Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group 
Menu

Identifying Similar YouTube Videos by Comparing Metadata in R – Part 2

Posted on May 7, 2020January 24, 2023 by Nicole Lemire Garlic

By Nicole Lemire-Garlic

Identifying Similar YouTube Videos by Comparing Metadata in R – Part 2

In my last post, I described my YouTube-recommended video pilot project. This project assesses the similarity between a UNICEF child poverty video and the “related” videos recommended by YouTube’s algorithms. This project is connected to my dissertation work on the ethics of communicating social justice through YouTube.

Recommended videos appear on the right hand side of the YouTube viewing screen. The list of videos is automatically generated by YouTube’s algorithms. Because YouTube keeps its algorithms secret, the inner workings of the algorithms are unknown. What we can guess is that the algorithms probably take into account many factors, including the number of likes and dislikes a video has garnered, the time a viewer has already spent watching videos, the subject matter and newness of the potential recommended videos, the user’s watch history, and more.

For non-profit or government organizations looking to reach YouTube users, they can look carefully at the recommended video lists to see what additional content their viewers may be exposed to during their viewing session. In this post, I will demonstrate the custom dictionary approach I used in R to do just that.

I followed a three-step process: collecting and cleaning the metadata, creating and applying the custom dictionary, and visualizing the results.

 

Collecting and Cleaning the Metadata

To collect the data, I used R’s tuber package to access YouTube’s Data API. After obtaining credentials to access the API, I used tuber’s functions to gather the metadata for the UNICEF video and the recommended videos (n = 50). The metadata included video channel title, video ID, video title, video category, video description, and video tags.

This alone created a rich set of metadata for each recommended video, but I added a few more calculated fields for my purposes. I added Rank to capture the placement of the recommended video on the screen (1 = first recommended video at the top of the screen). My hunch was that, based on YouTube’s advertiser-driven business model, a disproportionate amount of popular videos would be placed in the top (1 – 5) recommended videos. To test this, I added a Popularity variable, using view count (the number of times a video has been viewed) as a proxy.

Here is what the completed data frame looked like. You can’t see the full data frame here because there were over 650 unique tags! Wrangling the tags was one of more challenging aspects of this coding project. For more details on how I accomplished the data wrangling, check out my Github repository for step-by-step instructions in the R code.

To apply the dictionary described below, I combined the video title, description, and tags into one column. I pre-processed this text data to remove digits, extra spaces, punctuation, and NA values. I also converted the data to lowercase and lemmatized it with the lemmatize_strings function from the textstem package.

 

Creating and Applying the Custom Dictionary

As explained in my last post, a dictionary approach to text analysis involves searching for pre-selected, meaningful terms in a body of text. I developed a coding scheme with three types of dictionary entries─organization, issue, and hashtag-based─and gave the recommended video a point for each type. Altogether, a recommended video could earn between 0 (Not Similar) and 3 points (Very Similar).

Dictionary Category Dictionary Term Points
Organization “UNICEF” 1
Issue “child” and “poverty” 1
Hashtag “#fightunfair” or “#foreverychild” 1

Although the lemmatization was helpful, it did not convert all the terms relevant to my analysis. So, when writing the code, I used the OR operator ( | ) to search for terms with similar meanings. For example, when searching for matches to “child”, I searched for “girl”, “kid”, “teen”, etc. I used str_detect from the Tidyverse’s Stringr package to search for the terms and output the results into one final data frame.

 

Visualizing the Results

Video Categories and Similarity

As mentioned in my last post, my hunch was that video from the same category would not necessarily contain similar content. However, as this ggplot2 visualization shows, the higher levels of similarity were entirely made up of the same category as the UNICEF video:  Nonprofit & Activism. This may have been an idiosyncratic finding for nonprofit videos. It would be interesting to test an initial video from a broader YouTube category, like News & Politics or Entertainment, to compare results.

Popularity and Recommended Video Placement

I noted above that I expected popular videos to be disproportionately placed in the top recommended videos. Recall that Rank refers to how a user would encounter recommended videos on a screen. Because users eyes are drawn first to the recommended videos near the top, I recoded the Rank variable into “Top” (rank #s 1-5). I then broke up the remaining ranks into “Middle” (6-15), and “Bottom” (16-50). Recognizing that YouTube views for a single video can exceed a billion, I recoded Popularity as: “Unpopular” = 0 to million views, “Popular” = 1 million views, and “Very Popular” = 5 million views.

This visualization suggests that Very Popular videos are disproportionately represented in the Top recommended videos. And, unpopular videos are more heavily represented in the Bottom. It would be interesting to see if this holds over time. With the R script, this would be easy to do. You could run it multiple times a day for a series of days on the same initial video.

Channel Title and Similarity

Another question I explored was whether only those videos produced by the same creator would be Moderately Similar or Very Similar to the initial video. This visualization shows that the Moderately Similar and Very Similar videos were created by UNICEF. This is, in part, because the dictionary coding scheme gave an additional point to videos made by the same creator. In future work, it would be interesting to test a dictionary approach that does not include organization matches.


Through this pilot project, I explored on a small scale what can be gleaned from studying YouTube metadata for recommended videos. Because this project used the list of recommended videos generated by the YouTube API, it only mimics what a non-logged in user would see on their screen. My future work will expand on this approach by analyzing personalized recommended video lists.

You may find the full R code for this project on Github. Be aware that you will need to use your own YouTube API OAuth credential to run the code. For more on setting up the credential, check out this YouTube API post or follow the detailed instructions on my Github repository.

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Tags

3D modeling 3D printing arduino augmented reality banned books coding corpus building critical making Cultural Heritage data cleaning data visualization Digital Preservation digital reconstruction digital scholarship film editing game design games gephi human subject research linked open data machine learning makerspace makerspace residency mapping network analysis oculus rift omeka OpenRefine Photogrammetry Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping webscraping

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Archives

Blog Tags

3D modeling (11) 3D printing (14) arduino (8) augmented reality (5) banned books (3) coding (12) corpus building (4) critical making (7) Cultural Heritage (11) data cleaning (4) data visualization (11) Digital Preservation (3) digital reconstruction (9) digital scholarship (12) film editing (3) game design (3) games (6) gephi (3) human subject research (3) linked open data (4) machine learning (6) makerspace (8) makerspace residency (4) mapping (30) network analysis (17) oculus rift (8) omeka (3) OpenRefine (4) Photogrammetry (5) Python (8) QGIS (10) R (9) SketchUp (4) stylometry (8) text analysis (10) text mining (4) textual analysis (32) top news (102) twitter (5) video analysis (4) virtual reality (17) visual analysis (5) voyant (4) web scraping (16) webscraping (3)

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025
  • From Theory to Practice: Weaving in Response to the Grid in the Global Context March 26, 2025
  • Visiting a Land of Twilight February 24, 2025

Archives

©2025 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group