Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
  • About
    • Current Staff
    • Current Fellows
    • Faculty Fellowships
    • Graduate Extern Program
Menu

Identifying Similar YouTube Videos by Comparing Metadata in R – Part I

Posted on February 26, 2020January 25, 2021 by Nicole Lemire Garlic

With 91% of millenial and Gen-Z viewers on YouTube, non-profit organizations are increasingly promoting positive social change through compelling videos uploaded to the platform─a practice YouTube encourages. Yet, claims of YouTube’s recommendation algorithms directing users down unrelated rabbit holes or, worse, extremist bastions, proliferate.

How can a non-profit organization assess how YouTube’s algorithms will affect viewers looking at their uploaded videos? At the Scholars Studio, I’ve been testing out various computational text analysis methods that might help answer this question. In this blog post, I demonstrate a custom dictionary approach, implemented in the open-source software R, using a UNICEF child poverty video with over 3 million views.

 

A Dictionary Approach

Dictionary approaches to text mining are useful for locating specific words within a text. There are pre-set dictionaries designed to search for pre-determined topics. The LIWC, which contains lists of words linked to beliefs, thinking patterns, and personalities, among other things, is an example.

For this project, I did not locate a suitable pre-set dictionary of terms. So, to assess the similarity between the videos recommended by YouTube and the initial UNICEF video, I created a custom dictionary of words that reflect the “content” (see below) of the video. I culled the dictionary terms from the metadata of the UNICEF child poverty video.

 

Determining Video Content

Determining the “content” of a video is not as simple as it might sound. Humans can watch a video and pick out its general subject matter. For example, take a news broadcast showing cell phone footage of a police officer shooting a fleeing suspect. A human might classify this video as about “police violence” or “crime.”

How would a computer go about this task? There are some advanced applications in visual analytics designed for computers to view videos and analyze their images. But these techniques often require large amounts of sample data for testing, human coders for validation, and high-level coding. (Click here for a description of this process from the Universitat Osnabruck’s Institute of Cognitive Science). Textual representations of a video, however, are accessible and work well with tested computational tools. Enter: YouTube metadata.

 

YouTube Video Metadata 

YouTube videos are categorized, marked and described in three primary ways:

  1. The platform places them in categories;
  2. Creators assign tags; and
  3. Creators name the videos with a title and provide a description.

Categories are, essentially, genre classifications created by YouTube. Each category has its own number and corresponding phrase, e.g., Entertainment (24) and Nonprofits & Activism (29). When a video is uploaded, YouTube assigns a single category to the video and it is displayed on the video screen, under the Description.

 

Tags are keywords crafted by the video’s creator (like “human rights” or “child”) that are used by YouTube to match videos to user keyword search entries. They work behind the scenes and cannot be seen on the YouTube interface. To see them, you have to access them through YouTube’s API. For marketing purposes, creators are encouraged to choose tags that represent what the video is of, or about, at its most base level. Some creators use them this way while others reiterate the creator’s name or video title, or incorporate target-market specific vocabulary. (Click here for the Tagging YouTube study, if you’d like to know more). There is no predetermined set of tags, and no standard number of tags used per video. And, because there is no universal taxonomy, creators may use different terms to refer to the same concept. Even so, tags can provide some insight into video content.

Titles and descriptions are also chosen by the video’s creator, and include keywords they believe will encourage viewers to watch. This suggests that video titles will contain, at least what the creator believes, is the central subject of the video. The description typically contains additional information about the video itself, but may also include links to the creator’s channel or other data unrelated to the video being promoted. Like the category, the title and description can be seen on the YouTube page.

 

Project Design

For this pilot project, I chose to focus on tags, titles, and descriptions like the researchers here, who used YouTube metadata to classify videos into emotion groups (e.g., happiness, fear, surprise). Categories appeared too broad to also be useful, but I included them in my analysis to confirm this hunch. Overall, I followed a three-step process: collecting the tags, title, and description for a list of 50 recommended videos with help from R’s tuber package and YouTube’s API, pre-processing and analyzing them with the custom dictionary I created, and visualizing the results with ggplot2.

Check out my next blog post to see how it all turned out.

Share this:

  • Twitter
  • Facebook
  • Reddit
  • Email

Related Posts

Recent Posts

  • Digital Practices for the Study of Cultural Heritage (Part 2) April 7, 2022
  • Visualizing Changes in Colombian Wetlands with ArcGIS Story Maps March 21, 2022
  • Digital Practices for the Study of Cultural Heritage (Part 1) February 8, 2022
My Tweets

Tags

3D modeling 3D printing 360 video arduino augmented reality authorship attribution coding corpus building critical making Cultural Heritage data cleaning data visualization digital art history Digital Preservation digital reconstruction digital scholarship early modern film editing games gephi linked open data machine learning makerspace mapping network analysis oculus rift OpenRefine Photogrammetry physical computing Python QGIS R SketchUp stylometry terrain modeling text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping

Archives

©2022 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.