Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group 
Menu

Thinking about national affiliation through word frequency

Posted on June 16, 2016August 26, 2019 by Gerald Doyle

Written by Liz Rodrigues

Once I had assembled the early 20th century US immigrant autobiographies readily available on Project Gutenberg and Internet Archive, I needed to assess its quality and what methods might be usefully revelatory given its size. At around 50 books, it wasn’t large enough for any kind of machine learning classifier exercise. It was, in fact, on the small side for the most common forms of so-called distant reading. There was also another complicating problem for the majority of texts–they were full of typos and incomprehensibilities from the error-laden process of optical character recognition transcription. Without a good way of correcting them, even the most basic forms of topic modeling would likely be of little use. (Perhaps there is a way of estimating the error rate at least, but I wasn’t sure if that would be helpful.)

Faced with the prospect of many solo hours of hand correcting, I made a decision to shift the scale of my research at least temporarily. I figured that by using a small number of texts from the human-created, human-corrected Project Gutenberg, I could at least do some meaningful exploration.

I realized that my idea of modeling immigrant narrative implicitly assumed there was some way of tracking national affiliation textually. The most obvious place to start seemed like mention of country name. I also shifted slightly to a comparative look at autobiography and fiction, and so Mary Antin’s The Promised Land & Abraham Cahan’s The Rise of David Levinsky became my test cases. These are both highly canonical texts for studying US immigrant narrative, and both available on Project Gutenberg.

My first go was to look at mentions of the key nation states–in both cases America and Russia–using Voyant’s counting & visualization tools. 

AntinAmericaRusiaa

CahanAmericaRussia

 

In Antin, an initial prevalance of “Russia” or “Russian” is, in the third quarter or so of the best, displaced by an even greater prevelance of “America” (see note below) or “American.” America/n retains its dominance in the closing segments, although Russia/n rises in parallel. For Cahan, the novelist, America/n is dominant throughout, with a brief intersection at the beginning of the final segment.

While these texts show a strikingly different pattern of mention Russia & America, they both seem to suggest that America predominates their thoughts at the end. Of course, this is an assertion and a proxy measure that would need to be tested by close reading–their mentions of the America could be denunciatory or self-distancing rather than affiliating.

The interesting part of this process, though, came after I had done more in-depth work with named entity recognition & mapping. In the process of cleaning that data, I realized that, in fact, nation state names were not the primary mode of identification for these narrators’ homelands. Russia was actually for both the name of an occupying power. They instead identified more frequently with village names in occupied countries: Polotzk, Belarus for Antin and Antomir, Lithuania for Cahan. The mentions of Russia are likely for intelligibility to a US audience. So, I re-charted mentions.

download

Antin: mentions of “Polotzk” charted in green, “America” charted in blue

download (1)

Cahan: mentions of “Antomir” charted in blue, “America” charted in green

 

For Antin, plotting Polotzk vs. America shows a much tighter story. Polotzk still dominates the first half, until America spikes at almost the exact midway point. From then on both remain relatively close in parallel.

For Cahan, plotting Antomir vs. America shows a much sharper divergence between the first and second halves. America spikes right at the beginning, but Antomir spikes strongly at the end, and actually has (it looks like) one additional mention in the final segment. One mention isn’t, perhaps, noticeable to the casual reader, but it shows a competing presence of both national spaces in the narrator’s mind rather than a clear switch to the US, the country to which the main character has immigrated.

The primary lesson I learned, here, is that the granularity of place name matters both for modeling the text and for understanding the historical context of the works at hand. These immigrants did not narrate their movement as being primarily between countries–it is also between much more specific places.

Note 1: neither text mentions “United States” much at all (7 times in Antin and 13 times in Cahan). Arguable, “America” is as much an idea as a place name & the official place name gets little attention.

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Tags

3D modeling 3D printing arduino augmented reality banned books coding corpus building critical making Cultural Heritage data cleaning data visualization Digital Preservation digital reconstruction digital scholarship games gephi GIS linked open data machine learning makerspace makerspace residency mapping network analysis oculus rift omeka OpenRefine Photogrammetry physical computing Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping webscraping YouTube

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Archives

Blog Tags

3D modeling (10) 3D printing (13) arduino (8) augmented reality (5) banned books (3) coding (12) corpus building (4) critical making (7) Cultural Heritage (10) data cleaning (4) data visualization (11) Digital Preservation (3) digital reconstruction (9) digital scholarship (12) games (6) gephi (3) GIS (3) linked open data (4) machine learning (6) makerspace (7) makerspace residency (4) mapping (30) network analysis (17) oculus rift (8) omeka (3) OpenRefine (4) Photogrammetry (5) physical computing (3) Python (8) QGIS (10) R (9) SketchUp (4) stylometry (8) text analysis (10) text mining (4) textual analysis (32) top news (101) twitter (5) video analysis (4) virtual reality (17) visual analysis (5) voyant (4) web scraping (16) webscraping (3) YouTube (3)

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025
  • From Theory to Practice: Weaving in Response to the Grid in the Global Context March 26, 2025
  • Visiting a Land of Twilight February 24, 2025

Archives

©2025 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group