Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
  • Current Staff
  • Current Fellows
    • Faculty Fellowships
    • Graduate Extern Program
  • About
  • Newsletter
Menu

More Advanced Stylometry with JGAAP and R-stylo

Posted on August 11, 2015August 28, 2019 by

By Jaclyn Partyka

In my previous blog posts, I described my first forays into Stylometry and then reviewed Signature as a viable, but ultimately limited stylometric tool. In this post I’ll talk about n-grams and quickly review JGAAP and R-stylo before giving you a bit of a glimpse into my own research.

While Signature provided some basic frequency analytics of data (such as letters, punctuation, word length, sentence length, and paragraph length), many of these measures have fallen out of fashion in current stylometric research. Rather, following John Burrows and his Delta method, the preferred method of analysis is to use statistical frequency measures to look at n-grams.

In linguistics and computational statistics, n-grams are essentially numeric sets of characters or words used to measure probability. So, for example, a word n-gram of size two (sometimes called a bigram) would take a sentence such as this quote from Roth’s The Counterlife: “He sat down at the desk and began to read” and create the following list: “he sat” “sat down” “down at” “at the” “the desk” “desk and” “and began” “began to” “to read”

You can also make character n-grams. Here’s a list of character trigrams (3 characters) using the same sentence as above: “h e ” “e s” ” s a” “s a t” “a t ” “t d” ” d o” “d o w” “o w n” “w n ” “n a” ” a t” “a t ” “t t” ” t h” “t h e” “h e ” “e d” ” d e” “d e s” “e s k” “s k ” “k a” ” a n” “a n d” “n d ” “d b” ” b e” “b e g” “e g a” “g a n” “a n ” “n t” ” t o” “t o ” “o r” ” r e” “r e a” “e a d”

Essentially, you could do this manually, but it’s so much easier for a computer program like JGAAP or R to do the work for you, especially if you have a large corpus.

JGAAP stands for “Java Graphical Authorship Attribution Program” and it was designed by Patrick Juola of Duquesne University. It is a free Java-based program for textual analysis, text categorization, and authorship attribution. It has an easy to use graphical user interface (GUI) which even includes explanations for specific analytical options.

JGAAP_ROTH1

However, while JGAAP can run through a number of customized stylometric analyses simultaneously, the output of data is very limited and it does not have a graphical option. The program will list the most likely candidate first and include the numerical probability data but the details of the analysis is hidden from basic users. So, JGAAP would be useful for a closed set of likely candidates with an unknown text, but the data would have to be extracted for advanced analytics or graphical representation.

JGAAP_ROTH2

In light of the limitations of both Signature and JGAAP, R-stylo seems like the ideal program for stylometric textual analysis. R and RStudio are free to download and the “stylo” package is easily downloadable via R’s CRAN directory. Additionally, there are a number of active scholars who use R-stylo so the support community online is very helpful for beginners.

Some may balk at using R-stylo because it involves more upfront work and some familiarity with coding language. However, R-stylo’s functional GUI interface, its capacity to graphically display results in a variety of different and customizable formats, and its ability to perform machine learning analytics make it the best option for stylometric analysis at this current time.

As an example, here is a cluster analysis of Roth’s corpus according to either “Roth” or “Zuckerman” novels. It’s really interesting that the only officially nonfictional entry from the corpus, Reading Myself and Others correlates so closely to The Facts: A Novelist’s Autobiography, even though this novel plays with some metafictional techniques.

Roth_cluster_word3gram_delta_800MFW

And here is a Principal Component Analysis (PCA) of the same data. What’s different here is the statistical algorithm that R-stylo uses to measure the authorship probability. Though these are both distance measures, the kind of algorithm used really affects how the data is represented. So, it’s important that you really consider the best method of analysis for your specific project. Obviously the Cluster Analysis is much easier to read without much editing, while the PCA analysis would have to be plotted using symbols, which is an additional option in R. However, I also think the messiness on this graph is good, since it shows a significant overlap between the Zuckerman and the ‘Roth’ corpora, supporting the claim that the division between these two sets of texts is certainly permeable.

Roth_PCA_delta_800MFW
Finally, R-stylo has a rolling.classify option which takes a test corpus of, in this case, novels, and then reads it to determine multi-authored works. Maciej Eder most recently used this function to look at Harper Lee’s newly published novel Go Set a Watchman in order to determine the degree of Truman Capote’s involvement.

Roth_rollingclassify_delta_800MFW

Note: ‘Roth’ is in Red and Zuckerman is in Green. 

As for me, this rolling.classify analysis of Roth’s The Facts: The Novelist’s Autobiography reinforces what we already know about the structure of the novel, while also providing some interesting observations. We already know that the structure of the “autobiography” begins with a paratextal letter composed by “Roth” to Zuckerman and then Zuckerman replies to this letter at the end. This graph clearly supports the appearance of Zuckerman’s narration at the end of the novel, but what’s interesting is the middle section of the autobiography that also reads as a “Zuckerman” section. This section is most likely the description of Roth’s relationship with his first wife, “Josie” – a tumultuous event that he fictionalized in earlier novels. Ultimately, what these preliminary stylometric findings reveal is that there is indeed some kind of key stylistic difference between Roth’s and Zuckerman’s brand of authorship, a claim I will develop further over this next semester.

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • Web Scraping Wikipedia to Analyze XBOX Game Development Companies by Nationality January 4, 2023
  • Critical Elements for Making Games December 22, 2022
  • Cities as Havens for Bees: Using Remote Sensing to Visualize Urban Bee Habitat December 21, 2022

Tags

3D modeling 3D printing 360 video arduino augmented reality authorship attribution coding corpus building critical making Cultural Heritage data cleaning data visualization digital art history Digital Preservation digital reconstruction digital scholarship film editing games GIS linked open data machine learning makerspace mapping network analysis oculus rift omeka OpenRefine Photogrammetry physical computing Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping YouTube

Archives

©2023 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme