{"id":913,"date":"2015-08-11T13:00:01","date_gmt":"2015-08-11T17:00:01","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=913"},"modified":"2019-08-28T15:47:04","modified_gmt":"2019-08-28T19:47:04","slug":"more-advanced-stylometry-with-jgaap-and-r-stylo","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2015\/08\/11\/more-advanced-stylometry-with-jgaap-and-r-stylo\/","title":{"rendered":"More Advanced Stylometry with JGAAP and R-stylo"},"content":{"rendered":"<p>By Jaclyn Partyka<\/p>\n<p><!--more--><\/p>\n<p>In my previous blog posts, I described my first forays into<a href=\"https:\/\/sites.temple.edu\/tudsc\/2015\/07\/14\/testing-authorship-attribution-with-signature\/\" target=\"_blank\" rel=\"noopener noreferrer\"> Stylometry<\/a> and then <a href=\"https:\/\/sites.temple.edu\/tudsc\/2015\/07\/14\/testing-authorship-attribution-with-signature\/\" target=\"_blank\" rel=\"noopener noreferrer\">reviewed Signature <\/a>as a viable, but ultimately limited stylometric tool. In this post\u00a0I\u2019ll talk about n-grams and quickly review JGAAP and R-stylo before giving you a bit of a glimpse into my own research.<\/p>\n<p>While Signature provided some basic frequency analytics of data (such as letters, punctuation, word length, sentence length, and paragraph length), many of these measures have fallen out of fashion in current stylometric research. Rather, following John Burrows and his <a href=\"http:\/\/llc.oxfordjournals.org\/content\/17\/3\/267.abstract\" target=\"_blank\" rel=\"noopener noreferrer\">Delta method<\/a>, the preferred method of analysis is to use statistical frequency measures to look at n-grams.<\/p>\n<p>In linguistics and computational statistics, n-grams are essentially numeric sets of characters or words used to measure probability.\u00a0So, for example, a word n-gram of size two (sometimes called a bigram) would take a sentence such as this quote from Roth\u2019s <em>The Counterlife:<\/em>\u00a0\u201cHe sat down at the desk and began to read\u201d and create the following list: &#8220;he sat&#8221; &#8220;sat down&#8221; &#8220;down at&#8221; &#8220;at the&#8221; &#8220;the desk&#8221; &#8220;desk and&#8221; &#8220;and began&#8221; &#8220;began to&#8221; &#8220;to read&#8221;<\/p>\n<p>You can also make character n-grams. Here\u2019s a list of character trigrams (3 characters) using the same sentence as above: &#8220;h e &#8221; &#8220;e s&#8221; &#8221; s a&#8221; &#8220;s a t&#8221; &#8220;a t &#8221; &#8220;t d&#8221; &#8221; d o&#8221; &#8220;d o w&#8221; &#8220;o w n&#8221; &#8220;w n &#8221; &#8220;n a&#8221; &#8221; a t&#8221; &#8220;a t &#8221; &#8220;t t&#8221; &#8221; t h&#8221; &#8220;t h e&#8221; &#8220;h e &#8221; &#8220;e d&#8221; &#8221; d e&#8221; &#8220;d e s&#8221; &#8220;e s k&#8221; &#8220;s k &#8221; &#8220;k a&#8221; &#8221; a n&#8221; &#8220;a n d&#8221; &#8220;n d &#8221; &#8220;d b&#8221; &#8221; b e&#8221; &#8220;b e g&#8221; &#8220;e g a&#8221; &#8220;g a n&#8221; &#8220;a n &#8221; &#8220;n t&#8221; &#8221; t o&#8221; &#8220;t o &#8221; &#8220;o r&#8221; &#8221; r e&#8221; &#8220;r e a&#8221; &#8220;e a d&#8221;<\/p>\n<p>Essentially, you could do this manually, but it\u2019s so much easier for a computer program like JGAAP or R to <a href=\"https:\/\/web.archive.org\/web\/20160427111608\/http:\/\/www.inside-r.org:80\/packages\/cran\/stylo\/docs\/make.ngrams\" target=\"_blank\" rel=\"noopener noreferrer\">do the work for you<\/a>, especially if you have a large corpus.<\/p>\n<p><a href=\"http:\/\/evllabs.github.io\/JGAAP\/\" target=\"_blank\" rel=\"noopener noreferrer\">JGAAP<\/a> stands for \u201cJava Graphical Authorship Attribution Program\u201d and it was designed by Patrick Juola of Duquesne University. It is a free Java-based program for textual analysis, text categorization, and authorship attribution. It has an easy to use graphical user interface (GUI) which even includes explanations for specific analytical options.<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-914 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1.jpg\" alt=\"JGAAP_ROTH1\" width=\"716\" height=\"519\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1.jpg 888w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1-300x218.jpg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1-700x508.jpg 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1-232x168.jpg 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1-464x337.jpg 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH1-624x453.jpg 624w\" sizes=\"auto, (max-width: 716px) 100vw, 716px\" \/><\/a><\/p>\n<p>However, while JGAAP can run through a number of customized stylometric analyses simultaneously, the output of data is very limited and it does not have a graphical option. The program will list the most likely candidate first and include the numerical probability data but the details of the analysis is hidden from basic users. So, JGAAP would be useful for a closed set of likely candidates with an unknown text, but the data would have to be extracted for advanced analytics or graphical representation.<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-915 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2.jpg\" alt=\"JGAAP_ROTH2\" width=\"712\" height=\"630\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2.jpg 728w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2-300x265.jpg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2-700x619.jpg 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2-232x205.jpg 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2-464x410.jpg 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/JGAAP_ROTH2-624x552.jpg 624w\" sizes=\"auto, (max-width: 712px) 100vw, 712px\" \/><\/a><\/p>\n<p>In light of the limitations of both Signature and JGAAP, R-stylo seems like the ideal program for stylometric textual analysis. <a href=\"https:\/\/www.r-project.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">R<\/a> and <a href=\"https:\/\/www.rstudio.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">RStudio<\/a> are free to download and the \u201cstylo\u201d package is easily downloadable via R\u2019s <a href=\"https:\/\/cran.r-project.org\/web\/packages\/stylo\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">CRAN directory<\/a>. Additionally, there are a number of active scholars who use R-stylo so the support community <a href=\"https:\/\/sites.google.com\/site\/computationalstylistics\/stylo\" target=\"_blank\" rel=\"noopener noreferrer\">online<\/a> is very helpful for beginners.<\/p>\n<p>Some may balk at using R-stylo because it involves more upfront work and some familiarity with coding language. However, R-stylo\u2019s functional GUI interface, its capacity to graphically display results in a variety of different and customizable formats, and its ability to perform machine learning analytics make it the best option for stylometric analysis at this current time.<\/p>\n<p>As an example, here is a cluster analysis of Roth\u2019s corpus according to either \u201cRoth\u201d or \u201cZuckerman\u201d novels. It&#8217;s really interesting that the only officially nonfictional entry from the corpus, Reading Myself and Others correlates so closely to The Facts: A Novelist&#8217;s Autobiography, even though this novel plays with some metafictional techniques.<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-916 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW.jpeg\" alt=\"Roth_cluster_word3gram_delta_800MFW\" width=\"900\" height=\"578\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW.jpeg 900w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW-300x193.jpeg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW-700x450.jpeg 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW-232x149.jpeg 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW-464x298.jpeg 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_cluster_word3gram_delta_800MFW-624x401.jpeg 624w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><\/a><\/p>\n<p>And here is a Principal Component Analysis (PCA) of the same data. What\u2019s different here is the statistical algorithm that R-stylo uses to measure the authorship probability. Though these are both distance measures, the kind of algorithm used really affects how the data is represented. So, it\u2019s important that you really consider the best method of analysis for your specific project.\u00a0Obviously the Cluster Analysis is much easier to read without much editing, while the PCA analysis would have to be plotted using symbols, which is an additional option in R. However, I also think the messiness on this graph is good, since it shows a significant overlap between the Zuckerman and the &#8216;Roth&#8217; corpora, supporting the claim that the division between these two sets of texts is certainly permeable.<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-917 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW.jpeg\" alt=\"Roth_PCA_delta_800MFW\" width=\"1000\" height=\"642\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW.jpeg 1000w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW-300x193.jpeg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW-700x449.jpeg 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW-232x149.jpeg 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW-464x298.jpeg 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_PCA_delta_800MFW-624x401.jpeg 624w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><br \/>\nFinally, R-stylo has a rolling.classify option which takes a test corpus of, in this case, novels, and then reads it to determine multi-authored works. Maciej Eder most recently used this function to look at Harper Lee\u2019s newly published novel <em><a href=\"https:\/\/sites.google.com\/site\/computationalstylistics\/projects\/lee_vs_capote\" target=\"_blank\" rel=\"noopener noreferrer\">Go Set a Watchman<\/a><\/em> in order to determine the degree of Truman Capote\u2019s involvement.<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW.jpeg\"><img loading=\"lazy\" decoding=\"async\" class=\" size-full wp-image-918 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW.jpeg\" alt=\"Roth_rollingclassify_delta_800MFW\" width=\"861\" height=\"553\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW.jpeg 861w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW-300x193.jpeg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW-700x450.jpeg 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW-232x149.jpeg 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW-464x298.jpeg 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2015\/08\/Roth_rollingclassify_delta_800MFW-624x401.jpeg 624w\" sizes=\"auto, (max-width: 861px) 100vw, 861px\" \/><\/a><\/p>\n<p style=\"text-align: center\"><em>Note: &#8216;Roth&#8217; is in Red and Zuckerman is in Green.\u00a0<\/em><\/p>\n<p style=\"text-align: left\">As for me, this rolling.classify analysis of Roth\u2019s <em>The Facts: The Novelist\u2019s Autobiography<\/em> reinforces what we already know about the structure of the novel, while also providing some interesting observations. We already know that the structure of the \u201cautobiography\u201d begins with a paratextal letter composed by \u201cRoth\u201d to Zuckerman and then Zuckerman replies to this letter at the end. This graph clearly supports the appearance of Zuckerman\u2019s narration at the end of the novel, but what\u2019s interesting is the middle section of the autobiography that also reads as a \u201cZuckerman\u201d section. This section is most likely the description of Roth\u2019s relationship with his first wife, &#8220;Josie&#8221; \u2013 a tumultuous event that he fictionalized in earlier novels.\u00a0Ultimately, what these preliminary stylometric findings reveal is that there is indeed some kind of key stylistic difference between Roth\u2019s and Zuckerman\u2019s brand of authorship, a claim I will develop further over this next semester.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Jaclyn Partyka<\/p>\n","protected":false},"author":1418,"featured_media":918,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[2],"tags":[34,64,36,37],"class_list":["post-913","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-grad-students","tag-authorship","tag-coding","tag-stylometry","tag-textual-analysis"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/913","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/1418"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=913"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/913\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/918"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=913"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=913"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=913"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}