By Jillian Benedict
While I spent most of my time proofreading every book I had scanned to make sure I had a clean digital copy, I was unable to complete every book. Therefore the data below only relates to six books, not including Jon Krakauer’s most recent work, Missoula. However, since I still completed transforming Krakauer’s earlier books, I wanted to take the little time I have left at Temple to play with my corpus in R (I chose to use RStudio as I am more comfortable with the RStudio interface).
The six books in question can be divided into two groups based on their copyright dates. Eiger Dreams, Into the Wild, and Into Thin Air, were published in the 1990s; Under the Banner of Heaven, Where Men Win Glory, and Three Cups of Deceit, were published within the first two decades of the 2000s. All six books cover different topics, but all six are clearly works of investigative journalism written by the same author.
Before installing the stylo program in R, I made make sure that all my text files in my corpus were the same type of file. I chose a plain text file to avoid unnecessary programming. From there I installed the stylo program, which allows Rstudio to access the functions involved in stylo. Once the program recognized that stylo had been installed, I set my work directory to the parent folder in which my corpus folder was located. After that, all I had to do was call up the stylo function and R did the rest of the legwork.
Since stylo is a package that focuses on statistics, there was not a lot of programming for me to do; however, I was glad I spent so much time getting comfortable with R. It made it easier to identify what was wrong with my code when I received an error message. The difficulty with the stylo package is taking the results of the function, placing the results in the context of your project, and figuring out what these results mean. When I applied the stylo function to my corpus containing the plain text files of Krakauer’s novels, a plot appeared in the interface.
This plot, a cluster analysis, separated the texts (identified by different colors) into different branches based on proximity in the style of the plain text documents in the corpus. Even though this graph does not explicitly suggest what is different about the style of each book, there is a clear distinction between those published in the 90s and those after 2000. Perhaps Krakauer’s style is different depending on how much autobiographical information is included in the book (works published in the 1990s contain more autobiographical info). Perhaps Krakauer’s style changed after 9/11. It is impossible to know for sure. Krakauer himself may not even be aware of a change in his style.
To make sure I was reading the cluster analysis plot correctly, I altered the graphing preferences for the stylo plot in preferences and R produced a consensus tree. The separation between the two groups was not as clear at first, but when I really looked at the data I saw that the novels were once again separated into the same groups as before and were situated on opposite ends of the tree.
Due to a lack of time and experience, I did not have the opportunity to explore the stylo program as much as I would have liked. While I would not say that I came to a real conclusion, through my proofreading (I read each book to correct OCR inconsistencies) and the data provided by the stylo package, I do feel comfortable suggesting that there appears to be a difference of some sort between Krakauer’s earlier published works and his more recent ones.
To learn more about Stylometry and the stylo package, check out, “Stylometry with R: A Package for Computational Text Analysis” by Maciej Eder, Jan Rybicki and Mike Kestermong.
Novels and copyright dates:
Eiger Dreams: 1990
Into The Wild: 1996
Into Thin Air: 1997
Under the Banner of Heaven: 2003
Where Men Win Glory: 2009
Three Cups of Deceit: 2011