

{"id":28,"date":"2014-06-12T14:07:12","date_gmt":"2014-06-12T18:07:12","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=28"},"modified":"2019-10-15T09:20:59","modified_gmt":"2019-10-15T13:20:59","slug":"wandering-a-giant-digital-library-experiments-with-hathitrust","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2014\/06\/12\/wandering-a-giant-digital-library-experiments-with-hathitrust\/","title":{"rendered":"Wandering a Giant Digital Library: Experiments with HathiTrust"},"content":{"rendered":"<p dir=\"ltr\">By Beth Seltzer<!--more--><\/p>\n<p id=\"docs-internal-guid-9b2ea903-8c92-53dc-4b72-101ad3705b42\" dir=\"ltr\">Hi folks! I\u2019m Beth Seltzer, a PhD Candidate in Temple\u2019s English department, and I\u2019m also a summer graduate student worker in the Digital Scholarship Center. My dissertation is on the intersection of nineteenth-century British detective fiction and new Victorian \u201cinformational genres,\u201d like railway timetables and telegraph messages.<\/p>\n<p dir=\"ltr\">One of the things I get to do in the Center is to play around with different Temple resources and explore their implications for research. I\u2019ve been using the HathiTrust, an 11 million-volume digital library which comes with built-in algorithms for analyzing the texts you find. For this post, I\u2019m going to describe some of the messy, trial-and-error process of testing out the capacities of a new tool.<\/p>\n<p dir=\"ltr\"><strong>Maybe it will help with my latest dissertation chapter&#8230;<\/strong><\/p>\n<p dir=\"ltr\">I started out with a topic I know pretty well from my dissertation research: Dickens\u2019s <em>The Mystery of Edwin Drood<\/em> (1870). Dickens died halfway through writing it, and I\u2019m researching the hundreds of completions and theories about the ending.<\/p>\n<p dir=\"ltr\">But it doesn\u2019t quite work with the HathiTrust. When you cut out all the duplicates you\u2019re left with only about 10 texts&#8211;it doesn\u2019t have the richest database of Drood reserves. And that\u2019s okay&#8211;different databases and tools work better for different things! For this topic, Temple\u2019s database access to Nineteenth Century British Library Newspapers is much more helpful.<\/p>\n<p dir=\"ltr\"><strong>Or maybe I can use it to define the concept of the \u201cdetective.\u201d<\/strong><\/p>\n<p dir=\"ltr\">HathiTrust is a comprehensive resource, so maybe I can use it to trace the development of a concept over time.<\/p>\n<p dir=\"ltr\">One way to do this is to use Hathi&#8217;s built-in research portal to make topic models of different years. In topic modeling, the computer looks through more texts than you could humanly read and pulls out \u201ctopics,\u201d i.e. collections of words that the computer thinks are related. (So, the computer might read a lot of books and put ocean, sea, ship, boat, whale, and captain all in the same category.)<\/p>\n<p dir=\"ltr\">I decided to start by searching for all of the texts which mention the word &#8220;detective&#8221; in 1830, and forming that collection into a dataset. Then, I made two more datasets for 1860 and 1890. Generally, you want to do little tests along the way to see if anything interesting pops up, before you invest too much time into prepping your collections.<\/p>\n<p dir=\"ltr\">When the results came back, I noticed some interesting things. For example, all three models showed what we might call a \u201cpolitical\u201d topic, but this topic looks different in the different time periods:<\/p>\n<p dir=\"ltr\">1830: \u00a0<img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/THeZWfjijZl5ZjWX0j01EDbQV0KtNk56P41KdI_xgdambbBZwwASqOcpR-YdNfIAV897A6TC3sv0vAJCrd7K_hFghTlRfBR-0EdVzINqoapaLSaHTY0NUQeRqdnJRSQKlw\" alt=\"\" width=\"213px;\" height=\"233px;\" \/>1860: <img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/MGV3KkrHcxJyAjHffKVK3Q9qcaAGOWILrwLb5T4FjTD39gBSurF1EWQrlDteqmBpYXjrJArvLavYdIxoBRPaqt5OPPpJych1-jJDpwtc7rh2XfH0781yb3WcRa2nYf9rNQ\" alt=\"\" width=\"241px;\" height=\"225px;\" \/><\/p>\n<p dir=\"ltr\">1890: <img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/MePmysZ1sl59lM3BeedDtfSFAldAHTBh7KZvEUgY-KM58oGfDEdctCLpX3t9ceCdpLf-g80R9rBwMktmCALNxwwXymdV35Ejpyz0K1KohRpyrfqCOtIkiB3M1xUV7gAR2Q\" alt=\"\" width=\"204px;\" height=\"224px;\" \/><\/p>\n<p dir=\"ltr\">Look how the Irish pop up in the 1890s! And how \u201cKing\u201d fades away with time, though it\u2019s oddly not replaced by \u201cQueen,\u201d even though Queen Victoria took the throne in 1837. Perhaps this relates to Victoria\u2019s domestic persona? Or to an increasingly middle-class society?<\/p>\n<p dir=\"ltr\">Of course, you\u2019ll also find topics like this:<\/p>\n<p dir=\"ltr\"><img decoding=\"async\" src=\"https:\/\/lh6.googleusercontent.com\/Cj6yFRf-9LxkST18DUce1E1ngELhuelrAKE_KZW5qwYBuoJ8b790m7ZlMGKIqkG1BoYOxX8GtpykZ8wec2jLPVp1ZwU1XJnO1GeoYbaN6RN_mHNY9Jacfn3ETxXPo-xHGw\" alt=\"\" width=\"223px;\" height=\"226px;\" \/><\/p>\n<p dir=\"ltr\">Did I just make an exciting find about the significance of &#8220;cloth\u201d to nineteenth-century detective fiction? Nope, a topic like this one just means that most books ended with lists of advertisements for other books, specifying the price and the cloth covers of the books they were selling.<\/p>\n<p dir=\"ltr\">Anyway, I found some interesting things here, but these results aren&#8217;t really telling me anything about detectives or detection. (Even though I made this collection by searching for the word &#8220;detective,&#8221; detective doesn\u2019t even make it to the top words list in any of these years.) New mission: find a detective theme!<\/p>\n<p dir=\"ltr\"><strong>Okay, looks like I\u2019ve got to narrow down my corpus\u2026<\/strong><\/p>\n<p dir=\"ltr\">With a little effort, I assembled a reasonably edited collection of a little under 300 detective novels from 1830-1900. I ran another topic model on this collection, and this topic popped up:<\/p>\n<p dir=\"ltr\">\u00a0<img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/8aPGlpAwA70zgM0CoHEfUMpSn1WmGUuS2v1MSkK0LVN2Gd5CeHl4Eu8454D3DoPvlu44_5w5oNIfIUC_p-GkouNgLoNoTZDH3FxoFUJeAwG8_Yf9-4UHQ7ggmd5V3UHvlQ\" alt=\"\" width=\"254px;\" height=\"262px;\" \/><\/p>\n<p dir=\"ltr\">Hooray! A \u201cdetective\u201d theme at last!<\/p>\n<p dir=\"ltr\"><strong>Run tests on different authors to determine key themes.<\/strong><\/p>\n<p dir=\"ltr\">But what if I&#8217;m really interested in comparing the themes of different authors? I tested this out by forming datasets of all of the texts from two very different Victorian novelists&#8211;Sir Arthur Conan Doyle, of Sherlock Holmes fame, and George Eliot, best known for <em>Middlemarch<\/em> and other works of high realism.<\/p>\n<p dir=\"ltr\">I should say that this is kind of problematic to begin with because HathiTrust has multiple copies of everything and their \u201cauthor search\u201d isn\u2019t very precise. So if you do an author search for Jane Austen, who only wrote 6 novels (maybe 7.5 if you\u2019re being generous), you\u2019ll get 405 results. These results include things like secondary criticism, not just the novels. But it\u2019s still interesting to see what results we get.<\/p>\n<p dir=\"ltr\">Sir Arthur Conan Doyle&#8217;s topics definitely displays his focus on rationality and crime-solving:<\/p>\n<p dir=\"ltr\"><img decoding=\"async\" src=\"https:\/\/lh3.googleusercontent.com\/iUU-_YISDoyDvsTxgf1aPJB75GlRI6EnAbfx6yGoaEKQBeitIRpI7h25uJBqsVjb_3QPguUVYm4wqmOeVKWRHU4rx_p53w7wiaTOK19FC7k4Ccm4jib5DtyyOp2J-DeCqA\" alt=\"\" width=\"284px;\" height=\"275px;\" \/><\/p>\n<p dir=\"ltr\">George Eliot doesn\u2019t have any topics that look like this one. Instead, she\u2019s got topics about feeling\/sympathy\/the soul, which is seems very true to form for Eliot:<\/p>\n<p dir=\"ltr\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone\" src=\"https:\/\/lh6.googleusercontent.com\/kW7Ax1adBvA5t5t6OV0qa6UqtaWmKkJkr-Q_374DpugVGaWG6XtI1li5MfRgC5Uqn7pqRYPUfAeQNmJmYnoEb-Eros3UX4Mz_26k-C4uHNmA_vjH6Yl_agKAYhJ1KeFUbw\" alt=\"love, soul, etc. word cloud\" width=\"277\" height=\"277\" \/><\/p>\n<p dir=\"ltr\">While normally you want to start your project with a research question rather than a tool, this type of fun, trial-and-error experimentation is very useful for getting ideas and figuring out what\u2019s possible. You can play around with HathiTrust tools yourself at the <a href=\"https:\/\/analytics.hathitrust.org\/\">HTRC Portal<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Beth Seltzer<\/p>\n","protected":false},"author":1420,"featured_media":433,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[2,288],"tags":[256,37],"class_list":["post-28","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-grad-students","category-literary-studies","tag-hathitrust","tag-textual-analysis"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/28","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/1420"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=28"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/28\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/433"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=28"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=28"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=28"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}