{"id":69,"date":"2014-07-01T09:12:28","date_gmt":"2014-07-01T13:12:28","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=69"},"modified":"2019-09-26T12:01:40","modified_gmt":"2019-09-26T16:01:40","slug":"tracking-old-books-on-the-web-preparing-a-digital-corpus","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2014\/07\/01\/tracking-old-books-on-the-web-preparing-a-digital-corpus\/","title":{"rendered":"Tracking Old Books on the Web: Preparing a Digital Corpus"},"content":{"rendered":"<p>By Beth Seltzer<\/p>\n<p><!--more--><\/p>\n<p>For my summer project in Temple\u2019s Digital Scholarship Center, I wanted to use textual analysis to learn more about Victorian detective fiction. Scholarship has only touched on a small percentage of Victorian detective authors&#8211;Sir Arthur Conan Doyle of Sherlock Holmes fame, Wilkie Collins\u2019s sensation novels, and a handful of others. Yet there were hundreds of short stories and novels from the time period which are increasingly digitized and freely available online. Since my dissertation research is on the genre of detective fiction, I\u2019m hoping that computer-assisted textual analysis will help me get a fuller sense of the genre.<\/p>\n<p>Sometimes you\u2019ll find a corpus ready-made for you. In my case, it\u2019s going to be tricky just to <em>identify<\/em> novels and stories from 1830-1900 as detective fiction, much less track them down. I\u2019m dealing with a budding genre and very obscure texts. So I\u2019m going to be relying a lot on these highly-digital tools:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131821_483.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-70\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131821_483-300x169.jpg\" alt=\"Pile of books \" width=\"300\" height=\"169\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131821_483-300x169.jpg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131821_483-1024x577.jpg 1024w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131821_483-500x281.jpg 500w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Since I\u2019m not going to read all the Victorian novels that exist (around 400,000 by one estimate), I need to rely as much as I can on work done by human eyes to figure out which books are actually \u201cdetective fiction.\u201d<\/p>\n<p>The most useful book for me is Graham Greene and Dorothy Glover\u2019s 1966 catalogue, <em>Victorian Detective Fiction<\/em>, which I got from Temple\u2019s library depository. This catalogue lists 471 novels, all vetted as fitting into the loose, Victorian definition of the genre\u2014a novel mostly about detection, containing a detective. It\u2019s important to keep this contemporary definition in mind, since we wouldn\u2019t necessarily call many of these detective novels today. (If you\u2019re used to Agatha Christie\u2019s gang of suspects, it may be a surprise to start reading one of these earlier novels and find only one obvious criminal, for instance.)<\/p>\n<p>My first big task is to look through this catalogue, and some other books I\u2019ve got on hand, and try to track down as many of these texts as I can.<\/p>\n<p>A lot of digital textual analysis projects require clean, plain-text versions of the book, so my first stop is <a title=\"Gutenberg\" href=\"http:\/\/www.gutenberg.org\" target=\"_blank\" rel=\"noopener noreferrer\">Project Gutenberg<\/a>.<\/p>\n<p><a href=\"http:\/\/www.gutenberg.org\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-72 size-medium\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/Gutenberg-300x177.png\" alt=\"Gutenberg Screenshot\" width=\"300\" height=\"177\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/Gutenberg-300x177.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/Gutenberg-1024x607.png 1024w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/Gutenberg-500x296.png 500w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/Gutenberg.png 1681w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Gutenberg texts aren\u2019t perfect, but they were all corrected by hand, so they\u2019re among the cleanest copies out there for digital projects. Other good sources for \u201cclean\u201d digital books include Eighteenth-Century Collections Online, <a title=\"Women Writer's Project\" href=\"http:\/\/www.wwp.northeastern.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">Women Writer\u2019s Project<\/a>, and <a title=\"JSTOR Data for Research\" href=\"http:\/\/about.jstor.org\/service\/data-for-research\" target=\"_blank\" rel=\"noopener noreferrer\">JSTOR Data for Research<\/a>.<\/p>\n<p>However, Gutenberg has a comparatively small collection of 45,000 books, and they only have around 13% of the books I look for. So if I can\u2019t find a book, I search for it in the <a title=\"Internet Archive\" href=\"https:\/\/archive.org\/index.php\" target=\"_blank\" rel=\"noopener noreferrer\">Internet Archive<\/a>, an open library of about 6 million texts.<\/p>\n<p><a href=\"https:\/\/archive.org\/index.php\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-73 size-medium\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IA-300x177.png\" alt=\"Internet Archive Screenshot\" width=\"300\" height=\"177\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IA-300x177.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IA-1024x607.png 1024w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IA-500x296.png 500w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IA.png 1681w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Internet Archive easily lets you convert a book into plain-text. Unfortunately, they give you rough, unedited OCR, so it\u2019s going to need more work to get it clean enough to use. Look at this fun sort of stuff you find at the beginning of most of these books:<\/p>\n<p style=\"text-align: left\">\u00a0m<\/p>\n<p style=\"text-align: left\">:M.^^&amp;M^i^f,<\/p>\n<p style=\"text-align: left\">wmmm<\/p>\n<p style=\"text-align: left\">^-\u00bb&gt;4#^^&#8217;.SL^gJH&#8217;<\/p>\n<p>I find about 18% of what I searched for in the Internet Archive.<\/p>\n<p>If I still can\u2019t find a book, then I move on to sites that only have PDFs, like the <a title=\"HathiTrust\" href=\"http:\/\/www.hathitrust.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">HathiTrust <\/a>and <a title=\"Google Books\" href=\"http:\/\/books.google.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Google Books<\/a>. These sites have the most texts to work with, but they\u2019ll be more time-consuming to process, since in most cases I\u2019ll need to run them through OCR software like Adobe Acrobat Pro or ABBYY FineReader to get them into the text format I need. I find about 9% of what I search for here that I can\u2019t get in any other form.<\/p>\n<p>But wait, you\u2019re saying, you\u2019ve only found about 40% of what you looked for! And the rest, all of those texts which are still marked with yellow sticky-tabs?<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131927_836.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-74\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131927_836-169x300.jpg\" alt=\"Book with lots of tabs\" width=\"169\" height=\"300\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131927_836-169x300.jpg 169w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131927_836-577x1024.jpg 577w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/06\/IMG_20140626_131927_836.jpg 1840w\" sizes=\"auto, (max-width: 169px) 100vw, 169px\" \/><\/a><\/p>\n<p>Some of these books have never been digitized, some might be available on other databases I haven&#8217;t tried yet, and some I just don&#8217;t have access to. For now, I&#8217;ll keep searching.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Beth Seltzer<\/p>\n","protected":false},"author":1420,"featured_media":70,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[2,288],"tags":[37,169],"class_list":["post-69","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-grad-students","category-literary-studies","tag-textual-analysis","tag-web-scraping"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/69","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/1420"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=69"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/69\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/70"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=69"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=69"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=69"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}