{"id":7205,"date":"2020-05-06T15:00:38","date_gmt":"2020-05-06T19:00:38","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=7205"},"modified":"2020-05-06T15:00:52","modified_gmt":"2020-05-06T19:00:52","slug":"document-cleaning-guide","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2020\/05\/06\/document-cleaning-guide\/","title":{"rendered":"Document Cleaning: An Introductory Guide"},"content":{"rendered":"<p>By Laura Biesiadecki<\/p>\n<p><!--more--><\/p>\n<h2><b>An Important Note on the Data Life Cycle&#8230;<\/b><\/h2>\n<p><span style=\"font-weight: 400\">Gathering and curating a batch of data can be an intricate and time-consuming task \u2014 as you develop your own digital project, think about how the collection can be preserved. If you plan on submitting this data to a repository, make sure to reach out to a librarian before you start! There are plenty of things to keep track of during the gathering process that will be immensely helpful when the time comes to share your data set.<\/span><\/p>\n<h2><b>Time to Tidy Up<\/b><\/h2>\n<p><span style=\"font-weight: 400\">Once your digital project has been outlined, and once you\u2019ve established appropriate parameters for your corpus, it\u2019s time to collect and clean workable versions of relevant titles. And whether you\u2019re lounging in the ivy like the young woman above, trying to determine if the gentleman in the wide-brimmed hat is of marriageable stock, or working from home, you\u2019re ready to dive in.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The process for gathering and editing your documents will depend on the resources you\u2019re working with and where they come from. Thankfully, a majority of <a href=\"https:\/\/sites.temple.edu\/tudsc\/2019\/09\/26\/corpus-construction\/\" target=\"_blank\" rel=\"noopener noreferrer\">the titles on my list<\/a> were found as both PDF and text-only files on <a href=\"https:\/\/www.hathitrust.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">HathiTrust<\/a>, so I didn\u2019t need to use an Optical Character Recognition (OCR) software to process the scanned pages. The titles I <\/span><i><span style=\"font-weight: 400\">did<\/span><\/i><span style=\"font-weight: 400\"> need to convert were downloaded as PDFs from various library databases and processed using Adobe Acrobat\u2019s OCR tool.<\/span><\/p>\n<h2><b>The Basics of Document Cleaning: Find and Replace<\/b><\/h2>\n<p><span style=\"font-weight: 400\">In each of the full-text versions of etiquette books I download, there are errors that need to be removed so future textual analysis can be as accurate as possible. Thankfully,\u00a0 most of the errors in these documents are obvious and repeating, making it easier and significantly less time-consuming to clean. We can look to <a href=\"https:\/\/catalog.hathitrust.org\/Record\/011203702\" target=\"_blank\" rel=\"noopener noreferrer\">Dame Curtsey\u2019s <\/a><\/span><i><span style=\"font-weight: 400\">Art of Entertaining for All Occasions<\/span><\/i><span style=\"font-weight: 400\"> (1918) for an example\u00a0 \u2014 I\u2019m sure she\u2019d be thrilled.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The format of the original printed page looked like this:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7212 size-full\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-05-05-at-12.52.46-PM-e1588697598627.png\" alt=\"\" width=\"463\" height=\"334\" \/><\/p>\n<p><span style=\"font-weight: 400\">And the downloaded text file looked like this:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7211 size-full\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-05-05-at-12.49.46-PM-e1588697652651.png\" alt=\"\" width=\"600\" height=\"291\" \/><\/p>\n<p><span style=\"font-weight: 400\">There are several things that need to be cut from this text, like the bold \u201cPage 45\u201d and the oddly spaced \u201cE N T E R T A IN IN G IN M A R C H\u201d. Since these headers appear on every page of the original document, they\u2019ll need to be erased. A simple \u201cfind and replace\u201d will be helpful, but more often than not these headers are complicated with different spacing patterns, misspellings, or punctuation. To be sure they\u2019re all cut, scroll down and check the first text lines of each \u201cpage.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400\">It\u2019s also important to watch for words that were split between two lines in the original document. In this selection, \u201chaving\u201d on the third and fourth lines, and \u201cgraveyard\u201d on the tenth and eleventh lines, appear in the text file as \u201chav- ing\u201d and \u201cgrave- yard.\u201d Inconvenient, yes, but not difficult to fix. It\u2019s easy enough to find and replace all instances of \u201c- \u201d, bringing those lonely word-halves back together again, and the very low-tech cleaning required for errors like these takes mere minutes.<\/span><\/p>\n<h2><b>Some Heavy-Duty Scouring<\/b><\/h2>\n<p><span style=\"font-weight: 400\">Some documents, due to formatting or quality of the original scan, are more difficult to manage. The formatting issues associated with my particular collection are fairly standard: almost every text is muddled by illustrations, hand-written letters, lists of necessary kitchen appliances, arranged in three columns, and oddly arranged recipes for anti-aging sheet masks (seriously!). Unfortunately, variations of these errors in image-to-text translation are more than likely to pop up, and will introduce unique roadblocks to your document cleaning process.<\/span><\/p>\n<p><span style=\"font-weight: 400\">In his <\/span><a href=\"https:\/\/catalog.hathitrust.org\/Record\/102287903\" target=\"_blank\" rel=\"noopener noreferrer\"><i><span style=\"font-weight: 400\">Chesterfield\u2019s Art of Letter Writing Simplified<\/span><\/i><\/a><span style=\"font-weight: 400\"> (1857), Philip Dormer Chesterfield, Earl of Stanhope, uses footnotes at the bottom of each page to give his reader information about appropriate use of punctuation:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7216 size-full\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-04-22-at-11.03.38-AM.png\" alt=\"\" width=\"591\" height=\"335\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-04-22-at-11.03.38-AM.png 591w, https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-04-22-at-11.03.38-AM-300x170.png 300w\" sizes=\"auto, (max-width: 591px) 100vw, 591px\" \/><\/p>\n<p><span style=\"font-weight: 400\">Unfortunately, the scanned version of this text isn\u2019t quite able to handle the fine print:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-7217 size-full\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-04-22-at-11.03.27-AM.png\" alt=\"\" width=\"497\" height=\"339\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-04-22-at-11.03.27-AM.png 497w, https:\/\/sites.temple.edu\/tudsc\/files\/2020\/05\/Screen-Shot-2020-04-22-at-11.03.27-AM-300x205.png 300w\" sizes=\"auto, (max-width: 497px) 100vw, 497px\" \/><\/p>\n<p><span style=\"font-weight: 400\">We can see a few familiar and easy-to-clean errors (the \u201c- \u201d in \u201ccapi- tal\u201d), but the section turns to unintelligible mush after the standard-sized paragraph ends. I\u2019m left wondering, <\/span><i><span style=\"font-weight: 400\">What am I supposed to do with that?<\/span><\/i><\/p>\n<p><span style=\"font-weight: 400\">When faced with more complicated errors, you have the option to clean or cut. Depending on the content of broken text \u2014 is this section necessary? does it contribute something relevant and new to the document? \u2014 and the frequency of this particular formatting issue, your decision will be made for you.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">In the case of Chesterfield\u2019s <\/span><i><span style=\"font-weight: 400\">Art of Letter Writing<\/span><\/i><span style=\"font-weight: 400\">, there are footnotes on a majority of the text\u2019s 156 pages. It would take valuable time to go through the original scanned images, read each footnote, decide whether it was worth saving, and type the sections deemed relevant. I\u2019d also be setting myself up for hours of future work, committing to the same intensive cleaning for the other 200+ titles in my data set. Further, these particular footnotes are not necessary to a project about general etiquette; letter writing was certainly important, but I have little use for musings on the proper use of italics.<\/span><\/p>\n<p><span style=\"font-weight: 400\">While cleaning this, and at least a dozen other titles in my collection, supplementary content was cut in favor of efficiency.<\/span><\/p>\n<h2><b>A (Not Completely) Clean Sweep<\/b><\/h2>\n<p><span style=\"font-weight: 400\">The most important step in the document cleaning process is accepting that your final TXT products won\u2019t be perfect. To make the most of your working hours, you\u2019ll need to decide how many errors are acceptable and which sections of text you can afford to lose.<\/span><\/p>\n<p><span style=\"font-weight: 400\">For this project, each of my cleaned text documents will have no more than 10 errors in the first 300 words, and all samples of letters or invitations will be cut, as will footnotes and advertisements for products or other books. For another project, these elements will be invaluable, and I\u2019ll be taking note of which titles feature handwritten templates for wedding invitations. But for the purposes of my project, their content is not worth the time I\u2019d have to spend cleaning it.<\/span><\/p>\n<p><span style=\"font-weight: 400\">I\u2019d be lying if I said the cleaning process wasn\u2019t the most tedious part of my research so far, but a set of clean documents makes for more accurate data and more engaging analysis. In my next post I\u2019ll discuss preliminary exploration with <a href=\"https:\/\/voyant-tools.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Voyant<\/a>, and my experience learning (and writing!) code designed to analyze literature and non-fiction.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Laura Biesiadecki<\/p>\n","protected":false},"author":18181,"featured_media":7220,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[296,2,288],"tags":[185,407,329,6],"class_list":["post-7205","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-critical-digital-studies","category-grad-students","category-literary-studies","tag-data-cleaning","tag-ocr","tag-text-analysis","tag-top-news"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/7205","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/18181"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=7205"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/7205\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/7220"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=7205"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=7205"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=7205"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}