{"id":162,"date":"2014-08-12T12:01:37","date_gmt":"2014-08-12T16:01:37","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=162"},"modified":"2019-08-26T11:31:13","modified_gmt":"2019-08-26T15:31:13","slug":"text-scrubbing-hacks-cleaning-your-ocred-text","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2014\/08\/12\/text-scrubbing-hacks-cleaning-your-ocred-text\/","title":{"rendered":"Text Scrubbing Hacks: Cleaning Your OCRed Text"},"content":{"rendered":"<p>By Beth Seltzer<\/p>\n<p><!--more--><\/p>\n<p>I believe that as digital methods grow in prominence, we\u2019ll develop a huge collection of beautiful, correctly spelled, carefully curated texts for scholars to use for textual analysis.<\/p>\n<p>The problem (as I point out in <a title=\"Tracking Old Books on the Web: Preparing a Digital Corpus\" href=\"https:\/\/sites.temple.edu\/tudsc\/2014\/07\/01\/tracking-old-books-on-the-web-preparing-a-digital-corpus\/\">my earlier post<\/a>) is that we\u2019re not quite there yet. A lot of books are only available as messy OCR that\u2019s never gotten any type of spell-checking, and other texts you have to OCR yourself from a PDF.<\/p>\n<p>Fortunately, there are a few things you can do to clean up your texts without programming.<\/p>\n<p>I used a program called ABBYY FineReader Pro to OCR my collection of Victorian detective novels that are only available as PDFs. I also tried out Adobe Acrobat Pro in an admittedly not particularly controlled experiment. I found they were both relatively accurate for OCR, but ABBYY FineReader was faster and more intuitive to use, which makes sense because it is just designed for OCR. I also tested out the Pro vs. the free versions of ABBYY with the help of one of our librarians, and discovered that the Pro version is significantly better than the free version.<\/p>\n<p>One issue with OCRed texts is that old books often repeat the book title and\/or chapter title on each page, and the computer will read these as text like any other text. And if your file repeats the title &#8220;Murder or Manslaughter?&#8221; two hundred times, it will significantly skew your textual analysis results! ABBYY solves this problem by automatically detecting page features such as headers and footers.<\/p>\n<figure id=\"attachment_164\" aria-describedby=\"caption-attachment-164\" style=\"width: 220px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/Abbyylayout.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-164 size-medium\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/Abbyylayout-e1406837545550-220x300.png\" alt=\"Screenshot of Abbyy shows highlighted headers, footers, body text.\" width=\"220\" height=\"300\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/Abbyylayout-e1406837545550-220x300.png 220w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/Abbyylayout-e1406837545550.png 324w\" sizes=\"auto, (max-width: 220px) 100vw, 220px\" \/><\/a><figcaption id=\"caption-attachment-164\" class=\"wp-caption-text\">Abbyy analyzes the page layout automatically.<\/figcaption><\/figure>\n<p>Then, you can change your settings so that headers and footers aren&#8217;t included when you create a text version of the file:<\/p>\n<figure id=\"attachment_163\" aria-describedby=\"caption-attachment-163\" style=\"width: 384px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/AbbyySettings.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-163 \" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/AbbyySettings-e1406837642944-300x193.png\" alt=\"Screenshot showing settings menu.\" width=\"384\" height=\"247\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/AbbyySettings-e1406837642944-300x193.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/AbbyySettings-e1406837642944-464x300.png 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/AbbyySettings-e1406837642944.png 928w\" sizes=\"auto, (max-width: 384px) 100vw, 384px\" \/><\/a><figcaption id=\"caption-attachment-163\" class=\"wp-caption-text\">Make sure &#8220;headers and footers&#8221; isn&#8217;t checked.<\/figcaption><\/figure>\n<p>You can save your file in many different formats. I chose to save as plain-text, the format most often used for textual analysis. It\u2019s still got plenty of OCR errors, but it looks a little nicer than the plain-text I got from the Internet Archive, and it only takes 10-15 minutes to OCR a book (depending on the size of your PDF).<\/p>\n<p>But once everything\u2019s OCRed, how do you spell-check it? You could check each text file individually\u2014but even with a relatively small corpus of 200 novels, that would take weeks of work. You can speed up the process with a free program called Notepad++.<\/p>\n<figure id=\"attachment_167\" aria-describedby=\"caption-attachment-167\" style=\"width: 383px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/DH-Notepad.png\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-167\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/DH-Notepad-300x203.png\" alt=\"Screenshot of program with multiple files as tabs.\" width=\"383\" height=\"259\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/DH-Notepad-300x203.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/DH-Notepad-1024x693.png 1024w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/DH-Notepad-442x300.png 442w, https:\/\/sites.temple.edu\/tudsc\/files\/2014\/07\/DH-Notepad.png 1439w\" sizes=\"auto, (max-width: 383px) 100vw, 383px\" \/><\/a><figcaption id=\"caption-attachment-167\" class=\"wp-caption-text\">Notepad++ opens all your files at once in separate tabs.<\/figcaption><\/figure>\n<p>Not particularly fancy-looking, but this program will let you open all of your text files at once and do a find-and-replace on your whole corpus with just one step\u2014a huge time-saver!<\/p>\n<p>My sense is that we\u2019re only a few years, at most, from a fancy spellchecker that will clean up your OCR automatically. In the meantime, here are some of the top corrections I needed in my 19<sup>th<\/sup>-century corpus:<\/p>\n<ul>\n<li>Remove all page numbers*: 55,776 corrections<\/li>\n<li>Recombine all hyphenated words beginning with com-, con-, in-, re-, and un-*: 8,002 corrections<\/li>\n<li>Recombine all hyphenated words ending with ment, tion, and ing*: 5,612 corrections<\/li>\n<li>Fix some common misspellings:<\/li>\n<li>delete \u25a0 (960)<\/li>\n<li>bis -&gt; his (422)<\/li>\n<li>delete &#8220;digitized by&#8221; (1331)<\/li>\n<li>liis -&gt; his (250)<\/li>\n<li>tbe -&gt; the (337)<\/li>\n<li>tiie -&gt; the (698)<\/li>\n<\/ul>\n<p>And so on!<\/p>\n<p>*I did these more complex searches through something called &#8220;regular expressions.&#8221; Basically, instead of searching for and deleting every number individually: 1, 2, 3, 4\u2026 301, 302, etc., you can tell the computer \u201cdelete all the numbers\u201d by entering a short code you can look up on the internet. (In this case, the code is [0-9]+ ) Pretty convenient!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Beth Seltzer<\/p>\n","protected":false},"author":1420,"featured_media":163,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[2,288],"tags":[334,333,37],"class_list":["post-162","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-grad-students","category-literary-studies","tag-abbyy","tag-digitization","tag-textual-analysis"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/162","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/1420"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=162"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/162\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/163"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=162"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=162"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=162"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}