Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group 
Menu

The Process of Turning a Print Book into a Machine Readable Text

Posted on November 7, 2016November 8, 2016 by

By Jillian Benedict

Before I can start using R to look for patterns in Jon Krakauer’s body of work, I need to take the written text and turn it into a form that allows me to work with the text in R. For the sake of my project, I have decided to turn the text of each into a plain text file. Unlike html or any other markup computer language, plain text is exactly what it sounds like. It is the most bare-bones version of a text. This allows me to focus on what Krakauer has written as opposed to getting caught up in any special formatting or layouts that may have been chosen to add aesthetic interest to the print version of the book.

Taking a print text and turning it into a plain text file is a process, but an important process nonetheless. The first step, of course, is to scan Krakauer’s novels onto the computer as a PDF so I can work with the text freely and without concern for infringing upon any copyright laws that cover digital issues like those found on Google Books (assuming the whole text of a book is available on Google Books). Scanning each book from cover to cover is time consuming, but it is time well spent and provides me all of the information pertinent to the publication of each book in a digital format which may make it easier to access later. Once I have a book scanned, I send the PDF through OCR software. What is OCR? Well, OCR otherwise known as “optical character recognition…is a system of converting scanned printed/handwritten image files into its machine readable text format” (Basu). Once the files have been run through OCR software, I can turn them into a word document, which will allow me to remove any unnecessary information from the text.

Prologue to Under the Banner of Heaven as a Plain Text file.
Prologue to Under the Banner of Heaven as a plain text file.

The OCR software I use is ABBYY FineReader, which can allows you to clean up the scanned files if necessary so the computer can read them better. It was a little difficult to figure out, but vital once I understood what I was doing. I know what you are thinking. If the computer can read the text using Abby Fine Reader, why does the file need to become a word document? Unfortunately, just because a computer can recognize the characters in the document doesn’t mean the computer always reads those characters correctly.  Let’s look at an example.

As you can see in the photograph below of Part II in Under The Banner of Heaven (I chose to make it into a PNG file using the Snipping Tool instead of taking a screenshot) ABBYY FineReader “reads” the scanned image on the left, which says “Part II,” but it does not read it correctly. On the right side of the image is a close-up of what this page would look like on the word document created out of the OCR software. It says “Parl ll.” The OCR software did not understand that the letter “t” in “Part” is a “t”. It read it as an “l”. It is just a small but important example of some of inconsistencies in OCR software. Even though the example is just a header for a new section, for the sake of accurate text analysis the word document needs to be as close to the printed text as possible.

Abbyy Fine Reader analyzing a PDF of Under the Banner of Heaven
Abbyy Fine Reader analyzing a PDF of Under The Banner of Heaven.

This is a tedious part of the process, but creating an accurate and machine readable text is crucial to the development of this project. Even though I am still unsure about what I will find once I can play with the text itself, I am one step closer to finding out what analyzing all of these novels will reveal about the change in an author’s style over time.

 

Basu, Saikat. “Top 5 Free OCR Softwear Tools To Convert Images Into Text.” Makeuseof.com,

http://www.makeuseof.com/tag/top-5-free-ocr-software-tools-to-convert-your-images-into

text-nb/. Accessed 1 Nov. 2016.

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Tags

3D modeling 3D printing arduino augmented reality banned books coding corpus building critical making Cultural Heritage data cleaning data visualization Digital Preservation digital reconstruction digital scholarship film editing game design games gephi human subject research linked open data machine learning makerspace makerspace residency mapping network analysis oculus rift omeka OpenRefine Photogrammetry Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping webscraping

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Archives

Blog Tags

3D modeling (11) 3D printing (14) arduino (8) augmented reality (5) banned books (3) coding (12) corpus building (4) critical making (7) Cultural Heritage (11) data cleaning (4) data visualization (11) Digital Preservation (3) digital reconstruction (9) digital scholarship (12) film editing (3) game design (3) games (6) gephi (3) human subject research (3) linked open data (4) machine learning (6) makerspace (8) makerspace residency (4) mapping (30) network analysis (17) oculus rift (8) omeka (3) OpenRefine (4) Photogrammetry (5) Python (8) QGIS (10) R (9) SketchUp (4) stylometry (8) text analysis (10) text mining (4) textual analysis (32) top news (102) twitter (5) video analysis (4) virtual reality (17) visual analysis (5) voyant (4) web scraping (16) webscraping (3)

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025
  • From Theory to Practice: Weaving in Response to the Grid in the Global Context March 26, 2025
  • Visiting a Land of Twilight February 24, 2025

Archives

©2025 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group