Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
  • Current Staff
  • Current Fellows
    • Faculty Fellowships
    • Graduate Extern Program
  • About
  • Newsletter
Menu
Pile of books

Tracking Old Books on the Web: Preparing a Digital Corpus

Posted on July 1, 2014September 26, 2019 by

By Beth Seltzer

For my summer project in Temple’s Digital Scholarship Center, I wanted to use textual analysis to learn more about Victorian detective fiction. Scholarship has only touched on a small percentage of Victorian detective authors–Sir Arthur Conan Doyle of Sherlock Holmes fame, Wilkie Collins’s sensation novels, and a handful of others. Yet there were hundreds of short stories and novels from the time period which are increasingly digitized and freely available online. Since my dissertation research is on the genre of detective fiction, I’m hoping that computer-assisted textual analysis will help me get a fuller sense of the genre.

Sometimes you’ll find a corpus ready-made for you. In my case, it’s going to be tricky just to identify novels and stories from 1830-1900 as detective fiction, much less track them down. I’m dealing with a budding genre and very obscure texts. So I’m going to be relying a lot on these highly-digital tools:

Pile of books

Since I’m not going to read all the Victorian novels that exist (around 400,000 by one estimate), I need to rely as much as I can on work done by human eyes to figure out which books are actually “detective fiction.”

The most useful book for me is Graham Greene and Dorothy Glover’s 1966 catalogue, Victorian Detective Fiction, which I got from Temple’s library depository. This catalogue lists 471 novels, all vetted as fitting into the loose, Victorian definition of the genre—a novel mostly about detection, containing a detective. It’s important to keep this contemporary definition in mind, since we wouldn’t necessarily call many of these detective novels today. (If you’re used to Agatha Christie’s gang of suspects, it may be a surprise to start reading one of these earlier novels and find only one obvious criminal, for instance.)

My first big task is to look through this catalogue, and some other books I’ve got on hand, and try to track down as many of these texts as I can.

A lot of digital textual analysis projects require clean, plain-text versions of the book, so my first stop is Project Gutenberg.

Gutenberg Screenshot

Gutenberg texts aren’t perfect, but they were all corrected by hand, so they’re among the cleanest copies out there for digital projects. Other good sources for “clean” digital books include Eighteenth-Century Collections Online, Women Writer’s Project, and JSTOR Data for Research.

However, Gutenberg has a comparatively small collection of 45,000 books, and they only have around 13% of the books I look for. So if I can’t find a book, I search for it in the Internet Archive, an open library of about 6 million texts.

Internet Archive Screenshot

Internet Archive easily lets you convert a book into plain-text. Unfortunately, they give you rough, unedited OCR, so it’s going to need more work to get it clean enough to use. Look at this fun sort of stuff you find at the beginning of most of these books:

 m

:M.^^&M^i^f,

wmmm

^-»>4#^^’.SL^gJH’

I find about 18% of what I searched for in the Internet Archive.

If I still can’t find a book, then I move on to sites that only have PDFs, like the HathiTrust and Google Books. These sites have the most texts to work with, but they’ll be more time-consuming to process, since in most cases I’ll need to run them through OCR software like Adobe Acrobat Pro or ABBYY FineReader to get them into the text format I need. I find about 9% of what I search for here that I can’t get in any other form.

But wait, you’re saying, you’ve only found about 40% of what you looked for! And the rest, all of those texts which are still marked with yellow sticky-tabs?

Book with lots of tabs

Some of these books have never been digitized, some might be available on other databases I haven’t tried yet, and some I just don’t have access to. For now, I’ll keep searching.

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • Web Scraping Wikipedia to Analyze XBOX Game Development Companies by Nationality January 4, 2023
  • Critical Elements for Making Games December 22, 2022
  • Cities as Havens for Bees: Using Remote Sensing to Visualize Urban Bee Habitat December 21, 2022

Tags

3D modeling 3D printing 360 video arduino augmented reality authorship attribution coding corpus building critical making Cultural Heritage data cleaning data visualization digital art history Digital Preservation digital reconstruction digital scholarship film editing games GIS linked open data machine learning makerspace mapping network analysis oculus rift omeka OpenRefine Photogrammetry physical computing Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping YouTube

Archives

©2023 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme