Raw data on getting the raw data – Loretta C. Duckworth Scholars Studio

Written By Liz Rodrigues

The most recent pamphlet from the Stanford Lit Lab offers a rare level of insight into the process of assembling a corpus of digitized texts.

As the Pamphlet 11 co-authors suggest, we contextualize our actual corpus against three theoretical backgrounds: the published, the archive, and the corpus. The published is “the totality of the books that have been published (the plays that have been acted, the poems that have been recited, and so on),” while the archive is “that portion of published literature that has been preserved – in libraries and elsewhere – and that is now being increasingly digitized.” The corpus is always the smallest of the three, “that portion of the archive that is selected, for one reason or another, in order to pursue a specific research project.” For example, the DSC’s Beth Seltzer selected detective novels as her corpus out of the wide spectrum of nineteenth century literature.

And in practice, our actual corpus is always even smaller than that–as quickly becomes apparent as soon as you try to assemble one, and as the Pamphlet 11 co-authors describe in illuminating detail. They sought to identify a corpus of works sample (some randomly and some not) listed in field-specific bibliographies. They composed their (hoped-for) list of 674 titles and began looking for already digitized versions, starting the 4,000 English novels that can be accessed through ECCO, the Chadwyck Healy 19C Collection, and the University of Illinois’s curated collection of 19C novels from the Internet Archive.

This is what followed:

“We generated the sample at the end of the school year, in June 2014. Then we turned to our own database, where we found 35 of the 82 gothic novels, 35 of the 85 historical novels, and 145 of the 507 novels from the Raven-Garside bibliographies. In early July, we passed the list of the titles we had not found – roughly 460 – to Glen Worthey and Rebecca Wingfield, at the Stanford Libraries, who promptly disentangled it into a few major bundles. Around 300 texts were held (in more or less equal parts) by the Hathi trust and by Gale (through NCCO and ECCO II). Another 30 were in collected works, in alternate editions, concealed by slightly different titles, in microfiche or microfilm collections, etc.; about 100 existed only in print, and of 10 novels there were no known extant copies. In August, requests were sent to Hathi and Gale – with both of which Stanford has a long-standing financial agreement – for their 300 volumes. Of the 100 novels existing only in print, about half were held by the British Library, in London, which a few months earlier had kindly offered the Literary Lab a collection of 65,000 digitized volumes from its collections; unfortunately, none of the books we were looking for was there. The special collections at UCLA and Harvard, which held about 50 of the books, sent us a series of estimates that ranged (depending, quite reasonably, on the conditions of the original, and on photographic requirements which could be very labor-intensive) from $1,000 to $20,000 per novel; finally, six novels were part of larger collections held by Proquest, and would have cost us – despite Proquest’s very generous 50% discount – $147,000, or $25,000 per title.”

Spoiler alert: they didn’t buy any digitized titles for 25k each. Instead, faced with a ridiculous price tag (even for Stanford) and months of waiting for database providers to follow through on their delivery of requested titles, the team decided to go to work with the corpus they had, a selection of 1,117 from readily available titles.

They conclude:

“We don’t present this as an ideal model of research, and are aware that our results are weaker as a consequence of our decision. But collective work, especially when conducted in a sort of “interstitial” institutional space – as ours still is – has its own temporality: waiting months and months before research can begin would kill any project. Maybe in the future we will send out a scout, a year in advance, in search of the sample. Or maybe we will keep working with what we have, acknowledging the limits and flaws of our data. Dirty hands are better than empty.”

I’m keeping this is mind as I continue corpus wrangling for my own project.

Leave a Reply