Downloading plain text from Internet Archive and Project Gutenberg with Python

Written by Liz Rodrigues

After I had assembled a list of US immigrant autobiography and checked to see which were available in full text and plain text files, the next step was to get those files. This list of about fifty books was probably short enough to do this manually–by looking up each title and copying & pasting the file into notepad–but to prepare for future projects, I wanted to learn how to do it programmatically.

My first step was to do some googling and find out what kinds of access Project Gutenberg, the Internet Archive, and HathiTrust provided to their text files. The first two both offer an API–basically allowing you to write a script that will query their server for particular kinds of data and then download it. HathiTrust also offers an API, but it takes more sophisticated programming to get it. So, I decided to focus on Gutenberg and IA, which offered the majority of my texts anyway. Before grabbing anything, it’s a good idea to check the terms of service. Both sites allow moderate downloading for personal use but suggest that if you want the entire site you should set up your own mirror.

It took a little bit of tweaking to do what I specifically wanted to do which was get the full text, put it into a folder where I could quickly keep track of it, and not have a bunch of extra copies on my computer. Here is what I came up with–stripped down versions of more ambitious scripts adapted to my specific need.

For getting texts from Internet Archive, I used Jacob Johnson’s Internet Archive package for Python. It is well-documented and does many things beyond grabbing full text, but since that’s what I was focused on that’s what I started with. It ran smoothly last fall and just today, using this very basic code (sorry I can’t make it prettier, our installation of wordpress doesn’t currently support markdown):

import internetarchive

from internetarchive import get_item

import shutil, os

corpus = [“text_identifier_1”, “text_identifier_2” ]

#the text identifier is a number found in the url for the text

for text in corpus:

item = get_item(text)

#identifies the files available for the text you want

t = text + “_djvu.txt”

#specifies that you want the plain text file specifically

f = item.get_file(t)

#gets that plain text file

f.download(t)

#downloads the plain text file

new_name = text[0:(len(text))]

shutil.copyfile(t, new_name)

#renames the file with a sleek version of the text id

shutil.move(new_name, ‘path_to_where_you_want_the_file/’)

#moves the file to a place where you know where it is

os.remove(t)

#removes the duplicate copy

For getting texts off of Gutenberg, I started with the Gutenberg package for Python by Clemens Wolff. In the fall, when I was doing this work, this was a well documented tool that can do a lot more when you work one text at a time than the fairly basic version I used below for bulk downloading. Currently, the developer has released an update that is not working well on my PC and that I’m still figuring out how to load on Mac–so your mileage may vary. It is still well-documented, though, and you may be able to make it work just fine.

Begin code:

import gutenberg

from gutenberg.acquire import load_etext

import shutil

corpus = [text_identifier_1, text_identifier_2]

#the text identifier is a number found in the url for the text

for text in corpus:

load_etext(text)

#shutil.move(text, ‘path_to_where_you_want_the_file’)

The last line is commented out because, at the time, I wasn’t quite able to get it to rename & move the files to where I wanted them to go, although I think this is something like how it would be done. So after running this, you would need to use Windows search or Spotlight to figure out where they were downloaded. Mine were in the Python code folder.

NB: Due to Python changing its included modules, this now only works on Python installations that include BSD-DB. See the developer’s notes for more details about how to work around this.

For both of these bulk downloads, if I were to do serious rather than exploratory research with them I would also want to figure out a way to organize the metadata for the particular volumes that I ended up with. So that’s a next step.

5 thoughts on “Downloading plain text from Internet Archive and Project Gutenberg with Python”

Excellent article. I’m facing some of these issues as well..

I was curious if you ever thought of changing the page layout
of your website? Its very well written; I love what youve got to say.

But maybe you could a little more in the way of content so people could connect with it better.
Youve got an awful lot of text for only having one or 2 pictures.
Maybe you could space it out better?

Thank you for reading and for your suggestion! We agree– the layout and style restrictions for this theme are a bit clunky. We plan to update the content and layout of the website this summer to improve usability and legibility.

A fascinating discussion is definitely worth comment.
I think that you should publish more on this subject matter,
it may not be a taboo matter but usually people do not talk
about such topics. To the next! Many thanks!!

Awesome post.

You must be logged in to post a comment.

Downloading plain text from Internet Archive and Project Gutenberg with Python

Related Posts

5 thoughts on “Downloading plain text from Internet Archive and Project Gutenberg with Python”

Leave a Reply

Share this:

Related Posts

5 thoughts on “Downloading plain text from Internet Archive and Project Gutenberg with Python”

Leave a Reply