Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group 
Menu

Downloading plain text from Internet Archive and Project Gutenberg with Python

Posted on June 8, 2016August 26, 2019 by Gerald Doyle

Written by  Liz Rodrigues

After I had assembled a list of US immigrant autobiography and checked to see which were available in full text and plain text files, the next step was to get those files. This list of about fifty books was probably short enough to do this manually–by looking up each title and copying & pasting the file into notepad–but to prepare for future projects, I wanted to learn how to do it programmatically.

My first step was to do some googling and find out what kinds of access Project Gutenberg, the Internet Archive, and HathiTrust provided to their text files. The first two both offer an API–basically allowing you to write a script that will query their server for particular kinds of data and then download it. HathiTrust also offers an API, but it takes more sophisticated programming to get it. So, I decided to focus on Gutenberg and IA, which offered the majority of my texts anyway. Before grabbing anything, it’s a good idea to check the terms of service. Both sites allow moderate downloading for personal use but suggest that if you want the entire site you should set up your own mirror.

It took a little bit of tweaking to do what I specifically wanted to do which was get the full text, put it into a folder where I could quickly keep track of it, and not have a bunch of extra copies on my computer. Here is what I came up with–stripped down versions of more ambitious scripts adapted to my specific need.

For getting texts from Internet Archive, I used Jacob Johnson’s Internet Archive package for Python. It is well-documented and does many things beyond grabbing full text, but since that’s what I was focused on that’s what I started with. It ran smoothly last fall and just today, using this very basic code (sorry I can’t make it prettier, our installation of wordpress doesn’t currently support markdown):

import internetarchive

from internetarchive import get_item

import shutil, os

 

corpus = [“text_identifier_1”, “text_identifier_2” ]

#the text identifier is a number found in the url for the text

 

for text in corpus:

    item = get_item(text)

#identifies the files available for the text you want

    t = text + “_djvu.txt”

#specifies that you want the plain text file specifically

    f = item.get_file(t)

#gets that plain text file

    f.download(t)

#downloads the plain text file

new_name = text[0:(len(text))]

    shutil.copyfile(t, new_name)

#renames the file with a sleek version of the text id

    shutil.move(new_name, ‘path_to_where_you_want_the_file/’)

#moves the file to a place where you know where it is

    os.remove(t)

#removes the duplicate copy

 

For getting texts off of Gutenberg, I started with the Gutenberg package for Python by Clemens Wolff. In the fall, when I was doing this work, this was a well documented tool that can do a lot more when you work one text at a time than the fairly basic version I used below for bulk downloading. Currently, the developer has released an update that is not working well on my PC and that I’m still figuring out how to load on Mac–so your mileage may vary. It is still well-documented, though, and you may be able to make it work just fine.

 

Begin code:

 

import gutenberg

from gutenberg.acquire import load_etext

import shutil

corpus = [text_identifier_1, text_identifier_2]

#the text identifier is a number found in the url for the text

 

for text in corpus:

    load_etext(text)

    #shutil.move(text, ‘path_to_where_you_want_the_file’)

 

The last line is commented out because, at the time, I wasn’t quite able to get it to rename & move the files to where I wanted them to go, although I think this is something like how it would be done. So after running this, you would need to use Windows search or Spotlight to figure out where they were downloaded. Mine were in the Python code folder.

NB: Due to Python changing its included modules, this now only works on Python installations that include BSD-DB. See the developer’s notes for more details about how to work around this.

For both of these bulk downloads, if I were to do serious rather than exploratory research with them I would also want to figure out a way to organize the metadata for the particular volumes that I ended up with. So that’s a next step.

5 thoughts on “Downloading plain text from Internet Archive and Project Gutenberg with Python”

  1. prayer times says:
    April 16, 2017 at 7:27 pm

    Excellent article. I’m facing some of these issues as well..

  2. Pilar says:
    April 16, 2017 at 7:38 pm

    I was curious if you ever thought of changing the page layout
    of your website? Its very well written; I love what youve got to say.

    But maybe you could a little more in the way of content so people could connect with it better.
    Youve got an awful lot of text for only having one or 2 pictures.
    Maybe you could space it out better?

  3. Jennifer Grayburn says:
    April 19, 2017 at 11:25 am

    Thank you for reading and for your suggestion! We agree– the layout and style restrictions for this theme are a bit clunky. We plan to update the content and layout of the website this summer to improve usability and legibility.

  4. Xtremsearch seo stadt Nuernberg says:
    May 25, 2017 at 5:28 pm

    A fascinating discussion is definitely worth comment.
    I think that you should publish more on this subject matter,
    it may not be a taboo matter but usually people do not talk
    about such topics. To the next! Many thanks!!

  5. Best Hindi quotations says:
    August 28, 2017 at 12:27 am

    Awesome post.

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Tags

3D modeling 3D printing arduino augmented reality banned books coding corpus building critical making Cultural Heritage data cleaning data visualization Digital Preservation digital reconstruction digital scholarship film editing game design games gephi human subject research linked open data machine learning makerspace makerspace residency mapping network analysis oculus rift omeka OpenRefine Photogrammetry Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping webscraping

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Archives

Blog Tags

3D modeling (11) 3D printing (14) arduino (8) augmented reality (5) banned books (3) coding (12) corpus building (4) critical making (7) Cultural Heritage (11) data cleaning (4) data visualization (11) Digital Preservation (3) digital reconstruction (9) digital scholarship (12) film editing (3) game design (3) games (6) gephi (3) human subject research (3) linked open data (4) machine learning (6) makerspace (8) makerspace residency (4) mapping (30) network analysis (17) oculus rift (8) omeka (3) OpenRefine (4) Photogrammetry (5) Python (8) QGIS (10) R (9) SketchUp (4) stylometry (8) text analysis (10) text mining (4) textual analysis (32) top news (102) twitter (5) video analysis (4) virtual reality (17) visual analysis (5) voyant (4) web scraping (16) webscraping (3)

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025
  • From Theory to Practice: Weaving in Response to the Grid in the Global Context March 26, 2025
  • Visiting a Land of Twilight February 24, 2025

Archives

©2025 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group