

{"id":1804,"date":"2016-06-08T12:09:29","date_gmt":"2016-06-08T16:09:29","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=1804"},"modified":"2019-08-26T10:44:17","modified_gmt":"2019-08-26T14:44:17","slug":"downloading-plain-text-from-internet-archive-and-project-gutenberg-with-python","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2016\/06\/08\/downloading-plain-text-from-internet-archive-and-project-gutenberg-with-python\/","title":{"rendered":"Downloading plain text from Internet Archive and Project Gutenberg with Python"},"content":{"rendered":"<p>Written by\u00a0 Liz Rodrigues<\/p>\n<p><!--more--><\/p>\n<p><span style=\"font-weight: 400\">After I had assembled a list of US immigrant autobiography and checked to see which were available in full text and plain text files, the next step was to get those files. This list of about fifty books was probably short enough to do this manually&#8211;by looking up each title and copying &amp; pasting the file into notepad&#8211;but to prepare for future projects, I wanted to learn how to do it programmatically. <\/span><\/p>\n<p><span style=\"font-weight: 400\">My first step was to do some googling and find out what kinds of access Project Gutenberg, the Internet Archive, and HathiTrust provided to their text files. The first two both offer an API&#8211;basically allowing you to write a script that will query their server for particular kinds of data and then download it. HathiTrust also offers an API, but it takes more sophisticated programming to get it. So, I decided to focus on Gutenberg and IA, which offered the majority of my texts anyway. Before grabbing anything, it\u2019s a good idea to check the terms of service. Both sites allow moderate downloading for personal use but suggest that if you want the entire site you should set up your own mirror.<\/span><\/p>\n<p><span style=\"font-weight: 400\">It took a little bit of tweaking to do what I specifically wanted to do which was get the full text, put it into a folder where I could quickly keep track of it, and not have a bunch of extra copies on my computer. Here is what I came up with&#8211;stripped down versions of more ambitious scripts adapted to my specific need.<\/span><\/p>\n<p><span style=\"font-weight: 400\">For getting texts from Internet Archive, I used <\/span><a href=\"https:\/\/pypi.python.org\/pypi\/internetarchive\"><span style=\"font-weight: 400\">Jacob Johnson\u2019s Internet Archive package for Python<\/span><\/a><span style=\"font-weight: 400\">. It is well-documented and does many things beyond grabbing full text, but since that\u2019s what I was focused on that\u2019s what I started with. It ran smoothly last fall and just today, using this very basic code <\/span><span style=\"font-weight: 400\">(sorry I can\u2019t make it prettier, our installation of wordpress doesn\u2019t currently support markdown)<\/span><span style=\"font-weight: 400\">:<\/span><\/p>\n<p><span style=\"font-weight: 400\">import internetarchive<\/span><\/p>\n<p><span style=\"font-weight: 400\">from internetarchive import get_item<\/span><\/p>\n<p><span style=\"font-weight: 400\">import shutil, os<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">corpus = [&#8220;text_identifier_1\u201d, \u201ctext_identifier_2\u201d ]<\/span><\/p>\n<p><span style=\"font-weight: 400\">#the text identifier is a number found in the url for the text<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">for text in corpus:<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 item = get_item(text)<\/span><\/p>\n<p><span style=\"font-weight: 400\">#identifies the files available for the text you want<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 t = text + &#8220;_djvu.txt&#8221;<\/span><\/p>\n<p><span style=\"font-weight: 400\">#specifies that you want the plain text file specifically<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 f = item.get_file(t)<\/span><\/p>\n<p><span style=\"font-weight: 400\">#gets that plain text file<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 f.download(t)<\/span><\/p>\n<p><span style=\"font-weight: 400\">#downloads the plain text file<\/span><\/p>\n<p><span style=\"font-weight: 400\">new_name = text[0:(len(text))]<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 shutil.copyfile(t, new_name)<\/span><\/p>\n<p><span style=\"font-weight: 400\">#renames the file with a sleek version of the text id<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 shutil.move(new_name, &#8216;path_to_where_you_want_the_file\/&#8217;)<\/span><\/p>\n<p><span style=\"font-weight: 400\">#moves the file to a place where you know where it is<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 os.remove(t)<\/span><\/p>\n<p><span style=\"font-weight: 400\">#removes the duplicate copy<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">For getting texts off of Gutenberg, I started with the <\/span><a href=\"https:\/\/github.com\/c-w\/Gutenberg\"><span style=\"font-weight: 400\">Gutenberg package for Python<\/span><\/a><span style=\"font-weight: 400\"> by <\/span><span style=\"font-weight: 400\">Clemens Wolff. In the fall, when I was doing this work, this was a well documented tool that can do a lot more when you work one text at a time than the fairly basic version I used below for bulk downloading. Currently, the developer has released an update that is not working well on my PC and that I\u2019m still figuring out how to load on Mac&#8211;so your mileage may vary. It is still well-documented, though, and you may be able to make it work just fine.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">Begin code:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">import gutenberg<\/span><\/p>\n<p><span style=\"font-weight: 400\">from gutenberg.acquire import load_etext<\/span><\/p>\n<p><span style=\"font-weight: 400\">import shutil<\/span><\/p>\n<p><span style=\"font-weight: 400\">corpus = [text_identifier_1, text_identifier_2]<\/span><\/p>\n<p><span style=\"font-weight: 400\">#the text identifier is a number found in the url for the text<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">for text in corpus:<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 load_etext(text)<\/span><\/p>\n<p><span style=\"font-weight: 400\">\u00a0 \u00a0 #shutil.move(text, \u2018path_to_where_you_want_the_file&#8217;)<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400\">The last line is commented out because, at the time, I wasn\u2019t quite able to get it to rename &amp; move the files to where I wanted them to go, although I think this is something like how it would be done. So after running this, you would need to use Windows search or Spotlight to figure out where they were downloaded. Mine were in the Python code folder.<\/span><\/p>\n<p><span style=\"font-weight: 400\">NB: Due to Python changing its included modules, this now only works on Python installations that include BSD-DB. See <\/span><a href=\"https:\/\/github.com\/c-w\/Gutenberg\"><span style=\"font-weight: 400\">the developer\u2019s notes<\/span><\/a><span style=\"font-weight: 400\"> for more details about how to work around this.<\/span><\/p>\n<p><span style=\"font-weight: 400\">For both of these bulk downloads, if I were to do serious rather than exploratory research with them I would also want to figure out a way to organize the metadata for the particular volumes that I ended up with. So that\u2019s a next step.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Written by\u00a0 Liz Rodrigues<\/p>\n","protected":false},"author":285,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[2],"tags":[71,37,169],"class_list":["post-1804","post","type-post","status-publish","format-standard","hentry","category-grad-students","tag-python","tag-textual-analysis","tag-web-scraping"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/1804","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/285"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=1804"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/1804\/revisions"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=1804"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=1804"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=1804"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}