{"id":1960,"date":"2016-09-01T12:01:41","date_gmt":"2016-09-01T16:01:41","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=1960"},"modified":"2016-08-29T15:13:01","modified_gmt":"2016-08-29T19:13:01","slug":"web-scraping-with-python","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2016\/09\/01\/web-scraping-with-python\/","title":{"rendered":"Web scraping with Python"},"content":{"rendered":"<p>By Andrea Siotto<\/p>\n<p><!--more--><\/p>\n<p>This article is the second of a series on the project of making a map with all the ships sunk during the First World War<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1961\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-300x99.jpg\" alt=\"htmlcode\" width=\"1333\" height=\"440\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-300x99.jpg 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-768x254.jpg 768w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-1024x339.jpg 1024w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-700x232.jpg 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-232x77.jpg 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-464x154.jpg 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/08\/htmlcode-624x207.jpg 624w\" sizes=\"auto, (max-width: 1333px) 100vw, 1333px\" \/><\/a><\/p>\n<p>In creating the map of the ships lost during the First World War there are two major steps: scraping the data from Wikipedia and retrieving the coordinates from the descriptions when they are not already present. Since there are more than 8 thousand ships the key is to automate as much as possible. \u00a0By hand, if we consider an average of 5 minutes of work per ship, the total would be more than 600 hours without automation. No way that I have the time for that (or the patience).<\/p>\n<p>In this article we will review the web scraping, which is the collection of the data from webpages, in our case Wikipedia. I did not know Python, so I decided that it was a perfect occasion to learn it. It is an easy and forgiving programming language; in addition, it is often used for this kind of task and offers powerful and easy instruments for interacting with internet data.<\/p>\n<p>At the bottom of the article you will find the overly commented code, so if you are interested only in seeing how I solved this problem you can jump directly there.<\/p>\n<p>If instead you are in the middle of the planning of your project, a few considerations:<\/p>\n<ul>\n<li>You have never programmed? Don\u2019t be scared: it is easier than it seems and if you think that your project will take a lot of time to do manually, consider that in using Python you will learn a tool for the future AND save time.<\/li>\n<li>Be as simple as you can: use text files (specifically plain text) to store your data, so you can manually modify them and read them.<\/li>\n<li>If you already did not think to use them, use csv files (comma separated values \u2013 which are text files specified to store data), but use a character that you know will not be present (I used @).<\/li>\n<li>Use <a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\">BeautifulSoup<\/a>: it is a very powerful Python library to manage html code; most importantly it is broadly used and there are many tutorials online.<\/li>\n<\/ul>\n<p>So you decided to scrape some pages, grab the data and collect it in an orderly fashion. Your goal has a great friend and a major enemy.<\/p>\n<p>The friend is your browser, which can show the html source code of the page that you will use in your search.<\/p>\n<p>The enemy is JavaScript, because it means that the data is not available directly in the code, and you will need to implement much more complicated programming (if you need to do this kind of task then look into Selenium, a library that uses Chrome to interact with the data: it is possible, but it requires more programming and is slower).<\/p>\n<p>If you are still considering to copy and paste manually every line of data for your project, think twice, because in my case it would have taken most likely the whole day, while in a couple of hours of programming (I never used Python and Beautiful Soup and I had to learn them) and 4 minutes of actual scraping I had my text file with all the ships orderly put line by line with the name, the date of sinking, their country, in many cases the coordinates, and a description of the events.<\/p>\n<p>The problem of the ships without coordinates was a much more complicated one, so I decided to use C#, a language that I know better. It will be analyzed in the next article.<\/p>\n<hr \/>\n<p>Here it is the code :<\/p>\n<p><code><br \/>\n'''<\/code><\/p>\n<p>The program collects the names of the links, and opens all the pages,<br \/>\nthen searches for every<br \/>\ndiv style=&#8221;overflow-x: auto;&#8221;,<br \/>\nthen for all of them grabs the text inList of shipwrecks: 3 August 1914<br \/>\nelaborate the string taking off the &#8220;List of shipwrecks:&#8221;<br \/>\nthen collects all the(only if it hasand not)<br \/>\nPERHAPS NOT searches if the 3rdhas the coordinates and collects them<br \/>\nthen writes a line on a file with all the text in theseparated by @ and followed by the coordinates<br \/>\n&#8221;&#8217;<br \/>\ndef CheckLocation(Check_Link_Url,CheckLinkName): #filters the CheckLinkName string, and if passes the filter then scrape the page Check_Link_Url<br \/>\nFilterList =[&#8220;&#8221;,]<br \/>\nwith urllib.request.urlopen(Check_Link_Url) as response:<br \/>\ndata = response.read()<\/p>\n<p>def search_for_overflow(soup):<\/p>\n<p>for overflow_List in soup.findAll(&#8216;div&#8217;, attrs={&#8216;style&#8217;:&#8217;overflow-x:auto&#8217;}): # we search for the data. In our case the data we were looking for is always<br \/>\n#in a div with this specific attribute. these div represented each day of the month. They contained a table that listed all the ships sank in that date.<\/p>\n<p>date = overflow_List.caption.text<br \/>\ndate = date.replace(&#8220;List of shipwrecks: &#8220;,&#8221;&#8221;) # cleans the title of the table for the day, to have the date in a simpler format<br \/>\npretty2=overflow_List.prettify() # there must be a more elegant way, but I found simpler to create a new beautifulsoup objec for every div found<br \/>\nsoup2 = BeautifulSoup(pretty2,&#8217;lxml&#8217;)<br \/>\nfor List_tr in soup2.findAll(&#8216;tr&#8217;):# for all the div found we search for the table rows<br \/>\npretty3=List_tr.prettify() # and again we create a soup object of them<br \/>\nsoup3 = BeautifulSoup(pretty3,&#8217;lxml&#8217;)<br \/>\nfor index,List_td in enumerate(soup3.findAll(&#8216;td&#8217;)): #then we search for all the cells<\/p>\n<p>stringLinea =re.sub(r&#8217;\\s+&#8217;, &#8216; &#8216;, List_td.text)# we eliminate all the multiple spaces and transform them in single ones<br \/>\nstringLinea =re.sub(&#8216;(\\[\\d+\\])&#8217;, &#8216; &#8216;, stringLinea)#then we delete the quotes like for example &#8220;[1]&#8221;<br \/>\nif index stringLinea=stringLinea+&#8221;@&#8221; # we add @ as character separator<br \/>\nstringLinea=stringLinea.replace(&#8220;\\n&#8221;,&#8221;&#8221;)# we delete the endline<br \/>\nwith codecs.open(&#8220;List_Results_wikipedia_All_Pages_Plus_Coordinates and Names.txt&#8221;, &#8220;a&#8221;,encoding=&#8217;utf8&#8217;) as file:<br \/>\nfile.write(stringLinea) #we write the cell on the file<br \/>\nelse: # this is the more complicated fourth cell<br \/>\nstringLinea=stringLinea.replace(&#8220;\\n&#8221;,&#8221;&#8221;)<br \/>\nm = re.search(&#8216;(?&lt;=\/).+(?=\/)&#8217;, stringLinea) # we search for coordinates in the string<br \/>\nif m == None:<br \/>\nresult = &#8220;&#8221;<br \/>\nelse:<br \/>\nresult = m.group(0) # result = the found coordinates<\/p>\n<p>pretty4=List_td.prettify()# once again we create the object for beautifulsoup<br \/>\nsoup4 = BeautifulSoup(pretty4, &#8216;lxml&#8217;)<br \/>\nstringLinks =&#8221;none&#8221;<br \/>\nif soup4!=None:<br \/>\nListNameLinks = soup4.findAll(&#8216;a&#8217;) #simplifying strategy: wikipedia often puts as a link interesting stuff in the texts, we search for them and store them<br \/>\n# to help the subsequent work<\/p>\n<p>if len(ListNameLinks)==0: # there are not links in the text<br \/>\nstringLinks=&#8221;none&#8221;<br \/>\nelse:<br \/>\nstringLinks=&#8221;&#8221;<br \/>\nfor index,element in enumerate(ListNameLinks): # put in a single string all the links, separated by the semicolon<br \/>\nif index<\/p>\n<p>stringLinks = element.text<br \/>\nstringLinks = stringLinks.replace(&#8220;\\n&#8221;,&#8221;&#8221;)<br \/>\nelse: # multiple links<br \/>\ntextadd= element.text<br \/>\ntextadd= textadd.replace(&#8220;\\n&#8221;,&#8221;&#8221;)<br \/>\nstringLinks = stringLinks+&#8221;;&#8221;+textadd<\/p>\n<p>stringLinks =re.sub(r&#8217;\\s+&#8217;, &#8216; &#8216;, stringLinks)# in some cases the string still had multiple spaces, just for good measure we eliminate them<br \/>\nwith codecs.open(&#8220;List_Results_wikipedia_All_Pages_Plus_Coordinates and Names.txt&#8221;, &#8220;a&#8221;,encoding=&#8217;utf8&#8217;) as file:<\/p>\n<p>file.write(stringLinea+&#8221;@&#8221;+date+&#8221;@&#8221;+result+&#8221;@&#8221;+stringLinks+&#8221;\\n&#8221;) # we organize the whole string as name@date@coordinates@the list of links in the text.<\/p>\n<p>def main():<br \/>\nwith open(&#8220;List_Results_wikipedia_All_Pages_Plus_Coordinates and Names.txt&#8221;, &#8220;w&#8221;): # this is the file where we will store all the data<br \/>\npass<br \/>\nwith open(&#8216;List of Wikipedia links by month August 1914-December 1918&#8217;)as fileUrls: # we open the text file where we put all the web pages that we will scrape<br \/>\naddresses = fileUrls.readlines()<\/p>\n<p>for page in addresses: # for every page we open it<br \/>\nwith urllib.request.urlopen(page) as response:<br \/>\ndata = response.read()<br \/>\nprint(page)<br \/>\nsoup = BeautifulSoup(data, &#8220;lxml&#8221;) # we create the beautifulsoup object<br \/>\nsearch_for_overflow(soup) # and we search for the data that we want with our function<\/p>\n<p>if __name__ == &#8220;__main__&#8221;:<br \/>\nmain()<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Andrea Siotto<\/p>\n","protected":false},"author":9944,"featured_media":2145,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[2],"tags":[64,169],"class_list":["post-1960","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-grad-students","tag-coding","tag-web-scraping"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/1960","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/9944"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=1960"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/1960\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/2145"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=1960"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=1960"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=1960"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}