

{"id":2555,"date":"2016-11-29T12:00:09","date_gmt":"2016-11-29T16:00:09","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=2555"},"modified":"2017-12-01T19:43:30","modified_gmt":"2017-12-01T23:43:30","slug":"preparing-data-with-openrefine-part-i-time-in-sequence","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2016\/11\/29\/preparing-data-with-openrefine-part-i-time-in-sequence\/","title":{"rendered":"Preparing Data With OpenRefine Part I &#8211; Time in Sequence"},"content":{"rendered":"<p>By Luling Huang<\/p>\n<p><!--more--><\/p>\n<p>As a continuation of my previous post on <a href=\"https:\/\/sites.temple.edu\/tudsc\/2016\/10\/27\/scraping-for-studying-online-political-discussion-part-2-python\/\">webscrapping\u00a0a political discussion forum<\/a>, I will show how to prepare the time column for doing Relevant Event Modeling in R&#8217;s &#8220;relevent&#8221; (Butts, 2015). I used OpenRefine for data cleaning.<\/p>\n<p>The goal is to create a dyadic edgelist for each thread. The edgelist has three columns: time information, sender, and receiver. To use the modeling function &#8220;rem.dyad&#8221; in R, time must be relative to the start of observation (Butts, 2015).<\/p>\n<p>In my data, time is recorded as it is shown on the website. When the data was loaded in OpenRefine, it looks like this:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-9.48.17-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2560 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-9.48.17-AM.png\" alt=\"screen-shot-2016-11-17-at-9-48-17-am\" width=\"109\" height=\"69\" \/><\/a><\/p>\n<p>Also, some information is in the form &#8220;Today, xx:xx PM.&#8221; Therefore, here is a list of what should be done:<\/p>\n<p>(1) Get rid of\u00a0\ufffd\ufffd (Originally on the website, it is &#8220;&amp;nbsp;,&#8221; which is a no break space. The scraper should have done a better job to delete it).<\/p>\n<p>(2) Recode &#8220;Today&#8221; into &#8220;10\/22&#8221;, and &#8220;Yesterday&#8221; into &#8220;10\/21.&#8221; The scraper was run on 10\/22 (technically, it was run from 11:10 PM 10\/21 to 3:53 AM 10\/22, which creates a real problem of how to recode &#8220;Today&#8221; and &#8220;Yesterday.&#8221; To fix it, manual verification should be done for each of the 203 &#8220;Yesterday&#8221; and the 87 &#8220;Today&#8221; cells. For this post, I&#8217;ll simplify this process and do aggregate transformation).<\/p>\n<p>(3) Transform all &#8220;12:xx AM&#8221; and &#8220;12:xx PM&#8221; into &#8220;00:xx AM&#8221; and &#8220;00:xx PM.&#8221; The reason is that, somehow, OpenRefine does not like &#8220;12:xx AM\/PM.&#8221; When converting to time data, without transformation, OpenRefine would recognize &#8220;12:xx AM&#8221; as &#8220;12:xx PM,&#8221; and would not convert &#8220;12:xx PM&#8221; at all. I followed Little&#8217;s (2015, p. 32) trick to fix this issue.<\/p>\n<p>(4) For each thread, create a column for onset time, which should be the time of the first post in thread.<\/p>\n<p>(5) For each thread, create another column. This column is the original time subtracted by onset time.<\/p>\n<p>I used GREL (General Refine Expression Language).<\/p>\n<p>(1) For the column &#8220;PostDateAndTime,&#8221; apply the function &#8220;value.replace(&#8220;\ufffd\ufffd&#8221;, &#8221; &#8220;).<a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-10.52.57-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2561\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-10.52.57-AM.png\" alt=\"screen-shot-2016-11-17-at-10-52-57-am\" width=\"679\" height=\"450\" \/><\/a><\/p>\n<p>(2) With the same function, recode &#8220;Today&#8221; and &#8220;Yesterday&#8221; cells into &#8220;10\/22&#8221; and &#8220;10\/21.&#8221;<\/p>\n<p>(3) With the same function, recode\u00a0&#8220;12:xx AM&#8221; and &#8220;12:xx PM&#8221; into &#8220;00:xx AM&#8221; and &#8220;00:xx PM.&#8221;<\/p>\n<p>(4) Apply facet to the column &#8220;ThreadID&#8221; and select a single thread (in order to perform the following actions only to a single thread). Apply another facet to the column &#8220;PostPosition&#8221; and select the row for the first post. For the column&#8221;PostDateAndTime,&#8221; use &#8220;Edit column -&gt; Add a column based on this column&#8221; to fill the cell in the new column with the first post&#8217;s time. Cancel the facet for the column &#8220;PostPosition&#8221; and &#8220;Fill down&#8221; the new column. Now we have a new column (named &#8220;onset time&#8221;).<\/p>\n<p style=\"text-align: justify\"><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-11.14.39-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-2563 aligncenter\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-11.14.39-AM.png\" alt=\"screen-shot-2016-11-17-at-11-14-39-am\" width=\"115\" height=\"294\" \/><\/a><\/p>\n<p style=\"text-align: justify\">(5) For the column &#8220;onset time,&#8221; use &#8220;Edit column -&gt; Add a column based on this column&#8221; to create another new column &#8220;t.&#8221; &#8220;t&#8221; is the difference between original time and onset time. Use the function diff():<\/p>\n<p style=\"text-align: justify;padding-left: 30px\"><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-12.25.57-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2565\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-17-at-12.25.57-PM.png\" alt=\"screen-shot-2016-11-17-at-12-25-57-pm\" width=\"686\" height=\"460\" \/><\/a><\/p>\n<p>a<em>. cells[&#8220;PostDateAndTime&#8221;].value.toDate()<\/em>: Locate the original time and convert it to time format.<\/p>\n<p>b<em>. cells[&#8220;onset time&#8221;].value.toDate()<\/em>: Locate the onset time and convert it to time format.<\/p>\n<p>c. <em>&#8220;seconds&#8221;<\/em>: Designate second as the calculation unit.<\/p>\n<p>In the new column &#8220;t&#8221;, for example, &#8220;50160&#8221; (for the fourth post in thread) means that there are 50160 seconds between the start of the thread and the fourth post.<\/p>\n<p style=\"padding-left: 30px;text-align: center\">References<\/p>\n<p style=\"text-align: left\">Butts, C. (2015). Package &#8216;relevent&#8217; [R package documentation]. Retrieved from\u00a0https:\/\/cran.r-project.org\/web\/packages\/relevent\/relevent.pdf<\/p>\n<p style=\"text-align: left\">Little, J. (2015). OpenRefine: Introduction: Workbook [Online tutorial slide]. Retrieved from https:\/\/docs.google.com\/presentation\/d\/1YkArEiaws0dMcyFZEppg4eZ7CxvqCTckjY78ao93zIw\/edit#slide=id.p<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Luling Huang<\/p>\n","protected":false},"author":3163,"featured_media":2571,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[303,2],"tags":[179],"class_list":["post-2555","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-grad-students","tag-openrefine"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/2555","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/3163"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=2555"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/2555\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/2571"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=2555"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=2555"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=2555"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}