

{"id":2579,"date":"2016-12-13T14:11:43","date_gmt":"2016-12-13T18:11:43","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=2579"},"modified":"2023-01-17T15:12:13","modified_gmt":"2023-01-17T19:12:13","slug":"preparing-data-with-openrefine-part-ii-assign-unique-numerical-identifiers","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2016\/12\/13\/preparing-data-with-openrefine-part-ii-assign-unique-numerical-identifiers\/","title":{"rendered":"Preparing Data With OpenRefine Part II \u2013 Assign Unique Numerical Identifiers"},"content":{"rendered":"<p>By Luling Huang<\/p>\n<p><!--more--><\/p>\n<p><strong>Problem:<\/strong><\/p>\n<p>In R&#8217;s &#8220;relevent&#8221; (Butts, 2015), the identifiers of sender and receiver have to be integers, rather than strings. For example, if we have a sequence data with 12 ordered events like this:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-9.42.17-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2581\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-9.42.17-AM.png\" alt=\"sample data with three columns: 't,' 's,' and 'r.' 's' for senders and 'r' for receivers. Senders and receivers are coded as 'a,' 'b,' 'c,' and 'd.'\" width=\"119\" height=\"279\" \/><\/a><\/p>\n<p>we need\u00a0to assign &#8220;a,&#8221; &#8220;b,&#8221; &#8220;c,&#8221; and &#8220;d&#8221; to integers 1, 2, 3, and 4.<\/p>\n<p><strong>Assumption:<\/strong><\/p>\n<p>Receiver&#8217;s set <em>R<\/em> is a subset of sender&#8217;s set <em>S<\/em>.<\/p>\n<p><strong>Procedure:<\/strong><\/p>\n<p>1) Assign unique integers to objects in <em>S<\/em>.<\/p>\n<p>2) Match objects in <em>R<\/em> to the\u00a0assigned integers in <em>S<\/em>.<\/p>\n<p><strong>How to do it in OpenRefine:<\/strong><\/p>\n<p>1) Move the column &#8220;s&#8221; to beginning. That is, let the column be the first column from the left.<\/p>\n<p>2) Sort the column &#8220;s&#8221; and reorder rows permanently.<\/p>\n<p>3) Blank down &#8220;s&#8221; and switch to the &#8220;records&#8221; mode. The result is:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-10.50.12-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2583\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-10.50.12-AM.png\" alt=\"Result after step 3: for column 's,' all values of 'a' become grouped with '1,' 'b' with '2,' 'c' with '3,' and 'd' with '4.' Duplicate values deleted.\" width=\"195\" height=\"283\" \/><\/a><\/p>\n<p>Similar cells in &#8220;s&#8221; become &#8220;records&#8221; under each unique value. The purpose of this step is to group values of &#8220;a,&#8221; &#8220;b,&#8221; &#8220;c,&#8221; and &#8220;d&#8221; with the row indexes 1, 2, 3, and 4 (imagine these numbers as group labels).<\/p>\n<p>4) In &#8220;records&#8221; mode, for the column &#8220;s,&#8221; use &#8220;Edit column -&gt; Add a column based on this column&#8221; to create a new column &#8220;sID.&#8221; Use the GREL expression &#8220;row.record.index+1.&#8221; Result:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.09.53-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2585\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.09.53-AM.png\" alt=\"Result after step 4: for a new column 'sID,' cell values are assigned based on group values in 's.'\" width=\"246\" height=\"280\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.09.53-AM.png 246w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.09.53-AM-232x264.png 232w\" sizes=\"auto, (max-width: 246px) 100vw, 246px\" \/><\/a><\/p>\n<p>5) Fill down &#8220;s,&#8221; which is a reverse of Step 3. Result:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.14.26-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2586\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.14.26-AM.png\" alt=\"Result after step 5: duplicate values in 's' were added back.\" width=\"251\" height=\"282\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.14.26-AM.png 251w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.14.26-AM-232x261.png 232w\" sizes=\"auto, (max-width: 251px) 100vw, 251px\" \/><\/a><\/p>\n<p>Now, Procedure 1 has been done. We have assigned unique integers to objects in <em>S<\/em>.<\/p>\n<p>6) For the column &#8220;r,&#8221; use &#8220;Edit column -&gt; Add a column based on this column&#8221; to create another new column &#8220;rID.&#8221; Use the GREL function &#8220;cross():&#8221;<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2589\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM.png\" alt=\"Description of step 6. The function used: cell.cross('sample4 csv', 's')[0].cells['sID'].value\" width=\"687\" height=\"449\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM.png 687w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM-300x196.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM-232x152.png 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM-464x303.png 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-11.55.07-AM-624x408.png 624w\" sizes=\"auto, (max-width: 687px) 100vw, 687px\" \/><\/a><\/p>\n<p>The cross() function is written to match content of columns across different projects (e.g., different csv files). By designating the project name as the one we are working on, we can use cross() for our purpose: to match column values within one csv file.<\/p>\n<p>In <em>cross(cell, &#8220;sample4 csv&#8221;, &#8220;s&#8221;).cells[&#8220;sID&#8221;].value[0]<\/em>, what we do is to\u00a0first bind the two columns &#8220;s&#8221; and &#8220;sID&#8221; together (imagine the process as creating pairs of key-value in a dictionary); second, match the values in a third column, &#8220;r,&#8221; to &#8220;s;&#8221; third, assign values to a fourth column &#8220;rID&#8221; based on the matching chain of &#8220;s-sID-r.&#8221; Result (after sorting by time):<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-2.28.21-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2592\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/11\/Screen-Shot-2016-11-28-at-2.28.21-PM.png\" alt=\"Final result: Senders and receivers are coded as '1,' '2,' '3,' and '4.'\" width=\"140\" height=\"279\" \/><\/a><\/p>\n<p>Feature image&#8217;s source: map by Martin Magdinier.<\/p>\n<p style=\"padding-left: 30px;text-align: center\">References<\/p>\n<p style=\"text-align: left\">Butts, C. (2015). Package &#8216;relevent&#8217; [R package documentation]. Retrieved from\u00a0https:\/\/cran.r-project.org\/web\/packages\/relevent\/relevent.pdf<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Luling Huang<\/p>\n","protected":false},"author":3163,"featured_media":2646,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[303,2],"tags":[185,179],"class_list":["post-2579","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-grad-students","tag-data-cleaning","tag-openrefine"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/2579","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/3163"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=2579"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/2579\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/2646"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=2579"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=2579"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=2579"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}