{"id":2610,"date":"2017-01-17T12:00:41","date_gmt":"2017-01-17T16:00:41","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=2610"},"modified":"2023-01-24T11:28:28","modified_gmt":"2023-01-24T15:28:28","slug":"preparing-data-part-iii-split-data-and-apply-operation-history-to-subsets-combine-openrefine-and-python","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2017\/01\/17\/preparing-data-part-iii-split-data-and-apply-operation-history-to-subsets-combine-openrefine-and-python\/","title":{"rendered":"Preparing Data Part III \u2013 Split Data and Apply Operation History to Subsets (Combine OpenRefine and Python)"},"content":{"rendered":"<p>By Luling Huang<\/p>\n<p><!--more--><\/p>\n<p>The work done\u00a0in <a href=\"https:\/\/sites.temple.edu\/tudsc\/2016\/11\/29\/preparing-data-with-openrefine-part-i-time-in-sequence\/\">Part I<\/a> and <a href=\"https:\/\/sites.temple.edu\/tudsc\/2016\/12\/13\/preparing-data-with-openrefine-part-ii-assign-unique-numerical-identifiers\/\">Part II<\/a> with OpenRefine applies to each thread in my data. My next goal is to complete\u00a0the following three tasks with both OpenRefine and python:<\/p>\n<p>A. Split the data based on unique values in the column &#8220;ThreadID;&#8221;<\/p>\n<p>B. Do the cleaning work in Part I and Part II on one subset;<\/p>\n<p>C. Apply the work done in &#8220;Task B&#8221; to other subsets.<\/p>\n<p>The reason to include python is that it automates the process in Tasks A and C, which is especially important with a large dataset. Whereas my previous two blog posts\u00a0speak more to my own project, I hope this one can be helpful\u00a0for a more general situation when you want to automate the OpenRefine workflow on many subsets.<\/p>\n<p>Needed tools:<\/p>\n<p>I. OpenRefine;<\/p>\n<p>II. pandas: a python library for processing dataframes (here is the <a href=\"http:\/\/pandas.pydata.org\/pandas-docs\/stable\/install.html\">link for installation<\/a>);<\/p>\n<p>III. refine-python: a client library for interacting with OpenRefine in python (here is the <a href=\"https:\/\/github.com\/maxogden\/refine-python\">link for its GitHub page<\/a>, and yes, it&#8217;s still working under OpenRefine 2.6).<\/p>\n<p>Link for the hypothetical data used in this demonstration: Example_3Threads.csv (no longer available via Google Drive) (for simplification the time data is ready).<\/p>\n<p>Task A:<\/p>\n<p>What I thought I could do: Export subsets based on a facet\u00a0of the column &#8220;ThreadID&#8221; in OpenRefine.<\/p>\n<p>Is it possible to do? Yes. Although it is not an explicit functionality in OpenRefine, I found a workaround using Javascript provided by one developer. <a href=\"https:\/\/groups.google.com\/forum\/?fromgroups#!topic\/openrefine\/vF7b1ZygiUY\">You can find the method here<\/a>.<\/p>\n<p>Problem: The code was written for exporting Excel files. It did not work when I changed &#8220;xls&#8221; to &#8220;csv&#8221; in the first two lines. Given the dire situation that I had zero knowledge in Javascript, for an easy workaround of the workaround, I turned to pandas in python.<\/p>\n<p>Python code using pandas:<\/p>\n<div style=\"background: #f8f8f8;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0;line-height: 125%\">1\n2\n3\n4\n5\n6\n7\n8<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #008000;font-weight: bold\">import<\/span> <span style=\"color: #0000ff;font-weight: bold\">pandas<\/span> <span style=\"color: #008000;font-weight: bold\">as<\/span> <span style=\"color: #0000ff;font-weight: bold\">pd<\/span>\ntest_df <span style=\"color: #666666\">=<\/span> pd<span style=\"color: #666666\">.<\/span>read_csv(<span style=\"color: #ba2121\">'Example_3Threads.csv'<\/span>)\nThreadIDList <span style=\"color: #666666\">=<\/span> test_df<span style=\"color: #666666\">.<\/span>ThreadID<span style=\"color: #666666\">.<\/span>unique()\nroot <span style=\"color: #666666\">=<\/span> <span style=\"color: #ba2121\">'SplitFiles'<\/span>\n<span style=\"color: #008000;font-weight: bold\">for<\/span> i <span style=\"color: #aa22ff;font-weight: bold\">in<\/span> ThreadIDList:\n    spliti_df <span style=\"color: #666666\">=<\/span> test_df<span style=\"color: #666666\">.<\/span>loc[test_df[<span style=\"color: #ba2121\">'ThreadID'<\/span>] <span style=\"color: #666666\">==<\/span> i]\n    FileNamei <span style=\"color: #666666\">=<\/span> <span style=\"color: #ba2121\">'Example_3Threads-'<\/span> <span style=\"color: #666666\">+<\/span> <span style=\"color: #008000\">str<\/span>(i) <span style=\"color: #666666\">+<\/span> <span style=\"color: #ba2121\">'.csv'<\/span>\n    spliti_df<span style=\"color: #666666\">.<\/span>to_csv(root <span style=\"color: #666666\">+<\/span> <span style=\"color: #ba2121\">'\/'<\/span> <span style=\"color: #666666\">+<\/span> FileNamei, index <span style=\"color: #666666\">=<\/span> <span style=\"color: #008000\">False<\/span>, encoding <span style=\"color: #666666\">=<\/span> <span style=\"color: #ba2121\">'utf-8'<\/span>)\n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>Line 2: Create a dataframe from the sample csv file.<\/p>\n<p>Line 3: Create a list of unique values in the column &#8220;ThreadID.&#8221;<\/p>\n<p>Line 4: Locate a subdirectory under the working directory to store all split files.<\/p>\n<p>Lines 5 to 8: Loop through &#8220;ThreadIDList,&#8221; do the split, and export to csv with corresponding file names. The key is line 6: rows are selected based on values in the column &#8220;ThreadID.&#8221;<\/p>\n<p>Result: Three split files based on thread IDs have been stored in the folder &#8220;SplitFiles.&#8221;<\/p>\n<p>Task B:<\/p>\n<p>In OpenRefine, <a href=\"https:\/\/sites.temple.edu\/tudsc\/2016\/12\/13\/preparing-data-with-openrefine-part-ii-assign-unique-numerical-identifiers\/\">do the work covered in Part II on any one of the subsets.<\/a><\/p>\n<p>Task C:<\/p>\n<p>OpenRefine stores every step that changes the data in an operation history. History of one project can be extracted, saved in JSON, and applied to other projects. Therefore, the logic of completing Task C is straightforward: C1. When Task B is finished, extract and save the history as JSON; C2. Use refine-python to automate the process of applying the operation history to the other two subsets.<\/p>\n<p>C1: Extract operation history.<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-2635\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM.png\" alt=\"Screenshot: extract operation history\" width=\"781\" height=\"541\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM.png 781w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM-300x208.png 300w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM-768x532.png 768w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM-700x485.png 700w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM-232x161.png 232w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM-464x321.png 464w, https:\/\/sites.temple.edu\/tudsc\/files\/2016\/12\/Screen-Shot-2016-12-08-at-7.27.59-PM-624x432.png 624w\" sizes=\"auto, (max-width: 781px) 100vw, 781px\" \/><\/a><\/p>\n<p>The left column lists all the steps performed in Task B. The right column shows the operation history in JSON. We\u00a0can copy the JSON data to a code editor, and save it as &#8220;OperationHistory.json.&#8221;<\/p>\n<p>C2: Let python streamline the remaining process.<\/p>\n<p>Notes on directories for this demonstration: the working directory is &#8220;\/python refine\/,&#8221; where both &#8220;refine.py&#8221; and &#8220;OperationHistory.json&#8221; locate. Under &#8220;python refine,&#8221; &#8220;SplitFiles&#8221; stores split csv files from Task A, and &#8220;SplitFiles2&#8221; stores transformed csv files.<\/p>\n<p>Python code with brief comments:<\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<table>\n<tbody>\n<tr>\n<td>\n<pre style=\"margin: 0;line-height: 125%\"> 1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n27\n28\n29\n30\n31\n32\n33\n34\n35\n36\n37\n38\n39\n40\n41\n42\n43\n44\n45\n46\n47\n48\n49\n50\n51\n52\n53<\/pre>\n<\/td>\n<td>\n<pre style=\"margin: 0;line-height: 125%\"><span style=\"color: #888888\"># Load refine-python<\/span>\n<span style=\"color: #008800;font-weight: bold\">import<\/span> <span style=\"color: #bb0066;font-weight: bold\">sys<\/span>\nsys.path.append(<span style=\"color: #dd2200;background-color: #fff0f0\">\"refine.py\"<\/span>)\n<span style=\"color: #008800;font-weight: bold\">import<\/span> <span style=\"color: #bb0066;font-weight: bold\">refine<\/span>\n<span style=\"color: #888888\"># Load other needed modules<\/span>\n<span style=\"color: #008800;font-weight: bold\">import<\/span> <span style=\"color: #bb0066;font-weight: bold\">json<\/span>, <span style=\"color: #bb0066;font-weight: bold\">os<\/span>, <span style=\"color: #bb0066;font-weight: bold\">re<\/span>\n\nroot = <span style=\"color: #dd2200;background-color: #fff0f0\">\"SplitFiles\"<\/span>\nroot2 = <span style=\"color: #dd2200;background-color: #fff0f0\">\"SplitFiles2\"<\/span>\n\n<span style=\"color: #888888\"># Loop through every csv file under the folder \"SplitFiles\"<\/span>\n<span style=\"color: #008800;font-weight: bold\">for<\/span> filename <span style=\"color: #008800\">in<\/span> os.listdir(root):\n    <span style=\"color: #888888\"># Avoid annoying hidden files (.DS_Store in Mac OS X for example)<\/span>\n    <span style=\"color: #008800;font-weight: bold\">if<\/span> filename.endswith(<span style=\"color: #dd2200;background-color: #fff0f0\">\".csv\"<\/span>): \n        <span style=\"color: #888888\"># Extract thread ID from file name<\/span>\n        filenameindex=filename.split(<span style=\"color: #dd2200;background-color: #fff0f0\">\"-\"<\/span>)[<span style=\"color: #0000dd;font-weight: bold\">1<\/span>].split(<span style=\"color: #dd2200;background-color: #fff0f0\">\".\"<\/span>)[<span style=\"color: #0000dd;font-weight: bold\">0<\/span>]\n        \n        <span style=\"color: #888888\"># Create an OpenRefine project<\/span>\n        r = refine.Refine()\n        pfilename = r.new_project(filename)\n        \n        <span style=\"color: #888888\"># Update JSON file<\/span>\n        <span style=\"color: #888888\">## 1. Read JSON file<\/span>\n        os.chdir(<span style=\"color: #dd2200;background-color: #fff0f0\">\"\/Users\/lulinghuang\/Desktop\/2016 DSC\/openrefine\/python refine\/\"<\/span>)\n        jsonfilename=<span style=\"color: #003388\">open<\/span>(<span style=\"color: #dd2200;background-color: #fff0f0\">\"OperationHistory.json\"<\/span>, <span style=\"color: #dd2200;background-color: #fff0f0\">\"r\"<\/span>)\n        datafilename=json.load(jsonfilename)\n        jsonfilename.close()\n        <span style=\"color: #888888\">## 2. Change specific parts of content<\/span>\n        datafilename[<span style=\"color: #0000dd;font-weight: bold\">5<\/span>][<span style=\"color: #dd2200;background-color: #fff0f0\">\"description\"<\/span>] = re.sub(<span style=\"color: #dd2200;background-color: #fff0f0\">r\"(?&lt;=-)\\d+\"<\/span>, filenameindex, datafilename[<span style=\"color: #0000dd;font-weight: bold\">5<\/span>][<span style=\"color: #dd2200;background-color: #fff0f0\">\"description\"<\/span>])\n        datafilename[<span style=\"color: #0000dd;font-weight: bold\">5<\/span>][<span style=\"color: #dd2200;background-color: #fff0f0\">\"expression\"<\/span>] = re.sub(<span style=\"color: #dd2200;background-color: #fff0f0\">r\"(?&lt;=-)\\d+\"<\/span>, filenameindex, datafilename[<span style=\"color: #0000dd;font-weight: bold\">5<\/span>][<span style=\"color: #dd2200;background-color: #fff0f0\">\"expression\"<\/span>])\n        <span style=\"color: #888888\">## 3. Update <\/span>\n        jsonfilename=<span style=\"color: #003388\">open<\/span>(<span style=\"color: #dd2200;background-color: #fff0f0\">\"OperationHistory.json\"<\/span>, <span style=\"color: #dd2200;background-color: #fff0f0\">\"w\"<\/span>)\n        jsonfilename.write(json.dumps(datafilename))\n        jsonfilename.close()\n        \n        <span style=\"color: #888888\"># Apply updated JSON to OpenRefine project<\/span>\n        pfilename.apply_operations(<span style=\"color: #dd2200;background-color: #fff0f0\">\"OperationHistory.json\"<\/span>)    \n        \n        <span style=\"color: #888888\"># Export csv file to the folder \"SplitFiles2\"<\/span>\n        exportfilename=<span style=\"color: #dd2200;background-color: #fff0f0\">\"Thread\"<\/span> + <span style=\"color: #dd2200;background-color: #fff0f0\">\"_\"<\/span> + <span style=\"color: #003388\">str<\/span>(filenameindex) + <span style=\"color: #dd2200;background-color: #fff0f0\">\".csv\"<\/span>\n        os.chdir(root2)\n        ffilename = <span style=\"color: #003388\">open<\/span>(exportfilename, <span style=\"color: #dd2200;background-color: #fff0f0\">\"w\"<\/span>)\n        ffilename.write(pfilename.export_rows(format=<span style=\"color: #dd2200;background-color: #fff0f0\">\"csv\"<\/span>))\n        ffilename.close()\n        \n        <span style=\"color: #888888\"># Delete completed project<\/span>\n        pfilename.delete_project()\n        \n        <span style=\"color: #888888\"># Reset directory for the loop<\/span>\n        os.chdir(<span style=\"color: #dd2200;background-color: #fff0f0\">\"\/Users\/lulinghuang\/Desktop\/2016 DSC\/openrefine\/python refine\/\"<\/span>) \n    <span style=\"color: #008800;font-weight: bold\">else<\/span>:\n        <span style=\"color: #008800;font-weight: bold\">continue<\/span>\n        \n<\/pre>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<p>The tricky part (for me, may not apply to other situations):<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/2016\/12\/13\/preparing-data-with-openrefine-part-ii-assign-unique-numerical-identifiers\/\">In Part II (Step 6)<\/a>, there is a part of GREL&#8217;s cross() function that points to the working project&#8217;s name. Therefore, I have to update the JSON file in each loop so that the project names match. Lines 22-34 do this job (read, change, and update JSON).<\/p>\n<p>More specifically, when python reads JSON, it converts JSON&#8217;s content to a list of dictionaries. Locating and changing parts of the JSON content thus follows the way such a list is structured (lines 29-30, the two places where project name needs to be changed locate in the keys &#8220;description&#8221; and &#8220;expression&#8221; under the sixth dictionary in the list).<\/p>\n<p>More concretely, when &#8220;Example_3Threads-3.csv&#8221; enters the loop (the project name follows its file name by default), after line 34, the original project name in JSON, e.g., &#8220;Example_3Threads-689(whatever the index is).csv,&#8221; is changed to &#8220;Example_3Threads-3.csv.&#8221;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Luling Huang<\/p>\n","protected":false},"author":3163,"featured_media":2658,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[303,2],"tags":[185,179,71],"class_list":["post-2610","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-grad-students","tag-data-cleaning","tag-openrefine","tag-python"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/2610","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/3163"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=2610"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/2610\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/2658"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=2610"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=2610"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=2610"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}