

{"id":4036,"date":"2017-09-21T11:35:14","date_gmt":"2017-09-21T15:35:14","guid":{"rendered":"https:\/\/sites.temple.edu\/tudsc\/?p=4036"},"modified":"2017-12-01T19:43:29","modified_gmt":"2017-12-01T23:43:29","slug":"processing-json-data-with-jq","status":"publish","type":"post","link":"https:\/\/sites.temple.edu\/tudsc\/2017\/09\/21\/processing-json-data-with-jq\/","title":{"rendered":"Processing JSON Data With jq"},"content":{"rendered":"<p>By Luling Huang<\/p>\n<p><!--more--><\/p>\n<p>When we collect data from the Web through API (e.g., Twitter), we usually receive data in JSON (JavaScript Object Notation). How do we sift out unwanted information and transform JSON into the tabular formats many data analysis programs recognize (e.g., CSV; Comma-Separated Values)?<\/p>\n<p>This post introduces a powerful command line tool called <a href=\"https:\/\/stedolan.github.io\/jq\/\">jq<\/a>\u00a0for processing json data. It works on multiple platforms (<a href=\"https:\/\/stedolan.github.io\/jq\/download\/\">see more for installation here<\/a>). The primary reason to use jq is that no coding is involved. Although <a href=\"https:\/\/sites.temple.edu\/tudsc\/2017\/01\/17\/preparing-data-part-iii-split-data-and-apply-operation-history-to-subsets-combine-openrefine-and-python\/\">Python&#8217;s &#8220;json&#8221; package<\/a> is easy to work with, jq can get what we want even with just one line of command.<\/p>\n<p>The data structure of JSON is similar to Python&#8217;s dictionary (<a href=\"https:\/\/sites.temple.edu\/tudsc\/2016\/10\/27\/scraping-for-studying-online-political-discussion-part-2-python\/\">with key-value pairs<\/a>, in JSON they&#8217;re called name-value pairs). This is why Python&#8217;s &#8220;json&#8221; package is very handy if we have already stored some data in Python&#8217;s dictionary. But let&#8217;s just assume no Python is involved and we just get a JSON file from somewhere.<\/p>\n<p>In OS X&#8217;s Terminal (it should work similarly on Windows), we can display the structure of a JSON file called &#8220;test.json&#8221; with this command (remember to navigate to the directory that stores the file first):<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 50%\"><span style=\"color: #888888\">$ jq '.' test.json<\/span>\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>&#8220;jq&#8221; simply invokes jq. &#8220;.&#8221; is the basic jq operator to filter information. Here the JSON file is just reproduced without filter. Below is a small portion of a tweet&#8217;s information from a Twitter data set (collected through\u00a0<a href=\"https:\/\/github.com\/DocNow\/twarc\">twarc<\/a>). We can interpret the data structure from a CSV perspective: in one row (for each tweet), we have 6 variables (&#8220;contributors,&#8221; &#8220;truncated,&#8221; &#8220;is_quote_status,&#8221; etc.) as column names, and we have 6 corresponding values for each variable for that tweet (&#8220;null,&#8221; &#8220;false,&#8221; &#8220;false,&#8221; etc.).<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/json-data-example.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4037\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/json-data-example.png\" alt=\"json data example\" width=\"239\" height=\"88\" \/><\/a><\/p>\n<p>Because Twitter API gives us a lot metadata information that may not be relevant, our next task is to find the information we want to use for data analysis. Let&#8217;s say &#8220;id&#8221; is what we want to keep.<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 50%\"><span style=\"color: #888888\">$ jq '.id' test.json<\/span>\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>The above command will print out all the values under the attribute name of &#8220;id.&#8221; To see how many tweets we have collected, we can do this:<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 50%\"><span style=\"color: #888888\">$ jq '.id' test.json | wc -l<\/span>\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>&#8220;wc&#8221; is for word count and &#8220;-l&#8221; is for counting number of lines in output (this is a built-in command in Terminal, not part of jq). And we use &#8220;|&#8221; to &#8220;pipe&#8221; the result of filtering into the function for counting.<\/p>\n<p>From inspecting the data, we know that tweet content is stored under the attribute name of &#8220;full_text.&#8221; Let&#8217;s extract this information together with id, and put the filtering result in an JSON array enclosed by [ ] for each tweet.<\/p>\n<p>Note that in the original JSON file, each tweet is enclosed by { } (i.e., a JSON object with name-value pairs). Because we want to save the output to CSV, name-value pairs are no longer needed. Instead, we need arrays to represent rows in CSV. In other words, the hierarchical JSON structure is flattened.<\/p>\n<p>The procedure can be illustrated with a simple example. This is what we begin with (there are three tweets each of which is a JSON object and each object has two name-value pairs):<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-json-objects.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4046\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-json-objects.png\" alt=\"this is a simple example of 3 json objects\" width=\"252\" height=\"189\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-json-objects.png 252w, https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-json-objects-213x160.png 213w\" sizes=\"auto, (max-width: 252px) 100vw, 252px\" \/><\/a><\/p>\n<p>Then we extract the information from &#8220;id&#8221; and &#8220;full_text&#8221; and put it into three arrays each of which is for one tweet:<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-json-array.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4047\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-json-array.png\" alt=\"this is a simple example of three json arrays\" width=\"218\" height=\"52\" \/><\/a><\/p>\n<p>If we write each array as one row, then we have the CSV output (if necessary, column names can be added easily later):<\/p>\n<p><a href=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-csv-output2.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-4049\" src=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-csv-output2.png\" alt=\"\" width=\"153\" height=\"75\" srcset=\"https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-csv-output2.png 153w, https:\/\/sites.temple.edu\/tudsc\/files\/2017\/09\/simple-csv-output2-150x75.png 150w\" sizes=\"auto, (max-width: 153px) 100vw, 153px\" \/><\/a><\/p>\n<p>By using jq, we can achieve this with:<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<div style=\"background: #ffffff;overflow: auto;width: auto;border: solid gray;border-width: .1em .1em .1em .8em;padding: .2em .6em\">\n<pre style=\"margin: 0;line-height: 50%\"><span style=\"color: #888888\">$ jq -r '[.id, .full_text] | @csv' test.json &gt; test.csv<\/span>\r\n<\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>&#8220;[.id, .full_text]&#8221; means we create a new JSON array and put the filtered information in it. Then we pipe the array to &#8220;@csv,&#8221; which formats the array to CSV. &#8220;-r&#8221; is for &#8220;raw output,&#8221; which tells jq to treat the CSV formatted result as plain text. \u00a0&#8220;&gt; test.csv&#8221; designates the output file.<\/p>\n<p>Now we have turned a JSON file into a two-column CSV and it&#8217;s ready to be put into data analysis programs.<\/p>\n<p>For more information on JSON syntax, here is the <a href=\"http:\/\/www.json.org\/\">official introduction page<\/a>. Also, there are many other features in jq and\u00a0<a href=\"https:\/\/stedolan.github.io\/jq\/manual\/\">here is the manual<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>By Luling Huang<\/p>\n","protected":false},"author":3163,"featured_media":4052,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[303,2],"tags":[240,239],"class_list":["post-4036","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-grad-students","tag-jq","tag-json"],"_links":{"self":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/4036","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/users\/3163"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/comments?post=4036"}],"version-history":[{"count":0,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/posts\/4036\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media\/4052"}],"wp:attachment":[{"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/media?parent=4036"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/categories?post=4036"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.temple.edu\/tudsc\/wp-json\/wp\/v2\/tags?post=4036"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}