Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
  • About
    • Current Staff
    • Current Fellows
    • Faculty Fellowships
    • Graduate Extern Program
Menu

Preparing Data With OpenRefine Part I – Time in Sequence

Posted on November 29, 2016December 1, 2017 by Luling Huang

By Luling Huang

As a continuation of my previous post on webscrapping a political discussion forum, I will show how to prepare the time column for doing Relevant Event Modeling in R’s “relevent” (Butts, 2015). I used OpenRefine for data cleaning.

The goal is to create a dyadic edgelist for each thread. The edgelist has three columns: time information, sender, and receiver. To use the modeling function “rem.dyad” in R, time must be relative to the start of observation (Butts, 2015).

In my data, time is recorded as it is shown on the website. When the data was loaded in OpenRefine, it looks like this:

screen-shot-2016-11-17-at-9-48-17-am

Also, some information is in the form “Today, xx:xx PM.” Therefore, here is a list of what should be done:

(1) Get rid of �� (Originally on the website, it is “ ,” which is a no break space. The scraper should have done a better job to delete it).

(2) Recode “Today” into “10/22”, and “Yesterday” into “10/21.” The scraper was run on 10/22 (technically, it was run from 11:10 PM 10/21 to 3:53 AM 10/22, which creates a real problem of how to recode “Today” and “Yesterday.” To fix it, manual verification should be done for each of the 203 “Yesterday” and the 87 “Today” cells. For this post, I’ll simplify this process and do aggregate transformation).

(3) Transform all “12:xx AM” and “12:xx PM” into “00:xx AM” and “00:xx PM.” The reason is that, somehow, OpenRefine does not like “12:xx AM/PM.” When converting to time data, without transformation, OpenRefine would recognize “12:xx AM” as “12:xx PM,” and would not convert “12:xx PM” at all. I followed Little’s (2015, p. 32) trick to fix this issue.

(4) For each thread, create a column for onset time, which should be the time of the first post in thread.

(5) For each thread, create another column. This column is the original time subtracted by onset time.

I used GREL (General Refine Expression Language).

(1) For the column “PostDateAndTime,” apply the function “value.replace(“��”, ” “).screen-shot-2016-11-17-at-10-52-57-am

(2) With the same function, recode “Today” and “Yesterday” cells into “10/22” and “10/21.”

(3) With the same function, recode “12:xx AM” and “12:xx PM” into “00:xx AM” and “00:xx PM.”

(4) Apply facet to the column “ThreadID” and select a single thread (in order to perform the following actions only to a single thread). Apply another facet to the column “PostPosition” and select the row for the first post. For the column”PostDateAndTime,” use “Edit column -> Add a column based on this column” to fill the cell in the new column with the first post’s time. Cancel the facet for the column “PostPosition” and “Fill down” the new column. Now we have a new column (named “onset time”).

screen-shot-2016-11-17-at-11-14-39-am

(5) For the column “onset time,” use “Edit column -> Add a column based on this column” to create another new column “t.” “t” is the difference between original time and onset time. Use the function diff():

screen-shot-2016-11-17-at-12-25-57-pm

a. cells[“PostDateAndTime”].value.toDate(): Locate the original time and convert it to time format.

b. cells[“onset time”].value.toDate(): Locate the onset time and convert it to time format.

c. “seconds”: Designate second as the calculation unit.

In the new column “t”, for example, “50160” (for the fourth post in thread) means that there are 50160 seconds between the start of the thread and the fourth post.

References

Butts, C. (2015). Package ‘relevent’ [R package documentation]. Retrieved from https://cran.r-project.org/web/packages/relevent/relevent.pdf

Little, J. (2015). OpenRefine: Introduction: Workbook [Online tutorial slide]. Retrieved from https://docs.google.com/presentation/d/1YkArEiaws0dMcyFZEppg4eZ7CxvqCTckjY78ao93zIw/edit#slide=id.p

Share this:

  • Twitter
  • Facebook
  • Reddit
  • Email

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Recent Posts

  • Digital Practices for the Study of Cultural Heritage (Part 2) April 7, 2022
  • Visualizing Changes in Colombian Wetlands with ArcGIS Story Maps March 21, 2022
  • Digital Practices for the Study of Cultural Heritage (Part 1) February 8, 2022
My Tweets

Tags

3D modeling 3D printing 360 video arduino augmented reality authorship attribution coding corpus building critical making Cultural Heritage data cleaning data visualization digital art history Digital Preservation digital reconstruction digital scholarship early modern film editing games gephi linked open data machine learning makerspace mapping network analysis oculus rift OpenRefine Photogrammetry physical computing Python QGIS R SketchUp stylometry terrain modeling text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping

Archives

©2022 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.