Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group 
Menu

Preparing Data With OpenRefine Part I – Time in Sequence

Posted on November 29, 2016December 1, 2017 by Luling Huang

By Luling Huang

As a continuation of my previous post on webscrapping a political discussion forum, I will show how to prepare the time column for doing Relevant Event Modeling in R’s “relevent” (Butts, 2015). I used OpenRefine for data cleaning.

The goal is to create a dyadic edgelist for each thread. The edgelist has three columns: time information, sender, and receiver. To use the modeling function “rem.dyad” in R, time must be relative to the start of observation (Butts, 2015).

In my data, time is recorded as it is shown on the website. When the data was loaded in OpenRefine, it looks like this:

screen-shot-2016-11-17-at-9-48-17-am

Also, some information is in the form “Today, xx:xx PM.” Therefore, here is a list of what should be done:

(1) Get rid of �� (Originally on the website, it is “ ,” which is a no break space. The scraper should have done a better job to delete it).

(2) Recode “Today” into “10/22”, and “Yesterday” into “10/21.” The scraper was run on 10/22 (technically, it was run from 11:10 PM 10/21 to 3:53 AM 10/22, which creates a real problem of how to recode “Today” and “Yesterday.” To fix it, manual verification should be done for each of the 203 “Yesterday” and the 87 “Today” cells. For this post, I’ll simplify this process and do aggregate transformation).

(3) Transform all “12:xx AM” and “12:xx PM” into “00:xx AM” and “00:xx PM.” The reason is that, somehow, OpenRefine does not like “12:xx AM/PM.” When converting to time data, without transformation, OpenRefine would recognize “12:xx AM” as “12:xx PM,” and would not convert “12:xx PM” at all. I followed Little’s (2015, p. 32) trick to fix this issue.

(4) For each thread, create a column for onset time, which should be the time of the first post in thread.

(5) For each thread, create another column. This column is the original time subtracted by onset time.

I used GREL (General Refine Expression Language).

(1) For the column “PostDateAndTime,” apply the function “value.replace(“��”, ” “).screen-shot-2016-11-17-at-10-52-57-am

(2) With the same function, recode “Today” and “Yesterday” cells into “10/22” and “10/21.”

(3) With the same function, recode “12:xx AM” and “12:xx PM” into “00:xx AM” and “00:xx PM.”

(4) Apply facet to the column “ThreadID” and select a single thread (in order to perform the following actions only to a single thread). Apply another facet to the column “PostPosition” and select the row for the first post. For the column”PostDateAndTime,” use “Edit column -> Add a column based on this column” to fill the cell in the new column with the first post’s time. Cancel the facet for the column “PostPosition” and “Fill down” the new column. Now we have a new column (named “onset time”).

screen-shot-2016-11-17-at-11-14-39-am

(5) For the column “onset time,” use “Edit column -> Add a column based on this column” to create another new column “t.” “t” is the difference between original time and onset time. Use the function diff():

screen-shot-2016-11-17-at-12-25-57-pm

a. cells[“PostDateAndTime”].value.toDate(): Locate the original time and convert it to time format.

b. cells[“onset time”].value.toDate(): Locate the onset time and convert it to time format.

c. “seconds”: Designate second as the calculation unit.

In the new column “t”, for example, “50160” (for the fourth post in thread) means that there are 50160 seconds between the start of the thread and the fourth post.

References

Butts, C. (2015). Package ‘relevent’ [R package documentation]. Retrieved from https://cran.r-project.org/web/packages/relevent/relevent.pdf

Little, J. (2015). OpenRefine: Introduction: Workbook [Online tutorial slide]. Retrieved from https://docs.google.com/presentation/d/1YkArEiaws0dMcyFZEppg4eZ7CxvqCTckjY78ao93zIw/edit#slide=id.p

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Tags

3D modeling 3D printing arduino augmented reality banned books coding corpus building critical making Cultural Heritage data cleaning data visualization Digital Preservation digital reconstruction digital scholarship film editing game design games gephi human subject research linked open data machine learning makerspace makerspace residency mapping network analysis oculus rift omeka OpenRefine Photogrammetry Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping webscraping

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025

Archives

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Archives

Blog Tags

3D modeling (11) 3D printing (14) arduino (8) augmented reality (5) banned books (3) coding (12) corpus building (4) critical making (7) Cultural Heritage (11) data cleaning (4) data visualization (11) Digital Preservation (3) digital reconstruction (9) digital scholarship (12) film editing (3) game design (3) games (6) gephi (3) human subject research (3) linked open data (4) machine learning (6) makerspace (8) makerspace residency (4) mapping (30) network analysis (17) oculus rift (8) omeka (3) OpenRefine (4) Photogrammetry (5) Python (8) QGIS (10) R (9) SketchUp (4) stylometry (8) text analysis (10) text mining (4) textual analysis (32) top news (102) twitter (5) video analysis (4) virtual reality (17) visual analysis (5) voyant (4) web scraping (16) webscraping (3)

Recent Posts

  • The Untold History of Fletcher Street’s Stables April 21, 2025
  • Building an Immersive Archive of the Greek Orthodox Churches in Istanbul April 15, 2025
  • Tracing Influence in Genealogies of Communication Theory April 14, 2025
  • From Theory to Practice: Weaving in Response to the Grid in the Global Context March 26, 2025
  • Visiting a Land of Twilight February 24, 2025

Archives

©2025 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme
Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
    • Cultural Analytics Practicum Blogposts
  • Current Staff
  • Newsletter
  • About
    • Games Group