By Luling Huang
As a continuation of my previous post on webscrapping a political discussion forum, I will show how to prepare the time column for doing Relevant Event Modeling in R’s “relevent” (Butts, 2015). I used OpenRefine for data cleaning.
The goal is to create a dyadic edgelist for each thread. The edgelist has three columns: time information, sender, and receiver. To use the modeling function “rem.dyad” in R, time must be relative to the start of observation (Butts, 2015).
In my data, time is recorded as it is shown on the website. When the data was loaded in OpenRefine, it looks like this:
Also, some information is in the form “Today, xx:xx PM.” Therefore, here is a list of what should be done:
(1) Get rid of �� (Originally on the website, it is “ ,” which is a no break space. The scraper should have done a better job to delete it).
(2) Recode “Today” into “10/22”, and “Yesterday” into “10/21.” The scraper was run on 10/22 (technically, it was run from 11:10 PM 10/21 to 3:53 AM 10/22, which creates a real problem of how to recode “Today” and “Yesterday.” To fix it, manual verification should be done for each of the 203 “Yesterday” and the 87 “Today” cells. For this post, I’ll simplify this process and do aggregate transformation).
(3) Transform all “12:xx AM” and “12:xx PM” into “00:xx AM” and “00:xx PM.” The reason is that, somehow, OpenRefine does not like “12:xx AM/PM.” When converting to time data, without transformation, OpenRefine would recognize “12:xx AM” as “12:xx PM,” and would not convert “12:xx PM” at all. I followed Little’s (2015, p. 32) trick to fix this issue.
(4) For each thread, create a column for onset time, which should be the time of the first post in thread.
(5) For each thread, create another column. This column is the original time subtracted by onset time.
I used GREL (General Refine Expression Language).
(1) For the column “PostDateAndTime,” apply the function “value.replace(“��”, ” “).
(2) With the same function, recode “Today” and “Yesterday” cells into “10/22” and “10/21.”
(3) With the same function, recode “12:xx AM” and “12:xx PM” into “00:xx AM” and “00:xx PM.”
(4) Apply facet to the column “ThreadID” and select a single thread (in order to perform the following actions only to a single thread). Apply another facet to the column “PostPosition” and select the row for the first post. For the column”PostDateAndTime,” use “Edit column -> Add a column based on this column” to fill the cell in the new column with the first post’s time. Cancel the facet for the column “PostPosition” and “Fill down” the new column. Now we have a new column (named “onset time”).
(5) For the column “onset time,” use “Edit column -> Add a column based on this column” to create another new column “t.” “t” is the difference between original time and onset time. Use the function diff():
a. cells[“PostDateAndTime”].value.toDate(): Locate the original time and convert it to time format.
b. cells[“onset time”].value.toDate(): Locate the onset time and convert it to time format.
c. “seconds”: Designate second as the calculation unit.
In the new column “t”, for example, “50160” (for the fourth post in thread) means that there are 50160 seconds between the start of the thread and the fourth post.
References
Butts, C. (2015). Package ‘relevent’ [R package documentation]. Retrieved from https://cran.r-project.org/web/packages/relevent/relevent.pdf
Little, J. (2015). OpenRefine: Introduction: Workbook [Online tutorial slide]. Retrieved from https://docs.google.com/presentation/d/1YkArEiaws0dMcyFZEppg4eZ7CxvqCTckjY78ao93zIw/edit#slide=id.p