Skip to content

Loretta C. Duckworth Scholars Studio

⠀

Menu
  • Scholars Studio Blog
    • Digital Methods
      • coding
      • critical making
      • data visualization
      • digital pedagogy
      • immersive technology (AR/VR)
      • mapping
      • textual analysis
      • web scraping
    • Disciplinary Fields
      • Anthropology
      • Archaeology
      • Architecture
      • Art History
      • Business
      • Computer Science
      • Critical Digital Studies
      • Cultural Studies
      • Dance
      • Economics
      • Education
      • Environmental Studies
      • Film Studies
      • Gaming Studies
      • Geography
      • History
      • Information Science
      • Linguistics
      • Literary Studies
      • Marketing
      • Media and Communication Studies
      • Music Studies
      • Political Science
      • Psychology
      • Public Health
      • Sculpture
      • Sociology
      • Urban Studies
      • Visual Art
  • Current Staff
  • Current Fellows
    • Faculty Fellowships
    • Graduate Extern Program
  • About
  • Newsletter
Menu

Preparing Data With OpenRefine Part II – Assign Unique Numerical Identifiers

Posted on December 13, 2016January 17, 2023 by Luling Huang

By Luling Huang

Problem:

In R’s “relevent” (Butts, 2015), the identifiers of sender and receiver have to be integers, rather than strings. For example, if we have a sequence data with 12 ordered events like this:

sample data with three columns: 't,' 's,' and 'r.' 's' for senders and 'r' for receivers. Senders and receivers are coded as 'a,' 'b,' 'c,' and 'd.'

we need to assign “a,” “b,” “c,” and “d” to integers 1, 2, 3, and 4.

Assumption:

Receiver’s set R is a subset of sender’s set S.

Procedure:

1) Assign unique integers to objects in S.

2) Match objects in R to the assigned integers in S.

How to do it in OpenRefine:

1) Move the column “s” to beginning. That is, let the column be the first column from the left.

2) Sort the column “s” and reorder rows permanently.

3) Blank down “s” and switch to the “records” mode. The result is:

Result after step 3: for column 's,' all values of 'a' become grouped with '1,' 'b' with '2,' 'c' with '3,' and 'd' with '4.' Duplicate values deleted.

Similar cells in “s” become “records” under each unique value. The purpose of this step is to group values of “a,” “b,” “c,” and “d” with the row indexes 1, 2, 3, and 4 (imagine these numbers as group labels).

4) In “records” mode, for the column “s,” use “Edit column -> Add a column based on this column” to create a new column “sID.” Use the GREL expression “row.record.index+1.” Result:

Result after step 4: for a new column 'sID,' cell values are assigned based on group values in 's.'

5) Fill down “s,” which is a reverse of Step 3. Result:

Result after step 5: duplicate values in 's' were added back.

Now, Procedure 1 has been done. We have assigned unique integers to objects in S.

6) For the column “r,” use “Edit column -> Add a column based on this column” to create another new column “rID.” Use the GREL function “cross():”

Description of step 6. The function used: cell.cross('sample4 csv', 's')[0].cells['sID'].value

The cross() function is written to match content of columns across different projects (e.g., different csv files). By designating the project name as the one we are working on, we can use cross() for our purpose: to match column values within one csv file.

In cross(cell, “sample4 csv”, “s”).cells[“sID”].value[0], what we do is to first bind the two columns “s” and “sID” together (imagine the process as creating pairs of key-value in a dictionary); second, match the values in a third column, “r,” to “s;” third, assign values to a fourth column “rID” based on the matching chain of “s-sID-r.” Result (after sorting by time):

Final result: Senders and receivers are coded as '1,' '2,' '3,' and '4.'

Feature image’s source: map by Martin Magdinier.

References

Butts, C. (2015). Package ‘relevent’ [R package documentation]. Retrieved from https://cran.r-project.org/web/packages/relevent/relevent.pdf

1 thought on “Preparing Data With OpenRefine Part II – Assign Unique Numerical Identifiers”

  1. Arden says:
    June 30, 2017 at 9:23 am

    Wow! After all I gott a web site from where I can genuinely take helpfuul data concerning my study and knowledge.

Leave a Reply

You must be logged in to post a comment.

Recent Posts

  • Web Scraping Wikipedia to Analyze XBOX Game Development Companies by Nationality January 4, 2023
  • Critical Elements for Making Games December 22, 2022
  • Cities as Havens for Bees: Using Remote Sensing to Visualize Urban Bee Habitat December 21, 2022

Tags

3D modeling 3D printing 360 video arduino augmented reality authorship attribution coding corpus building critical making Cultural Heritage data cleaning data visualization digital art history Digital Preservation digital reconstruction digital scholarship film editing games GIS linked open data machine learning makerspace mapping network analysis oculus rift omeka OpenRefine Photogrammetry physical computing Python QGIS R SketchUp stylometry text analysis text mining textual analysis top news twitter video analysis virtual reality visual analysis voyant web scraping YouTube

Archives

©2023 Loretta C. Duckworth Scholars Studio | Design: Newspaperly WordPress Theme