As you know from my previous post, I decided to gather data from the University of Florida’s Samuel Proctor Oral History Program. Specifically, I have gathered data from the interviews conducted on Lumbee Indians who were living in urban centers, primarily in Baltimore, Maryland. Since I created the dataset myself from interview transcripts, I found that my data was in great need of some cleaning. Cleaning is the process of finding and correcting incorrect or incomplete data and resolving any typographical errors. In order to do this quickly and effectively, I used a free program called OpenRefine.
When cleaning, I tried to focus on consistency and uniformity, especially when it came to geographic locations. It is important to ensure consistency in order to make the data easier to use for further projects. The process of decided what to clean and how to clean it was difficult for this project. Because I was extracting data from oral history interviews, I had to work with the information that was given. In some cases, I did not find the data I was looking for in every interview. For example, while some individuals provided their ages, others did not. However, if the interviewer provided their birth date, I took the liberty of calculating their age based on the year the interview was conducted.
The lack of certain information was indeed frustrating in the cleaning process. Because the interviewers were inconsistent in the questions they asked, the data I was able to provide was patchy in some categories. I cleaned my data in such a way that would allow me to have as complete a spreadsheet as possible, even if this meant adding or taking away categories that I originally intended to include. The downside of this is the fact that I have made decisions that will affect the way I interpret and draw conclusions the data. Manipulating and cleaning data provides the opportunity to create silences in the narrative.