Faculty Support of Open Data: An Interview with Sergei Pond

Headshot of Sergei Pond

This week is Open Access Week, a yearly international celebration that aims to increase awareness about open access. Most academic work is locked up behind a paywall, available only to those who are affiliated with a college or university. Open access scholarship is completely free to read and reuse.

Professor of Evolutionary Genomics Sergei Pond is one of the many Temple faculty members who support open research practices. Pond recently spoke with Librarian Sarah Jones to discuss his work and his thoughts on how open data can help with the current reproducibility crisis.

Tell us about your recent research on COVID-19. What role did open data play in your work? 

My group at Temple (Institute for Genomics and Evolutionary Medicine) is a computational biology group. We use sequence data to watch what the virus is doing and evaluate how certain intervention efforts are going. Sequence data have never been generated at a faster pace than during the COVID-19 outbreak. As of today, around 130,000 SARS-CoV-2 genomes are available. In March there were 500. The rate of accumulation is really remarkable.

We’ve done a lot of collaborative work to look at the early evolution of SARS-CoV-2. Viral genomes change all the time; the trick is to figure out which changes matter and which ones don’t. At this moment, there are some changes but none that appear to be particularly important. Once we start giving it something to work against, like large scale drugs and vaccines, then we’ll watch it.

Something the public may not appreciate is that you have to do a lot of tedious work to make sure that the data you’re analyzing makes sense. You have to clean it up and make sure that your tools run fast enough. One of the issues everyone has run across is the volume of data–typically you’re talking about hundreds or maybe thousands of sequences, but tens or hundreds of thousands brings it up to a different scale. One of the issues with these large datasets is that they’re so big and the techniques that you use tend to be fairly complicated, so it turns into this hard-to-interpret black box. We’re trying to design something that’s easy to understand.

You recently published an article on the lack of data sharing in COVID-19 research. What problems do you see this causing? Which open tools and practices would you like to see adopted? 

Ideally, what you would like to be able to access are the original files that came off the sequencer. Typically what you see is the final genome; it’s a product of many steps that translate these data from raw sequencing data to genomes. It’s been the bane of computational biology that it’s not very common to share the original data. More importantly, it is next to impossible to find sufficient detail about how people went about processing these data to generate the genome. So basically you receive a genome but you’re missing how it was assembled. This is what creates the crisis of reproducibility. You have to be able to trust the data that you’re putting into your analyses.

There’s absolutely no excuse with modern tool availability not to publish the entire chain. If you’re an experimental scientist and you don’t publish your lab protocol nobody will believe it; it has to be recreatible. But in computational analysis there’s no standard like this. There’s no expectation that you will release the data and the tools that you used to analyse these data.

Tell us about your work with the Galaxy Project. How does this platform encourage open practices? 

Galaxy is a computational framework for open data and democratizing data analysis. Every step from extracting raw data to doing comparative analysis can be done in Galaxy. I think the strongest aspect of it is the longstanding focus on the reproducibility and shareability of research. When you develop a process for doing something, you can publish it and share it. It will record which tools you used, which settings, and how they were connected to each other. Each step you can store and share, so when you publish your work you instantly release your entire workflow.

Are there any misconceptions that you would like to address regarding open data? What do you wish people knew about it? 

There are a few things that tend to slow down or prevent people from doing open data and sharing. One, the logistics of it: will you find the time to annotate and format everything correctly and submit it? That excuse is becoming harder to use because there are large entities that have databases and tools that allow you to do this as easily as possible.

The other issue is data ownership. If you release open data it will be good for science, it will be good for discovery, and it will enable other people to extract more information from it. But as a data producer, how do you get proper credit for it? As a scientist you get rewarded for publishing papers and bringing in grants, but for being a good citizen of the open community, it’s not there.

I want to mention the idea of privacy. Human genomic data is personal health information, which needs to be guarded and protected. Viral data are a little different, you can’t track them down to specific individuals. But nonetheless, that could be a concern. It definitely is a concern in the area of HIV because in many jurisdictions in the United States HIV transmission is still a felony. That’s changing, but it’s still there. You don’t want to have a potential disclosure of an infection route.

Is there anything else you wanted to share? 

I want to emphasize that SARS-CoV-2 genomics has been a unique effort when it comes to collaboration and open science. It’s not ideal and we can improve on it, but compared to previous outbreaks this is probably the most open environment that we’ve had. It’s obviously a necessity, considering how much damage this pandemic has already caused. A truly international, truly open effort is necessary.

Fortunately, a lot of it was set up prior to SARS-CoV-2. There are people that took the time and strategically thought about how they could accelerate all of the necessary steps when the next big pathogen came out, and we’re reaping the benefits now. That really is important and worth emphasizing. That would not have happened without planning.

Thank you Dr. Pond!