Algorithm detects unknown anomalies in RNA-seq data

By Samantha Black, PhD, ScienceBoard editor in chief

November 27, 2019 -- A new computational approach to analyzing gene expression data is presented by computational biologists from Carnegie Mellon University (CMU) and published in Cell Systems on November 27.

Modern technology has allowed researchers to make great strides in analyzing the massive data sets generated by RNA sequencing (RNA-seq) techniques. While accuracy is often high, there are cases where the technique produces erroneous quantifications. To address this concern, researchers have developed an algorithm to automate the search for anomalies inferred by RNA-seq. Moreover, the algorithm re-examines its own output, identifying mistakes and correcting them.

There are many advantages to this system including:

  • no required known ground truth to discover potential errors
  • can provide more insight into what is causing the misquantification by identifying specific regions of specific transcripts for which the assumed theoretical model of read coverage does not match what is observed.
  • anomalies can be used to design better quantification algorithms

Using this system, the researchers have already identified 88 anomalies – regions of unexpectedly high or low levels of gene express – in two widely used RNA-seq libraries that were previously unknown (GEUVADIS and Human Body Map). "We don't yet know why we're seeing those 88 weird patterns," Carl Kingsford said, a professor in CMU's Computational Biology Department, noting that they could be a subject of further investigation.

Anomalies can be important clues for researchers, but until now finding them has been a painstaking, manual process, sometimes called "sequence gazing." Finding one anomaly might require examining 200,000 transcript sequences, Kingsford said. Most researchers, therefore, zero in on regions of genes that they think are important, largely ignoring the vast majority of potential anomalies.

But identifying anomalies is often not clear cut. Some RNA-seq "reads," for instance, are common to multiple genes and transcripts and sometimes get mapped incorrectly. If that occurs, a genetic region might appear more or less active than expected. So, the algorithm is capable of re-examining any anomalies it detects to see if they disappear when the RNA-seq reads are redistributed between the genes.

The new algorithm provides researchers with the ability to examine all the transcript sequences and could lead to new discoveries about unknown and unsuspected RNA-seq anomalies. "By correcting anomalies when possible, we reduce the number of falsely predicted instances of differential expression," said Cong Ma, a Ph.D. student in computational biology at CMU.

Do you have a unique perspective on your research related to proteomics? Contact the editor today to learn more.


Join The Science Advisory Board today!

Data Science Vital to Large-Scale Cellular Studies
Data science as a discipline has become a central aspect of many commercial endeavors. A large range of distinct applications have adopted its approaches...
Is there a reproducibility problem with RNA-seq data?
Modern day technology has allowed scientists to analyze highly complex genomic data sets. However, with this comes increased challenges associated with...
Large-scale analysis of microbiome reveals new classes of small proteins with implications in drug discovery
Trillion of bacteria reside within our bodies, and scientists are just scratching the surface of understanding the microbiome. Researchers at Stanford...

Copyright © 2019

Science Advisory Board on LinkedIn
Science Advisory Board on Facebook
Science Advisory Board on Twitter