Is there a reproducibility problem with RNA-seq data?

By Samantha Black, PhD, ScienceBoard editor in chief

November 12, 2019 -- Modern day technology has allowed scientists to analyze highly complex genomic data sets. However, with this comes increased challenges associated with reproducibility and misinterpretation of results from massive data sets. One such example, RNA sequencing (RNA-Seq), provides the ability to simultaneously measure gene expression levels of all genes in a sample in a single test.

According to new research from an international group of scientists published in PLOS on November 12 suggests that the is a frequent bias generated by RNA-seq technology.

Biology has been transformed by revolutionary technology that brings systems-level analysis to the benchtop. RNA-seq is a vastly used technique in biology and biomedical research to identify transcriptional drivers of biology as well as playing a key role in disease identification and diagnosis.

A key component of RNA-seq processing is data normalization. This process ensures that technical bias is minimal and allows for estimation and detection of differences in a given sample. In this study, the team analyzed publicly available RNA-seq datasets, which have been normalized and then analyzed based on a specific function. They identified sample-specific length effects have a greater impact on expression measurements that currently documented. This could be problematic when comparing the expression level of a gene between samples.

Puzzled by this recurring pattern, the authors then asked whether it reflects some universal biological response common to many different triggers or it rather stems from some experimental artifact. To tackle this question, they compared replicate samples from the same biological condition. Differences in gene expression between replicates can reflect technical effects that are not related to the experiment's biological factor of interest. Unexpectedly, the same pattern of particularly short genes (such as ribosomal protein genes) or long genes (like extracellular matrix genes) showed changes in expression level as observed in the comparisons between replicates, demonstrating that this pattern is the result of a technical bias that seemed to be coupled with gene length. These genes were found to be especially prone to false-positive results.

Overall, the researchers were able to determine length bias inherent in RNA-seq datasets, combined with flawed normalization during statistical analysis, can lead to the false identification of specific biological functions as cellular responses tested in a specific dataset. As concerns over reproducibility grow in the scientific community, this timely work emphasizes the importance of proper statistical analysis.


Join The Science Advisory Board today!

Copyright © 2019
To access all The ScienceBoard content create a free account now:

Email Address:  

First Name:

Last Name:

Learn about ScienceBoard

Get the latest life sciences research and industry news, delivered straight to your inbox, for free.

Why subscribe?

ScienceBoard is uniquely focused on the business of research, addressing the biggest problems that the biomedical industry face. You’ll get breaking news, events coverage, and deep dives into the science that drives innovation, delivered to your inbox daily.

Letter from the Editor Please send me twice-weekly roundups of all the latest life research and industry news.
SAB Announcements Please send me the latest announcements from The Science Advisory Board and their partners.
Spotlight Receive notifications about new content, services, or educational resources designed to help you sharpen your skills and grow professionally.
I have read and agree to the privacy policy and terms of service and wish to opt-in for

Already have an account? Sign in here