November 12, 2019 -- Modern day technology has allowed scientists to analyze highly complex genomic data sets. However, with this comes increased challenges associated with reproducibility and misinterpretation of results from massive data sets. One such example, RNA sequencing (RNA-Seq), provides the ability to simultaneously measure gene expression levels of all genes in a sample in a single test.
According to new research from an international group of scientists published in PLOS on November 12 suggests that the is a frequent bias generated by RNA-seq technology.
Biology has been transformed by revolutionary technology that brings systems-level analysis to the benchtop. RNA-seq is a vastly used technique in biology and biomedical research to identify transcriptional drivers of biology as well as playing a key role in disease identification and diagnosis.
A key component of RNA-seq processing is data normalization. This process ensures that technical bias is minimal and allows for estimation and detection of differences in a given sample. In this study, the team analyzed publicly available RNA-seq datasets, which have been normalized and then analyzed based on a specific function. They identified sample-specific length effects have a greater impact on expression measurements that currently documented. This could be problematic when comparing the expression level of a gene between samples.
Puzzled by this recurring pattern, the authors then asked whether it reflects some universal biological response common to many different triggers or it rather stems from some experimental artifact. To tackle this question, they compared replicate samples from the same biological condition. Differences in gene expression between replicates can reflect technical effects that are not related to the experiment's biological factor of interest. Unexpectedly, the same pattern of particularly short genes (such as ribosomal protein genes) or long genes (like extracellular matrix genes) showed changes in expression level as observed in the comparisons between replicates, demonstrating that this pattern is the result of a technical bias that seemed to be coupled with gene length. These genes were found to be especially prone to false-positive results.
Overall, the researchers were able to determine length bias inherent in RNA-seq datasets, combined with flawed normalization during statistical analysis, can lead to the false identification of specific biological functions as cellular responses tested in a specific dataset. As concerns over reproducibility grow in the scientific community, this timely work emphasizes the importance of proper statistical analysis.