New algorithm helps integrate single-cell data from around the globe

By Samantha Black, PhD, ScienceBoard editor in chief

April 20, 2021 -- A new algorithm enables researchers from around the globe to integrate multiple single-cell datasets from a variety of omics platforms in a quick and efficient process that can be done on standard computers. The technology, described in an April 19 Nature Biotechnology article, will help speed collaborative cell cataloging projects, such as the Human Body Map and Human Cell Atlas.

High-throughput single-cell sequencing technologies have enabled researchers to profile cell types based on gene expression, chromatin accessibility, and DNA methylation state. These approaches generate enormous amounts of data, classifying hundreds of thousands to millions of individual cells. In most cases, the datasets are not designed to integrate multiple modalities or incorporate new data.

A new approach to single-cell data

To address this limitation, a team of University of Michigan researchers, led by Joshua Welch, PhD, and doctoral candidate Chao Gao, developed an online, integrative non-negative matrix factorization (online iNMF) algorithm, which allows a scalable and iterative integration of single-cell datasets generated by different omics technologies.

"Our technique allows anyone with a computer to perform analyses at the scale of an entire organism," Welch said in a statement. "That's really what the field is moving towards."

The new algorithm is an extension of the team's recently published linked inference of genomic experimental relationships (LIGER) method. LIGER infers a set of latent factors ("metagenes") that represent the same biological signals in multiple datasets while also retaining the ways in which these signals differ across datasets. The shared and dataset-specific factors are then jointly used to identify cell types and states while also identifying and keeping cell type-specific differences in the metagene features that define cell identities.

In the new algorithm, "online learning" does not refer to the internet; rather, it is a technical term that denotes calculations that are performed iteratively and incrementally as new datasets become available. Cumulatively, online iNMF enables scalable and efficient data integration with fixed memory usage and the ability to incorporate new data without starting from scratch.

The authors point to two important advantages of the technology. First, online iNMF allows for the integration of large single-cell multiomic datasets by cycling through the data multiple times in small "minibatches." And second, the approach allows for the integration of continually arriving datasets, where the entire dataset is not available at any point during training.

The team showed that the algorithm can be used to integrate different types of omics datasets. For instance, they used online iNMF to integrate single-cell RNA sequencing (Slide-seq) and spatial transcriptomics (MERFISH) datasets. Integration allows for deeper, transcriptome-wide data, the authors noted.

To provide further proof of concept, the team used datasets from the National Institute of Health's Brain Initiative, a project aimed at understanding the human brain by mapping every cell. Typically, for this type of collaborative project, each single-cell dataset that is submitted must be reanalyzed with the previous datasets in the order they arrive. A major benefit of the technology is the ability to incorporate new data points as they are available, a particularly useful goal for large, collaborative efforts to construct comprehensive cell atlases, such as the Brain Initiative.

The new approach allows researchers to add new datasets to existing ones, without the need to reprocess older datasets, thereby saving a great deal of calculation time. The new algorithm also enables scientists to analyze data as minibatches, which significantly reduced the amount of memory needed for processing. Specifically, the approach incorporated single-cell RNA-seq and single-nucleus RNA-seq data without revisiting previously processed cells. The datasets were iteratively refined with each data upload.

"This is crucial for the sets increasingly generated with millions of cells," Welch explained. "This year, there have been five to six papers with two million cells or more and the amount of memory you need just to store the raw data is significantly more than anyone has on their computer."

Welch likens the technique to the continuous data processing done by social media platforms like Facebook and Twitter, which must process continuously generated data from users and serve up relevant posts to people's feeds. However, instead of tweets, labs submit experimental data for processing.

The approach has the potential to greatly improve efficiency for other ambitious cell atlas projects like the Human Body Map and Human Cell Atlas, which can help the scientific community understand the body's composition of cells and how cells go wrong in disease, Welch noted.

Do you have a unique perspective on your research related to omics or cell biology? Contact the editor today to learn more.

---

Join The Science Advisory Board today!

If you like this content, please share it with a colleague!