November 21, 2019 -- Data science as a discipline has become a central aspect of many commercial endeavors. A large range of distinct applications have adopted its approaches and techniques. Despite the major differences in many of these applications, some of the techniques can be transferred from old and established data-driven fields to new, completely separate data-intensive areas. For example, certain image analysis algorithms from astronomy can be applied in fields involving microscopy.
One of the current, big life sciences trends is the parallelized/ large-scale study of cells, which has become increasingly common in many fields. This area involves particularly challenging data analysis due to the high volume of data, high dimensionality, high degree of variability, and spatial and temporal heterogeneity of cells and tissues.
Two distinct applications of data science in large-scale cellular studies include single cell RNA sequencing (scRNA-seq) and high content analysis/ screening (HCA/ HCS). These both generate huge datasets with large numbers of both samples and variables.
Single cell RNA sequencing (scRNA-seq) and its challenges
While most genomics techniques use a mixture of cells as a sample, single cell RNA sequencing (scRNA-seq) provides information about individual cells. Like other types of RNA-seq, it analyzes gene expression levels or the transcriptome; but as a single cell technology, it provides advantages including:
These can be huge advantages for many reasons; for example, similar cells in a tissue often express different genes at different times. With cancer samples there is sequence variation at the DNA and RNA level between the cells which would typically get lost in the mix.
Given that a single cell can produce gigabases or gigabytes of sequence data, the experiments that look at significant numbers of cells produce an enormous data analysis challenge. In addition, there is often the need to work with external data sources for the purpose of comparison with public datasets to interpret or validate the results. One such public database is the 10X Genomics dataset resource which includes millions of individual cells. These factors create the requirement for labs to have access to extensive data analysis and data management capabilities.
With scRNA-seq there is also the upstream technical challenge of using consistent sample preparation and nucleic acid isolation technologies that are capable of handling individual cells. 10X Genomics provides the previously mentioned datasets, but is best known for its single cell technology which is now widely used in combination with next-generation sequencing (NGS) for the large-scale genomic analysis of cells due to its low cost per cell and high yield.
Deep learning for scRNA-seq data
Clustering has been one of the widely-used statistical approach for the grouping of cells in these scRNA-seq studies. But one of the challenges with the clustering of scRNA-seq data is the prevalence of false zero count observations in the data [which also affects other methods to a varying extent]. This is the term to describe the failure to detect an expressed gene; in other words, a "true" zero count would mean that there is, in fact, a lack of expression of the gene in the given cell type. It has a major impact on clustering and is generally a problem when working with single cells which can have RNA transcripts present in low copy numbers. It has been observed that datasets with over 70% missing/zero values are common; these could be false negatives or truly zero.
A clustering approach to address this problem, which was recently published and provided as code on GitHub, named scDeepCluster, uses a model-based deep learning approach to simultaneously learn feature representation and clustering. Many approaches use imputation to determine the missing transcript values. Another type of deep learning method which does this, DeepImpute, was recently shown to perform particularly well to address false zero counts and is also freely available on GitHub.
Deep learning is an area of data science that has evolved from artificial neural networks (ANNs), which have proven superior to other machine learning algorithms in some applications including voice recognition, image recognition, and natural language processing. For image analysis, convolutional neural networks (CNNs) have become a popular type; these generally contain several convolution layers and subsampling layers. The subsampling layers reduce the size of feature maps. As a result, the CNN typically lowers the consumed memory and increases the learning speed, by reducing the number of free parameters learned.
High content analysis/screening and image analysis
The field of high content analysis/ screening (HCA/ HCS) also involves highly parallel analysis of cells but has its own distinct technical challenges that are being addressed by data science. This area has seen continued growth due to a variety of factors such as the shortcomings of biochemical screening, more readily available phenotypic and biologically relevant data, and the move towards replacement of animal models.
These experiments generally use flow cytometry or high resolution microscopy/ imaging systems to look at large numbers of cells. The microscopy techniques involve the use of specialized software which automatically identifies and characterizes pre-selected features of cells in groups of images. Some of the goals of this research include:
Typically, the imaging experiments involve large numbers of samples, such as screening thousands of compounds separately, and this creates the need for automation of the systems. There can be dozens to hundreds of features that are being tracked through the software and through methods like fluorescent labeling. Unfortunately, there is often a limited ability for the software to automate the image analysis steps; the process can be labor-intensive, resulting in a bottleneck. For example, to detect a feature, there may be a "design" step needed to provide the exact metrics being used in terms of size, color, and intensity.
Deep learning for HCA/ HCS
Genedata, a bioinformatics solutions company with its headquarters in Basel, Switzerland, first publicly demonstrated Genedata Imagence®, a high content screening image analysis workflow based on deep learning, in October 2018. The Genedata Imagence ANN-based technology eliminates most of the manual steps in image analysis, by allowing the end-user to define the phenotypes and then essentially learning to identify by itself the features and their importance. This results in a trained neural network which can be applied to subsequent assays automatically. The company has also demonstrated that the same training can be applied to new experimental conditions without human involvement, within minutes.
The Genedata Imagence software system reduces image analysis times, increases data quality, and improves reproducibility of results; and it allows scientists without image analysis expertise to set up and analyze screens. It was developed by working with pharmaceutical partners with the goal of reducing human bias, further examining specific cell biology, and enabling scientists to more intuitively understand the phenotypic space. The Genedata Imagence technology presents the end-user with user-friendly maps of the phenotypic space within minutes of loading the image data into the system. This allows the subsequent identification of phenotype classes and refinement of the deep learning model for future analysis for the assay.
The new AI-based Genedata Imagence software reduces the process to hours rather than the days or weeks typically needed. In a survey of labs using HCA/ HCS systems, the end-users ranked "image analysis software" as the second most important feature when purchasing a system, after "sensitivity/ resolution." In April 2019, the Bio-IT World Conference & Expo awarded the Genedata Imagence software its "Best of Show Award".
"Pharma R&D is becoming overwhelmed by a tsunami of complex phenotypic assays in early drug discovery, which require complex image analysis procedures and time-consuming setups that are done manually, prone to some degree of error, and simply do not scale to the demands of today's pharma research," said Dr. Stephan Stiegle, head of science at Genedata. "This issue was the major catalyst for the development of Genedata Imagence, which capitalizes on the power of deep learning to automate and optimize high-content image analysis in early drug discovery."
"Our pharma customers view Genedata Imagence as a game-changer in the truest sense of the term as the solution's deep learning-based approach allows pharma organizations to broadly implement and automate phenotypic screens to drive faster delivery of R&D projects while reducing development cycle times and corresponding costs -- producing quality results from a new experiment in a matter of seconds vs. the weeks historically required by manual optimization."
Pathways, Regulatory Networks for Interpretation, Context
The use of data science and machine learning can extend from the types of experiments described above into many other disciplines of life science research. There is often the goal of integrating the experiments' findings with external data or distinctly different types of data to generate higher-level information. For example, in many genomics, proteomics, cellular analysis, and other large-scale studies, there is a need to interpret the results in the context of cellular pathways/ signaling. There are various machine learning tools available to integrate this regulatory network and pathway information. More developments are anticipated as these areas of data science are seeing continuing acceleration in their technological advancements.