February 2, 2021 -- A novel machine-learning technique leverages gene expression data to improve drug repurposing and can even predict interactions between drug candidates and targets based on incomplete data. The framework, which was described in Nature Machine Learning on February 1, was applied to drug repurposing for COVID-19 to generate potential lead compounds in line with clinical evidence.
Deep-learning applications have advanced drug screening efforts, but often the readout of a single-protein modulation by a chemical is poorly correlated with organism-level therapeutic effects. Phenotype-based screening is used for identifying cell-active compounds but is a low-throughput technique, which makes target deconvolution difficult. Alternatively, gene expression profiling is one effective way to characterize cellular and organism-level phenotypes and can be applied to drug repurposing efforts.
Researchers at Ohio State University designed a mechanism-driven neural network-based method, called DeepCE (pronounced Deep Sea), using high-dimensional associations among biological features, as well as nonlinear relationships between biological features and outputs, to predict gene expression profiles when exposed to a new chemical compound.
The researchers utilized L1000, a genome-wide chemical-induced gene expression database, that was developed by the U.S. National Institutes of Health (NIH) library of integrated network-based cellular signatures (LINCS) program. The dataset consists of approximately 1,400,000 gene-expression profiles on the responses of around 50 human cell lines to one of about 20,000 compounds across a range of concentrations.
The model uses a graph convolutional network to automatically extract chemical substructure features from data. It also uses an attention mechanism (selective implementation) to capture associations among chemical substructures and genes and among genes in cell lines.
While the L1000 dataset is one of the most advanced, it contains many missing values that are important for screening potential drug candidates. Being able to predict gene expression values for unmeasured and unreliable experiments in the dataset could help overcome the limitations of the database and improve functionality for drug screening purposes.
Therefore, the researchers also proposed a data augmentation method by which they could extract useful information from unreliable experiments in L1000 to improve the prediction performance of their model.
The team found that the DeepCE model considerably outperformed other models, including linear models, a vanilla neural network, k-nearest neighbor (kNN), and tensor-train weight optimization (TT-WOPT).
"The output demonstrates multitask learning - we can predict gene expression values for new chemicals not from one cell to one cell, but automatically predict the role of a drug on different cell lines and different genes," said Ping Zhang, PhD, assistant professor of computer science and engineering and biomedical informatics at Ohio State University, in a statement. "We can use the computer to simulate drug-induced gene expression. This provides real value."
Drug repurposing for COVID-19
"The story should stop here -- this is where we were during spring break. But then COVID-19 arrived, and we hoped our research could help, so we did a special case study for COVID-19 drug repurposing," explained Zhang.
The team used the trained DeepCE on the high-quality part of the L1000 dataset to generate predicted gene expression profiles for all of the 11,179 drugs in the Drugbank database at the largest chemical dosage. Drugbank is an open access database that contains information on the chemical structures and other details on approved and investigational drugs.
"Based on the known gene expression changes that have occurred and been identified with known drugs, we apply that to the gene expression in question - in this case, compounds that are being studied but are not yet experimented in L1000, said Zhang.
The researchers generated patient gene expression profiles by using SARS-COV-2 gene expression datasets from the National Genomics Data Center (NGDC) and National Center for Biotechnology Information (NCBI) to calculate the differential gene expression profiles of patients under population- and individual-based settings. They focused on lung and airway cell lines for the COVID-19 analysis.
They then screened drugs in Drugbank by computing correlation scores between their gene expression profiles with the patient gene expression profiles and selected drugs that give the most negative scores as the potential drugs.
"We put such predicted 'drug signatures' against the COVID-19 patient profiles on a population level," explained Zhang. "Once you can identify both signatures, the job is easy. Wherever we find the disease and a drug show opposite gene expression profiles, suggesting the drug would reverse the effects of the disease, you have found a drug that may treat the disease."
Among the ten drugs they identified for population-based analysis, three drugs are antiviral drugs used in hepatitis C treatment (faldaprevir, alisporivir, and NIM811) and two drugs are immunosuppressive agents (voclosporin and cyclosporine).
"I want to put together a research agenda using all the different data resources for drug repurposing and drug-disease associations from multiple perspectives and connect with researchers who can collaborate with us to find new drugs for diseases - including unknown diseases," explained Zhang.
Do you have a unique perspective on your research related to artificial intelligence or drug discovery? Contact the editor today to learn more.