AI uncovers the genome's hidden regulatory code

By Samantha Black, PhD, ScienceBoard editor in chief

February 19, 2021 -- A neural network trained on high-resolution maps of protein-DNA interactions can uncover how these sequences are organized to regulate genes, revealing a hidden regulatory code. Findings from use of the artificial intelligence (AI) model were published in Nature Genetics on February 18.

One of the great unsolved mysteries of genomics is the genome's second code. The precise rules of the cis-regulatory code -- regions of noncoding DNA that regulate the transcription of neighboring genes -- remain unclear. The code is read by transcription factors that bind to short stretches of DNA called motifs.

Gene enhancers are DNA motifs bound by proteins that increase the likelihood that transcription of a particular gene will occur. These short-sequence motifs are critical for the binding of sequence-specific transcription factors, but how motif combinations and their arrangements influence transcription factor binding in vivo is not well understood.

Experimental manipulations, such as mutations or synthetic designs, have provided some evidence of specific motif arrangements, referred to by the authors as syntax. However, syntax rules and patterns are difficult to identify with genome-wide analyses.

Neural networks can learn flexible, predictive models to capture de novo sequence motifs among complex and multilayer data without making strong biological assumptions. However, the complexity of the models makes them challenging to interpret. Existing models are also limited by low resolution and the inability to detect transcription factors cooperativity (including indirect binding).

Now, an interdisciplinary team of biologists and computational researchers led by Julia Zeitlinger, PhD, of the Stowers Institute for Medical Research and Anshul Kundaje, PhD, of Stanford University have designed a convolutional neural network -- named Base Pair Network (BPNet) -- that can be interpreted to reveal regulatory code by predicting transcription factor binding from DNA sequences with unprecedented accuracy.

BPNet can uncover the genome's regulatory code

The researchers used chromatin immunoprecipitation experiments with nucleotide resolution through exonuclease, unique barcode, and single ligation (ChIP-nexus) data in embryonic stem cells to achieve modeling at the highest possible resolution. The increased resolution allowed them to develop interpretation tools to extract key sequence patterns that directly summarize motif influence on transcription factor binding.

"This was extremely satisfying, as the results fit beautifully with existing experimental results, and also revealed novel insights that surprised us," said Zeitlinger, in a statement.

The team found that transcription factor binding is guided by soft syntax rules, which follow clear intermotif, distance-dependent relationships consistent with protein-protein interactions or nucleosome-mediated cooperativity. For example, BPNet predicted that transcription factors Sox2 and Nanog interact and that this cooperative interaction is directional. In this way, interactions between two motifs occur in a flexible but distance-dependent fashion that is specific for each motif pair.

"There has been a long trail of experimental evidence that such motif periodicity sometimes exists in the regulatory code," Zeitlinger says. "However, the exact circumstances were elusive, and Nanog had not been a suspect. Discovering that Nanog has such a pattern, and seeing additional details of its interactions, was surprising because we did not specifically search for this pattern."

Moreover, they found that the Nanog motif showed a strong helical space preference for multiples of around 10.5 base pairs, independent of orientation. This helical spacing may help Nanog engage in cooperative protein-protein interactions by presenting on the same side of the DNA as partner motifs.

"This is the key advantage of using neural networks for this task," said Žiga Avsec, PhD, senior research scientist at the Technical University in Munich and first author of the paper.

"More traditional bioinformatics approaches model data using pre-defined rigid rules that are based on existing knowledge. However, biology is extremely rich and complicated," explained Avsec. "By using neural networks, we can train much more flexible and nuanced models that learn complex patterns from scratch without previous knowledge, thereby allowing novel discoveries."

How does BPNet work?

BPNet learns from the raw DNA sequence and learns to detect sequence motifs and eventually the higher-order rules by which the elements predict the base-resolution binding data. Once the model is trained, the learned patterns are extracted with interpretation tools. The output signal is traced back to the input sequences to reveal sequence motifs.

Researchers used DNA sequences from high-resolution experiments to train a neural network called BPNet, whose "black box" innerworkings were then uncovered to reveal sequence patterns and organizing principles of the genome's regulatory code. Illustration courtesy of Mark Miller, Stowers Institute for Medical Research.

The final step is to use the model as an oracle and systematically query it with specific DNA sequence designs, similar to what one would do to test hypotheses experimentally, to reveal the rules by which sequence motifs function in a combinatorial manner.

"The beauty is that the model can predict way more sequence designs that we could test experimentally," Zeitlinger said. "Furthermore, by predicting the outcome of experimental perturbations, we can identify the experiments that are most informative to validate the model."

To experimentally validate the motif syntax, the researchers performed targeted point mutations in motifs and compared the changes in ChIP-nexus profiles to those predicted by BPNet. They used CRISPR/Cas9 to perform two-base substitutions in either the Sox2 or Nanog motif, and then performed ChIP-nexus experiments on wildtype and mutant embryonic stem cells.

As expected, mutation of Sox2 eliminated any binding associated with that transcription factor. However, the Nanog mutation did not affect Sox2 binding while the Sox2 mutation resulted in the near loss of Nanog binding near the Sox2 mutation site, validating directional cooperativity of transcription factors.

Both the Zeitlinger lab and the Kundaje lab are already using BPNet to reliably identify binding motifs for other cell types, relate motifs to biophysical parameters, and learn other structural features in the genome, such as those associated with DNA packaging. The teams have made the entire BPNet software framework freely available to other scientists.

Do you have a unique perspective on your research related to artificial intelligence or genomics? Contact the editor today to learn more.


Machine learning improves COVID-19 drug repurposing efforts
A novel machine-learning technique leverages gene expression data to improve drug repurposing and can even predict interactions between drug candidates...
Deep-learning approach easily identifies druggable protein sites
A deep-learning algorithm was able to automatically identify binding sites on proteins that could be good targets for drug candidates, according to research...
Virtual framework provides unprecedented detail of the heart
Researchers have created a comprehensive map of cardiac neurons at the cellular scale that allows for gene expression data to be superimposed, giving...

Copyright © 2021

Science Advisory Board on LinkedIn
Science Advisory Board on Facebook
Science Advisory Board on Twitter