Artificial genetic sequences improve performance of deep neural networks

By Nick Paul Taylor, The Science Advisory Board contributing writer

May 5, 2023 -- Researchers have generated artificial genetic sequences to improve the ability of deep neural networks (DNNs) to predict the rules of DNA regulation.

The project, details of which were published in Genome Biology, was designed to address a limitation of DNNs. Researchers are interested in using DNNs to understand how regulatory motifs in DNA control the expression of genes. However, the effectiveness of DNNs is tied to the amount of data they are trained on.

"With DNNs, the mantra is the more data, the better. We really need these models to see a diversity of genomes so they can learn robust motif signals. But in some situations, biology itself is the limiting factor, because we can't generate more data than exists inside the cell," Peter Koo, assistant professor at Cold Spring Harbor Laboratory, said in a statement.

If a DNN is trained on too little data, it may misinterpret how a regulatory motif impacts gene function. That is a risk when trying to show how sections of DNA control gene expression because some regulatory motifs are uncommon. The natural dataset is finite.

To expand the training dataset, Koo and his collaborators developed EvoAug. The tool generates artificial DNA sequences that nearly match sequences found in nature. Inspired by evolution, EvoAug creates DNA sequences that could theoretically exist but are not found in actual cells. The artificial sequences provide DNNs with additional training data to help them recognize regulatory motifs.

The models assume most changes to regulatory motifs have no effect on their function. Koo uses DNNs trained on cat images as an example, noting that mirroring the same picture creates two examples for a model to learn from. The cat remains a cat whichever way the image faces and the DNN recognizes that.

Regulatory motifs are more complex because some changes to DNA do affect function. To minimize the risk that the artificial sequences will lead the DNN astray, Koo and his colleagues ran a second training step using real biological data. The second step is intended to preserve function integrity.

DNNs trained with the expanded EvoAug dataset performed better than the models trained solely on biological data in assessments run by the team at Cold Spring Harbor Laboratory. In most cases, EvoAug DNNs performed better even before the second fine-tuning step on real biological data. The augmented DNNs worked better than models trained on relatively large natural datasets.

If you like this content, please share it with a colleague!

Related Reading

Predicting metabolic bone disease in infants
Chinese researchers and their collaborators have developed an artificial neural network model that can help predict metabolic bone disease in infants...

New AI-based framework holds promise for drug discovery
Using the artificial intelligence (AI) method of convolutional neural networks, researchers in China developed a new framework for finding novel drug...

Swiss scientists unveil ‘intelligent’ microscope using AI
Swiss scientists have automated microscope control for imaging biological events such as cell division all while limiting stress on the sample. The team...

Can AI predict adverse events from new drug combinations?
A new artificial intelligence (AI) model may be able to help clinicians predict if new combinations of drugs will produce side effects that are worse...

ARES deep-learning system improves 3D RNA structure prediction
A new deep-learning system called Atomic Rotationally Equivariant Scorer (ARES) significantly improves the prediction of RNA structures over previous...