May 5, 2023 -- Researchers have generated artificial genetic sequences to improve the ability of deep neural networks (DNNs) to predict the rules of DNA regulation.
The project, details of which were published in Genome Biology, was designed to address a limitation of DNNs. Researchers are interested in using DNNs to understand how regulatory motifs in DNA control the expression of genes. However, the effectiveness of DNNs is tied to the amount of data they are trained on.
"With DNNs, the mantra is the more data, the better. We really need these models to see a diversity of genomes so they can learn robust motif signals. But in some situations, biology itself is the limiting factor, because we can't generate more data than exists inside the cell," Peter Koo, assistant professor at Cold Spring Harbor Laboratory, said in a statement.
If a DNN is trained on too little data, it may misinterpret how a regulatory motif impacts gene function. That is a risk when trying to show how sections of DNA control gene expression because some regulatory motifs are uncommon. The natural dataset is finite.
To expand the training dataset, Koo and his collaborators developed EvoAug. The tool generates artificial DNA sequences that nearly match sequences found in nature. Inspired by evolution, EvoAug creates DNA sequences that could theoretically exist but are not found in actual cells. The artificial sequences provide DNNs with additional training data to help them recognize regulatory motifs.
The models assume most changes to regulatory motifs have no effect on their function. Koo uses DNNs trained on cat images as an example, noting that mirroring the same picture creates two examples for a model to learn from. The cat remains a cat whichever way the image faces and the DNN recognizes that.
Regulatory motifs are more complex because some changes to DNA do affect function. To minimize the risk that the artificial sequences will lead the DNN astray, Koo and his colleagues ran a second training step using real biological data. The second step is intended to preserve function integrity.
DNNs trained with the expanded EvoAug dataset performed better than the models trained solely on biological data in assessments run by the team at Cold Spring Harbor Laboratory. In most cases, EvoAug DNNs performed better even before the second fine-tuning step on real biological data. The augmented DNNs worked better than models trained on relatively large natural datasets.