MIT researchers use natural language processing to analyze viral evolution

By Leah Sherwood, The Science Advisory Board assistant editor

January 18, 2021 -- In a breakthrough that could guide the development of targeted vaccines, Massachusetts Institute of Technology (MIT) researchers used natural language processing methods lifted from the field of computational linguistics to analyze the viral protein sequence data of influenza A, HIV, and SARS-CoV-2 to identify regions within the genomes of those viruses that are most vulnerable to mutation. The results were published in a new study in Science on January 15.

One of the greatest challenges to defeating influenza and HIV is their rapid rate of mutation, which allows them to evade the antibodies generated by a particular vaccine through a process known as "viral escape." The phenomenon occurs when a mutation enables the virus to change the shape of its surface proteins in a way that prevents antibodies from binding to them, but still leaves the proteins' functionality intact.

"If a virus wants to escape the human immune system, it doesn't want to mutate itself so that it dies or can't replicate," said Brian Hie, the lead author of the Science paper and a graduate student at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), in a statement. "It wants to preserve fitness but disguise itself enough so that it's undetectable by the human immune system."

Viral escape of the surface protein of influenza and the envelope surface protein of HIV explains why we don't have a universal flu vaccine or a vaccine for HIV. In the case of SARS-CoV-2, it is still unclear how rapidly the virus mutates, which raises the question of how long the vaccines now being deployed to combat COVID-19 will remain effective before succumbing to viral escape.

Hie and his co-authors, who include members of MIT's departments of biological engineering and computational and systems biology, came up with a new way to computationally model viral escape based on machine-learning models that were originally developed to analyze natural language. Neural language models, which underlie technologies such as speech recognition and machine translation, are trained on huge collections of text in order to calculate the frequency which with certain words occur together.

The models adopted to the viral domain by the MIT researchers include constrained semantic change search (CSCS), which they adapted to search for mutations to a viral sequence that preserve fitness while being antigenically different, and bidirectional long short-term memory (BiLSTM), a neural language model architecture they adapted to learn "grammatical" protein sequences and predict viral escape. The researchers trained these models on amino acids from 60,000 HIV sequences, 45,000 influenza sequences, and 4,000 coronavirus sequences.

The team's models analyze patterns in the viral protein sequence in order to predict new sequences of viral surface proteins that have new functions but still follow the biological rules of protein structure. These are the sequences that are more likely to mutate in a way that enables viral escape. Similarly, the models can also identify sections that are less likely to mutate, which makes them good targets for new vaccines.

"Language models are very powerful because they can learn this complex distributional structure and gain some insight into function just from sequence variation," Hie said. "We have this big corpus of viral sequence data for each amino acid position, and the model learns these properties of amino acid co-occurrence and co-variation across the training data."

An advantage of this kind of modeling is that it requires only sequence information, which is much easier to obtain than the protein structures themselves.

"The beautiful thing is all we need is sequence data, which is easy to produce," said co-author Bryan Bryson, an assistant professor of biological engineering at MIT.

According to the researchers, since their paper was accepted for publication, they have used their method to identify sequences likely to generate escape mutations in the new variants of SARS-CoV-2 that recently emerged in the U.K. and South Africa.

In more recent work, the MIT team is collaborating with cancer researchers to identify sequences to be used as targets for cancer vaccines that stimulate the body's own immune system to destroy tumors.

If you like this content, please share it with a colleague!