June 20, 2021 -- A computer algorithm called molDiscovery uses mass spectrometry data from small molecules to predict the identity of unknown substances, potentially saving researchers time and money in the search for new naturally occurring products with medical uses. The new approach was reported in Nature Communications on June 17.
Small molecules are organic compounds of low molecular weight and a size on the order of 1 nm. The ability to determine which are present or absent in a specific sample, and whether those molecules are already known, has wide applications throughout the life sciences.
For example, in medicine, physicians look for small-molecule biomarkers in a patient's blood or tissue sample for purposes of disease diagnosis and prognosis, while epidemiologists search for small molecules in a population's diet and environment to identify disease risk factors. In pharmacology, small molecules are of interest for their potential as therapeutic drugs.
The molDiscovery algorithm improves both the efficiency and accuracy of small-molecule identification by matching small molecules to their mass spectra based on a pretrained probabilistic model.
Thanks to its speed, the algorithm is able to alert scientists early in their research whether they have stumbled onto a truly unique molecule or merely rediscovered something already known.
"Scientists waste a lot of time isolating molecules that are already known, essentially rediscovering penicillin," said co-author Hosein Mohimani, PhD, an assistant professor in the school of computer sciences at Carnegie Mellon University, in a statement. "Detecting whether a molecule is known or not early on can save time and millions of dollars, and will hopefully enable pharmaceutical companies and researchers to better search for novel natural products that could result in the development of new drugs."
A mass spectrum, which can be represented by a set of mass peaks, serves as a "fingerprint" or unique identifier of a small molecule. The molDiscovery algorithm works by comparing the mass spectra acquired from a sample against millions of molecular structures in small-molecule databases.
The probabilistic model at the heart of molDisocovery was trained on reference spectra from the MassBank of North America (MoNA) and on molecule-spectrum pairs from the U.S. National Institutes of Health (NIH) Natural Products Library.
The probabilistic model takes the form P(logRank∣bondType), where logRank represents the intensity of the mass peak of the corresponding small-molecule fragment and bondType is S-C, O-P, P-C, C-C, N-C, O-C, or a pairwise combination of these bonds.
To test the system, the researchers ran molDiscovery on over 8 million spectra in the Global Natural Product Social Molecular Networking (GNPS) repository, an open-access knowledge base for sharing mass spectrometry data. The molDiscovery system was able to identify 3,185 unique small molecules at a 0% false discovery rate (FDR), a sixfold increase compared to existing methods based on chemistry domain knowledge.
On a subset of the GNPS repository with known genomes, molDiscovery was able to correctly link 19 known and three putative biosynthetic gene clusters to their molecular products.
The authors also noted that molDiscovery works for a wider range of molecule masses than previous approaches, which do not perform well for very small molecules (< 400 Da) and become computationally insufficient for heavy small molecules (> 1000 Da).
The molDiscovery system can handle molecules with masses up to 2000 Da, which is twice the mass that was handled by Dereplicator+, an earlier system developed in Mohimani's lab for searching mass spectra against chemical structures.
Do you have a unique perspective on your research related to bioinformatics or drug discovery? Contact the editor today to learn more.