New method for searching molecular structure databases with tandem MS data

Written by Bethany Small, Future Science Group

A team of researchers from Aalto University (Espoo, Finland) and the University of Jena (Germany) have developed a method for searching molecular structure databases using tandem MS spectra of small molecules. The study, recently published online ahead of print in Proceedings of the National Academy of Sciences of the United States of America, could help identify a greater number of unknown metabolites in metabolomics experiments.

Tandem MS can be used to identify the thousands of metabolites simultaneously detected in a biological sample by LC–MS. However, identifying the structure of a detected metabolite can be challenging, especially when the compounds cannot be found in a spectral tandem MS library.

To help overcome this problem, the bioinformatics research team, led by Sebastian Böcker (Chair of Bioinformatics, University of Jena), devised a new method for searching molecular structure databases. They developed a search engine CSI:FingerID (Compound Structure Identification:FingerID) that combines computation and comparison of fragmentation trees with machine learning, and can be roughly divided into three phases.

In all phases, the tandem MS spectrum of each compound is first transformed into a fragmentation tree by an automated method. In the first phase, the method is trained on a database of reference compounds with known molecular structure. Following this machine learning phase, the second part of the method attempts to find an unknown compound in a database of molecular structures. Using the tandem MS spectrum of the unknown compound, the search engine computes the similarities of this compound against all the compounds in the data reference set. This results in a predicted fingerprint of the unknown compound. In the final phase, this predicted fingerprint is then compared against fingerprints of compounds in a molecular structure database such as PubChem. This gives a ‘shortlist’ of candidate molecular structures, and for each of these, its fingerprint is scored against the predicted fingerprint of the unknown compound. Candidate structures are then sorted according to this score, and reported back to the user.

Böcker explained the utility of this process: “After obtaining the list of possible candidates we still don’t know with absolute certainty which metabolite we are dealing with. But when we can reduce the number of possible compounds from several thousand down to perhaps ten, then this is huge progress.”

The published study reports that the new search method shows significantly increased identification rates compared with existing methods. Böcker and the research team have made their search engine freely available to the international scientific community.

Sources: Dührkopa K, Shenb H, Meusela M, Rousub J, Böckera S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA doi:10.1073/pnas.1509788112 (2015) (Epub ahead of print); Search engine for more accurate and fast recognition of metabolites; CSI:FingerID.