GoBioSpace: A database search tool for metabolite analysis

Based on the fact that the masses measured in the mass spectrometer are almost directly connected to the elemental composition of the measured analyte, we conceptualised the GoBioSpace (Golm Biochemical Space) database as a simple, repository of annotated elemental compositions, which can be directly searched with all kind of mass spectrometric data. In this sense, we stored the entries from public available chemical databases such as PubChem Compound , PubChem Substance , ChemSpider or biological databases such as the Human Metabolome Database and Metabolome.JP were collected and subsequently consolidated into a single repository. Accurate isotopic masses for ambient ¹²C or fully isotopic labelled ¹³C, ¹⁵N, and ³⁴S formula were calculated. To date more than 366 million elemental compositions, specific information for properties such as InChI strings, CAS numbers, IUPAC names, trade names, synonyms, cross references, literature references, KEGG Pathway names, foreign database identifiers, and other descriptions were collected and tagged according to both depositor and property to 2.1 million distinct formula into a second repository.

The meaningful interpretation of search results in biological context is accomplished by a targeted search limiting the formula to biology related depositors such as KEGG and BioCyc, among others. In contrast, relaxed searches in regard to the formula’s depositor, hence including those elemental compositions only reported from vendors of potentially synthesised chemicals, result in search hits with lower biological interpretability. In addition, search results can be restricted based on elementary chemical compositions, mass accuracy, either ambient or isotopic labelled formula, and expected analytical adducts.

The main function of GoBioSpace is to compare measured masses from mass spectrometric measurements, now including all kind of mass spectrometric data (high accurate mass but also lower mass accuracy), against a single or several databases. As illustrated in the figure below, the workflow for the data analysis is relatively simple: a single mass or an elemental composition, but also a list of masses or formulas (tab-delimited text file) can be loaded into the software and searched against a single or several databases (at the moment more than 150 public databases are hosted, including the whole PubChem collection). Prior to the database search, a number of parameters have to be specified, including the possible adducts of the measured mass (e.g. [M+H]⁺, [M+Na]⁺, [M+NH4]⁺, [M-2H]^2-, [M-Acetate+H]^-), the mass accuracy of the entered data, and finally a selection of elements expected to be contained in the matching compounds. The database search by itself (the in-house version) is quite fast and can process easily 2,000 searches per second, meaning that even a large list containing 30,000 peaks is processed within 15 seconds. However, reasoned by the increased complexity of protocol layers utilizing xml and for data encapsulation and transport over the internet, we expect the performance of the internet version to fall below this value, also depending on the final capacity of the web and database servers. The output format of the result list, which is again a tab-delimited text file, contains all the information contained in the input table (measured m/z, RT and intensity of the measured peaks) added by the possible elemental composition of the measured mass, the adduct used to match measured and calculated mass, the database this hit was derived from, one or several compound name(s) if specified within the selected databases, and the mass error between the measured mass and the matched hit.

GoBioSpace can be accessed online here