How to submit metabolomic data to MetaboLights

Index

Overview

To manage the exponential rise in biological data and the relative metadata information, it has become necessary to submit raw data to ad-hoc public repositories. If this requirement has worked well for genomic and transcriptomic studies, there is less appeal in those studies involving metabolite and metabolomic analysis. The requirement is not fully applied and not even enforced by journals¹, thus becoming difficult to retrieve metabolomic raw data for further meta-analysis and comparative studies.

Notwithstanding the complexity in harmonising disparate sampling strategies, metabolite extraction protocols, chromatographic and spectroscopy techniques; and moreover data measurement, data integration and validation, and metabolite annotation, there is an urgent need to make this biological data and information available to the entire scientific plant community².

MetaboLights³ is an open-access database for metabolomic experiments, connected with their raw data and associated metadata^4,5. It is part of the ELIXIR infrastructure, hosted by the European Bioinformatics Institute (EMBL-EBI). MetaboLights is the recommended metabolomic repository for a number of leading journals in the field, such as Metabolomics, Metabolites, Frontiers, BioMed Central, Scientific Data, Plos Biology, EMBO press, Data, F1000Research, OpenResearch etc.

The availability of both data and metadata is of utmost importance to adhere to FAIR principles⁶ in order to have data, experiments and results Findable, Accessible, Interoperable and Reusable. MetaboLights facilitates data submission through the ISA Commons, a growing community that uses the ISA metadata tracking framework to aid standards-compliant collection, curation, management and reuse of datasets in an increasingly diverse set of life science domains.

Precaution

Exceptions to a rule always exist and with these guidelines, we cannot presume to be exhaustive. Our intention is to provide a guided example on how to submit metabolomic data and metadata, with particular emphasis on LC- and GC-MS techniques, so as to be seen as tutorials for students, researchers and newcomers to the field.

Submission Process

Creating an account on MetaboLights is straightforward and free, requiring few details, such as name, email address, affiliation, country, password and the possibility to link the profile to the researcher’s ORCID ID.

Once the account is opened, the submission of a study requires the availability of raw data produced by the analytical instrument, preferably in an open-source format, and the metadata information to support the study.

Select “Submit study” to create a new study (or edit an existing study), and the Guided Submission Portal will lead the user through the study creation step by step.

Studies must pass validation to be submitted. Validation errors are highlighted in the information bar at the top of the study. Details of errors are available in the study validations tab and some example can be found here

In the initials steps, the study will not be published and made available until the following conditions have been met:

1. preliminary online validation has been done,
2. the submitter promotes the study from “submitted” to “in curation”
3. after a minimum of 4-8 weeks, the curation team approves the study if everything is correct
4. the release date is reached.

Study Description

The word “study” means the entire experiment that can be further subdivided into different assays if multiple analytical techniques were used, such as LC and GC. Besides the title, which is freely added by the submitter, one of the most important fields is the study ID or study unique identifier, an alphanumeric code always composed by the letters MTBLS followed by numbers, which is automatically assigned at the moment of the creation of a new study. Study ID cannot be modified and it is needed when referencing the study in manuscripts or elsewhere, together with the relative URL, like www.ebi.ac.uk/metabolights/MTBLS000.

This section reports general descriptions such as the title of the study, the author(s), an abstract giving a brief overview of the study and a list of keywords for indexing the studies. This can be copied from the publication (if already available) which is possible to link via the DOI of the article. All this information is stored in the file called i_MTBLSxxx.txt together with the descriptive protocols seen below.

The factors considered in the design of the experiment are visualized at the beginning of the submission, and they are added within the sample table: a new column for each factor is created. Increasing the number of factors increases also the metadata associated with the study and therefore the future reusability. A rich set of metadata is indeed key to reusability, also beyond the scope of the original study.

Field	Description	Example
Study Title	Title of the study (publication)
Study ID	Study unique identifier	MTBLS897; MTBLS898;
Authors	Name of the authors involved, or the responsible for the study
Abstract	Description of the study	Abstract of the publication
Keywords	4 to 8 keywords separated by semicolons	Untargeted metabolites; HPLC; Polyphenols;
Publications	doi link of the publication(s) related to the study
Factors	Study Design Descriptors divided in name and type of factor	Cultivar; Year; Stress;

Protocols

Protocols are essential for reproducibility and should provide a detailed description of steps taken in the study. When an assay is selected, the protocol section will populate with several titled sections to which the user can add the specific study details. Protocols are in a free-text form, but a list of minimum requirements for each protocol section are presented in the tabs below, together with some examples.

a. Sample collection

In this section, it is important to describe the sample origin (source, organism, species, intraspecific name, organism part), any relevant treatment, time points etc. as well as collection and storage procedures.

Field	Description	Example
Protocol Title	More than one protocol can be included and so more than one title ⁷	Leaves sampling
Source	From where the samples were collected. The use of an ID is recommended⁸	Fondazione Edmund Mach collection (ID 0381bab64)
Organism	An identifier for the organism at the species level. The use of the NCBI taxon ID⁹ is recommended.	NCBITAXON:29760
Specie(s)	Species name (formally: specific epithet) for the organism under study, according to standard scientific nomenclature.	Vitis vinifera L.
Intraspecific name(s)	Name of any subtaxa level, including variety, crossing name, etc. The use of the VIVC ID¹⁰ is recommended.	Prime name: Sangiovese; VIVC ID: 10680
Organism part	A reliable description of biological samples requires a shared vocabulary for the organ collected¹¹,¹²	berry, leaf, root, etc
Factor description	Describe or group the sample based on the “Study Design Descriptors” used on the “Study description” (Table 1)	Water deficit; Sangiovese; Veraison;
Replicate sampling and analyses	The number of samples represents the “sample size” and all the samples together with the “sample set”. A part of the “sample set”, thus a “sample subset”, can be used for the analysis. A minimum of triplicate (n = 3) biological sampling is proposed with n = 5 preferred¹³	For each variety/country condition, 4 leaves (not growing any more) per plant were collected, from 10 plants, from the same insertion level.
Tissue harvesting method	Register the details about how the sampling occurred in the field/vineyard. For example, report if the samples were directly frozen and how (e.g. liquid N2, dry ice, freeze clamping, etc.), the date and time of collection, the place of collection, if samples were washed from unwanted external components (e.g. soil), shipping (time and temperature), and sample storage prior to further preparation (e.g. −80°C for 2 weeks)	The leaves from each plant were pooled together, frozen immediately in liquid nitrogen and stored at -80 °C, until analysis.
Storage conditions and aliquoting prior to extraction or further processing	Report information about temperature, duration, possible additives, volumes, container and atmosphere. Samples can be divided into small aliquots. Report method, volume or weight and number of aliquots	The samples were transferred in 50 mL falcon tubes and stored at -80 °C until further treatment. Each biological sample, including the QC sample, was divided into 10-15 2 mL eppendorfs.
Relocation or shipping info	Provide information about the shipping conditions (if relevant)	… grape berries were directly frozen under liquid nitrogen, packed in dry ice and shipped to Italy within 24 h.
Sample IDs list	Sample ID list connected to the above information

b. Extraction (Sample Preparation)

This protocol is intended to describe any extraction or preparation methods applied to the sample before analysis. Please also to include information of any control samples prepared for the assay eg. pooled samples, standards, quality control, solvent blank etc.

Field	Description
Protocol Title	More than one protocol can be included and so more than one title.
Randomization	Report if the samples preparation order was randomized and how¹⁴, ¹⁵
Extraction parameters	Solvent(s), pH and ionic strength of the buffer, solvent temperature and volume(s) per quantity of tissue, internal standard(s), number of replicate extracts, sequential extraction, and extraction time¹⁶, ¹⁷
Concentration/Dilution	Extract concentration, dilution, and resolubilization processes (e.g. dried under nitrogen, solubilized in methanol)¹⁶
Enrichment	Extract Enrichment (e.g. solid phase extraction, desalting, molecular cut-off, ion exchanges, rotary vapour)¹⁶
Extract treatments	Extract cleanup and/or use of additives (e.g. ultrafiltration, centrifugation, the addition of antioxidants, pH change)¹⁶
Derivatization	Report the protocol of derivatization (the chemical used, temperature, time, etc).
Quality Control Sample(s)	Report if a QC pooled sample was prepared by using the extracts of all the “sample set” or a “sample subset”. Also, report the method (volume or weight from each sample and total amount of the QC pooled sample)¹⁶, ¹⁷
Reference Material	Report if any biological reference material or/and a standard mixture was used and how was purchased or prepared. This material can also be used as QC samples.
Blanks	Report how the blank sample was prepared. Blank is a type of QC sample. Blanks should be prepared under the same protocol of the study and the QC pooled samples, with the exception that the sample matrix is replaced by nothing if it is solid or by the extraction solvent (or water) if it is liquid.
Aliquoting	Aliquots prepared during or after the sample preparation (code, volume, number). This includes the QC samples¹⁷
Storage – Relocation	Extract storage (e.g. temperature, duration, atmosphere, volumes, containers, etc) and/or relocation (e.g. temperature, duration, atmosphere, places).
Internal standard(s) addition	Internal standard(s) at any stage.
Samples ID list	Update the Sample ID list, including the names or the IDs of the extracts. Often more than one extraction protocol is applied for the same samples.

Examples:

About 1 g of each sample was then weighed into 15‐ml amber vials, and 50 µl of the internal standard (IS) o‐coumaric acid (2 mg/ml MeOH), 1.2 ml of H₂O/CH₃OH (1 : 2) and 0.8 ml of CHCl₃ were then added. The extraction mixture was vortexed for 1 min, shaken for 15 min at room temperature using an orbital shaker (Grant‐Bio Rotator PTR‐60) and centrifuged at 4 °C and 1000 g for 10 min. The upper aqueous methanolic phase was transferred into a 5‐ml flask, and extraction was repeated by adding another 1.2 ml of H₂O/CH₃OH (1 : 2) with shaking and centrifugation as before. The two supernatants were collected in the same 5‐ml flask and the flask was brought up to 5 ml using Milli‐Q water then filtered into LC‐MS certificated vials through a 0.2‐µm PTFE filter (Millipore, Italy) and analyzed. This procedure was repeated three times for each biological sample in order to obtain three technical replicates.¹⁸
A sample of 100 mg frozen tissue powder was transferred to 2 ml Eppendorf tube, and metabolites were extracted in 1 ml methanol/chloroform/water extraction solution (2.5/1/1 v/v/v). The mixture was then vortexed for 1 min and centrifuged for 5 min at 10,000 RPM (Sigma, Germany) at 4 °C, and the supernatant was decanted into the new tubes. The supernatant was mixed with 400 µl of chloroform and 400 µl of MilliQ water and then centrifuged at 5 min at 10,000 RPM at 4°C. The upper water/methanol phase was filtered 0.22 µm (Millipore) and transferred to MS vials for LC-MS analysis.¹⁹
For the free (non-glycosylated) VOCs, on the day of analysis, four grams of frozen grape powder were weighed out in a 20 mL SPME dark-glass vial. Three grams of NaCl, 15 mg of citric acid, 15 mg of ascorbic acid, 50 μL of sodium azide, and 7 mL of milliQ water were added to the sample. Fifty μL of a solution containing five internal standards, d10-4-methyl-3-penten-2-one (1 g/L), d11-ethyl hexanoate (1 g/L), d16-octanal (1 g/L), d8-acetophenone (1 g/L), d7-benzyl alcohol (1 g/L), was added to each sample.²⁰

c. Chromatography

For reproducible reasons, there is a need to provide details of the instrument and column used, mobile phase and gradient, and settings such as temperatures, flow rate, injection volume…

Field	Description
Protocol Title	More than one protocol can be included and so more than one title (e.g. gas or liquid chromatography, different stationary phases or mobile phases).
Instrument	Chromatography instrument description (manufacturer, model, modules, software package(s), software version) ¹⁷, ²¹
Injection	Auto-injector (manufacturer, model, type, software, injector/loop volume, wash cycles, solvents, volume, SPME parameters, automatic derivatization, etc)¹⁷, ²¹
Stationary phase	Separation column(s) and pre/guard column (manufacturer, model/name, stationary phase composition, particles, internal diameter, physical parameters, length, parameters of 2D chromatography, etc)¹⁷, ²¹
Mobile phase	Mobile phase (e.g. gases, solvents, buffers, pH) including their preparation protocol (information of the type of flasks, pipette, degasser, etc) post-column modifiers (if applied)¹⁷, ²¹
Separation	Separation parameters (injector temperature, sample temperature, split or splitless mode and ratio, mobile phases composition(s), gradient profile, column temperature, flow rate(s), pressure, etc)¹⁷, ²¹
Sequence	Sequence duration and length of stay of the sample in the sampler prior to analysis. Report if the “sample set” or “sample subset” order was randomized and the frequency of the QC analysis (all type of QC samples used)¹⁷, ²¹
Analysis or Assay ID	Update the Sample ID list, including the names or the IDs of the assay raw files. This name could include the date of the analysis, the order of the analysis and information about the protocol. The sample with the ID: SAN12 could have as Assay ID: 20201214_055_SAN12_RP_NEG, where the first part gives information about the date of analysis, the second the order of the analysis, the third the sample ID, and the fourth about the chromatography protocol.

Examples:

A Waters Acquity UPLC controlled by MassLynx 4.1 was used. The column was a reversed-phase (RP) ACQUITY UPLC 1.8 µm 2.1 x 150 mm HSS T3 column (Waters); column manager was set at 40 ºC; the mobile phase flow rate was 0.28 ml/min, and the eluents were water and methanol both with 0.1% formic acid. The multistep linear gradient used was as follows: 0-1 min, 100% A isocratic; 1-3 min, 100-90% A; 3-18 min, 90-60% A; 18-21 min, 60-0% A; 21-25.5 min, 0% A isocratic; 25.5-25.6 min, 0-100% A; 25.6-28 min 100% isocratic. The injection volume was 5 µl and the samples were kept at 4 ºC throughout the analysis. The QC sample injections were used for the initial equilibration of the LC-MS system (5 injections) and controls at regular intervals (one QC sample injection every 6 real sample injections) during the sequence. The samples were analyzed according to a randomized order. ²²
GC analysis was performed using a Trace GC Ultra gas chromatograph coupled with a TSQ Quantum Tandem mass spectrometer, upgraded to the XLS configuration. A DuraBrite IRIS ion source with a pre-filter was installed to improve the performance of the spectrometer. The system was equipped with a Triplus autosampler (Thermo Electron Corporation, Waltham, MA). The injection volume was 1 µL, post-injection dwell time 4 s, tray temperature10◦C. GC separation was performed on a 30 m VF-WAXms capillary column with an internal diameter of 0.25 mm and a film thickness of 0.25 µm (Varian, Inc., USA). Temperature programme:40◦C hold for 2 min after injection, 10◦C/min up to 50◦C, 1.4◦C/minup to 60◦C, hold for 2 min, 1.6◦C/min up to 70◦C, hold for 1 min,2.2◦C/min up to 100◦C, hold for 0.5 min, 3.1◦C/min up to 140◦C,4.4◦C/min up to 200◦C, 12◦C/min up to 250◦C, hold for 6 min. Injection parameters were: splitless injection, splitless time: 0.8 min, inlet temperature 250◦C, carrier gas was helium 5.5, programme flow: 0.8 mL/min hold for 62.50 min, 0.8 mL/min up to 1.2 mL/min in 0.5 min, hold for 7 min. ²³
Briefly, chromatography was performed in a 1290 Agilent UPLC equipped with an RP C30 3-μm column (250 × 2.1 mm i.d.) coupled to a 20 × 4.6-mm C30 guard column (YMC Inc., Wilmington, NC, USA). A flow rate of 0.21 ml/min and an injection volume of 3 μl were adopted. The mobile phases consisted of methanol (A) and tert-methyl butyl ether (B), both containing 5 % of a mixture of water/methanol (20/80 by volume) and 0.2 % (w/v) ammonium acetate. The gradient elution consisted of 100 % A isocratically for 6 min, a step to 82.5 % A at 7 min, maintained isocratically for 5 min, followed by a linear gradient to 32.5 % A by 30 min, and these conditions were maintained for 14 min. A conditioning phase (48–60 min) was then used to return the column to the initial concentration of A. The DAD signal was acquired from 200 to 600 nm (step 1.2 nm), with a slit width of 1 nm, at a frequency of 2.5 Hz.²⁴

d. Mass Spectrometry

Provide details of the instrument used (make & manufacturer), ion source, ionisation mode (positive/negative), m/z range, and specific parameters such as temperatures, voltages, flow rates, scan rates.

Instrumental performance and method validation

Since the analyzed metabolites are not pre-defined in metabolomics, method validation is rather difficult. However, a minimum reporting of instrumental performance parameters is encouraged. Describe the nature and method(s) used to ensure instrumental sensitivity, selective, linearity, stability, resolution and mass accuracy. The QC samples distribution in a PCA plot is a good indicator.

Field	Description
Protocol Title	More than one protocol can be included and so more than one title (e.g. positive or negative ionization mode, full scan, data-dependent or -independent acquisition, etc).
Instrument	Manufacturer, model number, software package and version. The majority of the instruments can be found in the EMBL/EBI²⁷ ontology. If this is the case we recommend the use of the ontologies, if not use the free text.
Sample introduction and delivery	Direct infusion (continue or not), after GC, CE, or LC separation. Use the Separation Protocol names (Table 4).
Ionization source	Ionization mode (EI, APCI, ESI etc.), polarity (positive or negative), vacuum pressure, skimmer/focusing lens voltages (e.g. capillary voltage etc.), gas flows (e.g. nebulization gas, cone gas etc., source temperature).¹⁷
Mass analyzer	Type of analyzer (e.g. quadrupole, ion-trap, time-of-flight, FT-ICR, including combinations of these for hybrid instruments). The majority of the analyzers can be found in EMBL/EBI²⁷ ontology.
Acquisition mode and parameters	For single quadrupole instrument the scan modes are full scan and sim; and for a triple quadrupole instrument common modes are: full scan, product scan, precursor scan, neutral loss scan and MRM. In High-Resolution MS (QTof and Orbitrap) common scan modes are: a) full scan; b) Data Dependent Acquisitions like SIM, MRM, MS/MS; and c) Data Independent Acquisitions like Swath, Sonar, MSall, MSn, MSe, MSc2, AIF-MS2, vDIA, bbCID. Here should be reported all the parameters of the acquisition mode, like the m/z scan range, polarity(ies), scan speed, collision energy(ies), cycle time, resolution, mass accuracy, collision energy, spectral acquisition rate, vacuum pressure, various voltages etc.¹⁷
Ion Mobility	type (DTIMS, TIMS, DMS, etc), place (e.g. before or after the quadrupole), buffer gas, separation parameters.
Technique-specific sample preparation (if relevant)	Re-suspension of sample (e.g. in MeOH:water 1:1 with 0.2% formic acid), derivatization, volume injected, and internal calibrant(s) added (if relevant).
Calibration	Calibration compound(s) and mode.¹⁷
Lock spray	Concentration, lock mass, flow rate and frequency.¹⁷
Sequence	In case of direct infusion, report sequence duration and length of stay of the sample in the sampler prior to analysis. Report if the “sample set” or “sample subset” order was randomized and the frequency of the QC analysis (all type of QC samples used).

Examples:

GC analysis was performed using a Trace GC Ultra gas chromatograph coupled with a TSQ Quantum Tandem mass spectrometer, upgraded to the XLS configuration.The mass spectrometer was operated in electron ionisation (EI) mode at 70 eV. The filament current was 50 μA. The temperature of the transfer line was 220 °C and argon (99.9998% purity) was used as the collision gas with a collision cell pressure of 1.2 mTorr. Dwell time was 0.03 s up to 12 min and 0.05 s further on. The mass spectrometer was tuned and calibrated using FC-43 (perfluorotributylamine (PFTBA)). Data acquisition and analyses were performed using the Xcalibur Workstation software supplied by the manufacturer. ²³
Mass spectrometry detection was performed on a Waters Xevo TQMS (Milford, MA, USA) instrument equipped with an electrospray (ESI) source. Capillary voltage was 3.5 kV in positive mode and −2.5 kV in negative mode; the source was kept at 150 °C;desolvation temperature was 500 °C; cone gas flow, 50 L/h; and desolvation gas flow, 800 L/h. Unit resolution was applied to each quadrupole. Flow injections of each individual metabolite were used to optimize the MRM conditions. For the majority of the metabolites, this was done automatically by the Waters Intellistart software, whereas for some compounds the optimal cone voltages and collision energies were identified during collision-induced dissociation (CID) experiments and manually set. A dwell time of at least 25 ms was applied to each MRM transition. ²⁵
The UHPLC system was coupled directly to an API 5500 triple-quadrupole mass spectrometer (Applied Biosystems/MDS Sciex, Toronto, Canada) equipped with a electrospray source. Analyst™ software version 1.6.1 (Applera Corporation, Norwalk, CT, USA) was used for instrument control and data acquisition. The transitions and spectrometric parameters were optimized individually for each standard by direct infusion of their solutions (10 µg mL⁻¹) in water/ACN (40:60 v/v) with NH₄COOH 10 mM and HCOOH 0.1% into a mass spectrometer at a flow rate of 10 μL min−1. The two most abundant fragments to use as quantifier and qualifier were identified for each compound. Declustering potential (DP) and entrance potential (EP) were optimized for each precursor ion and collision energy (CE) and Collision Cell Exit Potential (CXP) for each product ion. Table 3 shows the compound-specific instrumental parameters used in the analytical method. The presence of our metabolite of interest was confirmed using the q/Q ratio. The spray voltage was set at 5500 V for positive mode and −4500 V for negative mode. The source temperature was set at 250 °C, the nebulizer gas (Gas 1) and heater gas (Gas 2) at 40 and 20 psi respectively (1 psi=6894.76 Pa). UHP nitrogen (99.999%) was used as both curtain and collision gas (CAD) at 20 and 9 psi respectively. ²⁶

e. Data Transformation

Provide details of methods/pipelines and software used to transform the raw data.

Field	Description
Protocol Title	More than one protocols can be included and so more than one title (e.g. positive or negative ionization mode, full scan, data dependent or independent acquisition, etc)
Raw data format	Report the format of the original raw data, as registered by the instrument and its software. ¹⁴
Data conversion	Often the raw data are converted to “open” (or not) formats like net.CDF, XML, MZmine, etc, for their further analysis. Report the software and its version used for the data conversion and the parameters used.¹⁴
Data pre-treatment	The original or the converted data are often processed before the statistical analysis. For the MS data, the process might include: peak picking, background subtraction, noise reduction, time or m/z filtration, alignment, spectral deconvolution, smoothing, bining, data reduction, filling missing peaks etc. The NMR data pre-process might include phase correction, conversion from time to frequency domain, degree of zero filling, degree of linear prediction, data reduction, baseline correction, etc. Report the software and its version used together with the parameters. The most popular software are MZmine, XCMS, MSdial, metaMS, Progenesis QI and MetAlign.
Data treatment	The obtained peak table from the data pre-treatment can be further treated with Normalization and Scaling tools. Report the software, its version and parameters used.⁹

f. Metabolite Identification

Provide details of methods/pipelines, reference databases and software used to identify features and/or annotate metabolites.

Field	Description
Protocol Title	More than one protocols can be included (and so more than one title) since different instruments can be used for the same study
Annotation confidence	The correct peak or metabolite annotation is crucial for the interpretation of the results. Due to the big number of measured features, the annotation of all is impossible, the confidence level between annotations can vary, and many features will remain unknown. Is important to provide information as far as the confidence of each annotation. Below we describe the three major methods used. Is mandatory to use one of them.
4 levels annotation ¹³	This is the most common method used to report the annotation confidence in metabolomics. Includes the following levels of annotation: 1. Identified compounds. A minimum of two independent and orthogonal data relative to an authentic compound analyzed under identical experimental conditions are proposed as necessary to validate non-novel metabolite identifications (e.g. retention time/index and mass spectrum, retention time and NMR spectrum, accurate mass and tandem MS, accurate mass and isotope pattern, full 1H and/or 13C NMR, 2-D NMR spectra). The use of literature values reported for authentic samples by other laboratories is generally believed insufficient to validate a confident and rigorous identification. The use of literature or external laboratory data results in level 2 identifications. 2. Putatively annotated compounds. If spectral (MS or NMR) matching is utilized in the identification process then the authentic spectra used for the spectral matching should be described appropriately or libraries made publicly available. It is preferred that the reference spectra are made available at no cost, but the CAWG recognizes that this may not always be possible for commercialized libraries (NIST, Wiley, etc.). However, the premise of this minimum is that authors document and provide spectral evidence to validate the metabolite identifications. If the authors choose not to provide the experimental evidence to support the identifications, then the identifications should be reported as ‘putative identifications’. 3. Putatively characterized compound classes. Metabolite identifications based upon additional orthogonal data (i.e. more than two) are highly advantageous, provide additional confidence, and are often necessary to provide unambiguous identification of stereo configuration. Additional data consistent with best chemical practices might include: selective solvent extraction, retention time, m/z, photodiode array spectra, λmax and εmax, chemical derivatization, isotope labelling, 2D NMR, IR spectra, etc. 4. Unknown compounds—although unidentified or unclassified these metabolites can still be differentiated and quantified based upon spectral data.
5 levels annotation ²⁸	This is the second most used method to report the annotation confidence in metabolomics. It includes the following levels of annotation: Level 1: Confirmed structure represents the ideal situation, where the proposed structure has been confirmed via appropriate measurement of a reference standard with MS, MS/MS and retention time matching. If possible, an orthogonal method should also be used. Level 2: Probable structure indicates that it was possible to propose an exact structure using different evidence. For Level 2a: a library that involves matching literature or library spectrum data where the spectrum-structure match is unambiguous. Care is needed when comparing spectra recorded with different acquisition parameters (e.g., resolution, collision energy, ionization, MS level) to ensure the validity of the match and decision criteria should be clearly presented. Desirable additional evidence such as retention behavior would require, for example, a retention index for both the measured and matched spectrum, established for GC-MS but not yet sufficiently for LC-MS-based techniques. Level 2b: diagnostic represents the case where no other structure fits the experimental information, but no standard or literature information is available for confirmation. Evidence can include diagnostic MS/MS fragments and/or ionization behavior, parent compound information and the experimental context. Level 3: Tentative candidate(s) describes a “grey zone”, where evidence exists for possible structure(s), but insufficient information for one exact structure only (e.g., positional isomers). Although there are many shades of grey (i.e., certainty) possible for many different situations, for all intents and purposes the exact structure remains speculative at level 3. Level 4: Unequivocal molecular formula is possible when a formula can be unambiguously assigned using the spectral information (e.g., adduct, isotope, and/or fragment information), but insufficient evidence exists to propose possible structures. The MS/MS could be uninformative, contain interferences or not even exist. Level 5: Exact mass (m/z) can be measured in a sample and be of specific interest for the investigation, but lack information to assign even a formula. Screening and nontarget methods allow the tracing of these masses in other investigations, but level 5 indicates that no unequivocal information about the structure or formula exists. It is even possible to record the MS/MS of a level 5 mass and save it as an “unknown” spectrum in a database. This level should only apply to a few masses of specific interest since it would be counterproductive to label all masses in a sample as level 5. Blank measurements should be used to ensure the substance does not arise from sample preparation or measurement.
Metabolomics Society’s Metabolite Identification Task Group²⁹	The metabolomics community currently released a new method by introducing new subclasses (A-F) for unambiguous metabolite annotation. The proposed levels are: A: Known enantiomer. A single defined enantiomer or a single defined achiral metabolite. Known molecular formula, structure and stereochemistry including chirality. Usually requires isolation of metabolite and full structure determination or chiral chromatography on metabolite in a mixture to prove chirality and matching of two orthogonal pieces of data with an authentic chemical standard; For achiral metabolites requires matching of two orthogonal pieces of data with authentic chemical standard (e.g. RT and MS/MS mass spectrum). B: Known diastereomer. One of two enantiomers. Known molecular formula, structure and stereochemistry but unknown chirality. Requires matching of two orthogonal pieces of data with authentic chemical standard (e.g. RT and MS/MS mass spectrum). C: Known structure/DB position. One of a number of stereoisomers, e.g. E/Z geometric or cis-/trans- ring isomers. Known molecular formula and structure but unknown stereochemistry. Requires matching of two orthogonal pieces of data with authentic chemical standard (e.g. RT and MS/MS mass spectrum). D: Known functional group. One of a number of positional isomers. Known molecular formula and metabolite class but unknown structure; e.g. high-resolution mass spectrometry provides unique and unambiguous single molecular formula and additional data proves metabolite class membership. E: Known formula. One of a number of possible compounds of known molecular formula. Known molecular formula but unknown structure; e.g. high-resolution mass spectrometry provides the unique and unambiguous single molecular formula. F: Known structural class. Specific spectral features defining a structural class. Unknown molecular formula but known class of metabolite; Characteristic signals of metabolite class in sample. G: Known formula. Specific spectral futures. Unknown molecular formula; Characteristic signals of unknown metabolite in the sample
Metabolite ID	Identify the annotated metabolites with a unique identifier (ID) that corresponds to one of the following databases: Chemical entities of biological interest (ChEBI), HMDB, FoodDB, KEGG, Chemspider, PubChem, COlleCtion of Open Natural ProdUcTs ID, or CAS.

Examples:

VOCs were identified by comparing the retention times of individual peaks with the retention times of their reference standards, and by identifying the mass spectra using the NIST library. The ratio of each VOC area to the d₁₆-octanal internal standard area was considered to reduce technical variability among extractions and chromatographic runs and VOCs quantity were expressed as μg/kg of berry of d₁₆-octanal equivalents. ²⁰
The acquired spectra were directly converted to NetCDF files using Databridge software (Waters). Peak picking, alignment, and principal component analysis (PCA) were performed using the automated data analysis pipeline MetaDB, developed at our institution. This package supports the execution of experiment compatible with the concept of interoperable bioscience data, including both the production of a validated experimental data set with the relevant metadata in ISA-Tab format. The workflow of MetaDB consists of six different steps: (1) upload of metadata in ISA-Tab format; (2) preparation of MS acquisition sequence, including sample randomization; (3) upload of raw and derived spectral data files; (4) data processing for feature alignment and detection with metaMS; (5) visualization of data for quality assessment; (6) preparation of data for upload to public repositories (Metabolights). ³⁰

Samples, Assays and Metabolites

The next three sections are called “Samples”, “Assays” and “Metabolites” and they are dedicated to filling the metadata and results of the experiment. These tables can be filled online, or alternatively, once all additional columns have been added, these files can be downloaded and edited in Excel and finally re-uploaded in their final form. It is important that the basic structure of the tables is not altered, without removing or altering existing columns. The file name and extensions must be retained as they are.

Sample information file (s_MTBLSxxx.txt) should provide all relevant facts about each sample included in the study. Sample metadata should include a unique sample name, organism, organism part, etc as seen in the dedicated protocol above. Further sample descriptors should be included where available by selecting +Factor to add new columns (eg. Days after anthesis, Treatment). The selection of the term is facilitated by a drop-down menu that shows the most relevant ontology term. If there is no ontology term available, there is the possibility to type a free text. The protocol of reference is the ‘Sample collection’.

More samples can be added to the sample table using +Samples and pasting a list or selecting to import Raw data file names if appropriate. There is also the option to add as many new rows as required with +Rows and edit cells individually.

Assay information file (a_MTBLSxxx_technique.txt) describes the assay process for each sample and connects the sample name to both its corresponding raw data file and metabolite identification table. Multiple assays can be added per study.

For the LC-MS and GC-MS file, the predefined column includes the protocols of reference (Extraction, Chromatography, Mass Spectrometry, Data Transformation, Metabolite Identification) followed by the instrument and column used and the technical parameters used, such as column type, scan polarity, range, etc. Also here, a drop-down menu with controlled vocabulary helps the users fill the tables so that all the studies using the same metadata information name can be easily findable.

Metabolite information file (m_MTBLSxxx_tecnique_MAF.tsv) is the files where the users must add as much information as possible on the metabolite identified in the study and report their final concentration for each sample. Important features to add are metabolite name, Chebi ID, chemical formula, SMILES and InChi codes. Other parameters, such as mass to charge and retention time, are useful for reproducibility.

Files

When creating a study, the metabolights team gives the users the possibility to upload the raw files through either FTP or Aspera Client by giving the credentials to access the remote folder on the Ebi FTP server. We recommend using Filezilla software.

The users receive an email including FTP setting parameters such as:
user: mtblight
password: ****
server: ftp-private.ebi.ac.uk
remote folder: /prod/-obfuscation_code

Please be aware the remote folder needs to be entirely typed, as the folder is not browsable. So use “cd/prod/-obfuscation_code” to access the private folder. Files/Folders that need to be uploaded must not be zip compressed.

Each sample present in the study should have a corresponding raw data file, both of which should be referenced within the assay table.

The accepted file formats for data are the following:

Raw file formats: d, raw, idb, cdf, wiff, scan, dat, cmp, cdf.cmp, lcd, abf, jpf, xps, mgf.

Derived file formats: mzml, nmrml, mzxml, xml, mzdata, cef, cnx, peakml, xy, smp, scan.

Validation Rules

A study needs to adhere to some requirements and pass automatic validation to progress to the curation stage. There are 4 validation flags that can be viewed individually by selecting the drop-down menu. This includes success, information, warning, error. To progress with the study, users must address all errors and should resolve all warnings. Below are some examples.

♦ Successfully read the investigation file

♦ Successfully found one or more samples/factors/descriptors

♦ Could not find any assays

♦ Found a publication (title/author list/DOI..)

♦ Protocol ‘Sample collection’, ‘Extraction’, ‘Chromatography’, ‘Mass spectrometry’, ‘Data transformation’, ‘Metabolite identification’, match the protocol type definition and the protocol description are validated

♦ Data transformation description should be more than just one sentence

♦ Sample column ‘Sample Name’, ‘Characteristics[Organism]’, ‘Characteristics[Organism part]’, …found in the sample file

♦ Sample column ‘Characteristics[Sample type]’ was not found

♦ No raw or derived files found

♦ Incomplete Metabolite Annotation File (MAF)

Available Datasets

Below is a list of publicly-available datasets for Vitis/Grape metabolome.

ID	Title (year)
MTBLS2876	Mono-Locus and Pyramided Resistant Grapevine Cultivars Reveal Early Putative Biomarkers Upon Artificial Inoculation With Plasmopara viticola (2021)
MTBLS784	LC-MS untargeted approach showed that methyl jasmonate application on Vitis labrusca L. grapes increases phenolics at subtropical Brazilian regions (2020).
MTBLS898 MTBLS982 MTBLS984	Multi-Omics and Integrated Network Analyses Reveal New Insights into the Systems Relationships between Metabolites, Structural Genes, and Transcriptional Regulators in Developing Grape Berries (Vitis vinifera L.) Exposed to Water Deficit (Phenolics; UPLC-MS/MS), Volatile organic compounds; GCMS, Carotenoids; LC-DAD (2019).
MTBLS897 MTBLS892 MTBLS889	Transcriptome and metabolite profiling reveals that prolonged drought modulates the phenylpropanoid and terpenoid pathway in white grapes (Vitis vinifera L.) (Phenolics; UPLC-MS/MS), Volatile organic compounds; GCMS, Carotenoids; LC-DAD (2019).
MTBLS968	The accumulation profiles of terpene metabolites in three Muscat table grape cultivars through HS-SPME-GCMS. (2019).
MTBLS209	The measurement of spatial distribution of grape metabolites in berry tissues using LC-MS (2017).
MTBLS392	Core Microbiota and Metabolome of Vitis vinifera L. cv. Corvina Grapes and Musts (2017).
MTBLS346	Metabolic changes of genetically engineered grapes (Vitis vinifera L.) studied by 1H-NMR, metabolite heatmaps and iPLS (2016).
MTBLS85	High-throughput carotenoid profiling using multivariate curve resolution (Stability of isoprenoids in grape extract) (2014)
MTBLS39	The plasticity of the grapevine berry transcriptome (2013).
V02001	Grape berry – vintage terroir Bordeaux – ACA 2006 (2006).

Provided by

Panagiotis Arapitsas, Stefania Savoi, Fulvio Mattivi

Primary contacts: panagiotis.arapitsas@fmach.it | savoi.stefania@gmail.com | fulvio.mattivi@unitn.it