MetaboAnalyst

Data Formats:

Various example datasets available for different analysis purposes. You can download to inspect their formats, or scroll down for more detailed instructions.

Analysis Path	Title	Download	Description
From LC-MS Spectra to Feature/Compound Table	Small test spectra (mzML)	IBD_small.zip	A trimmed small MS1 dataset (10 samples)
	Malaria raw spectra (mzML)	malaria_raw.zip	An experimental raw MS1 spectra dataset (15 samples)
	Blood samples (mzML)	blood_samples.zip	A blood spectra dataset (MS1+DDA), containing MS1 and DDA-based MS2
	COVID-19 dataset (mzML)	swath_dia_covid.zip	An experimental raw MS1+SWATH-DIA spectra COVID-19 dataset (16 samples)
From MS Peaks to Functions	MS peak table	malaria_feature_table.csv	Peak table of Malaria (MTBLS665) study
	MS peak list	mummichog_ibd.txt	A MS peak list (3 columns: m/z, p value, and t-score) for functional analysis
	Multiple peak tables	A1_pos.csv	3 MS peak tables from a COVID-19 study for functional meta-analysis
		B1_pos.csv
		C1_pos.csv
Statistics [one factor] and Biomarker Analysis	Concentration table	cow_diet.csv	A metabolite concentration table from cow rumen samples with four groups
	Concentration table	human_cachexia.csv	A metabolite concentration table from human urine samples with two groups
	Peak Intensity table	lcms_table.csv	A peak intensity table from mice spinal cord samples with two groups
	NMR/MS spectra data	nmr_bins.csv	A binned spectra data for statistical analysis
	mzTab 2.0-M	MouseLiver_negative.mzTab	mzTab 2.0-M file example data
	Zipped files	nmr_peaks.zip	NMR data with 2 columns (chemical shift and intensity)
		lcms_peaks_2col.zip	MS data with 2 columns (mass and intensity)
		lcms_peaks_3col.zip	MS data with 3 columns (mass, retention time, and intensity)
Statistics [metadata table] and Covariate Analysis	Time-series data	cress_time.csv	Peak table of a time-series study across two conditions
	Time-series data	cress_time_meta.csv	Peak table of a time-series study across two conditions
	Data and metadata	TCE_feature_table.csv	A peak intensity table from a trichloroethylene (TCE) exposure study for covariate analysis. Two files included (a peak table + metadata).
	Data and metadata	TCE_metadata.csv
Multi-omics Integration	Gene and compound lists	integ_genes_1.txt	Integration analysis of a transcriptomics and metabolomics data (compound) from a study of COVID-19.
	Gene and compound lists	integ_cmpds.txt
	Gene and peak lists	integ_genes_2.txt	Integration analysis of a transcriptomics and metabolomics data (untargeted, peaks) from a study of Malaria.
	Gene and peak lists	integ_peaks.txt
	Protein and compound lists	integ_genes_3.txt	Integration analysis of a proteomics and metabolomics data (compounds, HMDB) from a study of COVID-19.
	Protein and compound lists	integ_cmpds_3.txt

Comma Separated Values (.csv) or Tab Delimited Text (.txt):

These two formats are used for concentration data, peak intensity table, and MS/NMR spectral bins. Samples can be in either rows or columns. Note,

Both sample or feature names must be unique and consist of a combination of common English letters, underscores and numbers for naming purpose. Latin/Greek letters are not supported.
Statistical Analysis [one factor] module: for statistical analysis with one factor (two or multiple groups), class labels must immediately follow sample names; Statistical Analysis [metadata table] module: for statistical analysis with multiple factors (including time series), users need to upload a separate metadata table;
For time-series data, the time-point group must be named as Time. In addition, the samples collected from the same subjects at different time points should be consecutive; For more details, please see the screenshots demo for "Metadata / Time-series").
Data values (concentrations, bins, peak intensities) should contain only numeric and positive values (using empty or NA for missing values). In addition, there should not be spaces between numbers. For instance, 1 600 should be formatted as 1600, if not the value will get read as 1.

mzTab 2.0-M files (.mzTab)

MetaboAnalyst now supports the upload of mzTab files in the Statistical Analysis module. MetaboAnalyst parses both the Metadata Table (MTD) and the Small Molecule Table (SML) to a MetaboAnalyst ready data table format. From the SML, users can either choose to have their features named using the "chemical_name" or "theoretical_neutral_mass". If too many of these are missing however, the features will be named with the "SML_ID". Further, if there are duplicate names, the "SML_ID" will be appended to the end of the selected feature identifier. From the MTD, "study_variable" labeled "Blank" will be excluded from the final data table. Note that MetaboAnalyst supports only mzTab-M 2.0 files that have been validated to ensure that the files can be read by our software.

Zipped files (.zip)

For NMR/MS peak list files and GC/LC-MS spectra data, users need to upload a zipped folder containing data files from different groups under study (one file per spectrum and one sub-folder for each group ). For paired comparison, users need to upload a separate text file specifying the paired information.

GC/LC-MS spectra must be in either NetCDF, mzXML, or mzDATA format. The spectra should be stored in two separate folders according to their class labels then compressed into zip files. Please note, the program is not compatible with the most recent WinZip (v12.0) with default option. Make sure to select the Legacy compression (Zip 2.0 compatible) for compressing files. No space is allowed in either the folder names or the spectra names. The size limit for each uploaded zip file is 50M. Please contact the author if you wish to upload a bigger data size.

The peak list data is composed of peak list files organized into separate folders named by their class labels. For example, if your data contains three groups, the peak list files should be organized into three folders accordingly. Compress these folders into a single zip file then upload them to MetaboAnalyst.

NMR peak list files should contain two comma separated columns with the 1st column for peak positions (ppm) and the 2nd column for peak intensities; MS peak list files can be in either two-column (mass and intensities) or three-column format (mass, retention time and intensities), but not a mixture of both. The first line of each peak list file is reserved for column labels. The file must be saved in comma separated values (.csv) format.

Samples in rows (unpaired)

Each row represents data from a sample. The class label is in the second column. For unpaired comparisons, the class label can either be numeric (i.e. 0/1) or character (i.e. Healthy/Disease).

Samples in rows (paired)

For paired comparison, there must be an even (2n) number of samples. The class labels are required to be the numeric integers between -1 and -n/2 and between 1 and n/2. Samples with class labels of the same absolute values are considered to be pairs. In the example below, Patient1_d0 and Patient1_d3 are a pair.

Samples in columns (unpaired)

Samples can also be in columns and where each row represents a measured variable. The class label must be in the second row. The requirements for class label is the same as that for samples in rows for both paired and unpaired comparisons. The screenshot below shows the unpaired case.

Samples in columns (paired)

The screenshot below shows a subset of binned NMR spectra data (bin width 0.04 ppm). In this table, the samples from controls (e.g. Contr_1) are paired with the samples from the subjects in disease (Disease_1) based on some criteria (i.e. age, weight, gender). Each sample occupies a column and the second row is used for sample labels.

Peak intensity table

The screenshot below is a LC-MS peak intensity table. Each column represents peaks from a sample. These peaks are grouped and identified by their retention time and mass. The class labels are in the second row immediately following the sample names.

Metadata table containing multiple factors and covariates

This is a general table containing various descriptors for the data to be analyzed

The sample IDs must be identical to the metabolomics data;
The column after sample IDs should be the primary metadata of interest;
The metadata can contain either categorical (with at least three replicates per group) or continuous values (covariates);
Missing values are not allowed - you will be asked to manually "fix" the missing values if detected

A screenshot of a metadata table is shown below.

Time-series data only

This design requires two factors: the time points column must be labeled as Time; the other label is Subject containing subject IDs across different time points. Samples should be balanced (i.e. no missing time points for any subject). A screenshot of an example data with samples in rows is shown below.

Time-series + one experimental factor

This design requires three factors: the experimental factor must be labeled as Phenotype; the time points column must be labeled as Time; the other label is Subject containing subject IDs across different time points. The screenshot illustrates the appropriate structure of a time-series data table. The data shown contains 24 samples measured at three time points from 6 subjects under two conditions (MT and WT)

For NMR/MS peak list files data, users need to upload a zipped folder containing data files from different groups under study (one file per spectrum and one sub-folder for each group ). For paired comparison, users need to upload a separate text file specifying the paired information.

The paired sample information is encoded by using both sample names (without suffix) separated by a colon ":" with one pair per line, and uploaded as a text file (.txt). The screen shot below illustrates the data structure for peak list data as well as the specifications of paired samples:

Sample Preparation and Raw Spectra Acquisition

Here, we are sharing an in-house experimental protocol from our team to showcase how to prepare metabolomics samles and perform a standard LC-MS data acquisition. This is an example of human islets case study. Click here to read the details.

Raw Spectra Data for Processing

LC-MS spectra must be in mzML, mzXML, or mzDATA format. ProteoWizard - msconvert function can handle most common vendor formats.
Centroiding the MS data is required for online processing. This can be done during format conversion using ProteoWizard except NetCDF files.
To facilitate the process, we have developed CentroidMSData() function (available in MetaboAnalystR) to centroid all common formats (including netCDF).
Only standard mzXML (example) or mzML (example) can be accepted. Other non-standard mzXML or mzML will cause exceptions.
The size limit for each zip file is 200M for online processing. Use MetaboAnalystR for processing larger spectral files.
No spaces are allowed in spectra names. Use underscores instead (i.e. my_spectrum1.mzML)

(Optional) Meta-data for raw spectra processing

A 2-columns metadata table (.txt only) is mandatory. The 1st column is the filenames of the spectra compressed above. 2nd column is the class/groups of all sample. No space is allowed in filenames or classes. Please use "control_1" to replace "Control 1". For QC samples, the class/group name has to be "QC". At lease 3 samples for all groups are required except for QC. As for QC, it is strongly recommended to provide at least 2 samples. Otherwise, 2 sample files with largest file size will be used for optimization.

Exploratory biomarker analysis

The data format is same as the one-factor data with samples in rows or columns, followed immediately by class labels. Please note, ROC curve-based biomarker analysis is only defined for two-group analysis. If your data contains multiple groups, you need to specify which two groups you want to investigate.

Creating biomarker models to predict new samples

You can create biomarker models to predict new samples (with unknown class) using the ROC Tester. To do this, you need to upload a data that contains both the samples with class labels and the samples whose class label need to be predicted (leave their sample labels empty). A screenshot is shown below.

The data format is the same as the one-factor data with samples in rows or columns, followed immediately by class labels. Before uploading your data to the module, please make sure that the names of your features (compound names, spectral bins, peaks) are consistent between the individual studies. At least 25% of the features must match between the studies. Also make sure that the group labels are also consistent between the studies, i.e. Cancer and Healthy. Finally, all uploaded sample identifiers must be unique.

A screenshot example is shown below:

Peak list data format

Version 1 Peak Lists: The MS Peaks to Pathways module accepts either a three column table containing the m/z features, p-values, and statistical scores, a two-column table containing m/z features and either p-values or t-scores, or a one-column table ranked by either p-values or t-scores. All inputted files must be in .txt format. If the input is a three column table, both the mummichog and GSEA algorithms (and their combination) can be applied. If only p-values (or ranked by p-values) are provided, then only the mummichog algorithm will be applied. If only t-scores (or ranked by t-scores) are provided, then only the GSEA algorithm will be applied.

Version 2 Peak Lists: With Version 2 of the MS Peaks to Pathways module, retention time can be included as a new column with the "rt" or "r.t" heading. The maximum number of columns that can be uploaded is now 5: "m.z", "r.t", "p.value", "t.score" and "mode".

If p-values have not yet been calculated for their data, users can use the exploratory statistical analysis module to upload their raw peak tables, process the data, perform t-tests or fold-change analysis, and then upload these results into the module. An example dataset is shown below:

Peak table format

Upload your data in either a tab-deliminted (.txt) or comma-separated (.csv) format. The MS Peaks to Pathways module accepts either a generic peak table or the MZmine formatted peak table. For Version 2, retention times can be included in the generic peak table and should be formatted so that the peak and retention time are separated by two underscores. An example of the generic peak table with retention time is shown below:

Data format overview

Metabolite or gene list data: a list of metabolite or gene IDs with optional fold-changes. Each feature should be in in a row. Please refer to the example data for further details.

Metabolite/Gene list labels

It is critical for your data to be properly labeled so they can be uploaded into the Joint Pathway Analysis or Network Explorer module. The following common metabolite and gene IDs are supported:

Metabolite list: Common compound names, HMDB IDs, or KEGG compound IDs as metabolite identifiers.
Genelist: Entrez IDs, Ensembl Gene IDs, official gene symbols, or KEGG orthologs (KOs) are currently supported.

An example of what your data should look like in any text editor (WordPad, TextEdit) is shown in the screenshot below.