Date post: | 29-Jun-2015 |
Category: |
Science |
Upload: | juan-antonio-vizcaino |
View: | 987 times |
Download: | 2 times |
ProteomeXchange: data deposition and data retrieval made easy
Proteomics Services Group
European Bioinformatics Institute
Hinxton, Cambridge
United Kingdom
Juan Antonio VIZCAINO, Ph.D.
PRIDE Group coordinator
• The ProteomeXchange (PX) consortium
• Highlights in the last year
• PRIME-XS datasets
Overview
ProteomeXchange Consortium
•Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
•Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego).
•Common identifier space (PXD identifiers)
•Two supported data workflows: MS/MS and SRM.
•Main objective: Make life easier for researchers
http://www.proteomexchange.org
ProteomeCentral
Metadata / Manuscript
Raw Data*
Results
Journals
UniProt/neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL (SRM data)
PRIDE (MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE (MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
ProteomeXchange data workflow
MassIVE (UCSD)
http://proteomics.ucsd.edu/service/massive/
• Just joined ProteomeXchange on June 2014• Only partial submissions. A few datasets so far.
• The ProteomeXchange (PX) consortium
• Highlights in the last year
• PRIME-XS datasets
Overview
PX Data workflow for MS/MS data
1. Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter.
4. Other files: Optional files:a. QUANT: Quantification related results e. FASTAb. PEAK: Peak list files f. SP_LIBRARYc. GEL: Gel imagesd. OTHER: Any other file type
Published
RawFiles
Other files
Complete Partial
For complete submissions, it is possible to connect the spectra with the identificationprocessed results and they can be visualized.
Complete vs Partial submissions: processed results
PRIDE XML, mzIdentML supportedmzTab to come
Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar. However, at the assay level information, in partial submissions is less annotated
Complete submissions using mzIdentML
Search Engine Results + MS
files
Search engines
mzIdentML
- Mascot- MSGF+- Myrimatch and related tools from D. Tabb’s
lab- OpenMS- PEAKS- ProCon (ProteomeDiscoverer, Sequest)- Scaffold- TPP via the idConvert tool (ProteoWizard)- ProteinPilot (planned by the end of 2014)- Others: library for X!Tandem conversion, lab
internal pipelines, …
An increasing number of tools support export to mzIdentML 1.1
- Referenced spectral files need to be submitted as well (all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzIdentML ‘RESULT’
Now: native file export
Spectra files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
Search output
files
Spectra files
Original data files ‘RESULT’ file generation Final ‘RESULT’ file
PRIDE XML
‘RESULT’
Before: file conversion using PRIDE Converter
File conversion
PRIDE Converter
PRIDE Inspector 2
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML- mzIdentML + all types of spectra files- mzML- mzTab (work in progress)
http://code.google.com/p/pride-toolsuite/wiki/PRIDEInspector
•Capture the mappings between the different types of files.
•Add the mandatory metadata annotation.
•Make the file upload process straightforward to the submitter (It transfers all the files using Aspera or FTP).
•Command line alternative: some scripting is needed.
PX submission tool: data submission
Published
Raw
Other files
http://www.proteomexchange.org/submission
PXsubmission
tool
Uploading large datasets: Aspera
- Aspera is the default file transfer protocol to PRIDE:- PX Submission tool- Command line
- Up to 50X faster than FTP File transfer speed should not be a problem!!
Tutorial manuscript detailing the process
Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission
Example dataset:PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”- 12 runs: 4 controls and 8 infected samples- Identification and quantification data
Origin: 271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
ProteomeXchange: 1329 datasets up until October 2014
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 10 datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Datasets/year:
2012: 102
2013: 527
2014: 700
• The ProteomeXchange (PX) consortium
• Highlights in the last year
• PRIME-XS datasets
Overview
PX submission tool: PRIME-XS tags
37 Datasets in total (both public and private at present):
- 20 from the Netherlands- 4 from UK - 2 from Austria, Belgium, Denmark,
Spain and Switzerland- 1 from France and USA.
PRIME-XS are now tagged in PRIDE
PRIME-XS datasets are now tagged and can be browsed as a group
http://www.ebi.ac.uk/pride/archive/simpleSearch?q=prime-xs
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Which are the most accessed datasets?
PXD Identifier Total Hits Dataset title Publication
PXD000561 153512 A draft map of the human proteomeKim et al., Nature,2014.
PMID: 24870542
PXD000851 111587Membrane proteomic analysis of
colorectal cancer tissueKume et al., MCP, 2014.
PMID:24687888
PXD000865 51639Mass spectrometry based draft of the
human proteomeWilhelm et al., 2014,
Nature, PMID:24870543
Tota
l Num
bers
Which are the most accessed datasets?
Find the desired PRIDE project …
… and start re-analyzing the data!
… inspect the project details ….
Reshake PRIDE data in PeptideShaker
http://peptide-shaker.googlecode.comVaudel M, Burkhart J, Zahedi RP, Berven FS, Sickmann A, Martens L, Barsnes H. Nature Biotechnology (in press)
A little bit of perspective
Berlin 2011 Mallorca 2012
Annecy 2013 Split 2013
A little bit of perspective
2011 2012 2013 2014
mzIdentML mzQuantMLqcMLmzTab
PRIDE web (2011)
PRIDE Converter
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
PRIDE Inspector 2
PRIDE web (2014)
PRIDE/PX datasets
Conclusions
• ProteomeXchange is widely used. – PRIDE contains most of the MS/MS
datasets.– It has now a new consortium member:
MassIVE (UCSD).– Around half of the datasets are already
public.
• Different open source tools available to facilitate the process:– File transfer speed should not be a
problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Aknowledgements: People
Attila CsordasTobias TernentNoemi del ToroRui WangFlorian Reisinger
Jose A. DianesJohannes GrissSteven LewisYasset Perez-Riverol
Henning Hermjakob
All previous team membersProteomeXchange partners
Acknowledgements: Funding
[email protected]@ebi.ac.uk
http://www.proteomexchange.orghttp://code.google.com/p/pride-converter-2/
@pride_ebi