Post on 12-Jan-2017
Update to the PRIDE Cluster project
Dr. Juan Antonio Vizcaíno
Proteomics Team LeaderEMBL-European Bioinformatics InstituteHinxton, Cambridge, UK
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
•PRIDE stores mass spectrometry (MS)-based proteomics data:
•Peptide and protein expression data (identification and quantification)
•Post-translational modifications
•Mass spectra (raw data and peak lists)
•Technical and biological metadata
•Any other related information
•Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcaíno et al., NAR, 2016
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
PRIDE Cluster: Initial Motivation• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data. Pre-requisite for data reuse (e.g. in UniProt).
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
PRIDE Cluster - Concept
Griss et al., Nat Methods, 2016
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
Consensus spectrum
PPECPDFDPPR
NMMAACDPR
NMMAACDPR
Threshold: At least 3 spectra in a cluster and ratio >70%.
Originally submitted identified spectra
Spectrumclustering
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node Hadoop cluster, 340 CPU cores
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
PRIDE Cluster Home page
http://www.ebi.ac.uk/pride/cluster/#/
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
PRIDE Cluster: result of searches
http://www.ebi.ac.uk/pride/cluster/#/
A couple of examples …
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID- 4 species- 28 datasets- Same instruments
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Examples: one perfect cluster (2)
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
2. Inferring identifications for originally unidentified spectra
13
• 9.1 M unidentified spectra were contained in clusters with a reliable identification.
• These are candidate new identifications (that need to be confirmed), often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
3. Consistently unidentified clusters
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
3. Consistently unidentified clusters
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
PRIDE Cluster as a Public Data Mining Resource
18
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Consistently unidentified clusters
• We provide the results split per species in MGF and mzML format.
• Very interested in getting people trying to work in those.
• Available for several species (Largest clusters at present).
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Aknowledgements: People
Attila CsordasTobias TernentGerhard Mayer (de.NBI)
Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak
Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
Juan A. Vizcaínojuan@ebi.ac.uk
Bioinformatics Hub HUPO 2016Taipei, September 2016
Questions?