Pride cluster presentation

Update to the PRIDE Cluster project

Dr. Juan Antonio Vizcaíno

Proteomics Team LeaderEMBL-European Bioinformatics InstituteHinxton, Cambridge, UK

Juan A. Vizcaínojuan@ebi.ac.uk

Bioinformatics Hub HUPO 2016Taipei, September 2016

•PRIDE stores mass spectrometry (MS)-based proteomics data:

•Peptide and protein expression data (identification and quantification)

•Post-translational modifications

•Mass spectra (raw data and peak lists)

•Technical and biological metadata

•Any other related information

•Full support for tandem MS approaches

PRIDE (PRoteomics IDEntifications) database

http://www.ebi.ac.uk/pride/archiveMartens et al., Proteomics, 2005Vizcaíno et al., NAR, 2016

PRIDE Cluster: Initial Motivation• Provide a QC-filtered peptide-centric view of PRIDE.

• Data is stored in PRIDE Archive as originally analysed by the submitters (no data reprocessing is done).

• Heterogeneous quality, difficult to make the data comparable.

• Enable assessment of (published) proteomics data. Pre-requisite for data reuse (e.g. in UniProt).

PRIDE Cluster - Concept

Griss et al., Nat Methods, 2016

NMMAACDPR

PPECPDFDPPR

NMMAACDPR

Consensus spectrum

PPECPDFDPPR

NMMAACDPR

Threshold: At least 3 spectra in a cluster and ratio >70%.

Originally submitted identified spectra

Spectrumclustering

PRIDE Cluster: Second Implementation

• Griss et al., Nat. Methods, 2013

• Clustered all public, identified spectra in PRIDE

• EBI compute farm, LSF

• 20.7 M identified spectra

• 610 CPU days, two calendar weeks

• Validation, calibration

• Feedback into PRIDE datasets

• EBI farm, LSF

• Griss et al., Nat. Methods, 2016

• Clustered all public spectra in PRIDE by April 2015

• Apache Hadoop.

• Starting with 256 M spectra.

• 190 M unidentified spectra (they were filtered to 111 M for spectra that are likely to represent a peptide).

• 66 M identified spectra

• Result: 28 M clusters

• 5 calendar days on 30 node Hadoop cluster, 340 CPU cores

Parallelizing Spectrum Clustering: Hadoop

• Optimizes work distribution among machines.

• Hadoop is a (open source) Framework for parallelism using the Map-Reduce algorithm by Google.

• Solves many general issues of large parallel jobs:

• Scheduling

• inter-job communication

• failure

https://hadoop.apache.org/

PRIDE Cluster Home page

http://www.ebi.ac.uk/pride/cluster/#/

PRIDE Cluster: result of searches

http://www.ebi.ac.uk/pride/cluster/#/

A couple of examples …

Examples: one perfect cluster

- 880 PSMs give the same peptide ID- 4 species- 28 datasets- Same instruments

Examples: one perfect cluster (2)

Output of the analysis

• 1. Inconsistent spectrum clusters

• 2. Clusters including identified and unidentified spectra.

• 3. Clusters just containing unidentified spectra.

2. Inferring identifications for originally unidentified spectra

• 9.1 M unidentified spectra were contained in clusters with a reliable identification.

• These are candidate new identifications (that need to be confirmed), often missed due to search engine settings

• Example: 49,263 reliable clusters (containing 560,000 identified and 130,000 unidentified spectra) contained phosphorylated peptides, many of them from non-enriched studies.

3. Consistently unidentified clusters

• 19 M clusters contain only unidentified spectra.

• 41,155 of these spectra have more than 100 spectra (= 12 M spectra).

• Most of them are likely to be derived from peptides.

• They could correspond to PTMs or variant peptides.

• With various methods, we found likely identifications for about 20%.

• Vast amount of data mining remains to be done.

PRIDE Cluster as a Public Data Mining Resource

• http://www.ebi.ac.uk/pride/cluster

• Spectral libraries for 16 species.

• All clustering results, as well as specific subsets of interest available.

• Source code (open source) and Java API

Consistently unidentified clusters

• We provide the results split per species in MGF and mzML format.

• Very interested in getting people trying to work in those.

• Available for several species (Largest clusters at present).

Aknowledgements: People

Attila CsordasTobias TernentGerhard Mayer (de.NBI)

Johannes GrissYasset Perez-RiverolManuel Bernal-LlinaresAndrew Jarnuczak

Former team members, especially Rui Wang, Florian Reisinger, Noemi del Toro, Jose A. Dianes & Henning Hermjakob

Acknowledgements: The PRIDE Team

All data submitters !!!

Questions?

Pride cluster presentation

Science

Transcript of Pride cluster presentation

Oct-06 Pride Conexiones Rev0

Mascota y obsequios en el contexto Pride Mascota y entrega de obsequios en el contexto de una campaña Pride.

Instalaciones pride

H2 s pride

Tec pride 2015 Emprendimiento ecológico

Lecciones aprendidas de iniciativas cluster. Que son las iniciativas cluster? CLUSTER INICIATIVA CLUSTER RUTAS COMPETITIVAS.

Cluster Forestoindustrial

9 Corrigiendo Puntos Pelllizco Pride 2405

Documento PRIDE CAACS 28 de enero

Programa PRIDE Barcelona 2009

Programa Oficial Pride Barcelona 2011

Cluster Petroleo

L G p d p (pride) B . p · Evaluación para el PRIDE en el Bachillerato. Profesores. 2.2. crIterIos generales de evaluacIón Las comisiones evaluadoras para el PRIDE de la entidad

Apuntes Cluster

Complete Genome Sequences of Cluster G …Complete Genome Sequences of Cluster G Mycobacteriophage Darionha, Cluster A Mycobacteriophage Salz, and Cluster J Mycobacteriophage ThreeRngTarjay

COMUNIDAD CLUSTER MEDELLiN & COLOMBIA CLUSTER … · 2010. 9. 23. · medellin & colombia cluster energia elÉctricaØ cluster textil/confecciÓn diseÑo y modaØ cluster construcciÓn

4011 Cal Pride Properties Llc

Cluster Competitividad

Análisis de agrupamiento (Cluster nMDS) Agregación (Cluster) · Análisis de agrupamiento (Cluster –nMDS) Agregación (Cluster) Conjunto de técnicas que intentan organizar la

Campaña Pride Adriana LARA Ecuador Febrero, 2010.