Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

Assessing proteome completeness and quality

Last modified March 5, 2021

In order to assess quality and completeness of proteomes, we provide two values:

  • Statistical evaluation and classification of the proteome by the Complete Proteome Detector (CPD) algorithm (developed by UniProt)
  • The BUSCO score of the proteome, which was developed to quantify genomic data completeness in terms of expected gene content.

Complete Proteome Detector (CPD)

CPD statistically analyses each proteome against a group of closely related proteomes in order to determine completeness. For each proteome, CPD uses taxonomic lineage information to identify the group of proteomes taxonomically closest to it. A valid group is required to contain a minimum of three proteomes. For the proteome being analyzed, the algorithm compares selected attributes of all proteomes in its group to define the standard distribution of the number of proteins the proteome would be expected to contain in order to be considered complete. A quality metric is also determined based on the taxonomical closeness of the group of proteomes used for the analysis of the proteome in question.

This evaluation classifies each proteome into three possible categories (in terms of the proteome's protein count vs. the standard distribution of protein count expected for completeness): standard, close to standard or an outlier.

Benchmarking Universal Single-Copy Orthologs (BUSCO)

For eukaryotic and bacterial proteomes, we also provide the BUSCO score, which was developed to quantify genomic data completeness in terms of expected gene content based on single-copy orthologs. We're currently using BUSCO version 4.0.2.

This score includes percentages of complete (C) single-copy (S) genes, complete (C) duplicated (D) genes, fragmented (F) and missing (M) genes, as well as the total number of orthologous clusters (n) used in the BUSCO assessment.

We also report, as is recommended in BUSCO's user guide, the most specific lineage dataset available to analyse each proteome. For example, to assess fish data we would select the actinopterygii lineage (dataset: actinopterygii_odb10) rather than the metazoa or eukaryota lineage. A full list of available lineage datasets can be found on BUSCO website.

UniProt is an ELIXIR core data resource
Main funding by: National Institutes of Health

We'd like to inform you that we have updated our Privacy Notice to comply with Europe’s new General Data Protection Regulation (GDPR) that applies since 25 May 2018.

Do not show this banner again