Controlling the quality of protein datasets
The MisPred tool uses five principles to identify suspect protein sequence data in public databases that seriously affect the database reliability and can hamper the search for cancer-causing mutations, tumour suppressor genes and the various causes of life-threatening diseases.
The new tool is described by researchers, led by Professor Laszlo Patthy, of the Institute of Enzymology of the Hungarian Academy of Sciences, Budapest, in the open access journal BMC Bioinformatics.
“Recent studies have shown that a significant proportion of eukaryotic genes are mispredicted at the transcript level,” said Prof. Patthy.
“As the MisPred routines are able to detect many of these errors, and may aid in their correction, we suggest that it may significantly improve the quality of protein sequence data based on gene predictions.”
While there has been a significant improvement in the quality of genomic information recorded in public databases, there are still numerous examples of abnormal, incomplete or incorrectly predicted gene and protein sequences.
Manually locating and curating information about a genomic entity from the biomedical literature requires a vast amount of human effort and has led to the development of automated computational tools that capture and annotate data from the literature.
The MisPred approach uses five distinct routines for identifying possible erroneous entries based on the principle “that “a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins.”
The five routines can be summarised as:
i) Extracellular or transmembrane proteins must have appropriate secretory signals;
ii) A protein with intra- and extra-cellular parts must have a transmembrane segment;
iii) Extracellular and nuclear domains must not occur in a single protein;
iv) The number of amino acid residues in closely related members of a globular domain family must fall into a relatively narrow range; and
v) A protein must be encoded by exons located on a single chromosome.
The authors found that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions, with even the manually curated UniProtKB/Swiss-Prot dataset being contaminated with mispredicted or abnormal proteins.
Although the authors note that the contamination of the Swiss-Prot dataset was “to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON predicted entries”.
The researchers also point out that there are some exceptions to the 5 rules used by the MisPred system.
“Some secreted proteins may truly lack secretory signal peptides since they are subject to leaderless protein secretion. Similarly, it cannot be excluded at present that transchromosomal chimeras can be formed and may have normal physiological functions,” said Professor Patthy.
“Nevertheless, the fact that MisPred analyses of protein sequences of the Swiss-Prot database identified very few such exceptions indicates that the rules of MisPred are generally valid”.