pScan:
A database preprocessing software tool for proteomics
ABSTRACT
Summary: pScan is a flexible tool that helps biologists to preprocess
protein sequence databases in proteomics research. Besides the
commonly used functions, such as sequence pattern-matching, building
decoy databases, and converting protein sequence databases to
peptide sequence databases, pScan also supports querying and substituting
of protein entries based on the regular expression, creating customized
databases, and conducting statistical characterization of the
databases. pScan can greatly help biologists to improve the design
of proteomics experiments and to facilitate the database search
and analysis by making full use of the information content contained
in the sequence databases.
1
INTRODUCTION
Database searching is a commonly used method for peptide identification
in high-throughput proteomics. Protein sequence databases, such
as Swiss-Prot, IPI and the NCBI-nr, play a critical role in proteomics.
Currently, there are a few database preprocessing toolkits, such
as Kangroo (Betel et al., 2002), DecoyDBB (Reidegeld et al., 2008),
and DBToolkit (Martens et al., 2005), which have already helped
biologists considerably in protein sequence database processing.
Kangroo is a sequence pattern-matching toolkit, DecoyDBB can build
target-decoy databases with three different decoy strategies,
and DBToolkit can convert protein sequence databases to peptide
sequence databases to enhance protein identification. However,
these commonly used functions are implemented in different software
separately. What¡¯s more, there are some special database preprocessing
functions, which have not been implemented in the available software,
but are extremely useful for designing proteomics experiments
and for facilitating the database search, such as querying and
substituting of protein entries based on the regular expression,
creating customized databases, and conducting statistical characterization
of the databases. To solve this problem, we have developed an
integrated software tool named pScan, to conduct protein sequence
database preprocessing.
pScan is an easily extensible and user-friendly database preprocessing
toolbox. First, pScan allows biologists to edit, query and substitute
the accession ID, the description information and the sequence
for each entry in the FASTA file, which are based on various types
of regular expressions. Second, pScan can be used to create some
customized databases, e.g., sub-species databases, N- and C-terminal
sequence databases, and target-decoy databases with different
decoy strategies, which are very helpful for peptide identification
in database search engines, such as pFind (Wang et al., 2007),
SEQUEST and Mascot. Third, pScan also supports the statistical
characterization of the protein sequence databases, for example,
the ratio of digested peptides with a specific amino acid to all
peptide sequences, the ratio of digested peptides with special
modification patterns (e.g., ¡®NXS/T/C¡¯ in glycosylation and
¡®S/T/Y¡¯ in phosphorylation) to all peptide sequences, and the
distribution of mass values of all peptides (with or without modifications)
obtained from digestion of the proteins. The flexible manipulations
in pScan can greatly help to improve the design of proteomics
experiments and to facilitate the database search and analysis
by making full use of the information content contained in the
sequence databases.
2 FUNCTIONALITIES & APPLICATIONS
pScan can perform various types of preprocessing on protein sequence
databases. Here, we present some commonly used applications in
pScan.
Display,
Query and Substitute Sequences Besides the commonly
used regular expression based sequence pattern-matching against
the entire database file, pScan can also help biologists to display,
query and substitute the accession ID, the description information
and the sequence for each entry included in the sequence databases,
collectively or separately.
For example, biologists are often interested in the sequence motif
of ¡®NXS/T/C¡¯ in N-glycosylation site analysis where X may be
any amino acid except praline (Bause, E. et al., 1979). pScan
has been successfully used to substitute the letter N with J,
which was defined to have the same mass as Asn, to conduct the
database searching by pFind in large-scale identification of core
fucosylated glycoproteins (see Jia et al., 2009 and Fu et al.,
2009).
Create Customized Databases pScan can
help biologists to extract any sub protein database that they
want from the NCBI taxonomy database or any other database based
on the self-defined regular expressions. For example, the ¡®bovin¡¯
protein database can be easily retrieved from NCBI taxonomy database
by inputting the regular expression ¡®bovin¡¯ into the ¡®DE (DEscription)¡¯
query edit box in pScan.
pScan
can create N- or C- terminal sequence database, which contains
the first n residues from the N or C-terminal side of the target
sequence, respectively. In contrast to shotgun proteomics, biologists
can retrieve higher fidelity results from terminal proteomics,
because of the high information content of terminal sequence (Nakazawa
et al., 2008).
The
target-decoy search strategy is a widely used method to control
the false discovery rate. Reverse and shuffle strategies have
been implemented in pScan to create decoy databases. Reverse database
is simply created by reversing the target protein sequences. Shuffle
database is built by putting each letter from the target protein
sequence to a randomly chosen position in the decoy sequence.
pScan can be used to create two types of databases: the composite
target-decoy database and the decoy database only. Fig. 1. The
human IPI database version 3.55 is used to conduct the statistical
characterization. (a) The ratio of digested peptides with specific
amino acids to all peptide sequences. (b) The ratio of digested
peptides with special modification patterns (e.g., ¡®NXS/T/C¡¯
in glycosylation, ¡®S/T/Y¡¯ in phosphorylation, ¡®M¡¯ in oxidation,
and ¡®C¡¯ in carbamidomethylation) to all peptide sequences. (c)
Mass distribution of phosphorylated peptides for nominal mass
950 u and 1050 u. (d) Robust and extensible framework in the core
implementation of pScan.
Conduct Statistical Characterization
A powerful protein enzymatic digesting and indexing software package,
IndexToolkit (Li et al., 2006), has been integrated into pScan
to get all peptides obtained from digestion of the proteins. Currently,
three different peptides statistical characterization methods
have been implemented in pScan to improve the design of experiments.
The human IPI database version 3.55 is used to conduct the statistical
characterization.
First,
pScan can be used to calculate the ratio of digested peptides
with a specific amino acid to all peptide sequences (Fig.1 a),
which is useful in the stable isotopic labeling in quantitative
proteomics.
Second,
pScan is able to calculate the ratio of digested peptides with
special modification patterns (e.g., ¡®NXS/T/C¡¯ in glycosylation,
¡®S/T/Y¡¯ in phosphorylation, ¡®M¡¯ in oxidation, and ¡®C¡¯ in
carbamidomethylation) to all peptide sequences (Fig.1 b), which
is very helpful for the post-translational modifications study.
Third,
pScan can perform the calculating of the mass distribution of
all peptides (with or without modifications) obtained from digestion
of the proteins (Fig.1 c).
These statistical characterizations are very helpful for biologists
to design their experiments with more careful consideration and
get more reliable identified results.
3 CONCLUSIONS
In sum, pScan can greatly help biologists to improve the design
of proteomics experiments and to facilitate the database search
and analysis by making full use of the information content contained
in the sequence databases. pScan has been integrated into the
pFind Studio (http://pfind.ict.ac.cn), which is a new efficient
and effective software platform for mass spectrometry-based proteomics,
and has also been successfully applied in numerous tasks for the
design of experiments and database search. With the robust and
extensible framework in the core implementation (Fig.1 d), new
functions will be easily incorporated into pScan as needed in
the future.