Refreshes the protein references for all peptide hits from an idXML file and adds target/decoy information.
pot. predecessor tools | ![]() ![]() | pot. successor tools |
IDFilter or any protein/peptide processing tool | FalseDiscoveryRate |
All peptide and protein hits are annotated with target/decoy information, using the meta value "target_decoy". For proteins the possible values are "target" and "decoy", depending on whether the protein accession contains the decoy pattern (parameter decoy_string
) as a suffix or prefix, respectively (see parameter prefix
). For peptides, the possible values are "target", "decoy" and "target+decoy", depending on whether the peptide sequence is found only in target proteins, only in decoy proteins, or in both. The target/decoy information is crucial for the FalseDiscoveryRate tool. (For FDR calculations, "target+decoy" peptide hits count as target hits.)
PeptideIndexer supports relative database filenames, which (when not found in the current working directory) are looked up in the directories specified by OpenMS.ini:id_db_dir
(see TOPP for Advanced Users).
By default this tool will fail if an unmatched peptide occurs, i.e. if the database does not contain the corresponding protein. You can force it to return successfully in this case by using the flag allow_unmatched
.
Some search engines (such as Mascot) will replace ambiguous amino acids ('B', 'Z', and 'X') in the protein database with unambiguous amino acids in the reported peptides, e.g. exchange 'X' with 'H'. This will cause such peptides to not be found by exactly matching their sequences to the database. However, we can recover these cases by using tolerant search (done automatically).
Two search modes are available:
The exact mode is much faster (about 10 times) and consumes less memory (about 2.5 times), but might fail to report a few protein hits with ambiguous amino acids for some peptides. Usually these proteins are putative, however. The exact mode also supports usage of multiple threads (threads
option) to speed up computation even further, at the cost of some memory. If tolerant searching needs to be done for unassigned peptides, the latter will consume the major share of the runtime. Independent of whether exact or tolerant search is used, we require ambiguous amino acids in peptide sequences to match exactly in the protein DB (i.e. 'X' in a peptide only matches 'X' in the database).
Leucine/Isoleucine: Further complications can arise due to the presence of the isobaric amino acids isoleucine ('I') and leucine ('L') in protein sequences. Since the two have the exact same chemical composition and mass, they generally cannot be distinguished by mass spectrometry. If a peptide containing 'I' was reported as a match for a spectrum, a peptide containing 'L' instead would be an equally good match (and vice versa). To account for this inherent ambiguity, setting the flag IL_equivalent
causes 'I' and 'L' to be considered as indistinguishable.
For example, if the sequence "PEPTIDE" (matching "Protein1") was identified as a search hit, but the database additionally contained "PEPTLDE" (matching "Protein2"), running PeptideIndexer with the IL_equivalent
option would report both "Protein1" and "Protein2" as accessions for "PEPTIDE". (This is independent of the error-tolerant search controlled by full_tolerant_search
and aaa_max
.)
Enzyme specificity: Once a peptide sequence is found in a protein sequence, this does not imply that the hit is valid! This is where enzyme specificity comes into play. By default, we demand that the peptide is fully tryptic (i.e. the enzyme parameter is set to "trypsin" and specificity is "full"). So unless the peptide coincides with C- and/or N-terminus of the protein, the peptide's cleavage pattern should fulfill the trypsin cleavage rule [KR][^P]. We make one exception for peptides starting at the second amino acid of a protein if the first amino acid of that protein is methionine (M), which is usually cleaved off in vivo. For example, the two peptides AAAR and MAAAR would both match a protein starting with MAAAR. You can relax the requirements further by choosing semi-tryptic
(only one of two "internal" termini must match requirements) or none
(essentially allowing all hits, no matter their context).
The command line parameters of this tool are:
INI file documentation of this tool:
OpenMS / TOPP release 2.0.0 | Documentation generated on Mon May 23 2016 21:42:41 using doxygen 1.8.11 |