pcr_match

Name

pcr_match - Find primer pairs in a DNA sequence database

Synopsis

pcr_match [ options ]

Description

pcr_match finds pairs of short DNA sequences, usually primers, in a (much) larger DNA sequence database such as the human genome. 

By default, pcr_match outputs a human readable alignment for each occurance of a primer in the sequence database. The format of the alignments is completely configurable with the -A options.

pcr_match runs fastest when the sequence database has been pre-processed with compress_seq, but this is not necessary. If the sequence database has not been pre-processed with compress_seq, the sequence database must be in a regular FASTA format. Each line, except for the last, of every sequence entry must hold the same number of sequence characters. If the sequence database is not in a regular FASTA format, the results may be incorrect. pcr_match will warn the user if the FASTA format is not in a regular format. 

Options

-i FASTA_sequence_database

Name of the sequence database to search. Required.

-p primers

White space (space, tab, new line) separated list of primer sequences to find in the sequence database. When this opition is used on the command line, the primers will usually need to be placed in quotes ("). The primer pairs must be consecutive in the list of primers. One of -p, -P, -F, or -S must be supplied.

-P primer_file

File containing a white space (space, tab, new line) separated list of primer sequences to find in the sequence database. The primer pairs must be consecutive in the list of primers. One of -p, -P, -F, or -S must be supplied.

-F fasta_primer_file

FASTA file containing a list of primer sequences to find in the sequence database. The primer pairs must be consecutive in the list of primers. One of -p, -P, -F, or -S must be supplied.

-S sts-format-file

UniSTS format file containing a list of primer pairs to find in the sequence database. One of -p, -P, -F, or -S must be supplied.

-o output_file

Output is redirected into the file output_file. If absent, output goes to standard out.

-k edit_distance

The maximum number of insertions, deletions, and substitutions permitted in any primer alignment. If absent, edit distance 0 is assumed. 

-K mismatches

The maximum number of mismatches permitted in any primer alignment.

-r 

Reverse complement the reverse complement (second) primer. This option is automatically set, for UniSTS format primer pairs. Default: false.

-a

Output all primer pair orientations. Default: false.

-x l 

Length of exact seed or word size, ala blast, required by any primer alignment. Can be combined with other options.

-s ( l | ~l ) 

Constrain the first l primer characters to match exactly; any insertions, deletions or substitutions must occur after position l. The reverse complement of a primer must also have its first l characters match exactly. Note that a wildcard match is considered an exact match. With the ~ modifier, the first l primer characters are constrained to match inexactly, the remaining characters must match exactly. 

-e ( l | ~l ) 

Constrain the last l primer characters to match exactly; any insertions, deletions or substitutions must occur before position l. The reverse complement of a primer must also have its last l characters match exactly. Note that a wildcard match is considered an exact match.  With the ~ modifier, the last l primer characters are constrained to match inexactly, the remaining characters must match exactly.

-5 ( l | ~l )

Constrain the l primer characters at the 5' end of the primer to match exactly; any insertions, deletions or substitutions must occur after position l from the 5' end of the primer. The reverse complement of a primer must also have the l characters at its 5' end match exactly. Note that a wildcard match is considered an exact match. With the ~ modifier, the l primer characters at the 5' end of the primer are constrained to match inexactly, the remaining characters must match exactly. 

-3 ( l | ~l

Constrain the l primer characters at the 3' end of a primer to match exactly; any insertions, deletions or substitutions must occur after position l from the 3' end of the primer. The reverse complement of a primer must also have the l characters at its 3' end match exactly. Note that a wildcard match is considered an exact match. With the ~ modifier, the l primer characters at the 3' end of the primer are constrained to match inexactly, the remaining characters must match exactly. 

-w 

Respect IUPAC ambiguity codes as wildcards, in both the sequence database and the primers. A symbol from the sequence database is considered a wildcard match to a primer symbol if either set of represented DNA symbols contains the other. The only exception is that a N in the sequence database does not match any primer symbol. Note: this is almost certainly what you want, as long stretches of Ns are often used to indicate gaps in assembled sequence. 

-W 

Respect IUPAC ambiguity codes as wildcards, in both the sequence database and the primers. A symbol from the sequence database is considered a wildcard match to a primer symbol if either set of represented DNA symbols contains the other. Also respects Ns in the sequence databases.  

-u 

Force all primers to uppercase characters.

-m min-length

Minimum length, in bases, of the amplicon product of the primer pairs. Default: 0.

-M max-length

Maximum length, in bases, of the amplicon product of the primer pairs. Default: 2000.

-d deviation

Maximum deviation of the length, in bases, of the amplicon product of the primer pairs from the length specified in the UniSTS format primer file. UniSTS format primers required. Default: no constraint.

-b

Measure the length of amplicon as number of bases between primers.

-A format

Output format for primer alignments. See Output Format below. If present, alignments will be output.

-R report_interval

Usually, primer_match accumulates many matches before taking the time to output alignments. This reduces the running time tremendously. However, if you are debugging or want reassurance that primer_match is actually doing something, setting report_interval to 1 will force primer_match to report alignments as they are found.

-E eos

Consider the sequence character with ascii code eos to represent the end of the sequence in a FASTA entry. This character can never be part of an alignment, except if explicitly included in a primer sequence. By default, 12 (new line) is considered the end of sequence character. The end of sequence character is inserted by compress_seq.

-D ( 0 | 1 | 2 | 3 | 4

Select the sequence database pre-processing strategy. The default, 0, will choose the fastest strategy, based on the pre-processing done, or not done, by compress_seq.

  1. Sequence database has not been pre-processed.
  2. Sequence database has been indexed by compress_seq. This is the default behavior of compress_seq.
  3. Sequence database has been indexed and normalized by compress_seq, using the option -n true.
  4. Sequence database has been indexed, normalized and compressed by compress_seq, using the option -z true.

Given the availability of pre-processed sequence database files, option 3 is selected first, then option 4, then option 2, then option 1. This will typically represent the fastest possible run time. 

-B

Use buffered standard I/O rather than mmap to stream through the sequence database. On some platforms, where the use of mmap is somewhat unpredictable, this option may make it possible to run primer_match reliably. 

-v 

Verbose (version & diagnostic) output.

-h 

Command-line help.

Output Format

The an example of the default alignment output format

>gi|21700565|gb|AC092408.3| Papio anubis clone RP41-446H8, complete sequence
CTTGTAATCCCAGAACTTTGG 57681 ... 1714 ... 59395 CCCCGTCTCTACTAAAAATA
||^||||||||||*|||||||                          |||||||||||||*||||||
CT-GTAATCCCAGGACTTTGG F                      R CCCCGTCTCTACTTAAAATA D11S3114 REVERSE-STRAND

The command line option -A give the user explicit control over the output of alignments respectively. Each format string contains conversion characters, which specify pieces of the alignment or count output.

Alignment format conversion characters:

%h  FASTA header (defline) of the sequence entry containing the alignment.
%H First "word" of the FASTA header (defline) of the sequence entry containing the alignment. The first word is everything up to (but not including) the first whitespace character of the defline.
%f Index of the FASTA entry containing the alignment.
%>s Start position of the "left" primer alignment within the FASTA entry (space based).
%<s Start position of the "right" primer alignment within the FASTA entry (space based).
%>e End position of the "left" primer alignment in the FASTA entry (space based).
%<e End position of the "right" primer alignment in the FASTA entry (space based).
%>l Length of the "left" primer alignment.
%<l Length of the "right" primer alignment.
%l Length of the amplicon.
%>5 Position of the 5' end of the "left" primer alignment in the sequence entry (space based).
%<5 Position of the 5' end of the "right" primer alignment in the sequence entry (space based).
%>3 Position of the 3' end of the "left" primer alignment in the sequence entry (space based).
%<3 Position of the 3' end of the "right" primer alignment in the sequence entry (space based).
%>S Start position (absolute) of the "left" primer alignment in the sequence database.
%<S Start position (absolute) of the "right" primer alignment in the sequence database.
%>E  End position (absolute) of the "left" primer alignment in the sequence database.
%<E  End position (absolute) of the "right" primer alignment in the sequence database.
%i Index of the aligned primer pair.
%>d Edit distance (number of insertions, deletions, substitutions) of the "left" primer alignment.
%<d Edit distance (number of insertions, deletions, substitutions) of the "right" primer alignment.
%>p The (forward) sequence of the "left" primer, whether it was found in its forward or reverse complement form.
%<p The (forward) sequence of the "right" primer, whether it was found in its forward or reverse complement form.
%>P The FASTA header (defline) of the "left" primer, if the primers came from a FASTA format file. Otherwise, "".
%<P The FASTA header (defline) of the "right" primer, if the primers came from a FASTA format file. Otherwise, "".
%I The STS identifier of the primer pair, if the primers came from a UniSTS format file. Otherwise, "".
%L The STS length of the primer pair, if the primers came from a UniSTS format file. Otherwise, "".
%D The absolute value of the difference between the length of the amplicon and the STS length of the primer pair, if the primers came from a UniSTS format file.
%a The STS accession of the primer pair, if the primers came from a UniSTS format file. Otherwise, "".
%O The STS organism of the primer pair, if the primers came from a UniSTS format file. Otherwise, "".
%& The alternative STS accessions of the primer pair, if the primers came from a UniSTS format file. Otherwise, "".
%X The STS chromosome of the primer pair, if the primers came from a UniSTS format file. Otherwise, "".
%>q The "left" primer sequence of the alignment.
%<q The "right" primer sequence of the alignment.
%>Q The "left" primer sequence of the alignment, with alignment characters to indicate an insertion.
%<Q The "right" primer sequence of the alignment, with alignment characters to indicate an insertion.
%>t The "left" aligned sequence from the sequence database.
%<t The "right" aligned sequence from the sequence database.
%>T The "left" aligned sequence from the sequence database, with alignment characters to indicate deletion.
%<T The "right" aligned sequence from the sequence database, with alignment characters to indicate deletion.
%>A The string of alignment characters indicating exact match, insertion, deletion and subsitution at each position of the "left" primer alignment.
%<A The string of alignment characters indicating exact match, insertion, deletion and subsitution at each position of the "right" primer alignment.
%>r "F" if the forward form of the "left" primer was found, "R" if the reverse complement form of the "left" primer was found.
%<r "F" if the forward form of the "right" primer was found, "R" if the reverse complement form of the "left" primer was found.
%>R

" REVCOMP" if the reverse complement form of the "left" primer was found, "" otherwise.

%<R

" REVCOMP" if the reverse complement form of the "right" primer was found, "" otherwise.

%R

" REVERSE-STRAND" if the primer pair was found in reverse strand orientation (second primer first), "" otherwise.

%@ The sequence of the amplicon amplified by the primer pair.
%N The number of N's in the sequence of the amplicon "amplified" by the primer pair.
%0 e-PCR format output.
%% Percent (%).

The default alignment format is ">%h\n %>T %>s ... %l ... %<e %<T\n %>A %!>s %!l %!<e %<A\n %>Q %>r%!>s %!l %!<e%<r %<Q %a%R\n". The character '!' in the format indicates that the number of characters occupied by the formated data should output as spaces.

See Also

primer_match, compress_seq

Author

Nathan Edwards