primer_match

Name

primer_match - Find and count primers in a DNA sequence database

Synopsis

primer_match [ options ]

Description

primer_match finds and counts exact and near exact instances of short DNA sequences, usually primers, in a (much) larger DNA sequence database such as the human genome. 

By default, primer_match outputs a human readable alignment for each occurance of a primer in the sequence database. With appropriate option, -c, primer_match will output the number of occurances of each primer. The format of the alignments and counts is completely configurable with the -A and -C options.

primer_match runs fastest when the sequence database has been pre-processed with compress_seq, but this is not necessary. If the sequence database has not been pre-processed with compress_seq, the sequence database must be in a regular FASTA format. Each line, except for the last, of every sequence entry must hold the same number of sequence characters. If the sequence database is not in a regular FASTA format, the results may be incorrect. primer_match will warn the user if the FASTA format is not in a regular format. 

Options

-i FASTA_sequence_database

Name of the sequence database to search. Required.

-p primers

White space (space, tab, new line) separated list of primer sequences to find in the sequence database. When this opition is used on the command line, the primers will usually need to be placed in quotes ("). One of -p, -P, -F, or -S must be supplied.

-P primer_file

File containing a white space (space, tab, new line) separated list of primer sequences to find in the sequence database. One of -p, -P, -F, or -S must be supplied.

-F fasta_primer_file

FASTA file containing a list of primer sequences to find in the sequence database. One of -p, -P, -F, or -S must be supplied.

-S sts-format-file

UniSTS format file containing a list of primer sequences to find in the sequence database. One of -p, -P, -F, or -S must be supplied.

-o output_file

Output is redirected into the file output_file. If absent, output goes to standard out.

-k edit_distance

The maximum number of insertions, deletions, and substitutions permitted in any primer alignment. If absent, edit distance 0 is assumed. 

-K mismatches

The maximum number of mismatches permitted in any primer alignment.

-r 

Search for the reverse complements of the primers too.

-x l 

Length of exact seed or word size, ala blast, required by any primer alignment. Can be combined with other options.

-s ( l | ~l ) 

Constrain the first l primer characters to match exactly; any insertions, deletions or substitutions must occur after position l. The reverse complement of a primer must also have its first l characters match exactly. Note that a wildcard match is considered an exact match. With the ~ modifier, the first l primer characters are constrained to match inexactly, the remaining characters must match exactly. 

-e ( l | ~l ) 

Constrain the last l primer characters to match exactly; any insertions, deletions or substitutions must occur before position l. The reverse complement of a primer must also have its last l characters match exactly. Note that a wildcard match is considered an exact match.  With the ~ modifier, the last l primer characters are constrained to match inexactly, the remaining characters must match exactly.

-5 ( l | ~l )

Constrain the l primer characters at the 5' end of the primer to match exactly; any insertions, deletions or substitutions must occur after position l from the 5' end of the primer. The reverse complement of a primer must also have the l characters at its 5' end match exactly. Note that a wildcard match is considered an exact match. With the ~ modifier, the l primer characters at the 5' end of the primer are constrained to match inexactly, the remaining characters must match exactly. 

-3 ( l | ~l

Constrain the l primer characters at the 3' end of a primer to match exactly; any insertions, deletions or substitutions must occur after position l from the 3' end of the primer. The reverse complement of a primer must also have the l characters at its 3' end match exactly. Note that a wildcard match is considered an exact match. With the ~ modifier, the l primer characters at the 3' end of the primer are constrained to match inexactly, the remaining characters must match exactly. 

-w 

Respect IUPAC ambiguity codes as wildcards, in both the sequence database and the primers. A symbol from the sequence database is considered a wildcard match to a primer symbol if either set of represented DNA symbols contains the other. The only exception is that a N in the sequence database does not match any primer symbol. Note: this is almost certainly what you want, as long stretches of Ns are often used to indicate gaps in assembled sequence. 

-W 

Respect IUPAC ambiguity codes as wildcards, in both the sequence database and the primers. A symbol from the sequence database is considered a wildcard match to a primer symbol if either set of represented DNA symbols contains the other. Also respects Ns in the sequence databases.  

-u 

Force all primers to uppercase characters.

-M max

Stop counting primer occurrences once a primer has been seen max times.

-A format

Output format for primer alignments. See Output Format below. If present, alignments will be output.

-C format

Output format for primer counts. See Output Format below. If present, counts will be output.

-R report_interval

Usually, primer_match accumulates many matches before taking the time to output alignments. This reduces the running time tremendously. However, if you are debugging or want reassurance that primer_match is actually doing something, setting report_interval to 1 will force primer_match to report alignments as they are found.

-E eos

Consider the sequence character with ascii code eos to represent the end of the sequence in a FASTA entry. This character can never be part of an alignment, except if explicitly included in a primer sequence. By default, 12 (new line) is considered the end of sequence character. The end of sequence character is inserted by compress_seq.

-D ( 0 | 1 | 2 | 3 | 4

Select the sequence database pre-processing strategy. The default, 0, will choose the fastest strategy, based on the pre-processing done, or not done, by compress_seq.

  1. Sequence database has not been pre-processed.
  2. Sequence database has been indexed by compress_seq. This is the default behavior of compress_seq.
  3. Sequence database has been indexed and normalized by compress_seq, using the option -n true.
  4. Sequence database has been indexed, normalized and compressed by compress_seq, using the option -z true.

Given the availability of pre-processed sequence database files, option 3 is selected first, then option 4, then option 2, then option 1. This will typically represent the fastest possible run time. 

-B

Use buffered standard I/O rather than mmap to stream through the sequence database. On some platforms, where the use of mmap is somewhat unpredictable, this option may make it possible to run primer_match reliably. 

-v 

Verbose (version & diagnostic) output.

-h 

Command-line help.

Output Format

The default alignment output format is

>defline
sequence start end edits
alignment
primer index rc?

where defline is the FASTA header line of the sequence entry containing the alignment; sequence is the aligned sequence from the sequence database; start and end are the space based start and end positions of the aligned sequence in the sequence entry; edits is the number of insertions, deletions, and substitutions in the alignment; alignment is a series of alignment characters indicating match, insertion, deletion or substitution at each position of the alignment; primer is the aligned primer sequence; index is the index of this primer in the primer input set; and rc? is "REVCOMP" if the primer matched in its reverse complement form.

For example

>CCO_UID:219000002141424:BAC_UID:human_12212001_reproc:LEN:33337
AGATCGCAGGTACATAAATGCTTCT 20115 20140 0
|||||||||||||||||||||||||
AGATCGCAGGTACATAAATGCTTCT 3242
>CCO_UID:219000002142926:BAC_UID:human_12212001_reproc:LEN:2262
CCCATTCAGTCTTTCTTTTAAAAACATTTATTTTTAATTCAT 1671 1713 0
||||||||||||||||||||||||||||||||||||||||||
CCCATTCAGTCTTTCTTTTAAAAACATTTATTTTTAATTCAT 4781 REVCOMP

and

>gi|683734|gb|U20581.1|MFU20581 Macaca fascicularis endothelin 3 mRNA
CAGCCAGATCTGAG 44 58 1
|||*||||||||||
CAGTCAGATCTGAG 3
>gi|9967394|dbj|AB047965.1| Macaca fascicularis brain cDNA
CTCAGATCTGA-TG 1569 1582 1
|||||||||||v||
CTCAGATCTGACTG 3 REVCOMP

and

>gi|21320903|dbj|AB059653.1| Macaca fascicularis PGDH1 mRNA 
 TGGATAATTTTT 2338 2350 1
 +++|^|+||||+
 WRRA-AWTTTTW 13
>gi|21320905|dbj|AB059654.1| Macaca fascicularis PGDH2 mRNA 
 ACCGAGGAGGA 502 513 1
 ||*||+|||||
 ACAGAKGAGGA 11
>gi|21320905|dbj|AB059654.1| Macaca fascicularis PGDH2 mRNA 
 AGCTG-GTGGG 512 522 1
 |||||v|||||
 AGCTGYGTGGG 18
>gi|7593035|dbj|AB041420.1| Gorilla gorilla gene for alpha-1
 CGCCRGCACGAGTT 596 610 1
 ||||+|||^|||||
 CGCCAGCA-GAGTT 2

The default counts output format is

index rc? primer count ( 0-count 1-count ... )

where index is the index of the primer; rc? is "R" for the reverse complement of the primer and "F" otherwise; primer is the sequence of the primer if rc? is "F" and the sequence of the primer's reverse complement if rc? is "R"; count is the number of occurrences of primer in the sequence database; and k-count is the number of occurrences of primer in the sequence database with k insertions, deletions, or substitutions.

For example

1 F TTACGGGCAGCTCA 9 ( 6 3 )
1 R TGAGCTGCCCGTAA 0 ( 0 0 )
2 F CCTTGCCAGTCAGATC 23 ( 8 15 )
2 R GATCTGACTGGCAAGG 0 ( 0 0 )
3 F CAGTCAGATCTGAG 15 ( 2 13 )
3 R CTCAGATCTGACTG 6 ( 0 6 )

The command line options -A and -C give the user explicit control over the output of alignments and counts respectively. Each format string contains conversion characters, which specify pieces of the alignment or count output.

Alignment format conversion characters:

%h  FASTA header (defline) of the sequence entry containing the alignment.
%H First "word" of the FASTA header (defline) of the sequence entry containing the alignment. The first word is everything up to (but not including) the first whitespace character of the defline.
%f Index of the FASTA entry containing the alignment.
%s Start position of the alignment within the FASTA entry (space based).
%e End position of the alignment in the FASTA entry (space based).
%l Length of the alignment.
%5 Position of the 5' end of the alignment in the sequence entry (space based).
%3 Position of the 3' end of the alignment in the sequence entry (space based).
%S Start position (absolute) of the alignment in the sequence database.
%E  End position (absolute) of the alignment in the sequence database.
%i Index of the aligned primer.
%d Edit distance (number of insertions, deletions, substitutions) of the alignment.
%p The (forward) sequence of the primer, whether it was found in its forward or reverse complement form.
%P The FASTA header (defline) of the primer, if the primers came from a FASTA format file. Otherwise, "".
%I  STS id (first column) of primer entry, if the primers came from a UniSTS format file. Otherwise, "".
%L  STS length for primer entry, if the primers came from a UniSTS format file. Otherwise, "".
%a  STS accession for primer entry, if the primers came from a IUniSTS format file. Otherwise, "".
%O  STS organism for primer entry, if the primers came from a UniSTS format file. Otherwise, "".
%&  Alternative STS accessions for primer entry, if the primers came from a UniSTS format file. Otherwise, "".
%X  STS chromosome for primer entry, if the primers came from a UniSTS format file. Otherwise, "".
%q The primer sequence of the alignment.
%Q The primer sequence of the alignment, with alignment characters to indicate an insertion.
%t The aligned sequence from the sequence database.
%T The aligned sequence from the sequence database, with alignment characters to indicate deletion.
%A The string of alignment characters indicating exact match, insertion, deletion and subsitution at each position of the alignment.
%r "F" if the forward form of the primer was found, "R" if the reverse complement form of the primer was found.
%R

" REVCOMP" if the reverse complement form of the primer was found, "" otherwise.

%% Percent (%).

The default alignment format is ">%h\n %T %s %e %d\n %A\n %Q %i%R\n".

Count format conversion characters:

%i The primer index.
%p The (forward form of the) sequence of the primer.
%P The FASTA header (defline) of the primer, if the primers came from a FASTA format file. Otherwise, "".
%q The forward or reverse complement form of the primer.
%r "F" for the forward form of the primer, "R" for the reverse complement form of the primer.
%R " REVCOMP" for the reverse complement form of the primer, "" otherwise.
%c Count for primer or reverse complement.
%C Space separated list of counts for edit distance 0, 1, etc.
%+ Plus (+) if the count for this primer exceeded the maximum count threshold.
%%  Percent (%).

The default count format is "%i %q %c%+ ( %C )\n".

See Also

pcr_match, compress_seq

Author

Nathan Edwards