compress_seq
compress_seq - Normalize and compress a multi-FASTA sequence database
compress_seq -i fasta_sequence_database [ options ]
compress_seq takes a multi-FASTA sequence database and splits it into separate sequence, header and index files. It can output the sequence data in a variety of forms, depending of the command line options given, including a DNA optimized normalized form, and a bit compressed form. It can also add an arbitrary end of sequence character to each sequence entry, and force each sequence character to uppercase.
If compress_seq is used to pre-process a sequence database for primer_match, the best performance will result from the use of the -n true option, to index and normalize the sequence database.
By default, compress_seq splits the multi-FASTA sequence database file db into three files: db.seq, db.hdr and db.idb. In order to make compress_seq more suitable for use in computational analysis pipelines, it only re-constructs the sequence database component files if the file system timestamps indicate that it is necessary. Note that the -F option can be used to force each component to be re-made.
-i fasta_sequence_database
Name of the multi-Fasta sequence database file to process. Required.
-I ( true | false )
Write fasta index file in binary format. Default: true.
-n ( true | false )
Create a normalized version of the sequence data. This creates additional files with suffixes .sqn and .tbl. Default: false.
-D ( true | false )
Optimize the normalized sequence data for DNA sequence. Default: true.
-z ( true | false )
Create a bit compressed normalized version of the sequence data. This creates additional files with suffixes .sqz and .tbz. Default: false.
-u ( true | false )
Force the sequence data to uppercase characters. Default: false.
-e ( true | false )
Add an end of sequence character after each entry from the multi-FASTA sequence database file. This can help ensure that text matching algorithms cannot find a match that straddles two FASTA entries. Default: true.
-E eos
Use eos as the end of sequence character. eos is the ascii code for the desired character, it may be specifed as a decimal, octal or hexadecimal number. Default: 12 (newline).
-S ( true | false )
Insert end of sequence character before initial sequence entry. Default: false.
-F ( true | false )
Force each component of the compressed sequence database to be regenerated, even if the file timestamps indicate that this isn't necessary. Default: false.
-C ( true | false )
Cleanup unnecessary temporary files. Default: true.
-B
Use buffered standard I/O rather than mmap to stream through the sequence database. On some platforms, where the use of mmap is somewhat unpredictable, this option may make it possible to run compress_seq reliably.
-v
Output the release tag of the binary.
-h
Command-line help.
Nathan Edwards