SeqState Documentation
Features
Primer design
Although many other features have been added since 2003, one of the main purposes of SeqState remains automated primer design. For this, the alignment (NEXUS format; non-interleaved; see sample files) is screened for aligned internal primers, which also may be loaded from and saved to resource files. External primers (upstream and downstream of the alignment) can be specified as well. In each sequence, SeqState searches for stretches of missing data ("?"), which may contain, start with, or end with indel gap characters ("-") resulting from insertion of gaps in multiple sequences simultaneously. For each region of missing data, all primers are evaluated in terms of distance to the region and fit to the matching part of the sequence to be completed. Degeneracy of primers as well as ambiguity code in the template sequence is correctly interpreted, and mismatches in the head region receive particular attention.
For each section of absent data, the best primers are provided in the output. If no suitable primers are found, SeqState screens the sequence adjacent to the missing nucleotides (subtracting "-" and jumping over ambiguities) and suggests new primers for synthesis. To select the best primers, the program evaluates nucleotide composition of head and tail region, maximum length of primer dimer complements, primer specificity and annealing temperature, as well as the percentage of other sequences in the alignment to which the primer fits. SeqState allows specification and saving of user assumptions (e.g., the lengths of primer reads which strongly depend on the sequencer used).
The results are exported as a directly printable list and as a table that can be imported into other programs (e.g., Microsoft Excel). They are also printed to the screen and can be copied from there into any (online-) primer order form.
In addition, SeqState can calculate diverse characteristics of manually entered and/or currently loaded primers (e.g., Tm, fit to loaded sequences, primer dimers) and primer pairs (e.g., longest primer dimer complements and Tm differences between the two).
Sequence statistics for character sets
SeqState also supports character sets as understood by PAUP (Swofford, 1998) and calculates sequence statistics for the whole matrix and/or such character sets, including sequence length range, sequence divergence range, transition/transversion ratios, variability measures, and nucleotide composition. These are formatted as a table (on the screen and saved to a file) ready to be used in publications.
Indel coding
Finally, SeqState supports a number of published indel coding schemes. For details, please refer to:
- Müller K: Incorporating information from length-mutational events into phylogenetic analysis. Molecular Phylogenetics and Evolution 2006, 38:667-676
- Simmons M, Müller K, Norton A: The relative performance of indel-coding methods in simulations. Molecular Phylogenetics and Evolution 2007, 44:724-740
Using SeqState
Once SeqState runs, how to use it should be mostly self-evident by navigating through the menu bar and its items.
Loading data
The first thing you will usually do is loading a data file from within the "File" menu. You will be prompted to a file dialog as you are used to from all other programs on your computer.
Currently, it is safest to use non-interleaved, normal NEXUS-files as input, such as they would be generated when using Paup's export command with format=nexus. Additional blocks following after the Data block may currently confuse SeqSate.
Charset commands should be restricted to contiguous sets e.g., charset one= 23-67 78 95-102; charsets using intervals such as 1-.\3 are not supported yet. Taxsets will be supported soon, but the current version does not deal with it properly, so better ommit the taxset command.
Choosing and designing primers; requesting sequence statistics
Second you might want to check the global settings (e.g., assumed primer read lengths) from the Primers menu. Use "choose/design primers" from the same menu to have SeqState analyse the data. Statistics are available from the Statistics menu; bootstrapping for standard errors can be adjusted via the Settings submenu.
Indel coding
Indel coding (a variety of simple or more complex schemes) can be required from the IndelCoder menu. Just choose your preferred coding scheme to have SeqState write a NEXUS file with indels coded that is ready to be executed in PAUP.
Contact
In the case of difficulties or questions, don't hesitate to contact me. Also, bug reports and comments are always highly appreciated.
Prof. Dr. Kai Müller
Research group for Evolution and Biodiverity of Plants
Institute for Evolution and Biodiverity
Westphalian Wilhelms-University, Münster, Germany
Hüfferstrasse 1
48149 Münster
Germany
E-mail: kaimuelleruni-muenster.de
http://bioinfweb.info/People/Mueller
Citation
In case SeqState was of any help for you I would appreciate its citation as follows:
Müller K: SeqState - primer design and sequence statistics for phylogenetic DNA data sets. Applied Bioinformatics 2005, 4:65-69
Appendix: How are the primers evaluated?
For each region of missing data, all primers are evaluated in terms of
- distance to the region (either gap can be closed by specified primer read length; if not, the user provided distance may not be exceeded)
- fit to the matching part of the sequence to be completed. A score is computed representing the sum of all mismatching nucleotides. Degeneracy of primers as well as ambiguity code in the target sequence is taken into account (ambiguity in the template sequence is treated as mismatch in any case, since uncertainty of base call should not lead to choosing an unsuitable primer; degenerate positions in primers are treated as usual according to IUPAC).
- If a score is equal for two primers, mismatches in the head region (defined as first 5 nucleotides) are considered more problematic.
- All primers matching condition (1.) are sorted by the criteria under (2.-3.) and the best three primers are provided in the output (unless less primers are found).
If no suitable primers are found among those supplied in the alignment (i.e., they are all 1. too far away (user provided distance is exceeded), 2. and/or have >=3 mismatches, 3. and/or have >=1 mismatch in the head), SeqState screens the sequence adjacent to the missing nucleotides (ignoring "-") and suggests new primers for synthesis. The criteria are:
- Primers have to be at least 19 bp long (but no primers >=24 bp are evaluated here).
- The matching sequence must not contain ambiguities or missing data. Gaps are ignored.
- The primer must not contain primer dimer complements of >=5 bp length.
- The first two head positions should contain at least one G or a C, the last tail positions should maximally contain one G or one C.
- If primers of 19 bp matching these conditions are found, an extension up to 23 bp is attempted, following the same criteria.
The potential primers found in a range of -100 bp and +100bp (of the target sequence, not matrix positions) from the beginning and end of the gap, respectively, are sorted by
- GC content in the first positions of the primer head
- length of primer dimer complements <=4 bp
- length of primer sequence (the longer the primer, the higher the Tm (for a given base composition) and the better the specificity).
The maximally 10 best primers are enlisted.
For a further evaluation through the user, the percentage of sequences in the alignment to which to primer fits perfectly is provided (plus the percentage of "valid" sequences in brackets, i.e., those sequences that don't have ???? in the respective region). Since the primary purpose of SeqState is to provide a reliable suggestion for well-working sequencing primers able to fill gaps of missing data, the matching of the primer to as many other sequences as possible is no priority during sorting of the best primers. However, if the primer should be used for more than only one taxon, the should decide to synthesize the primer with the highest fit percentage out of the 10 suggested.
Note: F primers are evaluated first (sequence upstream of the gap), R primers later, and added to a list. Therefore, F primers come first in a list of primers of identical quality (according to the above sorting criteria). If more than 10 primers are found, this would be unsatisfactory, since there is no reason why F primers should be privileged. Therefore, the list is shuffled prior to sorting, guaranteeing a more homogeneous distribution of F and R primers. This, however, introduces a certain level of randomness into the procedure, and explains why the lists provided by SeqState may sometimes not be completely identical in two subsequent analyses with identical settings and input files. Still, the 10 primers suggested are the best; there may be equally good other primers that are not output, and what is included into the top 10 therefore might vary. Since you won't synthesize them all, anyway, but should have enough suggestions to choose from, I considered this randomness to be of no significance for the purpose of the program. Advanced features that allow you to navigate through ALL primers found and extend the distance from the gap are currently implemented, and so is a refined scoring procedure for the primers.
Tm is estimated according to the formula Tm=69.3+0.41*GC[%]-650/primer_length.