of large libraries of 16S rRNA sequences from bacterial isolates and
environmental DNA is a significant challenge, despite the widespread
availability of public sequence databases and associated bioinformatics
and genomics software. The quality of taxonomic information in general
public databases, such as GenBank, varies considerably and new sequences
are added at a phenomenal rate, quickly rendering phylogenetic trees
and taxonomic placements obsolete.
In order to provide consistent
and up-to-date taxonomic classifications for the thousands of 16S sequences
in the SIMO database, we have developed an
automated process for assigning unknown sequences to taxonomic ranks.
This taxonomic information is then used to annotate sequence records
for submission to GenBank, support taxonomic searches of the SIMO
database, and provide taxonomic classification and lineage information
for each SIMO sequence record.
SIMO Taxonomic assignments
are based on similarity to vetted type species sequences in the Ribosomal
Database Project database. Unknown 16S sequences are compared to
RDP type species sequences using the RDP
Sequence Match program, then the highest ranking sequences are retrieved
from RDP and aligned with the unknown SIMO sequence using the Smith-Waterman
pair-wise local alignment algorithm (SSEARCH34 in the Pearson
FASTA 3 package).
The taxonomic lineage
of the most similar type species is then parsed from the RDP tree, and
corresponding taxonomic ranks are assigned to the unknown SIMO sequence
after applying a rank cut-off filter based on percent nucleotide identity
determined from the local alignment. The taxonomic rank cut-offs, listed
below, were determined empirically by comparing a large number of aligned
sequences for known type species. The comparisons (between 104-335 depending
upon the taxonomic rank) were chosen to be a balanced representation
of more than one bacterial phylum. For each taxonomic rank, the cut-off
values are conservative and represent the 16S rRNA sequence similarity
values that would include approximately 95% of the comparisons at that
rank (Whitman and Dyszynski, unpublished data).
||% Identity Cut-off*
93% identity would result in assignment of the following ranks from
the closest type species:
Domain - Phylum - Class - Order - Family
are performed automatically when new sequence data are added to the SIMO
database, and are updated approximately quarterly to reflect additions
and refinements to the RDP database and phylogenetic tree.
SIMO RDP Agent
A set of software utilities
was developed by Wade Sheldon to
automate the entire process described above, as well as to retrieve RDP
trees for the closest overall 10 sequences (type and non-type) to augment
the taxonomic classification results. Classifications and trees returned
from each analysis are uploaded to the SIMO database for use by SIMO
investigators and display on the SIMO database web
The software, termed
the SIMO RDP Agent, was developed using MATLAB 6.5.
The conceptual diagram on the RDP Taxonomy Agent
page illustrates the work-flow performed for each analysis.
(View Conceptual Diagram)
SIMO RDPquery Program
An open source Java program
(RDPquery) has also been
developed by Glen Dyszynski and Wade
Sheldon to allow individuals to classify 16S sequences on their own
using this same approach. This program is fully described on the SIMO
RDPquery web page.