Most USEARCH commands use a database index to enable fast searching. There are two types of index: one for finding matching seeds for the UBLAST algorithm, and another for fast calculation of common word counts for the USEARCH algorithm. Clustering uses a USEARCH-style index. Indexing parameters apply to both types of index.
During search and clustering, indexes are always accessed directly in memory rather than being retrieved from a disk file, in order to maximize speed. The amount of RAM required to store the index is approximately the same as the size of a UDB file created with the same sequences and options. The physical RAM in the computer should be bigger than the index, otherwise virtual memory paging will cause much slower execution.
Indexes are constructed in three different ways:
(1) Loaded from in a UDB file.
(2) Built from a FASTA file.
(3) Built dynamically during clustering. The index is initially empty, then grows as centroid sequences are added to the database.
Indexing options
In the following table, "word" refers generically to the fixed-length segment of the database sequence that is indexed. It may be a k-mer or a pattern. The effective word length is the length of the k-mer or the number of 1s in the pattern.
网友评论