Sequence labels
Sequence labels are strings of text following a greater-than (>) symbol at the start of a line in a FASTA file or the @ symbol at the start of a read in a FASTQ file. The label is terminated by the end of the line.By default, the complete label is stored and reported, unlike BLAST which truncates labels at the first white space (blank or tab). You can use the -trunclabels option to specfiy that labels should be truncated.
Annotations in sequence labels
Many usearch command support or require annotations in sequence labels. Annotations indicate attributes of a sequence such as its abundance, sample identifier, taxonomy etc. There are no generally accepted standards for including annotations in FASTQ or FASTA files, so annotations are not usually compatible with other software packages.
Most annotations have the form name=value.
In usearch, annotations are separated by semi-colons. The first annotation begins at the first semi-colon. The label up to the first semi-colon is sometimes understood to be an implied name, e.g. an OTU identifier.
A semi-colon terminating the last annotation at the end of the label is optional, but recommended.
White space (blanks and tabs) are allowed within annotations, but are discouraged because they can cause problems. For example, if a sequence label has a tab and is written to a tab-separated text file, then the number of fields will be messed up.
Size annotations
Size annotations are used to indicate cluster sizes of representative sequences.
A size annotation is specified as a size=N field in the sequence label. It may appear anywhere in the label. It must be delimited by semi-colons, which may optionally be omitted at the end of a label, though this is not recommended.
You can propagate size annotations through multiple commands by using the -sizein and -sizeout options, which are supported by most (but not all) clustering commands.
Examples
>KR08766;size=1;
>KR08766;size=2
>READ-127;size=5;sample=soil3;
@M141:79:749142:1:1101:14941:1421;size=57721;
Sample identifiers in read labels
An OTU table is made by the otutab command. The query set, i.e. the FASTA file or FASTQ file containing the reads, must have sample identifiers in the labels.
Why different ways to do it?
Usearch supports different ways to put sample names into sequence labels to provide some degree of backwards compatibility with earlier versions and to allow flexibility in the formatting of sample names which were probably designed without thinking about the software package. For example, QIIME does not allow an underscore in the sample identifier, which is too restrictive in my opinion.
How to check that your sample names are formatted correctly
Use the fastx_get_sample_names command.
Sample identifier syntax
The sample name can be specified by putting sample=xxx; into the label. The semi-colon marks the end of the sample identifier, so semi-colons are not allowed but any other character may be used. If sample= is not found, the sample identifier is assumed to start at the beginning of the label and continue to the first character in the label which is not alphanumeric or an underscore, unless the sample_delim option is specified (see below). Put another way, any character which is not a letter, number or underscore marks the end of the sample label. The following labels have sample identifier S01. FASTA labels start with > at the beginning of the line, FASTQ labels start with @.
>S01.123
>S01.123;size=14;
@M00967:43:000000000-A3JHG:1:1101:18327:1699;sample=S01;
In the first and second example, the period (.) is the first non-alphanumeric character so the .123 is not part of the sample identifier.
The -sample_delim option
This option specifies a string of one or more characters that marks the end of a sample identifier. If this option is used, the sample idenfier must begin with the first character in the label and continues until the first match to the delimiter string. For example, if you have reads that were processed with QIIME, then read labels start with the sample identifier which is followed by an underscore (_) and an integer read number. Input in this format can be processed like this:
usearch -otutab qiime_reads.fq -sample_delim _ -otutabout otutable.txt
How to get sample names into your labels
The simplest method is to use the fastx_relabel command or the -relabel option of fastq_mergepairs, fastq_filter or fastx_uniques. If you process one file at a time, you can do something like this:
usearch -fastx_uniques reads.fastq -relabel SampleName. -fastaout uniques.fa
Note the period following SampleName.
If -relabel @ is specified, the sample name is constructed from the FASTQ filename by truncating at the first underscore or period. With typical Illumina FASTQ filenames, this is the sample name.
Alternatively, you could write you own script to do this task.
OTU identifiers
OTU sequences must have OTU identifiers
An OTU table is generated by the otutab command or closed_ref command.. OTU identifiers are extracted from the sequence labels in the search database which contains OTU or ZOTU sequences.
OTU identifier syntax
First, usearch looks for an annotation otu=xxx; in the label. If this is found, then xxx is the OTU identifier. Otherwise, the OTU identifier is the sequence label with all annotations removed, i.e. all characters up to the first semi-colon (;) or the end of the label, which ever comes first.
Examples
The following labels all contain an OTU identifier with 123.
>FQZ00866;otu=Otu123;
>Otu123
>Zotu123
fastx_strip_annots command
This command strips all usearch-style annotations from sequence labels by deleting everything after the first semi-colon (;).
Input can be in FASTA or FASTQ format. The output filename is specified by -fastqout (FASTQ format) and / or -fastaout (FASTA format). You cannot create FASTQ output from FASTA input.
Example
usearch -fastx_strip_annots seqs.fa -fastaout seqs_without_annots.fa
fastx2qiime command
Converts a FASTA or FASTQ file with usearch-compatible sample identifiers to QIIME-compatible format.
QIIME embeds sample identifiers into sequence labels using the following rules:
A label starts with the sample name followed by an underscore (_) and a read number 1, 2, 3 etc.
The read number is optionally followed by white space and other information which is usually (always?) ignored by the QIIME scripts.
Only alphanumeric and perdiod (.) characters are allowed in sample names.
The fastx2qiime command converts to QIIME format by finding the sample identifier and appending an underscore and a sequence number. All other information in the label is discarded. Any character in the sample identifier which is not alphameric or a period is replaced by a period.
Input can be in FASTA or FASTQ format. The output filename is specified by -fastqout (FASTQ format) and / or -fastaout (FASTA format). You cannot create FASTQ output from FASTA input.
Example
usearch -fastx2qiime reads.fastq -fastaout reads_qiime.fa
网友评论