美文网首页
Interpreting counts and frequenc

Interpreting counts and frequenc

作者: 代号北极能 | 来源:发表于2019-06-28 22:12 被阅读0次

信息来源https://www.drive5.com/usearch/manual/otu_count_interpret.html


Estimating microbial diversity

The diversity in a single sample (alpha diversity) is commonly measured using metrics such as the Shannon index and the Chao1 estimator, while the variation between pairs of samples (beta diversity) is measured using metrics such as the Jaccard distance or Bray-Curtis dissimilarity. Many such metrics, including Shannon, Chao1, Jaccard and Bray-Curtis, are calculated from OTU frequencies. Other metrics, e.g. unweighted UniFrac (called unifrac_binary in usearch) use presence / absence only, effectively considering a count to be one if it is any non-zero value.

OTU frequency does not correlate with species frequency

In fact, OTU frequencies have low correlation with species frequencies. This means, for example, that the most abundant OTU usually does not contain the most abundant species.

Cross-talk degrades presence / absence

Some diversity metrics use OTU presence / absence rather than frequencies. In usearch, such metrics are called "binary" because the count is considered to be zero or one. With amplicon reads, presence / absence cannot be reliably measured if samples are multiplexed because cross-talk often causes reads to be incorrectly assigned to a sample where the OTU is in fact absent. This problem is particularly severe if samples from different environments (e.g., human gut and mouse gut) are multiplexed into a single sequencing run.

Singleton counts are especially suspect

If you follow my recommended procedures, then you will pool reads for all samples and discard singleton unique sequences for making 97% OTUs and discard unique sequences with abundance <8 for making ZOTUs (denoising). Even so, many OTU table entries are often singletons (i.e., have value 1) for smaller OTUs because the total count is distributed over several samples. Small counts are more likely to be spurious, especially singletons, either because the OTU itself is spurious (e.g., an undetected chimera), or because of cross-talk.

Traditional diversity metrics are invalid or hard to interpret

Because of the issues described above, many diversity metrics are invalid, meaningless or hard to interpret when calculated from OTUs. Some alpha diversity metrics, including Chao1 and Robbins, explicitly use singleton counts or singleton frequencies in their formulas. If singleton unique reads or singleton OTUs are discarded, then these calculations are obviously invalid. Either way, singleton counts are suspect as described above, so the calculations are misleading or meaningless in practice. All beta diversity metrics use OTU frequencies or presence / absence, neither of which can be reliably determined from amplicon reads.


Alpha diversity

Alpha diversity

Alpha diversity is the diversity in a single ecosystem or sample. The simplest measure is richness, the number of species (or OTUs) observed in the sample. Other metrics consider the abundances (frequencies) of the OTUs, for example to give lower weight to lower-abundance OTUs. See alpha diversity metrics. The abundance distribution can be visualized using an octave plot.

Interpretation

It is important to keep in mind that NGS amplicon sequencing cannot reliably measure frequencies or presence / absence of OTUs, so the biological meaning of alpha diversity metrics developed for traditional ecology is unclear / misleading / difficult to interpret.

Estimators

Some rare species may not have been observed. An alpha diversity estimator attempts to extrapolate from the available observations (reads) to the total number of species in the community. The best-known estimator for NGS OTUs is Chao1. In my opinion, estimators cannot be usefully applied to NGS OTUs because rare species are underrepresented if an abundance threshold is used (e.g., discarding singletons), and regardless the number of spurious OTUs increases at low abundances. The low-abundance tail of the distribution is therefore highly uncertain, and attempting to extrapolate makes no sense.

Rarefaction

The goal of rarefaction is to get an indication of whether enough observations have been made to get a good measurement of an alpha diversity metric. This is done by making a rarefaction curve which shows the change in a metric as the number of observations increases. If the curve converges to a horizontal asymptote, this indicates that further observations (i.e., more reads) will have little or no effect on the metric. As with estimators, the asymptote of a rarefaction curve depends on the low-abundance tail of the distribution, and is therefore of dubious value when applied to NGS reads. The number of OTUs is almost certain to increase with more reads due to errors, even if all species in a sample have been accounted for, and it is therefore almost certain that the rarefaction curve will converge to a positive slope.

Units of measurement

Confusingly, alpha diversity metrics often use different units. Sometimes the meaning is not obvious (entropy!?), and metrics with different units cannot be compared with each other. For example, the popular Shannon index is a measure of entropy where the unit is bits of information if the logarithms are base 2, but people sometimes use natural logarithms (base e) or base 10. None of these variants of the Shannon index have an obvious connection to the number of OTUs, and people often do not say which variant they used, so the numerical values are difficult to interpret.

Effective number of OTUs

Metrics using unfamiliar units can be interpreted and comparied by converting to an effective number of species.


Beta diversity

Beta diversity

Beta diversity compares two sample. Usually, this is done by calculating a number that indicates the similarity or difference between the samples. Often, but not always, the number is in the range zero to one.

The pair-wise comparisons for a set of samples can be presented in a distance matrix.

In usearch, beta diversities are always difference measures, not similarity measures, so increasing values indicate lower similarity and increasing distance. For distance measures D that ranges between zero and one there is always an equivalent similarity measure S defined by S = 1 - D, for example (Jaccard similarity) = 1 - (Jaccard distance). You can easily convert between distance and similarity measures in a spreadsheet program such as Excel.

Samples can be clustered to bring similar samples together, producing a tree, as shown in the figure above. In usearch, this can be done using the beta_div command, which automatically generates sample trees, or by using the cluster_aggd command to cluster a distance matrix generated by beta_div or by third-party software.

相关文章

网友评论

      本文标题:Interpreting counts and frequenc

      本文链接:https://www.haomeiwen.com/subject/wceycctx.html