Samtools使用大全

作者: Davey1220 | 来源:发表于2019-10-28 20:59 被阅读0次

Samtools使用大全
samtools使用大全
2020-07-16 解决samtools报错：error wh
bcftools 替代samtools 产生vcf
基因组分析
BAM文件转换成fastq
Samtools 使用
[转]samtools使用大全（2018-05-29）
samtools merge命令
[samtools] bam 文件合并提示No @HD tag

一、简介

Samtools是一个用于操作sam和bam格式文件的应用程序集合，具有众多的功能。它从SAM（序列比对/映射）格式导入和导出，进行排序，合并和索引，并允许快速检索任何区域中的读数。SAM和BAM格式的比对文件主要由bwa、bowtie2、tophat和hisat2等序列比对工具产生，用于记录测序reads在参考基因组上的映射信息。其中，BAM格式文件是SAM文件的的二进制格式，占据内存较小且运算速度更快。

二、基本用法

Samtools由一系列的子命令组成，可以对sam和bam格式的文件进行不同的整合与处理。

$ samtools --help

Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     rmdup          remove PCR duplicates
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

2.1 构建索引(Indexing)

dict -- 对fasta格式的参考基因组序列构建字典索引

$ samtools dict

About:   Create a sequence dictionary file from a fasta file
Usage:   samtools dict [options] <file.fa|file.fa.gz>

Options: -a, --assembly STR    assembly
         -H, --no-header       do not print @HD line
         -o, --output STR      file to write out dict file [stdout]
         -s, --species STR     species
         -u, --uri STR         URI [file:///abs/path/to/file.fa]

Example：samtools dict -a GRCh38 -s "Homo sapiens" hg38.fasta -o hg38.dict
构建完成后会生成一个hg38.dict的参考基因组索引文件

$ head hg38.dict # 查看索引文件的前十行

@HD     VN:1.0  SO:unsorted
@SQ     SN:chr1 LN:248956422    M5:2648ae1bacce4ec4b6cf337dcae37816     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr2 LN:242193529    M5:4bb4f82880a14111eb7327169ffb729b     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr3 LN:198295559    M5:a48af509898d3736ba95dc0912c0b461     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr4 LN:190214555    M5:3210fecf1eb92d5489da4346b3fddc6e     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr5 LN:181538259    M5:f7f05fb7ceea78cbc32ce652c540ff2d     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr6 LN:170805979    M5:6a48dfa97e854e3c6f186c8ff973f7dd     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr7 LN:159345973    M5:94eef2b96fd5a7c8db162c8c74378039     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr8 LN:145138636    M5:c67955b5f7815a9a1edfaa15893d3616     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens
@SQ     SN:chr9 LN:138394717    M5:addd2795560986b7491c40b1faa3978a     UR:file:///data/public/refGenome/bwa_index/hg38/hg38.fa AS:GRCh38  SP:Homo sapiens

faidx -- 对fasta格式的文件建立索引，生成的索引文件以.fai后缀结尾。

$ samtools faidx

Usage:   samtools faidx <file.fa|file.fa.gz> [<reg> [...]]

Example：samtools faidx hg38.fasta
构建完成后会生成一个hg38.fasta.fai的索引文件，该文件共有五列，分别表示：

第一列：序列的名称

第二列：序列的长度

第三列：第一个碱基的偏移量，从0开始计数

第四列：除了最后一行外，序列中每行的碱基数

第五列：除了最后一行外，序列中每行的长度（包括换行符）

$ head hg38.fasta.fai

chr1    248956422       6       50      51
chr2    242193529       253935563       50      51
chr3    198295559       500972969       50      51
chr4    190214555       703234446       50      51
chr5    181538259       897253299       50      51
chr6    170805979       1082422330      50      51
chr7    159345973       1256644435      50      51
chr8    145138636       1419177334      50      51
chr9    138394717       1567218749      50      51
chr10   133797422       1708381368      50      51

该命令也可以根据索引文件，快速提取fasta文件中任意区域的序列文件

# 提取1号染色体的序列信息
$ samtools faidx ha38.fasta chr1 > hg38.chr1.fa

# 提取1号染色体200-1000bp之间的序列
$ samtools faidx hg38.fasta chr1:200-1000 > hg38.chr1_200_1000.fa

index -- 对bam格式的比对文件构建索引，需对bam文件进行排序后才能构建索引

$ samtools index

Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]

Example： samtools index aln.sorted.bam
构建完成后将产生后缀为.bai的文件，用于快速的随机处理

2.2 文件编辑(Editing)

calmd -- recalculate MD/NM tags and '=' bases 计算MD tag标签值（记录了mismatch信息）

$ samtools calmd
Usage: samtools calmd [-eubrAES] <aln.bam> <ref.fasta>

Options:
  -e       change identical bases to '='
  -u       uncompressed BAM output (for piping)
  -b       compressed BAM output
  -S       ignored (input format is auto-detected)
  -A       modify the quality string
  -r       compute the BQ tag (without -A) or cap baseQ by BAQ (with -A)
  -E       extended BAQ for better sensitivity but lower specificity
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example：samtools calmd -AEur aln.bam ref.fa

fixmate -- fix mate information 为以名称排序定位的alignment填入配对坐标，ISIZE（inferred insert size推测的插入序列大小）和配对相应的标签（flag）

$ samtools fixmate

Usage: samtools fixmate <in.nameSrt.bam> <out.nameSrt.bam>
Options:
  -r           Remove unmapped reads and secondary alignments
  -p           Disable FR proper pair check
  -c           Add template cigar ct tag
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

As elsewhere in samtools, use '-' as the filename for stdin/stdout. The input
file must be grouped by read name (e.g. sorted by name). Coordinated sorted
input is not accepted.

Example：samtools fixmate in.namesorted.sam out.bam

reheader -- replace BAM header 替换bam文件头注释信息

$ samtools reheader

Usage: samtools reheader [-P] in.header.sam in.bam > out.bam
   or  samtools reheader [-P] -i in.header.sam file.bam

Options:
    -P, --no-PG      Do not generate an @PG header line.
    -i, --in-place   Modify the bam/cram file directly.
                     (Defaults to outputting to stdout.)

rmdup -- remove PCR duplicates 去除bam文件中的PCR重复信息

$ samtools rmdup

Usage:  samtools rmdup [-sS] <input.srt.bam> <output.bam>

Option: -s    rmdup for SE reads
        -S    treat PE reads as SE in rmdup (force -s)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example：samtools rmdup -sS in.sorted.bam output.bam

targetcut -- cut fosmid regions (for fosmid pool only) 此命令仅用于从fosmid混池测序中切除fosmid克隆序列

$ samtools targetcut

Usage: samtools targetcut [-Q minQ] [-i inPen] [-0 em0] [-1 em1] [-2 em2] <in.bam>
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -f, --reference FILE
               Reference sequence FASTA FILE [null]

addreplacerg -- adds or replaces RG tags 该命令用于添加或更改bam文件中的RG tag标签值

$ samtools addreplacerg

Usage: samtools addreplacerg [options] [-r <@RG line> | -R <existing id>] [-o <output.bam>] <input.bam>

Options:
  -m MODE   Set the mode of operation from one of overwrite_all, orphan_only [overwrite_all]
  -o FILE   Where to write output to [stdout]
  -r STRING @RG line text
  -R STRING ID of @RG line in existing header to use
      --input-fmt FORMAT[,OPT[=VAL]]...
               Specify input format (SAM, BAM, CRAM)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example ：samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o output.bam input.bam

2.3 文件处理(File operations)

collate -- shuffle and group alignments by name 根据read name信息将bam文件进行随机打散和分组

$ samtools collate

Usage:   samtools collate [-Ou] [-n nFiles] [-c cLevel] <in.bam> <out.prefix>

Options:
      -O       output to stdout
      -u       uncompressed BAM output
      -l INT   compression level [1]
      -n INT   number of temporary files [64]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools collate -o aln.name_collated.bam aln.sorted.bam

cat -- concatenate BAMs 将多个bam文件合并为一个bam文件（与merge命令的区别是cat不需要将bam文件提前进行排序）
```
$ samtools cat

Usage: samtools cat [-h header.sam] [-o out.bam] <in1.bam> [...]
```
- Example: samtools cat -o out.bam in1.bam in2.bam

merge -- merge sorted alignments 将多个已经sort的bam文件进行合并

$ samtools merge

Usage: samtools merge [-nurlf] [-h inh.sam] [-b <bamlist.fofn>] <out.bam> <in1.bam> [<in2.bam> ... <inN.bam>]

Options:
  -n         Input files are sorted by read name
  -r         Attach RG tag (inferred from file names)
  -u         Uncompressed BAM output
  -f         Overwrite the output BAM if exist
  -1         Compress level 1
  -l INT     Compression level, from 0 to 9 [-1]
  -R STR     Merge file in the specified region STR [all]
  -h FILE    Copy the header in FILE to <out.bam> [in1.bam]
  -c         Combine @RG headers with colliding IDs [alter IDs to be distinct]
  -p         Combine @PG headers with colliding IDs [alter IDs to be distinct]
  -s VALUE   Override random seed
  -b FILE    List of input BAM filenames, one per line [null]
  -@, --threads INT
             Number of BAM/CRAM compression threads [0]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools merge merged.bam in1.sorted.bam in2.sorted.bam

mpileup -- multi-way pileup 对参考基因组每个位点做碱基堆积，用于call SNP和INDEL。主要是生成BCF、VCF文件或者pileup一个或多个bam文件。比对记录以在@RG中的样本名作为区分标识符。如果样本标识符缺失，那么每一个输入文件则视为一个样本

$ samtools mpileup

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:
  -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding
  -A, --count-orphans     do not discard anomalous read pairs
  -b, --bam-list FILE     list of input BAM filenames, one per line
  -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)
  -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]
  -d, --max-depth INT     max per-file depth; avoids excessive memory usage [250]
  -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs
  -f, --fasta-ref FILE    faidx indexed reference sequence file
  -G, --exclude-RG FILE   exclude read groups listed in FILE
  -l, --positions FILE    skip unlisted positions (chr pos) or regions (BED)
  -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
  -r, --region REG        region in which pileup is generated
  -R, --ignore-RG         ignore RG tags (one BAM = one sample)
  --rf, --incl-flags STR|INT  required flags: skip reads with mask bits unset []
  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                            [UNMAP,SECONDARY,QCFAIL,DUP]
  -x, --ignore-overlaps   disable read-pair overlap detection

Output options:
  -o, --output FILE       write output to FILE [standard output]
  -g, --BCF               generate genotype likelihoods in BCF format
  -v, --VCF               generate genotype likelihoods in VCF format

Output options for mpileup format (without -g/-v):
  -O, --output-BP         output base positions on reads
  -s, --output-MQ         output mapping quality

Output options for genotype likelihoods (when -g/-v is used):
  -t, --output-tags LIST  optional tags to output:
               DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
  -u, --uncompressed      generate uncompressed VCF/BCF output

SNP/INDEL genotype likelihoods options (effective with -g/-v):
  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [20]
  -F, --gap-frac FLOAT    minimum fraction of gapped reads [0.002]
  -h, --tandem-qual INT   coefficient for homopolymer errors [100]
  -I, --skip-indels       do not perform indel calling
  -L, --max-idepth INT    maximum per-file depth for INDEL calling [250]
  -m, --min-ireads INT    minimum number gapped reads for indel candidates [1]
  -o, --open-prob INT     Phred-scaled gap open seq error probability [40]
  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity
  -P, --platforms STR     comma separated list of platforms for indels [all]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Notes: Assuming diploid individuals.

Example: samtools mpileup -C 50 -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
主要参数解读：

-A：在检测变异中，不忽略异常的reads对

-C：用于调节比对质量的系数，如果reads中含有过多的错配，不能设置为零

-D：输出每个样本的reads深度

-l：BED文件或者包含区域位点的位置列表文件

注意：位置文件包含两列，染色体和位置，从1开始计数。BED文件至少包含3列，染色体、起始和终止位置，开始端从0开始计数。

-r：在指定区域产生pileup，需已建立索引的bam文件，通常和-l参数一起使用

-o/g/v：输出文件类型（标准格式文件或者VCF、BCF文件）

-t：设置FORMAT和INFO的列表内容，以逗号分割

-u：生成未压缩的VCF和BCF文件

-I：跳过INDEL检测

-m：候选INDEL的最小间隔的reads

-f：输入有索引文件的fasta参考序列

-F ：含有间隔reads的最小片段
mpileup生成的结果包含6列：1) 参考序列名，2) 位置，3) 参考碱基，4) 比对上的reads数，5) 比对情况，6) 比对上的碱基的质量。其中第5列比较复杂,解释如下：
a) ‘.’代表与参考序列正链匹配。
b) ‘,’代表与参考序列负链匹配。
c) ‘ATCGN’代表在正链上的不匹配。
d) ‘atcgn’代表在负链上的不匹配。
e) ‘*’代表模糊碱基
f) ‘’代表匹配的碱基是一个read的开始；’'后面紧跟的ascii码减去33代表比对质量；这两个符号修饰的是后面的碱基，其后紧跟的碱基(.,ATCGatcgNn)代表该read的第一个碱基。
g) ‘$’代表一个read的结束，该符号修饰的是其前面的碱基。
h) 正则式’+[0-9]+[ACGTNacgtn]+’代表在该位点后插入的碱基；
i) 正则式’-[0-9]+[ACGTNacgtn]+’代表在该位点后缺失的碱基；

sort -- sort alignment file 对比对后的bam文件进行排序，默认按coordinate进行排序

$ samtools sort

Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  -@, --threads INT
             Set number of sorting and compression threads [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
主要参数释义：

-l：设置文件压缩等级，0不压缩，9压缩最高

-m：每个线程运行内存大小（可使用K M G表示）

-n：按照read名称进行排序

-o：排序后的输出文件

-T：PREFIX临时文件前缀

-@：设置排序和压缩的线程数，默认单线程

split -- splits a file by read group 根据read group标签将bam文件进行分割

$ samtools split

Usage: samtools split [-u <unaccounted.bam>[:<unaccounted_header.sam>]]
                      [-f <format_string>] [-v] <merged.bam>
Options:
  -f STRING       output filename format string ["%*_%#.%."]
  -u FILE1        put reads with no RG tag or an unrecognised RG tag in FILE1
  -u FILE1:FILE2  ...and override the header with FILE2
  -v              verbose output
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Format string expansions:
  %%     %
  %*     basename
  %#     @RG index
  %!     @RG ID
  %.     filename extension for output format

Example: samtools split merged.bam

quickcheck -- quickly check if SAM/BAM/CRAM file appears intact 检查SAM/BAM/CRAM文件的完整性

$ samtools quickcheck

Usage: samtools quickcheck [options] <input> [...]
Options:
  -v              verbose output (repeat for more verbosity)

Notes:

1. In order to use this command effectively, you should check its exit status;
   without any -v options it will NOT print any output, even when some files
   fail the check. One way to use quickcheck might be as a check that all
   BAM files in a directory are okay:

        samtools quickcheck *.bam && echo 'all ok' \
           || echo 'fail!'

   To also determine which files have failed, use the -v option:

        samtools quickcheck -v *.bam > bad_bams.fofn \
           && echo 'all ok' \
           || echo 'some files failed check, see bad_bams.fofn'

Example: samtools quickcheck in1.bam in2.cram

fastq -- converts a BAM to a FASTQ 将bam格式文件转换为fastq文件

$ samtools fastq

Usage: samtools fastq [options...] <in.bam>
Options:
  -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
  -1 FILE   write paired reads flagged READ1 to FILE
  -2 FILE   write paired reads flagged READ2 to FILE
  -f INT    only include reads with all bits set in INT set in FLAG [0]
  -F INT    only include reads with none of the bits set in INT set in FLAG [0]
  -n        don't append /1 and /2 to the read name
  -O        output quality in the OQ tag if present
  -s FILE   write singleton reads to FILE [assume single-end]
  -t        copy RG, BC and QT tags to the FASTQ header line
  -v INT    default quality score if not given in file [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools fastq input.bam > output.fastq

fasta -- converts a BAM to a FASTA 将bam格式文件转换为fasta文件

$ samtools fasta

Usage: samtools fasta [options...] <in.bam>
Options:
  -0 FILE   write paired reads flagged both or neither READ1 and READ2 to FILE
  -1 FILE   write paired reads flagged READ1 to FILE
  -2 FILE   write paired reads flagged READ2 to FILE
  -f INT    only include reads with all bits set in INT set in FLAG [0]
  -F INT    only include reads with none of the bits set in INT set in FLAG [0]
  -n        don't append /1 and /2 to the read name
  -s FILE   write singleton reads to FILE [assume single-end]
  -t        copy RG, BC and QT tags to the FASTA header line
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools fasta input.bam > output.fasta

2.4 数据统计(Statistics)

bedcov -- read depth per BED region 统计给定region区间内reads的深度，每个碱基的深度叠加在一起

$ samtools bedcov

Usage: samtools bedcov [options] <in.bed> <in1.bam> [...]

  -Q INT       Only count bases of at least INT quality [0]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools bedcov in.bed aln.sorted.bam

depth -- compute the depth 统计每个碱基位点的测序深度，注意计算前必须对bam文件排序并构建索引

$ samtools depth

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
   -b <bed>            list of positions or regions
   -f <list>           list of input BAM filenames, one per line [null]
   -l <int>            read length threshold (ignore reads shorter than <int>)
   -d/-m <int>         maximum coverage depth [8000]
   -q <int>            base quality threshold
   -Q <int>            mapping quality threshold
   -r <chr:from-to>    region
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth.  Note that positions with zero coverage may be
omitted by default; see the -a option.

Example: samtools depth aln.sorted.bam

flagstat -- simple stats 统计bam文件中read的比对情况
```
$ samtools flagstat

Usage: samtools flagstat [--input-fmt-option OPT=VAL] <in.bam>
```
- Example: samtools flagstat aln.sorted.bam

idxstats -- BAM index stats 统计bam索引文件里的比对信息
```
$ samtools idxstats

Usage: samtools idxstats <in.bam>
```
- Example: samtools idxstats aln.sorted.bam
- idxstats生成一个统计表格，共有4列，分别为”序列名，序列长度，比对上的reads数，unmapped reads number”。第4列应该是paired reads中有一端能匹配到该scaffold上，而另外一端不匹配到任何scaffolds上的reads数。

phase -- Call and phase heterozygous SNPs

$ samtools phase

Usage:   samtools phase [options] <in.bam>

Options: -k INT    block length [13]
         -b STR    prefix of BAMs to output [null]
         -q INT    min het phred-LOD [37]
         -Q INT    min base quality in het calling [13]
         -D INT    max read depth [256]
         -F        do not attempt to fix chimeras
         -A        drop reads with ambiguous phase

      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools phase test.bam

stats -- generate stats (former bamcheck) 对bam文件做详细统计, 统计结果可用mics/plot-bamstats作图

$ samtools stats

About: The program collects statistics from BAM files. The output can be visualized using plot-bamstats.
Usage: samtools stats [OPTIONS] file.bam
       samtools stats [OPTIONS] file.bam chr:from-to
Options:
    -c, --coverage <int>,<int>,<int>    Coverage distribution min,max,step [1,1000,1]
    -d, --remove-dups                   Exclude from statistics reads marked as duplicates
    -f, --required-flag  <str|int>      Required flag, 0 for unset. See also `samtools flags` [0]
    -F, --filtering-flag <str|int>      Filtering flag, 0 for unset. See also `samtools flags` [0]
        --GC-depth <float>              the size of GC-depth bins (decreasing bin size increases memory requirement) [2e4]
    -h, --help                          This help message
    -i, --insert-size <int>             Maximum insert size [8000]
    -I, --id <string>                   Include only listed read group or sample name
    -l, --read-length <int>             Include in the statistics only reads with the given read length []
    -m, --most-inserts <float>          Report only the main part of inserts [0.99]
    -P, --split-prefix <str>            Path or string prefix for filepaths output by -S (default is input filename)
    -q, --trim-quality <int>            The BWA trimming parameter [0]
    -r, --ref-seq <file>                Reference sequence (required for GC-depth and mismatches-per-cycle calculation).
    -s, --sam                           Ignored (input format is auto-detected).
    -S, --split <tag>                   Also write statistics to separate files split by tagged field.
    -t, --target-regions <file>         Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.
    -x, --sparse                        Suppress outputting IS rows where there are no insertions.
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools stats aln.sorted.bam
输出的信息比较多，部分如下：
Summary Numbers，raw total sequences，filtered sequences, reads mapped, reads mapped and paired,reads properly paired等信息
Fragment Qualitites：根据cycle统计每个位点上的碱基质量分布
Coverage distribution：深度为1，2，3，，，的碱基数目
ACGT content per cycle：ACGT在每个cycle中的比例
Insert sizes：插入长度的统计
Read lengths：read的长度分布

2.5 可视化(Viewing)

flags -- explain BAM flags 查看不同flag值的含义

$ samtools flags

About: Convert between textual and numeric flag representation
Usage: samtools flags INT|STR[,...]

Flags:
        0x1     PAIRED        .. paired-end (or multiple-segment) sequencing technology
        0x2     PROPER_PAIR   .. each segment properly aligned according to the aligner
        0x4     UNMAP         .. segment unmapped
        0x8     MUNMAP        .. next segment in the template unmapped
        0x10    REVERSE       .. SEQ is reverse complemented
        0x20    MREVERSE      .. SEQ of the next segment in the template is reversed
        0x40    READ1         .. the first segment in the template
        0x80    READ2         .. the last segment in the template
        0x100   SECONDARY     .. secondary alignment
        0x200   QCFAIL        .. not passing quality controls
        0x400   DUP           .. PCR or optical duplicate
        0x800   SUPPLEMENTARY .. supplementary alignment

Example: samtools flags PAIRED,UNMAP,MUNMAP

FLAGS:

0x1	PAIRED	paired-end (or multiple-segment) sequencing technology
0x2	PROPER_PAIR	each segment properly aligned according to the aligner
0x4	UNMAP	segment unmapped
0x8	MUNMAP	next segment in the template unmapped
0x10	REVERSE	SEQ is reverse complemented
0x20	MREVERSE	SEQ of the next segment in the template is reverse complemented
0x40	READ1	the first segment in the template
0x80	READ2	the last segment in the template
0x100	SECONDARY	secondary alignment
0x200	QCFAIL	not passing quality controls
0x400	DUP	PCR or optical duplicate
0x800	SUPPLEMENTARY	supplementary alignment

tview -- text alignment viewer 直观地显示reads比对到参考基因组的情况，类似于基因组浏览器

$ samtools tview

Usage: samtools tview [options] <aln.bam> [ref.fasta]
Options:
   -d display      output as (H)tml or (C)urses or (T)ext
   -p chr:pos      go directly to this position
   -s STR          display only reads from this sample or group
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Example: samtools tview aln.sorted.bam ref.fasta
需先对bam文件进行排序并构建索引

view -- SAM<->BAM<->CRAM conversion 将sam和bam文件进行格式互换

$ samtools view

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:
  -b       output BAM
  -C       output CRAM (requires -T)
  -1       use fast BAM compression (implies -b)
  -u       uncompressed BAM output (implies -b)
  -h       include header in SAM output
  -H       print SAM header only (no alignments)
  -c       print only the count of matching records
  -o FILE  output file name [stdout]
  -U FILE  output reads not selected by filters to FILE [null]
  -t FILE  FILE listing reference names and lengths (see long help) [null]
  -L FILE  only include reads overlapping this BED FILE [null]
  -r STR   only include reads in read group STR [null]
  -R FILE  only include reads with read group listed in FILE [null]
  -q INT   only include reads with mapping quality >= INT [0]
  -l STR   only include reads in library STR [null]
  -m INT   only include reads with number of CIGAR operations consuming
           query sequence >= INT [0]
  -f INT   only include reads with all bits set in INT set in FLAG [0]
  -F INT   only include reads with none of the bits set in INT set in FLAG [0]
  -x STR   read tag to strip (repeatable) [null]
  -B       collapse the backward CIGAR operation
  -s FLOAT integer part sets seed of random number generator [0];
           rest sets fraction of templates to subsample [no subsampling]
  -@, --threads INT
           number of BAM/CRAM compression threads [0]
  -?       print long help, including note about region specification
  -S       ignored (input format is auto-detected)
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
  -T, --reference FILE
               Reference sequence FASTA FILE [null]

Example ：samtools view -bS test.sam > test.bam
重要参数解读：

-b：输出bam格式

-C：输出CRAM文件

-1：快速压缩（需要-b）

-u：输出未压缩的bam文件，节约时间，占据较多磁盘空间（需要-b）

-h：默认输出sam文件不带表头，该参数设定后输出带表头信息

-H：仅仅输出表头信息

-c：仅打印匹配数

-o：输出文件（stdout标准输出）

-U：输出没有经过过滤选择的reads

-t：制表分隔符文件（需要提供额外的参考数据，比如参考基因组的索引）

-L：仅包括和bed文件有重叠的reads

-r：仅输出在STR读段组中的reads

-R：仅输出特定reads

-q：定位的质量大于INT[默认0]

-l：仅输出在STR 库中的reads

-F：获得比对上（mapped）的过滤设置[默认0]

-f：获得未必对上（unmapped）的过滤设置[默认0]

-T：使用fasta格式的参考序列

depad -- convert padded BAM to unpadded BAM

$ samtools depad

Usage:   samtools depad <in.bam>

Options:
  -s           Output is SAM (default is BAM)
  -S           Input is SAM (default is BAM)
  -u           Uncompressed BAM output (can't use with -s)
  -1           Fast compression BAM output (can't use with -s)
  -T, --reference FILE
               Padded reference sequence file [null]
  -o FILE      Output file name [stdout]
  -?           Longer help
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE

Example: samtools depad input.bam

三、常用示例

sam/bam格式文件互换

$ samtools view -bS -1 test.sam > test.bam # sam转bam
$ samtools view -h test.bam > test.sam # bam转sam

提取比对到参考基因组上的数据

$ samtools view -bF 4 test.bam > test.F.bam

提取没有比对到参考基因组上的数据

$ samtools view -bf 4 test.bam > test.f.bam

双端reads都比对到参考基因组上的数据

$ samtools view -bF 12 test.bam > test.12.bam

单端reads1比对到参考基因组上的数据

samtools view -bF 4 -f 8  test .bam > test1.bam

单端reads2比对到参考基因组上的数据

$ samtools view -bF 8 -f 4 test.bam > test2.bam

提取bam文件中比对到scaffold1上的比对结果，并保存到sam文件格式

$ samtools view abc.bam scaffold1 > scaffold1.sam

提取scaffold1上能比对到30k到100k区域的比对结果

$ samtools view abc.bam scaffold1:30000-100000 > scaffold1_30k-100k.sam

根据fasta文件，将 header 加入到 sam 或 bam 文件中

$ samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam

call SNP和INDEL等变异信息

$ samtools mpileup -f genome.fasta abc.bam > abc.txt
$ samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf
$ samtools mpileup -guSDf genome.fasta abc.bam | \
           bcftools view -cvNg - > abc.vcf

参考来源：
http://www.htslib.org/doc/samtools.html
https://www.cnblogs.com/xiaofeiIDO/p/6805373.html
https://blog.csdn.net/sunchengquan/article/details/85176940

Samtools使用大全
一、简介 Samtools是一个用于操作sam和bam格式文件的应用程序集合，具有众多的功能。它从SAM（序列比...
samtools使用大全
samtools是一个用于操作sam和bam文件（通常是短序列比对工具如bwa,bowtie2,hisat2,to...
2020-07-16 解决samtools报错：error wh
今天想使用samtools时突然报错： samtools: error while loading shared ...
bcftools 替代samtools 产生vcf
使用samtools 1.10 发现不能生成vcf了 samtools mpileup bcftools call...
基因组分析
bwa 安装使用 Samtools 安装使用 Picard 安装使用
BAM文件转换成fastq
1、使用samtools ## 将BAM文件按照read name进行排序samtools sort -n ${S...
Samtools 使用
samtools常用命令详解我的一个例子，用来分离并改chr名字,HPC递交系统，mouse_chr 分离并改成c...
[转]samtools使用大全（2018-05-29）
转自：https://blog.csdn.net/sinat_38163598/article/details/7...
samtools merge命令
samtools merge命令的功能描述：当有多个样本的bam文件时，可以使用samtools的merge命令...
[samtools] bam 文件合并提示No @HD tag
使用版本 samtools 1.9 我找的别人安装好的旧版本，替换的samtools 0.1.13 版本，可...

Samtools使用大全

一、简介

二、基本用法

2.1 构建索引(Indexing)

2.2 文件编辑(Editing)

2.3 文件处理(File operations)

2.4 数据统计(Statistics)

2.5 可视化(Viewing)

三、常用示例

相关文章

Samtools使用大全

samtools使用大全

2020-07-16 解决samtools报错：error wh

bcftools 替代samtools 产生vcf

基因组分析

BAM文件转换成fastq

Samtools 使用

[转]samtools使用大全（2018-05-29）

samtools merge命令

[samtools] bam 文件合并提示No @HD tag

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

科研信息学

Linux-NGS-二代测序分析流程

生物信息学

小教程收藏