samtools的功能大全:查看bam文件外还能做什么

作者: Amy_Cui | 来源:发表于2019-10-09 14:43 被阅读0次

    必须学习这篇英文:http://www.htslib.org/doc/samtools.html
    【继续更新ing】

    samtools view -h

    查看bam文件,包含头文件,去掉-h,不包含

    samtools view -h s.bam|less -S
    samtools view s.bam|less -S
    # 提取chr1染色体,生成只有chr1的bam文件
    samtools view -h -b s.bam chr1 >s.chr1.bam
    samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
    

    samtools view -c

    -c print only the count of matching records;统计比对条数;结果等于samtools flagstat的总reads条数。

    # 单端比对reads数目的统计
    cuiqingmei 2019/10/09 10:57:10 /ifs9/BC_PS/cuiqingmei/ass/3.mapping/result_bam
    $ samtools view -c CL100141207_L01_69.sort.bam
    31504107
    cuiqingmei 2019/10/09 10:57:38 /ifs9/BC_PS/cuiqingmei/ass/3.mapping/result_bam
    $ zcat ../CL100141207_L01_69.fq.gz |wc -l
    126016428
    cuiqingmei 2019/10/09 11:02:26 /ifs9/BC_PS/cuiqingmei/ass/3.mapping/result_bam
    $ echo "scale=4;126016428/4"|bc
    31504107.0000
    
    qmcui 12:07:32 /teach/project/1.rna/4.clean_data_25000reads
    $ echo `zcat SRR1039510_1_val_1.100000.fq.gz|wc -l`/4|bc
    25000
    qmcui 12:08:12 /teach/project/1.rna/4.clean_data_25000reads
    $ echo `zcat SRR1039510_2_val_2.100000.fq.gz|wc -l`/4|bc
    25000
    qmcui 12:45:49 /teach/project/1.rna/4.clean_data_25000reads
    $ samtools view -c  ../5.sort.bam/SRR1039510.sort.bam 
    55621
    

    “Instead of printing the alignments, only count them and print the total number.” samtools view -c 算的不是reads,而是alignments数。当read有很多位置可以align上同时又都输出了,用samtools view -c 会比实际reads树木要多~~~ by: 严云

    samtools view -F/-f 4

    看下表,4是unmapped;所以我们-F(none of) 4就是找到map的reads数目
    -q INT only include reads with mapping quality >= INT [0]
    -m INT only include reads with number of CIGAR operations consuming
    query sequence >= INT [0]
    -f INT only include reads with all of the FLAGs in INT present [0]
    -F INT only include reads with none of the FLAGS in INT present [0]
    -G INT only EXCLUDE reads with all of the FLAGs in INT present [0]

    So-f 4 only output alignments that areunmapped(flag 0x0004 is set) and -F 4only output alignments that are not unmapped (i.e. flag 0x0004 is not set), hence these would onlyinclude mapped alignments.

    cuiqingmei 2019/10/09 11:02:58 /ifs9/BC_PS/cuiqingmei/ass/3.mapping/result_bam
    $ samtools view -c CL100141207_L01_69.sort.bam -F 4
    30441976
    # 想知道有多少paired end reads有mate并且都有map时,可以使用-f 1 -F 12来过滤。
    samtools view -c -f 1 -F 12 test.bam
    # 12是   read unmapped、   mate unmapped;12=8+4
    
    # Mapped reads only
    $ samtools view -c -F 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
    5068340
    
    # Unmapped reads only
    $ samtools view -c -f 4 HG00173.chrom11.ILLUMINA.bwa.FIN.low_coverage.20111114.bam
    149982
    
    Flag Chr Description
    0×0001 p the read is paired in sequencing
    0×0002 P the read is mapped in a proper pair
    0×0004 u the query sequence itself is unmapped
    0×0008 U the mate is unmapped
    0×0010 r strand of the query (1 for reverse)
    0×0020 R strand of the mate
    0×0040 1 the read is the first read in a pair
    0×0080 2 the read is the second read in a pair
    0×0100 s the alignment is not primary
    0×0200 f the read fails platform/vendor quality checks
    0×0400 d the read is either a PCR or an optical duplicate

    samtools sort

    bam文件必须排序
    一般按照默认参数,控制线程数均可,-@ 5即5个线程。
    -n Sort by read name:按照reads名字来排序
    -l INT Set compression level, from 0 (uncompressed) to 9 (best):设置压缩倍数
    -m INT Set maximum memory per thread; suffix K/M/G recognized [768M]:
    Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix>
    -m 参数默认下是 500,000,000 即500M(不支持K,M,G等缩写)。对于处理大数据时,如果内存够用,则设置大点的值,以节约时间。
    -n 设定排序方式按short reads的ID排序。默认下是按序列在fasta文件中的顺序(即header)和序列从左往右的位点排序。

    $ samtools sort -@ 5  -o output.sort.bam input.sam
    $ samtools sort -@ 5  -o output.sort.bam input.bam
    # 排序sam,bam均可,而且排序后结果默认生成bam
    # samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
    

    samtools flags

    explain BAM flags:解释bam文件第二列标签的含义

    cuiqingmei 2019/10/09 11:51:33 
    $ samtools flags 4  
    0x4 4   UNMAP
    cuiqingmei 2019/10/09 11:52:44
    $ samtools flags 355
    0x163   355 PAIRED,PROPER_PAIR,MREVERSE,READ1,SECONDARY
    

    samtools index

    给sam或者bam文件建立索引,一般下游程序需要位置信息的时候,就必须在bam同目录下存在bai索引文件。

    # samtools 1.8版本
    $ samtools index
    Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
    Options:
      -b       Generate BAI-format index for BAM files [default]
      -c       Generate CSI-format index for BAM files
      -m INT   Set minimum interval size for CSI indices to 2^INT [14]
      -@ INT   Sets the number of threads [none]
    cuiqingmei 2019/10/09 11:56:21 /ifs9/
    $ samtools index CL100141207_L01_69.sort.bam
    

    samtools idxstat

    结果解释:reference sequence name, sequence length, # mapped reads and # unmapped reads,\t分割

    $ samtools idxstats CL100141207_L01_69.sort.bam   
    chr1    249250621   2443169 0
    chr2    243199373   2575343 0
    chr3    198022430   2018108 0
    chr4    191154276   1987277 0
    chr5    180915260   1838927 0
    ...
    *   0   0   1062131
    

    samtools markdup

    去重

    EXAMPLE
    
    # The first sort can be omitted if the file is already name ordered
    samtools sort -n -o namesort.bam example.bam
    
    
    # Add ms and MC tags for markdup to use later
    samtools fixmate -m namesort.bam fixmate.bam
    # -m:Add ms (mate score) tags. These are used by markdup to select the best reads to keep.
    
    # Markdup needs position order
    samtools sort -o positionsort.bam fixmate.bam
    
    
    # Finally mark duplicates
    samtools markdup positionsort.bam markdup.bam
    

    samtools rmdup:过时

    samtools rmdup [-sS] <input.srt.bam> <out.bam>
    This command is obsolete. Use markdup instead.命令过时,最好不要再使用。


    值得学习翻译链接:
    http://qnot.org/2012/04/14/counting-the-number-of-reads-in-a-bam-file/
    http://www.htslib.org/doc/samtools.html

    还可以学习:http://www.chenlianfu.com/?p=1399

    相关文章

      网友评论

        本文标题:samtools的功能大全:查看bam文件外还能做什么

        本文链接:https://www.haomeiwen.com/subject/gghxkctx.html