bam文件转换参考基因组版本

之前的质控结果提示有点问题，查找原因后发现，之前的bam文件是根据hg19参考基因组完成的，所以需要先进行版本转换：
参考资料http://crossmap.sourceforge.net/#convert-bam-cram-sam-format-files
CrossMap.py工具可以进行各种文件之间的版本转换

image.png

Chain file

Download Ensembl chain files

Human to Human: ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/

Mouse to Mouse: ftp://ftp.ensembl.org/pub/assembly_mapping/mus_musculus/

Other organisms: ftp://ftp.ensembl.org/pub/assembly_mapping/

Download UCSC chain files

Chain files from hg38 (GRCh38) to hg19 and all other organisms: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/

Chain File from hg19 (GRCh37) to hg17/hg18/hg38 and all other organisms: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/

Chain File from mm10 (GRCm38) to mm9 and all other organisms: http://hgdownload.soe.ucsc.edu/goldenPath/mm10/liftOver/

CrossMap依赖pyBigWig

安装指令：

conda create -n py3 python=3.6
conda activate py3
conda install -c bioconda pyBigWig
pip3 install CrossMap

使用参数：

$ python CrossMap.py bam

Usage: CrossMap.py bam chain_file input_file output_file [options]
Note: If output_file == STDOUT or -, CrossMap will write BAM file to the screen

Options:
  -m INSERT_SIZE, --mean=INSERT_SIZE
                        Average insert size of pair-end sequencing (bp).
                        [default=200.0]
  -s INSERT_SIZE_STDEV, --stdev=INSERT_SIZE_STDEV
                        Stanadard deviation of insert size. [default=30.0]
  -t INSERT_SIZE_FOLD, --times=INSERT_SIZE_FOLD
                        A mapped pair is considered as "proper pair" if both
                        ends mapped to different strand and the distance
                        between them is less then '-t' * stdev from the mean.
                        [default=3.0]
  -a, --append-tags     Add tag to each alignment.

转换结果部分：

(wes) root@1100150:~/project# cat config | while read id
> do
> bam=~/CHD_pooling_seq/${id}.dedup.bam
> python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38.bam
> done
Insert size = 200.000000
Insert size stdev = 30.000000
Number of stdev from the mean = 3.000000
Add tags to each alignment = False
@ 2020-07-13 12:19:07: Read the chain file:  /root/biosoft/liftover/hg19ToHg38.over.chain.gz
[W::hts_idx_load3] The index file is older than the data file: /root/CHD_pooling_seq/C1.dedup.bam.bai
@ 2020-07-13 12:19:07: Liftover BAM file: /root/CHD_pooling_seq/C1.dedup.bam ==> /root/project/0.bwa/C1.hg38.bam.bam

由上可知，生成的bam会自带.bam后缀，所以前面的命名可以不加。输出文件名直接写成~/project/0.bwa/${id}.hg38即可。
终止命令，重新转换:

(wes) root@1100150:~/project# cat config | while read id
> do
> bam=~/CHD_pooling_seq/${id}.dedup.bam
> python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38
> done
Insert size = 200.000000
Insert size stdev = 30.000000
Number of stdev from the mean = 3.000000
Add tags to each alignment = False
@ 2020-07-14 00:45:25: Read the chain file:  /root/biosoft/liftover/hg19ToHg38.over.chain.gz
[W::hts_idx_load3] The index file is older than the data file: /root/CHD_pooling_seq/C1.dedup.bam.bai
@ 2020-07-14 00:45:26: Liftover BAM file: /root/CHD_pooling_seq/C1.dedup.bam ==> /root/project/0.bwa/C1.hg38.bam
@ 2020-07-14 01:07:19: Done!
@ 2020-07-14 01:07:19: Sort "/root/project/0.bwa/C1.hg38.bam" and save as "/root/project/0.bwa/C1.hg38.sorted.bam"
@ 2020-07-14 01:14:00: Index "/root/project/0.bwa/C1.hg38.sorted.bam" ...
Total alignments:13282037
     QC failed: 0
     Paired-end reads:
    R1 unique, R2 unique (UU): 12842196
    R1 unique, R2 unmapp (UN): 69655
    R1 unique, R2 multiple (UM): 0
    R1 multiple, R2 multiple (MM): 0
    R1 multiple, R2 unique (MU): 25411
    R1 multiple, R2 unmapped (MN): 2648
    R1 unmap, R2 unmap (NN): 275629
    R1 unmap, R2 unique (NU): 66498
    R1 unmap, R2 multiple (NM): 0

从结果可以看到，转换版本之后，还会自己对bam文件进行sort。