bam文件转换参考基因组版本
之前的质控结果提示有点问题,查找原因后发现,之前的bam文件是根据hg19参考基因组完成的,所以需要先进行版本转换:
参考资料http://crossmap.sourceforge.net/#convert-bam-cram-sam-format-files
CrossMap.py
工具可以进行各种文件之间的版本转换
Chain file
- Download Ensembl chain files
- Human to Human: ftp://ftp.ensembl.org/pub/assembly_mapping/homo_sapiens/
- Mouse to Mouse: ftp://ftp.ensembl.org/pub/assembly_mapping/mus_musculus/
- Other organisms: ftp://ftp.ensembl.org/pub/assembly_mapping/
- Download UCSC chain files
- Chain files from hg38 (GRCh38) to hg19 and all other organisms: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/
- Chain File from hg19 (GRCh37) to hg17/hg18/hg38 and all other organisms: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/
- Chain File from mm10 (GRCm38) to mm9 and all other organisms: http://hgdownload.soe.ucsc.edu/goldenPath/mm10/liftOver/
CrossMap依赖pyBigWig
安装指令:
conda create -n py3 python=3.6
conda activate py3
conda install -c bioconda pyBigWig
pip3 install CrossMap
使用参数:
$ python CrossMap.py bam
Usage: CrossMap.py bam chain_file input_file output_file [options]
Note: If output_file == STDOUT or -, CrossMap will write BAM file to the screen
Options:
-m INSERT_SIZE, --mean=INSERT_SIZE
Average insert size of pair-end sequencing (bp).
[default=200.0]
-s INSERT_SIZE_STDEV, --stdev=INSERT_SIZE_STDEV
Stanadard deviation of insert size. [default=30.0]
-t INSERT_SIZE_FOLD, --times=INSERT_SIZE_FOLD
A mapped pair is considered as "proper pair" if both
ends mapped to different strand and the distance
between them is less then '-t' * stdev from the mean.
[default=3.0]
-a, --append-tags Add tag to each alignment.
转换结果部分:
(wes) root@1100150:~/project# cat config | while read id
> do
> bam=~/CHD_pooling_seq/${id}.dedup.bam
> python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38.bam
> done
Insert size = 200.000000
Insert size stdev = 30.000000
Number of stdev from the mean = 3.000000
Add tags to each alignment = False
@ 2020-07-13 12:19:07: Read the chain file: /root/biosoft/liftover/hg19ToHg38.over.chain.gz
[W::hts_idx_load3] The index file is older than the data file: /root/CHD_pooling_seq/C1.dedup.bam.bai
@ 2020-07-13 12:19:07: Liftover BAM file: /root/CHD_pooling_seq/C1.dedup.bam ==> /root/project/0.bwa/C1.hg38.bam.bam
由上可知,生成的bam会自带.bam后缀,所以前面的命名可以不加。输出文件名直接写成~/project/0.bwa/${id}.hg38
即可。
终止命令,重新转换:
(wes) root@1100150:~/project# cat config | while read id
> do
> bam=~/CHD_pooling_seq/${id}.dedup.bam
> python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38
> done
Insert size = 200.000000
Insert size stdev = 30.000000
Number of stdev from the mean = 3.000000
Add tags to each alignment = False
@ 2020-07-14 00:45:25: Read the chain file: /root/biosoft/liftover/hg19ToHg38.over.chain.gz
[W::hts_idx_load3] The index file is older than the data file: /root/CHD_pooling_seq/C1.dedup.bam.bai
@ 2020-07-14 00:45:26: Liftover BAM file: /root/CHD_pooling_seq/C1.dedup.bam ==> /root/project/0.bwa/C1.hg38.bam
@ 2020-07-14 01:07:19: Done!
@ 2020-07-14 01:07:19: Sort "/root/project/0.bwa/C1.hg38.bam" and save as "/root/project/0.bwa/C1.hg38.sorted.bam"
@ 2020-07-14 01:14:00: Index "/root/project/0.bwa/C1.hg38.sorted.bam" ...
Total alignments:13282037
QC failed: 0
Paired-end reads:
R1 unique, R2 unique (UU): 12842196
R1 unique, R2 unmapp (UN): 69655
R1 unique, R2 multiple (UM): 0
R1 multiple, R2 multiple (MM): 0
R1 multiple, R2 unique (MU): 25411
R1 multiple, R2 unmapped (MN): 2648
R1 unmap, R2 unmap (NN): 275629
R1 unmap, R2 unique (NU): 66498
R1 unmap, R2 multiple (NM): 0
从结果可以看到,转换版本之后,还会自己对bam文件进行sort。
网友评论