原地址
几点说明
1.非简单翻译,所有代码均可运行,为了辅助理解,基本每步代码都有结果,需要比较的进行了整合
2.原文中的软件都下载最新版本
3.原文中有少量代码是错误的,这里进行了修正
4.对于需要的一些知识背景,在这里进行了注释或链接到他人博客
一共4部分
通过简单数据熟悉Linux下生物信息学各种操作1
通过简单数据熟悉Linux下生物信息学各种操作2
通过简单数据熟悉Linux下生物信息学各种操作3
通过简单数据熟悉Linux下生物信息学各种操作4
11安装使用bwa(mac)
11.1安装编译并创建到bin目录的链接
cat src
curl -OL http://downloads.sourceforge.net/project/bio-bwa/bwa-0.7.17.tar.bz2
tar jxvf bwa-0.7.17.tar.bz2
cd bwa-0.7.17/
make
ln -s ~/src/bwa-0.7.17/bwa ~/bin
11.2 制作index
创建文件夹放references
获取1999爆发ebola的参考基因组
mkdir -p ~/refs/852
efetch -db nucleotide -id NC_002549 -format fasta > ~/refs/852/ebola-1999.fa
bwa index ~/refs/852/ebola-1999.fa
ls ~/refs/852
ebola-1999.fa ebola-1999.fa.ann ebola-1999.fa.pac
ebola-1999.fa.amb ebola-1999.fa.bwt ebola-1999.fa.sa
通过blast也可以搜寻到
makeblastdb -in ~/refs/852/ebola-1999.fa -dbtype nucl -parse_seqids
ls ~/refs/852
ebola-1999.fa ebola-1999.fa.nhr ebola-1999.fa.nsi
ebola-1999.fa.amb ebola-1999.fa.nin ebola-1999.fa.nsq
ebola-1999.fa.ann ebola-1999.fa.nog ebola-1999.fa.pac
ebola-1999.fa.bwt ebola-1999.fa.nsd ebola-1999.fa.sa
共有16个文件,这也是为什么刚才创建单独文件夹的原因
获取ebolas 基因组的前1行作为query序列
head -2 ~/refs/852/ebola-1999.fa > query.fa
现在用bwa-mem把上面的query map到它自己的基因组上去
cat results.sam
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0 NC_002549.1 1 60 70M * 0 0 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 AS:i:70 XS:i:0
比较用上面同样序列的blasting结果
blastn -db ~/refs/852/ebola-1999.fa -query query.fa > results.blastn
Query= NC_002549.1 Zaire ebolavirus isolate Ebola
virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, complete genome
Length=70
Score E
Sequences producing significant alignments: (Bits) Value
NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD... 130 6e-34
>NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga,
complete genome
Length=18959
Score = 130 bits (70), Expect = 6e-34
Identities = 70/70 (100%), Gaps = 0/70 (0%)
Strand=Plus/Plus
Query 1 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGA 60
Query 61 AGATTAATAA 70
||||||||||
Sbjct 61 AGATTAATAA 70
Lambda K H
1.33 0.621 1.12
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 1079922
Database: /Users/ucco/refs/852/ebola-1999.fa
12理解SAM格式
现在vi query.fa,增加或删除几个bases,然后进行比对
比如在第6个碱基开始添加一串T
cat query.fa
>NC_002549.1 Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga, complete genome
CGGACTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA
再次运行比对程序,结果如下
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0 NC_002549.1 6 60 14S65M * 0 CGGACTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAANM:i:0 MD:Z:65 AS:i:65 XS:i:0
#之前没有修改bases的序列比对结果
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
NC_002549.1 0 NC_002549.1 1 60 70M * 0 0 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 AS:i:70 XS:i:0
12.1切片操作看特定列
为了查看特定列,可以进行cut操作,要深入理解每列的意义
query id,alignment flag and target id
cat results.sam |cut -f 1,2,3
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa
NC_002549.1 0 NC_002549.1
比对的start,mapping quality,CIGAR string
cat results.sam |cut -f 4,5,6
VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa
1 60 70M
paired end read information,name ,position and length of template
cat results.sam |cut -f 7,8,9
* 0 0
query sequence, query quality, other attributes
cat results.sam |cut -f 10-14
CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 AS:i:70
12.2 同时比对两个序列
在这里下载示例数据
curl -O http://www.personal.psu.edu/iua1/courses/files/2014/data-14.tar.gz
tar xzvf data-14.tar.gz
会产生两个名字为read1.fq和read2.fq的文件
用bwa比对
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam
部分结果如下
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa read1.fq read2.fq
gi|10313991|ref|NC_002549.1|_17166_17748_1:1:0_0:0:0_0 83 NC_002549.1 17679 60 70M = 17166 -583 AATTCAACGAAGCCCATACTGGCTAAGTCATTTAACTCAGTATGCTGACTGTGAGTTACATTTAAGTTAT 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:0 MD:Z:70 MC:Z:70M AS:i:70 XS:i:0
gi|10313991|ref|NC_002549.1|_17166_17748_1:1:0_0:0:0_0 163 NC_002549.1 17166 60 70M = 17679 583 CAGTCTAGAGTCAGAAATAGTATCAGGAATGACTACTCCTAGGATGCTTCTACCCGTTATGTCAAAATTC 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:2 MD:Z:4A49T15 MC:Z:70M AS:i:6XS:i:0
gi|10313991|ref|NC_002549.1|_4818_5334_1:0:0_3:0:0_1 99 NC_002549.1 4818 60 70M = 5265 517 GCCATCATGCTTGCTTCATACACTATCACCCAATTCGGCAAGGCAACCAATCCACTTGTCAGAGTCAATC 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:1 MD:Z:32T37 MC:Z:70M AS:i:6XS:i:0
gi|10313991|ref|NC_002549.1|_4818_5334_1:0:0_3:0:0_1 147 NC_002549.1 5265 60 70M = 4818 -517 GTGCCAGAAACTCTGGTCCTCAAGCTGACCGGTAAGAAGGTGACTTCTAAAATTCGACAACCAATCATCC 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 MD:Z:19A32A1G15 MC:Z:70M AS:i:5XS:i:0
gi|10313991|ref|NC_002549.1|_8927_9349_3:0:0_3:0:0_2 83 NC_002549.1 9280 60 70M = 8927 -423 GGTTCAGGGTTAAGAACATTGGTTCCTCAATCAGATAATGAGGAAGCTTCAACCAAACCGGGGAAATGCT 2222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 MD:Z:1T54C7C5 MC:Z:70M AS:i:5XS:i:0
为了简单明了表示,把上述的qurey加T后一起运行
cp query.fa query_addT.fa
vi query_addT.fa
bwa mem ~/refs/852/ebola-1999.fa query.fa query_addT.fa> xresults.sam
结果如下
@SQ SN:NC_002549.1 LN:18959
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem /Users/ucco/refs/852/ebola-1999.fa query.fa query_addT.fa
NC_002549.1 65 NC_002549.1 1 60 70M = 6 6 CGGACACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:70 MC:Z:15S65M AS:i:70 XS:i:0
NC_002549.1 129 NC_002549.1 6 60 15S65M = 1 -6 CGGACTTTTTTTTTTACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAA * NM:i:0 MD:Z:65 MC:Z:70M AS:i:65 XS:i:0
13 BAM 文件和samtools
安装短的read simulator:wgsim
cd ~/src
git clone https://github.com/lh3/wgsim.git
cd wgsim
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
ln -s ~/src/wgsim/wgsim ~/bin/wgsim
#Check that it works
wgsim
Program: wgsim (short read simulator)
Version: 1.9
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>
Options: -e FLOAT base error rate [0.020]
-d INT outer distance between the two ends [500]
-s INT standard deviation [50]
-N INT number of read pairs [1000000]
-1 INT length of the first read [70]
-2 INT length of the second read [70]
-r FLOAT rate of mutations [0.0010]
-R FLOAT fraction of indels [0.15]
-X FLOAT probability an indel is extended [0.30]
-S INT seed for random generator [0, use the current time]
-A FLOAT discard if the fraction of ambiguous bases higher than FLOAT [0.05]
-h haplotype mode
下载并安装samtools包,并且链接到~/bin 目录
cd ~/src
curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.9/samtools-1.1.tar.bz2
tar jxvf samtools-1.9.tar.bz2
cd samtools-1.9
make
ln -s ~/src/samtools-1.9/samtools ~/bin/
现在从参考基因组生成短reads,然后再map到这个基因组上
mkdir ~/edu/lec15
cd ~/edu/lec15
wgsim -N 10000 ~/refs/852/ebola-1999.fa read1.fq read2.fq > mutations.txt
[wgsim] seed = 1561981271
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 18959
比对
bwa mem ~/refs/852/ebola-1999.fa read1.fq read2.fq > results.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 20000 sequences (1400000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 9950, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (466, 500, 534)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (330, 670)
[M::mem_pestat] mean and std.dev: (499.84, 50.02)
[M::mem_pestat] low and high boundaries for proper pairs: (262, 738)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 20000 reads in 0.537 CPU sec, 0.539 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem /Users/ucco/refs/852/ebola-1999.fa read1.fq read2.fq
[main] Real time: 1.111 sec; CPU: 0.606 sec
转变为BAM文件
samtools view -Sb results.sam > temp.bam
# Sort the alignment.
samtools sort temp.bam -oresults.bam
#index the aligment
samtools index results.bam
用samtools进行过滤
详细用法参考samtools常用命令详解
-f match the flag:保留match flags的reads
-F filter the flags:删除match flags的reads,保留剩余部分
首先,保留比对到reverse链上的reads
samtools view -f 16 results.bam
然后,删除比对到forward链上的reads
samtools view -F 16 results.bam
对flag counts进行计数,下面两个命令都可以
samtools view -F 16 results.bam |wc -l
samtools view -c -F 16 results.bam
10000
关于samtools flag代表的意义请参考
samtools flag
根据质量进行过滤。如果一个reads可以map到多个位点,那么质量为0,否则>=1代表唯一匹配
# uniquely mapping reads.
samtools view -c -q 1 results.bam
# Count the high quality alignment.
samtools view -c -q 40 results.bam
14 用ReadSeq转换数据
14.1安装ReadSeq
mkdir -p readseq
cd ~/src/readseq
curl -OL http://iubio.bio.indiana.edu/soft/molbio/readseq/java/readseq.jar
我打不开这个网站,然后google下载的.
因为readseq调用的方式和前面安装的trimmomatic相似,所以cp
cp ~/bin/trimmomatic ~/bin/readseq
vi ~/bin/readseq
改成下图即可
java -jar ~/src/readseq/readseq.jar $@
vi_readseq
14.2获取1999 ebola基因组的full genbank 记录
http://www.ncbi.nlm.nih.gov/genome/4887
进入lec文件夹,随便你以前用其他名字命名的
获取complete data
efetch -db nucleotide -id NC_002549.1 -format gb > NC.gb
格式为GFF(Generic Feature Format)文件
readseq -format=GFF NC.gb
# 也可以设定输出名字
readseq -format=GFF -o NC-all.gff NC.gb
会有以下几个文件
4files
提取到fasta 文件
readseq -format=FASTA -o NC.fa NC.gb
先查看一下gff文件
cat NC-all.gff |head
##gff-version 2
# seqname source feature start end score strand frame attributes
NC_002549 - source 1 18959 . + . organism "Zaire ebolavirus" ; mol_type "viral cRNA" ; isolate "Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga" ; db_xref "taxon:186538"
NC_002549 - 5'UTR 1 55 . + . note "leader region" ; citation "[1]" ; function "regulation or initiation of RNA replication"
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - mRNA 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; product "nucleoprotein" ; db_xref "GeneID:911830"
NC_002549 - regulatory 56 67 . + . regulatory_class "other" ; gene "NP" ; locus_tag "ZEBOVgp1" ; note "putative; transcription start signal" ; citation "[1]"
NC_002549 - CDS 470 2689 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; function "encapsidation of genomic RNA" ; codon_start 1 ; product "nucleoprotein" ; protein_id "NP_066243.1" ; db_xref "GeneID:911830" ; translation "MDSRPQKIWMAPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAFEAGVDFQESADSFLLMLCLHHAYQGDYKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEELLPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHAEQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQARFSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLLNLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAESRELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGHYDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAPDDLVLFDLDEDDEDTKPVPNRSTKGGQQKNSQKGQHIEGRQTQSRPIQNVPGPHRTIHHASAPLTDNDRRNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDGTSNRTPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQSEHSFEEMYRHILRSQGPFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQFYWPVMNHKNKFMAILQHHQ"
NC_002549 - regulatory 3015 3026 . + . regulatory_class "polyA_signal_sequence" ; gene "NP" ; locus_tag "ZEBOVgp1"
NC_002549 - misc_feature 3027 3031 . + . note "intergenic region"
只获取gff文件中的gene rows
cat NC-all.gff |egrep '\tgene\t'
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - gene 3032 4407 . + . gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549 - gene 4390 5894 . + . gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549 - gene 5900 8305 . + . gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549 - gene 8288 9740 . + . gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549 - gene 9885 11518 . + . gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549 - gene 11501 18282 . + . gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
实际上我们还想要前两行的注释行
cat NC-all.gff |head -2 > NC-genes.gff
cat NC-all.gff | egrep '\tgene\t' >> NC-genes.gff
cat NC-genes.gff
##gff-version 2
# seqname source feature start end score strand frame attributes
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - gene 3032 4407 . + . gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549 - gene 4390 5894 . + . gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549 - gene 5900 8305 . + . gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549 - gene 8288 9740 . + . gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549 - gene 9885 11518 . + . gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549 - gene 11501 18282 . + . gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
NC_002549 - gene 56 3026 . + . gene "NP" ; locus_tag "ZEBOVgp1" ; db_xref "GeneID:911830"
NC_002549 - gene 3032 4407 . + . gene "VP35" ; locus_tag "ZEBOVgp2" ; db_xref "GeneID:911827"
NC_002549 - gene 4390 5894 . + . gene "VP40" ; locus_tag "ZEBOVgp3" ; db_xref "GeneID:911825"
NC_002549 - gene 5900 8305 . + . gene "GP" ; locus_tag "ZEBOVgp4" ; db_xref "GeneID:911829"
NC_002549 - gene 8288 9740 . + . gene "VP30" ; locus_tag "ZEBOVgp5" ; db_xref "GeneID:911826"
NC_002549 - gene 9885 11518 . + . gene "VP24" ; locus_tag "ZEBOVgp6" ; note "putative" ; db_xref "GeneID:911828"
NC_002549 - gene 11501 18282 . + . gene "L" ; locus_tag "ZEBOVgp7" ; db_xref "GeneID:911824"
染色体坐标不匹配,需要修复
重新索引并且重新比对
cp NC.fa ~/refs/852/
bwa index ~/refs/852/NC.fa
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.01 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index /Users/ucco/refs/852/NC.fa
[main] Real time: 0.113 sec; CPU: 0.015 sec
再align
注意我的命名有点乱,最好去完全按原文中的命名
下面的align.sh我没找到
但是aligh就是比对索引那些一起写到一个脚本里了,改天写下来。
cp ../lec4/*.fq .
bash align.sh read1.fq read2.fq results.bam
载入IGV,看100-150bp区域的深度
网友评论