reads 1 和 reads 2 去掉adapter,程序一样,且from SJ
perl read1_2_filter_adapter.pl $s.1.clean.fq.gz $s.1.adapter.fq.gz
reads2 去掉前6bp和12bp后,还要剩余20bp;reads1去掉前12bp和后6bp,还要剩余20bp。
perl rm_firstx_leny.pl $s.1.adapter.fq.gz $s.2.filter.fq.gz 6 20 12
- 想了一下,去掉前6bp后12bp或者前12bp后6bp的策略是合理的,因为基因组motif显示cgcgcgg中后5bpmapping更好一些,所以去掉前4+2bp,由于片段很短,基本上reads1和reads2是重合的,所以有必要前后都要去掉一定的碱基。但是对于组织样本,是否有必要去掉一定的碱基呢,如果组织片段大,那么就没有必要了吧。对组织的测序为什么不一开始就用RRBS呢?哦,对,因为RRBS和MCTA-Seq覆盖位点有差异。我觉得应该衡量一下切胶片段大小对结果的影响,同一个组织切不同长短,上机测序,看看结果有没有异同,如果相关性很好,那么就没有必要纠结长短了;如果切长一些更好,那么组织的去掉前xbp后xbp的策略可能需要修改
去掉reads中含有3个或以上CpH的序列
perl ch3deleate.pl $2.2.filter.fq.gz $s.2.nonCG.fq.gz
- perl很简单,reads匹配三个或三个以上CA/CT/CC就删掉
- 其实这步没有将CH去完,只去掉序列中的CH;noch_arra去掉了比对到reference的CH
bismark比对
noch_arra
STEP6_cgcgma_pcr
**(1)在xx范围内含有至少2个CG;(2)reads1 barcode区域的质量值控制;(3)barcode一样的reads仅保留一个 **
(1)在xx范围内含有至少2个CG
- 目的:筛选在-3~+3bp(含3bp)(mapping位点为0)中至少有2个CG的序列
- 输入文件:$s.temp2
- 结果文件:$s.5bp.cgcgmat.gz
perl extractCGx2.pl $s.temp2 hg19.fa $s.5bp.cgcgmat.gz
$samtools view $s.2.nonCG.fq.gz_bismark.sort.bam > $s.5temp2
(2)reads1 barcode区域的质量值控制
- 目的:barcode区域中每个碱基的质量值大于等于17
- 测序得到第一链如下,H至少有5个,所以将前5个H作为barcode,作为barcode的5个碱基质量值要大于等于17
TTTCCCTACACGACGCTCTTCCGATCTHHHHHHHHCGCH
TTTCCCTACACGACGCTCTTCCGATCTHHHHHHHCGHCH
TTTCCCTACACGACGCTCTTCCGATCTHHHHHHCGHHCH
TTTCCCTACACGACGCTCTTCCGATCTHHHHHCGHHHCH
perl qc5bp.pl $s.clean.1.fq.gz 5 $s.5qc.gz
(3)、barcode一样的reads仅保留一个
- 目的:如果read1没有比对上基因组,则只考虑reads1的barcode以及reads2的起始位置,一致,则认为是同一条reads。若read1和read2的比对信息都有,则考虑reads1的barcode、reads1的起始位置以及reads2的起始位置,一致,则认为是同一条reads
- 输入文件:$s.5temp1
- 输出文件:$s.5bp.cgcgmat.rmd.gz
perl rmduppcrv2.pl \$s.5qc.gz \$s.5bp.cgcgmat.gz 5 \$s.5temp1 \$s.5bp.cgcgmat.rmd.gz
用$s.5bp.cgcgmat.gz得到noch_arra
重复1.5、noch_arra 过程,输入文件为$s.5bp.cgcgmat.gz
step10 combine
perl combine.pl Pcrc1_m Pcrc2_m Pcrc3_m Pcrc1_Pcrc3 6
CMP部分文件说明
生成文件说明
s07.Pcgibed:$s.5bp.cgcgmat.rmd.gz s07.Tcgibed:$s.5bp.cgcgmat.gz T- s07.Tcgibed/$s.cgcgmat P- s07.Pcgibed/$s.cgcgmat.qc uPnorm- s07.Pcgibed/$s.cgcgmat.qc*(qc-dCGI-$s/rmd-dCGI-$s)
uP-xxx:对于血浆样本,在MePM基础上乘以 qc-dCGI-$s/rmd-dCGI-$s
P和T的差别在于P算的UMI,T算的MePM
cal文件内容说明
fmg9_m.clean:
clean reads 条数
fmg9_m.qc5.clean:
对clean reads再次做前5bp的qc后的reads
fmg9_m.filter:去掉前6bp后12bp后剩余的reads数
fmg9_m.rmfilter:去掉含有3个及以上nonCG的reads
fmg9_m.unique_mapping:bismark mapping到基因组
fmg9_m.cgcgmat:在-3~+3bp(含3bp)(mapping位点为0)中至少有2个CG
fmg9_m.cgcgmat.qc:在fmg9_m.cgcgmat的基础上,reads的前5bp做过qc
fmg9_m.cgcgmat.qc.rmd:用UMI去掉PCR重复序列
P-CGI-fmg9_m:用UMI去掉PCR重复后落在CGI中的序列
T-CGI-fmg9_m:不用UMI去掉PCR重复后落在CGI中的序列
qc-dCGI-fmg9_m:qc前5bp,落在dCGI区域中的reads(dCGI有3024个,分别是什么
呢?)
qc-dCGI2-fmg9_m:qc前5bp,落在dCGI2区域中的reads(dCGI2有9513个,分别是
什么呢?)
rmd-dCGI-fmg9_m:qc前5bp,去掉UMI,落在dCGI中的reads
rmd-dCGI2-fmg9_m:qc前5bp,去掉UMI,落在dCGI2中的reads
filter2_nonCGfmg9_m:1-("total methylated C in CHG"+"total methylated
C in CHH" )/("total methylated C in CHG"+"total methylated C in
CHH" +"total C to T conversions in CHG context"+"Total C to T
conversions in CHH context"))完成filter而未去掉含3个及以上nonCG的第二端序列然后bismark比对结果
filter1_nonCGfmg9_m :同上,第一端序列
CMP改进流程
单位点C和C+T衡量甲基化程度
不光用MePM衡量甲基化程度,还用测到reads含有的甲基化位点的C/C+T来衡量。
一个小问题:是否应该同时考虑reads1和reads2的信息,为了解决这个问题,
应该计算reads1和reads2重叠区域是否很多,如果基本上重复,那么reads1和
reads2的信息是一致的,仅需要考虑一条reads即可,如果reads1和reads2重叠
区域少,那么应该同时考虑两条reads的情况。这样计算有点麻烦,因为不能分别
把reads1和reads2的C加起来,C+T加起来,然后C/C+T,原因是这样会导致重叠
区域权重增大,应该是上述的C-重叠区域C,上述C+T-重叠区域C+T,然后C/C+T
才是真的甲基化程度。所以我觉得考虑一条reads足以。
计算单个cgcgcgg
正链序列(起始点)落在正链cgcgcgg上,负链序列(起始点)落在负链cgcgcgg上
sh CGIcgcgcggv2.sh Pn1 Pn
问:为什么要对单碱基数据也做normalise?
答:文老师发现一个肝癌数据中C特别高,但是癌症程度并不算太高,而是由于测序深度太深造成的。那么如果只关注C的绝对值,测序越深,C的绝对值就会越高。如果测饱和了(每个阳性位点都测到了),C的绝对值不会因为测序而升高(去掉PCR duplicate后),没有测饱和的时候,用绝对值计算是要受到测序深度影响的。另外,两个病人释放不同量的ccfDNA,而其中癌症相关的都是一条,因为取血量一样,都是5ml,那么癌症相关DNA浓度一样,但是测序得到的结果(同样测序深度)就不一样了,解决办法:饱和程度。测饱和可以解决以上两个问题。
问:为什么不用CGI-qc/CGI-rmd作为duplication rate,既然T和upnor的方法本质上是一样的,upnor的优势是什么?
答:upnor的duplication rate是一样的,而去重的时候,不可能每个位点去掉重复的比例一致,只要是乘以一个固定的duplication rate,T带来的随机性就被去掉了,至于能否用CGI-qc/rmd-CGI作为duplication rate,也要筛选那些低拷贝的地方吧,不筛选得到的duplication rate,高拷贝的地方占权重会大。
1. read1_2_filter_adapter.pl
1 #! /usr/bin/perl -w
2 use strict;
3
4 open IN,"zcat $ARGV[0] |" or die $!;
5 open OUT,"|gzip > $ARGV[1]" or die $!;
6
7 while(<IN>)
8 {
9 chomp;
10 my $line1=$_;
11 chomp(my $line2=<IN>);
12 #<IN>;
13 chomp(my $line3=<IN>);
14 chomp(my $line4=<IN>);
15 if($line2 =~ /^(\w+)AGATCGGAAGAGC/)
16 {
17 #$line2= s/AGATCGGAAGAGCAC//;
18 #my $len = length($1)
19 if(length($1)>2)
20 {
21 my $len = length($1);
22 my $new_line2=substr($line2,0,$len);
23 #my $line2_length= length $line2;
24 my $new_line4=substr($line4,0,$len);
25 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
26 }
27 }
28 elsif($line2 =~ /^AGATCGGAAGAGCAC/)
29 {
30 next;
31 }
32 #elsif($line2 =~ /^(\w+)AGATCGGAAGAGCA$/)
33 #{
34 # #$line2= s/AGATCGGAAGAGCA$//;
35 # my $len = length($1);
36 # my $new_line2=substr($line2,0,$len);
37 # my $new_line1=substr($line1,0,$len);
38 ## print OUT "$new_line1\n$new_line2\n\n";
39 #}
40 elsif($line2 =~ /^(\w+)AGATCGGAAGAGC$/)
41 {
42 #$line2= s/AGATCGGAAGAGC$//;
43 my $len = length($1);
44 my $new_line2=substr($line2,0,$len);
45 my $new_line4=substr($line4,0,$len);
46 # print OUT "$new_line1\n$new_line2\n\n";
47 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
48 }
49 elsif($line2 =~ /^(\w+)AGATCGGAAGAG$/)
50 {
51 #$line2= s/AGATCGGAAGAG$//;
52 my $len = length($1);
53 my $new_line2=substr($line2,0,$len);
54 my $new_line4=substr($line4,0,$len);
55 #print OUT "$new_line1\n$new_line2\n\n";
56 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
57 }
58 elsif($line2 =~ /^(\w+)AGATCGGAAGA$/)
59 {
60 #$line2= s/AGATCGGAAGA$//;
61 my $len = length($1);
62 my $new_line2=substr($line2,0,$len);
63 my $new_line4=substr($line4,0,$len);
64 #print OUT "$new_line1\n$new_line2\n\n";
65 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
66 }
67 elsif($line2 =~ /^(\w+)AGATCGGAAG$/)
68 {
69 #$line2= s/AGATCGGAAG$//;
70 my $len = length($1);
71 my $new_line2=substr($line2,0,$len);
72 my $new_line4=substr($line4,0,$len);
73 # print OUT "$new_line1\n$new_line2\n\n";
74 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
75 }
76 elsif($line2 =~ /^(\w+)AGATCGGAA$/)
77 {
78 #$line2= s/AGATCGGAA$//;
79 my $len = length($1);
80 my $new_line2=substr($line2,0,$len);
81 my $new_line4=substr($line4,0,$len);
82 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
83 # print OUT "$new_line1\n$new_line2\n\n";
84 }
85 elsif($line2 =~ /^(\w+)AGATCGGA$/)
86 {
87 #$line2= s/AGATCGGA$//;
88 #my $line2_length= length $line2;
89 my $len = length($1);
90 my $new_line2=substr($line2,0,$len);
91 my $new_line4=substr($line4,0,$len);
92 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
93 # print OUT "$new_line1\n$new_line2\n\n";
94 }
95 elsif($line2 =~ /^(\w+)AGATCGG$/)
96 {
97 #$line2= s/AGATCGG$//;
98 #my $line2_length= length $line2;
99 my $len = length($1);
100 my $new_line2=substr($line2,0,$len);
101 my $new_line4=substr($line4,0,$len);
102 # print OUT "$new_line1\n$new_line2\n\n";
103 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
104 }
105 elsif($line2 =~ /^(\w+)AGATCG$/)
106 {
107 #$line2= s/AGATCG$//;
108 #my $line2_length= length $line2;
109 my $len = length($1);
110 my $new_line2=substr($line2,0,$len);
111 my $new_line4=substr($line4,0,$len);
112 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
113 # print OUT "$new_line1\n$new_line2\n\n";
114 }
115 elsif($line2 =~ /^(\w+)AGATC$/)
116 {
117 #$line2= s/AGATC$//;
118 #my $line2_length= length $line2;
119 my $len = length($1);
120 my $new_line2=substr($line2,0,$len);
121 my $new_line4=substr($line4,0,$len);
122 # print OUT "$new_line1\n$new_line2\n\n";
123 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
124 }
125 elsif($line2 =~ /^(\w+)AGAT$/)
126
127 {
128 #$line2= s/AGAT$//;
129 #my $line2_length= length $line2;
130 my $len = length($1);
131 my $new_line2=substr($line2,0,$len);
132 my $new_line4=substr($line4,0,$len);
133 # print OUT "$new_line1\n$new_line2\n\n";
134 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
135 }
136 elsif($line2 =~ /^(\w+)AGA$/)
137 {
138 #$line2= s/AGA$//;
139 #my $line2_length= length $line2;
140 my $len = length($1);
141 my $new_line2=substr($line2,0,$len);
142 my $new_line4=substr($line4,0,$len);
143 # print OUT "$new_line1\n$new_line2\n\n";
144 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
145 }
146 elsif($line2 =~ /^(\w+)AG$/) #以AG结尾则去掉AG
147 {
148 #$line2= s/AG$//;
149 #my $line2_length= length $line2;
150 my $len = length($1);
151 my $new_line2=substr($line2,0,$len);
152 my $new_line4=substr($line4,0,$len);
153 print OUT "$line1\n$new_line2\n$line3\n$new_line4\n";
154 # print OUT "$new_line1\n$new_line2\n\n";
155 }
156 else
157 {
158 print OUT "$line1\n$line2\n$line3\n$line4\n";
159 }
160 }
161 close IN;
162 close OUT;
2.rm_firstx_leny.pl
#! /usrt/bin/perl -w
use strict;
# remove forward xbp, backward zbp, remain more than ybp
my $usage = "perl $0 <IN1> <OUT> x y z";
die $usage unless @ARGV==5;
open IN,"zcat $ARGV[0] |" or die $!;
open OUT,"| gzip > $ARGV[1]" or die $!;
while(<IN>)
{
chomp;
my $line1=$_;
chomp(my $line2=<IN>);
chomp(my $line3=<IN>);
chomp(my $line4=<IN>);
my $len_2=length($line2);
if($len_2 >=$ARGV[2]+$ARGV[4]+$ARGV[3]) #大于或等于6+12+20(38)
{
my $second=substr($line2,$ARGV[2],$len_2-$ARGV[2]-$ARGV[4]);
my $line4_1=substr($line4,$ARGV[2],$len_2-$ARGV[2]-$ARGV[4]);
my $len=length($second);
#my $third=substr($second,0,$len_2-24);
#my $line4_2=substr($line4_1,0,$len_2-24);
print OUT "$line1\n$second\n$line3\n$line4_1\n";
}
else
{
next;
}
}
close IN;
close OUT;
3.ch3deleate.pl
1 #!/usr/bin/perl
2 use strict;
3 use warnings;
4
5 my ($line1,$line2,$line3,$line4,$count);
6 open IN,"zcat $ARGV[0]|" or die $!;
7 open OUT,"| gzip > $ARGV[1]" or die $!;
8 while (<IN>)
9 {
10 $count=0;
11 chomp;
12 $line1 = $_;
13 chomp($line2=<IN>);
14 chomp($line3=<IN>);
15 chomp($line4=<IN>);
16 while ($line2 =~m/CA/g) #从建库方法来看,第二链不会有TG这种情况
17 {
18 $count++;
19 }
20 while ($line2 =~m/CT/g)
21 {
22 $count++;
23 }
24 while ($line2 =~m/CC/g)
25 {
26 $count++;
27 }
28 next if ($count >= 3);
29 print OUT "$line1\n$line2\n$line3\n$line4\n";
30 }
31 close IN;
4.s05.noch_arrange
109 out=\$dir/\${f}_\${b}trim/s05.noch_arrange/\$s
110 for i in {1..22} X Y
111 do
112 c=chr\$i
113 ff=\$f
114 bb=\$b
115 sbam2=\$dir/\${ff}_\${bb}trim/s04.align/\$s/\$s.2.nonCG.fq.gz_bismark.sort.bam
116 log=\$dir/\${ff}_\${bb}trim/s05.noch_arrange/\$s/\$c.log
117 outout=\$dir/\${ff}_\${bb}trim/s05.noch_arrange/\$s
118
119 touch \$shell/\$c.ing... && \\
120 \$samtools view -h \$sbam2 \$c\\
121 |perl -ne 'print \$_ and next if /^@/;my \$met = \$1 if/XM:Z:(\S+)/;next if (\$met =~tr/HX/HX/)>=3;print \ $_'\\
122 |perl \$dirperl/samStat.pl /dev/stdin \$log \\
123 |\$samtools view -uSb /dev/stdin \\
124 |\$samtools mpileup -f \$database/hg19/DNA_fa/\$c.fa /dev/stdin \\
125 |perl \$dirperl/covStrBed.pl /dev/stdin \$outout 1 \\
126 |perl \$dirperl/covStrBed.pl /dev/stdin \$outout 3 \\
127 |perl \$dirperl/modCytSplt.pl \$database/hg19/dbSNP/human_build_137/vcf_by_chromosome/CHB/processed_data/\ $c.gz \$database/hg19/DNA_fa/\$c.fa \\
128 /dev/stdin \$outout/\$c. >/dev/null && \\
129 mv \$shell/\$c.ing... \$shell/\$c.ok
130 done
131
132 for i in {1..22} X Y
133 do
134 zcat \$out/chr\$i.CG.gz|awk '{if (\$9==\".\")print \$1\"\\t\"\$2-1\"\\t\"\$2\"\\t\"\$3\"\\t\"\$4\"\\t\"\$5 \"\\t\"\$6}' - > \$out/chr\$i.CG
135 done
136 cat \$out/chr*.CG > \$out/all.CG
137
138 for i in {1..22} X Y
139 do
140 zcat \$out/chr\$i.CHG.gz|awk '{if (\$9==\".\")print \$1\"\\t\"\$2-1\"\\t\"\$2\"\\t\"\$3\"\\t\"\$4\"\\t\"\$ 5\"\\t\"\$6}' - > \$out/chr\$i.CHG
141 zcat \$out/chr\$i.CHH.gz|awk '{if (\$9==\".\")print \$1\"\\t\"\$2-1\"\\t\"\$2\"\\t\"\$3\"\\t\"\$4\"\\t\"\$ 5\"\\t\"\$6}' - > \$out/chr\$i.CHH
142 perl \$dirperl/sorcharacter.pl \$out/chr\$i.CHG > \$out/chr\$i.CHG.sort
143 perl \$dirperl/sorcharacter.pl \$out/chr\$i.CHH > \$out/chr\$i.CHH.sort
144 done
145 cat \$out/chr*.CHG.sort \$out/chr*.CHH.sort > \$out/all.nonCG
146 rm \$out/chr*
147 rm \$dir/\${f}_\${b}trim/s05.noch_arrange/\${s}chr*.cov*.bed
148 echo s05done
5.extractCGx2.pl
1 #!/usr/bin/perl
2 use strict;
3 use warnings;
4
5 my $usage = "perl $0 <IN1> <IN2> <OUT>
6 <IN1>:bam file;reads_name flag chr1 pos 255 74M * 0 0 TTTTTTTTTTTTTTT ############## ...(not imp ortant)
7 <IN2>:fasta(hg19)
8 <OUT>:bam file;reads_name flag chr1 pos 255 74M * 0 0 TTTTTTTTTTTTTTT ############## ...(not imp ortant)(delete some reads)";
9 die $usage unless @ARGV==3;
10 my ($id,$seq,%hash2,$len,$end1,$str1,$i);
11 open OUT,"|gzip > $ARGV[2]" or die $!;
12 open IN1,"<$ARGV[1]" or die $!;
13 $/=">";
14 while (<IN1>)
15 {
16 if ($_=~/(chr.*?)\s(.*)/ms)
17 {
18 $id = $1;
19 $seq = $2;
20 $seq =~s/\s//g;
21 $hash2{$id} = $seq;
22 }
23 }
24 close IN1;
25
26 $/="\n";
27 open IN2,"<$ARGV[0]" or die $!;
28 while (<IN2>)
29 {
30 $i=0;
31 chomp;
32 my @a = split;
33 if ($a[1] == 16 )
34 {
35 $len = length($a[10]);
36 $end1 = $a[3]+$len+2;
37 $str1 = $end1-7;
38 while (substr($hash2{$a[2]},$str1,7)=~/GC/ig)
39 {
40 $i++;
41 }
42 }
43 if ($a[1]==0)
44 {
45 $str1 = $a[3]-1-3;
46 while (substr($hash2{$a[2]},$str1,7)=~/CG/ig)
47 {
48 $i++;
49 }
50 }
51 if ($i>=2)
52 {
53 print OUT "$_\n";
54 }
55 }
56 close IN2;
6.qc5bp.pl
1 #!/usr/bin/perl
2 use strict;
3 use warnings;
4
5 my $usage = "perl $0 <IN1> <IN2> <OUT>
6 <IN1>:clean_data
7 <IN2>:5 (or 10)
8 <OUT>:clean_data_5bpqc,if 5bp or 10bp phred score >= 20,retain";
9 die $usage unless @ARGV==3;
10
11 my ($asc,$i,$j);
12 open IN1,"zcat $ARGV[0]|" or die $!;
13 open OUT,"|gzip > $ARGV[2]" or die $!;
14 while (<IN1>)
15 {
16 chomp;
17 my $line1 = $_;
18 chomp(my $line2=<IN1>);
19 chomp(my $line3=<IN1>);
20 chomp(my $line4=<IN1>);
21 if ($line1=~/@(.*?)\s/)
22 {
23 $j = 0;
24 for ($i=1;$i<=$ARGV[1];$i++)
25 {
26 my $s = substr($line4,$i,1);
27 if (ord($s)>=50) #按照phred33系统,50-33=17,大于等于17哦
28 {
29 $j++;
30 }
31 }
32 if ($j == 5 )
33 {
34 print OUT "$line1\n$line2\n$line3\n$line4\n";
35 }
36 }
37 }
38 close IN1;
7.rmduppcrv2.pl
1 #!/usr/bin/perl
2 use strict;
3 use warnings;
4
5 my $usage = "perl $0 <IN1> <IN2> <IN3> <IN4> <OUT>
6 <IN1>:clean_data
7 <IN2>:bam2
8 <IN3>:5bp or 10bp;
9 <IN4>;bam1
10 <OUT>:bam.txt";
11 die $usage unless @ARGV==5;
12 my (%bam,%rmdup,%seq,%pos,%pos2,$i,$end);#$end is the right-most coordinate of bam reads
13 open IN1,"zcat $ARGV[0] |" or die $!;
14 open OUT,"|gzip > $ARGV[4]" or die $!;
15 while (<IN1>) # for barcode quality
16 {
17 chomp;
18 my $line1 = $_;
19 chomp(my $line2=<IN1>);
20 chomp(my $line3=<IN1>);
21 chomp(my $line4=<IN1>);
22 if ($line1=~/@(.*?)\s/)
23 {
24 my $s = substr($line2,0,$ARGV[2]); #extract barcode in reads1
25 my $key = $1;# read name
26 $seq{$key} = $s; # barcode <- read_name1
27 }
28 }
29 close IN1;
30 open IN2,"zcat $ARGV[1] |" or die $!;
31 while (<IN2>)
32 {
33 chomp;
34 my @a = split;
35 if ($_=~/(.*?)_/)
36 {
37 my $key = $1;# read_name2
38 $end=$a[3]+length($a[9]);# end position
39 if ($a[1]==0)
40 {
41 $pos{$key} = "$a[2]\t$a[3]"; #chr&start <- read_name2
42 }
43 if ($a[1]==16)
44 {
45 $pos{$key} = "$a[2]\t$end"; #chr&end <- read_name2
46 }
47 }
48 }
49 close IN2;
50 open IN4,"<$ARGV[3]" or die $!;
51 while (<IN4>)
52 {
53 chomp;
54 my @a = split;
55 if ($_=~/(.*?)_/)
56 {
57 my $key = $1;# read_name1
58 $end=$a[3]+length($a[9]);# end position
59 if ($a[1]==0)
60 {
61 $pos2{$key} = "$a[2]\t$a[3]"; #chr&str <- read_name1
62 }
63 if ($a[1]==16)
64 {
65 $pos2{$key} = "$a[2]\t$end"; #chr&end <- read_name1
66 }
67 }
68 }
69
70 foreach my $key (keys %pos) #number of keys of %seq is more than that of %pos, for bamfile only contains t he reads that mapped on the genome;as it is sorted by %pos's key, if there is a seqid that can only be fou nd in %pos, but not in %seq, this id will be printed in final results, and there will be an error reported . But this situation won't happen, for %seq contains id in %pos.
71 {
72 next if (not exists $seq{$key}); # 5qc clean data.Discard reads with low quality
73 if (not exists $pos2{$key})
74 {
75 $i=0;
76 my $item = "$seq{$key}\t$pos{$key}\t$i"; #barcode,read2_chromosome,read2_start_coordinate ,0 。如果read1没有比对信息,则只考虑reads1的barcode以及reads2的起始位置,一致,则认为是同一条reads
77 $rmdup{$item} = $key; #read name <- barcode,read2_chromosome,read2_start_coordinate
78 }
79 else
80 {
81 my $item = "$seq{$key}\t$pos{$key}\t$pos2{$key}"; #barcode,read2_chromosome,read2_start_co ordinate,read1_chromosome,read1_start_coordinate 。若read1和read2的比对信息都有,则考虑reads1的barcode、reads1的起始位置以及reads2的起始位置,一致,则认为是同一条reads
82 $rmdup{$item} = $key; #read2name <- barcode&chr&str;key is xbp seq and position of the seq id without pcr duplication, value is the seqid without pcr duplication;
83 }
84 }
85 undef %seq;
86 undef %pos;
87 undef %pos2;
88
89 open IN2,"zcat $ARGV[1]|" or die $!; # read bam file again, print reads without pcr duplication in bam.txt format;
90 while (<IN2>)
91 {
92 chomp;
93 my @a = split;
94 if ($_=~/(.*?)_/)
95 {
96 my $item = $1;
97 $bam{$item} = $_; # put bamfile into a hash, key is the seqid, value is the whole informat ion from bam file;
98 }
99 }
100 foreach my $id (values %rmdup)
101 {
102 print OUT "$bam{$id}\n";
103 }
104 close IN2;
8、CGIcgcgcggv2.sh
1 export PATH=/data/bin/:$PATH
2 i=ss
3 s=$1 #Pnlc1
4 f=6
5 b=12
6 type=$2 #gac
7
8 shell=/WPSnew/liuxiaomeng/shell/cmp_pipeline/CGIcgcgcgg/shell2/$type
9 out=/WPSnew/liuxiaomeng/project/cmp/analysis/7.additional_pipeline/2.CGIcgcg cgg/singlesample/$s
10 txt=/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline/${f}_${b}trim /s06.cgcgma_pcr/$s.5bp.cgcgmat.rmd.gz
11 perlsup=/WPSnew/liuxiaomeng/perl/sorcharacter.pl
12 #bedtemp=$out/temp.$s.${b}bp.cgcgmat.rm.bed
13 Pbedres=$out/P_$s.${b}bp.CGI.bed
14 Ppositivebedres=$out/temppositiveP_$s.${b}bp.CGI.bed
15 Pnegativebedres=$out/tempnegativeP_$s.${b}bp.CGI.bed
16 intersectBed=/WPSnew/liuxiaomeng/software/bedtools-2.17.0/bin/intersectBed
17 CGIcgcgcgg=/WPSnew/liuxiaomeng/Database/hg19/CGI/CGIcgcgcgg.bed
18 CGIcgcgcggpositive=/WPSnew/liuxiaomeng/Database/hg19/CGI/CGIcgcgcggpositive. bed
19 CGIcgcgcggnegative=/WPSnew/liuxiaomeng/Database/hg19/CGI/CGIcgcgcggnegative. bed
20 bedtemppositive=$out/temppositive.$s.${b}bp.cgcgmat.rm.bed
21 bedtempnegative=$out/tempnegative.$s.${b}bp.cgcgmat.rm.bed
22
23 mkdir -p $out $shell
24 echo "zcat $txt|awk '{OFS=\"\t\";if(\$2==0) print \$3,\$4-1,\$4,\$1,\$5,\"+\ "}' > $bedtemppositive && \
25 zcat $txt|awk '{OFS=\"\t\";if(\$2==16)print \$3,\$4+length(\$10)-2,\$4+lengt h(\$10)-1,\$1,\$5,\"-\"}' > $bedtempnegative && \
26 $intersectBed -a $CGIcgcgcggpositive -b $bedtemppositive -c > $Ppositivebedr es && \
27 $intersectBed -a $CGIcgcgcggnegative -b $bedtempnegative -c > $Pnegativebedr es && \
28 cat $Ppositivebedres $Pnegativebedres|perl $perlsup - > $Pbedres && \
29 rm $bedtemppositive $bedtempnegative $Ppositivebedres $Pnegativebedres
30
31
32 txt=/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline/${f}_${b}trim /s06.cgcgma_pcr/$s.5bp.cgcgmat.gz
33 Tbedres=$out/T_$s.${b}bp.CGI.bed
34 Tpositivebedres=$out/temppositiveT_$s.${b}bp.CGI.bed
35 Tnegativebedres=$out/tempnegativeT_$s.${b}bp.CGI.bed
36
37 zcat \$txt|awk '{OFS=\"\t\";if(\$2==0) print \$3,\$4-1,\$4,\$1,\$5,\"+\"}' > $bedtemppositive && \
38 zcat \$txt|awk '{OFS=\"\t\";if(\$2==16)print \$3,\$4+length(\$10)-2,\$4+leng th(\$10)-1,\$1,\$5,\"-\"}' > $bedtempnegative && \
39 $intersectBed -a $CGIcgcgcggpositive -b $bedtemppositive -c > \$Tpositivebed res && \
40 $intersectBed -a $CGIcgcgcggnegative -b $bedtempnegative -c > \$Tnegativebed res && \
41 cat \$Tpositivebedres \$Tnegativebedres|perl $perlsup - > \$Tbedres && \
42 rm $bedtemppositive $bedtempnegative \$Tpositivebedres \$Tnegativebedres && \
43 echo "done $s.$b"
44 ############STEP7_Pcgi_distri.sh STEP7_Tcgi_distri.sh END##############
45
46 ############STEP8_Pcalreads.sh##########################################
47 reads=$out/$s.${f}_${b}reads.txt
48 awk -v var=$s 'BEGIN{sum=0}{sum+=\$4}END{printf \"P-CGIcgcgcgg-\"var\"\\t\"s um\"\\t\"}' $Pbedres >> \$reads
49 awk -v var=$s 'BEGIN{sum=0}{sum+=\$4}END{printf \"T-CGIcgcgcgg-\"var\"\\t\"s um\"\\t\"}' \$Tbedres >> \$reads
50 ################################STEP8_Pcalreads.sh END###################### ###############
51
52 ############STEP9_Mepm.sh##############################################
53 perl=/WPSnew/liuxiaomeng/perl/cmp/Mepm.pl
54 perl2=/WPSnew/liuxiaomeng/perl/cmp/upMepm.pl
55 TMepm=$out/T_$s.${b}bp.Mepm.bed
56 PMepm=$out/P_$s.${b}bp.Mepm.bed
57 uPMepm=$out/uPnor_$s.${b}bp.Mepm.bed
58 in3=/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline/cal/$s.${f}_$ {b}reads.txt
59
60 awk '{print \$12}' \$in3|perl \$perl \$Tbedres - > \$TMepm
61 awk '{print \$14}' \$in3|perl \$perl $Pbedres - > \$PMepm
62 perl \$perl2 $Pbedres \$in3 > \$uPMepm
63 ###########STEP9_Mepm.sh END###########################################" > $ shell/$s.sh
64 qsub -o $shell/$s.sh.o -e $shell/$s.sh.e -cwd -l vf=1g -V $shell/$s.sh $s $t ype
9.combine.pl
#!/usr/bin/perl
use strict;
use warnings;
no strict 'refs';
my $usage = "perl $0 IN1 IN2 IN3... OUT
IN1:chr10 100028204 100028508 2
IN2:chr10 100028204 100028508 2
...
INn:6(or 8)
OUT1:chr10 100028204 100028508 2 3 5...";
die $usage unless @ARGV>2;
my ($f,$i,$j,%cgi,$in);
my $dir="/WPSnew/liuxiaomeng/project/cmp/analysis/1.normal_pipeline";
open IN0,"<$dir/$ARGV[-1]_12trim/s07.Tcgibed/$ARGV[0].12bp.CGI.bed" or die $!;
$f=@ARGV-1;
open OUT1,">$dir/$ARGV[-1]_12trim/s08.Tcgibednorm/T_$ARGV[$f-1].cgireads.txt" or die $!;
open OUT2,">$dir/$ARGV[-1]_12trim/s08.Tcgibednorm/T_$ARGV[$f-1].cgiMePM.txt" or die $!;
open OUT3,">$dir/$ARGV[-1]_12trim/s08.Pcgibednorm/P_$ARGV[$f-1].cgireads.txt" or die $!;
open OUT4,">$dir/$ARGV[-1]_12trim/s08.Pcgibednorm/P_$ARGV[$f-1].cgiMePM.txt" or die $!;
open OUT5,">$dir/$ARGV[-1]_12trim/s08.Pcgibednorm/uPnor_$ARGV[$f-1].cgiMePM.txt" or die $!;
while (<IN0>)
{
chomp;
$i++;
my @a = split;
$cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
$j=0;
$in = "IN".$i;
open $in, "<$dir/$ARGV[-1]_12trim/s07.Tcgibed/$ARGV[$i].12bp.CGI.bed" or die $!;
while (<$in>)
{
$j++;
chomp;
my @a = split;
$cgi{$j}=$cgi{$j}."$a[3]\t";
}
}
foreach my $key (sort {$a<=>$b} keys %cgi)
{
print OUT1 "$cgi{$key}\n";
}
###########################################
open IN0,"<$dir/$ARGV[-1]_12trim/s09.mepm/T_$ARGV[0].12bp.Mepm.bed" or die $!;
$f=@ARGV-1;
$i=0;
while (<IN0>)
{
chomp;
$i++;
my @a = split;
$cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
$j=0;
$in = "IN".$i;
open $in, "<$dir/$ARGV[-1]_12trim/s09.mepm/T_$ARGV[$i].12bp.Mepm.bed" or die $!;
while (<$in>)
{
$j++;
chomp;
my @a = split;
$cgi{$j}=$cgi{$j}."$a[3]\t";
}
}
foreach my $key (sort {$a<=>$b} keys %cgi)
{
print OUT2 "$cgi{$key}\n";
}
############################################
open IN0,"<$dir/$ARGV[-1]_12trim/s07.Pcgibed/$ARGV[0].12bp.CGI.bed" or die $!;
$f=@ARGV-1;
$i=0;
while (<IN0>)
{
chomp;
$i++;
my @a = split;
$cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
$j=0;
$in = "IN".$i;
open $in, "<$dir/$ARGV[-1]_12trim/s07.Pcgibed/$ARGV[$i].12bp.CGI.bed" or die $!;
while (<$in>)
{
$j++;
chomp;
my @a = split;
$cgi{$j}=$cgi{$j}."$a[3]\t";
}
}
foreach my $key (sort {$a<=>$b} keys %cgi)
{
print OUT3 "$cgi{$key}\n";
}
###########################
open IN0,"<$dir/$ARGV[-1]_12trim/s09.mepm/P_$ARGV[0].12bp.Mepm.bed" or die $!;
$f=@ARGV-1;
$i=0;
while (<IN0>)
{
chomp;
$i++;
my @a = split;
$cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
$j=0;
$in = "IN".$i;
open $in, "<$dir/$ARGV[-1]_12trim/s09.mepm/P_$ARGV[$i].12bp.Mepm.bed" or die $!;
while (<$in>)
{
$j++;
chomp;
my @a = split;
$cgi{$j}=$cgi{$j}."$a[3]\t";
}
}
foreach my $key (sort {$a<=>$b} keys %cgi)
{
print OUT4 "$cgi{$key}\n";
}
##########
open IN0,"<$dir/$ARGV[-1]_12trim/s09.mepm/uPnor_$ARGV[0].12bp.Mepm.bed" or die $!;
$f=@ARGV-1;
$i=0;
while (<IN0>)
{
chomp;
$i++;
my @a = split;
$cgi{$i} = "$a[0]\t$a[1]\t$a[2]\t$a[3]\t";
}
for ($i=1;$i<=$f-1-1;$i++)
{
$j=0;
$in = "IN".$i;
open $in, "<$dir/$ARGV[-1]_12trim/s09.mepm/uPnor_$ARGV[$i].12bp.Mepm.bed" or die $!;
while (<$in>)
{
$j++;
chomp;
my @a = split;
$cgi{$j}=$cgi{$j}."$a[3]\t";
}
}
foreach my $key (sort {$a<=>$b} keys %cgi)
{
print OUT5 "$cgi{$key}\n";
}
网友评论