1、计算一条DNA序列中的ATGC个数、GC含量并格式化输出
2、对每一行数据求和,每个数剧使用各自行和归一化
3、计算sam文件每个染色体匹配reads数,GC碱基数,GC含量
4、计算SAM文件中reads落在每个bin区间中的reads数目,GC碱基数目,GC比
首发于本人公众号:pythonic生物人
更好的阅读体验请戳:
Perl学习17之生信简单运用
Perl学习18之生信简单运用(二)
Perl学习19之生信简单运用(三)
1、计算一条DNA序列中的ATGC个数、GC含量并格式化输出
gccount_read.pl
#!/usr/bin/perl
use strict;
use warnings;
my $read = "AACAAACCCCTTTTCTCTATTAAAAAATACAAAATAGCTTAGCTGCGGCATAGTGGAGCACG";
my $G = ($read =~ s/G/G/g);#s为匹配表示匹配,($read =~ s/G/G/g)返回匹配次数个数
my $C = ($read =~ s/C/C/g);
my $A = ($read =~ s/A/A/g);
my $T = ($read =~ s/T/T/g);
my $total = $G + $C + $A +$T;
my $GC_P = ($G + $C)/$total;
#格式化输出printf
printf "\$G:%d\n\$C:%d\n\$A:%d\n\$T:%d\n\$total:%d\n\$GC_P:%.2f\n",($G,$C,$A,$T,$total,$GC_P);
#sprintf格式化不输出
my $result = sprintf("\$G:%d\n\$C:%d\n\$A:%d\n\$T:%d\n\$total:%d\n\$GC_P:%.2f\n",($G,$C,$A,$T,$total,$GC_P));
#输出到文件GC.file
open OUT,">","./GC.file";
print OUT $result;
close OUT;
perl gccount_read.pl
$G:10
$C:14
$A:23
$T:15
$total:62
$GC_P:0.39
cat GC.file
$G:10
$C:14
$A:23
$T:15
$total:62
2、对每一行数据求和,每个数剧使用各自行和归一化
输入文件:
sumfile
a 1 2 3 4 5 6
b 1 2 3 4 5 6
目的:分别求a,b行中所有数之和,每个数在该行所占比例格式化输出
脚本:sum1.pl
#!/usr/bin/perl
use strict;
use warnings;
open IN,"./sumfile";
while(<IN>){
chomp;
my @t=split;
my $sum=0;
for my $i (1..5){
$sum+=$t[$i];
}
print "$t[0]\t$sum\t";#求和
for my $i (1..5){
my $percent=$t[$i]/$sum;
printf "%.2f\t","$percent";
}
print "\n";#换行
}
perl sum1.pl
a 15 0.07 0.13 0.20 0.27 0.33
b 15 0.07 0.13 0.20 0.27 0.33
3、计算sam文件每个染色体匹配reads数,GC碱基数,GC含量
gccount_reads.pl
#!/usr/bin/perl
use strict;
use warnings;
my (%hash,%GC,%ATGC);
open IN,"<",$ARGV[0];
while(<IN>){
chomp;
next if /^@/;#跳过@开头的行
my @line=split;
next if $line[2] eq "*";#跳过未匹配上的序列
next if $line[4] < 25;#排除比对质量小于25的序列
#计算ATGC的碱基个数
my $G = ($line[9] =~ s/G/G/g);
my $C = ($line[9] =~ s/C/C/g);
my $A = ($line[9] =~ s/A/A/g);
my $T = ($line[9] =~ s/T/T/g);
my $GC = $G + $C;
#计算每个染色体匹配到的read数,GC碱基个数,总碱基个数
$hash{$line[2]} ++;
$GC{$line[2]} += $GC;
$ATGC{$line[2]} += ($GC + $A + $T);
}
close IN;
foreach (keys %hash){
printf "$_\t$hash{$_}\t$GC{$_}\t%.2f%%\n",($GC{$_}/$ATGC{$_})*100;
}
perl gccount_reads.pl *sam
chr7 108663 2212904 40.73%
chr20 44747 990732 44.28%
chr22 24434 586887 48.04%
chr14 63145 1293962 40.98%
chrY 1444 29104 40.31%
chr19 36744 881130 47.96%
chr8 102992 2080246 40.40%
chr1 162445 3403360 41.90%
chr11 94802 1984799 41.87%
。。。。。。。。。。。。。。。
4、计算SAM文件中reads落在每个bin区间中的reads数目,GC碱基数目,GC比
输入文件1:file1,kemr区间文件,按照1M每个bin区间
chr1 7000001
chr1 8000001
chr1 9000001
chr1 10000001
chr1 11000001
chr1 12000001
chr1 14000001
chr1 15000001
chr1 16000001
chr1 18000001
chr1 19000001
chr1 20000001
输入文件2:
file2,sam文件问题解决代码:reads_inbin.pl
#!/usr/bin/perl
use strict;
use warnings;
#存储每个bin的区间{chr1=>{1000001=>0,2000001>0.......}}
my (%bin,%GC,%ATGC);
open BIN,"<",$ARGV[0];
while(<BIN>){
chomp;
my @line = split /\t/;
$bin{$line[0]}{$line[1]}=0;
}
close BIN;
open IN,"<",$ARGV[1];
while(<IN>){
chomp;
next if /^@/;
my @line = split/\t/,$_;
next if $line[2] eq "*";
next if $line[4] < 25;
#求每个bin区间例如,[1000001,1000001+1000000)上的reads数目
#keys %{$hash{line[2]}}取出所有1000001,二维哈希的循环写法
#foreach my $start (keys $bin{$line[2]}){#该写法报如下错误
#Type of argument to keys on reference must be unblessed hashref or arrayref
foreach my $s (sort keys %{$bin{$line[2]}}){
if($line[3]>=$s && $line[3]<($s+1000000)){
$bin{$line[2]}{$s} ++;
my $G = ($line[9] =~ s/G/G/g);
my $C = ($line[9] =~ s/C/C/g);
my $A = ($line[9] =~ s/A/A/g);
my $T = ($line[9] =~ s/T/T/g);
$GC{$line[2]}{$s} += ($G + $C);
$ATGC{$line[2]}{$s} += ($G + $C + $A + $T);
}
}
}
close IN;
foreach (keys %bin){
my $chr=$_;
foreach (keys $bin{$chr}){
printf "$chr\t$_\t$bin{$chr}{$_}\t$GC{$chr}{$_}\t$ATGC{$chr}{$_}\t%.2f%%\n",($GC{$chr}{$_}/$ATGC{$chr}{$_})*100;
}
}
perl reads_inbin.pl file1 test.sam
chr7 131000001 790 17650 39500 44.68%
chr7 79000001 706 12653 35300 35.84%
chr7 36000001 710 15303 35500 43.11%
chr7 24000001 725 14647 36250 40.41%
chr7 97000001 681 14593 34050 42.86%
chr7 47000001 796 18134 39800 45.56%
chr7 64000001 512 10840 25600 42.34%
chr7 33000001 749 14887 37450 39.75%
chr7 19000001 652 12071 32600 37.03%
。。。。。。。。。。。。。。。。。。。
网友评论