1. awk直接根据列名处理数据框
R中
tidyverse
处理数据框非常方便,可以根据列名操作相应的列。
awk
总是根据$n
来操作第n
列,当列数太多怎么办?怎么直接根据列名操作呢?
- 在工具 | bioawk中就有一个提取列的例子:
$ cat b.txt
geneid T1_1 T1_2 T1_3 T2_1 T2_2 T2_3
gene1 1 3 9 2 8 5
gene2 3 4 5 8 0 8
$ bioawk -t -c header '{print $geneid,$T1_1,$T2_1}' b.txt
geneid T1_1 T2_1
gene1 1 2
gene2 3 8
-
bioawk
还需要下载,我想用awk
实现:
$ awk 'BEGIN{OFS="\t"}NR==1{for(i=1;i<=NF;i++){a[$i]=i}}NR>=1{print $a["geneid"],$a["T1_1"],$a["T2_1"]}' b.txt
geneid T1_1 T2_1
gene1 1 2
gene2 3 8
除了提取列,列的相加等也都可以进行,,,但确实很鸡肋哈,好麻烦。所以,还是学R吧,
tidyverse
yyds。
2. 如何 pairs 去重?
有了
a,b
,怎么防止b,a
出现?
$ cat d.txt
a b
c a
d b
b d
a c
下面这一行是我写的,在读取每一行时,先判断数组里面有没有正或反的形式,没有则打印,再把这一行放进数组。
$ awk '{if($1"_"$2!=a[$2"_"$1] && $2"_"$1!=a[$2"_"$1])print;a[$1"_"$2]=$1"_"$2}' d.txt
a b
c a
d b
下面这一行是师兄写的,写的太好了,我无论如何写不出来。我只能明白这行代码达到的效果。
$ awk '!(SEEN[$1,$2]++) && !(($2,$1) in SEEN)' d.txt
a b
c a
d b
3. awk求基因表达平均值
- 一个基因表达矩阵,5个样本,每个样本5个重复,
每列分别为:geneid
,S1_1
,S1_2
,S1_3
,S1_4
,S1_5
,S2_1
,S2_2
,S2_3
,S2_4
,S2_5
,......
$ cat a.tpm.tsv
gene1 0.0 0.0 0.0 0.0 0.0 0.039595 0.0 0.000000 0.0 0.0 0.0 0.072418 0.136454 0.0 0.0 125.319084 206.183060 312.973907 1000.064026 535.770325 1111.690186 677.701477 747.069885 1043.504395 587.607422
gene2 153.139114 80.315964 51.284149 81.167549 53.799297 19.027866 14.644533 24.061693 36.095894 15.593305 52.814518 47.106213 40.647087 40.411690 46.323067 64.985306 112.807793 64.820419 104.483238 87.293083 82.768089 91.885117 49.162933 58.043594 72.681511
gene3 6.967771 11.112999 16.661377 12.513005 8.348808 3.606478 1.576240 3.155725 2.502464 2.007834 11.333991 8.626426 7.127884 3.307657 8.415873 3.260129 12.483765 11.294296 22.956882 13.596114 12.488963 9.070916 7.859025 11.917100 16.502171
gene4 0.505876 0.386053 0.912009 1.151058 1.154426 0.850491 0.175352 0.269306 0.603023 0.486635 1.176807 0.524952 1.303849 0.030600 1.180763 0.071114 0.826968 0.919727 0.965852 0.624409 0.112988 0.484227 0.0 0.0 0.157508
gene5 74.891602 106.629990 107.611679 184.312668 182.105698 54.929993 156.888580 95.991226 97.382843 57.194332 185.523758 85.128159 129.768570 87.086861 77.005058 87.280197 108.832001 142.719055 161.644943 158.324509 168.457855 196.658752 79.357765 77.824776 152.112717
gene6 43.037334 25.253721 18.894386 26.046051 25.136530 10.951700 5.766452 9.901020 13.677269 9.700789 25.597813 18.042780 25.968611 23.025476 23.210535 35.266296 53.085224 33.859856 56.133625 38.710495 33.187077 27.609497 23.028906 26.431442 24.113132
gene7 25.063200 56.035549 32.237309 72.436508 48.017216 26.486784 47.957367 44.359760 29.554129 25.999800 36.883751 17.445330 30.227165 15.413628 38.705254 30.413851 25.855337 19.630417 30.933867 35.670910 20.697100 30.442179 21.638878 36.755108 35.545338
gene8 11.461951 13.055794 9.187213 14.815991 11.335137 4.229527 4.162697 4.990695 2.702080 3.662157 11.143311 7.460255 7.482308 11.640666 9.239034 15.579340 13.978998 11.687348 12.549714 8.261336 14.801984 15.432146 10.693584 8.335018 17.221148
gene9 0.0 0.033566 0.329905 0.128400 0.042617 0.0 0.0 0.0 0.0 0.0 0.368945 0.033192 0.145969 0.000000 0.0 0.046507 0.149812 0.031563 2.481390 3.989131 0.000000 0.052335 0.093023 0.272024 0.965466
- 想求每个基因每个样本的平均值,用
awk
实现的笨办法是将对应的列相加求平均值,
($2+$3+$4+$5+$6)/5,($7+$8+$9+$10+$11)/5,......
$ awk '{print $1,($2+$3+$4+$5+$6)/5,($7+$8+$9+$10+$11)/5,($12+$13+$14+$15+$16)/5,($17+$18+$19+$20+$21)/5,($22+$23+$24+$25+$26)/5}' a.tpm.tsv | sort -k1 | column -t
gene1 0 0.007919 0.0417744 436.062 833.515
gene2 83.9412 21.8847 45.4605 86.878 70.9082
gene3 11.1208 2.56975 7.76237 12.7182 11.5676
gene4 0.821884 0.476961 0.843394 0.681614 0.150945
gene5 131.11 92.4774 112.902 131.76 134.882
gene6 27.6736 9.99945 23.169 43.4111 26.874
gene7 46.758 34.8716 27.735 28.5009 29.0157
gene8 11.9712 3.94943 9.39311 12.4113 13.2968
gene9 0.106898 0 0.109621 1.33968 0.27657
- 下面这行代码,只需要改变参数
-v sample=5 -v repeat=5
中的值,
如:6个样本,每个样本3个重复,改为-v sample=6 -v repeat=3
即可。
这行代码,感觉上用处不大,但如果要封装
Shell
脚本,还是挺方便的。
可以配合 脚本 | Shell | 任意n个样本表达量超过0.5视为表达 使用。
$ awk -v sample=5 -v repeat=5 '{for(i=1;i<=sample;i++){I=(i-1)*repeat;sum=0;for(j=1;j<=repeat;j++){sum+=$(I+j+1)};if(allsum[$1]=="")allsum[$1]=$1"\t"sum/repeat; else allsum[$1]=allsum[$1]"\t"sum/repeat}}END{for(k in allsum)print allsum[k]}' a.tpm.tsv | sort -k1 | column -t
gene1 0 0.007919 0.0417744 436.062 833.515
gene2 83.9412 21.8847 45.4605 86.878 70.9082
gene3 11.1208 2.56975 7.76237 12.7182 11.5676
gene4 0.821884 0.476961 0.843394 0.681614 0.150945
gene5 131.11 92.4774 112.902 131.76 134.882
gene6 27.6736 9.99945 23.169 43.4111 26.874
gene7 46.758 34.8716 27.735 28.5009 29.0157
gene8 11.9712 3.94943 9.39311 12.4113 13.2968
gene9 0.106898 0 0.109621 1.33968 0.27657
网友评论