作业来源参考:https://www.jianshu.com/p/9260baae4b4e
- 在任意文件夹下面创建形如 1/2/3/4/5/6/7/8/9 格式的文件夹系列。
for i in `seq 9`;do a=${a}/$i;done;mkdir -p `pwd`$a
(base) yjk@DESKTOP-U8UULFU:~/test/1/2/3/4/5/6/7/8/9$ pwd
/home/yjk/test/1/2/3/4/5/6/7/8/9
- 在创建好的文件夹下面,比如我的是 /Users/jimmy/tmp/1/2/3/4/5/6/7/8/9 ,里面创建文本文件 me.txt
(base) yjk@DESKTOP-U8UULFU:~/test/1/2/3/4/5/6/7/8/9$ touch me.txt
- 在文本文件 me.txt 里面输入内容:
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
(base) yjk@DESKTOP-U8UULFU:~/test/1/2/3/4/5/6/7/8/9$ vim me.txt
i
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
Esc
:wq
Enter
- 删除上面创建的文件夹 1/2/3/4/5/6/7/8/9 及文本文件 me.txt
(base) yjk@DESKTOP-U8UULFU:~/test$ rm -rf 1/
- 在任意文件夹下面创建 folder1~5这5个文件夹,然后每个文件夹下面继续创建 folder1~5这5个文件夹
(base) yjk@DESKTOP-U8UULFU:~/test$ for i in `seq 5`;do mkdir folder_$i;cd folder_$i;for j in `seq 5`;do mkdir folder_$j;done;cd ..;done
(base) yjk@DESKTOP-U8UULFU:~/test$ tree
.
├── folder_1
│ ├── folder_1
│ ├── folder_2
│ ├── folder_3
│ ├── folder_4
│ └── folder_5
├── folder_2
│ ├── folder_1
│ ├── folder_2
│ ├── folder_3
│ ├── folder_4
│ └── folder_5
├── folder_3
│ ├── folder_1
│ ├── folder_2
│ ├── folder_3
│ ├── folder_4
│ └── folder_5
├── folder_4
│ ├── folder_1
│ ├── folder_2
│ ├── folder_3
│ ├── folder_4
│ └── folder_5
└── folder_5
├── folder_1
├── folder_2
├── folder_3
├── folder_4
└── folder_5
- 在第五题创建的每一个文件夹下面都 创建第二题文本文件 me.txt ,内容也要一样
(base) yjk@DESKTOP-U8UULFU:~/test$ for i in `ls`;do cd $i;for j in `ls`;do cd $j;cat >me.txt<<EOF
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
EOF
cd ../; done; cd ../; done
(base) yjk@DESKTOP-U8UULFU:~/test$ tree
.
├── folder_1
│ ├── folder_1
│ │ └── me.txt
│ ├── folder_2
│ │ └── me.txt
│ ├── folder_3
│ │ └── me.txt
│ ├── folder_4
│ │ └── me.txt
│ └── folder_5
│ └── me.txt
├── folder_2
│ ├── folder_1
│ │ └── me.txt
│ ├── folder_2
│ │ └── me.txt
│ ├── folder_3
│ │ └── me.txt
│ ├── folder_4
│ │ └── me.txt
│ └── folder_5
│ └── me.txt
├── folder_3
│ ├── folder_1
│ │ └── me.txt
│ ├── folder_2
│ │ └── me.txt
│ ├── folder_3
│ │ └── me.txt
│ ├── folder_4
│ │ └── me.txt
│ └── folder_5
│ └── me.txt
├── folder_4
│ ├── folder_1
│ │ └── me.txt
│ ├── folder_2
│ │ └── me.txt
│ ├── folder_3
│ │ └── me.txt
│ ├── folder_4
│ │ └── me.txt
│ └── folder_5
│ └── me.txt
└── folder_5
├── folder_1
│ └── me.txt
├── folder_2
│ └── me.txt
├── folder_3
│ └── me.txt
├── folder_4
│ └── me.txt
└── folder_5
└── me.txt
30 directories, 25 files
(base) yjk@DESKTOP-U8UULFU:~/test/folder_1/folder_1$ cat me.txt
Go to: http://www.biotrainee.com/
I love bioinfomatics.
And you ?
- 再次删除掉前面几个步骤建立的文件夹及文件
(base) yjk@DESKTOP-U8UULFU:~/test$ rm -rf *
- 下载 http://www.biotrainee.com/jmzeng/igv/test.bed 文件,后在里面选择含有 H3K4me3 的那一行是第几行,该文件总共有几行
(base) yjk@DESKTOP-U8UULFU:~/test$ wget http://www.biotrainee.com/jmzeng/igv/test.bed
(base) yjk@DESKTOP-U8UULFU:~/test$ vi test.bed
?H3K4me3
显示在第八行
总共有几行?
(base) yjk@DESKTOP-U8UULFU:~/test$ cat test.bed | wc -l
10
总共有10行
- 下载 http://www.biotrainee.com/jmzeng/rmDuplicate.zip 文件,并且解压,查看里面的文件夹结构
(base) yjk@DESKTOP-U8UULFU:~/test$ wget http://www.biotrainee.com/jmzeng/rmDuplicate.zip
(base) yjk@DESKTOP-U8UULFU:~/test$ unzip rmDuplicate.zip
(base) yjk@DESKTOP-U8UULFU:~/test$ tree rmDuplicate
rmDuplicate
├── picard
│ ├── paired
│ │ ├── readme.txt
│ │ ├── tmp.MarkDuplicates.log
│ │ ├── tmp.header
│ │ ├── tmp.metrics
│ │ ├── tmp.rmdup.bai
│ │ ├── tmp.rmdup.bam
│ │ ├── tmp.sam
│ │ └── tmp.sorted.bam
│ └── single
│ ├── readme.txt
│ ├── tmp.MarkDuplicates.log
│ ├── tmp.header
│ ├── tmp.metrics
│ ├── tmp.rmdup.bai
│ ├── tmp.rmdup.bam
│ ├── tmp.sam
│ └── tmp.sorted.bam
└── samtools
├── paired
│ ├── readme.txt
│ ├── tmp.header
│ ├── tmp.rmdup.bam
│ ├── tmp.rmdup.vcf.gz
│ ├── tmp.sam
│ ├── tmp.sorted.bam
│ └── tmp.sorted.vcf.gz
└── single
├── readme.txt
├── tmp.header
├── tmp.rmdup.bam
├── tmp.rmdup.vcf.gz
├── tmp.sam
├── tmp.sorted.bam
└── tmp.sorted.vcf.gz
- 打开第九题解压的文件,进入 rmDuplicate/samtools/single 文件夹里面,查看后缀为 .sam 的文件,搞清楚 生物信息学里面的SAM/BAM 定义是什么
搞懂了,.sam是比对后的统计文件 ,不懂看这
(https://www.jianshu.com/p/9c99e09630da)
- 安装 samtools 软件
conda install -c bioconda samtools
- 打开 后缀为BAM 的文件,找到产生该文件的命令。 提示一下命令是:
/home/jianmingzeng/biosoft/bowtie/bowtie2-2.2.9/bowtie2-align-s --wrapper basic-0 -p 20 -x /home/jianmingzeng/reference/index/bowtie/hg38 -S /home/jianmingzeng/data/public/allMouse/alignment/WT_rep2_Input.sam -U /tmp/41440.unp
(base) yjk@DESKTOP-U8UULFU:~/test/rmDuplicate/samtools/single$ samtools view tmp.sorted.bam | less
SRR1042600.42157053 0 chr1 629895 42 51M * 0 0 ATAACCAATACTACCAATCANTACTCATCATTAATAATCATAATGGCTATA CCCFFFFFHHHHHJJJJJJJ#4AGHJJIIJJIIIIIJJJJIJIIIIJJIJI AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:11C8A30 YT:Z:UU
SRR1042600.42212881 0 chr1 629895 42 51M * 0 0 ATAACCAATACTACCAATCANTACTCATCATTAATAATCATAATGGCTATA @@<FDFFBFDHHFJEIIGJI#3AFHGEHEIJIIGIIGGIJIIJIGIIGIIJ AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:11C8A30 YT:Z:UU
SRR1042600.12010763 16 chr1 629895 24 51M * 0 0 ATAACCAATACTTCTAATCAAAACTCATCATTAATAATCATAATGGCTATA ?4B?1*4DD?11*1*?+22+<3F:3@EC:CC4EA,DEDDDDD?D3B:==+; AS:i:-10 XN:i:0 XM:i:4 XO:i:0 XG:i:0 NM:i:4 MD:Z:11C0A1C6T29 YT:Z:UU
SRR1042600.29629551 16 chr1 629895 40 51M * 0 0 ATAACCAATACTACCAATCACTACTCATCATTAATAATCATAATGGCTATA HGF?JJHHFDHHGJJIHDFA+E?JIJJIIHGJJJJJJJHHHHHFFFFFCC@ AS:i:-8 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:11C8A30 YT:Z:UU
SRR1042600.41910745 0 chr1 629896 42 51M * 0 0 TAACCAATACTACCAATCAANACTCATCATTAATAATCATAATGGCTATAG CC@FFFFFHHHHGIIHIJJJ#3<CFHCGGIIIJJJJJJJJIGGFHIIJFII AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:10C9T30 YT:Z:UU
SRR1042600.14329856 16 chr1 629896 8 18M1I32M * 0 0 AAACCAAATCCTCCAATCAAATCCTCATCATTAATAATCATAATGGCTATA #############################@IHHGCE9GHFHHHDDDDD<@@ AS:i:-18 XN:i:0 XM:i:5 XO:i:1 XG:i:1 NM:i:6 MD:Z:0T6T0A2A9A28 YT:Z:UU
SRR1042600.15078214 16 chr1 629896 40 51M * 0 0 TAACCAATACTACCAATCAATACCCATCATTAATAATCATAATGGCTATAG 9?1EFDD4CE?1F@?F<HFA<<C+F9HBC<<FEBBC4GD<=+8DDDDA=;1 AS:i:-8 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:10C12T27 YT:Z:UU
SRR1042600.52533601 16 chr1 629896 40 51M * 0 0 TAACCAATACTACCAATCAATCCTCATCATTAATAATCATAATGGCTATAG D?0?*?1*?C?*EGC99>FA+3FBHBEBCA4HCC<:FFFFFF<DB?BD<@@ AS:i:-8 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:10C10A29 YT:Z:UU
- 根据上面的命令,找到我使用的参考基因组
/home/jianmingzeng/reference/index/bowtie/hg38
具体有多少条染色体
这个我不会。。
- 上面的后缀为BAM 的文件的第二列,只有 0 和 16 两个数字,用 cut/sort/uniq等命令统计它们的个数。
(base) yjk@DESKTOP-U8UULFU:~/test/rmDuplicate/samtools/single$ samtools view tmp.rmdup.bam | cut -f 2 | sort | uniq -c
16 0
12 16
- 重新打开 rmDuplicate/samtools/paired 文件夹下面的后缀为BAM 的文件,再次查看第二列,并且统计
(base) yjk@DESKTOP-U8UULFU:~/test/rmDuplicate/samtools/paired$ samtools view tmp.rmdup.bam | cut -f 2 | sort | uniq -c | sort -k 2n
2 83
2 97
8 99
7 147
2 163
1 323
1 353
1 371
1 387
1 433
- 下载 http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip 文件,并且解压,查看里面的文件夹结构, 这个文件有2.3M,注意留心下载时间及下载速度。
(base) yjk@DESKTOP-U8UULFU:~/test$ wget http://www.biotrainee.com/jmzeng/sickle/sickle-results.zip
(base) yjk@DESKTOP-U8UULFU:~/test$ unzip sickle-results.zip
(base) yjk@DESKTOP-U8UULFU:~/test$ tree sickle-results
sickle-results
├── command.txt
├── single_tmp_fastqc.html
├── single_tmp_fastqc.zip
├── test1_fastqc.html
├── test1_fastqc.zip
├── test2_fastqc.html
├── test2_fastqc.zip
├── trimmed_output_file1_fastqc.html
├── trimmed_output_file1_fastqc.zip
├── trimmed_output_file2_fastqc.html
└── trimmed_output_file2_fastqc.zip
- 解压 sickle-results/single_tmp_fastqc.zip 文件,并且进入解压后的文件夹,找到 fastqc_data.txt 文件,并且搜索该文本文件以 >>开头的有多少行?
(base) yjk@DESKTOP-U8UULFU:~/test/sickle-results$ unzip single_tmp_fastqc.zip
(base) yjk@DESKTOP-U8UULFU:~/test/sickle-results$ cd single_tmp_fastqc/
(base) yjk@DESKTOP-U8UULFU:~/test/sickle-results/single_tmp_fastqc$ cat fastqc_data.txt | grep "^>>" | wc -l
24
- 下载 http://www.biotrainee.com/jmzeng/tmp/hg38.tss 文件,去NCBI找到TP53/BRCA1等自己感兴趣的基因对应的 refseq数据库 ID,然后找到它们的hg38.tss 文件的哪一行
(base) yjk@DESKTOP-U8UULFU:~/test$ wget http://www.biotrainee.com/jmzeng/tmp/hg38.tss
refseq搜tp53 human (https://www.ncbi.nlm.nih.gov/nuccore/?term=tp53+human+AND+srcdb_refseq%5BPROP%5D)找到任一人类tp53 的mRNA的accession号如NM_001276695
vi hg38.tss
?NM_001276695
8685行
退出即可
- 解析hg38.tss 文件,统计每条染色体的基因个数
(base) yjk@DESKTOP-U8UULFU:~/test$ cat hg38.tss | cut -f 2 | sort | uniq -c | sort -k 1n
1 chr11_KI270827v1_alt
1 chr17_GL000205v2_random
1 chr17_GL383566v1_alt
...
- 解析hg38.tss 文件,统计NM和NR开头的数量,了解NM和NR开头的含义。
(base) yjk@DESKTOP-U8UULFU:~/test$ cat hg38.tss | grep "^NR" | wc -l
15954
(base) yjk@DESKTOP-U8UULFU:~/test$ cat hg38.tss | grep "^NM" | wc -l
51064
或者这样
(base) yjk@DESKTOP-U8UULFU:~/test$ cat hg38.tss | grep -oE "^NR|^NM" | sort | uniq -c
51064 NM
15954 NR
NM代表mRNA
NR代表非编码的转录子序列,包括结构RNAs
网友评论