生信习题 | 一

作者: kkkkkkang | 来源:发表于2020-08-26 14:31 被阅读0次

生信习题 | 一
生信人的20个R语言习题及其答案-土豆学习笔记
全网第一个单细胞视频课程和配套习题
生信人的20个R语言习题--by me
[生信媛]练习题：如何根据FASTA的ID拆分数据
根据FASTA的ID重新排序并拆分数据
生信人的linux 20题--by me
2019-04-19
linux习题更新-生信技能树习题集锦
[目录|待刷的生信题整合]来自于生信技能树和生信菜鸟团的待刷的生

以下是分析和学习过程中遇到的值得记录一下的生信习题库

题目1：下载最新版的KEGG注释文本文件，编写脚本整理成kegg的pathway的ID与基因ID的对应格式。

#数据下载参考技能树教程http://www.bio-info-trainee.com/2550.html
我的解题思路：因为C开头的是KEGG pathway ID, D开头的是基因ID，C开头的第二列拿出来当作变量值赋给a，碰到D开头的第二列就把这俩一块输出，\t分隔，直到遇见下一个C开头的KEGG pathway ID，循环直到文件末
awk '{if ($0 ~ /^C/) a=$2; else if ($0 ~ /^D/) print a"\t"$2}' hsa00001.keg | less
00010   3101
00010   3098
00010   3099
00010   80201
00010   2645
00010   2821
00010   5213
00010   5214
00010   5211
00010   2203
00010   8789
00010   230
00010   226
00010   229
00010   7167
00010   2597
00010   26330

题目2: 解析go-basic.obo为tab键分隔的三列表

原始go-basic.obo是这样

format-version: 1.2
data-version: releases/2020-08-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
remark: cvs version: use data-version
remark: Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria
and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw
]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003

想要转换成这样

Go	Decription	Level
GO:0000001	mitochondrion inheritance	biological_process
GO:0000002	mitochondrial genome maintenance	biological_process
GO:0000003	reproduction	biological_process
GO:0000005	obsolete ribosomal chaperone activity	molecular_function
GO:0000006	high-affinity zinc transmembrane transporter activity	molecular_function
GO:0000007	low-affinity zinc ion transmembrane transporter activity	molecular_function

我的思路很简单，找一下规律就可以看出来：想要转换成的第一列的内容在原文件的id行；想要转换成的第二列的内容在原文件的name行；想要转换成的第三列的内容在原文件的namespace行；
依次提出来，前两个按tab键分隔，最后一个默认换行符'\n'分隔就行了。
But, 这里有一坑。。如果以python的startswith语句匹配的话，，startswith("name")直接把namespace包含了。。
很简单，以name:开头就行了🤢

#!/usr/bin/env python
#-*-utf-8-*-
import argparse
parser = argparse.ArgumentParser(description = "\nThis python script is used to parse the go-basic.obo file into tab delimited table", add_help = False, usage = "\n python3 parse_go.py -i [go-basic.obo] -o [go.txt] \n python3 parse_go.py --input [go-basic.obo] --output [go.txt]")
required = parser.add_argument_group("Required options")
optional = parser.add_argument_group("Optional options")
required.add_argument("-i", "--input", metavar = "[go-basic.obo]", help = "input file format: go_obo", required = True)
required.add_argument("-o", "--output", metavar = "[stat.txt]", help = "output file format: tab delimited table", required = True)
optional.add_argument("-h", "--help", action = "help", help = "help.info")
args = parser.parse_args()
with open(args.input,"r") as fi:
    with open(args.output,"w") as fo:
        print("GO\tDescription\tLevel",file = fo)
        for line in fi:
            if line.startswith("id"):
                print(''.join(line.split(" ")[1]).strip(), file = fo, end = '\t')
            elif line.startswith("name:"):
                print(' '.join(line.split(" ")[1:]).strip(), file = fo, end = '\t')
            elif line.startswith("namespace"):
                print(line.split(" ")[1].strip(), file = fo)

统计人类外显子区域的长度

#!/usr/bin/env python3
#2020.8
#-*-utf-8-*-
import argparse
parser = argparse.ArgumentParser(description = '\nThis is a python3 script used to count the length of human genome extron', add_help = False, usage = '\npython3 extron_len.py -i [human_extron.txt]')
required = parser.add_argument_group('Required options')
optional = parser.add_argument_group('Optional options')
required.add_argument('-i', metavar = '[human_extron.txt]', help = 'input: extron can be downloaded at https://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt', required = True)
optional.add_argument('-h','--help', action = 'help', help = 'help.info')
args = parser.parse_args()
extron = {}
with open(args.i,'r') as f:
    next(f) #跳过第一行
    for line in f:
        line = line.split('\t')
        if line[9] != '-':
            extron_cor = line[9].lstrip('[').rstrip(']').split(', ') 
        for i in range(len(extron_cor)):
            start = extron_cor[i].split('-')[0]
            extron[start] = extron_cor[i].split('-')[1]
length = 0
for i,j in extron.items():
    length += int(j.strip("'")) - int(i.strip("'")) + 1
print(length)