Biopython
此笔记参照Lec5
1. what is python?
Biopython is a Python library for reading and writing many common biological data formats. It contains some functionality to perform calculations, in particular on 3D structures.
biopython支持的数据格式
- Blast: 用来寻找序列间局部的局部相似性
- ClustalW: 多序列alignment
- GenBank:NCBI序列数据库
- PubMed and Medicine: document 数据库
- ExPASy : SSIB resource portal (Enzyme and Prosite)
- SCOP: Structural Classification of Proteins (e.g. ‘dom’,’lin’)
- UniGene: computationally identifies transcripts from the same locus
- SwissProt: annotated and non-redundant protein sequence database
2. Scripts
- The sequence object
sequence object的内容和一个string相似。
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna # 导入明确序列和不明确的
unamb_dna_seq = Seq("ACGTTTT", unambiguous_dna)
ambig_dna_seq = Seq("ACRGTRRRRRR", ambiguous_dna)
print( unamb_dna_seq )
print( ambig_dna_seq.alphabet ) # ?输出是 IUPACAmbiguousDNA()
# A Seq object in python acts like a normal python string.
for letter in ambig_dna_seq:
print(letter) # 打印每一个ambitious——data中的nuleotide
print( 'Length: ', len(ambig_dna_seq) ) # 序列长度
print( ambig_dna_seq[4:12] ) # 第4到12个
print( ambig_dna_seq[::-1] ) # 倒序
print( str(ambig_dna_seq) ) # 打印序列内容
- Nucleotide counts, transcription, translation
- 统计一段序列中核苷酸的个数 seq.count()
- 给出转录成的mRNA序列 seq.transcribe()
- 翻译的蛋白的序列 seq.translate()
- 互补DNA seq.complement()
- 互补倒序DNA seq.reverse_complement()
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna
my_seq = Seq( "AGTACACTGGTA", unambiguous_dna )
print( '原始序列:',my_seq )
# How many As,Cs, Ts and Gs in this sequence? 统计AGCT数目
print( '# A: ',my_seq.count("A") )
print( '# C: ',my_seq.count("C") )
print( '# G: ',my_seq.count("G") )
print( '# T: ',my_seq.count("T") )
# To get the GC nucleotide content: 计算GC含量
from Bio.SeqUtils import GC
print( 'GC content: ', GC(my_seq), '%' )
# Transcription and translation # 转录
my_mRNA = my_seq.transcribe()
print('转录的mRNA:' ,my_mRNA )
my_pept = my_seq.translate() # 翻译
print( '翻译的蛋白质序列:',my_pept )
# Complement and reverse complement
print( 'sequence: ',str(my_seq) )
print( 'comp. : ',my_seq.complement() ) # 互补序列
print( 'revcomp : ',my_seq.reverse_complement() ) # 倒序互补序列
- translation
翻译时候可以设置参数 seq.translate(to_stop=True)来考虑到终止密码子的情况
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
# mRNA序列
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
print( messenger_rna )
print( messenger_rna.translate() )
# There are three RNA stop codons: UAG, UAA, and UGA.
# 本段序列是UGA
# Now, you may want to translate the nucleotides up to the first in frame stop codon,
# and then stop (as happens in nature):
print( messenger_rna.translate(to_stop=True) )
- sequence record object
SeqRecord是SeqIO objects的一种基本的数据形式,和Seq object很像(注意之前的练习都是Seq object), 但是SeqRecord还包含其他的属性。
seq - 序列本身,通常是Seq对象。
id - 用于标识序列的主要ID,是一个字符串。在大多数情况下,这是一个accession number。
name - 序列的 "通用 "名称/id,是一个字符串。在某些情况下,这与accession number相同,但也可能是clone name。类似于GenBank记录中的LOCUS ID。
description - 一个人类可读的描述或序列的表达性名称-一个字符串。
我们可以把SeqRecord看作是一个容器,它包括Seq的上述属性。
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord # 导入seqrecord
## create a simple SeqRecord object
simple_seq = Seq( "GATCAGGATTAGGCC" )
simple_seq_r = SeqRecord( simple_seq )
simple_seq_r.id = "AC12345"
simple_seq_r.description = "I am not a real sequence"
simple_seq_r.name = 'fake sequence name'
## print summary
print( simple_seq_r.id )
print( simple_seq_r.description )
print( simple_seq_r.name )
print( str(simple_seq_r.seq) )
#print( simple_seq_r.seq ) 输出和沙面的一样
print( simple_seq_r.seq.alphabet )
## translate the sequence
translated_seq = simple_seq_r.seq.translate()
print ( translated_seq )
- Sequence IO object
import os
from Bio import SeqIO
## save the sequence records to a list
allSeqRecords = []
allSeqIDs = []
# fasta file is in current directory
pathToFile = os.path.join("./","multi_fasta_gb_header.fasta")
for seq_record in SeqIO.parse(pathToFile, "fasta"):
allSeqRecords.append(seq_record)
allSeqIDs.append(seq_record.id.split("|")[1])
print( seq_record.id )
# print 1st 10 nucleotides only
print( str(seq_record.seq[:10]) )
print( len(seq_record) )
## print out fun stuff about the sequences
print( "We found ", len(allSeqIDs), "sequences" )
print( "information on the third sequence:" )
ind = 2
seqRec = allSeqRecords[ind]
print( "\t", "GI number ", allSeqIDs[ind] )
print( "\t", "full id ", seqRec.id )
print( "\t", "num nucleo. ", len(seqRec.seq) )
print( "\t", "1st 10 nucleo.", seqRec.seq[:10] )
网友评论