Biopython

此笔记参照Lec5

1. what is python?

Biopython is a Python library for reading and writing many common biological data formats. It contains some functionality to perform calculations, in particular on 3D structures.

biopython支持的数据格式

Blast: 用来寻找序列间局部的局部相似性
ClustalW：多序列alignment
GenBank：NCBI序列数据库
PubMed and Medicine： document 数据库
ExPASy ： SSIB resource portal (Enzyme and Prosite)
SCOP： Structural Classification of Proteins (e.g. ‘dom’,’lin’)
UniGene： computationally identifies transcripts from the same locus
SwissProt： annotated and non-redundant protein sequence database

2. Scripts

- The sequence object

sequence object的内容和一个string相似。

from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna # 导入明确序列和不明确的

unamb_dna_seq = Seq("ACGTTTT", unambiguous_dna)
ambig_dna_seq = Seq("ACRGTRRRRRR", ambiguous_dna)

print( unamb_dna_seq ) 
print( ambig_dna_seq.alphabet ) # ？输出是 IUPACAmbiguousDNA()

# A Seq object in python acts like a normal python string.
for letter in ambig_dna_seq:
    print(letter) # 打印每一个ambitious——data中的nuleotide

print( 'Length: ', len(ambig_dna_seq) ) # 序列长度
print( ambig_dna_seq[4:12] ) # 第4到12个
print( ambig_dna_seq[::-1] ) # 倒序
print( str(ambig_dna_seq) ) # 打印序列内容

- Nucleotide counts, transcription, translation

统计一段序列中核苷酸的个数 seq.count()
给出转录成的mRNA序列 seq.transcribe()
翻译的蛋白的序列 seq.translate()
互补DNA seq.complement()
互补倒序DNA seq.reverse_complement()

from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import unambiguous_dna, ambiguous_dna

my_seq = Seq( "AGTACACTGGTA", unambiguous_dna ) 
print( '原始序列：',my_seq )

# How many As,Cs, Ts and Gs in this sequence? 统计AGCT数目
print( '# A: ',my_seq.count("A") )
print( '# C: ',my_seq.count("C") )
print( '# G: ',my_seq.count("G") )
print( '# T: ',my_seq.count("T") )

# To get the GC nucleotide content: 计算GC含量
from Bio.SeqUtils import GC
print( 'GC content: ', GC(my_seq), '%' )

# Transcription and translation # 转录
my_mRNA = my_seq.transcribe()
print('转录的mRNA：' ,my_mRNA )
my_pept = my_seq.translate() # 翻译
print( '翻译的蛋白质序列：',my_pept )

# Complement and reverse complement 
print( 'sequence: ',str(my_seq) )
print( 'comp.   : ',my_seq.complement() ) # 互补序列
print( 'revcomp : ',my_seq.reverse_complement() ) # 倒序互补序列

- translation

翻译时候可以设置参数 seq.translate(to_stop=True)来考虑到终止密码子的情况

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

# mRNA序列
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
print( messenger_rna )
print( messenger_rna.translate() )

# There are three RNA stop codons: UAG, UAA, and UGA.
# 本段序列是UGA

# Now, you may want to translate the nucleotides up to the first in frame stop codon, 
# and then stop (as happens in nature):
print( messenger_rna.translate(to_stop=True) )

- sequence record object

SeqRecord是SeqIO objects的一种基本的数据形式，和Seq object很像（注意之前的练习都是Seq object），但是SeqRecord还包含其他的属性。

seq - 序列本身，通常是Seq对象。
id - 用于标识序列的主要ID，是一个字符串。在大多数情况下，这是一个accession number。
name - 序列的 "通用 "名称/id，是一个字符串。在某些情况下，这与accession number相同，但也可能是clone name。类似于GenBank记录中的LOCUS ID。
description - 一个人类可读的描述或序列的表达性名称-一个字符串。
我们可以把SeqRecord看作是一个容器，它包括Seq的上述属性。

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord # 导入seqrecord

## create a simple SeqRecord object
simple_seq = Seq( "GATCAGGATTAGGCC" )
simple_seq_r = SeqRecord( simple_seq )
simple_seq_r.id = "AC12345"
simple_seq_r.description = "I am not a real sequence"
simple_seq_r.name = 'fake sequence name'

## print summary
print( simple_seq_r.id )
print( simple_seq_r.description )
print( simple_seq_r.name ) 
print( str(simple_seq_r.seq) )
#print( simple_seq_r.seq ) 输出和沙面的一样
print( simple_seq_r.seq.alphabet )

## translate the sequence
translated_seq = simple_seq_r.seq.translate()
print ( translated_seq )

- Sequence IO object

import os
from Bio import SeqIO

## save the sequence records to a list
allSeqRecords = []
allSeqIDs     = []

# fasta file is in current directory
pathToFile = os.path.join("./","multi_fasta_gb_header.fasta")
for seq_record in SeqIO.parse(pathToFile, "fasta"):
    allSeqRecords.append(seq_record)
    allSeqIDs.append(seq_record.id.split("|")[1])
    print( seq_record.id )
    # print 1st 10 nucleotides only
    print( str(seq_record.seq[:10]) )
    print( len(seq_record) )

## print out fun stuff about the sequences
print( "We found ", len(allSeqIDs), "sequences" )
print( "information on the third sequence:" )
ind = 2
seqRec = allSeqRecords[ind]
print( "\t", "GI number     ", allSeqIDs[ind] )
print( "\t", "full id       ", seqRec.id )
print( "\t", "num nucleo.   ", len(seqRec.seq) )
print( "\t", "1st 10 nucleo.", seqRec.seq[:10] )