将mRNA翻译为氨基酸序列,我们主要要解决以下几个问题:
1.建立类似于Python中字典的数据结构,将密码子与氨基酸一一对应起来;
2.按照每三个的步长进行翻译;
3.将翻译出来的每个氨基酸进行拼接;
下面我将代码贴出来,一步一步解释
#### make code table
code = c('F','F','L','L','L','L','L','L',
'L','L','L','M','V','V','V','V',
'S','S','S','S','P','P','P','P',
'T','T','T','T','A','A','A','A',
'Y','Y','*','*','H','H','Q','Q',
'N','N','K','K','D','D','E','E',
'C','C','*','W','R','R','R','R',
'S','S','R','R', 'G','G','G','G')
code_name = c('TTT','TTC','TTA','TTG','CTT','CTC','CTA','CTG',
'ATT','ATC','ATA','ATG','GTT','GTC','GTA','GTG',
'TCT','TCC','TCA','TCG','CCT','CCC','CCA','CCG',
'ACT','ACC','ACA','ACG','GCT','GCC','GCA','GCG',
'TAT','TAC','TAA','TAG','CAT','CAC','CAA','CAG',
'AAT','AAC','AAA','AAG','GAT','GAC','GAA','GAG',
'TGT','TGC','TGA','TGG','CGT','CGC','CGA','CGG',
'AGT','AGC','AGA','AGG','GGT','GGC','GGA','GGG')
names(code) = code_name
这一步是利用R语言中可以给每个向量命名的特点实现了将密码子与氨基酸一一对应起来的功能。我们可以测试一下:
> code["TTT"]
TTT
"F"
> code["CAC"]
CAC
"H"
下面我们就可以对一个mRNA序列进行翻译了,这里我们将翻译的过程写成一个function:
##### function for translating mRNA to protein #########
##### the first parameter is reading frame, the second parameter is a clean mRNA sequece
translate = function (frame, mrna)
{
prot = ''
i = frame
for (i in seq(frame,nchar(mrna), 3))
{
AA = code[substr(mrna,i,i+2)]
if (is.na(AA)) #翻译到最后剩少于3个氨基酸的时候,直接将AA
AA = "-" #赋值为“-”
prot = str_c(prot,AA)
}
return(prot)
}
下面调用测试一下:
> rna_seq = "ATCTCGATCTGAT"
> translate(1,rna_seq)
[1] "LSL*-"
> translate(2,rna_seq)
[1] "SRSD"
> translate(3,rna_seq)
[1] "LDL-"
由于一段未知序列翻译成氨基酸序列会有6种可能性,所以我们现在要添加一个可以将输入序列进行反向互补的功能,这个功能我们也是用两个简单的function实现的:
##### make complement sequece of the input sequence
complement = function(mrna)
{
tempseq = str_replace_all(mrna,"A","Q")
tempseq = str_replace_all(tempseq,"T","P")
tempseq = str_replace_all(tempseq,"C","M")
tempseq = str_replace_all(tempseq,"G","N")
tempseq = str_replace_all(tempseq,"Q","T")
tempseq = str_replace_all(tempseq,"P","A")
tempseq = str_replace_all(tempseq,"M","G")
tempseq = str_replace_all(tempseq,"N","C")
return(tempseq)
}
##### make reverse complement sequece of the input sequence
reverse = function(mrna)
{
str.matrix = str_split(mrna,"",simplify = T)
r_mRNA =paste(rev(str.matrix),collapse = "")
return(r_mRNA)
}
我们还是进行以下测试:
> rna_seq = "ATCTCGATCTGAT"
> reverse(rna_seq)
[1] "TAGTCTAGCTCTA"
> reverse_complement(rna_seq)
[1] "ATCAGATCGAGAT"
测试没问题,我们进行下一步,读入fasta格式文件,并将序列名称和序列分别存储,然后将序列中的数字和空格进行简单处理,并将序列中所有的小写字母转换为大写字母。
rna =""
while(length(line) != 0)
{
print(line)
if (str_detect(line, "^>"))
id = line
line = readLines(connec, n = 1, warn = F)
rna = str_c(rna, line)
}
close(connec)
fasta = data.frame(name = id,sequenes = rna)
mRNA = str_to_upper(str_replace_all(fasta$sequenes, "[0-9]| ", ""))
下面进行序列翻译,并将翻译好的序列用一个dataframe进行储存:
aa_seq = NULL
for (i in 1:3){aa_seq[i] = translate(i,mRNA)} # translate sense sequence
for (i in 1:3){aa_seq[i+3] = translate(i,reverse_complement(mRNA))} # translate reverse complement sequence
aa_name = paste(fasta$name ,"Frame", 1:6, sep = " ")
aa_result = data.frame(aa_name, aa_seq)
最后,写入文件,大功告成。
write.table(aa_result, "output.txt", sep = "\n", quote = F,
append = T, row.names = F, col.names = F)
网友评论