Biostrings ③ 多重序列比对

作者: 美式永不加糖 | 来源:发表于2019-06-14 01:02 被阅读2次

    MultipleAlignment 对象

    1. MultipleAlignment classes

    DNAMultipleAlignment , RNAMultipleAlignment, AAMultipleAlignment 可以将已比对的 DNA, RNA, 氨基酸序列作为一个单独的对象表现出来,并可以对其进行 masking.

    Masking: Converting a DNA sequence [A,C,G,T] (usually repetitive or of low quality) to the uninformative character state N or to lower case characters [a,c,g,t] (soft masking).

    2. 创建和Masking

    2.1 创建 MultipleAlignment 对象

    Biostrings 含有可以读取 clustaW, Phylip, Stolkholm 格式文件的函数。

    用包内自带数据尝试一波:

    origMAlign <- 
      readDNAMultipleAlignment(filepath = system.file("extdata",
                                                      "msx2_mRNA.aln",
                                                      package = "Biostrings"),
                               format = "clustal")
    
    phylipMAlign <- 
      readAAMultipleAlignment(filepath = system.file("extdata",
                                                     "Phylip.txt",
                                                     package = "Biostrings"),
                              format = "phylip")
    

    更改 origMAlign 的行名:

    origMAlign
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] -----TCCCGTCTCCGCAGC...ATTAAAAAAAAAAAAAAAAA gi|84452153|ref|N...
    # [2] --------------------...-------------------- gi|208431713|ref|...
    # [3] --------------------...-------------------- gi|118601823|ref|...
    # [4] --------------------...-------------------- gi|114326503|ref|...
    # [5] --------------------...-------------------- gi|119220589|ref|...
    # [6] --------------------...-------------------- gi|148540149|ref|...
    # [7] --------------CGGCTC...-------------------- gi|45383056|ref|N...
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- gi|213515133|ref|...
    rownames(origMAlign)
    # [1] "gi|84452153|ref|NM_002449.4|"   "gi|208431713|ref|NM_001135625."
    # [3] "gi|118601823|ref|NM_001079614." "gi|114326503|ref|NM_013601.2|" 
    # [5] "gi|119220589|ref|NM_012982.3|"  "gi|148540149|ref|NM_001003098."
    # [7] "gi|45383056|ref|NM_204559.1|"   "gi|213515133|ref|NM_001141603."
    rownames(origMAlign) <- c("Human","Chimp","Cow","Mouse","Rat",
                               "Dog","Chicken","Salmon")
    origMAlign
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] -----TCCCGTCTCCGCAGC...ATTAAAAAAAAAAAAAAAAA Human
    # [2] --------------------...-------------------- Chimp
    # [3] --------------------...-------------------- Cow
    # [4] --------------------...-------------------- Mouse
    # [5] --------------------...-------------------- Rat
    # [6] --------------------...-------------------- Dog
    # [7] --------------CGGCTC...-------------------- Chicken
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- Salmon
    

    展示细节:

    detail(origMAlign) 
    
    image

    2.2 Masking

    分别使用函数 rowmask()colmask()

    maskTest <- origMAlign 
    rowmask(maskTest) <- IRanges(start=1,end=3)  ## operations to mask rows
    rowmask(maskTest) 
    # NormalIRanges object with 1 range and 0 metadata columns:
    #           start       end     width
    #       <integer> <integer> <integer>
    #   [1]         1         3         3
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] ####################...#################### Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] --------------------...-------------------- Mouse
    # [5] --------------------...-------------------- Rat
    # [6] --------------------...-------------------- Dog
    # [7] --------------CGGCTC...-------------------- Chicken
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- Salmon
    colmask(maskTest) <- IRanges(start=c(1,1000),end=c(500,2343)) 
    colmask(maskTest)
    # NormalIRanges object with 2 ranges and 0 metadata columns:
    #           start       end     width
    #       <integer> <integer> <integer>
    #   [1]         1       500       500
    #   [2]      1000      2343      1344
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] ####################...#################### Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] ####################...#################### Mouse
    # [5] ####################...#################### Rat
    # [6] ####################...#################### Dog
    # [7] ####################...#################### Chicken
    # [8] ####################...#################### Salmon
    

    通过赋值 'NULL',移除被 mask 的行和列:

    rowmask(maskTest) <- NULL 
    rowmask(maskTest)
    # NormalIRanges object with 0 ranges and 0 metadata columns:
    #        start       end     width
    #    <integer> <integer> <integer>
    colmask(maskTest) <- NULL 
    colmask(maskTest)
    # NormalIRanges object with 0 ranges and 0 metadata columns:
    #        start       end     width
    #   <integer> <integer> <integer>
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] -----TCCCGTCTCCGCAGC...ATTAAAAAAAAAAAAAAAAA Human
    # [2] --------------------...-------------------- Chimp
    # [3] --------------------...-------------------- Cow
    # [4] --------------------...-------------------- Mouse
    # [5] --------------------...-------------------- Rat
    # [6] --------------------...-------------------- Dog
    # [7] --------------CGGCTC...-------------------- Chicken
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- Salmon
    

    通过参数 invert=TRUE ,可以设定要留下的行和列,并能得到与之前的操作完全一致的mask:

    rowmask(maskTest, invert=TRUE) <- IRanges(start=4,end=8)  
    ## IRange的参数也与之前的相反/互补
    rowmask(maskTest) 
    # NormalIRanges object with 1 range and 0 metadata columns:
    #           start       end     width
    #       <integer> <integer> <integer>
    #   [1]         1         3         3
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] ####################...#################### Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] --------------------...-------------------- Mouse
    # [5] --------------------...-------------------- Rat
    # [6] --------------------...-------------------- Dog
    # [7] --------------CGGCTC...-------------------- Chicken
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- Salmon
    colmask(maskTest, invert=TRUE) <- IRanges(start=501,end=999)  ##列的操作同上
    colmask(maskTest)
    # NormalIRanges object with 2 ranges and 0 metadata columns:
    #           start       end     width
    #       <integer> <integer> <integer>
    #   [1]         1       500       500
    #   [2]      1000      2343      1344
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] ####################...#################### Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] ####################...#################### Mouse
    # [5] ####################...#################### Rat
    # [6] ####################...#################### Dog
    # [7] ####################...#################### Chicken
    # [8] ####################...#################### Salmon
    

    移除mask:

    colmask(maskTest) <- NULL 
    rowmask(maskTest) <- NULL 
    

    针对参数 append= , 进行不同的设置和操作:

    The append argument takes union, replace or intersect to indicate how to combine the new value with rowmask(x),

    ## 默认状态,mask全部
    rowmask(maskTest) <- IRanges(start=1,end=3)
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] ####################...#################### Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] --------------------...-------------------- Mouse
    # [5] --------------------...-------------------- Rat
    # [6] --------------------...-------------------- Dog
    # [7] --------------CGGCTC...-------------------- Chicken
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- Salmon
    
    ## append="intersect", mask the rows that intersect with masked rows
    rowmask(maskTest,append="intersect") <- IRanges(start=2,end=5)
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] -----TCCCGTCTCCGCAGC...ATTAAAAAAAAAAAAAAAAA Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] --------------------...-------------------- Mouse
    # [5] --------------------...-------------------- Rat
    # [6] --------------------...-------------------- Dog
    # [7] --------------CGGCTC...-------------------- Chicken
    # [8] GGGGGAGACTTCAGAAGTTG...-------------------- Salmon
    
    ## append="replace"
    rowmask(maskTest,append="replace") <- IRanges(start=5,end=8) 
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] -----TCCCGTCTCCGCAGC...ATTAAAAAAAAAAAAAAAAA Human
    # [2] --------------------...-------------------- Chimp
    # [3] --------------------...-------------------- Cow
    # [4] --------------------...-------------------- Mouse
    # [5] ####################...#################### Rat
    # [6] ####################...#################### Dog
    # [7] ####################...#################### Chicken
    # [8] ####################...#################### Salmon
    
    
    ## append="union"
    rowmask(maskTest,append="union") <- IRanges(start=7,end=8) 
    maskTest
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] -----TCCCGTCTCCGCAGC...ATTAAAAAAAAAAAAAAAAA Human
    # [2] --------------------...-------------------- Chimp
    # [3] --------------------...-------------------- Cow
    # [4] --------------------...-------------------- Mouse
    # [5] ####################...#################### Rat
    # [6] ####################...#################### Dog
    # [7] ####################...#################### Chicken
    # [8] ####################...#################### Salmon
    

    函数 maskMotif() 也可以对 MultipleAlignment 对象进行操作,用于在列的水平上mask相同序列的出现。

    tataMasked <- maskMotif(origMAlign, "TATA") 
    colmask(tataMasked)
    # NormalIRanges object with 5 ranges and 0 metadata columns:
    #           start       end     width
    #       <integer> <integer> <integer>
    #   [1]       811       814         4
    #   [2]      1180      1183         4
    #   [3]      1186      1191         6
    #   [4]      1204      1207         4
    #   [5]      1218      1221         4
    

    函数 maskGaps() 可以基于设定的参数在列的水平上进行mask.

    min.fraction A value in [0,1] that indicates the minimum fraction needed to call a gap in
    the consensus string (default is 0.5).
    min.block.width A positive integer that indicates the minimum number of consecutive gaps
    to mask, as defined by min.fraction (default is 4).

    autoMasked <- maskGaps(origMAlign, min.fraction=0.5, min.block.width=4) 
    autoMasked
    # DNAMultipleAlignment with 8 rows and 2343 columns
    #      aln                                         names               
    # [1] ####################...#################### Human
    # [2] ####################...#################### Chimp
    # [3] ####################...#################### Cow
    # [4] ####################...#################### Mouse
    # [5] ####################...#################### Rat
    # [6] ####################...#################### Dog
    # [7] ####################...#################### Chicken
    # [8] ####################...#################### Salmon
    

    as.matrix() 可以应用在 MultipleAlignment 对象上,将其转换为矩阵。

    full = as.matrix(origMAlign) 
    dim(full)
    # [1]    8 2343
    partial = as.matrix(autoMasked) 
    dim(partial)
    # [1]    8 1143
    

    3. 分析应用

    3.1 完成mask的序列的碱基频率

    alphabetFrequency(autoMasked)
    #        A   C   G   T M R W S Y K V H D B N   - + .
    # [1,] 260 351 296 218 0 0 0 0 0 0 0 0 0 0 0  18 0 0
    # [2,] 171 271 231 128 0 0 0 0 0 0 0 0 0 0 3 339 0 0
    # [3,] 277 360 275 209 0 0 0 0 0 0 0 0 0 0 0  22 0 0
    # [4,] 265 343 277 226 0 0 0 0 0 0 0 0 0 0 0  32 0 0
    # [5,] 251 345 287 229 0 0 0 0 0 0 0 0 0 0 0  31 0 0
    # [6,] 160 285 241 118 0 0 0 0 0 0 0 0 0 0 0 339 0 0
    # [7,] 224 342 273 190 0 0 0 0 0 0 0 0 0 0 0 114 0 0
    # [8,] 268 289 273 262 0 0 0 0 0 0 0 0 0 0 0  51 0 0
    

    3.2 计算一致性矩阵 (consensus matrix)

    consensusMatrix(autoMasked, baseOnly=TRUE)[, 1138:1143] 
    #       [,1] [,2] [,3] [,4] [,5] [,6]
    # A        0    1    0    0    1    0
    # C        5    4    4    0    1    1
    # G        1    1    1    0    0    1
    # T        0    0    1    6    4    4
    # other    2    2    2    2    2    2
    

    3.3 提取一致性序列 (consensus string) & Consensus Views

    substr(consensusString(autoMasked),80,125) 
    # [1] "####CRGABAMGTCA-YRGCTTCTCYGTSCAWAGGCRRTGRCYTGT"
    consensusViews(autoMasked) 
    #   Views on a 2343-letter BString subject
    # subject: ----------------------VWVMKYY...----------------------------
    # views:
    #     start  end width
    # [1]    84  325   242 [CRGABAMGTCA-YRGCTTCTCY...TCSGCYGGSGCYRYCCTGSGG]
    # [2]   330  332     3 [CCR]
    # [3]   338 1191   854 [CTGCTGCTGYCGGGVCACGGCG...GTTTTTATGTATAAATATATA]
    # [4]  1198 1241    44 [ATAAAATATAAKAC--TTTTTATAYRSCARATGTAAAAATTCAA]
    

    3.4 聚类分析&树形图

    用masking前的序列:

    sdist <- stringDist(as(origMAlign,"DNAStringSet"), method="hamming") 
    clust <- hclust(sdist, method = "single") 
    plot(clust)
    

    嗯,得到了一张hin奇怪的图。

    如果用masking后的序列:

    sdist <- stringDist(as(autoMasked,"DNAStringSet"), method="hamming") 
    clust <- hclust(sdist, method = "single")
    plot(clust) 
    

    改善,但依旧——

    当然小鼠、大鼠和人类的高相似性是因为这三个物种的基因组序列整体更长,而第一张图里的奇怪关系是因为 rodent/mouse/human 之间的 extra "length", 而不是保守区域的相似性。

    4. 导出文件

    MultipleAlignment 对象转换为 DNAStringSet ,导出为 fasta 文件。

    DNAStr = as(origMAlign, "DNAStringSet")
    writeXStringSet(DNAStr, file="myFile.fa") 
    

    References


    最后,向大家隆重推荐生信技能树的一系列干货!

    1. 生信技能树全球公益巡讲:https://mp.weixin.qq.com/s/E9ykuIbc-2Ja9HOY0bn_6g
    2. B站公益74小时生信工程师教学视频合辑:https://mp.weixin.qq.com/s/IyFK7l_WBAiUgqQi8O7Hxw
    3. 招学徒:https://mp.weixin.qq.com/s/KgbilzXnFjbKKunuw7NVfw

    相关文章

      网友评论

        本文标题:Biostrings ③ 多重序列比对

        本文链接:https://www.haomeiwen.com/subject/lleyfctx.html