美文网首页RNASeq 数据分析
tsRNA分类分析:seqac包

tsRNA分类分析:seqac包

作者: 信你个鬼 | 来源:发表于2023-09-02 18:51 被阅读0次

    标题:tsRNA分类分析:seqac包

    tRNA的二级结构如下:

    image-20230903154931207.png

    根据其二级结构,tsRNA分类如下:

    早期研究报道,tiRNAs通过血管生成素(ANG)切割成熟tRNA产生,而tRFs来源于Dicer或ANG切割成熟tRNA或tRNA前体。根据成熟或前体tRNA转录本中的裂解位点,可以分为以下几类:

    • tRF-1:来源于前体tRNA,在3' trailer部由RNase Z or ELAC2酶裂解产生。
    • tRF-3:来自成熟体tRNA的3'端
    • tRF-5:来自成熟体tRNA的5'端
    • tRF-i: 来自成熟体tRNA的中间区域
    • tiRNAs:are generated by specific cleavage by angiogenin in the anticodon loops of mature tRNAs
    image-20230619222644080.png

    今天主要给大家介绍一个软件,可以根据tsRNA的序列以及tRNA的二级结构文件来鉴定tsRNA的分类,软件为Seqpac。

    Seqpac主要用于small RNA Sequence数据分析,教程最新版链接:

    https://www.bioconductor.org/packages/release/workflows/vignettes/seqpac/inst/doc/seqpac_-_A_guide_to_sRNA_analysis_using_sequence-based_counts.html

    软件分析输入为fq文件,seqpac workflow的核心对象为:

    The PAC object: 这个对象有三个部分,Phenotypic information (P), Annotations (A) and Counts (C)。

    这个PAC对象有两种格式, S3 list或者S4,并且可以使用函数进行互转,as(pac-object, "list")将S4转为S3, as.PAC(pac-object)将S3转为S4。

    The Targeting objects:用于访问PAC对象内的特定数据子集。

    seqpac的分析步骤主要为四步:

    • Generate a PAC object from fastq-files
    • Preprocess the PAC object
    • Annotate the PAC object
    • Analyze the PAC object.

    示例数据

    seqpac包中自带了一个示例数据,6 sRNA fastq文件。

    image-20230823222638876.png

    样本组织类型为果蝇胚胎,测序平台为Illumina NextSeq 500,测序读长为75个碱基。这里抽取原始fq的部分read进行演示。

    安装包:

    ## Installation
    devtools::install_github("Danis102/seqpac", upgrade="never", 
     build_manual=TRUE, build_vignettes=TRUE)
    

    首先快速一览整个包的分析步骤:

    # Setup
     library(seqpac)
     
     sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
     fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
                     full.names = TRUE)
     ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
                             package = "seqpac", mustWork = TRUE)
     
     # Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
     count_list  <- make_counts(fastq, plot = FALSE, 
                                 parse="default_neb", trimming="seqpac")
     pac <- make_PAC(count_list$counts)
     pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
     
     # Preprocess PAC-object and creat means
     pac <- PAC_norm(pac)                        # Default normalization is cpm
     pac <- PAC_filter(pac, size=c(15,60))       # Here, filter on sequence length
     pac <- PAC_summary(pac, norm = "cpm", type = "means", 
                        pheno_target=list("groups", unique(pheno(pac)$groups)))
     
     # Quickly align (annotate) and plot PAC-object
     map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
     plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
     plts[[13]]
    

    详细步骤

    1.首先加载fq数据

    # Use the ShortRead-package to randomly sample a fastq file
    sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
    fq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE, full.names = TRUE)
    fq
    [1] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq1.fastq.gz"
    [2] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq2.fastq.gz"
    [3] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq3.fastq.gz"
    [4] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq4.fastq.gz"
    [5] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq5.fastq.gz"
    [6] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq6.fastq.gz"
    

    2.生成count矩阵

    使用make_counts函数

    count_list <- make_counts(input=fq, trimming="seqpac", threads=16, parse="default_neb")
    
    # 查看count_list的结构
    str(count_list, max.level = 1)
    
    List of 3
     $ counts         :'data.frame':    8300 obs. of  6 variables:
     $ progress_report:'data.frame':    6 obs. of  19 variables:
     $ evidence_plots :List of 2
    

    3.构造样本表型矩阵

    使用make_pheno函数

    Sample_ID <- colnames(count_list$counts)
    pheno_fl <- data.frame(Sample_ID=Sample_ID,
                        Treatment=c(rep("heat", times=1), 
                                    rep("control", times=2)),
                        Batch=rep(c("1", "2", "3"), times=1))
    
    pheno <- make_pheno(pheno=pheno_fl, 
                        progress_report=count_list$progress_report, 
                        counts=count_list$counts)
    

    每个样本的表型为一个数据框:

    image-20230903161349444.png

    4.构造PAC对象

    ###--------------------------------------------------------------------- 
    ## Generate PAC object (default S4)
    pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
    pac_master
    
    PAC object with: 
       6 samples
       8300 sequences
       mean total counts: 4410 (min:4346/max:4476)
       best sequence: 230 mean counts
       worst sequence: 0 mean counts
    

    至此,数据的分析对象已经构建好了,PAC对象,后续的分析基本上都是对这个PAC进行操作。

    tRNA class analysis

    我们本次主要关注tRNA class analysis,此包的其他分析可以见分析教程。

    首先加载包中内置的PAC对象

    load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                     package = "seqpac", mustWork = TRUE))
                     
    # 查看加载的对象
    pac
    

    比对:

    这个包内置比对工具为bwa,此处mismatches为0,还讲序列上下增加了3bp,参考序列为tRNA序列,来自GtRNAdb数据库:http://gtrnadb.ucsc.edu/

    ###---------------------------------------------------------------------
    ## tRNA analysis in Seqpac 
    
    # First create an annotation blank PAC with group means
    anno(pac) <- anno(pac)[,1, drop=FALSE]
    pac_trna <- PAC_summary(pac, norm = "cpm", type = "means", 
                            pheno_target=list("stage"), merge_pac = TRUE)
    
    # Then reannotate only tRNA using the PAC_mapper function
    ref <- "/some/path/to/trna/fasta/recerence.fa"  
    # or use the provided fasta
    ref <- system.file("extdata/trna2", "tRNA2.fa", 
                              package = "seqpac", mustWork = TRUE)          
    
    map_object <- PAC_mapper(pac_trna, ref=ref, N_up = "NNN", N_down = "NNN", 
                             mismatches=0, threads=6, report_string=TRUE,
                             override=TRUE)
    

    比对完后的map_object对象是一个list对象,如下:

    image-20230903174938936.png

    每一个tRNA上比对到的tsRNA如下,每一行为一个tsRNA的序列:

    image-20230903175956563.png

    先看看比对结果,PAC_covplot可以绘制每一个tRNA上的短片段tsRNA的比对情况:

    # Hint: By adding N_up ad N_down you can make sure that modified fragments (like
    # 3' -CAA in mature tRNA are included).
    
    ## Plot tRNA using xseq=TRUE gives you the reference sequence as X-axis:
    # (OBS! Long reference will not show well.)                     
    cov_tRNA <- PAC_covplot(pac_trna, map_object, 
                            summary_target = list("cpmMeans_stage"), 
                            xseq=TRUE, style="line", 
                            color=c("red", "black", "blue"))
    

    选一个tRNA查看,这里对每类stage进行的绘制:

    image-20230903175253710.png

    也可以指定某个tRNA绘制查看:

    # Targeting a single tRNA using a summary data.frame 
    PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"), 
                map_target="tRNA-Lys-CTT-1-9")
    

    结果如下:

    image-20230903175431933.png

    此外,还可以筛选出tRNA上比对read最多的进行展示:

    # Find tRNAs with many fragments
    # 1st extract number of rows from each alignment 
    n_tRFs <- unlist(lapply(map_object, function(x){nrow(x[[2]])}))
    # The test data is highly down an filterd sampled, but still some with tRNA have
    # a few alignment
    table(n_tRFs)
    names(map_object)[n_tRFs>2]
    # Lets select a few of them and plot them
    selct <- names(map_object)[n_tRFs>2][c(1, 16, 27, 37)]
    cov_plt <- PAC_covplot(pac_trna, map=map_object, 
                           summary_target= list("cpmMeans_stage"), 
                           map_target=selct)
                           
    cowplot::plot_grid(plotlist=cov_plt, nrow=2, ncol=2)
    

    选取的四个tRNA绘图结果如下:

    image-20230903180248037.png

    接下来就是解析每个tsRNA的分类了:

    解理tsRNA的分类,如5 '和3 'halves等,更复杂。这种分类涉及到tRNA环(A-loop、anticodon loop和T-loop)中可能发生切割的位置。有一些外部软件可以从二级结构模型中预测环路。预测tRNA二级结构最流行的工具之一是tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/)。这个软件的输出是一个ss文件,它以以下格式存储循环信息:

    image-20230903180717224.png

    Ss-files for most common genomes are readily available for download at http://gtrnadb.ucsc.edu/

    根据ss文件,结合比对结果就可以使用tRNA_class函数对tsRNA进行分类注释了:

    ###---------------------------------------------------------------------
    ## Generate range types using a ss-file
    
    # Download ss object from GtRNAdb 
    # ("http://gtrnadb.ucsc.edu/")
    ss_file <- "/some/path/to/GtRNAdb/trna.ss"
    # or for testing use the provided ss-file in secpac
    ss_file <- system.file("extdata/trna2", "tRNA2.ss", 
                              package = "seqpac", mustWork = TRUE)
    
    # Classify fragments according to loop cleavage (small loops are omitted)       
    map_object_ss <- map_rangetype(map_object, type="ss", 
                                   ss=ss_file, min_loop_width=4)            
    
    # Remove reference tRNAs with no hits
    map_object_ss <-  map_object_ss[
      !unlist(lapply(map_object_ss, function(x){x[[2]][1,1] == "no_hits"}))]
    
    # Now we have quite comprehensive tRNA loop annotations
    names(map_object_ss)
    map_object_ss[[1]][[2]]
    
    # Don't forget to check ?map_rangetype to obtain more options 
    
    
    ###---------------------------------------------------------------------
    ## Function classifying 5'-tRF, 5'halves, i-tRF, 3'-tRF, 3'halves
    
    # Does all tRNAs have 3 loops?
    table(unlist(lapply(map_object_ss, function(x){unique(x$Alignments$n_ssloop)})))
    
    # Set tolerance for classification as a terminal tRF
    tolerance <- 5  # 2 nucleotides from start or end of full-length tRNA)
    
    ### Important:
    # We set N_up and N_down to "NNN" in the PAC_mapper step. To make sure
    # that we have a tolerance that include the original tRNA sequence
    # we set terminal= 2+3 (5).  
    ## tRNA classifying function 
    # Apply the tRNA_class function and make a tRNA type column
    pac_trna <- tRNA_class(pac_trna, map=map_object_ss, terminal=tolerance)
    ## 
    ## -- Anno target was specified, will retain: 29 of 9131 seqs.
    anno(pac_trna)$type <- paste0(anno(pac_trna)$decoder, anno(pac_trna)$acceptor)
    anno(pac_trna)[1:5, 1:4]    
    

    得到的分类结果如下:

    image-20230903181248338.png

    查看每种tsRNA类别的占比:

    ###---------------------------------------------------------------------
    ## Plot tRNA types
    
    # Now use PAC_trna to generate some graphs based on grand means
    trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL,
      join = TRUE, top = 15, log2fc = TRUE,
      pheno_target = list("stage", c("Stage1", "Stage3")), 
      anno_target_1 = list("type"),
      anno_target_2 = list("class"))
      
    cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Grand_means,
                   trna_result$plots$Log2FC_Anno_1,
                   trna_result$plots$Percent_bars$Grand_means,
                   nrow=1, ncol=3)
    

    结果如下:

    image-20230903181537012.png

    每个tsRNA的最终注释结果如下:

    anno_result <- anno(pac_trna)
    
    image-20230903182329699.png

    常见文献中tsRNA分类结果展示可以如下:

    image-20230903182439352.png

    ref:https://doi.org/10.1186/s12958-022-00978-3 Reproductive Biology and Endocrinology (2022) 20:106

    根据这个分类注释做个性化绘制吧~

    相关文章

      网友评论

        本文标题:tsRNA分类分析:seqac包

        本文链接:https://www.haomeiwen.com/subject/dqnxvdtx.html