美文网首页
rMATS : RNA-seq 可变剪切分析 (可比较两有重复

rMATS : RNA-seq 可变剪切分析 (可比较两有重复

作者: 重拾生活信心 | 来源:发表于2023-07-21 17:18 被阅读0次

    Reference :

    Installation

    cd <my.bio-tools.path>
    git clone https://github.com/Xinglab/rmats-turbo.git
    cd rmats-turbo
    ./build_rmats --conda
    conda activate {my.bio-tools.path}/rmats-turbo/conda_envs/rmats
    rmats.py 
    
    cd <my.bio-tools.path>
    git clone https://github.com/Xinglab/rmats2sashimiplot.git
    cd rmats2sashimiplot/
    bash 2to3.sh 
    pip uninstall rmats2sashimiplot
    python ./setup.py install
    rmats2sashimiplot
    
    

    Usage

    • Take bam files & gtf file as input.
    #-------------------------------------------
    # generate bam files from STAR, arguments from script of rmats.
    ## --s1 --s2 take fastq files as input
    wd=
    output=$wd/results
    index=refdata/cellranger_ref/hg38/star
    gtf=refdata/cellranger_ref/hg38/genes/genes.gtf
    cores=19
    cd $wd 
    for i in `cat SRR_Acc_List.txt`
    do
    STAR --readFilesIn $output/01cleandata/${i}_1_paired_clean.fastq.gz $output/01cleandata/${i}_2_paired_clean.fastq.gz \
    --chimSegmentMin 2 --outFilterMismatchNmax 3 \
    --alignEndsType EndToEnd \
    --runThreadN $cores \
    --outSAMstrandField intronMotif \
    --outSAMtype BAM Unsorted \  # Failed with BAM SortedByCoordinate
    --alignSJDBoverhangMin 6 \
    --alignIntronMax 299999 \
    --genomeDir $index \
    --sjdbGTFfile $gtf \
    --outFileNamePrefix  $output/02alignment-hg38/${i}_ \
    --readFilesCommand zcat
    
    
    samtools sort -@ $cores $output/02alignment-hg38/${i}_Aligned.out.bam -o $output/02alignment-hg38/${i}.bam
    samtools index -@ $cores $output/02alignment-hg38/${i}.bam
    
    
    done
    
    #-------------------------------------------
    vim rmats.sh
    #!/bin/bash
    echo "/path/to/bam/ctr_rep1.bam,/path/to/bam/ctr_rep2.bam" > b1.txt 
    echo "/path/to/bam/treat_rep1.bam,/path/to/bam/treat_rep2.bam" > b2.txt
    gtf=/path/to/gtf/xxxx.gtf
    out_dir=/path/to/output
    tmp_dir=/path/to/tmp
    core=16
    read_length=150
    
    rmats.py --b1 b1.txt --b2 b2.txt  --od $out_dir --tmp $tmp_dir \
    --readLength $read_length --variable-read-length --nthread $core
    #--------------------------------------------
    
    python rmats2sashimiplot/src/rmats2sashimiplot/rmats2sashimiplot.py \
    -o plot-RI-1-2/ \
    --l1 ctr \ #label the sample1
    --l2 treat \
    --b1 /path/to/bam/ctr_rep1.bam,/path/to/bam/ctr_rep2.bam \
    --b2 /path/to/bam/treat_rep1.bam,/path/to/bam/treat_rep2.bam \
    -e /path/to/output/RI.MATS.JC.txt \
    --event-type RI
    
    python rmats.py -h
    usage: rmats.py [options]
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      --gtf GTF             An annotation of genes and transcripts in GTF format
      --b1 B1               A text file containing a comma separated list of the
                            BAM files for sample_1. (Only if using BAM)
      --b2 B2               A text file containing a comma separated list of the
                            BAM files for sample_2. (Only if using BAM)
      --s1 S1               A text file containing a comma separated list of the
                            FASTQ files for sample_1. If using paired reads the
                            format is ":" to separate pairs and "," to separate
                            replicates. (Only if using fastq)
      --s2 S2               A text file containing a comma separated list of the
                            FASTQ files for sample_2. If using paired reads the
                            format is ":" to separate pairs and "," to separate
                            replicates. (Only if using fastq)
      --od OD               The directory for final output from the post step
      --tmp TMP             The directory for intermediate output such as ".rmats"
                            files from the prep step
      -t {paired,single}    Type of read used in the analysis: either "paired" for
                            paired-end data or "single" for single-end data.
                            Default: paired
      --libType {fr-unstranded,fr-firststrand,fr-secondstrand}
                            Library type. Use fr-firststrand or fr-secondstrand
                            for strand-specific data. Only relevant to the prep
                            step, not the post step. Default: fr-unstranded
      --readLength READLENGTH
                            The length of each read
      --variable-read-length
                            Allow reads with lengths that differ from --readLength
                            to be processed. --readLength will still be used to
                            determine IncFormLen and SkipFormLen
      --anchorLength ANCHORLENGTH
                            The "anchor length" or "overhang length" used when
                            counting the number of reads spanning splice
                            junctions. A minimum number of "anchor length"
                            nucleotides must be mapped to each end of a given
                            junction. The minimum value is 1 and the default value
                            is set to 1 to make use of all possible splice
                            junction reads.
      --tophatAnchor TOPHATANCHOR
                            The "anchor length" or "overhang length" used in the
                            aligner. At least "anchor length" NT must be mapped to
                            each end of a given junction. The default is 1. (Only
                            if using fastq)
      --bi BINDEX           The directory name of the STAR binary indices (name of
                            the directory that contains the SA file). (Only if
                            using fastq)
      --nthread NTHREAD     The number of threads. The optimal number of threads
                            should be equal to the number of CPU cores. Default: 1
      --tstat TSTAT         The number of threads for the statistical model. If
                            not set then the value of --nthread is used
      --cstat CSTAT         The cutoff splicing difference. The cutoff used in the
                            null hypothesis test for differential splicing. The
                            default is 0.0001 for 0.01% difference. Valid: 0 <=
                            cutoff < 1. Does not apply to the paired stats model
      --task {prep,post,both,inte,stat}
                            Specify which step(s) of rMATS to run. Default: both.
                            prep: preprocess BAMs and generate a .rmats file.
                            post: load .rmats file(s) into memory, detect and
                            count alternative splicing events, and calculate P
                            value (if not --statoff). both: prep + post. inte
                            (integrity): check that the BAM filenames recorded by
                            the prep task(s) match the BAM filenames for the
                            current command line. stat: run statistical test on
                            existing output files
      --statoff             Skip the statistical analysis
      --paired-stats        Use the paired stats model
      --novelSS             Enable detection of novel splice sites (unannotated
                            splice sites). Default is no detection of novel splice
                            sites
      --mil MIL             Minimum Intron Length. Only impacts --novelSS
                            behavior. Default: 50
      --mel MEL             Maximum Exon Length. Only impacts --novelSS behavior.
                            Default: 500
      --allow-clipping      Allow alignments with soft or hard clipping to be used
      --fixed-event-set FIXED_EVENT_SET
                            A directory containing fromGTF.[AS].txt files to be
                            used instead of detecting a new set of events```
    

    Output

    • rMATS 能识别5种Altenative Splicing events (AS : SE \ RI \ A5SS\A3SS\MXE,对于每种AS,都会有给出比对到exon junction的count。[AS].MATS.JC.txt or [AS].MATS.JCEC.txt

    • 23列,值得关注 :PValueIncLevelDifference=average(IncLevel1) - average(IncLevel2)
      。有几列描述这个剪切事件的染色体位置,名称根据不同AS有点不同。

    • 类似于gene-level的RNA-seq 分析,需要用长度进行 normalization。在rMATS中,用IncLevel (Inclusion Level)对 AS进行定量描述。 image.png
    • 用likelihood-ratio test ,将sample1和sample2的某event的IncLevel 的差值进行检验。(--cstat)


      I/S
    intro : 三个外显子,虚线为intron。看比对到junction 的reads,如果 1-3 ,则跳过了中间,如果 1-2 ;2-3 这个连接部位被检测到了,就表明包含了中间外显子。 Inclusion level ,不同 AS ,长度算法点不一样 paper-stat vis

    相关文章

      网友评论

          本文标题:rMATS : RNA-seq 可变剪切分析 (可比较两有重复

          本文链接:https://www.haomeiwen.com/subject/hcytpdtx.html