美文网首页ncRNA生信non-coding RNA
Docker封装生物信息学lncRNA流程

Docker封装生物信息学lncRNA流程

作者: 儒雅随和王小板 | 来源:发表于2020-10-21 15:07 被阅读0次

    一.流程介绍

    后续会开发lncRNA、smallRNA、circRNA流程
    首先是lncRNA流程,主要分析模块为:

    • 测序数据质控;
    • 序列比对分析;
    • 转录本组装;
    • lncRNA分析;
    • 表达量分析;
    • 表达量差异分析;
    • lncRNA靶基因分析;
    • 富集分析

    1.使用基础ubuntu镜像:

    docker run -itd --name lnc   ubuntu:18.04
    

    2.安装json环境配置环境
    换源:

    docker cp sources.list 1a73843e30e5:/etc/apt
    
    apt-get update
    apt-get upgrade
    

    lncRNA流程前几个模块与有参流程模块脚本基本一样,软件使用相同,直接复制原来的软件到lnc环境中,并安装即可。


    图片.png

    3.测试脚本
    01-qc.sh、01-qcStat.sh、02-alignHST.sh、02-alignSummary.sh
    使用FastQC质控与Hisat2比对;
    使用stringtie进行转录本拼接与组装,分为得到mRNA序列与注释文件与lncRNA序列与注释文件
    03-assemblyStie.sh脚本:

    for sample in $samplenames
    do
        delay stringtie
        cd $sample
        mkdir AssemblyStielncRNA
        cd AssemblyStielncRNA
    
        algn_bam=../AlignmentHST/$sample.bam
    
        echo "Transcript assembly for $sample at $(date)"
    
        $stringtie -p $stringtie_threads -G $lnc_gtf \
        -o transcripts.gtf -A $sample.exp -l $sample $algn_bam &
    
        cd ../
    
        mkdir AssemblyStieRNA
        cd AssemblyStieRNA
    
        algn_bam=../AlignmentHST/$sample.bam
    
        echo "Transcript assembly for $sample at $(date)"
    
        $stringtie -p $stringtie_threads -G $genome_gtf \
        -o transcripts.gtf -A $sample.exp -l $sample $algn_bam &
    
        cd ../../
    
    done
    

    根据lncRNA注释文件与mRNA注释文件,拼接得到每个样本的转录本注释文件

    mkdir MergedGTFlncRNA_Stie
    find */AssemblyStielncRNA/ -name "transcripts.gtf" -type f -printf "%p \n" >transcripts_file-lncRNA.list
    $stringtie --merge -p $stringtie_threads -G $lnc_gtf -o MergedGTFlncRNA_Stie/mergedlncRNA.gtf transcripts_file-lncRNA.list
    # #$gffcompare -r $genome_gtf -G -o MergedGTFlncRNA_Stie/merged_compare MergedGTFlncRNA_Stie/mergedlncRNA.gtf
    
     mkdir MergedGTFRNA_Stie
    find */AssemblyStieRNA/ -name "transcripts.gtf" -type f -printf "%p \n" >transcripts_file-mRNA.list
    $stringtie --merge -p $stringtie_threads -G $genome_gtf -o MergedGTFRNA_Stie/mergedmRNA.gtf transcripts_file-mRNA.list
    
    $cufflinks/gffread -w MergedGTFlncRNA_Stie/transcripts-lncRNA.fa -g $genome_fasta MergedGTFlncRNA_Stie/mergedlncRNA.gtf &
    
    $cufflinks/gffread -w MergedGTFRNA_Stie/transcripts-mRNA.fa -g $genome_fasta MergedGTFRNA_Stie/mergedmRNA.gtf &
    
    

    分别得到所有样本lncRNA与mRNA总注释文件,下一步计算表达量

    for sample in $samplenames
    do
        delay stringtie
        cd $sample
        mkdir ExpressionStielncRNA
        cd ExpressionStielncRNA
    
        algn_bam=../AlignmentHST/$sample.bam 
    
        echo "Start estimating for $sample at $(date)"
    
        $stringtie -e -B -p $stringtie_threads -G ../../MergedGTFlncRNA_Stie/mergedlncRNA.gtf \
        -o $sample.gtf -A $sample.exp $algn_bam &
    
        cd ../
    
        mkdir ExpressionStieRNA
        cd ExpressionStieRNA
    
        algn_bam=../AlignmentHST/$sample.bam 
    
        echo "Start estimating for $sample at $(date)"
    
        $stringtie -e -B -p $stringtie_threads -G ../../MergedGTFRNA_Stie/mergedmRNA.gtf \
        -o $sample.gtf -A $sample.exp $algn_bam &
    
        cd ../../
    
    done
    

    得到结果如下:


    图片.png

    4.接下来使用DEseq2或edger进行差异分析,可参考有参流程分析:
    (1)有生物学重复的同时满足 log2FoldChange 绝对值大于等于 1 且 pvalue 小于 0.05 的基因 significant 标为 yes,否则标为 no;(2)没有生物学重复的同时满足 log2FoldChange 绝对值大于等于 1 且 pvalue 小于 1 的基因 significant 标为 yes,否则标为 no。同时我们会根据初始运算结果,对差异阈值做进一步定义或者给予建议,例如:1. fold change 阈值更改;2. 统计检验阈值(pvalue)更改。满足显著性差异阈值的差异基因在分析结果中通过 significant 参数一列为 yes 显现。
    5.使用clusterProfiler包 进行KEGG与GO富集分析;

    1. lncRNA靶基因分析;

    10.23更新:

    LncRNA预测分析:
    使用软件:
    CNCI;
    CPC2;
    LGC;

    CNCI

    CNCI是中科院计算所赵屹团队开发的一款从转录组中分析编码RNA和非编码RNA的软件。赵屹团队在非编码RNA领域做了很多出色的工作,建立了目前最权威的非编码RNA数据库NONCODE。
    github地址:

    https://github.com/www-bioinfo-org/CNCI

    安装:

    git clone [git@github.com](mailto:git@github.com):www-bioinfo-org/CNCI.git
    
    cd CNCI
    
    unzip libsvm-3.0.zip
    
    cd libsvm-3.0
    
    make
    
    cd ..
    

    或者下载到服务器,unzip解压

    基本命令为:

    python CNCI_package/CNCI.py -f novel.fasta -o CNCI_out -m ve -p 4
    参数说明:
    -f 输入fasta文件(可以使用-g参数输入GTF文件,但是同时需要使用-d参数指定参考基因组的目录)
    -o 输出结果目录
    -m 指定模式,脊椎动物选择ve,植物选择pl
    -p 指定CPU核数
    

    更多用法参看帮助文档
    结果文件:


    图片.png

    CPC2

    CPC2 下载安装

    https://github.com/gao-lab/CPC2_standalone/releases/tag/v1.0.1

    安装:

    gzip -dc CPC2-beta.tar.gz | tar xf -
    
    cd CPC2-beta
    export CPC_HOME="$PWD"
    cd libs/libsvm
    gzip -dc libsvm-3.18.tar.gz | tar xf -
    cd libsvm-3.18
    make clean && make
    cd $CPC_HOME
    bin/CPC2.py -i (input_seq) -o (result_in_table)
    
    图片.png

    运行命令:

    python3 CPC2.py  -i transcripts-lncRNA.fa
    

    结果展示:


    图片.png

    LGC

    LGC是由北京基因组所基于python2 开发的一款快速lncRNA预测工具,该工具通过ORF(开放阅读框)长度和GC含量间的关系进行相关运算来鉴定lncRNA。LGC最大特点是能够基于跨物种策略进行lncRNA发掘。因此LGC可以支持有参数据和无参数据 进行lncRNA鉴定。在区分从植物到哺乳动物的不同物种的lncRNA和蛋白编码RNA方面,LGC鉴定的准确率高达90%。


    图片.png

    下载与安装:

     wget [http://bigd.big.ac.cn/biocode/tools/4/releases/1.0/file/download?filename=lgc-1.0.tar.gz](http://bigd.big.ac.cn/biocode/tools/4/releases/1.0/file/download?filename=lgc-1.0.tar.gz)
    
    tar zxf lgc-1.0.tar.gz
    
    chmod 755 lgc-1.0.py
    
    #确保conda,lgc-1.0.py在环境变量里
    
    lgc-1.0.py input.fasta output.txt
    
    # Or
    
    python lgc-1.0.py input.fasta output.txt
    
    

    结果展示:


    图片.png
    图片.png

    二.nextflow使用

    图片.png

    这里使用 Nextflow 作为流程搭建工具,它有着很多强大的功能,第一次接触,首先先安装:

    wget -qO- [https://get.nextflow.io](https://get.nextflow.io/) | bash
    

    或者使用conda安装

    conda install nextflow
    

    官方参考文档:
    https://www.nextflow.io/docs/latest/getstarted.html
    这里可以直接使用
    直接使用已有lncRNA流程进行测试,参考地址
    或者使用下面代码使用rna流程:

    nextflow pull nf-core/rnaseq
    

    下载完流程之后:


    图片.png

    运行文件为nf结尾文件,
    文件内容如下:

    #!/usr/bin/env nextflow
    import SampleGroup
    import Config
    
    
    
    
    /*------------------RNA分析nextflow-pipeline初版-----------------------------
    
    RNA_np_v0.0.1
    
    System: Linux version 3.10.0-693.21.1.el7.x86_64
    Tool  : Nextflow 0.30.2.4867
    -------------------------------------------------------------------------*/
    
    /*---------------------------定义基本目录-----------------------------------
    
    base pathway
    
    database: 与人hg19相关的数据、工具存储的目录
    opd: output pathway本项目的输出文件母目录 
    -------------------------------------------------------------------------*/
    database="/data/zhangxc/database/homo_sapiens"
    opd =  "/data/zhangxc/lnc_rna/lncRNA/output"
    main_dir = "/data/zhangxc/lnc_rna/lncRNA"
    /*-------------------------读取样本及定义组---------------------------------
    
    sample and group
    
    data.ini为样本和组的配置文件,通过congfig.groovy将定义好的样本和组读入pipline中备用
    
    sample = {{sample_name},{sample_file1,sample_file2}}
    group = {{group_name},{control_name},{case_name},{control_name,case_name}}
    
    nextflow 同一命名变量只允许一次作为输入变量
    sample 扩展成3份 分别用来进行fastqc、SOAPnuke、println 
    -------------------------------------------------------------------------*/
    params.data_file="/data/zhangxc/lnc_rna/lncRNA/test/data.ini"
    config=new Config(params.data_file)
    Channel.from( config.ReadSamples() ).set{samples}
    samples.into{samples0;samples1;sample_2}
    Channel.from( config.ReadGroups() ).set{groups0}
    sample_2.println()
    /*------------------------------fastqc------------------------------------
    
    fastqc
    
    质量控制模块,独立于分析流程的单独模块
    input : sample
    output: otp/fastqc/sample_name
    -------------------------------------------------------------------------*/
    process fastqc{
        tag { sample_name }
        // container 'biocontainers/fastqc'
        // conda "fastqc"
        publishDir { "output/fastqc/"+ sample_name }
        input:
            set sample_name , files from samples0
        output:
            file "*_fastqc/Images/*.png"
        script:
        if(files.size == 1){
            """
            echo \$PWD
            fastqc --extract -o . ${files[0]}
            """
        }else{
            """
            echo \$PWD
            fastqc --extract -o . ${files[0]} ${files[1]}
            """
        }
    }
    /*------------------------------soapnuke------------------------------------
    
    soapnuke
    
    质量控制模块,开始与分析第一步,用来去除引物生成clean sample
    input : sample
    output: file otp/soapnuke/sample_name/*.fq.gz  val clean_samples-->{{sample_name},{*.fq.gz}}
    
    clean_samples copy to clean_samples1|clean_samples2
    -------------------------------------------------------------------------*/
    process soapnuke{
        tag { sample_name }
        // container 'registry.cn-hangzhou.aliyuncs.com/bio/soapnuke'
        publishDir path:{"output/soapnuke/"+sample_name} ,mode:'copy',
            saveAs: {filename ->
                if (filename.indexOf(".txt")>0) filename
                else null
            }
    
        input:
            set sample_name , files from samples1
    
        output:
            set sample_name , file("*.fq.gz") into clean_samples
            file "*.txt"
    
        script:
            """
            echo \$PWD
            SOAPnuke filter -1 ${files[0]} -2 ${files[1]} -l 20 -q 0.5  \
                -f GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -r GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
                -o ./ -C ${sample_name}_1.clean.fq.gz -D ${sample_name}_2.clean.fq.gz
            """
    }
    clean_samples.into{clean_samples1;clean_samples2;}
    
    /*------------------------------hisat------------------------------------
    
    hisat
    
    比对软件,与bowtie和tophat功能相似,将read与基因组作比对
    input :  clean_samples1
    output:  file otp/hisat/sample_name/${sample_name}.sam  val sam-->{{sample_name},{${sample_name}.sam}}
             file otp/hisat/sample_name/*.align_summary.txt val summary_txt-->{{sample_name},{*.align_summary.txt}}
    -------------------------------------------------------------------------*/
    process hisat{
        tag { sample_name }
        // container 'registry.cn-hangzhou.aliyuncs.com/bio/hisat2'
        publishDir { "output/hisat/"+ sample_name }
        input:
            set sample_name , files from clean_samples1
    
        output:
            set sample_name , file("${sample_name}.sam") into sam
            set sample_name , file("*.align_summary.txt") into summary_txt
    
        script:
        """
            echo \$PWD
            hisat2  -x $database/grch37_tran/genome_tran -1 ${files[0]} -2 ${files[1]} -S ${sample_name}.sam 2> ${sample_name}.align_summary.txt
        """
    }
    
    /*------------------------------samtools----------------------------------
    
    samtools
    
    工具软件,将sam转为bam格式,并根据染色体定位进行sort和index
    input :  val sam 
    output:  file otp/samtools/sample_name/${sample_name}.bam  val sam-->{{sample_name},{${sample_name}.bam}}
    -------------------------------------------------------------------------*/
    process samtools{
        tag { sample_name }
        // container 'broadinstitute/genomes-in-the-cloud:2.3.1-1512499786'
        publishDir { "output/samtools/"+ sample_name }
        input:
            set sample_name , file('sample.sam') from sam
    
        output:
            set sample_name , file("${sample_name}.bam") into bam
    
        script:
        """
            echo \$PWD
            samtools view -bS sample.sam > ${sample_name}_unsorted.bam
            samtools sort -@ 8 ${sample_name}_unsorted.bam ${sample_name}
            samtools index ${sample_name}.bam ${sample_name}.bam.bai
        """
    }
    
    /*------------------------------htseq----------------------------------
    
    htseq
    
    转录组表达量分析
    input :  val bam 
    output:  file otp/htseq/sample_name/${sample_name}.rawCount.txt  val htseq_txt-->{{sample_name},{${sample_name}.rawCount.txt}}
    
    htseq_txt copy to htseq_txt0|htseq_txt1|htseq_txt2
    -------------------------------------------------------------------------*/
    process htseq{
        tag { sample_name }
        //container 'broadinstitute/genomes-in-the-cloud:2.3.1-1512499786'
        //container "dyndna/docker-for-pcawg14-htseq"
        // conda "htseq=0.9.1"
        publishDir { "output/htseq/"+ sample_name }
        input:
            set sample_name , file('bam_files') from bam
        output:
            set sample_name , file("${sample_name}.rawCount.txt") into htseq_txt
        script:
        """
            echo \$PWD
            htseq-count -f bam -m union -s yes -t exon -i gene_id -r pos bam_files \
            $database/Homo_sapiens.GRCh37.75.gtf >${sample_name}.rawCount.txt
        """
    }
    htseq_txt.into{htseq_txt0;htseq_txt1;htseq_txt2} 
    
    htseq_txt2.println()
    
    /*------------------------------htseq_xls----------------------------------
    
    htseq_xls
    
    转录组表达量分析后,将txt文件转换格式为xls
    input :  val summary_txt 
             val htseq_txt0
    output:  file otp/htseq_xls/sample_name/${sample_name}.rpkm.xls  val htseq_xls-->{{sample_name},{${sample_name}.rpkm.xls}}
    -------------------------------------------------------------------------*/
    process htseq_xls{
        tag { sample_name }
        //container 'registry.cn-hangzhou.aliyuncs.com/bio/hisat2'
        // container "broadinstitute/genomes-in-the-cloud:2.3.1-1512499786 "
        publishDir { "output/htseq/"+ sample_name }
        input:
            set sample_name , file('txt_files') from summary_txt
            set sample_name , file('sample.rawCount.txt') from htseq_txt0
        output:
            set sample_name , file("${sample_name}.rpkm.xls") into htseq_xls
        script:
        """
            echo \$PWD
            perl /data/zhangxc/lnc_rna/lncRNA/bin/RNAseq_bin/htseq2rpkm_hisat.pl txt_files \
            $database/hg19.GRCh37.74.ncRNA.gene.len \
            sample.rawCount.txt $sample_name ${sample_name}.rpkm.xls
            sed 's/^/$sample_name\t/' ${sample_name}.rpkm.xls | sed '1d' > ${sample_name}.rpkm.tmp
        """
    }
    

    只截取到基因表达量部分,感兴趣的可以去尝试运行,运行方式为:
    nextflow run zxc.nf


    图片.png

    运行完成,后续模块分析中...

    相关文章

      网友评论

        本文标题:Docker封装生物信息学lncRNA流程

        本文链接:https://www.haomeiwen.com/subject/niosmktx.html