转录组入门(5):序列比对

作者: JeremyL | 来源:发表于2017-09-18 21:46 被阅读1065次

    比对软件很多,首先大家去收集一下,因为我们是带大家入门,请统一用hisat2,并且搞懂它的用法。
    直接去hisat2的主页下载index文件即可,然后把fastq格式的reads比对上去得到sam文件。
    接着用samtools把它转为bam文件,并且排序(注意N和P两种排序区别)索引好,载入IGV,再截图几个基因看看!
    顺便对bam文件进行简单QC,参考直播我的基因组系列。

    HISAT2安装:

    linux版Hisat2下载,解压,可以使用了:
    $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip
    解压(-d 解压到指定文件):
    $ unzip -d /work/LXJ/software/ hisat2-2.1.0-Linux_x86_64.zip
    检查是否可以运行:
    $ ./hisat2
    (ERR): hisat2-align exited with value 1:可以忽略

    环境路径设置:
    $ sudo vi /etc/environment
    添加:/work/LXJ/software/hisat2-2.1.0
    $ source /etc/environment

    HISAT2使用

    基因组索引

    自行建立基因组索引:
    Command Line : hisat2-build [options]* <reference_in> <ht2_base>
    Usage : hisat2-build –p 8 genome.fa genome
    如果想分析关于snp、exon、剪切位点新的信息,HISAT2建立基因组索引时,需要加入注释过的snp、exon、剪切位点后,再信息建立基因组索引;(hisat2包中有程序帮你解决)
    下载基因组索引:
    从HISAT2的官网中下载现成的基因组索引,这样子比较省事,也可以防止出错:

    这是老鼠的基因组索引,根据需要下载合适的版本:
    $ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz tar zxvf mm10.tar.gz

    HISAT2比对RNA-Seq到基因组:
    hisat2 [options]* -x <hisat2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <hit>]
    <ht2-idx> Index filename prefix (minus trailing .X.ht2).
    <m1> Files with #1 mates, paired with files in <m2>.
    Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
    <m2> Files with #2 mates, paired with files in <m1>.
    Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
    <r> Files with unpaired reads.
    Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
    <SRA accession number> Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
    <sam> File for SAM output (default: stdout)

    <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
    specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.

    HISAT2比对:

    for i in {59..62};
    do
    echo $i
    hisat2 -t -p 8 -x /work/LXJ/Genome/M.musculus/mm10.hisat2.index/genome -1 SRR35899${i}.sra_1.fastq.gz -2 SRR35899${i}.sra_2.fastq.gz -S /mnt/hgfs/Labubuntu_data/GSE81916.RNAseq/hisat2.mm10/SRR35899${i}.sam;
    done
    
    59
    Time loading forward index: 00:00:25
    Time loading reference: 00:00:04
    Multiseed full-index search: 00:15:41
    30468155 reads; of these:
      30468155 (100.00%) were paired; of these:
        2722598 (8.94%) aligned concordantly 0 times
        24300848 (79.76%) aligned concordantly exactly 1 time
        3444709 (11.31%) aligned concordantly >1 times
        ----
        2722598 pairs aligned concordantly 0 times; of these:
          156872 (5.76%) aligned discordantly 1 time
        ----
        2565726 pairs aligned 0 times concordantly or discordantly; of these:
          5131452 mates make up the pairs; of these:
            3276583 (63.85%) aligned 0 times
            1334447 (26.01%) aligned exactly 1 time
            520422 (10.14%) aligned >1 times
    94.62% overall alignment rate
    Time searching: 00:15:45
    Overall time: 00:16:11
    60
    Time loading forward index: 00:00:29
    Time loading reference: 00:00:04
    Multiseed full-index search: 00:29:01
    52972617 reads; of these:
      52972617 (100.00%) were paired; of these:
        4438954 (8.38%) aligned concordantly 0 times
        42836426 (80.87%) aligned concordantly exactly 1 time
        5697237 (10.76%) aligned concordantly >1 times
        ----
        4438954 pairs aligned concordantly 0 times; of these:
          268939 (6.06%) aligned discordantly 1 time
        ----
        4170015 pairs aligned 0 times concordantly or discordantly; of these:
          8340030 mates make up the pairs; of these:
            5335211 (63.97%) aligned 0 times
            2173091 (26.06%) aligned exactly 1 time
            831728 (9.97%) aligned >1 times
    94.96% overall alignment rate
    Time searching: 00:29:05
    Overall time: 00:29:34
    61
    Time loading forward index: 00:00:31
    Time loading reference: 00:00:05
    Multiseed full-index search: 00:21:39
    36763726 reads; of these:
      36763726 (100.00%) were paired; of these:
        3102153 (8.44%) aligned concordantly 0 times
        29382458 (79.92%) aligned concordantly exactly 1 time
        4279115 (11.64%) aligned concordantly >1 times
        ----
        3102153 pairs aligned concordantly 0 times; of these:
          173349 (5.59%) aligned discordantly 1 time
        ----
        2928804 pairs aligned 0 times concordantly or discordantly; of these:
          5857608 mates make up the pairs; of these:
            3596954 (61.41%) aligned 0 times
            1595531 (27.24%) aligned exactly 1 time
            665123 (11.35%) aligned >1 times
    95.11% overall alignment rate
    Time searching: 00:21:44
    Overall time: 00:22:15
    62
    Time loading forward index: 00:00:28
    Time loading reference: 00:00:05
    Multiseed full-index search: 00:22:33
    43802631 reads; of these:
      43802631 (100.00%) were paired; of these:
        3816434 (8.71%) aligned concordantly 0 times
        35462440 (80.96%) aligned concordantly exactly 1 time
        4523757 (10.33%) aligned concordantly >1 times
        ----
        3816434 pairs aligned concordantly 0 times; of these:
          209180 (5.48%) aligned discordantly 1 time
        ----
        3607254 pairs aligned 0 times concordantly or discordantly; of these:
          7214508 mates make up the pairs; of these:
            4769954 (66.12%) aligned 0 times
            1806461 (25.04%) aligned exactly 1 time
            638093 (8.84%) aligned >1 times
    94.56% overall alignment rate
    Time searching: 00:22:38
    Overall time: 00:23:06
    

    Samtools

    samtools view:

    Sam文件转换为bam文件:

    for i in {59..62};
    do
    echo $i
    samtools view -S SRR35899${i}.sam -b > SRR35899${i}.bam;
    done
    

    samtools sort:

    sort对bam文件排序,而不是sam文件;对比对结果按reads名称排序(默认根据染色体上对应位置排序);此处依据reads名字排序是为了满足后面HTseq的计算,如果此处使用默认的chr position会增大HTseq生成count文件时的工作量。

    for i in {59..62};
    do
    echo $i
    samtools sort -n SRR35899${i}.bam -@ 8 SRR35899${i}_n.sorted;
    done
    

    默认按照染色体位置进行排序,而-n参数则是根据read名进行排序; -t,首先根据tag TAG排序,然后根据染色体位置或reads名字排序。

    IGV查看

    比对结果质控:
    常用工具有
    Picard https://broadinstitute.github.io/picard/
    RSeQC http://rseqc.sourceforge.net/
    Qualimap http://qualimap.bioinfo.cipf.es/
    此处使用RseQC,RseQC下属各式各样的工具,并且RseQC官网中有测试数据和运行实例
    RseQC
    安装:pip install RseQC
    可使用程序:

    参考:
    转录组入门(5): 序列比对

    相关文章

      网友评论

        本文标题:转录组入门(5):序列比对

        本文链接:https://www.haomeiwen.com/subject/untisxtx.html