美文网首页生信工具生信相关融合基因鉴定
基因融合 — step1 GeneFuse 软件

基因融合 — step1 GeneFuse 软件

作者: 小灰灰不会飞那又怎样 | 来源:发表于2019-10-22 13:25 被阅读0次

GeneFuse:A tool to detect and visualize target gene fusions by scanning FASTQ files directly. This tool accepts FASTQ files and reference genome as input, and outputs detected fusion results in TEXT, JSON and HTML formats.

一、概述

由C语言编写,官网:https://github.com/OpenGene/GeneFuse

文献情况:
Shifu Chen, Ming Liu, Tanxiao Huang, Wenting Liao, Mingyan Xu and Jia Gu. GeneFuse: detection and visualization of target gene fusions from DNA sequencing data. International Journal of Biological Sciences, 2018; 14(8): 843-848. doi: 10.7150/ijbs.24626
被引量:1
发表时间:2018
期刊及影响因子:International Journal of Biological Sciences 4.067(2018)
作者及单位:陈实富 深圳市海普洛斯生物科技有限公司 (还开发过fastp等著名软件 https://github.com/OpenGene/

二、安装

下载及安装十分简单,参照官网即可
下面是下载编译好的二进制文件,可以直接使用,也可以参考官网的步骤下载源码进行编译

wget http://opengene.org/GeneFuse/genefuse
chmod a+x ./genefuse
三、用法

用法示例:

genefuse -r hg19.fasta -f genes/druggable.hg19.csv -1 genefuse.R1.fq.gz -2 genefuse.R2.fq.gz -h report.html > result

需要输入的是fq1,fq2,参考基因组csv文件
csv下载链接:https://github.com/OpenGene/GeneFuse/tree/master/genes

帮助文档:

$ /software/GeneFuse/genefuse --help
usage: /software/GeneFuse/genefuse --read1=string --fusion=string --ref=string [options] ...
options:
  -1, --read1                          read1 file name (string)
  -2, --read2                          read2 file name (string [=])
  -f, --fusion                         fusion file name, in CSV format (string)
  -r, --ref                            reference fasta file name (string)
  -u, --unique                         specify the least supporting read number is required to report a fusion, default is 2 (int [=2])
  -h, --html                           file name to store HTML report, default is genefuse.html (string [=genefuse.html])
  -j, --json                           file name to store JSON report, default is genefuse.json (string [=genefuse.json])
  -t, --thread                         worker thread number, default is 4 (int [=4])
  -d, --deletion                       specify the least deletion length of a intra-gene deletion to report, default is 50 (int [=50])
  -D, --output_deletions               long deletions are not output by default, enable this option to output them
  -U, --output_untranslated_fusions    the fusions that cannot be transcribed or translated are not output by default, enable this option to output them
  -?, --help                           print this message

实际运行示例:

/software/GeneFuse/genefuse --ref /data/Project/cailili/Gene27/bin/pre/database/Genome/hg19/ucsc.hg19.fasta --read1  sample_name_1.fastq.gz --read2  sample_name_2.fastq.gz --thread 8 --html sample_name.cancer.html --json  sample_name.cancer.json -f cancer.hg19.csv > sample_name.cancer.result

得到:

-rw-rw-r--. 1  root  root 109002 Sep 16 15:58 sample_name.cancer.html
-rw-rw-r--. 1  root  root   9614 Sep 16 15:58 sample_name.cancer.json
-rw-rw-r--. 1  root  root   4138 Sep 16 15:58 sample_name.cancer.result

结果举例说明(html中报告):

1, Fusion: ALK_ENST00000389048.3:intron:4|+chr2:29736691___NTRK3_ENST00000394480.2:intron:17|+chr15:88428871 (total: 2, unique:2)

指的是两个基因ALK_ENST00000389048.3和NTRK3_ENST00000394480.2发生了融合。融合的位置分别在ALK:intron:4(exon/intron)chr2:29736691和NTRK3:intron:17 chr15:88428871

支持断点的reads数目是2,2个unique reads。reads的具体名称点击html中reads即可显示。

四、文章解读
(一)引言

GeneFuse: detection and visualization of target gene fusions from DNA sequencing data

文章比较了常用的几款检测融合基因的软件:delly 、FACTERA。这些软件都是基于比对软件的基础上进行融合基因检测。

提出了两点质疑:
1)If the aligner cannot detect accurate clips and chimeras, the mapping-based fusion detection algorithms may not work properly. However, misalignments can happen often for the reads containing fusions
如果比对软件出现问题,会直接影响这些基于比对软件的融合基因检测工具,况且fusion区域,比对错误的情况会显著增加。
2)clips and chimeras can also happen often for the normal reads that don’t contain any fusions.
对于非fusion的区域,也会发生clips and chimeras 情况。

假阳和假阴经常出现的原因:
False positives can happen often at repetitive regions.
Meanwhile false negatives can also happen often when they process data from the samples with low tumor DNA composition, like cell-free tumor DNA.

此外,GeneFuse只关注具有临床意义的融合。GeneFuse具有可视化功能。

基本理念:
The basic idea of GeneFuse is to search for the reads that can be well mapped to two different genes for its left part and right part, but cannot be entirely mapped to any position of the whole reference genome.

(二)方法

主要分为四个步骤:具体见图片:
four major steps: indexing, matching, filtering, and reporting


GeneFuse原理.png

第一步,indexing

In the indexing step, a hashmap of mapping to genes is computed.

csv文件:A CSV file, which lists the genome regions of target fusion genes and their exons, is needed to extract gene sequences from reference genome
用于自定义目标基因区域

根据参考基因组,从csv中提取出序列。每个序列被分成16bp的kmer,每一段kmer会被定位到基因组,用hashmap储存kmer及位置信息,用于下一步matching。

第二步,matching

In the matching step, reads are mapped to the genes using the computed hashmap, and those that can be mapped to two genes are saved to fusion matches.

If the left part and right part of a read can be mapped to two different genes, the read will be segmented to two regions. The read will be considered as a match candidate if its left region and right region are both long enough (Tregion = 20 by default), and simultaneously meets such condition that the bases that are out of both regions are less than a certain threshold (Tummapped = 10 by default).

如果reads两端分别比对到了不同基因,两端长度都大于20bp,且两端Tummapped小于10bp,则认为这个reads是一个候选序列。

为了获得更长的序列,rcR2 is computed as the reverse complement of R2。当R1 and rcR2,的overlap大于30bp,会合并成single read.这样的话,解决了断点在reads边缘问题。

第三步,Filtering

In the filtering step, each fusion match is filtered by its read complexity, match quality, and other factors.

根据reads复杂程度和碱基质量过滤

1)fusion候选List,scanned到基因组上,如果匹配到了就去掉。
2)如果一个reads比对到了一个基因的不同位置,认为这是一个deletion。当缺失序列过短,会去掉这个deletion(可能是一个indel)。

第四步,reporting

Finally, in the reporting step, the detected fusions are validated, and the supporting reads for each fusion are piled up and rendered to an HTML page

报告fusion支持的total reads和Unique reads

此外,为了验证软件的灵敏度、特异性和速度,还和其他两款软件做了比较。


三款软件的检测速度比较.png 三款软件的灵敏度和特异性比较.png
五、总结

这个软件的灵敏度确实较高,但是缺点是很多参数都写在了程序里面,不能传参,C语言写的还没办法修改源代码,导致一些特定融合基因无法检出。

相关文章

网友评论

    本文标题:基因融合 — step1 GeneFuse 软件

    本文链接:https://www.haomeiwen.com/subject/qgjcuctx.html