美文网首页生物信息
diploS/HIC鉴定选择信号

diploS/HIC鉴定选择信号

作者: DumplingLucky | 来源:发表于2021-05-13 12:23 被阅读0次

    diploS / HIC使用深度卷积神经网络来识别种群基因组数据中的硬选择和软选择区域。

    使用diploS / HIC进行分析的工作流程包括四个基本部分。
    1)使用模拟为diploS / HIC生成训练集。
    2)diploS / HIC培训和绩效评估。
    3)从基因组数据计算dipoS / HIC特征向量。
    4)使用训练有素的网络对经验数据进行预测。
    此处提供的软件可以处理最后三个部分。 人口遗传模拟必须使用诸如discoal之类的单独软件进行。

    1. 安装

    使用conda安装scikit-allel,这将为我们提供基础包(numpy,scipy,scikit-learn等)

    conda install -c conda-forge scikit-allel
    

    tensorflowkeras是用于处理事物的深度学习部分的库

    pip install tensorflow 
    pip install keras
    

    现在准备安装diploS / HIC

    git clone https://github.com/kern-lab/diploSHIC.git
    cd diploSHIC 
    python setup.py install
    

    2. 软件使用

    将要连接的主程序是diploSHIC.py。 该脚本具有四个运行模式,这些模式允许用户执行监督的机器学习过程中的每个主要步骤。
    diploSHIC.py使用python中的argparse模块尝试提供完整的基于命令行的帮助菜单。
    (1)使用方法

    $ python diploSHIC.py -h
    usage: diploSHIC.py [-h] {train,predict,fvecSim,fvecVcf} ...
    
    calculate feature vectors, train, or predict with diploSHIC
    
    possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help message:
      {fvecSim,makeTrainingSets,train,fvecVcf,predict}
                            sub-command help
        fvecSim             Generate feature vectors from simulated data
        makeTrainingSets    Combine feature vectors from muliple fvecSim runs into
                            5 balanced training sets
        train               train and test a shic CNN
        fvecVcf             Generate feature vectors from data in a VCF file
        predict             perform prediction using an already-trained SHIC CNN
    
    optional arguments:
      -h, --help            show this help message and exit
    

    (2)准备模拟训练/测试数据
    所有版本的S / HIC都需要模拟数据来进行训练,该软件提供了一个脚本generateSimLaunchScript.py,该脚本演示了如何模拟训练集。
    (3)特征向量生成模型

    • fvecSim mode
      fvecSim运行模式用于将ms样式的输出转换为与diploSHIC.py兼容的特征向量。
    $ python diploSHIC.py fvecSim -h
    usage: diploSHIC.py fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
                                [--numSubWins NUMSUBWINS]
                                [--maskFileName MASKFILENAME]
                                [--chrArmsForMasking CHRARMSFORMASKING]
                                [--unmaskedFracCutoff UNMASKEDFRACCUTOFF]
                                [--outStatsDir OUTSTATSDIR]
                                [--ancFileName ANCFILENAME] [--pMisPol PMISPOL]
                                shicMode msOutFile fvecFileName
    
    required arguments:
      shicMode              specifies whether to use original haploid SHIC (use
                            'haploid') or diploSHIC ('diploid')
      msOutFile             path to simulation output file (must be same format
                            used by Hudson's ms)
      fvecFileName          path to file where feature vectors will be written
    
    optional arguments:
      -h, --help            show this help message and exit
       --totalPhysLen TOTALPHYSLEN
                            Length of simulated chromosome for converting infinite
                            sites ms output to finite sites (default=1100000)
      --numSubWins NUMSUBWINS
                            The number of subwindows that our chromosome will be
                            divided into (default=11)
      --maskFileName MASKFILENAME
                            Path to a fasta-formatted file that contains masking
                            information (marked by 'N'). If specified, simulations
                            will be masked in a manner mirroring windows drawn
                            from this file.
      --chrArmsForMasking CHRARMSFORMASKING
                            A comma-separated list (no spaces) of chromosome arms
                            from which we want to draw masking information (or
                            'all' if we want to use all arms. Ignored if
                            maskFileName is not specified.
      --unmaskedFracCutoff UNMASKEDFRACCUTOFF
                            Minimum fraction of unmasked sites, if masking
                            simulated data
      --outStatsDir OUTSTATSDIR
                            Path to a directory where values of each statistic in
                            each subwindow are recorded for each rep
      --ancFileName ANCFILENAME
                            Path to a fasta-formatted file that contains inferred
                            ancestral states ('N' if unknown). This is used for
                            masking, as sites that cannot be polarized are masked,
                            and we mimic this in the simulted data. Ignored in
                            diploid mode which currently does not use ancestral
                            state information
      --pMisPol PMISPOL     The fraction of sites that will be intentionally
                            polarized to better approximate real data
    

    该模型接受三个参数,然后提供选项。 参数是“ shicMode”,指的是是否计算单倍体或二倍体摘要统计信息,输入文件的名称和输出文件的名称。

    • fvecVcf mode
      fvecVcf模式用于根据VCF文件计算特征向量。
    $ python diploSHIC.py fvecVcf -h
    usage: diploSHIC.py fvecVcf [-h] [--targetPop TARGETPOP]
                                [--sampleToPopFileName SAMPLETOPOPFILENAME]
                                [--winSize WINSIZE] [--numSubWins NUMSUBWINS]
                                [--maskFileName MASKFILENAME]
                                [--unmaskedFracCutoff UNMASKEDFRACCUTOFF]
                                [--ancFileName ANCFILENAME]
                                [--statFileName STATFILENAME]
                                [--segmentStart SEGMENTSTART]
                                [--segmentEnd SEGMENTEND]
                                shicMode chrArmVcfFile chrArm chrLen
    
    required arguments:
      shicMode              specifies whether to use original haploid SHIC (use
                            'haploid') or diploSHIC ('diploid')
      chrArmVcfFile         VCF format file containing data for our chromosome arm
                            (other arms will be ignored)
      chrArm                Exact name of the chromosome arm for which feature
                            vectors will be calculated
      chrLen                Length of the chromosome arm
      fvecFileName          path to file where feature vectors will be written
    
    optional arguments:
      -h, --help            show this help message and exit
      --targetPop TARGETPOP
                            Population ID of samples we wish to include
      --sampleToPopFileName SAMPLETOPOPFILENAME
                            Path to tab delimited file with population
                            assignments; format: SampleID popID
      --winSize WINSIZE     Length of the large window (default=1100000)
      --numSubWins NUMSUBWINS
                            Number of sub-windows within each large window
                            (default=11)
      --maskFileName MASKFILENAME
                            Path to a fasta-formatted file that contains masking
                            information (marked by 'N'); must have an entry with
                            title matching chrArm
      --unmaskedFracCutoff UNMASKEDFRACCUTOFF
                            Fraction of unmasked sites required to retain a
                            subwindow
      --ancFileName ANCFILENAME
                            Path to a fasta-formatted file that contains inferred
                            ancestral states ('N' if unknown); must have an entry
                            with title matching chrArm. Ignored for diploid mode
                            which currently does not use ancestral state
                            information.
      --statFileName STATFILENAME
                            Path to a file where statistics will be written for
                            each subwindow that is not filtered out
      --segmentStart SEGMENTSTART
                            Left boundary of region in which feature vectors are
                            calculated (whole arm if omitted)
      --segmentEnd SEGMENTEND
                            Right boundary of region in which feature vectors are
                            calculated (whole arm if omitted)
    

    此模式有五个参数,并且还有许多选项。 必需的参数是“ shicMode”,即是否要计算单倍体或二倍体摘要统计信息,输入文件的名称,要为其计算统计信息的染色体,该染色体的长度以及输出文件的名称。
    (4)training the CNN and prediction
    准备好特征向量文件后,我们就可以训练和测试我们的CNN,然后最后对经验数据进行预测。
    (5)格式化训练集
    训练模型:
    这是diploSHIC.py的训练模式的帮助消息

    $ python diploSHIC.py train -h
    usage: diploSHIC.py train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
                              trainDir testDir outputModel
    
    required arguments:
      trainDir              path to training set files
      testDir               path to test set files, can be same as trainDir
      outputModel           file name for output model, will create two files one
                            with structure one with weights
    
    optional arguments:
      -h, --help            show this help message and exit
      --epochs EPOCHS       max epochs for training CNN (default = 100)
      --numSubWins NUMSUBWINS
                            number of subwindows that our chromosome is divided
                            into (default = 11)
    

    训练模式用于训练深度学习分类器。 它的必需参数是trainDir(保留训练特征向量的目录),testDir(保留测试特征向量的目录)和outputModel受训网络的文件名。diploSHIC.py在训练和测试目录中需要五个名为hard.fvec,soft.fvec,neut.fvec,linkedSoft.fvec和linkedHard.fvec的文件。 训练和测试目录可以是同一目录,在这种情况下,保留20%的培训示例用于测试和验证。

    训练模式有两个选项,用于特征向量的子窗口数和用于网络的训练时期数。
    (6)预测模型
    训练完分类器后,便可以使用diploSHIC.py的预测模型对经验数据进行分类。

    $ python diploSHIC.py predict -h
    usage: diploSHIC.py predict [-h] [--numSubWins NUMSUBWINS]
                                modelStructure modelWeights predictFile
                                predictFileOutput
    
    required arguments:
      modelStructure        path to CNN structure .json file
      modelWeights          path to CNN weights .h5 file
      predictFile           input file to predict
      predictFileOutput     output file name
    
    optional arguments:
      -h, --help            show this help message and exit
      --numSubWins NUMSUBWINS
                            number of subwindows that our chromosome is divided
                            into (default = 11)
    

    预测模式将训练模式输出的两个模型文件作为输入,经验特征向量的输入文件,以及预测输出的文件名。

    3. 简单示例

    该示例首先从模拟计算特征向量开始,然后以预测基因组数据。 目录test /和training /各自包含格式正确的二倍体特征向量,可以将其输入到diploSHIC中。
    首先,我们将训练diploSHIC CNN,训练纪元的数量限制为10个。

    $ python diploSHIC.py train training/ testing/ fooModel --epochs 10
    

    优化完成后,我们训练后的网络将包含在两个文件fooModel.json和fooModel.weights.hdf5中。 diploSHIC.py的最后输出为我们提供了有关保留的测试数据的丢失和准确性的信息。

    evaluation on test set:
    diploSHIC loss: 0.404791
    diploSHIC accuracy: 0.846800
    

    --epochs的值默认设置100就足够了。 现在我们有了训练好的模型,我们可以对一些经验数据进行预测。 在仓库中有一个名为testEmpirical.fvec的文件,我们将其用作输入

    $ python diploSHIC.py predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds
    

    输出预测将保存在testEmpirical.preds中
    一个完整测试案例
    参考:https://github.com/kr-colab/diploSHIC

    相关文章

      网友评论

        本文标题:diploS/HIC鉴定选择信号

        本文链接:https://www.haomeiwen.com/subject/qjwzdltx.html