美文网首页生信分析流程
DeepVariant — step4 DeepVariant

DeepVariant — step4 DeepVariant

作者: 小灰灰不会飞那又怎样 | 来源:发表于2019-07-31 17:07 被阅读0次

    Deepvariant是google 推出的基于CNN(convolutional neural network,卷积神经网络)算法,图像识别的变异检测软件。

    参考的有关Deepvariant原理的博客:
    Creating a universal SNP and small indel variant caller with deep neural networks(deepvariants文章)
    Creating a universal SNP and small indel variant caller with deep neural networks理解(对deepvariants文章的解读)
    2017-google-DeepVariant-机器学习(卷积神经网络)与基因组学
    DeepVariant: 用卷积神经网络进行DNA序列变异位点检测
    EVALUATING DEEPVARIANT: A NEW DEEP LEARNING VARIANT CALLER FROM THE GOOGLE BRAIN TEAM

    附图一张:


    DeepVariant 识别variant的过程

    DeepVariant识别variant过程总共包括了三个步骤,一一说明:

    一、make_examples

    筛选候选位点

    1.原理:

    数据比对到参考基因组,考量候选区域支持的reads数,以及支持变异的read数与该区域总reads数之比),碱基质量值等,筛选潜在变异。
    得到候选位点,将每个位点转化成RGB图像,这一步会得到多个multi-dimensional pileup images。

    2.命令:
    /opt/deepvariant/bin/make_examples --mode calling --ref "/opt/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta" --reads "/opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam" --examples "/tmp/deepvariant_tmp_output/make_examples.tfrecord@2.gz" --regions "chr20:10,000,000-10,010,000" --gvcf "/tmp/deepvariant_tmp_output/gvcf.tfrecord@2.gz"
    
    3.输入:

    参考基因组;
    bam文件;

    4.得到:
    root@bcffde6cb466:/deepvariant# ll /tmp/deepvariant_tmp_output/
    total 468
    drwxr-xr-x   2 root root   4096 Jul 30 09:54 ./
    drwxrwxrwt. 13 root root   4096 Jul 30 09:54 ../
    -rw-r--r--   1 root root   4496 Jul 30 09:54 gvcf.tfrecord-00000-of-00002.gz
    -rw-r--r--   1 root root 303120 Jul 30 09:54 make_examples.tfrecord-00000-of-00002.gz
    -rw-r--r--   1 root root 154859 Jul 30 09:54 make_examples.tfrecord-00000-of-00002.gz.run_info.pbtxt
    
    5.LOG输出:
    2019-07-30 09:54:06.077220: W third_party/nucleus/io/sam_reader.cc:564] Unrecognized SAM header type, ignoring:
    I0730 09:54:06.077377 140332821325568 genomics_reader.py:218] Reading /opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam with NativeSamReader
    I0730 09:54:06.079504 140332821325568 make_examples.py:1110] Preparing inputs
    2019-07-30 09:54:06.079838: W third_party/nucleus/io/sam_reader.cc:564] Unrecognized SAM header type, ignoring:
    I0730 09:54:06.079907 140332821325568 genomics_reader.py:218] Reading /opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam with NativeSamReader
    I0730 09:54:06.080715 140332821325568 make_examples.py:1034] Common contigs are [u'chr20']
    I0730 09:54:06.083332 140332821325568 make_examples.py:1116] Writing examples to /tmp/deepvariant_tmp_output/make_examples.tfrecord-00000-of-00002.gz
    I0730 09:54:06.083600 140332821325568 make_examples.py:1120] Writing gvcf records to /tmp/deepvariant_tmp_output/gvcf.tfrecord-00000-of-00002.gz
    2019-07-30 09:54:06.084198: I third_party/nucleus/io/sam_reader.cc:600] Setting HTS_OPT_BLOCK_SIZE to 134217728
    2019-07-30 09:54:06.143938: W third_party/nucleus/io/sam_reader.cc:564] Unrecognized SAM header type, ignoring:
    I0730 09:54:06.144200 140332821325568 genomics_reader.py:218] Reading /opt/quickstart-testdata/NA12878_S1.chr20.10_10p1mb.bam with NativeSamReader
    I0730 09:54:06.452500 140332821325568 make_examples.py:1149] Task 0: 6 candidates (6 examples) [0.37s elapsed]
    I0730 09:54:06.900475 140332821325568 make_examples.py:1164] Writing MakeExamplesRunInfo to /tmp/deepvariant_tmp_output/make_examples.tfrecord-00000-of-00002.gz.run_info.pbtxt
    I0730 09:54:07.027529 140332821325568 make_examples.py:1167] Found 40 candidate variants
    I0730 09:54:07.028004 140332821325568 make_examples.py:1168] Created 46 examples
    
    二、call_variants

    运用TensorFlow框架构建模型,训练样本

    1.原理:

    这些候选位点被分为三类:hom-ref(与参考相同),het(杂合型),hom-alt(纯合变异),利用卷积神经网络进行训练和分类。

    CNN训练(trained CNN):利用一个已知genotype的样本(genome in a bottle, NA12878, http://jimb.stanford.edu/giab/)的image训练CNN,不断优化参数,直到模型表现收敛。

    2.命令:
    /opt/deepvariant/bin/call_variants --outfile "/tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz" --examples "/tmp/deepvariant_tmp_output/make_examples.tfrecord@2.gz" --checkpoint "/opt/models/wgs/model.ckpt"
    
    3.输入:

    参考基因组;
    第一步得到的候选位点;
    已知模型;

    4.得到:
    root@bcffde6cb466:/deepvariant# ll /tmp/deepvariant_tmp_output/
    total 472
    drwxr-xr-x   2 root root   4096 Jul 31 01:18 ./
    drwxrwxrwt. 14 root root   4096 Jul 31 01:19 ../
    -rw-r--r--   1 root root   2504 Jul 31 01:19 call_variants_output.tfrecord.gz
    
    5.LOG输出:
    WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
    For more information, please see:
      * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
      * https://github.com/tensorflow/addons
    If you depend on functionality not listed there, please file an issue.
    
    I0731 01:18:53.880074 140390238451456 call_variants.py:313] Set KMP_BLOCKTIME to 0
    2019-07-31 01:18:54.094641: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    2019-07-31 01:18:54.524745: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
    2019-07-31 01:18:54.525717: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x52c22f0 executing computations on platform Host. Devices:
    2019-07-31 01:18:54.525743: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
    I0731 01:18:55.435303 140390238451456 modeling.py:364] Initializing model with random parameters
    W0731 01:18:55.437553 140390238451456 estimator.py:1760] Using temporary folder as model directory: /tmp/tmp4YBYF_
    I0731 01:18:55.437961 140390238451456 estimator.py:201] Using config: {'_save_checkpoints_secs': 1000, '_session_config': , '_keep_checkpoint_max': 100000, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7faf05f5db10>, '_model_dir': '/tmp/tmp4YBYF_', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}
    I0731 01:18:55.438348 140390238451456 call_variants.py:381] Writing calls to /tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz
    W0731 01:18:55.445154 140390238451456 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Colocations handled automatically by placer.
    I0731 01:18:55.480995 140390238451456 data_providers.py:360] self.input_read_threads=8
    W0731 01:18:55.481403 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/data_providers.py:365: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use `tf.data.experimental.parallel_interleave(...)`.
    I0731 01:18:55.596564 140390238451456 data_providers.py:366] self.input_map_threads=48
    W0731 01:18:55.596837 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/data_providers.py:371: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use `tf.data.experimental.map_and_batch(...)`.
    I0731 01:18:55.621093 140390238451456 estimator.py:1111] Calling model_fn.
    W0731 01:18:55.621329 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/modeling.py:679: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.cast instead.
    W0731 01:18:55.623230 140390238451456 deprecation.py:323] From /tmp/Bazel.runfiles_HSnmci/runfiles/com_google_deepvariant/deepvariant/modeling.py:681: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Deprecated in favor of operator or tf.math.divide.
    I0731 01:18:59.044318 140390238451456 estimator.py:1113] Done calling model_fn.
    I0731 01:19:02.425590 140390238451456 monitored_session.py:222] Graph was finalized.
    W0731 01:19:02.426923 140390238451456 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use standard file APIs to check for files with this prefix.
    I0731 01:19:02.463726 140390238451456 saver.py:1270] Restoring parameters from /opt/models/wgs/model.ckpt
    I0731 01:19:05.943398 140390238451456 session_manager.py:491] Running local_init_op.
    I0731 01:19:06.011964 140390238451456 session_manager.py:493] Done running local_init_op.
    I0731 01:19:06.536978 140390238451456 modeling.py:283] Reloading EMA...
    I0731 01:19:06.538686 140390238451456 saver.py:1270] Restoring parameters from /opt/models/wgs/model.ckpt
    2019-07-31 01:19:11.018226: W tensorflow/core/framework/allocator.cc:124] Allocation of 67891200 exceeds 10% of system memory.
    2019-07-31 01:19:11.110581: W tensorflow/core/framework/allocator.cc:124] Allocation of 24398400 exceeds 10% of system memory.
    2019-07-31 01:19:11.154066: W tensorflow/core/framework/allocator.cc:124] Allocation of 31736320 exceeds 10% of system memory.
    2019-07-31 01:19:11.352648: W tensorflow/core/framework/allocator.cc:124] Allocation of 29887488 exceeds 10% of system memory.
    2019-07-31 01:19:11.527675: W tensorflow/core/framework/allocator.cc:124] Allocation of 59774976 exceeds 10% of system memory.
    I0731 01:19:14.422972 140390238451456 call_variants.py:399] Processed 1 examples in 1 batches [1898.435 sec per 100]
    I0731 01:19:14.607350 140390238451456 call_variants.py:401] Done evaluating variants
    
    三、postprocess_variants

    用训练好的CNN模型来判断最可能的genotype

    1.原理:

    利用训练好的CNN模型,寻找与参考基因组不一样的位点(基于图像识别与参考基因组不同的位点,并由CNN算法(trained CNN)给出候选位点的genotype likehood值),将概率最大值作为该基因的基因型。

    2.命令:
    mkdir -p /deepvariant/quickstart-output/
    /opt/deepvariant/bin/postprocess_variants --ref "/opt/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta" --infile "/tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz" --outfile "/deepvariant/quickstart-output/output.vcf.gz"
    
    3.输入:

    参考基因组;
    第二步得到的训练模型;

    4.得到:
    root@bcffde6cb466:/deepvariant# ll /deepvariant/quickstart-output/
    total 20
    drwxr-xr-x 2 root root 4096 Jul 31 01:32 ./
    drwxr-xr-x 3 root root 4096 Jul 31 01:22 ../
    -rw-r--r-- 1 root root  533 Jul 31 01:22 output.g.vcf.gz
    -rw-r--r-- 1 root root 1526 Jul 31 01:32 output.vcf.gz
    -rw-r--r-- 1 root root  143 Jul 31 01:32 output.vcf.gz.tbi
    
    5.LOG输出:
    2019-07-31 01:32:06.448499: I deepvariant/postprocess_variants.cc:88] Read from: /tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz
    2019-07-31 01:32:06.449696: I deepvariant/postprocess_variants.cc:97] Done reading: /tmp/deepvariant_tmp_output/call_variants_output.tfrecord.gz. #entries in single_site_calls = 46
    2019-07-31 01:32:06.449717: I deepvariant/postprocess_variants.cc:101] Total #entries in single_site_calls = 46
    2019-07-31 01:32:06.449724: I deepvariant/postprocess_variants.cc:103] Start SortSingleSiteCalls
    2019-07-31 01:32:06.449744: I deepvariant/postprocess_variants.cc:105] Done SortSingleSiteCalls
    I0731 01:32:06.450007 140146169325312 postprocess_variants.py:907] CVO sorting took 2.54511833191e-05 minutes
    I0731 01:32:06.450489 140146169325312 postprocess_variants.py:909] Transforming call_variants_output to variants.
    I0731 01:32:06.450607 140146169325312 postprocess_variants.py:921] Writing variants to VCF.
    I0731 01:32:06.450683 140146169325312 postprocess_variants.py:628] Writing output to VCF file: /deepvariant/quickstart-output/output.vcf.gz
    I0731 01:32:06.451236 140146169325312 genomics_writer.py:172] Writing /deepvariant/quickstart-output/output.vcf.gz with NativeVcfWriter
    I0731 01:32:06.463268 140146169325312 postprocess_variants.py:633] 1 variants written.
    I0731 01:32:06.487912 140146169325312 postprocess_variants.py:929] VCF creation took 0.000621068477631 minutes
    

    相关文章

      网友评论

        本文标题:DeepVariant — step4 DeepVariant

        本文链接:https://www.haomeiwen.com/subject/zcutdctx.html