美文网首页
基因组云计算书籍推荐:Genomics in the Cloud

基因组云计算书籍推荐:Genomics in the Cloud

作者: 生物信息与育种 | 来源:发表于2022-05-23 18:16 被阅读0次

    给一起学习基因组云计算的小伙伴推荐一本书,《Genomics in the Cloud:Using Docker, GATK, and WDL in Terra》,作者是GATK社区管理员,2020年出版,还算比较新吧。


    image.png

    Github地址:
    genomics-in-the-cloud

    本书涵盖内容:

    • 基本基因组学和计算技术背景
    • 基本的云计算操作
    • GATK 入门,以及三个主要的 GATK 最佳实践管道
    • 使用 WDL 和 Cromwell 使用脚本化工作流程自动分析
    • 在云中扩展工作流执行,包括并行化和成本优化
    • 使用 Jupyter 笔记本在云中进行交互式分析
    • 使用 Terra 的安全协作和计算可重复性

    书很厚,花了很大篇幅介绍Broad自己的产品,但我们基本不会用到它的云平台Terra,排版很差,这是本书不足之处。另外,该书是针对人类基因组来写的,所以范围有限。不过有选择性地挑选一些章节来看,不失为一个好的选择,毕竟这方面的书籍太少了。

    以下是目录,若要获取pdf电子版,请关注微信公众号Bioinfarmer,后台回复:cloud。

    1. Introduction
      The Promises and Challenges of Big Data in Biology and Life Sciences
      Infrastructure Challenges
      Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
      Cloud-Hosted Data and Compute
      Platforms for Research in the Life Sciences
      Standardization and Reuse of Infrastructure
      Being FAIR
      Wrap-Up and Next Steps
    2. Genomics in a Nutshell: A Primer for Newcomers to the Field
      Introduction to Genomics
      The Gene as a Discrete Unit of Inheritance (Sort Of)
      The Central Dogma of Biology: DNA to RNA to Protein
      The Origins and Consequences of DNA Mutations
      Genomics as an Inventory of Variation in and Among Genomes
      The Challenge of Genomic Scale, by the Numbers
      Genomic Variation
      The Reference Genome as Common Framework
      Physical Classification of Variants
      Germline Variants Versus Somatic Alterations
      High-Throughput Sequencing Data Generation
      From Biological Sample to Huge Pile of Read Data
      Types of DNA Libraries: Choosing the Right Experimental Design
      Data Processing and Analysis
      Mapping Reads to the Reference Genome
      Variant Calling
      Data Quality and Sources of Error
      Functional Equivalence Pipeline Specification
      Wrap-Up and Next Steps
    3. Computing Technology Basics for Life Scientists
      Basic Infrastructure Components and Performance Bottlenecks
      Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
      Levels of Compute Organization: Core, Node, Cluster, and Cloud
      Addressing Performance Bottlenecks
      Parallel Computing
      Parallelizing a Simple Analysis
      From Cores to Clusters and Clouds: Many Levels of Parallelism
      Trade-Offs of Parallelism: Speed, Efficiency, and Cost
      Pipelining for Parallelization and Automation
      Workflow Languages
      Popular Pipelining Languages for Genomics
      Workflow Management Systems
      Virtualization and the Cloud
      VMs and Containers
      Introducing the Cloud
      Categories of Research Use Cases for Cloud Services
      Wrap-Up and Next Steps
    4. First Steps in the Cloud
      Setting Up Your Google Cloud Account and First Project
      Creating a Project
      Checking Your Billing Account and Activating Free Credits
      Running Basic Commands in Google Cloud Shell
      Logging in to the Cloud Shell VM
      Using gsutil to Access and Manage Files
      Pulling a Docker Image and Spinning Up the Container
      Mounting a Volume to Access the Filesystem from Within the Container
      Setting Up Your Own Custom VM
      Creating and Configuring Your VM Instance
      Logging into Your VM by Using SSH
      Checking Your Authentication
      Copying the Book Materials to Your VM
      Installing Docker on Your VM
      Setting Up the GATK Container Image
      Stopping Your VM…to Stop It from Costing You Money
      Configuring IGV to Read Data from GCS Buckets
      Wrap-Up and Next Steps
    5. First Steps with GATK
      Getting Started with GATK
      Operating Requirements
      Command-Line Syntax
      Multithreading with Spark
      Running GATK in Practice
      Getting Started with Variant Discovery
      Calling Germline SNPs and Indels with HaplotypeCaller
      Filtering Based on Variant Context Annotations
      Introducing the GATK Best Practices
      Best Practices Workflows Covered in This Book
      Other Major Use Cases
      Wrap-Up and Next Steps
    6. GATK Best Practices for Germline Short Variant Discovery
      Data Preprocessing
      Mapping Reads to the Genome Reference
      Marking Duplicates
      Recalibrating Base Quality Scores
      Joint Discovery Analysis
      Overview of the Joint Calling Workflow
      Calling Variants per Sample to Generate GVCFs
      Consolidating GVCFs
      Applying Joint Genotyping to Multiple Samples
      Filtering the Joint Callset with Variant Quality Score Recalibration
      Refining Genotype Assignments and Adjusting Genotype Confidence
      Next Steps and Further Reading
      Single-Sample Calling with CNN Filtering
      Overview of the CNN Single-Sample Workflow
      Applying 1D CNN to Filter a Single-Sample WGS Callset
      Applying 2D CNN to Include Read Data in the Modeling
      Wrap-Up and Next Steps
    7. GATK Best Practices for Somatic Variant Discovery
      Challenges in Cancer Genomics
      Somatic Short Variants (SNVs and Indels)
      Overview of the Tumor-Normal Pair Analysis Workflow
      Creating a Mutect2 PoN
      Running Mutect2 on the Tumor-Normal Pair
      Estimating Cross-Sample Contamination
      Filtering Mutect2 Calls
      Annotating Predicted Functional Effects with Funcotator
      Somatic Copy-Number Alterations
      Overview of the Tumor-Only Analysis Workflow
      Creating a Somatic CNA PoN
      Applying Denoising
      Performing Segmentation and Call CNAs
      Additional Analysis Options
      Wrap-Up and Next Steps
    8. Automating Analysis Execution with Workflows
      Introducing WDL and Cromwell
      Installing and Setting Up Cromwell
      Your First WDL: Hello World
      Learning Basic WDL Syntax Through a Minimalist Example
      Running a Simple WDL with Cromwell on Your Google VM
      Interpreting the Important Parts of Cromwell’s Logging Output
      Adding a Variable and Providing Inputs via JSON
      Adding Another Task to Make It a Proper Workflow
      Your First GATK Workflow: Hello HaplotypeCaller
      Exploring the WDL
      Generating the Inputs JSON
      Running the Workflow
      Breaking the Workflow to Test Syntax Validation and Error Messaging
      Introducing Scatter-Gather Parallelism
      Exploring the WDL
      Generating a Graph Diagram for Visualization
      Wrap-Up and Next Steps
    9. Deciphering Real Genomics Workflows
      Mystery Workflow #1: Flexibility Through Conditionals
      Mapping Out the Workflow
      Reverse Engineering the Conditional Switch
      Mystery Workflow #2: Modularity and Code Reuse
      Mapping Out the Workflow
      Unpacking the Nesting Dolls
      Wrap-Up and Next Steps
    10. Running Single Workflows at Scale with Pipelines API
      Introducing the GCP Genomics Pipelines API Service
      Enabling Genomics API and Related APIs in Your Google Cloud Project
      Directly Dispatching Cromwell Jobs to PAPI
      Configuring Cromwell to Communicate with PAPI
      Running Scattered HaplotypeCaller via PAPI
      Monitoring Workflow Execution on Google Compute Engine
      Understanding and Optimizing Workflow Efficiency
      Granularity of Operations
      Balance of Time Versus Money
      Suggested Cost-Saving Optimizations
      Platform-Specific Optimization Versus Portability
      Wrapping Cromwell and PAPI Execution with WDL Runner
      Setting Up WDL Runner
      Running the Scattered HaplotypeCaller Workflow with WDL Runner
      Monitoring WDL Runner Execution
      Wrap-Up and Next Steps
    11. Running Many Workflows Conveniently in Terra
      Getting Started with Terra
      Creating an Account
      Creating a Billing Project
      Cloning the Preconfigured Workspace
      Running Workflows with the Cromwell Server in Terra
      Running a Workflow on a Single Sample
      Running a Workflow on Multiple Samples in a Data Table
      Monitoring Workflow Execution
      Locating Workflow Outputs in the Data Table
      Running the Same Workflow Again to Demonstrate Call Caching
      Running a Real GATK Best Practices Pipeline at Full Scale
      Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
      Examining the Preloaded Data
      Selecting Data and Configuring the Full-Scale Workflow
      Launching the Full-Scale Workflow and Monitoring Execution
      Options for Downloading Output Data—or Not
      Wrap-Up and Next Steps
    12. Interactive Analysis in Jupyter Notebook
      Introduction to Jupyter in Terra
      Jupyter Notebooks in General
      How Jupyter Notebooks Work in Terra
      Getting Started with Jupyter in Terra
      Inspecting and Customizing the Notebook Runtime Configuration
      Opening Notebook in Edit Mode and Checking the Kernel
      Running the Hello World Cells
      Using gsutil to Interact with Google Cloud Storage Buckets
      Setting Up a Variable Pointing to the Germline Data in the Book Bucket
      Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
      Visualizing Genomic Data in an Embedded IGV Window
      Setting Up the Embedded IGV Browser
      Adding Data to the IGV Browser
      Setting Up an Access Token to View Private Data
      Running GATK Commands to Learn, Test, or Troubleshoot
      Running a Basic GATK Command: HaplotypeCaller
      Loading the Data (BAM and VCF) into IGV
      Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
      Visualizing Variant Context Annotation Data
      Exporting Annotations of Interest with VariantsToTable
      Loading R Script to Make Plotting Functions Available
      Making Density Plots for QUAL by Using makeDensityPlot
      Making a Scatter Plot of QUAL Versus DP
      Making a Scatter Plot Flanked by Marginal Density Plots
      Wrap-Up and Next Steps
    13. Assembling Your Own Workspace in Terra
      Managing Data Inside and Outside of Workspaces
      The Workspace Bucket as Data Repository
      Accessing Private Data That You Manage Outside of Terra
      Accessing Data in the Terra Data Library
      Re-Creating the Tutorial Workspace from Base Components
      Creating a New Workspace
      Adding the Workflow to the Methods Repository and Importing It into the Workspace
      Creating a Configuration Quickly with a JSON File
      Adding the Data Table
      Filling in the Workspace Resource Data Table
      Creating a Workflow Configuration That Uses the Data Tables
      Adding the Notebook and Checking the Runtime Environment
      Documenting Your Workspace and Sharing It
      Starting from a GATK Best Practices Workspace
      Cloning a GATK Best Practices Workspace
      Examining GATK Workspace Data Tables to Understand How the Data Is Structured
      Getting to Know the 1000 Genomes High Coverage Dataset
      Copying Data Tables from the 1000 Genomes Workspace
      Using TSV Load Files to Import Data from the 1000 Genomes Workspace
      Running a Joint-Calling Analysis on the Federated Dataset
      Building a Workspace Around a Dataset
      Cloning the 1000 Genomes Data Workspace
      Importing a Workflow from Dockstore
      Configuring the Workflow to Use the Data Tables
      Wrap-Up and Next Steps
    14. Making a Fully Reproducible Paper
      Overview of the Case Study
      Computational Reproducibility and the FAIR Framework
      Original Research Study and History of the Case Study
      Assessing the Available Information and Key Challenges
      Designing a Reproducible Implementation
      Generating a Synthetic Dataset as a Stand-In for the Private Data
      Overall Methodology
      Retrieving the Variant Data from 1000 Genomes Participants
      Creating Fake Exomes Based on Real People
      Mutating the Fake Exomes
      Generating the Definitive Dataset
      Re-Creating the Data Processing and Analysis Methodology
      Mapping and Variant Discovery
      Variant Effect Prediction, Prioritization, and Variant Load Analysis
      Analytical Performance of the New Implementation
      The Long, Winding Road to FAIRness
      Final Conclusions

    https://www.oreilly.com/library/view/genomics-in-the/9781491975183/
    https://www.amazon.ca/Genomics-Cloud-GATK-Spark-Docker/dp/1491975199

    相关文章

      网友评论

          本文标题:基因组云计算书籍推荐:Genomics in the Cloud

          本文链接:https://www.haomeiwen.com/subject/rhirprtx.html