Bacterial genome assembly tutori

作者: 浩瀚之宇 | 来源:发表于2018-12-29 12:34 被阅读0次

Bacterial genome assembly tutori
Microbiol Spect | 仔猪肠道细菌培养组
利用虹鳟后代表型数据对细菌性冷水病抗性能力的基因组选择评估：基因
NBT|人肠道菌培养组HBC
DBGWAS：基于k-mer和De Bruijn图的GWAS
文献里面用到的基因组注释方法（不包括重复序列和ncRNA）
文献笔记六十五：叶绿体基因组组装工具综述
基因组组装和注释流程 from ChatGPT
苹果蠹蛾基因组为化学生态学及抗药性研究提供一种思路
Assembly--及相关内容

Source: https://bioinformatics.uconn.edu/bacterial-genome-assembly-tutorial/#

This tutorial will serve as an example of how to use free and open-source genome assembly and secondary scaffolding tools to generate high quality assemblies of bacterial sequence data. The bacterial sample used in this tutorial will be referred to simply as “Species” since it is live data. This data is paired-end data, meaning that there are forward and reverse reads, which we will designate as Sample_R1.fastq and Sample_R2.fastq, respectively.

Software download links:

Sickle
ABySS
SOAPdenovo
SPAdes
QUAST
SSPACE
AlignGraph

Assembly tutorial directory

<pre>/common/Assembly_Tutorial</pre>

Sickle: Quality control on raw reads

The first step is to perform quality control on the reads using sickle. To run the program we will use the sickle command. Since our reads are paired-end reads, we indicate this with the pe option. The -f flag designates the input file containing the forward reads, -r the input file containing the reverse reads, -o the output file containing the trimmed forward reads, -p the output file containing the trimmed reverse reads, and -s the output file containing trimmed singles. The -q flag designates the minimum quality, -l the minimum read length, and -t designates the type of read.

<pre class="p1">sickle pe -f /common/Assembly_Tutorial/Sample_R1.fastq -r /common/Assembly_Tutorial/Sample_R2.fastq -t sanger -o Sample_1.fastq -p Sample_2.fastq -s Sample_s.fastq -q 30 -l 45</pre>

The trimmed quality control files are located in /common/Assembly_Tutorial/Quality_Control and the script to perform the quality control is located at /common/Assembly_Tutorial/Quality_Control/Sample_QC.sh.

ABySS: de novo sequence assembler

ABySS is the first assembly program we will use to assemble our trimmed reads. Since our reads are paired-end reads, to run the assembler we will use the abyss-pe command. We will use the parameters k for the size of the kmer, name for the output file prefix, in for the paths to the forward/reverse trimmed reads, and se for the path to the singles file, np for number of processors, which in this case should be as same as number of processors declared in the header of your shell script.

<pre class="p1">abyss-pe np=8 k=31 name=Sample_Kmer31 in='/common/Assembly_Tutorial/Quality_Control/Sample_1.fastq /common/Assembly_Tutorial/Quality_Control/Sample_2.fastq' se='/common/Assembly_Tutorial/Quality_Control/Sample_s.fastq' # repeat for k=35, k=41, etc</pre>

The kmers used in this example can be viewed as a starting point to get an idea of what kmer would best assemble the data. The assembly output files are located in /common/Assembly_Tutorial/Assembly/ABySS and the script to perform assembly is located at /common/Assembly_Tutorial/Assembly/Sample_assembly.sh. Note that this script also includes the assembly commands for SOAP and SPAdes.

SOAPdenovo: de novo sequence assembler

SOAPdenovo is another de novo sequence assembler. Unlike the other assemblers, SOAP uses a config file to pass information about the sequences into the program. The configuration file is shown below. Notable fields include average insert size and read length, which differ depending on the sequencing technology, and q1, q2, and q; the paths to the forward, reverse and singles trimmed reads.

<pre>#maximal read length
max_rd_len=250
[LIB]

average insert size

avg_ins=550

if sequence needs to be reversed

reverse_seq=0

in which part(s) the reads are used

asm_flags=3

use only first 250 bps of each read

rd_len_cutoff=250

in which order the reads are used while scaffolding

rank=1

cutoff of pair number for a reliable connection (at least 3 for short insert size)

pair_num_cutoff=3

minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)

map_len=32

path to genes

q1=/common/Assembly_Tutorial/Quality_Control/Sample_1.fastq
q2=/common/Assembly_Tutorial/Quality_Control/Sample_2.fastq
q=/common/Assembly_Tutorial/Quality_Control/Sample_s.fastq</pre>

To run the assembler we will use the SOAPdenovo-63mer command with the all option (to perform kmer graph construction, contig error correction, mapping of reads to contigs, and scaffolding), -s for the path to the config file, -K for the size of the kmer, -o for the output prefix, 1 for assembly log, and 2 for assembly errors.

<pre>SOAPdenovo-63mer all -s /common/Assembly_Tutorial/Assembly/Sample.config -K 31 -R -o graph_Sample_31 1>ass31.log 2>ass31.err

repeat for k=35, k=41, etc</pre>

The assembly output files are located in /common/Assembly_Tutorial/Assembly/SOAP.

SPAdes: de Bruijn graph based assembler

The last assembler we will run is SPAdes. SPAdes is different from the other assemblers in that it generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. To run SPAdes we will use the spades.py command with the --careful option to minimize the number of mismatches in the contigs, -o for the output folder, -1 for the path to the forward reads, -2 for the path to the reverse reads, and -s for the path to the singles reads. If desired, a list of kmers can be specified with the -k flag which will override automatic kmer selection.

<pre>spades.py --careful -o SPAdes_out -1 /common/Assembly_Tutorial/Quality_Control/Sample_1.fastq -2 /common/Assembly_Tutorial/Quality_Control/Sample_2.fastq -s /common/Assembly_Tutorial/Quality_Control/Sample_s.fastq</pre>

The assembly output files are located in /common/Assembly_Tutorial/Assembly/SPAdes.

QUAST: assembly statistics

Now that we have several assemblies, it’s time to analyze the quality of each assembly. ABySS and SOAPdenovo both have their own statistics output, but for consistency, we will be using the program QUAST. The statistics we are most interested in are number of contigs, total length, and N50. A good assembly would have a low number of contigs, a total length that makes sense for the species, and a high N50 value. To run quast on all of our final assembly files we will run the following commands, with the only parameters used being the name of the assembly file(s) and output directory.

<pre># ABySS statistics
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Assembly/ABySS/Sample_Kmer*-scaffolds.fa -o ABySS</pre>

<pre># SOAPdenovo statistics
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Assembly/SOAP/graph_Sample_*.scafSeq -o SOAP</pre>

<pre># SPAdes statistics
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta -o SPAdes</pre>

Abyss results:

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| Sample_Kmer31-scaffolds | 363 | 86593 | 2779506 | 32.76 | 14714 |
| Sample_Kmer35-scaffolds | 342 | 86909 | 2787431 | 32.75 | 16801 |
| Sample_Kmer41-scaffolds | 330 | 84960 | 2794086 | 32.76 | 17579 |

SOAP results:

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| graph_Sample_31.scafSeq | 276 | 103125 | 3574101 | 32.44 | 26176 |
| graph_Sample_35.scafSeq | 246 | 86844 | 3543834 | 32.46 | 27766 |
| graph_Sample_41.scafSeq | 214 | 99593 | 3438095 | 32.46 | 36169 |

SPAdes results:

From the data, it’s clear that SPAdes performed the best. SPAdes generated only 59 contigs as compared to ~200 from SOAP and ~300 from ABySS. Additionally, the largest contig size and N50 values were the highest. Finally, the total number of base pairs was closest to the number of base pairs in a different strain of this bacteria that has already been sequenced. We will proceed to secondary scaffolding with this assembly, located in /common/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta.

QUAST’s output consists of a folder containing results in multiple formats within each of the three assembly directories. The script to run QUAST is located at /common/Assembly_Tutorial/QUAST/Sample_quast.sh.

SSPACE Standard

SSPACE is a script able to extend and scaffold pre-assembled contigs. SSPACE requires a library file containing the paths to the paired end reads, average insert size, and type of data. This file is located at /common/Assembly_Tutorial/Scaffolding/Species_library.txt.

We will run SSPACE using a perl command with the parameters -l for the species library, -s for the fasta file containing assembled scaffolds, -b for the output prefix, and -T for the number of threads.

<pre class="p1">perl /opt/bioinformatics/SSPACE-STANDARD/SSPACE_Standard_v3.0.pl -l /common/Assembly_Tutorial/Species_library.txt -s /common/Assembly_Tutorial/Assembly/SPAdes/scaffolds.fasta -b SSPACE -T 16</pre>

The output file is located at /common/Assembly_Tutorial/Scaffolding/SSPACE/Sample_SSPACE.final.scaffolds.fasta. The script to run SSPACE is located at /common/Assembly_Tutorial/Scaffolding/Sample_sspace.sh.

We then will run QUAST on this file to compare it with previous assemblies. This time, we will run QUAST over the command line without a submit script, since it is only one line.

<pre>cd /common/Assembly_Tutorial/QUAST
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Scaffolding/SSPACE/SSPACE.final.scaffolds.fasta -o SSPACE</pre>

Quast results:

AlignGraph on close relation (different strain of species)

AlignGraph is the final step in this assembly pipeline. From the documentation, “AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.” By using a reference genome of a closely related organism, it can improve the assembly.

To run AlignGraph we first need to convert the raw reads from fastq format to fasta format. There are many ways to do this, but one of the most efficient ways is to use a sed command to parse out the reads from the fastq file:

<pre class="p1">sed -n '1_4s/^@/>/p;24p' /common/Assembly_Tutorial/Sample_R1.fastq > Sample_R1.fasta
sed -n '1_4s/^@/>/p;24p' /common/Assembly_Tutorial/Sample_R2.fastq > Sample_R2.fasta</pre>

Then we will run AlignGraph using the AlignGraph command and the parameters --read1 for the forward read in fasta format, --read2 for the reverse read in fasta format, --contig for the path to the assembly we are rescaffolding, and --genome for the path to the reference genome we are using for rescaffolding. The genome we are using is named AlignGraph_genome.fasta, again to protect the live data.

Additionally, we have to define the --distanceLow and --distanceHigh parameters. From the documentation, distanceLow is the maximum of [insert size – 1000, insert size] and distanceHigh [insert size + 1000]. The insert size of this dataset is 550, giving us a distanceLow of 550 and distanceHigh of 1550. Finally, we define the output file names using --extendedContigs and --remainingContigs. –remainingContigs will contain the final assembly.

<pre class="p1">AlignGraph --read1 /common/Assembly_Tutorial/Scaffolding/Sample_R1.fasta --read2 /common/Assembly_Tutorial/Scaffolding/Sample_R1.fasta --contig /common/Assembly_Tutorial/Scaffolding/SSPACE/SSPACE.final.scaffolds.fasta --genome /common/Assembly_Tutorial/Scaffolding/AlignGraph_genome.fasta --distanceLow 550 --distanceHigh 1550 --extendedContig Species_extendedContigs.fa --remainingContig Species_remainingContigs.fa</pre>

The output file is located at /common/Assembly_Tutorial/Scaffolding/AlignGraph/Sample_remainingContigs.fa. The script to run AlignGraph is located at /common/Assembly_Tutorial/Scaffolding/Sample_aligngraph.sh.

Then QUAST:

<pre>cd /common/Assembly_Tutorial/QUAST
python /opt/bioinformatics/quast-2.3/quast.py /common/Assembly_Tutorial/Scaffolding/AlignGraph/Species_remainingContigs.fa -o AlignGraph</pre>

<colgroup><col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"> <col width="100"></colgroup>
| Assembly | # contigs | Largest contig | Total length | GC (%) | N50 |
| Species_remainingContigs | 57 | 255551 | 2880249 | 32.65 | 147660 |

Unfortunately, this dataset was not improved by AlignGraph with this specific genome, but this tutorial still illustrates the general idea.

Bacterial genome assembly tutori
Source: https://bioinformatics.uconn.edu/bacterial-genome...
Microbiol Spect | 仔猪肠道细菌培养组
文章信息标题：A Bacterial Genome and Culture Collection of Gut ...
利用虹鳟后代表型数据对细菌性冷水病抗性能力的基因组选择评估：基因
文章题目：Evaluation of Genome-Enabled Selection for Bacterial...
NBT|人肠道菌培养组HBC
文章标题：A human gut bacterial genome and culture collection...
DBGWAS：基于k-mer和De Bruijn图的GWAS
文献信息标题：A fast and agnostic method for bacterial genome-w...
文献里面用到的基因组注释方法（不包括重复序列和ncRNA）
Genome assembly of a tropical maize inbred line provides ...
文献笔记六十五：叶绿体基因组组装工具综述
论文题目 The landscape of chloroplast genome assembly tools 还...
基因组组装和注释流程 from ChatGPT
【202301】Q:write a genome assembly and annotation pipeline...
苹果蠹蛾基因组为化学生态学及抗药性研究提供一种思路
A chromosome-level genome assembly of Cydia pomonella pro...
Assembly--及相关内容
De novo genome assembly Hybrid error correctionUsing shor...

Bacterial genome assembly tutori

average insert size

if sequence needs to be reversed

in which part(s) the reads are used

use only first 250 bps of each read

in which order the reads are used while scaffolding

cutoff of pair number for a reliable connection (at least 3 for short insert size)

minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)

path to genes

repeat for k=35, k=41, etc</pre>

相关文章

Bacterial genome assembly tutori

Microbiol Spect | 仔猪肠道细菌培养组

利用虹鳟后代表型数据对细菌性冷水病抗性能力的基因组选择评估：基因

NBT|人肠道菌培养组HBC

DBGWAS：基于k-mer和De Bruijn图的GWAS

文献里面用到的基因组注释方法（不包括重复序列和ncRNA）

文献笔记六十五：叶绿体基因组组装工具综述

基因组组装和注释流程 from ChatGPT

苹果蠹蛾基因组为化学生态学及抗药性研究提供一种思路

Assembly--及相关内容

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读