Assembly¶
We can now proceed with the assembly of short and long reads. We will assemble each library (2 short and 1 long reads library) reseparately and compare the assemblies.
Assembly using Illumina short reads¶
We will assemble the M. tuberculosis genome using the qced short reads, separately for 20X and 60X libraries. We will be using SPAdes
assembler, which is a de Bruijn graph
based short read assembler.
The parameters that go into a short read assembler can be tricky to optimize and tuning them is quite time consuming. The most important parameter is the k-mer
size to be used to built the de Bruijn graph
. SPAdes
makes the parameterization quite easy and will select parameter values automatically.
The assembly for the 20X library will be computed quickly but for the 60X library the computation will take a few hours. ONLY RUN for 20X, DONOT RUN for 60X. For 60X, we will use precalculated results.
You can run SPAdes
as follows using 4 threads (-t 4
):
spades.py -o $MYRESULTSDIR/short_long_reads_assembly/MTB_illumina20X_spades -t 4 --12 $MYRESULTSDIR/qc/MTB_illumina20X_qced_R1R2.fastq
spades.py -o $MYRESULTSDIR/short_long_reads_assembly/MTB_illumina60X_spades -t 4 --12 $MYRESULTSDIR/qc/MTB_illumina60X_qced_R1R2.fastq
Copy the precalculated assembly directory:
cp -r ~/mtb_genomics_workshop_data/genome-assembly-annotation/precalculated/MTB_illumina60X_spades $MYRESULTSDIR/short_long_reads_assembly/MTB_illumina60X_spades
Assembly using PacBio long reads¶
Now, we will generate an assembly of the M. tuberculosis genome using PacBio long reads. Longer reads are better at resolving repeats and therefor will typically give more contiguous assemblies compared to short reads. However, long reads have a significantly higher error rate so consensus base qualities are expected to be much lower.
To assemble long reads, we will use canu
, an overlap-layour-consensus
assembler. Generate the assembly with the following command (this will take about ~30 min.).
Again, we will use precomputed results.
canu gnuplotTested=true -p MTB_pacbio_canu -d MTB_pacbio_canu_auto genomeSize=4.4m -pacbio-raw MTB_pacbio_circular-consensus-sequence-reads.fastq
- Copy the precomputed assembly:
- cp -r ~/mtb_genomics_workshop_data/genome-assembly-annotation/precalculated/MTB_pacbio_canu_auto $MYRESULTSDIR/short_long_reads_assembly/MTB_pacbio_canu_auto
Now you should have 3 separate M. tuberculosis assemblies. Copy the fasta
files for the assembled contigs into a separate directory for convenience:
mkdir $MYRESULTSDIR/short_long_reads_assembly/assemblies
cp $MYRESULTSDIR/short_long_reads_assembly/MTB_pacbio_canu_auto/MTB_pacbio_canu.contigs.fasta $MYRESULTSDIR/short_long_reads_assembly/assemblies/MTB_pacbio_assembly.fasta
cp $MYRESULTSDIR/short_long_reads_assembly/MTB_illumina20X_spades/contigs.fasta $MYRESULTSDIR/short_long_reads_assembly/assemblies/MTB_illumina20X_assembly.fasta
cp $MYRESULTSDIR/short_long_reads_assembly/MTB_illumina60X_spades/contigs.fasta $MYRESULTSDIR/short_long_reads_assembly/assemblies/MTB_illumina60X_assembly.fasta
Next, we will look into these assemblies in detail.