QCing and filtering the read alignments

Before we can attempt variant calling, we need to make sure that the alignments that are resulted from mapping artefacts are identified. We also want to get an overall picture on the mapping quality.

Duplicate marking

During the lecture, you heard about library and optical duplicates, which if they exist will bias the variant calling. We will mark these duplicates using picard.

First, let’s sort the .bam file, this time using picard. See the note in the previous section. This should give identical results to samtools sort, we are simply doing it this way for pipelining consistency.

java -Xmx16g -Djava.io.tmpdir=/tmp -jar /opt/picard.jar SortSam \
I=$MYRESULTS/mapping/${sample}.mapped.bam O=$MYRESULTS/mapping/${sample}.sorted.bam SORT_ORDER=coordinate

Now, we can mark the duplicates:

java -Xmx16g -Djava.io.tmpdir=/tmp -jar /opt/picard.jar MarkDuplicates \
I=$MYRESULTS/mapping/${sample}.sorted.bam O=$MYRESULTS/mapping/${sample}.sorted.nodup.bam \
REMOVE_DUPLICATES=true ASSUME_SORT_ORDER=coordinate M=$MYRESULTS/mapping/${sample}.sorted.dupmetrics

The metrics file will contain some statistics about the de-duplication. In the resulting bam file, only one fragment from each duplicate group survives unchanged, other duplicate fragments are given a FLAG 0x400 and will NOT be used downstream. Optimally, detection and marking of duplicate fragments should be done per library, i.e., over all read groups corresponding to a given library. In practice, often it is fine to take a shortcut and is sufficient to do it per lane (read group) to keep things naively parallelizable.

Assessing depth of coverage

You might often want to take a peak at the depth of coverage using the read alignments. You can do this with samtools depth:

samtools depth -a $MYRESULTS/mapping/${sample}.sorted.nodup.bam > $MYRESULTS/mapping/${sample}.coverage.txt

This will give you the depth of coverage for each position in the genome.

Question: How many lines should be in this file? Question: Can you write an awk one-liner to quickly compute the average depth of average?

Assessing mapping qualities

Next, we will generate summary statistics on mapping qualities. For this, we will use qualimap. qualimap has a GUI but we don’t want to use it here. To avoid launching the GUI, we do unset DISPLAY first.

unset DISPLAY;
qualimap bamqc -bam $MYRESULTS/mapping/${sample}.sorted.nodup.bam --java-mem-size=4g -outformat PDF -outdir $MYRESULTS/mapping/${sample}.bam.qualimap

The results will be in a text file named ${sample}.bam.qualimap/genome_results.txt. Later when we scale our analysis up to multiple isolates, we will parse out these files to get summary statistics for all of the isolates.

Indel realignment and filtering low quality alignments

We will use pilon to call our variants. This will give us a candidate list from a single sample, which we can further filter after we merge variants from multiple samples. pilon with its --variant option will perform indel realignment. We can also specify depth and mapping quality based filters. Run pilon:

java -Xmx16g -jar /opt/pilon-1.22.jar --genome $MYDATA/ref/NC_000962.3.fna --bam $MYRESULTS/mapping/${sample}.sorted.nodup.bam --output $MYRESULTS/mapping/${sample}_pilon --variant  --mindepth 10 --minmq 40 --minqual 20

The output will be a .vcf file and a .fasta file. .fasta file has the sequence from this reference based assembly, which we won’t be using downstream.