Processing and Merging SNVs from multiple isolates to get an MSA

What we need to do next is to get a set of high-quality SNVs across the whole genome. A SNV, single nucleotide variant (variant means substitution here), is defined as a variation in a single locus without any reference to its frequency in the population. A SNP or single nucleotide polymorphism, on the other hand is a variation for which we have evidence that it is fixed in the population. That means we should see it in more than a certain (arbitrarily defined but often >1%) number of individuals/isolates. Therefore, not all the SNVs are SNPs. Although this is the working definition we will be following, there is no consensus on this defition and all single nucleotide variants are called SNPs independent of their frequency in the population.

Previously in the workshop…

vcf file we generated using pilon contains unfiltered SNVs as well as indels. To remind, pilon did a series of filtering and quality control steps before coming up with this list SNVs and indels. These steps include
  1. indel realignment
  2. base quality score filtering
  3. mapping quality score filtering
  4. depth of coverage filtering.

So all SNVs in our vcf file have been, to the best of our ability, stripped off any known read mapping/alignment artifacts, has a reasonably high base quality, depth of coverage, and mapping quality.

Even after these steps, we expect a relatively high false positive rate within this list so we will need to do some further filtering. The assumption is that we are leaning towards minimizing false positive rate. We might be missing some true variants during this strict filtering steps. However, realize that we want to make sure that what we have as a final list (for instance to infer phylogeny) contain only true variants. That is we don’t care so much if we do miss few true variants here and there (that is higher false negative rate is somewhat acceptable).

Why are we doing all this?

The end point for this analysis is to get an MSA (multiple sequence alignment) of the variable sites, which we can then use to infer phylogeny. Keep that in mind as you dive in.

Separating SNVs and indels and filtering SNVs within highly repetitive regions

First, we will need to get rid of the indels (insertion/deletion). As you’ve heard during the lecture, read mappers donot do a particularly good job with indels. Any called indel by any mapper, you should be very suspicious of. We want to remove them for phylogenetic analysis -we should already have lots of variation to infer phylogeny- as they will bias the phylogenetic inference. In the next step, we will use vcftools to make 2 separate vcf files, one that contains only SNVs and the other indels. Concurrently, we will also remove any variant that sits within a highly repetitive region. Next, a few words on why we want to do this, particularly in analyzing M. tuberculosis genomes.


Repetitive regions in M. tuberculosis genome: A significant portion of the genome of Mycobacterium tuberculosis contains highly repetitive regions (every microbial genome will have some level of tandem and more complex repetitive regions, regions with low sequence complexity). Short-read sequencing isn’t sufficient to make accurate SNV calls within these regions. Therefore we want to remove these regions from further analysis as they will confuse downstream phylogenetic analysis. We have precalculated the repetitive regions, which is provided in the Mtb_repeats.bed file in bed format. bed is a file format used to specify genomic intervals. Overall, the total length of repetitive regions specified in Mtb_repeats.bed is ~572 Kb (~13% of the genome). The majority of these regions (~10% of the genome) belong to two large unrelaed gene families (PE and PPE families) clustered in the genome. They code for glycine-rich proteins that are often based on multiple copies of the polymorphic repetitive sequences referred to as PGRSs restructured text. (**P**olymorphic guanine cytosine–rich (GC-rich) repetitive sequences). Despite their prominence in the genome, we almost know nothing about the biological significance of these regions.

Run vcftools to separate SNVs/indels and filter repetitive regions:

# get rid of "_pilon" in the filename
mv ${sample}_pilon.vcf ${sample}.vcf
vcftools --gzvcf ${sample}.vcf --exclude-bed $SRCBASE/data/ref/Mtb_repeats.bed --remove-indels --recode --recode-INFO-all --out $sample
vcftools --gzvcf ${sample}.vcf --exclude-bed $SRCBASE/data/ref/Mtb_repeats.bed --keep-only-indels --recode --recode-INFO-all --out ${sample}.indels

# take a peek at the indels file! We will be leaving it behind from this point on
more ${sample}.indels.recode.vcf
# and the vcftools log file
more ${sample}.indels.log


vcf files are typically huge (~0.5 GB even for a single microbial genome). Compressing vcf files with bgzip and indexing them with tabix is the standard way vcf files are stored. Once that is done, further processing of the binary files is done with bcftools.

Compress and index the vcf file. Notice that we won’t use the vcf file for indels for this analysis.

bgzip ${sample}.recode.vcf
tabix -p vcf ${sample}.recode.vcf.gz

We need to change the sample header (which is “SAMPLE”, the very last column within the data fields) in the vcf file to specify that these variants are from $sample. This is to ensure data provenance: we will soon merge vcf files from multiple samples and they all have the same header at this point. You can do this with bcftools which requires the header name to be stored in a text file, so we write the sample name into a file called “sample” first.

echo $sample > sample
bcftools reheader -s sample ${sample}.recode.vcf.gz > m.${sample}.recode.vcf.gz # we add this "m" just to make the filename different in this temp file
mv -f m.${sample}.recode.vcf.gz ${sample}.recode.vcf.gz

# reindex
tabix -f -p vcf ${sample}.recode.vcf.gz

# make sure nothing that ``pilon`` didn't let it "PASS" doesn't get in
bcftools view --include 'FILTER="PASS"' ${sample}.recode.vcf.gz -O z -o temp

mv -f temp ${sample}.recode.vcf.gz
# reindex and we are done
tabix -f -p vcf ${sample}.recode.vcf.gz

All we did here was modifying sample “header” and some simple “data massaging”. If you want, you can unzip the ${sample}.recode.vcf.gz file, peek into it, and do some other quick sanity check:

gunzip ${sample}.recode.vcf.gz
# pay attention to the column name for the very last column, does it look ok?
more ${sample}.recode.vcf
wc ${sample}.vcf
wc ${sample}.recode.vcf

Merge vcf files from multiple isolates

Assuming that we did read mapping etc. for multiple isolates (which you will learn how to do using snakemake), at this point we can merge all the vcf files from multiple samples into a single vcf file. We will use that single vcf to define *SNP*s that be used for phylogenetic analysis:

# the way this following command is constructed is intentionally explicit so you can see what is happening
# notice that this list of 99 isolates contains our prototyping isolate, "ERS050945", and 98 additional ones
bcftools merge \
ERS050923.recode.vcf.gz \
ERS050924.recode.vcf.gz \
ERS050925.recode.vcf.gz \
ERS050926.recode.vcf.gz \
ERS050927.recode.vcf.gz \
ERS050928.recode.vcf.gz \
ERS050929.recode.vcf.gz \
ERS050930.recode.vcf.gz \
ERS050931.recode.vcf.gz \
ERS050932.recode.vcf.gz \
ERS050933.recode.vcf.gz \
ERS050934.recode.vcf.gz \
ERS050935.recode.vcf.gz \
ERS050936.recode.vcf.gz \
ERS050937.recode.vcf.gz \
ERS050938.recode.vcf.gz \
ERS050939.recode.vcf.gz \
ERS050940.recode.vcf.gz \
ERS050942.recode.vcf.gz \
ERS050943.recode.vcf.gz \
ERS050944.recode.vcf.gz \
ERS050945.recode.vcf.gz \
ERS050946.recode.vcf.gz \
ERS050947.recode.vcf.gz \
ERS050948.recode.vcf.gz \
ERS050949.recode.vcf.gz \
ERS050950.recode.vcf.gz \
ERS050951.recode.vcf.gz \
ERS050953.recode.vcf.gz \
ERS050954.recode.vcf.gz \
ERS050955.recode.vcf.gz \
ERS050956.recode.vcf.gz \
ERS050957.recode.vcf.gz \
ERS050958.recode.vcf.gz \
ERS050959.recode.vcf.gz \
ERS050960.recode.vcf.gz \
ERS050961.recode.vcf.gz \
ERS050962.recode.vcf.gz \
ERS050963.recode.vcf.gz \
ERS050964.recode.vcf.gz \
ERS050965.recode.vcf.gz \
ERS050966.recode.vcf.gz \
ERS050968.recode.vcf.gz \
ERS050969.recode.vcf.gz \
ERS050970.recode.vcf.gz \
ERS050971.recode.vcf.gz \
ERS050972.recode.vcf.gz \
ERS053603.recode.vcf.gz \
ERS053604.recode.vcf.gz \
ERS053605.recode.vcf.gz \
ERS053606.recode.vcf.gz \
ERS053607.recode.vcf.gz \
ERS053608.recode.vcf.gz \
ERS053609.recode.vcf.gz \
ERS053610.recode.vcf.gz \
ERS053611.recode.vcf.gz \
ERS053612.recode.vcf.gz \
ERS053613.recode.vcf.gz \
ERS053614.recode.vcf.gz \
ERS053615.recode.vcf.gz \
ERS053616.recode.vcf.gz \
ERS053617.recode.vcf.gz \
ERS053618.recode.vcf.gz \
ERS053619.recode.vcf.gz \
ERS053620.recode.vcf.gz \
ERS053621.recode.vcf.gz \
ERS053622.recode.vcf.gz \
ERS053623.recode.vcf.gz \
ERS053624.recode.vcf.gz \
ERS053625.recode.vcf.gz \
ERS053626.recode.vcf.gz \
ERS053627.recode.vcf.gz \
ERS053628.recode.vcf.gz \
ERS053629.recode.vcf.gz \
ERS053630.recode.vcf.gz \
ERS053631.recode.vcf.gz \
ERS053632.recode.vcf.gz \
ERS053633.recode.vcf.gz \
ERS053634.recode.vcf.gz \
ERS053635.recode.vcf.gz \
ERS053636.recode.vcf.gz \
ERS053637.recode.vcf.gz \
ERS053638.recode.vcf.gz \
ERS053639.recode.vcf.gz \
ERS053641.recode.vcf.gz \
ERS053642.recode.vcf.gz \
ERS053643.recode.vcf.gz \
ERS053644.recode.vcf.gz \
ERS053651.recode.vcf.gz \
ERS053656.recode.vcf.gz \
ERS053659.recode.vcf.gz \
ERS053661.recode.vcf.gz \
ERS053663.recode.vcf.gz \
ERS053667.recode.vcf.gz \
ERS053672.recode.vcf.gz \
ERS053684.recode.vcf.gz \
ERS096323.recode.vcf.gz \
ERS181351.recode.vcf.gz \
ERS181603.recode.vcf.gz \
-O v -o merged-99RussianIsolates.vcf

Merging step requires a reasonable amount of memory and takes some time. We will be using precalculated file. Now, what we essentially have is a vcf file that contains all the information to make an MSA of the variable position in the genome. Since everything is already defined with respect to a reference genome, we don’t need to build an MSA, per se, it will already just fall off once we get rid of the non variable sites. That is what snp-sites does.


Merging multiple vcf files, the way it has been implemented here, is highly compute heavy. It requires high memory instances when you have 100s to 1000s of isolates.

In what follows, we first convert vcf to a fasta file using (despite its name, it also does this) before calling SNPs. compute heavy, DONOT RUN

# -m: keep all SNVs independent of their frequency, typically with a larger number of isolates
# we would want to impose a threshold --fasta --phylip-disable -m 1 --input merged-99RussianIsolates.vcf

# -m: output fasta, -v: output VCF, -p: output PHYLIP
snp-sites -m merged-99RussianIsolates.min1.fasta -o 99RussianIsolates.msa.fasta

99RussianIsolates.msa.fasta contains our multiple sequence alignment.