Inspecting, summarizing, and manipulating the read alignments

We will now inspect and further process the mapping output. For this, we will use samtools suite, which as its name suggests let us massage the mapping information.

In this section, we will also be playing with the read alignments to have a working understanding of how to get information from and manipulate them.

Converting sam to bam and back

First, let’s convert our .sam mapping output to its binary counterpart .bam. The binary format is much easier for computers to process but very difficult for humans to read.

To convert .sam to .bam, we use the samtools view command. We must specify that our input is in .sam format (by default it expects .bam) using the -S option. We will also explicitely specify that we want the output to be .bam (by default it also produces bam) with the -b option. samtools follows the UNIX convention of sending its output to the UNIX STDOUT, so we need to use a redirect operator (“>”) to create a BAM file from the output.

samtools view -S -b $MYRESULTS/mapping/${sample}.mapped.sam > $MYRESULTS/mapping/${sample}.mapped.bam

If you need to convert back .bam to .sam and look at it on the fly without saving to disk, you would do:

samtools view $MYRESULTS/mapping/${sample}.mapped.bam | head -n 10

Getting mapping stats directly from .bam

We can also determine mapping statistics directly from the bam file. Use for instance the following (which uses FLAG options, -F and -f, more on that later below):

samtools view -c -F 4 $MYRESULTS/mapping/${sample}.mapped.bam
samtools view -c -f 4 $MYRESULTS/mapping/${sample}.mapped.bam

What does these numbers outputted to STDOUT represent -invoke samtools’s help to answer this-. Do things make sense and add up?

Sorting the read alignments

When you map and align the reads to the reference, the resulting read alignments are in random order with respect to their position in the reference genome. In other words, the .bam file is in the order that the sequences occurred in the input .fastq. Prove to yourself that this is the case:

samtools view $MYRESULTS/mapping/${sample}.mapped.bam | more

Doing anything useful downstream such as calling variants or visualizing the alignments requires that the .bam is further manipulated. It must be sorted such that the alignments occur in “genome order”. That is, ordered positionally based upon their alignment coordinates on each chromosome (we obviously have a single one here). Sort the .bam file:

samtools sort $MYRESULTS/mapping/${sample}.mapped.bam -o $MYRESULTS/mapping/${sample}.sorted.bam

After sorting, check the order:

samtools view $MYRESULTS/mapping/${sample}.sorted.bam | more

Question (easy!): What do you notice?

Note

Sorting can also be done using picard. We will be using picard for removing duplicate alignments after sorting. When we build our variant calling pipeline, we will do the sorting with picard just to keep things consistent with downstream processing. The results should be identical.

Indexing read alignments (i.e. indexing .bam)

Indexing a genome sorted .bam file allows one to quickly extract alignments overlapping particular genomic regions. Indexing is also required by genome viewers (for instance IGV) so that the viewers can quickly display alignments in each genomic region to which you navigate. This is essential for large genomes.

Index your .bam file:

samtools index $MYRESULTS/mapping/${sample}.sorted.bam

This will create an additional index file -what’s that file’s extension?-

For instance, now that we have indexed the .bam file, we have the flexibility to extract the alignments from a defined genomic region. We can, for example, extract alignments from the 150th kilobase of the genome.

samtools view $MYRESULTS/mapping/${sample}.sorted.bam NC_000962.3:150000-151000

Question: How many alignments are within NC_000962.3:150000-151000?

Detailed inspection of some alignments

Again, let’s inspect the first 10 alignments in our .bam file in detail:

samtools view $MYRESULTS/mapping/${sample}.sorted.bam | head -n 10

Let’s also inspect just the header. The header in a bam file records important information regarding the reference genome to which the reads were aligned, as well as other information about how the BAM has been processed. We can ask the view command to report solely the header by using the -H option. As the downstream programs further process the alignments, they will typically add information about what they did to the header. For now, our header contain bare minimum information.

samtools view -H $MYRESULTS/mapping/${sample}.sorted.bam

The FLAG field in the .bam format encodes several key pieces of information regarding how an alignment aligned to the reference genome. We can use this information to isolate specific types of alignments that we want to use in our analysis. Here is a table for the flags right from the manual:

Bit Description
0x1 template having multiple segments in sequencing
0x2 each segment properly aligned according to the aligner
0x4 segment unmapped
0x8 next segment in the template unmapped
0x10 SEQ being reverse complemented
0x20 SEQ of the next segment in the template being reversed
0x40 the first segment in the template
0x80 the last segment in the template
0x100 secondary alignment
0x200 not passing quality controls
0x400 PCR or optical duplicate
0x800 supplementary alignment

For example, we often want to call variants solely from paired-end sequences that aligned “properly” to the reference genome. Can you tell why?

To ask the view command to report solely “proper pairs” we use the -f option and ask for alignments where the second bit is true (proper pair is true).

samtools view -f 0x2 $MYRESULTS/mapping/${sample}.sorted.bam
  • Question: How many properly paired alignments are there? How does that number compare to the number you got above using -F?
  • Question: How many improperly paired alignments are there?
  • Question: Using what you played with, how would you calculate the fragment size distribution?