Read Quality Control and Trimming¶
To perform data quality control, we will use FastQC
, a tool that processes reads and presents visual reports.
First, peak into the first few lines of .fastq
files for the Illumina reads from 60X sequencing run
head -n 10 $MYDATADIR/MTB_illumina60X_R1.fastq
head -n 10 $MYDATADIR/MTB_illumina60X_R2.fastq
What can you tell about the headers? What is the read length
for this run? Can you quickly count the number of reads (Hint: use grep
and wc
)?
Running fastqc
¶
We will now run fastqc
. For most tools used in practicals, you can get help by using -h
or -help
flag. Type the following to get help for fastqc
:
fastqc -h
Run fastqc
for forward
reads from 60X and 20X runs saving the results under your results directory:
fastqc -o $MYRESULTSDIR/qc $MYDATADIR/MTB_illumina60X_R1.fastq
fastqc -o $MYRESULTSDIR/qc $MYDATADIR/MTB_illumina20X_R1.fastq
Now run it again for reverse
reads:
fastqc -o $MYRESULTSDIR/qc $MYDATADIR/MTB_illumina60X_R2.fastq
fastqc -o $MYRESULTSDIR/qc $MYDATADIR/MTB_illumina20X_R2.fastq
Repeat for PacBio reads.
fastqc -o $MYRESULTSDIR/qc $MYDATADIR/MTB_pacbio_subreads.fastq
fastqc -o $MYRESULTSDIR/qc $MYDATADIR/MTB_pacbio_circular-consensus-sequence-reads.fastq
fastqc
performs a suite of tests and then for each indicates whether the test has been passed (green tick) or failed (red cross). Notice tough that fastqc
has no idea about what your library should look like. All of its tests are based on a completely random library with 50% GC content and therefore it will show the test as failed if your sample doesn’t match these assumptions. For instance, you might have reads sampled from a high AT or high GC genome, and it will fail per sequence GC content. You need to be aware of how your library been generated and interpret the fastqc
reports taking that into account.
Question: For a whole genome shotgun library, what would you expect “per base sequence content” to look like? Question: What if you had amplicon sequences from a multiplex PCR assay?
Now, open fastqc
reports in a browser:
firefox $MYRESULTSDIR/qc/MTB_illumina60X_R1_fastqc.html
firefox $MYRESULTSDIR/qc/MTB_illumina60X_R2_fastqc.html
firefox $MYRESULTSDIR/qc/MTB_illumina20X_R1_fastqc.html
firefox $MYRESULTSDIR/qc/MTB_illumina20X_R2_fastqc.html
firefox $MYRESULTSDIR/qc/MTB_pacbio_R1_fastqc.html
firefox $MYRESULTSDIR/qc/MTB_pacbio_R2_fastqc.html
- Per base sequence quality: This is probably one of the most important diagnostics. Inspecting this, you can determine what quality trimming parameters you should use in downstream analysis.
- Per base sequence content: For a randomly sampled genome with a GC content of 50%, we would expect that at any given position within a read, there will be an equal probability of finding a A,C,G, or T base. Our library from M. tuberculosis is expected to reflect the GC content, which holds here. There seems to be some minor bias at the start of the read, which may be due to PCR duplicates during amplification or during library preparation.
- Sequence duplication levels: In a whole genome shotgun library, very few reads are expected to occur more than once. In a shotgun library of a targeted region, a low level of duplication can simply be the result of high level of sampling (i.e. high coverage). In addition, duplication might also indicate some kind of enrichment bias (i.e. PCR overamplification)
https://sequencing.qcfail.com/ is a good resource where people post about different ways a sequencing run might have failed.
Running sickle
to quality trim reads:¶
Now that we have an idea about the quality profile of our reads, we can do quality based trimming of reads. We will use a Q-score
threshold of 30 (-q 30), remove all reads that are shorter than 100 bp after trimming (-l 100), and also trim 10 bp from 5’ end just to reduce bias. After trimming, singletons reads (reads whose pair dropped off) are save into a separate file:
sickle pe -t sanger -q 30 -l 100 -n 10 \
-f $MYDATADIR/MTB_illumina60X_R1.fastq \
-r $MYDATADIR/MTB_illumina60X_R2.fastq \
-o $MYRESULTSDIR/qc/MTB_illumina60X_qced_R1.fastq \
-p $MYRESULTSDIR/qc/MTB_illumina60X_qced_R2.fastq \
-s $MYRESULTSDIR/qc/MTB_illumina60X_qced_single.fastq 1>$MYRESULTSDIR/qc/MTB_illumina60X_sickle.log
sickle pe -t sanger -q 30 -l 100 -n 10 \
-f $MYDATADIR/MTB_illumina20X_R1.fastq \
-r $MYDATADIR/MTB_illumina20X_R2.fastq \
-o $MYRESULTSDIR/qc/MTB_illumina20X_qced_R1.fastq \
-p $MYRESULTSDIR/qc/MTB_illumina20X_qced_R2.fastq \
-s $MYRESULTSDIR/qc/MTB_illumina20X_qced_single.fastq 1>$MYRESULTSDIR/qc/MTB_illumina20X_sickle.log