Microbial Genome Annotation

Now that we have assembled the reads into contigs, we will annotate the genome. That means we will predict genes and assign functional annotation to them. M. tuberculosis is a very clonal organism and its genome is already well annotated so typically when we assemble a new M. tuberculosis isolate, there is no need to do de novo annotation. Although we are using M. tuberculosis as a working example here, the methodology is applicable to all microbial genomes.

A set of programs are available for genome annotation tasks. Here we will use Prokka, a pipeline comprising several open source bioinformatic tools and databases. Prokka automates the process of locating open reading frames (ORFs) and RNA regions on contigs, translating ORFs to protein sequences, searching for protein homologs and producing standard output files. For gene finding and translation, Prokka uses a program named prodigal. Homology searches are then performed using BLAST and HMMER based on the translated protein sequences as queries against a set of public databases (CDD, PFAM, TIGRFAM) as well as custom databases that are included with Prokka.

Create a directory to hold the annotation results:

mkdir -p $MYRESULTSDIR/annotation

Define a shell variable for the final contigs:

FINALCONTIGS=$MYRESULTSDIR/short_long_reads_assembly/polished_pacbio_assembly/MTB_pacbio_assembly_shortreadcorrected.fasta

Run Prokka (will take ~30 minutes for our assembly). DONOT RUN, copy the precalculated results.

prokka $FINALCONTIGS --outdir $MYRESULTSDIR/annotation/MTB_pacbio_assembly_shortreadcorrected --norrna --notrna --metagenome --cpus 8

Prokka produces several types of output:

  • a .gff file, which is a standardised, tab-delimited, format for genome annotations
  • a Genbank (.gbk) file, which is a more detailed description of nucleotide sequences and the genes encoded in these.

Copy the precalculated results:

mkdir -p $MYRESULTSDIR/annotation/MTB_pacbio_assembly_shortreadcorrected/
cp ~/mtb_genomics_workshop_data/genome-assembly-annotation/precalculated/prokka_precalculated/* $MYRESULTSDIR/annotation/MTB_pacbio_assembly_shortreadcorrected/

When your dataset has been annotated you can view the annotations directly in the gff file. Peak into the gff file by doing:

more -S $MYRESULTSDIR/annotation/MTB_pacbio_assembly_shortreadcorrected/PROKKA_07052018.gff

Question: How many coding regions were found by Prodigal? (Hint: use grep) Question: How many of the coding regions were given an enzyme identifier? Question: How many were given a COG identifier?

Visualizing the annotations with Artemis

Artemis is a graphical Java program to browse annotated genomes. Launch Artemis and load gff file:

art&

Pick one of favorite M. tuberculosis genes, and try finding it (Goto–>Navigator–>Goto Feature with Key Name)