Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU -...

Bioinformatic tools for the diagnostic laboratory

A/Prof Torsten Seemann

Victorian Life Sciences Computation Initiative (VLSCI)Microbiological Diagnostic Unit Public Health Laboratory (MDU PHL)

Doherty Applied Microbial Genomics (DAMG)The University of Melbourne

ASA 2016 - Melbourne, AU - Sat 27 Feb 2016

Doherty Applied Microbial Genomics

Lead bioinformatician ♥ microbial genomics

Whole genome sequencing

The currency of genomics

Reads

Reads are stored in FASTQ files

Genome

Types of sequence reads

100 - 300 bp (paired)

100 - 400 bp

5,000 - 15,000+ bp

5,000 - 50,000+ bp

What data do we really have?

Isolate genomeSequenced reads

Other isolates in sequencing run

ContaminationSequencing adaptorsSpike-in controls eg. phiX

Unsequenced regions

Do we have enough data?

∷ Depth: expressed as fold-coverage of genome eg. 25x: means each base sequenced 25 times (on average)

∷ Coverage: the % of genome sequenced with depth > 0

25x

Genome data itself is of limited value.

Needs “extra” information

□ location: Australia 37.8S,145.0E □ date: 2015 2015-07-20□ source: human 60yo male faecal swab□ etc.

Metadata

Got my reads, now what?

Two options

∷ De novo genome assembly: reconstruct original sequence from reads alone: like a giant jigsaw puzzle: “create”

∷ Align to reference: identify where each read fits on a related genome: can not always be uniquely placed: “compare”

De novo genome assemblyAmplified DNA

Shear DNA

Sequenced reads

Overlaps

Layout

Consensus ↠ “Contigs”

The effect of read length

250 bp - Illumina - $200 8000 bp - Pacbio - $2000

The problem with repeatsRepeat copy 1 Repeat copy 2

Collapsed repeat consensus

1 locus

4 contigs

Align to referenceSeven short 4bp readsAGTC TTAC GGGA CTTT

TAGG TTTA ATAG

Aligned to 31bp referenceAGTCTTTATTATAGGGAGCCATAGCTTTACAAGTC TAGG ATAG TTAC

TTTA GGGA CTTT

Eight short 4bp readsAGTC TTAC GGGA CTTT

TAGG TTTA ATAG TTAT

Aligned to 31bp referenceAGTCTTTATTATAGGGAGCCATAGCTTTACAAGTC TAGG ATAG TTAC

TTTA GGGA CTTT TTAT TTAT

Ambiguous alignment

D’oh!

Best practice

■ Use both approaches□ reference-based + de novo

■ Best of both worlds□ and worst of both worlds - interpretation is non-trivial

■ Still need□ good epidemiology, metadata and domain knowledge!

The one true assay?

Applications of WGS

∷ Diagnostics: species ⇒ subspecies ⇒ strain identification: in silico antibiogram and virulence profile

∷ Surveillance: in silico genotyping - MLST, serotyping, VNTR, MLVA: what’s lurking in our hospital/community?

∷ Forensics: outbreak detection: source tracking

Isolate identification

∷ Can be done in seconds∷ Directly from reads (or subset)

∷ Scan against index of unique k-mers (oligoes)∷ Species level accurate (on average)

∷ Great for quality control !

Kraken,MetaPhlan,OneCodex

One Codex example metagenome output

Antibiogram

∷ The “resistome”

∷ Resistance specific genes: we have good databases of these: easy to identify to exact allele eg. blaNDM-9

∷ New alleles conferring resistance: databases are poor (exceptions include M.tb): novel mechanisms arrive de novo

ResFinder, CARD, ARG-Annot

SRST2, ABRicate

ABRicate example E.faecium outputSTART END GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY

7140 7902 erm(B) 1-762/762 ========/====== 1 100.00 99.08

8627 9421 aph(3')-III 1-795/795 =============== 0 100.00 100.00

11040 11948 ant(6)-Ia 1-345/909 =====.......... 0 35.00 100.00

15456 16257 lnu(B) 1-804/804 ========/====== 2 99.75 99.63

573128 575046 tet(M) 1-1920/1920 ========/====== 1 99.95 99.95

770130 770792 VanR-B 1-663/663 =============== 0 100.00 99.25

770792 772135 VanS-B 1-1344/1344 =============== 0 100.00 99.63

772306 773112 VanY-B 1-807/807 =============== 0 100.00 100.00

773130 773957 VanW-B 1-828/828 =============== 0 100.00 97.58

773954 774925 VanH-B 1-972/972 =============== 0 100.00 99.38

774918 775946 VanA-B 1-1029/1029 =============== 0 100.00 98.93

775952 776560 VanX-B 1-609/609 =============== 0 100.00 96.72

2352083 2352631 aac(6')-Ii 1-549/549 =============== 0 100.00 99.64

2789984 2791462 msr(C) 1-1479/1479 =============== 0 100.00 98.92

Virulence profile

∷ The “virulome”

∷ Curated databases : known virulence genes: pathogenicity islands

∷ Caveats: variable representation across organisms

VirulenceFinder,VFDB, MvirDB,

ViPR, PAI DB

Backward compatibility

MLSTResistomeVirulomeNG-MAST

MLVAVNTR

SerotypingPhage typing

PFGE

SRST2, mlst, ngmaster, lissero,and many more!

When typing lets us down

Typing resolution

Focus on a small “informative” section

Genotype shows isolates are related

D’oh!

Exploiting the whole genome

A familiar tree

Every SNP is sacred

∷ Chocolate bar tree: branches were based on phenotypic attributes: size, colour, filling, texture, ingredients, flavour

∷ Genomic trees: want to use every part of the genome sequence: need to find all differences between isolates

Finding differences

AGTCTGATTAGCTTAGCTTGTAGCGCTATATTATAGTCTGATTAGCTTAGAT

ATTAGCTTAGATTGTAG

CTTAGATTGTAGC-C

TGATTAGCTTAGATTGTAGC-CTATAT

TAGCTTAGATTGTAGC-CTATATT

TAGATTGTAGC-CTATATTA

TAGATTGTAGC-CTATATTAT

SNP Deletion

Reference

Reads

Snippy, VarScan, SAMtools, GATKand many more!

SNP distance matrix

Annotated tree

∷ 1 SNP resolution

∷ Distinguishes clades within genotypes

∷ Interpretation is not straightforward

10 SNPs

L. monocytogenes

Same tree!

Dendrogram

Spanning

Radial

Reference based analysis

∷ Implies you have a “close” reference: need to be careful with draft genomes

∷ Very sensitive: single mutation precision

∷ May not be complete: ignores novel DNA in your isolate

Inferring transmission

∷ Identical sequence does not imply transmission

∷ Easier to rule out than in

The pan genome

Align all your isolate genomes

Find “common” segments

The core genome

Core is common to all & has similar sequence.

Example pan genome Roary, LS-BSR, OrthoMCL, Degust

Rows are genomes, columns are genes.

Core

∷ Common DNA∷ Vertical evolution

∷ Genotyping∷ Phylogenetics

∷ Novel DNA∷ Lateral transfer∷ Plasmids∷ Mobile elements

∷ Partly unexploited

Accessory

Progress at MDU-PHL

Traditional workflow

Modern workflow

Nullarbor

∷ Software pipeline: does “reads to report”: cloud image available (mGVL)

∷ Under active development: used at MDU-PHL for past year for routine jobs: also used by USA CDC Enterics, FSS Qld, and research

∷ National access programme underway

null arbor“no trees”

Doherty Applied Microbial Genomics

■ Non-profit service available□ fixed price per isolate

■ Genome sequencing□ Illumina NextSeq 500

■ Bioinformatics analysis□ Nullarbor

■ Report□ QC, typing, resistome, phylogeny□ plus your raw data

Sharing is caring

Open science

∷ Crowd-sourcing provably works: EHEC outbreak 2011: Ebola, MERS, Zika

∷ But only if people share: sequencing data: metadata: software source code for analysis

GenomeTrakr

∷ International cooperation : Led by FDA + NCBI: >20 collaborating institutes inc. UK PHE, DK DTU, MX: Salmonella and Listeria

∷ Public SRA BioProject #183844 : Real-time submission of WGS genome reads: Nightly updates of phylogenomic trees: Contains ~25,000 strains of Salmonella

“GenomeTrakka”

∷ A shared online system for all Australian labs: upload samples: automated standard/specific analyses: simple reports and visualization: easy to submit to international archives (SRA)

∷ Access control

: each lab controls their own data: jurisdictions can share data in national outbreaks

Final thoughts

Does WGS deliver?

Yes!Bioinformatics Epidemiology

Technology

Microbiology

This meansscientists

not just software

Domain expertise

Always changing...

Acknowledgements

Ben HowdenTim Stinear

Dieter BulachJason Kwong

Anders G da Silva

Contact

tseemann.github.io

[email protected]

@torstenseemann

The EndThank you for listening.

Date post:	14-Jan-2017
Category:	Science
Upload:	torsten-seemann
View:	1,923 times
Download:	3 times