Date post: | 14-Jan-2017 |
Category: |
Science |
Upload: | torsten-seemann |
View: | 1,923 times |
Download: | 3 times |
Bioinformatic tools for the diagnostic laboratory
A/Prof Torsten Seemann
Victorian Life Sciences Computation Initiative (VLSCI)Microbiological Diagnostic Unit Public Health Laboratory (MDU PHL)
Doherty Applied Microbial Genomics (DAMG)The University of Melbourne
ASA 2016 - Melbourne, AU - Sat 27 Feb 2016
Doherty Applied Microbial Genomics
Lead bioinformatician ♥ microbial genomics
Whole genome sequencing
The currency of genomics
Reads
Reads are stored in FASTQ files
Genome
Types of sequence reads
100 - 300 bp (paired)
100 - 400 bp
5,000 - 15,000+ bp
5,000 - 50,000+ bp
What data do we really have?
Isolate genomeSequenced reads
Other isolates in sequencing run
ContaminationSequencing adaptorsSpike-in controls eg. phiX
Unsequenced regions
Do we have enough data?
∷ Depth: expressed as fold-coverage of genome eg. 25x: means each base sequenced 25 times (on average)
∷ Coverage: the % of genome sequenced with depth > 0
25x
Genome data itself is of limited value.
Needs “extra” information
□ location: Australia 37.8S,145.0E □ date: 2015 2015-07-20□ source: human 60yo male faecal swab□ etc.
Metadata
Got my reads, now what?
Two options
∷ De novo genome assembly: reconstruct original sequence from reads alone: like a giant jigsaw puzzle: “create”
∷ Align to reference: identify where each read fits on a related genome: can not always be uniquely placed: “compare”
De novo genome assemblyAmplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”
The effect of read length
250 bp - Illumina - $200 8000 bp - Pacbio - $2000
The problem with repeatsRepeat copy 1 Repeat copy 2
Collapsed repeat consensus
1 locus
4 contigs
Align to referenceSeven short 4bp readsAGTC TTAC GGGA CTTT
TAGG TTTA ATAG
Aligned to 31bp referenceAGTCTTTATTATAGGGAGCCATAGCTTTACAAGTC TAGG ATAG TTAC
TTTA GGGA CTTT
Eight short 4bp readsAGTC TTAC GGGA CTTT
TAGG TTTA ATAG TTAT
Aligned to 31bp referenceAGTCTTTATTATAGGGAGCCATAGCTTTACAAGTC TAGG ATAG TTAC
TTTA GGGA CTTT TTAT TTAT
Ambiguous alignment
D’oh!
Best practice
■ Use both approaches□ reference-based + de novo
■ Best of both worlds□ and worst of both worlds - interpretation is non-trivial
■ Still need□ good epidemiology, metadata and domain knowledge!
The one true assay?
Applications of WGS
∷ Diagnostics: species ⇒ subspecies ⇒ strain identification: in silico antibiogram and virulence profile
∷ Surveillance: in silico genotyping - MLST, serotyping, VNTR, MLVA: what’s lurking in our hospital/community?
∷ Forensics: outbreak detection: source tracking
Isolate identification
∷ Can be done in seconds∷ Directly from reads (or subset)
∷ Scan against index of unique k-mers (oligoes)∷ Species level accurate (on average)
∷ Great for quality control !
Kraken,MetaPhlan,OneCodex
One Codex example metagenome output
Antibiogram
∷ The “resistome”
∷ Resistance specific genes: we have good databases of these: easy to identify to exact allele eg. blaNDM-9
∷ New alleles conferring resistance: databases are poor (exceptions include M.tb): novel mechanisms arrive de novo
ResFinder, CARD, ARG-Annot
SRST2, ABRicate
ABRicate example E.faecium outputSTART END GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY
7140 7902 erm(B) 1-762/762 ========/====== 1 100.00 99.08
8627 9421 aph(3')-III 1-795/795 =============== 0 100.00 100.00
11040 11948 ant(6)-Ia 1-345/909 =====.......... 0 35.00 100.00
15456 16257 lnu(B) 1-804/804 ========/====== 2 99.75 99.63
573128 575046 tet(M) 1-1920/1920 ========/====== 1 99.95 99.95
770130 770792 VanR-B 1-663/663 =============== 0 100.00 99.25
770792 772135 VanS-B 1-1344/1344 =============== 0 100.00 99.63
772306 773112 VanY-B 1-807/807 =============== 0 100.00 100.00
773130 773957 VanW-B 1-828/828 =============== 0 100.00 97.58
773954 774925 VanH-B 1-972/972 =============== 0 100.00 99.38
774918 775946 VanA-B 1-1029/1029 =============== 0 100.00 98.93
775952 776560 VanX-B 1-609/609 =============== 0 100.00 96.72
2352083 2352631 aac(6')-Ii 1-549/549 =============== 0 100.00 99.64
2789984 2791462 msr(C) 1-1479/1479 =============== 0 100.00 98.92
Virulence profile
∷ The “virulome”
∷ Curated databases : known virulence genes: pathogenicity islands
∷ Caveats: variable representation across organisms
VirulenceFinder,VFDB, MvirDB,
ViPR, PAI DB
Backward compatibility
MLSTResistomeVirulomeNG-MAST
MLVAVNTR
SerotypingPhage typing
PFGE
SRST2, mlst, ngmaster, lissero,and many more!
When typing lets us down
Typing resolution
Focus on a small “informative” section
Genotype shows isolates are related
D’oh!
Exploiting the whole genome
A familiar tree
Every SNP is sacred
∷ Chocolate bar tree: branches were based on phenotypic attributes: size, colour, filling, texture, ingredients, flavour
∷ Genomic trees: want to use every part of the genome sequence: need to find all differences between isolates
Finding differences
AGTCTGATTAGCTTAGCTTGTAGCGCTATATTATAGTCTGATTAGCTTAGAT
ATTAGCTTAGATTGTAG
CTTAGATTGTAGC-C
TGATTAGCTTAGATTGTAGC-CTATAT
TAGCTTAGATTGTAGC-CTATATT
TAGATTGTAGC-CTATATTA
TAGATTGTAGC-CTATATTAT
SNP Deletion
Reference
Reads
Snippy, VarScan, SAMtools, GATKand many more!
SNP distance matrix
Annotated tree
∷ 1 SNP resolution
∷ Distinguishes clades within genotypes
∷ Interpretation is not straightforward
10 SNPs
L. monocytogenes
Same tree!
Dendrogram
Spanning
Radial
Reference based analysis
∷ Implies you have a “close” reference: need to be careful with draft genomes
∷ Very sensitive: single mutation precision
∷ May not be complete: ignores novel DNA in your isolate
Inferring transmission
∷ Identical sequence does not imply transmission
∷ Easier to rule out than in
The pan genome
Align all your isolate genomes
Find “common” segments
The core genome
Core is common to all & has similar sequence.
Example pan genome Roary, LS-BSR, OrthoMCL, Degust
Rows are genomes, columns are genes.
Core
∷ Common DNA∷ Vertical evolution
∷ Genotyping∷ Phylogenetics
∷ Novel DNA∷ Lateral transfer∷ Plasmids∷ Mobile elements
∷ Partly unexploited
Accessory
Progress at MDU-PHL
Traditional workflow
Modern workflow
Nullarbor
∷ Software pipeline: does “reads to report”: cloud image available (mGVL)
∷ Under active development: used at MDU-PHL for past year for routine jobs: also used by USA CDC Enterics, FSS Qld, and research
∷ National access programme underway
null arbor“no trees”
Doherty Applied Microbial Genomics
■ Non-profit service available□ fixed price per isolate
■ Genome sequencing□ Illumina NextSeq 500
■ Bioinformatics analysis□ Nullarbor
■ Report□ QC, typing, resistome, phylogeny□ plus your raw data
Sharing is caring
Open science
∷ Crowd-sourcing provably works: EHEC outbreak 2011: Ebola, MERS, Zika
∷ But only if people share: sequencing data: metadata: software source code for analysis
GenomeTrakr
∷ International cooperation : Led by FDA + NCBI: >20 collaborating institutes inc. UK PHE, DK DTU, MX: Salmonella and Listeria
∷ Public SRA BioProject #183844 : Real-time submission of WGS genome reads: Nightly updates of phylogenomic trees: Contains ~25,000 strains of Salmonella
“GenomeTrakka”
∷ A shared online system for all Australian labs: upload samples: automated standard/specific analyses: simple reports and visualization: easy to submit to international archives (SRA)
∷ Access control
: each lab controls their own data: jurisdictions can share data in national outbreaks
Final thoughts
Does WGS deliver?
Yes!Bioinformatics Epidemiology
Technology
Microbiology
This meansscientists
not just software
Domain expertise
Always changing...
Acknowledgements
Ben HowdenTim Stinear
Dieter BulachJason Kwong
Anders G da Silva
The EndThank you for listening.