Post on 11-May-2018
transcript
This talk A « bird eye » view of the GenScale NGS work
Dominique Lavenier
Rayan Chikhi
Guillaume Rizk
Claire Lemaitre
Raluca Uricaru
And many collaborators, … EPGV Seminar - Lusignan - April 2013 3
Minia, de novo assembly of a human genome on a
laptop Guillaume Rizk, Rayan Chikhi
Space-efficient and exact de Bruijn graph representation based on a Bloom filter Algorithms in Bioinformatics, 2012 - Springer
EPGV Seminar - Lusignan - April 2013 4
Contiging, overview Reads
TCAGGCAG ACTCAGCA ATATAATA …
k-‐‑mers
De-‐‑Bruijn Graph
…TCAGGCAGATCGATACTCAGCACAACGTATATAATA…
Contig, Unitig
EPGV Seminar - Lusignan - April 2013 5
De Bruijn Graph • Very simple example: one read
Source: homolog.us
EPGV Seminar - Lusignan - April 2013 6
Contiging, overview Reads
TCAGGCAG ACTCAGCA ATATAATA …
k-‐‑mers
Contig, Unitig
De-‐‑Bruijn Graph
Memory bo,leneck
…TCAGGCAGATCGATACTCAGCACAACGTATATAATA…
EPGV Seminar - Lusignan - April 2013 7
Minia approach • Use a BLOOM FILTER: probabilistic data-structure for
indexing k-mers:
Is ACGATCGACTCAGCAT indexed ? o NO à We are sure ACGATCGACTCAGCAT is absent o YES à We can not conclude : FALSE POSITIVES
• Minia: o Store some of the false positives :
• Those that may be met while walking the graph.
EPGV Seminar - Lusignan - April 2013 8
Minia approach • Use a BLOOM FILTER: probabilistic data-structure for
indexing k-mers and store the annoying false positives
• (Knowing that AACGATCGACTCAGCA exists) Is ACGATCGACTCAGCAT indexed ?
o NO à We are sure ACGATCGACTCAGCAT is absent o YES à
• if ACGATCGACTCAGCAT not a stored FP then present • else, ACGATCGACTCAGCAT absent
EPGV Seminar - Lusignan - April 2013 9
Minia results C. Elegans nematode –
33 million paired-‐‑end reads of length 100 bp (SRR065390)
EPGV Seminar - Lusignan - April 2013 13
Minia Consequences
EPGV Seminar - Lusignan - April 2013 14
• Exploring parameter sets • Developping light memory NGS applications:
o Kissnp2 o Mapsembler o Utimate Gap Filler o Kissplice o Intl o Minigraph o …
Kissnp2 Search for SNPs on your laptop
• Assembly difficulty: o Repeats o Polymorphism
• Find SNPs: o Usual approach: Map against a reference sequence (if exists).
• (else) create a reference to map reads o (else) Kissnp
Reads kissnp
>SNP_higher_path_1|score_27|high|left_contig_length_41|right_contig_length_45 atggcaadgggaataadcataadtadddctaaagtGACAACAGTCATTTTTTTCAAAGAACTTCAGTCTGGAACTATTATGTTGTTaggacatgtgatcdcatcaccagtatctcgaaatcctaaaada >SNP_lower_path_1|score_27|high|left_contig_length_41|right_contig_length_45 atggcaadgggaataadcataadtadddctaaagtGACAACAGTCATTTTTTTCAAAGAATTTCAGTCTGGAACTATTATGTTGTTaggacatgtgatcdcatcaccagtatctcgaaatcctaaaada
EPGV Seminar - Lusignan - April 2013 15
Kissnp2 Main idea
• A SNP in the de-Bruijn Graph:
• Algo o Find a branching k-mer: two (or more) extension possibilities o Walk the k-mers from one to another, until the full bubble is closed
o Uses the Minia datastructure
EPGV Seminar - Lusignan - April 2013 17
Kissnp2 Yes, but…
• Pros: o No reference needed o Don’t construct the full graph o Works with 1, 2, 3, …, n datasets :
>SNP_higher_path_16|C1_2|C2_6|Q1_70|Q2_67 TAATGTTAAATGACGAGTTAATGGGTGCAGCACATGAACATGGCACATGTA >SNP_lower_path_16|C1_2|C2_5|Q1_70|Q2_68 TAATGTTAAATGACGAGTTAATGGGGGCAGCACATGAACATGGCACATGTA
EPGV Seminar - Lusignan - April 2013 18
Kissnp2 Yes, but…
• Cons: 1. A sequencing error = a SNP 2. An approximate repeat = a SNP (with no filtration: 9 billion SNPs found in human genome !)
• Avoid this bias: 1. Use only k-mer with a minimal support (= coverage) 2. Avoid approximate repeats:
• Avoid k-mers with too big support (e.g. 1000) • Extend found SNPs
>SNP_higher_path_16|C1_2|C2_6|Q1_70|Q2_67 …acgatcgacgacgctaacacatataccTAATGTTAAATGACGAGTTAATGGGTGCAGCACATGAACATGGCACATGTAgagagdacatatgtaacacgcagcat… >SNP_lower_path_16|C1_2|C2_5|Q1_70|Q2_68 …acgatcgacgacgctaacacatataccTAATGTTAAATGACGAGTTAATGGGGGCAGCACATGAACATGGCACATGTAgagagdacatatgtaacacgcagcat…
EPGV Seminar - Lusignan - April 2013 19
Kissnp2 In a nutshell
• Read to SNPs o Any number of read sets o For each SNP found:
• The average coverage per set • The average quality of the SNP position • Possibility to extend SNPs (left and right unitigs)
• First results (publication in progress) o Duck: 350 million reads : 4GB, 3h30, 527.208 SNPs (on genotoul) o Duck: 3.500 million reads : 4GB, 5.5 days, 4.994.126 SNPs (on genotoul) o Tick: 576 SNP found and selected w.r.t. coverage à 575 true positives.
EPGV Seminar - Lusignan - April 2013 20
Use / extricate polymorphism
• Idea: o Polymorphism o Polyploidy o Repeats o …
… are the source of both: o Huge volume of information
AND o Huge mess in assembly process, and any NGS process.
EPGV Seminar - Lusignan - April 2013 21
Use / extricate polymorphism: a trail
• From a graph of UNITIGS o Produced by assemblers before contiging
• Map set of reads…
A
T
C
G
Reads: set1
Reads: set2
EPGV Seminar - Lusignan - April 2013 22
Use / extricate polymorphism: a trail
• From a graph of UNITIGS o Produced by assemblers before contiging
• Map set of reads…
A
T
C
G
• Compute o Coverage per set o Quality per set
EPGV Seminar - Lusignan - April 2013 23
Use / extricate polymorphism: a trail
• From a graph of UNITIGS o Produced by assemblers before contiging
• Map set of reads…
A
T
C
G
• Compute o Coverage per set o Quality per set
• « Phase » polymorphism o Better assembly
EPGV Seminar - Lusignan - April 2013 24
Use / extricate polymorphism: a trail
• « Phase » polymorphism o Better assembly
EPGV Seminar - Lusignan - April 2013 25
Ultimate gap filler
Ref.
Reads
1. Detection of gap in the reference:
EPGV Seminar - Lusignan - April 2013 27
Ultimate gap filler
Ref.
Reads
1. Detection of gap in the reference: 2. Fill gaps (using pairs):
EPGV Seminar - Lusignan - April 2013 28
Take Home Message • Assemble with no memory:
o Minia
• Find Polymorphism SNPs with no reference: o Kissnp2 o (kissplice)
• Use and extricate polymorphism • Detect and close gaps: (reads + ref) :
o Ultimate Gap Filler
• More infos and links: http://team.inria.fr/genscale
EPGV Seminar - Lusignan - April 2013 29