The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
KisSpliceIdentifying and Quantifying SNPs, indels and Alternative Splicing
Events from RNA-seq data
29th may 2013
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Next Generation Sequencing
A sequencing experiment now produces millions of short reads(∼ 100 nt) in a single run for a reasonable cost (∼ 103 euros)
For model species, the first step is usually to map the reads tothe reference genome/transcriptome
For non model species, the first step is usually to assemble thereads and reconstruct the genome/transcriptome
Downstream analysis includes the analysis of polymorphism(SNPs, rearrangements, splicing)
Our main idea is to extract polymorphism directly from thereads, and not assemble the genome/transcriptome
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
De-Bruijn Graph
De Bruijn graphs (DBG) are used as a first step in many shortreads assemblers.
Node = k-merEdge = overlap of k-1 bases
Example
CACTCAA, k = 3
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
De-Bruijn Graph
More complicated example
Reference : CACTCAACTG (unknown)
read1 CACTCA
read2 CAACTG
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
De-Bruijn Graph
Even more complicated example
Reference : CACTCAACTGACT (unknown)
read1 CACTCA
read2 CAACTG
read3 CTGACT
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Compressed De Bruijn Graph
Even more complicated example
Reference : CACTCAACTGACT (unknown)
read1 CACTCA
read2 CAACTG
read3 CTGACT
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
De-Bruijn Graph
An assembly is a walk in the de Bruijn graph, which containsall reads as subwalks
This problem is known to be NP-complete
In practice, heuristics are used which consist in simplifying thegraph to � make it linear �
However, the structures that are removed may correspond torelevant biological structures (SNPs, alternative splicing).
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Specificities of RNA-seq data
Dynamic range of gene expression
Few genes are highly expressedMany are poorly expressed
Alternative splicing
A gene may give rise to several transcripts
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Example of DBG built from RNA-seq data
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Polymorphism in RNA-seq data
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Polymorphism in RNA-seq data
If the purpose is to identify polymorphism, then assemblersare not well suited
The variable parts are precisely the ones that will be removed
3 types of polymorphisms are expected in RNA-seq :At the genomic level
SNPApproximate tandem repeats
At the transcriptomic level
Alternative splicing
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
SNPs
SNPs correspond to recognizable patterns in the de Bruijngraph
Issue : how to discriminate SNPs from sequencing errors ?
Idea : require a minimum coverage for each path
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Approximate Tandem Repeats
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Alternative splicing events
Exon skipping
Intron retention
Alternative 5’ or 3’ splice site
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Alternative splicing events
Not covered by this pattern :
Alternative transcription start and endMutually exclusive exons
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
A general model for polymorphism in DBG
! !
!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234
567"8"9"-':/1"*,"($%#:/"9;<=
>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%
!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"
SNP : 2 paths of length 2k − 1
! !
!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234
567"8"9"-':/1"*,"($%#:/"9;<=
>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%
!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"
Repeats : 1 path of length atmost 2k−2, the two paths align
! !
!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234
567"8"9"-':/1"*,"($%#:/"9;<=
>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%
!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"AS : 1 path of length at most2k − 2
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
The ModelAlgorithm outline
Algorithm outline
KisSplice
1 De Bruijn graph construction ;
2 BiConnected Components decomposition (BCC) ;
3 Four nodes compression (SNPs and sequencing errors) ;
4 Enumeration of all bubbles with a shorter path length at most2k − 2 ;
5 Quantification and classification.
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Results on simulated data
We simulated the sequencing of one drosophila gene with twoalternative transcripts (using FluxSimulator)
For different values of the coverage, we test if our methodrecovers the AS event
KisSplice recovers the AS event when the coverage is above8X
Trinity recovers the AS event when the coverage is above18X
Note : Trinity’s purpose is more general as it reconstructsfull-length transcripts, but for this task, it is less sensitive
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Impact of k
At 8X, kmin=17, kmax=29KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Results on real data
In order to assess if our predicted AS events are true positives, weneed to test our method in the case where a reference genome isavailable.
Data :
Human Body Map 2.0 data (ERP00546)2 tissues (out of 16) : brain and liver75 bp reads, 32M and 39M
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
BCCs repartition
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Confirming AS events
We align the two paths of the bubble to the reference genomeusing blat
If the two paths align with the same initial and finalcoordinates, then it is a true positiveOtherwise, it is a false positive
Next, we check if the alignment coordinates correspond to anannotated AS event
If the coordinates match, then it is a known eventOtherwise it is a novel AS event
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Confirming AS events
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Annotated exon skipping
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Annotated intron retention
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Novel alternative 5’ splice site
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Novel complex event
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Novel AS events are less expressed
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Novel AS events are shorter
1 short AS events tend to be under-annotated (Ex : NAGNAG)2 we also detect genomic indels that are within genes, which we
mistake for AS eventsKisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Comparison with Trinity on real data
Memory usage is better (5Gb / 100M reads)
KisSplice is faster, which is expected because it solves asimpler task
KisSplice finds 4099 events, while Trinity finds 1123, outwhich 570 are common
50% of the events found only by Trinity are false positives
The rest is hidden in very large BCCs, and we can recoverpart of it using larger values of k
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Unresolved BCC
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Unresolved BCC
This is not an elephant, this is a gene family :)
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping
Assembly Vs Mapping
For model species, the pipeline is usually TopHat +Cufflinks
Even in this case, KisSplice (or other assembly-basedapproaches) may be useful.
Example of event missed by Cufflinks, but which isannotated
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Experimental Validation (on going)
With Didier Auboeuf (CRCL)
Validation by RT-PCR
Almost all novel events are validated
Novel events found both by KisSplice and Cufflinks arealmost all validated
Novel events found by KisSplice alone are validated only if :
The minor isoform has a relative abundance of at least 15 %The splicing event is simple, not complex
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
KisSplice in practice
Input : fasta/q files
Output : 5 files (SNPs, AS events, Repeats, Indels <3nt,others)
Format :
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
After the counts
Testing if a variant is specific to a condition :
MReduced : Yv ,c = µ+ βvariantv + βcondc
MFull : Y (v , c) = µ+ βvariantv + βcondc + βvariant∗condv ,c
µ : local mean expression of the gene that contains the variant
βvariantv : contribution of variant v
βcondc : contribution of condition c
Counts are modelled using a negative binomial
We compute the likelihood of both models and test with a χ2
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
KisSplice2IGV
Combining KisSplice output with the context given by a fulllength transcriptome assembler ( Trinity, Oases, etc.)
Visualisation in a genome browser (IGV)
The colour of an alignment depends on the log10( RPKM ) ( ReadPer Kilobase per Millions mapped reads)
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
KisSplice2IGV
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Conclusion
KisSplice detects various polymorphisms (SNPs, AS,repeats ) in RNA-seq data
It provides quantification for such events.
KisSplice is more sensitive than Trinity for finding ASevents
KisSplice is relevant for studies without model species
It brings information even when there is a model species andcan be used in addition to classical pipeline
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
KisSplice People and download
Download
KisSplice : http://kissplice.prabi.fr
DBG construction http://minia.genouest.org/
KisSplice People
Rennes :Rayan Chikhi, Pavlos Antoniou, Guillaume Rizk, RalucaUricaru, Pierre Peterlongo
Lyon :Gustavo Sacomoto, Alice Julien-Laferriere, David Parsons,Janice Kielbassa, Lilia Brinza, Marie-France Sagot, VincentMiele, Vincent Lacroix
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Thank you !Questions ?
KisSplice
The softwareResults
Experimental ValidationIn Practice
Post-treatment in development
Further analysis on short events
KisSplice