Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

Post on 04-Jan-2016

221 views 1 download

Tags:

transcript

hard assembly

Jan Pačes

Institute of Molecular Genetics AS CR

problemsgenomes high GC content repetitions (short - low informational content,

long) polymorphic "unreadable" sequences, "weird" structures

technologies nonrandom libraries wrong sizes erroneous or chimeric reads

sequencing technologies ABI (sanger)

454 (pyrosequencing)

solexa (reversible terminator)

SOLiD (2base ligation)

PacBio (SMRT)

example of errors in one technology

http://chevreux.org/mira_ex_454sanger.html

Aird et al. Genome Biology 2011

high GC regions are underrepresented

Aird et al. Genome Biology 2011

protocol optimization for high GC content

repetitions

scaffold

repetition

repetitions

repetitions recognition

MIRA http://sourceforge.net/projects/mira-assembler/

MaSuRCAhttp://www.genome.umd.edu/masurca.html

SPAdeshttp://bioinf.spbau.ru/spades

Repeatmaskerhttp://www.repeatmasker.org/

RepeatModeller (RECON and RepeatScout)http://www.repeatmasker.org/RepeatModeler.html

position aware assemblers

k-mer distribution

k-mer analysis

JELLYFISH - Fast, Parallel k-mer Counting for DNAhttp://www.cbcb.umd.edu/software/jellyfish/

Quake is a package to correct substitution sequencing errors in experiments with deep coveragehttp://www.cbcb.umd.edu/software/quake/

KHMER Trim off likely erroneous k-mershttps://khmer-protocols.readthedocs.org/en/v0.8.2/

repetitions

scaffold

repetition

filling gaps

GapCloser (part of SOAPdenovo)http://soap.genomics.org.cn/soapdenovo.html

GapFiller (part of SSPACE)http://www.baseclear.com/lab-products/bioinformatics-tools/gapfiller/

GapFillerhttp://sourceforge.net/projects/gapfiller/

454 multiplicates

contig coverage by large libraries

illumina pe and mate-pairs libraries

highly polymorphic genomes

scaffold

two copies of polymorphic contigs

polymorphic assembly workflow

normal assembly

condensing alternative contigs

mapping to identify SNPs

"repair" reads

second "polymorpic" assembly

http://www.fishbrowser.org/software/L_RNA_scaffolder

G-quadruplex

AGCGACCCCCCCCCACCACCGCCACCACCACCTCTGCCATTGGCCGCCGCCGCCCCCCCCCCATTAAACCCCCCCACCCCCCCCCGCGCTGCCCCCTCCCCGGTGG

Chicken p53 – coverage from RNAseq data

Coverage > 13,000X

CCCGCCCACCCCCACCCCCACCCGCACCCCCCACTCTCCCACCCCCACCCCCTTTTCTCCCACCCCCTCTTCTCCCACCCCCTTTTCCCCCCCTTCCTCCCCCCACTCCGCCCCCCCCCCGCCCCCTCCCCCCCCCCAGGTGAGGACCCT

Chicken erythropoietin (EPO)– coverage from RNAseq data

Coverage > 500X from RNAseq

(*EPO locus not completed even from 1000X coverage genomic Illumina data!)

chicken missing genes

that’s it, thank you

many thanks also to:

Daniel EllederTomáš HronMichal KolářHynek Strnad