Post on 13-Jan-2016
transcript
26th International Mammalian
Genome Conference 2012
Bioinformatics Workshop
Sunday, October 21, 2012
09.00 – 12.00
Location: Tarpon Room@IMGC2012 #IMGC2012
Wi-Fi: twgroup / password: group5500
IMGS 2012Bioinformatics Workshop
Deanna Church, NCBI
Carol Bult, The Jackson Laboratory
Tutorial Resources
• Galaxy– https://main.g2.bx.psu.edu/
• Genome Analysis for Biologists– http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/
• NCBI 1000 Genomes Browser– http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/
• Genome Reference Consortium– http://genomereference.org/
Schedule
9-10 am: Intro• Genome Assembly Basics • Alignment Basics 10-11 am: Getting Stuff Done• File formats (sequences, alignments, annotations)11-12 am: Doing stuff• Typical RNA-Seq workflow• RNA Seq in Galaxy
• Differential Gene Expression with RNA Seq data
Assembly Basics
19 Oct 2012
Some assembly required…
Restrict and make libraries2, 4, 8, 10, 40, 150 kb
End-sequence allclones and retainpairing information“mate-pairs”
Find sequence overlaps
Each end sequenceis referred to as a read
WGS contig
tails
WGS: Sanger Reads Layout-Consensus-Overlap
http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf
Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity
23,894 genes
452 models with >1 exon, sym.best hit, and one frameshift
334 cases have 3 or less hits
Alexander Souvorov, NCBI
Fragmented genomes tend to have less frame shifts
Alexander Souvorov, NCBI
Fragmented genomes tend to have more partial models
Alexander Souvorov, NCBI
BAC insertBAC vector
Shotgun sequence
Assemble
Fold
seq
uenc
e
Gaps
deeper sequencecoverage rarelyresolves all gaps
GAPS
“finishers” go in to manually fill the gaps, often by PCR
Clone based assemblies
Scaffold N50 by chromosome
7 May 2010
Spanned Gaps by Assembly
Church et al., 2011 PLoS Biology
http://genomereference.org
NCBI36 (hg18)
GRC
h37
(hg1
9)
NCBI35 (hg17)
GRCh37 (hg19)
AL139246.20
AL139246.21
Build sequence contigs based on contigs defined in TPF (Tiling Path File).
Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis
Switch point
Consensus sequence
NCBI36
nsv832911 (nstd68) Submitted on NCBI35 (hg17)
NCBI35 (hg17) Tiling Path
GRCh37 (hg19) Tiling Path
Gap Inserted
Moved approximately 2 Mb distal on chr15
NC_0000015.8 (chr15)
NC_0000015.9 (chr15)
Removed from assembly
Added to assembly
HG-24
Sequences from haplotype 1Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
AC074378.4AC079749.5
AC134921.2AC147055.2
AC140484.1AC019173.4
AC093720.2AC021146.7
NCBI36 NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37 NC_000004.11 (chr4) Tiling Path
AC074378.4AC079749.5
AC134921.1AC147055.2
AC093720.2AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4AC140484.1
AC019173.4AC226496.2
AC021146.7
TMPRSS11E2
nsv532126 (nstd37)
GRCh37 (hg19)
http://genomereference.org
7 alternate haplotypesat the MHC
Alternate loci released as:FASTA
AGPAlignment to chromosome
UGT2B17 MHC MAPT
Assembly (e.g. GRCh37.p2)
Primary Assembly
Non-nuclear assembly unit
(e.g. MT)
ALT 1
ALT 2
ALT 3
ALT 4
ALT 5
ALT 9
ALT 6
ALT 7ALT
8
PAR
Patches…
Genomic Region(MHC)
Genomic Region
(UGT2B17)Genomic
Region(MAPT)
Genomic Region(ABO)
Genomic Region(SMA)
Genomic Region
(PECAM1)
MHC (chr6)Chr 6 representation (PGF)
Alt_Ref_Locus_2 (COX)
Richa AgarwalaEugene Yaschenko
GenBank
Data Archives
Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter
Data tracking
ABC14-1065514J1GapsPhase LengthDate
FP565796.1 1 121-Oct-2009
FP565796.2 1 014-Oct-2010
FP565796.3 3 007-Nov-2010
Mouse chrX: 35,000,000-36,000000
Mouse chrX: 35,000,000-36,000000
X
MGSCv3 MGSCv36
Unique Identification
NC_000086.6chrX in MGSCv36
List of scaffolds and gaps (AGP)
List of components and gaps (AGP)
hg19GRCh37
mm8MGSCv37
NCBIM37
danRer5Zv7
What’s in a name?
What’s in a name?
Assemblies with the same name aren’t always the same
chr21:8,913,216-9,246,964
Assemblies with the same name aren’t always the same
Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
hg19GRCh37
GRCh37.p2
GCA_000001405.1
Assembly Database to the rescue
GCA_000001405.3
http://www.ncbi.nlm.nih.gov/genome/assembly
GRCh37hg19
Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17
Primary Assembly
GCA_000001305.1/GCF_000001305.13
ALT 1
GCA_000001315.1/GCF_000001315.1
ALT 2
GCA_000001325.1/GCF_000001325.2
ALT 3
GCA_000001335.1/GCF_000001335.1
ALT 4
GCA_000001345.1/GCF_000001345.1
ALT 5
GCA_000001355.1/GCF_000001355.1
ALT 6
GCA_000001365.1/GCF_000001365.2
ALT 7
GCA_000001375.1/GCF_000001375.1
ALT 8
GCA_000001385.1/GCF_000001385.1
ALT 9
GCA_000001395.1/GCF_000001395.1
PatchesGCA_000005045.5GCF_000005045.4
Non-nuclear assembly unit
(e.g. MT)
GCA_000006015.1/GCF_000006015.1
GenBank RefSeq vs
Submitter Owned RefSeq Owned
Redundancy Non-RedundantUpdated rarely Curated
INSDC Not INSDC
BRCA183 genomic records31 mRNA records27 protein records
3 genomic records 5 mRNA records1 RNA record5 protein records
Sequence Alignments Basics
Hypothesis
• The biological basis of sequence alignment is evolution
• Sequences that share a common ancestor are homologous– Sequence similarity is evidence of homology– Sequences, genes, etc. are homologous or not,
there is no “percent homology”
Homology• Orthologous sequences
– Common ancestor; speciation• Paralogous sequences
– Gene duplicationwithin a species
(lineage specificexpansion)
http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html
Alignment to NR -> Homology
Alignment to an Assembly -> Mapping
Global and local alignments
Optimal global alignment
Needleman-Wunsch
Sequences align essentially from end to end
Optimal local alignment
Smith-Waterman
Sequences align only in small, isolated regions
References
Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.
Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.
http://en.wikipedia.org/wiki/Sequence_alignment
Hashing methods
MVRRLPERTSTPACE
MVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE
Query sequence
Word size = 3(configurable)
References
Wilbur & Lipman (1983), PNAS 80, 726-30
Lipman & Pearson (1985), Science 227, 1435-1441
Pearson & Lipman (1988), PNAS 85, 2444-2448
http://wwwdev.ebi.ac.uk/fg/hts_mappers/Fonseca et al., 2012
Sensitivity vs. Specificity
Sensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identified
Actu
al
Predicted
TP FN
FP TN
positives
negatives
positives negatives
Sensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)
• Aligner technology specific?• Gapped vs. ungapped alignments?• Spliced alignments (cDNAs/RNA-Seq)• Can use paired-end data?
Ruffalo et al., 2012
Li and Homer, 2010
Indels have correct and consistent alignment in readsafter multiple sequence local realignment
61DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet.
Phase 1:NGS data processing
Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
CDC27
Richa Agarwala
MHC Alternate locus
Alignment to chr6
Mouse Ren1 chr1 (NC_000067.6): 133350674-133360320
NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
CEPH: A=1.000 G=0
APOL1
YRI: A=0.5852 G=0.4148
Multiple submissions
FrequencyData
1000G
Suspect
Sudmant et al., 2010