+ All Categories
Home > Documents > 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012...

26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012...

Date post: 13-Jan-2016
Category:
Upload: tyrone-webster
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
68
26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012 Wi-Fi: twgroup / password: group5500
Transcript
Page 1: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

26th International Mammalian

Genome Conference 2012

Bioinformatics Workshop

Sunday, October 21, 2012

09.00 – 12.00

Location: Tarpon Room@IMGC2012 #IMGC2012

Wi-Fi: twgroup / password: group5500

Page 2: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

IMGS 2012Bioinformatics Workshop

Deanna Church, NCBI

Carol Bult, The Jackson Laboratory

Page 3: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Tutorial Resources

• Galaxy– https://main.g2.bx.psu.edu/

• Genome Analysis for Biologists– http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/

• NCBI 1000 Genomes Browser– http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

• Genome Reference Consortium– http://genomereference.org/

Page 4: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Schedule

9-10 am: Intro• Genome Assembly Basics • Alignment Basics 10-11 am: Getting Stuff Done• File formats (sequences, alignments, annotations)11-12 am: Doing stuff• Typical RNA-Seq workflow• RNA Seq in Galaxy

• Differential Gene Expression with RNA Seq data

Page 5: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Assembly Basics

19 Oct 2012

Page 6: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Some assembly required…

Page 7: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads Layout-Consensus-Overlap

Page 8: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf

Page 9: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity

23,894 genes

452 models with >1 exon, sym.best hit, and one frameshift

334 cases have 3 or less hits

Alexander Souvorov, NCBI

Page 10: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Fragmented genomes tend to have less frame shifts

Alexander Souvorov, NCBI

Page 11: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Fragmented genomes tend to have more partial models

Alexander Souvorov, NCBI

Page 12: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

seq

uenc

e

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Page 13: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Scaffold N50 by chromosome

Page 14: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

7 May 2010

Spanned Gaps by Assembly

Page 15: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Church et al., 2011 PLoS Biology

http://genomereference.org

Page 16: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

NCBI36 (hg18)

GRC

h37

(hg1

9)

Page 17: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Page 18: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

Page 19: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

NCBI36

Page 20: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

Page 21: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

Page 22: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

Page 23: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

nsv532126 (nstd37)

Page 24: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Page 25: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Assembly (e.g. GRCh37.p2)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Patches…

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

Page 26: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

Page 27: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Richa AgarwalaEugene Yaschenko

Page 28: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 29: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

GenBank

Data Archives

Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter

Page 30: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Data tracking

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

Page 31: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Mouse chrX: 35,000,000-36,000000

Page 32: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

Page 33: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Unique Identification

NC_000086.6chrX in MGSCv36

List of scaffolds and gaps (AGP)

List of components and gaps (AGP)

Page 34: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

What’s in a name?

Page 35: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

What’s in a name?

Page 36: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Assemblies with the same name aren’t always the same

chr21:8,913,216-9,246,964

Page 37: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Assemblies with the same name aren’t always the same

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

Page 38: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

hg19GRCh37

GRCh37.p2

GCA_000001405.1

Assembly Database to the rescue

GCA_000001405.3

Page 39: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

http://www.ncbi.nlm.nih.gov/genome/assembly

GRCh37hg19

Page 40: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 41: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17

Primary Assembly

GCA_000001305.1/GCF_000001305.13

ALT 1

GCA_000001315.1/GCF_000001315.1

ALT 2

GCA_000001325.1/GCF_000001325.2

ALT 3

GCA_000001335.1/GCF_000001335.1

ALT 4

GCA_000001345.1/GCF_000001345.1

ALT 5

GCA_000001355.1/GCF_000001355.1

ALT 6

GCA_000001365.1/GCF_000001365.2

ALT 7

GCA_000001375.1/GCF_000001375.1

ALT 8

GCA_000001385.1/GCF_000001385.1

ALT 9

GCA_000001395.1/GCF_000001395.1

PatchesGCA_000005045.5GCF_000005045.4

Non-nuclear assembly unit

(e.g. MT)

GCA_000006015.1/GCF_000006015.1

Page 42: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

GenBank RefSeq vs

Submitter Owned RefSeq Owned

Redundancy Non-RedundantUpdated rarely Curated

INSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

Page 43: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Sequence Alignments Basics

Page 44: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Hypothesis

Page 45: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

• The biological basis of sequence alignment is evolution

• Sequences that share a common ancestor are homologous– Sequence similarity is evidence of homology– Sequences, genes, etc. are homologous or not,

there is no “percent homology”

Page 46: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Homology• Orthologous sequences

– Common ancestor; speciation• Paralogous sequences

– Gene duplicationwithin a species

(lineage specificexpansion)

http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html

Page 47: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Alignment to NR -> Homology

Alignment to an Assembly -> Mapping

Page 48: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 49: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Global and local alignments

Optimal global alignment

Needleman-Wunsch

Sequences align essentially from end to end

Optimal local alignment

Smith-Waterman

Sequences align only in small, isolated regions

References

Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.

Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

Page 50: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 51: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

http://en.wikipedia.org/wiki/Sequence_alignment

Page 52: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Hashing methods

MVRRLPERTSTPACE

MVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE

Query sequence

Word size = 3(configurable)

References

Wilbur & Lipman (1983), PNAS 80, 726-30

Lipman & Pearson (1985), Science 227, 1435-1441

Pearson & Lipman (1988), PNAS 85, 2444-2448

Page 53: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 54: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 55: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 56: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

http://wwwdev.ebi.ac.uk/fg/hts_mappers/Fonseca et al., 2012

Page 57: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Sensitivity vs. Specificity

Sensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identified

Actu

al

Predicted

TP FN

FP TN

positives

negatives

positives negatives

Sensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)

Page 58: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

• Aligner technology specific?• Gapped vs. ungapped alignments?• Spliced alignments (cDNAs/RNA-Seq)• Can use paired-end data?

Page 59: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Ruffalo et al., 2012

Page 60: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Li and Homer, 2010

Page 61: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Indels have correct and consistent alignment in readsafter multiple sequence local realignment

61DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet.

Phase 1:NGS data processing

Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!

Page 62: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

Page 63: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 64: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Page 65: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

Mouse Ren1 chr1 (NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

Page 66: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.
Page 67: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CEPH: A=1.000 G=0

APOL1

Page 68: 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Location: Tarpon Room @IMGC2012 #IMGC2012.

YRI: A=0.5852 G=0.4148

Multiple submissions

FrequencyData

1000G

Suspect

Sudmant et al., 2010


Recommended