26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012...

Post on 13-Jan-2016

215 views 0 download

Tags:

transcript

26th International Mammalian

Genome Conference 2012

Bioinformatics Workshop

Sunday, October 21, 2012

09.00 – 12.00

Location: Tarpon Room@IMGC2012 #IMGC2012

Wi-Fi: twgroup / password: group5500

IMGS 2012Bioinformatics Workshop

Deanna Church, NCBI

Carol Bult, The Jackson Laboratory

Tutorial Resources

• Galaxy– https://main.g2.bx.psu.edu/

• Genome Analysis for Biologists– http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/

• NCBI 1000 Genomes Browser– http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/

• Genome Reference Consortium– http://genomereference.org/

Schedule

9-10 am: Intro• Genome Assembly Basics • Alignment Basics 10-11 am: Getting Stuff Done• File formats (sequences, alignments, annotations)11-12 am: Doing stuff• Typical RNA-Seq workflow• RNA Seq in Galaxy

• Differential Gene Expression with RNA Seq data

Assembly Basics

19 Oct 2012

Some assembly required…

Restrict and make libraries2, 4, 8, 10, 40, 150 kb

End-sequence allclones and retainpairing information“mate-pairs”

Find sequence overlaps

Each end sequenceis referred to as a read

WGS contig

tails

WGS: Sanger Reads Layout-Consensus-Overlap

http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf

Alignable trace count in frameshift window vs control in Opossum: 51nt window, >95% identity

23,894 genes

452 models with >1 exon, sym.best hit, and one frameshift

334 cases have 3 or less hits

Alexander Souvorov, NCBI

Fragmented genomes tend to have less frame shifts

Alexander Souvorov, NCBI

Fragmented genomes tend to have more partial models

Alexander Souvorov, NCBI

BAC insertBAC vector

Shotgun sequence

Assemble

Fold

seq

uenc

e

Gaps

deeper sequencecoverage rarelyresolves all gaps

GAPS

“finishers” go in to manually fill the gaps, often by PCR

Clone based assemblies

Scaffold N50 by chromosome

7 May 2010

Spanned Gaps by Assembly

Church et al., 2011 PLoS Biology

http://genomereference.org

NCBI36 (hg18)

GRC

h37

(hg1

9)

NCBI35 (hg17)

GRCh37 (hg19)

AL139246.20

AL139246.21

Build sequence contigs based on contigs defined in TPF (Tiling Path File).

Check for orientation consistenciesSelect switch pointsInstantiate sequence for further analysis

Switch point

Consensus sequence

NCBI36

nsv832911 (nstd68) Submitted on NCBI35 (hg17)

NCBI35 (hg17) Tiling Path

GRCh37 (hg19) Tiling Path

Gap Inserted

Moved approximately 2 Mb distal on chr15

NC_0000015.8 (chr15)

NC_0000015.9 (chr15)

Removed from assembly

Added to assembly

HG-24

Sequences from haplotype 1Sequences from haplotype 2

Old Assembly model: compress into a consensus

New Assembly model: represent both haplotypes

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

nsv532126 (nstd37)

GRCh37 (hg19)

http://genomereference.org

7 alternate haplotypesat the MHC

Alternate loci released as:FASTA

AGPAlignment to chromosome

UGT2B17 MHC MAPT

Assembly (e.g. GRCh37.p2)

Primary Assembly

Non-nuclear assembly unit

(e.g. MT)

ALT 1

ALT 2

ALT 3

ALT 4

ALT 5

ALT 9

ALT 6

ALT 7ALT

8

PAR

Patches…

Genomic Region(MHC)

Genomic Region

(UGT2B17)Genomic

Region(MAPT)

Genomic Region(ABO)

Genomic Region(SMA)

Genomic Region

(PECAM1)

MHC (chr6)Chr 6 representation (PGF)

Alt_Ref_Locus_2 (COX)

Richa AgarwalaEugene Yaschenko

GenBank

Data Archives

Data in a common format Data in a single location (and mirrored) Most quality checked prior to deposition Robust data tracking mechanism (accession.version) Data owned by submitter

Data tracking

ABC14-1065514J1GapsPhase LengthDate

FP565796.1 1 121-Oct-2009

FP565796.2 1 014-Oct-2010

FP565796.3 3 007-Nov-2010

Mouse chrX: 35,000,000-36,000000

Mouse chrX: 35,000,000-36,000000

X

MGSCv3 MGSCv36

Unique Identification

NC_000086.6chrX in MGSCv36

List of scaffolds and gaps (AGP)

List of components and gaps (AGP)

hg19GRCh37

mm8MGSCv37

NCBIM37

danRer5Zv7

What’s in a name?

What’s in a name?

Assemblies with the same name aren’t always the same

chr21:8,913,216-9,246,964

Assemblies with the same name aren’t always the same

Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

hg19GRCh37

GRCh37.p2

GCA_000001405.1

Assembly Database to the rescue

GCA_000001405.3

http://www.ncbi.nlm.nih.gov/genome/assembly

GRCh37hg19

Assembly (e.g. GRCh37.p5)GCA_000001405.6 /GCF_000001405.17

Primary Assembly

GCA_000001305.1/GCF_000001305.13

ALT 1

GCA_000001315.1/GCF_000001315.1

ALT 2

GCA_000001325.1/GCF_000001325.2

ALT 3

GCA_000001335.1/GCF_000001335.1

ALT 4

GCA_000001345.1/GCF_000001345.1

ALT 5

GCA_000001355.1/GCF_000001355.1

ALT 6

GCA_000001365.1/GCF_000001365.2

ALT 7

GCA_000001375.1/GCF_000001375.1

ALT 8

GCA_000001385.1/GCF_000001385.1

ALT 9

GCA_000001395.1/GCF_000001395.1

PatchesGCA_000005045.5GCF_000005045.4

Non-nuclear assembly unit

(e.g. MT)

GCA_000006015.1/GCF_000006015.1

GenBank RefSeq vs

Submitter Owned RefSeq Owned

Redundancy Non-RedundantUpdated rarely Curated

INSDC Not INSDC

BRCA183 genomic records31 mRNA records27 protein records

3 genomic records 5 mRNA records1 RNA record5 protein records

Sequence Alignments Basics

Hypothesis

• The biological basis of sequence alignment is evolution

• Sequences that share a common ancestor are homologous– Sequence similarity is evidence of homology– Sequences, genes, etc. are homologous or not,

there is no “percent homology”

Homology• Orthologous sequences

– Common ancestor; speciation• Paralogous sequences

– Gene duplicationwithin a species

(lineage specificexpansion)

http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html

Alignment to NR -> Homology

Alignment to an Assembly -> Mapping

Global and local alignments

Optimal global alignment

Needleman-Wunsch

Sequences align essentially from end to end

Optimal local alignment

Smith-Waterman

Sequences align only in small, isolated regions

References

Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453.

Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

http://en.wikipedia.org/wiki/Sequence_alignment

Hashing methods

MVRRLPERTSTPACE

MVRVRRRRLRLPLPEPERERTRTSTSTSTPTPAPACACE

Query sequence

Word size = 3(configurable)

References

Wilbur & Lipman (1983), PNAS 80, 726-30

Lipman & Pearson (1985), Science 227, 1435-1441

Pearson & Lipman (1988), PNAS 85, 2444-2448

http://wwwdev.ebi.ac.uk/fg/hts_mappers/Fonseca et al., 2012

Sensitivity vs. Specificity

Sensitivity = actual number of true positives (tp) identifiedSpecificity = number of true negatives (tn) identified

Actu

al

Predicted

TP FN

FP TN

positives

negatives

positives negatives

Sensitivity= TP/(TP+FN)Specificity=TN/(TN+FP)

• Aligner technology specific?• Gapped vs. ungapped alignments?• Spliced alignments (cDNAs/RNA-Seq)• Can use paired-end data?

Ruffalo et al., 2012

Li and Homer, 2010

Indels have correct and consistent alignment in readsafter multiple sequence local realignment

61DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet.

Phase 1:NGS data processing

Highlighted as one of the major methodological advances of the 1000 Genomes Pilot Project!

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CDC27

Richa Agarwala

MHC Alternate locus

Alignment to chr6

Mouse Ren1 chr1 (NC_000067.6): 133350674-133360320

NM_031192.3: transcript from C57BL/6JNM_031193.2: transcript from FVB/N

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

CEPH: A=1.000 G=0

APOL1

YRI: A=0.5852 G=0.4148

Multiple submissions

FrequencyData

1000G

Suspect

Sudmant et al., 2010