1 of 33
Genome annotation with Genome annotation with EnsemblEnsembl
Institute of Bioinformatics,National Yang-Ming University
XosXoséé MMªª FernFernáándezndezEuropean Bioinformatics Institute
December 2004
2 of 33
Outline of talkOutline of talk
• High level overview of Ensembl– Making genomes useful
• Outline workshop– New web code,– DAS, display your own data,– Modify EnsMart,– BLAST/SSAHA,– Comparing genomes– Customising Ensembl.
• Outlook– Manual annotation– Other features
3 of 33
We make genomes usefulWe make genomes useful
4 of 33
Making genomes usefulMaking genomes useful
• Interpretation– Where are the interesting parts of the genome?– What do they do?– How are they related to elements in other
genomes?
• Access– for bench biologists– for non-programming mid-scale groups– for good programming groups
5 of 33
AccessAccess…… bench biologistsbench biologists
• Mainly via the web• Web site designed for non
programming, not that genome aware biologist– Simple things to find are simple to find– Graphically displays and overviews– Consistency of layout, colour and text
6 of 33
Ensembl website: Role
– Visual display of Ensembl data• A graphical, intuitive display for biologists
– “Public face” of Ensembl• Contact point for the project
– Local site installation• Free, open-source, supported
– A framework on which to hang user data• DAS and data upload• Local data integration via data adaptors
– Web-based tools• Display tools, primer selection, Anopheles
gene name and transposon submission, etc
7 of 33
Architecture• Encapsulates
– Input– Output– Ensembl API– Rendering
• Improves– Maintainability– Flexibility– Code re-use
MySQL RDBMS
liteestsnp
core
View script
Client browsers
Data
Output
Renderer
Input
Ensembl APIBioperlA
pach
e / m
od_p
erl–
web
ser
vers
8 of 33
Access… mid scale groups
• Wanting to work with 50 to 1,000 genes, regions, expression data
• Little in house programming– Some web views designed for this
group– EnsMart focused on this group
• Mix and match queries• “Instant” refresh of selected set• Output to Excel, FASTA, HTML table
9 of 33
Mart databaseMart database
• De-normalised• Tables with ‘redundant’ information• Query-optimised• Fast and flexible
• Ideal for data mining
10 of 33
There are other waysThere are other ways……MartShellCommandline interface to Mart written in Java.
It works with a Mart Query Language
11 of 33
MartExplorerMartExplorer
12 of 33
BLAST/SSAHABLAST/SSAHA
13 of 33
BLAST/SSAHABLAST/SSAHA• Different web interfaces exist for sequence
comparison over genome scales• Ensembl’s BlastView is a generic/modular
interface that integrates several databases and methods
• BlastView has been extended to integrate tightly with the Ensembl web site
• Server-side state maintenance mechanisms provide a high-performance/flexible framework for the UI
14 of 33
Access… large scale groups
• Full use of the genome, by experienced bioinformaticians
• Complete openness of the group– Open data– Open software– Open MySQL server on the internet– Expect everything to be portable– Participate in standards and adopt
other standards (DAS, UCSC upload)
15 of 33
Ensembl Ensembl –– Open sourceOpen source
Freely-availableCommunity development.
–51 Ensembl installs worldwide.–Both public and commercial,e.g. Gramene (CSHL)
Fugu-sg (ICMB)Ciona-sg (Temasek)
16 of 33
Uploading data to EnsemblUploading data to Ensembl
17 of 33
Display of uploaded data
18 of 33
Comparing genomes
19 of 33
Many Genomes
VertebrateCompara
Human
Mouse Takifugu
C briggsaeC elegans
InterPro
Drosophila
WormCompara
Diptera Compara
Anopheles
Rat
Zebrafish
Honey bee
TetraodonChimp
Chicken
20 of 33
Many more genomes
• Ciona (C. savigny and C. intestinalis)• Rhesus• Sea Urchin, Platynereis…• Aedes, Ixodes… (vectors)
21 of 33
• High level overview of Ensembl– Making genomes useful
• Outline workshop– New web code,– DAS, display your own data,– Modify EnsMart,– BLAST/SSAHA,– Comparing genomes– Customising Ensembl.
• Outlook– Manual annotation– Other features
22 of 33
Future plans• New data
– More species– Variation data– Comparative data
• More integrated views– GeneSNPView– Comparative ContigView
• More focused tool displays– primer & haplotype selection
• Greater integration of user data– Gene & Protein DAS
23 of 33
Challenges
• What is the right way to calculate evolutionary relationships between these genomes?– How different is the gene build for each
new genome?• Is there novel information to be deduced
from the set of related genomes?• How do we integrate “close” genomes and
genome variation?
24 of 33
Manual Curation
• People are the best at– Resolving conflicting
hetreogeneous information– Recognising “out of the ordinary”
biology• For high investment genomes an
automated pipeline with human intervention is the endgame– Human and Mouse
25 of 33
Vega
• Vega is the collection of manually annotated human and other vertebrate genome data– Reuses Ensembl database and
Website technology– Reuses Ensembl pipelines for
Sanger annotation
26 of 33
Two types of variation dataNatural• Limitless• Dense markers
required• Need for optimal
experimental design (HapMap)
• Human and Anopheles
Managed• Limited strain
number• Light density
adequate for some uses
• (dense for complete dataset)
• Mouse, Rat
27 of 33
Variation data (now)
• dbSNP centric– Key data SNP position and allele– Calculate derived properties
(coding SNP, amino acid change)• Provide views on contigview and
transview• Provide selection via EnsMart
28 of 33
Variation data (expected)
• Recombination variability and population history of a species provides for optimal experimental design– “HapMap”
• Have to add individual, cohort, population and genotype concepts
29 of 33
Variation data (future)
• Allow for inexpensive hyper-dense genotype determination of large cohorts
• Integrate population substructure, close species and individual variation – Understanding positive and
negative selection
30 of 33
Other genomic features
31 of 33
There are more than genes!
• RNA genes– “well known” structural RNA genes– Newer miRNA genes– Pseudogenes/duplications a
massive headache• Cis-regulatory motifs
– Transcriptional motifs– RNA processing motifs
• Yet unknown other stuff
32 of 33
Comparative genomics…
• Action of negative selection should let us see these features– Honest research problem - how
does one expect promoters to evolve?
– Overlapping signals, eg, splicing enhancers in exons
33 of 33
Thanks
Ensembl Team
Database Schema and Core APIArne StabenauYuan ChenIan LongdenCraig MelsoppGlenn ProctorDaniel RíosGuy Slater
Distributed Annotation SystemAndreas Kähäri
Project LeaderEwan Birney (EBI)Tim Hubbard (Sanger)
Ensembl Web TeamJames StalkerFiona CunninghamJames Smith
Vega Web TeamPatrick MeidlSteve Trevianon
Analysis and Annotation PipelineVal CurwenSteve SearleDan AndrewsMario CaccamoLaura ClarkeMartin HammondJan Hinnerck-Vogel Kevin HoweVivek IyerKerstin JekoschFelix KokocinskiSimon White
User SupportXosé Mª FernándezMichael Schuster
Comparative GenomicsAbel Ureta-VidalJavier Herrero SánchezJessica SeverinCara Woodwark
EnsMart & BioMartArek KasprzykDamian KeefeDarin LondonDamian Smedley
Ensembl Team
December 2004