1
GENOMICA
Uso di Genome Browser per l'annotazione di sequenze genomiche.
UCSC
University of California Santa Cruz
Genome Browser
Dott.ssa Inga Prokopenko
Sequenziamento del genoma –assemblaggio delle sequenze disponibili
2
� Una sequenza viene detta “finita” quando presenta un livello di errore inferiore a 1/10000 basi e non ha gaps.
� Il Progetto Genoma Umano era complesso dal punto di vista tecnico ma anche dal punto di vista computazionale.
� L’output di una singola reazione di sequenza (read) = 500-800 bp � Tutti i singoli frammenti dovevano essere assemblati in una singola stringa lineare.
Biological Databases
� L’annotazione del genoma è un’area di investigazione attiva ed include molte organizzazioni che pubblicano i risultati nella data banche biologiche disponibili per tutta la comunità:
� ENCyclopedia Of DNA Elements (ENCODE)� Entrez Gene� Ensembl� Gene Ontology Consortium� GeneRIF� RefSeq� Uniprot� Vertebrate and Genome Annotation Project (Vega)
3
Browser genomici
� aiutano a visualizzare genomi annotati completi (inclusi geni e loro strutture, proteine, espressione, regulazione, variazione ed analisi comparative)
� Le risorse di annotazione sono multiple.� Genome Browsers� UCSC Genome Bioinformatics Genome Browser and
Tools (UCSC)� NCBI-UniGene (National Center for Biotechnology
Information) � Ensembl The Ensembl Genome Browser (Sanger
Institute and EBI)
Copyright OpenHelix. No use or reproduction without express written consent 4
5
3 milioni di basi in formato testo = nessuna utilita’Servono:•Annotazione dell’informazione sulla sequenza•Possibilita’ di recuperare velocemente la sequenza di regioni specifiche del genoma in base a criteri di
• Contenuto di informazione• Caratteristiche di sequenza
Genomi disponibili
HumanHomo sapiens assembly• 99% delle regioni contenenti geni• accuratezza 99.99% • 2.84 Gb finite “highly contiguous”
SpeciesA. gambiaeA. melliferaC. briggsaeC. elegansC. intestinalisChickenChimpCowD. ananassaeD. erectaD. grimshawiD. melanogasterD. mojavensisD. persimilisD. pseudoobscuraD. sechelliaD. simulansD. virilis
UCSC Genome BrowserSistema per la “navigazione” della sequenza e dell’annotazione di genomi, che permette la visualizzazione dell’informazione a “diverso ingrandimento” ed il recupero di porzioni di sequenza con associate le informazioni di annotazione, come:Geni noti e geni predettiESTs, mRNAsIsole CpGassembly gaps e coverage, bande cromosomicheOmologia con altri genomi…
6
Organizazione dei dati genomici…
Genome backbone: base position numbersequenzaA
nnot
atio
n T
rack
schromosome band
known genes
predicted genes
evolutionary conservation
SNPs
sts sites
gap locations
repeated regions
microarray/expression data
more…
Links out to more data
Copyright OpenHelix. No use or reproduction without express written consent 7
The UCSC Home page: genome.ucsc.edu
navigate
navigateGeneral information
Specific information—new features, current status, etc.
Copyright OpenHelix. No use or reproduction without express written consent 8
A sample of what we will find:
9
The Genome Browser Gatewaystart page, basic search
text/ID searches
Helpful search examples,
suggestions below
� Use this Gateway to search by:� Gene names, symbols
� Chromosome number: chr7, or region: chr11:1038475-1075482
� Keywords: kinase, receptor
� IDs: NP, NM, OMIM, and more…
10
UCSC Genome BrowserMolte possibilita’ per la ricerca di una regione spe cifica:
• chr7 un cromosoma intero• 20p13 una regione (banda p13 del cr. 20)• chr3:1-1000000 il primo milione di basi del cr. 3 dal ptel• D16S3046 regione intorno al marcatore (100,000 basi per lato)• RH18061;RH80175 regione tra i due marcatori• AA205474 regione genomica che si allinea con la sequenza con
questo GB accession number• PRNP regione del genoma che comprende il gene PRNP • NM_017414 regione del genoma con indificatore di RefSeq• NP_059110 regione del genoma con “protein accession number”• 11274 (LLID)
Oppure di liste di regioni:• pseudogene mRNA Lists transcribed pseudogenes, but not cDNAs• homeobox caudal Lists mRNAs for caudal homeobox genes• zinc finger Lists many zinc finger mRNAs• huntington Lists candidate genes associated with Huntington's
disease
11
12
The Genome Browser Gatewaystart page choices, February 2005
Make your Gateway choices:1. Select Clade2. Select species: search 1 species at a time3. Assembly: the official backbone DNA sequence4. Position: location in the genome to examine5. Image width: how many pixels in display window; 5000 max6. Configure: make fonts bigger + other choices
1 2 43 5
6
13
The Genome Browser Gatewaysample search for Human BRCA1
� Sample search: human, May 2004 assembly, BRCA1
•Often you will have to select the right gene from a results list
•Sometimes, you will go directly to a browser image (use an ID)
•AF005068, breast cancer 1, early onset
select
Copyright OpenHelix. No use or reproduction without express written consent 14
Overview of the wholeGenome Browser page
(first day, new human release)
}Genome viewer section
Track and image controls(day 1 = 40 tracks)
Copyright OpenHelix. No use or reproduction without express written consent 15
Overview of the wholeGenome Browser page
(mature release)
}Genome viewer section
mRNA and EST Tracks
Expression and Regulation
Comparative Genomics
ENCODE TracksVariation and Repeats
Groups of dataMapping and Sequencing Tracks
Genes and Gene Prediction Tracks
Copyright OpenHelix. No use or reproduction without express written consent 16
Different species, different tracks, same software
Copyright OpenHelix. No use or reproduction without express written consent 17
Sample Genome Viewer image, BRCA1 region
Genome backboneSTS markers
Known genes
RefSeq genes
Gene predictions
GenBank mRNAs
repeats
GenBank ESTs
conservation
SNPs
MGC clones
Copyright OpenHelix. No use or reproduction without express written consent 18
Annotation Track options, defined
� Hide: removes a track from view
� Dense: all items collapsed into a single line
� Squish: each item = separate line, but 50% height + packed
� Pack: each item separate, but efficiently stacked (full height)
� Full: each item on separate line
Copyright OpenHelix. No use or reproduction without express written consent 19
Clicking an annotation line, new page of detailed information
You will get detail for that single item you clickExample: click on the BRCA1 Black “Known Genes” line
Click the line
Newweb page
opens
Many details and links to more data about BRCA1
Copyright OpenHelix. No use or reproduction without express written consent 20
Click annotation track = BRCA1 “Known gene” detail page informative
description
other resource links
microarray data
mRNA secondary structure
links to sequences
protein domains/structure
homologs in other species
Gene Ontology™ descriptions
mRNA descriptions
pathways
Not all genes have This much detail.
Different annotation tracks
carry different detaildata.
SNP
detail page
sample
Copyright OpenHelix. No use or reproduction without express written consent 21
Getting the sequencesGet DNA, with Extended Options; or Details pages
� Use the DNA link at the top
� Plain or Extended options
� Change colors, fonts, etc.
Copyright OpenHelix. No use or reproduction without express written consent 22
Accessing the BLAT tool
� Rapid searches by INDEXING the entire genome� Works best with high similarity matches
BLAT = B LAST-l ike A lignment T ool
Copyright OpenHelix. No use or reproduction without express written consent 23
BLAT: Blast-like alignment tool
� Blat is really really fast—it has been optimized to search the whole genomes more quickly than BLAST does.
� UCSC have created an INDEX of all the unique 11mers if it’s DNA, 4mers if protein (or stretches of 11nucleotides or 4 amino acids).
� it looks down its index of 11mers, finds a match and works out from there.
� Blast does it the other way—it indexes your query and then runs your smaller index over everything…that’s the essential difference in the algorithm.
Copyright OpenHelix. No use or reproduction without express written consent 24
BLAT
� UCSC documentation:� “On DNA queries, BLAT is designed to quickly find
sequences with 95% or greater similarity of length 40 bases or more . It may miss genomic sequences that are more divergent or shorter than these minimums, although it will find perfect sequence matches of 33 bases and sometimes as few as 22 bases. The tool is capable of aligning sequences that contain large introns.
� On protein queries, BLAT rapidly locates genomic sequences with 80% or greater similarity of length 20 amino acids or more . In general, gene family members that arose within the last 350 million years can generally be detected.”
Copyright OpenHelix. No use or reproduction without express written consent 25
BLAT tool overview: www.openhelix.com/sampleseqs.html
Submit
� Make choices
DNA limit 25000 basesProtein limit 10000 aa25 total sequences
� Paste one or more
sequences
� Or upload
Copyright OpenHelix. No use or reproduction without express written consent 26
BLAT results, with links
� Results with demo sequences, settings default; sort = Query, Score� Score is a count of matches—higher number, better match
sorting
� Click browser to go to Genome Browser image location (next slide)� Click details to see the alignment to genomic sequence (2nd slide)
Copyright OpenHelix. No use or reproduction without express written consent 27
BLAT results, alignment details browser
� From browser click in BLAT results� A new line with your Sequence from BLAT Search appears!
query
matches
Click to flip frame
� Watch out for reading frame! Click - - - > to flip frame� Base position = full and zoomed in enough to see
amino acids
Copyright OpenHelix. No use or reproduction without express written consent 28
BLAT results,alignment details
Your query
Genomic match, color cues
Side-by-side alignment
Copyright OpenHelix. No use or reproduction without express written consent 29
In Silico PCR: find genomic sequence using primers
� Select genome� Enter primers� Minimum 15 bases� Flip reverse primer?� Submit
(note: the tool does not handle ambiguous bases at this time—don’t use Ns)
Copyright OpenHelix. No use or reproduction without express written consent 30
In Silico PCR: results
� Genomic location shown, links to Genome Viewer
location your primers
� Your primers displayed, flipped if necessary� Predicted genomic sequence shown
Tm for primers
� Primer melting temperatures provided
size
� Product size shown
Copyright OpenHelix. No use or reproduction without express written consent 31
Proteome Browser
� Access from homepage or Known Gene pages� Exon diagram, amino acids� Many protein properties (pI, mw, composition, 3D…)
moreprotein
data
Copyright OpenHelix. No use or reproduction without express written consent 32
Gene Sorter
� From homepage select ‘Gene sorter’
Copyright OpenHelix. No use or reproduction without express written consent 33
Gene Sorter interface
� Sorts genes by several criteria
Copyright OpenHelix. No use or reproduction without express written consent 34
Gene Sorter interface
� Choose from 11 sorting options
Copyright OpenHelix. No use or reproduction without express written consent 35
Gene Sorter results