Post on 30-Apr-2018
transcript
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
BINF 6211Design and Implementation of
Bioinformatics DatabasesLecture 22
April 21st, 2008Dr. Jennifer W. Weller
Dr. Andrew Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Agenda
• Genetic Databases– OMIM– dbSNP
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Genetic Information Databases• From the phenotype perspective
– A mutation may be inferred from the way it tracks in crosses (inheritance)
– Given enough crosses, the relative location may be inferred • Recombination frequency with respect to linked phenotypes
– The physical map location provides an absolute position– Mutant alleles have the sequence changes leading to the range of
phenotypes associated with the disease– The frame of reference is within the set of samples having the
phenotype• From the physical location perspective
– Not all sequence changes (alleles) lead to different phenotypes• The changes may be synonomous• The changes may lead to subtle, multi-genic effects• Variants are defined with respect to the gene/chromosomal location• The frame of reference is at the genome sequence representative of a
population.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Online Mendelian Inheritance in Man
• A catalog of human genes and genetic disorders – Expert curation: Dr. Victor A. McKusick (JHU) and colleagues – Development for the World Wide Web and housed for serving by NCBI.
• Contents: textual information and references, a federated database– Links to MEDLINE– Links to sequence records in Entrez– Links to additional related resources at NCBI and elsewhere, as curators deem
relevant.– http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
• List is not comprehensive – contents must fit the categories defined.
• Two strategies to get subsets of interest– Via phenotype
• Use the Limits page to retrieve records that have the prefixes (+,#,%, ) by checking the box in front of each
GO– Via clinical synopsis
• Don’t select the above boxes GO• Not all disease-related records have such a synopsis
– Use the History page to combine the two searches
http://www.ncbi.nlm.nih.gov/Omim/mimstats.html
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Y-linked is 4xxxxx OID
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Record Types• “Mendelian inheritance" refers to the transmission of inherited
characters– Via reproductive transmission of genes.
• Character types have keys in certain ranges, as below– (100000- 200000- ) Autosomal loci or phenotypes from before May 15,
1994.– (300000- )X-linked loci or phenotypes– (400000- )Y-linked loci or phenotypes– (500000- )Mitochondrial loci or phenotypes– (600000- )Autosomal loci or phenotypes from after May 15, 1994
• Allelic variants have the MIM number of the parent entry, a period, then a unique 4-digit number. – Example: Factor IX (hemophilia B) locus is 306900
• Alleles are 306900.0001 to 306900.0101. – The beta-globin locus (HBB) is 141900
• Sickle hemoglobin allele is 141900.0243.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
MIM special characters• Symbols preceding an entry number mean:
– An asterisk (*) indicates a gene of known sequence.– A number symbol (#) indicates that it is a descriptive entry
• This is usually of a phenotype, and will be explained in the first paragraph, discussion of related genes is included in the Gene entry
• This does not represent a unique locus.– A plus sign (+) description of a gene of known sequence and
a phenotype.– A percent sign (%) a confirmed Mendelian phenotype or
phenotypic locus but the molecular basis is not known.– No symbol a description of a phenotype where the Mendelian
basis is not proven, or there may be phenotypic crossover– A caret symbol (^) the entry no longer exists.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Other limitations• There is no aggregation function to track how many inherited diseases have
a known sequence• There is no aggregation function for inherited diseases with a known
phenotype but no corresponding sequence (those with % prefix)– Entrez Gene will let you retrieve human genes for which there is no sequence
data or for which only a phenotype is known BUT there is no keyword here for disease genes
• human[orgn] NOT gene_nucleotide[filter]• human[orgn] AND phenotype_only[Properties]• Note: these lists will overlap• To get those with a phenotype and an OMIM record use human[orgn] AND
phenotype_only[Properties] AND gene_omim[filter]• While there is an emphasis on inheritance and cytogenetics, there is very
little information on chromosomal aberrations (these are often NOT inherited).
– For this the genome-wide map of chromosomal break points elsewhere is a better source (although this does not include monoploid or polyploid examples)
• An OMIM record may link to a gene not actually the primary locus, if it was included in the discussion by the authors of a paper – these are the links at the top and bottom of the record, while those in the side bar are limited to the specific locus.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Gene Map• This section has information on the cytogenetic
locations of genes – a single tabular file, with chromosomes in order, each from ptel to q tel– Limited to those with demonstrated cytogenetic
location• if only mapped to a chromosome given at the end of the
list for that chromosome– The Web version is searchable by gene symbol,
chromosomal location or keyword– There is an associated file called the GeneMapKey to
describe the column headings in the file, and special characters used
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Morbid Map• Alphabetical list of diseases described in OMIM, with location as
known– Searchable by gene name, location and keyword– There is a graphical view of this data, which is visualizable in the Entrez
Map Viewer• You need to select the correct display settings in order to interpret the input
correctly• Special symbols
– [ ] information for molecular aberrations that don’t lead to something classified as a disease
– { } indicates a variant that leads to pathogen susceptibility– ? Means the mapping status is unresolved
• After the name of the disorder there is a (number) that indicates the method of mapping
• With respect to the WT gene (1)• With respect to the mutant allele (2)• Both (3)
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Local hosting• OMIM is not relational but some of the
information is tracked in a relational system:– MIM number, create date, update dates
• It uses the ASN.1 format– You can download selected files (matrices of
commonly requested data – think data hypercubes):• The complete text of OMIM• The OMIM Gene table, either from the ftp site or from the
directory of the NCBI Web site as an alphabetical list of gene symbols and their MIM numbers.
• The OMIM Gene Map key and columns in the GeneMap file • The OMIM Morbid Map
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
OMIM Case Study
• I want to design an array that will capture the mutations known to be associated with the Collagen I-A1 gene.
• I want to know what other genes I might need to assay for patients with this phenotype
• I want to know what patient data I should collect to do a good job on clinical signs and symptoms.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Nearby genes on the chromosome
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Go back to the Collagen DB
Lists of specific types of mutations, but not in a text file
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
For SNP-specific assays
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, CarrCircular
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
NIA database, but no data on bone
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Pathway data rather than chromosomal location data
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Signs and Symptoms
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
SNPs• SNPs are single nucleotide polymorphisms, f=1:300 nt
– Sequence variants that do not change the number of bases in a gene
• Can still cause early truncation of a gene product• Most are biallelic• Several large-scale international and commercial projects have
undertaken to assess the level of polymorphism in the human genome, in various populations
• Why useful: the collection of such markers is unique for an individual– Mapping– defining population structure– performing functional studies
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
dbSNP• dbSNP is a public database of single nucleotide polymorphisms (SNPs) and
abit more– Any species is allowed and from any part of a particular genome.– SNPs linked to known genes or expressed DNA segments (ESTs) are most
useful. • Thus SNPs from these regions are prioritized for integration with other ncbi
databases/view/tools.• dbSNP includes several types of simple genetic polymorphisms
– single-base nucleotide substitutions– small-scale multi-base deletions or insertions– retroposable element insertions– microsatellite repeat variation.
• Experimental information is also included: – the sequence information around the polymorphism– specific experimental conditions necessary to perform an experiment (such as
PCR of the locus)– frequency information by population or individual genotype.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Integration map
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
dbSNP Schemas• dbSNP Schema
– > 100 tables and many relationships among tables. – No single ER diagram with all dbSNP tables is available
• Sub-schemas are available in which tables are grouped according to subject areas:– - Batch Submission:– - Submitted SNP– - Submitted snp, population frequency and individual genotype– - Frequency calculation by submitted snp and population.– - SNP Mapping and Annotation– Version control: b125_SNPContigLoc_b34_3: is the mapping
data for b125 snps that are mapped to NCBI genome build 34 version 3.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
• SNPs are indexed by two different accession numbers– the HANDLE | ID /
NCBI | ssASSAY IDforms which refer to an individual submission record
– the NCBI | rsSNP IDform which refers to the abstracted SNP and all associated records.
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
No data
Try next
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
What should the record have
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr
Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr