BINF 6211 Design and Implementation of …bioinformatics.gmu.edu/weller/BINF8211/Course...

Post on 30-Apr-2018

219 views 4 download

transcript

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

BINF 6211Design and Implementation of

Bioinformatics DatabasesLecture 22

April 21st, 2008Dr. Jennifer W. Weller

Dr. Andrew Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Agenda

• Genetic Databases– OMIM– dbSNP

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Genetic Information Databases• From the phenotype perspective

– A mutation may be inferred from the way it tracks in crosses (inheritance)

– Given enough crosses, the relative location may be inferred • Recombination frequency with respect to linked phenotypes

– The physical map location provides an absolute position– Mutant alleles have the sequence changes leading to the range of

phenotypes associated with the disease– The frame of reference is within the set of samples having the

phenotype• From the physical location perspective

– Not all sequence changes (alleles) lead to different phenotypes• The changes may be synonomous• The changes may lead to subtle, multi-genic effects• Variants are defined with respect to the gene/chromosomal location• The frame of reference is at the genome sequence representative of a

population.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Online Mendelian Inheritance in Man

• A catalog of human genes and genetic disorders – Expert curation: Dr. Victor A. McKusick (JHU) and colleagues – Development for the World Wide Web and housed for serving by NCBI.

• Contents: textual information and references, a federated database– Links to MEDLINE– Links to sequence records in Entrez– Links to additional related resources at NCBI and elsewhere, as curators deem

relevant.– http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

• List is not comprehensive – contents must fit the categories defined.

• Two strategies to get subsets of interest– Via phenotype

• Use the Limits page to retrieve records that have the prefixes (+,#,%, ) by checking the box in front of each

GO– Via clinical synopsis

• Don’t select the above boxes GO• Not all disease-related records have such a synopsis

– Use the History page to combine the two searches

http://www.ncbi.nlm.nih.gov/Omim/mimstats.html

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Y-linked is 4xxxxx OID

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Record Types• “Mendelian inheritance" refers to the transmission of inherited

characters– Via reproductive transmission of genes.

• Character types have keys in certain ranges, as below– (100000- 200000- ) Autosomal loci or phenotypes from before May 15,

1994.– (300000- )X-linked loci or phenotypes– (400000- )Y-linked loci or phenotypes– (500000- )Mitochondrial loci or phenotypes– (600000- )Autosomal loci or phenotypes from after May 15, 1994

• Allelic variants have the MIM number of the parent entry, a period, then a unique 4-digit number. – Example: Factor IX (hemophilia B) locus is 306900

• Alleles are 306900.0001 to 306900.0101. – The beta-globin locus (HBB) is 141900

• Sickle hemoglobin allele is 141900.0243.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

MIM special characters• Symbols preceding an entry number mean:

– An asterisk (*) indicates a gene of known sequence.– A number symbol (#) indicates that it is a descriptive entry

• This is usually of a phenotype, and will be explained in the first paragraph, discussion of related genes is included in the Gene entry

• This does not represent a unique locus.– A plus sign (+) description of a gene of known sequence and

a phenotype.– A percent sign (%) a confirmed Mendelian phenotype or

phenotypic locus but the molecular basis is not known.– No symbol a description of a phenotype where the Mendelian

basis is not proven, or there may be phenotypic crossover– A caret symbol (^) the entry no longer exists.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Other limitations• There is no aggregation function to track how many inherited diseases have

a known sequence• There is no aggregation function for inherited diseases with a known

phenotype but no corresponding sequence (those with % prefix)– Entrez Gene will let you retrieve human genes for which there is no sequence

data or for which only a phenotype is known BUT there is no keyword here for disease genes

• human[orgn] NOT gene_nucleotide[filter]• human[orgn] AND phenotype_only[Properties]• Note: these lists will overlap• To get those with a phenotype and an OMIM record use human[orgn] AND

phenotype_only[Properties] AND gene_omim[filter]• While there is an emphasis on inheritance and cytogenetics, there is very

little information on chromosomal aberrations (these are often NOT inherited).

– For this the genome-wide map of chromosomal break points elsewhere is a better source (although this does not include monoploid or polyploid examples)

• An OMIM record may link to a gene not actually the primary locus, if it was included in the discussion by the authors of a paper – these are the links at the top and bottom of the record, while those in the side bar are limited to the specific locus.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Gene Map• This section has information on the cytogenetic

locations of genes – a single tabular file, with chromosomes in order, each from ptel to q tel– Limited to those with demonstrated cytogenetic

location• if only mapped to a chromosome given at the end of the

list for that chromosome– The Web version is searchable by gene symbol,

chromosomal location or keyword– There is an associated file called the GeneMapKey to

describe the column headings in the file, and special characters used

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Morbid Map• Alphabetical list of diseases described in OMIM, with location as

known– Searchable by gene name, location and keyword– There is a graphical view of this data, which is visualizable in the Entrez

Map Viewer• You need to select the correct display settings in order to interpret the input

correctly• Special symbols

– [ ] information for molecular aberrations that don’t lead to something classified as a disease

– { } indicates a variant that leads to pathogen susceptibility– ? Means the mapping status is unresolved

• After the name of the disorder there is a (number) that indicates the method of mapping

• With respect to the WT gene (1)• With respect to the mutant allele (2)• Both (3)

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Local hosting• OMIM is not relational but some of the

information is tracked in a relational system:– MIM number, create date, update dates

• It uses the ASN.1 format– You can download selected files (matrices of

commonly requested data – think data hypercubes):• The complete text of OMIM• The OMIM Gene table, either from the ftp site or from the

directory of the NCBI Web site as an alphabetical list of gene symbols and their MIM numbers.

• The OMIM Gene Map key and columns in the GeneMap file • The OMIM Morbid Map

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

OMIM Case Study

• I want to design an array that will capture the mutations known to be associated with the Collagen I-A1 gene.

• I want to know what other genes I might need to assay for patients with this phenotype

• I want to know what patient data I should collect to do a good job on clinical signs and symptoms.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Nearby genes on the chromosome

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Go back to the Collagen DB

Lists of specific types of mutations, but not in a text file

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

For SNP-specific assays

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, CarrCircular

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

NIA database, but no data on bone

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Pathway data rather than chromosomal location data

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Signs and Symptoms

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

SNPs• SNPs are single nucleotide polymorphisms, f=1:300 nt

– Sequence variants that do not change the number of bases in a gene

• Can still cause early truncation of a gene product• Most are biallelic• Several large-scale international and commercial projects have

undertaken to assess the level of polymorphism in the human genome, in various populations

• Why useful: the collection of such markers is unique for an individual– Mapping– defining population structure– performing functional studies

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

dbSNP• dbSNP is a public database of single nucleotide polymorphisms (SNPs) and

abit more– Any species is allowed and from any part of a particular genome.– SNPs linked to known genes or expressed DNA segments (ESTs) are most

useful. • Thus SNPs from these regions are prioritized for integration with other ncbi

databases/view/tools.• dbSNP includes several types of simple genetic polymorphisms

– single-base nucleotide substitutions– small-scale multi-base deletions or insertions– retroposable element insertions– microsatellite repeat variation.

• Experimental information is also included: – the sequence information around the polymorphism– specific experimental conditions necessary to perform an experiment (such as

PCR of the locus)– frequency information by population or individual genotype.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Integration map

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

dbSNP Schemas• dbSNP Schema

– > 100 tables and many relationships among tables. – No single ER diagram with all dbSNP tables is available

• Sub-schemas are available in which tables are grouped according to subject areas:– - Batch Submission:– - Submitted SNP– - Submitted snp, population frequency and individual genotype– - Frequency calculation by submitted snp and population.– - SNP Mapping and Annotation– Version control: b125_SNPContigLoc_b34_3: is the mapping

data for b125 snps that are mapped to NCBI genome build 34 version 3.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

• SNPs are indexed by two different accession numbers– the HANDLE | ID /

NCBI | ssASSAY IDforms which refer to an individual submission record

– the NCBI | rsSNP IDform which refers to the abstracted SNP and all associated records.

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

No data

Try next

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

What should the record have

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr

Winter 2008 UNCC CS/Bioinformatics Instructors: Weller, Carr