+ All Categories
Home > Documents > Biology 162: Computational Genetics Fall 2004

Biology 162: Computational Genetics Fall 2004

Date post: 25-Feb-2016
Category:
Upload: paxton
View: 62 times
Download: 1 times
Share this document with a friend
Description:
Biology 162: Computational Genetics Fall 2004. Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill. Bioinformatics vs computational genetics. Bioinformatics : The application of computing technology to molecular biology - PowerPoint PPT Presentation
39
Biology 162: Computational Genetics Fall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill
Transcript
Page 1: Biology 162:  Computational Genetics Fall 2004

Biology 162: Computational Genetics

Fall 2004Todd Vision

Assistant ProfessorDepartment of Biology, UNC

Chapel Hill

Page 2: Biology 162:  Computational Genetics Fall 2004

Bioinformatics vs computational genetics

• Bioinformatics: The application of computing technology to molecular biology

• Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics

Page 3: Biology 162:  Computational Genetics Fall 2004

Course emphasis• Data analysis in molecular

genetics• We will not cover

– Developments in IT hardware– Analysis of protein structure– Modeling of metabolic pathways,

cells, tissues, organs, etc. (i.e. systems biology)

Page 4: Biology 162:  Computational Genetics Fall 2004

Prerequisites• Bio 50: Molecular Biology and Genetics

– Gene/protein structure and expression– Principles of inheritance

• Comp Sci 14: Introduction to Programming– Algorithms and their design– Fundamental programming skills

• Stat 31: Introduction to Statistics– Probability and Distributions– Hypothesis testing and parameter estimations

Page 5: Biology 162:  Computational Genetics Fall 2004

Related courses at UNC• Biology 170/Math 107, Mathematical

and Computational Models in Biology (Tim Elston and Maria Servedio)

• Summer courses in– Computer Science

• Graduate courses in– Bioinformatics and Computational Biology– Biostatistics– School of Pharmacy

Page 6: Biology 162:  Computational Genetics Fall 2004

Readings• Gibson and Muse, A Primer of Genome

Science, Sinauer Associates.– Available in Student Bookstore– Primarily covers genomic technologies– Brief on computational/statistical aspects

• Supplemental papers– Handed out in class or posted on Blackboard – Includes

• More detail on computational/statistical aspects• Papers which you will review for class assignments

Page 7: Biology 162:  Computational Genetics Fall 2004

https://blackboard.unc.edu

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 8: Biology 162:  Computational Genetics Fall 2004

Computer labs / Problem sets

• Thursdays 3:30-4:30 in Wilson 132• Assignments are due following Tuesday• Purpose:

– Familiarity with genomic databases and tools• Functional and evolutionary sequence analysis• Gene expression analysis• Mapping of genomes and complex traits

– Comfort with command-line tools and computing– Exercise of scientific reasoning and biological

judgement– No programming required (but learn Perl anyway!)

Page 9: Biology 162:  Computational Genetics Fall 2004

Research paper• Critical review of the computational

challenges involved in assembly of the human genome

• Based on opposing articles from the main players in the drama

• Paper will be judged on– Understanding of content– Critical and synthetic reasoning– Clarity of scientific writing

Page 10: Biology 162:  Computational Genetics Fall 2004

Late policy• Assignments are due at beginning

of class on the due date• Late assignments receive half-

credit• Exceptions can be made but

require more than 24 hours notice

Page 11: Biology 162:  Computational Genetics Fall 2004

Group work• You are encouraged to work together

on most assignments (some exceptions)

• What you turn in should be your own– Show your work– Be able to defend your answers

• Know and love the UNC Honor Code– http://honor.unc.edu

Page 12: Biology 162:  Computational Genetics Fall 2004

Exams• Two midterms• Final exam will be cumulative• May include material from labs/problem

sets, readings and lectures• Most questions will be similar to those

on lab/problem sets• You will receive a study guide in

advance

Page 13: Biology 162:  Computational Genetics Fall 2004

Grading• 10 Labs/problem sets - 50% (5% each)• Review paper - 10%• Midterms - 20% (10% each)• Final exam - 20%• Final grades

– No curve, point divisions at discretion of instructor

– Different divisions for undergraduate/graduate students

Page 14: Biology 162:  Computational Genetics Fall 2004

Computer lab server: Biolinux

• All necessary analysis software is installed

• Dell PowerEdge server– Linux Redhat operating system– 2 Xeon processors– 2 GB RAM– 60 GB disk space

• Requires an ONYEN for login• Uses AFS file space

Page 15: Biology 162:  Computational Genetics Fall 2004

Connecting to Biolinux• biolinux.bio.unc.edu (IP 152.2.66.25)• Windows

– Zip archive contains necessary connection software

• MacOSX– X11 for graphical sessions– Fugu for secure ftp

• Linux/Solaris/etc.– Should work as is

Page 16: Biology 162:  Computational Genetics Fall 2004

https://onyen.unc.edu

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 17: Biology 162:  Computational Genetics Fall 2004

http://cilantro.bio.unc.edu/biolinux

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 18: Biology 162:  Computational Genetics Fall 2004

Cretaceous Park?• In 1994, researchers reported a remarkably

well-preserved Cretaceous dinosaur fossil.• DNA was extracted

– Care was taken to prevent contamination• Specific regions were amplified

– 20 different PCR primer pairs used, including 6 pairs from mitochondrial cytB

– How would you design primers for dinosaur DNA?– All yielded products in mammals, birds and reptiles– Only one cytB pair yielded a product from the fossil– Negative controls did not reveal contamination

Page 19: Biology 162:  Computational Genetics Fall 2004

Cretaceous Park?• One cytB fragment amplified• 9 sequences obtained from two bone samples

– Variability was present within and between the two samples, none were identical

• Consensus sequences used to search for homologs– Genbank (215,000 sequences)– BLAST

• Measured percent identity• Closest matches were ~70% identical

– Equidistant to mammals, birds, and reptiles

Page 20: Biology 162:  Computational Genetics Fall 2004

Cretaceous Park?• One would expect dinosaur DNA to be most

similar to that of birds, and then crocodilians• Other authors reanalyzed the data

– Multiple alignment– Protein sequence scoring matrix– Phylogenetic analysis

• All concluded that the DNA was clearly mammalian, possibly human

• One group showed that similar sequences could be amplified from human nuclear DNA

Page 21: Biology 162:  Computational Genetics Fall 2004

Cretaceous Park?• Three possibilities

– Preparation of human nuclear DNA could have been contaminated by dinosaur DNA

– Dinosaurs and humans might have hybridized during the Cretaceous

– Dinosaur extracts were contaminated by human DNA

• Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs

• Lesson learned: naïve computational analysis can lead to very misguided conclusions!

Page 22: Biology 162:  Computational Genetics Fall 2004

Discussion question• You are given the sequence of a new

gene and asked to determine its function.

• How would you begin?– What ‘wet lab’ approaches are possible?– What ‘in silico’ approaches are possible?– What approaches might require both

wet lab and in silico components?

Page 23: Biology 162:  Computational Genetics Fall 2004

Biological topics• Sequence alignment and assembly• Sequence homology searching• Sequence evolution and phylogenetics• Finding genes and other features• Patterns of gene expression• Genetic mapping• Dissecting genetic diseases and

quantitative traits

Page 24: Biology 162:  Computational Genetics Fall 2004

Computational topics• Dynamic programming• Regular expressions and suffix trees• Markov chains• Hidden Markov models and machine

learning• Techniques for clustering and classification• Maximum likelihood and Bayesian statistics• Graph traversal

Page 25: Biology 162:  Computational Genetics Fall 2004

Some informatics tools• Genbank, Uniprot, and major sequence

repositories• InterPro and protein signature dBs• Gene Ontology• Model organism genome databases

(SGD, FlyBase, Ensembl)• A sampling of software programs

– Chosen primarily for pedagogical utility

Page 26: Biology 162:  Computational Genetics Fall 2004

Genomics• Genetics on lots of genes?• Hypothesis-free science?• Some technologies• Enabled by

– Robotics– Computers

Page 27: Biology 162:  Computational Genetics Fall 2004

Genome database examples

• Primary databases– Genbank/EMBL/DDBJ

• Secondary databases– Pfam (protein domains)

• Organism-specific– SGD (yeast genomics)

• Specialized dBs– OMIM (human genetic disorders)

• Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/

Page 28: Biology 162:  Computational Genetics Fall 2004

Growth of Genbank

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 29: Biology 162:  Computational Genetics Fall 2004

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

http://www.expasy.org/cgi-bin/show_thumbnails.pl?2

Page 30: Biology 162:  Computational Genetics Fall 2004

First bacterial genome: 1995

• Haemophilus influenzae (TIGR)– 1.8 x 106 bp shotgun assembly– Required 9 months of computer time

• Now there are hundreds– 160 Bacterial– 19 Archaeal– 32 Eukaryotic

• Over a thousand projects ongoing• And a bacterial genome takes only days to

sequence and assemble

Page 31: Biology 162:  Computational Genetics Fall 2004

Tree of life

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 32: Biology 162:  Computational Genetics Fall 2004

More protein families await

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 33: Biology 162:  Computational Genetics Fall 2004

Other types of genomic data

• Spatiotemporal gene expression• Alternative transcription• Genetic knockout/overexpression phenotypes• Genetic variability

– Molecular polymorphism• Phenotypic variation / disease• Comparative data / molecular evolution• Protein

– Structure, including modifications– Interactions with other molecules

• Metabolic profiling, etc., etc.

Page 34: Biology 162:  Computational Genetics Fall 2004

Algorithmic/statistical innovations

• The most fundamental and heavily used application in the field is pairwise alignment – Smith-Waterman algorithm (1981)

• Still too slow for general database search– BLAST (1987)

• Made database search of 107-108 sequences feasible• Statistical ranking of each alignment

• Statistical methods in molecular evolution <25 yrs old

• Modern genetic mapping methods ~15 yrs old

Page 35: Biology 162:  Computational Genetics Fall 2004

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 36: Biology 162:  Computational Genetics Fall 2004

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 37: Biology 162:  Computational Genetics Fall 2004

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 38: Biology 162:  Computational Genetics Fall 2004

Things to review• Chemical differences among amino

acids• Prokaryotic and eukaryotic gene

structure• The central dogma• Anatomy of a typical protein

Page 39: Biology 162:  Computational Genetics Fall 2004

Reading for Thursday• Gibson and Muse, Ch.1 Genome

Projects, pgs. 1-58.


Recommended