Biology 162: Computational Genetics
Fall 2004Todd Vision
Assistant ProfessorDepartment of Biology, UNC
Chapel Hill
Bioinformatics vs computational genetics
• Bioinformatics: The application of computing technology to molecular biology
• Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics
Course emphasis• Data analysis in molecular
genetics• We will not cover
– Developments in IT hardware– Analysis of protein structure– Modeling of metabolic pathways,
cells, tissues, organs, etc. (i.e. systems biology)
Prerequisites• Bio 50: Molecular Biology and Genetics
– Gene/protein structure and expression– Principles of inheritance
• Comp Sci 14: Introduction to Programming– Algorithms and their design– Fundamental programming skills
• Stat 31: Introduction to Statistics– Probability and Distributions– Hypothesis testing and parameter estimations
Related courses at UNC• Biology 170/Math 107, Mathematical
and Computational Models in Biology (Tim Elston and Maria Servedio)
• Summer courses in– Computer Science
• Graduate courses in– Bioinformatics and Computational Biology– Biostatistics– School of Pharmacy
Readings• Gibson and Muse, A Primer of Genome
Science, Sinauer Associates.– Available in Student Bookstore– Primarily covers genomic technologies– Brief on computational/statistical aspects
• Supplemental papers– Handed out in class or posted on Blackboard – Includes
• More detail on computational/statistical aspects• Papers which you will review for class assignments
https://blackboard.unc.edu
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Computer labs / Problem sets
• Thursdays 3:30-4:30 in Wilson 132• Assignments are due following Tuesday• Purpose:
– Familiarity with genomic databases and tools• Functional and evolutionary sequence analysis• Gene expression analysis• Mapping of genomes and complex traits
– Comfort with command-line tools and computing– Exercise of scientific reasoning and biological
judgement– No programming required (but learn Perl anyway!)
Research paper• Critical review of the computational
challenges involved in assembly of the human genome
• Based on opposing articles from the main players in the drama
• Paper will be judged on– Understanding of content– Critical and synthetic reasoning– Clarity of scientific writing
Late policy• Assignments are due at beginning
of class on the due date• Late assignments receive half-
credit• Exceptions can be made but
require more than 24 hours notice
Group work• You are encouraged to work together
on most assignments (some exceptions)
• What you turn in should be your own– Show your work– Be able to defend your answers
• Know and love the UNC Honor Code– http://honor.unc.edu
Exams• Two midterms• Final exam will be cumulative• May include material from labs/problem
sets, readings and lectures• Most questions will be similar to those
on lab/problem sets• You will receive a study guide in
advance
Grading• 10 Labs/problem sets - 50% (5% each)• Review paper - 10%• Midterms - 20% (10% each)• Final exam - 20%• Final grades
– No curve, point divisions at discretion of instructor
– Different divisions for undergraduate/graduate students
Computer lab server: Biolinux
• All necessary analysis software is installed
• Dell PowerEdge server– Linux Redhat operating system– 2 Xeon processors– 2 GB RAM– 60 GB disk space
• Requires an ONYEN for login• Uses AFS file space
Connecting to Biolinux• biolinux.bio.unc.edu (IP 152.2.66.25)• Windows
– Zip archive contains necessary connection software
• MacOSX– X11 for graphical sessions– Fugu for secure ftp
• Linux/Solaris/etc.– Should work as is
https://onyen.unc.edu
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
http://cilantro.bio.unc.edu/biolinux
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Cretaceous Park?• In 1994, researchers reported a remarkably
well-preserved Cretaceous dinosaur fossil.• DNA was extracted
– Care was taken to prevent contamination• Specific regions were amplified
– 20 different PCR primer pairs used, including 6 pairs from mitochondrial cytB
– How would you design primers for dinosaur DNA?– All yielded products in mammals, birds and reptiles– Only one cytB pair yielded a product from the fossil– Negative controls did not reveal contamination
Cretaceous Park?• One cytB fragment amplified• 9 sequences obtained from two bone samples
– Variability was present within and between the two samples, none were identical
• Consensus sequences used to search for homologs– Genbank (215,000 sequences)– BLAST
• Measured percent identity• Closest matches were ~70% identical
– Equidistant to mammals, birds, and reptiles
Cretaceous Park?• One would expect dinosaur DNA to be most
similar to that of birds, and then crocodilians• Other authors reanalyzed the data
– Multiple alignment– Protein sequence scoring matrix– Phylogenetic analysis
• All concluded that the DNA was clearly mammalian, possibly human
• One group showed that similar sequences could be amplified from human nuclear DNA
Cretaceous Park?• Three possibilities
– Preparation of human nuclear DNA could have been contaminated by dinosaur DNA
– Dinosaurs and humans might have hybridized during the Cretaceous
– Dinosaur extracts were contaminated by human DNA
• Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs
• Lesson learned: naïve computational analysis can lead to very misguided conclusions!
Discussion question• You are given the sequence of a new
gene and asked to determine its function.
• How would you begin?– What ‘wet lab’ approaches are possible?– What ‘in silico’ approaches are possible?– What approaches might require both
wet lab and in silico components?
Biological topics• Sequence alignment and assembly• Sequence homology searching• Sequence evolution and phylogenetics• Finding genes and other features• Patterns of gene expression• Genetic mapping• Dissecting genetic diseases and
quantitative traits
Computational topics• Dynamic programming• Regular expressions and suffix trees• Markov chains• Hidden Markov models and machine
learning• Techniques for clustering and classification• Maximum likelihood and Bayesian statistics• Graph traversal
Some informatics tools• Genbank, Uniprot, and major sequence
repositories• InterPro and protein signature dBs• Gene Ontology• Model organism genome databases
(SGD, FlyBase, Ensembl)• A sampling of software programs
– Chosen primarily for pedagogical utility
Genomics• Genetics on lots of genes?• Hypothesis-free science?• Some technologies• Enabled by
– Robotics– Computers
Genome database examples
• Primary databases– Genbank/EMBL/DDBJ
• Secondary databases– Pfam (protein domains)
• Organism-specific– SGD (yeast genomics)
• Specialized dBs– OMIM (human genetic disorders)
• Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/
Growth of Genbank
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
http://www.expasy.org/cgi-bin/show_thumbnails.pl?2
First bacterial genome: 1995
• Haemophilus influenzae (TIGR)– 1.8 x 106 bp shotgun assembly– Required 9 months of computer time
• Now there are hundreds– 160 Bacterial– 19 Archaeal– 32 Eukaryotic
• Over a thousand projects ongoing• And a bacterial genome takes only days to
sequence and assemble
Tree of life
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
More protein families await
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Other types of genomic data
• Spatiotemporal gene expression• Alternative transcription• Genetic knockout/overexpression phenotypes• Genetic variability
– Molecular polymorphism• Phenotypic variation / disease• Comparative data / molecular evolution• Protein
– Structure, including modifications– Interactions with other molecules
• Metabolic profiling, etc., etc.
Algorithmic/statistical innovations
• The most fundamental and heavily used application in the field is pairwise alignment – Smith-Waterman algorithm (1981)
• Still too slow for general database search– BLAST (1987)
• Made database search of 107-108 sequences feasible• Statistical ranking of each alignment
• Statistical methods in molecular evolution <25 yrs old
• Modern genetic mapping methods ~15 yrs old
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Things to review• Chemical differences among amino
acids• Prokaryotic and eukaryotic gene
structure• The central dogma• Anatomy of a typical protein
Reading for Thursday• Gibson and Muse, Ch.1 Genome
Projects, pgs. 1-58.