Introduction to Bioinformatics
2
• Cell biology– Organisms and cells– Building blocks of cells– How genes encode proteins?
• Bioinformatics– What is bioinformatics?– Practical applications– Tools and databases
Contents
Cell Biology
4
5
Lineage tree of life on earth
6
• Prokaryotes– Bacteria– Archaebacteria
• Eukaryotes– Plants– Animals– Fungi
Lineage tree of life on earth
7
• Single cell organisms• Consists of cytosol bounded by the plasma membrane• Possesses a cell wall• Gram-negative bacteria have a thin cell wall and an
outer membrane• Gram-positive bacteria have a thick cell wall and no
outer membrane• DNA is condensed to the cell center and lacks a
defined nucleus• Ribosomes are found in the DNA-free region • Relatively simplified internal organization• Some can grow in extreme conditions (temperature,
pH, salt concentration)
Prokaryotic cells
8
Prokaryotic cells
9
• Single cell (unicellular fungi and protozoans) or multicellular organisms (plants and animals)
• Both plant cells and fungi possess a cell wall, however are of different compositions
• Surrounded by a plasma membrane, like the prokaryotes
• Contains a defined nucleus• Structurally more complex: organelles, cytoskeleton• Organelles are enclosed compartments separated
from the cytoplasm, defined by internal membranes• Cytoskeletons are structural proteins giving cell
strength and rigidity; can be connected to organelles and provide tracks for organelle movements
Eukaryotic cells
10
Eukaryotic cells
11
Lineage tree of life on earth
12
Animal cell structure
13
Plant cell structure
14
Building blocks of cells
• Macromolecules– Nucleic acids (e.g. DNA, RNA)– Proteins (e.g. collagen)– Sugars (e.g. glucose, glycogen)– Lipids (e.g. cholesterol)
• Other molecules– Water– Ions
15
Central dogma
DNA
mRNA
Protein
• Genetic information flow:
16
• Contains genetic information arranged in units termed “genes”
• In an organism, all cells contain the same DNA content
• Basic subunits– adenine (A)– guanine (G)– cytosine (C)– thymine (T)
Deoxyribonucleic acid (DNA)
17
Native DNA is a double helix of complementary antiparallel chains
18
DNA is packaged into chromosomes
19
The total DNA in the chromosomes of an organism is its genome
20
Ribonucleic acid (RNA)
• Contains genetic information as messenger RNAs (mRNA)
• In an organism, cells contain different types of mRNAs
• Basic subunits– adenine (A)– guanine (G)– cytosine (C)– uracil (U)
21
Protein
• Contains genetic information as amino acid sequence
• Basic subunits are 20 amino acids• A protein’s amino acid sequence determines
its 3D structure, which in turn determines the function of that protein
Question: What are essential amino acids?Amino acids that cannot be synthesized by the body cells, therefore have
to be included in the diet. Soy bean and corn are rich in essential amino acids
22
The genetic code is a triplet code
23
AUG GCU UGU UUA CGA AUU TAG
M A C L R I *
Met Ala Cys Leu Arg Ile *
ATG GCT TGT TTA CGA ATT TAGGene X
mRNA
Protein
• Example:
How genes encode proteins?
24
Bioinformatics
26
• Background– Massive explosion in the amount of biological information
available due to huge advances in the fields of molecular biology and genomics.
• What is bioinformatics?– Bioinformatics is the application of computer technology to
the management, interpretation and analysis of biological data.
– An interdisciplinary research area that is the interface between the biological and computational sciences.
• Goals– To uncover the wealth of biological information hidden in the
mass of data.– To provide improvements in research fields such as human
health, agriculture, the environment, energy and biotechnology.
What is Bioinformatics?
27
• Large scale sequencing projects– Genome sequencing
• Examples: microbial or human genome sequencing• Determine the DNA sequence of an organism• Discover “genes” in the genome using bioinformatics tools
– EST (Expressed Sequence Tag) sequencing• Examples: a specific tissue or cell type from a given
organism• Determine the mRNA sequences found in specific tissue
or cell type• Determine “genes” expressed in specific tissue or cell
type
Data generation
28
DNA sequencing
29
4,000
-
4.6
E. coli
35,000
80%
3,000
Human
45,000
80%
2,500
Maize
40,00027,000Estimated gene count
40%10%Repetitive DNA
400125Genome size (Mb)
RiceArabidopsis
Genome sizes
30
• Whole genome shotgun sequencing– For genome of relatively small sizes (e.g. bacteria)– Break up the genome into small DNA fragments– Rely on computer algorithms to assemble the fragments– Examples: microbial genomes, Drosophila, Human (Celera)
• Hierarchical sequencing– For genome of large sizes (e.g. human, maize)– Break genome into many long pieces– Map each long piece onto the chromosome (physical
mapping)– Select and sequence pieces with minimal overlaps– Examples: Rice, Human
Genome sequencing strategies
31
cut many times at random
Genome
1. Cut genomic DNA cut into pieces2. Sequence random fragments3. Put sequence together into one piece relying on
computer algorithms
Whole Genome Shotgun
32
1. Genomic DNA cut into pieces2. Assign chromosomal location for each DNA
fragment3. Sequence fragments originated from known
location4. Stitch together fragments from each
chromosomal location
GenomeChr 1, region 5Chr 1, region 1
Hierarchical Sequencing
33
• The order of the nucleotide bases contains the instructions for making an organism.
• There are 4 types of nucleotide base: A-adenine, T- thymine, C- cytosine, G- guanine.
• Every three bases codes for an amino acid.• There are 20 different amino acids that
combined in different ways make different proteins.
Genome facts
34
• The human genome is composed of more than 3 billion nucleotide bases.
• Almost all nucleotide bases (99.9%) are exactly the same in all people.
• Our DNA is 98% identical to chimpanzees.• Less than 2% of the genome codes for proteins.• The vast majority of the DNA in the genome (>97%) has no
known function.• The functions remain unknown for over 50% of discovered
genes.• Chromosome 1 has the most genes (2,968) and chromosome Y
has the least (231).• If unwound and tied together, the strands of DNA in one cell
would stretch 6 feet.• The total number of human genes is estimated to be between
30,000 - 40,000.
Human genome facts
35
CTCTAGCTATCTTGGTCTCCTACACAGCCTATGCACATGAGCCCATGCCTCTCCTCTCCTTGCGCCTGCATAGAGAGGTGGTATGATCACCTGGAAAGTTTTTAACTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTTACAAGCCTAGACCTTATGCATGGTCGGACGGACACATCTGATCATAGGACATATGAGTAGGCCACACTCCTCCTGCCCCTCTCTCGTAGAGATCAACACACACTGCTCTTAGTGCCAGGACCTAGAGAGGGGAGCGTGGAGAGGGCATCAGGGGGCCTTGGAGTCCCATCAGTAAAGCACATGTTTCCTTTCTGTGATTCCTCAAGCCCCATGGACTTACCGCTTTACCAACAACTGCAGCTAAGCCCGTCTTCCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTACCCATGCTCCCCTCCCTTCGCCGCCGCCGACGCCAGCTTTCCCCTCAGCTACCAGATCGGTAGTGCCGCGGCCGCCGACGCCACCCCTCCACAAGCCGTGATCAACTCGCCGGACCTGCCGGTGCAGGCGCTGATGGACCACGCGCCGGCGCCGGCTACAGAGCTGGGCGCCTGCGCCAGTGGTGCAGAAGGATCCGGCGCCAGCCTCGACAGGGCGGCTGCCGCGGCGAGGAAAGACCGGCACAGCAAGATATGCACCGCCGGCGGGATGAGGGACCGCCGGATGCGGCTCTCCCTTGACGTCGCGCGCAAATTCTTCGCGCTGCAGGACATGCTTGGCTTCGACAAGGCAAGCAAGACGGTACAGTGGCTCCTCAACACGTCCAAGTCCGCCATCCAGGAGATCATGGCCGACGACGCGTCTTCGGAGTGCGTGGAGGACGGCTCCAGCAGCCTCTCCGTCGACGGCAAGCACAACCCGGCAGAGCAGCTGGGAGGAGGAGGAGATCAGAAGCCCAAGGGTAATTGCCGCGGCGAGGGGAAGAAGCCGGCCAAGGCAAGTAAAGCGGCGGCCACCCCGAAGCCGCCAAGAAAATCGGCCAATAACGCACACCAGGTCCCCGACAAGGAGACGAGGGCGAAAGCGAGGGAGAGGGCGAGGGAGCGGACCAAGGAGAAGCACCGGATGCGCTGGGTAAAGCTTGCTTCAGCAATTGACGTGGAGGCGGCGGCTGCCTCGGGGCCGAGCGACAGGCCGAGCTCGAACAATTTGAGCCACCACTCATCGTTGTCCATGAACATGCCGTGTGCTGCCGCTGAATTGGAGGAGAGGGAGAGGTGTTCATCAGCTCTCAGCAATAGATCAGCAGGTAGGATGCAAGAAATCACAGGGGCGAGCGACGTGGTCCTGGGCTTTGGCAACGGAGGAGGAGGATACGGCGACGGCGGCGGCAACTACTACTGCCAAGAGCAATGGGAACTCGGTGGAGTCGTCTTTCAGCAGAACTCACGCTTCTACTGAACACTACGGGCGCACTAGGTACTAGAACTACTCTTTCGACTTACATCTATCTCCTTTCCCTCAACGTGAGCTTCTCAATAATTTGCTGTCTTAATCTATGCGTGTGTTTCTCTTTCTAGACTTCGTAATTGGCTGTGTGACGATGAACT
A piece of DNA sequence
36
CTCTAGCTATCTTGGTCTCCTACACAGCCTATGCACATGAGCCCATGCCTCTCCTCTCCTTGCGCCTGCATAGAGAGGTGGTATGATCACCTGGAAAGTTTTTAACTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTTACAAGCCTAGACCTTATGCATGGTCGGACGGACACATCTGATCATAGGACATATGAGTAGGCCACACTCCTCCTGCCCCTCTCTCGTAGAGATCAACACACACTGCTCTTAGTGCCAGGACCTAGAGAGGGGAGCGTGGAGAGGGCATCAGGGGGCCTTGGAGTCCCATCAGTAAAGCACATGTTTCCTTTCTGTGATTCCTCAAGCCCCATGGACTTACCGCTTTACCAACAACTGCAGCTAAGCCCGTCTTCCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTACCCATGCTCCCCTCCCTTCGCCGCCGCCGACGCCAGCTTTCCCCTCAGCTACCAGATCGGTAGTGCCGCGGCCGCCGACGCCACCCCTCCACAAGCCGTGATCAACTCGCCGGACCTGCCGGTGCAGGCGCTGATGGACCACGCGCCGGCGCCGGCTACAGAGCTGGGCGCCTGCGCCAGTGGTGCAGAAGGATCCGGCGCCAGCCTCGACAGGGCGGCTGCCGCGGCGAGGAAAGACCGGCACAGCAAGATATGCACCGCCGGCGGGATGAGGGACCGCCGGATGCGGCTCTCCCTTGACGTCGCGCGCAAATTCTTCGCGCTGCAGGACATGCTTGGCTTCGACAAGGCAAGCAAGACGGTACAGTGGCTCCTCAACACGTCCAAGTCCGCCATCCAGGAGATCATGGCCGACGACGCGTCTTCGGAGTGCGTGGAGGACGGCTCCAGCAGCCTCTCCGTCGACGGCAAGCACAACCCGGCAGAGCAGCTGGGAGGAGGAGGAGATCAGAAGCCCAAGGGTAATTGCCGCGGCGAGGGGAAGAAGCCGGCCAAGGCAAGTAAAGCGGCGGCCACCCCGAAGCCGCCAAGAAAATCGGCCAATAACGCACACCAGGTCCCCGACAAGGAGACGAGGGCGAAAGCGAGGGAGAGGGCGAGGGAGCGGACCAAGGAGAAGCACCGGATGCGCTGGGTAAAGCTTGCTTCAGCAATTGACGTGGAGGCGGCGGCTGCCTCGGGGCCGAGCGACAGGCCGAGCTCGAACAATTTGAGCCACCACTCATCGTTGTCCATGAACATGCCGTGTGCTGCCGCTGAATTGGAGGAGAGGGAGAGGTGTTCATCAGCTCTCAGCAATAGATCAGCAGGTAGGATGCAAGAAATCACAGGGGCGAGCGACGTGGTCCTGGGCTTTGGCAACGGAGGAGGAGGATACGGCGACGGCGGCGGCAACTACTACTGCCAAGAGCAATGGGAACTCGGTGGAGTCGTCTTTCAGCAGAACTCACGCTTCTACTGAACACTACGGGCGCACTAGGTACTAGAACTACTCTTTCGACTTACATCTATCTCCTTTCCCTCAACGTGAGCTTCTCAATAATTTGCTGTCTTAATCTATGCGTGTGTTTCTCTTTCTAGACTTCGTAATTGGCTGTGTGACGATGAACT
A piece of DNA sequence- carrying a gene unit
37
• Sequence properties– Length, base composition, GC content, etc.
• Sequence assembly– Put sequence together based on similarity
• Gene prediction– Find gene units in a given DNA sequence
• Repeat finding– Find repeated units in a given DNA sequence
• Sequence similarity search– Find other similar sequences based on DNA or protein
sequences• Protein function analysis
– Predict protein function based on known functional units found in protein sequence (domains)
Data analysis (tools)
38
• Databases– Research articles
• What is the latest research with regard to genes involved in horse coat color?
– Taxonomy• How many plant or animal genomes have been sequenced?
– Nucleotide• What is the nucleotide sequence of the maize
domestication gene teosinte branched 1 (tb1)?– Protein
• What is the protein sequence of the maize domestication gene tb1?
– Genome• Where are the human diabetes genes located in the
human genome? Which chromosome?
Data storage (databases)
39
• Agriculture– Improve insect resistance– Improve nutritional quality– Improve drought resistant and/or environmental adaptability
• Animals– Improve production and nutrition of farm animals
• Molecular medicine– Preventative medicine– Gene therapy
• Microbial genome– Waste cleanup– Climate change– Alternative energy sources– Biotechnology – Antibiotic resistance– Forensic analysis of microbes– Metagenomics
Long term goals