+ All Categories
Home > Documents > Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Information Theory, Statistical Measures and Bioinformatics approaches to gene expression

Date post: 15-Jan-2016
Category:
Upload: alissa
View: 26 times
Download: 0 times
Share this document with a friend
Description:
Information Theory, Statistical Measures and Bioinformatics approaches to gene expression. Friday’s Class. Sei Hyung Lee will make a presentation on his dissertation proposal at 1pm in this room Vector Support Clustering So I will discuss Clustering and other topics. Information Theory. - PowerPoint PPT Presentation
112
Information Theory, Statistical Measures an Bioinformatics approaches to gene expression
Transcript
Page 1: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory, Statistical Measures andBioinformatics approaches

to gene expression

Page 2: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Friday’s Class

• Sei Hyung Lee will make a presentation on his dissertation proposal at 1pm in this room

• Vector Support Clustering

• So I will discuss Clustering and other topics

Page 3: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory

• Given a probability distribution, Pi, for the letters in an independent and identically distributed (i.i.d.) message, the probability of seeing a particular sequence of letters i, j, k, ..., n is simply

Pi Pj Pk···Pn or

elog Pi + log Pj + log Pk+ ··· + log Pn

Page 4: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory 2

• The information or surprise of an answer to a question (a message) is inversely proportional to its probability – the smaller the probability, the more surprise or information

• Ask a child “Do you like ice cream?”• If the answer is yes, you’re not surprised and the

information conveyed is little• If the answer is no, you are surprised – more

information has been given with this lower probability answer

Page 5: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory 3

Information H associated with probability p is

H(p) = log2 (1/p)

1/p is the information or surprise

and

log2 (1/p) = # of bits required

Page 6: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory 4

Log-probabilities and their sums represent measures of information.

Conversely, information can be thought of as log-probabilities (with the negative sign to make the information increase with increasing values)

H(p) = log2 (1/p) = - log2 p

Page 7: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory 5

If we have an i.i.d. with 6 values (a die), or 4 (A, C, T, G) or n values (the distribution is flat) then the probability of any particular symbol is

1/n and the information in any such symbol is then

log2 n and this value is also the average

If the symbols are not equally probable (not i.d.) we need to weigh the information of each symbol by its probability of occurring. This is Claude Shannon’s Entropy

H = Σ pi log2 (1/pi) = - Σ pi log2 pi

Page 8: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory 6

If we have a coin, assuming h and t have equal probabilities.

H = - ( (1/2) log2 (1/2) + (1/2) log2 (1/2) ) = - ( (1/2) (-1) + (1/2) (-1) )

= - ( -1) = 1 bit

If the coin comes up heads ¾ of the time then the entropy should decrease (we’re more certain of the outcome and there’s less surprise)

H = - ( (3/4) log2 (3/4) + (1/4) log2 (1/4) ) = - ( (0.75) (-0.415) + (0.25) (-2) )

= - ( -0.81) = 0.81 bits

Page 9: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Information Theory 7

A random DNA source has an entropy of

H = - ( (1/4) log2 (1/4) + (1/4) log2 (1/4) + (1/4) log2 (1/4) + (1/4) log2 (1/4) ) = - ( (1/4) (-2) + (1/4) (-2) + (1/4) (-2) + (1/4) (-2))

= - ( -2) = 2 bits

A DNA source that emits 45% A and T and 5% G and C has an entropy of

H = - ( 2*(0.45) log2 (0.45) + 2*(0.05) log2 (0.05)) = - ( (0.90) (-1.15) + (0.10) (-4.32) )

= - ( - 1.035 – 0.432) = 1.467 bits

Page 10: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Natural Logs

• Using natural logarithms, the information is expressed in units of nats (a contraction of Natural digits). Bits and nats are easily convertible as follows

nats = bits · ln (2)ln 2 or log 2 ≈ 0.6931/ln 2 ≈ 1.44

• Generalizing, for a given base of the logarithm b

log x = log b · logb x

• Using logarithms to arbitrary bases, information can be expressed in arbitrary units, not just bits and nats, such that

-logbP = -k log P , where 1/k=log b So k is often ignored

Page 11: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Shannon’s Entropy

• Shannon's entropy, H = Σ pi log2 (1/pi) = - Σ pi log2 pi , is the expected (weighted arithmetic mean) value of log Pi computed over all letters in the alphabet, using weights that are simply the probabilities of the letters themselves, the Pis

• The information, H, is expressed in units per letter• Shannon's entropy allows us to compute the expected description

length of a message, given an a priori assumption (or knowledge) that the letters in the message will appear with frequencies Pi

• If a message is 200 letters in length, 200H is the expected description length for that message

• Furthermore, the theory tells us there is no shorter encoding for the message than 200H, if the symbols do indeed appear at the prescribed frequencies

• Shannon's entropy thus informs us of the minimum description length, or MDL, for a message

Page 12: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Shannon Entropy

A DNA source that emits 45% A and T and 5% G and C has an entropy of

H = - ( 2*(0.45) log2 (0.45) + 2*(0.05) log2 (0.05))

= - ( (0.90) (-1.15) + (0.10) (-4.32) )

= - ( - 1.035 – 0.432)

= 1.467 bits

Page 13: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Huffman Entropy (or Encoding)

If we build a binary Huffman code (tree) for the same DNA1 bit would be required to code for G2 bits to code for T (or vice versa)3 bits each to code for A and C

The "Huffman entropy" in this case is 1*0.45 + 2*0.45 + 3*0.05 + 3*0.05 = 1.65 bits per letter -- which is not quite as efficient as the previous Shannon code

Page 14: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Length of Message

• The description length of a message when using a Huffman code is expected to be about equal to the Shannon or arithmetic code length

• Thus, a Huffman-encoded 200 letter sequence will average about 200H bits in length

• For some letters in the alphabet, the Huffman code may be more or less efficient in its use of bits than the Shannon code (and vice versa), but the expected behavior averaged over all letters in the alphabet is the same for both (within one bit)

Page 15: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Relative Entropy

• Relative entropy, H, is a measure of the expected coding inefficiency per letter, when describing a message with an assumed letter distribution P, when the true distribution is Q.

• If a binary Huffman code is used to describe the message, the relative entropy concerns the inefficiency of using a Huffman code built for the wrong distribution

• Whereas Shannon's entropy is the expected log-likelihood for a single distribution, relative entropy is the expected logarithm of the likelihood ratio (log-likelihood ratio or LLR) between two distributions.

H(Q || P) = Σ Qi log2 (Qi /Pi)

Page 16: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

The Odds Ratio• The odds ratio Qi / Pi is the relative odds that

letter i was chosen from the distribution Q versus P

• In computing the relative entropy, the log-odds ratios are weighted by Qi

• Thus the weighted average is relative to the letter frequencies expected for a message generated with Q

H(Q || P) = Σ Qi log2 (Qi /Pi)

Page 17: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Interpreting the Log-Odds Ratio

• The numerical sign of individual LLRs log(Qi / Pi) LLR > 0 => i was more likely chosen using Q than P LLR = 0 => i was as likely selected using Q as it was using P LLR < 0 => i was more likely chosen using P than Q

• Relative entropy is sometimes called the Kullback Leibler distance between the distributions P and Q

H(P || Q) ≥ 0H(P || Q) = 0 iff P = QH(P || Q) is generally ≠ to H(Q || P)

Page 18: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

The Role of Relative Entropy

• Given a scoring matrix (or scoring vector, in this one-dimensional case) and a model sequence generated i.i.d. according to some background distribution, P, we can write a program that finds the maximal scoring segment (MSS) within the sequence (global alignment scoring)

• Here the complete, full-length sequence represents one message, while the MSS reported by our program constitutes another

• For example, we could create a hydrophobicity scoring matrix that assigns increasingly positive scores to more hydrophobic amino acids, zero score to amphipathic residues, and increasingly negative scores to more hydrophilic residues. (Karlin and Altschul (1990) provide other examples of scoring systems, as well).

Page 19: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

The Role of Relative Entropy

• By searching for the MSS - the "most hydrophobic" segment in this case - we select for a portion of the sequence whose letters exhibit a different, hydrophobic distribution than the random background

• By convention, the distribution of letters in the MSS is called the target distribution and is signified here by Q. (Note: some texts may signify the background distribution as Q and the target distribution as P).

Page 20: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

The Role of Relative Entropy 2

• If the letters in the MSS are described using an optimal Shannon code for the background distribution, we expect a longer description than if a code optimized for the target distribution was used instead

• These extra bits -- the extra description length -- tell us the relative odds that the MSS arose by chance from the background distribution rather than the target

• Why would we consider a target distribution, Q, when our original sequence was generated from the background distribution, P?

• Target frequencies represent the underlying evolutionary model• The biology tells us that

– The target frequencies are indeed special– The background and target frequencies are formally related to one another

by the scoring system (the scoring matrices don’t actually contain the target frequencies but they are implicit in the score)

– The odds of chance occurrence of the MSS are related to its score

Page 21: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

BLAST Scoring

Database similarity searches typically yield high-scoring pairwise sequence alignments such as the following

Score = 67 (28.6 bits), Expect = 69., P = 1.0 Identities = 13/33 (39%), Positives = 23/33 (69%)

Query: 164 ESLKISQAVHGAFMELSEDGIEMAGSTGVIEDI 196 E+L + Q+ G+ ELSED ++ G +GV+E++

Subjct: 381 ETLTLRQSSFGSKCELSEDFLKKVGKSGVVENL 413 +++-+-++--++--+++++-++- + +++++++

Scores: 514121512161315445632110601643512

• The total score for an alignment (67 in the above case) is simply the sum of pairwise scores

• The individual pairwise scores are listed beneath the alignment above (+5 on the left through +2 on the right)

Page 22: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Scoring Statistics

• Note that some pairs of letters may yield the same score• For example, in the BLOSUM62 matrix used to find the

previous alignment, S(S,T) = S(A,S) = S(E,K) = S(T,S) = S(D,N) = +1

• While alignments are usually reported without the individual pairwise scores shown, the statistics of alignment scores depend implicitly on the probabilities of occurrence of the scores, not the letters.

• Dependency on the letter frequencies is factored out by the scoring matrix.

• We see then that the "message" obtained from a similarity search consists of a sequence of pairwise scores indicated by the high-scoring alignment

Page 23: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 1

If we search two sequences X and Y using a scoring matrix Sij to identify the maximal-scoring segment pair (MSP), and if the following conditions hold

1. the two sequences are i.i.d. and have respective background distributions PX and PY (which may be the same),

2. the two sequences are effectively "long" or infinite and not too dissimilar in length,

3. the expected pairwise score Σ PX(i) PY(j) Sij is negative,

4. a positive score is possible, i.e., PX(i)PY(j)Sij > 0 for some letters i and j, then

Page 24: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 2• The scores in the scoring matrix are implicitly log-odds scores of

the formSij = log(Qij / (PX(i)PY(j))) / l

where Qij is the limiting target distribution of the letter pairs (i,j) in the MSP and l is the unique positive-valued solution to the equation ΣPX(i) PY(j) e l Sij = 1

• The expected frequency of chance occurrence of an MSP with score S or greater is

E = K mn e-lS

where m and n are the lengths of the two sequences, mn is the size of the search space, and K is a measure of the relative independence of the points in this space in the context of accumulating the MSP score

• While the complete sequences X and Y must be i.i.d., the letter pairs comprising the MSP do exhibit an interdependency

Page 25: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 3• Since gaps are disallowed in the MSP, a pairwise

sequence comparison for the MSP is analogous to a linear search for the MSS along the diagonals of a 2-d search space. The sum total length of all the diagonals searched is just mn

• Another way to express the pairwise scores is:

Sij = logb(Qij / (PX(i)PY(j)))

Where the log is to some base b and l = loge b, and is often called the scale of the scoring matrix

Page 26: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 4

• The MSP score is the sum of the scores Sij for the aligned pairs of letters in the MSP (the sum of log-odds ratios)

• The MSP score is then the logb of the odds that an MSP with its score occurs by chance at any given starting location within the random background -- not considering yet how large an area, or how many starting locations, were actually examined in finding the MSP

• Considering the size of the examined area alone, the expected description length of the MSP (measured in information) is log(K m n)

• The relative entropy, H, has units of information per length (or letter pair), the expected length of the MSP (measured in letter pairs) is

E(L) = log(K m n) / H where H is the relative entropy of the target and background frequencies

H = Σ Qij log(Qij / (PX(i) PY(j)))where PX(i) PY(j) is the product frequency expected for letter i paired with j in the background search space; and Qij is the frequency at which we expect to observe i paired with j in the MSP

Page 27: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 5

• By definition, the MSP has the highest observed score. Since this score is expected to occur just once, the (expected) MSP score has an (expected) frequency of occurrence, E, of 1

• The appearance of MSPs can be modeled as a Poisson process with characteristic parameter E, as it is possible for multiple, independent segments of the background sequences to achieve the same high score

• The Poisson "events" are the individual MSPs having the same score S or greater

• The probability of one or more MSPs having score S or greater is simply one minus the probability that no MSPs appear having score S or greater

P = 1 - e-E

• Note that in the limit as E approaches 0, P = E; and for values of E or P less than about 0.05, E and P are practically speaking equal

Page 28: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 6

• For convenience in comparing scores from different database searches, which may have been performed with different score matrices and/or different background distributions, we can compute the normalized or adjusted score

S' = lS - log K

• Expressing the "Karlin-Altschul equation" as a function of the adjusted score, we obtain

E = mn e-S'

• Note how the search-dependent values for l and K have been factored out of the equation by using adjusted scores

• We can therefore assess the relative significance of two alignments found under different conditions, merely by comparing their adjusted scores. To assess their absolute statistical significance, we need only know the size of the search space, mn.

Page 29: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 7

• Biological sequences have finite length, as do the MSPs found between them

• MSPs will tend not to appear near the outer (lower and right) edges of the search space

• The chance of achieving the highest score of all are reduced in that region because the end of one or both sequences may be reached before a higher score is obtained than might be found elsewhere

• Effectively, the search space is reduced by the expected length, E(l), for the MSP

• We can therefore modify the Karlin-Altschul equation to use effective lengths for the sequences compared.

m' = m - E(l) and n' = n - E(l) E' = K m' n' e-lS

• Since m' is less than m and n' is less than n, the edge-corrected E' is less than E, indicating greater statistical significance.

Page 30: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 8

• A raw MSP or alignment score is meaningless (uninformative) unless we know the associated value of the scaling parameter l, to a lesser extent K, and the size of the search space

• Even if the name of the scoring matrix someone used has been told to us (e.g., BLOSUM62), it might be that the matrix they used was scaled differently from the version of the matrix we normally use

• For instance, a given database search could be performed twice, with the second search using the same scoring matrix as the first's but with all scores multiplied by 100

• While the first search's scores will be 100-fold lower, the alignments produced by the two searches will be the same and their significance should rightfully be expected to be the same

• This is a consequence of the scaling parameter l being 100-fold lower for the second search, so that the product lS is identical for both searches

• If we did not know two different scoring matrices had been used and hadn't learned the lessons of Karlin-Altschul statistics, we might have been tempted to say that the higher scores from the second search are more significant (at the same time, we might have wondered: if the scores are so different, why are the alignments identical?!)

Page 31: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 9• Karlin-Altschul statistics suggest we should be careful about

drawing conclusions from raw alignment scores• When someone says they used the "BLOSUM62" matrix to obtain

a high score, it is possible their matrix values were scaled differently than the BLOSUM62 matrix we have come to know

• Generally in our work with different programs, and in communications with other people, we may find "BLOSUM62" refers to the exact same matrix, but this doesn't happen automatically; it only happens through efforts to standardize and avoid confusion

• Miscommunication still occasionally happens, and people do sometimes experiment with matrix values and scaling factors, so consider yourself forewarned!

• Hopefully you see how the potential for this pitfall further motivates the use of adjusted or normalized scores, as well as P-values and E-values, instead of raw scores

Page 32: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 10

• Statistical interpretations of results often involves weighing a null model against an alternative model

• In the case of sequence comparisons and the use of Karlin-Altschul statistics, the two models being weighed are the frequencies with which letters are expected to be paired in unselected, random alignments from the background versus the target frequencies of their pairing in the MSP

• Karlin and Altschul tell us when we search for the MSP we are implicitly selecting for alignments with a specific target distribution, Q, defined for seeing letter i paired with letter j, where

Qij = PX(i) PY(j) elSij

Page 33: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 11

• We can interpret a probability (or P-value) reported by BLASTP as being a function of the odds that the MSP was sampled from the background distribution versus being sampled from the target distribution

• In each case, one considers the MSP to have been created by a random, i.i.d. process

• The only question is which of the two distributions was most likely used to generate the MSP score: the background frequencies or the target frequencies?

Page 34: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Karlin-Altschul Statistics 12

• If the odds of being sampled from the background are low (i.e., much less than 1), then they are approximately equal to the P-value of the MSP having been created from the background distribution. Low P-values do not necessarily mean the score is biologically significant, only that the MSP was more likely to have been generated from the target distribution, which presumably was chosen on the basis of some interesting biological phenomena (such as multiple alignments of families of protein sequences)

• Further interpretations, such as the biological significance of an MSP score, are not formally covered by the theory (and are often made with a rosy view of the situation)

• Even if (biological) sequences are not random according to the conditions required for proper application of Karlin-Altschul statistics, we can still search for high-scoring alignments

• It may just happen that the results still provide some biological insight, but their statistical significance can not be believed. Even so, if the scoring system and search algorithm are not carefully chosen, the results may be uninformative or irrelevant -- and the software provides no warning.

Page 35: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 36: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 37: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 38: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 39: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

• by region (e.g. brain versus kidney)

• in development (e.g. fetal versus adult tissue)

• in dynamic response to environmental signals

(e.g. immediate-early response genes)

• in disease states

• by gene activity

Gene expression is regulated in several basic ways

Page 157

Page 40: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Organism Gene expression changes measured...

virus

bacteria

fungi

invertebrates

rodents

human In m

uta

nt

or

wild

typ

e ce

lls

Dev

elo

pm

ent

Cel

l typ

es

Dis

ease

Page 158

In v

iru

s, b

acte

ria,

an

d/o

r h

ost

In r

esp

on

se t

o s

tim

uli

Page 41: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

DNA RNA

cDNA

phenotypeprotein

Page 159

Page 42: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

DNA RNA

cDNA

protein DNA RNA

cDNA

protein

UniGene

SAGE

microarray

Page 159

Page 43: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

DNA RNA

cDNA

phenotypeprotein

[1] Transcription[2] RNA processing (splicing)[3] RNA export[4] RNA surveillance

Page 160

Page 44: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 161

exon 1 exon 2 exon 3intron intron

transcription

RNA splicing (remove introns)

polyadenylation

Export to cytoplasm

AAAAA 3’5’

5’

5’

5’ 3’5’3’

3’

3’

Page 45: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 162

Relationship of mRNA to genomic DNA for RBP4

Page 46: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Analysis of gene expression in cDNA libraries

A fundamental approach to studying gene expressionis through cDNA libraries.

• Isolate RNA (always from a specific organism, region, and time point)

• Convert RNA to complementary DNA

• Subclone into a vector

• Sequence the cDNA inserts. These are expressed sequence tags (ESTs)

Page 162-163

vector

insert

Page 47: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

UniGene: unique genes via ESTs

• Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene

• UniGene clusters contain many ESTs

• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.

Page 164

Page 48: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Cluster sizes in UniGene

This is a gene with1 EST associated;the cluster size is 1

Page 164

Page 49: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Cluster sizes in UniGene

This is a gene with10 ESTs associated;the cluster size is 10

Page 164

Page 50: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Cluster sizes in UniGene

Cluster size Number of clusters1 34,0002 14,0003-4 15,0005-8 10,0009-16 6,00017-32 4,000500-1000 5002000-4000 508000-16,000 3>16,000 1

Page 164

Page 51: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Ten largest UniGene clusters (10/02)

Cluster size Gene25,232 eukary. translation EF (Hs.181165)14,277 GAPDH (Hs.169476)14,231 ubiquitin (Ta.9227)12,749 actin, gamma 1 (Hs.14376)10,649 euk transl EF (Mm.196614)10,596 ribosomal prot. S2 (Hs.356360)10,290 hemoglobin, beta (Mm.30266)9,987 mRNA, placental villi (Hs.356428)9,667 actin, beta (Hs.288061)9,058 40S ribosomal prot. S18 (Dr.2984)

Page 165

Page 52: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Digital Differential Display (DDD) in UniGene

• UniGene clusters contain many ESTs

• UniGene data come from many cDNA libraries

• Libraries can be compared electronically

Page 165

Page 53: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 166

Page 54: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 166

Page 55: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 166

Page 56: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

UniGene brainlibraries

Page 57: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

UniGene lunglibraries

Page 58: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 167

Page 59: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 167

Page 60: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

n-sec1 up-regulated in brain

CamKII up-regulated

in brain

surfactant up-regulated in lung

Page 167

Page 61: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Pool A

Pool B

total

Gene 1 All other genes total

NA

NB

g1A NA-g1A

c = g1A + g1B

NB-g1Bg1B

C = (NA-g1A) + (NB-g1B)

Fisher’s Exact Test: deriving a p value

Page 167

Page 62: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Pitfalls in interpreting cDNA library data

• bias in library construction• variable depth of sequencing• library normalization• error rate in sequencing• contamination (chimeric sequences)

Pages 166-168

Page 63: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 168-169

Page 64: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Serial analysis of gene expression (SAGE)

• 9 to 11 base “tags” correspond to genes

• measure of gene expression in different biological samples

• SAGE tags can be compared electronically

Page 169

Page 65: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Tag 1

Tag 1Tag 2Tag n

Cluster 1Cluster 2Cluster 3

Cluster 1

SAGE tags are mapped to UniGene clusters

Page 169

Page 66: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 171

Page 67: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 171

Page 68: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 172

Page 69: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 173

Page 70: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 174

Page 71: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 175

Page 72: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 175

Page 73: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 74: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Microarrays: tools for gene expression

A microarray is a solid support (such as a membraneor glass microscope slide) on which DNA of knownsequence is deposited in a grid-like array.

RNA is isolated from matched samples of interest.The RNA is typically converted to cDNA, labeled withfluorescence (or radioactivity), then hybridized tomicroarrays in order to measure the expression levelsof thousands of genes.

Page 173

Page 75: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

• Wildtype versus mutant

• Cultured cells +/- drug

• Physiological states (hibernation, cell polarity formation)

• Normal versus diseased tissue (cancer, autism)

Questions addressed using microarrays

Page 173

Page 76: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

• metazoans: human, mouse, rat, worm, insect

• fungi: yeast

• plants: Arabidopsis

• other: bacteria, viruses

Organisms represented on microarrays

Page 77: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Fast Data on 15,000 genes in 1-4 weeks

Comprehensive Entire yeast genome on a chip

Flexible • As more genomes are sequenced, more arrays can be made. • Custom arrays can be made

to represent genes of interestEasy You can submit RNA samples

to a core facility for analysis

Cheap? Chip representing 15,000 genes for $350; robotic spotter/scanner cost $100,000

Advantages of microarray experiments

Page 175

Page 78: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Cost Many researchers can’t afford to doappropriate controls, replicates

RNA The final product of gene expression is proteinsignificance (see pages 174-176 for references)

Quality Impossible to assess elements on array surfacecontrol Artifacts with image analysis

Artifacts with data analysis

Disadvantages of microarray experiments

Page 176

Page 79: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

RNA: purify, label

Microarray: hybridize,wash, image

Biological insight

Sampleacquisition

Dataacquisition

Data analysis

Data confirmation

Page 176

Page 80: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 1: Experimental design

[1] Biological samples: technical and biological replicates

[2] RNA extraction, conversion, labeling, hybridization

[3] Arrangement of array elements on a surface

Page 177

Page 81: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Sample 1 Sample 2 Sample 3

Page 177

Page 82: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Samples 1,2 Samples 1,3 Samples 2,3

Sample 1, pool Sample 2, poolSamples 2,1:switch dyes

Page 177

Page 83: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 2: RNA and probe preparation

Page 178

For Affymetrix chips, need total RNA (about 10 ug)

Confirm purity by running agarose gel

Measure a260/a280 to confirm purity, quantity

Page 84: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Basic sciences Affymetrix corehttp://microarray.mbg.jhmi.edu/

Johns Hopkins Oncology Center Microarray Core http://www.hopkinsmedicine.org/microarray/

Johns Hopkins University NIDDK Gene Profiling Centerhttp://www.hopkinsmedicine.org/nephrology/microarray/

Gene expression methodologyseminar serieshttp://astor.som.jhmi.edu/hex/gem.html

The Hopkins Expressionistshttp://astor.som.jhmi.edu/hex/

Page 85: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 3: hybridization to DNA arrays

Page 178-179

The array consists of cDNA or oligonucleotides

Oligonucleotides can be deposited by photolithography

The sample is converted to cRNA or cDNA

Page 86: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 87: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Microarrays: array surface

Page 179

Page 88: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Microarrays: robotic spotters

See Nature Genetics microarray supplement

Page 89: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 4: Image analysis

Page 180

RNA expression levels are quantitated

Fluorescence intensity is measured with a scanner,or radioactivity with a phosphorimager

Page 90: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Rett

Control

Differential Gene Expression on a cDNA Microarray

B Crystallin is over-expressed in Rett Syndrome

Page 180

Page 91: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 181

Page 92: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 181

Page 93: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Page 181

Page 94: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 95: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 5: Data analysis

Page 180

This is the subject of Wednesday’s class

• How can arrays be compared? • Which genes are regulated?• Are differences authentic?• What are the criteria for statistical significance?• Are there meaningful patterns in the data (such as groups)?

Page 96: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Microarray data analysis

preprocessing

inferential statistics

exploratory statistics

Page 180

Page 97: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Microarray data analysis

preprocessing

inferential statistics

exploratory statistics

t-tests

global normalizationlocal normalizationscatter plots

clustering

Page 180

Page 98: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Matrix of genes versus samples

Metric (define distance)

supervised,unsupervised

analyses

clusteringTrees(hierarchical,k-means)

self-organizing

maps

principalcomponentsanalysis

Page 180

Page 99: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 6: Biological confirmation

Page 182

Microarray experiments can be thought of as“hypothesis-generating” experiments.

The differential up- or down-regulation of specificgenes can be measured using independent assayssuch as

-- Northern blots-- polymerase chain reaction (RT-PCR)-- in situ hybridization

Page 100: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Stage 7: Microarray databases

Page 182

There are two main repositories:

Gene expression omnibus (GEO) at NCBI

ArrayExpress at the European Bioinformatics Institute (EBI)

See the URLs on page 184

Page 101: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Gene expression omnibus (GEO)

NCBI repository for gene expression data

Page 102: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 103: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 104: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

http://www.dnachip.org

Page 183

Page 105: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

• Many links on Leming Shi’s page: http://www.gene-chips.com

• Stanford Microarray Database http://www.dnachip.org

• links at http://pevsnerlab.kennedykrieger.org/

Microarrays: web resources

Page 106: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Database Referencing of Array Genes Online (DRAGON)

Page 107: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Database Referencing of Array Genes Online (DRAGON)

Credit: Christopher BoutonCarlo ColantuoniGeorge Henry

Page 108: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 109: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 110: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

Paste accession numbers into DRAGON here

Page 111: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression
Page 112: Information Theory, Statistical Measures and Bioinformatics approaches  to gene expression

DRAGON relates genesto KEGG pathways


Recommended