Practical Applications and Properties of the Exponentially
Modified Gaussian (EMG) Distribution
A Thesis
Submitted to the Faculty
of
Drexel University
by
Scott Haney
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
March 23rd, 2011
c© Copyright March 23rd, 2011Scott Haney. All Rights Reserved.
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Background on Microarray Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Measuring Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Affymetrix Microarrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Experimental Errors and Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3. Properties of the Exponentially Modified Gaussian (EMG) Distribution . . . . . 13
3.1 Reparameterization of the EMG Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 EMG Quantile Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 EMG Right Tail Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4. Application of the EMG Distribution to Actual Affymetrix Microarray Per-
fect Match (PM) Probe Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Comparing the Right Tail to a Shifted Exponential Distribution . . . . . . . . 25
4.2 Discrepancy in the Sample Quantile of the Sample Mean . . . . . . . . . . . . . . . . 30
5. Fitting the Right Tail of the Perfect Match (PM) Probe Data. . . . . . . . . . . . . . . . . 32
5.1 Derivation of Functions That Decrease by a Common Ratio . . . . . . . . . . . . . 32
5.2 Application of Functions that Decrease by a Common Ratio to the
Right Tail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6. Practical Implementation of EMG Parameter Estimation Method and Prop-
erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.1 Proof of Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.2 Practical Considerations and Alterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Summary of Final Parameter Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Currently Available Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Maximum Likelihood Estimation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.6 The Silver Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.7 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7. Comparison of Methods on Synthetic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A. Derivation of pdf and cdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.1 Derivation of the Probability Density Function and the Cumulative
Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
i
List of Figures
2.1 The steps of gene expression that leads to a protein product (taken from [25]) 6
2.2 Affymetrix Chip Design (taken from [13]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Step by step procedure of a typical Affymetrix microarray experiment(taken from [9]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Several sources of error for a microarray experiment (Taken from [35]) . . . . . 10
3.1 Plots of EMG distributions for different values of k. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1 Plots of the sample pdf histograms for the PM probe distributions fromfive Affymetrix microarrays along with a plot of an EMG distribution withk = 1.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Plot of the right tail of the sample pdf histogram for the PM probe datafrom T01 tumor.CEL fitted to a shifted version of f(x) = 3log2(x). . . . . . . . . . 35
ii
List of Tables
7.1 Synthetic data results for the new method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Synthetic data results for the method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 Synthetic data results for the Silver method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
i
AbstractPractical Applications and Properties of the Exponentially Modified Gaussian
(EMG) Distribution
Scott Haney
Advisor: Moshe Kam, Ph.D.
The exponentially modified Gaussian (EMG) probability distribution is defined
as the convolution of an exponential distribution and a Gaussian distribution which
are independent of each other. Using a reparameterized form of the EMG cumulative
distribution function (cdf) several properties of the EMG distribution are derived.
These properties are used to test whether the distribution of the perfect match (PM)
probes from five Affymetrix microarrays follows an EMG distribution and to create
a new parameter estimation method. A commonly used method for preprocessing
Affymetrix microarray data, known as the robust multi-array average (RMA), as-
sumes that the distribution of the PM probes at least approximately follows an EMG
distribution. Using the results derived in this thesis it is found that the EMG distri-
bution is not a good fit for these sample data based on differences in the right tail of
the sample distribution. A new distribution that is very dissimilar to the right tail of
an EMG distribution is derived that more accurately fits the right tail of the sample
data. From the properties of the EMG distribution derived in this thesis it is further
shown that a new parameter estimation method can be created. This new parameter
estimation method is compared against two other methods from the literature namely
the method of moments and the Silver method (2009). From a theoretical perspec-
tive, the new parameter estimation method has the advantage that it is proven to
be consistent and to always return valid parameter estimates (such as the constraint
that the variance be positive). Neither the Silver method nor the method of moments
has both of these qualities. All three methods were compared on the same synthetic
data generated from EMG distributions and it was found that the performance of
ii
each method depended on the “shape of the EMG distribution. It was also found
that the Silver method appears to not be consistent for EMG distributions that are
too “close to being a Gaussian distribution.
1
1. Introduction
The EMG distribution is the convolution of a Gaussian distribution and an ex-
ponential distribution which are independent of each other. This distribution has
found practical applications in a variety of scientific disciplines such as chromatogra-
phy [17,20,23,29], cellular biology [14], radiotherapy [16], and microarray preprocess-
ing [18,30]. Many of these practical applications focus on the problem of curve fitting
of data points to a function which is an EMG pdf multiplied by a scaling parameter.
A large number of algorithms have been introduced in the literature to solve this
problem [3,11,12,36].
The focus of this thesis is to better understand the properties of the EMG distri-
bution so that it can be determined whether or not the perfect match (PM) probe
distributions from five Affymetrix microarrays approximately follows an EMG dis-
tribution. This is an important assumption made by a commonly used microarray
preprocessing method known as the robust multi-array average (RMA) [18]. Several
properties of the EMG distribution were derived and were used to show that the right
tails of the sample probability density function (pdf) were much “heavier” than would
be expected for an EMG distribution. By visual analysis of the sample pdf histograms
it was determined that the right tails of the sample pdfs approximately reduced in
height by one third whenever the value on the x-axis was doubled. This is a property
that the right tail of an EMG pdf does not come close to having. A function with
this property was derived and it was found to be a reasonable approximation for the
right tails of the sample pdfs. These results strongly challenge the assumption used
by the RMA method that the PM probes approximately follow an EMG distribution.
Using the derived properties it is also possible to create a parameter estimation
method that has some very desirable properties such as consistency and always being
CHAPTER 1. INTRODUCTION 2
able to return “valid” parameter estimates where “valid” refers to parameter esti-
mates that satisfy all of the constraints of the original parameters. Several parameter
estimation methods already exist in the literature [22, 30] and the new parameter
estimation method is compared to two of these. The two methods selected were the
method provided in [30] (referred to as the “Silver method”) and the method of mo-
ments. All three methods were compared on synthetic data generated from EMG
distributions. The synthetic data trials distinguished between three scenarios which
were:
1. The EMG distribution is “close” to being a shifted exponential distribution
2. The EMG distribution is “close” to being a Gaussian distribution
3. The EMG distribution is neither “close” to a shifted exponential distribution
nor “close” to a Gaussian distribution
An EMG distribution is considered to be “close” to a shifted exponential distribution
when a large fraction of the variance of the EMG distribution is due to the variance
of the exponential component; an EMG distribution is considered to be “close” to a
Gaussian distribution when a large fraction of the variance of the EMG distribution
is due to the variance of the Gaussian component.
Both the Silver method and the method of moments were found to have distinct
disadvantages compared ot the new parameter estimation method. The method of
moments failed to return valid parameter estimates at least 10 times out of 100 and
at most 61 times out of 100 in the synthetic data trials. For these failed runs the
method of moments returned at least one imaginary parameter estimate. The Silver
method appears to be converging to incorrect parameter estimates under the second
scenario. The average parameter estimates for the Silver method after applying it to
100 random samples of size 10,000 generated from a certain EMG distribution showed
CHAPTER 1. INTRODUCTION 3
that the parameter estimates were off by as much as 29 standard deviations.
With respect to accuracy, the results of the synthetic data trials showed that the
performance of the parameter estimation methods varied across the three scenarios.
In the first scenario it was found that the accuracy of the Silver method was noticeably
better in most cases than the accuracy of the new method and the accuracy of the
method of moments. In the second scenario it was found that the accuracy of the
method of moments and the accuracy of the new method were comparable, while
in most cases the accuracy of the Silver method was noticeably lower. In the third
scenario it was found that the accuracy of the method of moments and the accuracy
of the new method were comparable while in most cases the accuracy of the Silver
method was noticeably lower.
The organization of this thesis is as follows:
1. Background necessary for understanding the application of the EMG distribu-
tion to Affymetrix microarray data is described.
2. Properties of the EMG distribution that will be used in improving the applica-
tion of the EMG distribution in practice are derived
3. The assumption that the PM probe data from Affymetrix microarrays approx-
imately follows an EMG distribution is tested for data from five microarrays
and it is found that this assumption is unlikely to be true.
4. A new distribution is derived to fit the right tails of the PM probe distributions
from the five microarrays. This new distribution is found to visually fit the
sample data well and is not “close” to the right tail of an EMG distribution.
5. A new parmeter estimation procedure is described and is proven to be consis-
tent.
CHAPTER 1. INTRODUCTION 4
6. The new parameter estimation method is compared to two other parameter
estimation methods from the literature and is found to have several important
advantages over these two methods.
5
2. Background on Microarray Data Analysis
Within a single human being different cell types can have exactly the same DNA
yet be extraordinarily different. For example, skin cells and bone cells have the same
DNA yet they are not very similar in form or function [2]. Although skin and muscle
cells have the same DNA, certain subsequences of the DNA (known as genes) affect
the cellular environment in different ways. Perhaps the most commonly studied way
by which a gene can affect a cell is the process of gene expression.
2.1 Gene Expression
Gene expression is a multi-step process by which a gene product is created from
a gene. In humans the most common gene products are proteins, which are one or
more long chains of amino acids that are folded together. For simplicity it is assumed
that gene expression refers to gene expression where the gene product is a protein
since proteins are thought to be the primary reason for biological changes within the
cell. The steps of gene expression for protein products [2] are
1. DNA is transcribed into a complementary mRNA copy
2. Intron sequences are removed (or spliced) from the complementary mRNA copy
3. The spliced complementary mRNA sequence is translated into a chain of amino
acids
4. Posttranslational modifications are made to the chain of amino acids and the
final protein product is formed
These steps are shown pictorially in (Figure 2.1).
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 6
Figure 2.1: The steps of gene expression that leads to a protein product (takenfrom [25])
Protein gene products are typically very complex and can affect the cell in different
ways depending on a variety of factors. Two common factors that impact the effect
of proteins is the concentration of other proteins in the cellular environment and the
folded shape of the protein. Any change starting from gene expression and ending
with the final structure, form, and environment of the protein product can affect the
biology of the cell [2].
2.2 Measuring Gene Expression
Obtaining a meaningful measure of gene expression is not straightforward. A
single change in any step of the process can lead to different biological results. In
practice, the first step of the process of gene expression where the DNA is transcribed
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 7
into a complementary mRNA copy is the portion of the process that is measured.
Measuring this step provides an estimate of the total amount of protein product that
can be produced. This measurement, however, does not provide any estimate as to
how much of the protein is actually produced or give any idea as to the final physical
form of the protein in the cell.
There are several important reasons for focusing on this portion of the process
which are as follows:
1. Methods for measuring the presence of mRNA molecules are well established
2. Since the human genome is approximately 99.9% identical across individuals it
is reasonable to assume that the same mRNA molecules are being tested for
3. It is possible to simultaneously measure the presence of a large number of mRNA
molecules within the same sample
A number of testing devices are available for simultaneously measure large numbers
of mRNA molecules in a sample. One class of these testing devices, known as mi-
croarrays, are commonly used for this purpose in practice.
2.3 Affymetrix Microarrays
One of the most well known manufacturers of microarrays is Affymetrix [1].
Affymetrix microarrays are small chips that have their surfaces subdivided into a
rectangular grid. Each rectangle in the grid contains a large number of 25 nucleotide
base pair long DNA probes all having the same sequence. These DNA probe are
“standing straight up” on the surface of the chip with the bottom end of the probe
affixed to the surface of the chip and the top end of the probe being free to move
(Figure 2.2). This design allows any mRNA molecules to chemically bind to the DNA
probes on the surface of the microarray.
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 8
Figure 2.2: Affymetrix Chip Design (taken from [13])
Each subgrid contains either perfect match (PM) probes or mismatch (MM)
probes. A PM probe is designed to be complementary to an expected subsequence of
a specific mRNA. An MM probe is designed to match a PM probe sequence with the
exception that the 13th nucleotide base is switched. Every subgrid of PM probes has
a corresponding subgrid of MM probes. For every gene there are typically several PM
and MM probe subgrids. The entire collection of these subgrids is termed a probeset.
Affymetrix microarrays “measure” mRNA levels by using basic principles of chem-
istry. Each DNA probe on the surface will prefer to be bound to other DNA that
is exactly complementary. In general, the “closer” a subsequence of an DNA is to
being complementary to a probe sequence the more likely it will be to bind to the
corresponding probe. By using this principle it is thought that if a targeted sequence
is present in solution it will bind to its corresponding probe with high probability. Of
course, other DNA sequences in solution that have a subsequence which is “close” to
being complementary can also bind. It is thought that the MM probes can be used
to provide an estimate of this erroneous binding known as cross-hybridization.
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 9
An Affymetrix microarray experiment is begun by extracted mRNA from a bio-
logical sample. Extracted mRNA then goes through a number of preparation steps
where it is labeled with some molecule that can be identified using a scanner and
the labeled mRNA is then applied to the surface of the microarray. After the chem-
ical reactions have had some time to take place the microarray is washed and only
the mRNA from the sample that is bound to probes should remain. Lastly, the mi-
croarray is put under a scanner and for each rectangular subgrid the intensity of the
labeling molecule is measured. A pictorial example of this process is given in Figure
2.3.
Figure 2.3: Step by step procedure of a typical Affymetrix microarray experiment(taken from [9])
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 10
2.4 Experimental Errors and Data Preprocessing
Affymetrix microarrays are subject to technical, chemical, and human errors. An
example of some of these errors can be seen in Figure 2.4. These errors have been
extensively studied in the literature [33,35,37], however, they still remain to be con-
vincingly modeled in practical Affymetrix microarray experiments. An understanding
of how these errors affect Affymetrix microarray data is essential for determining how
reliable the data are as well as for extracting a reasonable estimate for the “level” of
gene expression in the sample.
Figure 2.4: Several sources of error for a microarray experiment (Taken from [35])
Previous work has been completed towards estimating the mRNA concentration
in the presence of error and has met with some success. In one publication [21],
a method was developed that was capable of detecting known mRNA levels in the
presence of experimental error. At least two other authors determined differential
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 11
equation models that took into account error which worked well on the data tested
[5,35]. For the type of microarray experiment described in the previous section these
techniques have not had a very significant impact in practice. For a different type
of microarray experiment, a real-time microarray experiment [8], these techniques
are much more effective and practical. Most microarray data sets that are currently
available, however, are not real-time microarray experiments.
In practice it is common to handle microarray errors by using techniques that are
much simpler than the methods discussed in the previous paragraph. The typical first
step to microarray data analysis is to preprocess the data in order to remove error.
Some of the most commonly used microarray preprocessing techniques in practice
are those provided directly by Affymetrix (PLIER and MAS 5.0) [34] and the robust
multichip average (RMA) [18]. As of the time of this writing no single method
preprocessing method has been found to be generally preferable to the rest [19].
After the data are preprocessed it is usually assumed that the resulting data are
“error free.” Data analysis techniques are then applied to the preprocessed data to
find interesting results.
The preprocessing technique of primary interest in this thesis is RMA. At the
present time the original RMA publication has been cited over 3,000 times. This
technique makes the assumption that the distribution of PM probes from a microarray
approximately follows an EMG distribution [18]. RMA uses this assumption to model
observed values as signal (which follows an exponential distribution) plus noise (which
follows a Gaussian distribution). The signal value is then estimated by solving for
the expected value of signal given the value of signal plus noise. If the assumption
that the distribution of the PM probes follow an EMG distribution is incorrect then
estimating the use of the EMG distribution in RMA is questoinable. This assumption
is shown to be unlikely to be true based on the results of comparing the sample data
CHAPTER 2. BACKGROUND ON MICROARRAY DATA ANALYSIS 12
to certain properties of the EMG distribution.
13
3. Properties of the Exponentially Modified Gaussian (EMG)
Distribution
Due to the large reach of the EMG distribution in practical applications, [14, 18,
30] a better understanding of the EMG is worth pursuing. The probability density
function (pdf) and cumulative density function (cdf) for an EMG distribution are
given below as EMG(c;µ, σ, λ) and emg(c;µ, σ, λ) respectively (see Appendix A for
derivations):
EMG(c;µ, σ, λ) =1
2− 1
2eλ(
λσ2
2+µ−c)erfc(
σ√2
(λ+µ− cσ2
)) +1
2erf(
1√2σ
(c− µ))
(3.1)
emg(c;µ, σ, λ) =λ
2eλ(
λσ2
2+µ−c)erfc((
σ√2
)(λ+µ− cσ2
)) (3.2)
where
erf(x) =2√π
∫ x
0
e−t2
dt
erfc(x) =2√π
∫ ∞x
e−t2
dt = 1− erf(x)
In this chapter several properties of the EMG distribution are derived. The deriva-
tions predominantly rely on reparameterizing the input to the EMG cdf. These prop-
erties will later be used to challenge a current assumption that the PM probe data
from Affymetrix microarrays approximately follows an EMG distribution [18] as well
as to create a new parameter estimation method.
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 14
3.1 Reparameterization of the EMG Distribution
A carefully selected reparameterization of the input to the EMG cdf can be used
to show several useful properties of the EMG distribution. This result was discovered
by analyzing the mode of the EMG pdf, which occurs when the derivative of the EMG
pdf is equal to zero. This equation is given by
λσ =1√π× e−p
2
erfc(p)(3.3)
where
p =λσ
2+µ− c
2σ
Solving for c yields the mode of the EMG pdf.
The equation for the mode can be simplified somewhat by replacing c with a
reparameterization c1 which is given by
c1 = µ+ λσ2 − 2Dσ (3.4)
where D ∈ R. Replacing c with c1 in (3.3) causes the equation to become
λσ =e−D
2
√πerfc(D)
This equation shows that the mode of the EMG pdf can be written entirely in terms
of D and λσ. The term λσ will be used often throughout the rest of this thesis, and
from this point on this term will be denoted by k.
This reparameterization can be slightly generalized and can be used to simplify the
EMG cdf for some input values. This slightly updated reparameterization is denoted
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 15
by c2 and is given by
c2 = µ+ Cλσ2 +Dσ
where C,D ∈ R. Replacing c with c2 in (3.1), the EMG cdf reduces to
EMGc2(C,D, k) =1
2× [1− e
k2
2−Ck2−Dkerfc(
k − Ck −D√2
) + erf(Ck +D√
2)]. (3.5)
where k = λσ. Using this equation it is possible to calculate any quantile that can
be represented in terms of c2 once k is known.
Several important results that will be used in this writing are now explained in the
following sections. These results heavily rely on the term k and (3.5). From the work
in the following sections it will become evident that the term k provides a significant
amount of information about an EMG distribution.
3.2 EMG Quantile Bounds
Analysis of specific values of c2 revealed that at least some of the quantiles must
lie within certain bounds. This is accomplished by combining the constraint k > 0
(which is true because both σ and λ are greater than zero) with (3.5). Two such
bounds are given in the following paragraphs both as examples and for use later in
this thesis.
Perhaps the simplest example of a quantile bound is when C = D = 0. Under
these conditions it follows that c2 = µ and (3.5) reduces to a function that only
depends on k which is given by
EMGc2(0, 0, k) = EMGµ(k) =1
2× [1− e
k2
2 erfc(k√2
)] (3.6)
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 16
Taking a derivative shows that the right hand side of EMGµ(k) is monotonically
decreasing for k ∈ (0,∞). Using this information it can be shown that
0 < EMGµ(k) <1
2(3.7)
for any EMG distribution.
It is also possible to determine a quantile bound on the mean m = µ + λ−1 of
an EMG distribution. The reparameterization c2 is equal to m when C = k−2 and
D = 0. Under these conditions (3.5) reduces to a function that only depends on k
which is given by
EMGm(k) = EMGc2(k−2, 0, k) =
1
2× [1− e
k2
2−1erfc(
k − k−1√2
) + erf(1
k√
2)] (3.8)
Analysis of the derivative shows that the right hand side of EMGm(k) is monotoni-
cally decreasing for k ∈ (0,∞). Using this information it can be shown that
1
2< EMGm(k) < 1− e−1 ≈ .632 (3.9)
for any EMG distribution.
3.3 Parameter Estimation
It is possible to completely define an EMG distribution in terms of k and two
quantiles rather than in terms of the three parameters µ, σ, and λ. Assuming that k
is known, one such procedure for determining the parameters is as follows:
1. Determine µ from the quantile determined by the right hand side of
EMGµ(k) =1
2× [1− e
k2
2 erfc(k√2
)]
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 17
2. Determine ml = µ + λ−1 from the quantile determined by the right hand side
of
EMGm(k) =1
2× [1− e
k2
2−1erfc(
k − k−1√2
) + erf(1
k√
2)]
3. Determine ms = µ + σ from the quantile determined by the right hand side of
EMGc2(0, 1, k) =1
2× [1− e
k2
2−kerfc(
k − 1√2
) + erf(1√2
)]
4. Estimate λ by subtracting the estimate of µ from ml and then taking the mul-
tiplicative inverse of the result
5. Estimate σ by subtracting the estimate of µ from ms
The ability to define the EMG distribution in terms of k and two quantiles opens
up the possibility of a new type of parameter estimation method for an EMG distri-
bution. Given a sample from an EMG distribution if k can be estimated then it is
possible to estimate the parameters of the EMG distribution. In practice, a simple
way to estimate k is by estimating the sample quantile of the sample mean. This
estimate can then be substituted for the left hand side of (3.8) and an estimate for k
can be obtained by solving this equation for k. As long as the estimate for the sample
quantile of the sample mean satisfies (3.9) it will be possible to solve for k.
3.4 Shape Estimation
The value of k determines the overall “shape” of the EMG distribution. This can
be seen by analyzing the variance of an EMG distribution in terms of k which yields
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 18
the following:
Var(EMG(c;µ, σ, λ)) = σ2 + λ−2
=k2 + 1
λ2(3.10)
= σ2 + σ2k−2 (3.11)
As k in (3.10) approaches zero the impact of the Gaussian component on the variance
becomes negligible. As k in (3.11) approach ∞ the impact of the shifted exponential
component on the variance becomes negligible. As the variance of a component
becomes negligible, the EMG distribution will be “close” to the distribution of the
other component. These observations indicate that for values of k that are “large”
the EMG distribution is “close” to a Gaussian distribution and that for values of k
that are “small” the EMG distribution is “close” to a shifted exponential distribution.
In practice, it is likely that an EMG distribution which is very “close” to being
either a Gaussian distribution or a shifted exponential distribution will be treated
as a Gaussian distribution or a shifted exponential distribution respectively. Due to
this, it seems reasonable to assume that EMG distributions which arise in practice
are likely to have k values that are located within a certain bounded interval. The
variance relations which were discussed in the previous paragraph provide a way to
obtain a rough estimate for this bounded interval. By combining 3.10) and (3.11) it
follows that
k2 + 1
λ2= σ2 + σ2k−2
Setting k = 1 results in σ2 = λ−2, which implies that the variance of both components
is equal. It seems reasonable to assume that a component will become negligible when
its variance is less than a certain percentage of the other. It further seems reasonable
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 19
to assume that this percentage can be set to 1%, which results in the following bounds
on k
k ∈ [0.1, 10].
Several plots of EMG distributions for different values of k between 0.1 and 10 are
given in Figure 3.1.
In practice, it seems unlikely that values of k outside of this interval will be encoun-
tered. If this is not the case then it will be very difficult to estimate the parameters
of the EMG distribution. The reason for this is that the closer the EMG distribution
becomes to either a Gaussian distribution or a shifted exponential distribution the
harder it will be to estimate the exact magnitude of the difference. In general, the
slighter the modification to a distribution the harder it will be to detect.
3.5 EMG Right Tail Approximation
It is possible to show that the EMG cdf is approximately the same as a shifted
exponential cdf in the right tail of the distribution. The cdf of a shifted exponential
distribution will be denoted by SED(c;λ, T ) and is defined to be
SED(c;λ, T ) = 1− eλc−T (3.12)
where T is the shift parameter and λ is the same shape parameter that is used in an
exponential distribution. The desired approximation will be derived by considering
the reparameterization
c3 = µ+Dσ (3.13)
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 20
(a) k = 0.1 (b) k = 0.5
(c) k = 1.0 (d) k = 2.0
Figure 3.1: Plots of EMG distributions for different values of k.
where D ∈ R. Using this new reparameterization in place of c in the EMG cdf it
follows that
EMGc3(D, k) =1
2× [1− e
k2
2−Dkerfc(
k −D√2
) + erf(D√
2)]
To see what happens in the right tail the limit as c3 approaches infinity is consid-
ered. This limit can not immediately be determined because the right hand side of
EMGc3(D, k) does not directly include c3. The right hand side is written in terms of
D so a relation between the limiting value of c3 and D would allow the limit to be
easily evaluated. From the constraint that EMGµ(k) ∈ (0, 12) (3.7) and the constraint
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 21
that σ > 0 it is clear that if c3 is greater than the median then D > 0. Thus as c3
approaches infinity, D approaches infinity. Using this information it can be seen that
limc3→∞
EMGc3(D, k) = limc3→∞
1
2× [1− e
k2
2−Dkerfc(
k −D√2
) + erf(D√
2)]
= limD→∞
1
2× [1− e
k2
2−Dkerfc(
k −D√2
) + erf(D√
2)]
=1
2× [2− 2 lim
D→∞ek2
2−Dk]
= 1− limD→∞
e−k(D−k2)
where the last equality is the cdf of a shifted exponential distribution with shift T = k2
and shape parameter λ = k. Both the erf and the erfc terms approach their limits
at a much faster rate than does a term of the form e−kD, hence the cdf of the EMG
distribution should be approaching the cdf of a shifted exponential distribution.
To show that the right tail approximation is accurate in a more quantitative
manner it is first noted that if D ≥ D0 > 0 then the following bounds hold
1 > erf(D√
2) ≥ 1− α1
2 > erfc(k −D√
2) ≥ 2− α2
where
α1 = erfc(D0√
2)
α2 = erfc(D0 − k√
2)
From the fact that k > 0 it must be that α1 < α2. Using these constraints along with
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 22
the inequality between α1 and α2 bounds can be put on EMGc3(D, k). The lower
bound is given by
EMGc3(D, k) ≥ 1
2× [1− e
k2
2−Dkerfc(
k −D√2
) + 1− α1]
=1
2× [(2− α1)− e
k2
2−Dkerfc(
k −D√2
)]
>1
2× [(2− α1)− 2e
k2
2−Dk]
= (1− α1
2)− e
k2
2−Dk
= 1− ek2
2−Dk − α1
2
= 1− ek2
2−Dk −
erfc(D0√2)
2
and the upper bound is given by
EMGc3(D, k) ≤ 1
2× [1− e
k2
2−Dk(2− α2) + 1− α1]
=1
2× [(2− α1)− (2− α2)e
k2
2−Dk]
<1
2× [2− (2− α2)e
k2
2−Dk]
= 1− ek2
2−Dk +
erfc(D0 − k)
2√
2ek2
2−Dk
≤ 1− ek2
2−Dk +
erfc(D0 − k)
2√
2ek2
2−D0k
= 1− ek2
2−Dk +
erfc(w(k))
2√
2ew2(k)
2−D
202
where w(k) = (D0 − k). For these two bounds the error between EMGc3(D, k) and
the bounds are given by
Le =erfc(D0√
2)
2(3.14)
Ue =erfc(w(k))
2√
2ew2(k)
2−D
202 (3.15)
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 23
where Le is the error in the lower bound and Ue is the error in the upper bound.
Both error terms approach zero much more rapidly than does the term ek2
2−Dk so
these approximations should be quite accurate as long as D is large enough relative
to k.
The error in approximation can also be characterized in terms of the percentage
error. It is possible to show that the percentage error of the approximation monoton-
ically decreases to zero for D > 0. The percentage error PE of approximating the
value of EMG(c3) at D is given by
PE =EMGc3(D, k)− 1− e k
2
2−Dk
EMGc3(D, k)
= 1− 1− e k2
2−Dk
EMGc3(D, k)(3.16)
If the percentage error is monotonically decreasing to zero for D > 0 then it must be
the case that the second term in PE given by
Pr =1− e k
2
2−Dk
EMGc3(D, k)
is monotonically increasing to one for D > 0. Clearly the limit of Pr is one since both
the numerator and the denominator are valid cdfs. The derivative of Pr with respect
to D can be shown to be positive so it follows that Pr is monotonically increasing
with respect to D. The denominator of the derivative is always positive since it is
squared and the numerator of the derivative given by
−2kek2
2−Dk(erf(
√2
2(D − k))− erf(
√2
2D))
will be positive for all D > 0.
As an example of the accuracy of this approximation assume that k = 1 and
CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIEDGAUSSIAN (EMG) DISTRIBUTION 24
D = 2. Under these circumstances it follows that
EMGc3(D, k)− (1− e k2
2−Dk)
EMGc3(D, k)≈ 0.016
which shows that the percentage error in the approximation is close to 1.6%. Because
the percentage error is monotonically decreasing for D > 0, the percent error in the
approximation will be no more than approximately 1.6% for all D ≥ 2.
25
4. Application of the EMG Distribution to Actual Affymetrix Microarray
Perfect Match (PM) Probe Distributions
Data from five Affymetrix microarrays described in [31] were downloaded from [7].
The five Affymetrix microarray data files that were selected were T01 tumor.CEL -
T05 tumor.CEL. It is found that the sample distributions of the PM probes for these
five Affymetrix microarrays are unlikely to follow an EMG distribution. First, it is
shown that the right tail of the sample pdf is not what would be expected for an EMG
distribution. Further, it is shown that the sample quantiles of the sample means for
all five distributions are larger than would be expected for an EMG distribution.
4.1 Comparing the Right Tail to a Shifted Exponential Distribution
From the results in 3.5 it is clear that the EMG cdf should be well approximated
by the cdf of a shifted exponential distribution in the right tail. In order to apply
this approximation in practice it will be necessary to know where to begin. It will
be shown that the start of the right tail can be reasonably approximated if an upper
bound kmax on k can be assumed. Once the right tail has been located, a slightly
modified ratio of two sample quantiles will be compared to the ratio that would be
expected if the distribution was a shifted exponential distribution. The results of
this test will show that the right tails of the PM probe distributions from the five
Affymetrix microarrays described at the beginning of this chapter are very different
from what would be expected for an EMG distribution.
CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 26
4.1.1 Locating the Beginning of the Right Tail
To estimate the beginning of the right tail there must first be an estimate for the
upper bound on k denoted by kmax. From this estimate it is possible to determine
upper bounds on σ denoted by σmax and µ denoted by µmax. Using the result from 3.5
that the percentage error in the right tail approximation is monotonically decreasing
it is possible to select a value for D such that the percentage error is bounded. These
three estimates are then used to calculate c3 (3.13) which is the estimate for the
beginning of the right tail.
To estimate kmax it is not unreasonable to assume a value for kmax by eye-balling
the data given the insights from 3.4. From viewing the sample pdf histograms of the
five PM probe distributions (Figure 4.1) kmax = 1 seems like a safe estimate. Using
kmax, σmax can be obtained by rearranging (3.11) to obtain
σmax ≤s√
1 + k−2max
where s is the sample standard deviation. Substituting k = kmax, C = D = 0 into
(3.5) yields an estimate for µmax. Lastly a suitable value for D must be chosen so
that the percentage error between the actual EMG tail and the shifted exponential
tail is “small” enough. In 3.5 it was shown that for D > 2 the percentage error in
the approximation was no more than roughly 1.6%. Given that this error seems to
be “small” enough it is assumed that the right tail begins at c3 = µmax + 2σmax.
4.1.2 Testing the Right Tail
In order to test that the right tail is approximately a shifted exponential distri-
bution it is necessary to use a test that will not be affected much by the error in
the approximation. One such test is to slightly modify the ratio of two quantiles.
CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 27
(a) T01 tumor.CEL (b) T02 tumor.CEL
(c) T03 tumor.CEL (d) T04 tumor.CEL
(e) T05 tumor.CEL (f) EMG with k = 1
Figure 4.1: Plots of the sample pdf histograms for the PM probe distributions fromfive Affymetrix microarrays along with a plot of an EMG distribution with k = 1.
CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 28
Estimates of quantiles tend to be fairly robust so this test should not be greatly af-
fected by the approximation error between EMGc3(D, k) and the cdf of some shifted
exponential distribution.
In order to derive a test for the ratio of quantiles from a shifted exponential
distribution such a test was first created for an exponential distribution. This test is
then extended in a natural way to the shifted exponential distribution. The cdf for
an exponential distribution denoted by E(c;λ) is given by
E(c;λ) = 1− e−λc
For an exponential distribution, the ratio of any two quantiles is constant. To see
this suppose that E(x1;λ) = q and E(x2;λ) = p. Then it follows that
x1x2
=ln(1− q)ln(1− p)
where the ratio of the quantiles is clearly independent of λ. For a shifted exponential
distribution the only change that needs to be made is to shift the input by the value
of the shift parameter T . The shifted ratio of its quantile denoted by Sr is given by
Sr =x1 − Tx2 − T
(4.1)
=ln(1− q)ln(1− p)
(4.2)
The ratio test just derived for a shifted exponential distribution can be directly
applied to the experimental data being studied despite two possible problems. The
first possible problem is that the right tail of the distribution will not be a valid
probability distribution (because the area under the right tail is not equal to one).
Instead the right tail will be some constant multiple of a probability distribution that
CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 29
is “close” to being a shifted exponential distribution. Fortunately, these constants
will cancel out by taking a ratio so the test is not affected. The second possible
problem is the approximation error. It is important to show that the approximation
error will not cause a “large” error in Sr. Bounds on the error in Sr caused by error
in the approximation will be derived. In the application of this test to the PM probe
distribution of Affymetrix microarrays these bounds will be used to show that the
error in Sr caused by approximation error is not significant.
Showing that the approximation does not “significantly” affect Sr requires some
extra work due to the shift parameter T being present in the ratio test. It was shown
in 3.5 that the error in approximation can be written in terms of percentage error and
that the percentage error can be made as small as desired by moving far enough to
the right. Shifting the actual quantile value along with its approximation changes the
percentage error so it is necessary to know how the percentage error changes in this
case. The percentage error (PE from 3.16) and the shifted percentage error denoted
by PEs can be related as follows
PE = PEsEMGc3(D, k)− TEMGc3(D, k)
From the last equation it follows that if the ratio of the shifted quantile to the actual
quantile is not too “large” then it will follow that if PE is “reasonably” small then PEs
will also be “reasonably” small. Applying this result back to the EMG cdf it follows
that the ratio of any two quantiles x1 = EMG(y1;µ, σ, λ) and x2 = EMG(y2;µ, σ, λ)
that are “far” enough into the right tail is bounded by
(1− PEs1 + PEs
) < (EMG(y1;µ, σ, λ)− TEMG(y2;µ, σ, λ)− T
)(SED(y2)− TSED(y1)− T
) < (1 + PEs1− PEs
)
This shows that the quantile ratio assuming a shifted exponential distribution will be
CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 30
approximately the same as the quantile ratio assuming an EMG cdf as long as PEs
is “small” enough.
This ratio test is now applied to the sample data from the five Affymetrix PM
probe distributions using the quantiles q = 0.50 and p = 0.80. For all five sample pdfs
it was assumed that kmax = 1 (see Figure 4.1 for a visual comparison) and that the
right tail could be assumed to start at D = 2 (see 4.1 for justification). Using these
assumptions it was found that even with the approximation error being taken into
account, varying the sample quantiles by even as much as five standard deviations
was not enough to match the ratio that would be expected. This result strongly
suggests that the right tail does not follow a shifted exponential distribution which
casts doubt on the assumption that this data follows an EMG distribution.
4.2 Discrepancy in the Sample Quantile of the Sample Mean
For all five data sets the sample quantile of the sample mean was much larger
than the (1− e−1)th quantile. Since the quantile of the mean of an EMG distribution
can not be larger than the (1 − e−1)th quantile it seems likely that the sample data
are not EMG distributed. To investigate this possibility a hypothesis test is created
to determine whether or not the quantile of the mean of each distribution was larger
than the (1− e−1)th quantile.
To create the hypothesis test it is assumed that both the sample mean and the
sample quantiles approximately follow a Gaussian distribution. Due to the fact that
the sample size was greater than 200,000 for all five sample distributions, these two
assumptions seem reasonable in light of the central limit theorem. Given these as-
sumptions, the paired t-test can be used to determine if it is likely that the quantile
of the mean is larger than the (1− e−1)th quantile.
After applying the paired t-test to all of the sample distributions it was found that
CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUALAFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBEDISTRIBUTIONS 31
the difference between the sample quantile of the sample mean and the (1 − e−1)th
quantile was very high. For all five data sets, the difference between the sample
quantile of the sample mean and the (1− e−1)th sample quantile was no less than 90,
while the standard deviations for both estimates were less than 1. Given these values
the null hypothesis that the quantile of the mean is less than the (1− e−1)th quantile
can easily be rejected at the α = 0.01 level for all five sample distributions.
Since the mean of the sample data occurs at such a large quantile it seems likely
that the best EMG fit for the data would be a distribution that is close to being
a shifted exponential distribution (small value of k). From viewing Figure 4.1 it is
clear that this sample pdf is not very similar to a shifted exponential distribution.
This result shows that it is unlikely that these sample distributions follow an EMG
distribution.
32
5. Fitting the Right Tail of the Perfect Match (PM) Probe Data
Given that the sample data are unlikely to follow an EMG distribution the next
question that should be asked is what distribution do these data follow? The previous
chapter showed that the right tails of the sample distributions were very different from
the right tail of an EMG distribution. From visual inspection (Figure 4.1) it appears
that the problem is due to the right tails of the sample pdf histograms being much
too “heavy.” In other words the right tails of the sample pdf histograms do not go to
zero as quickly as would be expected.
After further visual examination the right tails of the sample pdfs all seemed to
share the property that doubling the input to the sample pdf reduced the height of
the sample pdf histogram by approximately one third. Taking this observation as
an assumption the problem of determining an appropriate distribution for the right
tail of the sample data becomes the problem of finding a function with this property.
Such a function will be derived in the next section and will be shown to fit the right
tails of the sample pdf histograms closely. The derivation of this function will then
be generalized to functions of a larger class.
5.1 Derivation of Functions That Decrease by a Common Ratio
It is assumed that a function f(x) such that
f(x)
f(2x)= 3 (5.1)
may be an appropriate distribution for modeling the right tails of the sample pdf
histograms. In order to determine the form of f(x), several common functional forms
were assumed for f(x) and the algebra was checked to see if the final result was valid.
CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 33
After several attempts it was found that by assuming f(x) = g(x)x it was possible to
determine f(x).
By using the substitution f(x) = g(x)x it follows that
f(x)
f(2x)=
g(x)x
g(2x)2x= 3 (5.2)
It is possible to create a recurrence relation over some values of g(x). This is accom-
plished using the following modified form of (5.2)
g(x)x
g(2x)2x= 3
xlog(g(x))− 2xlog(g(2x)) = log(3)
log(g(x)
g(2x)2) =
log(3)
x= log(3x
−1
)
g(x) = 3x−1
g(2x)2
g(2x) =√g(x)3−x−1
If it is assumed that g(1) = 1 then the first six terms of the recurrence are as
follows:
g(1) = 1
g(2) =√g(1)3−1 = 3−
12
g(4) =
√g(2)3−
12 = 3−
12
g(8) =
√g(4)3−
14 = 3−
38
g(16) =
√g(8)3−
18 = 3−
416
g(32) =
√g(16)3−
116 = 3−
532
The last three terms in this list show that the numerator of the exponent is the log base
CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 34
two of the input and the denominator of the exponent is the input. This suggests that
the function 3−log2(x)
x may work for g(x) which suggests that f(x) = g(x)x = 3−log2(x).
To verify that f(x) = 3−log2(x) has the desired property this function is tested in
(5.1).
3−log2(x)
3−log2(2x)= r
log2(x−1)− log2((2x)−1) = log3(r)
log2(2) = log3(r)
3 = r
The last line of the algebra shows that f(x) has the desired property.
The format of this function suggests that it would be possible to generate functions
such that
f(x)
f(αx)= β
where α, β > 1 by using the function β−logα(x). Working out the same steps that were
performed for f(x) in the previous paragraph it follows that
β−logα(x)
β−logα(αx)= r
logα(x−1)− logα((αx)−1) = logβ(r)
logα(α) = logβ(r)
β = r
This algebra shows that this class of functions has the expected property.
CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 35
5.2 Application of Functions that Decrease by a Common Ratio to the
Right Tail
Attempting to fit the right tails of the sample pdfs immediately yields encouraging
results. By fitting a shifted version of the function f(x) = 3log2(x) to the right tail
of the sample pdf histogram from T01 tumor.CEL it can be seen that the shifted
version of f(x) and the right tail of the sample pdf histogram are very similar (Figure
5.1). It seems likely that the cdf for the sample data approaches a function that
decreases by a common ratio.
Figure 5.1: Plot of the right tail of the sample pdf histogram for the PM probe datafrom T01 tumor.CEL fitted to a shifted version of f(x) = 3log2(x).
Comparing the right tail of an EMG pdf to f(x) shows that these two functions
are very different. Both functions are concave up and constantly decreasing, however,
the rate of decrease is very different. By definition the ratio f(x)f(2x)
= 3 is constant with
respect to x. For an EMG distribution this ratio is not constant with respect to x
CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM)PROBE DATA 36
and can be very different from 3 depending on the value of x. As an example when
mu = 0, σ = 1, λ = 1, and x =10 the ratio for the right tail of an EMG distribution
is approximatley 5,000. In general, the right tail of an EMG pdf converges to zero
much more quickly than does f(x).
37
6. Practical Implementation of EMG Parameter Estimation Method and
Properties
In section 3.3 it was shown that once the value of the variable k is known, it is
possible to estimate the parameters of an EMG distribution using two sample quan-
tiles. Using (3.6) it is possible to estimate k by replacing EMGm(k) with the sample
quantile of the sample mean. Combining these two results constitutes a parameter
estimation method which is given by
1. Estimate k with ke where ke is calculated by replacing the left hand side of the
following equation with the sample quantile of the sample mean and solving for
ke
EMGm(ke) =1
2× [1− e
k2e2−1erfc(
ke − k−1e√2
) + erf(1
ke√
2)]
2. Determine µ from the quantile determined by the right hand side of
EMGµ(ke) =1
2× [1− e
k2e2 erfc(
ke√2
)]
3. Determine ml = µ + λ−1 from the quantile determined by the right hand side
of
EMGm(ke) =1
2× [1− e
k2e2−1erfc(
ke − k−1e√2
) + erf(1
ke√
2)]
4. Determine ms = µ + σ from the quantile determined by the right hand side of
EMGc2(0, 1, ke) =1
2× [1− e
k2e2−keerfc(
ke − 1√2
) + erf(1√2
)]
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 38
5. Estimate λ by subtracting the estimate of µ from ml and then taking the mul-
tiplicative inverse of the result
6. Estimate σ by subtracting the estimate of µ from ms
Although this method will work in theory there are several modifications that need
to be made in order to make it practical.
It will be shown that by performing several slight modifications to the parameter
estimation procedure described in the previous paragraph a practical implementation
will result. The final implementation is consistent and always returns “valid” pa-
rameter estimates where “valid” parameter estimates are parameter estimates that
satisfy all constraints on the original parameters (such as σ > 0). This new parameter
estimation method is then compared to other parameter estimation methods for the
EMG distribution from the literature. It is found that the new parameter estimation
method has several advantages over other currently available methods.
6.1 Proof of Consistency
It is proved that the new parameter estimation method as introduced at the
beginning of this chapter is consistent. This proof will also apply to the final imple-
mentation as the modification made will in no way affect consistency.
Theorem 6.1.1. The parameter estimation method introduced is consistent.
Proof. To prove the theorem it will first be proved that the estimate for k is consistent.
Applying the same techniques used to show that k is consistent it can easily be shown
that the consistency of the parameter estimates follows from the consistency of k. To
show that k is consistent it will be shown that the sample quantile of the sample
mean is a consistent estimate for the quantile of the mean. Given the continuity of
the EMG cdf it will then follow that the estimate for k is consistent.
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 39
By definition a consistent estimator Rn of the parameter k must have the following
property
limn→∞
Pr(|Rn − k| > ε) = 0
where ε is any real number greater than zero and n is the sample size. It is well
known that the sample mean will be a consistent estimator for the mean of random
variable that has a continuous pdf and cdf because of the central limit theorem. This
shows that the sample mean will be a consistent estimator of the mean for an EMG
distribution.
If the quantiles were completely known then it would immediately follow by the
continuity of the EMG cdf that the quantile of the sample mean is a consistent esti-
mate for the quantile of the mean. In the actual estimation method sample quantiles
are being used instead of actual quantiles, however, this will not affect consistency.
The distribution of any quantile value q (where 0 < q <1) follows a binomial distribu-
tion. Given any closed interval of quantile values (which does not include zero or one)
it is clear that the variance of all the quantiles in the interval will be bounded above
by the quantile that is farthest from 0.5. Using Chebyshev’s inequality it follows that
no matter which quantile in this interval is chosen the probability that the sample
quantile will be more than ε units away from the ends of this interval approaches
zero as the sample size approaches infinity. This shows that the estimation error
introduced by using sample quantiles in place of the actual quantiles can be made
arbitrarily small with probability approaching one. Since the EMG cdf is continuous
the “small” deviations in the sample quantiles will cause “small” deviations in the
sample quantile of the sample mean.
Combining these results it is clear that the estimator for k will be consistent.
Using the same techniques it is also possible to show that the rest of the estimates
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 40
used in this method are also consistent.
6.2 Practical Considerations and Alterations
Before describing the final form of the new parameter estimation method several
practical considerations that impact the method will be discussed. The first issue is
that in order to estimate the parameters, (3.8) must be solved for k. If the estimated
value of EMGm(k) is outside of the interval (0.5, 1 − e−1) then it seems reasonable
to modify the estimate to be the nearest value within (0.5, 1 − e−1). The problem
with this situation is that the interval is open and therefore the nearest value does
not exist. To rectify this, the estimated value of EMGm(k) is rounded to the closest
sample quantile within (0.5, 1− e−1). In order to avoid a discontinuous estimate, this
rounding procedure is extended to any estimates of EMGm(k) that are either smaller
than the smallest sample quantile in (0.5, 1− e−1) or larger than that largest sample
quantile in (0.5, 1− e−1).
For smaller sample sizes that only have one sample quantile within the interval
(0.5, 1 − e−1) the discontinuous estimate is more natural. If the continuous version
of the estimate was used, then the estimate of k would always be the same for these
sample sizes. For sample sizes that are greater than or equal to 15, there are always
at least two sample quantiles within (0.5, 1− e−1) so the discontinuous estimate is an
attractive option for sample sizes less than 15.
There are a handful of sample sizes that have no sample quantile within (0.5, 1−
e−1). For these sample sizes the rounding procedure fails because there is no sample
quantile to round to. The sample sizes for which this situation arises are one, two,
three, four, and six. When the sample size is six it is reasonable to round to the
sample quantile from a sample size of five that is within (0.5, 1− e−1). Choosing an
appropriate quantile for sample sizes that are less than or equal to four is not quite
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 41
as straightforward. In this case the easiest remedy is to choose a completely arbitrary
quantile for rounding (such as the center of the interval). This will solve the problem
for all but the sample size of one which is ignored.
The rounding procedure just described will not affect consistency and will not
cause the estimates to be invalid. By construction of the rounding procedure it is
clear that estimate for k must be valid. Next the estimate of λ is investigated. To
show that the estimate for λ is always valid it is noted that the sample mean for a
sample size of two or greater will always be greater than the smallest sample value
because the EMG distribution has a continuous pdf that is positive for all real values.
Since the quantiles are estimated using linear interpolation and since the EMG cdf is
monotonically increasing, it is clear that the estimated quantile of the mean (given
by µ + λ−1) and the estimated quantile of µ will not be the same if both quantiles
are estimated using the same value of k. This shows that the estimate of λ will be
valid and therefore the estimate of σ will also be valid.
One slight modification remains to be made. The estimate for k is often not accu-
rate and so if at all possible this estimate should be avoided for estimating anything
other than a quantile. In the new estimation procedure, the estimate of k is used in
such a manner to estimate σ. An alternative approach is to estimate µ + σ and µ
from the sample quantiles. An estimate for σ can then be obtained by subtracting
these two estimates. This alternative estimation procedure may fail if the estimated
quantiles for both values are less than or equal to the smallest sample quantile. In this
case σ will be estimated to be zero and this invalid estimate can simply be discarded
and replaced with the estimate from the previous estimation method for σ. By using
the alternative estimate for σ, a slight gain in estimation accuracy can be achieved.
Several steps of the new estimation procedure require the evaluation of an erfc
term. As was discussed in section 6.4, this term can introduce numerical instability.
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 42
Although this was a major problem for the MLE parameter estimation method it is
only a minor issue for the new method. All of the erfc terms in the new method
depend on the value of k which is likely to occur in the interval [0.1, 10] for most
practical applications. If k occurs in this interval then there should be no issue with
the numerical instability of the erfc term. For applications that require values of
k that are far outside of this interval, it may be necessary to use software that can
calculate erfc to a high precision.
6.3 Summary of Final Parameter Estimation Method
A step by step procedure for estimating the parameters of an EMG distribution
from a random sample S = {s1,...,sn} using the final form of the new method is given
below for sample sizes greater than or equal to 15:
1. Calculate mq, where mq is the sample quantile of the mean of S.
2. Calculate minsq, where minsq is the smallest sample quantile which is greater
than 0.5 and less than 1 − e−1.
3. Calculate maxsq where maxsq is the largest sample quantile which is greater
than 0.5 and less than 1 − e−1.
4. If mq /∈ [minsq, maxsq] then set mq to the closest endpoint of this interval.
5. Estimate k with ke where ke is obtained by solving
mq =1
2× [1− e
k2e2−1erfc(
k2e − 1
ke√
2) + erf(
1
ke√
2)]
6. Calculate µqe where
µqe =1
2× [1− e
k2e2 erfc(
ke√2
)]
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 43
7. Estimate µ as the sample quantile of µqe using linear interpolation.
8. Calculate mlqe where
mlqe =1
2× [1− e
k2e2−1erfc(
ke − k−1e√2
) + erf(1
ke√
2)]
9. Estimate ml as the sample quantile of mlqe using linear interpolation.
10. Estimate λ with λe where λe is obtained by subtracting the estimate for µ from
ml and then taking the multiplicative inverse of the result.
11. Calculate msqe where
msqe =1
2× [1− e
k2e2−keerfc(
ke − 1√2
) + erf(1√2
)]
12. Estimate ms as the sample quantile of msqe using linear interpolation.
13. Estimate σ by subtracting the estimate for µ from ms.
14. If the estimate for σ is zero then estimate σ by ke divided by the estimate for
λ.
For sample sizes less than 15, a few slight modifications are necessary. First
the rounding procedure must be changed to only round mq if mq occurs outside of
(0.5, 1 − e−1). Second, the sample quantile to round to when the sample size is six
should be set to the sample quantile to round to when the sample size is five. Third,
the sample quantile to round to for sample sizes of two, three, and four should be set
to the center of (0.5, 1− e−1). Lastly an error message should be returned when the
sample size is one because this case is not supported.
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 44
6.4 Currently Available Methods
Three parameter estimation methods are considered. The three parameter esti-
mation methods considered are the maximum likelihood estimation (MLE) method,
the method introduced in [30] (henceforth referred to as the Silver method), and the
method of moments. Each method is described in detail in the following sections.
6.5 Maximum Likelihood Estimation Method
Given a random sample S = {s1, ..., sn} the MLE method for an EMG distribution
returns the estimated values µe, σe, λe which maximize the following expression:
LE(S;µ, σ, λ) =n∏i=1
λ
2eλ(
λσ2
2+µ−si)erfc((
σ√2
)(λ+µ− siσ2
)) (6.1)
This method is proven to be consistent for an EMG distribution [27].
A major problem with the MLE method is the subtractive cancellation in the
exponential term and the erfc term in (6.1). The exponential term is subject to
subtractive cancellation when
si ≈λσ2
2+ µ
and the erfc term is subject to subtractive cancellation when the input to the erfc
term becomes large and positive due to the fact that erfc(x) = 1 − erf(x) and the
fact that erf(x) ≈ 1 when x is large and positive. As an example of subtractive
cancellation, erfc(35) < 10−530 which would underflow at many machine precisions.
Due to this issue the MLE is not a viable method for estimating the parameters
of an EMG distribution in general. If even a single underflow occurs the entire MLE
expression will reduce to zero causing the method to fail.
CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETERESTIMATION METHOD AND PROPERTIES 45
6.6 The Silver Method
The Silver method [30] circumvents the numerical stability issues by replacing the
(3.2) with a saddlepoint approximation [10] and then approximating the solution of
the maximum likelihood expression using the Nelder-Mead simplex algorithm [24].
The saddlepoint approximation is basically a simplified approximation to a distribu-
tion using information from several moments of the distribution. By replacing the
EMG pdf with a simplified saddlepoint approximation numerical stability is more eas-
ily controlled. In the newly formed approximation to the MLE expression subtractive
cancellation can be avoided. The Nelder-Mead simplex algorithm is then used to
estimate the optimal value of this new expression. A software implementation of this
method is the normexp.fit function (using the saddle method option) from the limma
package version 3.6.9 for the R programming language [26].
6.7 Method of Moments
The method of moments [6] estimates the parameters of a distribution by setting
the moment equations of a random variable equal to the sample moments and then
solving the system of equations for the parameters. The number of equations needed
is equal to the number of parameters that are needed to describe the random variable.
For an EMG distribution the first three moment equations can be used to solve for
µ, σ, and λ. These three equations are given by
m1 = µe + λ−1e
m2 = σ2e + µ2
e + 2µeλ−1e + 2λ−2e
m3 = 3σ2eµe + 3σ2
eλ−1e + µ3
e + 3µ2eλ−1e + 6µeλ
−2e + 6λ−3e ,
46
7. Comparison of Methods on Synthetic Data
To determine its usefulness the method described in the previous section was
applied to actual experimental data as well as synthetic data.
7.0.1 Synthetic Data
The new method was compared to the Silver method and the method of moments.
Synthetic data were generated for three different EMG distributions, each having
unique values of k (k = 0.1, 4, and 10). For each EMG distribution, 100 random
samples were generated for each of seven possible sample sizes. Each of the three
methods was then applied to the generated synthetic data. The average and standard
deviations of the estimates for each parameter along with the number of times that
valid parameters were not returned (denoted as “Fails”) were calculated. In order to
quickly compare results a goodness-of-fit metric is defined to show how closely the
estimates resembled the actual values. This metric is defined to be the largest of the
percent errors for the three parameters (denoted as “Error”). The results are given
in Tables 7.1 - 7.3.
The performance of each method over the synthetic data samples is dependent on
the value of k. For all three methods, the parameter estimates become less accurate
as k approaches the higher end of the range defined in [0.1, 10]. The accuracy of the
method of moments (when it manages to return valid parameter values) appears to
be very similar to the accuracy of the new method for values of k that are in the
higher end of [0.1, 10]. The Silver method appears to be superior to the other two
methods when k is at the lower end of [0.1, 10] and the new method appears to be
superior to the other two methods when k is at the higher end of [0.1, 10].
The results for k = 4 (Table 7.3) suggest that the Silver method is not converging
CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 47
to the correct parameter estimates. The parameter estimates for a sample size of
1,000 using the Silver method are µe = 0.6, σe = 1.9, and λ = 1.2 and the standard
deviation of each estimate is less than 0.15. This implies that the estimate for λ
(which is no greater than 1.25 when rounding is taken into account) is at least five
standard deviations away from the actual value. Since the standard error in the
estimates is so close to zero, it seems likely that the parameter estimates from the
Silver method are not consistent for this set of parameter values. This observation
casts doubt on the consistency of the Silver method.
A similar analysis on the estimate for µ when k = 10 (Table 7.3) shows that the
Silver method’s estimate is at least four standard deviations away from the actual
value at a sample size of 1,000. This analysis is less convincing, however, because the
standard deviations in the estimate for λ is not close to being zero. To investigate
further, the test was rerun using 10,000 samples. At this sample size all standard
deviations were less than 0.15 and all three parameter estimates were at least seven
standard deviations away from the actual parameter values (the estimate for λ was
at least 29 standard deviations away from the actual parameter value). This result
combined with the result from the previous paragraph seem to imply that the Silver
method is not consistent for large values of k.
CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 48
New MethodN Avg Sd Fails(Error)
(µ =1, σ = 0.1 , λ = 1), k = 0.115 (1.1,0.30,1.6) (0.1,0.21,0.7) 0(205%)25 (1.1,0.29,1.5) (0.1,0.17,0.4) 0(190%)50 (1.1,0.23,1.2) (0.1,0.13,0.2) 0(135%)
100 (1.1,0.15,1.1) (0.1,0.08,0.1) 0(53%)200 (1.0,0.15,1.1) (0.1,0.06,0.1) 0(54%)500 (1.0,0.12,1.0) (0.1,0.06,0.1) 0(21%)
1000 (1.0,0.10,1.0) (0.1,0.05,0.1) 0(4%)(µ =1, σ = 2 , λ = 2), k = 4
15 (-0.6,1.6,0.6) (0.7,0.7,0.2) 0(157%)25 (-0.1,1.7,0.7) (0.7,0.5,0.3) 0(106%)50 (0.2,1.7,0.7) (0.4,0.4,0.2) 0(83%)
100 (0.3,1.8,0.9) (0.4,0.3,0.2) 0(67%)200 (0.6,1.9,1.1) (0.3,0.2,0.3) 0(43%)500 (0.8,1.9,1.5) (0.3,0.2,0.4) 0(27%)
1000 (0.8,1.9,1.7) (0.3,0.1,0.6) 0(16%)(µ =1, σ = 5 , λ = 2), k = 10
15 (-3.1,4.2,0.2) (1.8,1.7,0.1) 0(412%)25 (-2.1,4.0,0.3) (1.7,1.3,0.1) 0(311%)50 (-1.7,4.2,0.3) (1.2,0.9,0.1) 0(272%)
100 (-1.1,4.3,0.4) (0.9,0.7,0.1) 0(211%)200 (-0.6,4.5,0.5) (0.8,0.6,0.1) 0(165%)500 (-0.2,4.6,0.6) (0.7,0.4,0.2) 0(124%)
1000 (0.2,4.8,0.8) (0.6,0.3,0.2) 0(83%)
Table 7.1: Synthetic data results for the new method
CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 49
Method of MomentsN Avg Sd Fails(Error)
(µ =1, σ = 0.1 , λ = 1), k = 0.115 (1.2,0.47,1.7) (0.2,0.19,0.8) 10(366%)25 (1.2,0.46,1.5) (0.2,0.17,0.6) 26(363%)50 (1.2,0.42,1.3) (0.1,0.14,0.3) 20(321%)
100 (1.1,0.36,1.2) (0.1,0.12,0.2) 25(257%)200 (1.1,0.34,1.1) (0.1,0.12,0.1) 24(239%)500 (1.1,0.28,1.1) (0.1,0.10,0.1) 37(181%)
1000 (1.0,0.25,1.1) (0.0,0.08,0.1) 42(147%)(µ =1, σ = 2 , λ = 2), k = 4
15 (0.2,1.5,1.1) (0.7,0.4,0.6) 36(76%)25 (0.6,1.8,1.2) (0.5,0.3,0.8) 45(39%)50 (0.6,1.8,1.3) (0.4,0.3,1.0) 49(41%)
100 (0.6,1.8,1.4) (0.3,0.2,0.7) 33(37%)200 (0.7,1.9,1.4) (0.3,0.1,0.6) 46(30%)500 (0.8,1.9,1.5) (0.2,0.1,0.5) 39(26%)
1000 (0.8,1.9,1.7) (0.2,0.1,0.7) 37(16%)(µ =1, σ = 5 , λ = 2), k = 10
15 (-1.0,3.8,0.4) (1.4,0.9,0.2) 53(203%)25 (-0.7,4.1,0.6) (1.2,0.8,0.5) 45(173%)50 (-0.6,4.3,0.6) (1.1,0.5,0.4) 49(160%)
100 (-0.6,4.3,0.5) (0.9,0.5,0.2) 61(160%)200 (-0.3,4.6,0.7) (0.7,0.3,0.5) 49(127%)500 (-0.1,4.7,0.7) (0.5,0.2,0.5) 47(113%)
1000 (0.1,4.8,0.8) (0.5,0.2,0.4) 54(92%)
Table 7.2: Synthetic data results for the method of moments
CHAPTER 7. COMPARISON OF METHODS ON SYNTHETIC DATA 50
Silver MethodN Avg Sd Fails(Error)
(µ =1, σ = 0.1 , λ = 1), k = 0.115 (1.1,0.07,1.3) (0.2,0.14,0.7) 0(35%)25 (1.0,0.05,1.1) (0.1,0.08,0.3) 0(55%)50 (1.0,0.07,1.0) (0.1,0.07,0.2) 0(29%)
100 (1.0,0.08,1.0) (0.0,0.05,0.1) 0(17%)200 (1.0,0.09,1.0) (0.0,0.03,0.1) 0(9%)500 (1.0,0.09,1.0) (0.0,0.02,0.1) 0(7%)
1000 (1.0,0.09,1.0) (0.0,0.01,0.0) 0(8%)(µ =1, σ = 2 , λ = 2), k = 4
15 (-0.3,1.0,8.5) (1.3,0.9,21.4) 0(326%)25 (0.5,1.6,12.5) (1.0,0.7,21.5) 0(527%)50 (0.7,1.8,9.9) (0.7,0.4,19.0) 0(393%)
100 (0.7,1.8,3.6) (0.3,0.2,9.3) 0(82%)200 (0.6,1.9,1.2) (0.2,0.1,0.1) 0(41%)500 (0.6,1.9,1.2) (0.1,0.1,0.1) 0(42%)
1000 (0.6,1.9,1.2) (0.1,0.0,0.1) 0(42%)(µ =1, σ = 5 , λ = 2), k = 10
15 (-1.5,3.2,5.5) (2.9,2.1,9.2) 0(253%)25 (-1.2,3.5,4.1) (2.9,1.9,8.1) 0(222%)50 (-0.3,4.4,4.3) (1.7,1.0,7.2) 0(127%)
100 (-0.3,4.6,2.2) (0.8,0.5,5.2) 0(134%)200 (-0.5,4.6,0.8) (0.5,0.2,1.9) 0(148%)500 (-0.5,4.6,0.6) (0.4,0.2,1.1) 0(155%)
1000 (-0.5,4.6,0.6) (0.3,0.2,0.9) 0(151%)
Table 7.3: Synthetic data results for the Silver method
51
8. Conclusion
By using a linear reparameterization of the input to the exponentially modified
Gaussian (EMG) cumulative distribution function (cdf) several important properties
of the EMG distribution were derived. These properties showed that the multipli-
cation of the exponential distribution parameter λ and the standard deviation pa-
rameter of the Gaussian distribution denoted by k = λσ provides a large amount
of information about an EMG distribution. This term can be used, for instance, to
determine the relative “shape” of the EMG distribution, to calculate bounds on cer-
tain quantiles, and to estimate the parameters of an EMG distribution from sample
values.
These properties were applied to a specific practical application of the EMG dis-
tribution to Affymetrix microarray preprocessing. The robust multiarray average
(RMA) Affymetrix microarray preprocessing techniques assumes that the distribution
of the perfect match (PM) probes from an Affymetrix microarray at least approxi-
mately follows an EMG distribution. Five Affymetrix microarrays were downloaded
from a public data base and the properties derived in this thesis were used to create
two tests for determining whether or not the sample data distributions were likely
to follow an EMG distribution. Both tests agreed that the sample data distributions
were not likely to follow an EMG distribution. The first test found that the sample
quantiles of the sample means were much larger than would be expected for an EMG
distribution while the second test found that the right tails of the sample distribu-
tions were much “heavier” than would be expected for an EMG distribution. Using
these results a new distribution f(x) was derived for fitting the right tail with pdf
f(x) = 3log2(x) that seemed to fit the right tail of the data reasonably well. This fitting
further challenges the assumption that the sample data follow an EMG distribution
CHAPTER 8. CONCLUSION 52
because f(x) has a significantly “heavier” tail than does an EMG distribution.
The derived properties of the EMG distribution also revealed a new way to esti-
mate the parameters of an EMG distribution from sample data. After a few slight
modifications a practical method for estimating the parameters of an EMG distribu-
tion was created that is proven to be consistent. This new method was shown to have
distinct advantages over two other EMG parameter estimation methods which were
the Silver method and the method of moments. Compared to the Silver method [30],
the new method is: 1, simpler to implement; 2, proven to be consistent (the Sil-
ver method does not appear to be consistent when applied to synthetic data); and
3, appears to more accurately estimate the parameters of EMG distributions with
“large” values of k. The Silver method does, however, appear to return more accu-
rate parameter estimates for EMG distributions with “small” values of k. Compared
to the method of moments, the new method does not have the problem of returning
imaginary parameter estimates. Overall the new method appears to be most useful
for EMG distributions that have “large” values of k and the Silver method appears
to be most useful for EMG distributions that have “small” values of k.
By better understanding the EMG distribution it was possible to not only ade-
quately answer the practical problem being considered (the distribution of the PM
probes from the five Affymetrix microarrays) but also to gain insight into a completely
different application area (parameter estimation). Due to the nature of the derived
properties it was further possible to show that the parameter estimation method that
was created was consistent and also showed how to determine the accuracy of param-
eter estimation based on the “shape” of the EMG distribution. This type of process
was not completed for the Silver method, most likely because the involved numerical
approximation techniques provided no actual insight into why their method may or
may not be working. From the synthetic data trials given in the original publication
CHAPTER 8. CONCLUSION 53
of the Silver method [30] it would seem that the Silver method would likely have
no issues with consistency in practice. With the knowledge gained from the proper-
ties derived in this thesis it was possible to challenge this “reasonable” assumption.
Any application that involves the EMG distribution is likely to benefit from a better
understanding of its properties.
54
Appendix A. Derivation of pdf and cdf
A.1 Derivation of the Probability Density Function and the Cumulative
Distribution Function
From the definition of the EMG distribution the cumulative distribution function
can be written as:
EMG(c;µ, σ, λ) = P{E + G ≤ c}
=
∫ ∞0
∫ c−x
−∞(λe−λx)(
1√2πσ2
e−(y−µ)2
2σ2 )dydx (A.1)
=
∫ c
−∞
∫ c−y
0
(λe−λx)(1√
2πσ2e−
(y−µ)2
2σ2 )dxdy (A.2)
Integrating (A.2) yields
EMG(c;µ, σ, λ) =
∫ c
−∞(−e−λ(c−y) + 1)(
1√2πσ2
e−(y−µ)2
2σ2 )dy
=
∫ c
−∞(
1√2πσ2
e−(y−µ)2
2σ2 )dy −∫ c
−∞(
1√2πσ2
e−(y−µ)2
2σ2−λ(c−y))dy
The second integral can be simplified using the following integral from [15]:
∫ ∞0
e−x24β−γxdx =
√πβeβγ
2
(1− erf(γ√β)) (A.3)
where
erf(x) =2√π
∫ x
0
e−t2
dt
After some algebra and simplification the cumulative distribution function reduces
CHAPTER APPENDIX A. DERIVATION OF PDF AND CDF 55
to:
EMG(c;µ, σ, λ) =1
2× [1− eλ(
λσ2
2+µ−c)erfc(
σ√2
(λ+µ− cσ2
)) + erf(1√2σ
(c− µ))]
where
erfc(x) =2√π
∫ ∞x
e−t2
dt = 1− erf(x)
Using (A.1) and the fact that the probability density function is the derivative of
the cumulative distribution function it follows that:
emg(c;µ, σ, λ) =d
dcEMG(c;µ, σ, λ)
=d
dc
∫ ∞0
∫ c−x
−∞(λe−λx)(
1√2πσ2
e−(y−µ)2
2σ2 )dydx
=
∫ ∞0
(λe−λx)(1√
2πσ2e−
(c−x−µ)2
2σ2 )dx
After using (A.3) and doing some algebra the probability density function simplifies
to:
emg(c;µ, σ, λ) =λ
2eλ(
λσ2
2+µ−c)erfc((
σ
2)(λ+
µ− cσ2
))
57
Bibliography
[1] Affymetrix WebsiteGeneChip Overview: Activity #2 - Structure & Function of GeneChip Microar-rays.http://media.affymetrix.com/about affymetrix/outreach/lesson plan/...downloads/student manual activities/activity2/activity2 structure function.pdfAccessed on: March 19th, 2011
[2] Alberts, B. et al. (2002). Molecular Biology of the Cell (4th ed). GarlandScience; New York, New York.
[3] Barber, W.E. and Carr, P.W. (1981). Graphical method for obtaining re-tention time and number of theoretical plates from tailed chromatographic peaksAnal. Chem. 53 1939–1942.
[4] Bioconductoraffy [Software Package] : methods for affymetrix oligonucleotide arrayshttp://www.bioconductor.org/help/bioc-views/2.5/bioc/html/affy.htmlAccessed on: November 29th, 2010
[5] Bishop J. et al. (2008). Kinetics of Multiplex Hybridization: Mechanisms andImplications Biophysical Journal 94 1726–1734.
[6] Breiman, L. (1973). Statistics: With a View Towards Applications. HoughtonMifflin Company; Boston, MA.
[7] Broad Institute WebsiteCancer Program Data Setshttp://www.broadinstitute.org/cgi-bin/cancer/datasets.cgiAccessed on: November 29th, 2010
[8] Chagovetz A. and Blair S. (2009). Real-time DNA microarrays: reality check.Biochemical Society Transactions 37(Pt 2) 471–475.
[9] Columbia UniversityDNA Microarrays in Health Care and Drug Discoveryhttp://www.columbia.edu/ bo8/undergraduate research/projects/...sahil mehta project/work.htmAccessed on: March 16th, 2011
[10] Daniels, H.E. (1954). Saddlepoint approximations in statistics. Annals ofMathematical Statistics 25 631–650.
[11] Felinger, A. (2010). Estimation of chromatographic peak shape parameters infourier domain. Talanta doi:10.1016/j.talanta.2010.10.001
58
[12] Foley, J.P. (1987). Equations for chromatographic peak modeling and calcu-lation of peak area. Anal. Chem. 59 1984–1987.
[13] GeceMexicogecemexico.comAccessed on: March 15th, 2011
[14] Golubev, A. (2010). Exponentially modified Gaussian (EMG) relevance todistributions related to cell proliferation and differentiation. Journal of TheoreticalBiology 262(2) 257–266.
[15] Gradshteyn, I.S. and Ryzhik, I.M. (1980). Table of Integrals, Series andProducts: Corrected and Enlarged Edition (2nd ed). Academic Press; Orlando,Florida.
[16] Gunzert-Marx, K. et al. (2008). Secondary beam fragments produced by200 MeVu−1 12C ions in water and their dose contributions in carbon ion radio-therapy. New Journal of Physics 10 1–21.
[17] Howerton S.B. and McGuffin V.L. (2003). Thermodynamic and kineticcharacterization of polycyclic aromatic hydrocarbons in reversed-phase liquidchromatography. Anal. Chem. 75 3539–3548.
[18] Irizzary, R.A. et al. (2003). Exploration, normalization, and summaries ofhigh density oligonucleotide array probe level data. Biostatistics 4(2) 249–264.
[19] Irizarry, R.A. et al. (2006). Comparison of Affymetrix Genechip expression.Bioinformatics 22(7) 789–794.
[20] Kong, H., et al. (2005). Deconvolution of overlapped peaks based on the ex-ponentially modified Gaussian model in comprehensive two-dimensional gas chro-matography. Journal of Chromatography A. 1086 160–164.
[21] Li, S. et al. (2008). A competitive hybridization model predicts probe signalintensity on high density DNA microarrays. Nucleic Acids Research 36(20) 6585–6591.
[22] McGee, M. and Chen, Z. (2006). Parameter estimation for the exponential-normal convolution model for background correction of Affymetrix GeneChipdata. Statistical Applications in Genetics and Molecular Biology 5 Article 24.
[23] Naish, P.J. and Hartwell S. (1988). Exponentially modified Gaussian func-tions - a good model for chromatographic peaks in isocratic HPLC? Chro-matographia 26 285–296.
[24] Nelder, J.A. and Mead, R. (1965). A simplex algorithm for function mini-mization. Computer Journal 7 308–313.
59
[25] News Medical (2011)What is Gene Expression?http://www.news-medical.net/health/What-is-Gene-Expression.aspxAccessed on: March 15th, 2011
[26] R Development Core TeamR: A Language and Environment for Statistical Computinghttp://www.R-project.orgAccessed on: March 1st, 2011
[27] Roussas, G. (2003). Introduction to Probability and Statistical Inference. Aca-demic Press; Orlando, Florida.
[28] Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics.John Wiley & Sons; New York, New York.
[29] Shao X. et al. (2004). Extraction of mass spectra and chromatographic profilesfrom overlapping GC/MS signal with background. Anal. Chem. 76 5143–5148.
[30] Silver, J. et al. (2009). Microarray background correction: maximum likeli-hood estimation for the normal-exponential convolution model. Biostatistics 10(2)352–363.
[31] Singh D. et al. (2002). Gene expression correlates of clinical prostate cancerbehavior. Cancer Cell 1 203–209.
[32] Steffen B. et al. (2005). A new mathematical procedure to evaluate peaksin complex chromatograms. Journal of Chromatography A. 1071 239–246.
[33] Suzuki, S. et al. (2007). Experimental optimization of probe length to in-crease the sequence specificity of high-density oligonucleotide microarrays. BMCGenomics 8:373.
[34] Therneau, T.M. and Ballman, K.V. (2008). What Does PLIER Really Do?Cancer Informatics 6 423–431.
[35] Vikalo, H. et al. (2008). Modeling and Estimation for Real-Time Microar-rays. IEEE Journal of Selected Topics in Signal Processing 2(3) 286–296.
[36] Walsh, S. and Diamond D. (1995). Non-linear curve fitting using microsoftexcel solver. Talanta 42(4) 561–572.
[37] Zakharkin, S. et al. (2005). Sources of variation in Affymetrix microarrayexperiments. BMC Bioinformatics 6:214.