Computation Solutions
Traditional High-Performance Computer (HPC)Pro: Node-node communicationCon: 20x to 200x cost of other solutions
Traditional Cluster ComputerPro: Less expensive to maintain & upgradeCon: Requires significant infrastructure
Internet Grid ComputerPro: Massive power on demandCon: Less adequate for massive data
Enterprise Grid ComputerPro: Harness existing infrastructureCon: Limited power
Total Annual Spending on HPC
$0
$2,000
$4,000
$6,000
$8,000
$10,000
$12,000
$14,000
$16,000
$18,000
1995 1996 1997 1998 1999 2000 2001 2002 2003
mill
ion
s $
Purchases of New HPC Spending on Existing HPC
Price per GF for HPC-Generated Computation
Worldwide (excluding Departmental class)
Installed base of HPC 400,000 GF*
Annual cost of installed base $16.2 billion
Average annual cost per GF $41,000
*1 gigaflops = 1 billion calculations per second ~ 5 GHz
Cluster Costs
20-Node Cluster Computer
Purchase price per node $1,100
Effective life 3 years
Weight (including racks) 1 ton
Power consumption 8 kilowatts
Required air conditioning 1 ton
Required space 16 square feet
Computational power 12 GF
Price per GF for Cluster-Generated ComputationAnnual Cost of a Cluster (per GF)
$763
$275
$69
$69
$466
$33
$125
$533
Nodes Hardware Software Hardware Service Contract Electricity Space Installation & Configuration Labor
$533 / year for labor
$763 / year for nodes (amortized purchase price)
$69 / year for software (amortized)
$69 / year for hardware service contracts
$466 / year for electricity
$33 / year for space
$125 / year for installation and configuration (amortized)
Annual cost of a cluster > 125% of the (unamortized) purchase price of the nodes
$275 / year for additional hardware (amortized)
Total = $2,300 per GF per year
Price Comparison
Annual Cost per GF
Traditional HPC $41,000
Traditional Cluster $2,300
Internet Grid $300*
*Assumes ½ availability at $100 per year.
What can a researcher do with cheap computation?
(in the literature, “all subsets regression”)
Goal: Examine all combinations of factors that have a significant effect on an outcome variable. Evaluate each combination on its ability to predict the outcome variable.
Scope: With K factors, there are 2K possible factor combinations.
Exhaustive Regression
There are statistical issues associated with performing data searches in this manner. But, in the absence of a theoretical model, the alternative is to do nothing.
Rock PyrolysisOrganic Mass Spectrometry
PotentialFactors
Outcome variablePresence of Natural Gas
Vitrinite Reflectance
Factor Combinations
Example: Examine all combinations of three factors that might predict presence of natural gas.
Combination #1Rock PyrolysisOrganic Mass SpectrometryVitrinite Reflectance
Factor Combinations
Combination #2Rock PyrolysisOrganic Mass SpectrometryCombination #3Rock PyrolysisVitrinite Reflectance
Combination #4Organic Mass SpectrometryVitrinite ReflectanceCombination #5Rock Pyrolysis
Combination #6Organic Mass SpectrometryCombination #7Vitrinite Reflectance
Example: Examine all combinations of three factors that might predict presence of natural gas.
As the number of possible factors grows, the number of models in the search space rises exponentially.
Number of Models in the Search Space
-
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
25 26 27 28 29 30
Number of Factors
Factor Combinations
Time requirement to exhaust all factor combinations with 40 factors when a single PC can compute 1,000 models per second 40 factors implies over 1 trillion possible models.
Factor Combinations
Procedure One PC10,000-node
Grid100,000-node
Grid
OLS ER 35 years 2 days 5 hours
LOGIT ERSeveral
centuries10 days 1 day
Typically, researchers would use “stepwise procedures” to avoid having to compute all 1 trillion models.
Search Space
Each square represents one combination of factors (a “model”).
The 144 squares shown here correspond (approximately) to all the possible models that can be constructed using just 7 factors.
Stepwise Procedures
Model Quality
BadPoorBetter
Best
Good
Search Space Model Quality
BadPoorBetter
Best
Good
Stepwise Procedures
Stepwise methods pick a single model as a starting point and follows an “improvement path” to a local optimum.
Starting here…
x
…stepwise finds this model.
*
Search Space Model Quality
BadPoorBetter
Best
Good
Stepwise Procedures
In this example, depending on where stepwise begins its search, stepwise could return any one of these four models.
x
*
x
*
x *
x
*
Search Space
Stepwise Procedures
1. That there are four locally optimal models.
2. That there are five models that are as good as the local optima but are not themselves locally optimal.
3. That there are nine models that are ranked “Good” or “Best.”
4. Commonalities among the more preferred models.
5. Commonalities among the less preferred models.
Stepwise methods would not reveal:
Exhaustive regression looks at all the models in the search (either within an OLS or LOGIT framework) and:
1. Returns results from all models, or
2. Returns only results from models that contain no insignificant
parameter estimates, and/or
3. Returns only models that satisfy a specified minimumgoodness of fit.
Exhaustive Regression
List of factors that appear in each model.
Each row corresponds to one of the 2K models.
X[1] X[2] X[3] STDEV(X[1]) STDEV(X[2]) STDEV(X[3])
[1,2,15,17,19] 316 -78.402 -89.839 0.07390 1.2763 0.4347[1,15,16,24,26] 315 -79.193 -89.753 0.07491 -0.0148 0.0046[2,15,17,19] 316 -78.948 -89.839 0.07520 1.2664 0.4372[1,3,11,15,16] 313 -76.069 -89.580 0.07520 -0.0143 1.1939 0.0046 0.4290[1,2,3,16,24,26] 315 -77.098 -89.753 0.07522 -0.0145 1.1939 1.3143 0.0046 0.4290 0.4201
Parameter Estimates Standard ErrorsFactor List N UnRestr ln L Restr ln L MSPE
Exhaustive Regression
Models can be evaluated via:
1. Multiple correlation, 2. k-Fold cross-validation mean squared prediction
error, or3. Other methods
Exhaustive Regression
Proposed “other method”: Cross-model stability measure
Assuming:
1. The list of potential factors does not exclude any factors that determine the outcome variable, and
2. The pair-wise between-factor correlations are randomly distributed…
…the expected values, across models, of parameter estimates will equal the values of the parameters.
Exhaustive Regression
1 3
2
1'1| 1| 2| 1| 1|
For included factors ( ), excluded factors ( ), and extraneous
factors ( ), the expected value of the mean, across models, of
the parameter estimate vector is:
1
2 2 1k k k kK
X X
X
X I P X X
2 1
'2| 3| 3|
1
1' '2| 2| 2| 2| 2|where and factor combination
K
k k k kk
thk k k k k
I P X
P X X X X k k
Exhaustive Regression
In preliminary Monte-Carlo experiments in which there are three “true” factors (among a set of up to 12 factors), the cross-model procedure correctly identifies:
1. All three of the “true” factors 85% of the time, and
2. Two of the three “true” factors 100% of the time.
(moderate-low correlated data sets; average “true” R2 = 0.43)
Problem:Amarillo Biosciences collected patient data from a phase II clinical study. Repeated statistical analyses of their experimental drug yielded no conclusive evidence for or against the drugs efficacy.
With patient data comprising 36 factors, there were almost 69 billion possible ways to model the data.
Exhaustive Regression: Case Study
Case Study: Amarillo Biosciences
Case Study: Amarillo BiosciencesSolution:Looking at all 69 billion models, Exhaustive
Regression revealed…
Exhaustive Regression: Case Study
• 250 models in which all factors were statistically significant,
• 8 models that were superior (by stepwise criteria) to the single model found by stepwise methods,
• 15 factors that were more stable (w.r.t. the cross-model criterion) than were the factors that stepwise methods selected,
• 42 factors that did not appear in any of 250 significant models.