Download - Employing Grid Technology for Data Analysis Contact: .

Employing Grid Technology for Data Analysis

Contact: www.business.duq.edu/faculty/davies

Computation Solutions

Traditional High-Performance Computer (HPC)Pro: Node-node communicationCon: 20x to 200x cost of other solutions

Traditional Cluster ComputerPro: Less expensive to maintain & upgradeCon: Requires significant infrastructure

Internet Grid ComputerPro: Massive power on demandCon: Less adequate for massive data

Enterprise Grid ComputerPro: Harness existing infrastructureCon: Limited power

Total Annual Spending on HPC

$0

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

$14,000

$16,000

$18,000

1995 1996 1997 1998 1999 2000 2001 2002 2003

mill

ion

s $

Purchases of New HPC Spending on Existing HPC

Price per GF for HPC-Generated Computation

Worldwide (excluding Departmental class)

Installed base of HPC 400,000 GF*

Annual cost of installed base $16.2 billion

Average annual cost per GF $41,000

*1 gigaflops = 1 billion calculations per second ~ 5 GHz

Cluster Costs

20-Node Cluster Computer

Purchase price per node $1,100

Effective life 3 years

Weight (including racks) 1 ton

Power consumption 8 kilowatts

Required air conditioning 1 ton

Required space 16 square feet

Computational power 12 GF

Price per GF for Cluster-Generated ComputationAnnual Cost of a Cluster (per GF)

$763

$275

$69

$69

$466

$33

$125

$533

Nodes Hardware Software Hardware Service Contract Electricity Space Installation & Configuration Labor

$533 / year for labor

$763 / year for nodes (amortized purchase price)

$69 / year for software (amortized)

$69 / year for hardware service contracts

$466 / year for electricity

$33 / year for space

$125 / year for installation and configuration (amortized)

Annual cost of a cluster > 125% of the (unamortized) purchase price of the nodes

$275 / year for additional hardware (amortized)

Total = $2,300 per GF per year

Price Comparison

Annual Cost per GF

Traditional HPC $41,000

Traditional Cluster $2,300

Internet Grid $300*

*Assumes ½ availability at $100 per year.

What can a researcher do with cheap computation?

(in the literature, “all subsets regression”)

Goal: Examine all combinations of factors that have a significant effect on an outcome variable. Evaluate each combination on its ability to predict the outcome variable.

Scope: With K factors, there are 2K possible factor combinations.

Exhaustive Regression

There are statistical issues associated with performing data searches in this manner. But, in the absence of a theoretical model, the alternative is to do nothing.

Rock PyrolysisOrganic Mass Spectrometry

PotentialFactors

Outcome variablePresence of Natural Gas

Vitrinite Reflectance

Factor Combinations

Example: Examine all combinations of three factors that might predict presence of natural gas.

Combination #1Rock PyrolysisOrganic Mass SpectrometryVitrinite Reflectance

Factor Combinations

Combination #2Rock PyrolysisOrganic Mass SpectrometryCombination #3Rock PyrolysisVitrinite Reflectance

Combination #4Organic Mass SpectrometryVitrinite ReflectanceCombination #5Rock Pyrolysis

Combination #6Organic Mass SpectrometryCombination #7Vitrinite Reflectance

Example: Examine all combinations of three factors that might predict presence of natural gas.

As the number of possible factors grows, the number of models in the search space rises exponentially.

Number of Models in the Search Space

-

200,000,000

400,000,000

600,000,000

800,000,000

1,000,000,000

1,200,000,000

25 26 27 28 29 30

Number of Factors

Factor Combinations

Time requirement to exhaust all factor combinations with 40 factors when a single PC can compute 1,000 models per second 40 factors implies over 1 trillion possible models.

Factor Combinations

Procedure One PC10,000-node

Grid100,000-node

Grid

OLS ER 35 years 2 days 5 hours

LOGIT ERSeveral

centuries10 days 1 day

Typically, researchers would use “stepwise procedures” to avoid having to compute all 1 trillion models.

Search Space

Each square represents one combination of factors (a “model”).

The 144 squares shown here correspond (approximately) to all the possible models that can be constructed using just 7 factors.

Stepwise Procedures

Model Quality

BadPoorBetter

Best

Good

Search Space Model Quality

BadPoorBetter

Best

Good

Stepwise Procedures

Stepwise methods pick a single model as a starting point and follows an “improvement path” to a local optimum.

Starting here…

x

…stepwise finds this model.

*

Search Space Model Quality

BadPoorBetter

Best

Good

Stepwise Procedures

In this example, depending on where stepwise begins its search, stepwise could return any one of these four models.

x

*

x

*

x *

x

*

Search Space

Stepwise Procedures

1. That there are four locally optimal models.

2. That there are five models that are as good as the local optima but are not themselves locally optimal.

3. That there are nine models that are ranked “Good” or “Best.”

4. Commonalities among the more preferred models.

5. Commonalities among the less preferred models.

Stepwise methods would not reveal:

Exhaustive regression looks at all the models in the search (either within an OLS or LOGIT framework) and:

1. Returns results from all models, or

2. Returns only results from models that contain no insignificant

parameter estimates, and/or

3. Returns only models that satisfy a specified minimumgoodness of fit.


List of factors that appear in each model.

Each row corresponds to one of the 2K models.

X[1] X[2] X[3] STDEV(X[1]) STDEV(X[2]) STDEV(X[3])

[1,2,15,17,19] 316 -78.402 -89.839 0.07390 1.2763 0.4347[1,15,16,24,26] 315 -79.193 -89.753 0.07491 -0.0148 0.0046[2,15,17,19] 316 -78.948 -89.839 0.07520 1.2664 0.4372[1,3,11,15,16] 313 -76.069 -89.580 0.07520 -0.0143 1.1939 0.0046 0.4290[1,2,3,16,24,26] 315 -77.098 -89.753 0.07522 -0.0145 1.1939 1.3143 0.0046 0.4290 0.4201

Parameter Estimates Standard ErrorsFactor List N UnRestr ln L Restr ln L MSPE


Models can be evaluated via:

1. Multiple correlation, 2. k-Fold cross-validation mean squared prediction

error, or3. Other methods


Proposed “other method”: Cross-model stability measure

Assuming:

1. The list of potential factors does not exclude any factors that determine the outcome variable, and

2. The pair-wise between-factor correlations are randomly distributed…

…the expected values, across models, of parameter estimates will equal the values of the parameters.


1 3

2

1'1| 1| 2| 1| 1|

For included factors ( ), excluded factors ( ), and extraneous

factors ( ), the expected value of the mean, across models, of

the parameter estimate vector is:

1

2 2 1k k k kK

X X

X

X I P X X

2 1

'2| 3| 3|

1

1' '2| 2| 2| 2| 2|where and factor combination

K

k k k kk

thk k k k k

I P X

P X X X X k k


In preliminary Monte-Carlo experiments in which there are three “true” factors (among a set of up to 12 factors), the cross-model procedure correctly identifies:

1. All three of the “true” factors 85% of the time, and

2. Two of the three “true” factors 100% of the time.

(moderate-low correlated data sets; average “true” R2 = 0.43)

Problem:Amarillo Biosciences collected patient data from a phase II clinical study. Repeated statistical analyses of their experimental drug yielded no conclusive evidence for or against the drugs efficacy.

With patient data comprising 36 factors, there were almost 69 billion possible ways to model the data.

Exhaustive Regression: Case Study

Case Study: Amarillo Biosciences

Case Study: Amarillo BiosciencesSolution:Looking at all 69 billion models, Exhaustive

Regression revealed…

Exhaustive Regression: Case Study

• 250 models in which all factors were statistically significant,

• 8 models that were superior (by stepwise criteria) to the single model found by stepwise methods,

• 15 factors that were more stable (w.r.t. the cross-model criterion) than were the factors that stepwise methods selected,

• 42 factors that did not appear in any of 250 significant models.

Employing Grid Technology for Data Analysis

Contact: www.business.duq.edu/faculty/davies