+ All Categories
Home > Documents > ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to...

ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to...

Date post: 01-Jan-2016
Category:
Upload: constance-hutchinson
View: 214 times
Download: 0 times
Share this document with a friend
24
ConSil Jeff Chase Duke University
Transcript
Page 1: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

ConSil

Jeff ChaseDuke University

Page 2: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Collaborators• Justin Moore

– received PhD in April, en route to Google.• Did this research.• Wrote this paper.• Named the system.

– Something to do with “Get Smart” (?)• Did not send me slides…• Partha Ranganathan (HP) has led this work.

Page 3: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Context: Dynamic Thermal Management for Data Centers

CRAC

RackTemperature

Scale (C)

Heat build-ups

Page 4: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Goals• ConSil is part of a larger system to analyze data

center thermals and manage heat proactively.– Temperature-aware workload placement– “Smart cooling”

• Preliminary conclusion: it is practical to reduce total energy by about 15% under “typical” conditions.– Your mileage may vary.

• Other goals:– Reduce capital cost with “common case” cooling

system.• Allow cluster to “burst”, but stop short of meltdown.

– Improve long-term reliability and availability– Better data center design

Page 5: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

“Green” Workload Placement

Making Scheduling "Cool": Temperature-Aware Resource Assignment in Data Centers by Justin Moore, J. Chase, P. Ranganathan, and R. Sharma. In the 2005 USENIX Annual Technical Conference, April 2005

Place workload intelligently to promote an even temperature distribution, given the “thermal topology” of the data center.

Page 6: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

The Subproblem that Consil Solves

• How hot is point (x, y, z) in your data center?– Placement policies need a thermal map

• Option 1: install new instrumentation– Tradeoff $$$ vs. granularity

• Option 2: use built-in sensors– But: how to derive the inlet temperatures?

• If we can do that, then we can obtain a precise and accurate thermal map with low instrumentation cost.

Page 7: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Thermal Instrumentation

Inlet Heat (Qinlet)

Heat Sources

(Qworkload)

Temperature

Sensors (Qobserved)

Observed: ▲= f(▲, ▲)Learn: ▲= g(▲, ▲)

Page 8: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.
Page 9: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.
Page 10: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

ConSil in ContextWorkload measures

Page 11: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Learning a Model

• • Learn statistical model for Y from m samples of

Ymsmn..sm2sm1sm

.......

Y2s2n..s22s21s2

Y1s1n..s12s11s1

YXn..X2X1AttributesSamples

Page 12: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

First Cut: Neural Nets

• Infer ambient temperature from an input sample:– Last N workload measure samples (epoch E)– Internal temperature sensor readings

• Use off-the-shelf FANN library• Some static (SWAG) structural choices:

– Four layers of neurons• Inputhiddenhiddenoutput

– Neurons use FANN sigmoid transform function– Train the net using FANN back-propagation to

set input weights on each neuron.

Page 13: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Experiments with Consil

• Collected data for 12 servers in a data center.• Pick servers whose inlet temperatures are known

– i.e., they have a sensor near them• 45 hours of data collected under active/varying

load– Two server models (HP DL360 G3, Dell 1425)– CPU data: 1 second granularities– temperature data: 5 or 30 second granularities

• CPU utilization only– CPU uses 80% of power (225/275 watts peak)

• 266 Lines of FANN code

Page 14: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Methodology

• FFCV– Divide observations into fifths– Train on one fifth, test on four– Do it for each fifth– Compute SSE

• Output: CDFs of errors• Sensitivity study

– Training time– Accuracy

Page 15: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

ConSil: Accuracy

• Accurate inference using workload and onboard data– 75% of inferred values are within 1C of actual value

Page 16: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.
Page 17: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Sensitivity

• Time-to-train– Most significant: FFCV sub-experiment– Training time is highly data-dependent– Epoch length– Number of sensor/workload epochs

• Accuracy (SSE)– Most significant: FFCV sub-experiment– Indicates not enough variation in behavior– Coarse granularity (more history) improves

Page 18: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

ConSil in ContextWorkload measures

Page 19: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Predicting Thermal Effects

• Model relationship using machine learning– Inputs: Workload data, AC settings, fan speeds– Output: Predicted thermal map– Learns from observations during normal

operation– FANN neural net library– Active “burn in” may speed learning

Weatherman: Automated, Online, and Predictive Thermal Mapping and Management for Data Centers by Justin Moore, J. Chase, and P. Ranganathan. Third IEEE International Conference on Autonomic Computing, June 2006.

Page 20: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

• Accurate inferences using workload and AC data– Data from validated Flovent CFD models– 92% of predicted values are within 1.0C of actual value

Weatherman: Accuracy

Page 21: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Summary/Conclusion

• Machine learning is a useful tool for “autonomic” self-optimization.– Sense and respond– Optimizing control loops based on learned models

• Neural nets don’t always suck.– Initial results suggest they work well here.– Maybe we can do better.

• Need good baseline datasets for training/validation.– Variance– History

Page 22: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

Why “ConSil”?

• Cone of Silence– “Mask out” unwanted signals

Page 23: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

http://www.cs.duke.edu/~chasehttp://www.cs.duke.edu/~chase

Page 24: ConSil Jeff Chase Duke University. Collaborators Justin Moore –received PhD in April, en route to Google. Did this research. Wrote this paper. Named the.

The maximum number of training iterations was set to$10^5$. Each neural net contained one input, one output, and two hiddenlayers. Each hidden layer contained twice the number of neurons as theinput layer;

varying the number of recent epochs we use asinput, we vary the number of workload epochs~---~parameter $B$~---~andinternal sensor epochs~---~parameter $C$~---~independently.

Using general full factorial design analysis, we can identify whichparameters have a significant effect when changed, and for whichparameters we can simply select a ``reasonable'' value.


Recommended