+ All Categories
Home > Documents > Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, ([email protected]), New...

Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, ([email protected]), New...

Date post: 20-Dec-2015
Category:
View: 217 times
Download: 0 times
Share this document with a friend
Popular Tags:
21
Application-Aware Management of Parallel Simulation Collections Siu-Man Yau, ([email protected]), New York University Steven G. Parker ([email protected]), University of Utah Kostadin Damevski ([email protected]), University of Utah Vijay Karamcheti ([email protected]), New York University Denis Zorin ([email protected]), New York University
Transcript

Application-AwareManagement of

Parallel Simulation Collections

Siu-Man Yau, ([email protected]), New York University Steven G. Parker ([email protected]), University of Utah Kostadin Damevski ([email protected]), University of

Utah Vijay Karamcheti ([email protected]), New York University

Denis Zorin ([email protected]), New York University

Multi-Experiment Studies

• Computational studies require multiple runs of a simulation software

Multi-Experiment Studies

• Existing (batch-based) systems treat each execution as a ‘black box’:– Issue one simulation at a time

• Application-aware system:– Schedule collection of simulations as a whole– Use application-specific knowledge for

scheduling and resource allocation decisions

• Application-awareness brings 4X improvement in response time

Outline

• Example MES: Helium Model Validation

• Evaluation platform: SimX System

• Application-specific considerations– Parallel overhead, Sampling, Result reuse,

Malleability

• Application-Driven Scheduling and Resource Allocation Strategies

• Conclusion

Helium Model Validation

• Gas mixing model for fire simulation

• “Knobs” on model: – Prandtl number– Smagorinsky constant– Grid resolution– Inlet Velocity– etc. . .

• To validate: compare Vs real-life experiment

Helium Model Validation

• Measure velocity profile from real-life experiment

• Pick two “knobs”– Prandtl number– Inlet Velocity

• Run simulated experiments

• Find the combination that match the profile at both heights

Helium Model Validation

• Pareto Frontier - set of inputs that cannot be improved in all objectives

Evaluation platform: SimX

• System support for Interactive Multi-Experiment Studies (SIMECS)

• View computational study as a whole

• For parallel, distributed clusters– Workers (Simulation code & Evaluation code)– Manager (UI, Sampler, Resource Allocator)– Spatially-Indexed Shared Object Layer

(SISOL)

SISOLAPI

Front-end Manager Process

Worker Process Pool

User Interface: Visualisation &

Interaction

Sampler

ResourceAllocator

FU

EL

Inte

rfac

e

SISOL Server Pool

Data Server

Data Server

Data Server

Data Server

Dir

Ser

ver

TaskQueue

Simulationcode

FU

EL

Inte

rfac

eEvaluation

code

Evaluation platform: SimX

Application-Awareness

• Decision: How many processes for each task? • Application-specific considerations

– Minimize parallelization overhead: concurrent tasks, low parallelism

– Sampling strategy: task dependency: serial tasks, high parallelism

– Reuse opportunities: maximize “reusable” work: serial tasks, high parallelism

– Malleability: claim idle resource as beneficial

• Work against each other

Application-awareness

• Naïve approach: Assign one worker per task– Eliminate per-task parallelization overhead– Does not maximize reuse and sampling efficiency– Left over “holes”

• Naïve approach: Assign one task at a time to all workers – Maximize reuse potential and sampling efficiency– Maximize parallelization overhead

• Application-aware approach: Batching – Groups of tasks allowed to be concurrently executed

SISOLAPI

Front-end Manager Process

Worker Process Pool

User Interface: Visualisation &

Interaction

Sampler

ResourceAllocator

FU

EL

Inte

rfac

e

SISOL Server Pool

Data Server

Data Server

Data Server

Data Server

Dir

Ser

ver

TaskQueue

Simulationcode

FU

EL

Inte

rfac

e

Evaluation code

SimulationContainer

TaskQueue::AddTask(Experiment)

TaskQueue:: CreateBatch(set<Experiment>&)

TaskQueue::GetIdealGroupSize()

Reconfigure(const int* assignment)

Solution: Application-awareness

Naïve Approach

• Response time = 12 hr 35 mins

Idle workers

Batch for Sampling

• Identify independent experiments in sampler• Max. parallelism while allowing active sampling

First Batch

1st Pareto-Optimal

Second Batch

1st & 2nd Pareto Opt.

3rd Batch

1st to 3rd Pareto Opt.

4rd Batch

Pareto Frontier

Prantl Number

Inle

t V

eloc

ity

Batch for Sampling

• Response time = 6 hrs 10 mins

1st Batch

2nd Batch 3rd Batch 4th Batch

Batch for Result Reuse

• Sub-divide each batch into 2 smaller batches: – 1st sub-batch: first in reuse class; no two belong to

same reuse class– No two concurrent from-

scratch experiments can reuse each other’s checkpoints(max. reuse potential)

– Experiments in samebatch have comparable run times (reduce holes)

Prantl Number

Inle

t V

eloc

ity

Batch for Result Reuse

• Total time: 5 hr 10 mins

1st Batch

2nd Batch

3rd Batch4th Batch 5th Batch

6th Batch

Preemption

• Helium code is malleable: – Restart a checkpointed run on different number of

workers

• Preemption system:– Manager stores a database of idle workers in SISOL– Workers uses application knowledge to determine if it

should claim idle workers– Manager creates new worker group by adding idle

workers to group– Manager restarts the simulation on new group

Preemption

• Total time: 4 hr 30 mins

1st Batch

2nd Batch

3rd Batch 4th Batch5th Batch

6th Batch

Evaluation: Resource AllocationKnowledge used

Total time Utilization

Rate

Avg. time per run

Improvement

None (run on 1 worker)

12 hr 35 min 56.3% 6 hr 17 min N/A

None (run 1 experiment)

20 hr 35 min 100% 34.3 min N/A

+ Active Sampling

6 hr 10 min 71.1% 63.4 min 51% / 70%

+ Reuse classes

5 hr 10 min 71.3% 39.7 min 59% / 75%

+ Preemption 4 hr 30 min 91.8% 34.5 min 64% / 78%

Conclusion

• Application-awareness yields up to 4+ times improvement in response time

• Conclusions: – View from application level important– Domain knowledge important– System API and infrastructure to exploit

domain knowledge important • Task Queue API for batching• SISOL & Resource Allocator API for pre-emption


Recommended