+ All Categories
Home > Documents > Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet...

Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet...

Date post: 16-Jan-2016
Category:
Upload: clarence-pearson
View: 224 times
Download: 0 times
Share this document with a friend
35
Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology Copyright 2011 California Institute of Technology. Government sponsorship acknowledged.
Transcript
Page 1: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Options for Parallelizing CASPER for a Mesh Processor

Brad Clement, Tara Estlin, Ben Bornstein

Jet Propulsion LaboratoryCalifornia Institute of Technology

Copyright 2011 California Institute of Technology. Government sponsorship acknowledged.

Page 2: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Outline

• Project context• CASPER• Some example designs• Parallelization design choices• Our design

Page 3: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Improving Long-Range Rover Science Using Multi-Core Computing

3

Objectives:• Develop and demonstrate key

capabilities for rover traverse science using multi-core computing

• Adapt three autonomous science technologies to SOA multi-core system

• rock finder (Rockster)

• texture analysis

• continual replanning (CASPER)

• Demonstrate with rover hardware and measure performance benefits using metrics such as execution time and data processed

GNU Free Documentation License

Page 4: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

CASPER – Continuous Activity Scheduling Planning Execution and Replanning

• CASPER uses a model of [spacecraft] activities to construct a [mission] plan to achieve [mission] goals while respecting [spacecraft operations] constraints– Example goals: science requests, downlink requests,

maneuver requests– Example constraints: limited memory, power, propellant

• autonomously commanding EO-1 for the past 7 years• automated sequence planning/generation for Orbital

Express• DSN resource allocation• Modified Antarctic Mapping Missions (MAMM)• 60+ models

Page 5: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Schedule Database (SDB)

activities

resource& statetimelines

conflicts

Page 6: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

CASPERgoals

state updates commands

SDBSDB

Commandable

Repairer

conflicts

scheduleoperation

execution

Page 7: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

CASPER Cycle

1. commandable updates timelines for state updates

2. propagate timelines and constraint networks

3. detail activities for new goals4. check for flaws5. repair/optimize for flaw6. send commands to commandable for execution7. repeat

Page 8: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

CASPER data → functions bottlenecks

Planning Data Planning functions

plans/schedules plan and schedule: identify flaws, add/delete search states 

activitieso parameters, possible valueso parameter dependencieso parameter/state constraintso reservationso temporal constraints

valid time intervals/orderings

add, delete, constrain, move, detail, abstractget, set, choose valueevaluate dependency func, propagate values, stale?find valid values, violated?, propagate constraintsapply to state/resource timelinesadd, removecompute valid time intervals/orderings

state/resource vars (timelines)o values

compute valid time intervals, identify conflictscompute, propagate, get contributing activities

constraint ruleso conflicts

preference/optimization criteriao scores, deficiencies

identify conflictschoose conflict, choose resolution method (e.g. move)compute scores, identify deficiencieschoose preference to improve

Page 9: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

tile/mesh processor

core

core

core

core

core

core

core

core

core core core core

RAM

RAM

cache

proc

switch

Tilera TILE64™8 x 8 cores

Page 10: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Tilera TILE64™ memory accesslocation / event cache

sizepenalty

cycleslinesize

MIPS

best 0 600-1800

branch mis-predict

2 250

L1 8KbI,8KbD

2 16b 250

L2 64Kb 35-49 64b 80

L3 4Mb 8 64b 20

RAM 4Gb 69-88 10

Ungar and Adams, 2009GNU Free Documentation License

Page 11: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

score, best schedule

|| stochastic search, star, copied memory

updated schedule

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

Page 12: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

score, best schedule

evolutionary search

mutated/crossed schedule with updates

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

Page 13: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

memory || by time, master-slave, chain

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

valid intervals, conflicts

activity Δs, propagation

Page 14: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

memory || by time, peer-to-peer

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

activity Δs, propagation

activity Δs, propagation

00 11

22 33

44

Page 15: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

memory || by activity/timeline type, master – slave, star

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

SDB3SDB3SDB2SDB2

SDB0SDB0 SDB4SDB4SDB1SDB1

valid intervals, conflicts

activity Δs, reservations

Page 16: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

memory || by time, peer-to-peer

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

SDB3SDB3SDB2SDB2

SDB0SDB0 SDB4SDB4SDB1SDB1

SDBSDB

activity Δs, propagation

activity Δs, propagation

??

Page 17: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Options: distributing memory

• which data– search space– plans– activities– timelines

• how to partition for load balance• data replication

Lansky (GEMPLAN),Zhou and Hansen, 2007,Burns et al., 2009,Kishimoto et al., 2009

DCR (DCSP & DCOP)

Distributed planning

Page 18: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Options: parallelizing functionLansky (GEMPLAN),Zhou and Hansen, 2007,Burns et al., 2009,Kishimoto et al., 2009

DCR (DCSP & DCOP)

Distributed planning

• which functions– entire algorithm– parts of algorithm

• identifying valid search operations(valid intervals)

• performing a planning/search operation • parameter dependency updates• timeline updates• identifying flaws

– methods of data objects (i.e. distributing memory)– data structure operations

• symmetry (loop-parallelized, master-slave, distributed)

Page 19: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Options: data access and communication

• access location types (processing node, cache, RAM, disk, network/messages)

• allocation control (specify node, specify cache, OS decides)• movement of data• maintain consistency of replicated data (transactions/mutexes,

conflict resolution)• integration of results (transactions/mutexes, conflict resolution)• data routing (centralized, hierarchical, peer-to-peer)• synchronous or asynchronous• communication services (hardware specific, threads, socket, file

I/O, MPI, CORBA, database, distributed planning interfaces)

Page 20: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

memory || by activity/timeline type,computation || by conflict type, master-slave, star

cpu1cpu0

cpu8 cpu9

SDBSDB

cpu2

cpu10

SDB3SDB3SDB2SDB2

SDB4SDB4SDB1SDB1

valid intervals, conflicts

activity Δs

reservation Δs

timeline values

Repairer

Repairer

Page 21: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

parallelize bottleneck functions parallelize repair/optimize by flaw type

memory distributed timelines dependencies/activities none

load balance strategy dynamic grouping dynamic grouping none needed

replicated data none none none

functions parallelized propagation, valid intervals propagation, conflict gathering

repair, optimize

symmetry peer-to-peer peer-to-peer master-slave, asymmetric by conflict type

data location local cache, pre-specified local cache, pre-specified RAM/cache

data movement none none OS controlled

replicated data none -shared memory none -shared memory none -shared memory

integration shared memory, no conflicts

shared memory, no conflicts determine independence

data routing centralized through cache & RAM

centralized through cache & RAM

centralized through RAM/cache

synchronization synchronize after propagation

full propagate before conflict gathering

sequential processing of dependent conflicts

services Pthreads Pthreads Pthreads

advantages may keep nearly all data in local cache many flaws may be independently addressed

disadvantages local cache may not be large enough for both instructions and data, but that may be unavoidable

difficult to take advantage of locally cached data.

difficult to load balance and maximize utilization

Page 22: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

score, best schedule

results on simple || stochastic search

updated schedule

cpu1cpu0

cpu8 cpu9

cpu2

cpu10

Page 23: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 24: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 25: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 26: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 27: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

11 22

33 44

55

66

77

88

1414

1010

99 1111

1212

1313

1616 1515

Spreading out cores to try and get more efficient access to memory

• At first, just assigned cores from left to right and top down, “all at top”

• Tried strategy to spread cores as far apart as possible while keeping as close as possible to memory controllers, “top and bottom”

Page 28: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 29: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 30: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 31: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 32: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.
Page 33: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

summary of experimental results• Ran multiple instances of Casper (1 per core) on same

problem but with different random seeds.– of 20 problems, about half improved with more cores– many slowed down with more cores– several only improved by a few 10s of percent– one improved >60X– more than 8 cores typically doesn’t provide a dramatic increase

in speedup• Tried spreading cores out to see if memory access would

improve– run times vary less and are often slightly better– but more variance actually leads to bigger speedups!

• 3 main computational bottlenecks in Casper; parallelizing one (valid intervals) in map/reduce fashion– full parallelization introduces too much overhead

Page 34: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

what’s the plan from here?1. bottleneck 1: figure out how many threads to spawn for valid intervals2. see how homing timeline memory affects performance2. bottleneck 2: parallelize parameter constraint network (PCN) similarly

– map: update function source parameters– reduce: apply function to source parameters and assign to sink parameter– home memory for activities (with their parameters)

3. bottleneck 3: conflict gathering – see if parallelizing PCN also helps this, or apply map-reduce approach

4. parallelize conflict repair– for example, move activity to resolve one conflict while switching another

activity’s resource to resolve another conflict– can mutexes around activities, parameters, and timelines allow

unrestricted parallelization without deadlock?– if not, we will need to actively determine when/where repair operations

can run concurrently5. tuning

– as with valid intervals, need to know how many threads to spawn for these different operations; if homing memory, maybe always spawning to home core is ok

– how to balance memory and threads across cores

energy power * durationpowerForMode(mode)

takeImage.duration

Page 35: Options for Parallelizing CASPER for a Mesh Processor Brad Clement, Tara Estlin, Ben Bornstein Jet Propulsion Laboratory California Institute of Technology.

Summary

• A large number of choices may go into a design for a parallelized planning system.

• Presented a hybrid design for parallelizing a continual iterative repair planning system for a tile/mesh multi-core processor.

• Need to characterize design choices by listing what implementation features each would entail.


Recommended