Date post: | 21-Jun-2015 |
Category: |
Technology |
Upload: | sasha-goldshtein |
View: | 596 times |
Download: | 2 times |
Sasha GoldshteinCTO, Sela Group
Task and Data Parallelism
Agenda
•Multicore machines have been a cheap commodity for >10 years•Adoption of concurrent programming is still slow•Patterns and best practices are scarce•We discuss the APIs first…•…and then turn to examples, best practices, and tips
TPL Evolution
Tasks
•A task is a unit of work–May be executed in parallel with other tasks by a scheduler (e.g. Thread Pool)–Much more than threads, and yet much cheaper
Task<string> t = Task.Factory.StartNew( () => { return DnaSimulation(…); });t.ContinueWith(r => Show(r.Exception), TaskContinuationOptions.OnlyOnFaulted);t.ContinueWith(r => Show(r.Result), TaskContinuationOptions.OnlyOnRanToCompletion);DisplayProgress();
try { //The C# 5.0 version var task = Task.Run(DnaSimulation); DisplayProgress(); Show(await task);} catch (Exception ex) { Show(ex);}
Parallel Loops
•Ideal for parallelizing work over a collection of data•Easy porting of for and foreach loops–Beware of inter-iteration dependencies!
Parallel.For(0, 100, i => { ...});
Parallel.ForEach(urls, url => { webClient.Post(url, options, data);});
Parallel LINQ
•Mind-bogglingly easy parallelization of LINQ queries•Can introduce ordering into the pipeline, or preserve order of original elementsvar query = from monster in monsters.AsParallel()
where monster.IsAttacking let newMonster = SimulateMovement(monster) orderby newMonster.XP select newMonster;
query.ForAll(monster => Move(monster));
Measuring Concurrency
•Visual Studio Concurrency Visualizer to the rescue
Recursive Parallelism Extraction
•Divide-and-conquer algorithms are often parallelized through the recursive call–Be careful with parallelization threshold and watch out for dependenciesvoid FFT(float[] src, float[] dst, int n, int r, int
s) { if (n == 1) { dst[r] = src[r]; } else { FFT(src, n/2, r, s*2); FFT(src, n/2, r+s, s*2); //Combine the two halves in O(n) time }}
Parallel.Invoke( () => FFT(src, n/2, r, s*2), () => FFT(src, n/2, r+s, s*2));
DEMORecursive parallel QuickSort
Symmetric Data Processing
•For a large set of uniform data items that need to processed, parallel loops are usually the best choice and lead to ideal work distribution•Inter-iteration dependencies complicate things (think in-place blur)Parallel.For(0, image.Rows, i => {
for (int j = 0; j < image.Cols; ++j) { destImage.SetPixel(i, j, PixelBlur(image, i, j)); }});
Uneven Work Distribution
•With non-uniform data items, use custom partitioning or manual distribution–Primes: 7 is easier to check than 10,320,647
var work = Enumerable.Range(0, Environment.ProcessorCount) .Select(n => Task.Run(() => CountPrimes(start+chunk*n, start+chunk*(n+1))));Task.WaitAll(work.ToArray());
versus
Parallel.ForEach(Partitioner.Create(Start, End, chunkSize), chunk => CountPrimes(chunk.Item1, chunk.Item2));
DEMOUneven workload distribution
Complex Dependency Management
•Must extract all dependencies and incorporate them into the algorithm–Typical scenarios: 1D loops, dynamic algorithms–Edit distance: each task depends on 2 predecessors, wavefront
C = x[i-1] == y[i-1] ? 0 : 1;D[i, j] = min( D[i-1, j] + 1, D[i, j-1] + 1, D[i-1, j-1] + C);
0,0
m,n
DEMODependency management
Synchronization > Aggregation
•Excessive synchronization brings parallel code to its knees–Try to avoid shared state–Aggregate thread- or task-local state and merge
Parallel.ForEach( Partitioner.Create(Start, End, ChunkSize), () => new List<int>(), //initial local state (range, pls, localPrimes) => { //aggregator for (int i = range.Item1; i < range.Item2; ++i) if (IsPrime(i)) localPrimes.Add(i); return localPrimes; }, localPrimes => { lock (primes) //combiner primes.AddRange(localPrimes);});
DEMOAggregation
Creative Synchronization
• We implement a collection of stock prices, initialized with 105 name/price pairs– 107 reads/s, 106 “update” writes/s, 103
“add” writes/day–Many reader threads, many writer
threadsGET(key): if safe contains key then return safe[key] lock { return unsafe[key] }
PUT(key, value): if safe contains key then safe[key] = value lock { unsafe[key] = value }
Lock-Free Patterns (1)
•Try to avoid Windows synchronization and use hardware synchronization–Primitive operations such as Interlocked.Increment, Interlocked.CompareExchange–Retry pattern with Interlocked.CompareExchange enables arbitrary lock-free algorithms
int InterlockedMultiply(ref int x, int y) { int t, r; do { t = x; r = t * y; } while (Interlocked.CompareExchange(ref x, r, t) != t); return r;}
Old
val
ue
New
val
ue
Com
para
nd
Lock-Free Patterns (2)
•User-mode spinlocks (SpinLock class) can replace locks you acquire very often, which protect tiny computationsclass __DontUseMe__SpinLock { private volatile int _lck; public void Enter() { while (Interlocked.CompareExchange(ref _lck, 1, 0) != 0); } public void Exit() { _lck = 0; }}
Miscellaneous Tips (1)
•Don’t mix several concurrency frameworks in the same process•Some parallel work is best organized in pipelines – TPL DataFlow
Miscellaneous Tips (2)
•Some parallel work can be offloaded to the GPU – C++ AMP
void vadd_exp(float* x, float* y, float* z, int n) { array_view<const float,1> avX(n, x), avY(n, y); array_view<float,1> avZ(n, z); avZ.discard_data(); parallel_for_each(avZ.extent, [=](index<1> i) ... { avZ[i] = avX[i] + fast_math::exp(avY[i]); }); avZ.synchronize();}
Miscellaneous Tips (3)
•Invest in SIMD parallelization of heavy math or data-parallel algorithms
–Already available on Mono (Mono.Simd)
•Make sure to take cache effects into account, especially on MP systems
START: movups xmm0, [esi+4*ecx] addps xmm0, [edi+4*ecx] movups [ebx+4*ecx], xmm0 sub ecx, 4jns START
Summary
• Avoid shared state and synchronization• Parallelize judiciously and apply
thresholds• Measure and understand performance
gains or losses• Concurrency and parallelism are still
hard• A body of best practices, tips, patterns,
examples is being built
Additional References
THANK YOU!
Sasha GoldshteinCTO, Sela Groupblog.sashag.net@goldshtn