From LINQ to DryadLINQ
Michael IsardWorkshop on Data-Intensive Scientific
Computing Using DryadLINQ
Overview
• From sequential code to parallel execution• Dryad fundamentals• Simple program example, plan for practicals
Distributed computation
• Single computer, shared memory– All objects always available for read and write
• Cluster of workstations– Each computer sees a subset of objects– Writes on one computer must be explicitly shared
• System automatically handles complexity– Needs some help
Data-parallel computation
• LINQ is high-level declarative specification• Same action on entire collection of objects• set.Select(x => f(x))– Compute f(x) on each x in set, independently
• set.GroupBy(x => key(x))– Group by unique keys, independently
• set.OrderBy(x => key(x))– Sort whole set (system chooses how)
Distributed cluster computing
• Dataset is stored on local disks of cluster
setset.0set.7
set.1set.6set.4
set.3set.2set.5
Distributed cluster computing
• Dataset is stored on local disks of cluster
set.0set.7
set.1set.6set.4
set.3set.2set.5
Simple distributed computation
var set2 = set.Select(x => f(x))
set
set2
Simple distributed computation
var set2 = set.Select(x => f(x))
set.0
set.7set.1
set.6 set.4
set.3
set.2
set.5
set2.0
set2.1
set2.2
set2.3
set2.4
set2.5
set2.6
set2.7
Simple distributed computation
var set2 = set.Select(x => f(x))
set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7
set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7
f f f f f f f f
Simple distributed computation
var set2 = set.Select(x => f(x))
set.0 set.1 set.2 set.3 set.4 set.5 set.6 set.7
set2.0 set2.1 set2.2 set2.3 set2.4 set2.5 set2.6 set2.7
f f f f f f f f
Distributed acyclic graph
• Computation reads and writes along edges• Graph shows parallelism via independence• Goals of DryadLINQ optimizer– Extract parallelism (find independent work)– Control data skew (balance work across nodes)– Limit cross-computer data transfer
Distributed grouping
var groups = set.GroupBy(x => x.key)
• set is a collection of records each with a key• Don’t know what keys are present– Or in which partitions
• First, reorganize data– All records with same key on same computer
• Then can do final grouping in parallel
Distributed grouping
var groups = set.GroupBy(x => x.key)
set
hash partition by key
group locally
groups
ac
ad
db
ba
ac
a caa
ad
dd bb
db
ba
Distributed grouping
var groups = set.GroupBy(x => x.key)
set
hash partition by key
group locally
groups
ac
ad
db
ba
ac
a caa
ad
dd bb
db
ba
a a ac
b bd d
a a ac
b bd d
Distributed sortingvar sorted = set.OrderBy(x => x.key)
range partition by key
sort locally
sorted
set
sample
compute histogram
1001
11
23
41
1001
1001
11
23
31
41
Distributed sortingvar sorted = set.OrderBy(x => x.key)
range partition by key
sort locally
sorted
set
sample
compute histogram
1001
11
23
41
1001
1001
11
23
31
41
[1,1][2,100]
1001
11 11
1002 34
11
23
41
Distributed sortingvar sorted = set.OrderBy(x => x.key)
range partition by key
sort locally
sorted
set
sample
compute histogram
1001
11
23
41
1001
11
23
41
[1,1][2,100]
1001
11 11
1002 34
11
23
41
1 1 1 1 2 3 4 100
1 1 1 1 2 3 4 100
Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})
set
hash partition by key
group locally
histogram
a bb a
a ad d
b db d
a bb a
a bb a
a aa aa a
a ad d
b bd d
b d b db b
b db d
a bb a
count
Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})
set
hash partition by key
group locally
histogram
a a a a a a
a a a a a a count
a bb a
a ad d
b db d
a bb a
a bb a
a ad d
b db d
a bb a
a aa aa a
b bd d
b d b db b
b b b b b bd d d d
b b b b b bd d d d
Additional optimizationsvar histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})
set
hash partition by key
group locally
histograma,6 b,6d,4
count
a bb a
a ad d
b db d
a bb a
a bb a
a ad d
b db d
a bb a
a a a a a a b b b b b bd d d d
a a a a a aa,6b,6d,4b b b b b b
d d d d
var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})
set
hash partition by key
group locally
histogram
a bb a
a ad d
b db d
a bb a
a,2b,2
a,2a,2a,2
a,2d,2
b,2d,2
b,2 d,2b,2
b,2d,2
a,2b,2
combine counts
group locallya,2b,2
a,2d,2
b,2d,2
a,2b,2
var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})
set
hash partition by key
group locally
histogram
a bb a
a ad d
b db d
a bb a
a,2b,2
a,2a,2a,2
a,2d,2
b,2d,2
b,2 d,2b,2
b,2d,2
a,2b,2
combine counts
group locallya,2b,2
a,2d,2
b,2d,2
a,2b,2
a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2
a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2
var histogram = set.GroupBy(x => x.key).Select(x => {x.Key, x.Count()})
set
hash partition by key
group locally
histogram
a bb a
a ad d
b db d
a bb a
a,2b,2
a,2d,2
b,2d,2
a,2b,2
combine counts
group locallya,2b,2
a,2d,2
b,2d,2
a,2b,2
a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2
a,2 a,2 a,2 b,2 b,2 b,2 d,2 d,2
a,6
a,6
b,6 d,4
b,6 d,4
What Dryad does
• Abstracts cluster resources– Set of computers, network topology, etc.
• Schedule DAG: choose cluster computers– Fairly among competing jobs– So computation is close to data
• Recovers from transient failures– Rerun computations on machine or network fault– Speculate duplicates for slow computations
Resources are virtualized
• Each graph node is process– Writes outputs to disk– Reads inputs from upstream nodes’ output files
• Graph generally larger than cluster– 1TB input, 250MB partition, 4000 parts
• Cluster is shared– Don’t size program for exact cluster– Use whatever share of resources are available
What controls parallelism
• Initially based on partitioning of inputs
• After reorganization, system or user decides
DryadLINQ-specific operators
• set = PartitionedTable.Get<T>(uri)• set.ToPartitionedTable(uri)• set.HashPartition(x => f(x), numberOfParts)• set.AssumeHashPartition(x => f(x))• [Associative] f(x) { … }• RangePartition(…), Apply(…), Fork(…)• [Decomposable], [Homomorphic], [Resource]• Field mappings, Multiple partitioned tables, …
using System;using System.Collections.Generic;using System.Linq;using System.Text;using LinqToDryad;
namespace Count { class Program { public const string inputUri = @"tidyfs://datasets/Count/inputfile1.pt"; static void Main(string[] args) { PartitionedTable<LineRecord> table = PartitionedTable.Get<LineRecord>(inputUri); Console.WriteLine("Lines: {0}", table.Count()); Console.ReadKey(); } }}
Form into groups
• 9 groups, one MSRI member per group• Try to pick common interest for project later
sherwood-246 — sherwood-253,sherwood-255
d:\dryad\data\Workshop\DryadLINQ\samplesCount, Points, Robots
Cluster job browser d:\dryad\data\Workshop\DryadLINQ\job_browser\DryadAnalysis.exe
TidyFS (file system) browserd:\dryad\data\Workshop\DryadLINQ\bin\retail\tidyfsexplorerwpf.exe