Date post: | 16-Apr-2017 |
Category: |
Documents |
Upload: | john-yates |
View: | 116 times |
Download: | 0 times |
Selling an executabledata flow graph based IR
John Yates
Order of presentation
• Who am I and why am I here?• 2010: Netezza needs a new architecture• A family of statically typed acyclic DFG IRs• (Time permitting: Some engineering details)• Q&A
“Who am I and why am I here?”
(with apologies to Adm. Stockdale)
1970: Maybe I’ll be a programmer
• NYC hippie, ponytail, curled handlebar mustache
• Liberal arts high school, lousy student• Wanted to build things, real things• Computers seemed interesting and intuitive• Luckily in 1970 programmers were scarce
40 years…– 1970: learning the craft, various jobs (all in assembler)– 1978: Digital Equipment Corp
• Pascal frontend, dynamic programming code selector– 1983: Apollo Computer
• Designed RISC ISP w/ explicit parallel dispatch (pre-VLIW)• Lead architect for RISC backend optimizer; built team• 1st commercial: SSA IR, SW pipeliner, lattice const prop
– 1992: Binary translation: DEC (sw), Chromatic (hw-support)• More SSA IR, lowering; built teams; lot of patents (many hw)
– 1999: Everfile - NFS-like Win32 internet file system– 2002: Netezza, badge #26
• Storage: compression, indices, access methods, txns, CBTs
20+
year
s
2010: Netezza needsa new architecture
Data parallel analytics engine
• Data partitioned across a cluster of nodes– Multiple “slices” per node to exploit multi-core
• Execution model:– Leader accepts query, produces an execution plan– Leader broadcasts plan’s parallel components– Cluster performs data parallel work– Leader performs work requiring a single locus
• Competition: Teradata, Green Plum, DB2, …
Netezza’s architecturePG Plan
Split
1
Split
2
Gen
FPGA
Gen
C++
Gen
C++
Com
pil
eCo
mpi
le
Load
DL
L
Bcas
t
Load
DL
LLo
adFP
GA
Exec
ute
Exec
ute
N workers
Latency
Netezza’s problemsPG Plan
Split
1
Split
2
Gen
FPGA
Gen
C++
Gen
C++
Com
pil
eCo
mpi
le
Load
DL
L
Bcas
t
Load
DL
LLo
adFP
GA
Exec
ute
Exec
ute
Very simplistic code generator:-Lowering across an enormous semantic gulf- No intermediate representation- Very complex, very fragile- Difficult to implement much more than general case code patterns
Hardwaredevelopmenttime scales
N workers
Garth’s incomplete Marlin vision
• What is the real input to the interpreter?• How do we get from query plan to that form?
PG Plan
Split
Bcas
t
Inte
rpre
t(fa
ster
?)
Inte
rpre
t(fa
ster
?)
N workers
Unspecifiedmiracle
Multi-core?
A family of statically typed acyclic data flow graph IRs
Working backwards
• Graph• Dataflow• Acyclic• Statically typed• A family of … IRs
Graph
• Operators– Label names a function– Edge connections in and out
• Edges– Directed (“dataflow”)
Dataflow
• Dataflow machines– Apply history, wisdom, insights to the interpreter
• Value semantics– All edges carry data– No other kinds of edges (i.e. no anti-dependence)– No updatable shared state (i.e. no store)
• Expose all opportunities for concurrency
Acyclic
• No backedges ≡ no cycles J• Can exploit topological ordering– Fact propagation: rDFS (forward) or DFS (reverse)– No iteration, guaranteed termination– Linear algorithms, O(graph)
Statically typed
• Edges initially have unknown type• A well-formed graph can be statically typed– Linear pass over topologically ordered Operators– Assign edge types per Operator descriptors– Inconsistencies can be diagnosed and reported
• Well-nested subsets of edge type vocabularies• Constraining edge types constrains operators
A family of … IRsPG Plan
Split
Bcas
t
Inte
rpre
t
N workers
Low
er
and
Opt
Low
er
and
Opt
Low
er
and
Opt
Inte
rpre
t
Tree
pa
ttern
s
Grap
h 1
patte
rns
Grap
h 2
patte
rns
High level tree - tuplesHigh level graph - tuplesMid level graph - nullable valuesLow level graph - values
Commonpatternnotation
Topo
ex
pand
, in
sert
CL
ON
Es
Topo
ex
pand
, in
sert
CL
ON
Es
≈
Nothing convinces like working code
• First delivery– Table drive operator semantics– Utilities: build, edit & expand– Topologically sort– Type check & report errors
Split
Bcas
t
Inte
rpre
t
N workers
Inte
rpre
t
Topo
ex
pand
, in
sert
CL
ON
Es
Topo
ex
pand
, in
sert
CL
ON
Es
Grap
h as
sem
ble
r
Graphassemblyprogram
Sold!
• Working code rendered mysuccessive lowerings idea credible
• Overall Marlin added ~10 engineers; I got 3• My team got itsfirst end-to-end test case working
PG Plan
Split
Bcas
t
Inte
rpre
t
N workers
Low
er
and
Opt
Low
er
and
Opt
Low
er
and
Opt
Inte
rpre
t
Tree
pa
ttern
s
Grap
h 1
patte
rns
Grap
h 2
patte
rns
Topo
ex
pand
, in
sert
CL
ON
Es
Topo
ex
pand
, in
sert
CL
ON
Es
IBM killed the Marlin program…
• Marlin was a clean up project promising…– Performance and shorter development cycles– But no new features nor functionality
• It is always hard to fund significant clean up– Especially if not legitimately tied to a coveted feature
• Harder if your company is under duress• Harder still if DB2 is gunning for your headcount
Question?
Some engineering details
Why clone?
• After expansion all edges are point-to-point– No output is multiply-consumed
• Chunk handoff along an edge becomes trivial– Think C++11’s new move semantics
• So only clones implement reference counting
Broadcast
• Serialize / deserialize• On network size matters• Graph object– Small number of scalar members– Handful of C++ vector (some ephemeral)– Position independent (no pointers in vectors)
No pointers
• Pointers index the linear address space– Implicit context (there is only one address space)
• Unsigned as vector index– User must provide explicit context (vector base)– 32 bit indices are ½ the size of 64 bit pointers– Position independence simplifies serialization
The graph object
• Exposed read-only data– Vector of Operator objects– Vector of EdgeIn objects– Vector of EdgeOut objects– Literal table and pool
• Private data (may be missing or elided)– Vector of EdgeIn next links– Vector of Operator BreadCrumbs
Discardable elements
• vecBc: BreadCrumbs vector• vecNxt: EdgeIn sibling links• LiteralPool hash table array
Graph vector detailsVector Index Type Element Type Element Sizeg.vecOp OperatorIndex Operator 16 bytes
g.vecOut EdgeOutIndex EdgeOut 8 bytes
g.vecIn EdgeInIndex EdgeIn 8 bytes
g.lit LiteralKey Literal multiple of 8 bytes
g.vecNxt EdgeInIndex EdgeInIndex 4 bytes
g.vecBc OperatorIndex BreadCrumb 4 bytes
Connectivity: Operator objects
• Operator private members– Operator’s edges are sub-vectors of g.vecIn, g.vecOut– Start of EdgeIn objects: EdgeInIndex baseIn_;
– Start of EdgeOut objects: EdgeOutIndex baseOut_;• Number of connections– Inputs: vecOp[x+1].baseIn_ - vecOp[x].baseIn_– Outputs: vecOp[x+1].baseOut_ - vecOp[x].baseOut_
Connectivity: EdgeIn objects
• EdgeIn private members– Sink Operator: OperatorIndex dstOp_;
– Source EdgeOut: EdgeOutIndex src_;
• EdgeIn connection position– Use pointer arithmetic:this - (vecIn + vecOp[dstOp_].baseIn_);
Connectivity: EdgeOut objects
• EdgeOut private members– Source Operator: OperatorIndex srcOp_;
– Sink EdgeIn: EdgeInXIndex dst_;
• EdgeOut connection position– Use pointer arithmeticthis - (vecOut + vecOp[srcOp_].baseOut_);
Working with XG
Thin graph constructionMethod Effect
graph.add(BreadCrumb, Op, Locus, Expansion, unsigned nVarIn =0, unsigned nVarOut =0);
Add an Operator and its Edge resources
graph.connect(OperatorIndex srcOp, unsigned srcPos, OperatorIndex dstOp, unsigned dstPos);
Guarantee a srcOp[srcPos] to dstOp[dstPos] edge exists
Whole graph operationsOperation Effect
Graph(); Construct an empty Graph
void done(); Topo sort and type check
Graph(Graph const thinGraph&, bool forSpu); Partitioning constructor
BinStream& operator << (BinStream&, Graph const&); Put to a BinStream (cheap)
BinStream& operator >> (BinStream&, Graph&); Get from a BinStream (cheap)
void expand(bool forSpu, Environment const& env); Expand, insert clones, etc.
Graph states and conversions
• Start with a “thin” graph• Leader plus one representative node and dataslice• Operators tagged with a locus and expansion rule• Outputs can have multiple consumers
• Partition into leader-side & node-side subsets• Expand based on loci and system topology
• Duplicate operators, adjust in and out arities, add sites• Expand edges: fan-in, fan-out, parallel• Introduce clones as needed
Graph overlay
• Template object publically derived from Graph• Macro hides lots of template boilerplate• User supplied types for parallel vectors– MyOperator ovOp[OperatorIndex]– MyEdgeIn ovIn[EdgeInIndex]– MyEdgeOut ovOut[EdgeOutIndex]
• Constructor shares vectors and LiteralTable
1973: Began 2-axis controller I wrote every line of code (in assembler)
1975: First installation 0.5 MegaWatt torch cutting up to ¾”
steel plate at Marion Power Shovel
1975: Torch on… I was hooked!