Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 1 times |
Adaptive Dataflow
Joe Hellerstein UC Berkeley
Overview
• Trends Driving Adaptive Dataflow• Lessons
– networking• flow control, event programming, app-level routing
– query processing• distributed & “shared nothing” parallel query systems
• Adaptive Dataflow– Rivers, Eddies
• Telegraph– Facts and Figures on the Web (FFW)– sensor-based information systems– software traces and adaptive distributed systems
• FFW Questions
Recent Trends
• 1990s: shift in the focus of academic CS– driving example applications changed– everybody on the information bandwagon
• 1990s: tightening of bottlenecks– data growth double Moore’s Law
• 90’s systems R&D: Parallel & Distributed Information Services
– infocentric, multi-user, highly available– “shared-nothing” clusters, not “parallelism” a la 1989– distributed data, not “distributed OS” a la 1989
Systems Research: Up or Down
• UP: Global Federations– internet services as procedure calls
• with fees and lawyers!
– B2B e-commerce has this problem today• Cohera examples
• DOWN: Sensor/Actuator Networks– UPC codes? Clickstream? Smart dust!– HUGE, noisy data volumes– new architectures, major challenges
Core Technology Not There Yet
• Key component: dataflow– The plumbing is coming
• XML/http, WML/WAP, etc. give LCD communication• glue at the boundaries• ho hum
– Systems challenge: move the data• efficiently • robustly• intelligently
– Language challenge too• programming and debugging tools• interfaces and economic models
What’s So Hard Here?
• Volatile regime– Data flows unpredictably from sources– Code performs unpredictably along flows– Continuous volatility due to many decentralized systems
• Lots of choices– Choice of services– Choice of machines– Choice of info: sensor fusion, data reduction, etc.– Order of operation
• Maintenance– Federated world– Partial failure is the common case
• Adaptivity required!
A Networking Problem!?
• Networks do dataflow!• Significant history of adaptive techniques
– E.g. TCP congestion control– E.g. routing
• But traditionally much lower function– Ship bitstreams– Minimal, fixed code
• Lately, moving up the foodchain?– app-level routing– active networks– politics of growth
• assumption of complexity = assumption of liability
Networking Code as Dataflow?
• States & Events, Not Threads– Asynchronous events natural to networks– State machines in protocol specification and system
code– Low-overhead, spreading to big systems– Totally different programming style
• remaining area of hacker machismo
• Eventflow optimization– Can’t eventflow be adaptively optimized like dataflow?– Why didn’t that happen years ago?– Hold this thought
Query Plans are Dataflow Too
• Programming model: iterators– old idea, widely used in
DB query processing– object with three
methods:• Init(), GetNext(), Close()
– input/output types– query plan: graph of
iterators• pipelining: iterators that
return results before children Close()
Distributed/Parallel Databases?
• Query plans across machines• Bloom filters, query optimization minimize bandwidth
– Send lossily compressed signatures, not just data– Model network, disk, CPU costs in dataflow optimization– A “Distributed DB” contribution
• App-level optimization … code to data or data to code– DB research highest up the food chain
• Data partitioning, query opt. parallelizes dataflow – Pipeline & partition parallelism: natural!– Model resource consumption and response time– A “Parallel DB” contribution
• Challenge: move down the foodchain to serve all– Biggest problem: limited adaptivity
Adaptive Systems: General Flavor
Repeat:1. Observe (model) environment
2. Use observation to choose behavior
3. Take action
Adaptive Dataflow in DBs: History
• Rich But Unacknowledged History– Codd's data independence predicated on
adaptivity!• adapt opaquely to changing schema and storage
– Query optimization does it!• statistics-driven optimization• key differentiator between DBMSs and other
systems
Adaptivity in Current DBs
• Limited & coarse grainRepeat:
1.Observe (model) environment– runstats (once per week!!): model changes in
data
2.Use observation to choose behavior – query optimization: fixes a single static query
plan
3.Take action– query execution: blindly follow plan
Adaptive Query Processing Work
– Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97]
– Per Query: Mariposa [SA+96], ASE [CR94] – Competition: RDB [AZ96]– Inter-Op: [KD98], Tukwila [IF+99]– Query Scrambling: [AF+96,UFA98]
• Survey: Hellerstein, Franklin, et al., DE Bulletin 2000
Syste
m R
Late
Bin
ding
Per Q
uery
Com
petit
ion
& Sa
mpl
ing
Inte
r-Ope
rato
rQ
uery
Scr
ambl
ing
Eddie
s
Ingr
es D
ECO
MP
Frequency of Adaptivity
Futu
re W
ork
Some Solutions We’re Focusing On• Rivers
– Adaptive partitioning of work across machines• Eddies
– Adaptive ordering of pipelined operations• Quality of Service
– Online aggregation & data reduction: CONTROL– MUST have app-semantics– Often may want user interaction
• UI models of temporal interest
• Data Dissemination– Adaptively choosing what to send, what to cache
Dataflow Parallelism in DBs
• Volcano: “exchange” iterator [Graefe]– encapsulate exchange
logic in an iterator– not in the dataflow
system– Box-and-arrow
programming can ignore parallelism
River
• We built the world’s fastest sorting machine– On the “NOW”: 100 Sun workstations + SAN– Only beat the record under ideal conditions
• No such thing in practice!
• River: adaptive dataflow on clusters– One main idea: Distributed Queues
• adaptive exchange operator
– Simplifies management and programming
River
Multi-Operator Query Plans
• Deal with pipelines of commutative operators• Adapt at finer granularity than current DBMSs
Continuous Adaptivity: Eddies
• A pipelining tuple-routing iterator– just like join or sort or exchange
• Works great with other pipelining operators– like Ripple Joins, online reordering, etc.
Eddy
Avnur & HellersteinSIGMOD 2000
Continuous Adaptivity: Eddies
• How to order and reorder operators over time– based on performance, economic/admin feedback
• Vs.River:– River optimizes each operator “horizontally”– Eddies optimize a pipeline “vertically”
Eddy
Continuous Adaptivity: Eddies
• Adjusts flow adaptively– Tuples routed through ops in different
orders– Visit each op once before output
• Naïve routing policy:– All ops fetch from eddy as fast as possible
• A la River
– Turns out, doesn’t quite work• Only measures rate of work, not benefit
An Aside: n-Arm Bandits
• A little machine learning problem:– Each arm pays off differently– Explore? Or Exploit?
• Sometimes want to randomly choose an arm
• Usually want to go with the best• If probabilities are stationary,
dampen exploration over time
Eddies with Lottery Scheduling
• Operator gets 1 ticket when it takes a tuple– Favor operators that run fast (low cost)
• Operator loses a ticket when it returns a tuple– Favor operators with high rejection rate
• Low selectivity
• Lottery Scheduling:– When two ops vie for the same tuple, hold a lottery– Never let any operator go to zero tickets
• Support occasional random “exploration”
• Set up “inflation” (forgetting) to adapt over time– E.g. tix’ = oldtix + newtix
Promising!
• Initial performance results
• Ongoing work on proofs of convergence– have analysis for contrained case
To Be Continued
• Tune & formalize policy• Competitive eddies
– Source & Join selection– Requires duplicate
management• Parallelism
– Eddies + Rivers?• Reliability
– Long-running flows– Rivers + RAID-style
computation
Eddy
R2R1 R3 S1 S2 S3
hash
block index1 index2
To Be Continued, cont.
• What about wide area?– data reduction– sensor fusion– asynchronous communication
• Continuous queries– events– disconnected operation
• Lower-level eventflow?– can eddies, rivers, etc. be brought to bear
on programming?
Telegraph: An Adaptive Dataflow System
• An adaptive dataflow system• Currently cluster + http
– Rivers and Eddies– Web wrappers
• Sensor nets next• Target applications
– Facts and Figures on the Web (FFW)– Distributed Introspection Services– Sensor Stream Services
w/Mike Franklin, Sirish Chandrasekaran, Amol Deshpande, Nick Lanham, Sam Madden, VijayShankar Raman, Mehul Shah
Facts & Figures on the Web
• “Deep” Web– “Hidden” Web, “Dark Matter”
• More interesting: Facts & figures, not text– “search” is not the main problem
• “search” was always easy• ranking often not apropos to facts
– combine, transform, summarize, analyze
FFW Election 2000
• Campaign Finance Drill-down– Bush/Gore donations
• Personal and industrial
– Industry data– Neighborhood data– Personal data– Historical voting data
• Live demo, online aggregation
http://ffw.cs.berkeley.edu
Web Research Revisited
• Crawling, Caching, Relationship Graphs– Transitive Closure– The Berkeley Bindings
• & graph analysis?– Form identification & APIs– “Semantic” caching
• Socio-Techno-Legal Issues– Privacy: Statistical DBs + Federation
• WhitePages |x| WhoIs |x| DoubleClick |x| CDC Wonder
– Stats in the wrong hands– Accuracy of derived results– Intellectual property
• Etc!
Summary
• Adaptive software systems must happen– federation & scaling require it– systems and stats must marry
• Dataflow programming natural– for many applications– best hope for large-scale apps
• Terrific nexus of research– DB, Networking, Learning/Stat– Lots of work to be done!
• Drop by the Telegraph FFW demo!– http://ffw.cs.berkeley.edu
Backup slides
• The rest of the slides are backup to answer questions…
Prior Progress in DB Adaptivity
• Per-query adaptivity– E.g. Mariposa [Sto95]
• 1st distributed DBMS to consider scalability• economic APIs for federation, give limited adaptivity too
• One-time intra-query– DEC Rdb competition [AZ96]– Sampling [lots]
• Intra-query, inter-operator – “Query Scrambling”: reoptimize in face of delays
[UFA98]– Kabra/DeWitt ‘98: dam the flow, reoptimize downstream