Post on 19-Dec-2015
transcript
Federated Facts and Figures
Joseph M. HellersteinUC Berkeley
Road Map
The Deep Web and the FFFAn Overview of TelegraphDemo: Election 2000From Tapping to TrawlingA Taste of Policy and CountermeasuresDelicious Snacks
Available in your browser, but not via hyperlinks Accessed via forms (press the “submit” button) Typically runs some code to generate data
E.g. call out to a database, or run some “servlet” Pretty-print results in HTML
Dynamic HTML
Estimated to be >400x larger than the “surface web”
Not accessible in the search engines Typically crawl hyperlinks only
Meet the “Deep Web”
Federated Facts and Figures
One part of the deep web: more full-text documents E.g. archived newspaper articles, legal documents, etc. Figure out how to fetch these, the add to search engine Various people working on this (e.g. CompletePlanet)
Another part: Facts and Figures I.e. structured database data Fetch is only the first challenge Want to combine (“federate”) these databases Want to search by criteria other than keywords Want to analyze the data en masse I.e. want full query power, not just search
Search was always easy Ranking not clearly appropriate here
Meet the FFF
Meet the FFF
Meet the FFF
Meet the FFF
http://telegraph.cs.berkeley.edu
Telegraph
An adaptive dataflow system Dataflow
siphon data from the deep web and other data pools harness data streaming from sensors and traces flow these data streams through code
Adaptive sensor nets & wide area networks: volatile! like Telegraph Avenue
needs to “be cool” with volatile mix from all over the world adaptive techniques route data to machines and code
marriage of queries, app-level route/filter, machine learning
First apps Facts and Figures Federation: Election 2000 Continuous queries on sensor nets Rich queries on Peer-to-Peer
Joe Hellerstein, Mike Franklin, & co.
Dataflow Commonalities
Dataflow at the heart of queries and networks Query engines move records through operators Networks move packets through routers Networked data-intensive apps an emerging middle ground
Database Systems: High-function, high integrity, carefully administered.
Compile intelligent query plans based on data models and statistical properties, query semantics.
Networks: Low-function, high availability, federated administration.
Adapt to performance variabilities, treat data and code as opaque for loose coupling.
Long-Running Dataflows on the FFF
Not precomputed like web indexesNeed online systems & apps for online performance goalsSubject of prior work in CONTROL project Combo of query processing, sampling/estimation, HCI
Time
100%
OnlineTraditional
Telegraph Architecture
Telegraph executes Dataflow GraphsExtensible set of operators With extensible optimization rules
Data access operators TeSS: the Telegraph Screen Scraper Napster/Gnutella readers File readers
Data processing operators Selections (filters), Joins, Drill-Down/Roll-Up,
Aggregation
Adaptivity Operators Eddies, STeMs, FLuX, etc.
Screen Scraping: TeSS
Screen scrapers do two things: Fetch: emulate a web user clicking Parse: extract info from resulting HTML/XML
Somebody has to train the screen scraper Need a separate wrapper for each site Some research work on making this process semi-
automatic
TeSS is an open-source screen-scraper Available at http://telegraph.cs.berkeley.edu/tess Written by a (superstar) sophomore! Simple scripting interface targeted today Moving towards GUI for non-technical users (“by
example”)
First Demo: Election 2000
From Tapping to Trawling
Telegraph allows users to pose rich queries over the deep webBut sometimes would like to be more aggressive: Preload a telegraph cache Access a variety of data for offline mining More (we’ll see soon!)
Want something like a webcrawler for FFF But FFF is too big. Want to “trawl” for interesting stuff hidden there.
From Tapping to Trawling
From Tapping to Trawling
From Tapping to Trawling
Infospace Name Infospace Street
Eddy
Anywho Name Yahoo Maps
“Smith”
DupElim
“1600PennsylvaniaAvenue, DC”
Name
Address
API Challenges in Trawling
Load APIs on the web today: service and silence Various policies at the servers, hard to learn No analogy to robots.txt (which is too limiting anyhow) Feedback can be delayed, painful
Solutions Be very conservative Make out-of-band (human) arrangements Both seem inefficient
Finding new sites to trawl is hard Have to wrap them: fetch is easyish, parse hardish
XML will help a little here Query? Or Update? Again, an API problem!
Imagine we auto-trawled AnyWho and WeSpamYou.com
Trawling “Domains”
Can now collect lists of: Names (First, Last), Addresses,
Companies, Cities, States, etc. etc. Can keep lists organized by site and
in toto Allows for offline mining, etc.
Q: Do webgraph mining techniques apply to facts and figures?
Exploiting Enumerated Domains I
Can trawl any site on known domains! Suddenly the deep web is not so
hidden.
In essence, we expand our trawl Can use pre-existing domains to trawl
further Or, can add new sites to the trawl
process
Exploiting Enumerated Domains II
Trawling gets a sample (signature) of a site’s content Analogous to a random walk, but needs to be characterized
better
Can identify that 2 sites have related subsets of domainsHelps with the query composition problem Rich query interfaces tend to be non-trivial
What sites to use? How to combine them? Imagine:
Traditional search engine experience to pick some sites System suggests how to join the sites in a meaningful way As you build the query, you always see incremental results
Refine query as the data pours in Berkeley CONTROL project has been incremental queries
Blends search, query, browse and mine
A Sampler of FFF Policy Issues
Statistical DB Security IssuesFacing the Power of the FFF “False” combinations Combination strength
What is trawling? Copying? So what?
Akamai for the deep web? Cracking?
Sampler of Countermeasures
Trawl detection And Distributed Trawl Detection
Metadata Watermarking Provenance, Lineage, Disclaimers
Stockpiling Spam
Delicious Snacks
"Concepts are delicious snacks with which we try to alleviate our amazement” -- A. J. Heschel, Man Is Not Alone
Technical Snacks
Adaptive Dataflow Systems + Learning
Incremental & continuous querying And online, bounded trawling Adds an HCI component to the above
FFF APIs, standards The wrapper-writing bottleneck: XML? Backoff APIs? Search vs. Update
Mining trawls
More Technical Snacks
Tie-ins with SecurityApplications beyond FFF Sensors P2P Overlay Networks
Policy Questions
Presenting & Interpreting Data Not just search
Privacy: What is it, what’s it for?Leading Indicators from the FFF
More?
http://telegraph.cs.berkeley.edujmh@cs.berkeley.edu
Collaborators: Mike Franklin, Hal Varian -- UCB Lisa Hellerstein & Torsten Suel -- Polytechnic Sirish Chandrasekaran, Amol Deshpande,
Sam Madden, Vijayshankar Raman, Fred Reiss, Mehul Shah -- UCB
Backup Slides
Telegraph: Adaptive Dataflow
Mixed design philosophy: Tolerate loose coupling and partial failure Adapt online and provide best-effort results Learn statistical properties online Exploit knowledge of semantics via
extensible optimization infrastructures
Target new networked, data-intensive applications
Adaptive Systems: General Flavor
Repeat:1. Observe (model) environment
2. Use observation to choose behavior
3. Take action
Adaptive Dataflow in DBs: History
Rich But Unacknowledged History Codd's data independence
predicated on adaptivity! adapt opaquely to changing schema and
storage Query optimization does it!
statistics-driven optimization key differentiator between DBMSs and
other systems
Adaptivity in Current DBs
Limited & coarse grainRepeat:
1. Observe (model) environment– runstats (once per week!!): model changes in
data
2. Use observation to choose behavior – query optimization: fixes a single static query
plan
3. Take action– query execution: blindly follow plan
What’s So Hard Here?
Volatile regime Data flows unpredictably from sources Code performs unpredictably along flows Continuous volatility due to many decentralized systems
Lots of choices Choice of services Choice of machines Choice of info: sensor fusion, data reduction, etc. Order of operation
Maintenance Federated world Partial failure is the common case
Adaptivity required!
Adaptive Query Processing Work
Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97]
Per Query: Mariposa [SA+96], ASE [CR94] Competition: RDB [AZ96] Inter-Op: [KD98], Tukwila [IF+99] Query Scrambling: [AF+96,UFA98]
Survey: Hellerstein, Franklin, et al., DE Bulletin 2000
Syste
m R
Late
Bin
ding
Per Q
uery
Com
petit
ion
& Sa
mpl
ing
Inte
r-Ope
rato
rQ
uery
Scr
ambl
ing
Eddie
s
Ingr
es D
ECO
MP
Frequency of Adaptivity
Futu
re W
ork
A Networking Problem!?
Networks do dataflow!Significant history of adaptive techniques E.g. TCP congestion control E.g. routing
But traditionally much lower function Ship bitstreams Minimal, fixed code
Lately, moving up the foodchain? app-level routing active networks politics of growth
assumption of complexity = assumption of liability
Networking Code as Dataflow?
States & Events, Not Threads Asynchronous events natural to networks State machines in protocol specification and system
code Low-overhead, spreading to big systems Totally different programming style
remaining area of hacker machismo
Eventflow optimization Can’t eventflow be adaptively optimized like dataflow? Why didn’t that happen years ago? Hold this thought
Query Plans are Dataflow Too
Programming model: iterators old idea, widely used in
DB query processing object with three
methods: Init(), GetNext(), Close()
input/output types query plan: graph of
iterators pipelining: iterators that
return results before children Close()
Clever Dataflow Tricks
Volcano: “exchange” iterator [Graefe] encapsulate exchange
logic in an iterator not in the dataflow
system Box-and-arrow
programming can ignore parallelism
Some Solutions We’re Focusing On
Rivers Adaptive partitioning of work across machines
Eddies Adaptive ordering of pipelined operations
Quality of Service Online aggregation & data reduction: CONTROL MUST have app-semantics Often may want user interaction
UI models of temporal interest
Data Dissemination Adaptively choosing what to send, what to cache
River
Berkeley built the world-record sorting machine On the NOW: 100 Sun workstations + SAN Only beat the record under ideal conditions
No such thing in practice! (Arpaci-Dusseau)2
with Culler, Hellerstein, Patterson
River: adaptive dataflow on clusters One main idea: Distributed Queues
adaptive exchange operator Simplifies management and programming Remzi Arpaci-Dusseau, Eric Anderson, Noah Treuhaft
w/Culler, Hellerstein, Patterson, Yelick
River
Multi-Operator Query Plans
Deal with pipelines of commutative operatorsAdapt at finer granularity than current DBMSs
Continuous Adaptivity: Eddies
A pipelining tuple-routing iterator just like join or sort or exchange
Works best with other pipelining operators like Ripple Joins, online reordering, etc.
Ron Avnur & Joe Hellerstein
Eddy
Continuous Adaptivity: Eddies
How to order and reorder operators over time based on performance, economic/admin feedback
Vs.River: River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically”
Eddy
Continuous Adaptivity: Eddies
Adjusts flow adaptively Tuples routed through ops in different orders Visit each op once before output
Naïve routing policy: All ops fetch from eddy as fast as possible
A la River Turns out, doesn’t quite work
Only measures rate of work, not benefit
Lottery-based routing Uses “lottery scheduling” to address a bandit problem Kris Hildrum, et al. looking at formalizing this Various AI students looking at Reinforcement Learning
Competitive Eddies Throw in redundant data access and code modules!
An Aside: n-Arm Bandits
A little machine learning problem: Each arm pays off differently Explore? Or Exploit?
Sometimes want to randomly choose an arm
Usually want to go with the best If probabilities are stationary,
dampen exploration over time
Eddies with Lottery Scheduling
Operator gets 1 ticket when it takes a tuple Favor operators that run fast (low cost)
Operator loses a ticket when it returns a tuple Favor operators with high rejection rate
Low selectivity
Lottery Scheduling: When two ops vie for the same tuple, hold a lottery Never let any operator go to zero tickets
Support occasional random “exploration”
Set up “inflation” (forgetting) to adapt over time E.g. tix’ = oldtix + newtix
Promising!
Initial performance results
Ongoing work on proofs of convergence have analysis for contrained case
To Be Continued
Tune & formalize policyCompetitive eddies Source & Join selection Requires duplicate
management
Parallelism Eddies + Rivers?
Reliability Long-running flows Rivers + RAID-style
computation
Eddy
R2R1 R3 S1 S2 S3
hash
block index1 index2
To Be Continued, cont.
What about wide area? data reduction sensor fusion asynchronous communication
Continuous queries events disconnected operation
Lower-level eventflow? can eddies, rivers, etc. be brought to bear
on programming?