+ All Categories
Home > Technology > MLconf NYC Shan Shan Huang

MLconf NYC Shan Shan Huang

Date post: 01-Nov-2014
Category:
Upload: sessionsevents
View: 1,531 times
Download: 9 times
Share this document with a friend
Description:
 
Popular Tags:
32
Smart database for next-generation applications LOGICBLOX - SIMPLIFYING YOUR DATA STACK MLConf NY, 2014.04.11
Transcript
Page 1: MLconf NYC Shan Shan Huang

Smart database for next-generation applications

LOGICBLOX - SIMPLIFYING YOUR DATA STACKMLConf NY, 2014.04.11

Page 2: MLconf NYC Shan Shan Huang

AREN’T THERE ENOUGH DATABASES?

©2014. LogicBlox. All Rights Reserved.

Page 3: MLconf NYC Shan Shan Huang

IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES

©2014. LogicBlox. All Rights Reserved.

Page 4: MLconf NYC Shan Shan Huang

IN 2007 THE SMARTPHONE UNIFIED CONSUMER DEVICES

©2014. LogicBlox. All Rights Reserved.

Is a similar revolution coming in databases?

Page 5: MLconf NYC Shan Shan Huang

OUR MISSION

▪ Be the iPhone of databases▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014

▪ One database to replace many specialized databases▪ Transactional (e.g. Oracle, VoltDB, NuoDB)

▪ Analytical (e.g. Teradata, Redshift, Hadoop)

▪ Graphs

▪ Documents

▪ ...

Footnote: for certain class of applications

©2014. LogicBlox. All Rights Reserved.

Page 6: MLconf NYC Shan Shan Huang

OUR MISSION

▪ Be the iPhone of databases. ▪ “Hybrid Transaction Analytical Processing”, Gartner, Jan. 2014

▪ One database to replace many specialized databases▪ Transactional (e.g. Oracle, VoltDB, NuoDB)

▪ Analytical (e.g. Teradata, Redshift, Hadoop)

▪ Graphs▪ Documents

▪ ...

Footnote: for certain class of applications

©2014. LogicBlox. All Rights Reserved.

Page 7: MLconf NYC Shan Shan Huang

SHOW ME

©2013. LogicBlox. All Rights Reserved.

Page 8: MLconf NYC Shan Shan Huang

FIRST THING FIRST

▪ Declarative query language▪ Based on Datalog

▪ ACID transactions ▪ In fact… full serializability

▪ Built from scratch -- not by stitching together different databases under the hood.

©2014. LogicBlox. All Rights Reserved.

Page 9: MLconf NYC Shan Shan Huang

CLIQUES IN LOGIQL

3 Clique - Triangle Queries 4 Clique

©2014. LogicBlox. All Rights Reserved.

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

4cliques(a, b, c, d) <-

edge(a, b),

edge(a, c),

edge(a, d),

edge(b, c),

edge(b, d),

edge(c, d).

Page 10: MLconf NYC Shan Shan Huang

3 CLIQUE in LOGIQL vs. SQL

©2013. LogicBlox. All Rights Reserved.

SELECT DISTINCT

v1.x AS x, v2.x AS y, v3.x AS w

FROM edge AS v1, edge AS v2, edge AS v3

WHERE

v1.y = v2.x

AND v2.y = v3.x

AND EXISTS(

SELECT 1 FROM edge AS vv1

WHERE vv1.x = v1.x AND vv1.y = v3.x);

SQL

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

LogiQL

Page 11: MLconf NYC Shan Shan Huang

3 CLIQUE in LOGIQL vs SPARQL

©2013. LogicBlox. All Rights Reserved.

sparql PREFIX g: <http://logicblox.com/graph>

SELECT DISTINCT ?av ?bv ?cv FROM <$database>

WHERE {

?a g:edge ?b .

?a g:edge ?c .

?b g:edge ?c .

?a g:value ?av .

?b g:value ?bv .

?c g:value ?cv .

FILTER (xsd:int(?av) < xsd:int(?bv) and

xsd:int(?bv) < xsd:int(?cv))

};

SPARQL

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

LogiQL

Page 12: MLconf NYC Shan Shan Huang

class triangle_count : public graphlab::ivertex_program<graph_type, set_union_gather> { public: bool do_not_scatter; // Gather on all edges edge_dir_type gather_edges(icontext_type& context, const vertex_type& vertex) const { return graphlab::ALL_EDGES; } gather_type gather(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { set_union_gather gather; graphlab::vertex_id_type otherid = edge.target().id() == vertex.id() ?edge.source().id() : edge.

target().id(); size_t other_nbrs = (edge.target().id() == vertex.id()) ? (edge.source().num_in_edges() + edge.source().num_out_edges()): (edge.target().num_in_edges() + edge.target().num_out_edges()); size_t my_nbrs = vertex.num_in_edges() + vertex.num_out_edges(); if (PER_VERTEX_COUNT || (other_nbrs > my_nbrs) || (other_nbrs == my_nbrs && otherid > vertex.id())) { gather.v = otherid; } return gather; } void apply(icontext_type& context, vertex_type& vertex, const gather_type& neighborhood { do_not_scatter = false; if (neighborhood.vid_vec.size() == 0) { vertex.data().vid_set.clear(); if (neighborhood.v != (graphlab::vertex_id_type(-1))) vertex.data().vid_set.vid_vec.push_back(neighborhood.v); } else vertex.data().vid_set.assign(neighborhood.vid_vec); do_not_scatter = vertex.data().vid_set.size() == 0; } edge_dir_type scatter_edges(icontext_type& context, const vertex_type& vertex) const { if (do_not_scatter) return graphlab::NO_EDGES; else return graphlab::OUT_EDGES; } void scatter(icontext_type& context, const vertex_type& vertex, edge_type& edge) const { const vertex_data_type& srclist = edge.source().data(); const vertex_data_type& targetlist = edge.target().data(); if (targetlist.vid_set.size() < srclist.vid_set.size()) edge.data() += count_set_intersect(targetlist.vid_set, srclist.vid_set); else edge.data() += count_set_intersect(srclist.vid_set, targetlist.vid_set); }};

3-CLIQUE IN LOGILQ vs. GRAPHLAB

©2013. LogicBlox. All Rights Reserved.

GraphLab - C++

3cliques(a, b, c) <-

edge(a, b),

edge(a, c),

edge(b, c).

LogiQL

Page 13: MLconf NYC Shan Shan Huang

4 CLIQUE - SYNTHETIC DATA

©2014. LogicBlox. All Rights Reserved.

Page 14: MLconf NYC Shan Shan Huang

4 CLIQUE - REAL DATA

©2014. LogicBlox. All Rights Reserved.

Page 15: MLconf NYC Shan Shan Huang

SEMANTIC WEB - LUBM

©2014. LogicBlox. All Rights Reserved.

Page 16: MLconf NYC Shan Shan Huang

DATAWAREHOUSE - TPC-H

©2013. LogicBlox. All Rights Reserved.

Page 17: MLconf NYC Shan Shan Huang

A NON-TRIVIAL EXAMPLE: PAGERANK IN LOGIQL

©2013. LogicBlox. All Rights Reserved.

d[] = 0.85f. // dampening factor

tolerance[] = 0.01f. // when to the pr change is small enough to stop

pr[p] = 1.0f / node_count[] <- node(p), !pr[p] = _. // initial pr

pr[p] = (1.0f - d[]) + (d[] * sum[p]) <-

abs[r - pr[p]] > tolerance[].

pr[p] = pr[p] <-

r = (1.0f - d[]) + (d[] * sum[p]),

!(abs[r - pr[p]] > tolerance[]).

pr[p] = pr[p] <- !sum[p] = _.

sum[n] = t <-

agg<< t = total(r) >>

edge(p, n),

r = pr[p] / out_count[p].

Page 18: MLconf NYC Shan Shan Huang

HOW DOES IT WORK

©2013. LogicBlox. All Rights Reserved.

Page 19: MLconf NYC Shan Shan Huang

ALGORITHMS FIRST

Computer Science @CompSciFact Sep 28

“Computer science is now about systems. It hasn’t been about algorithms since the 1960’s.” -- Alan Kay #hlf13

Page 20: MLconf NYC Shan Shan Huang

PHILOSOPHY: BRAINS BEFORE BRAWN

▪ Algorithmic scalability▪ New worst-case optimal join algorithm

▪ Incremental maintenance proportional to trace edit distance

▪ Adaptive domain decomposition for parallelization

▪ Data structures▪ Compression close to info-theoretic limit in some cases

▪ I/O minimization, cache consciousness

▪ Persistent data structures: full serializability, branch & merge, auditability, scalable distribution

▪ Unified declarative programming model▪ Optimizations through aggressive analysis

▪ Brute force▪ In-memory when data fits

▪ Distribution across thousands of cores, and GPUs

©2013. LogicBlox. All Rights Reserved.

Page 21: MLconf NYC Shan Shan Huang

A SMART JOIN ALGORITHM - LFTJ

▪ “Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm” T. Veldhuizen, ICDT 2014▪ Best Newcomer Award

©2013. LogicBlox. All Rights Reserved.

Page 22: MLconf NYC Shan Shan Huang

LFTJ INTUITION: CONSIDER MORE THAN PAIRS

©2013. LogicBlox. All Rights Reserved.

▪ Widely adopted technique: pair-wise joins

▪ Suppose A, B, and C each have 1 million records distributed over 3 months▪ Pair-wise join: best case scenario, 0.5 million records as intermediate results

▪ LFTJ: no records materialized

Jan Feb Mar

A(x)

B(x)

C(x)

Page 23: MLconf NYC Shan Shan Huang

SMARTER INCREMENTAL VIEW MAINTENANCE

▪ Incremental Maintenance for Leapfrog Triejoin, T. Veldhuizen, 2013▪ http://arxiv.org/abs/1303.5313

▪ Replaced our implementation of Count and DRed algorithms [Gupta+ 93]

▪ Guarantees that work is done proportional to the trace edit distance between the before and after▪ Critical for allowing caching analytical

views for performance, but still incorporating real-time updates

©2013. LogicBlox. All Rights Reserved.

Page 24: MLconf NYC Shan Shan Huang

INCREMENTALIZING 3 CLIQUE VIEW

©2013. LogicBlox. All Rights Reserved.

LogicBlox - Algebraic

+3cliques(a, b, c) <-

+edge(a, b), edge(a, c), edge(b, c).

+3cliques(a, b, c) <-

edge(a, b), +edge(a, c), edge(b, c).

+3cliques(a, b, c) <-

edge(a, b), edge(a, c), +edge(b, c).

DReD - Synthactic

3cliques(a, b, c) <-

edge(a, b), edge(a, c), edge(b, c).

edge(a, b) edge(a, c) edge(b, c)

Page 25: MLconf NYC Shan Shan Huang

INCREMENTAL MAINTENANCE OF 4-CLIQUE

©2013. LogicBlox. All Rights Reserved.

Page 26: MLconf NYC Shan Shan Huang

A PARTICULAR USE CASE OF LB FOR GRAPHS

©2013. LogicBlox. All Rights Reserved.

Page 27: MLconf NYC Shan Shan Huang

SCREAMING FAST PROGRAM ANALYSIS

▪ Order of magnitude faster than prior-art

▪ Program analysis is graph analysis▪ “Strictly Declarative Specification of

Sophisticated Points-to Analyses” (OOPSLA ‘09)

▪ “Exception Analysis and Points-to Analysis - Better Together” (ISSTA ‘09)

▪ “Pick Your Context Well - Understanding Object-Sensitivity” (POPL ’11)

▪ “Efficient and Effective Handling of Exceptions in Java Points-to Analysis” (CC’13)

▪ “Hybrid Context Sensitivity for Points-to Analysis” (PLDI ’13)

▪ “Set-based Pre-processing for Points-to Analysis” (OOPSLA ‘13)

©2013. LogicBlox. All Rights Reserved.

Page 28: MLconf NYC Shan Shan Huang

PROGRAM ANALYSIS IS ALL ABOUT GRAPH ANALYSIS

©2013. LogicBlox. All Rights Reserved.

Page 29: MLconf NYC Shan Shan Huang

COMPARE TO PRIOR-ART : >10x

©2013. LogicBlox. All Rights Reserved.

Page 30: MLconf NYC Shan Shan Huang

...AND THAT WAS ON PRIOR ART LOGICBLOX

©2013. LogicBlox. All Rights Reserved.

Page 31: MLconf NYC Shan Shan Huang

RECAP

▪ LogicBlox: the iPhone of databases▪ But perhaps the $10k camera of graph queries?

▪ Holy Grails▪ Declarative query language: LogiQL

▪ ACID transactions

▪ Guiding Principle: Brains before Brawns▪ Innovate on algorithms: LTFJ, incremental view maintenance, etc.

▪ Innovate on data structures

▪ Declarative language allows aggressive optimizations

▪ Brute force when necessary

©2014. LogicBlox. All Rights Reserved.

Page 32: MLconf NYC Shan Shan Huang

THANK YOU

©2014. LogicBlox. All Rights Reserved.


Recommended