Date post: | 28-Nov-2014 |
Category: |
Technology |
Upload: | graphdevroom |
View: | 2,148 times |
Download: | 4 times |
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog and Hadoop for rapid graphprocessing and exploration
Nils Grunwald and Hugo Zanghi
Linkfluence
2012-02-05 - FOSDEM 2012 - Graph Devroom
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Outline
Graph Analysis at Linkfluence
Why Cascalog
Introduction to Cascalog
Conclusion
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we do at Linkfluence
I Web data mining (blogs,media, etc.)
I Social Network data mining(Twitter, Facebook)
I Use this data to buildvarious search engines
I Visualize the data withvarious UI (Gephi, maps,etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we get
I Lots of nodes (users, pages, websites, words)
I Lots of edges (hyperlinks, comments, RT, co-occurences)I These datasets are interconnected (Twitter users link pages,
words occur everywhere)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we get
I Lots of nodes (users, pages, websites, words)I Lots of edges (hyperlinks, comments, RT, co-occurences)
I These datasets are interconnected (Twitter users link pages,words occur everywhere)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What we get
I Lots of nodes (users, pages, websites, words)I Lots of edges (hyperlinks, comments, RT, co-occurences)I These datasets are interconnected (Twitter users link pages,
words occur everywhere)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The problem
I Collecting and processing this data as a graph is not theprimary goal of our system
I But it is a very rich dataset we want to explore for R&Dpurpose
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The problem
I Collecting and processing this data as a graph is not theprimary goal of our system
I But it is a very rich dataset we want to explore for R&Dpurpose
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
I The graph processing should not compromise the rest of thesystem
I Low-maintenanceI Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used
beforehand
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
I The graph processing should not compromise the rest of thesystem
I Low-maintenance
I Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used
beforehand
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
I The graph processing should not compromise the rest of thesystem
I Low-maintenanceI Used for queries and rapid prototyping
I Flexible, hard to tell which field or metadata will be usedbeforehand
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The constraints
I The graph processing should not compromise the rest of thesystem
I Low-maintenanceI Used for queries and rapid prototypingI Flexible, hard to tell which field or metadata will be used
beforehand
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What is Cascalog
I Built on top of Hadoop and Cascading (workflow management)
I Inspired by the Datalog query syntaxI Hosted on the JVM by the Clojure language
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What is Cascalog
I Built on top of Hadoop and Cascading (workflow management)I Inspired by the Datalog query syntax
I Hosted on the JVM by the Clojure language
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
What is Cascalog
I Built on top of Hadoop and Cascading (workflow management)I Inspired by the Datalog query syntaxI Hosted on the JVM by the Clojure language
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
I Reliable and scalable
I Everything is dumped in text files, we reuse our existingrsyslog infrastructure
I We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have
data in a perfectly uniform format
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
I Reliable and scalableI Everything is dumped in text files, we reuse our existing
rsyslog infrastructure
I We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have
data in a perfectly uniform format
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
I Reliable and scalableI Everything is dumped in text files, we reuse our existing
rsyslog infrastructureI We can reuse existing hadoop instances of our system
I No need to know beforehand about indexed fields or to havedata in a perfectly uniform format
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Hadoop for reliability and scalability
I Reliable and scalableI Everything is dumped in text files, we reuse our existing
rsyslog infrastructureI We can reuse existing hadoop instances of our systemI No need to know beforehand about indexed fields or to have
data in a perfectly uniform format
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Datalog for rapid protyping
I Subset of Prolog
I Declarative, expressive and very concise way of writing queriesI Prolog has long been used for making queries over graphs
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Datalog for rapid protyping
I Subset of PrologI Declarative, expressive and very concise way of writing queries
I Prolog has long been used for making queries over graphs
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Datalog for rapid protyping
I Subset of PrologI Declarative, expressive and very concise way of writing queriesI Prolog has long been used for making queries over graphs
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Clojure for flexibility
I Only one language and one file for queries and business logic
I Tasks unrelated to data processing are possible inside thequeries (Resolve shortened links for example)
I Allows complex algorithms to be concisely expressed
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Clojure for flexibility
I Only one language and one file for queries and business logicI Tasks unrelated to data processing are possible inside the
queries (Resolve shortened links for example)
I Allows complex algorithms to be concisely expressed
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Clojure for flexibility
I Only one language and one file for queries and business logicI Tasks unrelated to data processing are possible inside the
queries (Resolve shortened links for example)I Allows complex algorithms to be concisely expressed
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The downsides
I Slow compared to in-memory computation or non-distributedgraph DB
I Cannot do realtime
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
The downsides
I Slow compared to in-memory computation or non-distributedgraph DB
I Cannot do realtime
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Use-cases
I Post-processing on large number of edges
I Filtering or transforming a dataset before exporting to Gephior Neo4j
I Back-processing old data with inconsistent fields and mergingdatasets from different sources
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Use-cases
I Post-processing on large number of edgesI Filtering or transforming a dataset before exporting to Gephi
or Neo4j
I Back-processing old data with inconsistent fields and mergingdatasets from different sources
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Use-cases
I Post-processing on large number of edgesI Filtering or transforming a dataset before exporting to Gephi
or Neo4jI Back-processing old data with inconsistent fields and merging
datasets from different sources
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
I Declarative syntax
I Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,
mapcat, etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
I Declarative syntaxI Order of statements is arbitrary
I Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,
mapcat, etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-like
I Operations are based on tuplesI Possibility to control the flow with custom operators (filter,
mapcat, etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuples
I Possibility to control the flow with custom operators (filter,mapcat, etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Using Cascalog
I Declarative syntaxI Order of statements is arbitraryI Syntax is LISP-likeI Operations are based on tuplesI Possibility to control the flow with custom operators (filter,
mapcat, etc.)
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Anatomy of a Cascalog query (Aggregation)
Example (in-degree from cascalog.graph.core)
(defn in-degree ;; just a normal function"computes the in degrees" ;; docstring[edges](<- ;; returns a cascalog query[?dst ?in_d] ;; returned tuple(edges ?dst _) ;; destructuring on a generator(:distinct false)(c/count :> ?in_d))) ;; infers aggregation on ?dst
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Anatomy of a Cascalog query (Filtering)
Example (filtering on in-degree)
(defn filtered-nodes[edges threshold];; compute in-degree as a subquery(let [in-degrees (in-degree edges)](<-[?node-id ?in-deg];; filters on computed in-degree(> ?in-deg threshold);; uses previous subquery as a generator(in-degrees ?node-id ?in-deg))))
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Under the hood, this happens. . .
Example (using custom filter ops)
(deffilterop over-threshold[deg threshold](> deg threshold))
(defn filtered-nodes[edges threshold](let [in-degrees (in-degree edges)](<-[?node-id ?in-deg](in-degrees ?node-id ?in-deg);; use custom operator(over-threshold ?in-deg threshold))))
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Anatomy of a Cascalog query (Join)
Example (joining on heterogenous datasets)
(defn get-website[url](-> (URL. url)
(.getHost)))
(defn join-edges[backlinks rt];; compute in-degree as a subquery(<-
[?resolved](backlinks ?src ?url)(rt _ ?url)(get-website ?url :> ?resolved)))
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Further reading
I Cascalog home
https://github.com/nathanmarz/cascalogI More advanced uses: Pagerank and components detection
https://github.com/docteurZ/cascalog-contrib/tree/pagerank
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration
Graph Analysis at Linkfluence Why Cascalog Introduction to Cascalog Conclusion
Thanks!
If you like this kind of problems, we’re hiring!Contact us at [email protected]
Nils Grunwald and Hugo Zanghi Linkfluence
Using Cascalog and Hadoop for rapid graph processing and exploration