Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | datablend |
View: | 116 times |
Download: | 2 times |
Thursday
23.5 grap
h da
taba
ses
about me
who am i ...
Davy Suvee@DSUVEE
➡ big data architect @ datablend - continuum• provide big data and nosql consultancy
• share practical knowledge and big data use cases via blog
Big Data
2-3 years ago ...
Nowadays ...
Big Data
What is big data ...
... large and complex data sets that are difficult to process with traditional database management tools ...
What is big data ...
Big Data
... large and complex data sets that are difficult to process with traditional database management tools ...
➡ store (nosql)
➡ enrich (data mining, ml, nlp, ... )
➡ visualize (d3, gephi, mapbox, tableau, ... )
➡ process/analyze (map/reduce, cep, storm, ... )
Volume Variety VelocityData exceeds the limits of vertically
scalable tools requiring novel storage solutions
Data takes different formats that make integration complex and expensive
Data analysis time windows are small compared to the speed of data acquistion
The world has changed ...
Tackling the volume problem ...
➡ Throwing our data away :-(
What we are currently doing ...
➡ Storing preprocessed data :-/
➡ Try to store it anyway ;-(But why?
Tackling the volume problem ...
Vertical Scaling
€
Your database
Tackling the volume problem ...
Vertical Scaling
€ 2
Your database
Tackling the volume problem ...
Vertical Scaling
€ 3
Your database
Tackling the volume problem ...
Vertical Scaling
€ 4
Your database
Tackling the volume problem ...
Vertical Scaling
€ 4
Horizontal Scaling
€ x #nodes
Your database
NoSQL
Tackling the variety problem ...
Video
Audio
Social streams
Log files
Text
MassiveUnstuctured
Tackling the variety problem ...
One, schema-structured model Best-fit, schema-less model
Your database
NoSQL
Key-Value Databases
Document-Based Databases
Graph Databases
Wide-column Databases
AS IS ...
Tackling the velocity problem ...
➡ Collect
We want to ...
➡ Process
➡ Query
in Real-Time
MASSIVE amounts of Unstructured data
➡ Analyze
Tackling the velocity problem ...
Slow and outdated information Fast and realtime
Your stack
NoSQL &Big Data
BI
ETL
APP
SYNC
SYNC
APPMap-Reduce
BI
(+ ANALYTICS)
graphs are everywhere ...
a little bit of graph theory ...
Davyage = 33
Datablendbtw = 123...
node/vertex
Janssensector = pharma
Kimage = 26
gender = F
edge
node/vertex node/vertex
founded
in: 2011
worked_forfrom: 2008 to: 2013
knowssince: 2013
Advantages ... ?
➡ whiteboard friendly ➡ schema-less
➡ index-free adjacency (no joins!)
Graph Database
➡ queries as traversals
➡ queries as pattern matching
Advantages ... ?
Products/projects ... ?
➡ databases: neo4j, orientdb, allegrograph, dex, ... ➡ processing: pregel, giraph, hama, goldenorb, ... ➡ APIs: blueprints
Graph Database ➡ query languages: gremlin, cypher, sparql
Graph database 101 (neo4j)
GraphDatabaseService graph = ...
Node davy = graph.createNode();davy.setProperty(“name”,”Davy”);
Davy
KimNode kim = graph.createNode();kim.setProperty(“name”,”Kim”);
Graph database 101 (neo4j)
enum RelTypes implements RelationshipType { KNOWS, WORKED_FOR, FOUNDED}
Davy
Kim
knows
Relationship davy_kim = davy.createRelationshipTo(kim, RelTypes.KNOWS)
davy_kim.setProperty(“since”, 2013);
Graph database 101 (neo4j)
Relationship davy_datablend = davy.createRelationshipTo( datablend, RelTypes.FOUNDED)
davy_datablend.setProperty(“in”, 2011);
Davy
Datablend
founded
➡ how to access the datablend node?
Graph database 101 (neo4j)
Index<Node> nodeIndex = graph.index().forNodes(“nodes”);
Node datablend = graph.createNode();datablend.setProperty(“name”,”Datablend”);
nodeIndex.add(datablend, “name”, “Datablend”);
Node found = nodeIndex.get(“name”,”Datablend”).getSingle();
Graph database 101 (neo4j)
➡ find friends of my friends ...
TraversalDescription td = Traversal.description() .breadthFirst() .relationships(RelTypes.KNOWS, Direction.OUTGOING) .evaluator(Evaluators.toDepth(2));
Traverser traverser = td.traverse(davy);
for (Path path : traverser) { ... }
Graph database 101 (neo4j)
➡ find friends of my friends ...
START davy=node:node_auto_index(name = “Davy”)MATCH davy-[:KNOWS]->()-[:KNOWS]->fofRETURN davy, fof
ExecutionEngine engine = new ExecutionEngine(graph);
ExecutionResults result = engine.execute(query);for(Map<String,Object> row : result) { ... }
Use cases ... ?
➡ recommendations ➡ access control ➡ routing
Graph Database ➡ social computing/networks
➡ genealogy
insights in big data
➡ typical approach through warehousing★ star schema with fact tables and dimension tables
insights in big data
➡ typical approach through warehousing★ star schema with fact tables and dimension tables
insights in big data
➡ typical approach through warehousing★ star schema with fact tables and dimension tables
insights in big data
★ real-time visualization★ filtering★ metrics★ layouting★ modular 1, 2
1. http://gephi.org/plugins/neo4j-graph-database-support/ 2. http://github.com/datablend/gephi-blueprints-plugin
gene expression clustering
★ 4.800 samples★ 27.000 genes
➡ oncology data set:
➡ Question:★ for a particular subset of samples, which genes are co-expressed?
mongodb for storing gene expressions{ "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} , "sample_name" : "122551hp133a21.cel" , "genomics_id" : 122551 , "sample_id" : 343981 , "donor_id" : 143981 , "sample_type" : "Tissue" , "sample_site" : "Ascending colon" , "pathology_category" : "MALIGNANT" , "pathology_morphology" : "Adenocarcinoma" , "pathology_type" : "Primary malignant neoplasm of colon" , "primary_site" : "Colon" , "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} , { "gene" : "X10_at" , "expression" : 3.92335121981739} , { "gene" : "X100_at" , "expression" : 7.81638155662255} , { "gene" : "X1000_at" , "expression" : 5.44318512260619} , … ]}
pearson correlation through map-reduce
pearson correlation
x y
43 99
21 65
25 79
42 75
57 87
59 81
0,52
co-expression graph
➡ create a node for each gene➡ if correlation between two genes >= 0.8, draw an edge between both nodes
co-expression graph
mutation prevalence
mutation prevalence
mutation prevalence
mutation prevalence
analyzing running data
<trkpt lon="4.723870977759361" lat="51.075748661533"> <ele>29.799999237060547</ele> <time>2011-11-08T19:18:39.000Z</time></trkpt><trkpt lon="4.724105251953006" lat="51.075623352080584"> <ele>29.799999237060547</ele> <time>2011-11-08T19:18:45.000Z</time></trkpt><trkpt lon="4.724143054336309" lat="51.07560558244586"> <ele>29.799999237060547</ele> <time>2011-11-08T19:18:46.000Z</time></trkpt>
analyzing running data through neo4j
➡ using neo4j spatial extension
➡ create a node for each tracked point
List<GeoPipeFlow> closests = GeoPipeline.startNearestNeighborLatLonSearch( runningLayer, to, 0.02). sort("OrthodromicDistance"). getMin("OrthodromicDistance").toList();
➡ connect succeeding tracking nodes in a graph
analyzing running data
analyzing google analytics data➡ source url -> target url
graphs and time ...
➡ fluxgraph: a blueprints-compatible graph on top of Datomic
➡ make FluxGraph fully time-aware ★ travel your graph through time★ time-scoped iteration of vertices and edges★ temporal graph comparison
➡ towards a time-aware graph ...
➡ reproducible graph state
travel through time
FluxGraph fg = new FluxGraph();
travel through time
FluxGraph fg = new FluxGraph();
Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);
Davy
travel through time
FluxGraph fg = new FluxGraph();
Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);
Davy
Kim
Vertex kim = ...
travel through time
FluxGraph fg = new FluxGraph();
Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);
Peter
Davy
Kim
Vertex kim = ...
Vertex peter = ...
travel through time
FluxGraph fg = new FluxGraph();
Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);
Peter
Davy
Kim
Vertex kim = ...
Vertex peter = ...
Edge e1 = fg.addEdge(davy, kim, “knows”);
knows
travel through time
Peter
Davy
Kim
knows
travel through time
Date checkpoint = new Date();
Peter
Davy
Kim
knows
travel through time
Date checkpoint = new Date();
davy.setProperty(“name”,”David”);
Peter
Davy
Kim
knows
travel through time
Date checkpoint = new Date();
davy.setProperty(“name”,”David”);
Peter
Kim
knows
David
travel through time
Date checkpoint = new Date();
davy.setProperty(“name”,”David”);
Peter
Kim
Edge e2 = fg.addEdge(davy, peter, “knows”);
knows
David
knows
travel through time
Peter
Davy
Kim
DavidDavy
Kim
knows
knows
Peter
knows
checkpoint
current
time
by default
travel through time
Peter
Davy
Kim
DavidDavy
Kim
knows
knows
Peter
knows
checkpoint
current
time
fg.setCheckpointTime(checkpoint);
travel through time
Peter
Davy
Kim
DavidDavy
Kim
knows
knows
Peter
knows
checkpoint
current
time
fg.setCheckpointTime(checkpoint);
tcurrrentt3t2
time-scoped iteration
change change change
Davy’’’Davy’ Davy’’
t1
Davy
➡ how to find the version of the vertex you are interested in?
tcurrrentt3t2
time-scoped iteration
Davy’’’Davy’ Davy’’
t1
Davy
next next next
previouspreviousprevious
tcurrrentt3t2
time-scoped iteration
Davy’’’Davy’ Davy’’
t1
Davy
next next next
previouspreviousprevious
tcurrrentt3t2
time-scoped iteration
Davy’’’Davy’ Davy’’
t1
Davy
Vertex previousDavy = davy.getPreviousVersion();
next next next
previouspreviousprevious
tcurrrentt3t2
time-scoped iteration
Davy’’’Davy’ Davy’’
t1
Davy
Vertex previousDavy = davy.getPreviousVersion();
Iterable<Vertex> allDavy = davy.getNextVersions();
next next next
previouspreviousprevious
tcurrrentt3t2
time-scoped iteration
Davy’’’Davy’ Davy’’
t1
Davy
Vertex previousDavy = davy.getPreviousVersion();
Iterable<Vertex> allDavy = davy.getNextVersions();
Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);
next next next
previouspreviousprevious
tcurrrentt3t2
time-scoped iteration
Davy’’’Davy’ Davy’’
t1
Davy
Vertex previousDavy = davy.getPreviousVersion();
Iterable<Vertex> allDavy = davy.getNextVersions();
Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);
Interval valid = davy.getTimerInterval();
PeterPeter
Davy
Kim
David Davy
Kim
temporal graph comparison
knows
knows
knows
current checkpoint
what changed?
temporal graph comparison
➡ difference (A , B) = union (A , B) - B
➡ ... as a (immutable) graph!
difference ( , ) =
David
knows
t3t2t1
use case: longitudinal patient data
patient patient
smoking
patient
smoking
t4
patient
cancer
t5
patient
cancer
death
use case: longitudinal patient data
➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)
➡ example analysis: ★ if a male patient is no longer smoking in 2005★ what are the chances of getting lung cancer in 2010, comparing
patients that smoked before 2005
patients that never smoked
FluxGraph
http://github.com/datablend/fluxgraph
➡ available on github
Open Innovation Networking Tool
➡ Many different projects, many different partners, many different domains ...★ how do we keep track?
★ how can we learn from the data?
➡ Store the date in it’s most natural form, a graph
➡ use graph algorithms to identify the importance of each node and their related ones
Open Innovation Networking Tool
Open Innovation Networking Tool
More graphs ...
➡ pharma ➡ geospatial ➡ dependency analysis
➡ ontology
➡ ...
Questions?
Follow us
twitter.com/data_blendwww.datablend.be
www.datablend.be [email protected] 0499/05.00.89
datablend - continuum