Date post: | 14-Dec-2014 |
Category: |
Technology |
Upload: | marcus-paradies |
View: | 823 times |
Download: | 0 times |
© Prof. Dr.-Ing. Wolfgang Lehner |
Challenges in the Design of a Graph Database Benchmark FOSDEM‘12 – Graph Processing DevRoom
Marcus Paradies
Marcus Paradies | | 1
> Outline
Motivation
Challenges
Thoughts on Graph Data Generation
Thoughts on Query Workload
Summary and Outlook
Discussion
FOSDEM 2012
Marcus Paradies | | 2
> Motivation
FOSDEM 2012
Graph databases are gaining momentum
Enterprise corporations are getting interested
How to compare the available graph database vendors?
Main issue: Results from benchmarks are not comparable
Lack of standardization in the data model and query language
What are “typical“ graph operations?
Marcus Paradies | | 3
>
Challenges
FOSDEM 2012
Marcus Paradies | | 4
> Challenge #1: Application Domain
Graph data is not homogenous
Graph data from different domains follows different patterns
Examples:
Social Network Analysis (SNA)
Protein Interaction Analysis
Recommendation Systems
Supply Chain Management (Vehicle Routing, CRM)
Fraud Detection in Financial Systems
…
Challenge: Find an application domain which represents a graph data pattern
common in many different scenarios.
FOSDEM 2012
Marcus Paradies | | 5
> Challenge #2: Graph Data Model
FOSDEM 2012
What flavours of graph data models are commonly used?
Marcus Paradies | | 6
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Marcus Paradies | | 7
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Marcus Paradies | | 8
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph
Marcus Paradies | | 9
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph Multi Graph
Marcus Paradies | | 10
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph Multi Graph
(Plain) Property Graph
Marcus Paradies | | 11
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph Multi Graph
(Plain) Property Graph
(Structured Property Graph)
Marcus Paradies | | 12
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph Multi Graph
(Plain) Property Graph
(Structured Property Graph)
Hyper Graph
Marcus Paradies | | 13
> Challenge #2: Graph Data Model
FOSDEM 2012
Directed Graph
Undirected Graph
Mixed Graph Multi Graph
(Plain) Property Graph
(Structured Property Graph)
Hyper Graph
Challenge: Find a graph data model suited for the majority of use cases
from various domains.
Marcus Paradies | | 14
> Challenge #3: Querying Graph Data
FOSDEM 2012
Large variety in graph processing and manipulation languages
Each graph database vendor implements own query languages/APIs
Reason: No standardized graph query language available
Marcus Paradies | | 15
> Challenge #3: Querying Graph Data
FOSDEM 2012
Large variety in graph processing and manipulation languages
Each graph database vendor implements own query languages/APIs
Reason: No standardized graph query language available
Challenge: Find a way to abstract from the zoo of available query languages.
Marcus Paradies | | 16
> Challenge #4: Defining the Workload
FOSDEM 2012
The workload to be defined is dependent from the underlying
query/manipulation language
Should complex (algorithmic) operations be part of a database benchmark?
Which algorithms to pick?
Social Network Analysis → Find communities
Supply Chain Management → Find maximal flow
Web of Data → Find pattern matches
How are concurrent users represented?
What about transactionality?
Marcus Paradies | | 17
>
Thoughts on Graph Data Generation
FOSDEM 2012
Marcus Paradies | | 18
> Graph Data Generation - Patterns
FOSDEM 2012
Understanding graph patterns (characteristics) is crucical for a good graph data generator
What are distinguishing characteristics of graphs?
How can we identify graph patterns on large graphs?
Three main patterns [1]:
Power law distributed Small diameters Community Effects
? =
? =
Marcus Paradies | | 19
> Pattern 1 – Power law distributed
FOSDEM 2012
Most real-world graph data sets follow a power law distribution
Examples:
Internet router graph Subsets of the WWW Citation Graphs
source: [2] source: [2]
Marcus Paradies | | 20
> Pattern 2 – Small Diameters
FOSDEM 2012
Effective Diameter (eccentricity): Minimum number of hops, in which a fraction (e.g. 90%) of all connected pairs of nodes can reach each other
Other measures exist as well, but are not applicable to disconnected graphs
In most use cases, diameter is much smaller than the size of the graph
Examples:
97% eccentricity of around 16 for path lengths in the WWW Average path length around 6 for Epinions social network
source: [1]
Marcus Paradies | | 21
> Pattern 3 – Community Effects
FOSDEM 2012
Community: A set of nodes, where each node in the set is closer to all other nodes in the community than to nodes outside the community.
Communities can be found in many real-world graphs, especially social networks and collaboration networks
Clustering Coefficient C: A measure, which qualifies the „clumpiness“ of a graph
Marcus Paradies | | 22
>
Thoughts on Query Workload
FOSDEM 2012
Marcus Paradies | | 23
> Query Workload - Operations
FOSDEM 2012
Graph Manipulation Operations
Add/Update/Remove Nodes from the Graph Add/Update/Remove Edges from the Graph Add/Update/Remove Edge attributes Add/Update/Remove Node attributes
Graph Query Operations
Retrieve selection of nodes from given filter expression Getting the neighbors of a set of nodes (possibly with edge filter constraints)
Graph Traversals
Based on basic query operations Exploration of neighborhood from a given set of start nodes Terminated by the number of steps and/or edge/node filter constraints
Graph Analytical Operations
Aggregation operations such as sum, avg, min, max Aggregations on node-level and on edge-level
Marcus Paradies | | 24
> Query Workload - Measures
FOSDEM 2012
Closely related to benchmark capabilities
Measures from relational benchmarks apply such as
Average query response time
Transactions per second (throughput)
Additional measures for graph traversals
Traversals per second
What about distributed scenarios?
What about concurrent users?
Marcus Paradies | | 25
> Summary and Outlook
Graph data distribution highly important for graph database benchmark
Application domains do have very specific graph characteristics
A graph database benchmark has to provide abstract and high-level graph
operation descriptions
Feel free to contact me if you want to contribute:
FOSDEM 2012
Marcus Paradies | | 26
>
Discussion
FOSDEM 2012
Marcus Paradies | | 27
> Theses
A benchmark based on social network data is nice, but might be not be that
representative for large enterprise applications
Algorithms should NOT be part of a graph database benchmark
Only support basic operations such as simple lookups and path traversals
The underlying graph data model should be a simple property graph
A graph database has to scale in terms of data size as well as number of
concurrent users
....
FOSDEM 2012
Marcus Paradies | | 28
> References
[1] Graph Mining: Laws, Generators, and Algorithms (2006)
[2] http://konect.uni-koblenz.de/
[3] A Discussion on the Design of Graph Database Benchmarks (2010)
FOSDEM 2012