Piecing together large puzzles, efficiently
Scalable bulk loading into graph databasesWork in progress paper
Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter Saake
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Agenda
● Motivation● Background & The Graph Loading Process● Experiments● Conclusion & Future Work
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Motivation
How can we understand better the networks that we belong to?
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● An example of a practical application: Recommendations@Pinterest
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● An example of a practical application: Dependency-driven analytics@Microsoft
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Large graphs are ubiquitous○ ⅕ of participants use graphs with >100 M edges
● Scalability is the main challenge● Graph DBMSs are the most popular tool, at the moment
Motivation
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● User experience starts with data loading
● This can still be improved○ Currently no standard scale-out solution for the process (our focus)○ Limited handling of variable input data characteristics.
bin/neo4j-import --into retail.db --id-type string \ --nodes:Customer customers.csv --nodes products.csv \ --nodes orders_header.csv,orders1.csv,orders2.csv \ --relationships:CONTAINS order_details.csv \ --relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv
Background
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Input data characteristics
Edge Lists, from SNAP Astro-Physics Collaboration Dataset
Implicit Entities, from SNAP Amazon Movie Reviews Dataset
Also property encodings, others…
Working today with large and diverse graph datasets is cumbersome
Background
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● But before going any further, the single unavoidable slide :)● Property graphs (the underlying logical model we’re assuming)
● Directed● Labeled● Attributed,● Multi-graph
The Graph Loading Process
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Topology-onlyrepresentations
Complete representations
● Moving data from input files to physical storage, while keeping with constraints
The Graph Loading Process
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● The basic question we address today:
○ How much can an developer nowadays scale-out and tune the process, without changing database internals?
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Setup
JanusGraph (formerly Titan)
Datasets: Wiki-RfA (10,835 V, 159,388 E) and Google-Web (875,731 V, 5,105,039 E)
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Setup○ JanusGraph Version 0.1.1 (May,11,2017) ○ Apache Cassandra 2.1.1. ○ Commodity multi-core machine composed of 2 Intel(R) Xeon(R) CPU E5-2609
v2 @ 2.50GHz processors (8 cores in total) with 251 GB of memory.
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Gains from batching
● Fit more data inside a single transaction● The bigger the batch size, the faster the
loading process○ Batching works!
● Larger batch sizes don‘t guarantee better performance○ Poor use of transaction caches○ Higher costs for failed transactions
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Adding some parallelism
● Partition the data into chunks and load in parallel○ Here we report average of strategies.
● This consistently reduces the loading time● Less impact than batching.
○ Multiple users on the same data bring transaction commits overheads.
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
A closer look at the partitioning strategies
● EE: Part Edges, Balance Edges● VV: PV, BV● BE: PV, BE● DS: Extension to BE, deals with skew
All achieve good balancing in these datasets
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
No big differences between them for these datasets
Only imbalance in Wiki-Rfa VV 2 part.
Distribution Across Partitions in Google Web =>
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
No big differences between them for these datasets
Only imbalance in Wiki-Rfa VV 2 part.
Distribution Across Partitions in Wiki-RfA =>
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Putting it all together
Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Wiki - RfA)
● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.
○ It also increases with more users.
No improvements over batching
Experiments
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Google Web)
● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.
○ It also increases with more users.
No improvements over batching
Conclusion
Databases and Software Engineering Workgroup, OvGU University of Magdeburg
● Batching is the best first strategy. We've seen gains from 100 minutes to 1.5.○ Small disclaimer: gains do not grow in proportion to sizes.
● But the combination of batching and partitioning is not straight-forward and can bring deterioration.○ How can we make them work well together?
● EE, BE/DS could be the default partitioning strategy○ But load imbalance is not the single factor affecting performance
More studies are next, moving our questions in studying physical storage alternatives, in tune with a broader picture of interest in supporting adaptive HTAP designs.
Thanks :)
Questions?
Databases and Software Engineering Workgroup, OvGU University of Magdeburg