Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...

transcript

Piecing together large puzzles, efficiently

Scalable bulk loading into graph databasesWork in progress paper

Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter Saake

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Agenda

● Motivation● Background & The Graph Loading Process● Experiments● Conclusion & Future Work

Motivation

How can we understand better the networks that we belong to?

Motivation

● An example of a practical application: Recommendations@Pinterest

Motivation

● An example of a practical application: Dependency-driven analytics@Microsoft

Motivation

● Large graphs are ubiquitous○ ⅕ of participants use graphs with >100 M edges

● Scalability is the main challenge● Graph DBMSs are the most popular tool, at the moment

Motivation

● User experience starts with data loading

● This can still be improved○ Currently no standard scale-out solution for the process (our focus)○ Limited handling of variable input data characteristics.

bin/neo4j-import --into retail.db --id-type string \ --nodes:Customer customers.csv --nodes products.csv \ --nodes orders_header.csv,orders1.csv,orders2.csv \ --relationships:CONTAINS order_details.csv \ --relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv

Background

● Input data characteristics

Edge Lists, from SNAP Astro-Physics Collaboration Dataset

Implicit Entities, from SNAP Amazon Movie Reviews Dataset

Also property encodings, others…

Working today with large and diverse graph datasets is cumbersome

Background

● But before going any further, the single unavoidable slide :)● Property graphs (the underlying logical model we’re assuming)

● Directed● Labeled● Attributed,● Multi-graph

The Graph Loading Process

Topology-onlyrepresentations

Complete representations

● Moving data from input files to physical storage, while keeping with constraints

The Graph Loading Process

Experiments

● The basic question we address today:

○ How much can an developer nowadays scale-out and tune the process, without changing database internals?

Experiments

● Setup

JanusGraph (formerly Titan)

Datasets: Wiki-RfA (10,835 V, 159,388 E) and Google-Web (875,731 V, 5,105,039 E)

Experiments

● Setup○ JanusGraph Version 0.1.1 (May,11,2017) ○ Apache Cassandra 2.1.1. ○ Commodity multi-core machine composed of 2 Intel(R) Xeon(R) CPU E5-2609

v2 @ 2.50GHz processors (8 cores in total) with 251 GB of memory.

Experiments

Gains from batching

● Fit more data inside a single transaction● The bigger the batch size, the faster the

loading process○ Batching works!

● Larger batch sizes don‘t guarantee better performance○ Poor use of transaction caches○ Higher costs for failed transactions

Experiments

Adding some parallelism

● Partition the data into chunks and load in parallel○ Here we report average of strategies.

● This consistently reduces the loading time● Less impact than batching.

○ Multiple users on the same data bring transaction commits overheads.

Experiments

A closer look at the partitioning strategies

● EE: Part Edges, Balance Edges● VV: PV, BV● BE: PV, BE● DS: Extension to BE, deals with skew

All achieve good balancing in these datasets

Experiments

No big differences between them for these datasets

Only imbalance in Wiki-Rfa VV 2 part.

Distribution Across Partitions in Google Web =>

Experiments

No big differences between them for these datasets

Only imbalance in Wiki-Rfa VV 2 part.

Distribution Across Partitions in Wiki-RfA =>

Experiments

Putting it all together

Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Wiki - RfA)

● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.

○ It also increases with more users.

No improvements over batching

Experiments

Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Google Web)

● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.

○ It also increases with more users.

No improvements over batching

Conclusion

● Batching is the best first strategy. We've seen gains from 100 minutes to 1.5.○ Small disclaimer: gains do not grow in proportion to sizes.

● But the combination of batching and partitioning is not straight-forward and can bring deterioration.○ How can we make them work well together?

● EE, BE/DS could be the default partitioning strategy○ But load imbalance is not the single factor affecting performance

More studies are next, moving our questions in studying physical storage alternatives, in tune with a broader picture of interest in supporting adaptive HTAP designs.

Thanks :)

Questions?

Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...

Documents