(DAT203) Building Graph Databases on AWS

Post on 08-Jan-2017

4,010 views 3 download

transcript

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Todd Hildebrant and Matthew Sowders

AWS

October 2015

DAT203

Graph Databases on AWS

What to Expect from the Session

• Who are we?

• General overview of graph database technology

• AWS architecture examples

• Amazon Fulfillment technology’s “Inventory Notification

Graph”

• Amazon DynamoDB Storage Backend for Titan

Graph databases on AWS

What is a graph? What is a graph database?

• A graph is a data structure consisting of vertexes

(nodes), directed edges (relationships), and properties.

Subset of tree data structure.

• A graph database uses a property graph as the data

model and includes a query language.

• Other possible data models are hyper-graphs, triple-

stores, RDF.

Graph data modeling

• NoSQL data models – Document, Key-Value, Columnar,

Graph, Mixed

• CAP and ACID

• Start with the use case, then develop the data model:

• As a Student, I want to know other Students in my Class who

know about a Subject

• Student KNOWS Subject, Student BELONGS_TO Class

StudentSubject Class

KNOWS BELONGS_TO

Graph vs. relational database

Graph

• Need to traverse a graph

without JOINs

• Queries have a starting

location MATCH ON x

• Normalized attribute to

enable filtering

• Dynamic schema

Relational

• Columnar analytics

• Tables denormalized for

performance

• Cluster and fault

management

• Recursive query support in

the query optimizer

Titan: distributed graph database

• Distributed graph

• Storage layer has plug-in architecture

• Native TinkerPop implementation

• Full text search with Lucene, SOLR, Elasticsearch

• HA using multi-master replication (Cassandra cluster)

• Scalability using DynamoDB

• Shared-nothing architecture, single master (writes),

multiple replicas (reads), embeddable using JVM

• HA when distributed, uses Paxos for master election

• Attempts to load DB into RAM, larger is better. Efficient

spilling to disk.

• Primary query language is Cypher, supports Gremlin

AWS deployment for Neo4j

Availability Zone #1

Write ELB

Availability Zone #1

Read ELB

ELB health checks

HTTP GET

/db/manage/server/ha/master

/db/manage/server/ha/slave

/db/manage/server/ha/active

Analytics on graphs

• OLAP not OLTP

• Leverages the Hadoop / MapReduce framework

• GraphX is analytics on Spark in-memory; functional-like,

“declarative” programming model

• Giraph is graph using MapReduce / HDFS; procedural,

vertex-centric programming model

• Aggregation type queries over the entire graph

TinkerPop

• Apache Incubator graph framework supporting both

OLAP and OLTP.

• Gremlin, a query language for graph traversals.

Supports analysis, modification, and queries.

• Gremlin Structured API, a generic connector framework

or API. Interface to a backend graph engine.

Graph DB use cases

• Social

• Recommendation

• Classic network problems

• Deep hierarchies

• Sensor analysis with geo-spatial constraints

• Fraud detection

• Identity and Access Management

Recommendation engine example

neo4j cluster

EMR

Writes Reads

Buy like

item

“People who bought

this item also bought”

Custom

Email

“Something you

recently looked at has

changed”

Inbound fulfillment

Inbound fulfillment data problems

Manual Research

• All tools emit events

• Humans trace the events

• Difficult to follow as search

space increases

• Developed queries, but took

too long to run

Approaches

Unique Identifiers

• Every item gets a unique

identifier

• Easy to get all related events

• Expensive

• Impractical for some items

Inventory notification graph: data model

Why not use a relational or NoSQL database?

• Relational Database

• Knew data volume would be huge and keep growing

• Did not want to vertically scale

• JOINs on table will be expensive

• Use case required high availability

• NoSQL Store

• Would be the same solution without all the functionality built

into the TinkerPop Graph Framework

Why a graph?

• No way to index just the events we need

• Need to perform search from receive to stow and vice

versa; i.e., requires many hops to find the data

• Need to process messages out of order

• Graphs provide a simple mental model

Why Titan?

Tinkerpop

Backend

DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB

Titan

Rexster(graph server)

Blueprints(generic graph API)

Furnace(graph algorithms)

Frames(object-graph mapper)

Gremlin(traversal language)

Pipes(dataflows)

Cassandra

• Highly available

• Existing Titan implementation

• EC2Snitch

• Replication

• RandomPartitioner

Cassandra: Titan lessons learned

• No one on our team had experience managing or

configuring a Cassandra cluster

• Needed to manage a cluster

• Team manually replaces hosts as EC2 swaps them out

• Does not handle time series data well

• We ran two producers against two keyspaces so we

could efficiently drop old data

DynamoDB: Titan

• Massively scalable

• No more tuning and host management

• Team was already familiar with DynamoDB

• Risky because there was no existing Titan

implementation

Inventory notification graph – architecture

DynamoDB: single-item data model

Hash Key (hk) Attribute Attribute Attribute Attribute Attribute

Vertex id 1 Property –

Name Justin

Edge (out) –

Friend: Anna

Edge (out) –

Friend: Kris

Edge (out) –

Likes: Movies

Hidden

Property -

Exists

Vertex id 2 Property –

Name Anna

Edge (out) –

Friend: Justin

Edge (out) –

Likes: Books

Hidden

Property -

Exists

Vertex id 3 Property –

Name Kris

Edge (out) –

Friend: Justin

Edge (out) –

Likes: Movies

Hidden

Property -

Exists

Vertex id 4 Property –

Name Movies

Edge (out) –

Friend: Justin

Edge (out) –

Likes: Kris

Hidden

Property -

Exists

Vertex id 5 Property –

Name Books

Edge (out) –

Friend: Anna

Hidden

Property -

Exists

DynamoDB: multiple-item data model

Hash Key (hk) Range Key (rk) Value (v)

Vertex id 1 Range key

Vertex id 1 Property id Property – Name Justin

Vertex id 1 Edge id Edge (out) – Friend Anna

Vertex id 1 Edge id Edge (out) – Friend Kris

Vertex id 2 Range key

Vertex id 2 Property id Property – Name Anna

Vertex id 2 Edge id Edge (out) – Friend Justin

Vertex id 2 Edge id Edge (out) – Friend

Brooks

DynamoDB: how does it scale?

• Close to 100 billion vertices

• Terabytes of data

• Without corresponding increase in latency

DynamoDB: Titan lessons learned

• Use Titan explicit partitioning on large graph

• Partition across multiple graphs for time series data

• Able to achieve stable performance at scale

Resources

• Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem

• Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL

Movement by Eric Redmond and Jim R. Wilson

• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by

Pramod J. Sadalage and Martin Fowler

• Titan Graph Database Integration with DynamoDB: World-class Performance,

Availability, and Scale for New Workloads by Werner Vogels

• Store and Process Graph Data using the DynamoDB Storage Backend for Titan by

Jeff Barr

• Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by

Matthew Sowders and Alexander Patrikalakis

• Amazon DynamoDB Storage Backend for Titan FAQ

• Amazon DynamoDB Storage Backend for Titan Documentation

Thank you!

Remember to complete

your evaluations!