+ All Categories
Transcript
Page 1: (DAT203) Building Graph Databases on AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Todd Hildebrant and Matthew Sowders

AWS

October 2015

DAT203

Graph Databases on AWS

Page 2: (DAT203) Building Graph Databases on AWS

What to Expect from the Session

• Who are we?

• General overview of graph database technology

• AWS architecture examples

• Amazon Fulfillment technology’s “Inventory Notification

Graph”

• Amazon DynamoDB Storage Backend for Titan

Page 3: (DAT203) Building Graph Databases on AWS

Graph databases on AWS

Page 4: (DAT203) Building Graph Databases on AWS

What is a graph? What is a graph database?

• A graph is a data structure consisting of vertexes

(nodes), directed edges (relationships), and properties.

Subset of tree data structure.

• A graph database uses a property graph as the data

model and includes a query language.

• Other possible data models are hyper-graphs, triple-

stores, RDF.

Page 5: (DAT203) Building Graph Databases on AWS

Graph data modeling

• NoSQL data models – Document, Key-Value, Columnar,

Graph, Mixed

• CAP and ACID

• Start with the use case, then develop the data model:

• As a Student, I want to know other Students in my Class who

know about a Subject

• Student KNOWS Subject, Student BELONGS_TO Class

StudentSubject Class

KNOWS BELONGS_TO

Page 6: (DAT203) Building Graph Databases on AWS

Graph vs. relational database

Graph

• Need to traverse a graph

without JOINs

• Queries have a starting

location MATCH ON x

• Normalized attribute to

enable filtering

• Dynamic schema

Relational

• Columnar analytics

• Tables denormalized for

performance

• Cluster and fault

management

• Recursive query support in

the query optimizer

Page 7: (DAT203) Building Graph Databases on AWS

Titan: distributed graph database

• Distributed graph

• Storage layer has plug-in architecture

• Native TinkerPop implementation

• Full text search with Lucene, SOLR, Elasticsearch

• HA using multi-master replication (Cassandra cluster)

• Scalability using DynamoDB

Page 8: (DAT203) Building Graph Databases on AWS

• Shared-nothing architecture, single master (writes),

multiple replicas (reads), embeddable using JVM

• HA when distributed, uses Paxos for master election

• Attempts to load DB into RAM, larger is better. Efficient

spilling to disk.

• Primary query language is Cypher, supports Gremlin

Page 9: (DAT203) Building Graph Databases on AWS

AWS deployment for Neo4j

Availability Zone #1

Write ELB

Availability Zone #1

Read ELB

ELB health checks

HTTP GET

/db/manage/server/ha/master

/db/manage/server/ha/slave

/db/manage/server/ha/active

Page 10: (DAT203) Building Graph Databases on AWS

Analytics on graphs

• OLAP not OLTP

• Leverages the Hadoop / MapReduce framework

• GraphX is analytics on Spark in-memory; functional-like,

“declarative” programming model

• Giraph is graph using MapReduce / HDFS; procedural,

vertex-centric programming model

• Aggregation type queries over the entire graph

Page 11: (DAT203) Building Graph Databases on AWS

TinkerPop

• Apache Incubator graph framework supporting both

OLAP and OLTP.

• Gremlin, a query language for graph traversals.

Supports analysis, modification, and queries.

• Gremlin Structured API, a generic connector framework

or API. Interface to a backend graph engine.

Page 12: (DAT203) Building Graph Databases on AWS

Graph DB use cases

• Social

• Recommendation

• Classic network problems

• Deep hierarchies

• Sensor analysis with geo-spatial constraints

• Fraud detection

• Identity and Access Management

Page 13: (DAT203) Building Graph Databases on AWS

Recommendation engine example

neo4j cluster

EMR

Writes Reads

Buy like

item

“People who bought

this item also bought”

Custom

Email

“Something you

recently looked at has

changed”

Page 14: (DAT203) Building Graph Databases on AWS

Inbound fulfillment

Page 15: (DAT203) Building Graph Databases on AWS

Inbound fulfillment data problems

Page 16: (DAT203) Building Graph Databases on AWS

Manual Research

• All tools emit events

• Humans trace the events

• Difficult to follow as search

space increases

• Developed queries, but took

too long to run

Approaches

Unique Identifiers

• Every item gets a unique

identifier

• Easy to get all related events

• Expensive

• Impractical for some items

Page 17: (DAT203) Building Graph Databases on AWS

Inventory notification graph: data model

Page 18: (DAT203) Building Graph Databases on AWS

Why not use a relational or NoSQL database?

• Relational Database

• Knew data volume would be huge and keep growing

• Did not want to vertically scale

• JOINs on table will be expensive

• Use case required high availability

• NoSQL Store

• Would be the same solution without all the functionality built

into the TinkerPop Graph Framework

Page 19: (DAT203) Building Graph Databases on AWS

Why a graph?

• No way to index just the events we need

• Need to perform search from receive to stow and vice

versa; i.e., requires many hops to find the data

• Need to process messages out of order

• Graphs provide a simple mental model

Page 20: (DAT203) Building Graph Databases on AWS

Why Titan?

Tinkerpop

Backend

DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB

Titan

Rexster(graph server)

Blueprints(generic graph API)

Furnace(graph algorithms)

Frames(object-graph mapper)

Gremlin(traversal language)

Pipes(dataflows)

Page 21: (DAT203) Building Graph Databases on AWS

Cassandra

• Highly available

• Existing Titan implementation

• EC2Snitch

• Replication

• RandomPartitioner

Page 22: (DAT203) Building Graph Databases on AWS

Cassandra: Titan lessons learned

• No one on our team had experience managing or

configuring a Cassandra cluster

• Needed to manage a cluster

• Team manually replaces hosts as EC2 swaps them out

• Does not handle time series data well

• We ran two producers against two keyspaces so we

could efficiently drop old data

Page 23: (DAT203) Building Graph Databases on AWS

DynamoDB: Titan

• Massively scalable

• No more tuning and host management

• Team was already familiar with DynamoDB

• Risky because there was no existing Titan

implementation

Page 24: (DAT203) Building Graph Databases on AWS

Inventory notification graph – architecture

Page 25: (DAT203) Building Graph Databases on AWS

DynamoDB: single-item data model

Hash Key (hk) Attribute Attribute Attribute Attribute Attribute

Vertex id 1 Property –

Name Justin

Edge (out) –

Friend: Anna

Edge (out) –

Friend: Kris

Edge (out) –

Likes: Movies

Hidden

Property -

Exists

Vertex id 2 Property –

Name Anna

Edge (out) –

Friend: Justin

Edge (out) –

Likes: Books

Hidden

Property -

Exists

Vertex id 3 Property –

Name Kris

Edge (out) –

Friend: Justin

Edge (out) –

Likes: Movies

Hidden

Property -

Exists

Vertex id 4 Property –

Name Movies

Edge (out) –

Friend: Justin

Edge (out) –

Likes: Kris

Hidden

Property -

Exists

Vertex id 5 Property –

Name Books

Edge (out) –

Friend: Anna

Hidden

Property -

Exists

Page 26: (DAT203) Building Graph Databases on AWS

DynamoDB: multiple-item data model

Hash Key (hk) Range Key (rk) Value (v)

Vertex id 1 Range key

Vertex id 1 Property id Property – Name Justin

Vertex id 1 Edge id Edge (out) – Friend Anna

Vertex id 1 Edge id Edge (out) – Friend Kris

Vertex id 2 Range key

Vertex id 2 Property id Property – Name Anna

Vertex id 2 Edge id Edge (out) – Friend Justin

Vertex id 2 Edge id Edge (out) – Friend

Brooks

Page 27: (DAT203) Building Graph Databases on AWS

DynamoDB: how does it scale?

• Close to 100 billion vertices

• Terabytes of data

• Without corresponding increase in latency

Page 28: (DAT203) Building Graph Databases on AWS

DynamoDB: Titan lessons learned

• Use Titan explicit partitioning on large graph

• Partition across multiple graphs for time series data

• Able to achieve stable performance at scale

Page 30: (DAT203) Building Graph Databases on AWS

Resources

• Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem

• Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL

Movement by Eric Redmond and Jim R. Wilson

• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by

Pramod J. Sadalage and Martin Fowler

• Titan Graph Database Integration with DynamoDB: World-class Performance,

Availability, and Scale for New Workloads by Werner Vogels

• Store and Process Graph Data using the DynamoDB Storage Backend for Titan by

Jeff Barr

• Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by

Matthew Sowders and Alexander Patrikalakis

• Amazon DynamoDB Storage Backend for Titan FAQ

• Amazon DynamoDB Storage Backend for Titan Documentation

Page 31: (DAT203) Building Graph Databases on AWS

Thank you!

Page 32: (DAT203) Building Graph Databases on AWS

Remember to complete

your evaluations!


Top Related