Date post: | 08-Jan-2017 |
Category: | Technology |
View: | 4,005 times |
Download: | 3 times |
2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Todd Hildebrant and Matthew Sowders
AWS
October 2015
DAT203
Graph Databases on AWS
What to Expect from the Session
Who are we?
General overview of graph database technology
AWS architecture examples
Amazon Fulfillment technologys Inventory Notification
Graph
Amazon DynamoDB Storage Backend for Titan
Graph databases on AWS
What is a graph? What is a graph database?
A graph is a data structure consisting of vertexes
(nodes), directed edges (relationships), and properties.
Subset of tree data structure.
A graph database uses a property graph as the data
model and includes a query language.
Other possible data models are hyper-graphs, triple-
stores, RDF.
Graph data modeling
NoSQL data models Document, Key-Value, Columnar,
Graph, Mixed
CAP and ACID
Start with the use case, then develop the data model:
As a Student, I want to know other Students in my Class who
know about a Subject
Student KNOWS Subject, Student BELONGS_TO Class
StudentSubject Class
KNOWS BELONGS_TO
Graph vs. relational database
Graph
Need to traverse a graph
without JOINs
Queries have a starting
location MATCH ON x
Normalized attribute to
enable filtering
Dynamic schema
Relational
Columnar analytics
Tables denormalized for
performance
Cluster and fault
management
Recursive query support in
the query optimizer
Titan: distributed graph database
Distributed graph
Storage layer has plug-in architecture
Native TinkerPop implementation
Full text search with Lucene, SOLR, Elasticsearch
HA using multi-master replication (Cassandra cluster)
Scalability using DynamoDB
Shared-nothing architecture, single master (writes),
multiple replicas (reads), embeddable using JVM
HA when distributed, uses Paxos for master election
Attempts to load DB into RAM, larger is better. Efficient
spilling to disk.
Primary query language is Cypher, supports Gremlin
AWS deployment for Neo4j
Availability Zone #1
Write ELB
Availability Zone #1
Read ELB
ELB health checks
HTTP GET
/db/manage/server/ha/master
/db/manage/server/ha/slave
/db/manage/server/ha/active
Analytics on graphs
OLAP not OLTP
Leverages the Hadoop / MapReduce framework
GraphX is analytics on Spark in-memory; functional-like,
declarative programming model
Giraph is graph using MapReduce / HDFS; procedural,
vertex-centric programming model
Aggregation type queries over the entire graph
TinkerPop
Apache Incubator graph framework supporting both
OLAP and OLTP.
Gremlin, a query language for graph traversals.
Supports analysis, modification, and queries.
Gremlin Structured API, a generic connector framework
or API. Interface to a backend graph engine.
Graph DB use cases
Social
Recommendation
Classic network problems
Deep hierarchies
Sensor analysis with geo-spatial constraints
Fraud detection
Identity and Access Management
Recommendation engine example
neo4j cluster
EMR
Writes Reads
Buy like
item
People who bought
this item also bought
Custom
Something you
recently looked at has
changed
Inbound fulfillment
Inbound fulfillment data problems
Manual Research
All tools emit events
Humans trace the events
Difficult to follow as search
space increases
Developed queries, but took
too long to run
Approaches
Unique Identifiers
Every item gets a unique
identifier
Easy to get all related events
Expensive
Impractical for some items
Inventory notification graph: data model
Why not use a relational or NoSQL database?
Relational Database
Knew data volume would be huge and keep growing
Did not want to vertically scale
JOINs on table will be expensive
Use case required high availability
NoSQL Store
Would be the same solution without all the functionality built
into the TinkerPop Graph Framework
Why a graph?
No way to index just the events we need
Need to perform search from receive to stow and vice
versa; i.e., requires many hops to find the data
Need to process messages out of order
Graphs provide a simple mental model
Why Titan?
Tinkerpop
Backend
DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB
Titan
Rexster(graph server)
Blueprints(generic graph API)
Furnace(graph algorithms)
Frames(object-graph mapper)
Gremlin(traversal language)
Pipes(dataflows)
Cassandra
Highly available
Existing Titan implementation
EC2Snitch
Replication
RandomPartitioner
Cassandra: Titan lessons learned
No one on our team had experience managing or
configuring a Cassandra cluster
Needed to manage a cluster
Team manually replaces hosts as EC2 swaps them out
Does not handle time series data well
We ran two producers against two keyspaces so we
could efficiently drop old data
DynamoDB: Titan
Massively scalable
No more tuning and host management
Team was already familiar with DynamoDB
Risky because there was no existing Titan
implementation
Inventory notification graph architecture
DynamoDB: single-item data model
Hash Key (hk) Attribute Attribute Attribute Attribute Attribute
Vertex id 1 Property
Name Justin
Edge (out)
Friend: Anna
Edge (out)
Friend: Kris
Edge (out)
Likes: Movies
Hidden
Property -
Exists
Vertex id 2 Property
Name Anna
Edge (out)
Friend: Justin
Edge (out)
Likes: Books
Hidden
Property -
Exists
Vertex id 3 Property
Name Kris
Edge (out)
Friend: Justin
Edge (out)
Likes: Movies
Hidden
Property -
Exists
Vertex id 4 Property
Name Movies
Edge (out)
Friend: Justin
Edge (out)
Likes: Kris
Hidden
Property -
Exists
Vertex id 5 Property
Name Books
Edge (out)
Friend: Anna
Hidden
Property -
Exists
DynamoDB: multiple-item data model
Hash Key (hk) Range Key (rk) Value (v)
Vertex id 1 Range key
Vertex id 1 Property id Property Name Justin
Vertex id 1 Edge id Edge (out) Friend Anna
Vertex id 1 Edge id Edge (out) Friend Kris
Vertex id 2 Range key
Vertex id 2 Property id Property Name Anna
Vertex id 2 Edge id Edge (out) Friend Justin
Vertex id 2 Edge id Edge (out) Friend
Brooks
DynamoDB: how does it scale?
Close to 100 billion vertices
Terabytes of data
Without corresponding increase in latency
DynamoDB: Titan lessons learned
Use Titan explicit partitioning on large graph
Partition across multiple graphs for time series data
Able to achieve stable performance at scale
How to get started
GitHub Repository
DynamoDB Local
CloudFormation Template
https://github.com/awslabs/dynamodb-titan-storage-backendhttps://github.com/awslabs/dynamodb-titan-storage-backend/blob/0.5.4/dynamodb-titan-storage-backend-cfn.json
Resources
Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem
Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL
Movement by Eric Redmond and Jim R. Wilson
NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by
Pramod J. Sadalage and Martin Fowler
Titan Graph Database Integration with DynamoDB: World-class Performance,
Availability, and Scale for New Workloads by Werner Vogels
Store and Process Graph Data using the DynamoDB Storage Backend for Titan by
Jeff Barr
Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by
Matthew Sowders and Alexander Patrikalakis
Amazon DynamoDB Storage Backend for Titan FAQ
Amazon DynamoDB Storage Backend for Titan Documentation
http://www.amazon.com/Graph-Databases-Opportunities-Connected-Data-ebook/dp/B00ZGRS4VYhttp://www.amazon.com/Seven-Databases-Weeks-Modern-Movement-ebook/dp/B00AYQNR50http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence-ebook/dp/B0090J3SYWhttp://www.allthingsdistributed.com/2015/08/titan-graphdb-integration-in-dynamodb.htmlhttps://aws.amazon.com/blogs/aws/new-store-and-process-graph-data-using-the-dynamodb-storage-backend-for-titan/https://medium.com/aws-activate-startup-blog/amazon-dynamodb-storage-backend-for-titan-distributed-graph-database-b9cc8cca80b7https://aws.amazon.com/dynamodb/faqs/#storagebackendhttp://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.html
Thank you!
Remember to complete
your evaluations!