Home >Technology >(DAT203) Building Graph Databases on AWS

(DAT203) Building Graph Databases on AWS

Date post:08-Jan-2017
View:4,005 times
Download:3 times
Share this document with a friend
  • 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Todd Hildebrant and Matthew Sowders


    October 2015


    Graph Databases on AWS

  • What to Expect from the Session

    Who are we?

    General overview of graph database technology

    AWS architecture examples

    Amazon Fulfillment technologys Inventory Notification


    Amazon DynamoDB Storage Backend for Titan

  • Graph databases on AWS

  • What is a graph? What is a graph database?

    A graph is a data structure consisting of vertexes

    (nodes), directed edges (relationships), and properties.

    Subset of tree data structure.

    A graph database uses a property graph as the data

    model and includes a query language.

    Other possible data models are hyper-graphs, triple-

    stores, RDF.

  • Graph data modeling

    NoSQL data models Document, Key-Value, Columnar,

    Graph, Mixed

    CAP and ACID

    Start with the use case, then develop the data model:

    As a Student, I want to know other Students in my Class who

    know about a Subject

    Student KNOWS Subject, Student BELONGS_TO Class

    StudentSubject Class


  • Graph vs. relational database


    Need to traverse a graph

    without JOINs

    Queries have a starting

    location MATCH ON x

    Normalized attribute to

    enable filtering

    Dynamic schema


    Columnar analytics

    Tables denormalized for


    Cluster and fault


    Recursive query support in

    the query optimizer

  • Titan: distributed graph database

    Distributed graph

    Storage layer has plug-in architecture

    Native TinkerPop implementation

    Full text search with Lucene, SOLR, Elasticsearch

    HA using multi-master replication (Cassandra cluster)

    Scalability using DynamoDB

  • Shared-nothing architecture, single master (writes),

    multiple replicas (reads), embeddable using JVM

    HA when distributed, uses Paxos for master election

    Attempts to load DB into RAM, larger is better. Efficient

    spilling to disk.

    Primary query language is Cypher, supports Gremlin

  • AWS deployment for Neo4j

    Availability Zone #1

    Write ELB

    Availability Zone #1

    Read ELB

    ELB health checks





  • Analytics on graphs

    OLAP not OLTP

    Leverages the Hadoop / MapReduce framework

    GraphX is analytics on Spark in-memory; functional-like,

    declarative programming model

    Giraph is graph using MapReduce / HDFS; procedural,

    vertex-centric programming model

    Aggregation type queries over the entire graph

  • TinkerPop

    Apache Incubator graph framework supporting both

    OLAP and OLTP.

    Gremlin, a query language for graph traversals.

    Supports analysis, modification, and queries.

    Gremlin Structured API, a generic connector framework

    or API. Interface to a backend graph engine.

  • Graph DB use cases



    Classic network problems

    Deep hierarchies

    Sensor analysis with geo-spatial constraints

    Fraud detection

    Identity and Access Management

  • Recommendation engine example

    neo4j cluster


    Writes Reads

    Buy like


    People who bought

    this item also bought



    Something you

    recently looked at has


  • Inbound fulfillment

  • Inbound fulfillment data problems

  • Manual Research

    All tools emit events

    Humans trace the events

    Difficult to follow as search

    space increases

    Developed queries, but took

    too long to run


    Unique Identifiers

    Every item gets a unique


    Easy to get all related events


    Impractical for some items

  • Inventory notification graph: data model

  • Why not use a relational or NoSQL database?

    Relational Database

    Knew data volume would be huge and keep growing

    Did not want to vertically scale

    JOINs on table will be expensive

    Use case required high availability

    NoSQL Store

    Would be the same solution without all the functionality built

    into the TinkerPop Graph Framework

  • Why a graph?

    No way to index just the events we need

    Need to perform search from receive to stow and vice

    versa; i.e., requires many hops to find the data

    Need to process messages out of order

    Graphs provide a simple mental model

  • Why Titan?



    DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB


    Rexster(graph server)

    Blueprints(generic graph API)

    Furnace(graph algorithms)

    Frames(object-graph mapper)

    Gremlin(traversal language)


  • Cassandra

    Highly available

    Existing Titan implementation




  • Cassandra: Titan lessons learned

    No one on our team had experience managing or

    configuring a Cassandra cluster

    Needed to manage a cluster

    Team manually replaces hosts as EC2 swaps them out

    Does not handle time series data well

    We ran two producers against two keyspaces so we

    could efficiently drop old data

  • DynamoDB: Titan

    Massively scalable

    No more tuning and host management

    Team was already familiar with DynamoDB

    Risky because there was no existing Titan


  • Inventory notification graph architecture

  • DynamoDB: single-item data model

    Hash Key (hk) Attribute Attribute Attribute Attribute Attribute

    Vertex id 1 Property

    Name Justin

    Edge (out)

    Friend: Anna

    Edge (out)

    Friend: Kris

    Edge (out)

    Likes: Movies


    Property -


    Vertex id 2 Property

    Name Anna

    Edge (out)

    Friend: Justin

    Edge (out)

    Likes: Books


    Property -


    Vertex id 3 Property

    Name Kris

    Edge (out)

    Friend: Justin

    Edge (out)

    Likes: Movies


    Property -


    Vertex id 4 Property

    Name Movies

    Edge (out)

    Friend: Justin

    Edge (out)

    Likes: Kris


    Property -


    Vertex id 5 Property

    Name Books

    Edge (out)

    Friend: Anna


    Property -


  • DynamoDB: multiple-item data model

    Hash Key (hk) Range Key (rk) Value (v)

    Vertex id 1 Range key

    Vertex id 1 Property id Property Name Justin

    Vertex id 1 Edge id Edge (out) Friend Anna

    Vertex id 1 Edge id Edge (out) Friend Kris

    Vertex id 2 Range key

    Vertex id 2 Property id Property Name Anna

    Vertex id 2 Edge id Edge (out) Friend Justin

    Vertex id 2 Edge id Edge (out) Friend


  • DynamoDB: how does it scale?

    Close to 100 billion vertices

    Terabytes of data

    Without corresponding increase in latency

  • DynamoDB: Titan lessons learned

    Use Titan explicit partitioning on large graph

    Partition across multiple graphs for time series data

    Able to achieve stable performance at scale

  • How to get started

    GitHub Repository

    DynamoDB Local

    CloudFormation Template


  • Resources

    Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem

    Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL

    Movement by Eric Redmond and Jim R. Wilson

    NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by

    Pramod J. Sadalage and Martin Fowler

    Titan Graph Database Integration with DynamoDB: World-class Performance,

    Availability, and Scale for New Workloads by Werner Vogels

    Store and Process Graph Data using the DynamoDB Storage Backend for Titan by

    Jeff Barr

    Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by

    Matthew Sowders and Alexander Patrikalakis

    Amazon DynamoDB Storage Backend for Titan FAQ

    Amazon DynamoDB Storage Backend for Titan Documentation


  • Thank you!

  • Remember to complete

    your evaluations!

Click here to load reader

Embed Size (px)