Home >Technology >(DAT203) Building Graph Databases on AWS

(DAT203) Building Graph Databases on AWS

Date post:08-Jan-2017
Category:
View:4,005 times
Download:3 times
Share this document with a friend
Transcript:
  • 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Todd Hildebrant and Matthew Sowders

    AWS

    October 2015

    DAT203

    Graph Databases on AWS

  • What to Expect from the Session

    Who are we?

    General overview of graph database technology

    AWS architecture examples

    Amazon Fulfillment technologys Inventory Notification

    Graph

    Amazon DynamoDB Storage Backend for Titan

  • Graph databases on AWS

  • What is a graph? What is a graph database?

    A graph is a data structure consisting of vertexes

    (nodes), directed edges (relationships), and properties.

    Subset of tree data structure.

    A graph database uses a property graph as the data

    model and includes a query language.

    Other possible data models are hyper-graphs, triple-

    stores, RDF.

  • Graph data modeling

    NoSQL data models Document, Key-Value, Columnar,

    Graph, Mixed

    CAP and ACID

    Start with the use case, then develop the data model:

    As a Student, I want to know other Students in my Class who

    know about a Subject

    Student KNOWS Subject, Student BELONGS_TO Class

    StudentSubject Class

    KNOWS BELONGS_TO

  • Graph vs. relational database

    Graph

    Need to traverse a graph

    without JOINs

    Queries have a starting

    location MATCH ON x

    Normalized attribute to

    enable filtering

    Dynamic schema

    Relational

    Columnar analytics

    Tables denormalized for

    performance

    Cluster and fault

    management

    Recursive query support in

    the query optimizer

  • Titan: distributed graph database

    Distributed graph

    Storage layer has plug-in architecture

    Native TinkerPop implementation

    Full text search with Lucene, SOLR, Elasticsearch

    HA using multi-master replication (Cassandra cluster)

    Scalability using DynamoDB

  • Shared-nothing architecture, single master (writes),

    multiple replicas (reads), embeddable using JVM

    HA when distributed, uses Paxos for master election

    Attempts to load DB into RAM, larger is better. Efficient

    spilling to disk.

    Primary query language is Cypher, supports Gremlin

  • AWS deployment for Neo4j

    Availability Zone #1

    Write ELB

    Availability Zone #1

    Read ELB

    ELB health checks

    HTTP GET

    /db/manage/server/ha/master

    /db/manage/server/ha/slave

    /db/manage/server/ha/active

  • Analytics on graphs

    OLAP not OLTP

    Leverages the Hadoop / MapReduce framework

    GraphX is analytics on Spark in-memory; functional-like,

    declarative programming model

    Giraph is graph using MapReduce / HDFS; procedural,

    vertex-centric programming model

    Aggregation type queries over the entire graph

  • TinkerPop

    Apache Incubator graph framework supporting both

    OLAP and OLTP.

    Gremlin, a query language for graph traversals.

    Supports analysis, modification, and queries.

    Gremlin Structured API, a generic connector framework

    or API. Interface to a backend graph engine.

  • Graph DB use cases

    Social

    Recommendation

    Classic network problems

    Deep hierarchies

    Sensor analysis with geo-spatial constraints

    Fraud detection

    Identity and Access Management

  • Recommendation engine example

    neo4j cluster

    EMR

    Writes Reads

    Buy like

    item

    People who bought

    this item also bought

    Custom

    Email

    Something you

    recently looked at has

    changed

  • Inbound fulfillment

  • Inbound fulfillment data problems

  • Manual Research

    All tools emit events

    Humans trace the events

    Difficult to follow as search

    space increases

    Developed queries, but took

    too long to run

    Approaches

    Unique Identifiers

    Every item gets a unique

    identifier

    Easy to get all related events

    Expensive

    Impractical for some items

  • Inventory notification graph: data model

  • Why not use a relational or NoSQL database?

    Relational Database

    Knew data volume would be huge and keep growing

    Did not want to vertically scale

    JOINs on table will be expensive

    Use case required high availability

    NoSQL Store

    Would be the same solution without all the functionality built

    into the TinkerPop Graph Framework

  • Why a graph?

    No way to index just the events we need

    Need to perform search from receive to stow and vice

    versa; i.e., requires many hops to find the data

    Need to process messages out of order

    Graphs provide a simple mental model

  • Why Titan?

    Tinkerpop

    Backend

    DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB

    Titan

    Rexster(graph server)

    Blueprints(generic graph API)

    Furnace(graph algorithms)

    Frames(object-graph mapper)

    Gremlin(traversal language)

    Pipes(dataflows)

  • Cassandra

    Highly available

    Existing Titan implementation

    EC2Snitch

    Replication

    RandomPartitioner

  • Cassandra: Titan lessons learned

    No one on our team had experience managing or

    configuring a Cassandra cluster

    Needed to manage a cluster

    Team manually replaces hosts as EC2 swaps them out

    Does not handle time series data well

    We ran two producers against two keyspaces so we

    could efficiently drop old data

  • DynamoDB: Titan

    Massively scalable

    No more tuning and host management

    Team was already familiar with DynamoDB

    Risky because there was no existing Titan

    implementation

  • Inventory notification graph architecture

  • DynamoDB: single-item data model

    Hash Key (hk) Attribute Attribute Attribute Attribute Attribute

    Vertex id 1 Property

    Name Justin

    Edge (out)

    Friend: Anna

    Edge (out)

    Friend: Kris

    Edge (out)

    Likes: Movies

    Hidden

    Property -

    Exists

    Vertex id 2 Property

    Name Anna

    Edge (out)

    Friend: Justin

    Edge (out)

    Likes: Books

    Hidden

    Property -

    Exists

    Vertex id 3 Property

    Name Kris

    Edge (out)

    Friend: Justin

    Edge (out)

    Likes: Movies

    Hidden

    Property -

    Exists

    Vertex id 4 Property

    Name Movies

    Edge (out)

    Friend: Justin

    Edge (out)

    Likes: Kris

    Hidden

    Property -

    Exists

    Vertex id 5 Property

    Name Books

    Edge (out)

    Friend: Anna

    Hidden

    Property -

    Exists

  • DynamoDB: multiple-item data model

    Hash Key (hk) Range Key (rk) Value (v)

    Vertex id 1 Range key

    Vertex id 1 Property id Property Name Justin

    Vertex id 1 Edge id Edge (out) Friend Anna

    Vertex id 1 Edge id Edge (out) Friend Kris

    Vertex id 2 Range key

    Vertex id 2 Property id Property Name Anna

    Vertex id 2 Edge id Edge (out) Friend Justin

    Vertex id 2 Edge id Edge (out) Friend

    Brooks

  • DynamoDB: how does it scale?

    Close to 100 billion vertices

    Terabytes of data

    Without corresponding increase in latency

  • DynamoDB: Titan lessons learned

    Use Titan explicit partitioning on large graph

    Partition across multiple graphs for time series data

    Able to achieve stable performance at scale

  • How to get started

    GitHub Repository

    DynamoDB Local

    CloudFormation Template

    https://github.com/awslabs/dynamodb-titan-storage-backendhttps://github.com/awslabs/dynamodb-titan-storage-backend/blob/0.5.4/dynamodb-titan-storage-backend-cfn.json

  • Resources

    Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem

    Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL

    Movement by Eric Redmond and Jim R. Wilson

    NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by

    Pramod J. Sadalage and Martin Fowler

    Titan Graph Database Integration with DynamoDB: World-class Performance,

    Availability, and Scale for New Workloads by Werner Vogels

    Store and Process Graph Data using the DynamoDB Storage Backend for Titan by

    Jeff Barr

    Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by

    Matthew Sowders and Alexander Patrikalakis

    Amazon DynamoDB Storage Backend for Titan FAQ

    Amazon DynamoDB Storage Backend for Titan Documentation

    http://www.amazon.com/Graph-Databases-Opportunities-Connected-Data-ebook/dp/B00ZGRS4VYhttp://www.amazon.com/Seven-Databases-Weeks-Modern-Movement-ebook/dp/B00AYQNR50http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence-ebook/dp/B0090J3SYWhttp://www.allthingsdistributed.com/2015/08/titan-graphdb-integration-in-dynamodb.htmlhttps://aws.amazon.com/blogs/aws/new-store-and-process-graph-data-using-the-dynamodb-storage-backend-for-titan/https://medium.com/aws-activate-startup-blog/amazon-dynamodb-storage-backend-for-titan-distributed-graph-database-b9cc8cca80b7https://aws.amazon.com/dynamodb/faqs/#storagebackendhttp://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.html

  • Thank you!

  • Remember to complete

    your evaluations!

Click here to load reader

Embed Size (px)
Recommended