Date post: | 18-Dec-2014 |
Category: |
Technology |
Upload: | chris-clarke |
View: | 1,311 times |
Download: | 2 times |
Using MongoDB as a Graph Database
Chris ClarkeNoSQL Birmingham16th October 2014
Graphs 101For the uninitiated
John Janeknows
John Janeknows
John knows JaneJane knows John
John Janeknows
John Janeknows
John knows JaneJane ? John
John Jane
John knows JaneJane knows John
knows
knows
RDF
John knows JaneEntity Property Value
John knows Jane
Subject Predicate Object
John knows Jane
Jane knows John
Subject Predicate Object
http://example.com/John foaf:knows http://example.com/Jane
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
Subject Predicate Object
http://example.com/John
http://example.com/John
foaf:knows http://example.com/Jane
foaf:name “John”
PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX rdf: <
http://www.w3.org/1999/02/22-rdf-syntax-ns#>
http://example.com/John rdf:type foaf:Person
http://example.com/Jane foaf:name “Jane”
http://example.com/Jane rdf:type foaf:Person
http://example.com/Jane foaf:knows http://example.com/John
Subject Predicate Object
example:John example:Jane
foaf:Person
rdf:type rdf:type
“John” “Jane”
foaf:name foaf:name
foaf:knows
foaf:knows
– Jack Fullstack
“WTF! Surely this is easier in JSON!”
> db.people.find(){ _id: ObjectID(‘123’), name: ‘John’ knows: [ObjectID(‘456’)]},{ _id: ObjectID(‘456’), name: ‘Jane’ knows: [ObjectID(‘123’)]}
foaf:Person
example:John
“John”
foaf:name
example:John
24
foaf:age
Dataset A Dataset B
example:John
“John” 24
Dataset A+B
foaf:name foaf:age
SPARQLAn RDF Query Language
PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?name ?emailWHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email.}ORDER BY ?nameLIMIT 50
CONSTRUCTDESCRIBESELECTASK
GraphGraph
TabularBoolean
Graphs and Talis A bit of history
Over time…• Our apps become popular. Last week, average
4M requests per day and at peak times 600k+ per hour
• Our dataset is growing in size - about 350M triples this week
• Our apps needed more queries and more expensive queries
• Our in-house triple store was EoL and out of date
Project Tripodhttp://github.com/talis/tripod-php http://github.com/talis/tripod-node
System characteristics
• 99:1 read:write
• Well shared, tenant based system. Our largest single customer has 35M triples
• Graph data structures and operations (merges, sub-graphs etc.) well entrenched in the codebase, over 2M lines code (inc. libraries)
• Actually not that many distinct query shapes
Simple Queries, and how they influenced our core
data model
DESCRIBE <http://example.com/John>
SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}
Give me all the triples about John as a graph
Give me properties name, age of John as tabular data
Subject Predicate Object
http://example.com/John
http://example.com/John
foaf:knows http://example.com/Jane
foaf:name “John”
PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX rdf: <
http://www.w3.org/1999/02/22-rdf-syntax-ns#>
http://example.com/John rdf:type foaf:Person
http://example.com/Jane foaf:name “Jane”
http://example.com/Jane rdf:type foaf:Person
http://example.com/Jane foaf:knows http://example.com/John
http://example.com/John
http://example.com/John
foaf:knows http://example.com/Jane
foaf:name “John”
http://example.com/John rdf:type foaf:Person
http://example.com/Jane foaf:name “Jane”
http://example.com/Jane rdf:type foaf:Person
http://example.com/Jane foaf:knows http://example.com/John
Concise Bound Description of http://example.com/John
Concise Bound Description of http://example.com/Jane
http://example.com/John
http://example.com/John
foaf:knows http://example.com/Jane
foaf:name “John”
http://example.com/John rdf:type foaf:Person
Concise Bound Description of http://example.com/John
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
_id is the unique primary key. There can only be one John
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
_id is the unique primary key. There can only be one John
l means value is a literal text value
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
_id is the unique primary key. There can only be one John
u means value is a uri, or another
node.l means value is a literal text value
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
DESCRIBE <http://example.com/John>
SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
DESCRIBE <http://example.com/John>
SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}
mongo$ col.findOne({_id:”example:John”});
mongo$ col.findOne({_id:”example:John”},{“foaf:name.l”:1,”foaf:age.l”:1});
{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },
{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },
DESCRIBE <http://example.com/John>
SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}
mongo$ var s = col.find({s:”example:John”});mongo$ while (s.hasNext()) { addToGraph(s.next()) }
mongo$ col.find({s:”example:John”, p: “foaf:name”}},{“o”:1});mongo$ col.find({s:”example:John”, p: “age”}},{“o”:1});
{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },
DESCRIBE ?person WHERE { ?person <foaf:name> “John” . }
mongo$ var s = col.find({p:”foaf:name”, o:”John”}); // BasicCursor = slow
{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
DESCRIBE ?person WHERE { ?person <foaf:name> “John” . }
mongo$ col.ensureIndex({“foaf:name.u”:1});mongo$ var s = col.find({“foaf:name.u”:”John”}); // BTreeCursor = fast
Complex Queries
DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?authorList ?author ?usedBy ?creator ?libraryNote ?publisherWHERE{ OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }}
DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?authorList ?author ?usedBy ?creator ?libraryNote ?publisherWHERE{ OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }}
– Project Tripod Team, sometime 2012
“We don’t need dynamic queries”
Precomputed viewsRemember those from the RDBMS?
{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}
{ _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” }}
DESCRIBE example:John ?knownPerson WHERE { example:John foaf:knows ?knownPerson . }
mongo$ var john = col.findOne({_id:”example:John”}); for (var i=0; i < john[“foaf:knows”].length; i++) { var knownPerson = col.findOne({“_id: john[“foaf:knows”][i]}); }
System characteristics
• 99:1 read:write
• Well shared, tenant based system. Our largest single customer has 35M triples
• Graph data structures and operations (merges, sub-graphs etc.) well entrenched in the codebase, over 2M lines code (inc. libraries).
• Actually not that many distinct query shapes.
{ _id : { r: “example:John, t: “v_knows”}, graphs: [{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }, { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } }]}
DESCRIBE example:John ?knownPerson WHERE { example:John foaf:knows ?knownPerson . }
mongo$ viewsCol.findOne({_id: {r:”example:John”,t:”v_knows”}})
{ _id : { r: “example:John, t: “v_knows”}, graphs: [{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }, { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } }] _impactIndex : [“example:Jane”,”example:John”]}
{ "_id":"v_knows", "type":["foaf:Person"], "from":"CBD_people", "joins":{ “foaf:knows":{} }}
View specification
More complex example
{ "_id":"v_resources", "type":["resourcelist:Resource"], "from":"CBD_resources", "joins":{ "dct:partOf":{ "joins": { "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } },
"dct:isPartOf":{ "joins": { "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }, "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }
What about tabular data?
• We also have tables and table specs
• Conceptually the same as views
• Instead of an array of graphs we have computed columns for complex tabular queries
• You can page, limit, offset results just like you’d expect
{"_id" : {
"r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4-41DB-AF132854770F”"type" : "t_user_resources"
},"value" : {
"_impactIndex" : [{
"r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4-41DB-AF132854770F","c" : "tenantContexts:DefaultGraph"
},{
"r" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE","c" : "tenantContexts:DefaultGraph"
}],"collection" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks","createdDate" : "2011-02-08T15:59:45+00:00","resourceUri" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE","note" : "ELECTRONIC","title" : "Feminism & psychology","type" : [
"resourcelist:Resource","bibo:Journal"
]}
}
Database layout
talis-rs:PRIMARY> show collectionsCBD_configCBD_draftCBD_eventsCBD_jobsCBD_listsCBD_nodesCBD_resourcesCBD_reviewsCBD_serviceCBD_user_listsCBD_user_resourcesCBD_userstable_rowsviews
{r/w
} read only
Fast and slow saves, you decide.
Tripod save()• Based on change sets, you supply the old and
new graphs
• CBDs updated immediately. Write ahead transaction log for multi-CBD writes
• Choice per save on whether to update views/tables sync or async (eventually consistent)
• Async adds jobs to a Mongo based queue
Measure everything
Query volumecomplex vs. simple
Query volumegraph vs. tabular
Query speedcomplex vs. simple graph query
Hardware• Real tin, 2x Dell low-end rack mount servers
• 96Gb RAM, 24 cores
• RAID-10 disks, non-SSD
• Keep ‘em on the same LAN as your app servers
• About the same to lease per month than a couple of c3.4xlarge (30Gb, 32vCPU)
• We’re about to add similar second cluster, 144Gb
Why Mongo? RTFM, not HN comment feeds.
But seriously it could have been n other document DBs
There’s lots moreSearch, named graphs (quads), data
functions
Future roadmap• Multi-cluster <- IN PROGRESS
• NodeJS port <- IN PROGRESS
• Choose better solution for tlog, probably PostgreSQL
• Background queue -> redis and resque
• Chainable API
• Spout of updates for Apache Storm
• Versioned views/tables config
ApertureAnnotate your models to persist to graph
ApertureAnnotate your models to persist to graph
tripod-php code…
…same in aperture
@talisfacebook.com/talisgroup
+44 (0) 121 374 2740
48 Frederick StreetBirminghamB1 3HN