Not All Graph Databases are Created Equally
A webinar by with Atanas Kiryakov, CEO and Founder of Ontotext
September 30th, 2014
Not All Graph Databases are Created Equally #1Sept 2014
• An overview on graph databases
• Triplestores: advantages and design choices
• Reasoning: best practices and pitfalls
• Essential features of triplestores – owl:sameAs optimization
– Full-Text Search and NoSQL Connectors
• Enterprise resilience and scalability
• Text mining pipeline and triplestores
• Success stories
Today’s Topics
Not All Graph Databases are Created Equally #2Sept 2014
About Ontotext
• Information management company providing text analysis, data management and state-of-the-art semantic technology
• 70 employees, head quartered in Sofia, Bulgaria
• Sales presence in London, Washington, DC, and Boston
• Clients include BBC, AstraZeneca, US DoD, Wiley & Sons, Getty
• Over 400 person-years in R&D to create a one-stop shop for:– Content enrichment
– Data management
– Graph database engine
• Open and standard compliant technology:– RDF(S), OWL, GATE, Sesame
#3Not All Graph Databases are Created Equally Sept 2014
Some of our clients
The most
popular
financial
newspaper
#4Not All Graph Databases are Created Equally Sept 2014
• Standard compliance– Unlike most of the NoSQL and proprietary graph databases
– Based on a mature set of W3C standards: RDF, RDFS, OWL, SPARQL
• Schema agility, easy querying diverse data– Unlike SQL databases
– RDF facilitates dealing with multiple schemata and schema evolution
• Allow for complex queries– Unlike the typical NoSQL databases
– SPARQL allows for comprehensive structured queries, similar to SQL
– Queries that are not possible in SQL (e.g. unknown relation types)
• Linked Data Ready– RDF is the standard for linked data publication
How are RDF Databases Different?
#5Not All Graph Databases are Created Equally Sept 2014
Visual Representation - Graph Database
Not All Graph Databases are Created Equally #6Sept 2014
• Have a database of locations, with part-of info
• Have a database with companies, with dependencies
• Define semantics for the relevant relationships:– sub-region and control are transitive relationships
– Located-in is transitive over sub-region
• Define the semantics of suspicious relationshipsCONSTRUCT { ?orgA my:suspiciousLink ?orgB } WHERE {
?orgA ptop:locatedIn ?x ; fibo:controls ?y .
?y fibo:controls ?orgB ; ptop:locatedIn ?z .
?orgB ptop:locatedIn ?x .
?z a ptop:OffshoreZone .
}
What It Takes to Make It Work?
#7Not All Graph Databases are Created Equally Sept 2014
Sample RDF Graph: Data and Schema
#8
myData: Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
ow
l:inverseO
f
inferred
myData:Ivan
owl:relativeOf
owl:inverseOfowl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:
typ
e
rdf:
typ
e
rdf:type
Not All Graph Databases are Created Equally Sept 2014
Data Representation: RDBMS vs. RDF
#9
Person
ID Name Gender
1 Maria P. F
2 Ivan Jr. M
3 …
Parent
ParID ChiID
1 2
…
Spouse
S1ID S2ID From To
1 3
…
Statement
Subject Predicate Object
myo:Person rdf:type rdfs:Class
myo:gender rdfs:type rdfs:Property
myo:parent rdfs:range myo:Person
myo:spouse rdfs:range myo:Person
myd:Maria rdf:type myo:Person
myd:Maria rdf:label “Maria P.”
myd:Maria myo:gender “F”
myd:Maria rdf:label “Ivan Jr.”
myd:Ivan myo:gender “M”
myd:Maria myo:parent Myd:Ivan
myd:Maria myo:spouse myd:John
…
Relational Tables RDF Representation
Not All Graph Databases are Created Equally Sept 2014
What is RDF Good for?
• Metadata-based content management– Metadata represents a re-usable result of content analytics
– It can be repurposed allowing for a wide range of applications
– Most of the search engines do analytics, but the results are not explicit; so, they cannot be validated, refined and used by other applications
• Linking text and structured data– Allows structured, uniform and efficient access to diverse domain
models, taxonomies, dictionaries, reference databases
• Reference data management– E.g. product catalogs and taxonomies that are too structured to be
managed with NoSQL, but too diverse and interconnected for SQL
• Using open linked data (LOD)– A growing amount and diverse public data can be used in enterprise
Knowledge Management applications
#10Not All Graph Databases are Created Equally Sept 2014
Interlinking Text and Data
#11Not All Graph Databases are Created Equally Sept 2014
Why is Inference Important?
• Intelligent mapping of queries to data– Rather than query 10+ different sources, an application can resolve
queries using inferred facts. These facts can evolve independently.
• Cheaper data integration, lower costs– Data integration costs are reduced
– Developers don’t need to maintain multiple schemas
• Finding patterns and inferring new relationships– Users can use inferred facts to discover patterns, connections and
relationships that they previously did not know existed
• Database depth, accurate complete results– With tens of billions of triples, users get complete, accurate results
• Faster query evaluation– One can look at materialization as specific type of indexing
#12Not All Graph Databases are Created Equally Sept 2014
Lightweight Inference - Simple Rules
Not All Graph Databases are Created Equally
The database will return ‘Ivan’ as result of a query for
Maria relativeOf ?x
when the fact asserted was
Ivan childOf Maria
This type of “intelligence” can be achieved in many ways, but graph databases offer the cleanest approach, delivering the best efficiency and lowest cost through the entire data lifecycle.
myData: Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
ow
l:inverseO
f
inferred
myData:Ivan
owl:relativeOf
owl:inverseOfowl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:
typ
e
rdf:
typ
e
rdf:type
#13Sept 2014
Forward-Chaining and Materialization
<C1,rdfs:subClassOf,C2>
<C2,rdfs:subClassOf,C3>
<C1,rdfs:subClassOf,C3>
<I,rdf:type,C1>
<C1,rdfs:subClassOf,C2>
<I,rdf:type,C2>
<P1,owl:inverseOf,P2>
<I1,P1,I2>
<I2,P2,I1>
<P1,rdf:type,owl:SymmetricProperty>
<P1,owl:inverseOf,P1>
<P1,rdfs:range,C1>
<I1,P1,I2>
<I2,rdf:type,C1>
The rule entailment language used by GraphDB is a simplification of Datalog, used in DBMS since the 1980’s.
myData: Maria
ptop:Agent
ptop:Person
ptop:Woman
ptop:childOf
ptop:parentOf
rdfs:range
ow
l:inverseO
f
inferred
myData:Ivan
owl:relativeOf
owl:inverseOfowl:SymmetricProperty
rdfs:subPropertyOf
owl:inverseOf
owl:inverseOf
rdf:
typ
e
rdf:
typ
e
rdf:type
#14Not All Graph Databases are Created Equally Sept 2014
#15
Two principle strategies for rule-based inference:• Forward-chaining: to start from the known facts (the explicit statements),
and to perform inference in an inductive fashion. Typically, the goal is to compute the inferred closure
• Backward-chaining: to start from a particular fact or a query, and to verify it or get all possible results. In a nutshell, the reasoner decomposes (or transforms) the query, or the fact, into simpler, or alternative, facts, whichare available in the KB, or can be proven through further recursivetransformations
Inferred closure: extension of a an RDF graph with all implicit facts (triples) that can be inferred from it
Materialization: keep an up-to-date inferred closure
Reasoning Strategies
Not All Graph Databases are Created Equally Sept 2014
Inference - Retraction
• Inferences materialized at load time– Fast query answering, as no inference is done during query time
– Alternative approaches (backward-chaining) harm query optimization
• Query optimization requires statistics about the „selectivity“ of the contraints in the query, in order to reorder them for optimal execution
• Retracting statements using custom algorithm– Does not require re-computation of full-closure
– Forward chaining to find potentially affected inferences
– Backward chaining to test which inferences are still supported
– No truth maintenance; pending patent application
– Fast (same order of magnitude as inserting)
• Result: allowing massive query loads along with huge updates (inserts+deletes) rates
#16Not All Graph Databases are Created Equally Sept 2014
Essential Features
• Geo-spatial indexing
• Ranking graph nodes
• Optimized handling of owl:sameAS
• Integration with FTS and NoSQL engines
#17Not All Graph Databases are Created Equally Sept 2014
The Honey and the Sting of owl:sameAs
E11 E22
E12 E21
E23
#18Not All Graph Databases are Created Equally Sept 2014
E11 E22
E12 E21
E23
The Honey and the Sting of owl:sameAs
#19Not All Graph Databases are Created Equally Sept 2014
GraphDB Connectors
• Adapters and configuration interfaces capable to connect GraphDB to external stores/engines
• Based on GraphDB’s Plug-in and Notification APIs
• Integrate IR engines and NoSQL databases
• IR engines– Full-text searches, Faceted search, Real-time synchronization, Property
paths
– Lucene , SOLR, Elastic search (end of June)
• NoSQL databases– Access Big Data from SPARQL
#20Not All Graph Databases are Created Equally Sept 2014
Integration with FTS Engines
• Limited query expressivity
• Extreme performance and scalability
Replication
Query Processor
Graph indexesInternal indexes
SPARQL queries
IR queries
IR engine GraphDB database
#21Not All Graph Databases are Created Equally Sept 2014
Enterprise Resilience
GraphDB Enterprise has two design goals:
• Improved resilience– failover, dynamic configuration
• Improved query bandwidth– larger cluster means more queries per unit time
#22Not All Graph Databases are Created Equally Sept 2014
Replication Cluster
• Two types of nodes - flexible topologies
• Resilient to failure of workers and masters
Worker 1Worker 3
Master
Worker 2
Master(hot standby)
Dispatches queries and updates to workers(read/write)
Dispatches queries to workers(read only)
GraphDB-Enterprise worker nodes
Queries &updates
Queries only
#23Not All Graph Databases are Created Equally Sept 2014
High Availability Cluster
#24Not All Graph Databases are Created Equally Sept 2014
Integration with Text Mining Pipelines
Not All Graph Databases are Created Equally #25Sept 2014
Technology Portfolio
#26Not All Graph Databases are Created Equally Sept 2014
Ontotext and BBC
Not All Graph Databases are Created Equally
Profile• Mass media broadcaster founded in 1922• 23,000 employees and over 5 billion
pounds in annual revenue.
Goals• Create a dynamic semantic publishing
platform that assembled web pages on-the-fly using a variety of data sources
• Deliver highly relevant data to web site visitors with sub-second response
Challenges• BBC journalists author and publish content
which is then statistically rendered. The costs and time to do this were high.
• Diverse content was difficult to navigate, content re-use was not flexible
• User experience needed to be improved with relevant content
"The goal is to be able to more easily and accurately aggregate content, find it and share it across many sources. From these simple relationships and building blocks you can dynamically build up incredibly rich sites and navigation on any platform."
John O’DonovanChief Technical Architect
#27Sept 2014
Ontotext and AstraZeneca
Not All Graph Databases are Created Equally
Profile• Global, Bio-pharma company• $28 billion in sales in 2012• $4 billion in R&D across three continents
Goals• Efficient design of new clinical studies• Quick access to all of the data• Improved evidence based decision-making• Strengthen the knowledge feedback loop• Enable predictive science
Challenges• Over 7,000 studies and 23,000 documents
are difficult to obtain• Searches returning 1,000 – 10,000 results• Document repositories not designed for
reuse• Tedious process to arrive at evidence
based decisions
#28Sept 2014
Context-based Disambiguation
Not All Graph Databases are Created Equally #29Sept 2014
Ontotext and LMI
Not All Graph Databases are Created Equally
Profile• Established in 1961 to enable federal
agencies • Specializes in logistics, financial,
infrastructure & information management
Goals• Unlock large collections of complex
documents• Improve analyst productivity• Create an application they can sell to US
Federal agencies
Challenges• Analysts taking hours to find, download
and search documents, using inaccurate keyword searches
• Needed a knowledge base to search quickly and guide the analysts – highly relevant searches
• Extracts knowledge from collection of documents
• Uses GraphDB to intuitively search and filter• Knowledge base used to suggest searches• Hyper speed performance• Huge savings in analyst time• Accurate results
#30Sept 2014
Ontotext and Euromoney
Not All Graph Databases are Created Equally
Profile• Euromoney Institutional Investor PLC, the
international online information and events group
Goals• Create a horizontal platform to serve 100
different publications • create a new publishing and information
platform which would include the latest authoring, storing, and display technologies including, semantic annotation, search and a triple store repository
Challenges• Different domains covered • Sophisticated content analytics incl.
Relation, template and scenario extraction
• Analytics of reports and news of various domains
• Extraction of sophisticated macro economic views on markets and market conditions; trades, condition and trade horizons, assets, asset allocations, etc.
• Multi-faceted search • Completely new content and data
infrastructure
#31Sept 2014
GraphDB Essentials
• Enterprise grade (resilience, scale, management)
• Geo-spatial, ranking and full-text search
• Scales to tens of billions of RDF statements
• Expressive inference (from RDFS to OWL2-RL)
• SPARQL 1.1 (query, update, federation, graph store)
• Pure Java implementation (portable)
• Sesame openRDF framework (Jena also in GraphDB-SE)
#32Not All Graph Databases are Created Equally Sept 2014
Additional Resources: Ontotext.com
Not All Graph Databases are Created Equally #33Sept 2014
Thank you!
Not All Graph Databases are Created Equally
A Link to the recording and response to any of your
unanswered questions will be sent out shortly.
September 30th, 2014
Not All Graph Databases are Created Equally #34Sept 2014