Not All Graph Databases are Created Equally

Not All Graph Databases are Created Equally

A webinar by with Atanas Kiryakov, CEO and Founder of Ontotext

September 30th, 2014

Not All Graph Databases are Created Equally #1Sept 2014

• An overview on graph databases

• Triplestores: advantages and design choices

• Reasoning: best practices and pitfalls

• Essential features of triplestores – owl:sameAs optimization

– Full-Text Search and NoSQL Connectors

• Enterprise resilience and scalability

• Text mining pipeline and triplestores

• Success stories

Today’s Topics


About Ontotext

• Information management company providing text analysis, data management and state-of-the-art semantic technology

• 70 employees, head quartered in Sofia, Bulgaria

• Sales presence in London, Washington, DC, and Boston

• Clients include BBC, AstraZeneca, US DoD, Wiley & Sons, Getty

• Over 400 person-years in R&D to create a one-stop shop for:– Content enrichment

– Data management

– Graph database engine

• Open and standard compliant technology:– RDF(S), OWL, GATE, Sesame

#3Not All Graph Databases are Created Equally Sept 2014

Some of our clients

The most

popular

financial

newspaper


• Standard compliance– Unlike most of the NoSQL and proprietary graph databases

– Based on a mature set of W3C standards: RDF, RDFS, OWL, SPARQL

• Schema agility, easy querying diverse data– Unlike SQL databases

– RDF facilitates dealing with multiple schemata and schema evolution

• Allow for complex queries– Unlike the typical NoSQL databases

– SPARQL allows for comprehensive structured queries, similar to SQL

– Queries that are not possible in SQL (e.g. unknown relation types)

• Linked Data Ready– RDF is the standard for linked data publication

How are RDF Databases Different?


Visual Representation - Graph Database


• Have a database of locations, with part-of info

• Have a database with companies, with dependencies

• Define semantics for the relevant relationships:– sub-region and control are transitive relationships

– Located-in is transitive over sub-region

• Define the semantics of suspicious relationshipsCONSTRUCT { ?orgA my:suspiciousLink ?orgB } WHERE {

?orgA ptop:locatedIn ?x ; fibo:controls ?y .

?y fibo:controls ?orgB ; ptop:locatedIn ?z .

?orgB ptop:locatedIn ?x .

?z a ptop:OffshoreZone .

}

What It Takes to Make It Work?


Sample RDF Graph: Data and Schema

#8

myData: Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

ow

l:inverseO

f

inferred

myData:Ivan

owl:relativeOf

owl:inverseOfowl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:

typ

e

rdf:

typ

e

rdf:type

Not All Graph Databases are Created Equally Sept 2014

Data Representation: RDBMS vs. RDF

#9

Person

ID Name Gender

1 Maria P. F

2 Ivan Jr. M

3 …

Parent

ParID ChiID

1 2

…

Spouse

S1ID S2ID From To

1 3

…

Statement

Subject Predicate Object

myo:Person rdf:type rdfs:Class

myo:gender rdfs:type rdfs:Property

myo:parent rdfs:range myo:Person

myo:spouse rdfs:range myo:Person

myd:Maria rdf:type myo:Person

myd:Maria rdf:label “Maria P.”

myd:Maria myo:gender “F”

myd:Maria rdf:label “Ivan Jr.”

myd:Ivan myo:gender “M”

myd:Maria myo:parent Myd:Ivan

myd:Maria myo:spouse myd:John

…

Relational Tables RDF Representation


What is RDF Good for?

• Metadata-based content management– Metadata represents a re-usable result of content analytics

– It can be repurposed allowing for a wide range of applications

– Most of the search engines do analytics, but the results are not explicit; so, they cannot be validated, refined and used by other applications

• Linking text and structured data– Allows structured, uniform and efficient access to diverse domain

models, taxonomies, dictionaries, reference databases

• Reference data management– E.g. product catalogs and taxonomies that are too structured to be

managed with NoSQL, but too diverse and interconnected for SQL

• Using open linked data (LOD)– A growing amount and diverse public data can be used in enterprise

Knowledge Management applications


Interlinking Text and Data


Why is Inference Important?

• Intelligent mapping of queries to data– Rather than query 10+ different sources, an application can resolve

queries using inferred facts. These facts can evolve independently.

• Cheaper data integration, lower costs– Data integration costs are reduced

– Developers don’t need to maintain multiple schemas

• Finding patterns and inferring new relationships– Users can use inferred facts to discover patterns, connections and

relationships that they previously did not know existed

• Database depth, accurate complete results– With tens of billions of triples, users get complete, accurate results

• Faster query evaluation– One can look at materialization as specific type of indexing


Lightweight Inference - Simple Rules


The database will return ‘Ivan’ as result of a query for

Maria relativeOf ?x

when the fact asserted was

Ivan childOf Maria

This type of “intelligence” can be achieved in many ways, but graph databases offer the cleanest approach, delivering the best efficiency and lowest cost through the entire data lifecycle.

myData: Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

ow

l:inverseO

f

inferred

myData:Ivan

owl:relativeOf


rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:

typ

e

rdf:

typ

e

rdf:type

#13Sept 2014

Forward-Chaining and Materialization

<C1,rdfs:subClassOf,C2>



<I,rdf:type,C1>


<I,rdf:type,C2>

<P1,owl:inverseOf,P2>

<I1,P1,I2>

<I2,P2,I1>

<P1,rdf:type,owl:SymmetricProperty>

<P1,owl:inverseOf,P1>

<P1,rdfs:range,C1>

<I1,P1,I2>

<I2,rdf:type,C1>

The rule entailment language used by GraphDB is a simplification of Datalog, used in DBMS since the 1980’s.

myData: Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

ow

l:inverseO

f

inferred

myData:Ivan

owl:relativeOf


rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:

typ

e

rdf:

typ

e

rdf:type


#15

Two principle strategies for rule-based inference:• Forward-chaining: to start from the known facts (the explicit statements),

and to perform inference in an inductive fashion. Typically, the goal is to compute the inferred closure

• Backward-chaining: to start from a particular fact or a query, and to verify it or get all possible results. In a nutshell, the reasoner decomposes (or transforms) the query, or the fact, into simpler, or alternative, facts, whichare available in the KB, or can be proven through further recursivetransformations

Inferred closure: extension of a an RDF graph with all implicit facts (triples) that can be inferred from it

Materialization: keep an up-to-date inferred closure

Reasoning Strategies


Inference - Retraction

• Inferences materialized at load time– Fast query answering, as no inference is done during query time

– Alternative approaches (backward-chaining) harm query optimization

• Query optimization requires statistics about the „selectivity“ of the contraints in the query, in order to reorder them for optimal execution

• Retracting statements using custom algorithm– Does not require re-computation of full-closure

– Forward chaining to find potentially affected inferences

– Backward chaining to test which inferences are still supported

– No truth maintenance; pending patent application

– Fast (same order of magnitude as inserting)

• Result: allowing massive query loads along with huge updates (inserts+deletes) rates


Essential Features

• Geo-spatial indexing

• Ranking graph nodes

• Optimized handling of owl:sameAS

• Integration with FTS and NoSQL engines


The Honey and the Sting of owl:sameAs

E11 E22

E12 E21

E23


E11 E22

E12 E21

E23

The Honey and the Sting of owl:sameAs


GraphDB Connectors

• Adapters and configuration interfaces capable to connect GraphDB to external stores/engines

• Based on GraphDB’s Plug-in and Notification APIs

• Integrate IR engines and NoSQL databases

• IR engines– Full-text searches, Faceted search, Real-time synchronization, Property

paths

– Lucene , SOLR, Elastic search (end of June)

• NoSQL databases– Access Big Data from SPARQL


Integration with FTS Engines

• Limited query expressivity

• Extreme performance and scalability

Replication

Query Processor

Graph indexesInternal indexes

SPARQL queries

IR queries

IR engine GraphDB database


Enterprise Resilience

GraphDB Enterprise has two design goals:

• Improved resilience– failover, dynamic configuration

• Improved query bandwidth– larger cluster means more queries per unit time


Replication Cluster

• Two types of nodes - flexible topologies

• Resilient to failure of workers and masters

Worker 1Worker 3

Master

Worker 2

Master(hot standby)

Dispatches queries and updates to workers(read/write)

Dispatches queries to workers(read only)

GraphDB-Enterprise worker nodes

Queries &updates

Queries only


High Availability Cluster


Integration with Text Mining Pipelines


Technology Portfolio


Ontotext and BBC


Profile• Mass media broadcaster founded in 1922• 23,000 employees and over 5 billion

pounds in annual revenue.

Goals• Create a dynamic semantic publishing

platform that assembled web pages on-the-fly using a variety of data sources

• Deliver highly relevant data to web site visitors with sub-second response

Challenges• BBC journalists author and publish content

which is then statistically rendered. The costs and time to do this were high.

• Diverse content was difficult to navigate, content re-use was not flexible

• User experience needed to be improved with relevant content

"The goal is to be able to more easily and accurately aggregate content, find it and share it across many sources. From these simple relationships and building blocks you can dynamically build up incredibly rich sites and navigation on any platform."

John O’DonovanChief Technical Architect

#27Sept 2014

Ontotext and AstraZeneca


Profile• Global, Bio-pharma company• $28 billion in sales in 2012• $4 billion in R&D across three continents

Goals• Efficient design of new clinical studies• Quick access to all of the data• Improved evidence based decision-making• Strengthen the knowledge feedback loop• Enable predictive science

Challenges• Over 7,000 studies and 23,000 documents

are difficult to obtain• Searches returning 1,000 – 10,000 results• Document repositories not designed for

reuse• Tedious process to arrive at evidence

based decisions

#28Sept 2014

Context-based Disambiguation


Ontotext and LMI


Profile• Established in 1961 to enable federal

agencies • Specializes in logistics, financial,

infrastructure & information management

Goals• Unlock large collections of complex

documents• Improve analyst productivity• Create an application they can sell to US

Federal agencies

Challenges• Analysts taking hours to find, download

and search documents, using inaccurate keyword searches

• Needed a knowledge base to search quickly and guide the analysts – highly relevant searches

• Extracts knowledge from collection of documents

• Uses GraphDB to intuitively search and filter• Knowledge base used to suggest searches• Hyper speed performance• Huge savings in analyst time• Accurate results

#30Sept 2014

Ontotext and Euromoney


Profile• Euromoney Institutional Investor PLC, the

international online information and events group

Goals• Create a horizontal platform to serve 100

different publications • create a new publishing and information

platform which would include the latest authoring, storing, and display technologies including, semantic annotation, search and a triple store repository

Challenges• Different domains covered • Sophisticated content analytics incl.

Relation, template and scenario extraction

• Analytics of reports and news of various domains

• Extraction of sophisticated macro economic views on markets and market conditions; trades, condition and trade horizons, assets, asset allocations, etc.

• Multi-faceted search • Completely new content and data

infrastructure

#31Sept 2014

GraphDB Essentials

• Enterprise grade (resilience, scale, management)

• Geo-spatial, ranking and full-text search

• Scales to tens of billions of RDF statements

• Expressive inference (from RDFS to OWL2-RL)

• SPARQL 1.1 (query, update, federation, graph store)

• Pure Java implementation (portable)

• Sesame openRDF framework (Jena also in GraphDB-SE)


Additional Resources: Ontotext.com


Thank you!


A Link to the recording and response to any of your

unanswered questions will be sent out shortly.

September 30th, 2014


Date post:	17-Jul-2015
Category:	Software
Upload:	ontotext
View:	155 times
Download:	1 times

Not All Graph Databases are Created Equally

Software