Date post: | 07-Oct-2014 |
Category: |
Documents |
Upload: | geantdepapier |
View: | 21 times |
Download: | 2 times |
Technological White Paper
sones Graph Database
Technology
2
Contents 1. Introduction ................................................................................................................................................... 3
1.1. Components of the sones graph database ....................................................................................... 3
1.2. Data model .......................................................................................................................................... 5
2. USPs ............................................................................................................................................................. 7
2.1. Index-free adjacency........................................................................................................................... 7
2.2. Handling semi-structured data ........................................................................................................... 7
2.3. Dynamic type extension ..................................................................................................................... 7
2.4. Graph query language ........................................................................................................................ 7
2.5. Solving the object-relational depiction problem ................................................................................. 8
2.6. HTTP/REST API ................................................................................................................................. 9
2.7. Traverser API .................................................................................................................................... 10
3. Technical case study .................................................................................................................................. 11
Bibliography .......................................................................................................................................................... 14
Glossary ................................................................................................................................................................ 15
3
1. Introduction
This white paper is intended for IT professionals needing more in-depth information on the sones
graph database technology.
Many problems encountered in every-day (IT) life (e.g., hyperlinks, navigation, who-knows-who)
can be depicted with graphs. A graph is a tuple consisting of the set V of nodes (vertices) and the
set E of edges. The latter depicts the relationship between elements in V.
A graph-oriented database uses this structure to present and administer information.
In the following, we will first explain the sones graph database technology in greater detail and
then draw a comparison to a relational database model. This will be followed by an illustration of
the technology's unique features followed by a technical case study.
1.1. Components of the sones graph database
The graph below illustrates the components of the sones graph database. It is comprised of four
layers – the storage medium, the GraphFS, the GraphDB and the GraphDS.
Illustration 1 Components of the sones graph database
The bottom layer consists of an interface to a number of storage media. These include both local
file systems such as NTFS or Ext4 as well as storage service providers such as Microsoft Azure
or Amazon S3. In addition to using the persistent varieties mentioned above, it is possible to store
data in an in-memory-only structure. (Edlich, Friedland, Hampe, & Brauer, 2010)
4
Based on this first layer, we then have the GraphFS, which provides object management. This
includes, among other things, management of the following aspects:
Object namespace
Object identities
Object data flows
Object editions
Object revisions
Unlike traditional file systems, the GraphFS manages all of an object's information as metadata of
that object and controls the distribution of the data on the integrated storage media. This enables
a hybrid approach consisting of fast, local storage media and network storage providers. In
addition to storing objects, the MVCC principle is applied when accessing these objects. This
principle enables concurring access without blocking. (Edlich, Friedland, Hampe, & Brauer, 2010)
The purpose of the GraphDB is to manage property hypergraphs. This includes type, nodes,
index and plugin administration. GraphDB type administration (create, alter, drop) is able to
manage node types hierarchically in an ontology. Here, both the definition of abstract types as
well as the inheritance of attributes is possible. Node administration regulates the manipulation
(insert, update, delete,…) of nodes within the property hypergraph.
The GraphDB also manages indexes and the manipulation of these (insert, update, rebuilding,
reorganization). The GraphDB can be extended modularly in many areas (aggregates, functions,
settings, …). This is handled by an administration system that takes care of coordinating the
different components. The GraphDB's main function is to provide logic for presenting and
manipulating the property hypergraph. This includes implementing projection and selection as well
as manipulating nodes and dynamically extending the type schema. An integrated subgraph
matching engine is necessary to enable the selection of nodes. This engine generates subgraphs
using Boolean expressions. In addition to logic and administration, the graph query language can
also be used to make ad hoc queries to the database. (Edlich, Friedland, Hampe, & Brauer, 2010)
The GraphDS module combines the above-mentioned components into a whole. The GraphDS is
an interface for user applications and offers an entire spectrum of access options such as REST,
.Net, Java and WebDAV. (Edlich, Friedland, Hampe, & Brauer, 2010)
5
1.2. Data model
Reference was made to the property hypergraph model in the last section. In this section it will be
discussed in further detail.
The property hypergraph
The property hypergraph is an extension of the property graph data model, which was established
in the past few years. A property graph is a directed multi-relational graph. The nodes and edges
of this graph are comprised of objects and the semi-structured properties embedded therein.
These are key/value relationships whose keys and values can be specified by the relevant node
and edge type. In this case, an edge is a special case of a property value.
Extension to a hypergraph is based on the use of hypernodes, which act as information carriers of
additional information in the context of the edges.
Illustration 2 Diagram of a property hypergraph
Illustration 2 gives an example of the structure of a property hypergraph. The user nodes Alice,
Bob and Carol are shown. These have been assigned the property "Age" in addition to a unique
ID. Alice has also been assigned the attribute "Friends," which is realized as a hypernode to a
number of other nodes of the "User" vertex type. Here you can see that is possible to enter
information in the hypernode as well as in the nodes it contains.
6
Node definition (vertex type)
The defined number of properties in a node is specified via the vertex type. Just like the node
instances in the sones graph database, this is an object in the GraphFS and therefore contains
metadata such as ID, name and position. In addition, the entered indexes and a reference to the
super vertex type are stored in the vertex type. The purpose of the latter is to realize the ontology,
which allows the user to inherit properties. Another function of the vertex type is to define the
structured number of node properties. It is also possible to define constraints such as uniqueness
or mandatory for these. The properties of the sones graph database can generally be divided into
two categories.
1. Basic properties…
are properties whose value equals basic data type instances such as string, integer or
Boolean. Collections (lists, amounts) of basic values can also be entered as a property.
2. Edge properties…
are properties that connect a sones graph database node with a number of other nodes.
These fulfill the node properties in the property hypergraph mentioned above.
Node instance (vertex)
Node instances act as a container for the properties they contain. They are comprised of multiple
streams, which are depicted and explained in brief in Illustration 3. Each of the entered streams
contains a dynamic number of editions and revisions. This enables both semantic (editions) as
well as temporal (revisions) information management. It is also possible to access individual
streams separately.
Illustration 3 Schematic diagram of a node instance
7
2. USPs
Now that the components of the sones graph database have been described and the basic data
model has been illustrated, this section of the white paper will discuss the USPs in relation to
relational state-of-the-art databases.
2.1. Index-free adjacency
Graph databases like the sones GraphDB address the paradigm of the index-free adjacency. That
means that is not necessary to manage a global index for relationships between nodes/entities.
The linked objects contain direct reference to their adjacent neighboring nodes. There is no need
to search a relation table in order to locate relevant information down to a thousandth of a percent.
This makes it possible to optimally scale the hypergraph since there is no need to manage
extensive relation tables.
2.2. Handling semi-structured data
In the 70s and 80s, most people followed the paradigm of structured data administration. To this
day, the approach is still being implemented in almost all state-of-the-art RDBMSs. The semi-
structured data approach was established in the mid-90s. This approach is based on the fact that
many application areas rarely allow a structured table structure due to complex information
characteristics. An example of such problem domains can be found in bioinformatics or the
Semantic Web. The sones graph database is able to store and retrieve unstructured properties in
any node of the graph. The idea is also to transfer unstructured data to structured data and vice
versa.
2.3. Dynamic type extension
Another advantage is that structured data can be dynamically extended with high performance in
nodes and edges during runtime. Additional properties can easily be entered or deleted from
vertex types in a short amount of time. The number of nodes is irrelevant here. By contrast,
changing relational data schemas at a later time is very time and resource intensive.
2.4. Graph query language
The sones GraphQL is a user-friendly domain-specific language and can be thought of as an
"SQL for graphs." The similarity to SQL is intentional and makes the transition much easier for
developers/consultants. It enables queries to the sones graph database property hypergraph and
can be dynamically extended during runtime using plugins such as functions or aggregates. When
8
an SQL query on the RDBMS is as long as half a novel (see complex JOINs), the GraphQL
equivalent is usually much shorter and much more intuitive. Here is an example of this type of
query:
„FROM User SELECT Enemies.Enemies.Name WHERE Name = 'JohnSmith'“
Analogously, the name of all Enemies' Enemies would be searched starting with "JohnSmith." The
same query on an RDBMS could also look like this:
“SELECT u_end.Name FROM User u_start
CROSS JOIN Enemies e1
CROSS JOIN Enemies e2
CROSS JOIN Enemies e3
CROSS JOIN User u_end
ON
(
u_start.user_id = e1.user_id
AND e1.enemy_id = e2.user _id
AND e2.enemy_id = e3.user _id
AND e3.enemy_id = u_end.user_id
)
WHERE u.Name = 'JohnSmith'“
In addition to the GraphQL, it is also possible to operate all other DSLs with the sones graph
database, since language and logic exist separately from one another.
2.5. Solving the object-relational depiction problem
Depicting object-oriented programming language objects in an RDBMS calls for what is known as
an object-relational mapper (see Illustration 4). This is due to the fact that, conceptually speaking,
the OOP and RDBMS paradigms are fundamentally different. Objects encapsulate their state
behind an interface and have a unique identity.
9
Illustration 4 Object-relational mapper
Unlike the above, RDBMSs are based on the mathematical concept of relational algebra. In the
90s, this contradiction was referred to as an object-relational depiction problem.
Illustration 5 No O/R mapper
The sones graph database solves this dilemma by implementing an object-oriented concept (see
the highly simplified diagram in Illustration 5). This results in better integration into object-oriented
languages, since no O/R mapper is required.
2.6. HTTP/REST API
In addition to a number of interfaces (e.g., Java, C#, WebShell, WebDAV) the sones graph
database also offers a REST API. This enables uncomplicated interaction with state-of-the-art
web technologies. A REST query is all that is required to execute CRUD operations directly on the
database.
10
2.7. Traverser API
Another important feature of the sones graph database is the Traverser API. This feature makes it
possible to analyze local data. Based on a number of nodes (local), neighboring nodes can be
searched recursively (breadth/depth first). With this method, for example, local rankings,
recommendations or (path) searches can be realized. Results of a traversal include paths, a
number of nodes or an aggregate result.
Realizing this technology with an RDBMS is highly resource intensive, since each step to a
neighboring node has to be depicted with a JOIN. In contrast, the sones property hypergraph
concept allows direct access to neighboring nodes by eliminating the edge attribute.
11
3. Technical case study
This section illustrates a technical case study on implementing a keyword recommendation
engine. The purpose of this engine is to generate relevant keywords based on a click-path
analysis. 200,000 paths were analyzed during the study. Together, these paths contained around
5,000 keywords. An individual path contained around 10-30 keywords.
Entering data
Before a keyword can be recommended, the data base needs to be uploaded onto the sones
graph database. This process is divided into two steps, which will be explained below:
1. Generating vertex types
The first step is to define a schema for the data base. This includes entering the vertex
type keyword and path in subpoints a and b. The latter contains a hyperedge on the node
of the vertex type keyword. In c, what is known as a backward edge attribute is added to
the keyword type generated in a. This makes it possible to select and project implicit
incoming edges in keyword instances. This new property is named UsedInPaths and
specifies the usable incoming edges. This makes it possible for neighbors of the keyword
hyperedge to contain an explicit incoming edge to the corresponding path.
a. “CREATE VERTEX Keyword”
b. “CREATE VERTEX Path ATTRIBUTES (SET<Keyword> Keywords)”
c. “CHANGE VERTEX Keyword”
ADD BACKWARDEDGES (Path.Keywords UsedInPaths)”
2. Generating nodes
Once the vertex types have been generated, the data base itself can be generated.
a. Generating keywords
First, the keywords are uploaded onto the sones graph database. Only the unique
node IDs (UUID) are set during this process. Other properties are not necessary.
“INSERT INTO Keyword VALUES (UUID='keyWordID1')”
b. Generating paths
Once the keywords have been entered, the path node generation process can
begin. As with the keyword instances, the UUID is set. The hyperedge keywords
are also filled in by referring to their unique ID. As soon as this step has been
completed it is possible to traverse from a path node to its adjacent keyword node.
This contains the explicit incoming edge, which enables "backward" movement.
„INSERT INTO Path VALUES(UUID = 'Path1',
12
Keywords = SETOF(
UUID='keyWordID1' ,
UUID='keyWordID2' ,
UUID='keyWordIDn'))”
Generating recommendations
Once the data base for the recommendations has been created, the next query is used to
calculate the top 10 potentially interesting keywords.
"FROM Keyword
SELECT FindMatching(UsedInPaths)
WHERE UUID IN ['keyWordID1', 'keyWordID2', 'keyWordIDn']”
The enquiry is comprised of three parts:
1. FROM Keyword
Selecting the vertex type keywords that act as the vertex type reference for subsequent
projection and selection.
2. SELECT FindMatching(UsedInPaths)
The actual recommendation is generated in this step. This is done with the help of the
FindMatching aggregate function, which works on the UsedInPaths hyperedge of the
keywords selected in step 3. As mentioned above, this edge provides the path instances
that contain the selected keywords. A frequency analysis of the keywords in this node is
conducted in order to make a recommendation. The order of the keyword nodes selected
in point 3 is also important since their influence on the result decreases as the index
increases ('keyWordID1' is more important than 'keyWordID2').
3. WHERE UUID IN ['keyWordID1', 'keyWordID2', 'keyWordIDn']”
The third and final step of the query provides the relevant keyword instances for the
projection in point 2.
13
Benchmark
The measurements below were generated on the open source edition of the sones graph
database and the sones embedded C# API was used.
.NET Mono
Number of paths 10,000 100,000 200,000 10,000 100,000 200,000
Import duration (in sec) 220 1,840 2,940 140 1,560 3,300
Recommendations / sec 2,827 2,539 2,339 2,022 2,030 1,979 Table 1 Recommendation engine performance
Table 1 illustrates performance on a DELL Latitude E6400 notebook (Core2Duo 2.60GHz,
4.00GB RAM) using the .NET framework and the Linux equivalent, Mono. The second line
("Number of objects") shows the number of paths that were analyzed as a basis for the engine.
The time needed to upload the relevant number of paths onto the sones graph database is
indicated as well. The last line provides information on the number of recommendations per
second.
14
Bibliography Edlich, S., Friedland, A., Hampe, J., & Brauer, B. (2010). NoSQL: Einstieg in die Welt nichtrelationaler Web 2.0 Datenbanken (NoSQL: Introduction to the world of non-relational web 2.0 databases). Hanser Fachbuchverlag.
15
Glossary
Term Explanation
Backward edge Backward edges are incoming edges on nodes.
CRUD CRUD stands for create, read, update, delete, i.e., fundamental
database operations.
DSL DSL is an abbreviation for domain-specific language. A formal
language which is developed for a specific problem.
GraphQL Graph query language (GraphQL) is a query language developed by
sones that can be thought of as an "SQL for graphs."
GraphDB The GraphDB (graph database) is a component of the sones graph
database technology and handles administration of the property
hypergraphs.
GraphDS The GraphDS (graph data storage) combines the GraphDB and the
GraphFS into a whole. The GraphDS is an interface for user
applications and offers an entire spectrum of access options such as
REST, .Net, Java and WebDAV.
GraphFS The GraphFS (graph file system) is a component of the sones graph
database technology and provides abstract object management.
Hyperedge A hyperedge is able to connect a node to more than one node.
JOIN A JOIN is the linked execution of Cartesian product and selection.
Edge An edge is a structural element in any graph. It specifies the
connection of nodes. A special case is what is referred to as a
hyperedge.
Edge type The defined number of properties in an edge is specified via the edge
type.
Nodes Nodes are elementary components of a graph and act as a container
for properties in the sones graph database.
Node type The defined number of properties in a node is specified via the node
type.
MVCC Multiversion concurrency control (MVCC) refers to a technology that
is used to avoid conflict between read-only and write access to the
same object. In the context of a database, MVCC makes
simultaneous access possible.
16
Ontology An ontology is a formally ordered number of concepts and their
relationships to one another.
OOP OOP stands for object-oriented programming and specifies a
programmer paradigm.
Property graph A property graph is a directed multi-relational graph. The nodes and
edges of this graph are comprised of objects and the semi-structured
properties embedded therein. These are key/value relationships
whose keys and values can be specified by the relevant node and
edge schema. In this case, an edge is a special case of a property
value.
Property hypergraph The extension of a property graph to a hypergraph is based on the
use of hyperedges, which act as carriers of additional information in
the context of the edges.
RDBMS A relational database management system (RDBMS) manages a
relational database.
REST Representational state transfer (REST) refers to a software
architecture style for distributed hypermedia information systems such
as the World Wide Web. It suggests that each resource be addressed
with its own unique identifier.
SQL Structured query language (SQL) is a query language used to define,
access and manipulate data in relational databases.
Traverser A traverser enables local data analysis. Based on a number of nodes
(local), neighboring nodes can be searched recursively
(breadth/depth first). With this method, for example, local rankings,
recommendations or (path) searches can be realized.
WebShell WebShell is a console depicted via the browser, which enables
interaction with the sones graph database.
17
sones GmbH November 2010 sones GmbH Eugen-Richter-Str. 44 99085 Erfurt Germany Tel.: +49(0) 361 - 30 26 25 0 Fax.: +49(0) 361 - 244 500 8
© 2010 sones GmbH. All rights reserved. sones and its logos are registered trademarks of sones GmbH. All other names of products and services are trademarks of the associated company. The information contained in this publication is non-binding and is intended for informational purposes only. Products may vary depending on the country. Information contained in this publication may be modified without prior notice. The information contained herein has been provided by sones and is intended for informational purposes only. sones does not assume any liability or guarantee for errors of inconsistencies in this publication. No further liability may ensue from information contained in this publication.