Post on 01-Nov-2014
description
transcript
LOD2 Webinar . 29.11.2011 . Page 1 http://lod2.eu
Creating Knowledge out of Interlinked Data
http://lod2.eu
LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners
are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.
LOD2 will integrate and syndicate Linked Data with existing large-scale
applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.
http://lod2.eu
Once per month the LOD2 webinar series offer a free webinar about tools and services along the Linked
Open Data Life Cycle.
Stay with us and learn more about acquisiAon, ediAng, composing, connected applicaAons – and finally
publishing Linked Open Data.
© 2012 OpenLink Software, All rights reserved.
Virtuoso 7.0 Enabling Massively Scalable Big Data Analytics
for RDF & SQL Data Management
By Orri Erling, Virtuoso Program Manager & Hugh Williams, Professional Services Manager
Making Technology Work For You
© 2012 OpenLink Software, All rights reserved.
Company Overview
OpenLink Company Overview n OpenLink Software is a privately-held company founded in 1992 by its President &
CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:
§ ODBC, JDBC, ADO.NET, and OLE-DB compliant Data Access Drivers for Oracle, SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL
§ High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology
§ Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)
§ Web Application Server Technology
§ Linked Data Deployment & Management
§ Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)
§ Identity Management.
© 2012 OpenLink Software, All rights reserved.
Products & Services Software Products
• OpenLink Universal Data Access Drivers (UDA) - High-performance data access drivers for ODBC, JDBC, ADO.NET, and OLE DB that provide transparent access to enterprise databases.
• OpenLink Virtuoso - available in single server and cluster editions that are deployed in cloud and/or enterprise modes.
• OpenLink Data Spaces Platform and Applications
• OpenLink Ajax Toolkit • OpenLink Data Explorer
• An Open Source Data Access SDK for ODBC All OpenLink products are delivered by download from the Internet (http, ftp, etc.). Temporary licenses are issued upon download and may be extended as needed, on a case-by-case basis. Permanent licenses are issued once payment is received.
© 2012 OpenLink Software, All rights reserved.
Products & Services Professional and Support Services
• OpenLink Product Support provides front-line email and phone support, web-based online support, and a variety of premium services such as phone, emergency, and onsite support.
• Our Support staff is comprised of individuals with extensive knowledge of data access, data migration, database administration, programming APIs, and other relevant skills.
• Services are sold in either Standard "Bronze" or Premium "Platinum" Support packages, with varying hours of availability, response times, etc.
• We also offer Custom Development, Training, and other Consultancy services. These services can be offered on- or off-site. Expenses for travel, accommodations, food, etc., associated with on-site services are charged separately.
© 2012 OpenLink Software, All rights reserved.
Customers OpenLink's installed base is in excess of 10,000 customers worldwide. Examples include:
© 2012 OpenLink Software, All rights reserved.
n Data.Gov (U.S. Govt. Open Linked Data initiative)
n Verizon n Raytheon n Bank of America n CGI Federal n Elsevier n French National Library n Globo n Scottish Government
n St Jude's Medical n Barclays Bank n Wells Fargo n and many more
Office Locations
USA OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030
© 2012 OpenLink Software, All rights reserved.
UK OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702
© 2012 OpenLink Software, All rights reserved.
Virtuoso Universal Server Overview
Situation Analysis
© 2012 OpenLink Software, All rights reserved.
Data is growing exponentially
along the following dimensions:
n Volume
n Velocity
n Variety
All of this happens while the total
hours in day remains 24 hrs.
Product Value Proposition
© 2012 OpenLink Software, All rights reserved.
Enterprise and Individual Agility
via Data Access, Integration, and
Management, without
compromising performance,
scalability, security, and platform
independence.
Virtuoso locks you into an experience (openness, performance, and scale) not
the platform itself. -- Kingsley Idehen, Founder & CEO, OpenLink
Software
Product Architecture
© 2012 OpenLink Software, All rights reserved.
A high-performance, scalable,
secure, and operating-system-
independent server designed
to handle contemporary
challenges associated with
standards compliant data
access, data integration, and
data management.
Data Virtualization Middleware
© 2012 OpenLink Software, All rights reserved.
An in-built middleware layer
(“Sponger”) for creating
Transient & Persistent
Views over Heterogeneous
Data Sources.
Sophisticated Content Crawler
© 2012 OpenLink Software, All rights reserved.
DBMS hosted Content
Crawler that’s leverages
loosely coupled binding to
the Sponger Middleware
component for
transformation of
unstructured and semi-
structured data into Linked
Data.
Core Platform behind LOD Cloud
© 2010 OpenLink Software, All rights reserved.
Core Platform (Graph DBMS and Linked Data Deployment) behind DBpedia, many
bubbles in the LOD Cloud, and the LOD Cloud cache itself.
Virtuoso Linked Data projects • DBpedia - public SPARQL endpoint over the DBpedia data
(and international Chapters)
• LOD Cloud Cache - public server hosting LOD cloud datasets
• URIBurner - Linked Data generation & transformation service
• Linked Geo Data - OpenStreetMap Spatial data as Linked Data
• Sindice - SPARQL endpoint behind its Semantic Web Index
• Data.gov - US Government Linked Data
• Health.data.gov - Clinical Quality Linked Data on health.data.gov
• Seevl - Linked Data music discovery service
• Bio2RDF - Life science data mapped to Linked Data
• Neurocommons - Life science data mapped to Linked Data
• Musicbrainz - MusicBrainz database published as Linked Data
• Open PHACTS - DBpedia-like Linked Data Space for Pharma
• Others - Many others …
© 2012 OpenLink Software, All rights reserved.
Powerful Standards Support
© 2012 OpenLink Software, All rights reserved.
ODBC compliance enables use of client applications (e.g. Microsoft Access) as front-
ends for Virtuoso, 3rd party RDBMS engines, and the World Wide Web hosted Linked
Open Data Cloud.
Powerful Standards Support Cont’d
© 2012 OpenLink Software, All rights reserved.
ODBC & HTML5 compliance enables development of rich client apps. that
leverage the WebDB-ODBC bridge for accessing data across: Virtuoso, 3rd party
RDBMS engines, and the World Wide Web hosted Linked Open Data Cloud.
Insight Discovery & Exploration
© 2012 OpenLink Software, All rights reserved.
Native Faceted Browsing that enables multi-dimensional drill-downs via any browser
Insight Discovery & Exploration
© 2012 OpenLink Software, All rights reserved.
Microsoft Silverlight or HTML5 based PivotViewer Front-End for SPARQL and SPARQL-FED
Queries
Powerful SPARQL Query Service
© 2012 OpenLink Software, All rights reserved.
Basic SPARQL Endpoint for Creating Query Definitions & Sharing Query Results.
Example: health.data.gov data directly from a Web Browser.
Powerful SPARQL Query Builder
© 2012 OpenLink Software, All rights reserved.
Use Query By Example (QBE) Patterns to Construct & Share Query
Results.
How Do I Get Going?
n Download, install, and experience the power of coherent integration of disparate data sources, data access protocols, and data representation formats.
n In an nutshell, commence exploitation of powerful business intelligence, socially enhanced collaboration, data virtualization, and entity analytics without writing a line of code!
n Turn "Big Data" into exploitable "Smart Data" without compromise!
n Will be integrated into the next release of the LOD2 Stack
© 2012 OpenLink Software, All rights reserved.
© 2012 OpenLink Software, All rights reserved.
Virtuoso 7.0
27 © 2012 OpenLink Software, All rights reserved.
Flexible Big Data Challenge
n Data Agility is challenged by Volume, Velocity, and Variety
n “Schema Last” is great - if the price is right n RDF, graphs promise powerful querying with the
flexibility and scale of NoSQL key-value stores n Inference may be good for integration, if can
express the right things, beyond OWL n RDF data management technology must learn
from the lessons of SQL RDBMS, everything applies
28 © 2012 OpenLink Software, All rights reserved.
Virtuoso 7.0 Mission Statement
Destruction of the following items as impediments to
Big (Open) Linked Data exploitation:
n Performance
n Scalability
n Platform Independence
n Security & Privacy
n Price
29 © 2012 OpenLink Software, All rights reserved.
Virtuoso 7.0 & Big Data Myths
Myths put to rest:
n Scalable Open Ended SPARQL Endpoints
n Scalable Open Ended Read-Write SPARQL
Endpoints
n Fine-grained Access Controls underlying Read-
Only or Read-Write endpoints.
30 © 2012 OpenLink Software, All rights reserved.
Virtuoso Column Store Features
n Supports SQL and SPARQL query languages
n Compact column-wise storage
n Vectored execution of commands
n Shared nothing scale out for clusters
n Powerful procedure language with parallel,
distributed control structures
n Full-text and geospatial indexes
31 © 2012 OpenLink Software, All rights reserved.
Storage Engine n Freely mix column-, and row-wise indices n All SQL and RDF data types natively supported , single
execution engine for SQL/SPARQL
n Column compression 3x more space efficient than row-wise compression for RDF
n Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality
n 9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals)
32 © 2012 OpenLink Software, All rights reserved.
Execution Engine n Vectoring is not only for column stores n Vectoring makes a random access into a linear merge
join if there is any locality: Always a win, mileage depends on run time factors
n Vectoring eliminates interpretation overhead and makes CPU friendly code possible
n Even with run time data typing, vectoring allows use of type-specific operators on homogenous data, e.g. arithmetic
n Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access
33 © 2012 OpenLink Software, All rights reserved.
Graph operations n Run time computation plus caching instead of
materialization n SPARQL/SQL extension for arbitrary transitive subqueries: n Flexible options for returning shortest paths, all paths, all /
distinct reachable, attributes of steps on paths etc. n Efficient execution, searching the graph from both ends if
looking for a path with ends given n Query operators for RDF hierarchy traversal n Special query operator for OWL sameAs and IFP based
identity n Taking OWL sameAs / IFP identity into account for
DISTINCT /GROUP BY
34 © 2012 OpenLink Software, All rights reserved.
Query Optimization Challenges n Typical SQL stats do not help n Need to measure data cardinalities starting from
constants in the query n Need to sample fanout predicate by predicate, as
needed n Predicate and class hierarchies are easy to
handle in sampling n sameAs or IFP inference voids all guesses n Is hash join worthwhile? High setup cost means
that one must be sure of cardinalities first
35 © 2012 OpenLink Software, All rights reserved.
Deep Sampling n Everything is a join -> sampling must also do joins n As the candidate plan grows, the cost model
executes all the ops on a sample of the data n Actual cardinality and locality are known, also when
search conditions are correlated n Having high confidence in the cost model, hash join
plans become safe and attractive n Even though there is an indexed access path for all,
a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk
36 © 2012 OpenLink Software, All rights reserved.
Elastic Cluster
n Data is partitioned by key, different indices may have different partition keys
n Partitions may split and migrate between servers
n Partitions may be kept in duplicate for fault tolerance/load balancing
n Actual access stats drive partition split and placement
37 © 2012 OpenLink Software, All rights reserved.
Optimizing for Cluster n Vectored execution is natural in a cluster since single-tuple
messages are not an option n Keep max ops in flight at all times, always send long messages n Fully distributed query coordination: ¡ Any node can service a client request. Correlated subqueries, stored
procedures may execute anywhere, arbitrary parallelism and recursion between partitions
¡ On single shared memory box, cluster is approximately even with single process multithreading, low overhead
¡ 1.8x more throughput in BSBM BI when going from 1 to 2 machines ¡ Distributed stored procedures, send the proc to the data, as in map-
reduce, except that there are no limits on cross partition calling/recursion ¡ Choice of transactional and auto-commit update semantics, can have
atomic ops without global transaction
38 © 2012 OpenLink Software, All rights reserved.
Cluster Architecture Diagrams
39 © 2012 OpenLink Software, All rights reserved.
n 55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk
n 2 x 384 GB of RAM, 4TB SSD
n Most of Linked Open Data and Web Crawls
n http://lod.openlinksw.com
n http://lod.openlinksw.com/sparql
LOD Cache
40 © 2012 OpenLink Software, All rights reserved.
Independent Benchmark Report from CWI:
Berlin SPARQL Benchmark
#Triples Source File Size
Compressed Source File Size
Source Data Files Per Loader Node
Final Database File Size
Load Time
50 Billion 2.8 TB 240 GB 30 GB 1.8 TB 10h 54s
150 Billion 8.5 TB 728 GB 91 GB 5.6 TB n/a
41 © 2012 OpenLink Software, All rights reserved.
Store Comparisons Summary:
Exploration oriented queries (QMpH)
Berlin SPARQL Benchmark
100 Million Triples
200 Million Triples
1 Billion Triples
Virtuoso 6 37,678.319 32,969.006
8,984.789
Virtuoso 7 47,178.820
27,933.682
42 © 2012 OpenLink Software, All rights reserved.
Store Comparisons Summary:
Business Intelligence oriented queries (QMpH)
Berlin SPARQL Benchmark
10 Million Triples 100 Million Triples
1 Billion Triples
Virtuoso 6 431.465 35.342 2.383
Virtuoso 7 996.795 75.236
43 © 2012 OpenLink Software, All rights reserved.
Store Comparisons Summary:
Exploration oriented queries (Cluster Edition) (QMpH)
Berlin SPARQL Benchmark
10 Billion Triples 50 Billion Triples 150 Billion Triples
Virtuoso 7 2,360.210 4,253.157 2,090.574
44 © 2012 OpenLink Software, All rights reserved.
Store Comparisons Summary:
Business Intelligence oriented queries (Cluster Edition) (QMpH)
Berlin SPARQL Benchmark
10 Billion Triples 50 Billion Triples 150 Billion Triples
Virtuoso 7 13.078 0.964 0.285
45 © 2012 OpenLink Software, All rights reserved.
Future Work
n Complete deep sampling: enhanced query optimization plans
n Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible
Additional Information n OpenLink Software
¡ OpenLink Software - www.openlinksw.com ¡ OpenLink Virtuoso - virtuoso.openlinksw.com ¡ Universal Data Access - uda.openlinksw.com
n Social Media Data spaces ¡ http://virtuoso.openlinksw.com/blog/ (weblog) ¡ https://plus.google.com/112399767740508618350/
posts (Google+) ¡ https://twitter.com/OpenLink (Twitter) ¡ http://www.linkedin.com/company/openlink-software
(LinkedIn) ¡ Hashtag: #LinkedData (Anywhere)
© 2012 OpenLink Software, All rights reserved.
EU-FP7 LOD2 WP6 – 25.-26.03.2013. Page 47 http://lod2.eu
Creating Knowledge out of Interlinked Data
LOD2 Stack Usability Survey 2013
http://www.surveygizmo.com/s3/1188229/LOD2-Stack-Usability-Survey-2013