Date post: | 18-Dec-2014 |
Category: |
Education |
Upload: | monsterbin |
View: | 708 times |
Download: | 0 times |
Heart Project ProposalDistributed RDF Table & Processing Engine
Frederick Haebin Na [email protected] Project Group
2008.10.23.
Contents
1.Heart Proposal Overview2.Goals & Objectives 3.Backgrounds4.Benefits5.Features
3 / Heart Project Proposal
Heart (Highly Extensible & Accumulative RDF Table) aims to provide a planet-scale RDF store and a set of features to process the data in distributed manner. Heart is based on Hadoop and HBase. Heart aims to be a batch processor, or analyzer, rather than a real-time database.
Heart Proposal Overview
Heart will be the heart of Web 3.0 where the machine extends human powered knowledge at a far greater rate than in Web 2.0. With this increasing rate of semantic data, Heart will be very useful after about a decade or so. Until then, Heart will play a crucial role in experimenting niche service models.
Massive Storage & Processor Highly Extensible &
Accumulative Storage Faster Loader/Query
Processing/ Materializer for Massive RDF Data
RDF Data Mining Platform Knowledge Discovery
Prediction/Classification/Association
Semantic Search Platform Bulk Pre & Post Processing
for Semantic Search
Heart Data Loader Bulk Triples to HBase
Heart Storage Manager Smart Triples Partitioning
Heart Query Processor Optimized Query for
Massive Data Heart Data Miner
Extension to SparQL for Data Mining
Heart Data Materializer Indexing for Implicit
Statements
Core (Billion Triples)1)
Garlik JXT (9.8) YARS2 (7) BigOWLIM (6.7) Jena TDB (1.7) Virtuoso (1)
Applications PowerSet – Semantic
Search Engine A Scale-Out RDF
Molecule Store for Distributed Processing of Biomedical Data, Newman, et al.
Benefits Features Relevant Projects
1
1) http://esw.w3.org/topic/LargeTripleStores
4 / Heart Project Proposal
Goals & Objectives2
The goals and objectives of Heart is to provide a massive RDF data storage and a batch processor for various RDF data mining.
Key problems must be addressed for the first objective which has the highest priority over the rests.
Goals To Provide Massive RDF Data Storage &
Batch Processor for Various RDF Data Mining
Key Problems Need to be Solved Would Sequential-read centric Hbase
index be enough for random reads/writes for joins?
If not, then how to exploit HBase indexes or generate new ones for speeding up the processing?• What is the best suitable index for
semantic search? How to partition the triples for efficient
joins? (By subject, predicate, grouped by named graphs)
Objectives Faster Massive Data Processor• Loader 1) – Better than Garlik JXT• Query Processor1) – Better than
Garlik JXT Highly Extensible & Accumulative RDF
Table• Supports more than 10 billion triples
over more than 3,000 computers. Extensions for Data Mining• Full Support for the Standard SparQL• Machine Learning Extensions2)
1) http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
2) http://www.eswc2008.org/final-pdfs-for-web-site/qpI-4.pdf
5 / Heart Project Proposal
More RDF Supporting Services
Needs for Contextual &
Specific Search Result
Proliferation of Various
RDF Schemes
Heart
Increase inRDF Data
Refinement in RDFS’s
Needs for ProcessingRDF Data
Backgrounds3
Environmentally, more and more services begin to provide and refine their RDF/S related features. Also, people begin to ask for more specific and contextual search result. For the service providers, they begin to have the data and its scheme to process RDF data for their customers’ needs.
6 / Heart Project Proposal
1
2
3
Massive RDF Storage & Processor
RDF Data Mining Platform
Semantic Search Platform
Highly extensible and accumulative storage benefits are from Hadoop and HBase.
Faster processing over massive RDF data is possible by MapReduce model for distributed RDF data processing.
HBase based column-oriented partitioning gives performance increase because of the lesser joins.
Full Support for Standard SparQL over Massive RDF Data Converts SparQL to MapReduce query implementation
Machine Learning Features for SparQL Extensions1)
Prediction Classification Association
Provides fundamental features for semantic search. Storage & Processor Knowledge Discovery by Data Mining
Massive RDF data can be mined to generate semantic search index. Support for User Defined Index Model
Benefits4
Heart provides three benefits; a massive RDF storage/processor, RDF data mining and semantic search platform.
1) http://www.eswc2008.org/final-pdfs-for-web-site/qpI-4.pdf
7 / Heart Project Proposal
12345
Data Loader
Storage Manager
Query Processor
Fast Bulk Storing & ReasoningBulk Triples into HBaseSupports Various File Format
Smart Triples PartitioningC-Store with Sequential-Read Centric Processing
Reduce or Eliminate Random Access
Full Standard SparQL Query Conversion to MapReduce Codes
Features5
Heart provides 5 core features; data loader, storage manager, query processor, data miner and data materializer.
Data MinerMachine Learning Extensions
Prediction Classification Association
Data Materializer
Indexes for Implicit Statements
Thank you.