+ All Categories
Home > Documents > Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores

Date post: 23-Feb-2016
Category:
Upload: hali
View: 40 times
Download: 0 times
Share this document with a friend
Description:
Machine Learning. Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores. ?. Relational, Distributed. Harris T. Lin and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University [email protected]. - PowerPoint PPT Presentation
Popular Tags:
16
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores Harris T. Lin and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University [email protected] Machine Learning lational, Distributed ?
Transcript
Page 1: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Learning Classifiers from Chains ofMultiple Interlinked RDF Data Stores

Harris T. Lin and Vasant HonavarArtificial Intelligence Research Laboratory

Department of Computer ScienceIowa State University

[email protected]

Machine Learning

Relational, Distributed ?

Page 2: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Resource Description Framework (RDF) Primer

(Inception, hasActor, Ellen Page)(Inception, hasActor, Leonardo DiCaprio)(Titanic, hasActor, Leonardo DiCaprio)(Ellen Page, yearOfBirth, 1987)(Ellen Page, gender, F)(Leonardo DiCaprio, yearOfBirth, 1974)(Leonardo DiCaprio, gender, M)

hasActor

yearOfBirth

genderInception

Leonardo DiCaprio

Ellen Page

Titanic hasActor

1987

F

yearOfBirth

gender

1974

M

hasActor

Movie Actor Gender

xsd:integer

hasActor

yearOfBirth

gender

RDF Data (Graph representation)

RDF Schema

RDF Data (Triple representation)

• RDF triple = subject-predicate-object triple• RDF graph = set of RDF triples• Directed labeled graph whose nodes are URIs

Page 3: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Introduction• Motivating scenario: Facebook + New York Times

– Facebook users share posts about news items published in New York Times

– Goal: predict the interest of a user in joining a group

• Challenges for Machine Learning– Multiple interlinked data stores– Physically distributed data stores– Autonomously maintained data stores

Page 4: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Introduction• Linked Open Data cloud

– 300+ interlinked datasets– 30+ trillion triples

• Multiple interlinked, physically distributed, autonomously maintained data stores

• Prohibits downloading all data together– Bandwidth limits– Access limits– Storage and Memory limits– Privacy and confidentiality constraints

• We need– Learning from multiple interlinked RDF stores that support only indirect

access to data (e.g. SPARQL query interface)

Linked Open Data cloud diagram,by Richard Cyganiak and Anja Jentzsch.

http://lod-cloud.net/

Page 5: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Summary of Contribution• Learning Classifiers from Chains of Multiple Interlinked RDF Data

Stores• Contributions

1. Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data

2. Distributed learning framework for RDF stores that form a chain[Not covered in this talk]

3. Identify 3 special cases of RDF data fragmentation[Not covered in this talk]

4. Novel application of matrix reconstruction for approximating statistics, which dramatically reduce communication

5. Experimental results demonstrating feasibility

Page 6: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Problem Formulation• Last.fm dataset:

Dataset (Conceptual)

((B11, …, B1K), c1)((B21, …, B2K), c2)

…((Bn1, …, BnK), cn)

YUser1

NUser2

NUser3

Page 7: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Learning with Indirect Access to Data

• Single RDF data store– Lin et al. [10]

• Multiple Interlinked RDF data stores– This work

Learner Classifier

New instance

Predicted class

((B11, …, B1K), c1)((B21, …, B2K), c2)

…((Bn1, …, BnK), cn)

Statistics viaSPARQL queries

RDF data stores

Page 8: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Learning Algorithms1. Aggregation

– Simple aggregation (max, min, avg, etc.)– Vector distance aggregation (Perlich and Provost [12])

2. Generative Models– Naïve Bayes (with 4 different distributions)

• Bernoulli• Multinomial• Dirichlet• Polya (Dirichlet-Multinomial)

• Key sufficient statistic:count for each value, for each instance(= histogram for each instance)

• How to obtain this efficiently?

Page 9: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Obtaining Statistics for Learning

Users Track Artist TagU

ser

Track

Trac

k

ArtistAr

tist

Tag

Use

r

Tag

Schema:

Data Graph:

MatrixRepresentation:

Page 10: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Approximating Statistics

Use

r

Track

Trac

k

Artist

Artis

t

Tag

Use

r

Tag

Use

rTrack

Trac

k

Artist

Artis

t

Tag

Use

r

Tag

User

Track

Trac

k

ArtistAr

tist

Tag

Use

r

Tag

ColumnProjection:

RowProjection:

Page 11: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Approximating Statistics

?Use

r

Tag• Could we approximate this matrixfrom the two projections?

• CT scans for the rescue!• CT: Reconstruct 3D object from its projected slices• We want: Reconstruct 2D matrix from its projections

Source: http://health-fts.blogspot.com/2012/01/brain-ct-mri.html Source: https://www.medicalradiation.com/types-of-medical-imaging/imaging-using-x-rays/computed-tomography-ct/

Page 12: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Approximating Statistics

?Use

r

Tag• We adapted one of the simplestreconstruction method:Algebraic Reconstruction Technique

• Proposed scheme1. Use SPARQL queries to accumulate and pass along column and

row vectors, ultimately send back to the learner2. Learner use a CT method to reconstruct matrix from projections3. Use the approximated matrix to compute necessary statistics for

learning• Dramatically reduce communication!• How accurate are the learned classifiers?

Page 13: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Experimental Results• Two subsets of Last.fm dataset• 2 aggregation and 4 naïve Bayes models• Compares against centralized counterpart

– Uses exact matrix for learning

• Accuracy Results (10-fold cross validation)– ART approximation has different effects depend on models– NB(Pol) is competitive, even in the ART approximated case– NB(Mul) is competitive too, despite using less information than NB(Pol)

• NB (Bernoulli) and NB (Multinomial) only need projections for learning, hence their results are identical (*)

• Sensitivity of ART on different models [Not covered in this talk]

Page 14: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Communication Complexity• Size of query results transferred

v.s.Size of the dataset (# users)

• Size of projections are several orders of magnitude smaller

Page 15: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Conclusion• Challenges

– Multiple interlinked, physically distributed, autonomously maintained RDF data stores– Learner may be prohibited to download all data due to limitations in bandwidth, access,

storage and memory, privacy and confidentiality constraints• We need

– Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface)

• Contributions– Statistical query-based formulations of several representative algorithms for learning

classifiers from RDF data– Distributed learning framework for RDF stores that form a chain

[Not covered in this talk]– Identify 3 special cases of RDF data fragmentation

[Not covered in this talk]– Novel application of matrix reconstruction from Computerized Tomography for

approximating statistics, which dramatically reduce communication– Experimental results demonstrating feasibility

Page 16: Learning Classifiers from  Chains of Multiple Interlinked RDF Data Stores

Iowa State University Department of Computer ScienceCenter for Computational Intelligence, Learning, and Discovery

Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356.

Related Work and Future Work• Related Work

– Most existing work on learning from RDF data assume direct access

– Lin et al. [10] learns relational Bayesian classifiers from a single remote RDF store via SPARQL queries

– Extends the remote access framework [20] to multiple RDF stores

• Future Work– Consider more recent and complex CT methods– Explore other ways of taking projections– Consider more complex RDF data fragmentations– Consider richer classes of learning models


Recommended