Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Ariadne’s Thread

Exploring a world of networked information built from free-text metadata

Rob Koopman, Shenghui Wang

OCLC Research

12 March 2015

KNAW eHumanitites group

What would you do if you are

interested in a topic?

It is difficult to answer these questions:

• What are the different aspects of this topic?

• Are there related aspects missing in my search terms?

• Who are the most prominent authors about this topic?

• Which journals publish most about this topic?

• How have others — e.g. librarians — described and

classified this topic?

Demo examples

• http://thoth.pica.nl/demo/relate

http://thoth.pica.nl/demo/relate?input=prekindergarten

How do we do this?

● Offline: Build low-dimensional semantic

representation using Random Projection

● Online: Interactive exploration of networked

entities

A MARC record

title

authors

issn

dewey

publisher

Step 1: Build semantic representation

• Direct co-occurrence based

Does prekindergarten improve school preparation and performance?

[author:loeb susanna]

[issn:0272-7757]

• Indirect co-occurrence based (context based)

The effects of state prekindergarten programs on young children’s

school readiness in five states

[author:jung kwanghee]

[subject:readiness for school]

Dataset

● WorldCat, 300+ million records

● Selected 13 million items (topical terms,

authors, ISSNs, Dewey decimal codes,

publishers, subject headings)

● Represented by 6 million topical terms

But a matrix of 13M x 6M is too big to process

Dimension reduction based on Random Projection

C: a co-occurrence matrix

R: a random matrix of +/-1

C’: approximation of C

after random projection

-- Semantic matrix

Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: In- teractive navigation in a world of networked information. In: CHI’15 Extended Abstracts.

Scaling up

- We first select the items we want to build

vectors for

- Matrix C and R are never stored.

- We only store C’ a matrix of about 16GB

- We read metadata records sequentially and

update C’ for each “co-occurrence”.

- Cost is order Ntotalwords * Ncolumns

- Can be done in parallel

- Can use HADOOP

Step 2: Interactive exploration

- Let user input search term

- Calculate the top 500 most related

candidates

- Find mutually related items

- Convert distances to probabilities

- Project to 2D

Step 2: Interactive exploration

- Let user select term

- Calculate the top 500 most similar

candidates

- Find mutually related items

- Convert distances to probabilities

- Project to 2D

Raw

Find mutually related items

- Basic assumption: my friends are the ones

who consider me as a friend too.

- For each candidate calculate the average distance and

standard deviation to all other candidates in the top 500

list.

- Keep candidates with the highest z-scores for the

selected term.

Cooked

Visualise in 2D

- Multidimensional scaling

- Simple spring model

- Verlet integration

Convert distances to probabilities

Idea from SNE to get more structure:

- Calculate z-score of each item

- Convert these scores to probabilities

- Average A to B and B to A probabilities

L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of

Machine Learning Research 9(Nov):2579-2605, 2008.

cosine similarity

Probability

Future work

● Compare the algorithm to other existing

algorithms - benchmarking

● Improve visualisation (more simple NLP)

● More functionality (timeline, history)

● More metadata fields (publisher, subject,

identifiers)

● Extend the implementation to other

databases

Future work

● Identify applications, e.g.o Author name disambiguation

o Matching chemical molecules

● Prepare user scenarios for usability testing

Thank you

[email protected]

[email protected]

http://thoth.pica.nl/relate (ArticleFirst)

http://thoth.pica.nl/astro/relate (Astrophysics articles)

http://thoth.pica.nl/demo/relate (WorldCat)

mailto:[email protected]

mailto:[email protected]

http://thoth.pica.nl/relate

http://thoth.pica.nl/astro/relate

http://thoth.pica.nl/demo/relate

Date post:	17-Jul-2015
Category:	Data & Analytics
Upload:	shenghui-wang
View:	37 times
Download:	1 times

Ariadne's Thread -- Exploring a world of networked information built from free-text metadata

Data & Analytics