Date post: | 17-Jul-2015 |
Category: |
Data & Analytics |
Upload: | shenghui-wang |
View: | 37 times |
Download: | 1 times |
Ariadne’s Thread
Exploring a world of networked information built from free-text metadata
Rob Koopman, Shenghui Wang
OCLC Research
12 March 2015
KNAW eHumanitites group
It is difficult to answer these questions:
• What are the different aspects of this topic?
• Are there related aspects missing in my search terms?
• Who are the most prominent authors about this topic?
• Which journals publish most about this topic?
• How have others — e.g. librarians — described and
classified this topic?
Demo examples
• http://thoth.pica.nl/demo/relate
How do we do this?
● Offline: Build low-dimensional semantic
representation using Random Projection
● Online: Interactive exploration of networked
entities
Step 1: Build semantic representation
• Direct co-occurrence based
Does prekindergarten improve school preparation and performance?
[author:loeb susanna]
[issn:0272-7757]
• Indirect co-occurrence based (context based)
The effects of state prekindergarten programs on young children’s
school readiness in five states
[author:jung kwanghee]
[subject:readiness for school]
Dataset
● WorldCat, 300+ million records
● Selected 13 million items (topical terms,
authors, ISSNs, Dewey decimal codes,
publishers, subject headings)
● Represented by 6 million topical terms
But a matrix of 13M x 6M is too big to process
Dimension reduction based on Random Projection
C: a co-occurrence matrix
R: a random matrix of +/-1
C’: approximation of C
after random projection
-- Semantic matrix
Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: In- teractive navigation in a world of networked information. In: CHI’15 Extended Abstracts.
Scaling up
- We first select the items we want to build
vectors for
- Matrix C and R are never stored.
- We only store C’ a matrix of about 16GB
- We read metadata records sequentially and
update C’ for each “co-occurrence”.
- Cost is order Ntotalwords * Ncolumns
- Can be done in parallel
- Can use HADOOP
Step 2: Interactive exploration
- Let user input search term
- Calculate the top 500 most related
candidates
- Find mutually related items
- Convert distances to probabilities
- Project to 2D
Step 2: Interactive exploration
- Let user select term
- Calculate the top 500 most similar
candidates
- Find mutually related items
- Convert distances to probabilities
- Project to 2D
Find mutually related items
- Basic assumption: my friends are the ones
who consider me as a friend too.
- For each candidate calculate the average distance and
standard deviation to all other candidates in the top 500
list.
- Keep candidates with the highest z-scores for the
selected term.
Convert distances to probabilities
Idea from SNE to get more structure:
- Calculate z-score of each item
- Convert these scores to probabilities
- Average A to B and B to A probabilities
L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of
Machine Learning Research 9(Nov):2579-2605, 2008.
Future work
● Compare the algorithm to other existing
algorithms - benchmarking
● Improve visualisation (more simple NLP)
● More functionality (timeline, history)
● More metadata fields (publisher, subject,
identifiers)
● Extend the implementation to other
databases
Future work
● Identify applications, e.g.o Author name disambiguation
o Matching chemical molecules
● Prepare user scenarios for usability testing
Thank you
http://thoth.pica.nl/relate (ArticleFirst)
http://thoth.pica.nl/astro/relate (Astrophysics articles)
http://thoth.pica.nl/demo/relate (WorldCat)