Our journey with semantic embedding

Our journey with semantic embedding

Rob Koopman, Shenghui WangOCLC

Fourth Annual KnoweScape Conference, 22-24 Feb 2017

Agenda

● What is semantic embedding● Applications:

○ Context explorer○ Topic delineation ○ Information retrieval○ Concept drift

An example by Stefan Evert: what’s the meaning of bardiwac?

•He handed her her glass of bardiwac.•Beef dishes are made to complement the bardiwacs.•Nigel staggered to his feet, face flushed from too much bardiwac.•Malbec, one of the lesser-known bardiwac grapes, responds well to Australia’s sunshine.•I dined on bread and cheese and this excellent bardiwac.•The drinks were delicious: blood-red bardiwac as well as light, sweet Rhenish.•⇒ ‘bardiwac’ is a heavy red alcoholic beverage made from grapes

How can we calculate the similarity/relatedness?

● Discrete encoding does not help to automatically process the underlying semantics

● Statistical Semantics [furnas1983, weaver1955] based on the assumption of “a word is characterized by the company it keeps” [firth1957]

● Distributional Hypothesis [harris1954, sahlgren2008]: words that occur in similar contexts tend to have similar meanings.

Let’s embed words in a vector space

● Words are represented in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other').

● A desirable property: cosine similarity

What can we do with the similarity?

● Context explorer● Topic delineation ● Information retrieval● Concept drift

Context explorer

What can we do with the similarity?

● Context exploration

http://thoth.pica.nl/astro/relate?input=supernovae&type=7

Document clustering

Topic delineation based on clustering

● Generate vectors for entities● Generate vectors for articles based on weighted average

of entity vectors● Use standard clustering methods to cluster articles● At the end this approach has proven to be remarkably

compatible with methods based on citation networks. Wang, S., & Koopman, R. (2017). Clustering articles based on semantic similarity. In J. Gläser, A. Scharnhorst, & W. Glänzel (Eds.), Same data—Different results? (pp. 234–556). Towards a comparative approach to the identification of thematic structures in science. Special Issue of Scientometrics

Information Retrieval

1. 2014 glycated nail proteins a new approach for detecting diabetes in developing countries

2. 2015 glycation of nail proteins from basic biochemical findings to a representative marker for diabetic glycation associated target organ damage

3. 2005 glycation products as markers and predictors of the progression of diabetic complications

4. 2015 glycated nail proteins as a new biomarker in management of the south kivu congolese diabetics

5. 2005 advanced glycosylation end products in skin serum saliva and urine and its association with complications of patients with type 2 diabetes mellitus

6. 1993 review of diabetes identification of markers for early detection glycemic control and monitoring clinical complications

7. 2012 glycation and biomarkers of vascular complications of diabetes8. 2005 the nail under fungal siege in patients with type ii diabetes mellitus9. 2003 improvement in quality of diabetes control and concentrations of age

products in patients with type 1 and insulin treated type 2 diabetes mellitus studied over a period of 10 years jevin

10. 2005 a novel advanced glycation index and its association with diabetes and microangiopathy

Now let’s evaluate and compare

Word embedding techniques

Two main categories of approaches:

● global co-occurrence count-based methods, such as Latent Semantic Analysis and Random Projection

● local context predictive methods, such as neural probabilistic language models


Two main categories of approaches:

● global co-occurrence count-based methods, such as Latent Semantic Analysis and Random Projection --- suffer in word analogy tasks

● local context predictive methods, such as neural probabilistic language models --- do not leverage the global statistics


● Ariadne (OCLC): based on Random Projection of the global co-occurrence matrix

● Word2Vec (Google): shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words

● GloVe (Stanford): a global log-bilinear regression model to learn word vectors based on the ratio of the co-occurrence probabilities of two words

https://en.wikipedia.org/wiki/Neural_network

Different models lead to different embeddings

knee

Word2Vec ankle, hip, elbow, knees, shoulder, patellofemoral, joint, wrist, tka, patellar

GloVe ankle, hip, joint, knees, arthroplasty, osteoarthritis, elbow, flexion, cruciate, joints

Ariadne knees, knee joint, contralateral knee, tibiofemoral, knee pain, knee motion, medial compartment, lateral compartment, operated knees, right knee

frog

Word2Vec toad, bullfrog, amphibian, rana, turtle, salamander, caudiverbera, frogs, leptodactylid, pleurodema

GloVe rana, toad, amphibian, bullfrog, frogs, temporaria, laevis, xenopus, anuran, catesbeiana

Ariadne frogs, isolated frog, frog muscle, rana pipiens, anurans, hyla, anuran, tree frog, anuran species, hylid

Word analogy evaluation

Which word is the most similar to Italy in the same sense as Paris is similar to France?

X=vector(``Paris'')-vector(``France'')+vector(``Italy'')

Word analogy evaluation

Which word is the most similar to Italy in the same sense as Paris is similar to France?

X=vector(``Paris'')-vector(``France'')+vector(``Italy'')

Method Accuracy (%) Runtime (seconds)

#Thread

Word2Vec 61.4 32,432 16

GloVe 53.6 22,680 16

Ariadne 1.6 15,020 1

Information retrieval evaluation

Use case: evidence-based medical guideline

Statement There are no indications to suggest that a skin-sparing mastectomy followed by immediate reconstruction leads to a higher risk of local or systemic recurrence of breast cancer.

Old references (pmid) 9142378, 1985335

New references (pmid) 9142378, 9694613, 18210199

From word embedding to document distance

● Doc2Vec: an extension of Word2Vec, that learns to correlate documents and words, rather than words with other words

● Ariadne: weighted average of word vectors

A tiny gold set

● 29 statements (16 breast cancer, 4 hepatitis C, 4 lung cancer, 5 ovarian cancer)

● 103 (96 unique) source articles, 156 (145 unique) target articles, in total 180 unique articles

● 66 articles are in both source and target lists, so the baseline total recall is 42.3% (the average baseline recall is 45.8%)

● These articles were published between 1984 and 2012.

Average recall

Average precision

Concept drift

Now let’s talk about concept drift

● 20 million Medline articles published since 1977● 1.5 million entities (subjects, authors, journals, words)● 8 five-year periods● Each subject is embedded in 8 chronological vector

spaces● Is there concept drift and can we detect it?

Jaccard similarity based on important subjects

Most and least stable subjects

Most stable subjects Least stable subjects

history 15th centuryhistory 18th centuryhistory 17th centuryhistory 16th centuryhistory 19th centurythymomahistory ancienthistory medievalrabieshistory

diagnostic techniques surgicalchromium isotopesshock surgicaliodine isotopesdiagnostic techniques and proceduresblood circulation timetrauma nervous systemcesium isotopesliver extractsmacroglobulins

Subjects most related to “trauma nervous system”1977-1982

anatomy regional, fracture fixation internal, bulgaria, piedra, surgery plastic, germany west, wound infection, carbuncle, burns

1982-1987

legionellosis, povidone, tropocollagen, attention deficit disorder with hyperactivity, legionnaires disease, transfer psychology

1987-1992

leg injuries, neurosurgical procedures, arm injuries, wound infection, orthopedic equipment, dermatomycoses, multiple trauma, candidiasis cutaneous, fractures closed

1992-1997

piperacillin, tazobactam, microbiology, diagnostic errors, sorption detoxification, arthroplasty, hsp40 heat shock proteins, emaciation, professional patient relations

1997-2002

defensive medicine, insurance liability, diagnostic errors, expert testimony, birth injuries, maleic anhydrides, dimethyl sulfate, medical errors, p protein hepatitis b virus

2002-2007

peripheral nervous system diseases, peripheral nerve injuries, neurologic examination, male, recovery of function, peripheral nerves, elbow, comorbidity, mother child relations

2007-2012

peripheral nerve injuries, sciatic neuropathy, papilledema, sciatic nerve, peripheral nerves, nerve crush, neuroma, nerve regeneration, acute disease

2012-2017

mitochondrial dynamics, dental records, park7 protein human, persistent vegetative state, dnm1l protein human, platelet derived growth factor bb, dual specificity phosphatases, lingual nerve injuries, dental care

defensive medicine, insurance liability, diagnostic errors, expert testimony, birth injuries,

anatomy regional, fracture fixation internal, bulgaria, piedra, surgery plastic

Global drift based on Self Organising Maps

- Create document vectors- Put the documents in a self organizing map- For each point in the map count the documents in a year range- Make sub maps for each year range- Now color code lower than expected as blue and higher than

expected in red- The result shows global drift

A point of attention is that this shows how the content of the medline database drifts over time, not necessarily how science drifts over time.

Summary

● Semantic indexing enables the operations directly on the underlying semantics

● It helps to explore the context of subject, cluster and retrieve related documents, and study drift

● Different methods have their own limitations● The choice is application sensitive

Date post:	15-Apr-2017
Category:	Data & Analytics
Upload:	shenghui-wang
View:	23 times
Download:	2 times

Our journey with semantic embedding

Data & Analytics