+ All Categories
Home > Documents > SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for...

SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for...

Date post: 21-Dec-2015
Category:
Upload: angela-horn
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
20
SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil, Mario Pichler Software Competence Center Hagenberg (Austria)
Transcript
Page 1: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

SCCH is an initiative of SCCH is located in

Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery

Jorge Martinez-Gil, Mario Pichler

Software Competence Center Hagenberg (Austria)

Page 2: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Who are we?

Application-oriented research institution Founded in July 1999 by some

institutes of the Johannes Kepler University Linz

Johannes Kepler University as a strong partner

Form of enterprise: GmbH Non-Profit ~ 65 employees (including. Partners

100) 5.6 million Euro turnover Located at the Hagenberg Software

Park Since 1.1.2008 COMET competence

center

2© Software Competence Center Hagenberg GmbH

Page 3: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Index

Introduction State-of-the-art Word co-occurrence in human literature Evaluation Fields of application Conclusions

3© Software Competence Center Hagenberg GmbH

Page 4: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Introduction

Semantic similarity measurement aims to determine the likeness between two text expressions that use different lexicographies for representing the same real object or idea

4© Software Competence Center Hagenberg GmbH

Source: http://eil.stanford.edu

Page 5: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Introduction (ii)

In the past, there have been great efforts in finding new semantic similarity measures since it is of fundamental importance in many application-oriented fields: automatic processing of text and email messages healthcare dialogue systems natural language querying of databases question answering sentence fusion ...

5© Software Competence Center Hagenberg GmbH

Page 6: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Introduction (iii)

Similarity vs Relatedness

Semantic similarity states the taxonomic proximity between terms or text expressions. For example, automobile and car are similar because both are means of transport.

6© Software Competence Center Hagenberg GmbH

Semantic relatedness considers taxonomic and relational proximity. For example, nurse and hospital are related because both belong to the world of health.

Page 7: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Introduction (iv)

Culturomics is a field of study which consists of collecting and analyzing large amounts of data for the study of human culture.

7© Software Competence Center Hagenberg GmbH

To do that, it necessary to use a corpus of digitized texts representing the digested history of human literature. The rationale behind this idea was that an analysis of this corpus can enable people to investigate cultural trends quantitatively.

Page 8: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Introduction (v)

In this work:

We propose to study the co-occurrence of words in the human literature (Culturomics) for trying to determine the semantic similarity between words.

We evaluate our proposal according to the word pairs included in the Miller & Charles benchmark data set. (This is de facto standard used by researchers)

8© Software Competence Center Hagenberg GmbH

Page 9: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

State-of-the-art

Existing semantic similarity measures:

Edge-counting measures which are based on the computation of the number of taxonomical links separating two concepts represented in a given dictionary

Feature-based measures which try to estimate the amount of common and non-common taxonomical information retrieved from dictionaries.

Information theoretic measures which try to determine similarity between concepts as a function of what concepts have in common in a given ontology..

Distributional measures which use text corpora as source. They look for word co-occurrences in the Web or large document collections using search engines.

Our solution is a distributional measure. Since we are using a text corpora as a source. (Maybe the largest text corpora ever known)

9© Software Competence Center Hagenberg GmbH

Page 10: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Word co-occurrence in human literature

10© Software Competence Center Hagenberg GmbH

Our contribution is based on the idea of exploring culturomics, thus the application of quantitative analysis to the study of human culture, for trying to determine the semantic similarity between terms.

Page 11: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Word co-occurrence in human literature (ii)

Why? According to the book library digitized by Google, the number of words in the English lexicon is currently above a million. Therefore, there are more words from the data sets we are using than appear in any dictionary.

11© Software Competence Center Hagenberg GmbH

Page 12: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Word co-occurrence in human literature (iii)

12© Software Competence Center Hagenberg GmbH

Page 13: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Word co-occurrence in human literature (iv)

The method that we propose consists of measuring how often two terms appear in the same text sentence.

Studying the co-occurrence of terms in a text corpus has been usually used as an evidence of semantic similarity in the scientific literature.

We calculate the joint probability so that a text expression may contain the two terms together over time.

13© Software Competence Center Hagenberg GmbH

Page 14: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Word co-occurrence in human literature (v)

This formula computes a similarity score so that it is possible to know if two terms appear together in the same text expressions each time unit.

Due to the way data are stored, the minimum time unit that can be considered is a year.

The result from this similarity measure can be easily interpreted since the range of possible values is bounded by 0 (no similarity at all) and 1 (totally similar).

Moreover, this output value for stating the degree of semantic similarity can be fuzzificated in case a great level of detail may not be needed.

14© Software Competence Center Hagenberg GmbH

Page 15: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Evaluation

We report our results using the data set offered by Google

The data used has been extracted from the English between 1900 and 2000

Results are obtained according Miller-Charles benchmark data set

The rationale behind this way to evaluate quality is that each result obtained by means of artificial techniques may be compared to human judgments

The goal is to replicate human behavior when solving tasks related to semantic similarity without supervision

15© Software Competence Center Hagenberg GmbH

Page 16: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Evaluation (ii)

Results have been obtained by using our method for the range 1900-2000 using 5 years as a time unit. The overall fitness we have obtained by measuring the correlation between human judgment and our approach is 0.458.

16© Software Competence Center Hagenberg GmbH

Page 17: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Evaluation (iii)

We repeated our experiment with some modifications through some kind of fuzzification for the numerical values.

We got 23/30 hits, this means we have been able to achieve 76.67% of accuracy. Now, it is possible to perceive better results than in the previous experiment.

17© Software Competence Center Hagenberg GmbH

Page 18: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Fields of application

This technique can be useful when supporting a number of tasks that have to be manually done currently. One example is the field of Human Resources Management (automatically matching job offers and applicant profiles).

Due dynamism of the job market, job offers and applicant profiles contain terms that are not usually covered by dictionaries (new programming languages, software tools...)

If we are able to find algorithms for discovering semantic similarity on basis of large book libraries, then we do not need to use dictionaries for supporting the process.

18© Software Competence Center Hagenberg GmbH

Page 19: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

Conclusions

We have described how we have got benefit from a new paradigm called culturomics for automatically determining the degree of semantic similarity between words.

We have shown that appropriately studying the co-occurrence of words along human literature can provide very accurate results when measuring semantic similarity between these words.

An advantage of this technique in relation to the traditional ones is that it can be applied on more than 600,000 single-word forms on which dictionary-based techniques cannot work.

19© Software Competence Center Hagenberg GmbH

Page 20: SCCH is an initiative of SCCH is located in Analysis of word co-occurrence in human literature for supporting semantic correspondence discovery Jorge Martinez-Gil,

End

Doubts?Suggestions?

Thank you for your attention!

20© Software Competence Center Hagenberg GmbH


Recommended