+ All Categories
Home > Documents > Latent Semanctic Analysis Auro Tripathy

Latent Semanctic Analysis Auro Tripathy

Date post: 11-Jul-2015
Category:
Upload: auro-tripathy
View: 1,119 times
Download: 15 times
Share this document with a friend
Popular Tags:
28
Latent Semantic Analysis Auro Tripathy [email protected]
Transcript
Page 1: Latent Semanctic Analysis Auro Tripathy

Latent Semantic Analysis

Auro Tripathy

[email protected]

Page 2: Latent Semanctic Analysis Auro Tripathy

Outline

Introduction

Singular Value Decomposition

Dimensionality Reduction

LSA in Information Retrieval

Page 3: Latent Semanctic Analysis Auro Tripathy

Latent Semantic Analysis

Introduction

Page 4: Latent Semanctic Analysis Auro Tripathy

Mathematical treatment capable of inferring meaning

Measures of word-word, word-passage, & passage-passage relations that correlate well with human understanding of semantic similarity

Similarity estimates are NOT based on contiguity frequencies, co-occurrence counts, or usage correlations

Mathematical way capable of inferring deeper relationships; hence “latent”

Page 5: Latent Semanctic Analysis Auro Tripathy

Akin to a well-read nun dispensing sex-advice

Analysis of text alone

Its knowledge does NOT come from perceived information about the physical world, NOT from instinct, NOT from feelings, NOT from emotions

Does NOT take into account word-order, phrases, syntactic relationships, logic,

It takes in large amounts of text and looks for mutual interdependencies in the text

Page 6: Latent Semanctic Analysis Auro Tripathy

Words and Passages

LSA represents the meaning of a word as the average of the meaning of all the passages in which it appears…

…and the meaning of the passage as an average of the meaning of the words it contains

word1word2word3

Page 7: Latent Semanctic Analysis Auro Tripathy

What is LSA?

LSA is a mathematical technique for extracting and inferring relations of expected contextual usage of words in documents

Page 8: Latent Semanctic Analysis Auro Tripathy

What LSA is not

Not a natural language processing program

Not an artificial intelligence program

Does NOT use dictionaries or databases

Does NOT use syntactic parsers

Does not use morphologies

Takes as input – words and text paragraphs

Page 9: Latent Semanctic Analysis Auro Tripathy

Example

Titles of N=9 technical memoranda

Five on human-computer interaction

Four on mathematical graph theory

Disjoint topics

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Page 10: Latent Semanctic Analysis Auro Tripathy

Sample Word-by-Document Matrix Word selection criteria – occurs in at least two of the

titles

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

How much was said about a topic

Page 11: Latent Semanctic Analysis Auro Tripathy

Semantic Similarity using Spearman rank coefficient correlation

The correlation between human and user is negative, -0.38

The correlation between human and minor is also negative, -0.29

Expected; words never in the same passage, no co-occurrences

http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient

Spearman ρ (human.user) = -0.38

Spearman ρ (human.minor) = -0.29

Page 12: Latent Semanctic Analysis Auro Tripathy

Singular Value Decomposition

Page 13: Latent Semanctic Analysis Auro Tripathy

The Term Space

Term

s

Documents

Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

Page 14: Latent Semanctic Analysis Auro Tripathy

The Document Space

Term

s

Documents

Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

Page 15: Latent Semanctic Analysis Auro Tripathy

The Semantic Spaceone space for terms and documents

Represent terms AND documents in one space

Makes it possible to calculate similarities

Between documents

Between terms

Between terms and documents

Page 16: Latent Semanctic Analysis Auro Tripathy

The Decomposition

Splits the term-document matrix into three matrices New space, the SVD space

because new axes were found by SVD along which the terms and documents can be grouped

M

Term-by-

document

matrix

Term1Term2Term3

T

S DT

t x d

r x dr x r

t x r

Page 17: Latent Semanctic Analysis Auro Tripathy

New Term Vector, New Document Vector, & Singular Values

T contains in its rows the term vectors scaled to a new basis

DT contains the new vectors of the documents

S contains the singular values

σ1,σ2, …. σn

Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0

Page 18: Latent Semanctic Analysis Auro Tripathy

Dimensionality Reduction

To reveal the latent semantic structure

Page 19: Latent Semanctic Analysis Auro Tripathy

Reduce to k Dimensions

M

Term-by-

document

matrix

Term1Term2Term3

t x d

r x kk x k

t x k

T

S DT

Page 20: Latent Semanctic Analysis Auro Tripathy

ExampleTerm Vector Reduced to two Dimensions

T

S

D

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Page 21: Latent Semanctic Analysis Auro Tripathy

Reconstruction of the original matrix based on the reduced dimensions

NEW

Original

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Page 22: Latent Semanctic Analysis Auro Tripathy

Recomputed Semantic Similarity using Spearman rank coefficient correlation

Spearman ρ (human.user) = -0.38

Spearman ρ (human.minor) = -0.29Original

Spearman ρ (human.user) = +0.94

Spearman ρ (human.minor) = -0.83

NEW

Humans-user correlation went up and the human-minor correlation went down

Page 23: Latent Semanctic Analysis Auro Tripathy

Correlation between a title and all other titles – Raw Data

•Correlation between the human-computer interaction titles was low

•Average correlations, 0.2; half the Spearman correlations were 0

•Correlation between the four graph-theory papers (mx / my) was mixed

•Average Spearman correlation was 0.44, 0.

•Correlation between human-computer interaction titles and the

graph-theory papers was -0.3, despite no semantic overlap

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Page 24: Latent Semanctic Analysis Auro Tripathy

Correlation in the reduced dimension (k=2) space

•Average correlations jumped from 0.2 to 0.92

•Correlation between the graph-theory papers (mx/my) was HIGH;1.0

•Correlation between human-computer interaction titles and the

graph-theory papers was strongly negative

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Page 25: Latent Semanctic Analysis Auro Tripathy

LSA in Information Retrieval

Page 26: Latent Semanctic Analysis Auro Tripathy

How to treat a query

Matrix of term-by-document

Perform SVD, reduce dimensions to 50-400

A query is a “pseudo-document” Weighted average of the vector of the words it

contains

Use a similarity metric (such as cosine) between the query vector and the document-to-document vectors

Rank the results

Page 27: Latent Semanctic Analysis Auro Tripathy

The Query Vector

Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

Does better that literal matches between terms in query documents

Superior when query and document use different words

Page 28: Latent Semanctic Analysis Auro Tripathy

References

• Latent Semantic Indexing and Information Retrieval, Johanna Geiß

• An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham


Recommended