1
Unsupervised Methods
2
Association Measures
• Association between items: assoc(x,y)– term-term, term-document, term-category, …
• Simple measure: freq(x,y), log(freq(x,y))+1
• Based on contingency table
3
Mutual Information
• The item corresponding to x,y in the Mutual Information for X,Y:
• Disadvantage: the MI value is inflated for low freq(x,y)
• Examples: results for two NLP articles
MI x y P x yP x P y
P x yP x
P y xP y( , ) log ( , )
( ) ( ) log ( )( ) log ( | )
( )
4
Log-Likelihood Ratio Test
• Comparing the likelihood of the data given two competing hypotheses (Dunning,93)
• Does not depend heavily on assumptions of normality, can be applied to small samples
• Used to test if p(x|y) = p(x|~y) = p(x), by comparing it to the general case (inequality)
• High log-likelihood score indicates that the data is much less likely if assuming equality
5
Log-Likelihood (cont.)
• Likelihood function:
• The likelihood ratio:
• is asymptotically distributed
• High : the data is less likely given
H p p k k H k( , ,...; , ,...) ( ; )1 2 1 2
max ( ; )
max ( ; )
0H k
H k
2 log
2
0
2 log
6
Log-Likelihood for Bigrams
p p x ya
a ckn
p p x yb
b dkn
p p xa b
a b c dk kn n
11
1
22
2
1 2
1 2
( | )
( |~ )
( )
7
Log-Likelihood for Binomial
• Maximum obtained for:
H p n k pk p n k n
k( ; , ) ( )
1
H p p n k n k H p n k H p n k( , ; , , , ) ( ; , ) ( ; , )1 2 1 1 2 2 1 1 1 2 2 2
max ( , ; , , , )
max ( , ; , , , ),
p
p p
H p p n k n k
H p p n k n k1 2 1 1 2 2
1 2 1 1 2 21 2
pkn
pkn
pk kn n1
1
12
2
2
1 2
1 2
; ;
8
Measuring Term Topicality• For query relevance ranking: Inverse Document
Frequency
• For term extraction:– Frequency– Frequency ratio for specialized vs. general corpus– Entropy of term co-occurrence distribution– Burstiness:
• Entropy of distribution (frequency) in documents• Proportion of topical documents for term (freq>1) within all
documents containing term (Katz, 1996)
9
Similarity Measures
• Cosine:
• Min/Max:
• KL to Average:
attv freqlogattu freqlog
attv freqlogattu freqlogvusim
attatt
att
22 ,,
,,,
att
att
attvIattuI max
attvIattuI minvu sim
,,,
,,,,
att vattPuattP
vattPvattP
vattPuattP
uattPuattPvuA
2log
2log,
10
A Unifying Schema of Similarity(with Erez Lotan)
• A general schema encoding most measures
• Identifies explicitly the important factors that determine (word) similarity
• Provides the basis for:– a general and efficient similarity computation
procedure– evaluating and comparing alternative measures
and components
11
Mapping to Unified Similarity Scheme
joint(assoc(u,att),assoc(u,att))
joint(assoc(u,att),assoc(u,att))
007
110 03
0 0
17 004
uassoc(u,att)
count(u,att)
,att
attu assoc g uW
016050 8000
45 006
vassoc(v,att)
count(v,att)
,att
attv assoc g vW
)(),(),,(
),(),(
),(,
vWuWvuSJnorm
vuSJvuf
attv, assoc,attu, assoc jointvuSJ
SIM
vuBoth att
12
Association and Joint Association
• assoc(u,att): quantify association strength– mutual information, weighted log frequency,
conditional probability (orthogonal to scheme)
• joint(assoc(u,att),assoc(v,att)): quantify the “similarity” of the two associations– ratio, difference, min, product
•
vuBoth att
attv, assoc,attu, assoc jointvuSJ,
),(
attv,freq ,attu,freq att vuBoth 00,
13
Normalization• Global weight of a word vector:
– For cosine:
• Normalization factor:
– For cosine:
,
u Just att
attu assoc g uW attu,freq att uJust 0
vW,uW,vu,SJnorm vu,rNorm_Facto
uJust att
attu assocuW 2,
vWuWvuFactorNorm ,_
14
The General Similarity Scheme
vWuW
vu,SJvu,sim
vWuWvuSJnorm
vuSJ
vu,rNorm_Facto
vu,SJ vu,sim
:cosine - exampleFor -
,,,
,
vuBoth attattv, assoc,attu, assoc jointvuSJ
,),(
where
15
Min/Max Measures
sim u vassoc u att assoc v attassoc u att assoc v att
att
att( , )
min( ( , ), ( , ))max( ( , ), ( , ))
• May be viewed as:
)),(),,(max()(
)),(),,(max(),(),,(min(
),(joint
attvassocattuassocattweight
attvassocattuassocattvassocattuassoc
16
Associations Used with Min/Max
• Log-frequency and Global Entropy Weight (Grefenstette, 1994):
• Mutual information (Dagan et al., 1993/5):
)()1),(log(),( attGewattufreqattuassoc
Gew attnrels
P v att P v attv( )log
( ) log( ( )) 11 [ , ]0 1
assoc u attP u att
P u P att
P att u
P att
P u att
P u( , ) log
( , )( ) ( )
log( )
( )log
( )
( )
17
Cosine Measure
• Used for word similarity (Ruge, 1992) with: assoc(u,att)=ln(freq(u,att))
• Popular for document ranking (vector space)
cos( , )( , ) ( , )
( , ) ( , )u v
assoc u att assoc v att
assoc w att assoc w att
att
att att
12
22
assoc doc term tf idf
tffreq doc term
freq doc
( , )
( , )max ( , )
idfdocfreq
docfreq term log
max ( )( )
1
18
Methodological BenefitsJoint work with Erez Lotan (Dagan 2000 and in preparation)
• Uniform understanding of similarity measure structure
• Modular evaluation/comparison of measure components
• Modular implementation architecture, easy experimentation by “plugging” alternative measure combinations
19
Empirical Evaluation• Thesaurus for query expansion (e.g. “insurance laws”):
Similar words for law :
Word Similarity Judgmentregulation 0.050242 +rule 0.048414 +legislation 0.038251 + guideline 0.035041 + commission 0.034499 - bill 0.033414 + budget 0.031043 - regulator 0.031006 + code 0.030998 + circumstance 0.030534 -
•Precision and comparative Recall at each point in the list
20
Comparing Measure Combinations
Recall
Precision
• Min/Max schemes worked better than cosine and Jensen-Shannon (almost by 20 points); stable over association measures
21
Effect of Co-occurrence Type on Semantic Similarity
22
Computational Benefits
• Complexity reduced by “sparseness” factor – #non-zero cells / total #cells Two orders of magnitude in corpus data
v1 … vj vmu
i1 . . . n
atti
Similarity Results j m1 . . .
• Efficient implementation through sparse matrix indexing By computing over common attributes only (both )
attributes
words
att1
atti
attn
23
General Scheme - Conclusions• A general mathematical scheme• Identifies the important factors for measuring
similarity• Efficient general procedure based on scheme• Empirical comparison of different measure
components (measure structure and assoc)• Successful application in an Internet crawler for
thesaurus construction (small corpora)
24
Clustering Methods
• Input: A set of objects (words, documents)
• Output: A set of clusters (sets of elements)
• Based on a criterion for the quality of a class, which guides cluster split/merge/modification– a distance function between objects/classes– a global quality function
25
Clustering Types
• Soft / Hard• Hierarchical / Flat• Top-down / bottom-up• Predefined number of clusters or not• Input:
– all point-to-point distances – original vector representation for points,
computing needed distances during clustering
26
Applications of Clustering• Word clustering
– Constructing a hierarchical thesaurus – Compactness and generalization in word
cooccurrence modeling (will be discussed later)
• Document clustering– Browsing of document collections and search
query output– Assistance in defining a set of supervised
categories
27
Hierarchical Agglomerative Clustering Methods (HACM)
1. Initialize every point as a cluster2. Compute a merge score for all cluster pairs3. Perform the best scoring merge4. Compute the merge score between the new cluster and all other clusters5. If more than one cluster remains, return to 3
28
Types of Merge Score• Minimal distance between the two candidates for the merge.
Alternatives for cluster distance:– Single link: distance between two nearest points– Complete ling: distance between two furthest points– Group average: average pairwise distance for all points– Centroid: distance between the two cluster centroids
• Based on the “quality” of the merged class:– Ward’s method: minimal increase in total within-group sum of
squares (average squared distance to centroid)
• Based on a global criterion (in Brown et al., 1992: minimal reduction in average mutual information)
29
Unsupervised Statistics and Generalizations for Classification
• Many supervised methods use cooccurrence statistics as features or probability estimates– eat a {peach,beach}– fire a missile vs. fire the prime minister
• Sparse data problem: if alternative cooccurrences never occurred, how to estimate their probabilities, or their relative “strength” as features?
30
Application: Semantic Disambiguation
Weapon
Bombs
grenade
Actions
Cause_movement
throw drop
Traditional AI-style approach Manually encoded semantic preferences/constraints
<object – verb>
Anaphora resolution (Dagan, Justeson, Lappin, Lease, Ribak 1995)
The terrorist pulled the grenade from his pocket and
threw it at the policeman ?
31
Statistical Approach
Corpus(text collection)
<verb–object: throw-grenade> 20 times
<verb–object: throw-pocket> 1 time
“Semantic” Judgment
• Semantic confidence combined with syntactic preferences it grenade
• “Language modeling” for disambiguation
32
What about sense disambiguation?(for translation)
I bought soap bars I bought window bars
sense1 sense2 sense1 sense2 (‘chafisa’) (‘sorag’) (‘chafisa’) (‘sorag’)
? ?
Corpus(text collection)
Sense1:<noun-noun: soap-bar> 20 times<noun-noun: chocolate-bar> 15 times
Sense2:<noun-noun: window-bar> 17 times<noun-noun: iron-bar> 22 times
• “Hidden” senses – supervised labeling required?
33
Solution: Mapping to Another LanguageEnglish(-English)-Hebrew Dictionary:
bar1 ‘chafisa’ soap ‘sabon’ window ‘chalon’bar2 ‘sorag’
Map ambiguous constructs to second language (all possibilities):
<noun-noun: soap-bar> 1 <noun-noun: ‘cahfisat-sabon’> 20 times 2 <noun-noun: ‘sorag-sabon’> 0 times
<noun-noun: window-bar> 1 <noun-noun: ‘cahfisat-chalon’> 0 times 2 <noun-noun: ‘sorag-chalon’> 15 times
Hebrew Corpus
• Exploiting ambiguities difference
• Principle – intersecting redundancies(Dagan and Itai 1994)
34
Selection Model Highlights• Multinomial model, under certain linguistic assumptions
• Selection “confidence” – lower bound for odds-ratio:
)(
11lnln 1 iConf
nnZ
n
n
p
p
jij
i
j
i
• Overlapping ambiguous constructs are resolved through constraint propagation, by decreasing confidence order.
• Results (HebrewEnglish):Coverage: ~70% Precision within coverage: ~90%– ~20% improvement over choosing most frequent translation
(the common baseline)
35
Data Sparseness and Similarity
?
<verb–object: ‘hidpis-tikiya’>
<verb–object: print-folder> 0 times <verb–object: print-file_cabinet> 0 times
• Standard approach: “back-off” to single term frequency
• Similarity-based inference:
print<verb-object>
folder
filedirectory
record…
Similarfile_cabinet
cupboardcloset
suitcase…
Similar
print<verb-object>
36
Computing Distributional Similarity folder
file
printeraseopen
retrievebrowse
save…
Similar
attP
uattPlog
uPattP
uattPlogattuI 22
,,
• Association between word u (“folder”) and its “attributes” (context words/features) is based on mutual information:
att
att
attvIattuI max
attvIattuI minvu sim
,,,
,,,,
• Similarity between u and v (weighted Jaccard, [0,1]):
37
Disambiguation AlgorithmSelection of preferred alternative:
• Hypothesized similarity-based frequency derived from average association for similar words(incorporating single term frequency)
• Comparing hypothesized frequencies
print<verb-object>
folder
filedirectory
record…
Similarfile_cabinet
cupboardcloset
suitcase…
Similar
print<verb-object>
38
Computation and Evaluation• Heuristic search used to speed computation of k most similar
words
• Results (HebrewEnglish):• 15% coverage increase, while decreasing precision by 2%• Accuracy 15% better than back-off to single word
frequency
(Dagan, Marcus and Markovitch 1995)
39
Probabilistic Framework - Smoothing• Counts are obtained from a sample of the probability space:
sample
• Maximum Likelihood Estimate proportional to sample counts:
MLE estimate – 0 probability for unobserved events
• Smoothing discounts observed events, leaving probability “mass” to unobserved events:
discounted estimate for observed eventspositive estimate for unobserved events
40
Smoothing Conditional Attribute Probability
• Good-Turing smoothing scheme – discount & redistribute:
0),(
0),(:
1)(
0),( )|()|( uattcount
uattcountattunorm
uattcountuattPuattP
d
• Katz seminal back-off scheme (speech language modeling):
0),( )()()|( uattcountattPunormuattP
• Similarity-based smoothing: (Dagan, Lee, Pereira 1999)
)|(),(),(
1 )|(
where
0),( )|()()|(
uattPuufuuf
uattP
uattcountuattPunormuattP
uSIM
uSIM
SIM
SIM
41
Similarity/Distance Functions for Probability Distributions
• L1 norm
),2(),(
,
vuLvuf
vattPuattP vuL
LSIM
att
• Jensen-Shannon divergence (KL-distance to the average)
attPattPattPP
PPPD
PPPDrqA
vuvu
vuvKL
vuuKL
2
1
2 s.t.
222
1,
Information loss by approximating u and v by their average
)),(exp(),( vuAvuf ASIM
β controls the relative influence of close vs. remote neighbors
42
Sample Results
• Several smoothing experiments (A performed best): Language modeling for speech (hunt bears?pears) Perplexity (predicting test corpus likelihood) Data recovery task (similar to sense disambiguation)Insensitive to exact value of β
• Most similar words to “guy”:
Measure Closest Words
A guy kid thing lot man mother doctor friend boy son
L guy kid lot thing man doctor girl rest son bit
PCrole people fire guy man year lot today way part
Typical common verb contexts: see get give tell take …
PC : an earlier attempt for similarity-based smoothing
43
Class-Based Generalization
• Obtain a cooccurrence-based clustering of words and model a word cooccurrence by word-class or class-class cooccurrence
• Brown et al., CL 1992: Mutual information clustering; class-based model interpolated to n-gram model
• Pereira, Tishby, Lee, ACL 1993: soft, top-down distributional clustering for bigram modeling
• Similarity/class-based methods: general effectiveness yet to be shown
44
Conclusions• (Relatively) simple models cover a wide
range of applications
• Usefulness in (hybrid) systems: automatic processing and knowledge acquisition
45
Discussion