Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 227 times |
Download: | 0 times |
Mark Sanderson, University of Sheffield
University of SheffieldCIIR, University of Massachusetts
Deriving concept hierarchies from text
Mark Sanderson, Bruce Croft
Mark Sanderson, University of Sheffield
The question is...
� What paper already presented at this SIGIR is most like the one you’re about to see?
� We’ll have the answer, right after this!
Mark Sanderson, University of Sheffield
Concept hierarchies from documents?
� Hierarchy ofconcepts, Yahoo� General down to
specific
� Child under one or more parents
� No training data
� Why?� Understandable
Mark Sanderson, University of Sheffield
Current methods
� Polythetic clustering
Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4 X X
Mark Sanderson, University of Sheffield
An alternative?
� Monothetic clustering
� Clusters based on a single features
� More ‘Yahoo/Dewey decimal’ like?
� Easier to understand?» Preferable to users?
� What about hierarchies of clusters?
Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4
Mark Sanderson, University of Sheffield
How to arrange cluster terms?
� Existing techniques� WordNet
» earthquake, volcano (eruption?)
� Key phrases (Hearst 1998)» “such as”, “especially”
� Phrase classification (Grefenstette 1997)» NP head or modifier “types of research” from “research things”
� Hierarchical phrase analysis (Woods 1997)» Head modifier again, “car washing” under “washing”, not “car”
Mark Sanderson, University of Sheffield
WordNet (aside)
� 1 sense of earthquake, sense 1
� earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)
» geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth)
» natural phenomenon, nature -- (all non-artificial phenomena)
» phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)
Mark Sanderson, University of Sheffield
WordNet (aside)
� 5 senses of eruption, sense 1
� volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material)
» discharge -- (the sudden giving off of energy)
» happening, occurrence, natural event -- (an event that happens)
» event -- (something that happens at a given place and time)
Mark Sanderson, University of Sheffield
Start with something simpler?
� Term clustering?� simple monothetic clusters
� No ordering.
Mark Sanderson, University of Sheffield
Use subsumption
� Initially using subsumption.� Finds related terms
� Decides which is more general, which is more specific (idf?)
� Strict interpretation� X s Y iff P(x|y) = 1, P(y|x) < 1
� In practice� X s Y iff P(x|y) > 0.8, P(y|x) < 1
� P(x|y) > 0.8, P(y|x) < P(x|y)
xy
x
y
Mark Sanderson, University of Sheffield
How to build a “hierarchy”
� X s Y
� X s Z
� X s M
� X s N
� Y s Z
� A s B
� A s Z
� B s Z
X
Y
Z
M N
A
B
really it’s a DAG
Mark Sanderson, University of Sheffield
How to display it?
� DAGs were big� Unlikely to get all on screen
� Only want to see current focus plus route to taken there?
� Use a method users are familiar with
� Hierarchical menus
X
Y
Z
M N
A
B
Z
Mark Sanderson, University of Sheffield
What about ambiguity?
� Monothetic clusters of ambiguous terms?
� Derive hierarchy from retrieved documents� Take a query and retrieve on it,
� take top 500 documents,
� build hierarchy from them.
� Topics/concepts are words/phrases taken from� Query
� Retrieved documents
� Comparison of frequencies
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Poliomyelitis and Post-PolioTREC topic 302
Mark Sanderson, University of Sheffield
Did you guess the paper?
� Bit like Peter Anick’s work?
Mark Sanderson, University of Sheffield
Experiment
� Test properties of hierarchy
� Does it mimic (in some way) Yahoo-like categories?� Parent related to child?
� Parent more general than child?
Mark Sanderson, University of Sheffield
Experimental set-up
� Gathered eight subjects� Presented subsumption categories and ‘random’ categories.
� Ask if parent child pair are ‘interesting’.» If yes, then what type is relationship, (roughly) from WordNet
» Aspect of
» Type of
» Same as
» Opposite of
» Don’t know
Mark Sanderson, University of Sheffield
Results
� Question of parent/child pairing ‘interesting’ or not� Random, 51%
� Subsumption, 67%
� Difference significant from t-test, p<0.002
� If interesting, what is parent/child type?
Odd?
Mark Sanderson, University of Sheffield
Yahoo categories?
Mark Sanderson, University of Sheffield
Results and conclusions
� Interesting AND (aspect of OR type of)� Random, 28% (51% * (47% + 8%))
� Subsumption, 48% (67% * (49% + 23%))
� Appears that subsumption and an ordering based on document frequency does a reasonable job.� Term frequency work see.
» Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21
» Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):
Mark Sanderson, University of Sheffield
Future work?
� More user studies.
� Incorporate other term relationship techniques
� Other visualisations
� Application of techniques to whole document collections.
� Presentation of Cross Language IR results?