Mark Sanderson, University of Sheffield University of Sheffield CIIR, University of Massachusetts...

Post on 19-Dec-2015

227 views 0 download

Tags:

transcript

Mark Sanderson, University of Sheffield

University of SheffieldCIIR, University of Massachusetts

Deriving concept hierarchies from text

Mark Sanderson, Bruce Croft

Mark Sanderson, University of Sheffield

The question is...

� What paper already presented at this SIGIR is most like the one you’re about to see?

� We’ll have the answer, right after this!

Mark Sanderson, University of Sheffield

Concept hierarchies from documents?

� Hierarchy ofconcepts, Yahoo� General down to

specific

� Child under one or more parents

� No training data

� Why?� Understandable

Mark Sanderson, University of Sheffield

Current methods

� Polythetic clustering

Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4 X X

Mark Sanderson, University of Sheffield

An alternative?

� Monothetic clustering

� Clusters based on a single features

� More ‘Yahoo/Dewey decimal’ like?

� Easier to understand?» Preferable to users?

� What about hierarchies of clusters?

Battery California Technology Mile StateD1 X X X XD2 X X X X XD3 X X XD4

Mark Sanderson, University of Sheffield

How to arrange cluster terms?

� Existing techniques� WordNet

» earthquake, volcano (eruption?)

� Key phrases (Hearst 1998)» “such as”, “especially”

� Phrase classification (Grefenstette 1997)» NP head or modifier “types of research” from “research things”

� Hierarchical phrase analysis (Woods 1997)» Head modifier again, “car washing” under “washing”, not “car”

Mark Sanderson, University of Sheffield

WordNet (aside)

� 1 sense of earthquake, sense 1

� earthquake, quake, temblor, seism -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity)

» geological phenomenon -- (a natural phenomenon involving the structure or composition of the earth)

» natural phenomenon, nature -- (all non-artificial phenomena)

» phenomenon -- (any state or process known through the senses rather than by intuition or reasoning)

Mark Sanderson, University of Sheffield

WordNet (aside)

� 5 senses of eruption, sense 1

� volcanic eruption, eruption -- (the sudden occurrence of a violent discharge of steam and volcanic material)

» discharge -- (the sudden giving off of energy)

» happening, occurrence, natural event -- (an event that happens)

» event -- (something that happens at a given place and time)

Mark Sanderson, University of Sheffield

Start with something simpler?

� Term clustering?� simple monothetic clusters

� No ordering.

Mark Sanderson, University of Sheffield

Use subsumption

� Initially using subsumption.� Finds related terms

� Decides which is more general, which is more specific (idf?)

� Strict interpretation� X s Y iff P(x|y) = 1, P(y|x) < 1

� In practice� X s Y iff P(x|y) > 0.8, P(y|x) < 1

� P(x|y) > 0.8, P(y|x) < P(x|y)

xy

x

y

Mark Sanderson, University of Sheffield

How to build a “hierarchy”

� X s Y

� X s Z

� X s M

� X s N

� Y s Z

� A s B

� A s Z

� B s Z

X

Y

Z

M N

A

B

really it’s a DAG

Mark Sanderson, University of Sheffield

How to display it?

� DAGs were big� Unlikely to get all on screen

� Only want to see current focus plus route to taken there?

� Use a method users are familiar with

� Hierarchical menus

X

Y

Z

M N

A

B

Z

Mark Sanderson, University of Sheffield

What about ambiguity?

� Monothetic clusters of ambiguous terms?

� Derive hierarchy from retrieved documents� Take a query and retrieve on it,

� take top 500 documents,

� build hierarchy from them.

� Topics/concepts are words/phrases taken from� Query

� Retrieved documents

� Comparison of frequencies

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Poliomyelitis and Post-PolioTREC topic 302

Mark Sanderson, University of Sheffield

Did you guess the paper?

� Bit like Peter Anick’s work?

Mark Sanderson, University of Sheffield

Experiment

� Test properties of hierarchy

� Does it mimic (in some way) Yahoo-like categories?� Parent related to child?

� Parent more general than child?

Mark Sanderson, University of Sheffield

Experimental set-up

� Gathered eight subjects� Presented subsumption categories and ‘random’ categories.

� Ask if parent child pair are ‘interesting’.» If yes, then what type is relationship, (roughly) from WordNet

» Aspect of

» Type of

» Same as

» Opposite of

» Don’t know

Mark Sanderson, University of Sheffield

Results

� Question of parent/child pairing ‘interesting’ or not� Random, 51%

� Subsumption, 67%

� Difference significant from t-test, p<0.002

� If interesting, what is parent/child type?

Odd?

Mark Sanderson, University of Sheffield

Yahoo categories?

Mark Sanderson, University of Sheffield

Results and conclusions

� Interesting AND (aspect of OR type of)� Random, 28% (51% * (47% + 8%))

� Subsumption, 48% (67% * (49% + 23%))

� Appears that subsumption and an ordering based on document frequency does a reasonable job.� Term frequency work see.

» Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval, in Journal of Documentation, 28(1): 11-21

» Caraballo, S.A., Charniak, E. (1999) Determining the specificity of nouns from text, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP):

Mark Sanderson, University of Sheffield

Future work?

� More user studies.

� Incorporate other term relationship techniques

� Other visualisations

� Application of techniques to whole document collections.

� Presentation of Cross Language IR results?