+ All Categories
Home > Documents > Clustering Related Terms with Definitions

Clustering Related Terms with Definitions

Date post: 21-Jan-2016
Category:
Upload: ksena
View: 21 times
Download: 0 times
Share this document with a friend
Description:
Clustering Related Terms with Definitions. Scott Piao, John McNaught and Sophia Ananiadou {scott.piao,john.mcnaught,sophia.ananiadou}@manchester.ac.uk National Centre for Text Mining School of Computer Science The University of Manchester. Outline of talk. - PowerPoint PPT Presentation
Popular Tags:
22
LREC 2008 Marrakech 1 Clustering Related Terms Clustering Related Terms with Definitions with Definitions Scott Piao, John McNaught and Sophia Ananiadou {scott.piao,john.mcnaught,sophia.ananiadou}@manchester.ac.uk National Centre for Text Mining School of Computer Science The University of Manchester
Transcript
Page 1: Clustering Related Terms with Definitions

LREC 2008 Marrakech 1

Clustering Related Terms with Clustering Related Terms with DefinitionsDefinitions

Scott Piao, John McNaught and Sophia Ananiadou

{scott.piao,john.mcnaught,sophia.ananiadou}@manchester.ac.uk

National Centre for Text MiningSchool of Computer ScienceThe University of Manchester

Page 2: Clustering Related Terms with Definitions

LREC 2008 Marrakech 2

Outline of talk

• Task: match related terms of ontology.• Approach: detect and cluster related terms

based on definitions.• Implementation: definition matching and term

clustering, user interface.• Evaluation on GO terms.• Conclusion.

Page 3: Clustering Related Terms with Definitions

LREC 2008 Marrakech 3

Task: matching terms for ontology enrichment

• matching similar or related terms/expressions is important task in NLP and Text Mining applications.

• Ontology term matching is also closely related to ontology enrichment.

• In the EU BOOTSTrep Project, some techniques have been tested for ontology entities matching and alignment.

• Our work focuses on testing and evaluating a text matching tool for identifying related ontology terms with their definitions.

Page 4: Clustering Related Terms with Definitions

LREC 2008 Marrakech 4

Definitions of term definitions

• Ontology terms, such as GO (Gene Ontology) terms, often contain detailed definitions:.– id: GO:0000124– name: SAGA complex– def: "A large multiprotein complex that possesses histone

acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, several proteins of the Spt and Ada families, and several TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins.“

– id: GO:0005671– name: Ada2/Gcn5/Ada3 transcription activator complex– def: "A multiprotein complex that possesses histone acetyltransferase

and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, two proteins of the Ada family, and two TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins."

Page 5: Clustering Related Terms with Definitions

LREC 2008 Marrakech 5

Our approach to the issue

• The definitions can provide a fundamental information source for detecting relations between terms.

• lexicon definitions have been previously used for analyzing relations between words/terms (Castillo et al., 2003).

• We assume text matching tools can be used to detect related terms based on the definitions.

Page 6: Clustering Related Terms with Definitions

LREC 2008 Marrakech 6

A tool for clustering related texts

• Align similar sentences between texts.

• Measure the distances between texts based on the aligned sentences.

• Cluster similar texts based on a distance matrix.

Page 7: Clustering Related Terms with Definitions

LREC 2008 Marrakech 7

Metrics for pairwise text comparison

2m

llpsd ngsw

21

2

mm

lldc

ngsw

ngsw

ng

ll

lpsng

,

(δ1=0.85,δ2=0.05,δ3=0.1),psngdcpsdws 321

l

wslsd ii

)( (0 <= d <=

1).

For further details, see the paper.

Page 8: Clustering Related Terms with Definitions

LREC 2008 Marrakech 8

An effective algorithm text comparisonCited from Clough et al. (2002)

Page 9: Clustering Related Terms with Definitions

LREC 2008 Marrakech 9

Clustering texts

• Using the text comparison tool, produce distance matrix matrix elements: eij =1 – dij, (0<=eij<=1)

• Error Sum of Squares (ESS) hierarchical clustering

clustersclusterswithin

p

kkik xxESS

1

2

Page 10: Clustering Related Terms with Definitions

LREC 2008 Marrakech 10

Sample of cluster tree

{layer=9 {layer=10 {layer=11 {layer=12 GO:0009897 GO:0010339 } {layer=12 GO:0010282 } } {layer=11 {layer=12 GO:0045284 } {layer=12 GO:0045293 } } } {layer=10 {layer=11 {layer=12 GO:0017117 GO:0033202 } {layer=12 GO:0017119 } }

Page 11: Clustering Related Terms with Definitions

LREC 2008 Marrakech 11

A package for definition comparison andterm clustering

pairwise definitions

comparison

term clusterer

userinterface

check update

synonym lexicon

extended Porter’s stemmer

distancematrix

clusters

termdatabase

Page 12: Clustering Related Terms with Definitions

LREC 2008 Marrakech 12

User interface for checking and updating terms

Page 13: Clustering Related Terms with Definitions

LREC 2008 Marrakech 13

Evaluation

• The text comparison and clustering components are evaluated on a set of GO terms as test data.

• In the evaluation, we consider GO terms to be related if they:– share a parent term within three layers of ancestor trees via

IS_A relation, or– have direct parent/child relations (e.g. X is_a Y), or– have direct part-of relations (e.g. X is part of Y).

Page 14: Clustering Related Terms with Definitions

LREC 2008 Marrakech 14

Evaluation

• Test data– GO terms under the namespace of cellular_component – 2,027 found, of which 2,010 have definitions --- actual test data. – All of the 2,010 test terms are related as defined previously with

one or more other test terms.

• Our evaluation strategy is to examine:– How many clustered terms have the relations defined previously,

and – How many of the related terms can be covered by the clusters.

Page 15: Clustering Related Terms with Definitions

LREC 2008 Marrakech 15

Evaluation of bottom-layer clusters

Total_clustered_terms=1,076

depths of parentnodes considered

clustered true pairs precision(%)

coverage(%)

1 417 (834 terms) 76.09 41.49

2 489 (978 terms) 89.23 48.66

3 531 (1,062 terms) 96.90 52.84

Page 16: Clustering Related Terms with Definitions

LREC 2008 Marrakech 16

Distribution of relation types IS_A and PART_OF in the clustered terms

1 parent node 2 parent nodes 3 parent nodes

type is-a part-of is-a part-of is-a part-of

numb 122 49 128 50 128 50

percent 29.3 11.75 26.2 10.2 24.1 9.4

Page 17: Clustering Related Terms with Definitions

LREC 2008 Marrakech 17

Evaluation of the second layer clusters

depths of parent nodesconsidered

correctly clusteredterms

precision/coverage(%)

1 1,163 57.86

2 1,474 73,33

3 1,685 83,83

Total_clustered_terms=2,010

Page 18: Clustering Related Terms with Definitions

LREC 2008 Marrakech 18

Evaluation of the third layer clusters

depths of parent nodesconsidered

correctly clustered terms

precision/coverage(%)

1 1,284 63.88

2 1,642 81.69

3 1,843 91.69

Total_clustered_terms=2,010

Page 19: Clustering Related Terms with Definitions

LREC 2008 Marrakech 19

• This package can be used as an assistant tool for modifying and enriching ontology and terminology. (Brief demo of interface)

Application of this package

Page 20: Clustering Related Terms with Definitions

LREC 2008 Marrakech 20

Conclusion

• Ontology term definitions provide an important

information source for term matching.

• Text comparing and clustering tool can provide useful

tool for matching the terms.

• For a better performance, the tool needs domain

knowledge resources.

Page 21: Clustering Related Terms with Definitions

LREC 2008 Marrakech 21

Acknowledgements

This research was supported by EC BOOTStrep Project (ref. FP6-028099).

The UK National Centre for Text Mining is sponsored by the JISC/BBSRC/EPSRC.

Page 22: Clustering Related Terms with Definitions

LREC 2008 Marrakech 22

References

• BOOTStrep Project website: http://www.BOOTStrep.org.

• Castillo, Gabriel, Gerardo Sierra, John McNaught (2003). An improved Algorithm for Semantic Clustering. Proceedings of the 1st international symposium on Information and communication technologies, Dublin.

• Clough, Paul, Robert Gaizauskas, Scott Piao, Yorick Wilks (2002), METER: MEasuring TExt Reuse, In Proceedings of the ACL-2002, University of Pennsylvania, Philadelphia, USA, pp. 152-159.

• Gene Ontology http://www.geneontology.org.

• Piao, Scott and Tony McEnery (2003). A tool for text comparison. Proceedings of the Corpus Linguistics


Recommended