1
Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying
Department of Computing The Hong Kong Polytechnic University
Chinese Core Ontology Constructionfrom a Bilingual Term Bank
2
Outline
Introduction
Related Works
Algorithm Design– COCA
Performance Evaluation
Conclusion
3
IntroductionWhat is a Core Ontology
A mid-level ontology Bridges the gap between an upper ontology
and a domain ontology
Upper Ontology
Core Ontology
Domain Ontology
ComputerProgram
Free Software Public Domain Software
Software
4
Concepts and TerminologiesUpper Ontology
A general ontology to ensure reusability across different domains (e.g.: Computer Program in SUMO)
Domain OntologyAn ontology conceptualize a specific domain (e.g.: Free
Software in IT domain)More application dependent, more extents of concepts
Midlevel Ontology(Core Concept)Basic concepts of a domain
More application independent, more intents of concepts.core ontology (e.g.: Software)Frequently used, ability to form other concepts
Core TermsLexical units of core concepts
5
Related WorksManually constructed ontologies
SUMOFamous upper level ontology works based on lexicon
CoreLex (Buitelaar, P., 1998)EuroWordnet (Rodríguez, 1998 )
Ontology harmonization: Core ontology“Towards a Core Ontology for Information Integration” (M.
Doerr, 2003)A most similar work
“Enriching Core Ontology with Domain Thesaurus through Concept and Relation Classification ” (Huang, 2007)Use Concept and Relation Classification to Enrich core
ontology
6
Our Previous WorksChinese terminology extractionChinese core term extraction(Ji et al, 2007) Preliminary work on automatic construction of
core ontology construction using English-Chinese Term Bank (MRCOCA, Ontolex 2007, Chen, 2007)
Bilingual lexicon Extended stringsFrequency information in synset Weight from extended strings are integrated into final
weight by simple additionMapping to synset and SUMO can only achieve accuracy of
about 50%
7
IssuesWhat kind of concept should be
included? How to identify core concepts
If through core terms, disambiguation
What and how to identify relations? Making use of available resources
Chinese NLP resource scaresEnglish NLP resources abundant
8
Requirements of Core Ontology
The concepts must be widely accepted and commonly referenced
Corresponding core terms must be highly used and productive
The concepts/terms can be mapped to upper ontology. So the core ontology can inherit the attributes provided by upper ontology
9
Core Ontology Construction Algorithm(COCA) for Chinese Extract Chinese core terms from a
bilingual term bank Mapped core term Tc to English terms Mapping English terms to WordNet
Mapping synset to a upper ontology concept in SUMO
)|(maxarg C
S
TSP
10
COCA - Resources UsedITCTerm
a domain specific core term list (Chen, 2007 )CETBank
Chinese-English bilingual term bank 1,500 most productive core terms extracted can
serve as suffixes to form more than 50% of the terms in CETBank)
WordNetSUMOMappings between WordNet and SUMO
11
The Framework of COCA
Chinese Core Term
Sense Disambiguation
Module
SUMO Concept
Statistical Translation
Module
Bilingual Term Bank
English Core Term Candidate
SUMO-WordNet Mapping Module
Core Concept Candidate
WordNet
Concept Selection Module
Core Concept
Mapping Data Additional Features
12
COCA – Statistical Translation ModuleTranslation ambiguity:
Each Chinese core term TC ∈ ITCTerm has a set of translations T_SetE , TE ∈T_SetE
Objectiveto estimate the likelihood of every translation using
extended terms of TC
P(TE | TC) for all TE ∈ T_SetE.
) ExtT_Set __ )(
)()|(
CeE TT eE
ECE
Tlen
TlenTTW
(
i
TSetTT
EiC
ECCE
CEEi
TTW
TTWTTP
)(_
),(
),()|(
13
COCA - Sense Disambiguation ModuleMapping a given TC to the Synset S through
its translation set T_SetE (TC)Mapping probability of a English term TE to
take a synset S using freq. info in WordNet
Mapping probability of TC to take a particular synset S via an English translation TE
)(
)1),((
1),()|(
ETsynsetx
E
EE
xTF
STFTSP
)|(*)|()|( ECEC TSPTTPTSP
14
COCA - Concept Selection ModuleCombining three features
multi-path featurehypernyms featurepart-of-speech feature
Using Union Probability of Independent Events
0||)(*)()()(
0||0)(
}{}{EpUxppUxp
EpU
xEyxEyEx
15
Feature 1 –Multi-Paths to SynsetMultiple paths is
the path between Chinese core terms and synset
via different English translations
)(TT_Set
CE
)|()|(ET
CC TSPTSSP电流
CurrentElectric Current
Current
Current, Electric Current
The feature merges the probability of multiple paths
16
Feature 2 – Hyponyms in domainIncorporate info on all the extended strings
) )|()(
)(()|(
)( )(T Suffix_Etx
C
Shyponymh
tC thSP
tlen
TlenUTSHP
C
计算机 大型计算机Head-of
Mainframe Computer
Computer
Computer, Computing Machine
Calculator, ComputerHyponym-of
Calculator
Extended String uses the core term as headword and is the hyponym of the core term
Length Ratio
Union Probability of Independent Events
17
Feature 3 – Part of SpeechProbability of the POS tag pos(S)
owned by a synset S
given a core term Tc
PoS Tag estimation: Heuristics on Adj, Verb, and noun based on position
标准
Standard
Standard, Criterion (n.)
Standard (adj.)
Relatively low
probability to be a noun
High probability
to be adjective
},,{
),(_
))(,(_ )|(S
adjectiveverbnounpo
C
CC
poTposfreq
SposTposfreqTOP
18
Integrate FeaturesUsing Union Probability of
Independent Events
})|(),|(),|({ )()|(
CCC TSOPTSHPTSSPxC xUTSFP
19
EvaluationAlgorithm Output
A pair of < Tc_i, Synseti > for each Chinese core term with the highest mapping weight
Evaluation Standard For each Tc_i, whether their mappings to Synset are the
best match with respect to this domain
Answer Preparation Answer is manually made by two experts in IT domain
respectively on the same set of data
20
PerformanceThe evaluation conducted on the top N
frequent core termsThe algorithm COCA achieves 71% in
accuracy (N is 28 in this paper) Compared to the result of MRCOCA (Chen,
2007) which achieved only 50%Two examples of core term to syntset
mapping generated by the algorithm are given for “软件” and “网络” .
21
No. Zh En SUMO Concept Synset
1 软 件(SC)
Software ComputerProgram + software,software_system
(computer science) written programs or procedures or rules and associated documentation pertaining to the operation of a computer system and that are stored in read/write memory
2 软件 Facility StationaryArtifact + facility,installation
something created to provide a particular service; "the assembly plant is an enormous facility"
3 软件 Facility SubjectiveAssessmentAttribute
+ proficiency, facility, technique
skillfulness in the command of fundamentals deriving from practice and familiarity; "practice greatly improves proficiency"
4 软件 Facility SubjectiveAssessmentAttribute
+ adeptness,adroitness,deftness,facility,quickness
skillful performance without difficulty; "his quick adeptness was a product of good design"
5 软件 facility Room + toilet, lavatory, lav, can, facility, john, privy, bathr
a room equipped with washing and toilet facilities
6 软件 facility SubjectiveAssessmentAttribute
+ facility,readiness
a natural effortlessness; "a happy readiness of conversation"--Jane Austen
7 网络 (S) net Artifact + network,net,mesh,meshwork,reticulation
an interconnected or intersecting configuration or system of components
8 网络 (C) network Collection + network,web
an intricately connected system of things or people; "a network of spies" or "a web of intrigue"
9 网络 network SocialInteraction + network
communicate with and within a group; "You have to network if you want to get a good job"
10 网络 net Pursuing + net,nett
catch with a net; "net a fish"
11 网络 net Making + web,net
construct or form a web, as if by weaving
12 网络 net SubjectiveAssessmentAttribute
+ final,last,net
conclusive in a process or progression; "the final answer"; "a last resort"; "the net result"
13 网络 net CurrencyMeasure + net,nett
remaining after all deductions; "net profit"
22
ConclusionEvaluation of COCA repeated on an English-
Chinese bilingual Term bank with more than 130K entries show that the algorithm is “42%” improved in accuracy compared to
MRCOCA (Our Previous Works)The three features and the new algorithm
based on probability made the improvement
23
Term bank can help to quickly construct domain core ontology by selecting the concept nodes and relations used in domain
Bilingual term bank can further introduce the second language realization of the core ontology effectively and automatically
24
Future WorksEvaluation on three features
how effective they arehow much they contribute to the final
performance
Consideration of more features such as abbreviation, synset of head word of core term and etc.
Use of other resources
25
Q&A
26
QA