Design and Realization for Ontology Learning Model Based ...read.pudn.com › downloads198 › doc...

Foundation items:Education department key project of Hubei Province No.D200618003

Design and Realization for Ontology Learning Model Based on Web

Wu yuhuang Network Center

Wuhan Polytechnic UniversityWuhan,china

e-mail: [email protected]

Li yushengNetwork Center

Wuhan Polytechnic UniversityWuhan,china

e-mail: [email protected]

Abstract—Research on ontology is getting more popular nowadays. However, conventional approach of manual construction of ontology is a laborious process and makes ontology difficult to be renewed. Thus, ontology construction is often considered as the bottleneck for knowledge acquisition. By exploiting technologies such as machine learning, automatic or semiautomatic construction of ontology becomes possible. In this paper, we propose a web based ontology learning model. We describe the key components of the approach including documents pretreatment, term extraction, concept choice, and concept classification.

Keywords-ontology; ontology learning; knowledge acquisition; ontology evaluation

1. Introduction

Today, ontology research in computer science becomes increasingly popular, which in turn promotes its widespread application in many areas. In applying ontology, a key step involves ontology construction.. Conventional approach of ontology construction is a tedious and time consuming task. Systems like Cyc and WordNet systems prerequisite significant amount of manual entry before their self-learning features kick in Thus, ontology construction often becomes the bottleneck of knowledge acquisition and prohibits its timely renewal. Due to the ever changing nature of knowledge in ontology, manual construction of ontology is often impractical. Instead, automatic or semiautomatic construction is used. Ontology learning

[1] studies automatic or semiautomatic ontology construction via exploiting variety of techniques such as ontology engineering[2] and machine learning[3] so that In this paper, we propose a Web-based ontology learning model. We discuss the key components of the model including Web documents pretreatment, term extraction, concept ensemble, concept classification structure. We also present an application example of the proposed model.

2. Ontology learning model design

This paper concerns realizing the ontology’s automatic extraction from the Web page, as well as discovering the pattern and the relations of the ontology semantics concept from the Web page data. It semi-automaticaly extracts the Web ontology through the analysis of Web page collection in the identical application domain. The proposed ontology learning model is shown in Figure 1.The entire process of the ontology learning includes Web documents pretreatment, the formulation of candidate key word collection, term extraction, concept choice(he production of concept collection), concept classification.It uses many kinds of data pool to collect, to choose and to pretreat the Web documents, produces candidate key word collection, and extracts the candidate terminology from the candidate key word, forms the initial tabulation of the domain terminology study information, filtrates non-correlated terminology with the domain through the concept choice, and obtains the domain ontology concept.

2009 International Conference on Information Technology and Computer Science

978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

488


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

486


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

485


978-0-7695-3688-0/09 $25.00 © 2009 IEEE

DOI 10.1109/ITCS.2009.234

485

Authorized licensed use limited to: Dalian University of Technology. Downloaded on October 9, 2009 at 07:43 from IEEE Xplore. Restrictions apply.

KNNdjiij

i

cdyddSimcdp ),(),(),(

Where, d is proper vector of the new documents; Sim(d,di) is the formula computing similarity, which is the same with the last step's formula; y(di cj) is the category attribute function, namely: If di belongs to kind cj, then the function value is 1, otherwise is 0.

(5)The weights of classes are compared, and then the document will be assigned in that category which has the largest weight.

3. Evaluation and testing

As different ontology learning systems have different ontology learning contents, and different ways are used for different input data, it is very difficult to directly compare their results. Therefore, many ontology learning systems have their own evaluating and testing methods, which are based on the applied environment and the selected domain of ontology. For example, a lot of ontology learning systems evaluate the learning systems by calculating two indicators, the recall rate and precision rate of the learning model.

The recall is the ratio of relevant terms that are extracted from the analyzed corpus(correctextracted)over all terms to be extracted from the corpus(allcorpus). Its mathematical formula is:

corpus

extracted

allcorrectrecall

The precision is the ration of correctly extracted terms over all extracted terms(allextracted). Its mathematical formula is:

extracted

extracted

allcorrectprecision

Recall and precision reflects two different angles of the classification quality, which must be overall considered and can not be neglected. Therefore the evaluation indicator which is overall considered, the F-measure test value, is used. Its mathematical formula is:

precisionrecallprecisionrecallF 2

This article takes Sina's sports news homepage as the test corpus, 5 sports items are selected from the corpus to be the subject classes: pingpong, badminton, basketball, tennis, and swimming. For each class, 200 homepages selected arbitrarily are taken as the training set, other 80 homepages are taken as the test collection. The test is carried on using the above method and the data and its experimental result is shown in Table 1.

Table 1 Web Documents test result

Type recall precision F

pingpong 91.2% 94.1% 92.6%badminton 79.6% 93.4% 85.9%basketball 76.1% 84.6% 80.1%tennis 85.3% 92.1% 88.6%swimming 80.4% 90.5% 85.2%

4. Conclusion

In the process of constructing ontology, Web documents collected are dynamically changing, so the domain concept needs to increased or deleted constantly. At present, a complete automatic domain ontology construction is not feasible and manual intervention is still necessary We proposed to select the most appropriate concepts from the candidate terms to reduce noise and mitigate loss of information when increasing or deleting concepts Future research topics include addressing issues such as how to determine the relationship between two concepts and how to maintain way of domain ontology renew.

5. References

[1] Perez G,Macho M.A survey of ontology learning methods and techniques.Onto Web Deliverable D1,2003,5:1-86. [2] Shauntrelle D D,Tia B W.Engineering knowledge, In: Proceedings of the 42nd Annual Southeast Regional Co-nference,Huntsvllle,Alabama,2004.406-407. [3] Zheng,De-Quan,Zhao Tie-Jun,Yu Fe-ng,et al.Machine learning for automaticac quisition of Chinese ingu-istic ontology knowledge.IEEE,2005.3728-3733. [4] Roberto Navigli,Paola Velardi.Learning domain ont-ologies from document warehouses and dedicated web site[M].Computational Linguistics(30-2),MIT Press,2004-06. [5] Kwok Yin Lai,Wai Lava.Automatic Textual Document Categorization Using Multiple Similarity Based Models. SDM2001,Nov,2001. [6] Liu Baisong,Gao ji. General Ontology Learning famew-ork. Journal of Southeast University(English Edition). Vol. 22,No.3.381-384. [7] Maedche A,Staab S.Ontology Learning for the Semantic Web.IEEE Intelligent Systems:Special Issue on the Semantic Web,2001,16(2):72-79. [8] A1exander Maedchel and Steffen Staab .Onto1ogy Learning,2005. [9] Velardi P,Navigli R,Cuchiarelli A,et a1.Evaluation of ontoLearn,a methodology for automatic learning of domain ontologies.In:Ontology Learning from Text:Methods, Evaluation and Applications.IOS Press,2005.1-32. [10] Maedche A,Staab S.On tology learning for the semantic web.IEEE Intelligent Systems,2001,16(2):72-79.

491487487487487487487487487487487486486


Figure 1 ontology learning model

2.1 Web documents pretreatment

The Web page's data is mostly non-structural or semi structural. In order to represent non-structural text using structural text which can be processed by the computer, pretreatment to the Web documents collection is necessary. Metadata representing its characteristic extracted from the Web documents is regarded as the documents semantics unit. The characteristic may be in the form of a character, a word, a phrase or a concept. The TF-IDF vector is used to represent text characteristic. The TF-IDF formula is used typically and widely:

2)]01.0/log(),([

)01.0/log(),(),(

dt t

t

nNdttf

nNdttfdtW

And, ),( dtW is the weight of word t in the documents d, differentiating the different documents to the maximum degree; ),( dttf is the frequency of word t in documents d; N is the sum total of all samples; nt is the sample number of word t in N samples. The words appearing frequently in the documents are most meaningful words for distinguishing the documents. The bigger the weight, the stronger is the ability of discriminating documents contents. After the documents pretreatment step, we obtain a series of candidate standard words.

2.2 Term extraction

As language claimed of the concept in the professional field, term is the phrase or the character string having simple or complex meaning in certain domain. In a sense, the term is a kind of shallow layer expression of domain knowledge by way of text form. Because term has the low ambiguity and high

exclusive pointing, these words are especially effective regarding domain knowledge's generalization, and may support the domain ontology’s foundation. By increasing the rate of accuracy and recalling rate, extracting the candidate term item accurately and roundly as far as possible by the computer is the key point of the ontology learning and research. Steps are as follows:

(1)Candidate term collection production: First the phrase block is used to determine the shallow layer phrase boundary in the sentence. In this process, this article uses the shallow layer analysis technology as well as the inspiration information, like prompt words expressing key sentence and paragraph. The shallow layer parser module can be divided into two processes: anchoring sentence, candidate term production and ontology terms choice. All anchored sentence is divided into pieces to form the noun phrase, the verb phrase and the subordinate clause. This step's output is a group of candidate noun phrase without structure disambiguation.

(2)Domain degree of correlation computation: Roberto Navigli put forward a new method of screening term[4], which is based on two kind of measure forms called as the domain relativity and the domain consistency. Domain relativity of the term t in kind of Dk is computed by the following formula:

)|(max)|(

1

,jnj

kkt DtP

DtPDR

And conditional probability )|( kDtP is estimated by the equation below:

kDtkt

ktk f

fDtPE

','

,))|((

489488488488488488488488488488488487487


(3)The symbiosis words extraction: Wrong noun phrases produced in the result of previous step are pruned. In this step, the noun phrases are analyzed by applying syntax structure and statistical technology, the question that noun phrases are produced excessively or insufficiently is solved. From the corpus marked by syntax, the same noun phrase probabilistic model is found. It extracts information from the results computed by the next equation in the documents collection and the corpus:

)()(),( jbifjiNP wPwPwwPij

Where ),( jiNP wwPij

is the goal noun conclusion

or the compound noun of extracted information, iwand jw may be connected to a new word; )( if wP is

the frequency that iw follows after other words ;

)( jb wP is the frequency that jw appears before

other words. This kind of probabilistic model may be used in

pruning the wrong noun phrase in the candidate noun phrase. If the front noun phrase's probability is bigger than the threshold value, this noun phrase possibly is an appropriate name. The selected term collection is sorted according to the degree of correlation, thus term tabulation is formed.

2.3 Concept choice

The concept is the knowledge fundamental unit as well as the smallest thinking unit. Terms and concepts should be one-to-one mapping, namely one term express only one concept; One concept has only one allegation, that is ,it’s expressed by only one term. To become Ontology concept, term must have a clear meaning and play an important role at the same time. To judge the term has the explicit meaning or not, its stability and integrity are inspected. According to the Shannon theory, the term stability may be measured by its internal mutual information, and the one who has the highest mutual information value is selected to be candidate concept.

Definition: Suppose that a character string S of document T is composed by P )2(P characters

namely “ 121 pLccc ”, then the mutual information of S is:

)()()()()(

SfSfSfSfSMI

RL

Where LS is the left piece character string obtained

by removing the far right 1 character of S ; RS is the right piece character string obtained by removing the far left 1 character of S; )(Sf , )( LSf , )( RSf are respective present frequencies of character strings S,

LS , RS .If a character string's mutual information is higher

than some threshold value, then this character string can be regarded as stable. The character string’s integrity means that it is able to express the complete meaning independently. Therefore it may appear independently in the different context.

2.4 Concept classification

In order to effectively classify the ontology concept, KNN (K-Nearest Neighour) algorithm[5] is used. This algorithm's basic mentality is: after the new document being assigned, the nearest (most similar)the document with the new document in training document set is found. According to the type of this document the new document’s type is determined. Specific algorithm steps are as follows:

(1)The training documents vector is redescribed according to a characteristic set.

(2)After arriving, new documents are separated into words according to the characteristic word, and the new documents’ vector representations are determined.

(3)Select with the new documents most similar K documents in the training documents centralism, the formula is given by:

))((),(

1

2

1

2

1

M

kjk

M

kik

M

kjkik

ji

WW

WWddSim

Where, di is the proper vector of testing Web text, diis central vector of the category j, and M is the dimension of proper vector. Wk is the vector’s k-dimension. There is no good way to determine the value of k at present. Generally an initial value is first confirmed, and then the value of K is adjusted based on the results of experiments. Common initial value is set between several hundred to several thousand. The above formula is used to carry on the computation, then the initial text vector can be analyzed, thus the most similar k texts with the test text are selected in the training text set.

(4)In the K neighbor of new documents, each kind of weight is calculated in turn. The formula is:

490489489489489489489489489489489488488


Date post:	29-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times