User-oriented ontology-based clustering of stored memories

Expert Systems with Applications 39 (2012) 9730–9742

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

User-oriented ontology-based clustering of stored memories

Lei Shi, Rossitza Setchi ⇑School of Engineering, Cardiff University, The Parade, Cardiff CF24 3AA, UK

a r t i c l e i n f o a b s t r a c t

Keywords:ReminiscenceLife Story BookSemantic technologyUser-oriented ontologyClusteringTopic identificationAssistive technology

0957-4174/$ - see front matter � 2012 Elsevier Ltd. Adoi:10.1016/j.eswa.2012.02.087

⇑ Corresponding author. Tel.: +44 29 20875720; faxE-mail address: [email protected] (R. Setchi).

This research addresses the needs of people who find reminiscence helpful. It focuses on the developmentof a computerised system called a Life Story Book (LSB), which facilitates access and retrieval of storedmemories used as the basis for positive interactions between elderly and young, and especially betweenpeople with cognitive impairment and members of their family or caregivers. To facilitate informationmanagement and dynamic generation of content, this paper introduces a semantic model of LSB whichis based on the use of ontologies and advanced algorithms for feature selection and dimension reduction.Furthermore, the paper defines a light weight user-oriented domain ontology and its building principles.It then proposes an algorithm called Onto-SVD, which uses the user-oriented ontology to automaticallydetect the semantic relations within the stored memories. It combines semantic feature selection with k-means clustering and Singular Value Decomposition (SVD) to achieve topic identification based onsemantic similarity. The experiments conducted explore the effect of semantic feature selection as aresult of establishing indirect relations, with the help of the ontology, within the information content.The results show that Onto-SVD considerably outperforms SVD in both topic identification and semanticdisambiguation.

� 2012 Elsevier Ltd. All rights reserved.

1. Introduction

This research addresses the needs of people who find reminis-cence therapy helpful. Reminiscence is the process of recollectingsignificant past experiences or events. It normally involves theuse of personal photographs, text, sound recordings, video clipsand memorabilia. As indicated by research, reminiscence enhancespositive and diminishes negative experiences (Haight & Burnside,1993), and can have a positive impact on the memory of elderlypeople including those with memory loss and dementia (Mather& Carstensen, 2005; Woods, Spector, Jones, Orrell, & Davies,2005). Moreover, it was shown that mutual reminiscence couldbe equally beneficial both for elderly and young people as it couldtrigger positive emotions (Bryant, Smart, & King, 2005; Pasupathi &Carstensen, 2003).

As a useful tool for reminiscing, the Life Story Book (LSB) wascreated in the 1980s to help children in foster care develop a senseof identity and provide them with some personal history (Ryan &Walker, 1985). From information management point of view, aLSB is an information container which facilitates reminiscing byproviding individuals with memory problems and/or their familymembers with means for reviewing and recalling life events.

This paper contributes to the research in reminiscence therapyby developing a user-oriented algorithm for clustering stored

ll rights reserved.

: +44 29 2087.

memories, i.e. content related to the life of a person. The algorithmis developed within the context of the Semantic Life Story Book(Sem-LSB), also described in this paper, which uses advances inontologies, nature language processing and information retrievalto deliver user-tailored information.

The paper is organised as follows. Section 2 introduces reminis-cence therapy and Life Story Books. Section 3 presents a state-of-the-art review of text classification and clustering, semantic topicmodelling and dimension reduction of text using Singular ValueDecomposition (SVD). Section 4 introduces the conceptual modelof the Semantic Life Story Book and the user-oriented ontologyused in it. Section 5 describes Onto-SVD, the content indexing algo-rithm developed, which is based on using a user-oriented ontologyand Singular Value Decomposition. The experimental evaluation ofthe algorithm is included in Section 6. Finally, Section 7 concludesthe paper.

2. Reminiscence therapy and Life Story Books

The therapeutic effect of life reviews was first noted by Butler(1980) who studied their therapeutic benefit to elderly people.Butler’s work changed the professional perspective on reminis-cence and also led to the introduction of reminiscing in the demen-tia care domain (Bender, Norris, & Bauckham, 1987).

The life review is one-to-one reminiscence therapy that relieson the communication between an individual and a therapist. Ina life review therapy session, the communication is focused on

http://dx.doi.org/10.1016/j.eswa.2012.02.087

mailto:[email protected]

http://dx.doi.org/10.1016/j.eswa.2012.02.087

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

L. Shi, R. Setchi / Expert Systems with Applications 39 (2012) 9730–9742 9731

the past experiences of the individual which could be both positiveand negative. Artefacts belonging to the individual such as photos,postcards, pieces of music, and video clips can be used as cues dur-ing the session. The recent trend in the reminiscence therapy is toinvolve in the session not only the therapist but also caregivers orfamily members (Haight, Gibson, & Michel, 2006; Haight et al.,2003; Thorgrimsen, Schweitzer, & Orrell, 2002). The benefits areincreased sense of personal identity, enjoyable interaction withothers, and improved mood (Woods et al., 2005). In addition, thistherapy enables people to leave their memories to their familiesand make their mark on the succeeding generations.

Reminiscence therapy uses autobiographical memories organ-ised as a life story. It is a collection of reminders of importantevents in a person’s life. The organisation of the autobiographicalmemories is normally chronological; negative memories are notincluded. The collection is divided into chapters, each of them cor-responding to a particular time period or predefined theme cate-gory (Bayer & Reban, 2004). The time periods are later used ascues to help the individual in recalling certain events (Conway,2005; Conway & Pleydell-Pearce, 2000; Thomsen & Berntsen,2008). In the context of a LSB, a chapter is a type of high-levelmemory representation, which includes various memories of aparticular time period or theme. A hierarchy model is normallyused to organise chapters into categories (Conway, 2005; Conway& Pleydell-Pearce, 2000). From top to bottom, the levels are lifestory, lifetime periods, mini-narratives and categorised memories.One aim of this model is to reduce the complexity of long andnested chapters. When an individual is asked to recall a specificmemory, he/she can start from an important category instead ofa brief time period (Thomsen, Pillemer, & Ivcevic, 2011).

The LSBs model consists of three elements: collecting, annotat-ing and maintaining information and materials (Bayer & Reban,2004). Data is collected from the archives of the person and his/her family and friends. Data may be electronic or material; exam-ples from the second category are an artefact related to one’s hob-bies or a postcard from a friend. These data and materials areorganised chronologically and related to persons, places or eventswhich have played a significant role in one’s life. The second stepincludes devising annotations for these materials, which can be po-tential cues for memory identification. The final step, maintenance,includes editing and updating of the content as significant lifeevents occur or new material is added to the collection.

Due to its predominantly paper-based format, traditional LSBshave three obvious limitations related to their information capac-ity, content management, and dynamic retrieval.

� Information capacity. As information is collected from the wholelifespan of the individual, LSB needs to deal with a huge amountof fragmented and heterogeneous documents in different for-mats and with varied quality. This problem can be addressedby developing a solution which is fully integrated with adatabase.� Content management. LSBs have a weak structure and do not

benefit from the existing semantic associations. For example,content belonging to one category (e.g. ‘children’) may haveobvious semantic similarity with another (e.g. ‘special days’).Such connections could be easily detected and established atconceptual level. Therefore, LSBs need a robust mechanism ofdetecting and identifying topics within the life story content.� Dynamic retrieval. Conventional LSBs have limited functionality,

as both authoring and retrieval mainly depend on manual work.Advanced information retrieval techniques can be employed tofacilitate the retrieval process.

This research attempts to address the limitations of conven-tional LSBs by creating a system with improved management of

personal memories stored in a digital format. The next section re-views text classification, clustering and topic modelling algorithmsused in information retrieval.

3. Literature review

3.1. Text classification and clustering

Text classification and clustering are approaches to text cate-gorisation, which automatically identify topics and similar infor-mation objects.

For a given document set D ¼ fd1; . . . ; dig and a category setC ¼ fC1; . . . ; Cjg text classification is formally defined in informationretrieval as the method c of mapping a document di to a class Cj

such as c : di ! Cj. This mechanism is called text classificationand c is called a classifier. If the set C represents natural groupsof the document set, this mechanism is referred to as text cluster-ing (Andrews & Fox, 2007). Both categorisation and clustering en-hance information retrieval, as the process improves the system’saccuracy and helps to organise the retrieval results. Each categoryCj is usually treated as a topic, and all documents under this cate-gory (or cluster) belong to that topic.

Developing reliable and high performance classification andclustering methods is a well-researched topic in information re-trieval. In general, most of the methods developed represent datain high-dimensional spaces. For example, if the terms in a docu-ment are used as features, then the number of these terms wouldbe the dimension of the feature space of this particular document.The feature space of the dataset or document collection then wouldbe a combination of all documents feature spaces included in it. Inreality, large datasets have high dimensionality that reduces theefficiency and accuracy of clustering. This problem is addressedby applying dimension reduction algorithms for feature selectionand feature extraction by converting high-dimensional data intoa lower dimensional space. By selecting a subset of the features,feature selection algorithms remove noise and errors, and thenuse principal features to analyse data (Guyon & Elisseeff, 2003;Manning, Raghavan, & Schütze, 2008). Data transformation andfeature extraction algorithms such as the Principal ComponentAnalysis (PCA) and its variations extract principal components ofdatasets and then analyse the inter-correlation between the fea-tures (Abdi & Williams, 2010; Erkmen & Yıldırım, 2008; Pasi,2009; Wall, Rechtsteiner, & Rocha, 2003).

3.2. Topic identification

Three main algorithms are applied in information retrieval tofacilitate topic identification of textual information. These are La-tent Semantic Indexing (LSI), Probabilistic Latent Semantic Index-ing (pLSI) and Latent Dirichlet Allocation (LDA). LSI is based onmatrix factorization while the other two algorithms rely on theprobability theory.

LSI (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990)factorizes the term-document matrix into three matrices contain-ing left singular vectors, singular values and right singular vectors,respectively. The singular value indicates the importance of its re-lated singular vectors (left and right), e.g. a greater singular valueindicates greater significance. The left and right singular vectorsmatrices represent the semantic space of terms and documents.In terms of its geometric interpretation, LSI projects documentsfrom a high dimensional space into a lower dimensional semanticspace where terms (documents) with similar meaning have highprobability to be placed on one dimension.

pLSI (Hofmann, 1999) is based on a latent variable model calledaspect model. It associates hidden classes (topics) with

9732 L. Shi, R. Setchi / Expert Systems with Applications 39 (2012) 9730–9742

observations (terms). In pLSI, each document is a combination oftopics that follows multinomial distribution, and one documentcan have multiple topics. Hofmann (1999) claims that pLSI outper-forms LSI. This was disputed by later research which found that theevaluation dataset used was small, and the complexity of evalua-tion data was far lower than any real-life IR system would have(Wei & Croft, 2006). Moreover, in some situations pLSI producesoverfitting when, for example, the number of variables (topics) ofthe pLSI model grows linearly with the number of documents inthe dataset. It means that the computational cost of using pLSI withlarge datasets is prohibitive; adding new documents to an existingpLSI model has been also found difficult (Blei, Ng, & Jordan, 2003).

LDA developed by Blei et al. (2003) to address the limitations ofthe pLSI, plays an essential role in statistical topic modelling. Itsolves the overfitting problem by limiting the topic mixture distri-bution of LDA by a k-parameter hidden random variable. A limita-tion of LDA is the weak identification of correlated topics (Blei &Lafferty, 2007). In addition, one disadvantage of LDA is the com-plexity of the model itself as parameter selection directly impactsits complexity. Moreover, there is no formal way to set the param-eters, which necessitates their empirical selection. Another disad-vantage is the need to evaluate the LDA performance with a largedata collection, such as web data (Wei & Croft, 2006).

3.3. Dimension reduction of text using Singular Value Decomposition

The core technology of LSI is Singular Value Decomposition(SVD), a matrix factorization method that plays an essential rolein multivariate data analysis, e.g. text analysis, image processing,and analysis of gene expressions.

The method can be formally represented in the following way.Let M be a real m � n matrix with rank r (i.e. the number of linearlyindependent rows/columns of M). Then, M can be factored as

M ¼ URVT ð1Þ

where U and V are m �m and n � n orthogonal matrices, respec-tively. Then I = UTU = VTV; R is a diagonal matrix with the same sizeas M. The nonnegative entries on the main diagonal are calledsingular values of M,R = diag(r1, r2, . . ., rn), where r1 P r2 P. . . P rr > 0, when 1 6 i 6 r; and rr+1 = rr+1 = � � � = rn = 0, whenr þ 1 6 i 6 n.

As a low-rank approximation, truncated SVD converts data froma r-dimensional to k-dimensional space, where 1 6 k� r. In otherwords, it finds a rank-reduced matrix Mk by minimising thedifference of Frobenius norm D between the initial matrix M andMk, i.e.,

kMkFrobenius ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXm

i¼1

Xn

j¼1

jaijj2vuut ; D ¼ kM �Mkk ð2Þ

where D is information loss (noise data). The information loss isdetermined by the parameter k, for example, a greater k indicatesless information loss.

Let ui and vi denote a column of U and a row of VT, then ui and vi

are called left singular vector and right singular vector of M,respectively. The rank-k approximation Mk of M is represented as

Mk ¼ UkRkVTk ¼

Xk

i¼1

ui � ri � vTi ; k� r ð3Þ

Formula (4) is applied to convert a vector q into the k-dimensionalspace.

~q ¼ qT UkR�1k ð4Þ

LSI uses truncated SVD to partly solve the polysemy and synonymyproblem. Besides analysing the most frequent usage of common

terms, LSI also considers their co-occurrence in contexts that en-ables LSI to identify the semantic similarity of information objects.LSI outperforms VSM for the following reasons. In the VSM, the sim-ilarity between the information objects is measured on the basis ofthe common terms they contain. The limitation of VSM is that it ishard to distinguish the meaning of the terms, which is importantwhen dealing with polysemy and synonymy. For instance, VSM can-not distinguish between ‘‘apple’’ as a type of fruit and ‘‘apple’’ as abrand, as it treats them identically based on their spelling. Thiscauses disordered VSM retrieval results leading to reduced preci-sion. Therefore, users need to filter the results based on their ownknowledge. The second problem is synonymy. For instance, two doc-uments may contain the words ‘‘sorrowful’’ and ‘‘sad’’, respectively.If one of them is used in a query, VSM would not return the otherone as a retrieval result, although the two terms are synonyms. Thishas a negative impact on recall. These two problems illustrate whyVSM is not used for complicated tasks such as topic identificationand relation detection.

4. Sem-LSBs conceptual model and ontology

4.1. Conceptual model

In the context of this research, an information object is definedas an atomic element of textual information (Shi & Setchi, 2010). Itmust include at least one named entity, and can be distinguishedfrom or related to other information objects by the named entityincluded in it (or other semantic features). For example, the photosand their annotations in Fig. 1 are both information objects as theyrefer to named entities. The second photo in the figure is annotatedwith the sentence ‘‘Me, Christopher, Alison and Pebbles on holidayon the Norfolk Broads.’’ which includes named entities of the fourparticipants and the place.

As shown in Fig. 2, the conceptual model of the Semantic LifeStory Book (Sem-LSB) includes modules for natural language pro-cessing (NLP), named entity recognition (NER), semantic featurematching (selection), clustering and indexing. The model is basedon processing text which necessitates all data input to be carefullyannotated first. The NLP module executes tokenization, stop-wordsremoving, stemming, and corpus building. It removes unimportantterms and symbols, and splits textual data into tokens. A lexicalontology is used at this stage to help with identifying foreign lan-guages and typos. The NER module detects and labels the namedentities. Next, semantic feature matching based on the labellednamed entities of each information object is used to retrieve sim-ilar named entities from a user-oriented ontology; it then expandsthe feature space of the information object by including all thesesimilar named entities which are further used as semantic features.The Onto-SVD algorithm which will be explained in detail in thenext section identifies the topic of the information objects and de-tects the semantic similarity between them. The clustering andindexing modules process the data further, and store it in clusters.Finally, the module for dynamic generation of content provides anintelligent way of automatically generating content based on theuser’s input. The output is a series of reminders of life events withstory-like format, which can help user review and recall eventsfrom his/her life. The user-oriented ontology provides knowledgesupport to the semantic feature matching module whose task isto retrieve the most relevant features.

4.2. Ontology

Ontologies are nowadays widely used to represent knowledgeusing knowledge structures, relations and properties of conceptsor entities. There are two main types of ontology: generic and

Fig. 1. A LSB page [from http://www.alzscot.org].

Fig. 2. Sem-LSBs conceptual model.


domain-specific. A generic ontology contains general conceptsfrom many domains. In contrast, a domain-specific ontology repre-sents the knowledge in a specific domain.

4.2.1. User-oriented ontologyFor well defined applications like the one described in this pa-

per, it would be rather computationally expensive and very costlyto apply a generic ontology for information analysis. Most of thegeneric ontologies have complex structure and contain consider-able amount of information, most of which is irrelevant to theevents in one’s life. In contrast, a domain-specific ontology haslower complexity and is more suitable for personal informationmanagement and especially for the management of stored memo-ries (Shi & Setchi, 2010). This work defines and formalises a user-oriented ontology. Its specification includes:

� Interactive building. A user-oriented ontology involves users ininteractive creation and modification. In most cases ontologiesare created and maintained by knowledge experts, while a

user-oriented ontology expects users to undertake these pro-cesses themselves. The personal information belongs to theusers, thus they should understand the facts contained betterthan the knowledge experts. This mechanism lets users applytheir knowledge intuitively, and allows them to decide whatessential knowledge needs to be represented.� Naïve structure. The knowledge (features) selection for a user-

oriented ontology employs Occam’s Razor principal (Gheyas &Smith, 2010; Koller & Sahami, 1996). It means that users needto select a minimum number of significant features to buildthe ontology, in order to reduce the redundancy and irrelevancyas much as possible. The use of large amount of features in anontology may cause ambiguity and conflict. Therefore the ben-efit of employing this principle is twofold: decreased complex-ity and improved computational performance.� Homogeneous semantic topics. Named entities and their relations

are fundamental elements of a user-oriented ontology. For a sin-gle ontology, all entities belong to the same semantic category(topic) representing the domain knowledge of the ontology.

http://www.alzscot.org


� Flexible structure. The purpose of a user-oriented ontologydepends on its application scope. It can be light weight andinclude a relatively small number of entities and relations.Without violating the ‘‘naïve structure’’ principal, it can alsobe of a larger size.

4.2.2. Building principleThe principle underlying the Life Story Book model is based on

user’s involvement in building the backbone of the Semantic LifeStory Book. This includes populating the ontology with informationabout their family tree and important events in their life. Usersthemselves determine the scope of the Life Story Book, i.e. whoamong their family and friends and what topic they want includedin it. The selection process is completely intuitive. It also enablessemantic feature selection, which identifies named entities withinthe scope of the ontology as semantic features. Typical categoriesof named entities include organizations, persons, locations, dates,times, monetary values and percentages (Marsh & Perzanowski,1998). In the case of the user-oriented ontology, the categories ofnamed entities could vary according to the user’s preferences.For example, all named entities within an ontology representinga person’s social hub would belong to category ‘‘people’’ and theyall would be connected via suitable relations. Weight is used todistinguish between different types of relation. It represents, usinga relative value, the strength of the relations between any twoconnected named entities. For example, ‘‘spouseOf’’ (e.g. havingweight = 1) is a closer relation than ‘‘friendOf’’ (e.g. with weight =0.5).

The semantic representation of a user-oriented ontology isdetermined by its category. Thus a single ontology may not be suit-able to represent a heterogeneous dataset with several topics. Toaddress this, a multi-ontology model is proposed, in which eachsingle ontology has its own domain. A predefined relation set is ap-plied to connect entities from different ontologies. For example, ifontology hpeoplei and ontology hlocationi contain ‘‘Elizabeth’’ and‘‘Cardiff’’ respectively, these named entities can be connected bya relation ‘‘visited’’ if such an information object depicting theevent exists. In the multi-ontology model, each entity can bemapped to one or more entities. For each information object (oftencontaining more than one named entity) there are four methodsmapping its named entities with the related ontologies: one-to-one, one-to-many, many-to-one and many-to-many. In the singleontology model, the entities are connected horizontally by rela-tions. In addition to horizontal relations, the multi-ontology modelhas vertical connections between entities from different ontolo-gies. This provides a mechanism for representing more complexknowledge structures.

4.2.3. Illustrative exampleFig. 3 shows a simple example of two information objects linked

to three user-oriented ontologies. The two information objects are:‘‘20 November 1947: Princess Elizabeth and Prince Philip leaveWestminster Abbey after their wedding’’, and ‘‘Prince Charles andPrincess Diana on the balcony of Buckingham Palace in 1981’’.Eight named entities are extracted from the annotations and usedas semantic features: ‘‘Elizabeth’’, ‘‘Philip’’, ‘‘Charles’’, ‘‘Diana’’,‘‘Westminster Abbey’’, ‘‘Buckingham Palace’’, ‘‘1947’’ and ‘‘1981’’.The relations ‘‘celebrate (wedding)’’, ‘‘attend (activity)’’ and‘‘happendOn (date)’’ establish the vertical connections and linkthe three ontologies: hpeoplei, hlocationi and htimei. The represen-tation of the two information objects as shown in Fig. 3 enablesthese information objects to be linked by semantic relations whichare preserved by the ontologies.

Table 1 shows a dataset containing five information objects io1–io5. The underlined terms are used in this example as semantic fea-tures. Two retrieval algorithms, namely VSM and SVD, have been

tested with the query ‘‘Queen Elizabeth’’. The VSM algorithm re-trieves information objects io1 and io5 as they both include theword ‘‘Elizabeth’’. In the case of using the SVD algorithm, the resultalso includes information object io4 because although io4 does notcontain common terms with the query, Westminster Abbey pro-vides a cue to a latent semantic relation.

However, as shown by this example, the SVD based approachhas also limitations because the two remaining objects, io2 andio3, are also related to Queen Elizabeth’s life. It is well known thatElizabeth and Philip are connected to Charles by a parentOf relation,and both Westminster Abbey and Windsor Castle are official resi-dences of the Queen. Therefore, in the context of this study, the re-trieval result should ideally also contain io2 and io3. However, thistype of connection is hidden for SVD which only considers term co-occurrences. If the co-occurrence is very low or zero, it would failto detect the connection. A new improved SVD algorithm calledOnto-SVD is designed to address this problem; it uses additionaldomain knowledge which is specific to each individual, e.g. theirfamily, circle of friends, locations, and important events in theirlife.

5. Onto-SVD

Onto-SVD is based on the idea that the main concept andsemantic meaning of each information object can be representedsemantically through the combination of terms and named entitiesincluded in it. As mentioned before, textual data is normally repre-sented in a high-dimensional space where each term or named en-tity is treated as a dimension. The Onto-SVD presented in thissection extends this approach by combining semantic featureselection with user-oriented ontology, and using SVD as thedimension reduction method to achieve topic identification basedon semantic similarity.

5.1. Algorithm

This algorithm uses named entities as semantic features. Let Obe a set of ontologies O = {o1, . . ., oj} and Ei is a set of entities iden-tified within an information object ioi, where Ei ¼ fexjex 2 ioi;

x P 1g. The feature set of ioi is represented as Fi ¼ ffyjfy 2 Ei; fy 2oj; y P 1g, which indicates that the semantic feature fy is containedin both an information object ioi and an ontology oj. A normalisa-tion function g(fy) is used to represent the feature weight(frequency). Each information object is then converted to a featurevector, where vf(ioi) denotes the feature vector of ioi,

v f ðioiÞ ¼ gðf1Þ ^ ðf1 2 o1Þ; . . . ; gðfn1 Þ ^ ðfn1 2 o1Þ _ . . . ;zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{n1

gðfn1þ1Þ ^ ðfn1þ1 2 o2Þ; . . . ; gðfn2 Þ ^ ðfn2 2 o2Þ _ . . . ;zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{n2

gðfnj�1þ1Þ ^ ðfnj�1þ1 2 ojÞ; . . . ; gðfyÞ ^ ðfy 2 ojÞzfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{nj

ð5Þ

where nj is the number of named entities of ioi belonging to oj, andl ¼

Pnj is the total number of entities in ioi; j is the number of ontol-

ogies (j = 1 for a single ontology model and j > 1 for amulti-ontology model). For example, the information object io3 inTable 1 contains seven named entities which belong to three differ-ent ontologies, namely people (Camilla, Harry, Williams, Laura, TomParker Bowles), location (Cornwall), and time (2005). Thenl ¼

P3j¼1nj ¼ n1 þ n2 þ n3 ¼ 5þ 1þ 1 ¼ 7.

The feature space FðDÞ of a dataset D is then defined through itsfeature vectors,

Fig. 3. Multiple user-oriented ontology model.

Table 1Information objects.

io1 20 November 1947, Princess Elizabeth and Prince Philip leaveWestminster Abbey after their wedding

io2 Prince Charles and Princess Diana on the balcony of Buckingham Palace in1981

io3 9 April 2005, Camilla, Duchess of Cornwall, with Prince Harry, PrinceWilliam, Laura and Tom Parker Bowles

io4 Prince William and Kate Middleton announced their engagement withplans to marry in summer 2011, Westminster Abbey

io5 Windsor Castle is the oldest and largest occupied castle in the world andthe official residence of Her Majesty Queen Elizabeth II


FðDÞ ¼Xn

ioi2Dv f ðioiÞ ð6Þ

where n is the number of information objects contained in D.The feature matrix is next constructed using the feature space.

It is a sparse matrix which has the same size as the initial term-information object matrix where all non-entity entries are repre-sented as zeros. The feature matrix of the dataset D is denotedby MFðDÞ,

MFðDÞ ¼Xn

ioi2Dðkjv f ðioiÞÞT ;

kj ¼ 1; if j ¼ 1Pj

kj ¼ 1; if j > 1

8<: ð7Þ

where kj is the weight of oj, and 0 < kj 6 1. The entries of MFðDÞ arethe weight of the entities initially contained within the informationobjects.

Semantic feature selection is then used as a mechanism to en-hance the named entity’s semantic representation, by selectingneighbours of the entity from the related user-oriented ontology.Assuming an information object ioi contains a set of named entitiesEi ¼ fexjex 2 ioi; x P 1g. The ontology oj is treated as a weightedundirected graph Gj = (V, R), where V is the vertex set of the graph,$ex e Ei:ex e V, and R is the edge (relation) set, which is applied toconnect the entities (vertices). In the graph Gj, the degree deg(ex)of a vertex ex is represented through the number of its connectedentities (vertices). The adjacency matrix of Gj is denoted as A(Gj),

AðGjÞ ¼ aij ¼1; v i adj v j

0; v i nadj v j or i ¼ j

�ð8Þ

where i and j indicate vertices vi and vj. Then, the degree of vi iscomputed as degðv iÞ ¼

Pjaij. The self-information of an entity ex

with an outcome is IðexÞ ¼ � log PðexÞ ¼ � log 1degðexÞ; the outcome

indicates the probability of this entity being selected as a featurein the semantic feature selection process. According to informationtheory (Cover & Thomas, 2006), the Shannon entropy of ex isfollows:

HðexÞ ¼Xn

i¼1

PðexÞ � IðexÞ ¼Xn

i¼1

1degðexÞ

� log1

degðexÞð9Þ

where n = deg(ex). The entropy measures the semantic informationof ex.

The algorithm includes selecting, for each information object, anamed entity ex (further referred to in this paper as a detected en-tity) from the observed entity set Ei, and then extracting its relatedentities from the ontology oj (i.e. its neighbouring vertices in thegraph representing the ontology relations). The selection of the de-tected entity is based on the maximum entropy principle (Berger,Pietra, & Pietra, 1996), that suggests that the detected informationobject ex should be the entity with the highest entropy in the cor-responding information object ioi. As shown in (9), H(ex) is directlyproportional to deg(ex), which means that the named entity withthe highest entropy is the one with the highest degree. The neigh-bours are the nearest entities of ex in Gj. Onto-SVD considers ex to-gether with its neighbours as the semantic representation of ioi.Formula (10) is used to measure the length of the edge betweenany two entities ex and ex0 .

Edgeðex; ex0 Þ ¼1

logcðweightðex; ex0 Þ þ cÞ ð10Þ

where c is an empirical constant; weightðex; ex0 Þ is the weight of therelation between entities ex and ex0 . If there is more than one rela-tion between ex and ex0 , the one with greater weight is selected(Shi & Setchi, 2010).

Let Nx = {n1, . . ., ni} denotes the set of selected neighbour of thedetected entity ex, where i = deg(ex) and ex R Nx. The parameter t is athreshold which limits the number of neighbouring entities of ex

selected from Nx. It is used to adjust the strength of the semanticrepresentation. Let t(ex) denotes a threshold function,

tðexÞ ¼degðexÞ; if t > degðexÞt; otherwise

�ð11Þ


NtðexÞ denotes the neighbour set of ex with threshold t, whereNtðexÞ # Nx. Assuming the semantic feature selection process picks tneighbours of an identified entity ex, the process can be representedas a chain, which has joint conditional probability:

Pðex;NtðexÞÞ ¼ Pðn1jexÞ . . . Pðnt jex; . . . ;nt�1Þ

¼Qt

nt2Nx;06t6tðexÞPðnijn0; . . . niÞ;

ð12Þ

where n0 = ex, and NtðexÞ ¼ fn1; . . . ntg.The entropy of this selection process is as follows:

Hðex;NtðexÞÞ ¼ HðexÞ þ . . .þ Hðnt jex; . . . nt�1Þ

¼Pt

nt2Nx;06t6tðexÞHðntjn0; . . . nt�1Þ;

ð13Þ

where n0 = ex, and NxðexÞ ¼ fn1; . . . ntg.The aim of selecting neighbours is to enhance the semantic rep-

resentation of the detected entity ex and reduce the risk of selectingneighbours with weak semantic representation. Therefore, theselecting process also follows the maximum entropy principle(Berger et al., 1996),

HtðexÞ ¼ argmaxXt

nt2Nx;06t6tðexÞHðntjn0; . . . nt�1Þ; ð14Þ

where n0 = ex. In other words, the result of the semantic featureselection based on the maximum entropy ensures that the selectedfeatures have strong semantic value, i.e. the selected entities shouldhave high entropy to maximise Ht(ex). Therefore, the selected neigh-bours should have the highest degree (entropy).

The value of the neighbour weight is set as gðexÞ ¼ 12t . Let v t

f ðioiÞand Mt

FðDÞ denote the semantic feature vector with the featurevalue(s) g0(ex) and the semantic feature matrix respectively,

MtFðDÞ ¼

Xn

ioi2D;06t6tðexÞkjðv f ðioiÞ þ v t

f ðioiÞÞT ;kj ¼ 1; if j ¼ 1P

jkj ¼ 1; if j ¼ 1

8<:

ð15Þ

where t is the threshold. The semantic feature matrix MtF has the

same size as the initial term-information object matrix M. Afterapplying the semantic features to the initial matrix M, the semanti-cally enhanced matrix Menhanced is produced.

Menhanced ¼ M þMtFðDÞ; ð16Þ

The next step of the algorithm uses SVD to decompose Menhanced,

Menhanced ffi UkRkVTk ; ð17Þ

where Uk and VTk multiplied with their corresponding singular values

are treated as terms (entities) and information objects’ projections inthe k dimensional space, respectively. As the k-dimensional spacehas the same attributes as the Euclidean space, cosine similarity isapplied to compare the vectors’ similarity in this space,

simcosða;bÞ ¼va � vT

b

jjvajj2 � jjvbjj2ð18Þ

This algorithm is used to calculate the semantic similarity ofinformation objects as shown in the illustrative example below.

5.2. Illustrative example

Using as an example a simple ontology representing relationswithin the British Royal Family, this section further develops theexample introduced in Section 4.2.3 (see Table 1) to illustrate theprocess of creating semantic feature matrix using a user-orientedontology. A part of the ontology hpeoplei related to the head of

the Royal Family is shown in Fig. 4. Note that only part of the fam-ily tree contains labelled entities (e1–e10) as not all family membersare mentioned in the small collection of information objects usedin this example (Table 1).

Next, the named entities included in the information objectshave to be extracted and analysed. For simplicity, time and loca-tions are not considered in this example. For instance, informationobject io1 contains two named entities: Elizabeth (e1) and Philip(e2); information object io2 also contains two named entities:Charles (e3), Diana (e4), etc. The graph representation of the ontol-ogy (Fig. 5) provides background knowledge needed to select thefeatures (neighbours), and then establish the semantic relationsbetween the information objects.

To compare the degrees of the named entities, the adjacency ma-trix corresponding to Fig. 5 is shown below. The adjacency matrix issymmetric, and the entry aij indicates the adjacency of ei and ej.

AðGjÞ ¼

0 1 1 0 0 0 0 0 0 01 0 1 0 0 0 0 0 0 01 1 0 1 1 1 1 0 0 00 0 1 0 0 1 1 0 0 00 0 1 0 0 0 0 1 1 00 0 1 1 0 0 1 0 0 10 0 1 1 0 1 0 1 0 10 0 0 0 1 0 1 0 1 00 0 0 0 1 0 0 1 0 00 0 0 0 0 1 1 0 0 0

0BBBBBBBBBBBBBBBBBB@

1CCCCCCCCCCCCCCCCCCA

For example, e1 and are adjacent, then elements a12 = a21 = 1; e1 ande4 are not adjacent, a14 = a41 = 0. The degree of e1 is degðe1Þ ¼P

16J610a1;j ¼ 2.Table 2 is constructed to show the use of the user-oriented

ontology to enhance the semantic representation of the datasetemployed in this example. It represents the raw dataset shownin Table 1 through the named entities included in the ontologyhpeoplei. Table 2a–d are different representations of the same dataset. Table 2a shows the occurrence of the identified named entities;it only represents direct connections between the informationobjects. In other words, the correlation of the information objectsdepends on their shared named entities. For example, io1 and io5

are linked as they share named entity e1; io2 is not correlated toany other information object as there is no shared named entitybetween it and the rest of the dataset. Table 2b shows the semanticfeature matrix built with a threshold t = 1. It includes all namedentities e2 identified in the information objects, as well as oneselected feature (neighbour entity) from the ontology shown inFig. 5. It represents direct and indirect connections. For example,io2 is now linked with all information objects in the datasetthrough its entity e3. Table 2c and d shows the semantic featurematrix computed with a threshold of 2 (all values are 1/22) and 3(all values are 1/23). As illustrated in Fig. 5, the degree set of theentities in this example is: deg(Gj) = {deg(e1) = 2, deg(e2) = 2,deg(e3) = 6, deg(e4) = 3, deg(e5) = 3, deg(e6) = 3, deg(e7) = 5, deg(e8) =3, deg(e9) = 2, deg(e10) = 1}. Note that the friendOf relation isignored for simplicity. Using the degree set, the additional values(all 1/2t = 1/22 = 1/4 in Table 2b) are determined in the followingway:

(i) Information object io1: the detected entity is e1 (it can also bee2, as they both have the same degree 2), which according tothe graph has two neighbours, e2 and e3, i.e. its neighbour setis N1 = {e2, e3}. The threshold t = 2 requires two neighbouringentities to be selected. However, one of the two entities isalready contained in io1. Therefore, only e3 is added to thesemantic feature matrix.

Fig. 4. Part of the family tree of the Royal Family.

Fig. 5. Weighted graph representing relations within the user-oriented ontology.


(ii) Information object io2: the detected entity is e3 and itsneighbour set is N3 = {e1, e2, e4, e5, e6, e7}. At t = 2, e7

(degree = 5) is first selected because it has the highestdegree. One more named entity needs to be selected amongthe three candidates having the same degree (degree = 3);these are e4, e5 and e6 and they all carry equal amountsof semantic information. In this particular case, e7 isselected together with e6.

(iii) Information object io3: the neighbour set of the detectedentity e7 is N7 = {e3, e4, e6, e8, e10}. At t = 2, the selection prin-ciple is similar with the above case, and e3 (degree = 6) isselected as the first neighbour. The second neighbour selec-tion has candidates, e4 and e5 (degree = 3). The reason ofremoving e6 from the candidates is that e6 is already con-tained in io3. In this example, e3 and e4 are selected.

(iv) Information object io4: the detected entity is e7 and its neigh-bour set is N7 = {e3, e4, e6, e8, e10}. At t = 2, e3 (degree = 6) isselected as the first neighbour. The candidates for the secondneighbour selection are e4, e6 and e8 (degree = 3). In thisexample, e3 is selected with e6.

(v) Information object io5: the only entity of ontology hpeoplei init is e1; its neighbour set is N1 = {e2, e3}. At t = 2, e2

(degree = 2) and e3 (degree = 6) are both selected.

Indirect connections indicate the semantic similarity betweenthe information objects detected using the ontology. Onto-SVDanalyses the information objects using both direct and indirectconnections.

6. Experiment and evaluation

6.1. Data and user-oriented ontology

The experiment aims to evaluate the topic identification perfor-mance using k-means with Onto-SVD (Onto-SVDK) in comparisonwith k-means with SVD (SVDK). The dataset used in the evaluationis real web data, which includes 2065 high quality English articles,and it is manually tagged with eight labelled groups (topics):Health (426 articles), Dementia Disease (420), Olympics (384), Fi-nance (365), British Royal Family (184), FIFA (106), Celebrity (95articles), and Politics (94). A user-oriented ontology op: people isbuilt to guide Onto-SVD in identifying information objects relatedto the British Royal Family. The relation set selected is Rp = {r1:spouseOf, r2: parentOf, r3: siblingOf, r4: friendOf}. As a relative mea-sure of semantic similarity, the weight of all these connections isset as Wp = {w1 = c0, w2 = c0, w3 = c0, w4 = c0}, where c0 is a positiveconstant (in this experiment, c0 = 1, and friendOf relations areignored).

6.2. Evaluation

In the experiment, truncated SVD factorizes the term-informa-tion object and semantic feature matrices, and then projects infor-mation object vectors into a lower dimensional space. Thealgorithm then uses cosine similarity to measure the correlationof the information objects based on the position of the projections,and then applying k-means algorithm to classify the result. The

Table 2Initial matrix and its related semantic feature matrix.

io1 io2 io3 io4 io5

(a) Term–information object matrixe1 1 0 0 0 1e2 1 0 0 0 0e3 0 1 0 0 0e4 0 1 0 0 0e5 0 0 1 0 0e6 0 0 1 0 0e7 0 0 1 1 0e8 0 0 1 0 0e9 0 0 1 0 0e10 0 0 0 1 0

(b) Semantic feature matrix, t = 1e1 1 0 0 0 1e2 1 0 0 0 0e3 1/2 1 1/2 1/2 1/2e4 0 1 0 0 0e5 0 0 1 0 0e6 0 0 1 0 0e7 0 1/2 1 1 0e8 0 0 1 0 0e9 0 0 1 0 0e10 0 0 0 1 0

(c) Semantic feature matrix, t = 2e1 1 0 0 0 1e2 1 0 0 0 1/4e3 1/4 1 1/4 1/4 1/4e4 0 1 1/4 0 0e5 0 0 1 0 0e6 0 1/4 1 1/4 0e7 0 1/4 1 1 0e8 0 0 1 0 0e9 0 0 1 0 0e10 0 0 0 1 0

(d) Semantic feature matrix, t = 3e1 1 0 0 0 1e2 1 0 0 0 1/8e3 1/8 1 1/8 1/8 1/8e4 0 1 1/8 1/8 0e5 0 1/8 1 0 0e6 0 1/8 1 1/8 0e7 0 1/8 1 1 0e8 0 0 1 0 0e9 0 0 1 0 0e10 0 0 1/8 1 0

Table 3Evaluation measures.

Precision PrecisionðCjÞ ¼jLl\Cj jjCj j

Recall RecallðCjÞ ¼jLl\Cj jjLlj

F-score F-scoreðCjÞ ¼ 2 � precision�recallprecisionþrecall ðlocalÞ

F-score ¼Pk

l¼1jLljjDj �max16j6k

2�precisionðCjÞ�recallðCjÞprecisionðCj ÞþrecallðCjÞ

n oðglobalÞ

Purity purityðCjÞ ¼ 1jCj jmaxlðjCjjlabel¼lÞðlocalÞ

purity ¼Pk

j¼1jCj jjDj purityðCjÞðglobalÞ

Entropy entropy ¼Pk

j¼1jCj jjDj � 1

log k

Pkl¼1jLl\Cj jjCj j � log jLl\Cj jjCj j

� �NormalisedMutualInformation

NMI ¼P

l;jjLl\Cj j log

jDj�jLl\Cj jjLl j�jCj jffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

ljLl j�log

jLl jjDj

� � PjjCj j�log

jCj jjDj

� �r

Table 4Clustering performance of SVDK with different dimension (global).

SVDK d = 80 d = 90 d = 100 d = 110 d = 120

Purity 0.4383 0.4281 0.4610 0.4591 0.4368F-score 0.3730 0.3696 0.4080 0.4027 0.3571Entropy 0.4667 0.5041 0.5393 0.5418 0.5179NMI 0.2362 0.2171 0.2462 0.2345 0.2540

Table 5Clustering performance of SVDK with different dimension (local).

SVDK d = 80 d = 90 d = 100 d = 110 d = 120

Precision 0.4077 0.4082 0.4018 0.4546 0.2449Recall 0.4446 0.5435 0.4783 0.4052 0.6467F-score 0.4254 0.4662 0.4367 0.4285 0.3553


output includes k clusters with correlated information objects.When k = 8, the clusters are supposed to match the groups (topics).After comparing the groups with the clusters, the experiment eval-uates the topic identification performance (clustering perfor-mance) of Onto-SVDK and SVDK.

The evaluation measures employed are listed in Table 3 (Man-ning et al., 2008). The table uses the following notations:D denotesthe dataset; jDj is the number of information objects; each infor-mation object has an initial label describing its topic; kl is the num-ber of labelled groups; Ll and Cj denote information objects in alabelled group and a cluster j respectively; jLlj and jCjj are their to-tal numbers; jLl \ Cjj is the number of correctly classified informa-tion objects (i.e. documents which are members of both a group Ll

and a cluster Cj).As shown in Table 3, precision is computed as the number of

relevant documents returned by a search divided by the totalnumber of documents retrieved by that search, i.e. in the contextof this experiment, precision is the correctly classified informationobjects jLl \ Cjj within a cluster Cj. Recall is the number of relevantdocuments that are returned by a search divided by the total num-ber of existing relevant documents, i.e. it is the correctly classifiedinformation objects jLl \ Cjj within a group Ll. The F-scorecombines the two measures by computing the harmonic mean of

precision and recall. The local and global F-score evaluate the par-ticular cluster (related with the ontology) and all the clustersrespectively. Purity is the maximum number of correctly classifieddocuments within a cluster divided by its size. The local and globalpurity evaluate the particular cluster (related with the ontology)and all the clusters respectively. Entropy represents the uncertaintyof the global clustering result. It relies on the summation of clusterentropy and the proportion of cluster size with dataset size, i.e. jCj j

jDj.For Entropy, a smaller value indicates the lower uncertainty of theresult, which indicates better clustering performance. NMI (Nor-malised Mutual Information) measures reduction in uncertaintyof the global clustering result, e.g. a greater value indicates greateruncertainty reduction, which means the process with its result re-duces more uncertainty.

The experiment explores the effect of semantic feature selec-tion, i.e. establishing indirect connections between information ob-jects with the help of the ontology. As mentioned before, thethreshold t is the number of selected neighbouring entities (verti-ces) of an identified ex of ioi from Gj (oj). Therefore, the constructionof the term–information object (with semantic features) matrix ofdataset is dependent on the value of t. In the experiment, t is set as2, 4, 6, and 8. To avoid overfitting, the maximum threshold is lim-ited to 8. Tables 4 and 5 list the global and local clustering perfor-mance of SVDK with the dataset. For global clustering performance(Table 4), when d = 100, SVDK has the best purity (0.4610) and F-score (0.4080), as well as acceptable entropy 0.5393 and NMI0.2462. Moreover, for local clustering performance (Table 5), whend = 100, SVDK shows adequate precision (0.4018), recall (0.4783)and F-score (0.4367). Dimension 100 is therefore considered asthe optimal dimension of SVDK, and the results achieved when dis 100 are used to compare the performance of the two algorithms.

Table 6Onto-SVDK with different threshold t when d = 100 (global).

Onto-SVDK t = 2 t = 4 t = 6 t = 8

Purity 0.7036 0.7698 0.6261 0.6015F-score 0.7327 0.7910 0.5664 0.5423Entropy 0.3556 0.4069 0.4451 0.4693NMI 0.4846 0.5484 0.4278 0.4207

Table 7Onto-SVDK with different threshold t when d = 100 (local).

Onto-SVDK t = 2 t = 4 t = 6 t = 8

Precision 0.3872 0.6693 0.5339 0.5401Recall 0.9239 0.9239 0.7283 0.8043F-score 0.5457 0.7763 0.6161 0.6462

Table 8Cluster ‘‘Royal Family’’ generated by SVDK and Onto-SVDK, when dimension d = 100(local).

Number of information objects fromdifferent topics (within the cluster)

SVDK Onto-SVDK

t = 2 t = 4 t = 6 t = 8

Finance 3 5 2 2 2Dementia 0 1 2 2 10FIFA 1 0 7 17 18Health 7 19 7 7 7Politics 36 63 16 17 17Olympics 24 93 24 47 47Celebrity 60 89 26 25 25Royal Family 88 169 170 134 148Total 219 439 254 251 274

Table 9Onto-SVDK with different dimensions, when t = 4 (global).

Onto-SVDK d = 80 d = 90 d = 100 d = 110 d = 120

Purity 0.8029 0.7690 0.7698 0.7603 0.7622F-score 0.8435 0.7932 0.7910 0.7757 0.7881Entropy 0.3951 0.4325 0.4069 0.3878 0.3917NMI 0.6148 0.5472 0.5484 0.5475 0.5479


Furthermore, the nearby dimensions are also selected as testingdimensions.

To compare with the optimal result of SVDK, the dimension ofOnto-SVDK is set as SVDK’s optimal dimension d = 100. Table 6shows the impact of the semantic feature selection on the globalclustering performance, when the dimension is 100. When thethreshold t equals 2, 4, 6, 8, Onto-SVDK displays differentperformance. As expected, all the results of Onto-SVDK outperformthe optimal result of SVDK. Onto-SVDK has its best performancepurity (0.7698), F-score (0.7910), entropy (0.4069) and NMI

Fig. 6. Details of the local clus

(0.5484) when the threshold is 4. In other words, within the samedimension, when a different number of semantic features is used,the topic identification performance of Onto-SVDK is better thanSVDK.

In Table 7, when threshold t = 2, the local clustering performancehas a high recall (0.9239), but a low precision (0.3872.) It meansthat Onto-SVD identifies 92.39% of the information objects relatedto ‘‘Royal Family’’, but some information objects within other topicsare not identified correctly. When t = 4, Onto-SVDK also reaches itsbest local clustering performance (precision 0.6693, recall 0.9239and F-score 0.7763). If t = 6 and 8, the performance slightly goesdown. The results indicate that the clustering performance is not di-rectly proportional to the value of the threshold t. This can be ex-plained by the following two reasons. Firstly, some informationobjects belonging to different topics have shared content, thus alow threshold is not sufficient to enhance the semantic representa-tion of the information objects to a proper level, which means thatthe information objects are still undistinguishable semantically(see, for example, the local clustering result when t = 2). Secondly,on the other side, a high threshold causes overfitting, which in-volves using excess features in cases of semantic ambiguity, e.g.when the same name is shared by different persons. If such typeof semantic features are overly utilised, the possibility of bringingirrelevant information objects to a cluster will be high (see, forexample, the local clustering result when t = 8).

To assess the local clustering performance as a result of thesame tests (with a dimension d = 100 and thresholds t = 2, 4, 6,8), the local clusters generated by the two algorithms SVDK andOnto-SVDK have been examined in terms of the number of infor-mation objects from different topics (e.g., ‘‘Finance’’, ‘‘Dementia’’,etc.) included in the local clusters (i.e. related to the ‘‘Royal Fam-ily’’). As Table 8 shows, the cluster generated by SVDK includes219 information objects, 88 of which have been originally taggedas ‘‘Royal Family’’. However, this cluster also contains 3 informa-tion objects from ‘‘Finance’’, 1 from ‘‘FIFA’’, etc. Fig. 6, plotted usingthe data from Table 8, shows that the local clustering performance

ter, when d = 100 (local).

Fig. 7. Clustering performance improvement of Onto-SVDK (global).

Table 10Onto-SVDK with different dimensions, when t = 4 (local).

Onto-SVDK, t = 4 d = 80 d = 90 d = 100 d = 110 d = 120

Precision 0.6842 0.6693 0.6693 0.6883 0.6746Recall 0.9185 0.9231 0.9239 0.9239 0.9239F-score 0.7842 0.7760 0.7763 0.7889 0.7798


of Onto-SVDK is better than SVDK, especially when t = 4, 6 and 8.Table 8 and Fig. 6 indicate that it is hard to distinguish some ofthe topics, e.g. some of the information objects in clusters ‘‘Poli-tics’’, ‘‘Celebrity’’ and ‘‘Olympics’’ are also related to ‘‘Royal Family’’.This is also the reason for SVDK failing to identify the difference onsemantic basis. Onto-SVDK applies the ontology to enhance thesemantic representation of the information objects related to ‘‘Roy-al Family’’, and then addresses the problem well.

As mentioned before, Onto-SVDK has an optimal threshold t = 4,when the dimension d = 100. To compare the global and localclustering performance of Onto-SVDK and SVDK with differentdimensions, the threshold of Onto-SVDK is fixed as t = 4. Table 9shows the global clustering performance of Onto-SVDK when

Fig. 8. Clustering performance imp

tested using different dimensions, from 80 to 120. The comparisonwith Table 4 shows that the performance of Onto-SVDK outper-forms SVDK with all the testing dimensions. In other words, withina set of selected semantic features, testing with different dimen-sions, the topic identification performance of Onto-SVDK is betterthan SVDK. Fig. 7 shows that the performance improvement ofOnto-SVDK over SVDK with different dimensions, and the averageimprovement is purity 32.82%, F-score 41.62%, entropy (reduced)21.38% and NMI 32.36%.

Table 10 shows the local clustering performance of Onto-SVDK,with different dimension from 80 to 120, when t = 4. The compar-ison with Table 5 shows that the performance of Onto-SVDKoutperforms SVDK. The best local clustering performance ofOnto-SVDK is, precision 0.6883, recall 0.9239 and F-score 0.7889.Fig. 8 represents the performance improvement of Onto-SVDK overSVDK, and the average improvement is precision 29.37%, recall41.90% and F-score 35.86%.

Fig. 9 shows part of the clustering result of Onto-SVD whentested with the Royal Family ontology. It displays four topics,namely ‘‘Royal Family’’, ‘‘Politics’’, ‘‘Finance’’ and ‘‘Celebrity’’. The

rovement of Onto-SVDK (local).

Fig. 9. Cluster result by Onto-SVDK.


result shows that some of the information objects which are notrelated to the topic ‘‘Royal Family’’ but share similar content, canbe successfully identified by Onto-SVD. It facilitates the informa-tion management process of Sem-LSB system. In addition, thecluster based data also provides a story-like retrieval mechanismfor the system. For example, if there is a query about QueenElizabeth II, the retrieval mechanism can generate the result fromthe different clusters (topics), such as her family, significantpolitics events, and dignitaries. The clusters (topics) are notindependent from each other; therefore it is possible to generatea retrieval result in the form of a story.

7. Conclusion

Sem-LSBs, the Semantic Life Story Book proposed in thisresearch, is a semantic based information management systemwhich unlike traditional LSBs, provides an automatic mechanismfor managing stored memories by analysing, clustering, indexingand generating content. Sem-LSB uses a user-oriented ontologywhich in an interactive pattern enables users to apply theirbackground knowledge. The ontology supports the system tounderstand the user’s personal life events on semantic basis, whichleads to improved performance. Onto-SVD as a new algorithmworking with a user-oriented ontology provides a high clustering

approach. It uses terms and essential semantic features to establishconnection between similar information objects. Thanks to theindirect connections identified with the help of the ontology,Onto-SVD is more useful for topic identification and semanticsimilarity than the traditional SVD based method.

The evaluation shows that the topic identification (clustering)performance of Onto-SVD outperforms SVD, the average improve-ments are purity 32.82%, F-score 41.62%, entropy (reduced) 21.38%and NMI 32.36%. Considering the user-oriented ontology relatedcluster, the average improvements are precision 29.37%, recall41.90%, F-score 35.86%. This means that Onto-SVD has provenability to distinguish information objects based on their topics,and detect the indirect connections between information objectson semantic basis.

Further work includes the development of the dynamic genera-tion module using the model and algorithm already developed.

References

Bayer, A., & Reban, J. (Eds.). (2004). Alzheimer’s disease and related conditions(pp. 169–171). Rudolfov, CZ: Medea Press–M. Bransovsky.

Abdi, H., & Williams, L. J. (2010). Principal component analysis. WileyInterdisciplinary Reviews: Computational Statistics, 2(4), 433–459.

Andrews, N. O., & Fox, E. A. 2007. Recent developments in document clustering.Computer Science, Virginia Tech, Blacksburg, VA, Technical Report TR-07-35.


Bender, M., Norris, A., & Bauckham, P. (1987). Groupwork with the elderly: Principlesand practice. Winslow Press.

Berger, A. L., Pietra, V. J. D., & Pietra, S. A. D. (1996). A maximum entropy approach tonatural language processing. Computational Linguistics, 22(1), 39–71.

Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. Annals ofApplied Statistics, 1(1), 17–35.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal ofMachine Learning Research, 3, 993–1022.

Bryant, F. B., Smart, C. M., & King, S. P. (2005). Using the past to enhance the present:Boosting happiness through positive reminiscence. Journal of Happiness Studies,6(3), 227–260.

Butler, R. (1980). The life review: An unrecognized bonanza. International Journal ofAging and Human Development, 12(1), 35.

Conway, M. A. (2005). Memory and the self. Journal of Memory and Language, 53(4),594–628.

Conway, M. A., & Pleydell-Pearce, C. W. (2000). The construction of autobiographicalmemories in the self-memory system. Psychological Review, 107(2), 261.

Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. Wiley-Interscience.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).Indexing by latent semantic analysis. Journal of the American Society forInformation Science, 41(6), 391–407.

Erkmen, B., & Yıldırım, T. (2008). Improving classification performance of sonartargets by applying general regression neural network with PCA. Expert Systemswith Applications, 35(1–2), 472–475.

Gheyas, I. A., & Smith, L. S. (2010). Feature subset selection in large dimensionalitydomains. Pattern Recognition, 43(1), 5–13.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.The Journal of Machine Learning Research, 3, 1157–1182.

Haight, B. K., Bachman, D. L., Hendrix, S., Wagner, M. T., Meeks, A., et al. (2003). Lifereview: Treating the dyadic family unit with dementia. Clinical Psychology &Psychotherapy, 10(3), 165–174.

Haight, B. K., & Burnside, I. (1993). Reminiscence and life review: Explaining thedifferences. Archives of Psychiatric Nursing, 7(2), 91–98.

Haight, B. K., Gibson, F., & Michel, Y. (2006). The Northern Ireland life review/lifestorybook project for people with dementia. Alzheimer’s and Dementia, 2(1),56–58.

Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the22nd annual international ACM SIGIR conference on research and development ininformation retrieval. Berkeley, California, United States.

Koller, D., & Sahami, M. (Eds.) (1996). Toward optimal feature selection. Proceedings13th international conference on machine learning. Bari, Italy, Morgan Kaufmann,San Mateo, CA.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to informationretrieval. Cambridge: Cambridge University Press.

Marsh, E., & Perzanowski, D. (Eds.) (1998). MUC-7 evaluation of IE technology:Overview of results.

Mather, M., & Carstensen, L. L. (2005). Aging and motivated cognition: Thepositivity effect in attention and memory. Trends in Cognitive Sciences, 9(10),496–502.

Pasi, L. (2009). Classification based on fuzzy robust PCA algorithms and similarityclassifier. Expert Systems with Applications, 36(4), 7463–7468.

Pasupathi, M., & Carstensen, L. L. (2003). Age and emotional experience duringmutual reminiscing. Psychology and Aging, 18(3), 430.

Ryan, T., & Walker, R. (1985). Making life story books. London: British Agencies forAdoption and Fosterings.

Shi, L., & Setchi, R. (2010). An ontology based approach to measuring the semanticsimilarity between information objects in personal information collections.Lecture Notes in AI, 6276, 617–626.

Thomsen, D. K., & Berntsen, D. (2008). The cultural life script and life story chapterscontribute to the reminiscence bump. Memory, 16, 420–435.

Thomsen, D. K., Pillemer, D. B., & Ivcevic, Z. (2011). Life story chapters, specificmemories and the reminiscence bump. Memory, 19(3), 267–279.

Thorgrimsen, L., Schweitzer, P., & Orrell, M. (2002). Evaluating reminiscence forpeople with dementia: A pilot study. The Arts in Psychotherapy, 29(2), 93–97.

Wall, M., Rechtsteiner, A., & Rocha, L. (2003). Singular value decomposition andprincipal component analysis. A practical approach to microarray data analysis,91–109.

Wei, X., & Croft, W. B. (Eds.) (2006). LDA-based document models for ad-hoc retrieval.Proceedings of the 29th annual international ACM SIGIR conference on Research anddevelopment in information retrieval. Seattle, Washington, USA. ACM.

Woods, B., Spector, A. E., Jones, C. A., Orrell, M., & Davies, S. P., 2005. Reminiscencetherapy for dementia. Cochrane database of systematic reviews.

Date post:	05-Sep-2016
Category:	Documents
Upload:	lei-shi
View:	213 times
Download:	0 times

User-oriented ontology-based clustering of stored memories

Documents