Research Article Using SVD on Clusters to Improve...

Research ArticleUsing SVD on Clusters to Improve Precision ofInterdocument Similarity Measure

Wen Zhang1 Fan Xiao1 Bin Li1 and Siguang Zhang2

1Center on Big Data Sciences Beijing University of Chemical Technology Beijing 100039 China2Institute of Policy and Management Chinese Academy of Sciences Beijing 100190 China

Correspondence should be addressed to Wen Zhang zhangwenmailbucteducn

Received 2 March 2016 Accepted 8 June 2016

Academic Editor Toshihisa Tanaka

Copyright copy 2016 Wen Zhang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Recently LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition) is proposed to overcome the problemsof polysemy and homonym in traditional lexical matching However it is usually criticized as with low discriminative powerfor representing documents although it has been validated as with good representative quality In this paper SVD on clusters isproposed to improve the discriminative power of LSI The contribution of this paper is three manifolds Firstly we make a surveyof existing linear algebra methods for LSI including both SVD basedmethods and non-SVD basedmethods Secondly we proposeSVD on clusters for LSI and theoretically explain that dimension expansion of document vectors and dimension projection usingSVD are the two manipulations involved in SVD on clusters Moreover we develop updating processes to fold in new documentsand terms in a decomposed matrix by SVD on clusters Thirdly two corpora a Chinese corpus and an English corpus are used toevaluate the performances of the proposed methods Experiments demonstrate that to some extent SVD on clusters can improvethe precision of interdocument similarity measure in comparison with other SVD based LSI methods

1 Introduction

As computer networks become the backbones of scienceand economy enormous quantities of machine readabledocuments become available The fact that about 80 percentof businesses are conducted on unstructured information[1 2] makes the great demand for the efficient and effectivetext mining techniques which aims to discover high qualityknowledge from unstructured information Unfortunatelythe usual logic-based programming paradigm has greatdifficulties in capturing fuzzy and often ambiguous relationsin text documents For this reason text mining which is alsoknown as knowledge discovery from texts is proposed to dealwith uncertainness and fuzziness of languages and disclosehidden patterns (knowledge) in documents

Typically information is retrieved by literally matchingterms in documents with those of a query However lexicalmatching methods can be inaccurate when they are used tomatch a userrsquos query Since there are usually many ways toexpress a given concept (synonymy) the literal terms in auserrsquos query may not match those of a relevant document

In addition most words have multiple meanings (polysemyand homonym) so terms in a userrsquos query will literally matchterms in irrelevant documents For these reasons a betterapproach would allow users to retrieve information on thebasis of a conceptual topic or meanings of a document [3 4]

Latent Semantic Indexing (LSI) is proposed to overcomethe problem of lexical matching by using statistically derivedconceptual indices instead of individual words for retrieval[5 6]We call this retrieval method Latent Semantic Indexingbecause the subspace represents important associative rela-tionships between terms and documents that are not evidentin individual documents LSI assumes that there is someunderlying or latent structure in word usage that is partiallyobscured by variability in word choice Using the singularvalue decomposition (SVD) one can take advantage of theimplicit higher-order structure in the association of termswith documents by determining the SVD of large sparseterm-document matrix Terms and documents representedby a reduced dimension of the largest singular vectors arethen matched against user queries Performance data showsthat the statistically derived term-document matrix by SVD

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 1096271 11 pageshttpdxdoiorg10115520161096271

2 Computational Intelligence and Neuroscience

is more robust to retrieve documents based on concepts andmeanings than the original term-document matrix producedusing merely individual words with vector space model(VSM)

In this paper we propose SVD on clusters (SVDC) toimprove the discriminative power of LSI The contributionof this paper is three manifolds Firstly we make a survey ofexisting linear algebra methods for LSI including both SVDbased methods and non-SVD based methods Secondly wetheoretically explain that dimension expansion of documentvectors and dimension projection using SVD are the twomanipulations involved in SVD on clusters We developupdating processes to fold in new documents and terms in adecomposedmatrix by SVDon clustersThirdly two corporaa Chinese corpus and an English corpus are used to evaluatethe performances of the proposed methods

The rest of this paper is organized as follows Section 2provides a survey on recent researches on Latent SemanticIndexing and its related topics Section 3 proposes SVD onclusters and its updating process Section 4 is the experimentto evaluate the proposed methods Section 5 concludes thispaper and indicates future work

2 Related Work

21 Singular Value Decomposition The singular valuedecomposition is commonly used in the solution of uncon-strained linear least square problemsmatrix rank estimationand canonical correlation analysis [7 8] Given119898 times 119899matrix119860 where without loss of generality 119898 ge 119899 and rank(119860) = 119903the singular value decomposition of 119860 denoted by SVD(119860)is defined as

119860 = 119880Σ119881119879 (1)

Here 119880119879119880 = 119881119879119881 = 119868

119899and Σ = diag(120590

1 120590

119899) 120590119894gt 0

for 1 le 119894 le 119903 and 120590119895gt 0 for 119895 ge 119903 + 1 The first 119903 columns of

the orthonormal matrices 119880 and 119881 define the orthonormaleigenvector associated with 119903 nonzero eigenvalues of 119860119860119879

and 119860119879119860 respectively The columns of 119880 and 119881 are referredto as the left and right singular vectors respectively and thesingular values of119860 are defined as the diagonal elements of Σwhich are the nonnegative square roots of the 119899 eigenvaluesof 119860119860119879 Furthermore if we define 119860

119896= sum119896

119894=1119906119894120575119894V119879119894 then we

will find that 119860119896is the best rank-119896 approximation for 119860 in

terms of Frobenius norm [7]

22 Recent Studies in LSI Recently a series of methodsbased on different methods of matrix decomposition havebeen proposed to conduct LSI A common point of thesedecomposition methods is to find a rank-deficient matrix inthe decomposed space to approximate the original matrixso that the term frequency distortion in term-document canbe adjusted Basically we can divide these methods into twocategories matrix decomposition based on SVD and matrixdecomposition not based on SVD Table 1 lists the existinglinear algebraic methods for LSI

In the aspect of SVD based LSI methods it includes IRR[9] SVR [10] and ADE [11] Briefly IRR conjectures that

Table 1 Existing linear algebra methods for LSI

Category Abbreviation Full nameSVD baseddecomposition forterm-documentmatrix

IRR Iterative Residual RescalingSVR Singular Value Rescaling

ADE Approximate DimensionEqualization

Non-SVD baseddecomposition forterm-documentmatrix

SDD SemidiscreteDecomposition

LPI Locality PreservingIndexing

R-SVD Riemannian-SVD

SVD removes two kinds of ldquonoisesrdquo from the original term-document matrix exceptional documents and documentswith minor terms However if our concentration is on char-acterizing relationships of documents in a collection ratherthan looking for representative documents then IRR can playan effective role for this work The basic idea behind SVR isthat the ldquonoiserdquo in original document representation vectorscomes from minor vectors that is those vectors which arefar from representative vectors in terms of distance Thuswe need to augment the influence of representative vectorsand meanwhile reduce the influence of minor vectors in theapproximation matrix Following this idea SVR adjusts thedifferences among major dimensions and minor dimensionsin the approximation matrix by rescaling the singular valuesin Σ Based on the observation that singular values in Σ havethe characteristics as low-rank-plus-shift structure ADE triesto flatten the first 119896 largest singular values with a fixed valueand combine with other small singular values to reconstructΣ to make dimension values relatively equalized in theapproximation matrix of 119860

In the aspect of non-SVD based LSI methods it includesSDD [12] LPI [13] and R-SVD [14] SDD restricts valuesin singular vectors (119880 and 119881) in approximation matrix onlyhaving entries in the set minus1 0 1 By this way it merely needsone-twentieth of storage and only one-half query time whileit can do and SVD does LSI in terms of information retrievalLPI argues that LSI seeks to uncover the most representativefeatures rather themost discriminative features for documentrepresentation With this motivation LPI constructs theadjacency graph of documents and aims to discover the localstructure of document space using Local Preserving Projec-tion (LPP) In essence we can regard that LPI is adaptedfrom LDA (Linear Discriminant Analysis) [15] which is atopic concerning dimension reduction for supervised classi-fication R-SVD is different with SVDmathematically in thatthe term-document matrix decomposition of SVD is basedon Total Least Square (TLS) while matrix decomposition inR-SVD is based on Structured Total Least Square (STLS) R-SVD is not designed for LSI but for information filtering toimprove the effectiveness of information retrieval by usingusersrsquo feedback

Recently two methods in [16 17] are presented whichalso make use of SVD and clustering In [16] Gao andZhang investigate three strategies of using clustering and SVD

Computational Intelligence and Neuroscience 3

for information retrieval as noncluster retrieval full-clusterretrieval and partial cluster retrieval Their study shows thatpartial cluster retrieval produces the best performance In[17] Castelli et al make use of clustering and singular valuedecomposition for nearest-neighbor search in image index-ing They use SVD to rotate the original vectors of imagesto produce zero-mean uncorrelated features Moreover arecursive clustering and SVD strategy is also adopted in theirmethod when the distance of reconstructed centroids andoriginal centroids exceeds a threshold

Although the two methods are very similar with SVD onclusters they are proposed for different uses with differentmotivations Firstly this research presents a complete the-ory for SVD on clusters including theoretical motivationtheoretical analysis of effectiveness and updating processwhich are entirely not mentioned in any of the two referredmethods Secondly this research describes the detailed proce-dures of using SVD on clusters and attempts to use differentclustering methods (119896-Means and SOMs clustering) whichare not mentioned in any of the two referred methodseither Thirdly the motivations of proposing SVDC aredifferent with theirs They proposed clustering and SVD forinhomogeneous data sets and our motivation is to improvethe discriminative power of document indexing

3 SVD on Clusters

31 The Motivation Themotivation for the proposal of SVDon clusters can be specified as the following 4 aspects

(1) The huge computation complexity involved in tradi-tional SVD According to [18] the actual computationcomplexity of SVD is quadratic in the rank of term-document matrix (the rank is bounded by the smallerof the number of documents and the number ofterms) and cubic in the number of singular values thatare computed [19] On the one hand in most casesof SVD for a term-document matrix the numberof documents is quite smaller than the number ofindex terms On the other hand the number ofsingular values which is equal to the rank of the term-document matrix is also dependent on the numberof documents For this reason we can regard thatthe computation complexity of SVD is completelydetermined by the number of documents in the term-document matrix That is to say if the number ofdocuments in the term-document matrix is reducedthen the huge computation complexity of SVD can bereduced as well

(2) Clusters existing in a document collection Usuallythere are different topics scattered in different docu-ments of a text collection Even if all documents ina collection are concerning on a same topic we candivide them into several subtopics Although SVDhasthe ability to uncover the most representative vectorsfor text representation it might not be optimal indiscriminating documents with different semanticsIn information retrieval the relevant documents withthe query should be retrieved as many as possible

on the other hand the irrelevant documents with thequery should be retrieved as few as possible If princi-pal clusters in which documents have closely relatedsemantics can be extracted automatically then rele-vant documents can be retrieved in the cluster withthe assumption that closely associated documentstend to be relevant to the same request that is rele-vant documents are more like one another than theyare like nonrelevant documents

(3) Contextual information and cooccurrence of indexterms in documents Classic weighting schemes [2021] are proposed on the basis of information aboutthe frequency distribution of index terms within thewhole collection or within the relevant and nonrel-evant sets of documents The underlying model forthese term weighting schemes is a probabilistic oneand it assumes that the index terms used for represen-tation are distributed independently in documentsAssuming variables to be independent is usually amatter of mathematical convenience However inthe nature of information retrieval exploitation ofdependence or association between index terms ordocuments will often lead to a better retrieval resultssuch as most linear algebra methods proposed forLSI [3 22] That is from mathematical point of viewthe index terms in documents are dependent on eachother In the viewpoint of linguistics topical wordsare prone to have burstiness in documents and lexicalwords concerning the same topic are likely to cooccurin the same content That is the contextual words ofan index term should also be emphasized and puttogether when used for retrieval In this sense cap-turing the cooccurrence of index terms in documentsand further capturing the cooccurrence of documentswith some common index terms are of great impor-tance to characterize the relationships of documentsin a text collection

(4) Divide-and-conquer strategy as theoretical supportThe singular values in Σ of SVD of term-documentmatrix 119860 have the characteristic as low-rank-plus-shift structure that is the singular values decreasesharply at first level off noticeably and dip abruptly atthe end According to Zha et al [23] we know that if119860 has the low-rank-plus-shift structure then the opti-mal low-rank approximation of 119860 can be computedvia a divide-and-conquer approach That is to sayapproximation of submatrices of 119860 can also producecomparable effectiveness in LSI to direct SVD of 119860

With all of the above observations from both practicesand theoretical analysis SVD on clusters is proposed for LSIto improve its discriminative power in this paper

32 The Algorithms To proceed the basic concepts adoptedin SVD on clusters are defined in the following in order tomake clear the remainder of this paper

Definition 1 (cluster submatrix) Assuming that 119860 is a term-document matrix that is 119860 = (119889

1 1198892 119889

119899) (119889119894(1 le 119894 le 119899)


is a term-document vector) after clustering process 119899 docu-ment vectors are partitioned into 119896 disjoint groups (each doc-ument belongs to only one group but all the documents havethe same terms for representation) For each of these clustersa submatrix of 119860 can be constructed by grouping the vectorsof documents which are partitioned into the same cluster byclustering algorithm That is 119860 = [119860

(1) 119860(2) 119860

(119896)] due

to the fact that changing the order of documents vectors in119860 can be ignored Then one calls that 119860(119895) (1 le 119895 le 119896) is acluster submatrix of 119860

Definition 2 (SVDC approximation matrix) Assuming that119860(1) 119860(2) 119860

(119896) are the all cluster submatrices of 119860 that is119860 = [119860

(1) 119860(2) 119860

(119896)] after SVD for each of these cluster

submatrices that is 119860(1) asymp 119860(1)

1199031

119860(2) asymp 119860(2)

1199032

119860(119896)

asymp

119860(119896)

119903119896

and 119903119896is the rank of SVD approximation matrix of 119860(119896)

and 119860(119896)119903119896

is the SVD approximation matrix of 119860(119896) then onecalls that 119860 = [119860

(1)

1199031

119860(2)

1199032

119860(119896)

119903119896

] is a SVDC approximationmatrix of 119860

With the above two definitions of cluster submatrix andSVDC approximation matrix we proposed two versionsof SVD on clusters by using 119896-Means clustering [24] andSOMs (Self-Organizing Maps) clustering [25] These twoversions are illustrated in Algorithms 3 and 4 respectivelyThe difference of these two versions lies in different clusteringalgorithms used in them For 119896-Means clustering we need topredefine the number of clusters in the document collectionand for SOMs clustering it is not necessary to predefine thenumber of clusters beforehand

Algorithm 3 Algorithm of SVD on clusters with 119896-Meansclustering to approximate the term-document matrix for LSIis as follows

Input

119860 is term-document matrix that is 119860 =

(1198891 1198892 119889

119899)

119896 is predefined number of clusters in 1198601199031 1199032 119903

119896are predefined rank of SVD approx-

imation matrix for 119896 clusters submatrices of 119860

Output

119860 is the SVDC approximation matrix of 119860

Method

(1) Cluster the document vectors 1198891 1198892 119889

119899into

119896 clusters using 119896-Means clustering algorithm(2) Allocate the document vectors according to

vectorsrsquo cluster labels from 119860 to construct thecluster submatrices (119860(1) 119860(2) 119860(119896))

(3) Conduct SVD for each of the cluster subma-trices of 119860(119894) (1 le 119894 le 119896) and produce theirSVD approximation matrix respectively Thatis 119860(119894) asymp 119860(119894)

119903119894

(4) Merge all the SVD approximation matrices ofthe cluster submatrices to construct the SVDCapproximation matrix of 119860 That is 119860 =

[119860(1)

1199031

119860(2)

1199032

119860(119896)

119903119896

]

33 Theoretical Analysis of SVD on Clusters For simplicityhere we only consider the case that term-document 119860 isclustered into two cluster submatrices 119860

1and 119860

2 that is

119860 = [1198601 1198602]After SVDprocessing for119860

1and119860

2 we obtain

1198601= 1198801Σ1119881119879

1and 119860

2= 1198802Σ2119881119879

2 For convenience of

explanation if we assume that

1198601015840= (

1198601

0

0 1198602

)

1198801015840= (

1198801

0

0 1198802

)

Σ1015840= (

Σ1

0

0 Σ2

)

1198811015840119879= (

1198811015840119879

10

0 1198811015840119879

2

)

(2)

we will obtain that 1198601015840 = 1198801015840Σ10158401198811015840119879 and 11988010158401198791198801015840 = 119881

10158401198791198811015840= 119868119899

that is1198801015840 and1198811015840 are orthogonalmatrices Hence we will alsoobtain

1198601015840=

1199031015840

sum

119894=1

1205901015840

1198941199061015840

119894V1015840119879119894 (3)

where 1199031015840 is the total number of elements in Σ1and Σ

2which

are nonzerosThus we can say that1198601015840 = 1198801015840Σ10158401198811015840119879 is a singulardecomposition of 1198601015840 and 119860

1015840

119896= sum119896

119894=11205901015840

1198941199061015840

119894V1015840119879119894

is the closetrank-119896 approximation for 1198601015840 in terms of Frobenius norm(assuming that we sort the values in Σ1015840 in descending orderand adapt the orders of 1199061015840

119894and V1015840119894accordingly)

We can conclude that there are actually two kinds ofmanipulations involved in SVD on clusters the first one isdimension expansion of document vectors and the secondone is dimension projection using SVD

On the one hand notice that 119860 isin 119877119898times119899 and 1198601015840 isin 1198772119898times119899

1198601015840 has expanded 119860 into another space where the number of

dimensions is twice as that of the original space of 119860 Thatis in 119860

1015840 we expanded each document vector 119889 into 1198772119898

dimension vector 1198891015840 by

1198891015840

119901=

119889119902 if 119889 isin 119862

119894 119901 = (119894 minus 1)119898 + 119902

0 otherwise(4)

Here 119889119902is the value of 119902th dimension in 119889 1198891015840

119901is the value of

119901th dimension of 1198891015840 and 1 le 119894 le 2 In this way we expandedeach 119889 into 1198772119898 dimension vector 1198891015840 where values of 1198891015840 areequal to the corresponding values of 119889 if 119889 belongs to cluster119862119894or zero if 119889 is not a member of that cluster


Theoretically according to the explanation documentvectors which are not in the same cluster submatrix will havezero cosine similarity However in fact all document vectorshave the same terms in representation and dimension expan-sion of document vectors is derived by merely copying theoriginal pace of119860 For this reason in practice we use the vec-tors in119860

1and119860

2for indexing and cosine similarities of doc-

ument vectors in1198601and119860

2will not necessarily be zeroThis

validates our motivation of using similarity measure for LSIperformance evaluation in Section 42

Algorithm 4 Algorithm of SVD on clusters with SOMsclustering to approximate the term-document matrix for LSIis as follows

Input


(1198891 1198892 119889

119899)

120572 is predefined preservation rate for submatricesof 119860

Output


Method


119899into

clusters using SOMs clustering algorithm(2) Allocate the document vectorsrsquo according to

vectorsrsquo cluster labels from 119860 to construct thecluster submatrices (119860(1) 119860(2) 119860(119896)) (noticehere that 119896 is not a predefinednumber of clustersof 119860 but the number of neurons which arematched with at least 1 document vector)

(3) Conduct SVD using predefined preservationrate for each cluster submatrix of 119860(119894) (1 le 119894 le

119896) and produce its SVD approximation matrixThat is 119860(119894) asymp 119860(119894)

120572


[119860(1)

120572 119860(2)

120572 119860

(119896)

120572]

On the other hand when using SVD for 119860 that is 119860 =

119880Σ119881119879 we obtain 119880119879119860 = Σ119881

119879 and further we say that SVDhas folded each document vector of 119860 into a reduced space(assuming that we use 119880119879

119896for the left multiplication of 119860 the

number of dimensions of original document vectors will bereduced to 119896) which is represented by119880 and reflects the latentsemantic dimensions characterized by term cooccurrence of119860 [3] In the same way for 1198601015840 we have 11988010158401198791198601015840119879 = Σ119881

1015840119879 andfurther we may say that 1198601015840 is projected into space which isrepresented by 1198801015840 However here 1198801015840 is not characterized byterm cooccurrence of 1198601015840 but by the existing clusters of 119860 andthe term cooccurrence of each cluster submatrix of 119860

34 The Computation Complexity of SVD on Clusters Thecomputation complexity of SVDC is 119874(1198992

1198951199033

119895) where 119899

119895is

the maximum number of documents in 119860(119894)(1 le 119894 le 119896)

and 119903119895is the corresponding rank-119895 to approximate cluster

submatrix 119860(119894) Because the original term-document matrix119860 is partitioned into 119896 cluster submatrices by clusteringalgorithm we can estimate 119899

119895asymp 119899119896 and 119903

119895asymp 119903119896 That is to

say the computation complexity of SVD compared to that ofSVDC has been decreased by approximate 1198965 The larger thevalue of 119896 is that is themore the document clusters setting fora document collection is the more computation complexitywhichwill be saved by SVDon clusters inmatrix factorizationis Although onemay argue that clustering process in SVD onclusters will bring about computation complexity in fact thecost of clustering computation is far smaller than that of SVDFor instance the computation complexity of 119896-Means cluster-ing is 119874(119899119896119905) [24] where 119899 and 119896 have the same meaning asthose in SVD on clusters and 119905 is the number of iterationsThe computation complexity of clustering is not comparableto the complexity 119874(1198995) involved in SVD The computationcomplexity of SOMs clustering is in the similar case with 119896-Means clustering

35 Updating of SVD on Clusters In rapidly changing envi-ronments such as the World Wide Web the document col-lection is frequently updated with new documents and termsconstantly being added and there is a need to find the latent-concept subspace for the updated document collection Inorder to avoid recomputing the matrix decomposition thereare two kinds of updates for an established latent subspace ofLSI folding in new documents and folding in new terms

351 Folding in New Documents Let 119863 denote 119901 newdocument vectors to be appended into the original term-document matrix 119860 then 119863 is 119898 times 119901 matrix Thus the newterm-documentmatrix is119861 = (119860119863)Then119861 = (119880Σ119881119879 119863) =119880Σ(119881

119879 Σminus1119880119879119863) = 119880Σ(119881119863

119879119880Σminus1)119879 That is if 119863 is

appended into the originalmatrix119860119881new = (119881119863119879119880Σminus1) and

119861 = 119880Σ119881119879

new However here119881new is not an orthogonal matrixlike 119881 So 119861

119896is not the closest rank-119896 approximation matrix

to 119861 in terms of Frobenius normThis is the reason whymoredocuments are appended in 119860 more deteriorating effectsare produced on the representation of the SVD approxima-tion matrix using folding in method

Despite this to fold in 119901 new document vectors 119863 intoan existing SVD decomposition a projection119863 of119863 onto thespan of the current term vectors (columns of119881119879

119896) is computed

by (5) Here 119896 is the rank of the approximation matrix

119863 = (119863119879119880Σminus1)1119896

(5)

As for folding in these 119901 new document vectors 119863 intothe established SVDC decomposition of matrix119860 we shoulddecide firstly the cluster submatrices of 119860 into which eachvector in 119863 should be appended Next using (5) we canfold in the new document vector into the cluster submatrixAssuming that 119889 is a new document vector of 119863 first the


Euclidean distance between 119889 and 119888119894(119888119894is the cluster center

of cluster submatrix 119860(119894)) is calculated using (6) where 119898 isthe dimension of 119889 that is the number of terms used in 119860One has

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2= (1198891minus 1198881198941)2+ (1198892minus 1198881198942)2+ sdot sdot sdot

+ (119889119898minus 119888119894119898)2

(6)

Second 119889 is appended into the 119904th cluster where 119889 has theminimum Euclidean distance with 119904th cluster That is

119904 = arg119894

min1le119894le119896

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2 (7)

Third (5) is used to update the SVD of 119860(119904) That is

119889 = (119889119879119880Σminus1)1119903119904

(8)

Here 119903119904is the rank of approximation matrix of 119860(119904)

Finally 119860 is updated as 119860 = (119860(1)

1199031

119860(119904)

119903119904

119860(119896)

119903119896

) with

119860(119904)

119903119904

= 119880(119904)

1119903119904

Σ(119904)

11199031199041119903119904

(119881(119904)

1119903119904

| 119889

119879

) (9)

Thus we finish the process of folding in a new documentvector into SVDC decomposition and the centroid of 119904thcluster is updated with new document The computationalcomplexity of updating SVDC depends on the size of 119880 andΣ because it involves only one-way matrix multiplication

352 Folding in New Terms Let 119879 denote a collection of 119902term vectors for SVD update Then 119879 is 119902 times 119899 matrix Thuswe have the new term-document 119862 with 119862 = (119860119879) =

(119860119879 119879119879)119879 Then 119862 = ((119880Σ119881

119879)119879 119879119879)119879

= (119880Σ119881119879119879) =

(119880119879119881Σminus1)Σ119881119879 That is 119880new = (119880119879119881Σ

minus1) and 119862 =

119880newΣ119881119879 Here 119880new is not an orthonormal matrix So 119862

119896is

not the closest rank-119896 approximation matrix to 119862 in terms ofFrobenius norm Thus the more the terms being appendedinto the approximationmatrix119860

119896are themore the deviation

between 119860119896and 119860 which will be induced in document

representation isAlthough the method specified above has a disadvantage

of SVD for folding in new terms we do not have bettermethod to tackle this problem until now if no recomputingof SVD is desired To fold in 119902 term vectors 119879 into an existingSVD decomposition a projection 119879 of 119879 onto the span ofcurrent document vectors (rows of 119880

119896) is determined by

119879 = (119879119881119896Σminus1

119896)11119896

(10)

Concerning folding in an element 119905 of 119879 the updatingprocess of SVDC is more complex than that of SVD Firstthe weight of 119905 in each document of each cluster is calculatedas

119905(119894)= (119908(119894)

1 119908

(119894)

119895 119908

(119894)

119898119894

) (1 le 119894 le 119896) (11)

Here 119908(119894)119895

is the weight of the new term 119905 in the jthdocument of ith cluster submatrix 119860

(119894) 119898119894is the number

of documents in 119860(119894) and 119896 is the number of clusters in theoriginal term-document matrix119860 Second for each119860(119894) (1 le119894 le 119896) in 119860 of Definition 2 the process of folding in a newterm in SVD is used to update each 119860(119894) shown in

119905

(119894)

= (119905(119894)119881(119894)Σ(119894)minus1

)11119903119894

(12)

Then each 119860(119894)119903119894

is updated using

119860(119894)

119903119894

= (

119880(119894)

11198981198941119903119894

119905

(119894))Σ(119894)

11199031198941119903119894

119881(119894)119879

11198991119903119894

(13)

Finally approximation term-document 119860 of Definition 2is reconstructed with all updated 119860(119894)

119903119894

as

119860 = [119860(1)

1199031

119860(119894)

119903119894

119860(119896)

119903119896

] (14)

Thus we finish the process of folding 119905 into SVDCdecomposition For folding 119902 term vectors 119879 into an existingSVDC decomposition we need to repeat the processes of(11)ndash(14) for each element of 119879 one by one

4 Experiments and Evaluation

41 The Corpus Reuters-21578 distribution 10 is used forperformance evaluation as the English corpus and it isavailable online (httpwwwdaviddlewiscomresourcestestcollectionsreuters21578) It collects 21578 news fromReuters newswire in 1987 Here the documents from 4categories as ldquocruderdquo (520 documents) ldquoagriculturerdquo (574documents) ldquotraderdquo (514 documents) and ldquointerestrdquo (424documents) are assigned as the target English documentcollection That is 2042 documents from this corpusare selected for evaluation After stop-word (we obtainthe stop-words from USPTO (United States Patent andTrademark Office) patent full-text and image database athttppatftusptogovnetahtmlPTOhelpstopwordhtm Itincludes about 100 usual wordsThe part of speech of Englishword is determined by QTAG which is a probabilisticparts-of-speech tagger and can be downloaded freely onlinehttpwwwenglishbhamacukstaffomasonsoftwareqtaghtml) elimination and stemming processing (Porterstemming algorithm is used for English stemming processingwhich can be downloaded freely online httptartarusorgsimmartinPorterStemmer) a total amount of 50837 sen-tences and 281111 individual words in these documents isestimated

TanCorpV10 is used as the Chinese corpus in thisresearch which is available in the internet (httpwwwcnblogscomtristanrobertarchive201202162354973html)Here documents from4 categories as ldquoagriculturerdquo ldquohistoryrdquoldquopoliticsrdquo and ldquoeconomyrdquo are assigned as target Chinesecorpus For each category 300 documents were selectedrandomly from original corpus obtaining a corpus of 1200documents After morphological analysis (because Chineseis character based we conducted the morphological analysisusing the ICTCLAS tool It is a Chinese Lexical AnalysisSystem Online httpictclasnlpirorg) a total amount of219115 sentences and 5468301 individual words is estimated


42 Evaluation Method We use similarity measure as themethod for performance evaluation The basic assumptionbehind similarity measure is that document similarity shouldbe higher for any document pair relevant to the same topic(intratopic pair) than for any pair relevant to different topics(cross-topic pair) This assumption is based on considerationof how the documents would be used by applications Forinstance in text clustering by 119896-Means clusters are con-structed by collecting document pairs having the greatestsimilarity at each updating

In this research documents in same category are regardedas having same topic and documents in different categoryare regarded as cross-topic pairs Firstly document pairs areproduced by coupling each document vector in a predefinedcategory and another document vector in the whole corpusiteratively Secondly cosine similarity is computed for eachdocument pair and all the document pairs are sorted ina descending order by their similarities Finally (15) and(16) are used to compute the average precision of similaritymeasure More details concerning similarity measure can befound in [9] One has

precision (119901119896)

=

of intra - topic pairs 119901119895where 119895 le 119896

119896

(15)

average precision =sum119898

119894=1119901119894

119898

(16)

Here 119901119895denotes the document pair that has the 119895th

greatest similarity value of all document pairs 119896 is variedfrom 1 to119898 and119898 is the number of total document pairsThelarger the average precision (119901

119896) is the more the document

pairs in same categories which are regarded as having sametopic are That is the better performance is produced Asimplified method may be that 119896 is predefined as fixednumbers such as 10 20 and 200 (as suggested by one of thereviewers) Thus (16) is not necessary However due to thelack of knowledge of the optimal 119896 we conjecture that anaverage precision on all possible 119896 is more convincing forperformance evaluation

43 Experimental Results of Indexing For both Chinese andEnglish corpus we carried out experiments for measuringsimilarities of documents in each category When usingSVDC in Algorithm 3 for LSI the predefined number ofclusters in 119896-Means clustering algorithm is set as 4 for bothChinese and English documents which is equal to the num-ber of categories used in both corpora In SOMs clusteringwhen using SVDC in Algorithm 4 for LSI 10 times 10 array ofneurons is set to map the original document vectors to thistarget space and the limit on time iteration is set as 10000As a result Chinese documents are mapped to 11 clusters andEnglish documents are mapped to 16 clusters Table 2 showsthe 119865-measure values [26] of the clustering results producedby 119896-Means and SOMs clustering respectively The largerthe 119865-measure value the better the clustering result Here119896-Means has produced better clustering results than SOMsclustering algorithm

Table 2 119865-measures of clustering results produced by 119896-Means andSOMs on Chinese and English documents

Corpus 119896-Means SOMs clusteringChinese 07367 06046English 07697 06534

Average precision (see (16)) on the 4 categories of bothEnglish and Chinese documents is used as the performancemeasure Tables 3 and 4 are the experimental results ofsimilarity measure on the English and Chinese documentsrespectively For SVD SVDC and ADE the only requiredparameter to compute the latent subspace is preservationrate which is equal to 119896rank(119860) where 119896 is the rankof the approximation matrix For IRR and SVR besidesthe preservation rate they also need another parameter asrescaling factor to compute the latent subspace

To compare document indexing methods at differentparameter settings preservation rate is varied from 01 to 10in increments of 01 for SVD SVDC SVR and ADE ForSVR its rescaling factor is set to 135 as suggested in [10] foroptimal average results in information retrieval For IRR itspreservation rate is set as 01 and its rescaling factor is variedfrom 1 to 10 the same as in [13] Note that in Tables 3 and 4 forIRR the preservation rate of 1 corresponds to rescaling factor10 09 to 9 and so forth The baseline of TF lowast IDF methodcan be regarded as pure SVD at preservation rate 10

We can see from Tables 3 and 4 that for both Englishand Chinese similaritymeasure SVDCwith 119896-Means SVDCwith SOMs clustering and SVDoutperformother SVDbasedmethods In most cases SVDC with 119896-Means and SVDCwith SOMs clustering have better performances than SVDThis outcome validates our motivation of SVD on clusters inSection 31 that all documents in a corpus are not necessarilyto be in a same latent space but in some different latentsubspaces Thus SVD on clusters which constructs latentsubspaces on document clusters can characterize documentsimilarity more accurately and appropriately than other SVDbased methods Here we regard that the variances of thementioned methods are comparable to each other becausethey have similar values

Considering the variances of average precisions on dif-ferent categories we admit that SVDC may not be a robustapproach since its superiority is not obvious than SVD (aspointed out by one of the reviewers) However we regard thatthe variances of the mentioned methods are comparable toeach other because they have similar values

Moreover SVDC with 119896-Means outperforms SVDC withSOMs clustering The better performance of SVDC with 119896-Means can be attributed to the better performance of 119896-Means than SOMs in clustering (see Table 2) When preser-vation rate declines from 1 to 01 the performances of SVDCwith 119896-Means and SVD increase significantly However forSVDC with SOMs clustering its performance decreaseswhen preservation is smaller than 03 We hypothesize thatSVDC with 119896-Means has effectively captured latent structureof documents but SVDC with SOMs clustering has not


Table 3 Similarity measure on English documents of SVD on clusters and other SVD based LSI methods PR is the abbreviation forldquopreservation raterdquo and the best performances (measured by average precision) are marked in bold type

PR SVD SVDC (119896-Means) SVDC (SOMs) SVR ADE IRR10 04373 plusmn 00236 04373 plusmn 00236 04373 plusmn 00236 04202 plusmn 00156 03720 plusmn 00253 03927 plusmn 0037809 04382 plusmn 00324 04394 plusmn 00065 04400 plusmn 00266 04202 plusmn 00197 02890 plusmn 00271 03929 plusmn 0020708 04398 plusmn 00185 04425 plusmn 00119 04452 plusmn 00438 04202 plusmn 00168 03293 plusmn 00093 03927 plusmn 0062107 04420 plusmn 00056 04458 plusmn 00171 04385 plusmn 00287 04089 plusmn 00334 03167 plusmn 00173 03928 plusmn 0027406 04447 plusmn 00579 04483 plusmn 00237 04462 plusmn 00438 04201 plusmn 00132 03264 plusmn 00216 03942 plusmn 0024305 04475 plusmn 00431 04502 plusmn 00337 04487 plusmn 00367 04203 plusmn 00369 03338 plusmn 00295 03946 plusmn 0027904 04499 plusmn 00089 04511 plusmn 00173 04498 plusmn 00194 04209 plusmn 00234 03377 plusmn 00145 03951 plusmn 0032503 04516 plusmn 00375 04526 plusmn 00235 04396 plusmn 00309 04222 plusmn 00205 03409 plusmn 00247 03970 plusmn 0021402 04538 plusmn 00654 04554 plusmn 00423 04372 plusmn 00243 04227 plusmn 00311 03761 plusmn 00307 03990 plusmn 0026101 04553 plusmn 00247 04605 plusmn 00391 04298 plusmn 00275 04229 plusmn 00308 04022 plusmn 00170 03956 plusmn 00185

Table 4 Similarity measure on Chinese documents of SVD on clusters and other SVD based LSI methods PR is the abbreviation forldquopreservation raterdquo and the best performances (measured by average precision) are marked in bold type


Table 5 Results of 119905-test on the performances of similarity measureof SVD on clusters and other SVD based LSI methods in Englishcorpus

Method SVDC with SOMs clustering SVDSVDC with 119896-Means ≫ ≫

SVDC with SOMs clustering gt

Table 6 Results of 119905-test on the performances of similarity measureof SVD on clusters and other SVD based LSI methods in Chinesecorpus

Method SVDC with SOMs clustering SVDSVDC with 119896-Means gt gt

SVDC with SOMs clustering sim

captured the appropriate latent structure due to its poorcapacity in document clustering

To better illustrate the effectiveness of each method theclassic 119905-test is employed [27 28] Tables 5 and 6 demonstratethe results of 119905-test of the performances of the examinedmethods on English and Chinese documents respectivelyThe following codification of 119875 value in ranges was usedldquo≫rdquo (ldquo≪rdquo) means that 119875 value is lesser than or equal to001 indicating a strong evidence that a method produces

a significant better (worse) similarity measure than anotherone ldquoltrdquo (ldquogtrdquo) means that 119875 value is larger than 001 andminor or equal to 005 indicating a weak evidence thata method produces a significant better (worse) similaritymeasure than another one ldquosimrdquo means that 119875 value is greaterthan 005 indicating that the compared methods do nothave significant differences in performances We can see thatSVDC with 119896-Means outperforms both SVDC with SOMsclustering and pure SVD in both English andChinese corpusMeanwhile SVDC with SOMs clustering has a very similarperformance with pure SVD

44 Experimental Results of Updating Figure 1 is the perfor-mances of updating process of SVDon clusters in comparisonwith SVD updating The vertical axis indicates averageprecision and the horizontal axis indicates the retaining ratioof original documents for initial SVDC or SVD approxi-mation For example the retaining ratio 08 indicates that80 percentage of documents (terms) in the corpus are usedfor approximation and the left 20 percentage of documents(terms) are used for updating the approximation matrixHere the preservation rates of approximation matrices areset as 08 uniformly We only compared SVDC with 119896-Meansand SVD in updating because SVDC with SOMs clusteringhas not produced a competitive performance in similaritymeasure


06 04 0208Retaining ratio


04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion

SVDSVDC with k-Means


Figure 1 Similarity measure of SVDC with 119896-Means and SVD for updating the preservation rates of their approximation matrices are set as08

We can see from Figure 1 that in folding in newdocuments the updating process of SVDC with 119896-Means issuperior to SVD updating on similarity measure An obvioustrend on their performance difference is that the superiorityof SVDC with 119896-Means becomes more and more significantthan SVD when the number of training documents declinesWe conjecture that less diversity in latent spaces of smallnumber of training documents can improve the documentsimilarity in the same category

In folding in new terms SVDC with 119896-Means is superiorto SVD as well However their performances drop dramati-cally in initial phase and increase after a critical value Thisphenomenon can be explained as that when retaining ratio islarge the removal of more and more index terms from term-document matrix will hurt the latent structure of documentspace However when retaining ratio attains to a small value(the critical value) the latent structure of document space isdecided principally by the appended terms which have largernumber than remaining terms For this reason document

similarities in the corpus are determined by the appendedindex terms Furthermore we observe that the critical valueon Chinese corpus is larger than that on English corpusThiscan be explained as that the number of Chinese index terms(21475) ismuch larger than that of English index terms (3269)but the number of Chinese documents (1200) is smaller thanthat of English documents (2402) Thus the structure ofChinese latent space ismuchmore robust than that of Englishlatent space which is very sensitive to the number of indexterms

5 Concluding Remarks

This paper proposes SVD on clusters as a new indexingmethod for Latent Semantic Indexing Based on the reviewon current trend of linear algebraicmethods for LSI we claimthat the state of art of LSI roughly follows two disciplinesSVD based LSI methods and non-SVD based LSI methodsThenwith the specification of itsmotivation SVDon clusters


is proposed We describe the algorithm of SVD on clusterswith two different clustering algorithms 119896-Means and SOMsclustering The computation complexity of SVD on clustersits theoretical analysis and its updating process for folding innew documents and terms are presented in this paper SVDon clusters is different from existing SVD based LSI methodsin the way of eliminating noise from the term-documentmatrix It neither changes the weights of singular values inΣ as done in SVR and ADE nor revises directions of singularvectors as done in IRR It adapts the structure of the originalterm-document matrix based on document clusters Finallytwo document collections as a Chinese and an English corpusare used to evaluate the proposed methods using similaritymeasure in comparison with other SVD based LSI methodsExperimental results demonstrate that in most cases SVDon clusters outperforms other SVD based LSI methodsMoreover the performances of clustering techniques used inSVD on clusters play an important role on its performances

The possible applications of SVD on clusters may beautomatic categorization of large amount of Web documentswhere LSI is an alternative for document indexing but withhuge computation complexity and the refinement of docu-ment clustering where interdocument similarity measure isdecisive for its performance We admit that this paper coversmerely linear algebra methods for latent sematic indexingIn the future we will compare SCD on clusters with thetopic based methods for Latent Semantic Indexing on inter-document similarity measure such as Probabilistic LatentSemantic Indexing [29] and Latent Dirichlet Allocation [30]

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This research was supported in part by National NaturalScience Foundation of China under Grants nos 7110113861379046 91218301 91318302 and 61432001 Beijing NaturalScience Fund under Grant no 4122087 the FundamentalResearch Funds for the Central Universities (buctrc201504)

References

[1] C White ldquoConsolidating accessing and analyzing unstruc-tured datardquo httpwwwb-eye-networkcomview2098

[2] R Rahimi A Shakery and I King ldquoExtracting translationsfrom comparable corpora for Cross-Language InformationRetrieval using the languagemodeling frameworkrdquo InformationProcessing amp Management vol 52 no 2 pp 299ndash318 2016

[3] M W Berry S T Dumais and G W OrsquoBrien ldquoUsing linearalgebra for intelligent information retrievalrdquo SIAM Review vol37 no 4 pp 573ndash595 1995

[4] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 no 20 pp 87ndash106 2015

[5] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal of

the American Society for Information Science vol 41 no 6 pp391ndash407 1990

[6] C Laclau andM Nadif ldquoHard and fuzzy diagonal co-clusteringfor document-term partitioningrdquoNeurocomputing vol 193 pp133ndash147 2016

[7] GH Golub andC F von LoanMatrix ComputationsThe JohnHopkins University Press 3rd edition 1996

[8] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[9] R K Ando ldquoLatent semantic space iterative scaling imrpovesprecision of inter-document similarity measurementrdquo in Pro-ceedings of the 23rd ACM International SIGIR Conference onResearch and Development in Information Retrieval (SIGIR rsquo00)pp 216ndash223 Athens Greece July 2000

[10] H Yan W I Grosky and F Fotouhi ldquoAugmenting the powerof LSI in text retrieval singular value rescalingrdquo Data andKnowledge Engineering vol 65 no 1 pp 108ndash125 2008

[11] F Jiang and M L Littman ldquoApproximate dimension equaliza-tion in vector-based information retrievalrdquo in Proceedings of the17th International Conference on Machine Learning (ICML rsquo00)pp 423ndash430 Stanford Calif USA 2000

[12] T G Kolda and D P OrsquoLeary ldquoA semidiscrete matrix decom-position for latent semantic indexing in information retrievalrdquoACM Transactions on Information Systems vol 16 no 4 pp322ndash346 1998

[13] X He D Cai H Liu and W Y Ma ldquoLocality preservingindeixng for document reprenentationrdquo in Proceedings of theInternational ACM SIGIR Conference on Research and Develop-ment in Information Retrieval pp 218ndash225 2004

[14] E P Jiang andMW Berry ldquoInformation filtering using the Rie-mannian SVD (R-SVD)rdquo in Solving Irregularly Structured Prob-lems in Parallel 5th International Symposium IRREGULARrsquo98Berkeley California USA August 9ndash11 1998 Proceedings vol1457 of Lecture Notes in Computer Science pp 386ndash395 2005

[15] M Welling Fisher Linear Discriminant Analysis httpwwwicsuciedusimwellingclassnotespapers classFisher-LDApdf

[16] J Gao and J Zhang ldquoClustered SVDstrategies in latent semanticindexingrdquo Information Processing and Management vol 41 no5 pp 1051ndash1063 2005

[17] V Castelli A Thomasian and C-S Li ldquoCSVD clustering andsingular value decomposition for approximate similarity searchin high-dimensional spacesrdquo IEEE Transactions on Knowledgeand Data Engineering vol 15 no 3 pp 671ndash685 2003

[18] MW Berry ldquoLarge scale singular value computationsrdquo Interna-tional Journal of Supercomputer Applications vol 6 pp 13ndash491992

[19] C D Manning and H Schutze Foundations of StatisitcalNatural Language Processing The MIT Press 4th edition 2001

[20] G Salton A Wang and C S Yang ldquoA vector space modelfor information retrievalrdquo Journal of American Society forInformation Science vol 18 no 11 pp 613ndash620 1975

[21] L Jiang C Li S Wang and L Zhang ldquoDeep feature weightingfor naive Bayes and its application to text classificationrdquo Engi-neering Applications of Artificial Intelligence vol 52 pp 26ndash392016

[22] T Van Phan and M Nakagawa ldquoCombination of global andlocal contexts for textnon-text classification in heterogeneousonline handwritten documentsrdquo Pattern Recognition vol 51 pp112ndash124 2016


[23] H Zha O Marques and H D Simon ldquoLarge scale SVD andsubspace-based methods for information retrievalrdquo in Proceed-ings of the 5th International Symposium on Solving IrregularlyStructured Problems in Parallel (IRREGULAR rsquo98) pp 29ndash42Berkeley Calif USA August 1998

[24] J Han and M Kamber Data Mining Concepts and TechniquesMorgan Kaufmann Boston Mass USA 2nd edition 2006

[25] T Kohonen Self-Organization and Associative Memory vol 8of Springer Series in Information Sciences Springer New YorkNY USA 2nd edition 1988

[26] M Steinbach G Karypis and V Kumar ldquoA comparison ofdocument clustering techniquesrdquo in Proceedings of the KDDWorkshop on Text Mining pp 109ndash110 2000

[27] Y M Yang and X Liu ldquoA re-examination of text categorizationmethodsrdquo inProceedings on the 22ndAnnual International ACMSIGIR Conference on Research and Development in InformationRetrieval (SIGIR rsquo99) pp 42ndash49 Berkeley Calif USA August1999

[28] R F Correa andT B Ludermir ldquoImproving self-organization ofdocument collections by semantic mappingrdquo Neurocomputingvol 70 no 1ndash3 pp 62ndash69 2006

[29] T Hofmann ldquoLearning the similarity of documents aninformation-geometric approach to document retrieval andcategorizationrdquo in Advances in Neural Information ProcessingSystems 12 pp 914ndash920 The MIT Press 2000

[30] D M Blei A Y Ng and M I Jordan ldquoLatent dirichlet alloca-tionrdquo Journal of Machine Learning Research vol 3 no 4-5 pp993ndash1022 2003

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



is more robust to retrieve documents based on concepts andmeanings than the original term-document matrix producedusing merely individual words with vector space model(VSM)

In this paper we propose SVD on clusters (SVDC) toimprove the discriminative power of LSI The contributionof this paper is three manifolds Firstly we make a survey ofexisting linear algebra methods for LSI including both SVDbased methods and non-SVD based methods Secondly wetheoretically explain that dimension expansion of documentvectors and dimension projection using SVD are the twomanipulations involved in SVD on clusters We developupdating processes to fold in new documents and terms in adecomposedmatrix by SVDon clustersThirdly two corporaa Chinese corpus and an English corpus are used to evaluatethe performances of the proposed methods

The rest of this paper is organized as follows Section 2provides a survey on recent researches on Latent SemanticIndexing and its related topics Section 3 proposes SVD onclusters and its updating process Section 4 is the experimentto evaluate the proposed methods Section 5 concludes thispaper and indicates future work

2 Related Work

21 Singular Value Decomposition The singular valuedecomposition is commonly used in the solution of uncon-strained linear least square problemsmatrix rank estimationand canonical correlation analysis [7 8] Given119898 times 119899matrix119860 where without loss of generality 119898 ge 119899 and rank(119860) = 119903the singular value decomposition of 119860 denoted by SVD(119860)is defined as

119860 = 119880Σ119881119879 (1)

Here 119880119879119880 = 119881119879119881 = 119868

119899and Σ = diag(120590

1 120590

119899) 120590119894gt 0

for 1 le 119894 le 119903 and 120590119895gt 0 for 119895 ge 119903 + 1 The first 119903 columns of

the orthonormal matrices 119880 and 119881 define the orthonormaleigenvector associated with 119903 nonzero eigenvalues of 119860119860119879

and 119860119879119860 respectively The columns of 119880 and 119881 are referredto as the left and right singular vectors respectively and thesingular values of119860 are defined as the diagonal elements of Σwhich are the nonnegative square roots of the 119899 eigenvaluesof 119860119860119879 Furthermore if we define 119860

119896= sum119896

119894=1119906119894120575119894V119879119894 then we

will find that 119860119896is the best rank-119896 approximation for 119860 in

terms of Frobenius norm [7]

22 Recent Studies in LSI Recently a series of methodsbased on different methods of matrix decomposition havebeen proposed to conduct LSI A common point of thesedecomposition methods is to find a rank-deficient matrix inthe decomposed space to approximate the original matrixso that the term frequency distortion in term-document canbe adjusted Basically we can divide these methods into twocategories matrix decomposition based on SVD and matrixdecomposition not based on SVD Table 1 lists the existinglinear algebraic methods for LSI

In the aspect of SVD based LSI methods it includes IRR[9] SVR [10] and ADE [11] Briefly IRR conjectures that

Table 1 Existing linear algebra methods for LSI

Category Abbreviation Full nameSVD baseddecomposition forterm-documentmatrix

IRR Iterative Residual RescalingSVR Singular Value Rescaling

ADE Approximate DimensionEqualization

Non-SVD baseddecomposition forterm-documentmatrix

SDD SemidiscreteDecomposition

LPI Locality PreservingIndexing

R-SVD Riemannian-SVD

SVD removes two kinds of ldquonoisesrdquo from the original term-document matrix exceptional documents and documentswith minor terms However if our concentration is on char-acterizing relationships of documents in a collection ratherthan looking for representative documents then IRR can playan effective role for this work The basic idea behind SVR isthat the ldquonoiserdquo in original document representation vectorscomes from minor vectors that is those vectors which arefar from representative vectors in terms of distance Thuswe need to augment the influence of representative vectorsand meanwhile reduce the influence of minor vectors in theapproximation matrix Following this idea SVR adjusts thedifferences among major dimensions and minor dimensionsin the approximation matrix by rescaling the singular valuesin Σ Based on the observation that singular values in Σ havethe characteristics as low-rank-plus-shift structure ADE triesto flatten the first 119896 largest singular values with a fixed valueand combine with other small singular values to reconstructΣ to make dimension values relatively equalized in theapproximation matrix of 119860

In the aspect of non-SVD based LSI methods it includesSDD [12] LPI [13] and R-SVD [14] SDD restricts valuesin singular vectors (119880 and 119881) in approximation matrix onlyhaving entries in the set minus1 0 1 By this way it merely needsone-twentieth of storage and only one-half query time whileit can do and SVD does LSI in terms of information retrievalLPI argues that LSI seeks to uncover the most representativefeatures rather themost discriminative features for documentrepresentation With this motivation LPI constructs theadjacency graph of documents and aims to discover the localstructure of document space using Local Preserving Projec-tion (LPP) In essence we can regard that LPI is adaptedfrom LDA (Linear Discriminant Analysis) [15] which is atopic concerning dimension reduction for supervised classi-fication R-SVD is different with SVDmathematically in thatthe term-document matrix decomposition of SVD is basedon Total Least Square (TLS) while matrix decomposition inR-SVD is based on Structured Total Least Square (STLS) R-SVD is not designed for LSI but for information filtering toimprove the effectiveness of information retrieval by usingusersrsquo feedback

Recently two methods in [16 17] are presented whichalso make use of SVD and clustering In [16] Gao andZhang investigate three strategies of using clustering and SVD




3 SVD on Clusters










1 1198892 119889

119899) (119889119894(1 le 119894 le 119899)



(1) 119860(2) 119860

(119896)] due




(1) 119860(2) 119860



1199031

119860(2) asymp 119860(2)

1199032

119860(119896)

asymp

119860(119896)

119903119896


and 119860(119896)119903119896


(1)

1199031

119860(2)

1199032

119860(119896)

119903119896




Input


(1198891 1198892 119889

119899)




Output


Method


119899into




119903119894


[119860(1)

1199031

119860(2)

1199032

119860(119896)

119903119896

]


1and 119860

2 that is


1and119860

2 we obtain

1198601= 1198801Σ1119881119879

1and 119860

2= 1198802Σ2119881119879



1198601015840= (

1198601

0

0 1198602

)

1198801015840= (

1198801

0

0 1198802

)

Σ1015840= (

Σ1

0

0 Σ2

)

1198811015840119879= (

1198811015840119879

10

0 1198811015840119879

2

)

(2)


10158401198791198811015840= 119868119899


1198601015840=

1199031015840

sum

119894=1

1205901015840

1198941199061015840

119894V1015840119879119894 (3)


2which


1015840

119896= sum119896

119894=11205901015840

1198941199061015840

119894V1015840119879119894









1198891015840

119901=

119889119902 if 119889 isin 119862

119894 119901 = (119894 minus 1)119898 + 119902

0 otherwise(4)






1and119860






Input


(1198891 1198892 119889

119899)


Output


Method


119899into





120572


[119860(1)

120572 119860(2)

120572 119860

(119896)

120572]


119880Σ119881119879 we obtain 119880119879119860 = Σ119881






1198951199033

119895) where 119899

119895is




119895asymp 119899119896 and 119903

119895asymp 119903119896 That is to




119879 Σminus1119880119879119863) = 119880Σ(119881119863



119861 = 119880Σ119881119879





119896) is computed


119863 = (119863119879119880Σminus1)1119896

(5)





1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817


+ (119889119898minus 119888119894119898)2

(6)


119904 = arg119894

min1le119894le119896

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2 (7)


119889 = (119889119879119880Σminus1)1119903119904

(8)



1199031

119860(119904)

119903119904

119860(119896)

119903119896

) with

119860(119904)

119903119904

= 119880(119904)

1119903119904

Σ(119904)

11199031199041119903119904

(119881(119904)

1119903119904

| 119889

119879

) (9)



(119860119879 119879119879)119879 Then 119862 = ((119880Σ119881

119879)119879 119879119879)119879

= (119880Σ119881119879119879) =




119896is







119879 = (119879119881119896Σminus1

119896)11119896

(10)


119905(119894)= (119908(119894)

1 119908

(119894)

119895 119908

(119894)

119898119894

) (1 le 119894 le 119896) (11)

Here 119908(119894)119895


(119894) 119898119894is the number


119905

(119894)

= (119905(119894)119881(119894)Σ(119894)minus1

)11119903119894

(12)

Then each 119860(119894)119903119894

is updated using

119860(119894)

119903119894

= (

119880(119894)

11198981198941119903119894

119905

(119894))Σ(119894)

11199031198941119903119894

119881(119894)119879

11198991119903119894

(13)


119903119894

as

119860 = [119860(1)

1199031

119860(119894)

119903119894

119860(119896)

119903119896

] (14)








precision (119901119896)

=


119896

(15)


119894=1119901119894

119898

(16)































04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






3 SVD on Clusters










1 1198892 119889

119899) (119889119894(1 le 119894 le 119899)



(1) 119860(2) 119860

(119896)] due




(1) 119860(2) 119860



1199031

119860(2) asymp 119860(2)

1199032

119860(119896)

asymp

119860(119896)

119903119896


and 119860(119896)119903119896


(1)

1199031

119860(2)

1199032

119860(119896)

119903119896




Input


(1198891 1198892 119889

119899)




Output


Method


119899into




119903119894


[119860(1)

1199031

119860(2)

1199032

119860(119896)

119903119896

]


1and 119860

2 that is


1and119860

2 we obtain

1198601= 1198801Σ1119881119879

1and 119860

2= 1198802Σ2119881119879



1198601015840= (

1198601

0

0 1198602

)

1198801015840= (

1198801

0

0 1198802

)

Σ1015840= (

Σ1

0

0 Σ2

)

1198811015840119879= (

1198811015840119879

10

0 1198811015840119879

2

)

(2)


10158401198791198811015840= 119868119899


1198601015840=

1199031015840

sum

119894=1

1205901015840

1198941199061015840

119894V1015840119879119894 (3)


2which


1015840

119896= sum119896

119894=11205901015840

1198941199061015840

119894V1015840119879119894









1198891015840

119901=

119889119902 if 119889 isin 119862

119894 119901 = (119894 minus 1)119898 + 119902

0 otherwise(4)






1and119860






Input


(1198891 1198892 119889

119899)


Output


Method


119899into





120572


[119860(1)

120572 119860(2)

120572 119860

(119896)

120572]


119880Σ119881119879 we obtain 119880119879119860 = Σ119881






1198951199033

119895) where 119899

119895is




119895asymp 119899119896 and 119903

119895asymp 119903119896 That is to




119879 Σminus1119880119879119863) = 119880Σ(119881119863



119861 = 119880Σ119881119879





119896) is computed


119863 = (119863119879119880Σminus1)1119896

(5)





1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817


+ (119889119898minus 119888119894119898)2

(6)


119904 = arg119894

min1le119894le119896

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2 (7)


119889 = (119889119879119880Σminus1)1119903119904

(8)



1199031

119860(119904)

119903119904

119860(119896)

119903119896

) with

119860(119904)

119903119904

= 119880(119904)

1119903119904

Σ(119904)

11199031199041119903119904

(119881(119904)

1119903119904

| 119889

119879

) (9)



(119860119879 119879119879)119879 Then 119862 = ((119880Σ119881

119879)119879 119879119879)119879

= (119880Σ119881119879119879) =




119896is







119879 = (119879119881119896Σminus1

119896)11119896

(10)


119905(119894)= (119908(119894)

1 119908

(119894)

119895 119908

(119894)

119898119894

) (1 le 119894 le 119896) (11)

Here 119908(119894)119895


(119894) 119898119894is the number


119905

(119894)

= (119905(119894)119881(119894)Σ(119894)minus1

)11119903119894

(12)

Then each 119860(119894)119903119894

is updated using

119860(119894)

119903119894

= (

119880(119894)

11198981198941119903119894

119905

(119894))Σ(119894)

11199031198941119903119894

119881(119894)119879

11198991119903119894

(13)


119903119894

as

119860 = [119860(1)

1199031

119860(119894)

119903119894

119860(119896)

119903119896

] (14)








precision (119901119896)

=


119896

(15)


119894=1119901119894

119898

(16)































04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





(1) 119860(2) 119860

(119896)] due




(1) 119860(2) 119860



1199031

119860(2) asymp 119860(2)

1199032

119860(119896)

asymp

119860(119896)

119903119896


and 119860(119896)119903119896


(1)

1199031

119860(2)

1199032

119860(119896)

119903119896




Input


(1198891 1198892 119889

119899)




Output


Method


119899into




119903119894


[119860(1)

1199031

119860(2)

1199032

119860(119896)

119903119896

]


1and 119860

2 that is


1and119860

2 we obtain

1198601= 1198801Σ1119881119879

1and 119860

2= 1198802Σ2119881119879



1198601015840= (

1198601

0

0 1198602

)

1198801015840= (

1198801

0

0 1198802

)

Σ1015840= (

Σ1

0

0 Σ2

)

1198811015840119879= (

1198811015840119879

10

0 1198811015840119879

2

)

(2)


10158401198791198811015840= 119868119899


1198601015840=

1199031015840

sum

119894=1

1205901015840

1198941199061015840

119894V1015840119879119894 (3)


2which


1015840

119896= sum119896

119894=11205901015840

1198941199061015840

119894V1015840119879119894









1198891015840

119901=

119889119902 if 119889 isin 119862

119894 119901 = (119894 minus 1)119898 + 119902

0 otherwise(4)






1and119860






Input


(1198891 1198892 119889

119899)


Output


Method


119899into





120572


[119860(1)

120572 119860(2)

120572 119860

(119896)

120572]


119880Σ119881119879 we obtain 119880119879119860 = Σ119881






1198951199033

119895) where 119899

119895is




119895asymp 119899119896 and 119903

119895asymp 119903119896 That is to




119879 Σminus1119880119879119863) = 119880Σ(119881119863



119861 = 119880Σ119881119879





119896) is computed


119863 = (119863119879119880Σminus1)1119896

(5)





1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817


+ (119889119898minus 119888119894119898)2

(6)


119904 = arg119894

min1le119894le119896

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2 (7)


119889 = (119889119879119880Σminus1)1119903119904

(8)



1199031

119860(119904)

119903119904

119860(119896)

119903119896

) with

119860(119904)

119903119904

= 119880(119904)

1119903119904

Σ(119904)

11199031199041119903119904

(119881(119904)

1119903119904

| 119889

119879

) (9)



(119860119879 119879119879)119879 Then 119862 = ((119880Σ119881

119879)119879 119879119879)119879

= (119880Σ119881119879119879) =




119896is







119879 = (119879119881119896Σminus1

119896)11119896

(10)


119905(119894)= (119908(119894)

1 119908

(119894)

119895 119908

(119894)

119898119894

) (1 le 119894 le 119896) (11)

Here 119908(119894)119895


(119894) 119898119894is the number


119905

(119894)

= (119905(119894)119881(119894)Σ(119894)minus1

)11119903119894

(12)

Then each 119860(119894)119903119894

is updated using

119860(119894)

119903119894

= (

119880(119894)

11198981198941119903119894

119905

(119894))Σ(119894)

11199031198941119903119894

119881(119894)119879

11198991119903119894

(13)


119903119894

as

119860 = [119860(1)

1199031

119860(119894)

119903119894

119860(119896)

119903119896

] (14)








precision (119901119896)

=


119896

(15)


119894=1119901119894

119898

(16)































04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





1and119860






Input


(1198891 1198892 119889

119899)


Output


Method


119899into





120572


[119860(1)

120572 119860(2)

120572 119860

(119896)

120572]


119880Σ119881119879 we obtain 119880119879119860 = Σ119881






1198951199033

119895) where 119899

119895is




119895asymp 119899119896 and 119903

119895asymp 119903119896 That is to




119879 Σminus1119880119879119863) = 119880Σ(119881119863



119861 = 119880Σ119881119879





119896) is computed


119863 = (119863119879119880Σminus1)1119896

(5)





1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817


+ (119889119898minus 119888119894119898)2

(6)


119904 = arg119894

min1le119894le119896

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2 (7)


119889 = (119889119879119880Σminus1)1119903119904

(8)



1199031

119860(119904)

119903119904

119860(119896)

119903119896

) with

119860(119904)

119903119904

= 119880(119904)

1119903119904

Σ(119904)

11199031199041119903119904

(119881(119904)

1119903119904

| 119889

119879

) (9)



(119860119879 119879119879)119879 Then 119862 = ((119880Σ119881

119879)119879 119879119879)119879

= (119880Σ119881119879119879) =




119896is







119879 = (119879119881119896Σminus1

119896)11119896

(10)


119905(119894)= (119908(119894)

1 119908

(119894)

119895 119908

(119894)

119898119894

) (1 le 119894 le 119896) (11)

Here 119908(119894)119895


(119894) 119898119894is the number


119905

(119894)

= (119905(119894)119881(119894)Σ(119894)minus1

)11119903119894

(12)

Then each 119860(119894)119903119894

is updated using

119860(119894)

119903119894

= (

119880(119894)

11198981198941119903119894

119905

(119894))Σ(119894)

11199031198941119903119894

119881(119894)119879

11198991119903119894

(13)


119903119894

as

119860 = [119860(1)

1199031

119860(119894)

119903119894

119860(119896)

119903119896

] (14)








precision (119901119896)

=


119896

(15)


119894=1119901119894

119898

(16)































04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817


+ (119889119898minus 119888119894119898)2

(6)


119904 = arg119894

min1le119894le119896

1003817100381710038171003817119889 minus 119888119894

1003817100381710038171003817

2 (7)


119889 = (119889119879119880Σminus1)1119903119904

(8)



1199031

119860(119904)

119903119904

119860(119896)

119903119896

) with

119860(119904)

119903119904

= 119880(119904)

1119903119904

Σ(119904)

11199031199041119903119904

(119881(119904)

1119903119904

| 119889

119879

) (9)



(119860119879 119879119879)119879 Then 119862 = ((119880Σ119881

119879)119879 119879119879)119879

= (119880Σ119881119879119879) =




119896is







119879 = (119879119881119896Σminus1

119896)11119896

(10)


119905(119894)= (119908(119894)

1 119908

(119894)

119895 119908

(119894)

119898119894

) (1 le 119894 le 119896) (11)

Here 119908(119894)119895


(119894) 119898119894is the number


119905

(119894)

= (119905(119894)119881(119894)Σ(119894)minus1

)11119903119894

(12)

Then each 119860(119894)119903119894

is updated using

119860(119894)

119903119894

= (

119880(119894)

11198981198941119903119894

119905

(119894))Σ(119894)

11199031198941119903119894

119881(119894)119879

11198991119903119894

(13)


119903119894

as

119860 = [119860(1)

1199031

119860(119894)

119903119894

119860(119896)

119903119896

] (14)








precision (119901119896)

=


119896

(15)


119894=1119901119894

119898

(16)































04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






precision (119901119896)

=


119896

(15)


119894=1119901119894

119898

(16)































04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in





















04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






04

042

044

046

048

05

Aver

age p

reci

sion

035

036

037

038

039

04

Aver

age p

reci

sion


04

042

044

046

048

05

Aver

age p

reci

sion


035

04

Aver

age p

reci

sion












Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in






Competing Interests


Acknowledgments


References








































Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



















Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in










Advances in

FuzzySystems


Volume 2014












Journal of

Journal of





Advances in

Multimedia


Biomedical Imaging



Advances in


RoboticsJournal of










Advances in



Date post:	30-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times