MANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON NEURAL...

MANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Semi-Supervised Feature Selection via SplineRegression for Video Semantic Recognition

Yahong Han, Yi Yang, Zhigang Ma, Yan Yan, Nicu Sebe, XiaofangZhou

Abstract—In order to improve both the efficiency and accuracyof video semantic recognition, we can perform feature selectionon the extracted video features to select a subset of featuresfrom the high dimensional feature set for a compact and accuratevideo data representation. Provided the number of labeled videosis small, supervised feature selection could fail to identify therelevant features that are discriminative to target classes. In manyapplications, abundant un-labeled videos are easily accessible.This motivates us to develop semi-supervised feature selectionalgorithms to better identify the relevant video features,whichare discriminative to target classes by effectively exploiting theinformation underlying the huge amount of un-labeled videodata. In this paper, we propose a framework of video semanticrecognition by Semi-Supervised Feature Selection via SplineRegression (S2FS2R). Two scatter matrices are combined to cap-ture both the discriminative information and the local geometrystructure of labeled and un-labeled training videos: A within-classscatter matrix to encode discriminative information of labeledtraining videos and a spline scatter output from a local splineregression to encode data distribution. Anℓ2,1-norm is imposedas a regularization term on the transformation matrix to ensureit is sparse in rows, making it particularly suitable for featureselection. To efficiently solve S2FS2R, we develop an iterativealgorithm and prove its convergency. In the experiments, threetypical tasks of video semantic recognition, namely video conceptdetection, video classification, and human action recognition, areused to demonstrate that the proposed S2FS2R achieves betterperformance compared with the state-of-the-art methods.

I. I NTRODUCTION

In many applications of video semantic recognition, suchas video concept detection [1] [2], human activity analysis[3][4], and object tracking [5] [6], data are always represented byhigh dimensional feature vectors. For example, we can extracthigh dimensional heterogeneous visual features from one givenvideo key frame, such as global features (color moment, edgesdirection, and Gabor) and local features (space-time interestpoints [7] and MoSIFT [8]). In the high dimensional spaceof visual features, it is hard to discriminate video samplesof different classes from each other, which results in the socalled “curse of dimensionality” problem [9]. Moreover, inthepresence of many irrelevant features, the training processofclassification tends to overfitting.

Dimensionality reduction is a commonly used step in ma-chine learning to deal with a high dimensional space of

Yahong Han is with the School of Computer Science and Technology,Tianjin University, China, e-mail: [email protected]

Yi Yang, Yan Yan, and Xiaofang Zhou are with the School of InformationTechnology and Electrical Engineering, The University of Queensland, Aus-tralia, e-mail: [email protected], [email protected], [email protected]

Zhigang Ma and Nicu Sebe are with the Department of InformationEngineering and Computer Science, University of Trento, Italy, email:{ma,sebe}@disi.unitn.it

features. The original feature space is mapped onto a new,reduced dimensionality space and the samples are representedin that new space. The mapping is usually performed eitherby constructing some new features or by selecting a subsetof the original features. Two major strands of mapping byconstructing new features are the linear subspace learning(e.g., the Principle Component Analysis [10]) and the nonlin-ear manifold learning methods (e.g., the Isometric mappingofdata manifolds [11]). This paper explores the second approachof dimensionality reduction, i.e., feature selection, anditsapplications to video semantic recognition.

Feature selection has a twofold role in improving both theefficiency and accuracy of data analysis. First, the dimension-ality of selected feature subset is much lower, making thesubsequential computation on the input data more efficient.Second, the noisy features are eliminated for a better da-ta representation, resulting in a more accurate classificationresult. Therefore, during recent years feature selection hasattracted much research attention [1] [4] [12] [13] [14] [15]. Invideo semantic recognition, feature selection is usually appliedfor a higher classification accuracy and a compact featurerepresentation [1] [4] [12] [6].

Feature selection algorithms can be roughly classified intotwo groups, i.e., supervised feature selection and unsupervisedfeature selection. Supervised feature selection determines fea-ture relevance by evaluating a feature’s correlation with theclasses. For example, Fisher Score [16], robust regression[17],and sparse multi-output regression [18] usually select featuresaccording to labels of the training data. Because discriminativeinformation is enclosed in the labels, supervised feature selec-tion is usually able to select discriminative features. Withoutlabels, unsupervised feature selection exploits data varianceand separability to evaluate feature relevance. A frequentlyused criterion is to select the features which best preservethe data distribution or local structure derived from the wholefeature set [19]. However, because there is no label informationdirectly available, it is much more difficult for unsupervisedfeature selection to select the discriminative features [12].

In real-world applications, collecting high-quality labeledtraining videos is difficult, and at the same time abundant un-labeled videos are often easily accessible. Provided the numberof labeled data is small, supervised feature selection couldfail to identify the relevant features that are discriminative totarget classes. This motivates us to develop a semi-supervisedfeature selection algorithm to better identify the relevantfeatures. In order to use both labeled and un-labeled data,inspired by the semi-supervised learning algorithms [20] [21],semi-supervised feature selection algorithms utilize thedata

2 MANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

●

●

●

boat

park

sunset

wedding

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Training

Videos Labels Features

XL

XU

Form Scatter

Matrix to Encode

Label Information

Capture Data Disctribution

and Local Geometry via

Spline Regression

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●●● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Sparse Coefficients of Feature Selection

?

?

?

?

?

?

Scatter Matrix Encodes

Data Distribution and

Label Information

Semi-Supervised Scatters

Regularized by

the L2,1 -norm

Feature

Selection and

Prediction

Testing

Videos

Matrix W

wedding

wedding

wedding

sports

sports

parkpark

park

child

child ● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ●

park

park

parkpark

park

park

Fig. 1. Flowchart of the proposed framework S2FS2R. We first construct the within-class scatter matrix to encode label information of labeled trainingvideos. Data distribution and local geometry structure of both labeled and un-labeled training videos are preserved bythe local spline regression. Combiningwithin-class and spline scatters, we form a semi-supervised scatter matrix to encode data distribution and label information. Anℓ2,1-norm is imposed as aregularization term on the transformation matrixW to ensure thatW is sparse in rows, making it particular suitable for featureselection.

distribution or local structure of both labeled and un-labeleddata to evaluate the features’ relevance. For example, Zhaoand Liu [14] introduced a semi-supervised feature selectionalgorithm based on spectral analysis. Spectral assumptionstates that the data points forming the same structure arelikely to have the same label. Similarly, the method in [22]utilizes manifold regularization to consider the geometryofdata distribution. In [23], Kong et al. proposed a semi-supervised feature selection algorithm for graph data. Manylocal evaluations are introduced to model the neighboring datapoints so as to explore the data structures. Typical methodsinclude data affinity between neighbors [24] and locally linearrepresentation [25], and locally nonlinear representation withkernels [26]. However, besides the parameter tuning problemin affinity measure with Gaussian function, the locally linearrepresentations and kernel functions may lack the ability toaccurately capture the local geometry [27].

In this paper, to better exploit the data distribution andthe local geometry of both labeled and un-labeled videos, wepropose a framework of Semi-Supervised Feature Selection viaSpline Regression (S2FS2R). The flowchart of the proposedframework is illustrated in Figure 1. Both the labeled andun-labeled video data are collected as training videos. Foreach video sample in the training and testing video set, weextract high-dimensional features to form the feature matrixX = [XL;XU ] of the training data. As illustrated in Figure 1,to make use of the discriminative information in the labeledvideos, we form a within-class scatter matrix on the labeled

training videos. To exploit the data distribution and localgeometry underlying the huge amount of un-labeled videos,we use splines developed in Sobolev space [28] [27] tointerpolate scattered videos in geometrical design, see the stepof spline regression in Figure 1. By integrating the polynomialsand Green’s functions into the local spline [29] [27], thelocal geometry of video data can be smoothly and accuratelycaptured according to their distribution. By summing thelocal losses estimated from all of the neighboring videos,we construct a spline scatter matrix to preserve the localgeometry of labeled and un-labeled video data. Thus, the localstructure and geometry of all training videos are preservedinthe formed spline scatter matrix. Combining within-class andspline scatters, we form a semi-supervised scatter matrix toencode data distribution and label information. Our goal istocompute a transformation matrixW (see matrixW in Figure1) which optimally preserves discriminative information anddata distribution of training videos. To makeW suitable forfeature selection, we add anℓ2,1-norm ofW as a regularizationterm to ensure thatW is sparse in rows [13] [17]. Then thelearnedW is able to select the most discriminant featuresfor testing videos prediction. To efficiently solve theℓ2,1-norm minimization problem with the orthogonal constraint,we develop an iterative algorithm and prove its convergence.

In the experiments, four open benchmark video datasets areused to evaluate the performance of video semantic recognitionby Semi-Supervised Feature Selection via Spline Regression(S2FS2R), which correspond to three typical video semantic

HAN et al.: SEMI-SUPERVISED FEATURE SELECTION 3

recognition tasks: Video concept detection in news videos,video classification of consumer videos, and human actionrecognition. Experimental results show that S2FS2R gets betterperformance of video semantic recognition compared withstate-of-the-art algorithms.

The remainder of this paper is organized as follows. InSection II, we briefly review the recent related works. Theframework of S2FS2R and its solutions are introduced inSection III. In Section III-D, we develop an iterative algorithmto solve S2FS2R and prove its convergence. The experimentalanalysis and conclusions are given in Section IV and SectionV.

II. RELATED WORKS

In this section, we review some of the representative relatedworks of video representation and feature selection for videosemantic recognition.

A. Video Feature Representations

In applications of video classification and video conceptdetection, one key frame within each shot is obtained as arepresentative image for that shot. In this way, video shotscan be represented by the extracted low-level visual fea-tures of corresponding key frames. For example, TRECVID1

provides global features of each key frame, such as colorhistograms, textures, and canny edge. With the popularityof key point based local features, e.g., SIFT feature [30],and the successful applications in scene classification [31],we can also represent each key frame using a Bag-of-Words(BoW) approach. Another important characteristic of videodata is the temporal associated co-occurrence. Consideringthat each video frame is a two-dimensional object representedby image features, the temporal axis makes up the thirddimension. Thus, a video stream spans a three-dimensionalspace. As discussed in [3], the SIFT feature lacks the abilityof representing temporal information in videos and does notconsider motion information. Recently, multi-instance space-time volumes [32], space-time interest points (STIP) [7], andMoSIFT [8] representations have been respectively proposedto model more information of video data. In order to performvideo event detection in real-world conditions, Ke et al. [32]efficiently match the volumetric representation of an eventagainst over-segmented spatio-temporal video volumes. STIPdescriptor concatenates several histograms from a space-timegrid defined on the patch and generalizes the SIFT descriptorto space-time. In contrast, MoSIFT detects interest pointsandnot only encodes their local appearance but also explicitlymodels the local motion. Owing to above characteristics, STIPand MoSIFT have been widely used in motion analysis andhuman action recognition [3] [7] [8].

B. Feature Selection for Video Semantic Recognition

Feature selection has an important role in improving boththe efficiency and accuracy of video semantic recognition.During recent years, feature selection has attracted much

1http://trecvid.nist.gov/

research attention [16] [19] [14]. However, most of the featureselection algorithms evaluate the importance of each featureindividually and select features one by one. A limitation isthat the correlation among features is neglected. Sparsity-based methods, e.g., lasso, use theℓ1-norm of coefficientvectors as a penalty to make many coefficients shrink to zeros,which can be used for feature selection. For example, sparsemultinomial logistic regression via Bayesianℓ1 regularization(SBMLR) [33] exploits sparsity by using a Laplace prior.Inspired by the block sparsity, [17] employs a jointℓ2,1-normminimization on both the loss function and regularization torealize feature selection across all data points. More recently,researchers have applied the two-step approach, i.e., spectralregression, to supervised and unsupervised feature selection[18]. The works in [17] [18] [34] [35] have shown that itis a better way to evaluate the importance of the selectedfeatures jointly. On the other hand, though some multiplekernel feature selection method have been proposed for videosemantic recognition [36], semi-supervised feature selectionfor video semantic recognition has not been well explored.In this paper, we propose a new one-step approach to performsemi-supervised feature selection by simultaneously exploitingdiscriminative information and preserving the local geometryof labeled and un-labeled video data.

III. SEMI-SUPERVISEDFEATURE SELECTION VIA SPLINE

REGRESSION

In this section, we present the framework of Semi-Supervised Feature Selection via Spline Regression (S2FS2R).In order to solve this framework efficiently, we developan iterative algorithm and prove its convergence. To betterpresent the proposed methods, we also introduce local splineregression in this section. In the following, we first providethe notations used in the rest of this paper.

A. Notations

Let us denoteX = {x1, x2, . . . , xn} as the training setof videos, wherexi ∈ R

d(1 ≤ i ≤ n) is the i-th videosample andn is the total number of training instances. Foreach video sample, we extractd-dimensional video featuresand then the matrix of training videos can be represented byX = [x1, . . . , xn] ∈ R

d×n. We let XL = [x1, . . . , xnl] ∈

Rd×nl denote the firstnl (nl ≤ n) video samples inX

which are the labeled videos, for which the labelsY L =[y1, . . . , ynl

] ∈ {0, 1}c×nl are provided for thec semanticcategories.XU = [xnl+1, . . . , xnl+nu

] ∈ Rd×nu denote the

un-labeled videos whose labels are not given. Thus we haveX = [XL, XU ] andn = nl+nu. In this paper,I is an identitymatrix. For an arbitrary matrixM ∈ R

r×p, its ℓ2,1-norm isdefined as

||M ||2,1 =

r∑

i=1

√

√

√

√

p∑

j=1

M2ij . (1)

We letM(s,:) andM(:,t) denote thesth row andt-th columnvector of matrixM , respectively.


B. Proposed Framework

In applications of video semantic recognition, such asvideo concept detection, video classification, and human actionrecognition, the extracted video features are usually high-dimensional. Selecting a subset of features for a compact andaccurate video representation will improve the efficiency andaccuracy of video semantic recognition. To select the mostdiscriminative video features for video semantic recognition,we assume there is a transformation matrixW ∈ R

d×c(c < d)which maps the high-dimensional video samples onto a lower-dimensional subspace, andx′

i = WTxi is the new representa-tion for each video samplexi in such subspace. As each rowof W is used to weight each feature, if some rows ofW shrinkto zero,W can be used for feature selection. In the generalframework of graph embedding for dimensionality reduction[37], a better transformation matrixW can be learned bythe minimization ofTr(WTMW ), where matrixM encodescertain structures of the training data. In this paper, we proposethe framework of semi-supervised feature selection to solve thefollowing ℓ2,1-norm regularized minimization problem:

minWTW=I

Tr(WTMW ) + λ||W ||2,1, (2)

where the regularization term||W ||2,1 controls the capacityof W and also ensures thatW is sparse in rows, making itparticularly suitable for feature selection. Parameterλ controlsthe regularization effect.M ∈ R

d×d is a semi-supervisedscatter matrix which encodes both data distribution and labelinformation. The orthogonal constraintWTW = I is imposedto avoid arbitrary scaling and the trivial solution of all zeros.

We defineM as:

M = A+ µD, (3)

where the weight parameterµ (0 ≤ µ ≤ 1) is used to controlthe weight of matrixD. Matrix A ∈ R

d×d is a scatter matrixwhich encodes label information of labeled training videos.Matrix D ∈ R

d×d is a scatter matrix which encodes localstructural information of all training videos (both labeled andun-labeled). Thus, ifµ = 0 we incorporate no local distributionof training videos. In the following section, we present thedetails of matrixA andD.

C. Estimation of Scatter Matrices

1) The Within-Class Scatter Matrix:Fisher discriminantanalysis [16] is a well-known method to utilize discriminativeinformation of the labeled data to find a low dimensional sub-space to better separate samples. Fisher discriminant analysismaximizes the ratio of between-class and within-class scattermatrices. In this way, data from the same class are close toeach other and data from different classes are far apart fromeach other in the subspace. If we incorporate between-classand within-class scatter matrix intoA of Eq. (3) one moreparameter has to been introduced [38], which is difficult totune its value. Thus, in this work, we use the within-classscatter matrix of Fisher discriminant analysis to encode thelabel information of training videos.

The within-class scatter matrixA is estimated as follows.

A =

c∑

j=1

1

Nj

∑

x∈ωj

(x−mj)(x−mj)T , (4)

where mj = 1Nj

Y(j,:)XT is the sample meanmj (j =

1, . . . , c) for the j-th class, andNj =∑nl

i=1 Y(j,i) is thenumber of labeled samples in classj. ωj = {xi|Y(j,i) = 1} isthe set of labeled videos in classj.

2) The Spline Scatter Matrix:Suppose matrixG ∈ Rn×n

encodes the local similarity relationship of each pair of sam-ples inX , then the local structure of training videos can bepreserved inXGXT . A recent study shows that [27], if thelocal geometry of training data (both labeled and un-labeled)are represented inG, then the unsupervised local distributionof training data can be utilized. We define the spline scattermatrix D to be:

D = XGXT , (5)

where matrixG is obtained by a local spline regression [27].It has been shown that splines developed in Sobolev space[28] can be used to interpolate the scattered distribution andpreserve the local geometry structure of training data. ASobolev space is a space of functions with sufficiently manyderivatives for some applications domain [28]. One importantproperty of the Sobolev space is that this space providesconditions under which a function can be approximated bysmooth functions. Splines developed in Sobolev space [28]are a combination of polynomials and Green’s function whichis popularly used to interpolate scattered data in geometricaldesign [29]. This spline is smooth, nonlinear, and able tointerpolate the scattered data points with high accuracy. Recentresearch has showed that it can effectively handle high-dimensional data [27]. In the following, we briefly introducehow to estimate the matrixG.

Given each datumxi ∈ X , to exploit its local similaritystructure, we add itsk−1 nearest neighbors as well asxi itselfinto a local clique denoted asNi = {xi, xi1 , xi2 , . . . , xik−1

}.The goal of local spline regression is to find a functiongi :R

d → R such that it can directly associate each data pointxij ∈ R

d to a class labelyij = gi(xij ) (j = 1, 2, . . . , k),which is a regularized regression process:

J(gi) =k∑

j=1

(

fij − gi(xij ))2

+ γS(gi), (6)

whereS(gi) is a penalty functional andγ > 0 is a trade-offparameter. Parameterγ controls the amount of smoothness ofthe spline [27]. In order to utilize the good characteristics ofsplines in Sobolev space [39], provided the penalty termS(gi)is defined as a semi-norm2, the minimizergi in Eq. (6) is givenby

gi(x) =

m∑

j=1

βi,jpj(x) +

k∑

j=1

αi,jGi,j(x), (7)

2A norm is a function that assigns a strictly positive length or size to allvectors in a vector space, other than the zero vector (which has zero lengthassigned to it). A semi-norm, on the other hand, is allowed toassign zerolength to some non-zero vectors (in addition to the zero vector).


wherem = (d + s − 1)!/(d!(s − 1)!) [39]. {pj(x)}mj=1 andGi,j are a set of primitive polynomials and a Green’s function,respectively, which are defined in [39]. It has been shownin [27] that the local functiongi(x) can better fit the localgeometry structure near the scattered points, as the data pointscan be locally wrapped by the Green’s functionGi,j(x). Now,our task is to estimate the parametersα andβ. According to[39], The coefficientsαi andβi can be solved by

A ·

(

αi

βi

)

=

(

Y Ti

0

)

(8)

whereYi = [yi, yi1 , yi2 , . . . , yik−1] corresponds to the label

indicator of data points inNi generated by the local function

gi and A =

(

Ki PPT

0

)

∈ R(k+m)×(k+m), in which Ki

is a k × k symmetrical matrix with its elementsKp,q =Gp,q(||xip −xiq ||) andP is a k×m matrix with its elementsPi,j = pi(xij ). DenotingMi as the upper leftk×k sub-matrixof the matrixA−1, it can be demonstrated that [27], [39]

J(gi) ≈ ηY Ti MiYi, (9)

where η is a scalar. Since there aren local functions withrespect ton local cliques, now we consider how to integratethe label indicators generated by different local function-s. As can be seen that each local indicator matrixYi =[yi, yi1 , . . . , yik−1

] is a sub-matrix of the global indicatormatrix Y = [y1, y2, . . . , yn], we can find a column selectionmatrix Si ∈ R

n×k to map the global indicator matrix into thelocal indicator matrix.

More specifically, given therth row andcth column elementSi(r, c), if the column selection matrixSi satisfies

Si(r, c) =

{

1, if r = ic,0, otherwise.

(10)

then we haveYi = Y Si. In this way, the global label indicatormatrix Y can be mapped inton local indicator matrices byncolumn selection matrices. Thus the combined local loss turnsto be

n∑

i=1

J(gi) = γ

n∑

i=1

Y Ti MiYi = γSTY TMY S (11)

where S = [S1, S2, . . . , Sn] and M =diag(M1,M2, . . . ,Mn). For each video point, the localindicators generated by different local functions are integratedinto one matrix to find the overall optimized label indicatormatrix. Defining

G = STMS, (12)

the spline scatter matrixD = XGXT , which sums up localdistributions and encodes geometry structure of labeled andun-labeled training videos.

D. Solution and Algorithm

The ℓ2,1-norm regularized minimization problem has beenstudied in previous works [17]. However, it remains unclearhow to directly apply the existing algorithms to optimize ourobjective function in Eq. (2), where the orthogonal constraint

Algorithm 1 Semi-Supervised Feature Selection via SplineRegression (S2FS2R)

Input : matrix of n training videosX = [x1, . . . , xn] ∈ Rd×n,

XL = [x1, . . . , xnl] ∈ R

d×nl is a matrix of firstnl(nl ≤ n)labeled video samples andY L = [y1, . . . , ynl

] ∈ {0, 1}c×nl

is the corresponding indicator matrix forc labels (or semanticcategories);XU = [xnl+1, . . . , xnl+nu

] ∈ Rd×nu is a matrix

of un-labeled videos whose labels are not given;k is thenumber of the nearest neighbors in local cliqueNi for eachvideo xi; Control parameterµ and regularization parameterλ; f is the number of features to be selected.Output : index idx of the top f selected fea-tures

1: for each videoxi ∈ X do2: Construct local cliqueNi by addingxi with its k − 1

nearest neighbors;3: Construct matrixKi using Green’s functionGi,j de-

fined onNi;

4: Construct matrixA =

(

Ki PPT

0

)

;

5: Construct matrixMi which is the up leftk×k submatrixof the matrixA−1;

6: end for7: Form matrixD using Eq. (5);8: Form matrixA using Eq. (4);9: Form matrixM;

10: Set t = 0 and initializeD(0) ∈ Rd×d to be an identity

matrix;11: repeat12: U(t) = M + λD(t);13: W(t) = [u1, . . . , uc] where u1, . . . , uc are the eigen-

vectors ofU(t) corresponding to the firstc smallesteigenvalues;

14: Update matrixD(t+1) as

D(t+1) =

12||w1

(t)||2

. . .1

2||wd(t)

||2

;

15: t = t+ 1;16: until convergence.17: Sort each feature of thej-th video sampleX(j,i)|

di=1

according to the value of||wi||2 in descending order;18: Output the indexidx of the topf selected features.

WTW = I is imposed. In this section, we give a new ap-proach to solve the optimization problem shown in Eq. (2) forfeature selection. The proposed algorithm is very efficienttosolve theℓ2,1-norm minimization problem with the orthogonalconstraint. We observe in experiment that the algorithm usuallyconverges around 30 iterations. We summarize the detailedsolution of S2FS2R in Algorithm 1. Once the optimalWis obtained, we sort thed features of thej-th video sampleX(j,i)|

di=1 according to the value of||wi||2 (i = 1, . . . , d) in

descending order and select top ranked video features.From step 11 to step 16 in Algorithm 1, we propose an

iterative approach to optimize the minimization problem inEq. (2). In the following, we verify in Theorem 1 that the


(a) bird (b) desert (c) explosion (d) office (e) sports (f) weather

(g) animal (h) birthday (i) dancing (j) picnic (k) sports (l) wedding

(m) walking (n) jogging (o) running (p) boxing (q) hand waving (r) hand clapping

(s) cycling (t) diving (u) juggling (v) jumping (w) riding (x) shooting

Fig. 2. Example video frames from the three datasets. From the top to the bottom rows are videos from TRECVID, Kodak, KTH, and UCF YouTubedatasets, respectively.

proposed iterative approach in Algorithm 1 converges to theoptimal W corresponding to Eq. (2). We mainly follow theproof from our previous work [13] to prove Theorem 1. Thedetails of the proof are given in Appendix A.

Theorem 1. The iterative approach in Algorithm 1 (from step1 to step 16) monotonically decreases the objective functionvalue of Tr(WTMW ) + λ

∑d

i=1 ||wi||2,s.t.WTW = I in

each iteration until convergence.

According to Theorem 1, we can see that the iterativeapproach in Algorithm 1 converges to the optimalW cor-responding to Eq. (2). In Algorithm 1, becausek is muchsmaller thann, the time complexity of computingD, A,and M is aboutO(n2). Moreover, the computation ofD,A, and M is outside the iterative process of Algorithm 1.Thus, to optimize the objective function of S2FS2R, the mosttime consuming operation is to perform eigen-decompositionof U(t). Note that U(t) ∈ R

d×d. According to [40], theeigen-decomposition ofU(t) is solved by the tridiagonal QRiteration algorithm, which is the main algorithm of functioneig in matlab. It first performs tridiagonal reduction ofU(t),which needs83d

3+O(d2) flops [40]. Then the tridiagonal QRiteration needsO(d2) flops. Thus, the time complexity of thisoperation isO(d3) approximately.

IV. EXPERIMENTS

In this section, three typical tasks of video semantic recog-nition, i.e., video concept detection in news videos, videoclassification of consumer videos, and human action recogni-tion, are used to investigate the performance of the proposed

S2FS2R algorithm. Accordingly, we use four open benchmarkvideo datasets to compare S2FS2R with the state-of-the-artalgorithms.

A. Video Datasets

We choose four video datasets, i.e., TRECVID3, Kodak[41], KTH [42], and UCF YouTube action dataset [43] inour experiments. In Figure 2, we show sample videos andcorresponding class labels/concepts of TRECVID, Kodak,KTH, and UCF YouTube. We summarize the datasets used inour experiment in Table I. The following is a brief descriptionof the four datasets.

TRECVID: We use the Columbia374 baseline detectors[44] for TRECVID 20054 in our experiments. TRECVID2005 consists of about 170 hours of TV news videos from13 different programs in English, Arabic, and Chinese. Weuse the development set in our experiments, since there areannotations of semantic concepts defined in LSCOM (Large-Scale Concept Ontology for Multimedia) [44], which could betaken as the ground truth. As there are 39 concepts annotatedin the TRECVID 2005 dataset in total, we use all these 39concepts in our experiment. Thus, the dataset used in ourexperiments includes 61,562 labeled key frames. Three globalfeature types used in [44], namely, 73-dimensional edged di-rection histogram (EDH), 48-dimensional Gabor (GBR), 225-dimensional grid color moment (GCM) and 200-dimensional

3http://trecvid.nist.gov/4http://www-nlpir.nist.gov/projects/tv2005/


TABLE IA BRIEF SUMMARY OF FOUR VIDEO DATASETS USED IN OUR EXPERIMENT. IN THIS TABLE, N , d, AND c DENOTE THE NUMBER OF INSTANCES,

DIMENSIONALITY OF VIDEO FEATURES, AND THE NUMBER OF CLASSES IN EACH OF THE FOUR DATASETS, RESPECTIVELY.

Dataset TRECVID Kodak KTH UCF YouTubeVideo Types News Consumer Human Action Human Action

Tasks Video ConceptDetection

Video ConceptDetection

Human ActionRecognition

Human Action Recognition“in the Wild”

N 61,562 3,590 2,391 1,596d 546 1,000 1,000 1,000c 39 22 6 11

canny edge provided by NIST are combined to be a 546-dimensional vector of global features to represent each keyframe in our experiments.

Kodak: There are 5,166 key frames extracted from 1,358consumer video clips in this dataset. Among these key frames,3,590 key frames are annotated by the students from ColumbiaUniversity, who are asked to assign binary labels for eachconcept. We use all the annotated keyframes belonging to22 concepts in our experiments. We extracted SIFT pointsfor each key frame. Then the randomly selected subset ofextracted SIFT points are clustered and produces the 1,000centers as the visual dictionary. Finally, each key frame isquantized into a 1,000 dimensional histogram of bag-of-visual-words (BoW).

KTH: KTH actions dataset [42] contains six types of humanactions (walking, jogging, running, boxing, hand waving, andhad clapping) performed several times by 25 subjects infour different scenarios. Currently the dataset contains 2,391videos sequences. In our experiments, we describe each videosequences using space-time interest points (STIP) [7]. Foreach STIP point, descriptors of the associated space-timepatch were computed. Two alternative patch descriptors werecomputed in terms of (i) histograms of oriented (spatial)gradient (HOG) and (ii) histograms of optical flow (HOF).Thus, STIP descriptor concatenates several histograms froma space-time grid defined on the patch and generalizes SIFTdescriptor to space-time. We built a 1,000 dimensional visualvocabulary of local space-time descriptors and assign eachinterest point to a visual word label. In this way, each videosequence in KTH is represented by a 1,000 dimensional STIPfeature.

UCF YouTube: UCF YouTube action dataset [43] contains11 action categories: basketball shooting, biking/cycling, div-ing, golf swinging, horse back riding, soccer juggling, swing-ing, tennis swinging, trampoline jumping, volleyball spiking,and walking with a dog. This dataset is very challenging forrecognizing realistic actions from videos “in the Wild”, due tolarge variations in camera motion, object appearance and pose,object scale, viewpoint, cluttered background, illuminationconditions, etc. For each category, the videos are groupedinto 25 groups with more than 4 actions clips in each group.The video clips in the same group may share some commonfeatures, such as the same actor, similar background, similarviewpoint, and so on. In our experiments, we describe eachvideo sequence using space-time interest points (STIP) [7].We built a 1,000 dimensional visual vocabulary of local space-time descriptors and assign each interest point to a visual word

label. In this way, each video sequence in UCF YouTube isrepresented by a 1,000 dimensional STIP feature.

B. Evaluation Metric

We evaluate the classification performance in terms of F1-Score (F-measure). Since there are multiple concepts (semanticcategories) in our experiments, to measure the global per-formance across multiple classes, we use themicroaveragingmethods following [45]. Therefore, the evaluation criterion weuse ismicroF1. More specifically, we present the “micro-”definition as follows.

Let Y ∗ ∈ {0, 1}n×c denote the indicator matrix of groundtruth for testing data, andY ∗ ∈ R

n×c denote the correspond-ing estimated indicator matrix, wherec denotes the numberof classes. FunctionF1(a, b) compute the F1-score betweenvectora andb. Let functionV ec(A) denote the operator thatconverts matrixA to vector by concatenating each columnsequentially, then the “micro-” criterion is

microF1 = F1(V ec(Y ∗), V ec(Y ∗)),

whereF1 score is defined as the harmonic mean of preci-sion and recall, where the functions ofprecision(a, b) andrecall(a, b) are defined in [46].

F1(a, b) =2 · precision(a, b) · recall(a, b)

precision(a, b) + recall(a, b).

C. Experimental Configuration

1) Parameter Setting:Four parameters, i.e.,k, µ, λ, andfin Algorithm 1 need to be set and tuned. In our experiments,we chosek = 5, 10 in the construction of local cliqueNi

for each videoxi. We setµ = 1 to treat equally the scattermatricesD andA. Parameterλ determines the regularizationeffect of ℓ2,1-norm in Eq. (2), which should be well tuned.The best number of features to be selected, i.e.,f , will bedifferent for different feature types and different video data.In our experiments, we use a 5-fold cross-validation processto tune parameterλ and f simultaneously. The ranges forλare set to beλ ∈ {1e-3, 1e-2, 1e-1, 1, 10, 100, 1,000} forall datasets. Because the feature dimensionality of TRECVIDis d = 546 and f ≤ d (see Section IV-A), the ranges offfor TRECVID aref ∈ {100, 200, 300, 400, 500}. And f ∈{50, 100, 200, 400, 600, 800, 900} for Kodak, KTH and UCFYouTube datasets, as the feature dimensionality of these threedatasets isd = 1, 000.


1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.1

0.2

0.3

0.4

0.5

λ

mic

roF

1

TRECVID (1%,5)

f=100f=200f=300f=400f=500

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.1

0.2

0.3

0.4

0.5

λm

icro

F1

TRECVID (1%,10)

f=100f=200f=300f=400f=500

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.1

0.2

0.3

0.4

0.5

λ

mic

roF

1

TRECVID (5%,5)

f=100f=200f=300f=400f=500

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.1

0.2

0.3

0.4

0.5

λ

mic

roF

1

TRECVID (5%,10)

f=100f=200f=300f=400f=500

Fig. 3. Different performance of video semantic recognition by the proposed S2FS2R for TRECVID, whenλ andf are set to different values. Impacts ofparameters are reported when the ratios of labeled trainingdata are set to be 5%, and 1%. The number “5” and “10” after the ratio in figures’ title denote thevalue ofk = 5, 10 in the construction of the local cliqueNi. For example, “(1%,5)” denotes that the ratio of labeled training data is 1% andk = 5 for Ni.

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.02

0.04

0.06

0.08

0.1

0.12

λ

mic

roF

1

Kodak (1%,5)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.02

0.04

0.06

0.08

0.1

0.12

λ

mic

roF

1

Kodak (1%,10)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

λ

mic

roF

1

Kodak (5%,5)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.02

0.04

0.06

0.08

0.1

0.12

λ

mic

roF

1

Kodak (5%,10)

f=50f=100f=200f=400f=600f=800f=900

Fig. 4. Different performance of video semantic recognition by the proposed S2FS2R for Kodak, whenλ and f are set to different values. Impacts ofparameters are reported when the ratios of labeled trainingdata are set to be 5%, and 1%. The number “5” and “10” after the ratio in figures’ title denote thevalue ofk = 5, 10 in the construction of the local cliqueNi. For example, “(1%,5)” denotes that the ratio of labeled training data is 1% andk = 5 for Ni.

2) Partition of Training/Testing Videos:We randomly sam-pled 10,000 and 2,000 video key frames as the training datafor TRECVID and Kodak datasets, respectively. For KTHand UCF YouTube datasets, we randomly sampled 1,000video clips as training data. The remaining data are used asthe corresponding testing data for each of the four datasets.For all these datasets, the sampling processes were repeatedfive times to generate five random training/testing partitions,and then the average performance of five-round repetitionsare reported. The significance of the repeated results havebeen demonstrated according to the Student’s t-test. In thisexperiment, we report the average results from the repetitions.For the first random partition of the five-round repetitions,wetuned and chose the best parametersλ andf using the 5-foldcross-validation. Then the tuned values ofλ andf were fixedfor all the rest of the partitions. In order to investigate theperformance of semi-supervised feature selection, we set theratio of labeled training videos in the sampled training videosto different values from{50%, 25%, 10%, 5%, 1%}.

3) Classifiers and Comparison Methods:Once the indexidx of features to be selected is obtained, we train a classifieron the selected video features. In our experiments, we chosekNN classifier (k = 10) for the four datasets. Furthermore,as shown in [47], theχ2 kernel SVM is a better classifier forhuman action recognition, especially for the BoW histogramrepresentations. Thus, in this experiment, for the task of humanaction recognition in KTH and UCF YouTube datasets, wealso report the results from theχ2 kernel in a Support VectorMachine (χ2-SVM). To show the comparative performance,we first compare S2FS2R with two baselines:

• Classification with full features: Conduct classification onthe original features bykNN (k = 10) or χ2-SVM.

• Classification with PCA [10]: Conduct classification on

the reduced features obtained by dimensionality reductionwith PCA.

We also compare S2FS2R with four state-of-the-art featureselection methods. Detailed information of these methods isgiven as follows.

• Fisher Score (FScore) [16]: It depends on fully labeledtraining data to select features with the best discriminat-ing ability.

• Feature Selection via Spectral Analysis (FSSA) [14]: It isa semi-supervised feature selection method using spectralregression.

• Feature Selection via Jointℓ2,1-Norms Minimization (F-SNM) [17]: It employs jointℓ2,1-norm minimization onboth loss function and regularization to realize featureselection across all data points.

• Sparse Multinomial Logistic Regression via Bayesianℓ1Regularization (SBMLR) [33]: It exploits sparsity byusing a Laplace prior and is used for multi-class patternrecognition. It can also be applied to feature selection.

• Discriminative Semi-Supervised Feature Selection viaManifold Regularization (FS-Manifold) [22]: It selectsfeatures through maximizing the classification marginbetween different classes and simultaneously exploitingthe data geometry by the manifold regularization.

Moreover, we investigate special instantiations of S2FS2R,which correspond to different settings of|Ni| = 5, 10 andµ = 0, 1. To demonstrate the impact of the size of local cliqueNi in the local spline regression, we let “S2FS2R(5)” and“S2FS2R(10)” denote S2FS2R with |Ni| = 5 and |Ni| = 10,respectively. Note that, whenµ = 0 we haveM = A (see Eq.(3)), which means the spline scatter matrixD is not includedand the information of unsupervised local distribution is notutilized. In the following, we let “S2FS2R(without local)”


1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

λ

mic

roF

1

KTH (1%,5)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

λm

icro

F1

KTH (1%,10)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

λ

mic

roF

1

KTH (5%,5)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

λ

mic

roF

1

KTH (5%,10)

f=50f=100f=200f=400f=600f=800f=900

Fig. 5. Different performance of video semantic recognition by the proposed S2FS2R for KTH, when λ and f are set to different values. Impacts ofparameters are reported when the ratios of labeled trainingdata are set to be 5%, and 1%. The number “5” and “10” after the ratio in figures’ title denote thevalue ofk = 5, 10 in the construction of the local cliqueNi. For example, “(1%,5)” denotes that the ratio of labeled training data is 1% andk = 5 for Ni.

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.02

0.04

0.06

0.08

0.1

0.12

0.14

λ

mic

roF

1

YouTube (1%,5)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.05

0.1

0.15

0.2

λ

mic

roF

1

YouTube (1%,10)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.05

0.1

0.15

0.2

0.25

λ

mic

roF

1

YouTube (5%,5)

f=50f=100f=200f=400f=600f=800f=900

1e−3 1e−2 1e−1 1 1e1 1e2 1e30

0.05

0.1

0.15

0.2

0.25

λ

mic

roF

1

YouTube (5%,10)

f=50f=100f=200f=400f=600f=800f=900

Fig. 6. Different performance of video semantic recognition by the proposed S2FS2R for YouTube, whenλ and f are set to different values. Impacts ofparameters are reported when the ratios of labeled trainingdata are set to be 5%, and 1%. The number “5” and “10” after the ratio in figures’ title denote thevalue ofk = 5, 10 in the construction of the local cliqueNi. For example, “(1%,5)” denotes that the ratio of labeled training data is 1% andk = 5 for Ni.

denote S2FS2R with µ = 0.

D. Experimental Results

1) Impacts of Parameters:In this section, we investigatethe impacts of parametersλ and f for different tasks ofvideo semantic recognition. In Figure 3 - Figure 6, weshow different performance of video semantic recognition byalgorithm S2FS2R for TRECVID, Kodak, KTH, and UCFYouTube datasets, respectively. From the figures we have,though parametersλ andf have impacts on the performanceon video semantic recognition, the impacts are different withdifferent video semantic recognition tasks and on differentdatasets. Firstly, the performance of video concept detection onTRECVID varies little whenλ andf are set to different values.Whereas, the performances of video semantic recognition onKodak, KTH, and UCF YouTube have bigger variances thanthat of TRECVID dataset. From these results we can see thatthe local features used in Kodak, KTH, and UCF YouTubeare more sensitive to parametersλ and f than global visualfeatures, which are used to represent key frames in TRECVID.Especially, the performance of action recognition is verysensitive to the numberf of selected features. Secondly, wecan observe in some cases (e.g., TRECVID (5%,5) and f=100,200), that the performance of video semantic recognitiondecreases when increasingf . A possible reason could bethat, whenf is set to f = 200, more noisy features areselected than that off = 100. Thirdly, for each of thefour datasets we can observe that, the best performance ofvideo semantic recognition can be obtained by S2FS2R whenf is set to larger values of the tuning ranges, e.g., 400 or500 of 546 for TRECVID and 600 or 800 of 1,000 forKodak. This demonstrates that, for the video features usedin this experiment, most of the dimensions contribute to video

semantic recognition, given that the number of noisy featuresis small. But in some cases (e.g.,f is set to be small values),more noisy features may be selected whenf is larger. In thisexperiment, we choose the best performance whenλ and fare set to different values. Moreover, as we will report inthe following results, the performance of S2FS2R is betterthan using all the features. It is clear that S2FS2R can selectthe most discriminative subset of features for video semanticrecognition.

2) Video Semantic Recognition Results:In this section, wefirst investigate the performance of S2FS2R compared with thestate-of-the-art methods for different tasks of video semanticrecognition: video concept detection for TRECVID videos,consumer videos classification for Kodak videos, and humanaction recognition for videos in KTH and UCF YouTube. Inorder to show the impacts of different ratios of labeled trainingvideos for semi-supervised methods, we report results whenthe ratios of labeled training videos are set to 50% and 5%.As is shown in Table II and Table III, results in the left fourcolumns are obtained using thekNN (k = 10) classifier,whereas “χ2-SVM” denotes that we also report the resultsusing theχ2-SVM classifier for KTH and YouTube. From theresults we can observe: (1) The proposed framework of semi-supervised feature selection via spline regression outperformsthe state-of-the-art methods for different settings of theratioof labeled training videos. (2) When there are more labeledtraining videos (see Table II), S2FS2R with a bigger localclique Ni has a better performance than that with a smallerlocal clique for spline regression (except for YouTube dataset).Despite a little variance of performance forNi = 5 and 10,algorithm S2FS2R outperforms all the compared methods. (3)Comparing the results of S2FS2R(5) and S2FS2R(10) withthat of Full Feature and PCA we have, S2FS2R gains better


TABLE IICOMPARISON RESULTS OF VIDEO SEMANTIC RECOGNITION ON DIFFERENT VIDEO DATASETS. FULL FEATURE DENOTES THE BASELINE OF

CLASSIFICATION WITH FULL FEATURES. PCA DENOTES THE BASELINE OF CLASSIFICATION WITHPCA. FOR THE SEMI-SUPERVISEDS2FS2R AND

FSSA,THE RATIO OF LABELED TRAINING VIDEO IS 50%. IN THE FIRST FOUR COLUMNS, WE REPORT THE RESULTS USING THEkNN (k = 10)CLASSIFIER. FOR KTH AND YOUTUBE, WE ALSO REPORT THE RESULTS USING THEχ2-SVM CLASSIFIER. THE NUMBER IN [] DENOTES THE REFERENCE

INDEX .

Methods TRECVID(kNN)

Kodak(kNN)

KTH(kNN)

YouTube(kNN)

KTH(χ2-SVM)

YouTube(χ2-SVM)

S2FS2R(5) 0.5821 0.4047 0.6252 0.2982 0.8940 0.6540S2FS2R(10) 0.5874 0.4301 0.6714 0.2894 0.8994 0.6485S2FS2R(without local) 0.5511 0.3107 0.5569 0.2660 0.8910 0.6279Full Feature 0.5646 0.3107 0.5611 0.2376 0.8858 0.6459PCA 0.5789 0.3556 0.5923 0.2817 0.1592 0.0926FScore [16] 0.5561 0.3224 0.6080 0.2824 0.8922 0.6314FSSA [14] 0.5330 0.3506 0.6130 0.2567 0.8876 0.6261FSNM [17] 0.5571 0.3203 0.5765 0.2693 0.8784 0.6109SBMLR [33] 0.4845 0.2075 0.6115 0.2562 0.8768 0.4899FS-Manifold [22] 0.5633 0.3487 0.6133 0.2601 0.8799 0.6455

TABLE IIICOMPARISON RESULTS OF VIDEO SEMANTIC RECOGNITION ON DIFFERENT VIDEO DATASETS. FULL FEATURE DENOTES THE BASELINE OF

CLASSIFICATION WITH FULL FEATURE. PCA DENOTES THE BASELINE OF CLASSIFICATION WITHPCA. FOR THE SEMI-SUPERVISEDS2FS2R AND FSSA,THE RATIO OF LABELED TRAINING VIDEO IS 5%. IN THE FIRST FOUR COLUMNS, WE REPORT THE RESULTS USING THEkNN (k = 10) CLASSIFIER. FOR

KTH AND YOUTUBE, WE ALSO REPORT THE RESULTS USING THEχ2-SVM CLASSIFIER. THE NUMBER IN [] DENOTES THE REFERENCE INDEX.

Methods TRECVID(kNN)

Kodak(kNN)

KTH(kNN)

YouTube(kNN)

KTH(χ2-SVM)

YouTube(χ2-SVM)

S2FS2R(5) 0.4961 0.1406 0.1981 0.0824 0.6965 0.2881S2FS2R(10) 0.4857 0.1093 0.2080 0.1275 0.6419 0.2530S2FS2R(without local) 0.4744 0.0578 0.0758 0.0228 0.5803 0.2482Full Feature 0.4716 0.0326 0.0585 0.0298 0.6248 0.2406PCA 0.4761 0.1071 0.0867 0.0781 0.1345 0.0755FScore [16] 0.4778 0.0611 0.0917 0.0686 0.6021 0.2020FSSA [14] 0.4701 0.0375 0.0192 0.0345 0.6261 0.2208FSNM [17] 0.4781 0.0712 0.0241 0.0373 0.6454 0.2403SBMLR [33] 0.4493 0.0000 0.0249 0.0000 0.1891 0.0000FS-Manifold [22] 0.4721 0.1057 0.0604 0.0418 0.6233 0.2419

1% 5% 10% 25% 50% 100%

0.4

0.45

0.5

0.55

0.6

mic

roF

1

ratio of labeled training data

S2FS2R(5)

S2FS2R(10)Full FeaturePCAFisher ScoreFSSAFSNMSBMLRFS−Manifold

(a) TRECVID

1% 5% 10% 25% 50% 100%

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

mic

roF

1


S2FS2R(5)


(b) Kodak

1% 5% 10% 25% 50% 100%0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

mic

roF

1


S2FS2R(5)


(c) KTH

1% 5% 10% 25% 50% 100%0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

mic

roF

1


S2FS2R(5)


(d) YouTube

Fig. 7. Performance comparison of S2FS2R with the baselines and the state-of-the-art methods on TRECVID, Kodak, KTH, and YouTube datsets. ThemicroF1 scores are plotted when the ratios of labeled training data are set to 100%, 50%, 25%, 10%, 5%, and 1%. The results of PCA are obtained usingthe kNN (k = 10) classifier

1% 5% 10% 25% 50% 100%0.4

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

mic

roF

1


S2FS2R(5)

S2FS2R(without local)Full Feature

(a) TRECVID

1% 5% 10% 25% 50% 100%

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

mic

roF

1


S2FS2R(5)


(b) Kodak

1% 5% 10% 25% 50% 100%

0.3

0.4

0.5

0.6

0.7

0.8

0.9

mic

roF

1


S2FS2R(5)


(c) KTH

1% 5% 10% 25% 50% 100%

0.1

0.2

0.3

0.4

0.5

0.6

0.7

mic

roF

1


S2FS2R(5)


(d) YouTube

Fig. 8. Performance comparison of S2FS2R with S2FS2R(without local) and performing classification on the full features for TRECVID, Kodak, KTH, andYouTube datasets. ThemicroF1 scores are plotted when the ratios of labeled training data are set to 100%, 50%, 25%, 10%, 5%, and 1%.


performance that of using the full feature set and conductingdimensionality reduction using PCA. (4) The performance ofconductingχ2-SVM after performing PCA is poor for KTHand YouTube. As introduced in Section IV-A, we extract BoWhistogram of STIP for KTH and YouTube. Thus, “PCA+χ2-SVM” is not suitable for the BoW histogram. As is shown inTable II and III, the performance of conductingkNN (k = 10)after performing PCA is better.

3) Performance of Semi-Supervised Feature Selection:Inorder to investigate the performance of semi-supervised featureselection, we set the ratio of labeled training videos in thesampled training videos to different values of{50%, 25%,10%, 5%, 1%}. Figure 7 shows the performance of videosemantic recognition of different methods when the ratio oflabeled training videos are set to different values. From theresults we observe the following: (1) As the number of labeledtraining samples increases, the performance obviously increas-es. (2) Compared to the supervised feature selection methods,S2FS2R(5) and S2FS2R(10) have competitive or better perfor-mance than that of Fisher Score, FSNM, and SBMLR, thanksto the preservation of local geometry structure of un-labeledvideos via spline regression. (3) Algorithms S2FS2R(5) andS2FS2R(10) outperform the semi-supervised FSSA on all theratios of labeled training videos for TRECVID, Kodak, KTH,and YouTube. (4) When the ratio of labeled training videosis very low, e.g., 1%, S2FS2R outperforms all the comparedmethods, which shows a better property of semi-supervisedfeature selection.

Figure 8 shows the performance of comparing S2FS2R(5)with S2FS2R(without local) and using full features forTRECVID, Kodak, KTH, and YouTube datasets. As intro-duced in the end of Section IV-C3, information of localgeometry of the training videos is not incorporated intoS2FS2R(without local). S2FS2R(without local) can be takenas a supervised version of S2FS2R. From the results weobserve that, without the local information, performance ofS2FS2R(without local) is not better than that of using the fullfeature set. Owing to the preservation of local geometry ofthe unlabeled data, S2FS2R(5) outperforms S2FS2R(withoutlocal) and using the full feature set for the four datsets, whichfurther demonstrates the strength of semi-supervised featureselection of S2FS2R.

V. CONCLUSION

This paper proposed a framework of video semantic recog-nition by Semi-Supervised Feature Selection via Spline Re-gression (S2FS2R). In this framework, the discriminative infor-mation between labeled training videos and the local geometrystructure of all the training videos are well preserved by thecombined semi-supervised scatters: within-class scattermatrixto encode label information and spline scatter matrix to encodedata distribution by spline regression. Anℓ2,1-norm is imposedas a regularization term on the transformation matrix to controlthe capacity and also ensure it is sparse in rows, making itparticularly suitable for feature selection. Three typical tasksof video semantic recognition, i.e., video concept detectionin news videos, video classification of consumer videos, and

human action recognition, were used in our experiments toinvestigate the performance of S2FS2R. To efficiently solveS2FS2R, we proposed an iterative algorithm and prove itsconvergence. Experimental results show that the proposedS2FS2R has better performance of feature selection comparedto state-of-the-art methods. S2FS2R also has an extensionability of incorporating new neighborhood information into thefeature selection process if we define new scatter matrices.

APPENDIX APROOF OFTHEOREM 1

Proof: According to the definition ofW(t) in step 13 ofAlgorithm 1, we can see that

W(t) = argminWT W=I

Tr(

WT (M+ λD(t))W)

(13)

That is to say, for any matrixA such that ATA =

I, Tr(

WT(t)(M + λD(t))W(t)

)

≤ Tr(

AT (M+ λD(t))A)

.Therefore, we have

Tr(

WT(t)(M + λD(t))W(t)

)

≤

Tr(

WT(t−1)(M + λD(t))W(t−1)

)

⇒ Tr(

WT(t)MW(t)

)

+ λ∑

i

||wi(t)||

22

2||wi(t−1)||2

≤

Tr(

WT(t−1)MW(t−1)

)

+ λ∑

i

||wi(t−1)||

22

2||wi(t−1)||2

(14)

Then we have the following inequality

Tr(

WT(t)MW(t)

)

+ λ∑

i

||wi(t)||2 −

λ

(

∑

i

||wi(t)||2 −

∑

i

||wi(t)||

22

2||wi(t−1)||2

)

≤ Tr(

WT(t−1)MW(t−1)

)

+ λ∑

i

||wi(t−1)||2 −

λ

(

∑

i

||wi(t−1)||2 −

∑

i

||wi(t−1)||

22

2||wi(t−1)||2

)

(15)

According to Lemma 1 in [17], we have

∑

i

||wi(t)||2 −

∑

i

||wi(t)||

22

2||wi(t−1)||2

≤∑

i

||wi(t−1)||2 −

∑

i

||wi(t−1)||

22

2||wi(t−1)||2

(16)

Therefore, we have the following inequality:

Tr(

WT(t)MW(t)

)

+ λ∑

i

||wi(t)||2

≤ Tr(

WT(t−1)MW(t−1)

)

+ λ∑

i

||wi(t−1)||2, (17)

which indicates that the objective function value ofTr(WTMW )+λ

∑d

i=1 ||wi||2,s.t.WTW = I monotonically

decreases until convergence using the updating rule in Algo-rithm 1.


REFERENCES

[1] R. Ewerth and B. Freisleben, “Semi-supervised learningfor semanticvideo retrieval,” inProceedings of the 6th ACM International Conferenceon Image and Video Retrieval. ACM, 2007, pp. 154–161.

[2] S. Dagtas, W. Al-Khatib, A. Ghafoor, and R. L. Kashyap, “Models formotion-based video indexing and retrieval,”IEEE Transactions on ImageProcessing, vol. 9, no. 1, pp. 88–101, 2000.

[3] M. Chen, A. Hauptmann, A. Bharucha, H. Wactlar, and Y. Yang, “Hu-man activity analysis for geriatric care in nursing home,” in Proceedingsof the 2011 Pacific-Rim Conference on Multimedia, 2011.

[4] X. Zhen, L. Shao, D. Tao, and X. Li, “Embedding motion and structurefeatures for action recognition,”IEEE Transactions on Circuits Systemsfor Video Technology, vol. 23, no. 7, pp. 1182–1190, 2013.

[5] L. Maddalena and A. Petrosino, “Stopped object detection by learningforeground model in videos,”IEEE Transactions on Neural Networksand Learning Systems, vol. 24, no. 5, pp. 723–735, 2013.

[6] R. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminativetracking features,”IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 27, no. 10, pp. 1631–1643, 2005.

[7] I. Laptev, “On space-time interest points,”International Journal ofComputer Vision, vol. 64, no. 2, pp. 107–123, 2005.

[8] M. Chen and A. Hauptmann, “Mosift: Recognizing human actionsin surveillance videos,”CMU-CS-09-161, Carnegie Mellon University,2009.

[9] F. Korn, B. Pagel, and C. Faloutsos, “On the dimensionality curse andthe self-similarity blessing,”IEEE Transactions on Knowledge and DataEngineering, vol. 13, no. 1, pp. 96–111, 2001.

[10] H. Abdi and L. Williams, “Principal component analysis,” Wiley Interdis-ciplinary Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459,2010.

[11] J. Tenenbaum, V. De Silva, and J. Langford, “A global geometricframework for nonlinear dimensionality reduction,”Science, vol. 290,no. 5500, pp. 2319–2323, 2000.

[12] P. Padungweang, C. Lursinsap, and K. Sunat, “A discrimination analysisfor unsupervised feature selection via optic diffraction principle,” IEEETransactions on Neural Networks and Learning Systems, vol. 23, no. 10,pp. 1587–1600, 2012.

[13] Y. Yang, H. Shen, Z. Ma, Z. Huang, and X. Zhou, “ℓ2,1-normregularized discriminative feature selection for unsupervised learning,”in Proceedings of Twenty-Second International Joint Conference onArtifical Intelligence (IJCAI-11), 2011, pp. 1589–1594.

[14] Z. Zhao and H. Liu, “Semi-supervised feature selectionvia spectralanalysis,” inProceedings of the 7th SIAM International Conference onData Mining, Minneapolis, MN, 2007, pp. 1151–1158.

[15] S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, “Discriminative leastsquares regression for multiclass classification and feature selection,”IEEE Transactions on Neural Networks and Learning Systems, vol. 23,no. 11, pp. 1738–1754, 2012.

[16] R. Duda, P. Hart, and D. Stork, “Pattern classification,2nd edition,”NewYork, USA: John Wiley & Sons., 2001.

[17] F. Nie, H. Huang, X. Cai, and C. Ding, “Efficient and robust featureselection via jointℓ2,1-norms minimization,”Advances in Neural Infor-mation Processing Systems, vol. 23, pp. 1813–1821, 2010.

[18] Z. Zhao, L. Wang, and H. Liu, “Efficient spectral featureselectionwith minimum redundancy,” inProceedings of the Twenty-4th AAAIConference on Artificial Intelligence (AAAI), 2010.

[19] Z. Zhao and H. Liu, “Spectral feature selection for supervised and unsu-pervised learning,” inProceedings of the 24th International Conferenceon Machine Learning, 2007, pp. 1151–1157.

[20] Z. Xiaojin, “Semi-supervised learning literature survey,” Computer Sci-ence, University Wisconsin-Madison, Madison, WI, Technical Report,vol. 1530, 2007.

[21] Y. Wang, S. Chen, and Z.-H. Zhou, “New semi-supervised classificationmethod based on modified cluster assumption,”IEEE Transactions onNeural Networks and Learning Systems, vol. 23, no. 5, pp. 689–702,2012.

[22] Z. Xu, R. Jin, M.-T. Lyu, and I. King, “Discriminative semi-supervisedfeature selection via manifold regularization,” in2009 International JointConferences on Artificial Intelligence, 2009, pp. 1303–1308.

[23] X. Kong and P. S. Yu, “Semi-supervised feature selection for graphclassification,” inProceedings of the 16th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 2010, pp.793–802.

[24] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learningwith local and global consistency,”Advances in neural informationprocessing systems, vol. 16, pp. 321–328, 2004.

[25] M. Wu and B. Scholkopf, “Transductive classification via local learningregularization,” inProceedings of the 11th International Conference onArtificial Intelligence and Statistics, 2007, pp. 624–631.

[26] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”The Journal of Machine Learning Research, vol. 7, pp. 2399–2434,2006.

[27] S. Xiang, F. Nie, C. Zhang, and C. Zhang, “Nonlinear dimensionality re-duction with local spline embedding,”IEEE Transactions on Knowledgeand Data Engineering, vol. 21, no. 9, pp. 1285–1298, 2009.

[28] R. Adams,Sobolev Spaces. Academic Press, 1975.[29] F. Bookstein, “Principal warps: Thin-plate splines and the decomposition

of deformations,”IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 11, no. 6, pp. 567–585, 1989.

[30] D. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110,2004.

[31] J. Yang, Y. Jiang, A. Hauptmann, and C. Ngo, “Evaluatingbag-of-visual-words representations in scene classification,” inProceedings of theInternational Workshop on Multimedia Information Retrieval. ACM,2007, pp. 197–206.

[32] Y. Ke, R. Sukthankar, and M. Hebert, “Volumetric features for videoevent detection,”International journal of computer vision, vol. 88, no. 3,pp. 339–362, 2010.

[33] G. Cawley, N. Talbot, and M. Girolami, “Sparse multinomial logistic re-gression via bayesian l1 regularisation,”Advances in Neural InformationProcessing Systems, vol. 19, p. 209, 2007.

[34] C. Hou, F. Nie, D. Yi, and Y. Wu, “Feature selection via joint embeddinglearning and sparse regression,” inProceedings of the Twenty-Secondinternational joint conference on Artificial Intelligence. AAAI Press,2011, pp. 1324–1329.

[35] Q. Gu, Z. Li, and J. Han, “Joint feature selection and subspace learning,”in Proceedings of the Twenty-Second international joint conference onArtificial Intelligence-Volume Volume Two. AAAI Press, 2011, pp.1294–1299.

[36] S. Nilufar, N. Ray, and H. Zhang, “Object detection withdog scale-space: A multiple kernel learning approach,”IEEE Transactions onImage Processing, vol. 21, no. 8, pp. 3744–3756, 2012.

[37] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimensionalityreduction,” IEEE Transactions on Pattern Analysis and Machine Intel-ligence, vol. 29, no. 1, pp. 40–51, 2007.

[38] Y. Yang, F. Wu, D. Xu, Y. Zhuang, and L.-T. Chia, “Cross-mediaretrieval using query dependent search methods,”Pattern Recognition,vol. 43, no. 8, pp. 2927–2936, 2010.

[39] J. Duchon, “Splines minimizing rotation-invariant semi-norms in sobolevspaces,”Constructive Theory of Functions of Several Variables, pp. 85–100, 1977.

[40] J. W. Demmel,Applied numerical linear algebra. SIAM, 1997.[41] A. Loui, J. Luo, S. Chang, D. Ellis, W. Jiang, L. Kennedy,K. Lee, and

A. Yanagawa, “Kodak’s consumer video benchmark data set: conceptdefinition and annotation,” inProceedings of the International Workshopon Multimedia Information Retrieval. ACM, 2007, pp. 245–254.

[42] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: A lo-cal svm approach,” inProceedings of the 17th International Conferenceon Pattern Recognition, vol. 3. IEEE, 2004, pp. 32–36.

[43] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videosin the wild,” in IEEE Conference on Computer Vision and PatternRecognition, 2009, pp. 1996–2003.

[44] A. Yanagawa, S. Chang, L. Kennedy, and W. Hsu, “Columbiauniversitysbaseline detectors for 374 lscom semantic visual concepts,” ColumbiaUniversity ADVENT Technical Report, 2007.

[45] D. Lewis, “Evaluating text categorization,” inProceedings of Speechand Natural Language Workshop, 1991, pp. 312–318.

[46] T. Fawcett, “An introduction to roc analysis,”Pattern recognition letters,vol. 27, no. 8, pp. 861–874, 2006.

[47] S. Wang, Y. Yang, Z. Ma, X. Li, C. Pang, and A. Hauptmann, “Actionrecognition by exploring data distribution and feature correlation,” inIEEE Conference on Computer Vision and Pattern Recognition, 2012,pp. 1370–1377.

Date post:	23-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

MANUSCRIPT SUBMITTED TO IEEE TRANSACTIONS ON NEURAL...

Documents