ORIGINAL ARTICLE
Semi-supervised classification with privileged information
Zhiquan Qi1 • Yingjie Tian1 • Lingfeng Niu1 • Bo Wang1
Received: 25 December 2014 / Accepted: 12 June 2015 / Published online: 30 June 2015
� Springer-Verlag Berlin Heidelberg 2015
Abstract The privileged information that is available
only for the training examples and not available for test
examples, is a new concept proposed by Vapnik and
Vashist (Neural Netw 22(5–6):544–557, 2009). With the
help of the privileged information, learning using privi-
leged information (LUPI) (Neural Netw 22(5–6):544–557,
2009) can significantly accelerate the speed of learning.
However, LUPI is a standard supervised learning method.
In fact, in many real-world problems, there are also a lot of
unlabeled data. This drives us to solve problems under a
semi-supervised learning framework. In this paper, we
propose a semi-supervised learning using privileged
information (called Semi-LUPI), which can exploit both
the distribution information in unlabeled data and privi-
leged information to improve the efficiency of the learning.
Furthermore, we also compare the relative importance of
both types of information for the learning model. All
experiments verify the effectiveness of the proposed
method, and simultaneously show that Semi-LUPI can
obtain superior performances over traditional supervised
and semi-supervised methods.
Keywords Classification � Support vector machine �Privileged information
1 Introduction
For the classical learning model on the training data [2]
fðx1;y1Þ;...;ðxl;;ylÞg; xi2X�Rn; yi2Y¼f�1;1g; ð1Þ
where xi denotes the ith training data, and yi is the class label
of the ith training data. The leaner’s aim is to select a suit-
able classifier from a given collections f ðx; aÞ; a 2 K, which
can minimize the number of misclassified points [3].
However, in humans learning process, teachers play an
important role. They teach students knowledge through all
kinds of information such as include comments, compar-
ison, explanation, logic, emotional or metaphorical reason-
ing, and so on. Also, during the machine learning process, a
teacher may describe training examples with this additional
information. Vapnik et al. [1, 4–6] called this kind of
additional information as the privileged information, which
is only available at the training stage but is never available
for test samples, and then gave a new learning model:
learning using privileged information (called LUPI), which
has been proven to be able to significantly increase the speed
of learning through the statistical learning theory [1, 4, 5].
Recently, semi-supervised learning has attracted an
increasing amounts of interests [7–11]. One important rea-
son is that the labeled examples are always rare but there are
large amounts of unlabeled examples available in many
practical problems. Graph based methods are a very
important branch in this field, where nodes in the graph are
the labeled and unlabeled points, and weighted edges reflect
the similarities of nodes. The initial assumption of these
methods is that all points are located in a low dimensional
& Yingjie Tian
Zhiquan Qi
Lingfeng Niu
Bo Wang
1 Key Laboratory of Big Data Mining and Knowledge
Management, Chinese Academy of Sciences,
Beijing 100190, China
123
Int. J. Mach. Learn. & Cyber. (2015) 6:667–676
DOI 10.1007/s13042-015-0390-1
manifold, and the graph is used for an approximation of the
underlying manifold. Neighboring point pairs connected by
large weight edges tend to have the same labels and vice
versa. By the means, the labels associated with data can be
propagated throughout the graph. By using the graph
Laplacian, [12] proposed a novel Laplacian support vector
machine (Lap-SVM). Unlike other methods based on graph
[13–15], Lap-SVM is a natural out-of-sample extension,
which can classify data that become available after the
training process, without having to retrain the classifier or
resort to various heuristics [12].
In this paper, we propose a novel semi-supervised
learning using general privileged information (called Semi-
LUPI), which can effectively exploit label data, unlabeled
data and privileged information to improve the perfor-
mance of the classifier and is a useful extension of LUPI.
Moreover, Semi-LUPI can be effectively solved by a
standard quadratic programming problem.
The remaining parts of the paper are organized as fol-
lows. Section 2 briefly introduces the background of LUPI;
Section 3 describes the new method proposed: Semi-LUPI;
Section 4 gives various extensions of Semi-LUPI; All
experimental results are shown in Sect. 5; Last section
gives the conclusions.
2 Background
Firstly,we give the mathematical formulation of the privi-
leged classification problem [1, 16] as follows.
Privileged classification problem [1]: Given a training
set
T ¼ ðx1; x�1; y1Þ; . . .; ðxl; x�l ; ylÞ; xi 2 Rn; x�i 2 Rm;
yi 2 f�1; 1g; i ¼ 1; . . .; l; ð2Þ
where xi denotes the ith training data, x�i denotes the
additional information about the ith training data, and yi is
the class label of the ith training data. The goal is to find a
real valued function g(x) in Rn, such that the value of y for
any x can be predicted by the decision function
f ðxÞ ¼ sgnðgðxÞÞ: ð3Þ
Since the additional information x�i 2 X� is included in the
training input ðxi; x�i Þ, but not in any testing input x, Vapnik
et al. [1] call it the privileged information.
In order to explain the basic idea of LUPI, we first
introduce the definition of oracle function.
Definition 1 (Oracle function) [1] Given a traditional
classification problem with the training set
T ¼ fðx1; y1Þ; . . .; ðxl; ylÞg: ð4Þ
Suppose there exists the best but unknown linear
hyperplane:
ðw0 � xÞ þ b0 ¼ 0: ð5Þ
The oracle function nðxÞ of the input x is defined as
follows:
n0 ¼ nðxÞ ¼ ½1 � yððw0 � xÞ þ b0Þ�þ; ð6Þ
where
½g�þ ¼g; if g > 0;
0; otherwise:
�ð7Þ
If we could know the value of the oracle function on
each training input xi such that we know the triplets
ðxi; n0i ; yiÞ with n0
i ¼ nðxiÞ; i ¼ 1; . . .; l, we can accelerate
its learning rate. However, in fact, a teacher does not know
the values of slacks. Instead, Vapnik et al. [1] use a so-
called correcting function to approximate an oracle func-
tion. In the linear case,
/ðx�Þ ¼ ðw� � x�Þ þ b�: ð8Þ
Replacing niði ¼ 1; . . .; lÞ by /ðx�i Þ in the primal problem
of SVM, we get the following primal problem:
minw;w�;b;b�
1
2ðkwk2 þ ckw�k2Þ þ C
Xli¼1
½ðw� � x�i Þ þ b��;
s.t. yi½ðw � xiÞ þ b� � 1 � ½ðw� � x�i Þ þ b��;ðw� � x�i Þ þ b� � 0; i ¼ 1; . . .; l:
ð9Þ
The corresponding dual problem is as follows
maxa;b
Xlj¼1
aj �1
2
Xli¼1
Xlj¼1
yiyjaiajðxi � xjÞ
� 1
2c
Xli¼1
Xlj¼1
ðai þ bi � CÞðaj þ bj � CÞðx�i ; x�j Þ;
s.t.Xli¼1
aiyi ¼ 0;Xli¼1
ðai þ bi � CÞ ¼ 0;
ai � 0; bi � 0; i ¼ 1; . . .; l:
ð10Þ
For the nonlinear case, introducing two transformations:
x ¼ UðxÞ : Rn ! H and x� ¼ U�ðx�Þ : Rm ! H�, the pri-
mal problem are constructed as follows:
668 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676
123
minw;w�;b;b�
1
2ðkwk2 þ ckw�k2Þ þ C
Xli¼1
½ðw� � Uðx�i ÞÞ þ b��;
s.t. yi½ðw � UðxiÞÞ þ b� � 1 � ½ðw� � U�ðx�i ÞÞ þ b��;ðw� � U�ðx�i ÞÞ þ b� � 0; i ¼ 1; . . .; l:
ð11Þ
Similarly, we can give its dual programming:
maxa;b
Xlj¼1
aj �1
2
Xli¼1
Xlj¼1
yiyjaiajKðxi; xjÞ
� 1
2c
Xli¼1
Xlj¼1
ðai þ bi � CÞðaj þ bj � CÞK�ðx�i ; x�j Þ;
s.t.Xli¼1
aiyi ¼ 0;Xli¼1
ðai þ bi � CÞ ¼ 0;
ai � 0; bi � 0; i ¼ 1; . . .; l: ð12Þ
3 Semi-LUPI
In this section, we will elaborate our proposed method:
Semi-LUPI.
First, given a set of labeled data (1) and a set of unla-
beled data
ðxlþ1; . . .; xlþuÞ; ð13Þ
where xlþi 2 Rn; i ¼ ; . . .; u. Suppose the labeled data are
generated according to the distribution P on X �R,
whereas unlabeled examples are drawn according to the
marginal distribution PX of P. Labels of samples can be
obtained from the conditional probability distribution
P(y|x). According to [12], the semi-supervised learning
framework can be expressed as
minf2Hk
Xli¼1
Vðxi; yi; f Þ þ cHkfk2H þ cMkfk2
M; ð14Þ
where Hk is a reproducing Kernel Hilbert space, f is a
classifier defined on a manifold M, V represents some loss
function on the labeled data, cH is the weight of kfk2H and
controls the complexity of f in the reproducing Kernel
Hilbert space. cM is the weight of kfk2M and controls the
complexity of the function in the intrinsic geometry of
marginal distribution, kfk2M is able to penalize f along the
Riemann manifold M.
Now, our goal is to make use of labeled data with
privileged information and unlabeled data together to infer
the labels. By means of the Representer Theorem, weights
w can be expressed as w ¼Plþu
i¼1 aiUðxiÞ, and K can be
denoted as the kernel matrix which is formed by the kernel
functions Kðxi; xjÞ ¼ ðUðxiÞ � UðxjÞÞ. So the regularization
term kfk2H can be rewritten as
kfk2H ¼ a>Ka: ð15Þ
Similarly, the oracle function can be rewritten as
/ðx�Þ ¼Xlj¼1
a�j K�ðx�j ; x�Þ þ b�; ð16Þ
where a� ¼ fa�1; . . .; a�l g, K� ¼ ðUðx�i Þ � Uðx�j ÞÞl�l. Replac-
ing kfk2H by (15), introducing 0-1 loss function and the the
corresponding oracle function, the formulation of Semi-
LUPI can be expressed as
mina;a�;b;b�
c1a>Kaþ c2a
�>K�a� þ1
le>K�a� þb� þ c3kfk2
M;
s.t. yiXlþu
j¼1
ajKðxi;xjÞþb
" #�1�
Xlj¼1
a�j K�ðx�i ;x�j Þþb�
" #;
Xlj¼1
a�j K�ðx�i ;x�j Þþb��0; i¼ 1; . . .; l:
ð17Þ
An important premise of this kind of approach is to assume
that the probability distribution of data has the geometric
structure of a Riemannian manifold M. The labels of two
points that are close in the intrinsic geometry of PX should
be the same or similar. Literature [12] applied the intrinsic
regularizer kfk2M to describe the constraint above,
kfk2M ¼ 1
ðlþ uÞ2
Xlþu
i;j¼1
Wi;jðf ðxiÞ � f ðxjÞÞ2 ¼ f>Lf ; ð18Þ
where L is the graph Laplacian. In practise, a data adja-
cency graph WðlþuÞ�ðlþuÞ is defined by nodes Wi;j, which
represents the similarity of every pair of input samples. The
weight matrix W may be defined by k nearest neighbor or
graph kernels as follows [12]:
Wij ¼expð�kxi � xjk2
2=2r2Þ; if xi; xj are neighbor ;
0; Otherwise ;
(
ð19Þ
where kxi � xjk22 denotes the Euclidean norm in Rn. L ¼
D�W is the graph Laplacian, D is a diagonal matrix with its ith
diagonalDii ¼Plþu
j¼1 Wij, and f ¼ ½f ðx1Þ; . . .; f ðxlþuÞ�> ¼Ka.
When (18) is used as a penalty item of the Eq. (17), we can
understand them by these means: if the neighbor of xi; xj has the
higher similarity(Wij is larger), the difference of fðxiÞ; fðxjÞwill obtain a big punishment. More intuitively, smaller jf ðxiÞ �f ðxjÞj is, more smooth f(x) in the data adjacency graph is. So
(17) can be translated to the following optimization
Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 669
123
mina;a�;b;b�
c1a>Kaþ c2a
�>K�a� þ 1
le>K�a�
þ b� þ c3
ðlþ uÞ2a>KLKa;
s.t. yiXlþu
j¼1
ajKðxi;xjÞþ b
" #�1�
Xlj¼1
a�j K�ðx�i ;x�j Þþ b�
" #;
Xlj¼1
a�j K�ðx�i ;x�j Þþ b��0; i¼ 1; . . .; l: ð20Þ
The Lagrangian corresponding to the problem (20) is given
by
LðHÞ ¼c1a>Kaþ c2a
�>K�a� þ 1
le>K�a� þ b� þ c3
ðlþ uÞ2a>
KLKa�Xli¼1
bi yiXlþu
j¼1
ajKðxi;xjÞþ b
" #� 1
!
þXlj¼1
a�j K�ðx�i ;x�j Þþ b�
" # !
�Xli¼1
giXlj¼1
a�j K�ðx�i ;x�j Þþ b�
!; ð21Þ
where H¼ fa;a�;b;b�;b;gg, b¼ ðb1; . . .;blÞ>, g¼ðg1; . . .;glÞ> are the Lagrange multipliers. So the dual
problem can be formulated as
maxH
LðHÞ
s.t. ra;a�;b;b�LðHÞ ¼ 0;
b; g� 0:
ð22Þ
From Eq. (22), we get
raL ¼ 2c1K þ 2c3
ðlþ uÞ2KLK
!a� KJ>Yb ¼ 0; ð23Þ
ra�L ¼ 2c2K�a� þ 1
lK�e� K�ðbþ gÞ ¼ 0; ð24Þ
rbL ¼Xli¼1
yibi ¼ 0; ð25Þ
rb�L ¼ 1 �Xli¼1
bi �Xli¼1
gi ¼ 0; ð26Þ
where J ¼ ½I0� is an l� ðlþ uÞ matrix with I as the l� l
identity matrix and Y is a diagonal matrix composed by the
labels as Y ¼ diagðy1; . . .; ylÞ. Now, substituting (23)–(26)
into the dual (22) , we can obtain the Wolfe dual of the
problem (20) as follows
maxb;g
Xli¼1
bi�1
2b>Qb� 1
4c2
b�g�1
le
� �>K� b�g�1
le
� �
s.t.Xli¼1
yibi ¼ 0;
1�Xli¼1
bi�Xli¼1
gi ¼ 0;
b�0; g�0;
ð27Þ
where
Q ¼ YJK 2c1I þ 2c3
ðlþ uÞ2LK
!�1
J>Y : ð28Þ
From (27), it is easy to find that this is a standard convex
quadratic programming problem and we do not need to
solve the additional variable a� and b�.Finally, Semi-LUPI can be summarized as the following
Algorithm 1:
4 Other extensions of the semi-LUPI
In this section, we give some extensions of Semi-LUPI.
4.1 Mixture model of slacks
Modeling slacks by values of some smooth function is
not always the best choice [1]. Let us model slacks by
a mixture of values of some smooth function. Consider
a model by a mixture of values of some smooth func-
tion /ðx�i Þ ¼Pl
j¼1 a�j K
�ðx�j ; x�i Þ þ b� and n�i ; i ¼ 1; . . .; l.
The primal optimization problem (20) can be changed
as
Algorithm 1 Semi-LUPI
• Input the training set T given by (1) and (13);• Choose two appropriate kernels K(, ) and K∗(, ), and parameters γ1, γ2, γ3 > 0;• Construct and solve the convex quadratic programming problem (28),
obtaining the solution β∗, η∗;• Select a component index j such that β∗
j > 0, η∗j > 0, and compute
b = yj − ∑l+ui=1 α∗
i yiK(xi, xj),
where α∗ = (2γ1I + 2γ3
(l + u)2LK)−1J�Y β∗.
• Construct the decision function:
f(x) = sgn(g(x)),
where g(x) =∑l+u
i=1 yiα∗i K(xi, x) + b.
670 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676
123
mina;a�;b;b�
c1a>Kaþ c2a
�>K�a� þ 1
le>K�a�
þ b� þ 1
lhXli¼1
n� þ c3
ðlþ uÞ2a>KLKa;
s.t. yiXlþu
j¼1
ajKðxi;xjÞ þ b
" #�1�
Xlj¼1
a�j K�ðx�i ;x�j Þ þ b�
" #
� n�i ; i¼ 1; . . .; l;
Xlj¼1
a�j K�ðx�i ;x�j Þ þ b� �0;n�i �0; i¼ 1; . . .; l: ð29Þ
where h[ 0 is the pre-specified penalty factor. The dual
problem of the algorithm is almost the same as the (27).
The only difference is that the dual problem has one extra
constraint b 1lhe.
4.2 The case of only partly samples possessing
the privileged information
When the only partly samples are provided the corre-
sponding privileged information, the given training data
can be expressed as
ðx1; y1Þ; . . .; ðxn; ynÞ; ðxnþ1; x�nþ1; ynþ1Þ; . . .; ðxl; x�l ; ylÞ;
ðxlþ1; ylþ1Þ; . . .; ðxlþu; ylþuÞ: ð30Þ
In this situation, (20) can be rewritten as
mina;a�;b;b�
c1a>Kaþ c2
Xli¼nþ1
Xlj¼nþ1
a�i�na�j�nK
�ðx�i ;x�j Þ
þ 1
l� n
Xli¼nþ1
Xlj¼nþ1
a�j�nK�ðx�i ;x�j Þ þ b�
þ 1
nhXni¼1
n� þ c3
ðlþ uÞ2a>KLKa;
s.t. yiXlþu
j¼1
ajKðxi;xjÞ þ b
" #�1� ni; i¼ 1; . . .;n;
ni�0; i¼ 1; . . .;n;
yiXlþu
j¼1
ajKðxi; xjÞ þ b
" #�1�
Xlj¼nþ1
a�j�nK�ðx�i ;x�j Þ þ b�
" #;
i¼ nþ 1; . . .; l;
Xlj¼nþ1
a�j�nK�ðx�i ;x�j Þ þ b� �0; i¼ nþ 1; . . .; l: ð31Þ
The corresponding correcting function becomes
/ðx�Þ ¼Xlj¼nþ1
a�j�nK�ðx�j ; x�Þ þ b�; ð32Þ
and the dual problem for this case can be written as
maxb;g
Xli¼1
bi �1
2b>Qb� 1
4c2
Xli¼nþ1
Xlj¼nþ1
bi � gi �1
l� n
� �bj � gj �
1
l� n
� �K�ðx�i ; x�j Þ
s.t.Xli¼1
yibi ¼ 0;
1 �Xli¼nþ1
bi �Xli¼nþ1
gi ¼ 0;
0 bi hn; i ¼ 1; . . .; n;
bi � 0; gi � 0; i ¼ nþ 1; . . .; l: ð33Þ
4.3 The privileged information with different
dimensions
Suppose the privileged information are described in dif-
ferent spaces. In order to simplify, we only consider two
spaces: space X� and space X��. The given training data is
defined by
ðx1;x�1;y1Þ; . . .; ðxn;x�n; ynÞ; ðxnþ1;x
��nþ1;ynþ1Þ; . . .; ðxl;x��l ;ylÞ;
ðxlþ1;ylþ1Þ; . . .; ðxlþu;ylþuÞ: ð34Þ
In this case, the primal problem may be expressed as
mina;a�;b;b�
c1a>Kaþc2
Xni¼1
Xnj¼1
a�i a�j K
�ðx�i ;x�j Þ
þXli¼nþ1
Xlj¼nþ1
a��i�na��j�nK
��ðx��i ;x��j Þ!
þ1
n
Xni¼1
Xnj¼1
a��j K��ðx��i ;x��j Þþb�
þ 1
l�n
Xli¼nþ1
Xlj¼nþ1
a��j�nK��ðx��i ;x��j Þþb��þ c3
ðlþuÞ2a>KLKa;
s.t. yiXlþu
j¼1
ajKðxi;xjÞþb
" #�1�
Xnj¼1
a�j K�ðx�i ;x�j Þþb�
" #; i¼1;...;n;
Xnj¼1
a�j K�ðx�i ;x�j Þþb��0; i¼1;...;n;yi
Xlþu
j¼1
ajKðxi;xjÞþb
" #
�1�Xlj¼nþ1
a��j�nK��ðx��i ;x��j Þþb��
" #; i¼nþ1;...;l;
Xlj¼nþ1
a��j�nK��ðx��i ;x��j Þþb���0; i¼nþ1;...;l:
ð35Þ
The corresponding correcting function becomes
/ðx�Þ ¼Xnj¼1
a�j K�ðx�i ; x�j Þ þ b�; ð36Þ
Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 671
123
/ðx��Þ ¼Xlj¼nþ1
a��j�nK��ðx��i ; x��j Þ þ b��; ð37Þ
and the dual problem is
maxb;g
Xli¼1
bi �1
2b>Qb
� 1
4c2
Xni¼1
Xnj¼1
bi � gi �1
l� n
� �
bj � gj �1
l� n
� �K�ðx�i ; x�j Þ
� 1
4c2
Xli¼nþ1
Xlj¼nþ1
bi � gi �1
l� n
� �
bj � gj �1
l� n
� �K��ðx��i ; x��j Þ
s.t.Xli¼1
yibi ¼ 0;
1 �Xni¼1
bi �Xni¼1
gi ¼ 0;
1 �Xli¼nþ1
bi �Xli¼nþ1
gi ¼ 0;
bi � 0; gi � 0; i ¼ 1; . . .; l: ð38Þ
5 Experiments
In this section, we compare the Semi-LUPI against LUPI
[1] and Lap-SVM [12] on time series prediction datasets
and MNIST datasets. For simplicity, we set c2 ¼ 1. c1; c3
and RBF kernel parameter r are all selected from the set
f2iji ¼ �7; . . .; 7g [16, 17].
All algorithms are implemented by using MATLAB
2010. The experimental environment: Intel Core I7-2600
CPU, 4 GB memory. For comparison purposes, the
‘‘quadprog’’ function with MATLAB is employed to solve
quadratic programming problem related to this paper.
5.1 Time series prediction
The time series prediction datasets are obtained from
Mackey–Glass time series [18], which can be described by
an equation
dxðtÞdt
¼ �axðtÞ þ bxðt � sÞ1 þ x10ðt � sÞ ; ð39Þ
where t[ 0 and a; b; s are parameters of the equation.
Basically, the experiment’s goal is to predict the value if the
time series at the moment t þ D will be larger or smaller
than the value at t for a given historical information about
the values of time series up to moment t. In the finance
market, there are many similar prediction problems.
Specifically, examples of time series before t� are taken
as the standard input data, and t0
between t� and t can be
taken as the privileged information (the future in the past),
which information is not available for testing (but obtain-
able for training) (see Fig. 1).
Similar to [1], we use the Mackey–Glass series with
parameters a ¼ 0:1; b ¼ 0:2; s ¼ 17 and xðsÞ ¼ 1:1, and
then to construct the training data as follows:
xt ¼ ðxðt � 3Þ; xðt � 2Þ; xðt � 1Þ; xðtÞÞ: ð40Þ
The corresponding privileged information can be expressed
as
x�t ¼ ðxðtþD� 2Þ;xðtþD� 1Þ;xðtþDþ 1Þ;xðtþDþ 2ÞÞ:ð41Þ
As in [1], we set D ¼ 1; 5; 8 to generate three classification
problems. The sizes of training sets are 100, 200, 400, 500,
respectively. The validation set of 500 is used to select the
model parameters; 500 examples are treated as unlabeled
Fig. 1 The interpretation about the privileged information in the time series prediction problem
Table 1 Error rates of Semi-LUPI, LUGPI and Lap-SVM on the
Mackey–Glass series
Model The sizes of training sets Interval
D = 1 D = 5 D = 8
Lap-SVM 100 3.82 6.17 8.11
LUPI 100 3.12 5.43 7.73
Semi-LUPI 100 2.64 4.78 6.43
Lap-SVM 200 3.62 6.24 7.64
LUPI 200 2.43 4.68 7.21
Semi-LUPI 200 2.12 3.97 5.98
Lap-SVM 400 3.34 4.65 6.33
LUPI 400 1.91 3.64 5.56
Semi-LUPI 400 1.68 3.31 4.17
Lap-SVM 500 2.24 4.55 5.12
LUPI 500 1.81 2.92 4.43
Semi-LUPI 500 1.42 2.18 3.44
672 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676
123
data and 500 are for testing. Table 1 and Fig. 2 give the
final results of three methods.
From the results, we can obtain the following conclu-
sions: 1) with the change of D, the privileged information
has different impacts on the final error ratios of the
Mackey–Glass series’ classification. The case of D ¼ 1 is
superior to that of other cases of D ¼ 5 and D ¼ 8. This
shows the closer data points as the privileged information
to its training data points, the more obvious the effect of the
privileged information is. 2) Semi-LUPI has a better per-
formance than that of LUPI and Lap-SVM. The result isn’t
surprising, because Semi-LUPI uses more prior informa-
tion to improve the quality. 3) LUPI outperforms Lap-SVM
in all cases. This shows the prior information obtained by
teacher is far more than the distribution information by
unlabeled samples.
The first of Fig. 5 shows the error ratio changes with
different number of unlabeled data in the case of D ¼ 1 and
500 labeled samples as training data. With the increase of
unlabeled data, our algorithm’s the performance can be
improved gradually.
5.2 Digits recognition
In the second experiment, we use the MNIST dataset [1].
Similar with [1], we only consider the binary classification
0 500 1000 15000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
0 500 1000 15000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
0 500 1000 1500 20000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
(a) Lap-SVM
0 500 1000 15000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
0 500 1000 15000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
0 500 1000 1500 20000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
(b) LUPI
0 500 1000 15000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
0 500 1000 15000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
0 500 1000 1500 20000.2
0.4
0.6
0.8
1
1.2
1.4
t
x(t)
(c) Semi-LUPI
Fig. 2 The results of Lap-SVM, LUPI and Semi-LUPI in D ¼ 5. All
data are split into four parts (left! right): the first part (yellow ‘‘�’’) is
for training; the second part (purple ‘‘�’’) is for verification; the third
part (gray ‘‘�’’) is unlabeled data; the last part is for testing (the
correctly predicted points are shown with cyan ‘‘[ ’’, and erroneously
predicted points are shown with red ‘‘\’’) (color figure online)
Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 673
123
problem of ‘‘5’’ vs. ‘‘8’’ in 28 � 28 pixels (This database
contains 5522 and 5652 images of 5 and 8, respectively.).
In order to make the problem more challenging, these
digits are further resized to 10 � 10 pixel images(see
Fig. 3).
Each training data with the privileged information was
supplied with a holistic description of the corresponding
image [1]. These holistic descriptions are translated into
21-dimensional features such as two-part-ness (0–5); tilting
to the right (0–3); aggressiveness (0–2); stability (0–3);
uniformity (0–3), and so on. These privileged information
is created prior to the learning process by independent
expert (more details can be found from NEC lab1).
Figure 4 illustrates the results by varying the number of
training data in a wider range. In all samples of ‘‘5’’ and
‘‘8’’, we pick up the first 50 images of ‘‘5’’ and 50 images
of ‘‘8’’ as the training set with the privileged information.
Training sets of the smaller size are randomly extracted
from the 100 selected images. 2000 of digits are used as a
fixed validation set, 2000 of digits are used as unlabeled set
and 1766 digits are used as a fixed test set. Lap-SVM1 of
Fig. 4 use samples with the resolution of 10 � 10, and Lap-
SVM2 uses ones with 28 � 28 (Note: the privileged
information was not used as part of the unlabeled dataset in
this experiment and the next experiment.).
From the results, we find that when the number of these
samples of ‘‘5’’ and ‘‘8’’ is small, the error ratios of LUPI,
Semi-LUPI and Lap-SVM is very close, but when the
number of samples is larger than 35, Semi-LUPI and
LUGPI have a better performance than Lap-SVM. This
shows that the privileged information can significantly
increase the speed of learning. In addition, with the help of
unlabeled data, the average error ratio of Semi-LUPI is
1:459% lower than that of LUPI. Semi-LUPI outperforms
LUPI in all cases. Note that, although Lap-SVM2 uses
digits with high resolution, its the accuracy is still lower
than LUPI and Semi-LUPI. This shows the privileged
information contained in poetic descriptions is even larger
than the total obtained from both unlabeled data and high
resolution image.
The second figure of Fig. 5 shows the error ratio changes
with different number of unlabeled data. For each class, 50
samples are randomly selected as labeled ones. The number
of unlabeled samples ranges from 100 to 2000. As can be
seen, the performance can be improved with more unla-
beled data.
5.3 Image classification
In this subsection, we will apply our proposed method to
image classification on PASCAL 2006 dataset (see Fig. 6)
[19]. The dataset contains 10 object categories (cats,
bicycles, cows, motorbikes, cars, dogs, buses, sheep, peo-
ple, horses) and 5304 images. In order to simplify, we only
pick up the 50 images of ‘‘cat’’ and 50 images of ‘‘dog’’ as
the training set. 50 samples are used as a fixed validation
set, 50 of images are used as unlabeled set and 100 images
are used as a fixed test set. The color representation method
[20] is used to extract the feature of images. All images are
resized to be gray images of 80 � 100.
In order to obtain the corresponding privileged infor-
mation, we created its holistic description for each training
sample. A holistic description for some ‘‘cat’’ is as follows
Fig. 3 Samples of ‘‘5’’ and ‘‘8’’ in different image resolutions.
Images in the first and the third row are 28 � 28 ; Images in the
second and four row are 10 � 10. When the resolution of those
images is reduced, some of them begin to become vague and
incomplete and it is even hard to recognize them by human eyes
20 40 60 80 1005
10
15
20
25
Sizes of training data
Err
or r
ates
(%
)
Lap−SVM1Lap−SVM2LUPISemi−LUPI
Fig. 4 The results of the comparison between LapSVM1 (using
digits ¼ 10 � 10), LapSVM2 (using digits ¼ 28 � 28), LUPI and
Semi-LUPI
1 http://www.nec-labs.com/research/machine/ml_website/depart
ment/software/learning-with-teacher.
674 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676
123
(see Fig. 6a): The ear is small in proportion to its face; the
mouth is narrow and non-prominent; the nose is small and
its color is light; short and rounded head; hardly see its lip
on the face; the color of the whole body is very bright and
rich; see the whole body; several cats in the picture; the
image is clear. A holistic description for some ‘‘dog’’ is as
follows (see Fig. 6b): The ear is large in proportion to its
face; the mouth is wide and prominent; the nose is large
and black; the face is very long; the lip is also long and just
like a zipper on the face; the color of the whole body is very
dark and lacks diversity; only see the part of the body; only
a dog in the picture; the image is clear.
We translate these holistic descriptions into 11-dimen-
sional feature vectors: the length of the ear in proportion to
its face (0–5)2; the width of the mouth (0–5); the prominent
extent of the mouth (0–6); the size of the nose (0–6); the
color of the nose (0–4); the length of the head (0–5); the
100 200 300 400 5001
1.5
2
2.5
3
3.5
4
Sizes of unlabeled data
Err
or r
ates
(%
)Lap−SVMSemi−LUPI
(a) Time Series Prediction
100 200 300 400 500 600 700 800 20006
8
10
12
14
16
18
Sizes of unlabeled data
Err
or r
ates
(%
)
Lap−SVMSemi−LUPI
(b) Digits Recognition
10 20 30 40 5026
28
30
32
34
36
38
Sizes of unlabeled data
Err
or r
ates
(%
)
Lap−SVMSemi−LUPI
(c) Image Classification
Fig. 5 The results of the comparison between Lap-SVM and Semi-LUPI on various datasets
Fig. 6 Image classification on PASCAL 2006 dataset
1 3 5 7 110.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
Dimensions of privileged information
Acc
urac
y
Lap−SVMLUPISemi−LUPI
Fig. 7 The final results of ‘‘cat’’ and ‘‘dog’’ on PASCAL 2006 dataset2 0-5 is the range of possible values.
Int. J. Mach. Learn. & Cyber. (2015) 6:667–676 675
123
appearance of the head (0–4); the length of the lip (0–6); if
see the whole body (1–2); the number of the animal (0–6);
the clearness of the image (0–5).
Figure 7 gives the results by varying the vector dimen-
sions of the privileged information. Since Lap-SVM cannot
use the privileged information, its accuracy dose not
change. Performances of LUPI and Semi-LUPI are stable
improved with the increase of the privileged information.
Due to Semi-LUPI using more additional information, the
accuracy of Semi-LUPI is 0:76% higher than that of LUPI
and 2:18% higher than that of Lap-SVM. The third figure
of Fig. 5 shows the error ratio changes with different
number of unlabeled data. The result is similar with the
above two experiments, with the increase of unlabeled
data, Semi-LUPI have a better performance.
6 Conclusion
In human’s behavior and cognition, teacher always plays an
important role. However, in the field of machine learning,
the information offered by teacher is seldom applied.
Recently, Vapnik et al. introduce a new learning paradigm
called Learning Using Privileged Information (LUPI),
mainly consider how to include a ‘‘teacher’’ in the learning
process. The theory and experiments show that LUPI can
accelerate the convergence rate of learning especially when
the learning problem itself is hard. In this paper, we propose
a novel semi-supervised classification problem (called
Semi-LUPI) and its various extensions, that can simulta-
neously utilize the geometry information of the marginal
distribution embedded in unlabeled data and the additional
(privileged) information offered by teacher to improve the
classification performance. At the same time, in order to
deal with different forms of the privileged information, we
also give some extensions of Semi-LUPI. All experiments
confirm the effectiveness of our method. In the future work,
how to further accelerate the algorithm is under our con-
sideration. In addition, the extension of online learning and
multi-class classification are also interesting.
Acknowledgments This work has been partially supported by
grants from National Natural Science Foundation of China (NO.
61472390, NO. 61402429, NO. 11271361, NO. 11201472, NO.
11331012), key project of National Natural Science Foundation of
China (NO. 71331005), Major International (Regional) Joint
Research Project (NO. 71110107026).
References
1. Vapnik V, Vashist A (2009) A new learning paradigm: learning
using privileged information. Neural Netw 22(5–6):544–557
2. Vapnik V (1995) The nature of statistical learning theory.
Springer, New York
3. Vapnik V (1996) The nature of statistical learning theory.
Springer, New York
4. Vapnik V (2006) Estimation of dependences based on empirical
data (information science and statistics). Springer, Berlin
5. Pechyony D, Vapnik V (2010) On the theory of learning with
privileged information. In: Advances in neural information pro-
cessing systems, vol 23
6. Pechyony D, Izmailov R, Vashist A, Vapnik V (2010) Smo-style
algorithms for learning using privileged information. In: DMIN.
CSREA Press, Providence, pp 235–241
7. Seeger M (2001) Learning with labeled and unlabeled data.
Technical report
8. Chapelle O, Scholkopf B, Zien A (eds) (2006) Semi-supervised
learning (adaptive computation and machine learning). The MIT
Press, Cambridge
9. Zhu X (2006) Semi-supervised learning literature survey. Tech-
nical Report 15304, University of Wisconsin, Madison
10. Belkin M, Matveeva I, Niyogi P (2004) Regularization and semi-
supervised learning on large graphs. In: COLT. Springer, Berlin,
pp 624–638
11. Grandvalet Y, Bengio Y (2005) Semi-supervised learning by
entropy minimization. In: CAP, PUG, pp 281–296
12. Belkin M, Niyogi P, Sindhwani V (2006) Manifold regulariza-
tion: a geometric framework for learning from labeled and
unlabeled examples. J Mach Learn Res 7:2399–2434
13. Joachims T (2003) Transductive learning via spectral graph
partitioning. In: ICML, pp 290–297
14. Belkin M, Niyogi P (2002) Using manifold structure for partially
labelled classification. In: NIPS, pp 953–960
15. Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised
learning using gaussian fields and harmonic functions. In: ICML,
pp 912–919
16. Deng N, Tian Y, Zhang C (2011) Optimization based data min-
ing: theory and applications. Springer Press, Berlin
17. Tian Y, Yong S, Xiaohui L (2012) Recent advances on support
vector machines research. Technol Econ Dev Econ 18(1): 5–33
18. Mackey MC, Glass L (1977) Oscillation and chaos in physio-
logical control systems. Science 197(4300):287–289
19. Everingham M, Zisserman A, Williams CKI, Van Gool L (2006) The
PASCAL visual object classes challenge 2006 (VOC 2006) results.
http://www.pascal-network.org/challenges/VOC/voc2006/results.
20. Deng Y, Manjunath BS, Kenney C, Moore MS, Member S, Shin
H (2001) An efficient color representation for image retrieval.
IEEE Trans Image Process 10:140–147
676 Int. J. Mach. Learn. & Cyber. (2015) 6:667–676
123