Computers and Electrical Engineering 62 (2017) 499–510
Contents lists available at ScienceDirect
Computers and Electrical Engineering
journal homepage: www.elsevier.com/locate/compeleceng
Richer feature for image classification with super and sub
kernels based on deep convolutional neural network
�
Pengjie Tang
a , b , c , Hanli Wang
a , b , ∗
a Department of Computer Science and Technology, Tongji University, Shanghai 201804, P. R. China b Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 20 0 092, P. R. China c College of Math and Physics, Jinggangshan University, Ji’an 343009, P. R. China
a r t i c l e i n f o
Article history:
Received 29 January 2016
Revised 13 January 2017
Accepted 13 January 2017
Available online 20 January 2017
Keywords:
Deep convolutional neural network
Super convolutional kernel
Sub convolutional kernel
Parallel crossing
Image classification
a b s t r a c t
Deep convolutional neural network (DCNN) has obtained great successes for image classi-
fication. However, the principle of human visual system (HVS) is not fully investigated and
incorporated into the current popular DCNN models. In this work, a novel DCNN model
named parallel crossing DCNN (PC–DCNN) is designed to simulate HVS with the con-
cepts of super convolutional kernel and sub convolutional kernel being introduced. More-
over, a multi-scale PC–DCNN (MS-PC-DCNN) framework is designed, with which a batch
of PC–DCNN models are deployed and the scores from each PC–DCNN model are fused by
weighted average for the final prediction. The experimental results on four public datasets
verify the superiority of the proposed model as compared to a number of state-of-the-art
models.
© 2017 Elsevier Ltd. All rights reserved.
1. Introduction
Image classification plays an important role in computer vision. Traditionally, the histogram of orientation gradients
(HOG) [1] , the scale invariant feature transform (SIFT) [2] , and other biological features [3] are extracted from image first.
Then, several techniques such as the principle component analysis (PCA) and the linear discriminant analysis (LDA) are usu-
ally employed for dimension reduction. Afterwards, the bag of features (BoF) or fisher vector (FV) approach is applied to
encode the descriptors as feature vectors. At last, all the feature vectors are fed to a classifier such as the support vector
machine (SVM) to predict the image class. Many effective works, such as the spatial pyramid matching (SPM) [4] and sparse
coding along with pooling and spatial pyramid matching (Sc + SPM) [5] , have emerged and achieved wonderful performances
for image classification. However, these handcrafted features generally possess less semantic and structural information, and
the image classification performance can be further improved.
Nowadays, deep convolutional neural network (DCNN) has attracted a lot of research attentions because of its amazing
performance [6–8] . It simulates the human visual system (HVS) and the brain multi-level architecture. A number of DCNN
models( e.g. , Alex-Net [6] , VGG16 [8] , GoogLeNet [7] ) have been designed and obtained astonishing results on a number
of visual tasks ( e.g. , image classification, object detection, human action recognition). In DCNN, the depth is one of the
key factors to enhance the discriminative ability of features. Generally speaking, the deeper the model is, the better the
� Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. M. Senthil Kumar. ∗ Corresponding author.
E-mail address: [email protected] (H. Wang).
http://dx.doi.org/10.1016/j.compeleceng.2017.01.011
0045-7906/© 2017 Elsevier Ltd. All rights reserved.
500 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510
(a) PC-DCNN model
(b) MS-PC-DCNN model
Fig. 1. Overview of the proposed PC–DCNN and MS-PC-DCNN models.
performance becomes. Many research efforts are devoted to optimize the DCNN model architecture such as [9–11] . The
works mentioned above achieve great breakthroughs on the task of image classification. However, many of these works
pay much more attentions to the model depth, and neglect the fact that with increasing the number of layers, the model
complexity is dramatically increased and the performance is not always improved [12] . In addition, most of the current
models ignore another fact that human eyes have different visual fields, and the types of information they collect are also
different.
As we know, the visual information goes into the brain through two visual pathways, and then it comes into being
more comprehensive information via the optic chiasma. The comprehensive information at last is more discriminative and
abstract. Therefore, in this work, we simulate human eyes with two types of convolutional kernels with different sizes at
the bottom layer; and then, the extracted information forms two streams and is forwarded via each other’s pathway. When
the features arrive at the top of the proposed model, these two streams will be fused, which is similar to the mechanism of
optic chiasma. According to this process, we design a novel architecture called parallel crossing DCNN (PC–DCNN) as shown
in Fig. 1 (a).
Actually, the ability of human eyes is limited, and most often people have to turn to some optical equipments for more
information about the objects they want to understand or recognize. For example, people use telescope for observing the
macro features of objects, and they use microscope to get the information of microstructure. The more information about the
objects can be used, the higher precision can be obtained. We simulate this process by multi-scale convolutional kernels and
develop the multi-scale PC–DCNN (MS-PC-DCNN) model as shown in Fig. 1 (b). In MS-PC-DCNN, we use super convolutional
kernel to simulate the process of using telescopes, and employ sub convolutional kernel to simulate the process of using
microscopes. By this way, a batch of trained DCNN models and a few groups of scores are obtained, then, we fuse all
the models’ scores by computing their weighted average. The experimental results demonstrate that the proposed models
can improve the performance greatly for image classification. Meanwhile, the proposed framework is expandable, so if we
employ more and smaller sub convolutional kernels, more PC–DCNN modules can be generated, and the performance can
be further improved.
The main contributions of this work include the following three folds. First, inspired by the principle of human vision,
we propose a novel model which has a reasonable architecture with low complexity called PC–DCNN for the task of image
classification. Second, the concepts of super convolutional kernel and sub convolutional kernel are proposed in accordance
P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510 501
with the process that human observes objects. Third, the MS-PC-DCNN framework is designed, in which a batch of PC–
DCNN models are generated according to the principle of super convolutional kernel and sub convolutional kernel. The rest
of this paper is structured as follows. Section 2 reviews the related works about DCNN, including the history and several
state-of-the-art DCNN architectures. In Section 3 , the proposed PC–DCNN and MS-PC-DCNN models are detailed with the
principles of super convolutional kernel and sub convolutional kernel introduced. The experimental results are presented in
Section 4 . Finally, Section 5 concludes this work.
2. Related works
LeCun et al. [13] present the convolutional neural network (CNN), which constructs convolutional kernels to simulate the
human’s receptive field and employs the method of convolutional operation to filter image patches. Hinton et al. propose
the idea of deep learning in 2006, and design the deep belief network (DBN) model in which a few of restricted boltzmann
machines (RBM) are stacked and the layer-wise learning mechanism is used for training [14] .
An astonishing achievement of DCNN is made on the competition of ImageNet in 2012 that Krizhevsky et al. design the
Alex-Net model via combing the idea of deep learning with CNN [6] , and the classification accuracy arrives to 84.7% (Top
5) on the Imagenet2012 dataset [15] , outperforming the state-of-the-art model (SIFT+FV) more than 10%. In [9] , Zeiler et al.
propose an approach which can visualize the features of every layer based on Alex-Net and further improve the classification
performance by refining the Alex-Net model. The DCNN models of VGG16 and VGG19 are designed which are inspired by the
conclusion that the depth is significant for feature representation [8] . As compared to VGG models, the GoogLeNet model
has lower model complexity and better performance because of its small size of convolutional kernels and contraption
ingenuity, in which the module named Inception is designed for clustering sparse features [7] . In [10] , the network in
network (NIN) model is designed which uses repetitive convolutional operations before pooling to generate richer image
features and reduces the model complexity. The classification performances achieved by NIN on the benchmark datasets of
CIFAR-10 [16] and CIFAR-100 [16] reach to 92% and 64.3%, respectively. Other improved models such as All-CNN [11] are also
developed for image classification and perform well. Besides the model architecture, a number of works focus on developing
new techniques for DCNN. Zeiler et al. propose the stochastic pooling method [17] for eliminating the drawbacks of max
pooling and average pooling. Dropout [6] , Maxout [18] and DropConnect [19] are designed to prevent over-fitting. In [20] ,
the parametric rectified linear units (PReLU) method is designed to improve the widely applied ReLU method for neuron
activation.
The aforementioned works concentrate on the models themselves or their optimization for finding and designing more
reasonable architectures and techniques. As compared with these works, the proposed method further follows the philoso-
phy of HVS as mentioned in Section 1 so that the parallel crossing mechanism in the process of visual information transfer
is explicitly explored and the super convolutional kernel as well as the sub convolutional kernel are designed for the pro-
posed PC–DCNN and MS-PC-DCNN models. In order to implement the proposed PC–DCNN and MS-PC-DCNN models, we
employ the relatively low-complexity Alex-Net model to verify our ideas.
3. Proposed PC–DCNN model and MS-PC-DCNN model
3.1. Convolutional neural network
Generally, a convolutional neural network consists of convolutional layers, pooling layers and fully connected layers.
Given an input feature map x ∈ R
H×W ×N , where H and W are the height and width of the feature map, and N is the number
of feature maps in a layer. The convolutional operation can be given by
y i j = (K ∗ x k ) i j + b, (1)
where y ij is the output at ( i , j ), and K is the convolutional kernel. x k is the k th patch in x , and b is the bias. Then, the max
or average pooling is followed for sampling by x k = max (x 1 k , x 2
k , · · · , x n
k ) or x k = (x 1
k + x 2
k + · · · + x n
k ) /n, where n indicates the
number of feature map elements within the k th patch. Next, the activation function is employed for enhancing sparseness of
the features with the format of f (y i j ) = max (0 , y i j ) + α · min (0 , y i j ) if we use the PReLU method [20] , where the parameter
α can be updated as
α = α − η∂�
∂α, (2)
where η is the learning rate, and � is the cost function with the cross entropy method usually employed as
� = − 1
m
m ∑
i =1
[ t i log (o i ) + (1 − t i ) log (1 − o i ) ] , (3)
where m is the number of samples in one iteration, t i is the ground truth of the i th sample, and o i is the output of the
system. When training a DCNN model, the object is to minimize � : f (x, w ) �→ R , where w is the set of weights that needs
to be optimized.
502 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510
3.2. Super convolutional kernel and PC–DCNN model
Suppose that the convolutional kernel has the same value of length and width, we define z l as the times of its scale as
z l =
K
c l
− K
s l
K
c l
, K
c l � = K
s l , (4)
where l is the l th convolutional layer, K
c l
is the size of the original convolutional kernel and K
s l
is the size of the scaled
convolutional kernel. If z l < 0, we have
K
c l
− K
s l
K
c l
< 0 , (5)
namely,
K
s l > K
c l , (6)
where if K
s l
� K
c l
under this condition, the size of convolutional kernel with K
s l
will be very large. We define this type of ker-
nel as super convolutional kernel. For obtaining super convolutional kernels, a naive approach is to increase the kernel size
directly. However, the number of parameters and the time complexity will be increased greatly if this method is employed.
Instead, we propose another way to generate super convolutional kernels.
As we know, the size of feature map in the l th convolutional layer can be calculated by {M
H l = � (M
H l−1 − K
H l ) /s H l + 1
M
W
l = � (M
W
l−1 − K
W
l ) /s W
l + 1
, (7)
where M l and M l−1 denote the sizes of feature maps in the l th and (l − 1) th convolutional layers, respectively, and K l is the
size convolutional kernel in the l th convolutional layer, while s l is the stride of convolutional operation in the l th convolu-
tional layer and H and W are the height and width, respectively. For simplicity, we set H = W, and Eq. (7) can be simplified
as
M l = � (M l−1 − K l ) / s l + 1 , (8)
Given K l = K
c l , the size of the original feature map M l−1 and the stride s l are denoted by M
c l−1
and s c l , respectively,
Eq. (8) can be written as
M
c l = � (M
c l−1 − K
c l ) / s
c l + 1 , (9)
Then, we increase s c l
and denote it as s s l , and the size of the output feature map M
s l
can be calculated by
M
s l = � (M
c l−1 − K l
c ) / s s l + 1 , (10)
Afterwards, since M
s l
can be fixed and thus we have
K
s l = M
c l−1 − s c l (M
s l − 1) . (11)
We can find that M
s l
will be decreased according to Eq. (10) . Because M
c l−1
and s c l
are fixed, K
s l
will be increased based
on Eq. (11) . Therefore, we utilize the method of increasing the stride to get the super convolutional kernel. In this way, the
size of convolutional kernels is not increased, and the complexity can be restricted. Then, we can further rewrite Eq. (11) as
K
s l = M
c l−1 − s c l (� (M
c l−1 − K
c l ) /s s l ) . (12)
Therefore, the new stride can be calculated by
s s l = � (M
c l−1 − K
c l )(s c l / (M
c l−1 − K
s l )) . (13)
At the training stage, the weights w and the bias b are updated by a chain rule. But at the top of the model, we cross
the two streams twice, and the process is different from the traditional back propagation (BP) algorithm. Suppose that g L is
the function of the last fully connected layer which can be denoted as g L = f (w · X L + b) , where X L is the neuron vector. At
the second crossing layer, the weights and bias can be updated by
w L −1 = w L −1 − η∂�
∂ w L −1
= w L −1 − η(
∂�
∂g L
∂g L
∂g A L −1
∂g A L −1
∂w
A L −1
+
∂�
∂g L
∂g L
∂g B L −1
∂g B L −1
∂w
B L −1
), (14)
P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510 503
b L −1 = b L −1 − η∂�
∂ b L −1
= b L −1 − η(
∂�
∂g L
∂g L
∂g A L −1
∂g A L −1
∂b A L −1
+
∂�
∂g L
∂g L
∂g B L −1
∂g B L −1
∂b B L −1
). (15)
In a similar way, w and b at the first crossing layer can be updated with
w L −2 = w L −2 −η∂�
∂g L
[ ∂g L
∂g A L −1
(∂g A L −1
∂g A L −2
∂g A L −2
∂w
A L −2
+
∂g A L −1
∂g B L −2
∂g B L −2
∂w
B L −2
)
+
∂g L
∂g B L −1
(∂g B L −1
∂g A L −2
∂g A L −2
∂w
A L −2
+
∂g B L −1
∂g B L −2
∂g B L −2
∂w
B L −2
)] , (16)
b L −2 = b L −2 −η∂�
∂g L
[ ∂g L
∂g A L −1
(∂g A L −1
∂g A L −2
∂g A L −2
∂b A L −2
+
∂g A L −1
∂g B L −2
∂g B L −2
∂b B L −2
)
+
∂g L
∂g B L −1
(∂g B L −1
∂g A L −2
∂g A L −2
∂b A L −2
+
∂g B L −1
∂g B L −2
∂g B L −2
∂b B L −2
)] , (17)
where g A and g B stand for the transformation functions of Stream A and Stream B, respectively. Similarly, w
A and b A are the
weights and bias for Stream A, and w
B and b B are for Stream B.
In our model, the number of parameters does not increase because the size of convolutional kernels in practice does not
increase though we use the super convolutional kernel. Meanwhile, the number of neurons and the time complexity are
both reduced because of the smaller generated feature maps caused by increasing the stride of convolutional operations.
And it can not improve the performance if we only use a single model with the proposed super convolutional kernel,
because much more detailed information will be lost when we use a long stride. Inspired by HVS, we combine the model of
super convolutional kernel with the model of the original convolutional kernel and form two data transformation streams
to simulate two visual pathways. At the top of the model, these two streams will be mixed for simulating the mechanism
of optic chiasma.
3.3. Sub convolutional kernel and MS-PC-DCNN model
According to Eq. (4) , given z l > 0, then we have
K
c l
− K
s l
K
c l
> 0 , (18)
that is
K
s l < K
c l , (19)
where if K
s l
> 0 , K
s l
is the size of the traditional convolutional kernel, and if the scaled smaller convolutional kernel is used
directly, the parameters and the complexity will be reduced, but the number of neurons will be increased, so that it requires
more memory space. And if K
s l
< 0 , there is no such a convolutional kernel actually, so we name this type of kernel as sub
convolutional kernel.
In order to get sub convolutional kernels, the following two situations are investigated. The first is that M l−1 is fixed when
the size of convolutional kernel may be increased or decreased, and the second is that the size of convolutional kernel is
fixed while M l−1 increases or decreases. As illustrated in Fig. 2 , we can find the effects of the two conditions are equivalent.
In the first situation, because the size of feature maps is fixed, the size of regions covered by different convolutional kernels
is different too and the larger size of kernels will cover bigger regions while the smaller size of kernels will cover smaller
regions of the feature map. As for the second case, if the size of input feature maps is smaller, the size of regions covered
by convolutional kernels will become larger which is caused by their fixed size. And in a similar way, the size of regions
covered by convolutional kernels will be smaller if the feature maps become larger. For image classification, these two
approaches have the same effect. Therefore, we get sub convolutional kernels by increasing the size of input feature maps
under the condition of using fixed size of convolutional kernels.
If M l−1 increases, M l will also increase according to Eq. (8) . Let K l = K
c l , and the increased M l and M l−1 are denoted as
(M
s l ) ′ and (M
s l−1
) ′ , respectively, we can compute (K
s l ) ′ as
(K
s l )
′ = (M
s l−1 )
′ − s c l ((M
s l )
′ − 1) . (20)
If (K
s l ) ′ < 0 , (K
s l ) ′ is the size of the sub convolutional kernel. However, our goal is not for the sub convolutional kernel
according to Fig. 2 but for (M
s l−1
) ′ . From Eq. (20) , we derive
(M
s l−1 )
′ = (K
s l )
′ + s c l ((M
s l )
′ − 1) , (21)
504 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510
Fig. 2. Illustration of the proposed super convolutional kernel and sub convolutional kernel via image decreasing and increasing.
that is
(M
s l−1 )
′ = (K
s l )
′ + s c l (� (M
c l−1 − K
c l ) /s s l ) . (22)
According to Eq. (4) , if we have known the scale ratio z l and the original convolutional kernel K
c l , then (K
s l ) ′ can be
obtained, thus we can compute (M
s l−1
) ′ with the aforementioned equations. If l = 1 , (M
s l−1
) ′ = (M
s 0 ) ′ which represents the
expected size of the input image.
The ideas just discussed above are applied to the proposed PC–DCNN model. By setting different scale ratios, a few
types of input images with different sizes can be obtained leading to the generation of more than one PC–DCNN models.
In addition, the experimental results show that a single model with sub convolutional kernel has no obvious improvement
on performances. This is mainly because a single model will miss the macro structure information of the entire object.
Therefore, we regard the PC–DCNN model as the basic module and combine multiple basic modules with different sizes
of sub convolutional kernels, and then we compute the weighted average score from each of these modules. Assume there
are c categories in a dataset and R modules in the proposed MS-PC-DCNN model. The feature vector generated by the i th
module is denoted as v i . For an image, we define the estimation which belongs to the j th class as ˆ sc j = P (c j | v i ) + ε i j (v i ) ,
where P (c j | v i ) is the true probability from the i th module and ε i j (v i ) denotes the error of estimation. Because ε i
j (v i ) may
be different among the modules, we apply the average combination rule with weights like
ˆ SC j =
1
R
R ∑
i =1
ˆ sc j =
1
R
R ∑
i =1
β i P (c j | v i ) , (23)
where R = 4 in our model, β i is the weight for the i th PC–DCNN module, and
ˆ SC j is the final score estimation from all
the images belonging to the j th class. The experimental results demonstrate that the proposed framework can improve
the performance greatly. It is worth noting that, our multi-scale method is different from the traditional method [8] that
enlarges the original images and just uses fixed crop patches ( e.g. , 224 × 224).
4. Experimental results
The image resolution has a big effect on the performance, so we implement two different PC–DCNN models for the
datasets with high-resolution images and low-resolution images. For the first situation, we use Caltech101 [21] and Cal-
tech256 [22] datasets for evaluating our approach. The Caltech101 dataset has more than 90 0 0 images, including 101 cate-
gories, and each category has 90 images on average. In the Caltech256 dataset, there is a small number of images overlap-
ping with the Caltech101 dataset, but it contains much more images and categories. The Caltech256 dataset has more than
30,0 0 0 images and 256 categories, and each category includes about 120 images on average. On each dataset, we repeat
the experiments three times and take the mean value. And for the second situation, we use the popular CIFAR-10 [16] and
CIFAR-100 [16] datasets to evaluate the model. Both the datasets of CIFAR-10 and CIFAR-10 0 contain 60,0 0 0 color images
with size of 32 × 32, and the training set includes 50,0 0 0 images and the others are for test.
4.1. Model design
High-resolution images have more pixels than tiny images, and more neurons are required for representing them, leading
to higher model complexities. We design Model(a) and Model(b) for the large image datasets and tiny image datasets,
P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510 505
Table 1
The architectures of Model(a) and Model(b).
Layer type Model(a) Model(b)
Stream A Stream B Stream A Stream B
Conv. {96@11 × 11, 4} {96@11 × 11, 6} {96@3 × 3, 1} {96@3 × 3, 2}
PReLU/LRN −− −− −− −−Max pooling {96@3 × 3, 2} {96@3 × 3, 2} {96@3 × 3, 2} {96@3 × 3, 2}
Conv. {256@5 × 5, 1} {256@5 × 5, 1} {256@3 × 3, 1} {256@3 × 3, 1}
PReLU/LRN −− −− −− −−Max pooling {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2}
Conv. {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1}
PReLU −− −− −− −−Conv. {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1} {384@3 × 3, 1}
PReLU −− −− −− −−Conv. {256@3 × 3, 1} {256@3 × 3, 1} {256@3 × 3, 1} {256@3 × 3, 1}
PReLU −− −− −− −−Max pooling {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2} {256@3 × 3, 2}
FC 2048 2048 2048 2048
Dropout ratio:0.5 ratio:0.5 ratio:0.5 ratio:0.5
Concat Fusion Fusion
FC 2048 2048 2048 2048
Dropout ratio:0.5 ratio:0.5 ratio:0.7 ratio:0.7
Concat Fusion Fusion
FC 1024 1024
Table 2
The size of super convolutional kernels and sub convolutional kernels with different input patches.
Model(a) Model(b)
Crop size Stream A Stream B Crop size Stream A Stream B
Sub Super Sub Sub Super Sub
224 × 224 −− 84 × 84 −− 28 × 28 −− 16 × 16 −−280 × 280 40 × 40 48 × 48 −− 42 × 42 12 × 12 9 × 9 −−336 × 336 100 × 100 8 × 8 −− 56 × 56 26 × 26 2 × 2 −−392 × 392 156 × 156 −− 28 × 28 70 × 70 40 × 40 −− 5 × 5
Table 3
Crop size and image size used by Model(a) and Model(b).
Model(a) Model(b)
Crop size Image size Crop size Image size
224 × 224 256 × 256 28 × 28 32 × 32
280 × 280 320 × 320 42 × 42 48 × 48
336 × 336 384 × 384 56 × 56 64 × 64
392 × 392 448 × 448 70 × 70 80 × 80
Table 4
Configuration of MS-PC-DCNN with Model(a).
MS-PC-DCNN Fusion scale
1-Scale PC-DCNN:256 × 256
2-Scale PC-DCNN:256 × 256 + 320 × 320
3-Scale PC-DCNN:256 × 256 + 320 × 320 + 384 × 384
4-Scale PC-DCNN:256 × 256 + 320 × 320 + 384 × 384 + 448 × 448
respectively, based on the idea of super convolutional kernel and sub convolutional kernel. The model settings are shown
in Table 1 , where in the format of { n 1 @ n 2 × n 2 , s }, n 1 is the number of output channels, n 2 is the size of convolutional
kernel, and s is the stride. Suppose the size of the original convolutional kernels in the first layer is 11 × 11 and 3 × 3 in
Model(a) and Model(b), respectively, a series of super convolutional kernels and sub convolutional kernels are generated.
The detailed configuration is shown in Table 2 . It is worth noting that super convolutional kernels are always in Stream B,
and sub convolutional kernels are always in Stream A in a single PC–DCNN module.
In the proposed MS-PC-DCNN model, we resize the image according to the patch size we have cropped ( Table 3 ). And
the fusion scales are shown in Tables 4 and 5 for Model(a) and Model(b), respectively, where the format of m 1 × m 1 + · · · +m r × m r denotes that there are r PC–DCNN modules with different scales in the MS-PC-DCNN model.
506 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510
Table 5
Configuration of MS-PC-DCNN with Model(b).
MS-PC-DCNN Fusion scale
1-Scale PC-DCNN:32 × 32
2-Scale PC-DCNN:32 × 32 + 48 × 48
3-Scale PC-DCNN:32 × 32 + 48 × 48 + 64 × 64
4-Scale PC-DCNN:32 × 32 + 48 × 48 + 64 × 64 + 80 × 80
Table 6
Configuration for training.
Parameter type Model(a) Model(b) Policy
Learning rate 0 .001 0 .01 poly
Decay power 0 .6 0 .6 fixed
Weight decay 0 .0 0 05 0 .0 0 05 fixed
Momentum 0 .9 0 .9 fixed
Batch size 32 50 fixed
Table 7
Performance comparison of different models on Caltech256 and Caltech101.
Deep model Caltech256 ( N train = 60 ) Caltech101 ( N train = 30 )
Acc%(Top1) Acc%(Top5) Acc%(Top1) Acc%(Top5)
Alex-Net [6] 38 .8 57 .2 56 .6 75 .9
ZF-Net [9] 38 .8 −− 46 .5 −−GoogLeNet [7] 43 .8 63 .9 53 .3 73 .2
VGG16 [8] 41 .5 59 .9 58 .1 77 .7
Ex-CNN [24] 53 .6 −− 87 .1 −−PC-DCNN (256 × 256) 47 .3 66 .6 66 .7 85 .4
PC-DCNN (320 × 320) 48 .6 68 .2 67 .2 87 .0
PC-DCNN (384 × 384) 49 .6 68 .1 67 .5 86 .5
PC-DCNN (448 × 448) 50 .1 68 .5 69 .7 88 .1
MS-PC-DCNN (4-Scale) 54 .4 71 .5 72 .6 88 .8
4.2. Configuration
In all experiments, we use the technology of data augmentation [16] . First of all, we crop the image from the top left
corner, top right corner, bottom left corner, bottom right corner and center for patches, and then flip them horizontally at
the training stage. So the size of augmented dataset is 10 times of the original dataset.
During training, for Model(a), 30 images are selected from each category randomly from the Caltech101 dataset, and the
rest images are used for test. For finding hyper parameters, we use the first 20 images in each class for training and the
rest for validation. Next, we find the best performance point on the validation set. Then, we use the whole training set for
training until the iteration number matches the best point we have found. On the Caltech256 dataset, we first randomly
select 60 images from each category as the training samples, and use the other images for test. Then we use the first 50
images for training and the rest for validation.
Regarding Model(b), we follow the method used by Goodfellow et al. [18] , which uses the first 40,0 0 0 images for training
and the rest 10,0 0 0 images for validation on CIFAR-10. Meanwhile, the same protocol is applied on the CIFAR-100 dataset
and does not fine-tune the parameters on the model trained on CIFAR-10. The other configuration parameters are shown in
Table 6 . In all experiments, we employ the popular Caffe toolkit [23] to deploy and implement the proposed models. And we
use two NVIDIA TITAN-X GPUs for accelerating the computing process. On the datasets of Caltech256 and Caltech101, about
3 days are required for training the models, and on the datasets of CIFAR-10 and CIFAR-100, about 1.5 days are required for
training. At the test stage, the average time for one image in Caltech256 and Caltech101 is about 0.15s, and for one image
in CIFAR-10 and CIFAR-100, the test average time is about 0.07s.
4.3. Performance analysis
For large image, a number of research works are implemented for comparison including Alex-Net [6] , GoogLeNet [7] and
VGG16 [8] with the same configuration. And the results of the ZF-Net model are cited from [9] directly. The comparative
results are represented in Table 7 , where it can be observed that the proposed PC-DCNN model has better performances
than the other competing models no matter on Caltech256 or Caltech101. Furthermore, the proposed MS-PC-DCNN (4-Scale)
model outperforms GoogLeNet more than 10% (Top1) and 19% (Top1) on Caltech256 and Caltech101, respectively, and out-
performs VGG16 about 13% (Top1) and 14% (Top1) on Caltech256 and Caltech101, respectively. As compared with Ex-CNN,
P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510 507
Table 8
Performance comparison of different models on CIFAR-100 and CIFAR-10.
Deep model CIFAR-100 (Err%(Top1)) CIFAR-10 (Err%(Top1))
Stochastic pooling [17] 42 .51 15 .13
Conv. Maxout [18] 38 .57 9 .38
Deeply supervised [25] 34 .57 7 .97
DropConnect [19] −− 9 .32
NIN + Dropout [10] 35 .68 8 .81
All-CNN [11] 33 .71 7 .25
Ex-CNN [24] −− 15 .7
PC–DCNN (32 × 32) 31 .91 7 .73
PC–DCNN (48 × 48) 28 .81 7 .41
PC–DCNN (64 × 64) 28 .87 7 .49
PC–DCNN (80 × 80) 29 .90 7 .23
MS-PC-DCNN (4-Scale) 25.90 6 .07
Table 9
Complexity of different models. Columns 1–2 list the results of Model(a) and Columns
3–4 list the results of Model(b).
Deep model T _ Complexity Deep Model T _ Complexity
Alex-Net [6] 0 .88 Stochastic pooling [17] 0 .14
ZF-Net [9] 0 .94 Conv. Maxout [18] 28 .23
GoogLeNet [7] 1 .32 Deeply supervised [25] 0 .14
VGG16 [8] 20 .1 NIN [10] 1 .13
Ex-CNN [24] 76 .43 All-CNN [11] 1 .29
PC-DCNN 1 .0 Ex-CNN [24] 10 .02
MS-PC-DCNN (4-Scale) 10 .51 PC-DCNN 1 .0
MS-PC-DCNN (4-Scale) 13 .43
the performance of the proposed model also has obvious improvement on Caltech256. However, on Caltech101, there is a
great gap between our model and Ex-CNN.
We also compare the performances achieved by the proposed Model(b) with several state-of-the-art models on the tiny
image datasets as shown in Table 8 , where it can be seen that the proposed MS-PC-DCNN (4-Scale) model obtains the state-
of-the-art results on the CIFAR-100 dataset such that the test error reduces to 25.90%. Even we just use a single 1-Scale
model, the performance outperforms NIN [10] and All-CNN [11] . On CIFAR-10, we also get the best result with the test error
being 6.07% by the 4-Scale model.
We further make a comparison of different multi-scale models which include different numbers of single PC–DCNN
models. We fuse different PC–DCNN models with different scales according to the rule in Tables 4 and 5 on Model(a) and
Model(b), respectively. The results are shown in Fig. 3 , where it is obvious that the performances are improved by increasing
the number of PC–DCNN modules. Especially on Caltech256 and Caltech101, the 4-Scale model achieves the accuracy of
54.4% and 72.6%, respectively, outperforming the original PC–DCNN model more than 7% and 6%, respectively. On the CIFAR-
100 dataset, the 4-Scale model obtains the test error which outperforms the original PC–DCNN with 6.01%. Even on the
CIFAR-10 dataset, the 4-Scale model reduces the test error about 1.7% as compared to the original PC–DCNN model.
4.4. Model complexity
The model complexity of a DCNN model is related to the size of convolutional kernels, feature maps and stride. The
smaller the convolutional kernel is, the lower the model complexity is. If we use a smaller stride, the complexity will
increase greatly. The following formula is employed to evaluate the model complexity [12] .
T _ Complexity = O
( d ∑
l=1
n l−1 · K
2 l · n l ·
(⌊
M l−1 − K l + 2 · pad l s l
⌋
+ 1
)2 ), (24)
where d is the number of convolutional layers, K l and n l are the size of kernel and the number of kernels (or outputs) in
the l th convolutional layer, M l is the size of output feature maps generated by the l th convolutional layer, and if l = 1 , n 0and M 0 are the number of channels and the size of input images. The parameter of pad l is the border added on the feature
map for avoiding the convolutional kernel exceeding the region of feature maps and the values are usually set to 0.
The comparison of complexity in different models is shown in Table 9 , where it can be observed that the GoogLeNet
model has 1.32 times of complexity as compared with the single Model(a) for large images, and the complexity of the
4-Scale Model(a) has 10 times of complexity as compared to the single PC–DCNN model. However, even for the 4-Scale
model, the complexity is 1/7 times less than that of Ex-CNN. And for tiny images, our single model has lower complexity
than NIN, All-CNN and Ex-CNN. But the 4-Scale Model(b) is about 13 times of model complexity as compared with the
single PC–DCNN model.
508 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510
(a) Caltech256 (b) Caltech101
(c) CIFAR-100 (d) CIFAR-10
Fig. 3. Performance comparison of the proposed MS-PC-DCNN model with different multiple scales.
5. Conclusion
The deep learning technology is on the way of development rapidly, especially DCNN. Nowadays, lots of novel methods
and architectures are constantly emerging, and the state-of-the-art results on public datasets are usually surpassed in the
field of computer vision. In the DCNN model, the end-to-end method is employed for feature learning, avoiding to design
complicated algorithms for handcrafted features and special models for an application. It extracts more abstract, richer
semantic features through more than one time of nonlinear transformations. However, the current DCNN models become
deeper and deeper, and the architectures are more and more sophisticated. From the view of HVS, the PC–DCNN model
and the MS-PC-DCNN model are proposed based on the super convolutional kernel and the sub convolutional kernel in this
work. In a sense, the proposed approach doesn’t increase the depth of the model, instead it increases the width of the model
according to using more than one transformation stream for an image, leading to more abstract and robust features. The
experimental results demonstrate that the proposed PC-DCNN model obtains better performances for image classification
than a number of state-of-the-art models. Meanwhile, we further improve the performance with the MS-PC-DCNN model.
However, the proposed model complexity is higher compared to most of the other deep models under the same conditions.
In the future work, more effort s will be put in reducing the complexity and speeding up convergence of the model.
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under Grants 61622115 and 61472281 ,
and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No.
GZ2015005).
References
[1] Felzenszwalb P , Girshick R , Mcallester D , Ramanan D . Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal MachIntell 2010;32(9):1627–45 .
P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510 509
[2] Lowe DG . Distinctive image features from scale-invariant keypoints. Int J Comput Vis 2004;69(2):91–110 . [3] Waheed Z , Akram MU , Waheed A , Khan MA , Shaukat A , Ishaq M . Person identification using vascular and non-vascular retinal features. Comput Electr
Eng 2016;53:359–71 . [4] Lazebnik S , Schmid C , Ponce J . Beyond bags of features: Spatial pyramid matching for recognizing natural scene categorie. In: conference on computer
vision and pattern recognition. IEEE; 2006. p. 2169–78 . [5] Zhang C , Liu J , Liang C , Xue Z , Pang J . Image classification by non-negative sparse coding, correlation constrained low-rank and sparse decomposition.
Comput Vis Image Und 2014;123(7):14–22 .
[6] Krizhevsky A , Sutskever I , Hinton GE . Imagenet classification with deep convolutional neural networks. In: advances in neural information processingsystems. NIPS; 2012. p. 1106–14 .
[7] Szegedy C , Liu W , Jia Y , Sermanet P , Reed S , Anguelov D , et al. Going deeper with convolutions. In: conference on computer vision and patternrecognition. IEEE; 2015. p. 1–9 .
[8] Simonyan K , Zisserman A . Very deep convolutional networks for large-scale image recognition. international conference on learning representation;2015 .
[9] Zeiler MD , Fergus R . Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33 . [10] Lin M , Chen Q , Yan S . Network in network. international conference on learning representation; 2015 .
[11] Springenberg JT , Dosovitskiy A , Brox T . The all convolutional net. international conference on learning representation; 2015 .
[12] He K , Sun J . Convolutional neural networks at constrained time cost. In: conference on computer vision and pattern recognition. IEEE; 2015.p. 5353–60 .
[13] Lecun Y , Bottou L , Bengio Y , Haffner P . Gradient-based learning applied to document recognition. Proc IEEE 1998;86(11):2278–324 . [14] Hinton GE , Salakhutdinov RR . Reducing the dimensionality of data with neural networks. Science 2006;313(5786):504–7 .
[15] Russakovsky O , Deng J , Huang Z , Berg A , Li F-F . Detecting avocados to zucchinis: what have we done, and where are we going?. IEEE internationalconference on computer vision. IEEE; 2013 . 2604–2071.
[16] Krizhevsky A . Learning multiple layers of features from tiny images. University of Toronto; 2009 .
[17] Zeiler MD , Fergus R . Stochastic pooling for regularization of deep convolutional neural networks. international conference on learning representation;2013 .
[18] Goodfellow IJ , Warde-Farley D , Mirza M , Courville A , Bengio Y . Maxout networks. In: international conference on machine learning; 2013. p. 1319–27 .[19] Li W , Zeiler MD , Zhang S , LeCun Y , Fergus R . Regularization of neural networks using dropconnect. In: international conference on machine learning;
2013. p. 1058–66 . [20] He K , Zhang X , Ren S , Sun J . Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: IEEE international
conference on computer vision. IEEE; 2015. p. 1026–34 .
[21] Li F-F , Fergus R , Perona P . Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 objectcategories. Comput Vis Image Und 2010;106(1):59–70 .
[22] Griffin G , Holub A , Perona P . Caltech-256 object category dataset. Tech Report. California Institute of Technology; 2007 . [23] Jia Y , Shelhamer E , Donahue J , Karayev S , Long J , Girshick R , et al. Caffe: Convolutional architecture for fast feature embedding. In: ACM international
conference on multimedia. ACM; 2014. p. 675–8 . [24] Dosovitskiy A , Fischer P , Springenberg J , Riedmiller M , Brox T . Discriminative unsupervised feature learning with exemplar convolutional neural net-
works. In: Advances in neural information processing systems. NIPS; 2014. p. 766–74 .
[25] Lee CY , Xie S , Gallagher P , Zhang Z , Tu Z . Deeply-supervised nets. In: international conference on artificial intelligence and statistics; 2015. p. 562–70 .
510 P. Tang, H. Wang / Computers and Electrical Engineering 62 (2017) 499–510
Pengjie Tang received the M.S. degree in Computer Software and Theory from Nanchang University, China, in 2009. He is currently a Ph.D. candidate atthe Department of Computer Science and Technology, Tongji University, Shanghai, China. His current research interests include computer vision and deep
learning.
Hanli Wang received the M.E. degree in Electrical Engineering from Zhejiang University, Hangzhou, China, in 2004, and the Ph.D. degree in ComputerScience from City University of Hong Kong, Kowloon, Hong Kong, in 2007. He is a professor at the Department of Computer Science and Technology, Tongji
University, Shanghai, China. His current research interests include digital video coding, computer vision, and machine learning.
本文献由“学霸图书馆-文献云下载”收集自网络,仅供学习交流使用。
学霸图书馆(www.xuebalib.com)是一个“整合众多图书馆数据库资源,
提供一站式文献检索和下载服务”的24 小时在线不限IP
图书馆。
图书馆致力于便利、促进学习与科研,提供最强文献下载服务。
图书馆导航:
图书馆首页 文献云下载 图书馆入口 外文数据库大全 疑难文献辅助工具