IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 8, …knngan/2017/TIP_v26_n8_p4019-4031.pdf ·...

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 8, AUGUST 2017 4019

Weakly Supervised Part Proposal SegmentationFrom Multiple Images

Fanman Meng, Member, IEEE, Hongliang Li, Senior Member, IEEE, Qingbo Wu, Member, IEEE,Bing Luo, and King Ngi Ngan, Fellow, IEEE

Abstract— Weakly supervised local part segmentation is chal-lenging, due to the difficulty of modeling multiple local partsfrom image level prior. In this paper, we propose a new weaklysupervised local part proposal segmentation method based onthe observation that local parts will keep fixed along the objectpose variations. Hence, the local part can be segmented bycapturing object pose variations. Based on such observation,a new local part proposal segmentation model is proposed.Three aspects, such as shape similarity-based cosegmentation,shape matching-based part detection and segmentation, andgraph matching-based part assignment are considered. A partsegmentation energy function is first proposed. Four terms,such as MRF-based single image segmentation term, shapefeature-based foreground consistency term, NCuts-based partsegmentation term, and two-order graphs matching based partconsistency term, are contained. Then, a three sub-minimization-based energy minimization method is proposed to accomplishapproximation solution. Finally, we verify our method basedon three image data sets (PASCAL VOC 2008 Part data set,UCB Bird data set, and Cat-Dog data set), and one video dataset (UCF Sports) data set. The experimental results demonstratea better segmentation performance compared with the existingobject cosegmentation and part proposal generation methods.

Index Terms— Part segmentation, cosegmentation, shapematching, graph matching, NCuts.

I. INTRODUCTION

THE existing image segmentation methods paid muchattention on object segmentation that segments object

region from images [1]–[3], while the detailed local partregions are ignored. Note that many computer vision tasks relyon local information analysis where the part segmentation isan essential step, such as the fined bird image classificationthat uses the appearances of local parts to distinguish the birdsubspecies [4].

Recently, a few of researchers have paid attention on localpart segmentation [4]–[7], where the multiple local parts and

Manuscript received March 8, 2016; revised March 29, 2017; acceptedMay 11, 2017. Date of publication May 26, 2017; date of current versionJune 23, 2017. This work was supported in part by National NaturalScience Foundation of China under Grant 61502084, Grant 61525102, andGrant 61601102. The associate editor coordinating the review of this manu-script and approving it for publication was Prof. David Clausi. (Correspondingauthor: Fanman Meng.)

F. Meng, H. Li, Q. Wu, and B. Luo are with School of ElectronicEngineering, University of Electronic Science and Technology of China,Cheng Du 611731, China (e-mail: [email protected]; [email protected]).

K. N. Ngan is with the Department of Electronic Engineering, The ChineseUniversity of Hong Kong, Hong Kong, and also with the School of ElectronicEngineering, University of Electronic Science and Technology of China,Cheng Du 610051, China (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2017.2708839

Fig. 1. An explanation of part level segmentation using weak priors, whichis the input and output of the proposed method.

their structures instead of object are segmented. Comparedwith object segmentation, it needs to handle multiple part pri-ors and their spatial structures, and thus is a more difficult task.

The existing local part segmentation methods mainly focuson supervised manner that learns each part prior from accuratetraining data [5]–[7]. Successful part segmentation can beachieved by the careful prior learning and segmentation modeldesign. However, pixel-level training data is generally notavailable in many applications, while the rough priors suchas image tags often appear. An example is shown in Fig. 1,where the image level tags can be easily provided by user.In such case, the problem changes to segment multiple partfrom images with tag priors, which is weakly supervised partsegmentation problem.

The challenge of weakly supervised part segmentation ishow to define semantic local part from weak priors. In otherwords, the initial priors are so rough that it is difficult togenerate part priors. A feasible solution is local part proposalgeneration, i.e., using a set of sufficient local part proposals toprovide local parts [4]. However, there still lacks a useful cueto capture part regions. Fortunately, it is observed that localparts will keep fixed among object variations. Hence, localparts can be defined as the regions that keep fixed among thevariations. An example is shown in Fig. 1, where the “head”region keeps fixed among “Cat” images, and is set as a localpart, while the region containing “head” and “body” variesamong the images, and is treated as local region instead. By thedefinition, we can obtain part proposals by shape matching.

Based on such observation, we propose a weakly supervisedpart proposal segmentation model. Given multiple imagesrelated to an object, with the assumption that the object iscontained in each image, we aim at segmenting local part

1057-7149 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

4020 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 8, AUGUST 2017

Fig. 2. The example of original region pairs and their correspondence matching. (a)(b): the original region pairs. (c): the matching results. (d): the correspondingpixels on the boundary by [8]. It is seen that local parts keep fixed along the shape variation, such as the “head” that have same pixel shifts.

proposals from the images by measuring pose variations.Our part proposal segmentation problem is formulated as aL label assignment task. An energy function is designed tomeasure the label assignment by four terms: the single imagesegmentation term, the object region consistency term, the partconsistency term, and the global part structure consistencyterm. The first two terms are to enforce the foreground to becommon objects. The third term is to constraint the consistencyof parts, which is formulated by capturing shape variations,as shown in Fig. 2. The fourth term is to enforce the similarityof part structure. Based on the energy, the part segmentationis accomplished by the energy minimization, which is solvedby three sub-minimization problems, such as cosegmentationproblem by object proposals and shortest path searching, partgeneration problem by shape context matching and NCuts, andpart label matching problem by two order graph matching.We verify our method on both the image and video datasets. The experimental results demonstrate that our methodobtains better IOU values than several state-of-the-art objectsegmentation and part segmentation methods.

Our contributions are listed as follows.

• It is a weak part proposal generation method by capturingpose variation among objects, which can obtain better partsegmentation.

• A new segmentation model is proposed by includingcosegmentation, part segmentation, and part label match-ing, which can segment local part proposals from multipleimages.

The paper is organized as follows. We present related workin Section II. The proposed method, including energy design,energy minimization, and detailed algorithm is introduced inSection III. Section IV displays the results of our method andthe comparison methods. We finally draw the conclusion inSection V.

II. RELATED WORK

Image segmentation is a clustering process that clusters pix-els into semantic regions. The existing segmentation methodscan be classified into superpixel level segmentation, objectlevel segmentation, and part level segmentation according tothe semantic level of segmentation targets.

A. Superpixel Level Segmentation

Superpixel level segmentation is unsupervised manner thatautomatically clusters pixels into a set of local smooth regions.

The similarities among adjacent pixels are employed to guideclustering. The typical methods are spectral clustering basedNCuts segmentation [9], Mean-shift clustering based seg-mentation [10], edge clustering based UCM superpixel [11],and K-means clustering based SLIC method [12]. Becausethe number of superpixels are much smaller than pixels,superpixel is mainly used to take place of the pixels in practiceapplications in order to release the computational burden.However, since the similarities among neighboring pixels arenot sufficient to provide semantic priors, the superpixels arenot semantic regions.

B. Object Level Segmentation

Another important research branch is the object level seg-mentation, which aims at extracting semantic object regions.The object prior is needed in such segmentation, which isgenerally learned from two types of training data: the pixel-level training data, and the image-level training data. The firsttype of data is accurate, and results in the good segmentation,such as the recent CNN learning based specific object segmen-tation and objectness evaluation based general object proposalgeneration [13]–[16]. Meanwhile, it is hard to provide this typeof data, since the training data is obtained by either manuallydrawing or using interactive image segmentation [17], [18],which are very time-consuming for large scale of images.

Compared with pixel-level training data, the image-leveltraining data is easier to obtain, which is called weakly super-vised segmentation [19], [20]. The main idea is to learn theprior from the similar regions among multiple relevant images.There are usually two steps: similar region matching, andobject prior learning, which are iteratively performed until theconvergence. In general, the two steps are formulated in a CRFsegmentation framework that is usually minimized by EMalgorithm with α-expansion algorithm. Note that object levelsegmentation focus on the whole object region segmentation,which ignores the semantic local part segmentation.

By seeing the discrimination ability of local regionand their spatial structure in object region representation,Zhang et al. in [21] propose an excellent weakly-supervisedsemantic segmentation framework by exploiting spatial struc-ture cue from image-level labels. In the framework, the localregions are first represented by graphlets structure [22]. Then,three cues such as image level label, global spatial layout andgeometric context are combined in the manifold embeddingto discover the discriminative spatial structure. Based onthe results of manifold embedding, normalized cut based

MENG et al.: WEAKLY SUPERVISED PART PROPOSAL SEGMENTATION FROM MULTIPLE IMAGES 4021

segmentation is finally employed to obtain the semanticregions. Zhang et al. further extend the framework by con-sidering feature contributions and object region relationshipsin [23]. Patch alignment and Bayesian network are used, andbetter results are obtained. The framework by Zhang et al.shows the usefulness of spatial structure in capturing localregion relationships. Meanwhile, our method is different fromthe method by Zhang et al. The main difference is that ourtargets are the local part proposal regions, while the targetsof the methods [21], [23] are the semantic object regions.Therefore, they are still object-level segmentation rather thanpart-level segmentation.

C. Part Level Segmentation

Local part and their structure have been widely used inmany high-level object detection, recognition and understand-ing tasks. In the part level segmentation, two aspects needto be considered: learning the multiple part prior models,and measuring the part relationships, which makes the partlevel segmentation more difficult than object level segmen-tation. The existing part segmentation methods focus onsupervised manner, i.e., learning the part prior models andpart structure from accurate training data, and then apply-ing them to accomplish part segmentation. For example,Luo et al. [5] propose a Deep Decompositional Networkto segment local parts of pedestrian. Three layers such asocclusion estimation layers, completion layers, and decom-position layers are carefully designed to directly obtain thelabel map. Wang and Yuille [6] propose a semantic partsegmentation model by using compositional model to describethe relationships among parts. The latent SVM is employed tolearn the parameters of the model and is used to accomplishthe model inference with the dynamic programming. Theresults show an improvement compared with object levelsegmentation methods. Wang et al. [7] intend to obtain boththe object and part segmentation with the concept of semanticcompositional parts (SCP). The segmentation is performed ina novel fully connected conditional random field model withthe SCP potentials learned from FCN network. Note that thesemethods rely on the accurate learning of the part prior modelsand part structures, which are fully supervised methods.

Recently, Krause et al. [4] try to automatically generatepart annotations by only given the bounding box of theobject. Compared with the above fully supervised methods,it is weakly supervised manner. The method first performscosegmentation to obtain the object regions, and then alignthem to generate the parts. However, since it is difficultto determine the part regions without any part annotations,the method instead generates diverse set of part candidate byrandom sampling, which expects to be remedied by learningthe discriminative parts in the following object recognitiontasks. Here, we try to generate the part proposals by capturingthe pose variations.

D. Cosegmentation

Another related work is cosegmentation, which assumes thata common object is contained in each image, and segments

object by extracting similar regions among images. Althoughmany cosegmentation methods have been proposed, such assingle class cosegmentation [24], [25], multiple class coseg-mentation [26], [27], multiple group cosegmentation [28],noise image based cosegmentation, and RGBD cosegmenta-tion [29], these methods focus on object region segmentation,which are not part level segmentation. Compared with coseg-mentation, our method tries to obtain local part regions, whichis a more detailed and difficult task.

III. THE PROPOSED METHOD

In this section, we introduce our method by first illustrat-ing the label assignment based problem formulation. Then,our energy function consisting of four terms are introduced.Finally, the model minimization is presented based on threesub-minimization problems: cosegmentation, part generation,and part assignment.

A. The Problem Formulation

Given multiple images I = {I1, · · · , In} with number n,a common object is contained in each image Ii . We denotethe object regions as S = {S1, · · · , Sn}. Each object regionSi consists of N local parts, i.e., Si = {Pi1, · · · , Pi N }. Basedon the assumption that the object and their parts are similaramong images, our task is to obtain the object part set S fromthe multiple images according to their similarity consistencyand shape variations. For simplicity, we assume the imageshave the same size with the same number of pixels m.

We formulate the part-level segmentation problem aslabel assignment problems. Specifically, denoting C ={C0, C1, · · · , CN } as the background label and the part labels,every pixel pik in each image Ii is assigned a label lik ∈ C thatrepresents its part classes. By denoting Li = {li1, · · · , lim } asthe label set for Ii , and L = {L1, · · · , Ln} as the label setof all images, we formulate the part segmentation problemby searching L∗ ∈ �L that best fits the semantic partsegmentation, which can be represented as

L∗ = arg minL∈�L

E(L) (1)

where �L is the domain of L, E(L) is the energy functionmeasuring the fitness between L and image regions. A smallenergy function indicates a good label assignment.

There are two challenges in modeling (1): the design ofenergy E(L), and the energy minimization. One the one hand,the energy E designed from weak image level tags shouldefficiently evaluate the fitness between L and semantic parts.On the other hand, easy minimization can be deduced from theenergy to obtain global or approximation solution. Here, ourenergy is designed from shape matching by four terms: sin-gle image segmentation, multiple image cosegmentation, partconsistency and part structure consistency. The energy mini-mization is accomplished by three sub-optimization problem:cosegmentation minimization, NCuts minimization and graphmatching minimization. We next detail our energy design andthe corresponding energy minimization, respectively.


Fig. 3. An illustration of our energy design. There are four terms, i.e., single image segmentation Es , multiple image cosegmentation Ec, part consistencyE p and part structure consistency Eh , respectively.

B. The Energy Function Design

Our energy function is designed by four terms, which canbe represented as:

E = Es + Ec + E p + Eh (2)

where Es is the segmentation evaluation in each image,which makes the foreground to be different from the back-ground. Ec is the cosegmentation evaluation, which measuresthe similarity among foregrounds, and constraints the objectregion to be similar. The first two terms can be concluded ascosegmentation terms. E p is the part consistency evaluationamong images, which enforces the similarity of parts. Eh isthe part structure consistency, which measures the consistencyof part structure. Fig. 3 displays an example to illustrate ourenergy design. Note that the last two terms are related to part,and we name them as part terms.

1) Es: We design Es by pixel consistency of the foregroundand background, which is the classical energy of single imagesegmentation model. Here, Markov Random Field segmenta-tion model is employed for Es . Given an image Ik and itslabel set Lk , the background pixels and foreground pixels aredenoted as Bk = {pik |lki = C0} and Fk = {pki |lki �= C0},where lki is the label of pixel pki . Then, we evaluate the labelLk by the data term and pairwise term, which is representedby

Esk =

∑

pki ∈�k

[P(pki |θkF )δ(lki �= C0)

+ P(pki |θk B)δ(lki = C0)

]

+∑

(pki ,pkj )∈Nwsk(pki , pkj )δ(lki �= lkj ) (3)

where Esk is the energy for image Ik , �k is the pixel domain

of image Ik , θkF and θk B are parameters of the foregroundand background models, which is learned from Fk and Bk

by Mixture Gaussian Model. P(pki |θkF ) is the probability ofpixel pki under foreground model. A large value indicates agood consistency of the pixel and the region. δ(·) = 1 if ·is true. Otherwise, it is zero. The first term is also knownas data term. wks is the similarity matrix of pixels in eachsingle image Ii . N is the 3 × 3 neighboring relationship.

The second term (pairwise term) punishes the label changesamong neighboring pixel unless there are very large colorvariations.

Based on (3), we define Es as the sum of Esk for all images,which can be represented as

Es =n∑

k=1

exp(−Esk) (4)

2) Ec: We use Ec to enforce the foreground similar amongimages, which is the global term in cosegmentation. Givena pair of images Ik and Ir , and their labels are Lk and Lr ,we first obtain the foreground regions Fk and Fr from theimages by the labels. Then, we define the foreground similarityof Ik and Ir by

Eckr = d( f (Fk), f (Fr )) (5)

where f is the feature extraction function of region, d is theEuclidean distance between the features. Similar features havesmall value of Ec

kr . Here, we use shape context feature [8] asf in order to capture the mid-level features. It is seen that Ec

krforces the foreground to be similar in shape.

By considering the multiple images, we define Ec bysumming up Ec

kr of all image pairs, which is represented as

Ec =n∑

k=1

n∑

r=1

Eckr =

n∑

k=1

n∑

r=1

d( f (Fk), f (Fr )) (6)

3) E p: E p is to evaluate the part consistency, which isdefined by the assumption that there are pixel level matchingbetween image pairs (Ik, Ir ). A matching example is shownin Fig. 4, where the match is performed based on the shapevariation matching, and each pixel in Ik (the first image) hasa match pixel in Ir (the second image). Based on the match,the label energy E p

kr for Ik and Ir is defined by:

E pkr = −

∑

pi∈�k

δ(li = l̃i )

+∑

(pi ,p j )∈Nk

Ncut (d(�(pi ),�(p j ))) · δ(li �= l j ) (7)

where, li ∈ Lk is the label of pi ∈ Ik , p̃i ∈ Ir is the matchpixel of pi and l̃i ∈ Lr is its label, �(pi ) is the shift vectorof pi defined as v(pi ) − v( p̃i ), where v(p) = (x p, yp) is the


Fig. 4. An examples of the region matching. (a)(b): Two images containing the region of object “Cow”. Each pixel in the first object region has a matchedpixel in the second object region, as the lines in the images. Based on the matching, the pixels within a part region have the same matching shifts, while thepixels of different part regions have very large shift, as shown in (c).

Fig. 5. Some examples of the shift maps obtained by our method. The shiftvalue is represented by the color. It is seen that the color of the parts aredifferent, which demonstrates the effectiveness of our method.

vector of p based on its location. It describes the shift of pixelpi as shown in Fig. 4(c). By considering the shift vectors ofall pixels in Ik , we can obtain a shift map Ms , where the valueof each pixel is the shift vector, i.e., Ms (pi) = �(pi ). Fig. 5displays some shift maps, where five original images and theirshift maps are shown. It is seen that shift map distinguishesthe local parts in these images, which guarantees the followingpart proposal generation.

In the last term of (7), d(�(pi),�(p j )) is the distancebetween the shift vectors, which describes the matching shiftconsistency between image pairs. Large value indicates a largechange of shift, and corresponds to the border of part regions.Ncut (d(�(pi ),�(p j ))) is the cutting evaluation among dif-ferent label regions δ(li �= l j ), which prefers to segmentlocal parts along the large variations of d(�(pi),�(p j )).We formulate it as the Normalized cut defined in [9]. It isseen that because the part always keeps fixed among images,pixels within one part have similar �. Meanwhile, the pixelswith different part have large differences of � due to the posevariation. Hence, the second term aims at dividing object alongwith pixels with large changes of �.

In (7), the first term is the number of the matched pixelswith the same labels. Larger value of this term indicates agood part consistency. The second term is the cost by labellingneighboring pixels. It enforces the label to be consistencyamong neighboring pixels. Once there is label change, it hopesto have large matching shifts, i.e., they come from differentparts. It is seen that the two terms in (7) are similar to thedata term and pairwise term in MRF segmentation. However,the first term is based on the part matching, which is a globalcue. Furthermore, the second term is based on the matchingshift, which is different from the pixel color variations. Hence,

Fig. 6. An example to illustrate the fourth term Eh of part structureconsistency. g is the part relationship by the spatial distance, and d is thepart structure consistency evaluation, which is based on the relationships ofall part pairs.

the regions can be segmented into several segments even if thepixels have same colors.

Based on (7), we set the energy function E p by consideringall image pairs, i.e.,

E p =∑

k∈�

∑

r∈�

[−

∑

pi∈�k

δ(li = l̃i )

+∑

(pi ,p j )∈NNcut (d(�r (pi),�

r (p j )) · δ(li �= l j )

]

(8)

where �r (pi) is the matching shift vector of pi ∈ Ik basedon image Ir .

4) Eh: Eh evaluates the part structure consistency. We con-sider two aspects: part spatial relationship, and the relationshipconsistency. Given an image pair (Ik, Ir ) with labels (Lk, Lr ),Eh

kr is defined as,

Ehkr =

n∑

i=1

n∑

j=1

d(g(Ski , Sk

j ), g(Sri , Sr

j )) (9)

where Ski is the region of label Li in image Ik , g(Sk

i , Skj ) is the

spatial relationship between the i th and j th regions in imageIk , which is represented by function g. d(g(Sk

i , Skj ), g(Sr

i , Srj ))

is the distance between a pair of relationships, which measurethe consistency of part pair matching. An example is shownin Fig. 6. The final structure consistency is the sum of all partpairs.


By considering all image pairs, Eh is defined as

Eh =n∑

k=1

n∑

r=1

[ m∑

i=1

m∑

j=1

d(g(Ski , Sk

j ), g(Sri , Sr

j ))

](10)

It is seen that formula (9) is the consistency of part spatialrelationships (“Head”-“Body” to “Head”-“Body”), which isthe second order matching. Note that the first term in (7) isthe part similarity (“Head” to “Head”), which is the first ordermatching. Hence, by combing the two terms, it is a two-ordergraph matching problem. Only the labels with similar featuresamong the same label region and the same spatial structure interms of high-order graph matching will lead to small values.

By introducing the terms (4), (6), (8), and (10) into (2),we obtain the final energy function. We next introduce theenergy minimization.

C. The Energy Function Minimization

Because our energy contains nonlinear terms such as Ec,and is also formed by many sums of multiple images, it isdifficult to globally minimize the energy. Instead, we pursueapproximate solution by diving original problem into threesub-minimization problems, i.e., cosegmentation problem, partgeneration problem and region matching problems.

1) Cosegmentation Problem: We combine the first twoterms Es and Ec to form the cosegmentation problem, whichcan be represented as:

EC = Es + Ec

=∑

i∈�

Esi +∑

(i, j )∈�×�

d( f (Fi ), f (Fj )) (11)

This is classical cosegmentation problem, but with diffi-cult shape similarity constraints. Here, we use the strategyin [30], [31] for the minimization. The main idea is to firstconsider all the segments that satisfying the first term Es byobject proposals, and then select regions that best satisfyingEs to be the final results. The minimizing of the second termcan be solved by using a fully connected graph to repre-sent the relationships of the proposals, and then performingthe belief propagation on the graph to score the commonregions, as used in [31]. By considering the computationalcost of the fully connected graph, we only construct the graphbased on neighboring images, and achieve the common objectsegmentation by dynamic programming [30]. In our method,the object proposals are generated by [32]. Since it obtains thebounding boxes rather than regions, we perform Grabcut onthe bounding boxes to obtain region proposals. In the graphgeneration, shape feature in [8] is used for the edge weightcalculation.

2) Part Generation Problem: We treat the second term inE p as the part generation sub problem, which is representedas

E P =∑

k∈�

∑

r∈�

[ ∑

(pi ,p j )∈Nk

Ncut (d(�r (pi),�r (p j ))

·δ(li �= l j )

](12)

Algorithm 1 Weakly Supervised Local Part Segmentation

=∑

k∈�

[ ∑

(pi ,p j )∈Nk

∑

r∈�

Ncut (d(�r (pi),�r (p j ))

·δ(li �= l j )

](13)

To solve this problem, we divide the problem into many sub-problems based on each image, and forms the sub-problem as,

l∗ = arg minl

[ ∑

(pi ,p j )∈Nk

Ncut (d(�(pi ),�(p j )) · δ(li �= l j )

]

(14)where �(pi) = 1

n

∑nr=1 �r (pi ) is the average shift of all

images. In other words, because there are multiple images,and each image will result in a shift map, we average theseshift maps to form the final one. It is seen that the sub-problem in (14) is the classical normalized cut segmentationproblem with the part number n, which can be solved byspectral techniques. Based on (14), we minimize (12) bysolving these sub-problems in (14) one by one.


Fig. 7. Our part proposal segmentation results on PASCAL 2008 part datasets by setting Ns = 4 and various number of local part N . The original image isshown in the first row. The segmentation results with N = 3 to 8 are displayed from the second row to the seventh row, respectively. The bottom row showsthe results by N = 1, which are the cosegmentation results.

3) Local Part Matching: The third step is to combine thefirst term in (8) and (10), and forms a two order graphmatching problem, which is represented as

l∗ = arg minl

∑

k∈�

∑

r∈�

[−

∑

pi ∈�k

δ(li = l̃i )

+n∑

k=1

n∑

r=1

[ m∑

i=1

m∑

j=1

d(g(Ski , Sk

j ), g(Sri , Sr

j ))

](15)

and is equal to the problem as

l = arg maxl∗

∑

(k,r)

∑(i, j )∈m Mkr

1 (i, j)∑

(i, j ) |Mk2 (i, j) − Mr

2 (i, j)| (16)

where Mkr1 (i, j) is the matching scores between the labels li

and l j , which is based on the matching of image pairs (Ik, Ir ),and is defined as

Mkr1 (i, j) =

∑p∈lk

iδ(p′ ∈ lr

j )

Nlp(17)

where lki is the pixels of Ik in the region li , p′ ∈ Ir is the

matched pixel of p ∈ lki , Nlp is the number of pixels in lk

i .

Meanwhile, Mk2 (i, j) is the spatial distance between region

pair (Si , Sj ) in image Ik , and |Mk2 (i, j)− Mr

2(i, j)| is the spa-tial relationship consistency evaluation. Small values indicatea good consistency. It is seen from (16) that the region pairswith good label match will have many matching pixels, andlead to large value of Mkr

1 . In addition, the consistent partregion labels will have similar M2, and result in small valueof denominator. Hence, the best region label has the largestvalue of the fraction. In this paper, we use gradient descentto minimize the problem in (16) with grid based initial valuesetting. Note that when the number of part is small, traversalmethod can also be used to search global solution with fastspeed.

After continuously performing the three sub-minimizationproblems, we finally obtain the approximate solution of ourmodel. Algorithm 1 shows the process of our model.

IV. EXPERIMENTAL RESULTS

In this section, our method is verified by subjective andobjective results. A part dataset constructed from three imagedatasets such as PASCAL 2010 part datasets, Caltech-UCSDBirds dataset, Cat-Dog dataset, and one video dataset such asUCF Sports Actions dataset, is used for the verification.


Fig. 8. Our part proposal segmentation results on Caltech-UCSD Birds dataset. Each rows are the same to Fig. 7.

A. Implementation DetailsThe shift map Ms is comprised of vectors, which is repre-

sented by two channels: length and angle. Because the shiftvectors gradually change based on the shape context match-ing, the value distances of neighbor pixels are small, whichleads to unsuccessful segmentation. Hence, we refine the twochannels by replacing the value into the class centers, whichare obtained by k-means algorithm. Denoting the number ofclasses in k-means as Ns , we adjust Ns to obtain multiplelayer of proposals. Meanwhile, the number of part N is alsoadjusted to obtain the proposals. Hence, our model adjuststwo parameters N and Ns for the proposal generation. In thispaper, we empirically set Ns = 4 and Ns = N , and N ∈ [3, 8],with the consideration of computational cost.

B. Subjective ResultsSome part proposal generation results are shown

in Fig. 7, 8, 9 and 10 for the four datasets. These resultsare obtained based on Ns = 4. The original image is shownin the first row. The segmentation results with N = 3 to8 are displayed from the second row to the seventh row,respectively. The bottom row shows the results by N = 1,which are the cosegmentation results. We can see from theresults of N = 1 that the common objects with similarshapes can be segmented from these images, such as “Cow”in Fig. 7, and “Girl” in Fig. 10. This indicates that ourcosegmentation guarantees the following part segmentation.

Furthermore, by seeing each row, it is seen that part propos-als have been matched well, although there may have noise

regions caused by cosegmentation. For example, in Fig. 7, eachpart of “Cow” has been matched in the row of N = 6, andthe matches keep spatial consistency, such as the relationshipof “Head” and “Leg”. These results indicate the effectivenessof our spatial consistency constraints in Eh .

The results also show that the local part can be representedby one of segmentation layers. For example, the “Head”regions are extracted in layers of N = 6 and N = 5 inthe Fig. 8 and Fig. 9, respectively. Meanwhile, the “Tails”are segmented in both the layers of N = 8 for the two set ofimages. This indicates the fact that shape variation can providepart regions.

It is also seen that there are failure cases in these segmenta-tion results, such as the sixth image of “Cat” in Fig. 9, and thelast segmentation results in Fig.10. We also display other morefailed segmentation results in Fig. 11, where three images andtheir segmentation are displayed. The unsuccessful segmentsare mainly caused by the initial object segmentation. Whenthe object segmentation is inaccurate, wrong matching will beperformed, which will lead to failed segmentation. Note thatwhen most of the object regions are successfully extracted,good segmentation can also be obtained, since the failed shapematching can be corrected by these success segmentation.

C. Objective ResultsWe next verify our method based on the objective value.

The IOU value is used in our experiment, which is definedas F∩G

F∪G , where F and G are the regions of the segment and


Fig. 9. Our part proposal segmentation results on Cat-Dog datasets. Each rows are the same to Fig. 7.

the groundtruth, respectively. A large IOU value indicates agood segmentation. Since there are multiple images and parts,we evaluate the results based on the regions of each part. Givena part, we have n groundtruth regions. Then we select one labelregions, and calculate their average IOU values compared withthe groundtruth regions, and use the largest value as the IOUvalue of this part. The final object value is the average of allparts. We can see although the segmentation results are onlyone label regions, the IOU value can still be obtained for theevaluation. Hence, we can compare our method with both partlevel and region level methods.

We show our objective results in the last row of Table I,where the IOU values of the classes are shown. The averageIOU values are displayed in the last column. It is seen from theresults that the IOU value is 0.186, which is low. This is causedby the fact that part segmentation is a very challenging taskwith the needs to obtain consistent part regions among images.It is also seen that the value of Pascal 2010 dataset is lowerthan the other datasets, which is caused by the large objectvariations among images and the complicated backgrounds.

The results with setting Ns = 4 and Ns = N are alsodisplayed for comparison. It is seen that the proposal resultsare affected by the setting of Ns . In addition, the averageIOU value of Ns = N is 0.179, which is better than Ns = 4of 0.168. But their combination obtains the average IOU value

of 0.186, which is the best one among the results.We next display the average IOU values along with N

in Fig. 12, where x-axis is the part number, and y-axis is theaverage IOU values. It is seen from the curve that the averageIOU values become larger along with N , which is caused bythe fact that there are some small part regions, such as “Ear”and “Eye”, and the set of large part number can obtain moredetailed regions that covers these small part, and leads to largeraverage IOU values.

We also compare our method with four existing methods,including weak object level segmentation [30], weakly super-vised part segmentation [4], interactive cosegmentation [33]and interactive binary region segmentation [34]. The codesof these methods publicly released by the authors are used.The method in [30] is a cosegmentation method that extractssimilar shape common regions from multiple images. Themethod in [4] is a recent part segmentation method thatextracts part regions based on average sampling. In the imple-mentation of [4], we replace the CNN feature to the shapefeature for simplification, since authors indicate the robustnessof the method to feature selection. Furthermore, since theresults of [4] are windows instead of regions, we use thewindows of segment and groundtruth for calculating the IOUvalue. The method in [33] designs a cosegmentation modelby three cues such as user interaction, local smooth and


Fig. 10. Our part proposal segmentation results on UCF sports action dataset. Each row is the same to Fig. 7.

TABLE I

THE OBJECTIVE VALUES OF OUR METHODS, AND THE COMPARISON RESULTS

foreground consistency. The model is converted to constrainedquadratic programming problem, with a simple iterative solu-tion. By a few of user interactions, better segmentation resultsare obtained. The method in [34] is a user-interaction basedbinary segmentation model. Three constraints such as shapeconvexity, user-defined hard constraint and other standardconstraints are considered. The model is efficiently minimized

by trust region approach. Meanwhile, we observe that suchmodel can segment multiple part regions by separately scrawl-ing foreground and background seeds for each part due tothe shape convexity constraint. Hence, we simply changethe interaction-based binary segmentation [34] to interaction-based part segmentation by implementing the method in [34]for each part separately, and name it as [34]+Part.


Fig. 11. The original images and the failed segmentation results, which iscaused by the unsuccessful object region extraction.

Fig. 12. The average IOU values by varying the part number N . It is seenthat the average IOU values become larger along with N .

The objective results by the comparison methods areshown in Table I. It is seen that the average IOU value by[34]+Part (0.633) is obviously larger than our method (0.186)and the comparison methods due to the consideration ofpart-level segmentation and user interaction. Meanwhile, ourmethod outperforms object-level segmentation methods [33](0.080) and [30] (0.076) due to the fact that cosegmentationobtains object-level regions rather than part-level regions.Furthermore, the results of [4] is 0.176, which is bet-ter than our results with Ns = 4. Meanwhile, our finalresult (0.186) is larger than the method in [4], whichdemonstrates the usefulness of shape variation in part proposalsegmentation.

D. Discussions

Our method is a weakly supervised part proposal generationmethod, which aims at segmenting part regions from a set ofimages with image-level labels. This is a new research topicin weakly supervised segmentation. Although a little bit ofweakly supervised part segmentation methods [4] have beenproposed recently, the basic problem on how to efficientlydefine a part region is still underdeveloped. Compared withthe existing weakly supervised part segmentation methods,we design the part-level segmentation model in a new view ofpose variation that is a different and efficient cue for discover-ing part regions. Furthermore, our method is different from ourprevious cosegmentation work [30] that is used in our model.The difference is that the work in [30] aims at extractingcommon objects from multiple images, which first describesthe relationships of the regions by digraph, and then formulates

cosegmentation as shortest path problem. But our method aimsat segmenting more detailed part regions, which is a moredifficult problem. In addition, our previous work [30] is usedhere to simplify the minimization of the cosegmentation termof our model. Note that other cosegmentation minimizationssuch as belief-propagation-based method [31] can also be usedto replace our previous work [30].

In our model, as the usual assumption in common objectsegmentation [1], [29], [35], we assume that similar objectsshare similar feature f , such as the mid-level shape featureused in our method. The successful segmentation of similarobjects can be guaranteed when the feature f describes theobject similarity well. However, when the objects are differentgreatly by f , the foreground consistency term Ec in (5) cannotcorrectly evaluate the similarities of regions. Wrong segmen-tation will be obtained. Note that these wrong segmentationresults can be avoided by using more effective feature. We willstudy more robust and adaptive feature description such asdeep learning based feature extraction in the future to furtherimprove our model.

V. CONCLUSIONS

In this paper, a weakly supervised part region segmen-tation method is proposed. Object pose variations are cap-tured to obtain the fixed regions, which is then used togenerate the local parts. Four aspects, such as shape featurebased cosegmentation, shape matching based variation cap-ture, NCuts based part proposal generation, and second ordergraph matching based part label matching, are considered.The four aspects are combined to form our energy function,which is minimized by three sub-minimization steps, includingcosegmentation, NCuts segmentation and label matching. Ourmethod is verified on three image datasets and one videodataset. The experimental results demonstrate the effectivenessof the proposed method.

REFERENCES

[1] C. Rother, V. Kolmogorov, T. Minka, and A. Blake, “Cosegmentation ofimage pairs by histogram matching—Incorporating a global constraintinto MRFs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,New York, NY, USA, Jun. 2006, pp. 993–1000.

[2] H. Li, F. Meng, Q. Wu, and B. Luo, “Unsupervised multiclass regioncosegmentation via ensemble clustering and energy minimization,” IEEETrans. Circuits Syst. Video Technol., vol. 24, no. 5, pp. 789–801,May 2014.

[3] Y. Fang, Z. Chen, W. Lin, and C.-W. Lin, “Saliency detection in thecompressed domain for adaptive image retargeting,” IEEE Trans. ImageProcess., vol. 21, no. 9, pp. 3888–3901, Sep. 2012.

[4] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognitionwithout part annotations,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2015, pp. 5546–5555.

[5] P. Luo, X. Wang, and X. Tang, “Pedestrian parsing via deep decom-positional network,” in Proc. Int. Conf. Comput. Vis., Dec. 2013,pp. 2648–2655.

[6] J. Wang and A. L. Yuille, “Semantic part segmentation using compo-sitional model combining shape and appearance,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1788–1795.

[7] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Jointobject and part segmentation using deep learned potentials,” in Proc.Int. Conf. Comput. Vis., Dec. 2015, pp. 1573–1581.

[8] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.


[9] J. Shi and J. Malik, “Normalized cuts and image segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905,Aug. 2000.

[10] D. Comaniciu and P. Meer, “Mean shift: A robust approach towardfeature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,no. 5, pp. 603–619, May 2002.

[11] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detectionand hierarchical image segmentation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 5, pp. 898–916, May 2011.

[12] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,“SLIC superpixels compared to state-of-the-art superpixel methods,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,Nov. 2012.

[13] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?” in Proc.CVPR, Jun. 2010, pp. 73–80.

[14] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneousdetection and segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV),2014, pp. 297–312.

[15] P. Rantalankila, J. Kannala, and E. Rahtu, “Generating object segmen-tation proposals using global and local search,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2014, pp. 2417–2424.

[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. (2013). “Rich featurehierarchies for accurate object detection and semantic segmentation.”[Online]. Available: https://arxiv.org/abs/1311.2524

[17] Y. Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimalboundary & region segmentation of objects in N-D images,” in Proc.Int. Conf. Comput. Vis., Jul. 2001, pp. 105–112.

[18] P. Krähenbühl and V. Koltun, “Efficient inference in fully connectedCRFs with Gaussian edge potentials,” in Proc. Annu. Conf. Neural Inf.Process. Syst., 2011, pp. 109–117.

[19] A. Vezhnevets, V. Ferrari, and J. M. Buhmann, “Weakly supervisedstructured output learning for semantic segmentation,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 845–852.

[20] Y. Liu, J. Liu, Z. Li, J. Tang, and H. Lu, “Weakly-supervised dualclustering for image semantic segmentation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2075–2082.

[21] L. Zhang, M. Song, Z. Liu, X. Liu, J. Bu, and C. Chen, “Probabilisticgraphlet cut: Exploiting spatial structure cue for weakly supervisedimage segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., Jun. 2013, pp. 1908–1915.

[22] L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, and C. Chen, “Probabilisticgraphlet transfer for photo cropping,” IEEE Trans. Image Process.,vol. 22, no. 2, pp. 802–815, Feb. 2013.

[23] L. Zhang, Y. Yang, Y. Gao, Y. Yu, C. Wang, and X. Li, “A probabilisticassociative model for segmenting weakly supervised images,” IEEETrans. Image Process., vol. 23, no. 9, pp. 4150–4159, Sep. 2014.

[24] T. Ma and L. J. Latecki, “Graph transduction learning with connec-tivity constraints with application to multiple foreground cosegmenta-tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,pp. 1955–1962.

[25] F. Meng, H. Li, K. N. Ngan, L. Zeng, and Q. Wu, “Feature adaptive co-segmentation by complexity awareness,” IEEE Trans. Image Process.,vol. 22, no. 12, pp. 4809–4824, Dec. 2013.

[26] A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA,Jun. 2012, pp. 542–549.

[27] G. Kim and E. P. Xing, “On multiple foreground cosegmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA,Jun. 2012, pp. 837–844.

[28] F. Meng, H. Li, and J. Cai, “On multiple image group cosegmentation,”in Proc. Asian Conf. Comput. Vis., 2014, pp. 258–272.

[29] H. Fu, D. Xu, S. Lin, and J. Liu, “Object-based RGBD image co-segmentation with mutex constraint,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2015, pp. 4428–4436.

[30] F. Meng, H. Li, G. Liu, and K. N. Ngan, “Object co-segmentation basedon shortest path algorithm and saliency model,” IEEE Trans. Multimedia,vol. 14, no. 5, pp. 1429–1441, Oct. 2012.

[31] S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA,Jun. 2011, pp. 2217–2224.

[32] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposalsfrom edges,” in Proc. ECCV, 2014, pp. 391–405.

[33] X. Dong, J. Shen, L. Shao, and M.-H. Yang, “Interactive cosegmenta-tion using global and local energy optimization,” IEEE Trans. ImageProcess., vol. 24, no. 11, pp. 3966–3977, Nov. 2015.

[34] L. Gorelick, O. Veksler, Y. Boykov, and C. Nieuwenhuis, “Convexityshape prior for binary segmentation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 39, no. 2, pp. 258–271, Feb. 2017.

[35] L. Mukherjee, V. Singh, and C. R. Dyer, “Half-integrality based algo-rithms for cosegmentation of images,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Miami, FL, USA, Jun. 2009, pp. 2028–2035.

Fanman Meng (S’12–M’14) received the Ph.D.degree in signal and information processing fromthe University of Electronic Science and Technol-ogy of China, Chengdu, China, in 2014. From2013 to 2014, he joined the Division of Visualand Interactive Computing, Nanyang Technologi-cal University, Singapore, as a Research Assistant.He is currently an Associate Professor with theSchool of Electronic Engineering, University ofElectronic Science and Technology of China . He hasauthored or co-authored numerous technical articles

in well-known international journals and conferences. His research interestsinclude image segmentation and object detection. He received the BestStudent Paper Honourable Mention award for the 12th Asian Conference onComputer Vision, Singapore, in 2014, and the Top 10% Paper Award in theIEEE International Conference on Image Processing, Paris, France, in 2014.He is a member IEEE CAS society.

Hongliang Li (SM’12) received the Ph.D. degree inelectronics and information engineering from Xi’anJiaotong University, China, in 2005. From 2005 to2006, he joined the Visual Signal Processing andCommunication Laboratory, The Chinese Universityof Hong Kong, (CUHK) as a Research Associate aPost-Doctoral Fellow the Visual Signal Processingand Communication Laboratory, CHUK, from 2006to 2008. He is currently a Professor with the Schoolof Electronic Engineering, University of ElectronicScience and Technology of China. He has authored

or co-authored numerous technical articles in well-known international jour-nals and conferences. He is a co-editor of a Springer book titled Video segmen-tation and its applications. His research interests include image segmentation,object detection, image and video coding, visual attention, and multimediacommunication system. He was involved in many professional activities.He is a member of the Editorial Board of the Journal on Visual Communica-tions and Image Representation, and the Area Editor of the Signal Processing:Image Communication and the Elsevier Science. He served as a TechnicalProgram Co-Chair of VCIP2016 and ISPACS 2009, a General Co-Chair ofthe ISPACS 2010, a Publicity Co-Chair of IEEE VCIP 2013, the Local Chairof the IEEE ICME 2014, and the TPC Member in a number of internationalconferences, including, ICME 2013, ICME 2012, ISCAS 2013, PCM 2007,PCM 2009, and VCIP 2010.

Qingbo Wu (S’12–M’13) received the B.E. degreein education of applied electronic technology fromHebei Normal University in 2009, and the Ph.D.degree in signal and information processing fromthe University of Electronic Science and Technologyof China in 2015. From 2014 to 2014, he wasa Research Assistant with the Image and VideoProcessing Laboratory, The Chinese University ofHong Kong. From 2014 to 2015, he served as a Vis-iting Scholar with the Image and Vision ComputingLaboratory, University of Waterloo. He is currently

a Lecturer with the School of Electronic Engineering, University of ElectronicScience and Technology of China. His research interests include image/videocoding, quality evaluation, and perceptual modeling and processing.


Bing Luo received the B.Sc. degree in communica-tion engineering from The Second Artillery Com-mand College, Wuhan, China, in 2009, and theM.Sc. degree in computer application technologyfrom Xihua University, Chengdu, China, in 2012.He is currently pursuing the Ph.D. degree withthe University of Electronic Science and Technol-ogy of China, Chengdu, under the supervision ofProf. H. Li. His research interests include image andvideo segmentation and machine learning.

King Ngi Ngan (F’00) received the Ph.D. degreein electrical engineering from the LoughboroughUniversity, U.K. He is currently a Chair Profes-sor with the Department of Electronic Engineering,The Chinese University of Hong Kong. He wasa Full Professor with the Nanyang TechnologicalUniversity, Singapore, and The University of West-ern Australia, Australia. He has been appointed aChair Professor with the University of ElectronicScience and Technology, Chengdu, China, underthe National Thousand Talents Program since 2012.

He has authored extensively, including three authored books, seven editedvolumes, over 380 refereed technical papers, and edited nine special issues injournals. In addition, he holds 15 patents in the areas of image/video codingand communications. He is a member of IET, U.K., and IEAust, Australia,and the IEEE Distinguished Lecturer from 2006 to 2007. He holds honoraryand Visiting Professorships of numerous universities in China, Australia andSouth East Asia. He served as an Associate Editor of IEEE TRANSACTIONS

ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, Journal on VisualCommunications and Image Representation, EURASIP Journal of SignalProcessing: Image Communication, and Journal of Applied Signal Processing.He chaired and co-chaired a number of prestigious international conferenceson image and video processing including the 2010 IEEE InternationalConference on Image Processing, and served on the advisory and technicalcommittees of numerous professional organizations.

Date post:	20-Apr-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 8, …knngan/2017/TIP_v26_n8_p4019-4031.pdf ·...

Documents