Spatial Pattern Discovery by Learning a Probabilistic...

Spatial Pattern Discovery by Learning a

Probabilistic Parametric Model from Multiple

Attributed Relational Graphs

Pengyu Hong and Thomas S. Huang

Beckman Institute for Advanced Science and TechnologyUniversity of Illinois at Urbana-Champaign

Urbana, IL 61801, USA.

Abstract

This paper presents the methodology and theory for automatic spatial patterndiscovery from multiple attributed relational graph samples. The spatial patternis modelled as a mixture of probabilistic parametric attributed relational graphs.A statistic learning procedure is designed to learn the parameters of the spatialpattern model from the attributed relational graph samples. The learning proce-dure is formulated as a combinatorial non-deterministic process, which uses theExpectation-Maximization (EM) algorithm to find maximum-likelihood estimatesfor the parameters of the spatial pattern model. The learned model summarizes thesamples and captures the statistic characteristics of the appearance and structureof the spatial pattern, which is observed under various conditions. It can be usedto detect the spatial pattern in new samples. The proposed approach is applied tounsupervised visual pattern extraction from multiple images in the experiments.

Key words: Spatial Pattern Discovery, Attributed Relational Graph, ParametricAttributed Relational Graph, the EM Algorithm.

1 Introduction

In many application domains (e.g., image/video retrieval, software engineer-ing, understanding the biological activity of chemical compounds, etc.), struc-tured information is dependently distributed among the basic primitives and

Email addresses: [email protected] (Pengyu Hong), [email protected](Thomas S. Huang).

Preprint submitted to Elsevier Science 12 September 2002

the relationships between them. Extracting regular structured informationfrom observations is an interesting and challenging problem. The extractedinformation can be used to summarize old observations and predict new ob-servations. This paper reports our work on automatic regular structured in-formation extraction from samples. The regular structured information is rep-resented as a spatial pattern. The samples consist of various backgrounds aswell as varied instances of the spatial pattern (see Fig. 1).

(a) (b) (c)

Fig. 1. The McDonald’s logo is observed under different conditions.

We chose a general graphic representation, attributed relational graphs (ARGs)[23], to represent structured information. An ARG consists of a set of nodesthat are connected by a set of arcs. The nodes represent basic primitives (e.g.,image pixels, edges, image segments, atoms, molecules, gene, computer pro-gramming segments, etc.). The arcs represent the relationships between theprimitives. The attributes of the nodes encode the properties of the primitives.The attributes of the relationships describe the context of the primitives. Inthe rest of this paper, we call the ARG representations of the samples assample ARGs.

There are many approaches for learning spatial pattern models from multiplesamples. Ratan et al [19] used diverse density algorithm [17] to learn “visualconcepts” (i.e., spatial patterns), from multiple images. A “visual concept”is a pre-specified conjunction of several image primitives. The representationof the “visual concept” in [19] is similar to ARG. Nonetheless, the relation-ships between the image primitives are not modelled in [19]. Different spatialpatterns may share the same set of primitives while containing considerablydifferent relationships (see Fig. 2).

(a) (b) (c)

Fig. 2. Three different image patterns are formed by shuffling the same set of imageblocks so that the spatial relationships between the image blocks in the patternsare different from each other.

2

Adopting the data augmentation scheme [9], [22], Frey and Jojic [11] treatedtransformations as latent variables and used the probabilistic graphical modelto represent image patterns and their transformations. They used the EMalgorithm [9] to learn the model from image samples. The transformations ofan image pattern are defined as shuffles of image pixels inside the pattern. Theylimited the value range of the transformations to a small pre-defined discreteset because the number of the potential transformations is exponential withrespect to the number of image pixels. It requires non-trivial prior knowledgeto define the discrete transformation set. In addition, the image pixels of animage pattern are just like the nodes of an ARG. However, similar to [19],they did not model the relationships between image pixels.

Zhu and Guo [27] studied the conceptualization and modelling of visual pat-terns from the perspective of statistical physics. They proposed a Gestalt en-semble for modelling spatial organization of attributed points (i.e., the nodesin an ARG). The Gestalt ensemble is associated with a probability model,which can be learned from samples via a minimax entropy learning scheme[28]. The learned model captures visual patterns by examining the local inter-actions among attributed points in a dynamic local neighborhood.

The contextual information of image pixels was utilized by Hong and Huang toautomatically detect recurrent image patterns in a big image [13]. They showedhow image patterns could be extracted by the local interactions of imagepixels. However, the only allowable transformation of the image patterns wastranslation. Hong, Wang, and Huang [14] used the Generalized EM algorithmto learn the spatial pattern model from multiple sample ARGs. Nevertheless,the theory in [14] is far from fully developed. This paper improves the work of[14], and reports the methodology and theory for unsupervised spatial patterndiscovery by learning a spatial pattern model as a probabilistic parametricmodel from multiple sample ARGs.

We assume that the instances of the spatial pattern are governed by someunderlying probabilistic distribution, which is represented by a parametricspatial pattern model. The task is to infer the parameters of the model frommultiple sample ARGs. Section 2 introduces the mathematic representationsof the sample ARGs and the parametric spatial pattern model. Section 3mathematically formulates the task and uses the EM algorithm to learn themaximum-likelihood parameters for the parametric model. Section 4 addressesimplementation issues and analyzes the computational complexity. Section 5discusses how to use the learned model for pattern detection. Experimentalresults are shown in Section 6. Finally, the paper closes with summary anddiscussions in Section 7.

3

2 The Representations

In reality, the instances of a spatial pattern will not be the same because ofthe noise of sensors, different observation conditions, and so on. Probabilis-tic modelling tools have been shown to be effective for handling noise andvariation. We design a probabilistic parametric model to represent the spatialpattern.

2.1 Probabilistic Modelling of the Spatial Pattern

Without losing generality, we assume that the instances of the spatial patternare governed by some probability distribution function (PDF) f(G|Z), whereZ is the spatial pattern model of interest and G is a variable representing theinstance of Z. Our goal is to infer Z given the sample ARGs.

It is in general very difficult to estimate Z without any prior knowledge aboutf(G|Z). In practice, the PDF f(G|Z) is usually assumed to have a struc-ture, for example, a linear combination of parametric mixtures. Adopting thismethod, we assume that f(G|Z) is a linear combination of parametric mix-tures and Z consists of a set of parametric model components ΦwW

w=1, whereW is the number of model components. Each mixture of f(G|Z) is representedby a model component of Φw. Hence, we have

f(G|Z) =W∑

w=1

αwξ(G|Φw) (1)

where ξ(G|Φw) is a parametric mixture (or parametric distribution sub-function)of f(G|Z), αw is the weight of ξ(G|Φw), and

∑Ww=1 αw = 1. ξ(G|Φw) has sim-

pler structure and is easier to estimate. The value of αw implies the amount ofinformation is captured by ξ(G|Φw). For the purpose of data summarization,the value of W should be much smaller than the number of the sample ARGs.

To calculate f(G|Z), we need to know how to evaluate ξ(G|Φw)w. In thefollowing subsections, we first define the representations for the sample ARGsand Z. Then, we derive the computational forms for ξ(G|Φw)w.

2.2 The Sample ARGs

The sample ARG set is denoted as G = GiSi=1, where S is the number of

the sample ARGs. The nodes of the sample ARGs are called sample nodes.

4

The relations of the sample ARGs are called sample relations. A sample ARGis represented as Gi = 〈Ai, Ri〉, which is explained in details as below.

(a) Ai = 〈oik,−→a ik〉Ui

k=1, where oik is a sample node, −→a ik is the attributevector of oik, and Ui is the number of the sample nodes in Gi.

(b) Ri = 〈ricd,−→b icd〉Ui

c,d=1, where ricd represents the relation between oic

and oid,−→b icd is the attribute vector of ricd. We assume the relationships

are directional. If ricd and ridc are directionless, we have ricd = ridc and−→b icd =

−→b idc. If there is no relationship from oic to oid, we have ricd =

NULL and−→b icd is invalid.

2.3 The Parametric Pattern Model

The model components are represented as parametric attributed relationalgraphs. The nodes of the model components are called model nodes. Therelations of the model components are called model relations. Each modelcomponent is denoted as Φw = 〈Ωw,Ψw〉, where:

(a) Ωw = 〈ωwk,−→ϕ wk, βwk〉Nw

k=0. ωwk is a model node. Particularly, ωw0 isa null model node. Nw is the number of non-null model nodes. To al-low occlusions, different model components may have different number ofnon-null model nodes. Each non-null model node ωwk is associated witha parametric node PDF p(oim|ωwk) whose parameter vector is −→ϕ wk. Theparameter βwk implies the relative frequency of the model node ωwk be-ing observed in the sample ARGs. It is normalized with respect to all themodel nodes in Φw. We have ΣNw

k=0βwk = 1. The null model node ωw0 doesnot have physical existence and is used to provide a modelling destina-tion for those sample nodes that represent backgrounds. The node PDFp(oim|ωw0) and the parameter vector −→ϕ w0 are invalid.

(b) Ψw = 〈ψwστ ,−→ϑ wστ 〉. ψwστ is a model relation. The model relation ψwστ

is a null relation if there is no relation from ωwσ to ωwτ . Each non-nullrelation ψwστ is associated with a parametric relation PDF p(ricd|ωwστ )

whose parameter vector is−→ϑ wστ . The relation PDF and the parameter

vector of a null model relation are invalid.

Let Θw = −→ϕ wk ∪ βwk ∪ −→ϑ wστ denote the parameter set of Φw. Let

Θ =⋃

w Θw denote the parameter set of Z.

5

2.4 The Probability Density Function of the Spatial Pattern Model

To evaluate ξ(G|Φw)w and f(G|Z), the match between G and the modelZ is required. Let −→y i = [qi, yi1, . . . , yiUi

] denote the match between a sampleARG Gi and the model Z. The information in −→y i represents two-level matchbetween Gi and Z. The first level information is represented by qi, whichdenotes Gi as a whole graph matches the component Φqi

of Z. The value rangeof qi is [1,W ]. The second level information is represented by [yi1, . . . , yiUi

],which denotes the match between the sample nodes of Gi and the model nodesof Φqi

. The element yij denotes that the sample node oij matches the modelnode ωqiyij

. The value range of yij is [0, Nqi].

Let P (yij|Gi,Φw) (i.e., P (yij|Gi,Θw)) denote the matching probability be-tween oij and ωwyij

. Assuming P (yij|Gi,Φw) is available (The details aboutcalculating P (yij|Gi,Φw) will be discussed in Section 4), we have ξ(Gi|Φw) as

ξ(Gi|Φw) = ΣUik=1Σ

Nwj=1P (yik = j|Gi,Φw)p(oik|ωwj)+

ΣUic=1Σ

Uid=1Σ

Nwσ=1Σ

Nwτ=1P (yic = σ|Gi,Φw)P (yid = τ |Gi,Φw)p(ricd|ψwστ )

(2)

The probability of Gi matching Φw given the model Z is

P (qi = w|Gi, Z) =ξ(Gi|Φw)

ΣWt=1ξ(Gi|Φt)

(3)

3 Estimating the Parameters of the Spatial Pattern Model via theEM Algorithm

The parameter estimation problem becomes straightforward if we known thematching probabilities between the sample ARGs and Z. However, it is tediousand labor intensive to manually specify the matching information for a largeset of sample ARGs. We are interested in automatically learning the spatialpattern model without manually specifying the matching information. Thissection derives the theory for inferring the maximum likelihood parametersfor Z using the EM algorithm [9]. The learning procedure simultaneouslyestimates the parameters of Z and the matching probabilities between thesample ARGs and Z.

6

3.1 The Basic EM Algorithm

The EM algorithm is a technique for iteratively finding maximum-likelihoodestimates for the parameters of a underlying distribution from a training dataset, which is incomplete or has missing information. The EM algorithm definesa likelihood function

Q(H;H(n)) = E[log p(Do, Dm|H)|Do, H(n)] (4)

where H is the unknown parameter set, Do is the observed data, Dm is themissing information, and n is the number of the iterations of the EM algorithm.The complete data set is Do ∪ Dm. The likelihood function Q(H;H(n)) is afunction of H under the assumption that H = H(n). The right hand sideof (4) denotes that the expected value of the complete data log-likelihoodlog p(Do, Dm|H) with respect to Dm and Do while assuming H = H(n).

The EM algorithm starts with an initial value of H, say H(0), and refines thevalue of H iteratively in two steps: the expectation step (or the E-step) andthe maximization step (or the M-step). In the E-step, Q(H;H(n)) is computed.In the M-step, the parameter set H is updated by:

H(n+1) = arg maxH

Q(H;H(n)) (5)

The iterative procedure stops when it converges or a pre-defined maximumnumber of iterations is reached.

3.2 The Likelihood Function for Learning the Parameters of the Spatial Pat-tern Model

In our case, the observed data Do is the sample ARG set G. The missingdata Dm corresponds to the match between the sample ARGs and Z. Let Y= −→y i. The unknown parameter set is Θ. The likelihood function for ourproblem is

Q(Θ; Θ(n)) = Ef [log p(G,Y|Θ)|G, Θ(n)]

= ΣYf(G,Y|Θ(n)) log p(G,Y|Θ)

= ΣYf(Y|G, Θ(n))f(G|Θ(n)) log p(G,Y|Θ)

= f(G|Θ(n))ΣYf(Y|G, Θ(n)) log p(G,Y|Θ)

(6)

7

We can remove f(G|Θ(n)) from (6) because it does not depend on either Θ or Yand will not affect the final results. We further assume that Gi is independentof each other. Consequently, −→y i is independent of each other. Hence, (6) canbe rewritten as

Q(Θ; Θ(n)) = ΣYf(Y|G, Θ(n)) log p(G,Y|Θ)

= Σ−→y 1· · ·Σ−→y S

ΣSi=1

(log p(Gi,

−→y i|Θ)S∏

j=1

f(−→y j|Gj, Θ(n))

)= ΣS

i=1Σ−→y if(−→y i|Gi, Θ

(n)) log p(Gi,−→y i|Θ)

= ΣSi=1Σ−→y i

f(−→y i|Gi, Θ(n)) log

(p(Gi|−→y i, Θ)p(−→y i|Θ)

)= ΣS

i=1Σ−→y if(−→y i|Gi, Θ

(n)) log(p(Gi|−→y i, Θ)p(−→y i)

)(7)

The term f(−→y i|Gi, Θ(n)) in (7) is the marginal distribution of −→y i, i.e., the

unobserved match between Gi and Z. It is dependent on the observed data Gand the current value of the parameter set Θ. The contextual information ofthe nodes is fully described in Gi. In other words, the interdependence amongyik is described by Gi. Hence, we have

f(−→y i|Gi, Θ(n)) = P (qi|Gi, Θ

(n))Ui∏

k=1

f(yik|Gi,Θ(n)qi

) (8)

Since the value space of yik is uniformly discretized, (8) can be rewritten as

f(−→y i|Gi, Θ(n)) = P (qi|Gi, Θ

(n))Ui∏

k=1

P (yik|Gi,Θ(n)qi

) (9)

The term p(Gi|−→y i, Θ) in (7) is the marginal distribution of Gi given the modelZ and the match −→y i. It can be rewritten as

p(Gi|−→y i, Θ) = p(Gi|[yi1 · · · yiUi],Θqi

)

=Ui∏

m=1

p(oim|ωqiyim)

Ui∏c=1

Ui∏d=1

p(ricd|ψqiyicyid)

(10)

where p(oim|ωqiyim) is the node PDF of ωqiyim

and p(ricd|ψqiyicyid) is the relation

PDF of ψqiyicyid. If the relations are directionless, (10) should be written as

8

p(Gi|−→y i, Θ) = p(Gi|[yi1 · · · yiUi],Θqi

)

=Ui∏

m=1

p(oim|ωqiyim)( Ui∏

c=1

Ui∏d=1

p(ricd|ψqiyicyid))1/2 (11)

In the following derivation, we use (10). It can be easily shown that only partof the results will be affected by a scale of 1/2 if we use (11).

Expanding the term P (−→y i) in (7), we have

P (−→y i) = P (qi)Ui∏t=1

P (yit|qi) (12)

where P (qi = h) = αh and P (yic = η|qi = h) = βhη.

Substituting (9), (10), and (12) into (7), we have

Q(Θ;Θ(n)) = ΣSi=1Σ−→y i

P (qi|Gi, Θ(n))

Ui∏k=1

P (yik|Gi,Θ(n)qi

)×

log( Ui∏

m=1

p(oim|ωqiyim)

Ui∏c=1

Ui∏d=1

p(ricd|ψqiyicyid)P (qi)

Ui∏t=1

P (yit|qi)) (13)

Expanding Σ−→y iand replacing log

∏g(x) with

∑log g(x) in (13), we have:

Q(Θ; Θ(n)) =S∑

i=1

W∑qi=1

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

P (qi|Gi, Θ(n))

Ui∏k=1

P (yik|Gi,Θ(n)qi

)×

[logP (qi) +

Ui∑m=1

log(p(oim|ωqiyim

)P (yim|qi))

+Ui∑

c=1

Ui∑d=1

log p(ricd|ψqiyicyid)

](14)

Equation (14) can be simplified into (see Appendix A)

9

Q(Θ;Θ(n)) =S∑

i=1

W∑h=1

Pqi(h|Gi, Θ

(n))

[logαh+

Ui∑m=1

Nh∑η=0

Pyim(η|Gi,Θ

(n)h ) log βhη+

Ui∑m=1

Nh∑η=0

Pyim(η|Gi,Θ

(n)h ) log p(oim|ωhη)+

Ui∑c=1

Ui∑d=1

Nh∑σ=0

Nh∑τ=0

Pyic(σ|Gi,Θ

(n)h )Pyid

(τ |Gi,Θ(n)h ) log p(ricd|ψhστ )

](15)

where Pqi(h|Gi, Θ

(n)) denotes P (qi = h|Gi, Θ(n)) and Pyim

(η|Gi,Θ(n)h ) denotes

P (yim = η|Gi,Θ(n)h ). The probability Pqi

(h|Gi, Θ(n)) can be calculate using

(3). The calculation of Pyim(η|Gi,Θ

(n)h ) will be discussed in Section 4.1.

3.3 The Expressions for Updating the Parameters in the M-Step

In the Maximization step, Θ is updated by Θ(n+1) = arg maxΘ

Q(Θ; Θ(n)). The

expressions for updating αh and βhη can be obtained as below regardless theforms of the node PDFs and those of the relation PDFs (see Appendix B)

α(n+1)h =

∑Si=1 Pqi

(h|Gi, Θ(n))

S(16)

β(n+1)hη =

∑Si=1

∑Uim=1 Pyim

(η|Gi,Θ(n)h )Pqi

(h|Gi, Θ(n))∑S

i=1 Pqi(h|Gi, Θ(n))Ui

(17)

Both the parameters of the node PDFs and those of the relation PDFs aredecided by the forms of the PDFs, and so are their updating expressions.

If the node PDFs and relation PDFs are Gaussian PDFs, analytical expressionscan be derived for updating the parameters of the PDFs in the M-step of theEM algorithm.

Assume the node PDF is Gaussian

p(oim|ωhη) =exp(−1

2(−→a im −−→µ hη)

T Σ−1hη (−→a im −−→µ hη))

(2π)ς/2|Σhη|1/2(18)

10

where −→µ hη and Σhη are the mean and covariance matrix of the node PDFof the model node ωhη respectively, and ς is the dimension of −→µ hη. We canobtain the expressions for updating −→µ hη and Σhη as below (see Appendix C)

−→µ (n+1)hη =

∑Si=1

∑Uim=1

−→a imPyim(η|Gi,Θ

(n)h )Pqi

(h|Gi, Θ(n))∑S

i=1

∑Uim=1 Pyim

(η|Gi,Θ(n)h )Pqi

(h|Gi, Θ(n))(19)

−→Σ

(n+1)hη =

∑Si=1

∑Uim=1

−→x (n)im−→x (n)

im

TPyim

(η|Gi,Θ(n)h )Pqi

(h|Gi, Θ(n))∑S

i=1

∑Uim=1 Pyim

(η|Gi,Θ(n)h )Pqi

(h|Gi, Θ(n))(20)

where −→x (n)im = −→a im −−→µ (n+1)

hη .

Assume the relation PDF is Gaussian

p(ricd|ψhστ ) =exp(−1

2(−→b icd −−→γ hστ )

T Λ−1hστ (

−→b icd −−→γ hστ ))

(2π)κ/2|Λhστ |1/2(21)

where −→γ hστ and Λhστ are the mean and covariance matrix of the relation PDFof ψhστ , and κ is the dimension of −→γ hστ . We can obtain the expressions forupdating −→γ hστ and Λhστ as below (see Appendix C)

−→γ (n+1)hστ =

∑Si=1

∑Uic=1

∑Uid=1

−→b icd`h(yic, yid, σ, τ)Pqi

(h|Gi, Θ(n))∑S

i=1

∑Uic=1

∑Uid=1 `h(yic, yid, σ, τ)Pqi

(h|Gi, Θ(n))(22)

−→Λ

(n+1)hστ =

∑Si=1

∑Uic=1

∑Uid=1

−→z (n)icd−→z (n)

icd

T`h(yic, yid, σ, τ)Pqi

(h|Gi, Θ(n))∑S

i=1

∑Uic=1

∑Uid=1 `h(yic, yid, σ, τ)Pqi

(h|Gi, Θ(n))(23)

where `h(yic, yid, σ, τ) = Pyic(σ|Gi,Θ

(n)h )Pyid

(τ |Gi,Θ(n)h ) and −→z (n)

icd =−→b icd −

−→γ (n+1)hστ .

4 Implementation Issues

4.1 Register the Sample ARGs with the Spatial Pattern Model

Given the current value of the parameter set of Φw, we can calculate thematching probabilities between Gi and Φw using two-graph matching tech-

11

niques. Two-graph matching is a fundamental combinatorial problem and isan NP problem [12]. It has been widely investigated for finding a local opti-mum inexact match between two graphs [1], [5], [7], [16], [20], [23], [21], [24],[26]. We use an implementation of the probabilistic relaxation graph match-ing algorithm [7] to match each sample ARG with every component of Z. The

matching results are approximations to Pyim(η|Gi,Θ

(n)h ).

4.2 Initialize the Spatial Pattern Model

Initializing the spatial pattern model is the first step of the learning procedureand is very important. The number of the model components is decided bythe user or the applications. We initialize the model components one by one.First, the average number of the nodes of the sample ARGs is calculated.We select a sample ARG, say Gp, so that the number of sample nodes in Gp

is the closest to the average node number. The geometric structure of Gp isused to initialize that of the first model component Φ1. If the node PDFs andrelation PDFs are assumed to be Gaussian, the feature vectors of the nodesand relations of Gp are used to initialize the corresponding means of the nodePDFs and relation PDFs of Φ1. The covariance matrixes of the node PDFsand relation PDFs are initialized as identical matrixes.

The rest of the model components are initialized using the following algorithm.The idea is to initialize the model components by some sample ARGs whichare as different from each other as possible.

Algorithm 1. Initialize the Spatial Pattern Model.

(a) for w = 2 to W(b) Select a sample Gp = arg min

Gi

(maxΦh

ξ(Gi|Φh)).

(c) Initialize the model component Φw using Gp.(d) αw = 1(e) βwk = 1 (0 ≤ k ≤ Nw)(f) endfor

Before beginning the iterative procedure of the EM algorithm, the K-meansalgorithm is used to pre-adjust the parameters of the spatial pattern model.

4.3 Modify the Structure of the Spatial Pattern Model

Since we select a subset of the sample ARGs to initialize the components ofthe model, it is very likely that the model components have spurious nodes

12

which represent backgrounds. During the iterations of the EM algorithm, wecalculate the average probability of being matched for each model node ωwk

as

%wk =

∑Si=1 Pqi

(w|Gi, Θ(n)w )(

∑Uim=1 Pyim

(k|Gi,Θ(n)w ))∑S

k=1 Pqk(w|Gk, Θ

(n)w )

(24)

If %wk is smaller than a threshold ε, the model node ωwk and its relations willbe removed. The threshold ε can be a constant or an ascendant function ofthe iteration number of the EM algorithm (e.g., we choose ε = 1− 0.5n).

4.4 The Computational Complexity

The computational complexity of the learning procedure is O (the number ofthe EM iterations × ∑S

i=1

∑Ww=1 (the computational complexity of matching

Gi to Φw) ). Since it might take too long for the EM algorithm to converge,a maximum number of iterations T is empirically set for the EM algorithm(e.g., we set T to 50). The graph matching algorithm is used by the EM al-gorithm to deal with the hidden variables, i.e., the match between the sampleARGs and the spatial pattern model. Without additional constraints or priorknowledge, the complexity of the value space of the match between the samplenodes of Gi and the model nodes of Φw is O((Nw + 1)Ui). To deal with sucha huge searching space, we chose a bottom-up graph matching approach (seeSection 4.1), which finds a local optimum solution by fusing the low-level in-formation. The computational complexity of our implementation of the graphmatching algorithm is O(N2

wU2i ). The overall computational complexity of the

implemented learning procedure is O(T∑

i

∑w(N2

wU2i )).

5 Detect the Spatial Pattern

The learned model captures the statistical characteristics of a spatial patternobserved under various conditions. It can be used to detect whether the patternappears in a new sample ARG, say Gx = 〈Ox, Rx〉. The similarity between Gx

and the model Z is calculated as f(Gx|Z). An instance of the pattern is saidto be found in Gx if f(Gx|Z) is larger than a predefined threshold ε1, whichdepends on applications. A choice of ε1 could be minGi∈G f(Gi|Z) if eachsample ARG Gi has at least one instance of the spatial pattern.

The likelihood of each sample node oxk is calculated as

13

W∑h=1

αhP (Gx = Φh|Gx, Z)Nh∑η=1

βhηP (oxk = ωhη|Gx,Φh) (25)

Those sample nodes whose likelihood is larger than a predefined threshold ε2

are selected. A choice of ε2 could be 0.95S/(W∑S

i=1 Ui). The relations amongthe selected sample nodes are preserved. The selected nodes and relations forman instance of the pattern in Gx.

6 Experimental Results

We applied the proposed approach to the problem of unsupervised visual pat-tern extraction. The image samples are segmented using a segmentation algo-rithm [10] and are represented as ARGs. Each image segment is representedas a node. The attribute of a node denotes the mean and variance of thecolor (RGB) features of the corresponding image segment. The adjacent re-lationships between the image segments are considered. The attributes of therelationships in the sample ARGs are either 1 (adjacent) or 0 (non-adjacent).During the learning process, the attributes of relationships are updated ascontinuous variables in the range of [0, 1]. When the learning procedure stops,a threshold of 0.5 is used to decide whether a relationship should be kept. Amodel node without any neighbor will be deleted.

We first show a simple example. The pictures of the McDonald’s logowere taken in various backgrounds, from different viewpoints, and under twodifferent lighting conditions. Ten images were captured under each lightingcondition. Some of them are shown in Fig. 3. The observed color features ofthe McDonald’s logo are different in the samples due to different lightingconditions, different viewpoints, and noise. Take ‘m’ in the middle of thelogo as an example. The images shown in Fig. 3 (b) and (e) are capturedunder different lighting conditions. The means of the color features of ‘m’ are(202.4, 138.2, 59.8) and (240.3, 180.1, 109.4) in Fig. 3 (b) and (e) respectively.The images shown in Fig. 3 (a), (b), and (c) are captured under the samelighting condition. The means of the color features of ‘m’ are (208.2, 149.7,69.1), (202.4, 138.2, 59.8), and (205.7, 144.3, 71.2) in Fig. 3 (a), (b), and (c)respectively.

We made two assumptions. First, the spatial model has two model compo-nents. Second, the node PDFs and the relation PDFs are Gaussian with fixedcovariance matrixes as identical matrixes. Both components of the learnedmodel have 8 nodes. The means of the color attributes of the model nodes,which correspond to ‘m’, are (207.5, 140.3, 68.6) and (240.2, 179.7, 117.1)

14

(a) (b) (c)

(d) (e) (f)

Fig. 3. The McDonald’s logo. The first row lists three images captured underthe first lighting condition. The second row shows three images captured under thesecond lighting condition.

respectively. The learning results already include the detection results of theMcDonald’s logo in the training images (see Fig. 4).

(a) (b) (c)

(d) (e) (f)

Fig. 4. Detect the McDonald’s logo in the training images.

In another experiment, we used the images of the ZIP logo in various back-grounds (see Fig. 5). The backgrounds in this experiment are more complicatedthan those in the previous one. More details will be provided.

The images are segmented (see Fig. 6) and are represented as ARGs (seeFig.7). The spatial pattern model is assumed to have one component. Thenode PDFs and the relation PDFs are assumed to be Gaussian with fixedcovariance matrixes as identical matrixes. Fig. 8 shows the detection resultson the sample ARGs, which are shown in Fig. 7. Fig. 9 shows the originalimage regions that correspond to the detected subgraphs in Fig. 8. We alsoused the learned model to detect the ZIP logo in a new image (see Fig. 10).

15

(a) (b) (c)

(d) (e) (f)

Fig. 5. The ZIP logo images.

(a) (b) (c)

(d) (e) (f)

Fig. 6. The segmentation results of the images shown in Fig. 5. The image segmentsare automatically painted in pseudo colors by the segmentation program [10].

As shown in Fig. 9 (d), (e), and (f), the final results depend on the quality ofthe image segmentation results. In fact, if each image pixel is represented as anode in an ARG, our theory can be directly applied to image pixels so that wecan avoid using corrupted information generated by the low-level image pre-processing step (e.g., image segmentation, edge detection, etc.). Nonetheless,this will result in high computational complexity if the sample images havelarge numbers of image pixels. Image segmentation was just used to reducethe computational complexity in our experiments.

7 Summary and Discussions

We present a statistic learning approach that discovers frequently observedstructured information by simultaneously examining multiple samples. Weassume that the structured information is governed by a PDF which is repre-

16

(a) (b) (c)

(d) (e) (f)

Fig. 7. The ARG representations of the images shown in Fig. 5. The nodes representthe image segments. The 2D coordinates of a node in the image plane are decidedby the coordinates of a randomly selected image pixel in the corresponding imagesegment. The coordinates of the nodes are used for visualization only. An edge isdrawn to connect two nodes if the corresponding image segments are adjacent.

(a) (b) (c)

(d) (e) (f)

Fig. 8. The detected subgraphs that correspond to the instances of the learnedspatial pattern model.

sented as a probabilistic parametric model. The model consists of a set of para-metric attributed relational graphs. The learning procedure iteratively finds alocal optimum estimate for the parameter set of the model. The learned modelsummarizes the samples and can be used for pattern detection. We demon-strated the approach by applying it to unsupervised 2D visual spatial patternextraction. The experimental results show that the learning procedure is ableto distinguish the instances of the spatial pattern from their backgrounds ifsimilar backgrounds are not always observed in the samples.

Although the proposed approach was only applied to two dimensional imagesin the experiments, it is suitable for general spatial pattern learning and dis-covery. This is because ARG can be used to represent data in any dimensional

17

(a) (b) (c)

(d) (e) (f)

Fig. 9. The original image segments that correspond to the subgraphs in Fig. 8.

(a) (b) (c)

(d) (e)

Fig. 10. Detect the ZIP logo in a new image. (a) The image, (b) the segmentationresults, (c) the ARG representation, (d) the detected subgraph, and (e) the imagesegments corresponding to the detected subgraph.

space. In addition, our approach can be used for feature selection. Represent-ing the instantiations of feature elements as the nodes of sample ARGs, weare able to not only discover the dominant feature space but also capture therelationships between the selected features.

Future work will expand the proposed methodology and theory for temporal-spatial pattern modelling and apply them to real applications (e.g., gene func-tion modelling and detection, network flow modelling, multimodal human-computer interaction, content-based image retrieval, depth information recov-ery from multiple images, face detection and recognition, etc.).

18

References

[1] Almohamad, H. A., and S. O. Duffuaa, A Linear Programming Approach for theWeighted Graph Matching Problem, IEEE Trans. Pattern Analysis and MachineIntelligence 15 (1993), 522-525.

[2] Barrow, H. G., and R.J. Popplestone, Relational Descriptions in PictureProcessing, Machine Intelligence 6 (1971), 377-396.

[3] Bertsekas, D. P., Nonlinear Programming, Athena Scientific, Belmonte, MA,1995.

[4] Besl, P. J., and R. C. Jain, Three-dimensional object rec-ognition, ComputingSurveys, 17 (1985), 75-145.

[5] Bhanu, B., and O. D. Faugeras, Shape Matching of Two-Dimensional Objects,IEEE Trans. Pattern Analysis and Machine Intelligence, 6 (1984), 137-156.

[6] Bledsoe W. W., and I. Browning, Pattern recognition and reading by machine,Proc. Eastern Joint Computer Conference, 16 (1959), 225-232.

[7] Christmas, W. J., J. Kittler, and M. Petrou, Structural matching in computervision using probabilistic relaxation, IEEE Trans. Pattern Analysis and MachineIntelligence 17 (1995), no 8., 749-764.

[8] Cover, T. M., and J. A. Thomas, Elements of Information Theory, New York:Wiley, 1991.

[9] Dempster, A. P., N. M. Laird, and D. B. Rubin, Maximum Likelihood fromincomplete data via the EM algorithm, J. Royal Stat. Soc. Ser. B, 39 (1977),no. 1, 1-38.

[10] Felzenszwalb, P. F., and D. O. Huttenlocher, Image segmentation using localvariation, in Proceedings of IEEE Conference on Computer Vision and PatternRecognition, (1998), 98-104.

[11] Frey, B. J. and N. Jojic, Transformed component analysis: Joint estimationof spatial transformations and image components, International Conference onComputer Vision, (1999).

[12] Garey, M. R. and D. S. Johnson, Computers and Intractability: A Guide to theTheory of NP-Completeness, New York: W. H. Freeman and Company, 1979.

[13] Hong, P., and T. S. Huang, Extracting the recurring patterns from image, The4th Asian Conference on Computer Vision, (2000), Taipei, Taiwan.

[14] Hong, P., R. Wang, and T. S. Huang, Learning patterns from imagesby combining soft decisions and hard decisions, IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, (2000), Hilton HeadIsland, South Carolina.

[15] Kittler, J., and J. Fglein, Contextual classification of multispectral pixel data,Image and Vision Computing, 2 (1984), 13-29.

19

[16] Li, S. Z., Matching: Invariant to translations, rotations and scale changes,Pattern Recognition 25 (1992), 583-594.

[17] Maron, O., and T. Lozano-Prez, A Framework for multiple-instance learning,Neural Information Processing Systems 10 (1998).

[18] Ozer, B., W. Wolf, and A. N. Akansu, A graph based object description forinformation retrieval in digital image and video libraries, In Proceedings ofIEEE Workshop on Content-based Access of Image and Video Libraries, pp.79-83. 1999.

[19] Ratan, A. L., O. Maron, et al., A framework for learning query concepts inimage classification, In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition, (1999), 423-429.

[20] Rosenfeld, A., R. Hummel and S. Zucker, Scene labeling by relaxationoperations, IEEE Trans. Systems, Man and Cybernetics, 6 (1976), 420-433.

[21] Shapiro, L. G., and R. M. Haralick, Structural descriptions and inexactmatching, IEEE Trans. Pattern Analysis and Machine Intelligence 3 (1981),504-519.

[22] Tanner, M. A., and W. H. Wong, The calculation of posterior distributionsby data augmentation (with discussion), Journal of the American StatisticalAssociation. 82, 805-811.

[23] Tsai, W. H., and K. S. FU, Error-correcting isomorphism of attributed relationalgraphs for pattern analysis, IEEE Trans. Sys., Man and Cyb. 9, (1979), 757-768.

[24] Umeyama, S., An Eigen-decomposition approach to weighted graph matchingproblems, IEEE Trans. Pattern Analysis and Machine Intelligence, 10 (1988),695-703.

[25] Wells, W., Statistical approaches to feature-based object recognition,International Journal of Computer Vision, 19 (1997), 63-98.

[26] Wilson, R. C., and E. R. Hancock, Structural matching by discrete relaxation,IEEE Trans. Pattern Analysis and Machine Intelligence 19 (1997), no. 6, pp.634-648.

[27] Zhu, S. C., and C. E. Guo, Conceptualization and modeling of visual patterns,International Workshop on Perceptual Organization in Computer Vision,(2001), Vancouver, Canada.

[28] Zhu, S. C., Y. N. Wu, and D. B. Mumford, Minimax Entropy Principle and ItsApplications to Texture Modeling, Neural Computation 9 (1997), 1627-1660.

20

APPENDIX

A Simplify the Maximum-Likelihood Function

We rewrite (14) as

Q(Θ; Θ(n)) =S∑

i=1

M∑qi=1

P (qi|Gi, Θ(n))(L1 + L2 + L3) (A.1)

where

L1 =

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1

P (yik|Gi,Θ(n)qi

) logP (qi) (A.2)

L2 =

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1

P (yik|Gi,Θ(n)qi

)Ui∑

m=1

log

(p(oim|ωqiyim

)P (yim|qi))

(A.3)

L3 =

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1

P (yik|Gi,Θ(n)qi

)Ui∑

c=1

Ui∑d=1

log p(ricd|ψqiyicyid) (A.4)

We then simplify the above three terms one-by-one. From time to time, we

will use the fact that∑Nqi

yik=0 P (yik|Gi,Θ(n)qi

) = 1.

L1 = logP (qi)

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1

P (yik|Gi,Θ(n)qi

)

= logP (qi)Ui∏

k=1

Nqi∑yik=0

P (yik|Gi,Θ(n)qi

)

= logP (qi)

(A.5)

21

L2 =

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1

P (yik|Gi,Θ(n)qi

)Ui∑

m=1

log

(p(oim|ωqiyim

)P (yim|qi))

=Ui∑

m=1

Nqi∑yim=0

log

(p(oim|ωqiyim

)P (yim|qi))P (yim|Gi,Θ

(n)qi

)×

[ Nqi∑yi1=0

· · ·Nqi∑

yim−1=0

Nqi∑yim+1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1,k 6=m

P (yik|Gi,Θ(n)qi

)

]

=Ui∑

m=1

Nqi∑yim=0

log

(p(oim|ωqiyim


(n)qi

)×

Ui∏k=1,k 6=m

Nqi∑yik=0

P (yik|Gi,Θ(n)qi

)

=Ui∑

m=1

Nqi∑yim=0

log

(p(oim|ωqiyim


(n)qi

)

=Ui∑

m=1

Nqi∑yim=0

P (yim|Gi,Θ(n)qi

)[log p(oim|ωqiyim

) + logP (yim|qi)]

(A.6)

L3 =

Nqi∑yi1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1

P (yik|Gi,Θ(n)qi

)Ui∑

c=1

Ui∑d=1

log p(ricd|ψqiyicyid)

=Ui∑

c=1

Ui∑d=1

Nqi∑yic=0

Nqi∑yid=0

P (yic|Gi,Θ(n)qi

)P (yid|Gi,Θ(n)qi

) log(p(ricd|ψqiyicyid))×

[ Nqi∑yi1=0

· · ·Nqi∑

yic−1=0

Nqi∑yic+1=0

· · ·Nqi∑

yid−1=0

Nqi∑yid+1=0

· · ·Nqi∑

yiUi=0

Ui∏k=1,k 6=c,d

P (yik|Gi,Θ(n)qi

)

]

=Ui∑

c=1

Ui∑d=1

Nqi∑yic=0

Nqi∑yid=0

P (yic|Gi,Θ(n)qi

)P (yid|Gi,Θ(n)qi

) log(p(ricd|ψqiyicyid))×

Ui∏k=1,k 6=c,d

Nqi∑yik=0

P (yik|Gi,Θ(n)qi

)

=Ui∑

c=1

Ui∑d=1

Nqi∑yic=0

Nqi∑yid=0

P (yic|Gi,Θ(n)qi

)P (yid|Gi,Θ(n)qi

) log p(ricd|ψqiyicyid)

(A.7)

Finally, we can obtain

22

Q(Θ; Θ(n)) =S∑

i=1

W∑qi=1

P (qi|Gi, Θ(n))

[logP (qi)+

Ui∑m=1

Nqi∑yim=0

P (yim|Gi,Θ(n)qi

) logP (yim|qi)+

Ui∑m=1

Nqi∑yim=0

P (yim|Gi,Θ(n)qi

) log p(oim|ωqiyim)+

Ui∑c=1

Ui∑d=1

Nqi∑yic=0

Nqi∑yid=0

P (yic|Gi,Θ(n)qi

)P (yid|Gi,Θ(n)qi

) log p(ricd|ψqiyicyid)

]

=S∑

i=1

W∑h=1

Pqi(h|Gi, Θ

(n))

[logαh+

Ui∑m=1

Nh∑η=0

Pyim(η|Gi,Θ

(n)h ) log βhη+

Ui∑m=1

Nh∑η=0

Pyim(η|Gi,Θ

(n)h ) log p(oim|ωhη)+

Ui∑c=1

Ui∑d=1

Nh∑σ=0

Nh∑τ=0

Pyic(σ|Gi,Θ

(n)h )Pyid

(τ |Gi,Θ(n)h ) log p(ricd|ψhστ )

](A.8)

where Pqi(h|Gi, Θ

(n)) = P (qi = h|Gi, Θ(n)) and Pyim

(η|Gi,Θ(n)h ) = P (yim =

η|Gi,Θ(n)h ), P (qi = h) = αh, and P (yim = η|qi = h) = βhη.

B Derive Expressions for Updating αh and βhη

First, we derive the updating expression for αh. We introduce the Lagrangemultiplier λ with the constraint that Σhαh = 1, and solve the following equa-tion

23

∂

∂αh

[Q(Θ; Θ(n)) + λ

(W∑

h=1

αh − 1

)]

=∂

∂αh

[S∑

i=1

W∑h=1

Pqi(h|Gi, Θ

(n)) logαh + λ

(M∑

h=1

αh − 1

)]

=S∑

i=1

1

αh

Pqi(h|Gi, Θ

(n)) + λ = 0 =⇒

W∑h=1

[S∑

i=1

1

αh

Pqi(h|Gi, Θ

(n)) + λ

]= 0 =⇒ λ = −S =⇒

αh =

∑Si=1 Pqi

(h|Gi, Θ(n))

S

(B.1)

Second, we derive the updating expression for βhη. We introduce the Lagrangemultiplier λ with the constraint that Σηβhη = 1, and solve the following equa-tion

∂

∂βhη

[Q(Θ; Θ(n)) + λ

(Nh∑η=0

βhη − 1

)]

=∂

∂βhη

[S∑

i=1

W∑h=1

Ui∑m=1

Nh∑η=0

Pyim(η|Gi,Θ

(n)h )Pqi

(h|Gi, Θ(n)) log βhη + λ

(Nh∑η=0

βhη − 1

)]

=S∑

i=1

Ui∑m=1

1

βhη

Pyim(η|Gi,Θ

(n)h )Pqi

(h|Gi, Θ(n)) + λ = 0

=⇒Nh∑η=0

[S∑

i=1

Ui∑m=1

Pyim(η|Gi,Θ

(n)h )Pqi

(h|Gi, Θ(n)) + λβhη

]= 0

=⇒ λ = −S∑

i=1

Pqi(h|Gi, Θ

(n))Ui

=⇒ βhη =

∑Si=1

∑Uim=1 Pyim

(η|Gi,Θ(n)h )Pqi

(h|Gi, Θ(n))∑S

i=1 Pqi(h|Gi, Θ(n))Ui

(B.2)

C Derive the Updating Expressions for Gaussian Node PDFs andGaussian Relation PDFs

If the node PDFs and relation PDFs are Gaussian, we can obtain analyticalexpressions for updating the parameters of the PDFs in the M-step. Basically,we take the derivatives of Q(Θ, Θ(n)) with respect to the parameters of thePDFs, set the derivatives to zero, and solve the equations.

24

Only the third term of Q(Θ, Θ(n)) is related to the node PDFs. Substitutingthe Gaussian node PDF (18) into the third term of Q(Θ, Θ(n)), we obtain

S∑i=1

Ui∑m=1

W∑h=1

Nh∑η=0

Pqi(h|Gi, Θ

(n))P (η|Gi,Θ(n)h ) log p(oim|ωhη)

=S∑

i=1

Ui∑m=1

W∑h=1

Nh∑η=0

Pqi(h|Gi, Θ

(n))P (η|Gi,Θ(n)h )×

[−ς2

log 2π + log |Σhη|+ (−→a im −−→µ hη)Σ−1hη (−→a im −−→µ hη)

](C.1)

The above expression is quadratic. It is a typical optimization problem to solvea equation that is obtained by taking the derivative of (C.1) with respect toits parameter and setting the derivative to zero [3]. We first take the derivativeof (C.1) with respect to −→µ hη, set it equal to zero, and obtain the updatingexpression of −→µ hη as (19). Then, we take the derivative of (C.1) with respectto Σhη, set it equal to zero, and obtain the updating expression of Σhη as (20).

Similarly, only the forth term of Q(Θ, Θ(n)) is related to the relation PDFs.Substituting the Gaussian relation PDF (18) into the forth term of Q(Θ, Θ(n)),we obtain a quadratic expression with respect to the parameters of the Gaus-sian relation PDFs. Using the same method described above, we can obtainthe updating expressions of the parameters of the Gaussian relation PDFs as(22) and (23) respectively.

25

Date post:	02-Sep-2018
Category:	Documents
Upload:	doanmien
View:	214 times
Download:	0 times

Spatial Pattern Discovery by Learning a Probabilistic...

Documents