3918 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …jwp2128/Papers/NiPaisleyCarinDunson2008.pdf3918...

3918 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 8, AUGUST 2008

Multi-Task Learning for Analyzing and SortingLarge Databases of Sequential Data

Kai Ni, John Paisley, Lawrence Carin, Fellow, IEEE, and David Dunson

Abstract—A new hierarchical nonparametric Bayesian frame-work is proposed for the problem of multi-task learning (MTL)with sequential data. The models for multiple tasks, each charac-terized by sequential data, are learned jointly, and the intertaskrelationships are obtained simultaneously. This MTL setting isused to analyze and sort large databases composed of sequentialdata, such as music clips. Within each data set, we represent thesequential data with an infinite hidden Markov model (iHMM),avoiding the problem of model selection (selecting a number ofstates). Across the data sets, the multiple iHMMs are learnedjointly in a MTL setting, employing a nested Dirichlet process(nDP). The nDP-iHMM MTL method allows simultaneoustask-level and data-level clustering, with which the individualiHMMs are enhanced and the between-task similarities arelearned. Therefore, in addition to improved learning of each ofthe models via appropriate data sharing, the learned sharingmechanisms are used to infer interdata relationships of interestfor data search. Specifically, the MTL-learned task-level sharingmechanisms are used to define the affinity matrix in a graph-dif-fusion sorting framework. To speed up the MCMC inference forlarge databases, the nDP-iHMM is truncated to yield a nestedDirichlet-distribution based HMM representation, which accom-modates fast variational Bayesian (VB) analysis for large-scaleinference, and the effectiveness of the framework is demonstratedusing a database composed of 2500 digital music pieces.

Index Terms—Hierarchical Bayesian modeling, infinite hiddenMarkov model (iHMM), multi-task learning (MTL), variationalBayesian (VB).

I. INTRODUCTION

M ULTI-TASK LEARNING (MTL) [1] has attracted sig-nificant interest in the Machine Learning communityover the last decade [2]–[5]. Much of the research on MTLhas exploited ideas in Bayesian hierarchical modeling [6], andsuch techniques have been successfully applied to informationretrieval [2] and computer vision [3]. In most previous MTLstudies, a focus has been placed on data-level clustering; for ex-ample, the hierarchical Dirichlet process (HDP) [7] may be usedto jointly learn mixture models for multiple tasks, and data are

Manuscript received September 19, 2007; revised March 25, 2008. The as-sociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Mark Coates.

K. Ni, J. Paisley, and L. Carin are with the Department of Electrical andComputer Engineering, Duke University, Durham, NC 27708 USA (e-mail:[email protected]; [email protected]; [email protected]).

D. Dunson is with the Department of Electrical and Computer Engineering,Institute of Statistics and Decision Sciences, Duke University, Durham, NC27708 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2008.924798

shared between tasks as appropriate (sharing is at the data level,not at the overall task level). However, one may also be inter-ested in task-level clustering, i.e., to learn overall task-level sim-ilarities (for example, to learn the degree to which multiple datasets are related to one another). Such inter-data-set relationshipsmay be inferred by such models as the HDP, via measures [8]applied after the individual models are learned. However, suchpostprocessing model-similarity measures are often expensiveto implement. The nested Dirichlet process (nDP) [9] has beendeveloped as a hierarchical Bayesian tool for MTL and intertasksharing, and this framework is extended in the work presentedhere.

While most work in hierarchical Bayesian modeling ad-dresses clustering multiple exchangeable data sets, little hasbeen done to solve the MTL problem for sequential data.Hidden Markov models (HMMs) have been widely used to ana-lyze sequential data, addressing problems in speech recognition[10], music analysis [8] and multiaspect target detection [11],among many others. In many scenarios, one may have limitedsequential data for training. Rather than building HMMs foreach task separately, it is desirable to identify relationshipsbetween tasks and share information appropriately, thus ob-taining more accurate task-dependent models. In the context ofHMM-based sequential-data analysis, a key issue involves de-velopment of a methodology for finding the appropriate modelcomplexity, i.e., defining the appropriate number of HMMstates. However, the data may not be represented by a single“correct” HMM structure, i.e., a fixed number of states. Ratherthan performing model selection [12] to select a fixed modelstructure, we employ a nonparametric Bayesian approach [7]in which the number of states is not fixed a priori; the modelis termed an infinite HMM (iHMM). To learn multiple iHMMssimultaneously, one for each sequential data set, the basedistributions of the iHMMs may be drawn from an nDP [9],this allowing intertask clustering.

The nDP-iHMM introduced here represents a new hierar-chical nonparametric Bayesian model for multi-task learningwith sequential data, allowing simultaneous iHMM task-levelclustering and data-level modeling. A Markov chain MonteCarlo (MCMC) inference engine has been adopted for thenDP-iHMM. However, such MCMC inference is impracticalcomputationally when considering a large number of tasks.Therefore, another contribution of this paper is the develop-ment of a variational Bayesian (VB) [13] inference formulation,which allows relatively fast computation even for analysis oflarge-scale problems. It requires replacing a truncated iHMMwith a Dirichlet-distribution-based HMM, which we call anested Dirichlet distribution HMM (nDD-HMM).

1053-587X/$25.00 © 2008 IEEE

Authorized licensed use limited to: DUKE UNIVERSITY. Downloaded on February 13, 2009 at 15:32 from IEEE Xplore. Restrictions apply.

NI et al.: MULTI-TASK LEARNING FOR ANALYZING AND SORTING LARGE DATABASES OF SEQUENTIAL DATA 3919

As demonstrated here, the proposed sequential-data MTL set-ting scales well, with example results presented on a large dig-ital-music database. Specifically, the nDD-HMM algorithm isemployed to simultaneously learn models for a large numberof sequential data sets; in addition to learning the individualmodels, a by-product of the analysis is an intertask affinity ma-trix, learned via the nDD-HMM sharing mechanisms. This ma-trix is subsequently normalized to define a random walk on agraph, with graph-diffusion analysis [14] then performed to rankand sort the sequential data. Example results are presented for adatabase of 2500 pieces of digital music.

The remainder of the paper is organized as follows. InSection II, we briefly review the HDP construction of theiHMM. We introduce the nDP-iHMM framework in Section III,to solve MTL for sequential data. The truncated form of thenDP-iHMM, the nDD-HMM, and its VB inference algorithmare described in Section IV. Experimental results for bothsynthetic data and a large set of digital music are presentedin Section V, with the learned intertask sharing mechanismfeeding into a graph-diffusion-based data-search framework.Conclusions are provided in Section VI.

II. THE INFINITE HIDDEN MARKOV MODEL FOR ASINGLE TASK

HMMs have been used extensively for modeling sequentialdata. A data sequence of length generated by an HMM yieldsa sequence of observations and a sequenceof hidden underlying states , the latter fol-lowing a Markov process. Consider an HMM with states and

possible observations, for which the parameters of the modelare , with being the initial-state proba-bility, the transition matrix representing , andthe observation matrix representing . HMMs are typ-ically learned via the expectation-maximization (EM) method,implemented by the Baum–Welch algorithm [10] or the varia-tional Bayesian method [13]. However, in both approaches themodel structure must be specified initially, i.e., the number ofstates is fixed. Learning the correct model complexity requiresexpensive model selection, and in some applications there mayexist no such fixed “correct” model (the limited sequential datafor a given problem may be best represented via an ensembleof HMMs, with different numbers of states). Here, we addressthis problem by considering an HMM with a countably infinitestate space, namely the iHMM. Beal et al. [15] first proposedthe iHMM, and later Teh et al. [7] demonstrated that the iHMMcan be recast in the hierarchical Dirichlet process (HDP) frame-work. Below, we start with a finite-state HMM, and then extendthis to an infinite state space, and we describe how the HDP maybe used to constitute the iHMM.

An -state HMM is a set of coupled mixture models, eachwith mixture components that are state-dependent observa-tion models. Given the previous hidden state , the con-ditional distribution of the next observation is

where is the element in the transitionmatrix A, representing the probability of transitioning from the

state to the state and is the state-dependentobservation model. Therefore, the previous state indexesa specific row of the transition matrix serving as the mixtureweights for choosing current state , and the state-dependentobservation models serve as the mixture components generating

. Note that for different possible visited states at time(different states ), the mixture models generating sharethe same set of mixture components but with different mixtureweights.

When considering an infinite state space, the aforementionedmixture models become infinite mixture models, and it is nat-ural to use Dirichlet process mixture models [16] to describesuch mixture models with unbounded components. Assume the

infinite mixture model in the iHMM is drawn from aDirichlet process [17] DP , where is a positivereal number and is the base distribution for , Ferguson[17] showed that is discrete with probability one, a propertymade explicit by the stick-breaking construction [18]

Beta (1)

where is a discrete measure concentrated atand the procedure of generating is denoted Stick .Here corresponds to the transition probabilities starting fromthe state and are the parameters of the observation modelgiven allocation to component . To deal with the infinite count-able states, the iHMM takes the form of a hierarchical Dirichletprocess [7]

DP DP

(2a)

for (2b)

where is the infinite mixture model for state is the basedistribution on the parameters , and is the conditional dis-tribution of an observation given the parameter (e.g., for con-tinuous observations may represent a Gaussian with meanand covariance defined by , while for discrete observationsmay represent the parameters of a multinomial distribution). Wealso note that after the Markovian state transition from timeto time , we transit from to the discrete state with whichthe atom is associated, and the active infinite mixture modelbecomes .

A key component of the iHMM is that the base DP distribu-tion , from which the atoms in all are drawn, is it-self a draw from a separate DP. This guarantees that the state-de-pendent observation parameters are identical for all ,as required for an HMM. Hence, the discreteness of [asshown in (1)] is a key property exploited here. To make theseproperties explicit, note that (2a) and (2b) imply that

DP (3)



Fig. 1. Infinite hidden Markov model for single task learning. For observation�� defines the infinite mixture model to be used and � selects the mixturecomponent according to infinite dimension weight vector �� .

where and are both infinite-dimensional vectors of proba-bilities that sum to one almost surely, and . While HDPconstruction of and realize desired properties concerningthe consistency of the state-dependent observation models in theiHMM, it also yields a fairly complicated form for the state tran-sition parameters. As discussed further below, the hierarchicalform in (3) undermines the ability to perform VB inference.Hence, we will subsequently refine the form of , to preservethe desired discreteness properties, while imposing a simplerstructure that yields an expression similar to (3), albeit with theopportunity to perform fast VB inference.

Combining (2) and (3), the graphical form of an iHMM froma single set of sequential data (single-task learning, or STL) isshown in Fig. 1, with parameters defined as

Mult

DP Stick

where corresponds to the row of transition matrix andis the conditional distribution of the observation given

membership in state . Each observation is represented with an-dimensional feature vector and the fea-

ture vector is assumed to be generated from a signal model .If there are multiple tasks, it has been suggested by [7] that onecould induce an additional level in the Bayesian hierarchy, let-ting a master Dirichlet process couple each of the iHMMs. Sucha framework borrows information by allowing two tasks to haveidentical iHMMs instead of allowing tasks to have separate butdependent iHMMs, as will be considered below.

III. MULTIPLE TASKS LEARNING WITH INFINITE HIDDENMARKOV MODELS

In Section II, we described how to construct an iHMMfor a single set of sequential data (single-task learning). Nowwe consider a learning problem in which we have sequentialdata collected from different but possibly related tasks,

, where is the sequen-tial data from task . For example, each may represent theobservation sequence of features extracted from the music

clip, from a set of music clips. We here assume a singlesequence with fixed length for each task (music piece),but this can easily be generalized to multiple sequences withdifferent observation lengths.

Our goal is to learn accurate iHMMs for each of the tasksas well as to learn the intertask similarities, the latter used forsearching and sorting. For example, one may wish to learnwhich musical pieces are similar to one another. A naive way todo this is to treat each task separately and learn the individualiHMMs, then apply a post-model-development similarity-mea-sure computation. However since these tasks may be related,the data from one task may potentially be useful to help buildthe model for other tasks, which can effectively reduce theamount of required data per task to achieve a particular perfor-mance (this is of particular value when there are limited dataper task). We wish to discover the appropriate sharing structurebetween tasks and use the sharing information to enhance thelearned iHMMs. Meanwhile we wish to exploit the sharingstructure to obtain the intertask relationships directly whenperforming the model learning, without requiring the extra stepof a postlearning similarity computation.

We propose a new hierarchical nonparametric Bayesianmodel for the aforementioned sequential-data MTL. At thebottom level each task is modeled with an iHMM as describedin Section II, and at the top level the data in the tasks areshared appropriately by imposing an nDP [9] prior on the basedistributions of the iHMMs, yielding the nDP-iHMM. Thereare several appealing properties of the proposed model: i) theproblem of selecting an appropriate number of HMM statesis avoided; ii) the similarity measurement between tasks isobtained directly; iii) the state-dependent observation modelsare shared among related tasks and model learning is improved;and iv) upon sharing, each task still maintains its own transitionmatrix, which can be used to distinguish the tasks.

A. The Nested Dirichlet Process

The nDP has been proposed [9] to perform intratask and inter-task clustering simultaneously. Assume there are related datasets, with each containing exchangeable observations.The parameter associated with is assumed to be drawnfrom a group distribution and all are linked via annDP :

(4)

(5)

with Stick and Stick .Equation (4) implies that the distribution is a

stick-breaking process, in which the atoms are themselvesstick-breaking processes drawn from DP . Since

, the model induces clus-tering in the space of distributions. Also, the stick breakingconstruction of ensures that marginally DPfor every . In the implementation we will also place Gamma



priors on and , to allow the data to inform more stronglyabout clustering.

It has been shown [9] that the prior correlation betweentwo distributions and is

. In addition, the correlation between draws fromthe process can be calculated from (4) and (5), yielding

The above indicates that the a priori correlation between ob-servations coming from the same group is larger than the cor-relation between observations coming from different groups,which is an appealing feature. Therefore, the nDP model simul-taneously enables clustering the observations across groups aswell as clustering the distributions at the task level. This is dif-ferent from the HDP, in which only data-level clustering acrossgroups is considered, and task-level clustering does not occur(the task-level clustering in the HDP may be implied indirectly).

B. The nDP-iHMM

Based on the iHMM and the nDP, we propose a new hi-erarchical model to address the MTL problem for sequentialdata: the nDP-iHMM. In the nDP-iHMM, each set of sequen-tial data are modeled by an iHMM and those iHMMs are prop-erly shared via an nDP prior. To better understand the MTL set-ting, reconsider (2a). Each of the tasks has their own set ofDP mixture models for . To encouragesharing among the tasks, the collection of base distributions

for the iHMMs are drawn from an nDP, implyingthe state-dependent observation statistics can be shared acrosstasks:

Stick DP

DP (6a)

(6b)

where the script denotes the task, is the associatedsequence length, is a vector constituting the , and the re-maining variables are analogous to the STL case in (2). We notethat to address MTL with task-level sharing, (2a) is replacedby (6a), with the remaining (6b) essentially the same as (2b),only repeated times for the tasks. A graphical model of thenDP-iHMM is shown in Fig. 2.

Two tasks and share the same observation models (mix-ture components defined in the base distribution) if

for some . Note that even though the base distributionand are identical, the consequent iHMMs can still be dif-ferent, because the rows of transition matrices are random drawsfrom a Dirichlet process prior centered on the mixture weightsof the base distribution. Since ,the model induces clustering at the task level, and the DP con-struction of provides the desired data-level clustering (at thelevel of states within the HMM). Therefore, the nDP-iHMMsimultaneously enables clustering the observations at the datalevel as well as clustering the distributions at the task level.

Fig. 2. Graphical representation of the nDP-iHMM. The base distributions areshared via a nDP and the consequent iHMMs are independently generated giventhe base distribution.

The hyperparameters reflect the prior knowledge of how sim-ilar the tasks are. As , each base distribution is as-signed to a different , so that all of the atoms in the basedistributions are task-specific. In this case, borrowing of infor-mation across the tasks relies entirely on shrinkage occurringunder , which may be sensitive to the specification of . Onthe other hand, as , all the base distributions are iden-tical, , so that a type of pooling occurs and borrowingof information across the tasks is strong. Moreover, as ,each iHMM degenerates to a single-state HMM and the modelreduces to parametric-based clustering.

C. MCMC Inference for nDP-iHMM

The inference for the nDP-iHMM is based on a Gibbs sam-pler. To simplify the discussion, we consider a single sequenceof observations from each task. Let rep-resent the observation sequence from task and let indicatorvariable denote the atom for which . ThenDP-iHMM mixture model can be written as follows:

Mult

DP

Stick

Mult Stick (7)

We employ a stick-breaking process [18] for the top-levelnDP and a Chinese restaurant process [19] for the bottom leveliHMM. Similar to the work in [20], we truncate the top levelstick-breaking representation to components. The initializa-tion of the inference includes nDP mixture weights ,center index , hidden state sequences , iHMM param-eters iHMM for and hyperparameters

and . The iHMM parameter iHMM includes the transitionmatrix , the observation matrix and the initialization prob-abilities . We put Gamma priors on the hyperparameters andrepeat the following steps until the Gibbs sampler converges.

1) Draw new center index according toiHMM for every task .

2) For those tasks whose center indexes are changed ,generate new hidden state sequence using the parameteriHMM . Recalculate the unique HMM transition matrix



for and the shared HMM observationmatrix for . Both of the ’s and ’s arerepresented as counting matrices.

3) Sample new hidden state sequence for each iHMM [seeequation shown at the bottom of the page], with being

the hidden state sequence excluding the previoussampled value of the count of transitions from statevalue to state value in task , and the mixing weightfor state given the task is using .

4) Sample new ’s by counting the transitions using ’s.Sample new ’s (mixture components) according to itsposterior distribution. Generate new iHMM parametersfrom the pair of and .

5) Sample nDP mixture weight fromBeta, where .

6) Update the Gamma priors on the hyperparameters, andsample new and .

IV. NESTED DIRICHLET DISTRIBUTION HMM ANDVARIATIONAL INFERENCE

The aforementioned Gibbs sampler for the nDP-iHMM in-volves simple steps for sampling from standard distributions andis relatively simple to implement. However, such an MCMC in-ference algorithm is computationally prohibitive when consid-ering a large number of tasks. Therefore, we seek to develop afast VB inference algorithm to learn our MTL model.

Let Stick represent the probability weights on theatoms within , and DP represent the prob-ability weights associated with when . Unfortu-nately, within the graphical model of the nDP-iHMM, the termsassociated with , the likelihood DP andthe prior Stick , are all beta-distributed, and hencethere is no conjugacy or analytical update form for . There-fore, while an MCMC inference for (7) is possible, a direct VBanalysis is not. To address this problem, one may perform anapproach like Monte Carlo EM [21] to sample inside eachVB iteration. However, the Monte Carlo EM is very expensivecomputationally given the need to run a MCMC chain withineach VB step.

Recall from the discussion above that each iHMM state has itsown multinomial distribution Mult , where the set of vec-tors is drawn iid from a Dirichlet process DP ,with drawn from a stick-breaking process Stick . The tran-sition from state to occurs with probability Mult ,and the observation is drawn from , where the set ofatoms are drawn iid from the base distribution . In

this formulation we therefore note that the state transition pa-rameters are controlled by two scalar parameters,and , and it is this two-parameter dependency that causes thedifficulty with a VB implementation.

In a truncated stick-breaking representation [20] of theiHMM, the iHMM is truncated to states, with improvingaccuracy with increasing . The finite set of atomsare again drawn iid from . In the truncated iHMM eachof the weight vectors in the set represents an

-dimensional probability weight (rather than infinite-dimen-sional probability weights, as in the untruncated iHMM),drawn as DP . As discussed in [18], a drawfrom a Dirichlet distribution, Dir , for -di-mensional base probability weight , may be constructed as

, with Stick and Mult ;this is the same construction as that associated with DP .Therefore, in the context of the truncated stick-breakingrepresentation of the iHMM, with sticks/states, the statetransition parameters are drawn from a Dirichlet distribution.Hence, the draw from DP , for a finite-dimensionalprobability weight , is equivalent to a draw from Dir .The two-layer form of the truncated iHMM implies that the

-dimensional probability weight , the base of the Dirichletdistribution, is drawn from a truncated version of Stick ; itis the dependence of on and that undermines aVB implementation. We therefore here propose the followingformulation change. The probability weights are nowdrawn from Dir , therefore employinga fixed base (rather than a draw from atruncated Stick ), and the precision is again drawn froma Gamma distribution. Through use of this framework, themodel preserves the desired conjugacy throughout, permittingvariational Bayesian inference. To emphasize that the truncatediHMM yields draws from Dirichlet distributions (DDs), werefer to this as a DD-HMM model (rather than the truncatediHMM).

The DD-HMM prior may be summarized as follows. A set ofatoms are drawn iid . For each of the states

, there is an associated probability weight ,with Mult , and

. Each of the probability weights are drawn iidfrom Dir . In the context of multi-tasklearning, the new formulation is equivalent to a truncated iHMMwith the in the original nested formulation replaced by

, with . We therefore refer to this asa nested DD-HMM, or nDD-HMM for short. Note that in thisnDD-HMM, we only alter the bottom iHMM to a DD-HMMfor VB implementation, and the top task-level sharing remainsintact via an nDP prior.

if

if



In this nDD-HMM representation we again have the desireddiscreteness property of the base distribution toensure the sharing and consistency of the observation models,but now the transition probabilities are only controlled byvariable , rather than via a two-level structure, which makesdirect VB inference intractable. As shown further below, theMCMC nDP-iHMM and the VB nDD-HMM yield comparableresults for all test cases considered, except that the VB ver-sion scales to large problems significantly better. The excel-lent performance of the VB-based nDD-HMM relative to theMCMC-based nDP-iHMM suggests the appropriateness of thenDD-HMM; note that these comparisons were done for tens ofmusic pieces, for which the MCMC calculations were feasible.For a large database example with 2500 music pieces, the VBversion typically converges in roughly 20 iterations while theMCMC version is infeasible.

A. Variational Bayesian Inference for the nDD-HMM

With the nDD-HMM framework we are able to obtain a fastinference algorithm via the variational Bayesian (VB) method.To consider standard VB inference [13], the top level is trun-cated to “sticks” , and the stick-breaking DPrepresentation for the base distribution is truncated to sticks.As discussed above, in the nDD-HMM, the original infinite-di-mensional characteristic of the stick-breaking representationof DP is now replaced by a -dimensional cor-responding to a uniform probability mass function.

Similar to the nDP-iHMM, the mathematical form of thenDD-HMM mixture model can be written as

Mult

DP

Mult Stick (8)

The nDD-HMM (8) is almost the same as the nDP-HMM (7),except that both the top and bottom level DPs are representedas truncated stick-breaking processes and is set to a constantvector, which is essential to perform the VB inference.

Assume there are observation sequences from task ,and therefore in the standard VB derivation we have: (i) theobserved data ; (ii) the hidden variables

, with representingthe th hidden state sequence for task ; and (iii) the parame-ters

, where is thetransition matrix, is the initial state probability and

is the observation matrix. The set of arefixed at uniform -dimensional vectors, as indicated above. The

triple defines the unique iHMM for modelingtask given . For simplicity of exposition, we haveassumed fixed hyperparameters and . However, Gammapriors are placed on the hyperparameters in our experiments.

In VB inference, we wish to maximize a lower bound on thelog marginal likelihood with respect to the :

where is the variational distribution and is assumed tobe fully factorized:

(9)

where is a multinomial distribution Multis a Beta distribution and is a con-

stant vector. We set a truncated stick breaking process as thevariational distribution of , and each row of and .The variational distribution of hidden states, , is calcu-lated by the forward-backward algorithm, following the idea in[13]. In the VB algorithm, we iteratively update all terms in (9)to maximize the lower bound, and the detailed VB update equa-tions are shown in the Appendix.

V. EXPERIMENTS

We demonstrate the effectiveness of our proposed methods onboth synthetic data and real music data. Our synthetic problem,for which the ground truth is known, demonstrates how thenDP-iHMM discovers the underlying sharing structure amongmultiple sequential data sets. We then apply both MCMC-basednDP-iHMM and VB-based nDD-HMM to a ten music piecesdata set to explore the similarities between tasks. Finally, we em-ploy the nDD-HMM to model a large digital music database, forwhich the nDP-iHMM is infeasible to implement, and show howthe learned intertask similarities can be used in a graph-diffusionanalysis to perform sorting and ranking of sequential databases.

A. Synthetic Data

We apply the nDP-iHMM for discovering the relationshipsbetween 12 synthetic data sets. Each data set contains 50 se-quences of length 20, generated from a distinct discrete HMM.For all HMMs, the number of states and the codebooksize . The parameters of the HMMs have the form of

, where

and

Each non-zero element in and is independently drawnfrom a uniform distribution on the interval [0, 0.05]. The HMM



TABLE IHMM PARAMETERS FOR THE SYNTHETIC PROBLEM

parameters are chosen as in Table I, therefore there are threeclusters among these 12 tasks.

The nDP-iHMM is applied to clustering the tasks. The isset to be a Dirichlet distribution with parameters all equal to 1,the top level stick-breaking DP is truncated to com-ponents and is imposed on hyperparameters and

. We initialize all tasks with the same center indexes and letthe nDP-iHMM infer the true underlying relationships betweenthe tasks. The result is shown by the Hinton diagram plotted inFig. 3(a). In the Hinton diagram, the size of the green box isproportional to the degree of similarity between two tasks. Thesimilarity measure corresponds to the posterior probability thattwo tasks are grouped together, which can be calculated by theproportion of Gibbs sampling draws where tasks are assignedto the same cluster. Note that this approach relies on soft prob-abilistic clustering, so that the posterior mean estimates of thebase distribution for two tasks always differ, but these estimateseventually converge as the posterior probability of clustering in-creases. The evolution of center indexes is shown in Fig. 3(b). Itcan be seen that the nDP-iHMM clusters the tasks nicely evenwhen we initialize with the same center indices.

B. Real Music Data

Now we demonstrate the application of our proposed methodsto analyzing real data, and in particular, music data. In this ex-periment, we have ten 1-minute music clips extracted from dif-ferent pieces. The reason for choosing part of the piece insteadof the whole piece is that we are relatively confident on the sim-ilarities of those clips. In this way we are able to control theground truth for the real application using our proposed model.These ten clips were chosen deliberately with the following in-tended clustering: 1) clip 1 is unique in style and instrumenta-tion; 2) clips 2 and 3, 4 and 5, 6 and 7, and 9 and 10 are in-tended to be paired together; 3) clip 8 is also unique, but is ofthe same format (instrumentation) as clips 6 and 7 (the namesof the pieces are given in Fig. 4).

Our goal is to learn the relationships between the clips, i.e.,the similarities of these clips. Meanwhile, we wish to learnan accurate iHMM for each of the clips simultaneously. Eachmusic clip is sampled at 22 kHz and 10-dimensional Melfre-quency cepstral coefficient (MFCC) [8] features are extractedfor every 25 ms nonoverlapping frame. The feature vectorsacross all the ten clips are concatenated to perform vectorquantization (VQ) [22], mapping each feature vector to a codewithin a VQ codebook of size 32. In our experiment, we choosea sequence of 1 second windows, or 40 observations. Therefore,

Fig. 3. (a) Hinton diagram of between-task similarities for the syntheticproblem. (b) Evolution of center indexes versus Gibbs iterations for the 12tasks.

each music clip is transformed into 60 data sequences with 40observations inside each sequence.

We compare three methods for iHMM model learning:i) the proposed nDP-iHMM MTL method, ii) a DP-basedMTL method, DP-iHMM, for which a master level DP is usedto couple all the iHMMs, and iii) STL-iHMM—the singletask-learning method, for which each clip is analyzed in iso-lation. Both ii) and iii) can be considered as special cases ofnDP-iHMM with particular choices of hyperparameters. The



Fig. 4. (a) Average testing sequence likelihood using nDP-iHMM, DP-iHMMand STL-iHMM; (b) Hinton diagram of between-clip similarities based on sam-pled center indexes in nDP-iHMM for the case of 20 training sequences.

model learning performance is evaluated by calculating the av-erage of testing-sequence likelihood averaged first within clipand then over all the clips. For each clip, a certain number oftraining sequences are selected and the remaining sequencesare used as testing data for that clip-dependent iHMM. Thetraining data are chosen from the middle of the clip, becausethere may be a quiet period in the two ends. To have a compre-hensive comparison, different training set sizes are considered:

and 30 sequences are used. All three methodsare implemented via Gibbs samplers and the Raftery and Lewistest [23] is performed to determine the number of iterationsneeded for convergence. All results shown in Fig. 4 are basedon 20 709 samples obtained after a burn-in period of 2098 itera-tions. The setting of parameters are the same as in Section V-A.Fig. 4(a) shows that the proposed nDP-iHMM method consis-tently outperforms the other two methods, and the improve-ment is more dramatic when there is only a small amount oftraining data available. This is because the nDP-iHMM exploitsthe intertask relationships when performing model learning. Asa result, the nDP-iHMM directly obtains the similarities be-tween tasks, which are shown in Fig. 4(b). It is clear that thenDP-iHMM captures the between-clip similarities quite well.

While the nDP-iHMM provides a mechanism to obtain the in-tertask similarities directly, the STL-iHMM and the DP-iHMMdo not. Nevertheless, one may use an appropriate distance mea-sure to compute the similarity based on the learned iHMMs. For

Fig. 5. (a) Between-clip similarity matrix computed by (10) using nDP-iHMMresults for the case of six training sequences; (b) and (c) are same as (a) but usingDP-iHMM and STL-iHMM, respectively.



this purpose, we use a distance measure similar to that consid-ered by Aucouturier [24]. The distance between two iHMMs isdefined as

iHMM iHMM

iHMM iHMM

iHMM iHMM

where ’s are sequences simulated from iHMM and ’s aresequences simulated from iHMM . The similarity between clip

and clip is then calculated as

iHMM iHMM(10)

where the variance is arbitrary. We compute similaritiesof clips using (10) for the case of six training sequences, andplot the Hinton diagrams for all the three methods in Fig. 5(a),(b), and (c), respectively. We observe that the nDP-iHMM[Fig. 5(a)] does the best in discovering the sharing structure ofthe music clips with limited training data.

While the MCMC-based nDP-iHMM yields pleasing resultson MTL for sequential data, the computation is relativelyslow. For this ten music pieces example, the nDP-iHMMrequires around 40 hours to perform 20 000 Gibbs samplingsteps on a laptop. The expensive computational cost hindersthe nDP-iHMM from solving large-scale problem. On theother hand, the VB inference algorithm converges very fastand learns the ten music clips in about 2 min on the samemachine. Fig. 6 shows that the VB-based nDD-HMM trainedusing 20 sequences per clip yields comparable results as theMCMC-based nDP-iHMM, but greatly reduces the computa-tional time, and makes large scale problems solvable. This isfurther demonstrated below.

C. Large-Scale Sequential Databases

Consider the application of the nDD-HMM on a large-scaleproblem of music analysis. We demonstrate that by treating themultiple musical pieces in a multi-task learning setting via thenDD-HMM, we obtain: (i) a direct similarity measure betweenmusical pieces; and (ii) more accurate individual task-depen-dent iHMMs, also avoiding the model-selection problem. Thesimilarity matrix learned from (i) is then normalized to define arandom walk on a graph, where the nodes on the graph corre-spond to the different musical pieces; this normalized graph ma-trix is then applied in a graph-diffusion analysis [14], for musicsearch and sorting.

We examine the algorithm on a music database with 2500one-minute clips, including classical, jazz, and rock music. Theclassical portion of the database contained works spanning over200 years and vary from small piano pieces to larger orchestralcompositions. The jazz portion was also varied between soft andupbeat works utilizing a variety of instruments. The rock por-tion spanned primarily the decade 1965–1975 and varied fromsofter, acoustic works to loud, guitar driven rock. The feature

Fig. 6. (a) Evolution of lower bound during VB learning. (b) Interclips simi-larity matrix learned using the VB nDD-HMM.

extraction is the same to the one described in Section V-B andthe VQ codebook size is 200.1

We conduct two experiments on this large music database.The first experiment is designed for demonstration purposes.We select 151 music clips including 48 pieces from Bach’s TheWell-Tempered Clavier: Book I, 48 pieces from seven albums byMiles Davis, and 55 pieces from four albums by The Beatles.The reason for choosing these clips instead of the whole 2500pieces is that we are confident in the similarities of those clips;in this way we are able to provide a reasonable approximationto the ground truth for this real application.

We compare three methods for music similarity mea-surement: i) the proposed nDD-HMM; ii) the STL-iHMM,followed by the distance-measure calculation discussed in (10),and iii) the nDP-iGMM, in which each task is represented by aninfinite Gaussian mixture model [25]. All of the three methodsare implemented with the VB algorithm. For the nDD-HMM,

1The sequential features and the raw music clips are available to all interestedresearchers, upon request to the authors.



we obtain the between-task similarity using the output ofvariational distribution on , which indicates how likelythe base distribution of task uses . We select the index

as the cluster indicator for task , and thenthe between-task similarity is simply defined by these clusterindicators. In our experiment, the nDD-HMM uses truncation

and ; we also considered a larger number ofsticks at both levels, with minimal change in the results. ThenDP-iGMM uses the same setting as the nDD-HMM to obtainthe similarity matrix.

The similarity-measure performance of the three methods isrepresented in the colormap, shown in Fig. 7(a)–(c). The higherthe value in the similarity matrix, the more similar the two clipsare. The results are averaged over 20 random VB initializationsand all sequences are used for training. We observe that thenDD-HMM result clearly clusters the 151 music clips intothree major groups, which conforms with the genre informationprovided by the ground truth. The STL-iHMM result roughlyfinds three principal clusters; however, there are less relation-ships considered between clips within the same genre. ThenDP-iGMM also discovers the major genre information butwith some confusion among the groups; this is because theiGMM considers observations as exchangeable data and doesnot exploit the sequential information.

To compare the learned task-dependent iHMMs, we did thesame test as considered in Fig. 4(a). The results of nDD-HMMand STL-iHMM are shown in Fig. 7(d), and it can be seen thatthe proposed nDD-HMM method consistently outperforms theSTL-iHMM. The performance improvement is more dramaticwhen there is a small amount of training data available; this isbecause of the intrinsic sharing mechanism of the nDD-HMM,in which data-level clustering and clip-level clustering are per-formed simultaneously. The fact that our proposed MTL methodlearns better iHMMs than the STL counterpart implies that thenDD-HMM can effectively reduce the number of training sam-ples required to achieve a particular performance. Therefore,with the same computational resources, we can obtain simi-larity information between more music pieces, which is usefulfor on-demand music recommendation system. We show the ex-ample results in Fig. 8, where we use 20 sequences per task tofind the interclip similarities. For a fair comparison, the simi-larity matrices are obtained by first learning the iHMMs thenapplying distance measure (10) for both the nDD-HMM andthe STL-iHMM. As we can see, by using only one third of thedata, the nDD-HMM already discovers the relationships amongthe clips, while the STL-iHMM learns the iHMMs indepen-dently and does not provide much information about the inter-task similarities.

In our second experiment, all 2500 music clips are used andour goal is to obtain the between-task similarity, and to use thiswithin a graphical analysis for music sorting and search. Weapply the nDD-HMM to this data set and use the same settingas in the first experiment. Clips 1–525 are jazz, 526–1525 arerock, and 1526–2500 are classical music (recall from the dis-cussion above, these data constitute a relatively high level of di-versity, that provides a reasonable challenge for the search/sortengine). Sixty observation sequences from each clip are used,and the result is shown in Fig. 9(a). The result clearly groups

Fig. 7. Comparison of the between-clip similarity matrix: (a) nDD-HMMresults; (b) STL-iHMM results by applying (10) on the learned models;(c) nDP-iGMM results; (d) Average testing sequence likelihood using theiHMMs learned from nDD-HMM and STL-iHMM.

2500 music clips into three clusters, which correspond to thethree major genres. In addition, Fig. 9(a) indicates that there are



Fig. 8. Between-clip similarity measurement by first learning iHMMs then ap-plying distance measure (10). 20 sequences per task (1/3 of the total data) areused, and two methods are compared. (a) Results for the nDD-HMM and (b)results for the STL-iHMM.

different within-genre similarities, which will be exploited bythe graph-diffusion analysis. The similarity matrix reflected inFig. 9(a) was appropriately normalized, to define the Markovianproperties of a random walk on a graph [14]. The resultingeigenvalue spectrum is shown in Fig. 9(b), and the relativelylow dimension of the graph-diffusion space indicates that themanifold is relatively simple, even though a fairly large numberof pieces was considered. A graph-diffusion analysis was thenperformed on this matrix [14], and this was then used to sortthe music. The representative search/sort results are reflectedin Table II. Although it is difficult to demonstrate without lis-tening, results from these relatively well-known pieces speak tothe relatively high-quality sorting results observed. The exper-iment was implemented with Matlab 7.0 on an Intel Xeon pro-cessor with 3 GHz CPU. The required CPU time for one run ofthe 2500-piece database was approximately 3 hours given thestop threshold of the VB algorithm (that the relative change of

Fig. 9. (a) Between-clip similarity measurement using the nDD-HMM on alarge database with 2500 music clips. The three blocks corresponding to jazz,rock and classical music. (b) Eigenvalue spectrum of the normalized graph-dif-fusion matrix [calculated using the result from (a)].

the lower bound is smaller than , with 60 sequences usedper piece).

VI. CONCLUSION

We have proposed a new hierarchical Bayesian model formulti-task learning with sequential data. The model aims to si-multaneously perform task-level clustering as well as data-levelclustering. Each task is modeled by an infinite hidden Markovmodel (iHMM), solving the fundamental model selectionproblem in HMM learning, and all the iHMMs are sharedappropriately via a nested Dirichlet process (nDP). A MCMCinference engine has been implemented for the nDP-iHMM,and promising results are obtained on a synthetic problem anda real music data set.

To analyze large-scale sequential databases, the nDP-iHMMis truncated to an nDD-HMM, to allow a variational Bayesianinference algorithm. Encouraging nDD-HMM results havebeen demonstrated on a large-scale database involving 2500musical clips. The framework effectively learns better models



TABLE IIEXAMPLE MUSIC SEARCH-ENGINE RESULTS, BASED ON NDD-HMM AND GRAPH DIFFUSION

than the competing methods, and the learned intertask (in-termusic-piece) relationship is used to define the Markovianproperties of a random walk on a graph, on which graph-diffu-sion-based data sorting is applied. Therefore, the nDD-HMMcan be used for modeling and sorting large sequential databases,such as the application of the music recommendation system.

APPENDIX

We show the detailed VB update equations for the variationaldistribution described in (9). The derivations are basedon VB-DP [26] and VB-HMM [13], with modifications to ac-commodate the nDD-HMM model.

• . To consider a stick-breaking rep-resentation, we write and set

. We update the variational distri-bution on as

where and .• . We put

stick-breaking prior as the variational distribution on eachrow of the matrices:

and the update equations are

where is the size of codebook in iHMMs.• . Similar to the

VB-HMM [13], we calculate the expectation of usingforward-backward algorithm with data and modifiedparameters



The corresponding outputs are

and .• . We define

with

REFERENCES

[1] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75,1997.

[2] D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum, “Hierar-chical topic models and the nested Chinese restaurant process,” in Ad-vances in Neural Information Processing Systems. Cambridge, MA:MIT Press, 2004, vol. 16, pp. 17–24.

[3] S. Thrun and J. O’Sullivan, “Discovering structure in multiple learningtasks: The TC algorithm,” in Proc. 13th Int. Conf. Machine Learning,1996, pp. 489–497.

[4] B. Bakker and T. Heskes, “Task clustering and gating for Bayesianmultitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–99, 2003.

[5] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task learningfor classification with Dirichlet process priors,” J. Mach. Learn. Res.,pp. 35–63, Jan. 2007.

[6] , A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubim, Eds., BayesianData Analysis. London, U.K.: Chapman & Hall, 1995.

[7] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “HierarchicalDirichlet processes,” J. Amer. Stat. Assoc., vol. 101, no. 476, pp.1566–1581, 2006.

[8] B. Logan and A. Salomon, “A music similarity function based on signalanalysis,” presented at the Proc. Int. Conf. Multimedia Expo., Tokyo,Japan, Aug. 2001.

[9] A. Rodriguez, D. B. Dunson, and A. E. Gelfang, “The nested Dirichletprocess (with discussion),” J. Amer. Stat. Assoc., 2007, to be published.

[10] L. R. Rabiner, “A tutorial on hidden Markov models and selected ap-plications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286,1989.

[11] P. Runkle, P. K. Bharadwaj, L. Couchman, and L. Carin, “HiddenMarkov models for multi-aspect target classification,” IEEE Trans.Signal Process., vol. 47, pp. 2035–2040, Jul. 1999.

[12] A. Stolcke and S. Omohundro, “Hidden Markov model induction byBayesian model merging,” in Adv. Neural Inf. Process. Syst., Denver,CO, 1993, vol. 5, pp. 11–18.

[13] M. J. Beal, “Variational algorithms for approximate Bayesian infer-ence,” Ph.D. dissertation, Gatsby Computational Neuroscience Unit,Univ. College London, London, U.K., 2003.

[14] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, “Diffusion maps,spectral clustering and eigenfunctions of Fokker–Planck operators,” inAdv. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2005, vol. 18,pp. 955–962.

[15] M. J. Beal, Z. Ghahramani, and C. Rasmussen, “The infinite hiddenMarkov model,” in Adv. Neural Inf. Process. Syst., Vancouver, BC,Canada, 2002, vol. 14, pp. 577–584.

[16] M. D. Escobar and M. West, “Bayesian density estimation and infer-ence using mixtures,” J. Amer. Stat. Assoc., vol. 90, pp. 577–588, 1995.

[17] T. S. Ferguson, “A Bayesian analysis of some nonparametric prob-lems,” Ann. Stat., vol. 1, no. 2, pp. 209–230, 1973.

[18] J. Sethuraman, “A constructive definition of Dirichlet priors,” StatisticaSinica, vol. 4, pp. 639–650, 1994.

[19] D. Aldous, “Exchangeability and related topics,” in Ecole d’Ete St.Flour 1983, 1985, pp. 1–198.

[20] H. Ishwaran and L. F. James, “Gibbs sampling methods forstick-breaking priors,” J. Amer. Stat. Assoc., vol. 96, pp. 161–173,2001.

[21] W. Jank, “Implementing and diagnosing the stochastic approximationEM algorithm,” J. Comput. Graph. Stat., vol. 15, pp. 803–829, 2006.

[22] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Commun., vol. COM-28, no. 1, pp. 84–95, Jan.1980.

[23] A. E. Raftery and S. M. Lewis, “How many iterations in the Gibbssampler?,” Bayesian Stat., vol. 4, pp. 763–773, 1992.

[24] J. J. Aucouturier and F. Pachet, “Music similarity measures: What’s theuse?,” in Proc. Int. Symp. Music Information Retrieval (ISMIR), Paris,France, 2002, pp. 157–163.

[25] C. Rasmussen, “The infinite Gaussian mixture model,” in Adv.. NeuralInf. Process. Syst., 2000, vol. 12, pp. 554–560.

[26] D. Blei and M. Jordan, “Variational inference for Dirichlet process mix-tures,” J. Bayesian Anal., vol. 1, no. 1, pp. 121–144, 2005.

Kai Ni was born on August 9, 1980 in Shanghai,China. He received the B.S. degree in electronicand information engineering from Shanghai JiaoTong University, Shanghai, China, in 2002 and theM.S. and Ph.D. degrees in electrical and computerengineering from Duke University, Durham, NC, in2004 and 2007, respectively. Since June 2003, he hasbeen working towards the Ph.D. degree in electricaland computer engineering at Duke University.

His research interests include the general areasof machine learning, classification and signal pro-

cessing, with a focus on nonparametric Bayesian, sequential data modeling,and multitask learning.

John Paisley received the B.S. and M.S. degrees inelectrical and computer engineering from Duke Uni-versity, Durham, NC, in 2004 and 2007, respectively.He is currently working towards the Ph.D. degree atDuke University.

Lawrence Carin (SM’96–F‘01) was born in Washington, DC, on March 25,1963. He received the B.S., M.S., and Ph.D. degrees in electrical engineeringfrom the University of Maryland, College Park, in 1985, 1986, and 1989,respectively.

In 1989, he joined the Electrical Engineering Department at Polytechnic Uni-versity, Brooklyn, NY, where he was an Assistant Professor and became anAssociate Professor in 1994. In September 1995, he joined the Electrical andComputer Engineering Department at Duke University, Durham, NC, where heis now the William H. Younger Distinguished Professor. His current research in-terests include signal processing and machine learning for sensing applications.

Dr. Carin has been the principal investigator on several large research pro-grams, including two Multidisciplinary University Research Initiative (MURI)programs. He is the co-founder of the small business Signal Innovations Group(SIG), which was purchased in 2006 by Integrian, Inc. He was an Associate Ed-itor of the IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION from 1996to 2001. He is a member of the Tau Beta Pi and Eta Kappa Nu honor societies.



David Dunson was born on July 18, 1972 inTownsville, Australia. He earned the B.S. degree inmathematics at the Pennsylvania State Universityin 1994 and the Ph.D. degree in biostatistics fromEmory University in 1997.

In 1997 he joined the Biostatistics Branch of theNational Institute of Environmental Health Sciences(NIEHS) as a Research Fellow. In 2000, he becamea Tenure Track Principal Investigator at NIEHS andwas promoted to Tenured Senior Investigator in 2002.He has accepted a position as Professor in the Depart-

ment of Statistical Science at Duke University, where he has been an adjunct fac-

ulty member since 2000. His current research interests include Bayesian non-parametrics, functional data analysis, model selection, machine learning, andsignal processing.

Dr. Dunson is a Fellow of the American Statistical Association and recentlywon the Mortimer Spiegelman Award for the top public health statisticianunder age 40. He won a 2007 Gold Medal from the Environmental ProtectionAgency for Exceptional Service for work on risk assessment using Bayesianmethods. He was an Associate Editor at Biometrics from 2000 to 2007,and is currently Associate Editor of the Journal of the American StatisticalAssociation, Biostatistics and Psychometrika, while also serving as Co-Editorof Bayesian Analysis.


Date post:	16-Feb-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

3918 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. …jwp2128/Papers/NiPaisleyCarinDunson2008.pdf3918...

Documents