A Fast Algorithm for Feature Selection in Conditional Maximum … · 2010. 6. 15. · lects...

A Fast Algorithm for Feature Selection in Conditional

Maximum Entropy Modeling

Yaqian ZhouComputer Science Department

Fudan UniversityShanghai 200433, P.R. [email protected]

Fuliang WengResearch and Technology Center

Robert Bosch Corp.Palo Alto, CA 94304, USA

[email protected]

Lide WuComputer Science Department

Fudan UniversityShanghai 200433, P.R. [email protected]

Hauke SchmidtResearch and Technology Center

Robert Bosch Corp.Palo Alto, CA 94304, USA

[email protected]

Abstract

This paper describes a fast algorithm that se-lects features for conditional maximum en-tropy modeling. Berger et al. (1996) presentsan incremental feature selection (IFS) algo-rithm, which computes the approximate gainsfor all candidate features at each selectionstage, and is very time-consuming for anyproblems with large feature spaces. In thisnew algorithm, instead, we only compute theapproximate gains for the top-ranked featuresbased on the models obtained from previousstages. Experiments on WSJ data in PennTreebank are conducted to show that the newalgorithm greatly speeds up the feature selec-tion process while maintaining the same qual-ity of selected features. One variant of thisnew algorithm with look-ahead functionalityis also tested to further confirm the goodquality of the selected features. The new algo-rithm is easy to implement, and given a fea-ture space of size F, it only uses O(F) morespace than the original IFS algorithm.

1 Introduction

Maximum Entropy (ME) modeling has receiveda lot of attention in language modeling and naturallanguage processing for the past few years (e.g.,Rosenfeld, 1994; Berger et al 1996; Ratnaparkhi,1998; Koeling, 2000). One of the main advantages

using ME modeling is the ability to incorporatevarious features in the same framework with asound mathematical foundation. There are twomain tasks in ME modeling: the feature selectionprocess that chooses from a feature space a subsetof good features to be included in the model; andthe parameter estimation process that estimates theweighting factors for each selected feature in theexponential model. This paper is primarily con-cerned with the feature selection process in MEmodeling.

While the majority of the work in ME modelinghas been focusing on parameter estimation, lesseffort has been made in feature selection. This ispartly because feature selection may not be neces-sary for certain tasks when parameter estimate al-gorithms are fast. However, when a feature spaceis large and complex, it is clearly advantageous toperform feature selection, which not only speedsup the probability computation and requiressmaller memory size during its application, butalso shortens the cycle of model selection duringthe training.

Feature selection is a very difficult optimizationtask when the feature space under investigation islarge. This is because we essentially try to find abest subset from a collection of all the possiblefeature subsets, which has a size of 2|W|, where |W|is the size of the feature space.

In the past, most researchers resorted to a sim-ple count cutoff technique for selecting features

(Rosenfeld, 1994; Ratnaparkhi, 1998; Reynar andRatnaparkhi, 1997; Koeling, 2000), where only thefeatures that occur in a corpus more than a pre-defined cutoff threshold get selected. Chen andRosenfeld (1999) experimented on a feature selec-tion technique that uses a c2 test to see whether afeature should be included in the ME model, wherethe c2 test is computed using the count from a priordistribution and the count from the real trainingdata. It is a simple and probably effective tech-nique for language modeling tasks. Since MEmodels are optimized using their likelihood orlikelihood gains as the criterion, it is important toestablish the relationship between c2 test score andthe likelihood gain, which, however, is absent.Berger et al. (1996) presented an incremental fea-ture selection (IFS) algorithm where only one fea-ture is added at each selection and the estimatedparameter values are kept for the features selectedin the previous stages. While this greedy searchassumption is reasonable, the speed of the IFS al-gorithm is still an issue for complex tasks. Forbetter understanding its performance, we re-implemented the algorithm. Given a task of600,000 training instances, it takes nearly fourdays to select 1000 features from a feature spacewith a little more than 190,000 features. Bergerand Printz (1998) proposed an f-orthogonal condi-tion for selecting k features at the same time with-out affecting much the quality of the selectedfeatures. While this technique is applicable forcertain feature sets, such as word link features re-ported in their paper, the f-orthogonal conditionusually does not hold if part-of-speech tags aredominantly present in a feature subset. Past work,including Ratnaparkhi (1998) and Zhou et al(2003), has shown that the IFS algorithm utilizesmuch fewer features than the count cutoff method,while maintaining the similar precision and recallon tasks, such as prepositional phrase attachment,text categorization and base NP chunking. Thisleads us to further explore the possible improve-ment on the IFS algorithm.

In section 2, we briefly review the IFS algo-rithm. Then, a fast feature selection algorithm isdescribed in section 3. Section 4 presents a numberof experiments, which show a massive speed-upand quality feature selection of the new algorithm.Finally, we conclude our discussion in section 5.

2 The Incremental Feature Selection Al-gorithm

For better understanding of our new algorithm, westart with briefly reviewing the IFS feature selec-tion algorithm. Suppose the conditional ME modeltakes the following form:

†

p(y | x) = 1Z (x ) exp( l j f j (x, y))

jÂ

where fj are the features, l j are their corre-sponding weights, and Z(x) is the normalizationfactor.

The algorithm makes the approximation that theaddition of a feature f in an exponential model af-fects only its associated weight a, leaving un-changed the l-values associated with the otherfeatures. Here we only present a sketch of the algo-rithm in Figure 1. Please refer to the original paperfor the details.

In the algorithm, we use I for the number oftraining instances, Y for the number of outputclasses, and F for the number of candidate featuresor the size of the candidate feature set.

0. Initialize: S = ∅, sum[1..I, 1..Y] = 1,z[1..I] = Y

1. Gain computation:MaxGain = 0for f in feature space F do

)(maxargˆ aaa

fSG »=

)(maxˆ aa

fSGg »=

if MaxGain < g then MaxGain = g f* = f a*=a

2. Feature selection:S = S » { f* }

3. if termination condition is met, then stop4. Model adjustment:

for instance i such that there is yand f*(xi, y) = 1 do

z[i] -=sum[i, y]sum[i, y] ¥= exp(a*)z[i] += sum[i, y]

5. go to step 1.

Figure 1: A Variant of the IFS Algorithm.

One difference here from the original IFS algo-rithm is that we adopt a technique in (Goodman,2002) for optimizing the parameters in the condi-tional ME training. Specifically, we use array z tostore the normalizing factors, and array sum for allthe un-normalized conditional probabilities sum[i,y]. Thus, one only needs to modify those sum[i, y]that satisfy f*(xi, y)=1, and to make changes to theircorresponding normalizing factors z[i]. In contrastto what is shown in Berger et al 1996’s paper, hereis how the different values in this variant of the IFSalgorithm are computed.

Let us denote

Â=j

jj yxfxysum )),(exp()|( l

Â=y

xysumxZ )|()(

Then, the model can be represented by sum(y|x)and Z(x) as follows:

)(/)|()|( xZxysumxyp =

where sum(y|xi) and Z(xi) correspond to sum[i,y]and z[i] in Figure 1, respectively.

Assume the selected feature set is S, and f iscurrently being considered. The goal of each se-lection stage is to select the feature f that maxi-mizes the gain of the log likelihood, where the aand gain of f are derived through following steps:

Let the log likelihood of the model be

Â

Â

-=

-≡

yx

yx

xZxysumyxp

xypyxppL

,

,

))(/)|(log(),(~

))|(log(),(~)(

and the empirical expectation of feature f be

†

E ˜ p ( f ) = ˜ p (x,y) f (x, y)x,yÂ

With the approximation assumption in Bergeret al (1996)’s paper, the un-normalized componentand the normalization factor of the model have thefollowing recursive forms:

)|()|( aa exysumxysum SfS ⋅=»

)|(

)|()()(

xysum

xysumxZxZ

fS

SSfS

a

a

»

»

+

-=

The approximate gain of the log likelihood iscomputed by

†

GS» f (a) ≡ L(pS» fa ) - L(pS )

= - ˜ p (x)(logZS» f ,a (x)x

Â /ZS (x))

+ aE ˜ p ( f ) (1)

The maximum approximate gain and its corre-sponding a are represented as:

)(max),(~ aa

fSGfSL »=D

)(maxarg),(~ aaa

fSGfS »=

3 A Fast Feature Selection Algorithm

The inefficiency of the IFS algorithm is due to thefollowing reasons. The algorithm considers all thecandidate features before selecting one from them,and it has to re-compute the gains for every featureat each selection stage. In addition, to compute aparameter using Newton’s method is not alwaysefficient. Therefore, the total computation for thewhole selection processing can be very expensive.

Let g(j, k) represent the gain due to the additionof feature fj to the active model at stage k. In ourexperiments, it is found even if D (i.e., the addi-tional number of stages after stage k) is large, formost j, g(j, k+D) - g(j, k) is a negative number or atmost a very small positive number. This leads us touse the g(j, k) to approximate the upper bound ofg(j, k+D).

The intuition behind our new algorithm is thatwhen a new feature is added to a model, the gainsfor the other features before the addition and afterthe addition do not change much. When there arechanges, their actual amounts will mostly be withina narrow range across different features from topranked ones to the bottom ranked ones. Therefore,we only compute and compare the gains for thefeatures from the top-ranked downward until wereach the one with the gain, based on the newmodel, that is bigger than the gains of the remain-ing features. With a few exceptions, the gains ofthe majority of the remaining features were com-puted based on the previous models.

As in the IFS algorithm, we assume that the ad-dition of a feature f only affects its weighting fac-tor a. Because a uniform distribution is assumed asthe prior in the initial stage, we may derive aclosed-form formula for a(j, 0) and g(j, 0) as fol-lows.

Let

†

Ed ( f ) = ˜ p (x)maxy

{ f (x, y)}x

Â

†

Re ( f ) = E ˜ p ( f ) / Ed ( f )Yp /10 =

Then

)log()0,( )(1

1)( 0

0 fR

p

p

fR

e

ej -

-⋅=a

†

g( j,0) = L(p∅» fa( i,0)) - L(p∅ )

= Ed ( f )[Re ( f )log Re ( f )p0

+ (1- Re ( f ))log 1-Re ( f )1- p0

]

where ∅ denotes an empty set, p∅ is the uni-form distribution. The other steps for computingthe gains and selecting the features are given inFigure 2 as a pseudo code. Because we only com-pute gains for a small number of top-ranked fea-tures, we call this feature selection algorithm asSelective Gain Computation (SGC) Algorithm.

In the algorithm, we use array g to keep thesorted gains and their corresponding feature indi-ces. In practice, we use a binary search tree tomaintain the order of the array.

The key difference between the IFS algorithmand the SGC algorithm is that we do not evaluateall the features for the active model at every stage(one stage corresponds to the selection of a singlefeature). Initially, the feature candidates are or-dered based on their gains computed on the uni-form distribution. The feature with the largest gaingets selected, and it forms the model for the nextstage. In the next stage, the gain of the top featurein the ordered list is computed based on the modeljust formed in the previous stage. This gain iscompared with the gains of the rest features in thelist. If this newly computed gain is still the largest,this feature is added to form the model at stage 3.If the gain is not the largest, it is inserted in theordered list so that the order is maintained. In thiscase, the gain of the next top-ranked feature in theordered list is re-computed using the model at thecurrent stage, i.e., stage 2.

This process continues until the gain of the top-ranked feature computed under the current modelis still the largest gain in the ordered list. Then, themodel for the next stage is created with the addi-

tion of this newly selected feature. The whole fea-ture selection process stops either when thenumber of the selected features reaches a pre-defined value in the input, or when the gains be-come too small to be useful to the model.

0. Initialize: S = ∅, sum[1..I, 1..Y] = 1,z[1..I] = Y, g[1..F] = {g(1,0),…,g(F,0)}

1. Gain computation:MaxGain = 0Loop ]},...,1[{maxarg

in Fgf

Ffj =

if g[j] £ MaxGain then go to step 2 else

)(maxargˆ aaa

fSG »=

)(maxˆ aa

fSGg »=

g[j]= g if MaxGain < g then MaxGain = g f* = fj

a*=a2. Feature selection:

S = S » { f* }3. if termination condition is met, then stop4. Model adjustment:

for instance i such that there is yand f*(xi, y) = 1 do

z[i] -=sum[i, y]sum[i, y] ¥= exp(a*)z[i] += sum[i, y]

5. go to step 1.

Figure 2: Selective Gain Computation Algo-rithm for Feature Selection

In addition to this basic version of the SGC al-gorithm, at each stage, we may also re-computeadditional gains based on the current model for apre-defined number of features listed right afterfeature f* (obtained in step 2) in the ordered list.This is to make sure that the selected feature f* isindeed the feature with the highest gain within thepre-defined look-ahead distance. We call this vari-ant the look-ahead version of the SGC algorithm.

4 Experiments

A number of experiments have been conducted toverify the rationale behind the algorithm. In par-ticular, we would like to have a good understand-ing of the quality of the selected features using theSGC algorithm, as well as the amount of speed-ups, in comparison with the IFS algorithm.

The first sets of experiments use a dataset {(x,y)}, derived from the Penn Treebank, where x is a10 dimension vector including word, POS tag andgrammatical relation tag information from two ad-jacent regions, and y is the grammatical relationtag between the two regions. Examples of thegrammatical relation tags are subject and objectwith either the right region or the left region as thehead. The total number of different grammaticaltags, i.e., the size of the output space, is 86. Thereare a little more than 600,000 training instancesgenerated from section 02-22 of WSJ in PennTreebank, and the test corpus is generated fromsection 23.

In our experiments, the feature space is parti-tioned into sub-spaces, called feature templates,where only certain dimensions are included. Con-sidering all the possible combinations in the 10-dimensional space would lead to 210 feature tem-plates. To perform a feasible and fair comparison,we use linguistic knowledge to filter out implausi-ble subspaces so that only 24 feature templates areactually used. With this amount of feature tem-plates, we get more than 1,900,000 candidate fea-tures from the training data. To speed up theexperiments, which is necessary for the IFS algo-rithm, we use a cutoff of 5 to reduce the featurespace down to 191,098 features. On average, eachcandidate feature covers about 485 instances,which accounts for 0.083% over the whole traininginstance set and is computed through:

ÂÂÂ=jj yx

j yxfac 1/),(,

The first experiment is to compare the speed ofthe IFS algorithm with that of SGC algorithm.Theoretically speaking, the IFS algorithm com-putes the gains for all the features at every stage.This means that it requires O(NF) time to select afeature subset of size N from a candidate featureset of size F. On the other hand, the SGC algorithmconsiders much fewer features, only 24.1 features

on average at each stage, when selecting a featurefrom the large feature space in this experiment.Figure 3 shows the average number of featurescomputed at the selected points for the SGC algo-rithm, SGC with 500 look-ahead, as well as theIFS algorithm. The averaged number of features istaken over an interval from the initial stage to thecurrent feature selection point, which is to smoothout the fluctuation of the numbers of features eachselection stage considers. The second algorithmlooks at an additional fixed number of features,500 in this experiment, beyond the ones consideredby the basic SGC algorithm. The last algorithm hasa linear decreasing number of features to select,because the selected features will not be consid-ered again. In Figure 3, the IFS algorithm stopsafter 1000 features are selected. This is because ittakes too long for this algorithm to complete theentire selection process. The same thing happens inFigure 4, which is to be explained below.

0

1

2

3

4

5

6

200 400 600 800 1000 2000 4000 6000 8000 10000

Number of Selected Features

Ave

rage

Con

side

red

Feat

ure

Num

ber

Berger SGC-0 SGC-500log10 (Y)

Figure 3: The log number of features considered inSGC algorithm, in comparison with the IFS algo-rithm.

To see the actual amount of time taken by theSGC algorithms and the IFS algorithm with thecurrently available computing power, we use aLinux workstation with 1.6Ghz dual Xeon CPUsand 1 GB memory to run the two experiments si-multaneously. As it can be expected, excluding thebeginning common part of the code from the twoalgorithms, the speedup from using the SGC algo-rithm is many orders of magnitude, from more than100 times to thousands, depending on the numberof features selected. The results are shown in Fig-ure 4.

-2

-1

0

1

2

3

200 400 600 800 1000 2000 4000 6000 8000 10000


Ave

rage

Tim

efo

r eac

h se

lectio

n st

ep(s

econ

d)

Berger SGC-0log10 (Y)

Figure 4: The log time used by SGC algorithm, incomparison with the IFS algorithm.

To verify the quality of the selected featuresusing our SGC algorithm, we conduct four experi-ments: one uses all the features to build a condi-tional ME model, the second uses the IFSalgorithm to select 1,000 features, the third usesour SGC algorithm, the fourth uses the SGC algo-rithm with 500 look-ahead, and the fifth takes thetop n most frequent features in the training data.The precisions are computed on section 23 of theWSJ data set in Penn Treebank. The results arelisted in Figure 5. Three factors can be learnedfrom this figure. First, the three IFS and SGC algo-rithms perform similarly. Second, 3000 seems to

be a dividing line: when the models include fewerthan 3000 selected features, the IFS and SGC algo-rithms do not perform as well as the model with allthe features; when the models include more than3000 selected features, their performance signifi-cantly surpass the model with all the features. Theinferior performance of the model with all the fea-tures at the right side of the chart is likely due tothe data over-fitting problem. Third, the simplecount cutoff algorithm significantly under-performs the other feature selection algorithmswhen feature subsets with no more than 10,000features are considered.

To further confirm the findings regarding preci-sion, we conducted another experiment with BaseNP recognition as the task. The experiment usessection 15-18 of WSJ as the training data, and sec-tion 20 as the test data. When we select 1,160 fea-tures from a simple feature space using our SGCalgorithm, we obtain a precision/recall of92.75%/93.25%. The best reported ME work onthis task includes Koeling (2000) that has the pre-cision/recall of 92.84%/93.18% with a cutoff of 5,and Zhou et al. (2003) has reached the perform-ance of 93.04%/93.31% with cutoff of 7 andreached a performance of 92.46%/92.74% with615 features using the IFS algorithm. While theresults are not directly comparable due to differentfeature spaces used in the above experiments, ourresult is competitive to these best numbers. Thisshows that our new algorithm is both very effectivein selecting high quality features and very efficientin performing the task.

5 Comparison and Conclusion

Feature selection has been an important topic inboth ME modeling and linear regression. In thepast, most researchers resorted to count cutofftechnique in selecting features for ME modeling(Rosenfeld, 1994; Ratnaparkhi, 1998; Reynar andRatnaparkhi, 1997; Koeling, 2000). A more refinedalgorithm, the incremental feature selection algo-rithm by Berger et al (1996), allows one featurebeing added at each selection and at the same timekeeps estimated parameter values for the featuresselected in the previous stages. As discussed in(Ratnaparkhi, 1998), the count cutoff techniqueworks very fast and is easy to implement, but hasthe drawback of containing a large number of re-

70

72

74

76

78

80

82

84

86

88

90

92

94

200 400 600 800 1000 2000 4000 6000 8000 10000


Prec

isio

n (%

)

All (191098) IFS SGC-0

SGC-500 Count Cutoff

Figure 5: Precision results from models using the wholefeature set and the feature subsets through the IFS algo-rithm, the SGC algorithm, the SGC algorithm with 500look-ahead, and the count cutoff algorithm.

dundant features. In contrast, the IFS removes theredundancy in the selected feature set, but thespeed of the algorithm has been a big issue forcomplex tasks. Having realized the drawback ofthe IFS algorithm, Berger and Printz (1998) pro-posed an f-orthogonal condition for selecting kfeatures at the same time without affecting muchthe quality of the selected features. While thistechnique is applicable for certain feature sets,such as link features between words, the f -orthogonal condition usually does not hold if part-of-speech tags are dominantly present in a featuresubset.

Chen and Rosenfeld (1999) experimented on afeature selection technique that uses a c2 test to seewhether a feature should be included in the MEmodel, where the c2 test is computed using thecounts from a prior distribution and the countsfrom the real training data. It is a simple andprobably effective technique for language model-ing tasks. Since ME models are optimized usingtheir likelihood or likelihood gains as the criterion,it is important to establish the relationship betweenc2 test score and the likelihood gain, which, how-ever, is absent.

There is a large amount of literature on featureselection in linear regression, where least meansquared errors measure has been the primary opti-mization criterion. Two issues need to be ad-dressed in order to effectively use these techniques.One is the scalability issue since most statisticalliterature on feature selection only concerns withdozens or hundreds of features, while our tasksusually deal with feature sets with a million offeatures. The other is the relationship betweenmean squared errors and likelihood, similar to theconcern expressed in the previous paragraph.These are important issues and require further in-vestigation.

In summary, this paper presents our new im-provement to the incremental feature selection al-gorithm. The new algorithm runs hundreds tothousands times faster than the original incre-mental feature selection algorithm. In addition, thenew algorithm selects the features of a similarquality as the original Berger et al algorithm,which has also shown to be better than the simplecutoff method in some cases.

Acknowledgement

This work is done while the first author is visitingthe Center for Study of Language and Information(CSLI) at Stanford University and the Researchand Technology Center of Robert Bosch Corpora-tion. This project is sponsored by the Research andTechnology Center of Robert Bosch Corporation.We are grateful to the kind support from Prof.Stanley Peters of CSLI. We also thank the com-ments from the three anonymous reviewers whichimprove the quality of the paper.

ReferencesAdam L. Berger, Stephen A. Della Pietra, and Vincent

J. Della Pietra. 1996. A Maximum Entropy Approachto Natural Language Processing. Computational Lin-guistic, 22 (1): 39-71.

Adam L. Berger and Harry Printz. 1998. A Comparisonof Criteria for Maximum Entropy / Minimum Diver-gence Feature Selection. Proceedings of the 3rd con-ference on Empirical Methods in Natural LanguageProcessing. Granda, Spain.

Stanley Chen and Ronald Rosenfeld. 1999. EfficientSampling and Feature Selection in Whole Sentencemaximum Entropy Language Models. Proceedings ofICASSP-1999, Phoenix, Arizona.

Joshua Goodman. 2002. Sequential Conditional Gener-alized Iterative Scaling. Association for Computa-tional Linguistics, Philadelphia, Pennsylvania.

Rob Koeling. 2000. Chunking with Maximum EntropyModels. In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal, 139-141.

Adwait Ratnaparkhi. 1998. Maximum Entropy Modelsfor Natural Language Ambiguity Resolution. Ph.D.thesis, University of Pennsylvania.

Ronald Rosenfeld. 1994. Adaptive Statistical LanguageModeling: A Maximum Entropy Approach. Ph.D.thesis, Carnegie Mellon University, April.

J. Reynar and A. Ratnaparkhi. 1997. A Maximum En-tropy Approach to Identifying Sentence Boundaries.In: Proceedings of the Fifth Conference on AppliedNatural Language Processing, Washington D.C., 16-19.

Zhou Ya-qian, Guo Yi-kun, Huang Xuan-jing, and WuLi-de. 2003. Chinese and English BaseNP Recog-nized by Maximum Entropy. Journal of ComputerResearch and Development. 40(3):440-446, Beijin

Date post:	10-Oct-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

A Fast Algorithm for Feature Selection in Conditional Maximum … · 2010. 6. 15. · lects...

Documents