Learning" State Models" of Language Progress Report

Learning "State Models" of LanguageProgress Report

Dept. of CIS - Senior Design Project 2009-2010

Kiuk [email protected]. of Pennsylvania

Philadelphia, PA

Andres F. [email protected]

Univ. of PennsylvaniaPhiladelphia, PA

Lyle [email protected]. of Pennsylvania

Philadelphia, PA

ABSTRACTThis senior design project consists of examining a statis-

tical model (Canonical Correlational Analysis), using it topredict different properties of the words in corpora (for exam-ple, Wikipedia and Project Gutenberg) such as, what entitytype (E.g., person, place, organization ...) what the wordsare, what they link to, and what part of speech they comprise.

We have built a system that, given a large sample text ac-quired from (mostly) Project Gutenberg, will remove unnec-essary tags, tokenize the words (using the de-facto PennTreeTokenization), tag these words with part-of-speech (POS)tags, create state vectors for each word, run regressionalanalysis by plugging the POS and state vectors into a lin-ear model, and finally predict and disambiguate propertiesof these words. We do this last part, using a method basedon Canonical Correlation Analysis(CCA), which is a gener-alization of Principle Component Analysis (PCA) to pairsof matrices. CCA based methods are an alternative to themore commonly used method based on Hidden Markov Mod-els (HMMs).

We present a Multi-View learning (MVL) method based onCCA - an alternative to the currently popular Hidden MarkovModel (HMM). Running several trials under this method topredict properties of words, will provide insight (for certainlinear models) into the advantage of employing CCA fol-lowed by linear regression.

1. INTRODUCTIONIn this project we will examine the use of correlation-based

methods; specifically, CCA (see below) on an unlabeled setof data (pairs of views) to learn the commonality betweenthe views (which we call the state-space vector).

Extracting information relevant to a given task in a semi/un-supervised manner is one of the fundamental modern chal-lenges in machine learning. In several machine learning ap-plications, it is often easier to collect raw, unlabeled datathan labeled data, because obtaining the latter is expensivesince it often involves procedures like human hand labeling,clinical trials, and costly experiments. How unlabeled datacan be used to improve performance and reduce the bur-den of collecting labeled data has been receiving increasingattention in recent years. The “Multi-view” approach hascome up as a paradigm for semi-supervised learning.

At the center of our MVL approach lies the CCA, which

computes the directions of maximal correlation between apair of matrices. These matrices, in our case, representsmoothed versions of the training text (a huge excerpt fromcorpora, such as Wikipedia or Project Gutenberg), wherethe training text is smoothed to capture structural patternsin the language influenced by both before (past) and after(future) a certain word in the whole text. Hence by employ-ing CCA on these two “views” (past and future smoothes),we obtain a latent structure that is common to both. Thenwe use linear regression to predict part-of-speech labels as astate-space vector.

The advantages of using MVL with CCA become appar-ent when it comes to the problem of scaling. Since CCA andlinear regression is linear, it allows us to work with sam-ples that are many times larger than those sizes that canbe handled by non-linear methods. The fact that MVL withCCA scales well under large sets of data outweigh the disad-vantage that certain non-linear methods may provide moreaccurate results for certain problems. Also, implementingCCA is rather straight forward compared to the complexcoding required by some non-linear methods.

The fact that MVL can be used to address problems thatare currently modeled by HMMs is yet another advantage ofmulti-view learning. MVL is optimal under the assumptionsof linearity and Markov Property - that given the present,the future does not depend on the past. This is preciselythe standard Markovian assumption used in HMMs. HenceMVL has the potential to be an efficient alternative methodfor problems currently modeled by HMMs.

To examine the MVL approach with CCA followed bylinear regression, we have implemented a system, which canbe trained (on a large input text), then subsequently be usedto predict part-of-speech labels given a small sample text.In this paper we discuss the following:

i. Overview of the implemented MVL system.

ii. Specifics of training the CCA model and calculationsthat yield a prediction.

iii. Evaluation (error calculation) on the efficiency andperformance of the MVL with CCA method.

1.1 Markov PropertyA stochastic process {Xn} (sequence of random variables)

on the state space (S,S) - where S is the set of states and S

is the σ-field generated by S - adapted to a filtration {Fn},is said to have Markov Property if for every set A ∈ S:

P(Xn+1 ∈ A|Fn) = P(Xn+1 ∈ A|Xn) (1)

Where P is a probability measure[5]. Loosely speaking, thisis the property of having no memory, since the conditionalprobabilities of future states depend only on the presentstate and are independent of the past states. Such processesmay be time discrete or continuous and state discrete or con-tinuous. An example of a time and state continuous Marko-vian process (stochastic process with the Markov property)is Brownian Motion. Our interests are with Markov Chains,which are a time discrete Markovian processes on a discretespace.

1.1.1 Markov ChainA Markov Chain, is defined on a finite or countable state

space S, initial distribution {νi}i∈S (where each νi ≥ 0 and∑i νi = 1), and transition probabilities {pij}i,j∈S (where

each pij ≥ 0, and∑

j∈S pij = 1 for each i ∈ S). It canbe thought of as the representation of a particle movingrandomly in the space S. νi is the probability that theparticle starts off at the point i in space, and pij is theprobability that the particle will move from point i to pointj [17]. Markov chains are stochastic processes, and can bedefined as a sequence of random variables {Xi} on S suchthat for any natural number n and any i0, i1, ..., in ∈ S thefollowing holds:

P(X0 = i0, X1 = i, ..., Xn = in) = νi0pi0i1pi1i2 ...pin−1in

We note that this is the discrete counterpart to equation(1), hence a Markov Chain obeys the Markov Property. AMarkov Process is a continuous time Markov Chain, that is,it is a continuous time process on a discrete or countablestate space.

1.1.2 Hidden Markov ModelThe Hidden Markov Model is a stochastic model on a

Markov Process. As opposed to the regular Markov Model,in which the states are observable and the transition prob-abilities are known, in HMM, however, the states are “hid-den”, meaning they either cannot be observed or are un-observed. However, the outcomes and the transition prob-abilities from each state are observable. Hence each statehas a probability distribution over the outputs. The HiddenMarkov Model was introduced and studied in the late 1960sand early 1970s. This model has been receiving increas-ing attention in recent years due to their rich mathematicalstructure that lead to the formation of strong theoreticalbasis and wide range of applications.

In HMM we start in state St; a state vector that representsa single current state. Additionally we have a transition ma-trix (transition probabilities), an observation (outcome) set{Yn}, and emission probabilities pj,k. The transition ma-trix, which we call M , has (i, j)th entry, Mi,j = P(St+1

j |Sti ),

where t is the time step. The emission probability pj,k is thelikelyhood of state Sj outputting observation Yk [16].

The state vectors in HMM only represents a single state,hence computation is bulkier than the state vectors in theKalman Filter, where multiple states are represented bya single real-valued state vector. Up to date, the Hidden

Markov Model is the statistical model of choice in applica-tions requiring unsupervised learning.

1.2 Multi-view learningMulti-view learning is the idea we claim will not only be

faster than Markov models (as has been suggested), but willalso produce better results. By incorporating for examplereal valued vectors to represent state, we could obtain amuch more compact model than that offered by standardMarkov Models. For example,

[0, 0, . . . , 1, . . . , 0] (Single State)

[0.482,−7.261, . . . 2.272] (Multi State)

The idea of Multi-view learning is, that some common-ality between two different views of a single object definesa reprenstation which can be used to characterize the ob-ject. For example, in the two-view setting, there are twoco-occuring views X1 and X2 of the data, and a target vari-able Y of interest. Two natural underlying assumptions arisefrom this setting.

The first assumption is that either one of the views issufficient to predict Y . Here, the complexity of the learningproblem can be reduced by the elimination of subspaces ofeach view, which do not agree with each other. This can bedone using unlabeled data, and a wide class of algorithms ([22] [2] [3] [14] [11] [9] [21] [20] [10] [19]).

The second assumption is that on some set of “hiddenvariables” the two views are independent. This is meantto capture the intuition that there is “hidden” state thatgenerates the two views. It is this hidden structure that isrelevant for predicting target variables. If we can capturethat information about this “hidden state” using the corre-lation structure between both views, then this informationcould be used in subsequent supervised learning tasks [2] [1].We focus on this second idea.

1.3 Canonical Correlation Analysis (CCA)At the heart of Multi-View learning lies an important

technique of applied linear algebra known as CCA [8]. Itis the analog of a technique known as PCA for pairs of ma-trices.

PCA computes the directions of maximum covariance be-tween elements in a single matrix. It does this by computingthe eigenvectors of the correlation matrix of a single matrix.Thus it can be cast as an eigenvalue problem on a covariancematrix. However, compared to CCA, it wastes dimension(CCA reduces the dimension of the original data set), henceit is not as efficient as the CCA.

We discuss the mathematics behind the CCA using arrays(i.e. one dimensional vectors). Let X = (X1, X2, ..., Xm)and Y = (Y1, Y2, ..., Yn) be random variables in Rm and Rn

respectively with mean zero. That is they are measurablemaps, X : Rm → R and Y : Rn → R. Then CCA findsthe directions WX and WY that maximizes the correlationbetween the projections x = W t

XX and y = W tY Y (x and

y are often refered to as canonical correlates)[6]. Hence,we must maximize the Pearson product-moment correlation

coefficient for x and y defined as follows,

ρx,y =Cov(x, y)

σ(x)σ(y)=

E(xy)− E(x)E(y)√E(x2)E(y2)

(2)

Where E and σ stands for the expected value and standarddeviation respectively. Since X and Y were assumed to havemean zero, equation (2) reduces to

ρx,y =E(xy)√

E(x2)E(y2)(3)

=E(W t

XXYtWY )√

E(W tXXX

tWXE(W tY Y Y

tWY )(4)

=W t

XΓX,YWY√W t

XΓX,XWXW tY ΓY,YWY

(5)

Where the ΓX,Y in equation (5) is the cross-covariance,which is defined as the matrix where the (i, j)th elementis the covariance between the ith and jth element of X andY (i.e. Cov(Xi, Yj)).

To maximize ρx,y, we set the derivative of ρx,y(equation(4)) with respect to WX and WY equal zero [6], and weobtain:

Γ−1X,XΓX,Y Γ−1

Y,Y ΓY,XWX = ρ2x,yWx (6)

Γ−1Y Y ΓY,XΓ−1

X,XΓX,YWY = ρ2x,yWy (7)

This is the eigenvalue equation, where ρ2x,y is the eigen-

value of the linear transformation Γ−1X,XΓX,Y Γ−1

Y,Y ΓY,X cor-responding to the eigenvector WX , and of the linear trans-formation Γ−1

Y Y ΓY,XΓ−1X,XΓX,Y corresponding to the eigen-

vector WY [6].CCA finds a set of basis vectors such that these basis vec-

tors are orthonormal and such that the subspaces spannedby these vectors are maximally correlated [7]. If we let ΠX1

denote the projection of an observation X1 onto the canon-ical correlates (referred as x and y above) . The dimensionof ΠX1 is determined by how many basis vectors are used inthe projection. This projection can be viewed as the “state”characterizing the object since ΠX1 is a low dimensional rep-resentation of the observation. This state can then be usedeither to compute similarity to states for other observations,or as features in a regression to predict, for example, a la-bel for the observation. CCA looks in many ways similar tothe widely used PCA. CCA can be viewed as computing theprinciple components of the correlation between the viewsX and Y . However, CCA offers many advantages over PCA.CCA is invariant to scale and affine transformations; thus,unlike PCA, when using CCA one does not need to worryabout rescaling the data.

1.4 Exponential SmoothingSmoothing, in statistics is a way to eliminate noise and

irregularities in a time series data (observations which havetime or spacial ordering), while keeping important intrin-sic properties of the data. Exponential smoothing is a spe-cific type of smoothing where events further away from the“present”are given exponentially decreasing weights. In con-trast, a Simple Moving Average Smoothing is a techniquewhere events are given equal weights and simply averagedover t + 1 observations. Hence if we let {xn} represent atime series data, where each xn is an observation at time n,

then the smoothes (which we call {sn}) are found using thefollowing simple formula:

sn =1

t+ 1

t∑k=0

xn−k

The major drawback of this method being that we cannotcapture the fact that often, events far away from the presenthave less impact on the current observation, hence must begiven less weight. An improvement is the Weighted Mov-ing Average Smoothing, in which one first chooses a vectorof weighting factors {α0, α1, ..., αt} such that

∑tk=0 αk = 1.

Then the average is computed by weighing the t+ 1 obser-vations according to the corresponding αt. The followingformula shows how:

sn =

t∑k=0

wkxn−k

Hence by choosing a decreasing sequence of weighting fac-tors, one can give more importance to the observations closerto the present and less to the observations further away fromthe present. However, these Moving Average Smoothingmethods have a major disadvantage that since the first tobservations are required for computing the smooth sn, itcannot be used on the terms 0, 1, ..., t.

Exponential smoothing addresses these shortcomings bycalculating the smooths recursively as follows:

sn = αxn−1 + (1− α)sn−1

where s0 = x0 and 0 < α < 1

Here α is a smoothing factor (we often call this the “lag”).If α is closer to 1 then Exponential Smoothing gives greaterweight to the events that are closer to the “present” and ifα is closer to 0 then this method gives less weight to therecent observations. From the formula given above, it iseasy to observe that Exponential Smoothing only requirestwo observations to compute a smooth. The Kalman Filter,(see section 1.5) is a closely related method that attemptsto calculate values close to the observations by using timeseries data.

1.5 The Kalman FilterThe Kalman filter is a recursive estimator. In order to

compute an estimate for the current state with the KalmanFilter one only needs the estimated state from the previousiteration and the current measurement. This means that nohistory of observations and estimates is required. This fil-ter has “Predict” and “Update” phases, where the “Predict”phase uses an estimate from previous iterations to providea current estimate, and the “Update” phase uses the currentprediction with the observed state at this stage to provide arefinement of the state estimates.

2. RELATED WORKA variety of techniques have been proposed over the years

to make use of unlabeled or partially labeled data to improvemachine learning. Co-training [2] [18] uses two differentviews of the same data, thus resembles our proposed modelclosely. However, co-training makes stronger demands on

the data that is used and the algorithms to handle thesetypes of data are easily broken.

Other multi-task methods take a complementary view,using the same features to predict multiple labels, dividedinto“main” and “auxiliary” tasks. Caruana [1997], Ando andZhang [2005], Zhang et al. [2005], multi-task methods as-sume that the same latent structure that predicts the aux-iliary task labels will also be predictive of the main task.Rather than assuming that the features are divided intocomplementary sets, they assume that the same features canbe applied to different tasks.

The method, Ando and Zhang [2007] assume conditionalindependence where X1 and X2 are assumed to be inde-pendently conditioned on Y . This is closer to the spirit ofthis proposal. Ando and Zhang [2007] do not explicitly con-sider CCA, but their results show that CCA can be used toproject the views down to a lower dimensional space, suchthat this projection does not lose the predictive informationabout the target Y .

2.1 Drawbacks

Ando and Zhang [2007]

The drawback to the assumption in this method, is thatconditional independence based on the observed target (asopposed to based on the hidden state) is too stringent.

Kakade and Foster [2007]

The drawback to the redundancy assumption in this methodis that often both predictors are not equally good (i.e. dueto occlusion)

In theory we may expect a significant improvement whenwe have both views, however we hope that both views canstill be used in an unsupervised manner. We will addressboth of the drawbacks mentioned above by relaxing our as-sumptions.

2.2 Advantages of CCACCA offers a variety of advantages. As mentioned above it

is invariant to scale and affine transformations. Furthermoreif the views of the data contain a large number of featuresthat are irrelevant for the classification of the object whilestill being correlated with each other, CCA will capture thecommon features of the different views, being able to rejectextraneous dimensions. On the otherhand, PCA will wastemany dimensions trying to describe irrelevant features. Thenext example shows the advantage of CCA over PCA on alarge data set.

Example 2.2.1Suppose both your views are composed of video and au-

dio visual streams. One approach would be to use PCA on avery large single matrix using a huge number of dimensionsto describe the scene. On the other hand, a much moreefficient approach would be to use CCA which will realizethat most of the visual scene is entirely uncorrelated withthe audio and such projections would not be included in thecanonical basis.

CCA is ideally suited to estimate state for linear dynam-ical systems such as Kalman filters, since the Markov as-sumption makes the past and future observations be independent conditional on the current state. CCA has been usedin time series, such as in the systems identification literature[13] [15] [12]. More recently, work in Hsu et al.[4] shows howCCA can be used to learn HMMs. We will attempt to char-acterize how the multi-view assumption makes each problemeaiser in terms of better capturing states as well as compu-tational complexity.

3. SYSTEM MODELAt its core, multi-view learning is simply CCA on the

two views followed by regression or clustering to predict thelabel. Different algorithms can be used for CCA and forthe regression/clustering components. We are focusing onsparse CCA and sparse regession and testing a variety ofsuch methods.

In this project we combine three closely coupled aspects:

1. Learning state (an object trained on a large data setusing CCA) in language and how it can be efficientlyestimated and what that state represents in terms ofsyntax (parts of speech) and semantics.

2. Implemention of a pipeline that will efficiently scale tovery large data sets.

3. Demonstrations of the efficacy of multi-view learningon machine reading comprehension prediction and dis-ambiguation.

A generic pipeline has been put together such that the ex-change of different components of this system is made easy.By doing so, we may easily experiment with different algo-rithms and implementations to improve efficiency and per-formance. Moreover, by decomposing this pipeline into sen-sible stages, we have decoupled key aspects as to not wastetime and resources in generating multiple sets of the samedata and performing unnecessary calculations.

This is crucial for this project because we deal with ex-tremely large quantities of raw data (several gigabytes ormore of compressed excerpt from corpus), thus generatingobjects and models are very expensive operations that shouldbe performed (in an ideal scenario) only once per data set.At this size, functions that are seemingly simple (such as thetokenization of text) can present themselves as a potentialbottleneck and can give rise to memory issues if not designedproperly.

The key to designing a seamless and efficient pipeline wasto break the whole process into stages, and tweak its com-ponents with the objective of resulting in an accurate andefficient predictive model that will estimate state represen-tations in terms of syntax and semantics. The pipeline thatwe built to this purpose is displayed below (in both trainingmode Figure 3.1a and Figure 3.1b):

Figure 3.1a - Pipeline of Multi-view Learning Sys-tem (Training)

While a more detailed description of the data-flow de-picted in this pipeline and that of the next is provided inthe System Implementation section, we note that the abovepipeline describes the data-flow when the system is run un-der ’Training’ mode with the intent of building a statisticalmodel and regression coefficients from a large amount oftext whereas the subsequent diagram depicts the more reg-ular flow of data present when running the system under’Predict’ mode with the intent of obtaining a prediction ona (normally smaller) piece of text.

Both pipelines use the same system components, albeitdifferent control flow. Figure 3.1a handles an initial train-ing run in which a CCA model is constructed and written todisk using a sizable sample of english text. On subsequentruns, or prediction runs, the pipeline in Figure 3.1b illus-trates the flow to be followed to generate a prediction fromthe previously stored statistical CCA model and regressioncoefficients.

Figure 3.1b - Pipeline of Multi-view Learning Sys-tem (Predicting)

4. SYSTEM IMPLEMENTATIONThe MVL with CCA system that we have built consists of

two major components, namely, training and predicting. Inorder to ensure accurate and efficient predictions, we mustfirst train the CCA model as well as the labels (coefficients)and store the result in disk. Hence when one runs the pre-diction part of the pipline, the trained CCA model

4.1 Detailed Explanation of Pipeline (Figure3.1a and Figure 3.1b)

1. The input to run_mvl.py is a text file containing thefile names separated by new lines and a boolean pa-rameter that specifies whether the run is “training” or“prediction”. The file names point to the files fromwhich we draw the text for the run (which may eitherbe training or prediction). Obviously, for “training”the text files will be large in size, and for predictionthey are smaller samples. The main function then cre-ates a list structure that contains as the file names asentries, which we refer to as a “document list”, andcalls the right type of MVL simulation we intend torun (whether it is missing words or POS).

2. Every word in every file in the document list providedabove is scanned and added to a dictionary where acertain number of them (for example one hundred)is assigned a unique integer ID. The more words wechoose the less sparse the matrices that represent thembecome, hence there exists a trade-off between howrich of a vocabulary the model captures versus com-putational complexity. Choosing the most commonwords is optimal from the point of view of capturingthe linguistic properties. From this point on these(one hundred) words have numerical identifiers andthe rest of the words are identified as zero. We re-fer to this as the X_dictionary. The X_dictionary isused to map each word into an integer, which producesa sequence of integers, which we refer to as int_doc.Depending on the dictionary used (X_dictionary orY_dictionary) we name the int_doc as int_doc_x

and int_doc_y respectively. For example if the textis

It takes its name from the first six characters seen.

and the dictionary maps each word to an integer asfollows:

Table 4.1 - Dictionary for sample sentence

It 5 the 6takes 2 first 0its 1 six 0name 0 characters 0from 0 seen 3

Then the int_doc produced is

It takes its name from the↓ ↓ ↓ ↓ ↓ ↓5 2 1 0 0 6

first six characters seen↓ ↓ ↓ ↓0 0 0 3

[5 2 1 0 0 6 0 0 0 3]

Now to make a matrix from the int_doc produced wetake each row to represent a word and the columns rep-resent the integer mapping parity. Hence the matrixproduced by this example would look as follows:

Figure 4.1 - Matrix for sample sentence∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

It 7→takes 7→

its 7→name 7→from 7→the 7→first 7→six 7→

characters 7→seen 7→

∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣

0 0 0 0 0 1 00 0 1 0 0 0 00 1 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 10 0 0 0 0 0 00 0 0 0 0 0 00 0 0 0 0 0 00 0 0 1 0 0 0

3. Now there are two possibilities, the run is intended for

prediction of properties (POS or missing words) thenwe read a pre-computed CCA model from disk and goto step 4. If training is the purpose, then it is assumedthat the file list we were given is the training data andwe proceed to do the following:

3a. In order to obtain a CCA we will require theeigenvectors to a matrix given by a product ofsquare smoothing matrices. In order to do thisthe compute_correlations function uses the

X_dictionary and the sample text to build a ma-trix M as shown in Figure 4.1, then it computes aleft smooth and a right smooth, which are referedto as the left view L and right view R respec-tively. Given α, the smoothing rate, the expo-nential smoothing is done as follows:

L[t,·] = (1− α)L[t−1,·] + αM[t,·]

R[t,·] = (1− α)R[t−1,·] + αM[t,·]

Where the notation L[t,·] represents the [t + 1]th

row of the matrix L, R[t,·] represents the [t− 1]th

row of the matrix R, and finally M[t,·] represents

the tth row of the matrix M.

The dimensions of the left (L) and right (R) viewmatrices can become rather large, hence we mul-tiply them by their transpose to obtain a squarematrix of a smaller dimension and produce thematrices LL,RR,LR,

LL = LtL

RR = RtR

LR = LtR

These matrices are, respectively, the square (trans-pose times original) of the left smooth, the right

smooth, and finally the product of the transposeof the left smooth times the right smooth.

With the above in mind, the Canonical Compo-nents are the eigenvectors φ of the matrix

(LtL)−1LtR(RtR)−1RtL, hence if we let λ be theeigen value,

(LtL)−1LtR(RtR)−1RtLφ = λφ

Where, as mentioned above, L is the left view, Ris the right view.

3b. The CCA, which we call leftCCs is essentially amatrix composed of the eigenvectors φ with the klargest φ’s by magnitude.

3c. Now we calculate a state matrix (state), in whicheach row is the estimated state of each token in-fluenced by the words before it. To obtain state,we multiply L by leftCCs.

state = L · leftCCs

3d. Since we are carrying out training in this sequenceof steps we are now interested in generating ourregression coefficients.

The function train_emissions() takes the ma-trix state and a sample Y matrix built from a dic-tionary which is based either on the same logic asthe X_dictionary (in the case of missing words)or on the results of pretagged data by our POStagger, TNT. Then the linear algebra package re-turns a coefficient matrix β to be stored alongsideour CCA for future use by doing a simple linearregression,

β = (statetstate)−1state

tY

4. For predictions, the pre-computed leftCCs and the co-efficients β are read from disk, and a left smooth ma-trix L is computed (as discussed in part [3a]) for thesample text input. Multiplying L times the loadedleftCCs we obtain the state matrix for the sample in-put.

5. With β and state in hand we calculate the predic-tion Y by essentially undoing the linear regression doneduring training, but with the new state of the sampledata.

Y = state · β

This predicted matrix Y may then be translated (de-pending on whether we are doing POS predictions ormissing words)

6. Lastly, an error check is run by using the Root MeanSquare (RMS) on Y −Y , where Y is the original matrix

and Y is the predicted matrix. If Y and Y are an[n×m] matrix, then the RMS is calculated as follows,

ERRORrms =

√√√√∑nk=1

∑mj=1

(Yk,j − Yk,j

)2n ·m

RMS measures the magnitude of the variation hencewill give us a measure of how close or how far our pre-diction Y is from the actual Y . The smaller the RMSerror the more closer to the actual value we are, and

the bigger the RMS error, it means that our predictionis further from the actual value.

4.2 Tweaks

Figure 4.1 - Old Pipeline of Multi-view LearningSystem

Raw Text (Wiki)

CCA Cleaned Text

State Labd Txt Tagged Txt

Regression(Mallet)

PredictiveModel

Interpretation of State

?Tokenize

��

HHHHHHj

StanfordTagger

HHHHHHj

��

?

In this section we will discuss some major performancetweaks which were added to the pipeline along the way inorder to allow for faster, more efficient, and larger runs whichwere necessary to generate competitive models from vastamounts of data.

4.2.1 Faster TaggingThe first largest bottle neck was Stage 3b (see Figure 4.1 )

where tokens were being tagged with POS tags. Originallythe Stanford POS Tagger was being used, however, thisproved to be inefficient in large sample texts. As we cansee in Figure 4.2 below, the Stanford tagger takes 163.38seconds to tag 100 words, and the time grows exponentiallyas we take larger and larger sample texts. In fact, to tag awhole book (Pride and Prejudice) the Stanford Tagger took2819.58 seconds (about 0.7 hours). This may seem like aplausible number, but considering the fact that the bookwas only 701kb in size, compared to the gigabytes of datathat we would be working with, the efficiency of the StanfordTagger would be impractical for our purposes.

Figure 4.2

Aside the slow speed of the Stanford Tagger, another issuethat came up with using this tagger was the fact that theStanford Tagger used an internal tokenizer to tokenize text.This would later cause problems when sync’ing vectors inthe state labeled, and POS tagged files. Morever, having totokenize the same text twice, using two potentially differ-ent tokenizers would not scale well when dealing with largedata sets. The solution to these problems was to use a differ-ent POS tagger. The TNT Tagger (as described in Section4-System Implementation) was not only faster, but also de-coupled the process of tokenization and tagging. Hence wehad control over which tokenization standards to use, andwould only have to tokenize the text once. Figure 4.3 showsthe speed of the TNT Tagger, when it was run with thesame set of texts as the Stanford Tagger. Notice how, whilestill being exponential in time complexity, the TNT taggeris running several orders of magitude faster.

Figure 4.3

The main reason for such a dramatic difference in speed be-tween the two taggers is that the stanford tagger builds astatistical model (maximum entropy or logistic regression).Whereas the tnt tagger does a table look up of the relevantn-grams. Table look-up is faster. In a logarithmic scale, wecan readily observe (see Figure 4.4 below) that the TNTTagger would be an optimal choice.

Figure 4.4

4.2.2 Faster Linear AlgebraA second bottleneck emerged to haunt us when generating

a CCA (which requires some heavy linear algebra packagematrix computations) running in the interpreted python was

just not cutting it. In order to generate these CCA mod-els from a lot of data (the key to obtaining good state andproperly capturing the most semantics in our corpora) theimplementation of the CCA changed from python to C++.The C++ implementation was able to overcome this diffi-culty and generate much more solid CCA models for use inthe system.

4.2.3 An Iterative ApproachOne final change made to the computationally intense por-

tion of the pipeline was making the MVL algorithm iterativein nature. In this approach every token of input is accom-panied not only by a state vector but also by an attributevector (estimated as an average of state vectors computed forthe word in iterations and used as criteria for stopping theiterative algorithm thereby giving us a more reliable state).The essense of the iterative approach is as follows:

1. As usual we take the input text and transform it intoan integer sequence representation (alongside a matrixrepresentation) but we keep track of a separate dic-tionary referred to as the “attribute dictionary” whichinitially maps every token to a vector (of the samedimension as the state vector would be for this run)of random numbers taken uniformly from the inter-val (0, 1) (this is our initial guess, typical of iterativealgorithms).

2. We define the state of each token to be the attributevector stored in the “attribute dictionary” entry forthat token.

3. We now carry out exponential smoothin as in the orig-inal algorithm but do so on the state estimate matricesinstead of the matrix representation of the integer se-quence. We carry out the smoothing, as usual, for bothstates before and after each token to get our usual leftand right view pairs. The formula, with smoothingcoefficient (or lag) α is as before:

L[t,·] = (1− α)L[t−1,·] + αM[t,·]

R[t,·] = (1− α)R[t−1,·] + αM[t,·]

4. Now using the CCA we estimate (or re-estimate, ifnot the first iteration) the state of each token usingthe familiar

state = L · leftCCs

5. At this point we estimate the attributes for each to-ken again by averaging the states estimated for thatparticular token.

6. We compute the change in attribute vectors (numeri-cally by using the usual Euclidean norm) and halt thealgorithm if and only if the change in norm from theprevious iteration is bounded by some positive toler-ance ε. If it is not, we go back to step 2.

5. RESULTSWhen tested, the system has come to perform very well.

Accuracy has been high in almost all instances and requiring

very little training when it comes to the actual regression.The whole work really does go into building a CCA matrix- the larger (and hence more time consuming) this is, thebetter the results become. What is best is that this ardu-ous process must only be carried out once - having it run inC++ rather than python was an excellent decision. To startoff we set the system up to disambiguate between whetherwords are nouns or not. We ran the prediction engine withvaried number of tokens used to train the regression coef-ficients and obtained the following accuracy results (witha lower than random or most-frequent guessing RMSE). Agrah of the results follows:

Figure 5.1 Accuracy of System for DisambiguatingWhether Noun or Not

The zoom on the vertical axis serves to show that the per-formance does increase (albeit trivially) with a larger train-ing set for coefficients (plus or minus random variation ofcourse as we can see with the right-most bar). The fact thatthe changes in performance are minimal with an increase intraining set size for the coefficients lets us cut down sometime when training this part of the pipeline as merely overone thousand tokens suffice for a strong enough linear modelmost of the time.

For the same dictionary (disambiguating between nounsand not nouns) we tested the system on 200 different popu-lar english works of literature. The following is a boxplot ofthe accuracy for the 200 books:

Figure 5.2 Whisker Plot for Prediction on 200 En-glish Texts

Even the worst performance was way above the accuracyof guessing one way or the other and the mean centeredpretty well in the high 80s and low 90s of accuracy percent-age.

While a lot of testing was conducted on this particularproperty set (noun or not noun), the generic nature of thesystem allowed us to substitute in some other dictionaries toattempt prediction over a larger set of possibilities. The fol-lowing graph (discussed below) illustrates the performanceover several different types of prediction.

Figure 5.3 Accuracy Across Different PredictionDictionaries

For comparison the first bar is the noun or not a noun dic-

tionary. The second bar attempts another interesting pos-sibility. It attempts to distinguish between whether everytoken is a verb or not, and if it is, what type of verb it is(the acronyms VB, VBN, and VBD stand respectively for verb,past participle verb, and past verb). As one can see, theperformance is still excellent, nearly 80%. In this example,however, we noticed that guessing for neither of the threeperformed at 92% accuracy - but this is because these arenot all the types of verbs that the TNT tagger handles (thereare other types like VBG and VBZ, which respectively standfor a verb ending in ’ing’ and a verb ending in ’s’ and eventhen all of these do not constitute a majority of the words(sentences normally contain only one verb). Due to this, inthis case such guessing outperforms the model. The final barrefers to the accuracy of disambiguating between all 42 tagsprovided by the TNT part of speech tagger. While the accu-racy seems to be low at 29%, guessing for the most commonof them would perform at about 24% accuracy - so beat-ing this reveals that the system is working well. A muchlarger CCA trained over a cluster of computers over dayswould perform even better than this accuracy - the CCA wetrained was only trained on one computer for several hours.

Overall the graphs reveal some fault, but overall, theydemonstrate that the system is able to quickly and accu-rately disambiguate between linguistic properties of text aswe expected it to.

6. FUTURE WORKSeeing as the system as it stands produces reliable results,

a first idea for future work which we would have carried outif we had more time would be to expedite the computationof the canonical correlation matrix. The algorithm (eventhe tweaked iterative version) form the hotspot of the sys-tem and future work would probably focus a lot of effort onthem. Running this procedure in as distributed and parallelas possible would prove instrumental to make it possible togenerate much larger and more statistically capable modelsto gain even better prediction capabilities.

Other work would involve making the system more inter-active, smoother, and better overall structured. For exam-ple, one possibility would be to rewrite the system usingC++ even though the computationally intense part has al-ready been rewritten in C++. Experimenting with differentCCAs is something which unfortunately we did not havemuch time to do as these are extremely time consuming togenerate. Data from different CCAs would provide great in-sight into how much better the system can be and what theright trade-offs for CCA size and reliability versus compu-tational effort involved really are.

7. CONCLUSIONOverall this year long project was very exciting. It was

a demanding and difficult project, however, and it requiredan incredible amount of attention to detail and becomingacquainted with legacy code. The legacy code presenteda relatively large challenge - as it is something we were notused to encountering in large quantities (the main repositoryout of which we operated had over ten thousand revisions).

While much work was required, the system performedwith a high degree of accuracy and was fast considering itwas written with a lot of disk IO and primarily in python.

8. REFERENCES

[1] Rie Kubota Ando and Tong Zhang. Two-view featuregeneration model for semi-supervised learning, 2007.

[2] Avrim Blum and Tom Mitchell. Combining labeledand unlabeled data with co-training, 1998.

[3] M. Collins and Y. Singer. Unsupervised models fornamed entity classification, 1999.

[4] Sham M. Kakade Daniel Hsu and Tong Zhang. Aspectral algorithm for learning hidden markov models,2008.

[5] Rick Durrett. Probability: Theory and Examples.Duxbury Press, third edition edition, 2004.

[6] Shaogang Gong, Caifeng Shan, and Tao Xiang. Visualinference of human emotion and behaviour. In ICMI’07: Proceedings of the 9th international conference onMultimodal interfaces, pages 22–29, New York, NY,USA, 2007. ACM.

[7] David R. Hardoon, Sandor R. Szedmak, and John R.Shawe-taylor. Canonical correlation analysis: Anoverview with application to learning methods. NeuralComput., 16(12):2639–2664, 2004.

[8] H. Hotelling. Journal of educational psychology, 1935.

[9] Hongying Meng John Shawe-Taylor Jason D.R. Farquhar, David R. Hardoon and Sandor Szedmak.Two view learning: Svm-2k, theory and practice, 2005.

[10] J. Blitzer K. Ganchev, J. Graca and B. Taskar.Multi-view learning over structured and nonidenticaloutputs., 2008.

[11] Sham M. Kakade and Dean P. Foster. Multi-viewregression via canonical correlation analysis, 2007.

[12] T. Katayama. Subspace methods for systemidentification, 2005.

[13] L. Ljung. System identification: Theory for the user,1987.

[14] Kamal Nigam and Rayid Ghani. Analyzing theeffectiveness and applicability of co-training, 2000.

[15] P. V. Overschee and B. D. Moor. Subspaceidentification of linear systems, 1996.

[16] Lawrence R. Rabiner. A tutorial on hidden markovmodels and selected applications in speech recognition.In Proceedings of the IEEE 77(2), pages 257–286.IEEE, 1989.

[17] Jeffrey S. Rosenthal. A First Look at RigorousProbability Theory. World Scientific PublishingCompany, second edition edition, 2006.

[18] Michael L. Littman Sanjoy Dasgupta and DavidMcallester. Pac generalization bounds for cotraining,2001.

[19] Vikas Sindhwani and David Rosenberg. An rkhs formulti-view learning and manifold coregularization,2008.

[20] Tobias Scheffer Ulf Brefeld, Thomas Gartner andStefan Wrobel. Efficient co-regularised least squaresregression., 2006.

[21] P. Niyogi V. Sindhwani and M. Belkin. Aco-regularized approach to semi-supervised learningwith multiple views, 2005.

[22] David Yarowsky. Unsupervised word sensedisambiguation rivaling supervised methods, 1995.

Date post:	11-Mar-2023
Category:	Documents
Upload:	independent
View:	1 times
Download:	0 times

Learning" State Models" of Language Progress Report

Documents