Learning "State Models" of Language Progress Report Dept. of CIS - Senior Design Project 2009-2010 Kiuk Chung [email protected] Univ. of Pennsylvania Philadelphia, PA Andres F. Velazquez [email protected] Univ. of Pennsylvania Philadelphia, PA Lyle Ungar [email protected] Univ. of Pennsylvania Philadelphia, PA ABSTRACT This senior design project consists of examining a statis- tical model (Canonical Correlational Analysis), using it to predict different properties of the words in corpora (for exam- ple, Wikipedia and Project Gutenberg) such as, what entity type (E.g., person, place, organization ...) what the words are, what they link to, and what part of speech they comprise. We have built a system that, given a large sample text ac- quired from (mostly) Project Gutenberg, will remove unnec- essary tags, tokenize the words (using the de-facto PennTree Tokenization), tag these words with part-of-speech (POS) tags, create state vectors for each word, run regressional analysis by plugging the POS and state vectors into a lin- ear model, and finally predict and disambiguate properties of these words. We do this last part, using a method based on Canonical Correlation Analysis(CCA), which is a gener- alization of Principle Component Analysis (PCA) to pairs of matrices. CCA based methods are an alternative to the more commonly used method based on Hidden Markov Mod- els (HMMs). We present a Multi-View learning (MVL) method based on CCA - an alternative to the currently popular Hidden Markov Model (HMM). Running several trials under this method to predict properties of words, will provide insight (for certain linear models) into the advantage of employing CCA fol- lowed by linear regression. 1. INTRODUCTION In this project we will examine the use of correlation-based methods; specifically, CCA (see below) on an unlabeled set of data (pairs of views) to learn the commonality between the views (which we call the state-space vector ). Extracting information relevant to a given task in a semi/un- supervised manner is one of the fundamental modern chal- lenges in machine learning. In several machine learning ap- plications, it is often easier to collect raw, unlabeled data than labeled data, because obtaining the latter is expensive since it often involves procedures like human hand labeling, clinical trials, and costly experiments. How unlabeled data can be used to improve performance and reduce the bur- den of collecting labeled data has been receiving increasing attention in recent years. The “Multi-view” approach has come up as a paradigm for semi-supervised learning. At the center of our MVL approach lies the CCA, which computes the directions of maximal correlation between a pair of matrices. These matrices, in our case, represent smoothed versions of the training text (a huge excerpt from corpora, such as Wikipedia or Project Gutenberg), where the training text is smoothed to capture structural patterns in the language influenced by both before (past) and after (future) a certain word in the whole text. Hence by employ- ing CCA on these two “views” (past and future smoothes), we obtain a latent structure that is common to both. Then we use linear regression to predict part-of-speech labels as a state-space vector. The advantages of using MVL with CCA become appar- ent when it comes to the problem of scaling. Since CCA and linear regression is linear, it allows us to work with sam- ples that are many times larger than those sizes that can be handled by non-linear methods. The fact that MVL with CCA scales well under large sets of data outweigh the disad- vantage that certain non-linear methods may provide more accurate results for certain problems. Also, implementing CCA is rather straight forward compared to the complex coding required by some non-linear methods. The fact that MVL can be used to address problems that are currently modeled by HMMs is yet another advantage of multi-view learning. MVL is optimal under the assumptions of linearity and Markov Property - that given the present, the future does not depend on the past. This is precisely the standard Markovian assumption used in HMMs. Hence MVL has the potential to be an efficient alternative method for problems currently modeled by HMMs. To examine the MVL approach with CCA followed by linear regression, we have implemented a system, which can be trained (on a large input text), then subsequently be used to predict part-of-speech labels given a small sample text. In this paper we discuss the following: i. Overview of the implemented MVL system. ii. Specifics of training the CCA model and calculations that yield a prediction. iii. Evaluation (error calculation) on the efficiency and performance of the MVL with CCA method. 1.1 Markov Property A stochastic process {Xn} (sequence of random variables) on the state space (S, S ) - where S is the set of states and S

