Lecture 17: Supervised Learning Recap
Machine LearningApril 6, 2010
Last Time
• Support Vector Machines• Kernel Methods
Today
• Short recap of Kernel Methods• Review of Supervised Learning• Unsupervised Learning– (Soft) K-means clustering– Expectation Maximization– Spectral Clustering– Principle Components Analysis– Latent Semantic Analysis
Kernel Methods
• Feature extraction to higher dimensional spaces.
• Kernels describe the relationship between vectors (points) rather than the new feature space directly.
When can we use kernels?
• Any time training and evaluation are both based on the dot product between two points.
• SVMs• Perceptron• k-nearest neighbors• k-means• etc.
Kernels in SVMs
• Optimize αi’s and bias w.r.t. kernel• Decision function:
Kernels in Perceptrons
• Training
• Decision function
Good and Valid Kernels
• Good: Computing K(xi,xj) is cheaper than ϕ(xi)• Valid: – Symmetric: K(xi,xj) =K(xj,xi) – Decomposable into ϕ(xi)Tϕ(xj)
• Positive Semi Definite Gram Matrix
• Popular Kernels– Linear, Polynomial– Radial Basis Function– String (technically infinite dimensions)– Graph
Supervised Learning
• Linear Regression• Logistic Regression• Graphical Models– Hidden Markov Models
• Neural Networks• Support Vector Machines– Kernel Methods
Major concepts
• Gaussian, Multinomial, Bernoulli Distributions• Joint vs. Conditional Distributions• Marginalization• Maximum Likelihood• Risk Minimization• Gradient Descent• Feature Extraction, Kernel Methods
Some favorite distributions
• Bernoulli
• Multinomial
• Gaussian
Maximum Likelihood
• Identify the parameter values that yield the maximum likelihood of generating the observed data.
• Take the partial derivative of the likelihood function• Set to zero• Solve
• NB: maximum likelihood parameters are the same as maximum log likelihood parameters
Maximum Log Likelihood
• Why do we like the log function?• It turns products (difficult to differentiate) and
turns them into sums (easy to differentiate)
• log(xy) = log(x) + log(y)• log(xc) = c log(x)•
Risk Minimization
• Pick a loss function– Squared loss– Linear loss– Perceptron (classification) loss
• Identify the parameters that minimize the loss function.– Take the partial derivative of the loss function– Set to zero– Solve
Frequentists v. Bayesians
• Point estimates vs. Posteriors• Risk Minimization vs. Maximum Likelihood• L2-Regularization– Frequentists: Add a constraint on the size of the
weight vector– Bayesians: Introduce a zero-mean prior on the
weight vector– Result is the same!
L2-Regularization
• Frequentists:– Introduce a cost on the size of the weights
• Bayesians:– Introduce a prior on the weights
Types of Classifiers• Generative Models
– Highest resource requirements. – Need to approximate the joint probability
• Discriminative Models– Moderate resource requirements. – Typically fewer parameters to approximate than generative models
• Discriminant Functions– Can be trained probabilistically, but the output does not include
confidence information
Linear Regression
• Fit a line to a set of points
Linear Regression
• Extension to higher dimensions– Polynomial fitting
– Arbitrary function fitting• Wavelets• Radial basis functions• Classifier output
Logistic Regression
• Fit gaussians to data for each class• The decision boundary is where the PDFs cross
• No “closed form” solution to the gradient.• Gradient Descent
Graphical Models
• General way to describe the dependence relationships between variables.
• Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.
Junction Tree Algorithm
• Moralization– “Marry the parents”– Make undirected
• Triangulation– Remove cycles >4
• Junction Tree Construction– Identify separators such that the running intersection
property holds• Introduction of Evidence– Pass slices around the junction tree to generate marginals
Hidden Markov Models
• Sequential Modeling– Generative Model
• Relationship between observations and state (class) sequences
Perceptron
• Step function used for squashing.• Classifier as Neuron metaphor.
Perceptron Loss
• Classification Error vs. Sigmoid Error– Loss is only calculated on Mistakes
Perceptrons usestrictly classificationerror
Neural Networks
• Interconnected Layers of Perceptrons or Logistic Regression “neurons”
Neural Networks
• There are many possible configurations of neural networks– Vary the number of layers– Size of layers
Support Vector Machines
• Maximum Margin Classification Small Margin
Large Margin
Support Vector Machines
• Optimization Function
• Decision Function
30
Visualization of Support Vectors
Questions?
• Now would be a good time to ask questions about Supervised Techniques.
Clustering
• Identify discrete groups of similar data points• Data points are unlabeled
Recall K-Means
• Algorithm– Select K – the desired number of clusters– Initialize K cluster centroids– For each point in the data set, assign it to the cluster
with the closest centroid
– Update the centroid based on the points assigned to each cluster
– If any data point has changed clusters, repeat
k-means output
Soft K-means
• In k-means, we force every data point to exist in exactly one cluster.
• This constraint can be relaxed.
Minimizes the entropy of cluster assignment
Soft k-means example
Soft k-means
• We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points
• Convergence is based on a stopping threshold rather than changed assignments
Gaussian Mixture Models
• Rather than identifying clusters by “nearest” centroids
• Fit a Set of k Gaussians to the data.
GMM example
Gaussian Mixture Models
• Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,
Graphical Modelswith unobserved variables
• What if you have variables in a Graphical model that are never observed?– Latent Variables
• Training latent variable models is an unsupervised learning application
laughing
amused
sweating
uncomfortable
Latent Variable HMMs
• We can cluster sequences using an HMM with unobserved state variables
• We will train the latent variable models using Expectation Maximization
Expectation Maximization
• Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization– Step 1: Expectation (E-step)
• Evaluate the “responsibilities” of each cluster with the current parameters
– Step 2: Maximization (M-step)• Re-estimate parameters using the existing
“responsibilities”
• Related to k-means
Questions
• One more time for questions on supervised learning…
Next Time
• Gaussian Mixture Models (GMMs)• Expectation Maximization