Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | conrad-hoover |
View: | 218 times |
Download: | 2 times |
Machine Learning
Week 1, Lecture 2
RecapSupervised Learning
Data Set
Learning Algorithm
Hypothesis hh(x) ≈ f(x)
Unknown Target f
Hypothesis Set5 0 4 1 9 2 1 3 1 4
Hyperplane
Halfspace >0
Halfspace < 0
w
Classification Regression
np-hard in general
Assume Data Is Linear Separable!!!Perception find separating hyperplane Convex
Today
• Convex Optimization – Convex sets– Convex functions
• Logistic Regression– Maximum Likelihood– Gradient Descent
• Maximum likelihood and Linear Regression
Convex Optimization
Optimization problem, in general very hard (if possible at all)!!!
For convex optimization problemstheoretical (polynomial time) and practical solutions exist(most of the time)
Example:
Convex Sets
Convex Set Non-convex Set
The “line” from x to y must also be in the set
Convex Sets
Union of convex setsmay not be convex
Intersection of convex sets
Convex Functions
x,f(x)
y,f(y)
f is concave if –f is convex
Concave?, Convex? Both
Differentiable Convex Functions
x,f(x)
y,f(y)
f(x)+f’(x)(y-x)
Example
Twice Differentiable Convex Functions
f is convex if the Hessian is positive semi-definite for all x.
Real symmetric matrix A is positive semidefinite if for all nonzero x
1D:
Simple 2D Example
More Examples
Quadratic Functions:
Convex if A is positive semidefinite
Affine Functions:
Convexity of Linear Regression
Quadratic Functions:
Convex if A is positive semidefinite
Real and Symmetric: Clearly
EpigraphConnection between convex sets and convex functions
f is convex if epi(f) is a convex set
Sublevel setsConvex function
Define α-Sublevel set:
Is Convex
Convex Optimization
f and g are convex, h is affine
Local Minima are Global Minima
Examples of Convex Optimization
• Linear Programming
• Quadratic Programming (P is positive semidefinite)
Summary
Rockafellar stated, in his 1993 SIAM Review survey paper “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity”
Convex GOOD!!!!
Estimating Probabilities
• Probability of getting cancer given your situation.
• Probability that AGF wins against Viborg given the last 5 results.
• Probability that the loan is not payed back as a function of credit worthiness
• Probability of a student getting an A in Machine Learning given his grades.
Data is actual events not probabilities, e.g. some students that failed and some who did not…
Breast Cancerhttp://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
• 1. Sample code number: id number • 2. Clump Thickness: 1 - 10 • 3. Uniformity of Cell Size: 1 - 10 • 4. Uniformity of Cell Shape: 1 - 10 • 5. Marginal Adhesion: 1 - 10 • 6. Single Epithelial Cell Size: 1 - 10 • 7. Bare Nuclei: 1 - 10 • 8. Bland Chromatin: 1 - 10 • 9. Normal Nucleoli: 1 - 10 • 10. Mitoses: 1 - 10
Input Features• benign • malignant
Target Function
PREDICT PROBABILITY OF BENIGN AND MALIGNANT ON FUTURE PATIENTS
Maximum LikelihoodBiased Coin, (bias θ probability of heads)Flip it n times independently (Bernoulli trials), Count the number of heads k
Fix θ, What is the probability of seeing DTake Logs
After seeing the data what can we infer
Likelihood of the data
Maximum Likelihood
solve for 0
Compute Gradient
Negative Log Likelihood of the data (log is monotone)
Maximize
Minimize
Bayesian Perspective
Bayes Rule:
Want: Need: A Prior
Likelihood x Prior
Normalizing factor
Posterior
Bayesian Perspective
• Compute the probability of each hypotheses• Pick the most likely and use for predictions
(map = maximum a posteriori)• Compute Expected Values (Weighed average
over all hypotheses)
Logistic Regression
Assume Independent Data Points, Apply Maximum Likelihood (there is a Bayesian version to)
Hard Threshold Hard and Soft Threshold
Can and is used for classification. Predict most likely y
Maximum Likelihood Logistic Regression
Neg. Log likelihood is convex
Cannot solve for zero analytically
Descent Methods
Iteratively move toward a better solution
where f is twice continuously differentiable
• Pick start point x• Repeat Until Stopping Criterion Satisfied• Compute Descent Direction v• Line Search: Compute Step Size t• Update: x = x + t v
Gradient Descent
Line (Ray) Search• Pick start point x• Repeat Until Stopping Criterion Satisfied• Compute Descent Direction v• Line Search: Compute Step Size t• Update: x = x + t v
• Solve analytically (if possible)• Backwards Search start high and decrease until
improving distance found [SL 9.2]• Fix to a small constant• Use size of the gradient scaled with small constant.• Start with constant, let it decrease slowly or when to high
Stopping Criteria
• Gradient becomes very small• Max number of iterations used
Gradient Descent for Linear Reg.
GD For Linear RegressionMatlab style
function theta= GD(X,y,theta)LR = 0.1for i=1:50
cost = (1/length(y))* sum((X*theta-y).^2)grad = (1/length(y))*2.*X'*(X*theta-y)
theta = theta – LR * gradend
Note we do not scale gradient to unit vector
Learning Rate Learning Rate Learning Rate
Gradient Descent Jump Around
Use Exact Line Search Starting From (10,1)
Gradient Descent Running Time
• Number of iterations x Cost per iteration.• Cost Per Iteration is usually not a problem.• Number of iterations depends choice of line
search and stopping Criterion clearly.– Very Problem and Data Specific– Need a lot of math to give bounds. – We will not cover it in this course.
Gradient Descent For Logistic Regression
Handin 1!A long with multiclass extension
Stochastic Gradient Descent
Pick at random and use
Use K points chosen at randomMini-Batch Gradient Descent
Linear Classification with K classes
• Use Logistic regression All Vs one.– Train K classifiers one for each class– Input X is the same. Y is 1 for all elements from that
class and 0 otherwise (All vs. One)– Prediction, compute the probability for all K classifiers
output class with highest probability.• Use Softmax Regression– Extension of logistic function to K classes in some
sense– Covered in Handin 1.
Maximum Likelihood and Linear Regression (Time to spare slide)
Assume: Independently
Todays Summary• Convex Optimizations
– Many Definitions– Local Optimal is Global Optimal– Usually theoretical and practically feasible
• Maximum likelihood– Use as a proxy for– Assume Independent Data
• Gradient Descent– Minimize function – Iteratively finding better solution by local steps based on gradient– First order method (Uses gradient)– Other methods exist, e.g. Second order methods (use hessian)