Supervised Learning(Part II)Weinan Zhang
Shanghai Jiao Tong Universityhttp://wnzhang.net
2019 EE448, Big Data Mining, Lecture 5
http://wnzhang.net/teaching/ee448/index.html
Content of Supervised Learning• Introduction to Machine Learning
• Linear Models
• Support Vector Machines
• Neural Networks
• Tree Models
• Ensemble Methods
Content of This Lecture
• Support Vector Machines
• Neural Networks
Linear Classification• For linear separable cases, we have multiple
decision boundaries
Linear Classification• For linear separable cases, we have multiple
decision boundaries
• Ruling out some separators by considering data noise
Linear Classification• For linear separable cases, we have multiple
decision boundaries
• The intuitive optimal decision boundary: the largest margin
Review: Logistic Regression• Logistic regression is a binary classification model
pμ(y = 1jx) = ¾(μ>x) =1
1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =
1
1 + e¡μ>x
pμ(y = 0jx) =e¡μ>x
1 + e¡μ>xpμ(y = 0jx) =
e¡μ>x
1 + e¡μ>x
L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1 ¡ y) log(1¡ ¾(μ>x))L(y; x; pμ) = ¡y log ¾(μ>x)¡ (1 ¡ y) log(1¡ ¾(μ>x))
@¾(z)
@z= ¾(z)(1¡ ¾(z))
@¾(z)
@z= ¾(z)(1¡ ¾(z))
@L(y; x; pμ)
@μ= ¡y
1
¾(μ>x)¾(z)(1¡ ¾(z))x¡ (1¡ y)
¡1
1¡ ¾(μ>x)¾(z)(1¡ ¾(z))x
= (¾(μ>x)¡ y)x
μ Ã μ + ´(y ¡ ¾(μ>x))x
@L(y; x; pμ)
@μ= ¡y
1
¾(μ>x)¾(z)(1¡ ¾(z))x¡ (1¡ y)
¡1
1¡ ¾(μ>x)¾(z)(1¡ ¾(z))x
= (¾(μ>x)¡ y)x
μ Ã μ + ´(y ¡ ¾(μ>x))x
• Cross entropy loss function
• Gradient
¾(x)¾(x)
xx
Label Decision• Logistic regression provides the probability
• The final label of an instance is decided by setting a threshold
pμ(y = 1jx) = ¾(μ>x) =1
1 + e¡μ>xpμ(y = 1jx) = ¾(μ>x) =
1
1 + e¡μ>x
pμ(y = 0jx) =e¡μ>x
1 + e¡μ>xpμ(y = 0jx) =
e¡μ>x
1 + e¡μ>x
hh
y =
(1; pμ(y = 1jx) > h
0; otherwisey =
(1; pμ(y = 1jx) > h
0; otherwise
Logistic Regression Scores
pμ(y = 1jx) =1
1 + e¡s(x)pμ(y = 1jx) =
1
1 + e¡s(x)
s(x) = μ0 + μ1x1 + μ2x2s(x) = μ0 + μ1x1 + μ2x2
s(x) = μ0 + μ1x(A)1 + μ2x
(A)2s(x) = μ0 + μ1x
(A)1 + μ2x
(A)2
s(x) = μ0 + μ1x(B)1 + μ2x
(B)2s(x) = μ0 + μ1x
(B)1 + μ2x
(B)2
s(x) = μ0 + μ1x(C)1 + μ2x
(C)2s(x) = μ0 + μ1x
(C)1 + μ2x
(C)2
s(x) = 0s(x) = 0
The higher score, the larger distance to the decision boundary, the higher confidence
Decisionboundary
x1x1
x2x2
Example from Andrew Ng
Linear Classification• The intuitive optimal decision boundary: the
highest confidence
Notations for SVMs• Feature vector• Class label• Parameters
• Intercept• Feature weight vector
• Label prediction
y 2 f¡1; 1gy 2 f¡1; 1gxx
bb
hw;b(x) = g(w>x + b)hw;b(x) = g(w>x + b)
g(z) =
(+1 z ¸ 0
¡1 otherwiseg(z) =
(+1 z ¸ 0
¡1 otherwise
ww
Logistic Regression Scores
pμ(y = 1jx) =1
1 + e¡s(x)pμ(y = 1jx) =
1
1 + e¡s(x)
s(x) = b + w1x1 + w2x2s(x) = b + w1x1 + w2x2
s(x) = b + w1x(A)1 + w2x
(A)2s(x) = b + w1x
(A)1 + w2x
(A)2
s(x) = b + w1x(B)1 + w2x
(B)2s(x) = b + w1x
(B)1 + w2x
(B)2
s(x) = b + w1x(C)1 + w2x
(C)2s(x) = b + w1x
(C)1 + w2x
(C)2
s(x) = 0s(x) = 0
The higher score, the larger distance to the separating hyperplane, the higher confidence
Decisionboundary
x1x1
x2x2
Example from Andrew Ng
Margins• Functional margin
Decisionboundary
x1x1
x2x2
°(B)°(B)
°(i) = y(i)(w>x(i) + b)°(i) = y(i)(w>x(i) + b)
ww
g(w>x + b) = g(2w>x + 2b)g(w>x + b) = g(2w>x + 2b)
• Note that the separating hyperplane won’t change with the magnitude of (w, b)
• Geometric margin°(i) = y(i)(w>x(i) + b)
where kwk2 = 1
°(i) = y(i)(w>x(i) + b)
where kwk2 = 1
Margins• Decision boundary
Decisionboundary
x1x1
x2x2
°(B)°(B)
w>μ
x(i) ¡ °(i)y(i) w
kwk¶
+ b = 0w>μ
x(i) ¡ °(i)y(i) w
kwk¶
+ b = 0
ww
• Given a training set
° = mini=1:::m
°(i)° = mini=1:::m
°(i)
) °(i) = y(i) w>x(i) + b
kwk= y(i)
μ³ w
kwk´>
x(i) +b
kwk¶) °(i) = y(i) w
>x(i) + b
kwk= y(i)
μ³ w
kwk´>
x(i) +b
kwk¶
S = f(xi; yi)gi=1:::mS = f(xi; yi)gi=1:::m
the smallest geometric margin
Objective of an SVM• Find a separable hyperplane that maximizes the
minimum geometric marginmax°;w;b
°
s.t. y(i)(w>x(i) + b) ¸ °; i = 1; : : : ;m
kwk = 1
max°;w;b
°
s.t. y(i)(w>x(i) + b) ¸ °; i = 1; : : : ;m
kwk = 1
max°;w;b
°
kwks.t. y(i)(w>x(i) + b) ¸ °; i = 1; : : : ;m
max°;w;b
°
kwks.t. y(i)(w>x(i) + b) ¸ °; i = 1; : : : ;m
• Equivalent to normalized functional margin
(non-convex constraint)
(non-convex objective)
Objective of an SVM• Functional margin scales w.r.t. (w,b) without
changing the decision boundary. • Let’s fix the functional margin at 1.
maxw;b
1
kwks.t. y(i)(w>x(i) + b) ¸ 1; i = 1; : : : ;m
maxw;b
1
kwks.t. y(i)(w>x(i) + b) ¸ 1; i = 1; : : : ;m
• Objective is written as° = 1° = 1
• Equivalent with
minw;b
1
2kwk2
s.t. y(i)(w>x(i) + b) ¸ 1; i = 1; : : : ; m
minw;b
1
2kwk2
s.t. y(i)(w>x(i) + b) ¸ 1; i = 1; : : : ; m
This optimization problem can be efficiently solved by quadratic programming
A Digression of Lagrange Duality in Convex Optimization
Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
Lagrangian for Convex Optimization
• A convex optimization problemmin
wf(w)
s.t. hi(w) = 0; i = 1; : : : ; l
minw
f(w)
s.t. hi(w) = 0; i = 1; : : : ; l
• The Lagrangian of this problem is defined as
L(w; ¯) = f(w) +lX
i=1
¯ihi(w)L(w; ¯) = f(w) +lX
i=1
¯ihi(w)
Lagrangian multipliers
@L(w; ¯)
@w= 0
@L(w; ¯)
@w= 0
@L(w; ¯)
@¯= 0
@L(w; ¯)
@¯= 0
• Solving
yields the solution of the original optimization problem.
Lagrangian for Convex Optimization
w1w1
w2w2f(w) = 5f(w) = 5
f(w) = 4f(w) = 4
f(w) = 3f(w) = 3
h(w) = 0h(w) = 0
L(w; ¯) = f(w) + ¯h(w)L(w; ¯) = f(w) + ¯h(w)
@L(w; ¯)
@w=
@f(w)
@w+ ¯
@h(w)
@w= 0
@L(w; ¯)
@w=
@f(w)
@w+ ¯
@h(w)
@w= 0
i.e., two gradients on the same direction
With Inequality Constraints• A convex optimization problem
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
• The Lagrangian of this problem is defined as
L(w;®; ¯) = f(w) +kX
i=1
®igi(w) +lX
i=1
¯ihi(w)L(w;®; ¯) = f(w) +kX
i=1
®igi(w) +lX
i=1
¯ihi(w)
Lagrangian multipliers
Primal Problem
• The primal problemμP(w) = max
®;¯:®i¸0L(w;®; ¯)μP(w) = max
®;¯:®i¸0L(w;®; ¯)
• If a given violates any constraints, i.e., ww
gi(w) > 0 or hi(w) 6= 0gi(w) > 0 or hi(w) 6= 0
• Then μP(w) = +1μP(w) = +1
• The Lagrangian
L(w;®; ¯) = f(w) +kX
i=1
®igi(w) +lX
i=1
¯ihi(w)L(w;®; ¯) = f(w) +kX
i=1
®igi(w) +lX
i=1
¯ihi(w)
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
• A convex optimization
Primal Problem
• The Lagrangian
L(w;®; ¯) = f(w) +kX
i=1
®igi(w) +lX
i=1
¯ihi(w)L(w;®; ¯) = f(w) +kX
i=1
®igi(w) +lX
i=1
¯ihi(w)
• The primal problemμP(w) = max
®;¯:®i¸0L(w;®; ¯)μP(w) = max
®;¯:®i¸0L(w;®; ¯)
• Conversely, if all constraints are satisfied for• Then
wwμP(w) = f(w)μP(w) = f(w)
μP(w) =
(f(w) if w satis¯es primal constraints
+1 otherwiseμP(w) =
(f(w) if w satis¯es primal constraints
+1 otherwise
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
• A convex optimization
Primal Problem
• The minimization problem
μP(w) =
(f(w) if w satis¯es primal constraints
+1 otherwiseμP(w) =
(f(w) if w satis¯es primal constraints
+1 otherwise
minwμP(w) = min
wmax
®;¯:®i¸0L(w; ®; ¯)min
wμP(w) = min
wmax
®;¯:®i¸0L(w; ®; ¯)
is the same as the original problemmin
wf(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
• Define the value of the primal problem p¤ = minwμP(w)p¤ = min
wμP(w)
Dual Problem• A slightly different problem
• Define the dual optimization problem
minwμP(w) = min
wmax
®;¯:®i¸0L(w;®; ¯)min
wμP(w) = min
wmax
®;¯:®i¸0L(w;®; ¯)
• Min & Max exchanged compared to the primal problem
• Define the value of the dual problemd¤ = max
®;¯:®i¸0min
wL(w;®; ¯)d¤ = max
®;¯:®i¸0min
wL(w;®; ¯)
Primal Problem vs. Dual Problem
• Proof
d¤ = max®;¯:®i¸0
minwL(w;®; ¯) · min
wmax
®;¯:®i¸0L(w;®; ¯) = p¤d¤ = max
®;¯:®i¸0min
wL(w;®; ¯) · min
wmax
®;¯:®i¸0L(w;®; ¯) = p¤
minwL(w; ®; ¯) · L(w; ®; ¯); 8w; ® ¸ 0; ¯
) max®;¯:®¸0
minwL(w; ®; ¯) · max
®;¯:®¸0L(w;®; ¯); 8w
) max®;¯:®¸0
minwL(w; ®; ¯) · min
wmax
®;¯:®¸0L(w; ®; ¯)
minwL(w; ®; ¯) · L(w; ®; ¯); 8w; ® ¸ 0; ¯
) max®;¯:®¸0
minwL(w; ®; ¯) · max
®;¯:®¸0L(w;®; ¯); 8w
) max®;¯:®¸0
minwL(w; ®; ¯) · min
wmax
®;¯:®¸0L(w; ®; ¯)
d¤ = p¤d¤ = p¤• But under certain condition
Karush-Kuhn-Tucker (KKT) Conditions• If f and gi’s are convex and hi’s are affine, and suppose gi’s
are all strictly feasible• then there must exist w*, α*, β*
• w* is the solution of the primal problem• α*, β* are the solutions of the dual problem• and the values of the two problems are equal
• And w*, α*, β* satisfy the KKT conditions• Moreover, if some w*, α*, β*
satisfy the KKT conditions, then it is also a solution to the primal and dual problems.
• More details please refer to Boyd “Convex optimization” 2004.
KKT dual complementarity condition
Now Back to SVM Problem
Objective of an SVM• SVM objective: finding the optimal margin classifier
• Re-wright the constraints as
minw;b
1
2kwk2
s.t. y(i)(w>x(i) + b) ¸ 1; i = 1; : : : ; m
minw;b
1
2kwk2
s.t. y(i)(w>x(i) + b) ¸ 1; i = 1; : : : ; m
gi(w) = ¡y(i)(w>x(i) + b) + 1 · 0gi(w) = ¡y(i)(w>x(i) + b) + 1 · 0
so as to match the standard optimization formmin
wf(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
minw
f(w)
s.t. gi(w) · 0; i = 1; : : : ; k
hi(w) = 0; i = 1; : : : ; l
Equality Cases
The gi’s = 0 cases correspond to the training examples that have functional margin exactly equal to 1.
gi(w) =¡ y(i)(w>x(i) + b) + 1 = 0
)y(i)³w>x(i)
kwk +b
kwk´
=1
kwk
gi(w) =¡ y(i)(w>x(i) + b) + 1 = 0
)y(i)³w>x(i)
kwk +b
kwk´
=1
kwk
Geometric margin
Objective of an SVM• SVM objective: finding the optimal margin classifier
• Lagrangian
minw;b
1
2kwk2
s.t. ¡ y(i)(w>x(i) + b) + 1 · 0; i = 1; : : : ;m
minw;b
1
2kwk2
s.t. ¡ y(i)(w>x(i) + b) + 1 · 0; i = 1; : : : ;m
• No β or equality constraints in SVM problem
L(w; b; ®) =1
2kwk2 ¡
mXi=1
®i[y(i)(w>x(i) + b)¡ 1]L(w; b; ®) =
1
2kwk2 ¡
mXi=1
®i[y(i)(w>x(i) + b)¡ 1]
Solving
• Derivatives@
@wL(w; b; ®) = w ¡
mXi=1
®iy(i)x(i) = 0 ) w =
mXi=1
®iy(i)x(i)
@
@bL(w; b; ®) =
mXi=1
®iy(i) = 0
@
@wL(w; b; ®) = w ¡
mXi=1
®iy(i)x(i) = 0 ) w =
mXi=1
®iy(i)x(i)
@
@bL(w; b; ®) =
mXi=1
®iy(i) = 0
L(w; b; ®) =1
2kwk2 ¡
mXi=1
®i[y(i)(w>x(i) + b)¡ 1]L(w; b; ®) =
1
2kwk2 ¡
mXi=1
®i[y(i)(w>x(i) + b)¡ 1]
L(w; b; ®) =1
2
°°° mXi=1
®iy(i)x(i)
°°°2 ¡mX
i=1
®i[y(i)(w>x(i) + b)¡ 1]
=
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j) ¡ b
mXi=1
®iy(i)
L(w; b; ®) =1
2
°°° mXi=1
®iy(i)x(i)
°°°2 ¡mX
i=1
®i[y(i)(w>x(i) + b)¡ 1]
=
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j) ¡ b
mXi=1
®iy(i)
• Then Lagrangian is re-written as
= 0
Solving α*
• Dual problemmax®¸0μD(®) = max
®¸0minw;b
L(w; b; ®)max®¸0μD(®) = max
®¸0minw;b
L(w; b; ®)
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. ®i ¸ 0; i = 1; : : : ; mmX
i=1
®iy(i) = 0
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. ®i ¸ 0; i = 1; : : : ; mmX
i=1
®iy(i) = 0
• To solve α* with some methods e.g. SMO• We will get back to this solution later
Solving w* and b*
• With α* solved, w* is obtained by
w =mX
i=1
®iy(i)x(i)w =
mXi=1
®iy(i)x(i)
• With w* solved, b* is obtained by
b¤ = ¡maxi:y(i)=¡1 w¤>x(i) + mini:y(i)=1 w¤>x(i)
2b¤ = ¡maxi:y(i)=¡1 w¤>x(i) + mini:y(i)=1 w¤>x(i)
2
w¤w¤
• Only supporting vectors with α > 0
Predicting Values• With the solutions of w*and b* , the predicting
value (i.e. functional margin) of each instance is
w¤>x + b¤ =³ mX
i=1
®iy(i)x(i)
´>x + b¤
=mX
i=1
®iy(i)hx(i); xi+ b¤
w¤>x + b¤ =³ mX
i=1
®iy(i)x(i)
´>x + b¤
=mX
i=1
®iy(i)hx(i); xi+ b¤
• We only need to calculate the inner product of x with the supporting vectors
Non-Separable Cases• The derivation of the SVM as presented so far assumes that
the data is linearly separable.• More practical cases are linearly non-separable.
Dealing with Non-Separable Cases• Add slack variables min
w;b
1
2kwk2 + C
mXi=1
»i
s.t. y(i)(w>x(i) + b) ¸ 1¡ »i; i = 1; : : : ; m
»i ¸ 0; i = 1; : : : ;m
minw;b
1
2kwk2 + C
mXi=1
»i
s.t. y(i)(w>x(i) + b) ¸ 1¡ »i; i = 1; : : : ; m
»i ¸ 0; i = 1; : : : ;m
L1 regularization
• LagrangianL(w; b; »; ®; r) =
1
2w>w + C
mXi=1
»i ¡mX
i=1
®i[y(i)(x>w + b)¡ 1 + »i]¡
mXi=1
ri»iL(w; b; »; ®; r) =1
2w>w + C
mXi=1
»i ¡mX
i=1
®i[y(i)(x>w + b)¡ 1 + »i]¡
mXi=1
ri»i
• Dual problemmax
®W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ;mmX
i=1
®iy(i) = 0
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ;mmX
i=1
®iy(i) = 0
Surprisingly, this is the only change
Efficiently solved by SMO algorithm
SVM Hinge Loss vs. LR Loss• SVM Hinge loss
1
2kwk2 + C
mXi=1
max(0; 1¡ yi(w>xi + b))
1
2kwk2 + C
mXi=1
max(0; 1¡ yi(w>xi + b))
• LR log loss¡yi log ¾(w>xi + b)¡ (1¡ yi) log(1¡ ¾(w>xi + b))¡yi log ¾(w>xi + b)¡ (1¡ yi) log(1¡ ¾(w>xi + b))
• If y = 1
Coordinate Ascent (Descent)• For the optimization problem
max®
W (®1; ®2; : : : ; ®m)max®
W (®1; ®2; : : : ; ®m)
• Coordinate ascent algorithmLoop until convergence: f
For i = 1; : : : ;m f®i := arg max
®i
W (®1; : : : ; ®i¡1; ®i; ®i+1; : : : ; ®m)
gg
Loop until convergence: fFor i = 1; : : : ;m f
®i := arg max®i
W (®1; : : : ; ®i¡1; ®i; ®i+1; : : : ; ®m)
gg
Coordinate Ascent (Descent)
A two-dimensional coordinate ascent example
SMO Algorithm• SMO: sequential minimal optimization• SVM optimization problem
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ;mmX
i=1
®iy(i) = 0
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ;mmX
i=1
®iy(i) = 0
• Cannot directly apply coordinate ascent algorithm becausemX
i=1
®iy(i) = 0 ) ®iy
(i) = ¡Xj 6=i
®jy(j)
mXi=1
®iy(i) = 0 ) ®iy
(i) = ¡Xj 6=i
®jy(j)
SMO Algorithm• Update two variable each time
Loop until convergence {1. Select some pair αi and αj to update next2. Re-optimize W(α) w.r.t. αi and αj
}
• Key advantage of SMO algorithm is the update of αiand αj (step 2) is efficient
• Convergence test: whether the change of W(α) is smaller than a predefined value (e.g. 0.01)
SMO Algorithm
• Without loss of generality, hold α3 … αm and optimize W(α) w.r.t. α1 and α2
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ;mmX
i=1
®iy(i) = 0
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ;mmX
i=1
®iy(i) = 0
®1y(1) + ®2y
(2) = ¡mX
i=3
®iy(i) = ³
) ®2 = ¡y(1)
y(2)®1 +
³
y(2)
®1 = (³ ¡ ®2y(2))y(1)
®1y(1) + ®2y
(2) = ¡mX
i=3
®iy(i) = ³
) ®2 = ¡y(1)
y(2)®1 +
³
y(2)
®1 = (³ ¡ ®2y(2))y(1)
SMO Algorithm• With , the objective is written as®1 = (³ ¡ ®2y
(2))y(1)®1 = (³ ¡ ®2y(2))y(1)
W (®1; ®2; : : : ; ®m) = W ((³ ¡ ®2y(2))y(1); ®2; : : : ; ®m)W (®1; ®2; : : : ; ®m) = W ((³ ¡ ®2y(2))y(1); ®2; : : : ; ®m)
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ; mmX
i=1
®iy(i) = 0
max®
W (®) =
mXi=1
®i ¡ 1
2
mXi;j=1
y(i)y(j)®i®jx(i)>x(j)
s.t. 0 · ®i · C; i = 1; : : : ; mmX
i=1
®iy(i) = 0
• Thus the original optimization problem
is transformed into a quadratic optimization problem w.r.t. ®2®2
max®2
W (®2) = a®22 + b®2 + c
s.t. 0 · ®2 · C
max®2
W (®2) = a®22 + b®2 + c
s.t. 0 · ®2 · C
SMO Algorithm• Optimizing a quadratic function is much efficient
max®2
W (®2) = a®22 + b®2 + c
s.t. 0 · ®2 · C
max®2
W (®2) = a®22 + b®2 + c
s.t. 0 · ®2 · C
®2®2
W (®2)W (®2)
CC00®2®2
W (®2)W (®2)
CC00 ¡ b
2a¡ b
2a
®2®2
W (®2)W (®2)
CC00
Content of This Lecture
• Support Vector Machines
• Neural Networks
Breaking News of AI in 2016• AlphaGo wins Lee Sedol (4-1)
https://www.goratings.org/
https://deepmind.com/research/alphago/
Machine Learning in AlphaGo
• Policy Network• Supervised Learning
• Predict what is the best next human move
• Reinforcement Learning• Learning to select the
next move to maximize the winning rate
• Value Network• Expectation of winning
given the board state
• Implemented by (deep) neural networks
Neural Networks• Neural networks are the basis of deep learning
Perceptron
Multi-layer Perceptron
Convolutional Neural Network Recurrent Neural Network
Real Neurons
• Cell structures• Cell body• Dendrites• Axon• Synaptic terminals
Slides credit: Ray Mooney
Neural Communication• Electrical potential across cell membrane exhibits spikes
called action potentials.• Spike originates in cell body, travels down
axon, and causes synaptic terminals to release neurotransmitters.
• Chemical diffuses across synapse to dendrites of other neurons.
• Neurotransmitters can be excitatory or inhibitory.
• If net input of neurotransmitters to a neuron from other neurons is excitatory and exceeds some threshold, it fires an action potential.
Slides credit: Ray Mooney
Real Neural Learning• Synapses change size and strength with experience.
• Hebbian learning: When two connected neurons are firing at the same time, the strength of the synapse between them increases.
• “Neurons that fire together, wire together.”
• These motivate the research of artificial neural nets
Slides credit: Ray Mooney
Brief History of Artificial Neural Nets• The First wave
• 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model
• 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons.
• 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation.
• The Second wave• 1986 The Back-Propagation learning algorithm for Multi-Layer
Perceptrons was rediscovered and the whole field took off again.
• The Third wave• 2006 Deep (neural networks) Learning gains popularity and• 2012 made significant break-through in many applications.
Slides credit: Jun Wang
Artificial Neuron Model• Model network as a graph with cells as nodes and synaptic
connections as weighted edges from node i to node j, wji
• Model net input to cell as
• Cell output is
1
32 54 6
w12
w13 w14w15
w16
(Tj is threshold for unit j)netj
oj
Tj0
1
netj =X
i
wjioinetj =X
i
wjioi
oj =
(0 if netj < Tj
1 if netj ¸ Tj
oj =
(0 if netj < Tj
1 if netj ¸ Tj
Slides credit: Ray MooneyMcCulloch and Pitts [1943]
Perceptron Model• Rosenblatt’s single layer perceptron [1958]
• Rosenblatt [1958] further proposed the perceptron as the first model for learning with a teacher (i.e., supervised learning)
• Focused on how to find appropriate weights wmfor two-class classification task
• y = 1: class one • y = -1: class two
• Activation function
'(z) =
(1 if z ¸ 0
¡1 otherwise'(z) =
(1 if z ¸ 0
¡1 otherwisey = '
³ mXi=1
wixi + b´
y = '³ mX
i=1
wixi + b´• Prediction
Training Perceptron• Rosenblatt’s single layer perceptron [1958]
• Activation function
'(z) =
(1 if z ¸ 0
¡1 otherwise'(z) =
(1 if z ¸ 0
¡1 otherwise
• Prediction
wi = wi + ´(y ¡ y)xiwi = wi + ´(y ¡ y)xi
• Training
y = '³ mX
i=1
wixi + b´
y = '³ mX
i=1
wixi + b´
b = b + ´(y ¡ y)b = b + ´(y ¡ y)
• Equivalent to rules:• If output is correct, do
nothing• If output is high, lower
weights on positive inputs
• If output is low, increase weights on active inputs
Properties of Perceptron• Rosenblatt’s single layer perceptron [1958]
x1x1
x2x2
• Rosenblatt proved the convergence of a learning algorithm if two classes said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane)
• Many people hoped that such a machine could be the basis for artificial intelligence
Class 1
Class 2
w1x1 + w2x2 + b = 0w1x1 + w2x2 + b = 0
Properties of Perceptron• The XOR problem
Input x Output y X1 X2 X1 XOR X2
0 0 00 1 11 0 11 1 0
• However, Minsky and Papert[1969] showed that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt’s one-layer perceptron
• However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet
• Due to the lack of learning algorithms people left the neural network paradigm for almost 20 years
X1 1 true false false true 0 1 X2
XOR is non linearly separable: These two classes (true and false) cannot be separated using a line.
• Adding hidden layer(s) (internal presentation) allows to learn a mapping that is not constrained by linearly separable
decision boundary: x1w1 + x2w2 + b = 0
class 1
class 2
b
x1
x1 yw1
w2 b
class 1
class 2class 2
class 2 class 2x2
x1
yEach hidden node realizes one of the lines bounding the convex region
Hidden Layers and Backpropagation (1986~)
• But the solution is quite often not unique
The number in the circle is a threshold
http://www.cs.stir.ac.uk/research/publications/techreps/pdf/TR148.pdfhttp://recognize-speech.com/basics/introduction-to-artificial-neural-networks
(solution 1) (solution 2)
Hidden Layers and Backpropagation (1986~)
Input x Output y X1 X2 X1 XOR X2 0 0 0 0 1 1 1 0 1 1 1 0
Two lines are necessary to divide the sample space accordingly Sign activation function
Two-layer feedforward neural network
• Feedforward: massages move forward from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network
WeightParameters
WeightParameters
Hidden Layers and Backpropagation (1986~)
Single / Multiple Layers of Calculation
• Single layer function
fμ(x) = ¾(μ0 + μ1x + μ2x2)fμ(x) = ¾(μ0 + μ1x + μ2x2)
h1(x) = tanh(μ0 + μ1x + μ2x2)
h2(x) = tanh(μ3 + μ4x + μ5x2)
fμ(x) = fμ(h1(x); h2(x)) = ¾(μ6 + μ7h1 + μ8h2)
h1(x) = tanh(μ0 + μ1x + μ2x2)
h2(x) = tanh(μ3 + μ4x + μ5x2)
fμ(x) = fμ(h1(x); h2(x)) = ¾(μ6 + μ7h1 + μ8h2)
fμ(x) = μ0 + μ1x + μ2x2fμ(x) = μ0 + μ1x + μ2x2
xx x2x2
xx x2x2
fμ(x)fμ(x)
h1(x)h1(x) h2(x)h2(x)
¾(x) =1
1 + e¡x¾(x) =
1
1 + e¡x tanh(x) =1¡ e¡2x
1 + e¡2xtanh(x) =
1¡ e¡2x
1 + e¡2x
• Multiple layer function
• With non-linear activation function
Non-linear Activation Functions• Sigmoid
• Tanh
• Rectified Linear Unit (ReLU)
tanh(z) =1¡ e¡2z
1 + e¡2ztanh(z) =
1¡ e¡2z
1 + e¡2z
ReLU(z) = max(0; z)ReLU(z) = max(0; z)
¾(z) =1
1 + e¡z¾(z) =
1
1 + e¡z
Universal Approximation Theorem• A feed-forward network with a single hidden layer
containing a finite number of neurons (i.e., a multilayer perceptron), can approximate continuous functions
• on compact subsets of
• under mild assumptions on the activation function• Such as Sigmoid, Tanh and ReLU
RnRn
[Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366.]
Universal Approximation• Multi-layer perceptron approximate any continuous
functions on compact subset of
¾(x) =1
1 + e¡x¾(x) =
1
1 + e¡x tanh(x) =1¡ e¡2x
1 + e¡2xtanh(x) =
1¡ e¡2x
1 + e¡2x
RnRn
• One of the efficient algorithms for multi-layer neural networks is the Backpropagation algorithm
• It was re-introduced in 1986 and Neural Networks regained the popularity
Note: backpropagation appears to be found by Werbos [1974]; and then independently rediscovered around 1985 by Rumelhart, Hinton, and Williams [1986] and by Parker [1985]
Error Caculation
Error backpropagation
Parametersweights
Parametersweights
Hidden Layers and Backpropagation (1986~)
Compare outputs with correctanswer to get error
[LeCun, Bengio and Hinton. Deep Learning. Nature 2015.]
@E
@wjk=
@E
@zk
@zk
@wjk=
@E
@zkyj
@E
@wjk=
@E
@zk
@zk
@wjk=
@E
@zkyj
Learning NN by Back-Propagation
d1 =1
d2 = 0
x1
x2
xm
y1
Parametersweights Parameters
weights
label = Face
label = no face
Training instances…
y0
Learning NN by Back-PropagationError Back-propagation
Error Calculation
Make a Predictionx1x1
x2x2
xmxm
w(1)j;mw(1)j;m
net(1)1net(1)1 h
(1)1h(1)1
XXf(1)f(1)
XXf(1)f(1)
XXf(1)f(1)
net(1)2net(1)2 h
(1)2h(1)2
net(1)jnet(1)j h
(1)jh(1)j
w(2)k;jw(2)k;j
net(2)1net(2)1
net(2)knet(2)k
XXf(2)f(2)
XXf(2)f(2)
y1y1
ykyk
d1d1
dkdk
inputs outputs labels
Two-layer feedforward neural networkInput layer hidden layer output layer
Feed-forward prediction:
where
x = (x1; : : : ; xm)x = (x1; : : : ; xm)
h(1)j = f(1)(net
(1)j ) = f(1)(
Xm
w(1)j;mxm)h
(1)j = f(1)(net
(1)j ) = f(1)(
Xm
w(1)j;mxm)
h(1)jh(1)j
yk = f(2)(net(2)k ) = f(2)(
Xj
w(1)k;jh
(1)j )yk = f(2)(net
(2)k ) = f(2)(
Xj
w(1)k;jh
(1)j )
ykyk
net(1)j =
Xm
w(1)j;mxmnet
(1)j =
Xm
w(1)j;mxm net
(2)k =
Xj
w(2)k;jh
(1)jnet
(2)k =
Xj
w(2)k;jh
(1)j
Make a Predictionx1x1
x2x2
xmxm
w(1)j;mw(1)j;m
net(1)1net(1)1 h
(1)1h(1)1
XXf(1)f(1)
XXf(1)f(1)
XXf(1)f(1)
net(1)2net(1)2 h
(1)2h(1)2
net(1)jnet(1)j h
(1)jh(1)j
w(2)k;jw(2)k;j
net(2)1net(2)1
net(2)knet(2)k
XXf(2)f(2)
XXf(2)f(2)
y1y1
ykyk
d1d1
dkdk
inputs outputs labels
Two-layer feedforward neural networkinput layer hidden layer output layer
Feed-forward prediction:
where
x = (x1; : : : ; xm)x = (x1; : : : ; xm)
h(1)j = f(1)(net
(1)j ) = f(1)(
Xm
w(1)j;mxm)h
(1)j = f(1)(net
(1)j ) = f(1)(
Xm
w(1)j;mxm)
h(1)jh(1)j
yk = f(2)(net(2)k ) = f(2)(
Xj
w(1)k;jh
(1)j )yk = f(2)(net
(2)k ) = f(2)(
Xj
w(1)k;jh
(1)j )
ykyk
net(1)j =
Xm
w(1)j;mxmnet
(1)j =
Xm
w(1)j;mxm net
(2)k =
Xj
w(2)k;jh
(1)jnet
(2)k =
Xj
w(2)k;jh
(1)j
Make a Predictionx1x1
x2x2
xmxm
w(1)j;mw(1)j;m
net(1)1net(1)1 h
(1)1h(1)1
XXf(1)f(1)
XXf(1)f(1)
XXf(1)f(1)
net(1)2net(1)2 h
(1)2h(1)2
net(1)jnet(1)j h
(1)jh(1)j
w(2)k;jw(2)k;j
net(2)1net(2)1
net(2)knet(2)k
XXf(2)f(2)
XXf(2)f(2)
y1y1
ykyk
d1d1
dkdk
inputs outputs labels
Two-layer feedforward neural networkInput layer hidden layer output layer
Feed-forward prediction:
where
x = (x1; : : : ; xm)x = (x1; : : : ; xm)
h(1)j = f(1)(net
(1)j ) = f(1)(
Xm
w(1)j;mxm)h
(1)j = f(1)(net
(1)j ) = f(1)(
Xm
w(1)j;mxm)
h(1)jh(1)j
yk = f(2)(net(2)k ) = f(2)(
Xj
w(1)k;jh
(1)j )yk = f(2)(net
(2)k ) = f(2)(
Xj
w(1)k;jh
(1)j )
ykyk
net(1)j =
Xm
w(1)j;mxmnet
(1)j =
Xm
w(1)j;mxm net
(2)k =
Xj
w(2)k;jh
(1)jnet
(2)k =
Xj
w(2)k;jh
(1)j
When Backprop/Learn Parametersx1x1
x2x2
xmxm
w(1)j;mw(1)j;m
net(1)1net(1)1 h
(1)1h(1)1
XXf(1)f(1)
XXf(1)f(1)
XXf(1)f(1)
net(1)2net(1)2 h
(1)2h(1)2
net(1)jnet(1)j h
(1)jh(1)j
w(2)k;jw(2)k;j
net(2)1net(2)1
net(2)knet(2)k
XXf(2)f(2)
XXf(2)f(2)
y1y1
ykyk
d1d1
dkdk
inputs outputs labels
Two-layer feedforward neural networkInput layer hidden layer output layer
dk ¡ ykdk ¡ yk
±k = (dk ¡ yk)f0(2)(net
(2)k )±k = (dk ¡ yk)f
0(2)(net
(2)k )
Notations: net(1)j =
Xm
w(1)j;mxmnet
(1)j =
Xm
w(1)j;mxm net
(2)k =
Xj
w(2)k;jhjnet
(2)k =
Xj
w(2)k;jhj
Backprop to learn the parameters
E(W ) =1
2
Xk
(yk ¡ dk)2E(W ) =
1
2
Xk
(yk ¡ dk)2
¢w(2)k;j = ´ErrorkOutputj = ´±kh
(1)j¢w
(2)k;j = ´ErrorkOutputj = ´±kh
(1)j
w(2)k;j = w
(2)k;j + ¢w
(2)k;jw
(2)k;j = w
(2)k;j + ¢w
(2)k;j
¢w(2)k;j = ¡´
@E(W )
@w(2)k;j
= ¡´(yk ¡ dk)@yk
@net(2)k
@net(2)k
@w(2)k;j
= ´(dk ¡ yk)f0(2)(net
(2)k )h
(1)j = ´±kh
(1)j¢w
(2)k;j = ¡´
@E(W )
@w(2)k;j
= ¡´(yk ¡ dk)@yk
@net(2)k
@net(2)k
@w(2)k;j
= ´(dk ¡ yk)f0(2)(net
(2)k )h
(1)j = ´±kh
(1)j
When Backprop/Learn Parametersx1x1
x2x2
xmxm
w(1)j;mw(1)j;m
net(1)1net(1)1 h
(1)1h(1)1
XXf(1)f(1)
XXf(1)f(1)
XXf(1)f(1)
net(1)2net(1)2 h
(1)2h(1)2
net(1)jnet(1)j h
(1)jh(1)j
w(2)k;jw(2)k;j
net(2)1net(2)1
net(2)knet(2)k
XXf(2)f(2)
XXf(2)f(2)
y1y1
ykyk
d1d1
dkdk
inputs outputs labels
Two-layer feedforward neural networkInput layer hidden layer output layer
dk ¡ ykdk ¡ yk
±k = (dk ¡ yk)f0(2)(net
(2)k )±k = (dk ¡ yk)f
0(2)(net
(2)k )
Notations: net(1)j =
Xm
w(1)j;mxmnet
(1)j =
Xm
w(1)j;mxm net
(2)k =
Xj
w(2)k;jhjnet
(2)k =
Xj
w(2)k;jhj
Backprop to learn the parameters
E(W ) =1
2
Xk
(yk ¡ dk)2E(W ) =
1
2
Xk
(yk ¡ dk)2
¢w(2)k;j = ´ErrorjOutputm = ´±jxm¢w(2)k;j = ´ErrorjOutputm = ´±jxm
w(1)j;m = w
(1)j;m + ¢w
(1)j;mw
(1)j;m = w
(1)j;m + ¢w
(1)j;m
¢w(1)j;m = ¡´
@E(W )
@w(1)j;m
= ¡´@E(W )
@h(1)j
@h(1)j
@w(1)j;m
= ´X
k
(dk ¡ yk)f0(2)(net
(2)k )w
(2)k;jxmf 0(1)(net
(1)j ) = ´±jxm¢w
(1)j;m = ¡´
@E(W )
@w(1)j;m
= ¡´@E(W )
@h(1)j
@h(1)j
@w(1)j;m
= ´X
k
(dk ¡ yk)f0(2)(net
(2)k )w
(2)k;jxmf 0(1)(net
(1)j ) = ´±jxm
An example for Backprop
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
An example for Backprop
Consider sigmoid activation function
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
fSigmoid (x) = 11+ e−x
δk = (dk − yk ) f(2) '(netk(2) )
Δwk, j(2) = ηErrorkOutput j = ηδkhj
(1)
δ j = f(1) '(net j(1) ) δkwk, j
(2)
k
Δwj,m(1) = ηErrorjOutputm = ηδ j xm
Let us do some calculation
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
Consider the simple network below:
Assume that the neurons have a Sigmoid activation function and1. Perform a forward pass on the network2. Perform a reverse pass (training) once (target = 0.5)3. Perform a further forward pass and comment on the result
Let us do some calculation
https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
A demo from Google
http://playground.tensorflow.org/
Non-linear Activation Functions• Sigmoid
• Tanh
• Rectified Linear Unit (ReLU)
¾(z) =1
1 + e¡z¾(z) =
1
1 + e¡z
tanh(z) =1¡ e¡2z
1 + e¡2ztanh(z) =
1¡ e¡2z
1 + e¡2z
ReLU(z) = max(0; z)ReLU(z) = max(0; z)
Active functions
https://theclevermachine.wordpress.com/tag/tanh-function/
fSigmoid (x)
flinear (x)
ftanh (x) f 'tanh (x)
f 'Sigmoid (x)f 'linear (x)
Activation functions• Logistic Sigmoid:
fSigmoid (x) = 11+ e−x
• Output range [0,1]• Motivated by biological neurons and can
be interpreted as the probability of an artificial neuron “firing” given its inputs
• However, saturated neurons makegradients vanished (why?)
Its derivative:
f 'Sigmoid (x) = fSigmoid (x)(1− fSigmoid (x))
Activation functions• Tanh function
ftanh (x) = sinh(x)cosh(x)
= ex − e−x
ex + e−x
• Output range [-1,1]• Thus strongly negative inputs to the tanh
will map to negative outputs.• Only zero-valued inputs are mapped to
near-zero outputs• These properties make the network less
likely to get “stuck” during training
Its gradient:
https://theclevermachine.wordpress.com/tag/tanh-function/
ftanh(x) = 1¡ ftanh(x)2ftanh(x) = 1¡ ftanh(x)2
Active Functions• ReLU (rectified linear unit)
• Another version is Noise ReLU:
• ReLU can be approximated by softplus function
• ReLU gradient doesn't vanish as we increase x
• It can be used to model positive number• It is fast as no need for computing the
exponential function• It eliminates the necessity to have a
“pretraining” phase
• The derivative:
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40811.pdf
fReLU(x) =
(1 if x > 0
0 if x · 0fReLU(x) =
(1 if x > 0
0 if x · 0
fNoisyReLU(x) = max(0; x + N(0; ±(x)))fNoisyReLU(x) = max(0; x + N(0; ±(x)))
fSoftplus(x) = log(1 + ex)fSoftplus(x) = log(1 + ex)
fReLU(x) = max(0; x)fReLU(x) = max(0; x)
Active Functions• ReLU (rectified linear unit)
ReLU can be approximated by softplus function
http://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf
• The only non-linearity comes from the path selection with individual neurons being active or not
• It allows sparse representations:• for a given input only a subset
of neurons are active
Sparse propagation of activations and gradientsAdditional active functions:Leaky ReLU, Exponential LU, Maxout etc
fSoftplus(x) = log(1 + ex)fSoftplus(x) = log(1 + ex)
fReLU(x) = max(0; x)fReLU(x) = max(0; x)
Error/Loss function• Recall stochastic gradient descent
• Update from a randomly picked example (but in practice do a batch update)
• Squared error loss for one binary output:
input output
w = w ¡ ´@L(w)
@ww = w ¡ ´
@L(w)
@w
L(w) =1
2(y ¡ fw(x))2L(w) =
1
2(y ¡ fw(x))2
fw(x)fw(x)xx
Error/Loss function• Softmax (cross-entropy loss) for multiple classes
where
One hot encoded class labels
(Class labels follow multinomial distribution)
L(w) = ¡X
k
(dk log yk + (1¡ dk) log(1¡ yk))L(w) = ¡X
k
(dk log yk + (1¡ dk) log(1¡ yk))
yk =exp
³Pj w
(2)k;jh
(1)j
´P
k0 exp³P
j w(2)k0;jh
(1)j
´yk =exp
³Pj w
(2)k;jh
(1)j
´P
k0 exp³P
j w(2)k0;jh
(1)j
´w
(1)j;mw(1)j;m
net(1)1net(1)1 h
(1)1h(1)1
XXf(1)f(1)
XXf(1)f(1)
XXf(1)f(1)
net(1)2net(1)2 h
(1)2h(1)2
net(1)jnet(1)j h
(1)jh(1)j
w(2)k;jw(2)k;j
net(2)1net(2)1
net(2)knet(2)k
XXf(2)f(2)
XXf(2)f(2)
y1y1
ykyk
d1d1
dkdk
outputs labels
hidden layer output layer