+ All Categories
Home > Documents > Hidden Markov Models (HMM) and Probabilistic Graphical Models...

Hidden Markov Models (HMM) and Probabilistic Graphical Models...

Date post: 26-Aug-2018
Category:
Upload: truongque
View: 223 times
Download: 0 times
Share this document with a friend
71
Professor Joongheon Kim (CSE@CAU) http://sites.google.com/site/joongheon Artificial Intelligent Lectures for Communication and Network Engineers/Researchers (14 April 2017) Hidden Markov Models (HMM) and Probabilistic Graphical Models (PGM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1
Transcript

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hidden Markov Models (HMM) and Probabilistic Graphical Models (PGM)

Professor Joongheon Kim

School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea

1

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hidden Markov Models (HMM) and Probabilistic Graphical Models (PGM)Part 1: Hidden Markov Models

Professor Joongheon Kim

School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea

2

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Hidden Markov Models• Markov

• Markov Chain

• Markov Models and Markov Processes

• Hidden Markov Model (HMM)

• HMM Applications: Probability Evaluation

3

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Chain)

[Definition (𝑃𝑖𝑗)] The fixed probability (one-step transition probability) that it will next be in state 𝑗 whenever the process is in state 𝑖. That is,

𝑃𝑖𝑗 = 𝑃 𝑋𝑛+1 = 𝑗 𝑋𝑛 = 𝑖, 𝑋𝑛−1 = 𝑖𝑛−1, ⋯ , 𝑋1 = 𝑖1, 𝑋0 = 𝑖0

for all states 𝑖0, 𝑖1, ⋯ 𝑖𝑛−1, 𝑖, 𝑗 and all 𝑛 ≥ 0.

[Note (Markov Property)] For all states 𝑖0, 𝑖1, ⋯ 𝑖𝑛−1, 𝑖, 𝑗 and all 𝑛 ≥ 0,

𝑃𝑖𝑗 = 𝑃 𝑋𝑛+1 = 𝑗 𝑋𝑛 = 𝑖, 𝑋𝑛−1 = 𝑖𝑛−1, ⋯ , 𝑋1 = 𝑖1, 𝑋0 = 𝑖0

= 𝑃 𝑋𝑛+1 = 𝑗 𝑋𝑛 = 𝑖

4

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

[Markov Chain]

Markov (Markov Chain)

[Note]

• 𝑃𝑖𝑗 ≥ 0 where 𝑖 ≥ 0, 𝑗 ≥ 0

• 𝑗=0∞ 𝑃𝑖𝑗 = 1 for all 𝑖 = 0,1,⋯

5

𝑖

1

2

𝑛

𝑃𝑖1

𝑃𝑖2

𝑃𝑖𝑛

𝑃𝑖𝑖

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Chain)

[Note (𝑷)] Let 𝑷 denote the matrix of one-step transition probabilities, i.e.,

𝑷 =

𝑃𝑖𝑖 𝑃𝑖𝑗 𝑃𝑖𝑘𝑃𝑗𝑖 𝑃𝑗𝑗 𝑃𝑗𝑘𝑃𝑘𝑖 𝑃𝑘𝑗 𝑃𝑘𝑘

6

𝑗

𝑖

𝑘

𝑃𝑖𝑖𝑃𝑖𝑗

𝑃𝑖𝑘𝑃𝑗𝑖

𝑃𝑗𝑗𝑃𝑗𝑘

𝑃𝑘𝑖

𝑃𝑘𝑗

𝑃𝑘𝑘

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

[Transition Matrix] [Markov Chain]

Markov (Markov Chain)

[Example] There are two milk companies in South Korea, i.e., 𝐴 and 𝐵. Based on last year statistics, the 88% customers of 𝐴 is currently still with 𝐴; and the other 12% customers are now with 𝐵. In addition, the 85% customer of 𝐵is currently with 𝐵; and the other 15% customers are now with 𝐴.

7

𝑷 =𝑃𝐴𝐴 𝑃𝐴𝐵𝑃𝐵𝐴 𝑃𝐵𝐵

=0.88 0.120.15 0.85

𝐵𝐴

𝑃𝐴𝐴 = 0.88 𝑃𝐴𝐵 = 0.12

𝑃𝐵𝐴 = 0.15 𝑃𝐵𝐵 = 0.85

[One-Step Transition]If initial market share is A = 0.25 and 𝐵 = 0.75,i.e., 𝑠0 = 0.25 0.75 , the next market share is:

𝑠1 = 𝑠0𝑷

= 0.25 0.750.88 0.120.15 0.85

= 0.3325 0.6675

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Chain)

[Example (Multi-Step Transition)] From the 𝑷 (in previous slide), suppose that we are in state 𝑖 in time 𝑡 and we have to compute the probability for being in state 𝑖 in time 𝑡 + 2 (denote by 𝑃𝑖𝑖

2).

8

𝑗𝑖

𝑘

𝑖

𝑖

𝑡 𝑡 + 1 𝑡 + 2

𝑃𝑖𝑖

𝑃𝑖𝑘

𝑃𝑖𝑗

𝑃𝑖𝑖

𝑃𝑗𝑖

𝑃𝑘𝑖

𝑃𝑖𝑖2 = 𝑃 𝑋𝑛+2 = 𝑖 𝑋𝑛 = 𝑖

= 𝑃𝑖𝑖𝑃𝑖𝑖 + 𝑃𝑖𝑗𝑃𝑗𝑖 + 𝑃𝑖𝑘𝑃𝑘𝑖

=

𝑃𝑖𝑖 𝑃𝑖𝑗 𝑃𝑖𝑘𝑃𝑗𝑖 𝑃𝑗𝑗 𝑃𝑗𝑘𝑃𝑘𝑖 𝑃𝑘𝑗 𝑃𝑘𝑘

𝑃𝑖𝑖 𝑃𝑖𝑗 𝑃𝑖𝑘𝑃𝑗𝑖 𝑃𝑗𝑗 𝑃𝑗𝑘𝑃𝑘𝑖 𝑃𝑘𝑗 𝑃𝑘𝑘

= 𝑷𝑖𝑖2

[𝑋𝑛 = 𝑖: State in 𝑖 in time 𝑛]

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Models and Markov Processes)

• Example for Markov Model (Weather Forecasting)• Weather State: Sunny (𝑆), Rainy (𝑅), Foggy (𝐹)

• Today’s weather 𝑞𝑛 depends on previous weather conditions, i.e., 𝑞𝑛−1, 𝑞𝑛−2, ⋯ , 𝑞1:

𝑃 𝑞𝑛 𝑞𝑛−1, 𝑞𝑛−2, ⋯ , 𝑞1

• Example: if the previous three weather conditions are 𝑞𝑛−1 = 𝑆,𝑞𝑛−2 = 𝑅, and𝑞𝑛−3 = 𝐹, subsequently, the probability where today weather (𝑞𝑛) is 𝑅 is as follows:

9

𝑃 𝑞𝑛 = 𝑅 𝑞𝑛−1 = 𝑆, 𝑞𝑛−2 = 𝑅, 𝑞𝑛−3 = 𝐹

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Models and Markov Processes)

• Observation from previous [Example]• If we have larger 𝑛, it means we have to gather more information.

• If 𝑛 = 6, we need to gather 3(6−1) = 243 weather data. Therefore, we need an assumption (called Markov Assumption) which reduces the number of gathering data.

• [First-Order Markov Assumption]

• [Second-Order Markov Assumption]

10

𝑃 𝑞𝑛 = 𝑆𝑗 𝑞𝑛−1 = 𝑆𝑖 , 𝑞𝑛−2 = 𝑆𝑘 , ⋯ = 𝑃 𝑞𝑛 = 𝑆𝑗 𝑞𝑛−1 = 𝑆𝑖

𝑃 𝑞1, 𝑞2, ⋯ , 𝑞𝑛 =

𝑖=1

𝑛

𝑃 𝑞𝑖 𝑞𝑖−1

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Models and Markov Processes)

• Observation from previous [Example] (Continued)• With Markov Assumption, the probability that can observe a sequence 𝑞1, 𝑞2, ⋯ , 𝑞𝑛 can be presented by joint probability as follows:

11

𝑃 𝑞1, 𝑞2, ⋯ , 𝑞𝑛= 𝑃 𝑞1 𝑃 𝑞2 𝑞1 𝑃 𝑞3 𝑞2, 𝑞1 ⋯𝑃 𝑞𝑛−1 𝑞𝑛−2, ⋯ , 𝑞1 𝑃 𝑞𝑛 𝑞𝑛−1, ⋯ , 𝑞1= 𝑃 𝑞1 𝑃 𝑞2 𝑞1 𝑃 𝑞3 𝑞2 ⋯𝑃 𝑞𝑛−1 𝑞𝑛−2 𝑃 𝑞𝑛 𝑞𝑛−1

=

𝑖=1

𝑛

𝑃 𝑞𝑖 𝑞𝑖−1when we assume 𝑃 𝑞0 = 1

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

[Transition Matrix]

Markov (Markov Models and Markov Processes)

• Example (Weather Forecasting)

12

𝑞𝑛−1 𝑞𝑛

𝑆 𝑅 𝐹

𝑆 0.8 0.05 0.15

𝑅 0.2 0.6 0.2

𝐹 0.2 0.3 0.5

[Weather State Table]𝑷 =0.8 0.05 0.150.2 0.6 0.20.2 0.3 0.5

[Transition Diagram]

𝑅

𝑆

𝐹

0.80.05

0.150.2

0.60.2

0.2

0.3

0.5

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Markov (Markov Models and Markov Processes)

• Example (Weather Forecasting)• Case Study: Suppose that yesterday (𝑞1)’s weather is Sunny (𝑆). Then, find the

probabilities where today (𝑞2)’s weather is Sunny (𝑆) and tomorrow (𝑞3)’s weather is Rainy (𝑅).

• (Solutions)

13

𝑃 𝑞2 = 𝑆, 𝑞3 = 𝑅 𝑞1 = 𝑆= 𝑃 𝑞3 = 𝑅 𝑞2 = 𝑆, 𝑞1 = 𝑆 ∙ 𝑃 𝑞2 = 𝑆 𝑞1 = 𝑆= 𝑃 𝑞3 = 𝑅 𝑞2 = 𝑆 ∙ 𝑃 𝑞2 = 𝑆 𝑞1 = 𝑆= 0.05 ∙ 0.8 = 0.04

[Markov Assumption]

𝑃 𝑞1 = 𝑆, 𝑞2 = 𝑆, 𝑞3 = 𝑅= 𝑃 𝑞1 = 𝑆 ∙ 𝑃 𝑞2 = 𝑆 𝑞1 = 𝑆 ∙ 𝑃 𝑞3 = 𝑅 𝑞2 = 𝑆, 𝑞1 = 𝑆= 𝑃 𝑞1 = 𝑆 ∙ 𝑃 𝑞2 = 𝑆 𝑞1 = 𝑆 ∙ 𝑃 𝑞3 = 𝑅 𝑞2 = 𝑆= 1.0 ∙ 0.8 ∙ 0.05 = 0.04

[Markov Assumption]

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Hidden Markov Models• Markov

• Hidden Markov Model (HMM)• Example: Weather

• Example: Balls in Jars

• HMM Applications: Probability Evaluation

14

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM (Example: Weather)

• [Example (Weather)] You are in a house which has no windows. Your friend will visit you once a day. Now, you can estimate weather by checking whether your friend has an umbrella or not. Your friend carries an umbrella with the probabilities of 0.1, 0.8, and 0.3, when the weather is 𝑆, 𝑅, and 𝐹.

15

Observation: With Umbrella (𝑜𝑖 = 𝑈𝑂) or Without Umbrella (𝑜𝑖 = 𝑈𝑋).Now, the weather can be estimated by observing 0𝑖 , 𝑖 ≥ 1.

Therefore, according to Bayes’ theorem:

𝑃 𝑞𝑖 𝑜𝑖 =𝑃 𝑜𝑖 𝑞𝑖 𝑃 𝑞𝑖𝑃 𝑜𝑖

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM (Example: Weather)

• [Example (Weather)] You are in a house which has no windows. Your friend will visit you once a day. Now, you can estimate weather by checking whether your friend has an umbrella or not. Your friend carries an umbrella with the probabilities of 0.1, 0.8, and 0.3, when the weather is 𝑆, 𝑅, and 𝐹.

16

When the sequences of weather and umbrella are given, i.e., 𝑞1, ⋯ , 𝑞𝑛and 𝑜1, ⋯ , 𝑜𝑛 , the conditional probability is as follows:

𝑃 𝑞1, ⋯ , 𝑞𝑛 𝑜1, ⋯ , 𝑜𝑛 =𝑃 𝑜1, ⋯ , 𝑜𝑛 𝑞1, ⋯ , 𝑞𝑛 𝑃 𝑞1, ⋯ , 𝑞𝑛

𝑃 𝑜1, ⋯ , 𝑜𝑛

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM (Example: Balls in Jars)

• [Example (Weather)] A room has a curtain and there are three jars and the jars contain balls (colors: red, blue, green, and purple). A person behind the curtain select one jar and pick one ball from there. The person shows the ball and put the ball into the jar. And the person repeats.

17

Notations) • 𝑏𝑗 𝑘 : pick one ball from jar 𝑗 and the color of the ball is 𝑘 where 𝑘 = 1,2,3,4

when the color is red, blue, green, and purple, respectively.• 𝑁: The number of states (i.e., the number of jars): 𝑆 = 𝑆1, ⋯ , 𝑆𝑁• 𝑀: The number of observation (i.e., the number of colors): 𝑂 = 𝑂1, ⋯ , 𝑂𝑀• State Transition Matrix 𝐴 = 𝑎𝑖𝑗 where 𝑎𝑖𝑗 = 𝑃 𝑞𝑡+1 = 𝑆𝑗 𝑞𝑡 = 𝑆𝑖 and this

stands for the case where transition happens from state 𝑖 to state 𝑗.

• Observation 𝐵 = 𝑏𝑗 𝑘 where 𝑏𝑗 𝑘 = 𝑃 𝑂𝑡 = 𝑜𝑘 𝑞𝑡 = 𝑆𝑗 and this stands for

the case where 𝑘 is observed in state 𝑗.• Initial State Distribution 𝜋 = 𝜋𝑖 where 𝜋𝑖 = 𝑃 𝑞1 = 𝑆1 .

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Hidden Markov Models• Markov

• Hidden Markov Model (HMM)

• HMM Applications: Probability Evaluation

18

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

[Problem Definition (Probability Evaluation)] When 𝑂 = 𝑜1, 𝑜2, 𝑜3, ⋯ and HMM model 𝜆 = 𝐴, 𝐵, 𝜋 are given, find that the observation sequence can occur from which model with the highest probability? It means that how we can calculate 𝑃 𝑂 𝜆 ?

[Example] We are about to toss a coin with HMM model 𝜆 = 𝐴, 𝐵, 𝜋 ; and we want to find the probability of the case where observation is 𝑂 = 𝑇,𝐻, 𝑇 .

19

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

[Problem Definition (Probability Evaluation)] When 𝑂 = 𝑜1, 𝑜2, 𝑜3, ⋯ and HMM model 𝜆 = 𝐴, 𝐵, 𝜋 are given, find that the observation sequence can occur from which model with the highest probability? It means that how we can calculate 𝑃 𝑂 𝜆 ?

[Example] We toss a coin with HMM model 𝜆 = 𝐴, 𝐵, 𝜋 ; and we want to find the probability of the case where observation sequence is 𝑂 = 𝑇,𝐻, 𝑇 . The given HMM model 𝜆 = 𝐴, 𝐵, 𝜋 is as follows:

20

𝜋 =1

3

1

3

1

3𝐴 =

1

3

1

3

1

3

01

2

1

20 0 1

𝐵 =

1 01

2

1

21

3

2

3

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

[Example] We toss a coin with HMM model 𝜆 = 𝐴, 𝐵, 𝜋 ; and we want to find the probability of the case where observation sequence is 𝑂 = 𝑇,𝐻, 𝑇 . The given HMM model 𝜆 = 𝐴, 𝐵, 𝜋is as follows:

21

𝜋 =1

3

1

3

1

3

𝐴 =

1

3

1

3

1

3

01

2

1

20 0 1

𝐵 =

1 01

2

1

21

3

2

3

21

1/3[Transition Diagram]

3

1/31/3

1/2 1

1/2

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

[Example] We toss a coin with HMM model 𝜆 = 𝐴, 𝐵, 𝜋 ; and we want to find the probability of the case where observation sequence is 𝑂 = 𝑇,𝐻, 𝑇 . The given HMM model 𝜆 = 𝐴, 𝐵, 𝜋is as follows:

22

𝜋 =1

3

1

3

1

3

𝐴 =

1

3

1

3

1

3

01

2

1

20 0 1

𝐵 =

1 01

2

1

21

3

2

3

[Trellis]

State 1

State 2

State 3

𝑡 = 0 𝑡 = 1 𝑡 = 2

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

23

[Trellis]

State 1

State 2

State 3

𝑡 = 0 𝑡 = 1 𝑡 = 2

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

[Probability Evaluation][Case 1] State 2 State 2 State 2𝑃1 𝑇, 𝐻, 𝑇 = 𝜋2𝑏2 𝑜1 = 𝑇 𝑎22𝑏2 𝑜2 = 𝐻 𝑎22𝑏2 𝑜3 = 𝑇

=1

3∙1

2∙1

2∙1

2∙1

2∙1

2= 0.0104

[Case 2] State 2 State 2 State 3𝑃2 𝑇,𝐻, 𝑇 = 𝜋2𝑏2 𝑜1 = 𝑇 𝑎22𝑏2 𝑜2 = 𝐻 𝑎23𝑏3 𝑜3 = 𝑇

=1

3∙1

2∙1

2∙1

2∙1

2∙2

3= 0.0139

[Case 3] State 2 State 3 State 3𝑃2 𝑇,𝐻, 𝑇 = 𝜋2𝑏2 𝑜1 = 𝑇 𝑎23𝑏3 𝑜2 = 𝐻 𝑎33𝑏3 𝑜3 = 𝑇

=1

3∙1

2∙1

2∙1

3∙ 1 ∙2

3= 0.0185

[Case 4] State 3 State 3 State 3𝑃2 𝑇,𝐻, 𝑇 = 𝜋3𝑏3 𝑜1 = 𝑇 𝑎33𝑏3 𝑜2 = 𝐻 𝑎33𝑏3 𝑜3 = 𝑇

=1

3∙2

3∙ 1 ∙1

3∙ 1 ∙2

3= 0.0494

𝑃 𝑂 =

𝑖=1

4

𝑃𝑖 𝑇,𝐻, 𝑇 = 0.0922

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

• Forward Algorithm for Probability Evaluation• Step 1) Initialization (𝛼1 𝑖 = 𝜋𝑖𝑏𝑖 𝑜𝑖 , 1 ≤ 𝑖 ≤ 3)

24

𝑡 = 0 𝑖 = 1𝛼1 1 = 𝜋1𝑏1 𝑜1 = 𝑇 =

1

3∙ 0 = 0

𝑖 = 2𝛼1 2 = 𝜋2𝑏2 𝑜1 = 𝑇 =

1

3∙1

2=1

6

𝑖 = 3𝛼1 3 = 𝜋3𝑏3 𝑜1 = 𝑇 =

1

3∙2

3=2

9

State 1

State 2

State 3

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

𝑡 = 0 𝑡 = 1 𝑡 = 2

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

• Forward Algorithm for Probability Evaluation

• Step 2) Derivation (𝛼𝑡+1 𝑗 = 𝑖=13 𝛼𝑡 𝑖 𝑎𝑖𝑗 𝑏𝑖 𝑜𝑡+1 , 1 ≤ 𝑡 ≤ 2,1 ≤ 𝑗 ≤ 3)

25

𝑡 = 1 𝑗 = 1𝛼2 1 =

𝑖=1

3

𝛼1 𝑖 𝑎𝑖1 𝑏1 𝑜2 = 𝐻

= 0

𝑗 = 2𝛼2 2 =

𝑖=1

3

𝛼1 𝑖 𝑎𝑖2 𝑏2 𝑜2 = 𝐻

=1

6∙1

2∙1

2=1

24= 0.0417

𝑗 = 3𝛼2 3 =

𝑖=1

3

𝛼1 𝑖 𝑎𝑖3 𝑏3 𝑜2 = 𝐻

=1

6∙1

2+2

9∙ 1 ∙1

3= 0.1019

State 1

State 2

State 3

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

𝑡 = 0 𝑡 = 1 𝑡 = 2

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

• Forward Algorithm for Probability Evaluation

• Step 2) Derivation (𝛼𝑡+1 𝑗 = 𝑖=13 𝛼𝑡 𝑖 𝑎𝑖𝑗 𝑏𝑖 𝑜𝑡+1 , 1 ≤ 𝑡 ≤ 2,1 ≤ 𝑗 ≤ 3)

26

𝑡 = 2 𝑗 = 1𝛼3 1 =

𝑖=1

3

𝛼2 𝑖 𝑎𝑖1 𝑏1 𝑜3 = 𝑇

= 0

𝑗 = 2𝛼3 2 =

𝑖=1

3

𝛼2 𝑖 𝑎𝑖2 𝑏2 𝑜3 = 𝑇

= 0.0417 ∙1

2∙1

2= 0.0104

𝑗 = 3𝛼3 3 =

𝑖=1

3

𝛼2 𝑖 𝑎𝑖3 𝑏3 𝑜3 = 𝑇

= 0.0417 ∙1

2+ 0.1019 ∙ 1 ∙

2

3= 0. 0818

State 1

State 2

State 3

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

𝑡 = 0 𝑡 = 1 𝑡 = 2

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

HMM Applications: Probability Evaluation

• Forward Algorithm for Probability Evaluation• Step 2) Termination (𝑃 𝑂 𝜆 = 𝑖=1

3 𝛼3 𝑖 )

27

State 1

State 2

State 3

P[H]=1P[T]=0

P[H]=1/2P[T]=1/2

P[H]=1/3P[T]=2/3

𝑡 = 0 𝑡 = 1 𝑡 = 2

𝑃 𝑂 𝜆 = 𝑖=1

3

𝛼3 𝑖 = 0.0922

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hidden Markov Models (HMM) and Probabilistic Graphical Models (PGM)Part 2: Markov Decision Process

Professor Joongheon Kim

School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea

28

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Markov Decision Process (MDP)• Basics

• Markov Property

• Policy and Return

• Value Functions (V, Q)

• Solving MDP• Planning

• Reinforcement Learning (Value-based)

• Reinforcement Learning (Policy-based) advanced topic (out of scope)

29

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Basics)

• Markov Decision Process (MDP) Components: <𝑆, 𝐴, 𝑅, 𝑇, 𝛾>• 𝑆: Set of states

• 𝐴: Set of actions

• 𝑅: Reward function

• 𝑇: Transition function

• 𝛾: Discount factor

30

How can we use MDP to model agent in a maze?

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Basics)

• Markov Decision Process (MDP) Components: <𝑆, 𝐴, 𝑅, 𝑇, 𝛾>• 𝑺: Set of states

• 𝐴: Set of actions

• 𝑅: Reward function

• 𝑇: Transition function

• 𝛾: Discount factor

31

𝑆: location (𝑥, 𝑦) if the maze is a 2D grid• 𝑠0: starting state• 𝑠: current state• 𝑠′: next state• 𝑠𝑡: state at time 𝑡

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Basics)

• Markov Decision Process (MDP) Components: <𝑆, 𝐴, 𝑅, 𝑇, 𝛾>• 𝑆: Set of states

• 𝑨: Set of actions

• 𝑅: Reward function

• 𝑇: Transition function

• 𝛾: Discount factor

32

𝑆: location (𝑥, 𝑦) if the maze is a 2D grid𝐴: move up, down, left, or

right• 𝑠 → 𝑠′

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Basics)

• Markov Decision Process (MDP) Components: <𝑆, 𝐴, 𝑅, 𝑇, 𝛾>• 𝑆: Set of states

• 𝐴: Set of actions

• 𝑹: Reward function

• 𝑇: Transition function

• 𝛾: Discount factor

33

𝑆: location (𝑥, 𝑦) if the maze is a 2D grid𝐴: move up, down, left, or

right𝑅: how good was the chosen action?• 𝑟 = 𝑅 𝑠, 𝑎, 𝑠′

• -1 for moving (battery used)

• +1 for jewel? +100 for exit?

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Basics)

• Markov Decision Process (MDP) Components: <𝑆, 𝐴, 𝑅, 𝑇, 𝛾>• 𝑆: Set of states

• 𝐴: Set of actions

• 𝑅: Reward function

• 𝑻: Transition function

• 𝛾: Discount factor

34

𝑆: location (𝑥, 𝑦) if the maze is a 2D grid𝐴: move up, down, left, or

right𝑅: how good was the chosen action?𝑇: where is the robot’s new location?• 𝑇 = 𝑠′ 𝑠, 𝑎

Stochastic Transition

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Basics)

• Markov Decision Process (MDP) Components: <𝑆, 𝐴, 𝑅, 𝑇, 𝛾>• 𝑆: Set of states

• 𝐴: Set of actions

• 𝑅: Reward function

• 𝑇: Transition function

• 𝜸: Discount factor

35

𝑆: location (𝑥, 𝑦) if the maze is a 2D grid𝐴: move up, down, left, or right𝑅: how good was the chosen action?𝑇: where is the robot’s new location?𝛾: how much does future reward worth? • 0 ≤ 𝛾 ≤ 1, [𝛾 ≈ 0: future

reward is near 0 (immediate action is preferred)]

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Markov Property)

• Does 𝑠𝑡+1 depend on 𝑠0, 𝑠1, ⋯ , 𝑠𝑡−1, 𝑠𝑡 ? No.• Memoryless!

• Future only depends on present

• Current state is a sufficient statistic of agent’s history

• No need to remember agent’s history• 𝑠𝑡+1 depends only on 𝑠𝑡 and 𝑎𝑡• 𝑟𝑡 depends only on 𝑠𝑡 and 𝑎𝑡

36

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Policy and Return)

• Policy• 𝜋: 𝑆 → 𝐴

• Maps states to actions

• Gives an action for every state

• Return• Discounted sum of rewards

• Could be undiscounted Finite horizon

37

𝑅𝑡 =

𝑘=0

𝛾𝑘𝑟𝑡+𝑘Our goal:Find 𝜋 that maximizes expected return!

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Value Functions (V, Q))

• State Value Function (𝑉)

• Expected return of starting at state 𝑠 and following policy 𝜋• How much return do I expect starting from state 𝑠?

• Action Value Function (𝑄)

• Expected return of starting at state 𝑠, taking action 𝑎, and then following policy 𝜋• How much return do I expect starting from state 𝑠 and taking action 𝑎?

38

𝑉𝜋 𝑠 = 𝐸𝜋 𝑅𝑡 𝑠𝑡 = 𝑠 = 𝐸𝜋 𝑘=0∞ 𝛾𝑘𝑟𝑡+𝑘 𝑠𝑡 = 𝑠

𝑄𝜋 𝑠, 𝑎 = 𝐸𝜋 𝑅𝑡 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎 = 𝐸𝜋 𝑘=0∞ 𝛾𝑘𝑟𝑡+𝑘 𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Solving MDP: Planning)

• Again, our goal is to find the optimal policy

• If 𝑇 𝑠′ 𝑠, 𝑎 and 𝑅 𝑠, 𝑎, 𝑠′ are known, this is a planning problem.

• We can use dynamic programming to find the optimal policy.

• Keywords: Bellman equation, value iteration, policy iteration

39

𝜋∗ 𝑠 = max𝜋𝑅𝜋 𝑠

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Solving MDP: Planning)

• Bellman Equation

• Value Iteration

• Policy Iteration

• Policy Evaluation

• Policy Improvement

40

∀𝑠 ∈ 𝑆: 𝑉∗ 𝑠 = max𝑎

𝑠′

𝑇 𝑠, 𝑎, 𝑠′ 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗ 𝑠′

∀𝑠 ∈ 𝑆: 𝑉𝑖+1 𝑠 ← max𝑎

𝑠′

𝑇 𝑠, 𝑎, 𝑠′ 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉∗ 𝑠′

∀𝑠 ∈ 𝑆: 𝑉𝑖+1𝜋𝑘 𝑠 ←

𝑠′

𝑇 𝑠, 𝜋𝑘(𝑠), 𝑠′ 𝑅 𝑠, 𝜋𝑘(𝑠), 𝑠

′ + 𝛾𝑉𝑖𝜋𝑘 𝑠′

𝜋𝑘+1 𝑠 = argmax𝑎

𝑠′

𝑇 𝑠, 𝑎, 𝑠′ 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉𝜋𝑘 𝑠′

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Solving MDP: Reinforcement Learning (Value-based))

• If 𝑇 𝑠′ 𝑠, 𝑎 and 𝑅 𝑠, 𝑎, 𝑠′ are unknown, this is a reinforcement learningproblem.

• Agent need to interact with the world and gather experience

• At each time-step,

• From state 𝑠

• Take action 𝑎 (𝑎 = 𝜋(𝑠) if stochastic)

• Receive reward 𝑟

• End in state 𝑠′

• Value-based: learn an optimal value function from these data

41

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

MDP (Solving MDP: Reinforcement Learning (Value-based))

• One way to learn 𝑄(𝑠, 𝑎)

• Use empirical mean return instead of expected return

• Average sampled returns

• Policy chooses action that max 𝑄(𝑠, 𝑎)

• Using 𝑉(𝑠) requires the model:

42

𝑄 𝑠, 𝑎 =𝑅1 𝑠, 𝑎 + 𝑅2 𝑠, 𝑎 + ⋯+ 𝑅𝑛 𝑠, 𝑎

𝑛

𝜋(𝑠) = max𝑎𝑄(𝑠, 𝑎)

𝜋 𝑠 = argmax𝑎

𝑠′

𝑇 𝑠, 𝑎, 𝑠′ 𝑅 𝑠, 𝑎, 𝑠′ + 𝛾𝑉 𝑠′

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hidden Markov Models (HMM) and Probabilistic Graphical Models (PGM)Part 3: Probabilistic Graphical Models

Professor Joongheon Kim

School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea

43

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Brief Introduction to Probabilistic Graphical Models (PGM)• What Is a Graphical Model?

• Three Key Problems• Representation

• Learning

• Inference

44

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

What Is a Graphical Model?

• A way of representing probabilistic relationships between random variables

• Nodes: random variables

• Edges: statistical dependencies btw these variables• Undirected edges simply give correlations between variables (Markov Random

Field or Undirected Graphical model)

• Directed edges give causality relationships (Bayesian Network or Directed Graphical Model)

• Research on Graphical Models: Multivariate Statistics + Data Structure

45

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Three Key Problems

• [Representing] Represent the world as a collection of random variables 𝑋1, ⋯ , 𝑋𝑛 with joint distribution, i.e., 𝑃 𝑋1, ⋯ , 𝑋𝑛

• [Learning] Learn the distribution from data

• [Inferencing] Perform interference (i.e., compute conditional distributions), i.e., 𝑃 𝑋𝑖 𝑋1 = 𝑥1, ⋯ , 𝑋𝑛 = 𝑥𝑛

46

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Three Key Problems

• [Representing] Represent the world as a collection of random variables 𝑋1, ⋯ , 𝑋𝑛 with joint distribution, i.e., 𝑃 𝑋1, ⋯ , 𝑋𝑛• Basics of graph theory

• Families of probability distributions

• Markov properties and conditional independence

• Density estimation, classification and regression

47

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Three Key Problems

• [Learning] Learn the distribution from data• Model structure and parameters estimation

• Complete observations and latent variables

48

Structure Observability Method

Known Full ML or MAP Estimation

Known Partial EM Algorithm

Unknown Full Model Selection orModel Averaging

Unknown Partial EM + Model Selection or EM + Model Averaging

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Three Key Problems

• [Inferencing] Perform interference (i.e., compute conditional distributions), i.e., 𝑃 𝑋𝑖 𝑋1 = 𝑥1, ⋯ , 𝑋𝑛 = 𝑥𝑛• Inference is the problem of computing P(variables of interest | observed variables)

• Exact inference vs. Approximate inference • Exact inference

• The junction tree and related algorithms, belief propagation

• Limitation on the model

• Approximate inference

• Variational methods, Sampling (Monte Carlo Methods), Maximum entropy approach, Loopy BP, Bethe approximation

• Limitation on the correctness of solution

49

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hidden Markov Models (HMM) and Probabilistic Graphical Models (PGM)Part 4: Support Vector Machine

Professor Joongheon Kim

School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea

50

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Main Idea

• Hyperplane in 𝑛-Dimensional Space

• Brief Introduction to Optimization for Support Vector Machine (SVM)

• SVM for Classification

51

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Main Idea

• How can we classify the give data?

52

Any of these would be fine. But which is the best?

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Main Idea

53

Gene Y

Gene XCancer PatientsNormal Patients

Gap

Find a linear decision surface (hyperplane) that can separate patient classes and has the largest distance (i.e., largest gap (or margin)) between border-line patients (i.e., support vectors);

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Main Idea

54

Kernel

• If linear decision surface does not exist, the data is mapped into a higher dimensional space (feature space) where the separating decision surface is found.

• The feature space is constructed via mathematical projection (kernel trick).

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Main Idea

• Hyperplane in 𝒏-Dimensional Space

• Brief Introduction to Optimization for Support Vector Machine (SVM)

• SVM for Classification

55

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hyperplane in 𝑛-Dimensional Space

[Definition (Hyperplane)] A subspace of one dimension less than its ambient space, i.e., the hyperplane in 𝑛-dimensional space means the 𝑛 − 1 subspace.

56

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hyperplane in 𝑛-Dimensional Space

• Equations of a Hyperplane

57

• An equation of a hyperplane is defined by a point (𝑃0) and a perpendicular vector to the plane (𝑤) at that point.

• Define vectors: 𝑥0 and 𝑥 where 𝑃 is an arbitrary point on a hyperplane.

• A condition for 𝑃 to be one the plane is that the vector 𝑥 − 𝑥0 is perpendicular to 𝑤:

• The above equations hold for 𝑅𝑛 when 𝑛 > 3.

𝑤 ∙ 𝑥 − 𝑥0 = 0

𝑤 ∙ 𝑥 − 𝑤 ∙ 𝑥0 = 0 and define 𝑏 = −𝑤 ∙ 𝑥0

𝑤 ∙ 𝑥 + 𝑏 = 0

• Distance between two parallel hyperplanes 𝑤 ∙ 𝑥 +𝑏1 = 0 and 𝑤 ∙ 𝑥 + 𝑏2 = 0 is

equivalent to 𝐷 =𝑏1−𝑏2

𝑤.

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Hyperplane in 𝑛-Dimensional Space

• Equations of a Hyperplane

58

𝑥2 = 𝑥1 + 𝑡𝑤𝐷 = 𝑡𝑤 = 𝑡 𝑤𝑤 ∙ 𝑥2 + 𝑏2 = 0𝑤 ∙ 𝑥1 + 𝑡𝑤 + 𝑏2 = 0𝑤 ∙ 𝑥1 + 𝑡 𝑤

2 + 𝑏2 = 0𝑤 ∙ 𝑥1 + 𝑏1 − 𝑏1 + 𝑡 𝑤

2 + 𝑏2 = 0−𝑏1 + 𝑡 𝑤

2 + 𝑏2 = 0𝑡 = 𝑏1 − 𝑏2 / 𝑤

2

Therefore, 𝐷 = 𝑡 𝑤 = 𝑏1 − 𝑏2 / 𝑤

Distance between two parallel hyperplanes 𝑤 ∙ 𝑥 +𝑏1 = 0 and 𝑤 ∙ 𝑥 + 𝑏2 = 0 is equivalent to

𝐷 =𝑏1−𝑏2

𝑤.

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Main Idea

• Hyperplane in 𝑛-Dimensional Space

• Brief Introduction to Optimization for Support Vector Machine (SVM)

• SVM for Classification

59

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Brief Introduction to Optimization for Support Vector Machine

• Now, we understand• How to represent data (vectors)

• How to define a linear decision surface (hyperplane)

• We need to understand• How to efficiently compute the hyperplane that separates two classes with the

largest gap?

60

Need to understand the basics of relevant optimization theory

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Brief Introduction to Optimization for Support Vector Machine

• Convex Functions• A function is called convex if the function lies below the straight line segment

connecting two points, for any two points in the interval.

• Property: Any local minimum is a global minimum.

61

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Brief Introduction to Optimization for Support Vector Machine

• Quadratic programming (QP)• Quadratic programming (QP) is a special optimization problem: the function to

optimize (objective) is quadratic, subject to linear constraints.

• Convex QP problems have convex objective functions.

• These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum).

62

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Brief Introduction to Optimization for Support Vector Machine

• Quadratic programming (QP)• [Example]

63

Consider 𝑥 = 𝑥1, 𝑥2

Minimize 1

2 𝑥 22 subject to 𝑥1 + 𝑥2 − 1 ≥ 0

Quadratic Objective Linear Constraints

Consider 𝑥 = 𝑥1, 𝑥2

Minimize 1

2𝑥12 + 𝑥2

2 subject to 𝑥1 + 𝑥2 − 1 ≥ 0

Quadratic Objective Linear Constraints

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Outline

• Main Idea

• Hyperplane in 𝑛-Dimensional Space

• Brief Introduction to Optimization for Support Vector Machine (SVM)

• SVM for Classification

64

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

SVM for Classification

• SVM for Classification• (Case 1) Linearly Separable Data; Hard-Margin Linear SVM

• (Case 2) Not Linearly Separable Data; Soft-Margin Linear SVM

• (Case 3) Not Linearly Separable Data; Kernel Trick

65

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

SVM for Classification

• (Case 1) Linearly Separable Data; Hard-Margin Linear SVM

66

• Want to find a classifier (hyperplane) to separate negative instances from the positive ones.

• An infinite number of such hyperplanes exist.

• SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so-called support vectors).

• If the points on the boundaries are not informative (e.g., due to noise), SVMs will not do well.

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

SVM for Classification

• (Case 1) Linearly Separable Data; Hard-Margin Linear SVM

67

• The gap is distance between two parallel hyperplanes:𝑤 ∙ 𝑥 + 𝑏 = −1 and 𝑤 ∙ 𝑥 + 𝑏 = +1

• Now, we know that

𝐷 =𝑏1−𝑏2

𝑤, i.e., 𝐷 =

2

𝑤.

• Since we have to maximize the gap, we have to minimize 𝑤 .

• Or equivalently, we have to minimize 1

2𝑤 2.

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

SVM for Classification

• (Case 1) Linearly Separable Data; Hard-Margin Linear SVM

68

• In addition, we need to impose constrains that all instances are correctly classified. In our case, 𝑤 ∙ 𝑥𝑖 + 𝑏 ≤ −1 if 𝑦𝑖 = −1𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ +1 if 𝑦𝑖 = +1., i.e.,equivalently, 𝑦𝑖 𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ 1.

In summary,

• Minimize 1

2𝑤 2 subject to

𝑦𝑖 𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ 1, for 𝑖 = 1,⋯ , 𝑁

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

SVM for Classification

• (Case 2) Not Linearly Separable Data; Soft-Margin Linear SVM

69

• What if the data is not linearly separable? E.g., there are outliers or noisy measurements, or the data is slightly non-linear.

Approach• Assign a slack variable to each instance 𝜉𝑖 ≥ 0,

which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise.

• Minimize 1

2𝑤 2 + 𝐶 𝑖=1

𝑁 𝜉𝑖 subject to

𝑦𝑖 𝑤 ∙ 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖, for 𝑖 = 1,⋯ ,𝑁

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

SVM for Classification

• (Case 3) Not Linearly Separable Data; Kernel Trick

70

Data is not linearly separable in the input space

Data is linearly separable in the feature space obtained by a kernel

Professor Joongheon Kim (CSE@CAU)http://sites.google.com/site/joongheon

Artificial Intelligent Lectures for Communication andNetwork Engineers/Researchers (14 April 2017)

Questions?

71


Recommended