+ All Categories
Home > Documents > TODAY’S LECTURETODAY’S LECTURE Data collection Data processing Exploratory analysis & Data viz...

TODAY’S LECTURETODAY’S LECTURE Data collection Data processing Exploratory analysis & Data viz...

Date post: 27-Jan-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
33
TODAY’S LECTURE Data collection Data processing Exploratory analysis & Data viz Analysis, hypothesis testing, & ML Insight & Policy Decision 1
Transcript
  • TODAY’S LECTURE

    Data collection

    Data processing

    Exploratory analysis

    & Data viz

    Analysis, hypothesis testing, &

    ML

    Insight & Policy

    Decision

    1

  • FEATURE SCALING

    2

    Price (in 1000$)

    220180350…….500

    Area(sq. ft.)

    160014002100 … ….

    2400

    # Bathrooms # Bedrooms

    2.51.53.5……4

    334……5

    111

    … ….1

    base

  • NORMALIZATION- ZERO MEAN UNIT STANDARD DEVIATION

    3

    xij =xij − μj

    σjj: Area, Bathrooms, Bedrooms

    Price (in 1000$)

    220180350…….500

    Area(sq. ft.)

    160014002100 … ….

    2400

    # Bathrooms # Bedrooms

    2.51.53.5……4

    334……5

    111

    … ….1

    base

  • MAX-MIN

    4

    xij =xij − xminj

    xmaxj − xminjj: Area, Bathrooms, Bedrooms

    Price (in 1000$)

    220180350…….500

    Area(sq. ft.)

    160014002100 … ….

    2400

    # Bathrooms # Bedrooms

    2.51.53.5……4

    334……5

    111

    … ….1

    base

  • All Observation Model

    ■ Matrix Notation 
For all observations


    y1y1

    y2.

    .

    .

    ym

    =

    1

    1

    1.

    .

    .1

    x11x21

    x31.

    .

    .xm1

    x12x22

    x32.

    .

    .xm2

    ..

    ..

    ...

    .

    .

    ..

    ..

    ..

    ...

    .

    .

    ..

    x1n

    x2n

    x3n.

    .

    .

    xmn

    .

    Y = X

    θn

    θ0θ1θ2

    θ

  • GRADIENT DESCENTAlgorithm for any* hypothesis function , loss function , step size : Initialize the parameter vector: •

    Repeat until satisfied (e.g., exact or approximate convergence): • Compute gradient: • Update parameters:

    6*must be reasonably well behaved

  • GRADIENT DESCENT - MULTIVARIATE

    7

    θ0 := θ0 − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)x0i…

    Repeat{

    θj := θj − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)xji

    }(update θj for all j = 1…n simultaneously

    θ = 01m

    m

    ∑i=1

    (h(θ(xi) − yi)xji =∂

    ∂θjf (θ)

    θ1 := θ1 − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)x1i

    θ2 := θ2 − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)x2i

    θn := θn − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)xni

  • GRADIENT DESCENT - MULTIVARIATE

    8

    Repeat{

    θj := θj − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)xji

    }(update θj for all j = 1…n simultaneously

    θ = 01m

    m

    ∑i=1

    (h(θ(xi) − yi)xji =∂

    ∂θjf (θ)

  • STOCHASTIC GRADIENT DESCENT - MULTIVARIATE

    9

    Repeat{

    θj := θj − α1m

    m

    ∑i=1

    (h(θ(xi) − yi)xji

    }(update θj for all j = 1…n simultaneously

    θ = 01m

    m

    ∑i=1

    (h(θ(xi) − yi)xji =∂

    ∂θjf (θ)

    Repeat{

    θj := θj − α(h(θ(xi) − yi)xji

    }(update θj for all j = 1…n

    θ = 0

    i = random index between 1 and m

    STOCHASTIC GRADIENT DESCENT

  • GRADIENT DESCENT

    10

    θ0 = 0.1 θ1 = 0.1

    ̂y = θ0 + θ1x

    x y ̂y = θ0 + θ1x12

    ( ̂y − y)2SSE ∂(SSE )

    ∂θ0̂y − y

    ∂(SSE )∂θ1

    ( ̂y − y)x

    0.2 0.440.310.45

    0.26

    0.123

    0.75

    0.39

    0.120.131

    0.145

    0.175

    0.05120.000032

    0.183

    0.0231

    -0.32

    0.008

    -0.605

    -0.215

    -0.064

    0.00248

    -0.27225

    -0.16125

    -1.132 -0.495

  • GRADIENT DESCENT

    11

    θ0 = 0.1 θ1 = 0.1

    ̂y = θ0 + θ1x

    x y ̂y = θ0 + θ1x12

    ( ̂y − y)2SSE ∂(SSE )

    ∂θ0̂y − y

    ∂(SSE )∂θ1

    ( ̂y − y)x

    0.31 0.123 0.131 0.000032 0.008 0.00248

    ...…..

  • STOCHASTIC GRADIENT DESCENT

    12

  • STOCHASTIC GRADIENT DESCENT - MINI BATCH

    13

    Repeat {

    }

    (update θj for all j = 1…n

    θ = 0

    = random index between 1 and mi1, …, ilθj := θj − α

    1l

    l

    ∑i=1

    (h(θ(xi) − yi)xij

  • CLASSIFICATION PROBLEM

    14

  • GENERAL CLASSIFICATION PROBLEM

    15

    • Can we predict categorical response/output Y, from a set of predictors

    • For example, an individual’s choice of transportation: • Predictors: income, cost, and time • Response: car, bike, bus or train.

    • From this classification model, an inference task: • how do people value price and time when considering a

    transportation choice?

    θ1, θ2, …, θn

  • WHY NOT LINEAR REGRESSION• For categorical responses, with more than two values, if

    order and scale, don’t make sense, it is not a regression problem. 


    Y = { • For binary responses, it is a little better 



    Y = {

    16

    1 if stroke2 if drugoverdose3 if epilecticseizure

    0 if stroke

    1 if drugoverdose

  • BINARY RESPONSES• We could use linear regression and interpret response, Y,

    as a probability

    • for example, if

    17

    ̂y > 0.5 (predict drugoverdose)

  • CLASSIFICATION AS PROBABILITY ESTIMATION• Instead of modeling classes 0 or 1, model conditional class

    probability p(Y=1 | X = x)

    • classify based on this probability.

    • Use of discriminant functions ( think of scoring).

    • One way to do this is, logistic regression.

    18

  • LOGISTIC REGRESSION• Basic idea: Build a linear model related to p(x)

    • Linear regression directly doesn’t work. Why?

    • Instead build a linear model of log-odds:

    19

    (i.e. p(x) = θ0 + θ1x)

    logp(x)

    1 − p(x)= θ0 + θ1x

  • LOGISTIC REGRESSION - ODDS• Odds are equivalent to ratios of probabilities.

    • For example, “Two to one odds that US wins the Women’s soccer world cup”, means 

the probability that US wins the soccer world cup is double the probability that they lose.

    • So, if odds = 2, probability, p(x) = ??

    • if odds = 1/2, p(x) = ??

    20

    p(x)1 − p(x)

  • LOGISTIC REGRESSION • Suppose an individual has a 16% chance of defaulting on

    their credit card payment. What are the odds that (s)he will default?

    • On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

    21

  • LOGISTIC REGRESSION - PREDICTIONS

    Sigmoid function , given by

    22

    logp(x)

    1 − p(x)= θ0 + θ1x

    ⟹ p(x) =1

    1 + e−θT X

    hθ(x) =1

    1 + e−θT X

    y = hθ(x), z = − θT x

  • ESTIMATION OF PARAMETERS• Bernoulli probability ( think of flipping a coin weighted by

    p(x) )

    • Estimate the parameters to maximize the likelihood of the observed training data under this binomial model.

    • Minimize the negative of the log likelihood of the model,

    • Optimization problem:

    23

    minθ0,θ1 ∑i:yi=1

    yi f (xi) − log(1 + e f(xi))

    f (xi) = θ0 + θ1xi

  • LOGISTIC REGRESSION

    • Hypothesis function / Classifier

    24

    0 ≤ hθ(x) ≤ 1

    hθ(x) =1

    1 + e−zSigmoid function / logistic function

    y = hθ(x), z = θT x

  • LOGISTIC REGRESSION

    • Hypothesis function / Classifier

    25

    0 ≤ hθ(x) ≤ 1

    hθ(x) =1

    1 + e−z= p(y = 1 |x, θ)

    y = hθ(x), z = θT x

    y = 1 if hθ(x) ≥ 0.5

    y = 0 if hθ(x) < 0.5

  • ESTIMATION OF PARAMETERS

    26

    Training set:{(x1, y1), (x2, y2), …, (x − m, ym)}

    x ∈

    x0x1⋯xn

    x0 = 1,y ∈ {0,1}

    hθ(x) =1

    1 + e−θT x

    Parameters, θ?

  • LOSS FUNCTION

    27

    Linear Regression: f (θ, x) =1m

    m

    ∑i=1

    12

    (hθ(xi) − yi)2

    x ∈

    x0x1⋯xn

    x0 = 1,y ∈ {0,1}

    hθ(x) =1

    1 + e−θT x

    Parameters, θ?

    minθ0,θ1,…,θn

    1m

    m

    ∑i=1

    [−yilog(hθ(xi)) − (1 − yi)log(1 − hθ(xi))]

  • LOSS FUNCTION

    28

    x ∈

    x0x1⋯xn

    x0 = 1,y ∈ {0,1}

    f(θ) = −1m [

    m

    ∑i=1

    yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]

    to predict a new observation, x: 1

    1 + e−θT x

    p(y = 1 |x, θ)

    minθ0,θ1,…,θn

    1m

    m

    ∑i=1

    [−yilog(hθ(xi)) − (1 − yi)log(1 − hθ(xi))]

  • GRADIENT DESCENT

    29

    want minθ0,θ1,…,θn

    f(θ)

    f(θ) = −1m [

    m

    ∑i=1

    yilog(hθ(xi)) + (1 − yi)log(1 − hθ(xi))]

    Repeat {

    θj := θj − α∂

    ∂θjf(θ)

    (update all θj simultaneously)

    }

    ∂∂θj

    f(θ) =1m

    m

    ∑i=1

    (hθ(xi) − yi)xij where, hθ(x) =1

    1 + e−θT x, θ =

    θ0θ1…θn

  • PREDICTION

    30

  • PREDICTION• For Iris data versicolor classification

intercept: -4.220 
petal_width feature: 2.617 



    • Probability that a given petal width is a versicolor 



    31

    ̂p(1.7) =1

    1 + e−(θ0+θ1x)=

    11 + e−4.22+2.617*1.7

    = 0.55

    ̂p(2.5) =1

    1 + e−(θ0+θ1x)=

    11 + e−4.22+2.617*2.5

    = 0.91

  • MULTIPLE LOGISTIC REGRESSION

    32

    logp(x)

    1 − p(x)= θ0 + θ1x1 + … + θnxn

  • EXERCISE

    • Suppose we collect data for a group of students in a statistics class with variables x1=hours, x2 = undergrad GPA, and y = receive an A. We fit a logistic regression and find the following coefficients, 



Estimate the probability that a student who studies 40 hours and has an undergraduate GPA of 3.5, gets an A in the class.

    • With estimated parameters from previous question, and GPA of 3.5 as before, how many hours would the student need to study to have a 50% chance of getting an A in the class?

    33

    θ0 = − 6,θ1 = 0.05,θ2 = 1


Recommended