I.Introduction - Webis · 2021. 2. 15. · Elements of Machine Learning (3) Feature Space Structure...

Chapter ML:I

I. Introductionq Examples of Learning Tasksq Specification of Learning Tasksq Elements of Machine Learning

q Comparative Syntax Overviewq Functions Overviewq Algorithms Overviewq Classification Approaches Overview

ML:I-43 Introduction © STEIN 2021

Elements of Machine Learning(1) Model Formation: Real World→ Model World

Objects

OClassesγ(o)

C

XFeature vectors

α(o)c(x) ≈ y(x)

Related questions:

q From what kind of experience should be learned?

q Which level of fidelity is sufficient to solve a certain task?


Elements of Machine Learning(2) Design Choices for Model Function Construction

Optimization approach

Optimization objectiveLoss function [ + Regularization ]

Model function ; Hypothesis space

4

Task

Data

Stochastic gradient descent

q Objective: Minimize squared lossq Regularization: Noneq Loss: Sum of squared residuals

q Hypothesis space: w ∈ Rp+1

q Linear model: y(x) = w0 +∑p

i=1wixi

Binary classification

D = {(x1, c(x1)), . . . , (xn, c(xn))} ⊆ X × {−1, 1}


Elements of Machine Learning(2) Design Choices for Model Function Construction: LMS in a Nutshell




4

Task

Data





i=1wixi


D = {(x1, c(x1)), . . . , (xn, c(xn))} ⊆ X × {−1, 1}


Elements of Machine Learning(2) Design Choices for Model Function Construction: LMS in a Nutshell (continued)




4

Task

Data





i=1wixi


D = {(x1, c(x1)), . . . , (xn, c(xn))} ⊆ X × {−1, 1}






4

Task

Data





i=1wixi


D = {(x1, c(x1)), . . . , (xn, c(xn))} ⊆ X × {−1, 1}






4

Task

Data





i=1wixi


D = {(x1, c(x1)), . . . , (xn, c(xn))} ⊆ X × {−1, 1}


Related questions:

q What are useful classes of model functions?

q What are methods to fit (= learn) model functions?

q What are measures to assess the goodness of fit?

q How does (label) noise affect the learning process?

q How does the example number affect the learning process?

q How to deal with extreme class imbalance?


Elements of Machine Learning(3) Feature Space Structure

The feature space is an inner product space.

q An inner product space (also called pre-Hilbert space) is a vector space withan additional structure called “inner product”.

q Example: Euclidean vector space equipped with the dot product.

q Enables algorithms such as gradient descent and support vector machines.


Elements of Machine Learning(3) Feature Space Structure (continued)





The feature space is a σ-algebra.

q A σ-algebra on a set X is a collection of subsets of X that includes X itself, isclosed under complement, and is closed under countable unions.

q Enables probability spaces and statistical learning, such as naive Bayes.


Elements of Machine Learning(3) Feature Space Structure (continued)





The feature space is a σ-algebra.

q A σ-algebra on a set X is a collection of subsets of X that includes X itself, isclosed under complement, and is closed under countable unions.

q Enables probability spaces and statistical learning, such as naive Bayes.

The feature space is a finite set of vectors with nominal dimensions.

q Requires concept learning via set splitting as done by decision trees.


Remarks:

q The aforementioned examples of feature spaces are not meant to be complete. However,they illustrate a broad range of structures underlying the example sets we want to learn from.

q The structure of a feature space constrains the applicable learning algorithm. Usually, thisstructure is inherently determined by the application domain and cannot be chosen.


Elements of Machine Learning(4) Discriminative vs. Generative Approach to Classification

q Discriminative classifiers (models) learn a boundary between classes.

q Generative classifiers exploit the distributions underlying the classes.

x2

x1

-

--

--

--

-

--

- --

-

---

-

-

--- -

-

+

+

+

++

++

+

+

+

++

+

+ +

++

+

++

+

+

x2

x1

-

--

--

--

-

--

- --

-

---

-

-

--- -

-

+

+

+

++

++

+

+

+

++

+

+ +

++

+

++

+

+


Elements of Machine Learning(4) Discriminative vs. Generative Approach to Classification (continued)



x2

x1

-

--

--

--

-

--

- --

-

---

-

-

--- -

-

+

+

+

++

++

+

+

+

++

+

+ +

++

+

++

+

+

x2

x1

-

--

--

--

-

--

- --

-

---

-

-

--- -

-

+

+

+

++

++

+

+

+

++

+

+ +

++

+

++

+

+

discriminative; classification rule

generative; class membership probability


Elements of Machine Learning(4) Discriminative vs. Generative Approach to Classification (continued)



x2

x1

-

--

--

--

-

--

- --

-

---

-

-

--- -

-

+

+

+

++

++

+

+

+

++

+

+ +

++

+

++

+

+

x2

x1

-

--

--

--

-

--

- --

-

---

-

-

--- -

-

+

+

+

++

++

+

+

+

++

+

+ +

++

+

++

+

+

? ?

discriminative; classification rule

generative; class membership probability


Remarks:

q When classifying a new example (1) discriminative classifiers apply a decision rule that waslearned via minimizing the misclassification rate given training examples D, while(2) generative classifiers maximize the probability of the combined event P (X=x,Y=y), or,similarly, the a-posteriori probability P (Y=y | X=x), y ∈ {,⊕}.

q The LMS algorithm computes “only” a decision boundary, i.e., it constructs a discriminativeclassifier. A Bayes classifier is an example for a generative model.

q Yoav Freund provides an excellent video illustrating the pros and cons of discriminative andgenerative models respectively. [YouTube]

q Discriminative models may be further differentiated in models that also determine theposterior class probabilities P (Y=y | X=x) (without computing the joint probabilitiesP (X=x,Y=y)) and those that do not. In the latter case, only a so-called “discriminantfunction” is computed.


https://www.youtube.com/watch?v=gwV7spVO5Z0

Elements of Machine Learning(5) Frequentist vs. Subjectivist Approach to Parameter Estimation

Frequentism:

q There is a (hidden) mechanism that generates D.

q To model this mechanism you consider– a family of distributions, or– a model function, or– a combination of both,

parameterized by θ. The possible values for θ form the hypothesis space H.

q Select a most probable hypothesis hML ∈ H by estimating θ using a sampleD′ ⊂ D. hML is called maximum likelihood hypothesis.


Elements of Machine Learning(5) Frequentist vs. Subjectivist Approach to Parameter Estimation (continued)

Frequentism:







Frequentism:







Frequentism:





θ ; D′, hML = argmaxh∈H

P (D′ | h)


Remarks:

q θ is a parameter or a parameter vector that is considered as fixed (in particular: not as arandom variable), but unknown.

q In the experiment of flipping a coin, one may suppose a Laplace experiment and consider thebinomial distribution, B(n, p).

q P (D′ | h) is the probability of observing D′ under h. I.e., it is the probability of observing D′

if the hidden mechanism that generates D′ behaves according to the considered modelwhose parameter θ is set to h.



Subjectivism:

q Consider a model for the mechanism that has generated D.

q There are different beliefs about the parameter (vector) θ that characterizesthe model. The possible values for θ form the hypothesis space H.

q Select a most probable hypothesis hMAP ∈ H by weighting the ML estimatesunder D with the priors. hMAP is called maximum a-posteriori hypothesis.

Belief/Prior 1: P (p = 0.5︸︷︷︸θ1

) = 0.95 Belief/Prior 2: P (p = 0.75︸︷︷︸θ2

) = 0.50



Subjectivism:





) = 0.95 Belief/Prior 2: P (p = 0.75︸︷︷︸θ2

) = 0.50



Subjectivism:





) = 0.95 Belief/Prior 2: P (p = 0.75︸︷︷︸θ2

) = 0.50



Subjectivism:





) = 0.95 Belief/Prior 2: P (p = 0.75︸︷︷︸θ2

) = 0.50

θ1 +D → P (D | θ1)

θ2 +D → P (D | θ2)



Subjectivism:





) = 0.95 Belief/Prior 2: P (p = 0.75︸︷︷︸θ2

) = 0.50

θ1 +D → P (D | θ1)

θ2 +D → P (D | θ2)

hMAP = argmaxh∈{θ1,θ2}

P (h | D) = argmaxh∈{θ1,θ2}

P (D | h) · P (h)P (D)


Remarks:

q θ is considered as random variable. There is prior knowledge about the distribution of θ.

q p is a parameter of the binomial distribution and denotes the success probability for each trial.

– Belief 1: With a probability of 0.95 the coin is fair (both sides are equally likely).– Belief 2: With a probability of 0.5 the odds of preferring a particular side is 3:1.

Given D from a number of trials compute P (D | θ1) and P (D | θ2) and the respective valuesfor P (θ1 | D) and P (θ2 | D).

Disclaimer. While only mild conditions are required for MAP estimation to be a limiting case ofBayes estimation, it is not very representative of Bayesian methods in general. This isbecause MAP estimates are point estimates, whereas Bayesian methods are characterizedby the use of distributions to summarize data and draw inferences. [Wikipedia]

q The subjectivist approach is also called Bayesian interpretation of probability. The Bayesianinterpretation of probability enables by design the integration of prior knowledge, backgroundknowledge, and human expertise. [Wikipedia: probability interpretations, Bayes interpretations]

q Food for thought: Discuss the use of frequentist and subjectivist approaches to decisionmaking if you had to develop an AI that plays poker.


https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation#Limitations

https://en.wikipedia.org/wiki/Probability_interpretations

https://en.wikipedia.org/wiki/Bayes'_theorem#Interpretations

Date post:	09-Mar-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

I.Introduction - Webis · 2021. 2. 15. · Elements of Machine Learning (3) Feature Space Structure...

Documents