+ All Categories
Home > Documents > The Kernel Trick, Gram Matrices, and Feature Extraction

The Kernel Trick, Gram Matrices, and Feature Extraction

Date post: 09-Nov-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
42
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 — Fall 2021
Transcript
Page 1: The Kernel Trick, Gram Matrices, and Feature Extraction

The Kernel Trick, Gram Matrices, and Feature Extraction

CS6787 Lecture 4 — Fall 2021

Page 2: The Kernel Trick, Gram Matrices, and Feature Extraction

Basic Linear Models

• For two-class classification using model vector w

•What is the compute cost of making a prediction in a d-dimensional linear model, given an example x?

•Answer: d multiplies and d adds• To do the dot product.

output = sign(wTx)

Page 3: The Kernel Trick, Gram Matrices, and Feature Extraction

Optimizing Basic Linear Models

• For classification using model vector w

• Optimization methods for this task vary; here’s logistic regression

output = sign(wTx)

minimizew1

n

nX

i=1

log�1 + exp(�wTxiyi)

(yi 2 {�1, 1})

Page 4: The Kernel Trick, Gram Matrices, and Feature Extraction

SGD on Logistic Regression

• Gradient of a training example is

• So SGD update step is

rfi(w) =�xiyi

1 + exp(wTxiyi)<latexit sha1_base64="G6Yt2VdVr158VDC91U9qVbgvgMg=">AAACH3icbVDLSsNAFJ3UV62vqEs3g0VoEUsiom6EohuXFfqCpobJdNIOnUzCzMS2hP6JG3/FjQtFxF3/xmmbhbYeuHA4517uvceLGJXKsiZGZmV1bX0ju5nb2t7Z3TP3D+oyjAUmNRyyUDQ9JAmjnNQUVYw0I0FQ4DHS8Pp3U7/xRISkIa+qUUTaAepy6lOMlJZc89LhyGMI+i4tDIrwBjq+QDg5G7oUjlw6Tmx4Ch0yjAqDxypM1eLYNfNWyZoBLhM7JXmQouKa304nxHFAuMIMSdmyrUi1EyQUxYyMc04sSYRwH3VJS1OOAiLbyey/MTzRSgf6odDFFZypvycSFEg5CjzdGSDVk4veVPzPa8XKv24nlEexIhzPF/kxgyqE07BghwqCFRtpgrCg+laIe0gHpHSkOR2CvfjyMqmfl2yrZD9c5Mu3aRxZcASOQQHY4AqUwT2ogBrA4Bm8gnfwYbwYb8an8TVvzRjpzCH4A2PyA7kUoOo=</latexit><latexit sha1_base64="G6Yt2VdVr158VDC91U9qVbgvgMg=">AAACH3icbVDLSsNAFJ3UV62vqEs3g0VoEUsiom6EohuXFfqCpobJdNIOnUzCzMS2hP6JG3/FjQtFxF3/xmmbhbYeuHA4517uvceLGJXKsiZGZmV1bX0ju5nb2t7Z3TP3D+oyjAUmNRyyUDQ9JAmjnNQUVYw0I0FQ4DHS8Pp3U7/xRISkIa+qUUTaAepy6lOMlJZc89LhyGMI+i4tDIrwBjq+QDg5G7oUjlw6Tmx4Ch0yjAqDxypM1eLYNfNWyZoBLhM7JXmQouKa304nxHFAuMIMSdmyrUi1EyQUxYyMc04sSYRwH3VJS1OOAiLbyey/MTzRSgf6odDFFZypvycSFEg5CjzdGSDVk4veVPzPa8XKv24nlEexIhzPF/kxgyqE07BghwqCFRtpgrCg+laIe0gHpHSkOR2CvfjyMqmfl2yrZD9c5Mu3aRxZcASOQQHY4AqUwT2ogBrA4Bm8gnfwYbwYb8an8TVvzRjpzCH4A2PyA7kUoOo=</latexit><latexit sha1_base64="G6Yt2VdVr158VDC91U9qVbgvgMg=">AAACH3icbVDLSsNAFJ3UV62vqEs3g0VoEUsiom6EohuXFfqCpobJdNIOnUzCzMS2hP6JG3/FjQtFxF3/xmmbhbYeuHA4517uvceLGJXKsiZGZmV1bX0ju5nb2t7Z3TP3D+oyjAUmNRyyUDQ9JAmjnNQUVYw0I0FQ4DHS8Pp3U7/xRISkIa+qUUTaAepy6lOMlJZc89LhyGMI+i4tDIrwBjq+QDg5G7oUjlw6Tmx4Ch0yjAqDxypM1eLYNfNWyZoBLhM7JXmQouKa304nxHFAuMIMSdmyrUi1EyQUxYyMc04sSYRwH3VJS1OOAiLbyey/MTzRSgf6odDFFZypvycSFEg5CjzdGSDVk4veVPzPa8XKv24nlEexIhzPF/kxgyqE07BghwqCFRtpgrCg+laIe0gHpHSkOR2CvfjyMqmfl2yrZD9c5Mu3aRxZcASOQQHY4AqUwT2ogBrA4Bm8gnfwYbwYb8an8TVvzRjpzCH4A2PyA7kUoOo=</latexit><latexit sha1_base64="G6Yt2VdVr158VDC91U9qVbgvgMg=">AAACH3icbVDLSsNAFJ3UV62vqEs3g0VoEUsiom6EohuXFfqCpobJdNIOnUzCzMS2hP6JG3/FjQtFxF3/xmmbhbYeuHA4517uvceLGJXKsiZGZmV1bX0ju5nb2t7Z3TP3D+oyjAUmNRyyUDQ9JAmjnNQUVYw0I0FQ4DHS8Pp3U7/xRISkIa+qUUTaAepy6lOMlJZc89LhyGMI+i4tDIrwBjq+QDg5G7oUjlw6Tmx4Ch0yjAqDxypM1eLYNfNWyZoBLhM7JXmQouKa304nxHFAuMIMSdmyrUi1EyQUxYyMc04sSYRwH3VJS1OOAiLbyey/MTzRSgf6odDFFZypvycSFEg5CjzdGSDVk4veVPzPa8XKv24nlEexIhzPF/kxgyqE07BghwqCFRtpgrCg+laIe0gHpHSkOR2CvfjyMqmfl2yrZD9c5Mu3aRxZcASOQQHY4AqUwT2ogBrA4Bm8gnfwYbwYb8an8TVvzRjpzCH4A2PyA7kUoOo=</latexit>

wt+1 = wt + ↵txiyi

1 + exp(wTt xiyi)

<latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit>

Page 5: The Kernel Trick, Gram Matrices, and Feature Extraction

What is the compute cost of an SGD update?

• For logistic regression on a d-dimensional model

•Answer: 2d multiples and 2d adds + O(1) extra ops• d multiplies and d adds to do the dot product• d multiplies and d adds to do the AXPY operation• O(1) additional ops for computing the exp, divide, etc.

wt+1 = wt + ↵txiyi

1 + exp(wTt xiyi)

<latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit>

Page 6: The Kernel Trick, Gram Matrices, and Feature Extraction

Benefits of Linear Models• Fast classification: just one dot product

• Fast training/learning: just a few basic linear algebra operations

• Drawback: limited expressivity• Can only capture linear classification boundaries à bad for many

problems

• How do we let linear models represent a broader class of decision boundaries, while retaining the systems benefits?

Page 7: The Kernel Trick, Gram Matrices, and Feature Extraction

Review: The Kernel Method

• Idea: in a linear model we can think about the similaritybetween two training examples x and y as being

• This is related to the rate at which a random classifier will separate x and y

• Kernel methods replace this dot-product similarity with an arbitrary Kernel function that computes the similarity between x and y

xT y

K(x, y) : X ⇥ X ! R

Page 8: The Kernel Trick, Gram Matrices, and Feature Extraction

Kernel Properties

• What properties do kernels need to have to be useful for learning?

• Key property: kernel must be symmetric

• Key property: kernel must be positive semi-definite

• Can check that the dot product has this property

K(x, y) = K(y, x)

8ci 2 R, xi 2 X ,nX

i=1

nX

j=1

cicjK(xi, xj) � 0

Page 9: The Kernel Trick, Gram Matrices, and Feature Extraction

Facts about Positive Semidefinite Kernels

• Sum of two PSD kernels is a PSD kernel

• Product of two PSD kernels is a PSD kernel

• Scaling by any function on both sides is a kernel

K(x, y) = K1(x, y)K2(x, y) is a PSD kernel

K(x, y) = K1(x, y) +K2(x, y) is a PSD kernel

K(x, y) = f(x)K1(x, y)f(y) is a PSD kernel

Page 10: The Kernel Trick, Gram Matrices, and Feature Extraction

Other Kernel Properties

• Useful property: kernels are often non-negative

• Useful property: kernels are often scaled such that

• These properties capture the idea that the kernel is expressing the similarity between x and y

K(x, y) � 0

K(x, y) 1, and K(x, y) = 1 , x = y

Page 11: The Kernel Trick, Gram Matrices, and Feature Extraction

Common Kernels

• Gaussian kernel/RBF kernel: de-facto kernel in machine learning

• We can validate that this is a kernel• Symmetric? ✅• Positive semi-definite? ✅• Non-negative? ✅• Scaled so that K(x,x) = 1? ✅

K(x, y) = exp���kx� yk2

Page 12: The Kernel Trick, Gram Matrices, and Feature Extraction

Common Kernels (continued)

• Linear kernel: just the inner product

• Polynomial kernel:

• Laplacian kernel:

K(x, y) = xT y

K(x, y) = (1 + xT y)p

K(x, y) = exp (��kx� yk1)

Page 13: The Kernel Trick, Gram Matrices, and Feature Extraction

Kernels as a feature mapping

•More generally, any function that can be written in the form

(where is called a feature map) is a kernel.

• Even works for maps onto infinite dimensional Hilbert space• And in this case the converse is also true: any kernel has

an associated (possibly infinite-dimensional) feature map.

K(x, y) = �(x)T�(y)<latexit sha1_base64="k7rt/N8/95ilNbVwE8/iFHy3eO4=">AAACA3icbZDLSgMxFIbP1Futt1F3ugkWoQUpMyLoRii6EdxU6A3asWTStA3NXEgy0mEouPFV3LhQxK0v4c63MW1noa0/BD7+cw4n53dDzqSyrG8js7S8srqWXc9tbG5t75i7e3UZRILQGgl4IJoulpQzn9YUU5w2Q0Gx53LacIfXk3rjgQrJAr+q4pA6Hu77rMcIVtrqmAe3hdFJXESXqB0OWGFUvK/OKC52zLxVsqZCi2CnkIdUlY751e4GJPKorwjHUrZsK1ROgoVihNNxrh1JGmIyxH3a0uhjj0onmd4wRsfa6aJeIPTzFZq6vycS7EkZe67u9LAayPnaxPyv1opU78JJmB9GivpktqgXcaQCNAkEdZmgRPFYAyaC6b8iMsACE6Vjy+kQ7PmTF6F+WrKtkn13li9fpXFk4RCOoAA2nEMZbqACNSDwCM/wCm/Gk/FivBsfs9aMkc7swx8Znz+EEpWE</latexit><latexit sha1_base64="k7rt/N8/95ilNbVwE8/iFHy3eO4=">AAACA3icbZDLSgMxFIbP1Futt1F3ugkWoQUpMyLoRii6EdxU6A3asWTStA3NXEgy0mEouPFV3LhQxK0v4c63MW1noa0/BD7+cw4n53dDzqSyrG8js7S8srqWXc9tbG5t75i7e3UZRILQGgl4IJoulpQzn9YUU5w2Q0Gx53LacIfXk3rjgQrJAr+q4pA6Hu77rMcIVtrqmAe3hdFJXESXqB0OWGFUvK/OKC52zLxVsqZCi2CnkIdUlY751e4GJPKorwjHUrZsK1ROgoVihNNxrh1JGmIyxH3a0uhjj0onmd4wRsfa6aJeIPTzFZq6vycS7EkZe67u9LAayPnaxPyv1opU78JJmB9GivpktqgXcaQCNAkEdZmgRPFYAyaC6b8iMsACE6Vjy+kQ7PmTF6F+WrKtkn13li9fpXFk4RCOoAA2nEMZbqACNSDwCM/wCm/Gk/FivBsfs9aMkc7swx8Znz+EEpWE</latexit><latexit sha1_base64="k7rt/N8/95ilNbVwE8/iFHy3eO4=">AAACA3icbZDLSgMxFIbP1Futt1F3ugkWoQUpMyLoRii6EdxU6A3asWTStA3NXEgy0mEouPFV3LhQxK0v4c63MW1noa0/BD7+cw4n53dDzqSyrG8js7S8srqWXc9tbG5t75i7e3UZRILQGgl4IJoulpQzn9YUU5w2Q0Gx53LacIfXk3rjgQrJAr+q4pA6Hu77rMcIVtrqmAe3hdFJXESXqB0OWGFUvK/OKC52zLxVsqZCi2CnkIdUlY751e4GJPKorwjHUrZsK1ROgoVihNNxrh1JGmIyxH3a0uhjj0onmd4wRsfa6aJeIPTzFZq6vycS7EkZe67u9LAayPnaxPyv1opU78JJmB9GivpktqgXcaQCNAkEdZmgRPFYAyaC6b8iMsACE6Vjy+kQ7PmTF6F+WrKtkn13li9fpXFk4RCOoAA2nEMZbqACNSDwCM/wCm/Gk/FivBsfs9aMkc7swx8Znz+EEpWE</latexit><latexit sha1_base64="k7rt/N8/95ilNbVwE8/iFHy3eO4=">AAACA3icbZDLSgMxFIbP1Futt1F3ugkWoQUpMyLoRii6EdxU6A3asWTStA3NXEgy0mEouPFV3LhQxK0v4c63MW1noa0/BD7+cw4n53dDzqSyrG8js7S8srqWXc9tbG5t75i7e3UZRILQGgl4IJoulpQzn9YUU5w2Q0Gx53LacIfXk3rjgQrJAr+q4pA6Hu77rMcIVtrqmAe3hdFJXESXqB0OWGFUvK/OKC52zLxVsqZCi2CnkIdUlY751e4GJPKorwjHUrZsK1ROgoVihNNxrh1JGmIyxH3a0uhjj0onmd4wRsfa6aJeIPTzFZq6vycS7EkZe67u9LAayPnaxPyv1opU78JJmB9GivpktqgXcaQCNAkEdZmgRPFYAyaC6b8iMsACE6Vjy+kQ7PmTF6F+WrKtkn13li9fpXFk4RCOoAA2nEMZbqACNSDwCM/wCm/Gk/FivBsfs9aMkc7swx8Znz+EEpWE</latexit>

� : Rd ! RD<latexit sha1_base64="rpEcwzb/XDNC0teQN6+lSKNx8OQ=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0UQhJKIoLgq6sJlFfuAJpbJZNIMnUzCzEQpoR/hxl9x40IRty7c+TdO2ixq64ELh3Pu5d57vIRRqSzrxygtLC4tr5RXK2vrG5tb5vZOS8apwKSJYxaLjockYZSTpqKKkU4iCIo8Rtre4DL32w9ESBrzOzVMiBuhPqcBxUhpqWceOUlIz6ETIRV6XnY7uvehI2g/VEiI+HHauOqZVatmjQHniV2QKijQ6Jnfjh/jNCJcYYak7NpWotwMCUUxI6OKk0qSIDxAfdLVlKOISDcbPzWCB1rxYRALXVzBsTo9kaFIymHk6c78Rjnr5eJ/XjdVwZmbUZ6kinA8WRSkDKoY5glBnwqCFRtqgrCg+laIQyQQVjrHig7Bnn15nrSOa7ZVs29OqvWLIo4y2AP74BDY4BTUwTVogCbA4Am8gDfwbjwbr8aH8TlpLRnFzC74A+PrF/MAnsE=</latexit><latexit sha1_base64="rpEcwzb/XDNC0teQN6+lSKNx8OQ=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0UQhJKIoLgq6sJlFfuAJpbJZNIMnUzCzEQpoR/hxl9x40IRty7c+TdO2ixq64ELh3Pu5d57vIRRqSzrxygtLC4tr5RXK2vrG5tb5vZOS8apwKSJYxaLjockYZSTpqKKkU4iCIo8Rtre4DL32w9ESBrzOzVMiBuhPqcBxUhpqWceOUlIz6ETIRV6XnY7uvehI2g/VEiI+HHauOqZVatmjQHniV2QKijQ6Jnfjh/jNCJcYYak7NpWotwMCUUxI6OKk0qSIDxAfdLVlKOISDcbPzWCB1rxYRALXVzBsTo9kaFIymHk6c78Rjnr5eJ/XjdVwZmbUZ6kinA8WRSkDKoY5glBnwqCFRtqgrCg+laIQyQQVjrHig7Bnn15nrSOa7ZVs29OqvWLIo4y2AP74BDY4BTUwTVogCbA4Am8gDfwbjwbr8aH8TlpLRnFzC74A+PrF/MAnsE=</latexit><latexit sha1_base64="rpEcwzb/XDNC0teQN6+lSKNx8OQ=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0UQhJKIoLgq6sJlFfuAJpbJZNIMnUzCzEQpoR/hxl9x40IRty7c+TdO2ixq64ELh3Pu5d57vIRRqSzrxygtLC4tr5RXK2vrG5tb5vZOS8apwKSJYxaLjockYZSTpqKKkU4iCIo8Rtre4DL32w9ESBrzOzVMiBuhPqcBxUhpqWceOUlIz6ETIRV6XnY7uvehI2g/VEiI+HHauOqZVatmjQHniV2QKijQ6Jnfjh/jNCJcYYak7NpWotwMCUUxI6OKk0qSIDxAfdLVlKOISDcbPzWCB1rxYRALXVzBsTo9kaFIymHk6c78Rjnr5eJ/XjdVwZmbUZ6kinA8WRSkDKoY5glBnwqCFRtqgrCg+laIQyQQVjrHig7Bnn15nrSOa7ZVs29OqvWLIo4y2AP74BDY4BTUwTVogCbA4Am8gDfwbjwbr8aH8TlpLRnFzC74A+PrF/MAnsE=</latexit><latexit sha1_base64="rpEcwzb/XDNC0teQN6+lSKNx8OQ=">AAACFHicbVDLSsNAFJ3UV62vqEs3g0UQhJKIoLgq6sJlFfuAJpbJZNIMnUzCzEQpoR/hxl9x40IRty7c+TdO2ixq64ELh3Pu5d57vIRRqSzrxygtLC4tr5RXK2vrG5tb5vZOS8apwKSJYxaLjockYZSTpqKKkU4iCIo8Rtre4DL32w9ESBrzOzVMiBuhPqcBxUhpqWceOUlIz6ETIRV6XnY7uvehI2g/VEiI+HHauOqZVatmjQHniV2QKijQ6Jnfjh/jNCJcYYak7NpWotwMCUUxI6OKk0qSIDxAfdLVlKOISDcbPzWCB1rxYRALXVzBsTo9kaFIymHk6c78Rjnr5eJ/XjdVwZmbUZ6kinA8WRSkDKoY5glBnwqCFRtqgrCg+laIQyQQVjrHig7Bnn15nrSOa7ZVs29OqvWLIo4y2AP74BDY4BTUwTVogCbA4Am8gDfwbjwbr8aH8TlpLRnFzC74A+PrF/MAnsE=</latexit>

Page 14: The Kernel Trick, Gram Matrices, and Feature Extraction

Classifying with Kernels

• Recall the SGD update is

• Resulting weight vectors will always be in the span of the examples.

• So, our prediction will be:

wt+1 = wt + ↵txiyi

1 + exp(wTt xiyi)

<latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit><latexit sha1_base64="08vren7WiiUa91fAfd0JrWYrbVc=">AAACKXicbVDLSgMxFM3UV62vqks3wSJUCmVGBN0IRTcuK/QFbR0yaaYNzTxI7tiWYX7Hjb/iRkFRt/6IaTsLbT0QOJxzLjf3OKHgCkzz08isrK6tb2Q3c1vbO7t7+f2DhgoiSVmdBiKQLYcoJrjP6sBBsFYoGfEcwZrO8GbqNx+YVDzwazAJWdcjfZ+7nBLQkp2vjOwYSlaCr/DIBlzCHSLCAdG040pC47HN8cTmSWxNPTYOizp2X8OpfprY+YJZNmfAy8RKSQGlqNr5104voJHHfKCCKNW2zBC6MZHAqWBJrhMpFhI6JH3W1tQnHlPdeHZpgk+00sNuIPXzAc/U3xMx8ZSaeI5OegQGatGbiv957Qjcy27M/TAC5tP5IjcSGAI8rQ33uGQUxEQTQiXXf8V0QHRDoMvN6RKsxZOXSeOsbJll6+68ULlO68iiI3SMishCF6iCblEV1RFFj+gZvaF348l4MT6Mr3k0Y6Qzh+gPjO8f24OlLA==</latexit>

w =nX

i=1

uixi ) hw(x) = sign�wTx

�= sign

nX

i=1

uixTi x

!

<latexit sha1_base64="NsyyefT8GWCalDpjC8QymRRl5WA=">AAACfHicfVFNT9tAEF27UGhaSihHDqxIKwVVjWxAwAUJ0QtHqAggxcFab8bJKutda3dMEln5FfwzbvwULlXXIQcKiJFGenrznuYryaWwGAQPnv9hYfHj0vKn2ucvK19X62vfLq0uDIc211Kb64RZkEJBGwVKuM4NsCyRcJUMf1f1q1swVmh1gZMcuhnrK5EKztBRcf1uRI9oZIssLsVROL1RtIgFHbuM/oj+AJkxekQH8ag53q6UOgfDUBvFMiit6KtpJCHFJh3dXNAxjUxlqpTvSN9q99wd1xtBK5gFfQ3COWiQeZzF9fuop3mRgUIumbWdMMixWzKDgkuY1qLCQs74kPWh42A1kO2Ws+NN6Q/H9GiqjUuFdMY+d5Qss3aSJU6ZMRzYl7WKfKvWKTA97JZC5QWC4k+N0kJS1LT6BO0JAxzlxAHGjXCzUj5ghnF0/6q5I4QvV34NLnda4W5r53yvcXwyP8cy2SBbpElCckCOySk5I23CyaO36TW9be+v/93/6f96kvre3LNO/gt//x+Ls8Bx</latexit>

w =nX

i=1

uixi ) hw(x) = sign�wTx

�= sign

nX

i=1

uixTi x

!

<latexit sha1_base64="NsyyefT8GWCalDpjC8QymRRl5WA=">AAACfHicfVFNT9tAEF27UGhaSihHDqxIKwVVjWxAwAUJ0QtHqAggxcFab8bJKutda3dMEln5FfwzbvwULlXXIQcKiJFGenrznuYryaWwGAQPnv9hYfHj0vKn2ucvK19X62vfLq0uDIc211Kb64RZkEJBGwVKuM4NsCyRcJUMf1f1q1swVmh1gZMcuhnrK5EKztBRcf1uRI9oZIssLsVROL1RtIgFHbuM/oj+AJkxekQH8ag53q6UOgfDUBvFMiit6KtpJCHFJh3dXNAxjUxlqpTvSN9q99wd1xtBK5gFfQ3COWiQeZzF9fuop3mRgUIumbWdMMixWzKDgkuY1qLCQs74kPWh42A1kO2Ws+NN6Q/H9GiqjUuFdMY+d5Qss3aSJU6ZMRzYl7WKfKvWKTA97JZC5QWC4k+N0kJS1LT6BO0JAxzlxAHGjXCzUj5ghnF0/6q5I4QvV34NLnda4W5r53yvcXwyP8cy2SBbpElCckCOySk5I23CyaO36TW9be+v/93/6f96kvre3LNO/gt//x+Ls8Bx</latexit>

Page 15: The Kernel Trick, Gram Matrices, and Feature Extraction

Classifying with Kernels

•An equivalent way of writing a linear model on a training set is

•We can kernel-ize this by replacing the dot products with kernel evaluations

hw(x) = sign

nX

i=1

uixTi x

!

<latexit sha1_base64="Va+a2nYQmBzAnfpjl6Mz0NuFcbk=">AAACLXicbVDLSgNBEJz1bXxFPXoZDEK8hF0V9CKIevCoYFTIxmV20psMzs4sM72asOSHvPgrInhQxKu/4STm4KugoajqprsrzqSw6Psv3tj4xOTU9MxsaW5+YXGpvLxyYXVuONS5ltpcxcyCFArqKFDCVWaApbGEy/jmaOBf3oKxQqtz7GXQTFlbiURwhk6Kysed6K7a3aT7lIY6A8NQG8VSKKxoq34oIcEqDW2eRoXYD/rXiuaRoN1IXJ/TLg2NaHdwMypX/Jo/BP1LghGpkBFOo/JT2NI8T0Ehl8zaRuBn2CyYQcEl9EthbiFj/Ia1oeHo4CDbLIbf9umGU1o00caVQjpUv08ULLW2l8auM2XYsb+9gfif18gx2WsWQmU5guJfi5JcUtR0EB1tCQMcZc8Rxo1wt1LeYYZxdAGXXAjB75f/koutWrBd2zrbqRwcjuKYIWtknVRJQHbJATkhp6ROOLknj+SFvHoP3rP35r1/tY55o5lV8gPexye3pqfR</latexit>

hu(x) = sign

nX

i=1

uiK(xi, x)

!

<latexit sha1_base64="zkQlqw2EZZVCU2X7nnys+gxxxH4=">AAACL3icbVBNaxsxFNQmaeu4X05z7EXEFBwoZjcNpBeDaaAUenGhjgNed9HKb20RrbRIT8Fm8T/KJX/FlxIaQq75F5EdHxKnA4Jh5g16b9JCCotheBVsbG69ePmqsl19/ebtu/e1nQ8nVjvDocu11OY0ZRakUNBFgRJOCwMsTyX00rPjhd87B2OFVr9xWsAgZyMlMsEZeimpfR8nrjHZpy1KY12AYaiNYjmUVozULJaQYYPG1uVJKVrR7I+iLhH0Z2OSiM/U52IjRmPcT2r1sBkuQZ+TaEXqZIVOUpvHQ81dDgq5ZNb2o7DAQckMCi5hVo2dhYLxMzaCvqeLleygXN47o5+8MqSZNv4ppEv1caJkubXTPPWTOcOxXfcW4v+8vsPs66AUqnAIij98lDlJUdNFeXQoDHCUU08YN8LvSvmYGcbRV1z1JUTrJz8nJwfN6Evz4Ndhvf1tVUeFfCR7pEEickTa5AfpkC7h5ILMyT9yHVwGf4Ob4PZhdCNYZXbJEwR390pnp/k=</latexit>

Page 16: The Kernel Trick, Gram Matrices, and Feature Extraction

Learning with Kernels

•An equivalent way of writing linear-model logistic regression is

•We can kernel-ize this by replacing the dot products with kernel evaluations

minimizeu1

n

nX

i=1

log

0

B@1 + exp

0

B@�

0

@nX

j=1

ujxj

1

AT

xiyi

1

CA

1

CA

<latexit sha1_base64="n27C+/44rMUVmyAvIj8yWQRxSzA=">AAACb3icbVHbahRBEO0Zb3G9rfHBh4gULkKCuMxEQUGEoC8+Rsgmge3N0NNbs9tJX4a+yK7DvPqBvvkPvvgH9u6OoIkFRZ06dYruPl3WUjifZT+S9Nr1Gzdvbd3u3bl77/6D/sPtY2eC5TjiRhp7WjKHUmgceeElntYWmSolnpQXH1fzky9onTD6yC9rnCg206ISnPlIFf1vVJVm0SihhRJfsS0C0HdAK8t4k7eNboG6oIpGvM/bMw1UmhmVWPldyOEFUFzUXfuyqxv5+UYeinNYxKRWzOZ+7+wodgKWMTvmTy36g2yYrQOugrwDA9LFYdH/TqeGB4Xac8mcG+dZ7ScNs15wiW2PBoc14xdshuMINVPoJs3arxaeR2YKlbExtYc1+/dGw5RzS1VGpWJ+7i7PVuT/ZuPgq7eTRug6eNR8c1AVJHgDK/NhKixyL5cRMG5FvCvwOYte+/hFvWhCfvnJV8Hx/jB/Ndz//Hpw8KGzY4vskGdkl+TkDTkgn8ghGRFOfibbyU7yJPmVPk6fprCRpkm384j8E+nebzXFup8=</latexit>

minimizeu1

n

nX

i=1

log

0

@1 + exp

0

@�nX

j=1

ujyiK(xj , xi)

1

A

1

A

<latexit sha1_base64="13pDDWZiIbNAHluFHPnRIfMToWE=">AAACY3icbVFdixMxFM2Muq5ddcfVNxEuFqGilplVUBBh0RfBlxXs7kJTQybNtOnmY8iHtDvMn/TNN1/8H6btILrrhZCTc87lJidlLYXzef4jSa9dv7Fzc/dWb+/2nbv72b2DE2eCZXzEjDT2rKSOS6H5yAsv+VltOVWl5Kfl+Ye1fvqNWyeM/uJXNZ8oOtOiEoz6SJHsAqvSLBsltFDigrckAH4LuLKUNUXb6BawC4o04l3RftWApZlhySs/gAKeAebLuju+6IyLrTGQBayIgE+DJVk8hyURTwFbMZv7PzvJ+vkw3xRcBUUH+qirY5J9x1PDguLaM0mdGxd57ScNtV4wydseDo7XlJ3TGR9HqKnibtJsMmrhSWSmUBkbl/awYf/uaKhybqXK6FTUz91lbU3+TxsHX72ZNELXwXPNtoOqIMEbWAcOU2E583IVAWVWxLsCm9OYr4/f0oshFJeffBWcHA6Ll8PDz6/6R++7OHbRQ/QYDVCBXqMj9BEdoxFi6Geyk+wnWfIr3UsP0gdba5p0PffRP5U++g3dpLVL</latexit>

Page 17: The Kernel Trick, Gram Matrices, and Feature Extraction

The Computational Cost of Kernels

• Recall: benefit of learning with kernels is that we can express a wider class of classification functions

• Recall: another benefit is linear classifier learning problems are “easy” to solve because they are convex, and gradients easy to compute

• Major cost of learning naively with Kernels: have toevaluate K(x, y)• For SGD, need to do this effectively n times per update• Computationally intractable unless K is very simple

Page 18: The Kernel Trick, Gram Matrices, and Feature Extraction

The Gram Matrix

• Address this computational problem by pre-computing the kernel function for all pairs of training examples in the dataset.

• Transforms the logistic regression learning problem into

• This is much easier than re-computing the kernel at each iteration

Gi,j = K(xi, xj)

minimizeu1

n

nX

i=1

log�1 + exp

��yie

Ti Gu

��

<latexit sha1_base64="mCw0MoXdKVl+XzppJJI/0mKeKFk=">AAACUHicbVHLahRBFK0eX3F8jbp0c3EQIuLQHYUERAi60GWETBKYmjTVNbd7itSjqbotGZv+RDfZ+R1uXChaM5mAJh4o7qlzbr1OFbVWgdL0W9K7dv3GzVsbt/t37t67/2Dw8NFBcI2XOJZOO39UiIBaWRyTIo1HtUdhCo2Hxcn7pX/4GX1Qzu7TosapEZVVpZKCopQPKm4Kd9oaZZVRX7DLG+BvgJdeyDbrWtsBD43JW/U2644tcO0qrrGkTcjgBXA8rdfTl7DIFWCujvfhA8RdvKrm9Pyi5oNhOkpXgKskW5MhW2MvH5zxmZONQUtSixAmWVrTtBWelNTY9XkTsBbyRFQ4idQKg2HargLp4FlUZlA6H4clWKl/r2iFCWFhithpBM3DZW8p/s+bNFTuTFtl64bQyvODykYDOVimCzPlUZJeRCKkV/GuIOcihknxD/oxhOzyk6+Sg61R9mq09en1cPfdOo4N9oQ9ZZssY9tsl31ke2zMJPvKvrOf7FdylvxIfveS89aLyh6zf9Dr/wGtxLLQ</latexit>

Page 19: The Kernel Trick, Gram Matrices, and Feature Extraction

Problems with the Gram Matrix

• Suppose we have n examples in our training set.

• How much memory is required to store the Gram matrix G?

• What is the cost of taking the product Gi w to compute a gradient?

• What happens if we have one hundred million training examples?

Page 20: The Kernel Trick, Gram Matrices, and Feature Extraction

Feature Extraction

• Simple case: let’s imagine that X is a finite set {1, 2, …, k}

• We can define our kernel as a matrix

• Since M is positive semidefinite, it has a square root

M 2 Rk⇥k

Mi,j = K(i, j)

UTU = MkX

i=1

Uk,iUk,j = Mi,j = K(i, j)

Page 21: The Kernel Trick, Gram Matrices, and Feature Extraction

Feature Extraction (continued)

• So if we define a feature mapping then

• The kernel is equivalent to a dot product in some space

• As we noted above, this is true for all kernels, not just finite ones• Just with a possibly infinite-dimensional feature map

�(i) = Uei

�(i)T�(j) =kX

i=1

Uk,iUk,j = Mi,j = K(i, j)

Page 22: The Kernel Trick, Gram Matrices, and Feature Extraction

Classifying with feature maps

• Suppose that we can find a finite-dimensional feature map that satisfies

• Then we can simplify our classifier to

�(i)T�(j) = K(i, j)

hu(x) = sign

nX

i=1

uiK(xi, x)

!

= sign

nX

i=1

ui�(xi)T�(x)

!= sign

�wT�(x)

<latexit sha1_base64="uNzOHar/Jb5dFhNCso+zfDUnooU=">AAACq3iclVFda9swFJW9r877aLY+7kUsrDiwBbsbbC+B0r0MBqNjTRMWpUZW5FhUlox0tTWY/Ln9hL7130xODduaMdgBweHce8+9ujevpbCQJFdBeOv2nbv3du5HDx4+erzbe/L01GpnGB8zLbWZ5tRyKRQfgwDJp7XhtMoln+Tn79v45Bs3Vmh1Aquazyu6VKIQjIKXst6PMnPxxQDvjzDRNTcUtFG04o0VS7UmkhcQY2JdlTVilK7PFHaZwB/ji0y8xL6OGLEsYRARsj+K/sOB1KVoTQZnJx3/5fUvo+/b+VmvnwyTDfA2STvSRx2Os94lWWjmKq6ASWrtLE1qmDfUgGCSryPiLK8pO6dLPvO0ncHOm82u1/iFVxa40MY/BXij/l7R0MraVZX7zIpCaW/GWvFvsZmD4t28Eap2wBW7blQ4iUHj9nB4IQxnIFeeUGaEnxWzkhrKwJ838ktIb355m5weDNPXw4PPb/qHR906dtAz9BzFKEVv0SH6gI7RGLEgDj4Fk2Aavgq/hF9Dcp0aBl3NHvoDIf8JwVPP5Q==</latexit>

hu(x) = sign

nX

i=1

uiK(xi, x)

!

= sign

nX

i=1

ui�(xi)T�(x)

!= sign

�wT�(x)

<latexit sha1_base64="uNzOHar/Jb5dFhNCso+zfDUnooU=">AAACq3iclVFda9swFJW9r877aLY+7kUsrDiwBbsbbC+B0r0MBqNjTRMWpUZW5FhUlox0tTWY/Ln9hL7130xODduaMdgBweHce8+9ujevpbCQJFdBeOv2nbv3du5HDx4+erzbe/L01GpnGB8zLbWZ5tRyKRQfgwDJp7XhtMoln+Tn79v45Bs3Vmh1Aquazyu6VKIQjIKXst6PMnPxxQDvjzDRNTcUtFG04o0VS7UmkhcQY2JdlTVilK7PFHaZwB/ji0y8xL6OGLEsYRARsj+K/sOB1KVoTQZnJx3/5fUvo+/b+VmvnwyTDfA2STvSRx2Os94lWWjmKq6ASWrtLE1qmDfUgGCSryPiLK8pO6dLPvO0ncHOm82u1/iFVxa40MY/BXij/l7R0MraVZX7zIpCaW/GWvFvsZmD4t28Eap2wBW7blQ4iUHj9nB4IQxnIFeeUGaEnxWzkhrKwJ838ktIb355m5weDNPXw4PPb/qHR906dtAz9BzFKEVv0SH6gI7RGLEgDj4Fk2Aavgq/hF9Dcp0aBl3NHvoDIf8JwVPP5Q==</latexit>

Page 23: The Kernel Trick, Gram Matrices, and Feature Extraction

Learning with feature maps• Similarly we can simplify our learning objective to

• Take-away: this is just transforming the input data, then running a linear classifier in the transformed space!

• Computationally: super efficient• As long as we can transform and store the input data in an efficient

way

minimizew1

n

nX

i=1

log

0

@1 + exp

0

@�nX

j=1

wT�(xi)yi

1

A

1

A

<latexit sha1_base64="PHWK5kP3bS1dPPdzFjoFrE214z0=">AAACYXicbVFdb9MwFHXCgK2wEcbjXq6okDohqmQggYSQJnjhcUjrNqnuIse9ab35I7Id1hLlT/LGCy/8Edw2SLBxJcvH55yrax8XlRTOp+mPKL63df/Bw+2d3qPHu3tPkqf7Z87UluOIG2nsRcEcSqFx5IWXeFFZZKqQeF5cf1rp51/ROmH0qV9WOFFspkUpOPOBypMFVYVZNEpoocQ3bPMboO+BlpbxJmsb3QJ1tcob8SFrLzVQaWZUYukHkMFLoLiouuOrzni1Md5cngKt5mKwyMUhLHMB1IrZ3B/+2fOknw7TdcFdkHWgT7o6yZPvdGp4rVB7Lplz4yyt/KRh1gsuse3R2mHF+DWb4ThAzRS6SbNOqIUXgZlCaWxY2sOa/bujYcq5pSqCUzE/d7e1Ffk/bVz78t2kEbqqPWq+GVTWEryBVdwwFRa5l8sAGLci3BX4nIV0ffiUXgghu/3ku+DsaJi9Hh59edM//tjFsU0OyHMyIBl5S47JZ3JCRoSTn9FWtBvtRb/inTiJ9zfWOOp6npF/Kj74Dc5UtOk=</latexit>

Page 24: The Kernel Trick, Gram Matrices, and Feature Extraction

Problems with feature maps

• The dimension of the transformed data may be much larger than the dimension of the original data.

• Suppose that the feature map is and there are nexamples

• How much memory is needed to store the transformed features?

• What is the cost of taking the product to compute a gradient?

� : Rd ! RD

uT�(xi)

Page 25: The Kernel Trick, Gram Matrices, and Feature Extraction

Feature maps vs. Gram matrices

• Interesting systems trade-offs exist here.

• When number of examples gets very large, feature maps are better.

• When transformed feature vectors have high dimensionality, Gram matrices are better.

Page 26: The Kernel Trick, Gram Matrices, and Feature Extraction

Another Problem with Feature Maps

• Recall: I said there was always a feature map for any kernel such that

• But this feature map is not always finite-dimensional• For example, the Gaussian/RBF kernel has an infinite-dimensional

feature map• Many kernels we care about in ML have this property

• What do we do if ɸ has infinite dimensions?• We can’t just compute with it normally!

�(i)T�(j) = K(i, j)

Page 27: The Kernel Trick, Gram Matrices, and Feature Extraction

Solution: Approximate feature maps

• Find a finite-dimensional feature map so that

• Typically, we want to find a family of feature maps ɸtsuch that

K(x, y) ⇡ �(x)T�(y)

�D : Rd ! RD

limD!1

�D(x)T�D(y) = K(x, y)

Page 28: The Kernel Trick, Gram Matrices, and Feature Extraction

Types of approximate feature maps

• Deterministic feature maps• Choose a fixed-a-priori method of approximating the kernel• Generally not very popular because of the way they scale with

dimensions

• Random feature maps• Choose a feature map at random (typically each feature is

independent) such that

• Then prove with high probability that over some region of interest

E⇥�(x)T�(y)

⇤= K(x, y)

���(x)T�(y)�K(x, y)�� ✏

Page 29: The Kernel Trick, Gram Matrices, and Feature Extraction

Types of Approximate Features (continued)

• Orthogonal randomized feature maps• Intuition behind this: if we have a feature map where for some i and j

then we can’t actually learn much from including both features in the map.

• Strategy: choose the feature map at random, but subject to the constraint that the features be statistically “orthogonal” in some way.

• Quasi-random feature maps• Generate features using a low-discrepancy sequence rather than

true randomness

eTi �(x) ⇡ eTj �(x)

Page 30: The Kernel Trick, Gram Matrices, and Feature Extraction

Adaptive Feature Maps

• Everything before this didn’t take the data into account

• Adaptive feature maps look at the actual training set and try to minimize the kernel approximation error using the training set as a guide• For example: we can do a random feature map, and then fine-tune

the randomness to minimize the empirical error over the training set

• Gaining in popularity

• Also, neural networks can be thought of as adaptive feature maps.

Page 31: The Kernel Trick, Gram Matrices, and Feature Extraction

Summary: Many Ways to Learn Linear Models

• Options for representing features:• Learn with an exact feature map• Learn with a kernel• Learn with an approximate feature map

• Other choices:• Pre-compute feature map/Gram matrix and store in memory• Re-compute feature map/kernel value at each iteration

Page 32: The Kernel Trick, Gram Matrices, and Feature Extraction

Systems Tradeoffs

• Lots of tradeoffs here

• Do we spend more work up-front constructing a more sophisticated approximation, to save work on learning algorithms?

• Would we rather scale with the data, or scale to more complicated problems?

• Another task for hyperparameter optimization

Page 33: The Kernel Trick, Gram Matrices, and Feature Extraction

Demo

Page 34: The Kernel Trick, Gram Matrices, and Feature Extraction

Dimensionality reduction

Page 35: The Kernel Trick, Gram Matrices, and Feature Extraction

Linear models are linear in the dimension

•But what if the dimension d is very large?• Example: if we have a high-dimensional kernel map

• It can be difficult to run SGD when the dimension is very high even if the cost is linear• This happens for other learning algorithms too

Page 36: The Kernel Trick, Gram Matrices, and Feature Extraction

Idea: reduce the dimension

• If high dimension is the problem, can we just reduce d?

• This is the problem of dimensionality reduction.

• Dimensionality reduction benefits both statistics and systems• Statistical side: can help with generalization by

identifying important subset of features• Systems side: lowers compute cost

Page 37: The Kernel Trick, Gram Matrices, and Feature Extraction

Techniques for dimensionality reduction

• Feature selection by hand• Simple method• But costly in terms of human effort

•Principal component analysis (PCA)• Identify the directions of highest variance in the

dataset• Then project onto those directions• Many variants: e.g. kernel PCA

Page 38: The Kernel Trick, Gram Matrices, and Feature Extraction

More techniques for dimensionality reduction

• Locality-sensitive hashing (LSH)• Hash input items into buckets so close-by elements map

into the same buckets with high probability• Many methods of doing this too

• Johnson-Lindenstrauss transform (random projection)• General method for reducing dimensionality of any

dataset• Just choose a random subspace and project onto that

subspace

Page 39: The Kernel Trick, Gram Matrices, and Feature Extraction

Johnson-Lindenstrauss lemma

Given a desired error ✏ 2 (0, 1), a set of m points in Rd,

and a reduced dimension D that satisfies D > 8 log(m)✏2 ,

there exists a linear map T such that

(1� ✏) · kx� yk2 kT (x)� T (y)k2 (1 + ✏) · kx� yk2

for all points x and y in the set.<latexit sha1_base64="q82TmY2la2durnjnSXpYmTUkXUk=">AAADf3icfVLbbtNAELUbLsUUaOGRlxFdRCLayI6EKA+gCpDgsaD0IuK0Wq/Hyar2rrW7rmKZfByfwRfwCn/AOE0FbRHzdHTmfmaSMpfWheF3f6Vz4+at26t3grtr9+4/WN94eGB1ZQTuC51rc5Rwi7lUuO+ky/GoNMiLJMfD5PRd6z88Q2OlVkNXlzgu+ETJTAruiDrZ8L98kGeogEOKVhpMAY3RBliMpZW5VhBLBd1wK+qxLYqy6EBnwAoGpZbKWSA3iwvupknSfJ4fp22YSimUilWCCqayQNUOAOw9AzflDix1t5lESxS8gTgzXDQ7EOd60i168+ai+fFgTuXcFA0CzkgNS3XbVbmBgpfAhgxsJaaLqkE8CgC6EWzDRX4PYpFqB/FXmBFdEzgeUBtsmWF31iNy2K17f3hKf/7/9CAeBxkpxPP8QgI2Y4udWc1aOWjeVqf+yfpm2A8XBtdBtASb3tL26BprcapFRXo5kXNrR1FYunHDjZMix3kQVxZLLk75BEcEFS/QjpvFF8zhKTEptKNlWjlYsH9nNLywti4SimzPZa/6WvJfvlHlsp1xI1VZOVTivFFW5eA0tC9FBzYoXF4T4MJImhXElNNJHT3epS4lN/ZUlpcXKbiZSPV60H8h1biZoC7QmXoeBKRfdFWt6+Bg0I/CfvRpsLn7dqnkqvfYe+J1vch76e16H709b98T/jf/h//T/9XxO886/U54HrriL3MeeZes8+o3FrsQng==</latexit><latexit sha1_base64="q82TmY2la2durnjnSXpYmTUkXUk=">AAADf3icfVLbbtNAELUbLsUUaOGRlxFdRCLayI6EKA+gCpDgsaD0IuK0Wq/Hyar2rrW7rmKZfByfwRfwCn/AOE0FbRHzdHTmfmaSMpfWheF3f6Vz4+at26t3grtr9+4/WN94eGB1ZQTuC51rc5Rwi7lUuO+ky/GoNMiLJMfD5PRd6z88Q2OlVkNXlzgu+ETJTAruiDrZ8L98kGeogEOKVhpMAY3RBliMpZW5VhBLBd1wK+qxLYqy6EBnwAoGpZbKWSA3iwvupknSfJ4fp22YSimUilWCCqayQNUOAOw9AzflDix1t5lESxS8gTgzXDQ7EOd60i168+ai+fFgTuXcFA0CzkgNS3XbVbmBgpfAhgxsJaaLqkE8CgC6EWzDRX4PYpFqB/FXmBFdEzgeUBtsmWF31iNy2K17f3hKf/7/9CAeBxkpxPP8QgI2Y4udWc1aOWjeVqf+yfpm2A8XBtdBtASb3tL26BprcapFRXo5kXNrR1FYunHDjZMix3kQVxZLLk75BEcEFS/QjpvFF8zhKTEptKNlWjlYsH9nNLywti4SimzPZa/6WvJfvlHlsp1xI1VZOVTivFFW5eA0tC9FBzYoXF4T4MJImhXElNNJHT3epS4lN/ZUlpcXKbiZSPV60H8h1biZoC7QmXoeBKRfdFWt6+Bg0I/CfvRpsLn7dqnkqvfYe+J1vch76e16H709b98T/jf/h//T/9XxO886/U54HrriL3MeeZes8+o3FrsQng==</latexit><latexit sha1_base64="q82TmY2la2durnjnSXpYmTUkXUk=">AAADf3icfVLbbtNAELUbLsUUaOGRlxFdRCLayI6EKA+gCpDgsaD0IuK0Wq/Hyar2rrW7rmKZfByfwRfwCn/AOE0FbRHzdHTmfmaSMpfWheF3f6Vz4+at26t3grtr9+4/WN94eGB1ZQTuC51rc5Rwi7lUuO+ky/GoNMiLJMfD5PRd6z88Q2OlVkNXlzgu+ETJTAruiDrZ8L98kGeogEOKVhpMAY3RBliMpZW5VhBLBd1wK+qxLYqy6EBnwAoGpZbKWSA3iwvupknSfJ4fp22YSimUilWCCqayQNUOAOw9AzflDix1t5lESxS8gTgzXDQ7EOd60i168+ai+fFgTuXcFA0CzkgNS3XbVbmBgpfAhgxsJaaLqkE8CgC6EWzDRX4PYpFqB/FXmBFdEzgeUBtsmWF31iNy2K17f3hKf/7/9CAeBxkpxPP8QgI2Y4udWc1aOWjeVqf+yfpm2A8XBtdBtASb3tL26BprcapFRXo5kXNrR1FYunHDjZMix3kQVxZLLk75BEcEFS/QjpvFF8zhKTEptKNlWjlYsH9nNLywti4SimzPZa/6WvJfvlHlsp1xI1VZOVTivFFW5eA0tC9FBzYoXF4T4MJImhXElNNJHT3epS4lN/ZUlpcXKbiZSPV60H8h1biZoC7QmXoeBKRfdFWt6+Bg0I/CfvRpsLn7dqnkqvfYe+J1vch76e16H709b98T/jf/h//T/9XxO886/U54HrriL3MeeZes8+o3FrsQng==</latexit><latexit sha1_base64="q82TmY2la2durnjnSXpYmTUkXUk=">AAADf3icfVLbbtNAELUbLsUUaOGRlxFdRCLayI6EKA+gCpDgsaD0IuK0Wq/Hyar2rrW7rmKZfByfwRfwCn/AOE0FbRHzdHTmfmaSMpfWheF3f6Vz4+at26t3grtr9+4/WN94eGB1ZQTuC51rc5Rwi7lUuO+ky/GoNMiLJMfD5PRd6z88Q2OlVkNXlzgu+ETJTAruiDrZ8L98kGeogEOKVhpMAY3RBliMpZW5VhBLBd1wK+qxLYqy6EBnwAoGpZbKWSA3iwvupknSfJ4fp22YSimUilWCCqayQNUOAOw9AzflDix1t5lESxS8gTgzXDQ7EOd60i168+ai+fFgTuXcFA0CzkgNS3XbVbmBgpfAhgxsJaaLqkE8CgC6EWzDRX4PYpFqB/FXmBFdEzgeUBtsmWF31iNy2K17f3hKf/7/9CAeBxkpxPP8QgI2Y4udWc1aOWjeVqf+yfpm2A8XBtdBtASb3tL26BprcapFRXo5kXNrR1FYunHDjZMix3kQVxZLLk75BEcEFS/QjpvFF8zhKTEptKNlWjlYsH9nNLywti4SimzPZa/6WvJfvlHlsp1xI1VZOVTivFFW5eA0tC9FBzYoXF4T4MJImhXElNNJHT3epS4lN/ZUlpcXKbiZSPV60H8h1biZoC7QmXoeBKRfdFWt6+Bg0I/CfvRpsLn7dqnkqvfYe+J1vch76e16H709b98T/jf/h//T/9XxO886/U54HrriL3MeeZes8+o3FrsQng==</latexit>

In fact, a randomly chosen linear map Tworks with high probability!

Page 40: The Kernel Trick, Gram Matrices, and Feature Extraction

Consequences of J-L transform

•We only need O(log(m) / ε2) dimensions to map a dataset of size m with relative distance accuracy.• No matter what the size of the input dataset was!

• This is a very useful result for many applications• Provides a generic way of reducing the dimension with

guarantees

• But there are more specialized data-dependent ways of doing dimensionality reduction that can work better.

Page 41: The Kernel Trick, Gram Matrices, and Feature Extraction

Autoencoders

• Use deep learning to learn two models• The encoder, which maps an example to a dimension-reduced

representation• The decoder, which maps it back

• Train to minimize the distance between encoded-and-decoded examples and the original example.

entry of A as an independent Gaussian random variable (with zero mean and appropriate variance) then the

bound in the lemma will hold with high probability, as long as the dimension D is large enough.

An aside: concerns with efficiency. While using Gaussian random variables is sufficient, it’s not very

computationally efficient to generate the matrix A, communicate it, and multiply by it. There’s a lot of work

into making random projections faster by using other distributions and more structured matrices, so if you

want to use random projection at scale, you should consider using these methods.

Principal component analysis (PCA). Idea: instead of using a random linear projection, pick an orthogonal

linear map that maximizes the variance of the resulting transformed data. Concretely, if we’re given some

data x1, . . . , xn 2 Rd, we want to find an orthonormal matrix A 2 RD⇥d

(i.e. a matrix with orthogonal rows

all of norm 1) that maximizes

1

n

nX

i=1

������Axi �

1

n

nX

j=1

Axj

������

2

over orthogonal projections A 2 RD⇥d.

This problem can be solved by forming the empirical covariance matrix ⌃ of the data, where

⌃ =1

n

nX

i=1

0

@xi �1

n

nX

j=1

xj

1

A

0

@xi �1

n

nX

j=1

xj

1

AT

,

and then finding its D largest eigenvectors and using them as the rows of A. One downside of this direct

approach is that doing so requires O(d2) space (to store the covariance matrix) and even more time (to do

the eigendecomposition). As a result, many methods for fast PCA have been developed, and you should

consider using these if you want to use PCA at scale.

Autoencoders. Idea: use deep learning to learn two nonlinear models, one of which (the encoder, �) goes

from our original data in Rdto a compressed representation in RD

for D < d, and the other of which (the

decoder, ) goes from the compressed representation in RDback to Rd

. We want to train in such a way as to

minimize the distance between the original examples in Rdand the “recovered” examples that result from

encoding and then decoding the example. Formally, given some dataset x1, . . . , xn, we want to minimize

1

n

nX

i=1

k (�(xi))� xik2

over some parameterized class of nonlinear transformations � : Rd ! RDand : RD ! Rd

defined by a

neural network.

encoder

encoded

example

in RDdecoder

original

example

xi in Rd

decoded

example

x̂i in Rd

3

Page 42: The Kernel Trick, Gram Matrices, and Feature Extraction

Questions

•Upcoming things:• Paper 1a or 1b review due on Monday• Papers 2a/2b in class on Monday• Start thinking about the class project• It will come faster than you think!


Recommended