Online Passive-Aggressive Algorithms
Tirgul 11
Multi-Label Classification
2
Multilabel Problem: Example
• Mapping Apps to smart folders:
3
Assign an installed app to one or more folders Candy Crush
Saga
Goal
• Given 𝐴, an installed app:
• Assign it to 𝒚 ⊆ 𝒴 the set of relevant folders:
• Where 𝒴 is the set of all folders:
4
Farm Story
SocialGames
GamesSocial Music Photography Shopping
Multilabel Classification
• A variant of the classification problem:• Multiple target labels must by assigned to each instance.
• Setting:• There are 𝑘 different possible labels: 𝒴 = 1,… , 𝑘 .
• Every instance 𝒙𝑖 is associated with a set of relevant labels 𝒚𝑖.
• Special case: • There is a single relevant label for each instance: multiclass (single-label) classification.
5
Multilabel Classification
• Other usage:• Text categorization:
• 𝒙𝑖 represents a document.
• 𝒚𝑖 is the set of topics which are relevant to the document• Chosen from a predefined collection of topics.
• E.g.: A text might be about any of religion, politics, finance or education at the same time or none of these.
6
Task
7
Assign user’s installed applications to one ore
more smart folders in an automatic way
Task
• Training set of examples:
• Each example (𝐴𝑖 , 𝒚𝑖) contains an application and a set of one or more smart folders:
8
Farm Story SocialGames
Model
9
Find a function 𝒚 = 𝑓𝒘 𝐴 with parameters 𝒘
that assigns a set of folders 𝒚 to an installed
app 𝐴.
Model and Inference
10
weight vectorsfeature maps
𝑦 = 𝑓𝑤 𝐴 = 𝑎𝑟𝑔𝑠𝑜𝑟𝑡𝒚𝒘𝜙(𝐴, 𝒚)
Feature maps
TF-IDF of title
Google Play’s category
TF-IDF of description
Representation of
related apps
Inference
• Algorithm’s output upon receiving an instance 𝒙𝑖:• A score for each of the 𝑘 labels in 𝒴.
12
GamesSocial Music Photography Shopping
𝒴 =
0.6, 0.3, 0.7, 0.1, 0.01
𝒙𝑖 =Farm Story
𝑦𝑖 = 𝑎𝑟𝑔𝑠𝑜𝑟𝑡𝒚𝒘𝜙(𝐴𝑖 , 𝒚)
Inference
• The algorithm’s prediction is a vector in ℝ𝑘 where each element in the vector corresponds to the score assigned to the respective label.
• This form of prediction is often referred to as label ranking.
13
GamesSocial Music Photography Shopping
𝒴 =
0.6, 0.3, 0.7, 0.1, 0.01
𝑦 = 𝑎𝑟𝑔𝑠𝑜𝑟𝑡𝒚𝒘𝜙(𝐴, 𝒚)
∈ ℝ𝑘
Inference
• For a pair of labels 𝑟, 𝑠 ∈ 𝒴:• If score 𝑟 > 𝑠𝑐𝑜𝑟𝑒(𝑠):
• Label 𝑟 is ranked higher than label 𝑠.
• Goal of Algorithm:• Rank every relevant label above every
irrelevant label.
14
Social
Music
Games
Photography
Shopping
0.6
0.3
The Margin
• Example: 𝐴𝑖 , 𝒚𝑖 = , .
• After making predictions 𝒚𝒊, the algorithm receives the correct set 𝒚𝑖 .
1. Find the least probable correct folder:
2. Find the most probable wrong folder:
15
Games
Music
Social
Photography
Shopping
Games Social
Games
Music
max
Iterate over examples
• Example: 𝐴𝑖 , 𝒚𝑖 = , .
• After making predictions 𝒚𝒊, the algorithm receives the correct set 𝒚𝑖 .
• Update:
16
Games
Music
Social
Photography
Shopping
Games Social
max
The Margin
• We define the margin attained by the algorithm on round 𝑖 for example 𝒙𝑖 , 𝒚𝑖 ,
17
𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 = min𝑟∈𝑦𝑖
𝒘𝑖 𝜙 𝒙𝑖 , 𝑟 − max𝑠 ∉𝑦𝑖
𝒘𝑖𝜙 𝒙𝑖 , 𝑠 .Games
Music
Social
Photography
Shopping
The Margin
• The margin is positive if all relevant labels are ranked higher than all irrelevant labels.
• We are not satisfied with only a positive margin; we require the margin of every prediction to be at least 1.
18
𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 = min𝑟∈𝑦𝑖
𝒘𝑖 𝜙 𝒙𝑖 , 𝑟 − max𝑠 ∉𝑦𝑖
𝒘𝑖𝜙 𝒙𝑖 , 𝑠 .
The Loss
• Define a hinge loss:
19
𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 = min𝑟∈𝑦𝑖
𝒘𝑖 𝜙 𝒙𝑖 , 𝑟 − max𝑠 ∉𝑦𝑖
𝒘𝑖𝜙 𝒙𝑖 , 𝑠 .
ℓ 𝑤𝑖 , 𝑥𝑖 , 𝑦𝑖 = 0 𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 ≥ 1
1 − 𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝕝 min𝑟∈𝑦𝑖
𝒘𝑖 𝜙 𝒙𝑖,𝑟 −max𝑠 ∉𝑦𝑖
𝒘𝑖𝜙 𝑥𝑖,𝑠 <0
The Loss
• Could also be written as follows:
• where 𝑎 + = max(0, 𝑎)
20
ℓ 𝑤𝑖 , 𝑥𝑖 , 𝑦𝑖 = 1 − 𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 +
ℓ 𝑤𝑖 , 𝑥𝑖 , 𝑦𝑖 = 0 𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 ≥ 1
1 − 𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Learning
• First approach:
• Goal :
21
Multilabel PA Optimization Problem
• An alternative approach:
22
𝑟𝑖 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑟∈𝑦𝑖𝒘𝑖𝜙(𝒙𝑖 , 𝑟) 𝑠𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑠∉𝑦𝑖𝒘𝑖𝜙(𝒙𝑖 , 𝑠)
𝒘𝑖+1 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤1
2𝒘 −𝒘𝑖
2
𝑠. 𝑡. ℓ 𝒘, 𝒙𝑖 , 𝒚𝑖 = 1 − 𝛾 𝒘, 𝒙𝑖, 𝑦𝑖+= 0
i.e., the margin is greater than 1.
𝛾 𝒘𝑖 , 𝒙𝑖 , 𝑦𝑖 = 𝒘𝑖 𝜙 𝒙𝑖 , 𝑟𝑖 −𝒘𝑖𝜙 𝒙𝑖 , 𝑠𝑖
Passive Aggressive (PA)
• The algorithm is passive whenever the hinge-loss is zero:• ℓ𝑖 = 0𝑤𝑖+1 = 𝑤𝑖
• When the loss is positive, the algorithm aggressively forces 𝑤𝑖+1 to satisfy the constraint ℓ 𝒘𝑖 , 𝒙𝑖 , 𝒚𝑖 = 0, regardless of the step size required.
23
Passive Aggressive
• Solving with Lagrange Multipliers, we get the following update rule:
24
𝑤𝑖+1 = 𝑤𝑖 + 𝜏𝑖 𝜙 𝑥𝑖 , 𝑟𝑖 − 𝜙 𝑥𝑖 , 𝑠𝑖
𝜏𝑖 =ℓ𝑖
𝜙 𝑥𝑖 , 𝑟𝑖 − 𝜙 𝑥𝑖 , 𝑠𝑖2
Passive Aggressive
• The updated vector 𝑤𝑖+1 will classify example 𝑥𝑖 with ℓ 𝒘, 𝒙𝑖 , 𝒚𝑖 =0.
25
Ranking
26
Ranking Problem: Example
• A Prediction Bar:
27
Predict the apps the user is most likely to use at any given time and location.
the most likely app
the 4th likely app
...
assume the user is at a context
...
at the office
12:47
today is Wednesday
Goal
29
Find a function 𝑓 that gets as input the
current context 𝒙 and predicts the apps the
user is likely to click at a given context.
Features
time of day
day of week
location
Discriminative Model
• Prediction:
31
Parameters 𝒘𝐴 ∈ ℝ𝑑
of app A Context𝒙 ∈ ℝ𝑑
...
New Evaluation
• The performance of the prediction system is measured by Receiver operating Characteristics (ROC) curve.
true positive rate=
app was in the prediction bar and was clicked
total contexts where was clicked
false positive rate=
app was in the prediction bar and wasn’t clicked
total contexts where wasn’t clicked
Maximizing AUC
• By definition of the AUC (Bamber, 1975; Hanley and McNeil,1982):
33
𝐴𝑈𝐶
Pairwise Dataset
• For an application define two sets of context:
34Waze Waze
Maximizing AUC
• By definition of the AUC (Bamber, 1975; Hanley and McNeil,1982):
35
Solve using PA
• Set the next 𝑤𝑖 to be the minimizer of the following optimization problem
37
min𝑤∈ℝ𝑑,𝜉≥0
1
2𝑤 − 𝑤𝑖−1
2
𝑠. 𝑡. 𝑓𝑤 𝑥𝑖+ , 𝐴𝑖 − 𝑓𝑤 𝑥𝑖
−, 𝐴𝑖 ≥ 1
Implementation
• Online algorithm to solve the optimization problem efficiency on huge data.
• Theoretical guarantees of convergence.
38