PPDSparse: A Parallel Primal and Dual Sparse Methodto Extreme Classification
Ian E.H. Yen1, Xiangru Huang2, Wei Dai1, Pradeep Ravikumar1,Inderjit S. Dhillon2 and Eric Xing1
1Carnegie Mellon University. 2University of Texas at Austin
KDD, 2017
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 1 / 17
Outline
1 Problem SettingExtreme ClassificationRelated Works
2 AlgorithmSeparable LossAlgorithm Diagram
3 TheoryAnalysis of primal and dual sparsity
4 Experimental Results
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 2 / 17
Outline
1 Problem SettingExtreme ClassificationRelated Works
2 AlgorithmSeparable LossAlgorithm Diagram
3 TheoryAnalysis of primal and dual sparsity
4 Experimental Results
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 3 / 17
Extreme Classification
Goal: Learn a function h(x) : RD → RK from D input features to Koutput scores that is consistent with labels y ∈ {0, 1}K .
K is large (e.g. 103˜106).
Average number of positive labels (per sample)kp = 1
N
∑Ni=1(
∑k yik).
Multiclass: kp = 1; Multilabel: kp � K .
Average number of positive samples (per class)np = 1
K
∑Kk=1 n
kp = 1
K
∑Kk=1(
∑i yik).
np = Nkp/K � N
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 4 / 17
Extreme Classification
Goal: Learn a function h(x) : RD → RK from D input features to Koutput scores that is consistent with labels y ∈ {0, 1}K .
K is large (e.g. 103˜106).
Average number of positive labels (per sample)kp = 1
N
∑Ni=1(
∑k yik).
Multiclass: kp = 1; Multilabel: kp � K .
Average number of positive samples (per class)np = 1
K
∑Kk=1 n
kp = 1
K
∑Kk=1(
∑i yik).
np = Nkp/K � N
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 4 / 17
Extreme Classification
Goal: Learn a function h(x) : RD → RK from D input features to Koutput scores that is consistent with labels y ∈ {0, 1}K .
K is large (e.g. 103˜106).
Average number of positive labels (per sample)kp = 1
N
∑Ni=1(
∑k yik).
Multiclass: kp = 1; Multilabel: kp � K .
Average number of positive samples (per class)np = 1
K
∑Kk=1 n
kp = 1
K
∑Kk=1(
∑i yik).
np = Nkp/K � N
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 4 / 17
Extreme Classification
We consider Linear Classifier:
h(x) := W Tx where W ∈ RD×K .
Challenge: When K is large, training of simple linear model requiresO(NDK ) cost.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 5 / 17
Extreme Classification
We consider Linear Classifier:
h(x) := W Tx where W ∈ RD×K .
Challenge: When K is large, training of simple linear model requiresO(NDK ) cost.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 5 / 17
Outline
1 Problem SettingExtreme ClassificationRelated Works
2 AlgorithmSeparable LossAlgorithm Diagram
3 TheoryAnalysis of primal and dual sparsity
4 Experimental Results
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 6 / 17
Approaches
Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.
Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.
Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.
This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17
Approaches
Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.
Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.
Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.
This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17
Approaches
Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.
Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.
Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.
This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17
Approaches
Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Goodaccuracy when assumption holds. Lower accuracy when assumptionsnot hold.
Approach 2 Parallelized one-vs-all Good accuracy, slow,parallelizable. Need days on largest dataset with 100 cores.
Approach 3 Primal-Dual Sparse Good accuracy, fast, notparallelizable, memory issue O(DK ). Need days on largest dataset.
This paper Parallel PD-Sparse Good accuracy, fast, parallelizable.Need only < 30 min on largest dataset with 100 cores.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 7 / 17
Outline
1 Problem SettingExtreme ClassificationRelated Works
2 AlgorithmSeparable LossAlgorithm Diagram
3 TheoryAnalysis of primal and dual sparsity
4 Experimental Results
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 8 / 17
We consider the classwise-separable hinge loss
L(z , y) :=K∑
k=1
`(zk , yk) =K∑
k=1
max(1− ykzk , 0)
Minimizing a separable loss is equivalent to One-versus-all:
minW∈RD×K
N∑i=1
K∑k=1
`(wTk xi , yik) =
K∑k=1
(N∑i=1
`(wTk xi , yik)
)
To obtain sparse iterates, we add `1-penalty on W and add bias perclass w0k . The dual problem of the `1-`2-regularized problem is:
minαk∈RN
G (αk) :=1
2‖w(αk)‖2 −
N∑i=1
αik
s.t. w(αk) = proxλ(X̂Tαk),
0 ≤ αik ≤ 1.
(1)
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 9 / 17
We consider the classwise-separable hinge loss
L(z , y) :=K∑
k=1
`(zk , yk) =K∑
k=1
max(1− ykzk , 0)
Minimizing a separable loss is equivalent to One-versus-all:
minW∈RD×K
N∑i=1
K∑k=1
`(wTk xi , yik) =
K∑k=1
(N∑i=1
`(wTk xi , yik)
)
To obtain sparse iterates, we add `1-penalty on W and add bias perclass w0k . The dual problem of the `1-`2-regularized problem is:
minαk∈RN
G (αk) :=1
2‖w(αk)‖2 −
N∑i=1
αik
s.t. w(αk) = proxλ(X̂Tαk),
0 ≤ αik ≤ 1.
(1)
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 9 / 17
Outline
1 Problem SettingExtreme ClassificationRelated Works
2 AlgorithmSeparable LossAlgorithm Diagram
3 TheoryAnalysis of primal and dual sparsity
4 Experimental Results
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 10 / 17
Primal-Dual-Sparse Active-set Method
O
(nnz(wk)nnz(x j)︸ ︷︷ ︸
Search
+ nnz(αk)nnz(x i )︸ ︷︷ ︸Update+Maintain
)per iteration.
Apply Random Sparsification on (already sparse) wk before search.
Update α by Coordinate Descent within Ak .
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 11 / 17
Parallel & Primal-Dual Sparse Method
Due to the separable loss, theoptimization can beembarrassingly parallelized withone-time communication.
The input yk and output wk ofeach sub-problem are sparse.
Can be implemented in adistributed, shared-memory, ortwo-level parallelization setting.
Space: O(nnz(X ) + D).
Nearly linear speedup even withthousands of cores.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 12 / 17
Outline
1 Problem SettingExtreme ClassificationRelated Works
2 AlgorithmSeparable LossAlgorithm Diagram
3 TheoryAnalysis of primal and dual sparsity
4 Experimental Results
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 13 / 17
Theory: Primal and Dual Sparsity
Key Insight: The number of positive samples for each class
np =NkpK
is small. The following results hold if class-wise bias wk0 are added.
Step-1: bound ‖w‖1, and optimal ‖α∗‖1 in terms of np:
‖wk‖1 ≤2nkpλ
, ‖α∗k‖1 ≤ 4nkp .
Step-2: bound nnz(w), and nnz(α) in terms of ‖w‖1 and ‖α∗‖1:
nnz(w̃k) ≤ ‖wk‖21
δ2, nnz(αt
k) ≤ t ≤4‖α∗k‖2
1
ε
where w̃ is Random-Sparsified version of w with δ-approximationerror in ∇G (α), and ε is the desired precision of solution.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 14 / 17
Theory: Primal and Dual Sparsity
Key Insight: The number of positive samples for each class
np =NkpK
is small. The following results hold if class-wise bias wk0 are added.
Step-1: bound ‖w‖1, and optimal ‖α∗‖1 in terms of np:
‖wk‖1 ≤2nkpλ
, ‖α∗k‖1 ≤ 4nkp .
Step-2: bound nnz(w), and nnz(α) in terms of ‖w‖1 and ‖α∗‖1:
nnz(w̃k) ≤ ‖wk‖21
δ2, nnz(αt
k) ≤ t ≤4‖α∗k‖2
1
ε
where w̃ is Random-Sparsified version of w with δ-approximationerror in ∇G (α), and ε is the desired precision of solution.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 14 / 17
Theory: Primal and Dual Sparsity
Key Insight: The number of positive samples for each class
np =NkpK
is small. The following results hold if class-wise bias wk0 are added.
Step-1: bound ‖w‖1, and optimal ‖α∗‖1 in terms of np:
‖wk‖1 ≤2nkpλ
, ‖α∗k‖1 ≤ 4nkp .
Step-2: bound nnz(w), and nnz(α) in terms of ‖w‖1 and ‖α∗‖1:
nnz(w̃k) ≤ ‖wk‖21
δ2, nnz(αt
k) ≤ t ≤4‖α∗k‖2
1
ε
where w̃ is Random-Sparsified version of w with δ-approximationerror in ∇G (α), and ε is the desired precision of solution.
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 14 / 17
Multilabel Classification
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 15 / 17
Multiclass Classification
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 16 / 17
Thank you
Xiangru [email protected]
Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit S. Dhillon and Eric Xing (shortinst)PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme ClassificationKDD, 2017 17 / 17