Machine-Learning for Big Data:
Sampling and Distributed On-Line Algorithms
Stéphan Clémençon
LTCI UMR CNRS No. 5141-
Telecom ParisTech-
Journée Traitement de Masses de Données du Laboratoire JL Lions UPMC
Goals of Statistical Learning Theory
• Statistical issues cast as M -estimation problems:• Classification• Regression• Density level set estimation• ... and their variants
• Minimal assumptions on the distribution• Build realistic M -estimators for special criteria• Questions:
• Optimal elements• Consistency• Non-asymptotic excess risk bounds• Fast rates of convergence• Oracle inequalities
Main Example: Classification• (X,Y ) random pair with unknown distribution P
• X 2 X observation vector• Y 2 {�1,+1} binary label/class
• A posteriori probability ⇠ regression function
8x 2 X , ⌘(x) = P{Y = 1 | X = x}• g : X ! {�1,+1} classifier• Performance measure = classification error
L(g) = P {g(X) 6= Y } ! ming
• Solution: Bayes rule
8x 2 X , g⇤(x) = 2I{⌘(x)>1/2} � 1
• Bayes error L⇤ = L(g⇤)
Empirical Risk Minimization
• Sample (X1, Y1), . . . , (Xn, Yn) with i.i.d. copies of (X,Y )• Class G of classifiers• Empirical Risk Minimization principle
gn = arg ming2G
Ln(g) :=1n
nX
i=1
I{g(Xi) 6=Yi}
• Best classifier in the class
g = arg ming2G
L(g)
Empirical Processes in Classification
• Bias-variance decomposition
L(gn)� L⇤ (L(gn)� Ln(gn)) + (Ln(g)� L(g)) + (L(g)� L⇤)
2
supg2G
| Ln(g)� L(g) |!
+✓
infg2G
L(g)� L⇤◆
• Concentration inequalityWith probability 1� �:
supg2G
| Ln(g)� L(g) | E supg2G
| Ln(g)� L(g) | +r
2 log(1/�)n
Classification Theory - Main Results
1 Bayes risk consistency and rate of convergenceComplexity control:
E supg2G
| Ln(g)� L(g) | C
r
V
n
if G is a VC class with VC dimension V .
2 Fast rates of convergenceUnder variance control: rate faster than n�1/2
3 Convex risk minimization
4 Oracle inequalities
Classification Theory - Main Results
1 Bayes risk consistency and rate of convergenceComplexity control:
E supg2G
| Ln(g)� L(g) | C
r
V
n
if G is a VC class with VC dimension V .
2 Fast rates of convergenceUnder variance control: rate faster than n�1/2
3 Convex risk minimization
4 Oracle inequalities
Classification Theory - Main Results
1 Bayes risk consistency and rate of convergenceComplexity control:
E supg2G
| Ln(g)� L(g) | C
r
V
n
if G is a VC class with VC dimension V .
2 Fast rates of convergenceUnder variance control: rate faster than n�1/2
3 Convex risk minimization
4 Oracle inequalities
Big Data? Big Challenge!
Now, it is much easier
• to collect data, massively and in real-time: ubiquity of sensors(cell phones, internet, embedded systems, social networks, . . .)
• to store and manage Big (and Complex) Data (distributed filesystems, NoSQL)
• to implement massively parallelized and distributedcomputational algorithms (MapReduce, clouds)
The three features of Big Data analysis
• Velocity: process data in quasi-real time (on-line algorithms)• Volume: scalability (parallelized, distributed algorithms)• Variety: complex data (text, signal, image, graph)
How to apply ERM to Big Data?
• Suppose that n is too large to evaluate the empirical risk Ln(g)
• Common sense: run your preferred learning algorithm using asubsample of "reasonable" size B << n, e.g. by drawing withreplacement in the original training data set...
• ... but of course, statistical performance is downgraded!
1/p
n << 1/p
B
How to apply ERM to Big Data?
• Suppose that n is too large to evaluate the empirical risk Ln(g)
• Common sense: run your preferred learning algorithm using asubsample of "reasonable" size B << n, e.g. by drawing withreplacement in the original training data set...
• ... but of course, statistical performance is downgraded!
1/p
n << 1/p
B
Survey designs:a solution to Big Data learning?
• Framework: massive original sample (X1, Y1), . . . , (Xn, Yn)viewed as a superpopulation
• Survey plan Rn = probability distribution on the ensemble of allnonempty subsets of {1, . . . , n}
• Let S ⇠ RN and set ✏i = 1 if i 2 S, ✏i = 0 otherwiseThe vector (✏1, . . . , ✏n) fully describes S
• First and second order inclusion probabilities:
⇡i(RN ) = P{i 2 S} and ⇡i,j(RN ) = P{(i, j) 2 S2}• Do not rely on the empirical risk based on the survey sample{(Xi, Yi) : i 2 S}
1#S
P
i2S I{g(Xi) 6= Yi} is a biased estimate of L(g)
Horvitz -Thompson theory
• Consider the Horvitz-Thompson estimator of the risk
LRnn (g) =
1n
nX
i=1
✏i
⇡iI{g(Xi) 6= Yi}
• And the Horvitz Thompson empirical risk minimizer
arg ming2G
LRnn (g) = g✏
n
• It may work if supg2G�
�LRnn (g)� Ln(g)
�
� is small
• In general, due to the dependence structure, not much can be saidabout the fluctuations of this supremum
The Poisson case:the ✏i’s are independent
• In this case, LRnn (g) is a simple average of independent r.v.’s
) back to empirical process theory
• One recovers the same learning rate as if all data had been used,e.g. VC finite dimension case
E [L(g✏n)� L⇤] (n
p2 + 4)
r
V log(n + 1) + log 2n
where n =q
Pni=1(1/⇡2
i ) (the ⇡i’s should not be too small...)
• The upper bound is optimal in the minimax sense.
The Poisson case:the ✏i’s are independent
• Can be extended to more general sampling plans Qn providedyou are able to control
dTV (Rn, Qn)def=
X
S2P(Un)
|Pn(S)�Rn(S)|.
• A coupling technique (Hajek, 1964) can be used to show that itworks for rejective sampling, Rao-Sampford sampling,successive sampling, post-stratified sampling, etc
Beyond Empirical ProcessesU -Statistics as Performance Criteria
• In various situations, the performance criterion is not a basicsample mean statistic any more
• Examples:• Clustering: within cluster point scatter related to a partition P
2n(n� 1)
X
i<j
D(Xi, Xj)X
C2PI{(Xi, Xj) 2 C2}
• Graph inference (link prediction)• Ranking• · · ·
• The empirical criterion is an average over all possible k-tuplesU -statistic of degree k � 2
Example: Ranking• Data with ordinal label:
(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n
• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.
s(X) and Y tend to increase/decrease together with highprobability
• Quantitative formulation: maximize the criterion
L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}
• Observations: nk i.i.d. copies of X given Y = k,X(k)
1 , . . . , X(k)nk
n = n1 + . . . + nK
Example: Ranking• Data with ordinal label:
(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n
• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.
s(X) and Y tend to increase/decrease together with highprobability
• Quantitative formulation: maximize the criterion
L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}
• Observations: nk i.i.d. copies of X given Y = k,X(k)
1 , . . . , X(k)nk
n = n1 + . . . + nK
Example: Ranking• Data with ordinal label:
(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n
• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.
s(X) and Y tend to increase/decrease together with highprobability
• Quantitative formulation: maximize the criterion
L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}
• Observations: nk i.i.d. copies of X given Y = k,X(k)
1 , . . . , X(k)nk
n = n1 + . . . + nK
Example: Ranking• Data with ordinal label:
(X1, Y1), . . . , (Xn, Yn) 2 �X ⇥ {1, . . . , K}�⌦n
• Goal: rank X1, . . . ,Xn through a scoring function s : X ! Rs.t.
s(X) and Y tend to increase/decrease together with highprobability
• Quantitative formulation: maximize the criterion
L(s) = P{s(X(1)) < . . . < s(X(k)) | Y (1) = 1, . . . , Y (K) = K}
• Observations: nk i.i.d. copies of X given Y = k,X(k)
1 , . . . , X(k)nk
n = n1 + . . . + nK
Example: Ranking
• A natural empirical counterpart of L(s) is
bLn(s) =
Pn1i1=1 · · ·
PnKiK=1 I
n
s(X(1)i1
) < . . . < s(X(K)iK
)o
n1 ⇥ · · ·⇥ nK,
• But the number of terms to be summed is prohibitive!
n1 ⇥ . . .⇥ nK
• Maximization of bLn(s) is computationally unfeasible...
Example: Ranking
• A natural empirical counterpart of L(s) is
bLn(s) =
Pn1i1=1 · · ·
PnKiK=1 I
n
s(X(1)i1
) < . . . < s(X(K)iK
)o
n1 ⇥ · · ·⇥ nK,
• But the number of terms to be summed is prohibitive!
n1 ⇥ . . .⇥ nK
• Maximization of bLn(s) is computationally unfeasible...
Example: Ranking
• A natural empirical counterpart of L(s) is
bLn(s) =
Pn1i1=1 · · ·
PnKiK=1 I
n
s(X(1)i1
) < . . . < s(X(K)iK
)o
n1 ⇥ · · ·⇥ nK,
• But the number of terms to be summed is prohibitive!
n1 ⇥ . . .⇥ nK
• Maximization of bLn(s) is computationally unfeasible...
Generalized U -statistics
• K � 1 samples and degrees (d1, . . . , dK) 2 N⇤K
• (X(k)1 , . . . , X(k)
nk ), 1 k K, K independent i.i.d. samplesdrawn from Fk(dx) on Xk respectively
• Kernel H : X d11 ⇥ · · ·⇥ X dK
K ! R, square integrable w.r.t.µ = F⌦d1
1 ⌦ · · ·⌦ F⌦dKK
Generalized U -statisticsDefinitionThe K-sample U -statistic of degrees (d1, . . . , dK) with kernel H is
Un(H) =
P
I1. . .P
IKH(X(1)
I1;X(2)
I2; . . . ;X(K)
IK)
�
n1d1
�⇥ · · · �nKdK
� ,
whereP
Ikrefers to summation over all
�
nkdk
�
subsets
X(k)Ik
= (X(k)i1
, . . . , X(k)idk
) related to a set Ik of dk indexes1 i1 < . . . < idk
nk
It is said symmetric when H is permutation symmetric in each set ofdk arguments X(k)
Ik.
References: Lee (1990)
Generalized U -statistics• Unbiased estimator of
✓(H) = E[H(X(1)1 , . . . , X(1)
d1, . . . , X(K)
1 , . . . , X(K)dk
)]
with minimum variance
• Asymptotically Gaussian as nk/n ! �k > 0 for k = 1, . . . , K
• Its computation requires the summation of
KY
k=1
✓
nk
dk
◆
terms
• K-partite ranking: dk = 1 for 1 k K
Hs(x1, . . . , xK) = I {s(x1) < s(x2) < · · · < s(xK)}
Incomplete U -statistics• Replace Un(H) by an incomplete version, involving much less
terms
• Build a set DB of cardinality B built by sampling withreplacement in the set ⇤ of indexes
((i(1)1 , . . . , i(1)
d1), . . . , (i(K)
1 , . . . , i(K)dK
))
with 1 i(k)1 < . . . < i(k)
dk nk, 1 k K
• Compute the Monte-Carlo version based on B terms
eUB(H) =1B
X
(I1, ..., IK)2DB
H(X(1)I1
, . . . , X(K)IK
)
• An incomplete U -statistic is NOT a U -statistic
ERM based on incompleteU -statistics
• Replace the criterion by a tractable incomplete version based onB = O(n) terms
minH2H
eUB(H)
• This leads to investigate the maximal deviations
supH2H
�
�
�
eUB(H)� Un(H)�
�
�
Main ResultTheoremLet H be a VC major class of bounded symmetric kernels of finite VCdimension V < +1. Set MH = sup(H,x)2H⇥X |H(x)|. Then,
(i) Pn
supH2H�
�
�
eUB(H)� Un(H)�
�
�
> ⌘o
2(1 + #⇤)V ⇥ e�B⌘2/M2
H
(ii) for all � 2 (0, 1), with probability at least 1� �, we have:
1MH
supH2H
�
�
�
eUB(H)� Eh
eUB(H)i
�
�
�
2
r
2V log(1 + )
+
r
log(2/�)
+
r
V log(1 + #⇤) + log(4/�)B
,
where = min{bn1/d1c, . . . , bnK/dKc}
Consequences
• Empirical risk sampling with B = O(n) yields a rate bound ofthe order O(
p
log n/n)
• One suffers no loss in terms of learning rate, while drasticallyreducing computational cost
Example: RankingEmpirical ranking performance for SVMrank based on 1%, 5%, 10%,20% and 100% of the "LETOR 2007" dataset.
Sketch of Proof• Set ✏ = ((✏k(I))I2⇤)1kB , where ✏k(I) is equal to 1 if the tuple
I = (I1, . . . , IK) has been selected at the k-th draw and to 0otherwise
• The ✏k’s are i.i.d. random vectors• For all (k, I) 2 {1, . . . , B}⇥ ⇤, the r.v. ✏k(I) has a Bernoulli
distribution with parameter 1/#⇤• With these notations,
eUB(H)� Un(H) =1B
BX
k=1
Zk(H),
whereZk(H) =
X
I2⇤
(✏k(I)� 1/#⇤)H(XI)
• Freezing the XI ’s, by virtue of Sauer’s lemma:
#{(H(XI))I2⇤ : H 2 H} (1 + #⇤)V .
Sketch of Proof (continued)
• Conditioned upon the XI ’s, Z1(H), . . . , ZB(H) areindependent
• The first assertion is thus obtained by applying Hoeffding’sinequality combined with the union bound
• Set
�1VH
⇣
X(1)1 , . . . , X(1)
n1, . . . , X(K)
1 , . . . , X(K)nK
⌘
=
H⇣
X(1)1 , . . . , X(1)
d1, . . . , X(K)
1 , . . . , X(K)dK
⌘
+ H⇣
X(1)d1+1, . . . , X(1)
2d1, . . . , X(K)
dK+1, . . . , X(K)2dK
⌘
+ . . .
+ H⇣
X(1)d1�d1+1, . . . , X(K)
dK�dK+1, . . . , X(K)dK
⌘
,
Sketch of Proof (continued)
• The proof of the second assertion is based on the Hoeffdingdecomposition
Un(H) =1
n1! · · ·nK !
X
�12Sn1 , ..., �K2SnK
V⇣
X(1)�1(1), . . . , X(K)
�K(nK)
⌘
,
• The concentration result is then obtained in a classical manner• Convexity (Chernoff’s bound)• Symmetrization• Randomization• Application of McDiarmid’s bounded difference inequality
Beyond finite VC dimension
• Challenge: develop probabilistic tools and complexityassumptions to investigate the concentration properties ofcollections of sums of weighted binomials
eUB(H)� Un(H) =1B
BX
k=1
Zk(H),
withZk(H) =
X
I2⇤
(✏k(I)� 1/#⇤)H(XI)
Some references
• Maximal Deviations of Incomplete U-statistics with Applicationsto Empirical Risk Sampling. S. Clémençon, S. Robbiano and J.Tressou (2013). In the Proceedings of the SIAM InternationalConference on Data-Mining, Austin (USA).
• Empirical processes in survey sampling. P. Bertail, E. Chautruand S. Clémençon (2013). Submitted.
• A statistical view of clustering performance through the theory ofU-processes. S. Clémençon (2014). In Journal of MultivariateAnalysis.
• On Survey Sampling and Empirical Risk Minimization. P.Bertail, E. Chautru and S. Clémençon (2014). ISAIM 2014, FortLauderdale (USA).
Introduction
Investigate the binary classification problem in statistical learning context
I Data not stored in central unit but processed by independent agents(processors)
I Aim : not to find a consensus on a common classifier but find howto combine e�ciently the local ones
I Solution : implement in an on-line and distributed manner
2/21
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection
Numerical experiences
3/21
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection
Numerical experiences
3/21
Learning problem
sign(H (X ))
r.v. observation r.v. binary output
X 2 X⇢ Rn ���! ���! Y 2 {�1,+1}
Given training dataset (X ,Y ) = (Xi ,Yi)i=1,...,n in a high dimension nand with unknown joint distribution....
...find the best prediction rule sign(H ?) such the classifier function H (x ) :
H ? =minH
Pe(H ) where Pe(H ) = P [�YH (X )> 0] = E⇥1{�YH (X )>0}
⇤
minimizes the probability of error Pe
B but 1(x ) is not a di↵erentiable function !
4/21
Learning problem
Majorize E⇥1{�YH (X )>0}
⇤by a convex function: Convex Surrogate
E⇥1{�YH (X )>0}
⇤ E [j(�YH (X ))]
How ? Use a cost function with appropiate properties
Example : use the quadratic function j(u) = (u+1)2
2 : R! [0, +•)
4/21
Learning problemsign(H (X ))
r.v. observation r.v. binary output
X 2 X⇢ Rn ���! ���! Y 2 {�1,+1}
Given training dataset (X ,Y ) = (Xi ,Yi)i=1,...,n in a high dimension nand with unknown joint distribution....
...find the best prediction rule sign(H ?) such the classifier function H (x ) :
H ? =minH
Rj(H ) where Rj(H ) = E [j(�YH (X ))]
minimizes the risk function Rj(H )
4 when j(u) = (u+1)2
2 ! H ? coincides with the naive Bayes classifier !
4/21
Aggregation of local classifiers
Consider a classification device composed by a set V of Nconnected agents
Each agent v 2 V :
I disposes of {(X1,v ,Y1,v ), . . . ,(Xnv ,v ,Ynv ,v )} ! nv independent copies of(X ,Y )
I selects a local soft classifier function from a parametric class {hv (·, qv )}
Set qv = (av ,bv ), the global soft classifier is: H (x ,✓) = Âv2V hv (x ,qv )
where : hv (x ,qv ) = avhv (x ,bv ) and ✓ =
0
B@q1...
qN
1
CA
5/21
Problem statement
The problem can be summarized as follows:
I given an observed data X
I obtain the best estimated label Y as sign(H (X ,✓))
I where ✓ is computed from the optimization problem using thetraining data (X ,Y ) = (Xi ,Yi)i=1,...,n as:
min✓2⇥
Rj(�YH (X ,✓))
6/21
Problem statement
Approaches
1. Agreement to a common decision rule [Tsitsiklis-84’,
Agarwal-10’] : consensus approach
I find an average consensus solution : ✓ = (q , . . . ,q)I each agent use the global classifier H (X ,✓)
2. Mixture of experts : cooperative approach
I find the best aggregation solution : ✓ = (q1, . . . ,qN )
I each agent use its local classifier hv (x ,qv )
6/21
Problem statement
Approaches
1. Agreement to a common decision rule [Tsitsiklis-84’, Agarwal-10’] :consensus approach
2. Mixture of experts : cooperative approach
I find the best aggregation solution : ✓ = (q1, . . . ,qN )
I each agent use its local classifier hv (x ,qv )
4 Example:
set bv = 0, av � 0 and hv : X! {�1,+1} : the weak classifier
hv (x ,qv ) = av hv (x )
6/21
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection
Numerical experiences
7/21
High rate distributed learning
Solve the minimization problem of the parametric risk function:
min✓2⇥
Rj(H (X ,✓))
8/21
High rate distributed learning
An standard distributed gradient descent iterative approach :
I generates a vector sequence of the estimated parameter(✓t)t�1 = (qt ,1, · · · ,qt ,N )t�1
I at each agent v the update step writes:
qt+1,v = qt ,v + gt E⇥Y —vhv (X ,qt ,v )j 0(�YH (X ,✓t))
⇤| {z }
B the joint distribution is unknown
8/21
High rate distributed learning
An standard distributed and on-line gradient descent iterativeapproach is:
I generate a vector sequence of the estimated parameter(✓t)t�1 = (qt ,1, · · · ,qt ,N )t�1
I each agent v observes a pair (Xt+1,v ,Yt+1,v )
I at each agent v the update step writes:
qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qvt ,v )j 0(�Yt+1,vH (Xt+1,v ,✓t ))| {z }replace by the empirical version
B evaluate H (Xt+1,v ,✓(t)) is required at each t and v !
8/21
High rate distributed learning
ExampleAt iteration t , each agent v 2 V has (Xv ,t ,qv ,t )...
1
h1(X1,t ,q1,t )
2
(X2,t ,q2,t )
3
(X3,t ,q3,t )
4
(X4,t ,q4,t )
...and evaluates its local hv (Xt ,v ,qt ,v )
9/21
High rate distributed learning
ExampleEach node v sends its observation Xt ,v to all the other nodes...
1
(Xt ,1,qt ,1)
2Xt ,1
3
Xt ,1
4Xt ,1
9/21
High rate distributed learning
ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from all the other nodes...
1
h1(X1,t ,q1,t )
2
h2(X1,t ,q2,t )
3
h3(X1,t ,q3,t )
4
h4(X1,t ,q4,t )
9/21
High rate distributed learning
ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from all the other nodes...
1h1(X1,t ,q1,t ) {h2(X1,t ,q2,t ),h3(X1,t ,q3,t ),h4(X1,t ,q4,t )}
2
3
4
...and computes the global : H (Xt ,1,✓t ) = Â4w=1 hw (Xt ,1,qt ,w )
B N (N �1) communications per iterationN=4����! 12!
9/21
Proposed distributed learning : OLGA algorithm
4 Replace the global H (Xt+1,v ,✓(t)) by a local estimate Y(V)t ,v at
each v 2 V such :
E[Y (V)t+1,v |Xt+1,v ,✓t ] =H (Xt+1,v ,✓t )
How ? sparse communications with ratio sparsity p...
On-line Learning Gossip Algorithm (OLGA)
...for each v 2 V at time t , the local gradient descent update writes :
qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qt ,v )j 0(�Yt+1,v Y(V)t+1,v )
10/21
Proposed distributed learning : OLGA algorithm
ExampleAt iteration t , each agent v 2 V has (Xt ,v ,qt ,v )...
1
h1(X1,t ,q1,t )
2
(X2,t ,q2,t )
3
(X3,t ,q3,t )
4
(X4,t ,q4,t )
...and evaluates its local hv (Xt ,v ,qt ,v )
10/21
Proposed distributed learning : OLGA algorithm
ExampleEach node v sends its observation Xt ,v to randomly selected nodes withprobability p = 1
3 ...
1
(Xt ,1,qt ,1)
2
3
4Xt ,1
10/21
Proposed distributed learning : OLGA algorithm
ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from the randomlyselected nodes...
1
h1(X1,t ,q1,t )
2
3
4
h4(X1,t ,q4,t )
10/21
Proposed distributed learning : OLGA algorithm
ExampleEach node v obtains the evaluation of hw (Xt ,v ,qt ,w ) from choosen nodes...
1h1(X1,t ,q1,t ) {h4(X1,t ,q4,t )}
2
3
4
...and computes its local estimated : Y(V)t ,1 = h1(Xt ,1,qt ,1)+ 1
p h4(Xt ,1,qt ,4)
B pN (N �1) communications per iterationN=4,p= 1
3��������! 4 (reduction of 67%)!
10/21
Performance analysis
B What is the e↵ect of sparsification ?...
...study the behaviour of the vector sequence ✓t as t ! •
I the consistency of the final solution given by the algorithm
I qualify the error variance excess due to the sparsity
11/21
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection
Numerical experiences
12/21
Asymptotic behaviour of OLGA
Under suitable assumptions, we prove the following results :
1. Consistency :
(✓t )t�1a.s.������! q? 2 L= {—Rj (✓) = 0}
2. CLT : conditioned to the event {limt!•✓t = ✓?}, then
pgt (✓t �✓?)L�����! N (0,S(G?))
where :
G? =
estimation error in a centralized casez }| {E[(H (X ,✓?)�Y )2 —vhv (X ,q?
v )—Tv hv (X ,q?
v )]+
+1�p
p Âw 6=v
E[hw (X ,q?w )2 —vhv (X ,q?
v )—Tv hv (X ,q?
v )]
| {z }additional noise term induced by the distributed setting
13/21
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection
Numerical experiences
14/21
A best agents selection approach
When...
B the number of agents N " ! di�cult to implement
B redudancy agents ! avoid similar outputs
... include distributed agent selection !
How ? add a `1-penalization term with tunning parameter l
min✓2⇥
Rj(H (X ,✓))+l Âv|av |
where:
I the weight av = 0 for an idle agent and av > 0 when it is active
15/21
Including best agents selection to OLGA algorithm
Introduce an update step at each time t of OLGA to seek :
the time varying set of active nodes St ⇢ V
16/21
Including best agents selection to OLGA algorithm
The extended algorithm is summarized as follows, at time t :
1. obtain active nodes St from the sequence of updated weights(at ,1, . . . ,at ,N )
2. apply OLGA to the set of active agents v 2 St as:
i) estimate local Y(St )t+1,v from a random selection among the
current active nodes
ii) update local gradient descent
qt+1,v = qt ,v + gt Yt+1,v—vhv (Xt+1,v ,qt ,v )j 0(�Yt+1,v Y(St )t+1,v )
16/21
Outline
Background
Proposed algorithm
Theoretical results
Improvement of agents selection
Numerical experiences
17/21
Example with simulated data
Binary classification of (+) and (o) data samples with N = 60 agentsusing weak lineal classifiers (-). When using distributed selection, itreduces to 25 active classifiers.
!6 !4 !2 0 2 4 6!6
!4
!2
0
2
4
6
(a) OLGA
!6 !4 !2 0 2 4 6!6
!4
!2
0
2
4
6
(b) OLGA with distributed selection
18/21
Comparison with real data
Binary classification of the available benchmark dataset banana usingweak lineal classifiers when increasing N .
5 10 15 20 25 30 350.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Number of weak!learners
Erro
r r
ate
OLGA (p=0.6)GentleBoost
Figure: Comparison between a centralized and sequential approach(GentleBoost) and our distributed and on-line algorithm (OLGA).
19/21
Conclusions
I A fully distributed and on-line algorithm is proposed for binaryclassification of big datasets solved by N processors
4 the algorithm is then adapted to select useful classifiers ! N #
I We obtain theoretical results from the asymptotic analysis ofthe sequence estimated by OLGA
I Numerical results are illustrated showing a comparablebehaviour to a centralized, batch and sequential approach(GentleBoost)
20/21