An Efficient Membership-Query Algorithm for Learning DNF with
Respect to the Uniform Distribution
Jeffrey C. JacksonPresented By:
Eitan YaakobiTamar Aizikowitz
2
Presentation Outline Introduction
Algorithms We Use Estimating Expected Values Hypothesis Boosting Finding Weak-approximating Parity Functions
Learning DNF With Respect to Uniform Existence of Weak Approximating Parity Functions for
every f, D Nonuniform Weak DNF Learning Strongly Learning DNF
3
Introduction DNF is weakly-learnable with respect to the
uniform distribution as shown by Kushilevitz and Mansour.
We show that DNF is weakly learnable with respect to a certain class of nonuniform distributions.
We then use a method based on Freund’s boosting algorithm to produce a strong learner with respect to uniform.
4
Algorithms We Use Our learning algorithm makes use of several
previous algorithms.
Following is a short reminder of these algorithms.
5
Estimating Expected Values The AMEAN Algorithm:
Efficiently estimates the expectancy of a random variable.
Based on Hoeffding’s inequality: Let Xi be independent random variables such that:
Xi , Xi[a,b] and E[Xi]=μ
then:2 22 /( )
1
1Pr 2m
m b ai
i
X em
6
The AMEAN Algorithm Input:
random X[a,b] b – a λ, > 0
Output: μ’ such that Pr[|E[X] – μ’| λ] 1 – δ
Running time: O((b-a)2log(δ-1) / λ2)
7
Hypothesis Boosting Our algorithm is based on boosting weak
hypotheses into a final strong hypothesis.
We use a boosting method very similar to Freund’s boosting algorithm.
We refer to Freund’s original algorithm as F1.
8
The F1 Boosting Algorithm Input:
positive ε, δ and γ (½ – γ)-approximate PAC learner for representation
class EX( f,D) for some f in and any distribution D
Output: ε-approximation for f with respect to D with
probability at least 1 – δ Running time:
polynomial in n, s, γ-1, ε -1, and log(δ -1)
9
The Idea Behind F1 (1) The algorithm generates a series of weak
hypotheses hi.
h0 is a weak approximator for f with respect to the distribution D.
Each subsequent hi is a weak approximator for f with respect to the distribution Di.
10
The Idea Behind F1 (2) Each distribution Di focuses weight on those
areas where slightly more than half the hypotheses already generated were incorrect.
The final hypothesis h is a majority vote on all the hi-s.
11
If a sufficient number of weak hypotheses is generated then h will be an ε-approximator for f with respect to the distribution D.
Freund showed that ½γ-2ln(ε-1) weak hypotheses suffice.
The Idea Behind F1 (3)
12
Finding Weak-approximating Parity Functions In order to use the boosting algorithm, we
need to be able to generate weak-approximators for our DNF f with respect to the distributions Di.
Our algorithm is based on the Weak Parity algorithm (WP) by Kushilevitz and Mansour.
13
The WP Algorithm Finds the large Fourier coefficients of a
Boolean function f on {0,1}n using a Membership Oracle for f.
ˆ.
A
A
f A E ff
each coefficient represents the correlation between and a parity
ˆ
A
f Af
For each that has the large coefficientproperty is a weak approximator for with respect to the uniform distribution.
14
The WP’ Algorithm (1) Our learning algorithm will need to find the
large coefficients of a non-Boolean function. The basic WP algorithm can be extended to
the WP’ algorithm which works for non-Boolean f as well.
WP’ gives us a weak approximator for a non-Boolean f with respect to the uniform distribution.
15
The WP’ Algorithm (2) Input:
MEM( f ) for f:{0,1}n→ θ, δ, n, L( f ) > 0
Output: With probability at least 1 – δ, WP’ outputs a set S such that
for all A:
Running time: 6 2 2 6log / /O nL f nL f
ˆ ( )f A A S
ˆ ( )2
A S f A
16
We now show the main result: DNF is learnable with respect to uniform.
We begin by showing that for every DNF f and distribution D there exists a parity function that weakly approximates f with respect to D.
We use this to produce an algorithm for weakly learning DNF with respect to certain nonuniform distributions.
Finally we show that this weak learner can be boosted into a strong learner with respect to the uniform distribution.
Learning DNF with Respect to Uniform
17
Existence of Weak Approximating Parity Functions for every f, D (1) For every DNF f and every distribution D
there exists a parity function that weakly approximates f with respect to D.
The more difficult case is when ED[ f ] ~ 0.
0DE f If is noticeably different than then orit's negation is a weak approximator.
18
Existence of Weak Approximating Parity Functions for every f, D (2)
Let f be a DNF such that E[ f ] ~ 0.
Let s be the number of terms in f.
Let T(x) be the {-1,+1} valued function equivalent to the term in f best correlated with f with respect to D.
19
Existence of Weak Approximating Parity Functions for every f, D (3)
10 Pr 1 Pr 12D D Df f f
1 Pr ( ) ( ) | ( ) 1 Pr ( ) ( ) | ( ) 12 D DT x f x f x T x f x f x
Pr ( ) ( )
Pr ( ) ( ) | ( ) 1 Pr ( ) 1
Pr ( ) ( ) | ( ) 1 Pr ( ) 1
D
D D
D D
T x f x
T x f x f x f x
T x f x f x f x
20
T is a term of f PrD [T(x) = f(x) | f(x) = -1] = 1
There are s terms in f, T is the best correlated with f PrD [ T(x) = f(x) | f(x) = 1 ] ≥ 1/s
PrD [ T(x) = f(x) ] ≥ 1/2(1 + 1/s) ED [fT] ≥ 1/s
1 Pr ( ) ( ) | ( ) 1 Pr ( ) ( ) | ( ) 12 D DT x f x f x T x f x f x L 1 Pr ( ) ( ) | ( ) 1 12 D T x f x f x L 1 1 12 s L
Existence of Weak Approximating Parity Functions for every f, D (4)
21
Existence of Weak Approximating Parity Functions for every f, D (5) T can be represented using the Fourier
transform. Define:
A T A T if is a subset of the variables in .
D
TE f T
Replace with it's Fourier representation in
1 s.t. (2 1)A D AT E f s L
22
Nonuniform Weak DNF Learning (1) We have shown that for every DNF f and
every distribution D there exists a parity function that is a weak approximator for f with respect to D.
How can we find such a parity function? We want an algorithm that when given a
threshold θ and a distribution D finds a parity such that, say: 2D AE f
23
1 2 ( ) ( ) ( )2
nAn
x
f x D x x
Nonuniform Weak DNF Learning (2)
( ) ( ) ( )D A Ax
E f f x x D x
( ) 2 ( ) ( )ng x f x D x
ˆ ( ) D Ag A E f
1 ˆ( ) ( ) ( )2 A Unif An
x
g x x E g g A
24
We have reduced the problem of finding a well correlated parity to finding a large Fourier coefficient of g.
g is not Boolean therefore we use WP’.
Invocation: WP’(n,MEM(g),θ,L(g) ,)
Nonuniform Weak DNF Learning (3)
MEM(g)(x) 2n MEM( f )(x) D
25
The WDNF Algorithm (1) We define a new algorithm: Weak DNF
(WDNF). WDNF finds the large Fourier coefficients of
g(x)=2nf(x)D(x) therefore finding a parity that is well correlated with f with respect to the distribution D.
WDNF makes use of the WP’ algorithm for finding the Fourier coefficients of the non-Boolean g.
26
Proof of Existence: Let g(x)=2nf(x)D(x)
Output with prob. 1 – :
Running Time:
poly. in n, s, log(-1), and L(2nD)
The WDNF Algorithm (2)
ˆ ( ) 1/(2 1)A D Ag A E f s s.t.
WP'( , ( ), 1/(2 1), (2 ), )nn MEM g s L D
1A D AE f s s.t.
27
The WDNF Algorithm (3) Input:
EX( f,D) MEM( f ) D δ > 0
Output: With probability at least 1 – δ :
parity function h (possibly negated) s.t.:ED[fh] = Ω(s-1)
Running time: polynomial in n, s, log(-1), and L(2nD)
28
The WDNF Algorithm (4) WDNF is polynomial in L(g) = L(2nD). If D is at most poly(n,s,ε, -1) / 2n then WDNF
runs polynomially in the normal parameters. Such D is referred to as polynomially-near
uniform.
WDNF weakly learns DNF with respect to any polynomially-near uniform distribution D.
29
We define the Harmonic Sieve Algorithm (HS).
HS is an application of the F1 boosting algorithm on the weak learner generated by WDNF.
The main difference between HS and F1 is the need to supply WDNF with an oracle for distribution Di at each stage of boosting.
Strongly Learning DNF
30
The HS Algorithm (1) Input:
EX( f,D) MEM( f ) D s ε, > 0
Output: With probability 1 – :
h s.t. h is an ε-approximator of f with respect to D. Running Time:
polynomial in n, s, ε-1, log(-1), and L(2nD)
31
For WDNF to work, and work efficiently, two requirements must be met: An oracle for the distribution must be provided for the
learner. The distribution must be polynomially-near uniform.
We show how to simulate an approximate oracle Di’ that can be provided to the weak learner instead of an exact one.
We then show that the distributions Di are in fact polynomially-near uniform.
The HS Algorithm (2)
32
Simulating Di (1) Define:
To provide an exact oracle we need to compute the denominator which could potentially take an exponentially long time.
Instead we will estimate the value of using AMEAN.
( )
( )
( )( )
( )i
i
ir x
i ir yy
D xD x
D y
( )( )i
ir yy
D y
33
Simulating Di (2) .
( )
, ( ) ( , ) draw example from and compute
i
ir x
X x f x EX f D
( )( ) / 3i
ir yy
E D y 2 / 3The algorithm guarantees E
( ) ( )2 ( ) 2 ( )3 i i
i ir y r yy y
D y E D y 1 3'2 2i i iD D D
AMEAN( , - , / 3, ') , ( )E X b a poly
1/ 2,3 / 2 , '( ) ( )i i i ic x D x c D x
34
Implications of Using Di’
Note that: gi’ = 2n f Di’ = 2nf ci Di = ci gi
Multiplying the distribution oracle by a constant is
like multiplying all the coefficients of gi by the same constant.
The relative sizes of the coefficients stay the same. WDNF will be able to find the large coefficients. The running time is not adversely affected.
'i i i i ig c g c g
35
Bound on Distributions Di
It can be shown that for each i:
Thus Di is bounded by a polynomial in L(D) and ε-1.
If is D polynomially-near uniform then Di is also polynomially-near.
HS strongly learns DNF with respect to the uniform distribution.
( ) 3 ( ) /iL D L D
36
Summary DNF can be weakly learned with respect to
polynomially-near distributions using the WDNF algorithm.
The HS algorithm strongly learns DNF with respect to the uniform distribution by boosting the WDNF weak learner.