Post on 21-Dec-2015
transcript
Fourier Analysis and Boolean Function Learning
Jeff Jackson
Duquesne University
www.mathcs.duq.edu/~jackson
Themes
• Fourier analysis is central to learning theoretic results in wide variety of models– Results generally are the strongest known for
learning Boolean function classes with respect to uniform distribution
• Work on learning problems has led to some new harmonic results– Spectral properties of Boolean function classes– Algorithms for approximating Boolean functions
Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Circuit Classes
• Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC)
• DNF: depth-2 circuit with OR at root
. . . . . . . . .} d levels
v1 v2 v3 vn
. . . . . .
Negations allowed
Function Size
• Each function representation has a natural size measure:– CDC, DNF: # of gates– DT: # of leaves
• Size sF (f) of f with respect to class F is size of smallest representation of f within F– For all Boolean f,
sCDC(f) ≤ sDNF(f) ≤ sDT(f)
Efficient Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Timepoly(n,sF ,1/ε)
Harmonic-Based Uniform Learning
• [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable
• [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn
– Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1)
– Also exponential in 1/ε (so assumes ε constant)– But independent of any size measure
Notation
• Assume f: {0,1}n {-1,1}
• For all a in {0,1}n, χa (x) ≡ (-1) a · x
• For all a in {0,1}n, Fourier coefficient f(a) of f at a is:
• Sometimes write, e.g., f({1}) for f(10…0)
)]()([)(ˆ E~
xxa ax
ffU
^
^
^
Fourier Properties of Classes
• [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a )
• [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }
:if)(ˆS
2 a
af
Proof Techniques
• [LMN]: Hastad’s Switching Lemma + harmonic analysis
• [BT]: Based on [KKL]– Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)]– If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– For monotone f, harmonic analysis + Cauchy-
Schwartz shows AS(f) ≤ √n– Note: This is tight for MAJ
^
Function Approximation
• For all Boolean f,
• For S {0,1}n, define
• [LMN]:
aa
a )(ˆ}1,0{
n
ff
aa
a )(ˆS
S
ff
S
2S~ )(ˆ))](()([Pr
ax axx ffsignfU
“The” Fourier Learning Algorithm
• Given: ε (and perhaps s, d, ...)
• Determine k such that for S = {a : |a| < k}, ΣaS f2(a) < ε
• Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aS– Chernoff bounds: ~nk/ε sample size sufficient
• Output h ≡ sign(ΣaS f(a) χa)
• Run time ~ n2k/ε
^
~
^
Halfspaces
• [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant)– Halfspace: wRn+1 s.t. f(x) = sign(w · (xº1))
– If S = {a : |a| < (21/ε)2 } then aS f2(a) < ε
– Apply LMN algorithm
• Similar result applies for arbitrary function applied to constant number of halfspaces– Intersection of halfspaces key learning pblm
^
Halfspace Techniques
• [O] (cf. [BKS], [BJTa]): – Noise sensitivity of f at γ is probability that
corrupting each bit of x with probability γ changes f(x)
– NSγ (f) ≡ ½(1-a(1-2 γ)|a| f2(a))
• [KOS]:– If S = {a : |a| < 1/ γ} then aS f2(a) < 3 NSγ (f)
– If f is halfspace then NSγ(f) < 9√ γ
^
^
Monotone DT
• [OS]: Monotone functions are efficiently learnable given:– ε is constant– sDT(f) is used as the size measure
• Techniques:– Harmonic analysis: for monotone f,
AS(f) ≤ √log sDT(f) – [BT]: If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– Friedgut: |T| ≤ 2AS(f)/ε s.t. ΣAT f2(A) < ε
^
^
Weak Approximators
• KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n
• Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n
• In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f
• If A outputs a weak approximator for every f in F , then F is weakly learnable
^
Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Weak Uniform Learning Model
Boolean Function Class F
(e.g., DNF)
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesish:{0,1}n {0,1} s.t.
Prx~U [f(x) ≠ h(x) ] < ½ - 1/p(n,s)
Efficient Weak Learning Algorithm for Monotone Boolean Functions
• Draw set of ~n2 examples <x,f(x)>
• For i = 1 to n– Estimate f({i})
• Output h ≡ argmaxf({i})(-χ{i})
^
^
Weak Approximation for MAJ of Constant-Depth Circuits
• Note that adding a single MAJ to a CDC destroys the LMN spectral property
• [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniform learnable– If f is a MAJ of CDC’s of depth d, and if the
number of gates in f is s, then there is a set A {0,1}n such that
• |A| < logd s ≡ k
• Pr[f(x) = χA(x)] ≥ ½ +1/4snk
Weak Learning Algorithm
• Compute k = logds
• Draw ~snk examples <x,f(x)>
• Repeat for |A| < k– Estimate f(A)
• Until find A s.t. f(A) > 1/2snk
• Output h ≡ χA
• Run time ~npolylog(s)
^
^
Weak ApproximatorProof Techniques
• “Discriminator Lemma” (HMPST)– Implies one of the CDC’s is a weak
approximator to f
• LMN spectral characterization of CDC
• Harmonic analysis
• Beigel result used to extend weak learning to CDC with polylog MAJ gates
Boosting
• In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [FS], …)– Need to learn weakly with respect to near-
uniform distributions• For near-uniform distribution D, find weak hj s.t.
Prx~D[hj = f] > ½ + 1/poly(n,s)
– Final h typically MAJ of weak approximators
Strong Learning for MAJ of Constant-Depth Circuits
• [JKS]: MAJ of CDC is quasi-efficiently uniform learnable– Show that for near-uniform distributions, some
parity function is a weak approximator– Beigel result again extends to CDC with poly-
log MAJ gates
• [KP] + boosting: there are distributions for which no parity is a weak approximator
Uniform Learning from a Membership Oracle
Boolean Function Class F
(e.g., DNF)
Membership OracleMEM(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmAf(x)
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
x
Uniform Membership Learning of Decision Trees
• [KM]– L1(f) ≡ a |f(a)| ≤ sDT(f)
– If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaS f2(a) < ε
– [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6
– So can efficiently uniform membership learn DT
– Output h same form as LMN:h ≡ sign(ΣaS f(a) χa)
^
^ ^
^
^
~
^
Uniform Membership Learning of DNF
• [J]– (distributions D) χa s.t.
Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF
– Modified [GL] can efficiently locate such χa given oracle for near-uniform D
• Boosters can provide such an oracle when uniform learning
– Boosting provides strong learning
• [BJTb], [KS], [F] – For near-uniform D, can find χa in time ~ns2
RandomWalkExamples
< x, f(x) >
Uniform Learning from a Random Walk Oracle
Boolean Function Class F
(e.g., DNF)
Random Walk OracleRW(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Random Walk DNF Learning
• [BMOS]– Noise sensitivity and related values can be
accurately estimated using a random walk oracle
• NSγ (f) ≡ ½(1-a(1-2 γ)|a| f2(a))
• Tb(f) ≡ a b |a| f2(a)
– Estimate of Tb(f) is efficient if |b| logarithmic
– Only need logarithmic |b| to learn DNF [BF]
^
^
Random Walk Parity Learning
• [JW] (unpub)– Effectively, [BMOS] limited to finding “heavy”
Fourier coefficents f(a) for logarithmic |a|– Using a “breadth-first” variation of KM, can
locate any |f(a)| > θ in time O(nlog 1/ θ)– “Heavy” coefficient corresponds to a parity
function that weakly approximates
^
^
Uniform Learning from a Classification Noise Oracle
Boolean Function Class F
(e.g., DNF)
Classification Noise OracleEXη (f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmAPr[<x, f(x)>]=1-η
Pr[<x, -f(x)>]=η
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Uniform random x
Error rateη > 0
Uniform Learning from a Statistical Query Oracle
Boolean Function Class F
(e.g., DNF)
Statistical Query OracleSQ(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmAEU[q(x, f(x))] ± τ
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
( q(), τ )
SQ and Classification Noise Learning
• [K]– If F is uniform SQ learnable in time
poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η))
– Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable (i.e., 1/τ poly in other parameters)
• Exception: F = PARn ≡ {χa : a {0,1}n, |a| ≤ n}
Uniform SQ Hardness for PAR
• [BFJKMR]– Harmonic analysis shows that for any q, χa:
EU[q(x, χa(x))] = q(0n+1) + q(a º 1)
– Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(a º 1)| < τ
– Parseval: |q(b º 1)| < τ for all but 1/τ2 Fourier coefficients
– So ‘bad’ query eliminates only poly coefficients
– Even PARlog n not efficiently SQ learnable
^ ^
^ ^
^
Uniform Learning from an Attribute Noise Oracle
Boolean Function Class F
(e.g., DNF)
Attribute Noise OracleEXDN(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA<xr, f(x)>, r~DN
Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε
Accuracyε > 0
Uniform random x
Noise modelDN
Uniform Learning with Independent Attribute Noise
• [BJTa]:– LMN algorithm produces estimates of
f(a) · Er~DN[χa(r)]
• Example application– Assume noise process DN is a product distribution:
• DN(x) = ∏i (pixi + (1-pi)(1-xi))
– Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions)
– Then modified LMN uniform learns attribute noisy AC0 in quasi-poly time
^
Agnostic Learning Model
Arbitrary Boolean Function
Example OracleEX(f)
Target functionf : {0,1}n {0,1}
Learning AlgorithmA
UniformRandomExamples
< x, f(x) >
Hypothesis h in H s.t.
Prx~U [f(x) ≠ h(x) ]<= optH + ε
Accuracyε > 0
Agnostic Learning of Halfspaces
• [KKMS] – Agnostic learning algorithm for H the set of
halfspaces– Algorithm is not Fourier-based (L1 regression)
• However, a somewhat weaker result can be obtained by simple Fourier analysis
Near-Agnostic Learning via LMN
• [KKMS]:– Let f be an arbitrary Boolean function– Fix any set S {1..n} and fix ε– Let g be any function s.t.
• ΣaS g2(a) < ε and
• Pr[f ≠ g] (call this η) is minimized for any such g
– Then for h learned by LMN by estimating coefficients of f over S:
• Pr[f ≠ h] < 4η + ε
^
Summary
• Most uniform-learning results for Boolean function classes depend on harmonic analysis
• Learning theory provides motivation for new harmonic observations
• Even very “weak” harmonic results can be useful in learning-theory algorithms
Some Open Problems
• Efficient uniform learning of monotone DNF– Best to date for small sDNF is [Ser], time
~nslog s (based on [BT], [M], [LMN])
• Non-uniform learning– Relatively easy to extend many results to
product distributions, e.g. [FJS] extends [LMN]
– Key issue in real-world applicability
Open Problems (cont’d)
• Weaker dependence on ε– Several algorithms fully exponential (or
worse) in 1/ε
• Additional proper learning results– Allows for interpretation of learned hypothesis
References• Beigel: When Do Extra Majority Gates Help? ...• [BFJKMR] Blum, Furst, Jackson, Kearns, Mansour, Rudich. Weakly Learning DNF...• [BJTa] Bshouty, Jackson, Tamon. Uniform-Distribution Attribute Noise Learnability.• [BJTb] Bshouty, Jackson, Tamon. More Efficient PAC-learning of DNF...• [BKS] Benjamini, Kalai, Schramm. Noise Sensitivity of Boolean Functions...• [BMOS] Bshouty, Mossel, O’Donnell, Servedio. Learning DNF from Random Walks.• [BT] Bshouty, Tamon. On the Fourier Spectrum of Monotone Functions.• [F] Feldman. Attribute Efficient and Non-adaptive Learning of Parities...• [FJS] Furst, Jackson, Smith. Improved Learning of AC0 Functions.• [FS] Freund, Schapire. A Decision-theoretic Generalization of On-line Learning...• Friedgut: Boolean Functions with Low Average Sensitivity Depend on Few Coordinates.• [HMPST] Hajnal, Maass, Pudlak, Szegedy, Turan. Threshold Circuits of Bounded Depth.• [J] Jackson. An Efficient Membership-Query Algorithm for Learning DNF...• [JKS] Jackson, Klivans, Servedio. Learnability Beyond AC0.• [JW] Jackson, Wimmer. In prep.• [KKL] Kahn, Kalai, Linial. The Influence of Variables on Boolean Functions.• [KKMS] Kalai, Klivans, Mansour, Servedio. On Agnostic Boosting and Parity Learning.• [K] Kearns. Efficient Noise-tolerant learning from Statistical Queries.• [KM] Kushilevitz, Mansour. Learning Decision Trees using the Fourier Spectrum.• [KOS] Klivans, O’Donnell, Servedio. Learning Intersections and Thresholds of Halfspaces.• [KP] Krause, Pudlak. On Computing Boolean Functions by Sparse Real Polynomials.• [KS] Klivans, Servedio. Boosting and Hard-core Sets.• [LMN] Linial, Mansour, Nisan. Constant-depth Circuits, Fourier Transform, and Learnability.• [M] Mansour. An O(nloglog n) Learning Algorithm for DNF...• [O] O’Donnell. Hardness Amplification within NP.• [OS] O’Donnell, Servedio. Learning Monotone Functions from Random Examples in Polynomial Time.• [S] Schapire. The Strength of Weak Learnability.• [Ser] Servedio. On Learning Monotone DNF under Product Distributions.