Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University jackson.

transcript

Fourier Analysis and Boolean Function Learning

Jeff Jackson

Duquesne University

www.mathcs.duq.edu/~jackson

Themes

• Fourier analysis is central to learning theoretic results in wide variety of models– Results generally are the strongest known for

learning Boolean function classes with respect to uniform distribution

• Work on learning problems has led to some new harmonic results– Spectral properties of Boolean function classes– Algorithms for approximating Boolean functions

Uniform Learning Model

Boolean Function Class F

(e.g., DNF)

Example OracleEX(f)

Target functionf : {0,1}n {0,1}

Learning AlgorithmA

UniformRandomExamples

< x, f(x) >

Hypothesish:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε

Accuracyε > 0

Circuit Classes

• Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC)

• DNF: depth-2 circuit with OR at root

. . . . . . . . .} d levels

v1 v2 v3 vn

. . . . . .

Negations allowed

Decision Treesv3

x3 = 0

x = 11001

Decision Treesv3

x1 = 1

x = 11001

Decision Treesv3

x = 11001f(x) = 1

Function Size

• Each function representation has a natural size measure:– CDC, DNF: # of gates– DT: # of leaves

• Size sF (f) of f with respect to class F is size of smallest representation of f within F– For all Boolean f,

sCDC(f) ≤ sDNF(f) ≤ sDT(f)

Efficient Uniform Learning Model

(e.g., DNF)

Example OracleEX(f)

Learning AlgorithmA

< x, f(x) >

Accuracyε > 0

Timepoly(n,sF ,1/ε)

Harmonic-Based Uniform Learning

• [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable

• [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn

– Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1)

– Also exponential in 1/ε (so assumes ε constant)– But independent of any size measure

Notation

• Assume f: {0,1}n {-1,1}

• For all a in {0,1}n, χa (x) ≡ (-1) a · x

• For all a in {0,1}n, Fourier coefficient f(a) of f at a is:

• Sometimes write, e.g., f({1}) for f(10…0)

)]()([)(ˆ E~

xxa ax

Fourier Properties of Classes

• [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a )

• [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }

:if)(ˆS

Spectral Properties

Proof Techniques

• [LMN]: Hastad’s Switching Lemma + harmonic analysis

• [BT]: Based on [KKL]– Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)]– If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– For monotone f, harmonic analysis + Cauchy-

Schwartz shows AS(f) ≤ √n– Note: This is tight for MAJ

Function Approximation

• For all Boolean f,

• For S {0,1}n, define

• [LMN]:

a )(ˆ}1,0{

a )(ˆS

2S~ )(ˆ))](()([Pr

ax axx ffsignfU

“The” Fourier Learning Algorithm

• Given: ε (and perhaps s, d, ...)

• Determine k such that for S = {a : |a| < k}, ΣaS f2(a) < ε

• Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aS– Chernoff bounds: ~nk/ε sample size sufficient

• Output h ≡ sign(ΣaS f(a) χa)

• Run time ~ n2k/ε

Halfspaces

• [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant)– Halfspace: wRn+1 s.t. f(x) = sign(w · (xº1))

– If S = {a : |a| < (21/ε)2 } then aS f2(a) < ε

– Apply LMN algorithm

• Similar result applies for arbitrary function applied to constant number of halfspaces– Intersection of halfspaces key learning pblm

Halfspace Techniques

• [O] (cf. [BKS], [BJTa]): – Noise sensitivity of f at γ is probability that

corrupting each bit of x with probability γ changes f(x)

– NSγ (f) ≡ ½(1-a(1-2 γ)|a| f2(a))

• [KOS]:– If S = {a : |a| < 1/ γ} then aS f2(a) < 3 NSγ (f)

– If f is halfspace then NSγ(f) < 9√ γ

Monotone DT

• [OS]: Monotone functions are efficiently learnable given:– ε is constant– sDT(f) is used as the size measure

• Techniques:– Harmonic analysis: for monotone f,

AS(f) ≤ √log sDT(f) – [BT]: If S = {a : |a| < AS(f)/ε} then ΣaS f2(a) < ε– Friedgut: |T| ≤ 2AS(f)/ε s.t. ΣAT f2(A) < ε

Weak Approximators

• KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n

• Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n

• In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f

• If A outputs a weak approximator for every f in F , then F is weakly learnable

Uniform Learning Model

(e.g., DNF)

Example OracleEX(f)

Learning AlgorithmA

< x, f(x) >

Accuracyε > 0

Weak Uniform Learning Model

(e.g., DNF)

Example OracleEX(f)

Learning AlgorithmA

< x, f(x) >

Hypothesish:{0,1}n {0,1} s.t.

Prx~U [f(x) ≠ h(x) ] < ½ - 1/p(n,s)

Efficient Weak Learning Algorithm for Monotone Boolean Functions

• Draw set of ~n2 examples <x,f(x)>

• For i = 1 to n– Estimate f({i})

• Output h ≡ argmaxf({i})(-χ{i})

Weak Approximation for MAJ of Constant-Depth Circuits

• Note that adding a single MAJ to a CDC destroys the LMN spectral property

• [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniform learnable– If f is a MAJ of CDC’s of depth d, and if the

number of gates in f is s, then there is a set A {0,1}n such that

• |A| < logd s ≡ k

• Pr[f(x) = χA(x)] ≥ ½ +1/4snk

Weak Learning Algorithm

• Compute k = logds

• Draw ~snk examples <x,f(x)>

• Repeat for |A| < k– Estimate f(A)

• Until find A s.t. f(A) > 1/2snk

• Output h ≡ χA

• Run time ~npolylog(s)

Weak ApproximatorProof Techniques

• “Discriminator Lemma” (HMPST)– Implies one of the CDC’s is a weak

approximator to f

• LMN spectral characterization of CDC

• Harmonic analysis

• Beigel result used to extend weak learning to CDC with polylog MAJ gates

Boosting

• In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [FS], …)– Need to learn weakly with respect to near-

uniform distributions• For near-uniform distribution D, find weak hj s.t.

Prx~D[hj = f] > ½ + 1/poly(n,s)

– Final h typically MAJ of weak approximators

Strong Learning for MAJ of Constant-Depth Circuits

• [JKS]: MAJ of CDC is quasi-efficiently uniform learnable– Show that for near-uniform distributions, some

parity function is a weak approximator– Beigel result again extends to CDC with poly-

log MAJ gates

• [KP] + boosting: there are distributions for which no parity is a weak approximator

Uniform Learning from a Membership Oracle

(e.g., DNF)

Membership OracleMEM(f)

Learning AlgorithmAf(x)

Accuracyε > 0

Uniform Membership Learning of Decision Trees

• [KM]– L1(f) ≡ a |f(a)| ≤ sDT(f)

– If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaS f2(a) < ε

– [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6

– So can efficiently uniform membership learn DT

– Output h same form as LMN:h ≡ sign(ΣaS f(a) χa)

Uniform Membership Learning of DNF

• [J]– (distributions D) χa s.t.

Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF

– Modified [GL] can efficiently locate such χa given oracle for near-uniform D

• Boosters can provide such an oracle when uniform learning

– Boosting provides strong learning

• [BJTb], [KS], [F] – For near-uniform D, can find χa in time ~ns2

RandomWalkExamples

< x, f(x) >

Uniform Learning from a Random Walk Oracle

(e.g., DNF)

Random Walk OracleRW(f)

Learning AlgorithmA

Accuracyε > 0

Random Walk DNF Learning

• [BMOS]– Noise sensitivity and related values can be

accurately estimated using a random walk oracle

• NSγ (f) ≡ ½(1-a(1-2 γ)|a| f2(a))

• Tb(f) ≡ a b |a| f2(a)

– Estimate of Tb(f) is efficient if |b| logarithmic

– Only need logarithmic |b| to learn DNF [BF]

Random Walk Parity Learning

• [JW] (unpub)– Effectively, [BMOS] limited to finding “heavy”

Fourier coefficents f(a) for logarithmic |a|– Using a “breadth-first” variation of KM, can

locate any |f(a)| > θ in time O(nlog 1/ θ)– “Heavy” coefficient corresponds to a parity

function that weakly approximates

Uniform Learning from a Classification Noise Oracle

(e.g., DNF)

Classification Noise OracleEXη (f)

Learning AlgorithmAPr[<x, f(x)>]=1-η

Pr[<x, -f(x)>]=η

Accuracyε > 0

Uniform random x

Error rateη > 0

Uniform Learning from a Statistical Query Oracle

(e.g., DNF)

Statistical Query OracleSQ(f)

Learning AlgorithmAEU[q(x, f(x))] ± τ

Accuracyε > 0

( q(), τ )

SQ and Classification Noise Learning

• [K]– If F is uniform SQ learnable in time

poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η))

– Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable (i.e., 1/τ poly in other parameters)

• Exception: F = PARn ≡ {χa : a {0,1}n, |a| ≤ n}

Uniform SQ Hardness for PAR

• [BFJKMR]– Harmonic analysis shows that for any q, χa:

EU[q(x, χa(x))] = q(0n+1) + q(a º 1)

– Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(a º 1)| < τ

– Parseval: |q(b º 1)| < τ for all but 1/τ2 Fourier coefficients

– So ‘bad’ query eliminates only poly coefficients

– Even PARlog n not efficiently SQ learnable

Uniform Learning from an Attribute Noise Oracle

(e.g., DNF)

Attribute Noise OracleEXDN(f)

Learning AlgorithmA<xr, f(x)>, r~DN

Accuracyε > 0

Uniform random x

Noise modelDN

Uniform Learning with Independent Attribute Noise

• [BJTa]:– LMN algorithm produces estimates of

f(a) · Er~DN[χa(r)]

• Example application– Assume noise process DN is a product distribution:

• DN(x) = ∏i (pixi + (1-pi)(1-xi))

– Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions)

– Then modified LMN uniform learns attribute noisy AC0 in quasi-poly time

Agnostic Learning Model

Arbitrary Boolean Function

Example OracleEX(f)

Learning AlgorithmA

< x, f(x) >

Hypothesis h in H s.t.

Prx~U [f(x) ≠ h(x) ]<= optH + ε

Accuracyε > 0

Agnostic Learning of Halfspaces

• [KKMS] – Agnostic learning algorithm for H the set of

halfspaces– Algorithm is not Fourier-based (L1 regression)

• However, a somewhat weaker result can be obtained by simple Fourier analysis

Near-Agnostic Learning via LMN

• [KKMS]:– Let f be an arbitrary Boolean function– Fix any set S {1..n} and fix ε– Let g be any function s.t.

• ΣaS g2(a) < ε and

• Pr[f ≠ g] (call this η) is minimized for any such g

– Then for h learned by LMN by estimating coefficients of f over S:

• Pr[f ≠ h] < 4η + ε

Summary

• Most uniform-learning results for Boolean function classes depend on harmonic analysis

• Learning theory provides motivation for new harmonic observations

• Even very “weak” harmonic results can be useful in learning-theory algorithms

Some Open Problems

• Efficient uniform learning of monotone DNF– Best to date for small sDNF is [Ser], time

~nslog s (based on [BT], [M], [LMN])

• Non-uniform learning– Relatively easy to extend many results to

product distributions, e.g. [FJS] extends [LMN]

– Key issue in real-world applicability

Open Problems (cont’d)

• Weaker dependence on ε– Several algorithms fully exponential (or

worse) in 1/ε

• Additional proper learning results– Allows for interpretation of learned hypothesis

References• Beigel: When Do Extra Majority Gates Help? ...• [BFJKMR] Blum, Furst, Jackson, Kearns, Mansour, Rudich. Weakly Learning DNF...• [BJTa] Bshouty, Jackson, Tamon. Uniform-Distribution Attribute Noise Learnability.• [BJTb] Bshouty, Jackson, Tamon. More Efficient PAC-learning of DNF...• [BKS] Benjamini, Kalai, Schramm. Noise Sensitivity of Boolean Functions...• [BMOS] Bshouty, Mossel, O’Donnell, Servedio. Learning DNF from Random Walks.• [BT] Bshouty, Tamon. On the Fourier Spectrum of Monotone Functions.• [F] Feldman. Attribute Efficient and Non-adaptive Learning of Parities...• [FJS] Furst, Jackson, Smith. Improved Learning of AC0 Functions.• [FS] Freund, Schapire. A Decision-theoretic Generalization of On-line Learning...• Friedgut: Boolean Functions with Low Average Sensitivity Depend on Few Coordinates.• [HMPST] Hajnal, Maass, Pudlak, Szegedy, Turan. Threshold Circuits of Bounded Depth.• [J] Jackson. An Efficient Membership-Query Algorithm for Learning DNF...• [JKS] Jackson, Klivans, Servedio. Learnability Beyond AC0.• [JW] Jackson, Wimmer. In prep.• [KKL] Kahn, Kalai, Linial. The Influence of Variables on Boolean Functions.• [KKMS] Kalai, Klivans, Mansour, Servedio. On Agnostic Boosting and Parity Learning.• [K] Kearns. Efficient Noise-tolerant learning from Statistical Queries.• [KM] Kushilevitz, Mansour. Learning Decision Trees using the Fourier Spectrum.• [KOS] Klivans, O’Donnell, Servedio. Learning Intersections and Thresholds of Halfspaces.• [KP] Krause, Pudlak. On Computing Boolean Functions by Sparse Real Polynomials.• [KS] Klivans, Servedio. Boosting and Hard-core Sets.• [LMN] Linial, Mansour, Nisan. Constant-depth Circuits, Fourier Transform, and Learnability.• [M] Mansour. An O(nloglog n) Learning Algorithm for DNF...• [O] O’Donnell. Hardness Amplification within NP.• [OS] O’Donnell, Servedio. Learning Monotone Functions from Random Examples in Polynomial Time.• [S] Schapire. The Strength of Weak Learnability.• [Ser] Servedio. On Learning Monotone DNF under Product Distributions.

Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University jackson.

Documents