Hypothesis Testing for Structured Probability Distributions
Ilias Diakonikolas USC
Joint work with Daniel Kane (UCSD)
Vladimir Nikishkin (Edinburgh)
What this talk is about
Basic object of study: Probability distributions over ordered domain: or
Notation: p, q: either pmf or pdf
[n] = 1, . . . , n I = [a, b] ⊆ R
Menu Explaining the title: • Let be a family of probability distributions
• Identity Testing Problem: − Distinguish between the cases p=q and dist (p, q) > ε − Minimize sample size, computation time
Unknown 1, 2, 2, 4, 3,…
Known/Unknown
2, 1, 2, 3, 1,…
Total Varia0on Distance dTV(p, q) = (1/2)p− q1
D
p ∈ D
q ∈ D
This Talk
Unified Framework for Identity Testing: Leads to sample-optimal and computationally efficient
estimators for a variety of structured distribution families.
& (Matching Information-Theoretic Lower Bounds)
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
Distribution Testing (Hypothesis Testing)
Given samples (observations) from one (or more) unknown probability distribution(s) (model), decide whether it satisfies a certain property. • Introduced by Karl Pearson (’99). • Classical Problem in Statistics [Neyman-Pearson’33, Lehman-Romano’05]
• Last twenty years (TCS): property testing [Goldreich-Ron’00, Batu et al. FOCS’00/JACM’13]
Related Work – Property Testing (I)
Focus has been on arbitrary distributions over support of size . Testing Identity to an explicitly known Distribution: • [Goldreich-Ron’00]: upper bound for uniformity testing
(collision statistics) • [Batu et al., FOCS’01]: upper bound for testing
identity to any known distribution.
• [Paninski ’03]: upper bound of for uniformity testing, assuming . Lower bound of .
• [Valiant-Valiant, FOCS’14, D-Kane-Nikishkin, SODA’15]: upper bound of for identity testing to any known distribution.
n
O(√n/
4)
O(√n) · poly(1/)
O(√n/
2) = Ω(n−1/4) Ω(
√n/2)
O(√n/
2)
Related Work – Property Testing (II)
Focus has been on arbitrary distributions over support of size . Testing Closeness between two unknown distributions: • [Batu et al., FOCS’00]: upper bound for testing closeness between two unknown discrete distributions.
• [P. Valiant, STOC’08]: lower bound of for constant error.
• [Chan-D-Valiant-Valiant, SODA’14]: tight upper bound and lower bound of
n
O(n2/3 log n/8/3)
Ω(n2/3)
O(maxn2/3/
4/3, n
1/2/
2)
Summary of Related Work
Testing Closeness
Tight Bound
[Chan-D-Valiant-Valiant’14]
Learning Tight Bound
[folklore]
Testing Identity
Tight Bound
[Valiant-Valiant’14, D-Kane-Nikishkin’15]
support size: , total variation distance error: n
Θ(maxn2/3/4/3, n1/2/2)
Θ(n1/2/2)
Θ(n/2)
Estimating Structured Distributions
• Statistical Estimation well-understood for arbitrary discrete distributions.
• How about for structured distributions? • Long line of work in statistics since the 1950’s [Grenander’56, Rao’69,
Wegman’70, Birge’87,…]. Focus has been on density estimation (learning).
• [Batu-Kumar-Rubinfeld, STOC’04]: identity testing of monotone distributions.
• [Daskalakis-D-Servedio-Valiant-Valiant, SODA’13]: generalization to
k-modal distributions.
Types of Structured Distributions
bimodal
log-‐concave
monotone • Distributions with “shape restrictions”
• Simple combinations of simple distributions
mixtures of Gaussians
Mixtures of simple distributions
Sums of simple distributions
+ + … + Poisson Binomial Distribu9ons
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
First Step: Changing the metric
Identity Testing Problem for family Given (sample) access to : • Output “YES” (with high probability) if (completeness) • Output “NO” (with high probability) if (soundness)
Reduces to Identity Testing Problem under - distance Given (sample) access to : • Output “YES” (with high probability) if • Output “NO” (with high probability) if
Dp, q ∈ D
p = qp− q1 ≥
Ak
p, qp = q
p− qAk ≥
-Distance between Distributions (I)
Definition. For and , we define the - distance between as follows: Facts: • For , (essentially) equivalent to the Kolmogorov distance.
• For any , we have .
• We have:
Ak
p, q
I1 I2 I3 Ik
p− qAk = supI=(Ii)ki=1
k
i=1
|p(Ii)− q(Ii)|
p, q : R → [0, 1]
Ak
k ≥ 2
k = 2
k ≥ 2 p− qAk ≤ p− q1
limk→∞
p− qAk = p− q1
-Distance between Distributions (II)
Definition. For and , we define the - distance between as follows: Upper Bound on Sample Complexity: For a family of one-dimensional distributions and , let be the smallest integer such that for any it holds Then, the parameter is the “right” complexity measure for estimating a property of the family .
k ≥ 1 Ak
p, q
D > 0p, q ∈ D
D
k = k(D, )
k
I1 I2 I3 Ik
p− qAk = supI=(Ii)ki=1
k
i=1
|p(Ii)− q(Ii)|
p, q : R → [0, 1]
Ak
p− q1 ≤ p− qAk + /2.
for each :
Overview of Framework
Approximation (Existential Step)
Identity Tester under - distance
(Algorithmic Step) > 0
YES/NO
Error parameter:
Ak
k = k(D, )
min k s.t.p, q ∈ D
p− q1 ≤ p− qAk + /2. = /2
L1-Identity Tester for D
D
Second Step: Design -Distance Tester
Identity Testing Problem under - distance Given (sample) access to : • Output “YES” (with high probability) if • Output “NO” (with high probability) if Two fundamentally different regimes: • One of known explicitly [Testing Identity to Fixed Distribution]. • Both unknown [Testing Closeness].
Ak
p, q
p = qp− qAk ≥
Ak
p, q
p, q
-distance vs L1 distance
Testing Closeness Tight Bound
Support: [n], L1 distance
- distance
Learning Tight Bound Support: [n], L1 distance
- distance
Testing Identity Tight Bound Support: [n], L1 distance
- distance
Θ(maxn2/3/4/3, n1/2/2)
Θ(n1/2/2)
Θ(n/2)
Ak
Ak
Ak Θ(k/2)
Θ(k1/2/2)
Ak
[VC]
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
-Testing Identity to Fixed Distribution
Theorem [D-Kane-Nikishkin’15] For any , , and any explicit there exists a computationally efficient algorithm that distinguishes between the case versus with constant error probability using samples from . Moreover, this sample size is information-theoretically necessary for this task. Remark: • The upper bound holds both for discrete and continuous distributions.
> 0 k ≥ 2
p = q p− qAk ≥
q
O(k1/2/2)
p
Ak
Applications: L1 -Identity Testing for Structured Distributions
Distribution Family
Sample Size Parameters
t-flat
t-piecewise degree-d
Log-concave
Log-concave t-mixture
t-modal over [n]
MHR over [n]
k = O(t)
k = O(t(d+ 1))
O(−9/4) k = O(−1/2)
k = O(t−1/2)
k = O(t log(n)/)
k = O(log(n)/)
O
t1/2
2
O
(t(d+ 1))1/2
2
O
t1/2
9/4
O
(t log n)1/2
5/2
O
(log n)1/2
5/2
- Identity Testing: Basic Facts
Lemma: Identity testing reduces to uniformity testing. Proof Idea: Appropriately “stretch” the domain size. Henceforth, focus on uniformity testing. Observation: If we know the partition maximizing the discrepancy, can reduce to L1- identity testing over domain of size k.
J1 J2 Jk
p− UAk =k
j=1
|p(Jj)− U(Jj)|
Ak
- Uniformity Testing: First Approach
• Partition the domain into intervals of equal
length. • Apply an L1- uniformity tester on the reduced distributions over
these intervals.
Claim: Sample Complexity:
= 10k/ I1, . . . , I
p− UAk − /2 ≤
i=1
|p(Ii)− U(Ii)| ≤ p− UAk
I1 I2 I3 I
J1 J2 Jk
O(1/2/2) = O(k1/2/5/2)
Ak
- Uniformity Testing: Optimal Algorithm
• Construct several oblivious decompositions of the domain. • Use L2- uniformity tester over the reduced distributions.
In more detail: • Consider equal-length interval partitions of the domain.
Partition consists of intervals. • For each j, apply an L2- uniformity tester with L2 - error • Accept if and only if all testers accept. Structural Lemma: One of the partitions will detect the discrepancy.
Ak
M = log(1/)I(j) j = k · 2j
j = · 23j/8/1/2j
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
-distance vs L1 distance
Testing Closeness Tight Bound
Support: [n], L1 distance
- distance
Learning Tight Bound Support: [n], L1 distance
- distance
Testing Identity Tight Bound Support: [n], L1 distance
- distance
Θ(maxn2/3/4/3, n1/2/2)
Θ(n1/2/2)
Θ(n/2)
Ak
Ak
Ak Θ(k/2)
Θ(k1/2/2)
Θ(maxk2/3/4/3, k1/2/2)
Ak
[VC]
- Equivalence Testing
Theorem For any and , and any distributions there exists a computationally efficient algorithm that distinguishes between the case versus with constant error probability using samples. Moreover, this sample size is information-theoretically necessary for this task. Remarks: • The upper bound holds both for discrete and continuous distributions.
• The lower bound applies to continuous distributions or discrete distributions over a domain of size .
> 0 k ≥ 2 p, q
p = q p− qAk ≥
O(maxk4/5/6/5, k1/2/2)
N ≥ 2poly(k)
Ak
- Closeness Testing: Basic Facts
• No oblivious decomposition can work: Discrepancy may be hidden in intervals even though reduced distributions are the same.
• Can partition the domain into “light” intervals, and apply standard
closeness tester on reduced distributions over these intervals. • Inherently leads to sample algorithms: Need adaptive
partition in which at least one distribution has small mass.
• How do we obtain sample size?
Ak
o(k)
Ω(k)
- Closeness Testing Algorithm
Consider the following “order-based” algorithm: • Let . Draw samples from p,
and samples from q.
• Let be the union of and sorted in increasing order.
• Let
• If , return “NO”; otherwise, return “YES.”
m = O(k4/5/6/5) m1 = Poi(m)m2 = Poi(m)
Sp
Sq
S Sp Sq
Z = #(pairs of consecutive elements of S from same distribution)−#(pairs of consecutive elements of S from different distributions)
Z > 3√m
Ak
Closeness Testing: Sketch of Analysis
• Bound mean and variance and using concentration.
• Completeness: and • Soundness: Main technical step bounding from below.
- Easy to argue: - Highly non-trivial:
E[Z] = 0
E[Z]
Var[Z] = O(m)
E[Z] = Ω(m33/k2)
Var[Z] = 2m− 1
Outline
§ Introduction, Related and Prior Work § Framework Overview § Testing Identity to a Fixed Distribution § Testing Closeness between two Unknown Distributions § Future Directions and Concluding Remarks
Future Directions
Unified Technique for Identity Testing: Use - distance as a proxy. Concrete Open Problems: • Understanding the regime [DKN’16]. • Testing Other Properties of Structured Distributions: Independence, Entropy, etc. A Few Open-ended Challenges: • Other Criteria: Privacy, Communication • High-Dimensional Structured Distributions • Tradeoffs between sample size and computational efficiency?
Thank you for your attention!
log n ≤ k ≤ n
Ak
Sketch of Lower Bound (I)
• Suppose algorithm only considers ordering of the samples. • Consider the following instance:
• If less than 3 samples land in a mini-bucket, no useful information.
…
p pq p = q
2k
k
2k
1−
k
Sketch of Lower Bound (II)
• If less than 3 samples land in a mini-bucket, no useful information
for an order-based tester. • Expected number of buckets with 3 samples • Need this quantity to be How about for general testers? • Can embed above instance into larger domain, so that ordered-
based testers suffice. • Non-constructive argument (Ramsey’s theorem).
…
p pq p = q
2k
k
2k
1−
k
km
k
3
√m