Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 219 times |
Download: | 0 times |
Co-Training and Expansion: Towards Bridging Theory and
Practice
Maria-Florina Balcan, Avrim Blum, Ke Yang
Carnegie Mellon University,Computer Science Department
2
Combining Labeled and Unlabeled Data (a.k.a. Semi-supervised Learning)
• Many applications have lots of unlabeled data, but labeled data is rare or expensive:
•Web page, document classification•OCR, Image classification
• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
•Transductive SVM•Co-training•Graph-based methods
3
Co-training: method for combining labeled & unlabeled data
• Works in scenarios where examples have distinct, yet sufficient feature sets:– An example:– Belief is that the two parts of the example are
consistent, i.e. 9 c1, c2 such that• Each view is sufficient for correct classification
• Works by using unlabeled data to propagate learned information.
+++
X1X2
4
Co-Training: method for combining labeled & unlabeled data
• For example, if we want to classify web pages:
My Advisor
Prof. Avrim Blum
My Advisor
Prof. Avrim Blum
x2- Text infox1- Link infox - Link info & Text info
5
Iterative Co-Training
• Have learning algorithms A1, A2 on each of the two views.
• Use labeled data to learn two initial hypotheses h1, h2.
• Look through unlabeled data to find examples where one of hi is confident but other is not.
• Have the confident hi label it for algorithm A3-
i.
Repeat
6
Iterative Co-Training A Simple Example: Learning Intervals
c1
c2
Use labeled data to learn h11 and h2
1
Use unlabeled data to bootstrap
h11
h21
Labeled examples Unlabeled examples
h12
h21
h12
h22
7
Theoretical/Conceptual Question
• What properties do we need for co-training to work well?
• Need assumptions about:– the underlying data distribution– the learning algorithms on the two sides
8
Theoretical/Conceptual Question
• What property of the data do we need for co-training to work well?
• Previous work:1) Independence given the label2) Weak rule dependence
• Our work - much weaker assumption about how the data should behave:
• expansion property of the underlying distribution
• Though we will need stronger assumption on the learning algorithm compared to (1).
9
Co-Training, Formal Setting• Assume that examples are drawn from distribution D
over instance space .• Let c be the target function; assume that each view is
sufficient for correct classification:– c can be decomposed into c1, c2 over each view s. t. D has no
probability mass on examples x with c1(x1) c2(x2)
• Let X+ and X- denote the positive and negative regions of X.
• Let D+ and D- be the marginal distribution of D over X+ and X- respectively.
• Let – think of as
D+
D-
10
(Formalization)
• We assume that D+ is expanding.
• Expansion:
• This is a natural analog of the graph-theoretic notions of conductance and expansion.
S1S2
11
Property of the underlying distribution
• Necessary condition for co-training to work well:– If S1 and S2 (our confident sets) do not expand, then
we might never see examples for which one hypothesis could help the other.
• We show, sufficient for co-training to generalize well in a relatively small number of iterations, under some assumptions:– the data is perfectly separable– have strong learning algorithms on the two
sides
12
Expansion, Examples: Learning Intervals
c1
c2
D+
c1
c2
Zero probability mass in the regions
Non-expanding distribution Expanding distribution
c1
c2
D+
S1
S2
13
Weaker than independence given the label & than weak rule dependence.
D+S1
S2
D-
e.g, w.h.p. a random degree-3 bipartite graph is expanding, but would NOT have independence given the label, or weak rule dependence
14
Main Result• Assume D+ is -expanding.• Assume that on each of the two views we have
algorithms A1 and A2 for learning from positive data only.
• Assume that we have initial confident sets S10 and S2
0 such that
15
Main Result, Interpretation
• Assumption on A1, A2 implies the they never generalize incorrectly.
• Question is: what needs to be true for them to actually generalize to whole of D+?
+++
X1+ X2
+
16
Main Result, Proof Idea• Expansion implies that at each iteration, there is
reasonable probability mass on "new, useful" data.• Algorithms generalize to most of this new region.
• See paper for real proof.
+++
17
What if assumptions are violated?
• What if our algorithms can make incorrect generalizations and/or there is no perfect separability?
18
What if assumptions are violated?
• Expect "leakage" to negative region.
• If negative region is expanding too, then incorrect generalizations will grow at exponential rate.
• Correct generalization are growing at exponential rate too, but will slow down first.
• Expect overall accuracy to go up then down.
19
Synthetic Experiments • Create a 2n-by-2n bipartite graph;
– nodes 1 to n on each side represent positive clusters– nodes n+1 to 2n on each side represent negative clusters
• Connect each node on the left to 3 nodes on the right: – each neighbor is chosen with prob. 1- to be a random
node of the same class, and with prob. to be a random node of the opposite class
• Begin with an initial confident set and then propagate confidence through rounds of co-training:– monitor the percentage of the positive class covered, the
percent of the negative class mistakenly covered, and the overall accuracy
20
Synthetic Experiments
• solid line indicates overall accuracy• green curve is accuracy on positives• red curve is accuracy on negatives
=0.01, n=5000, d=3 =0.001, n=5000, d=3
21
Conclusions
• We propose a much weaker expansion assumption of the underlying data distribution.
• It seems to be the “right” condition on the distribution for co-training to work well.
• It directly motivates the iterative nature of many of the practical co-training based algorithms.