Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | brooks-tinkham |
View: | 215 times |
Download: | 0 times |
Guiding Semi-Supervision with
Constraint-Driven Learning
Ming-Wei Chang ,Lev Ratinow , Dan Roth
• Semi -super vised Learning ? • Scarcity of Training Data • What are constraints ?• How/why do they help ?
Supervised learning
( X1Y1) Labelled Data
(X2-Y2)
(X3 Y3).. ……(XnYn) .
What if n is less ? .. Obtaining training data is Costly and it could be inefficient . Example : (Fraud detection / Anamoly detection)
Domain expertise helps……
Definitions • X = (X1,X2,X3,X4…………Xn)• Y = (Y1,Y2,Y3,Y4…………Yn)
• H : XY is a classifier .
f : (Cross product of X and Y ) -R set of real numbers
• The out-put of the classifier will be such y which maximizes the value of function f
• Classification function .. • It’s a linear sum of feature
functions
Motivational Interviewing
Labels : Support,Reflection,Cofrontation,Facilitate, Question
Can we exploit knowledge of constraints in Inference Phase? • Lets assume n items (observations) in sequence and p labels.. i.e., n tokens and p parts of speech or n tokens and p tags in an NER task
Brute Force : O(n power p )
Viterbi : O( N power P)
Can we go down further ? Can we further reduce our search space
Further down ?
Introducing constraints into Model• Let C1, C2 ……….CK be the constraints
• C: (Cross product of X and Y) {0,1}
• Constraints are of two types . • Hard (MUST be satisfied)• Soft (Can be relaxed)
• 1Cx is the set of sequence labels that DON’T violate the constraints
Constraints come to rescue • Lets say x out of X possible tag sequences violate the constraints .
• Search space comes from X to X-x .• How do we infer ? • Does Viterbi help us ?
Example
A B C D E F G
S1 X1 X1 X1 X1 X1 X1 X1
S2 X10 X10 X10 X10 X10 X10 X10 S3 X11 X11 X11 X11 X11 X1I X11
Motivational Interviewing :
At least ONE reflection
Soft constraints
How do we calculate distance here ?
How do we learn the parameters ?
Lars Ole Andersen. Program Analysis and Specialization for the C programming Language . PhD Thesis , DIKU , University of Copenhagen, May 1994.This is Ground Truth .
But HMM gives this. Lars Ole Andersen. Program Analysis and Specialization for the C Programming Language . PhD Thesis , DIKU , University of Copenhagen, May 1994.
Top-k inference
We only chose the few top possible sequences and add ALL of of them to training data.
The author used beam search decoding, but this can be done with any inference procedure.
From the Unlabeled sample, we label them and include them in the training data.
Choice : We may include only the high confident samples.
PitFall : Then we don’t really learn properly and miss-out some characteristics
Algorithm: