Decision treesand their use in NLP
Jan HajičAdditional Lecture to NPFL067
Fall 2018/19
2018/9 Decision Trees in NLP 2
Decision Trees
• Goal: Categorical or numerical predictions– i.e., classification (prevalent in NLP) / regression
• Use in NLP (examples)– Standard classification
• POS/morphological tagging CPOS: W → T• Named entity recognition NE: W → {0,1}|W|
– Modeling conditional distributions• Language modeling LM: <w1,…,wi> → Pi+1(.)• In general D: H → P(.)
– H is “history” (context)– P(.) is a set probabilistic distributions on the variable of interest
(e.g. “next word”)
2018/9 Decision Trees in NLP 3
Decision Trees – the Idea
• Queries organized in a rooted tree• Queries evaluated against (input) data
– Precisely, against context of the item of interest to be classified
• Value returned → edge to be followed down the tree• (Global) answer (class, distribution) found at leaf
– Example classifier (customer helpdesk, data: customer’s answers):
Starts when switched on?
Battery inserted? Login screen reached after less than 2 minutes?
yes
no
no yes
C: missing battery C: empty battery
yes
noHard drive lamp:
off onblinking
… … …
…
2018/9 Decision Trees in NLP 4
Goal: A Distribution
• Generalization (from a categorical answer)– Leaves contain a prob. distribution on ‘classes’
Fever (body temp. > 39C)?
Headache? Pain while swallowing?yes
no
noyes
yes
noPain over all body?
no…
Healthy 0.02Migraine 0.07Flu 0.17Tonsilitis 0.69Fracture 0.01…
Healthy 0.02Migraine 0.18Flu 0.53Tonsilitis 0.09Fracture 0.01…
…
Healthy 0.38Migraine 0.12Flu 0.07Tonsilitis 0.03Fracture 0.08…
yes
2018/9 Decision Trees in NLP 5
Using Decision Trees
• Collect data at each query, evaluate• Follow edge, evaluate next query
• Do until leaf reached
Fever (body temp. > 39C)?
Headache? Pain while swallowing?yes
no
noyes
yes
noPain over all body?
no…
Healthy 0.02Migraine 0.07Flu 0.17Tonsilitis 0.69Fracture 0.01…
Healthy 0.02Migraine 0.18Flu 0.53Tonsilitis 0.09Fracture 0.01…
…
Healthy 0.38Migraine 0.12Flu 0.07Tonsilitis 0.03Fracture 0.08…
yes
Data
Temp = 39.7CData
No swallowing pain
Data
Body pain
2018/9 Decision Trees in NLP 6
Constructing (Training) Decision Trees
• Set-of-Queries Acquisition– Manual (given)– (Semi-)automatic ([man.] templates → instances)
• Tree building– Machine learning (supervised)– Objective function
• Probability of data (maximize) given the (trained) tree (~ model)• MERT (Minimum error rate training)
– (If it is) hard to define probability of data (esp. categorical classifiers)
– NP complete problem • Heuristics needed (e.g., greedy search)• Approximations
– Technique: Top-down, Node-splitting
2018/9 Decision Trees in NLP 7
Queries
• Binary (Yes/No)– two edges outgoing from a query node
• Is the previous word “to”?• Is there the word “car” within the same sentence?• Is the rel. unigram frequency of the prev. word > 0.05?• Is the entropy of the trigram distribution P(.|wi-2,wi-1) below 0.1
(at the position ‘i’ in the data)?
• General (discrete)– N (>2) edges outgoing from a query node
• Number of children (0, 1, 2, >2) → 4 edges down• POS of previous word (N,V,A,…,) → 10 (11, 12, …) edges• Discretization intervals: (0..0.05, ..0.10, ..1.00) → 20 edges
2018/9 Decision Trees in NLP 8
Queries: Acquisition
• Defined ahead of time– Diagnosis (problem, medical), Financials, Profiling, …– NLP: POS tagging, Word Sense Disambiguation
• Template-based– Analogy:
• Brill’s TBEDL• Feature templates in MaxEnt models, perceptrons, …
– Used for most tasks in NLP• Language modeling, other models w/conditional distributions
– Two steps:• Template definition (manual)• Query Instance Generation (template “expansion”) – data-based• [Selection (cf. feature selection in MaxEnt): part of tree building]
2018/9 Decision Trees in NLP 9
Tree Building: Data
• (Large) vector of pairs (y,x): D = (yi,xi), i = 1..|D|• yi – value of interest (the one being predicted)• xi – context (‘history’)
• Examples, modeling p(y|x):– Language modeling
• n-gram LM: yi – current word, xi – previous (n-1) words– POS tagging
• yi – POS tag, xi – words in sentence & tags to the left– Word Sense Disambiguation
• yi – sense of the word at position ‘i’ (from a fixed set), • xi – words +- 50 positions away from position ‘i’
– (Non-NLP:) Disease diagnostics• yi – disease, xi – vector of symptoms (numerical, categorical)
2018/9 Decision Trees in NLP 10
Tree Building: the Objective Function ()
• Form– Function to maximize/minimize: : T x D →
• T – the decision tree being built, D – the training data
• Distribution-based– As usual: (Max.) Probability of data ~ (Min.) Entropy
argmaxT (T,D) = PT(D) ~ argminT (T,D) = -y,x pT(y,x) log(pT(y|x))
= -1/|D|i=1..|D| log(pT(yi|xi))
• Error-based (~Minimum Error Rate Training, MERT)
argminT (T,D) = ER = 1/|D| i=1..|D| δ(ClassifyT(xi),yi)ClassifyT: X → Y chooses y Y given context x X using T
2018/9 Decision Trees in NLP 11
Tree Building: the Algorithm (~ID3)
• Using training data and the objective function:Tfinal = argminT (T,D)
• Too (exponentially) many possible trees (vs. no. of queries)
• Greedy search (Q: pool of possible queries)1. Start with ‘empty’ T (single leaf node), set 0 = (T,D)2. Iterate (iteration index k = 1..kmax); set mink = k-1
• For all leafs li in T • For all qj Q:
– Split li into a query node qj, and nj ( ≥ 2) leafs → call it Ti,j
– If (Ti,j,D) < mink: set min
k = (Ti,j,D) and remember i,j i.e. mink holds the minimal value of (Ti,j,D) so far
3. Set k = mink and T = Ti,j; repeat (2) until termination
2018/9 Decision Trees in NLP 12
Tree Building: Termination
• Terminating condition(s): – For any new query splitting any leaf node of T:
• No improvement based on objective function (T,D)– i.e., mink = k-1 at the end of the iteration, for example:
» Entropy does not go down» Error rate does not go down
• Small improvement only (k-1 - mink < ) – To avoid overtraining
– Tree too big (or too deep)• To avoid overtraining, long running time; space constraints etc.
– Set kmax to desired maximum size of tree, or watch T’s depth
2018/9 Decision Trees in NLP 13
Example: POS tagging
• Data:Positions: 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
• Objective function: error rate(T,D) = 1/|D| i=1..|D| δ(CT(xi),yi)
CT is the “Classify” function: Words (in/with context) → Tags
• Query pool:q1..q6: current word (wi = John, …, table)q7..q11: previous tag (ti-1 = NN, …, PRE)
2018/9 Decision Trees in NLP 14
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Iteration “0”
C: NN l1
0 = (T,D) = 5/8
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 15
wi=John? q1 C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 1 min1 = 5/8 (T1,1,D) = 5/8
C: NN l1
wi=John? q1
C: NN l2
yes no
T1,1
i,j: ----
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 16
wi=can? q2
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 2 min1 = 5/8 (T1,2,D) = 5/8
C: MOD l1 C: NN l2
yes no
T1,2
i,j: ----
wi=can? q2
wi=John? q1
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 17
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 3 min1 = 5/8 (T1,3,D) = 1/2
C: VBF l1 C: NN l2
yes no
T1,3
i,j: ----
wi=bring? q3
min1 = 1/2 i,j: 1,3
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 18
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 4 min1 = 1/2 (T1,4,D) = 3/8
C: DET l1 C: NN l2
yes no
T1,4
i,j: 1,3
wi=the? q4
min1 = 3/8 i,j: 1,4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 19
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 5 min1 = 3/8 (T1,5,D) = 1/2
C: PRE l1 C: NN l2
yes no
T1,5
i,j: 1,4
wi=to? q5
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 20
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 6 min1 = 3/8 (T1,6,D) = 5/8
C: NN l1 C: NN l2
yes no
T1,6
i,j: 1,4
wi=table? q6
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 21
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 7 min1 = 3/8 (T1,7,D) = 1/2
C: MOD l1 C: NN l2
yes no
T1,7
i,j: 1,4
ti-1=NN? q7
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 22
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 8 min1 = 3/8 (T1,8,D) = 1/2
C: VBF l1 C: NN l2
yes no
T1,8
i,j: 1,4
ti-1=MOD? q8
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 23
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 9 min1 = 3/8 (T1,9,D) = 1/2
C: DET l1 C: NN l2
yes no
T1,9
i,j: 1,4
ti-1=VBF? q9
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 24
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 10 min1 = 3/8 (T1,10,D) = 1/2
C: NN l1 C: DET l2
yes no
T1,10
i,j: 1,4
ti-1=DET? q10
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
2018/9 Decision Trees in NLP 25
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 Leaf 1 Query 11 min1 = 3/8 (T1,11,D) = 1/2
C: DET l1 C: NN l2
yes no
T1,11
i,j: 1,4
ti-1=PRE? q11
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
End of iteration 1
2018/9 Decision Trees in NLP 26
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
C: NN l1
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 1 0 = 5/8 min1 = 3/8 i,j: 1,4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
End of iteration 1
C: NN l1
C: DET l1 C: NN l2
yes no
wi=the? q4
1 = 3/8
Start of iteration 2
Iteration 2
Query q4
Leaf l1
Decreasing … go on to next iteration
2018/9 Decision Trees in NLP 27
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 1 Query 1 min2 = 3/8 (T1,1,D) = 3/8
C: DET l1 C: NN l2
yes no
T1,1
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=John? q1
C: ??? l1 C: DET l3
yes no
Can we ignore the querieswith other words?
… yes(same wi)
2018/9 Decision Trees in NLP 28
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 1 Query 7 min2 = 3/8 (T1,7,D) = 3/8
C: DET l1 C: NN l2
yes no
T1,7
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=NN? q7
C: ??? l1 C: DET l3
yes no
Can we ignore the querieswith tags at i-1?
… no (but see later for more powerfulheuristics)
2018/9 Decision Trees in NLP 29
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 1 Query 11 min2 = 3/8 (T1,11,D) = 3/8
C: DET l1 C: NN l2
yes no
T1,11
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=PRE? q11
C: DET l1 C: DET l3
yes no
OK, so what’sthe heuristics?
Leafs with absoluteclassification certaintyneed not be split
This leaf is final.(If wordi = the, DET is the onlypossible tag.)
2018/9 Decision Trees in NLP 30
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 1 min2 = 3/8 (T2,1,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,1
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=John? q1
C: NN l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 31
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 2 min2 = 3/8 (T2,2,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,2
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=can? q2
C: MOD l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 32
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 3 min2 = 3/8 (T2,3,D) = 1/4
C: DET l1 C: NN l2
yes no
T2,3
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=bring? q3
C: VBF l2 C: NN l3
yes no
i,j: 2,3 min2 = 1/4
2018/9 Decision Trees in NLP 33
wi=John? q1
wi=can? q2
wi=bring? q3
wi=the? q4
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 4 min2 = 1/4 (T2,4,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,4
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=the? q4
C: ??? l2 C: NN l3
yes no
Should I look atthe same questionagain?
Queries once usedmay be excluded
…unlike in TBEDL!
2018/9 Decision Trees in NLP 34
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 5 min2 = 1/4 (T2,5,D) = 1/4
C: DET l1 C: NN l2
yes no
T2,5
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=to? q5
C: PRE l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 35
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 6 min2 = 1/4 (T2,6,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,6
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi=table? q6
C: NN l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 36
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 7 min2 = 1/4 (T2,7,D) = 1/4
C: DET l1 C: NN l2
yes no
T2,7
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=NN? q7
C: MOD l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 37
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 8 min2 = 1/4 (T2,8,D) = 1/4
C: DET l1 C: NN l2
yes no
T2,8
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=MOD? q8
C: VBF l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 38
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 9 min2 = 1/4 (T2,9,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,9
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=VBF? q9
C: DET l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 39
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 10 min2 = 1/4 (T2,10,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,10
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=DET? q10
C: NN l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 40
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 11 min2 = 1/4 (T2,11,D) = 3/8
C: DET l1 C: NN l2
yes no
T2,11
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
ti-1=PRE? q11
C: DET l2 C: NN l3
yes no
2018/9 Decision Trees in NLP 41
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 2 1 = 3/8 Leaf 2 Query 11 min2 = 1/4
C: DET l1 C: NN l2
yes no
i,j: 2,3
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
Query q3
Leaf l2
2 = 1/4
End of iteration 2
2018/9 Decision Trees in NLP 42
C: VBF l2
wi=John? q1
wi=can? q2
wi=bring? q3
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 1
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
Start of iteration 3
min3 = 1/4
T3,1
Deletequeryused
l2 is final(no ambiguity)
wi=John? q1
C: NN l3 C: NN l4
yes no
(T3,1,D) = 1/4
2018/9 Decision Trees in NLP 43
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 2
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/4
T3,2wi=can? q2
C: MOD l3 C: NN l4
yes no
(T3,2,D) = 1/4
2018/9 Decision Trees in NLP 44
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 5
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/4
T3,5wi=to? q5
C: PRE l3 C: NN l4
yes no
(T3,5,D) = 1/8i,j: 3,5 min3 = 1/8
2018/9 Decision Trees in NLP 45
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 6
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,6wi=table? q6
C: NN l3 C: NN l4
yes no
(T3,6,D) = 1/4
2018/9 Decision Trees in NLP 46
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 7
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,7ti-1=NN? q7
C: MOD l3 C: NN l4
yes no
(T3,7,D) = 1/8
2018/9 Decision Trees in NLP 47
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 8
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,8ti-1=MOD? q8
C: ??? l3 C: NN l4
yes no
(T3,8,D) = 1/4
2018/9 Decision Trees in NLP 48
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 9
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,9ti-1=VBF? q9
C: ??? l3 C: NN l4
yes no
(T3,9,D) = 1/4
2018/9 Decision Trees in NLP 49
C: NN l3 C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 10
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,10ti-1=DET? q10
(T3,10,D) = 1/4
2018/9 Decision Trees in NLP 50
C: ??? l3 C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4 Leaf 3 Query 11
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,11ti-1=PRE? q11
(T3,11,D) = 1/4
2018/9 Decision Trees in NLP 51
C: PRE l3 C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=to? q5
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 3 2 = 1/4
C: DET l1
yes no
i,j: 3,5
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2 C: NN l3
yes no
min3 = 1/8
T3,5wi=to? q5
Query q5
Leaf l3
End of iteration 3
3 = 1/8
C: PRE l3
Start of iteration 4
2018/9 Decision Trees in NLP 52
C: NN l4 C: NN l5
yes no
C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 4 3 = 1/8
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2
yes no
min4 = 1/8
T4,1wi=to? q5
C: PRE l3 wi=John? q1
Leaf 4 Query 1 (T4,1,D) = 1/8
2018/9 Decision Trees in NLP 53
C: MOD l4 C: NN l5
yes no
C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 4 3 = 1/8
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2
yes no
min4 = 1/8
T4,2wi=to? q5
C: PRE l3 wi=can? q2
Leaf 4 Query 2 (T4,2,D) = 1/8
2018/9 Decision Trees in NLP 54
C: NN l4 C: NN l5
yes no
C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 4 3 = 1/8
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2
yes no
min4 = 1/8
T4,6wi=to? q5
C: PRE l3 wi=table? q6
Leaf 4 Query 6 (T4,6,D) = 1/8
2018/9 Decision Trees in NLP 55
C: MOD l4 C: NN l5
yes no
C: NN l4
yes no
wi=John? q1
wi=can? q2
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 4 3 = 1/8
C: DET l1
yes no
i,j: ----
wi=the? q4
1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
wi = bring? q3
C: VBF l2
yes no
min4 = 1/8
T4,7
wi=to? q5
C: PRE l3 ti-1=NN? q7
Leaf 4 Query 7 (T4,7,D) = 0 i,j: 4,7 min4 = 0
2018/9 Decision Trees in NLP 56
C: MOD l4 C: NN l5
yes no
yes no
wi=John? q1
wi=can? q2
wi=table? q6
ti-1=NN? q7
ti-1=MOD? q8
ti-1=VBF? q9
ti-1=DET? q10
ti-1=PRE? q11
Example: POS tagging 1 2 3 4 5 6 7 8
X: John can bring the can to the table Y: NN MOD VBF DET NN PRE DET NN
Iteration 4 4 = 0
C: DET l1
yes no
wi=the? q4
wi = bring? q3
C: VBF l2
yes no
wi=to? q5
C: PRE l3 ti-1=NN? q7
End of iteration 4
End of training
The final decision tree
2018/9 Decision Trees in NLP 57
Generalized Case
• Same algorithm• Objective function – entropy:
argmin (T,D) = -y,x pT(y,x) log(pT(y|x))
= -1/|D| i=1..|D| log(pT(yi|xi))
• Effective computation– Compute only change of entropy at split node
• Savings (much) greater towards end of computation
– “Final” leaf heuristics:• Zero entropy at that leaf’s distribution
2018/9 Decision Trees in NLP 58
Avoiding Overtraining (Overfitting)
• Stop growing the tree early– Set threshold for entropy gain at > 0
• Smoothing (for the generalized case)– Keep distribution at every node (not just leaves) +
weight (the “lambda”)– Smooth along the path taken from root to leaf– Train weights by EM on heldout data
• Tree prunning– Use heldout data H for (T,H) computation– Remove node if (T,H) decreases (or stays the same
← minimal complexity principle)
2018/9 Decision Trees in NLP 59
Generalizations / Variants
• Historically: C4.5, C5.0– Conversion to rules at the end, prunning the rules at any “node”
(not just leaves)– Ability to use incomplete data for training (unknown values of
some attributes)– Using continuous-valued attributes
• Key: finding (effectively) the threshold t for converting to query “is valueA < t?”
– Effective runtime software for many programming languages
• Decision lists– “Narrow” graphs, allow multiple parents, special features
• in certain cases, like WSD, avoid too fine data partitioning
2018/9 Decision Trees in NLP 60
Further Reading
Mitchell, T. M. (1997): Machine Learning, WCB/McGraw-Hill, ISBN 0-07-042807-7, Chapter 3
Quinlan, J. R. (1986): Induction of Decision Trees. Machine Learning 1(1), 81-106.
Quinlan, J. R. (1993): C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufman.– http://rulequest.com (Quinlan’s company)– C5.0/2.09 (latest version)
Rokach, L., Maimon, O. (2005): Top-down induction of decision trees classifiers - a survey. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 35, no. 4, pp. 476-487, Nov. 2005
Google search