Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 1 times |
Distributed Iterative Training
Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith
Outline
• The Problem
• Distributed Architecture
• Experiments and Hadoop Issues
Iterative Training
• Many problems in NLP and machine learning require iterating over large training sets many times– Training log-linear models (logistic regression, conditional
random fields)– Unsupervised or semi-supervised learning with EM (word
alignment in MT, grammar induction)– Minimum Error-Rate Training in MT– *Online learning (MIRA, perceptron, stochastic gradient descent)
• All of the above except * can be easily parallelized– Compute statistics on sections of the data independently– Aggregate them– Update parameters using statistics of full set of data– Repeat until a stopping criterion is met
Dependency Grammar Induction
• Given sentences of natural language text, infer (dependency) parse trees
• State-of-the-art results obtained using only a few thousand sentences of length ≤ 10 tokens (Smith and Eisner, 2006)
• This talk: scaling up to more and longer sentences using Hadoop!
Dependency Grammar Induction
• Training– Input is a set of sentences (actually, POS tag sequences) and a
grammar with initial parameter values– Run an iterative optimization algorithm (EM, LBFGS, etc.) that
changes the parameter values on each iteration– Output is a learned set of parameter values
• Testing– Use grammar with learned parameters to parse a small set of
test sentences– Evaluate by computing percentage of predicted edges that
match a human annotator
Outline
• The Problem
• Distributed Architecture
• Experiments and Hadoop Issues
MapReduce for Grammar Induction
• MapReduce was designed for:– Large amounts of data distributed across
many disks– Simple data processing
• We have:– (Relatively) small amounts of data– Expensive processing and high memory
requirements
MapReduce for Grammar Induction
• Algorithms require 50-100 iterations for convergence– Each iteration requires a full sweep over all training data– Computational bottleneck is computing expected counts for EM
on each iteration (gradient for LBFGS)
• Our approach: run one MapReduce job for each iteration– Map: compute expected counts (gradient)– Reduce: aggregate– Offline: renormalize (EM) or modify parameter values (LBFGS)
• Note: renormalization could be done in reduce tasks for EM with correct partition functions, but using LBFGS in multiple reduce tasks is trickier
MapReduce Implementation
Map Reduce
Distributed Cache
New Parameter Values:p_root(NN) = -1.91246p_dep(CD | NN, right) = -2.7175p_dep(DT | NN, right) = -3.0648…
Expected Counts:p_root(NN) 0.345p_root(NN) 1.875p_dep(CD | NN, right) 0.175p_dep(CD | NN, right) 0.025p_dep(DT | NN, right) 0.065…
Sentences:[NNP,NNP,VBZ,NNP][DT,JJ,NN,MD,VB,JJ,NNP,CD][DT,NN,NN,VBZ,RB,VBN,VBN]…
Aggregated Expected Counts:p_root(NN) 2.220p_dep(CD | NN, right) 0.200p_dep(DT | NN, right) 0.065…
Server 1. Normalize expected counts to get new parameter values
2. Start new MapReduce job, placing new parameter values on distributed cache
Compute expected counts
Aggregate expected counts
Running ExperimentsWe use streaming for all experiments with 2 C++ programs: server and map
(reduce is a simple summer)
> cd /home/kgimpel/grammar_induction
> hod allocate –d /home/kgimpel/grammar_induction –n 25
> ./dep_induction_server \
input_file=/user/kgimpel/data/train20-20parts \
aux_file=aux.train20 output_file=model.train20 \
hod_config=/home/kgimpel/grammar_induction \
num_reduce_tasks=5 1> stdout 2> stderr
dep_induction_server runs a MapReduce job on each iteration
Input split into pieces for map tasks (dataset too small for default Hadoop splitter)
Outline
• The Problem
• Distributed Architecture
• Experiments and Hadoop Issues
Speed-up with Hadoop
• 38,576 sentences• ≤ 40 words / sent.
• 40 nodes• 5 reduce tasks
• Average iteration time reduced from 2039 s to 115 s
• Total time reduced from 3400 minutes to 200 minutes
0 500 1000 1500 2000 2500 3000 3500-2.2
-2.15
-2.1
-2.05
-2
-1.95
-1.9
-1.85
-1.8x 10
6
Wall Clock Time (minutes)
Log-
Like
lihoo
d
Single node
Hadoop (40 nodes)
Hadoop Issues
1. Overhead of running a single MapReduce job
2. Stragglers in the map phase
23:17:05 : map 0% reduce 0%
23:17:12 : map 3% reduce 0%
23:17:13 : map 26% reduce 0%
23:17:14 : map 49% reduce 0%
23:17:15 : map 66% reduce 0%
23:17:16 : map 72% reduce 0%
23:17:17 : map 97% reduce 0%
23:17:18 : map 100% reduce 0%
23:18:00 : map 100% reduce 1%
23:18:15 : map 100% reduce 2%
23:18:18 : map 100% reduce 4%
23:18:20 : map 100% reduce 15%
23:18:27 : map 100% reduce 17%
23:18:28 : map 100% reduce 18%
23:18:30 : map 100% reduce 23%
23:18:32 : map 100% reduce 100%
Typical Iteration (40 nodes, 38,576 sentences):
Consistent 40-second delay between map and
reduce phases
• 115 s per iteration total• 40+ s per iteration of overhead
• When we’re running 100 iterations per experiment, 40 seconds per iteration really adds up!
3
1of execution time is
overhead!
23:17:05 : map 0% reduce 0%
23:17:12 : map 3% reduce 0%
23:17:13 : map 26% reduce 0%
23:17:14 : map 49% reduce 0%
23:17:15 : map 66% reduce 0%
23:17:16 : map 72% reduce 0%
23:17:17 : map 97% reduce 0%
23:17:18 : map 100% reduce 0%
23:18:00 : map 100% reduce 1%
23:18:15 : map 100% reduce 2%
23:18:18 : map 100% reduce 4%
23:18:20 : map 100% reduce 15%
23:18:27 : map 100% reduce 17%
23:18:28 : map 100% reduce 18%
23:18:30 : map 100% reduce 23%
23:18:32 : map 100% reduce 100%
Typical Iteration (40 nodes, 38,576 sentences):
• 5 reduce tasks used• Reduce phase is simply aggregation of values for 2600 parameters
Why does reduce take
so long?
Histogram of Iteration Times
0 100 200 300 400 5000
100
200
300
400
500
Iteration Time (seconds)
Cou
nt
Mean = ~115 s
Histogram of Iteration Times
0 100 200 300 400 5000
100
200
300
400
500
Iteration Time (seconds)
Cou
nt
What’s going on here?
Mean = ~115 s
23:17:05 : map 0% reduce 0%
23:17:12 : map 3% reduce 0%
23:17:13 : map 26% reduce 0%
23:17:14 : map 49% reduce 0%
23:17:15 : map 66% reduce 0%
23:17:16 : map 72% reduce 0%
23:17:17 : map 97% reduce 0%
23:17:18 : map 100% reduce 0%
23:18:00 : map 100% reduce 1%
23:18:15 : map 100% reduce 2%
23:18:18 : map 100% reduce 4%
23:18:20 : map 100% reduce 15%
23:18:27 : map 100% reduce 17%
23:18:28 : map 100% reduce 18%
23:18:30 : map 100% reduce 23%
23:18:32 : map 100% reduce 100%
Typical Iteration:
23:17:05 : map 0% reduce 0%
23:17:12 : map 3% reduce 0%
23:17:13 : map 26% reduce 0%
23:17:14 : map 49% reduce 0%
23:17:15 : map 66% reduce 0%
23:17:16 : map 72% reduce 0%
23:17:17 : map 97% reduce 0%
23:17:18 : map 100% reduce 0%
23:18:00 : map 100% reduce 1%
23:18:15 : map 100% reduce 2%
23:18:18 : map 100% reduce 4%
23:18:20 : map 100% reduce 15%
23:18:27 : map 100% reduce 17%
23:18:28 : map 100% reduce 18%
23:18:30 : map 100% reduce 23%
23:18:32 : map 100% reduce 100%
23:20:27 : map 0% reduce 0%
23:20:34 : map 5% reduce 0%
23:20:35 : map 20% reduce 0%
23:20:36 : map 41% reduce 0%
23:20:37 : map 56% reduce 0%
23:20:38 : map 74% reduce 0%
23:20:39 : map 95% reduce 0%
23:20:40 : map 97% reduce 0%
23:21:32 : map 97% reduce 1%
23:21:37 : map 97% reduce 2%
23:21:42 : map 97% reduce 12%
23:21:43 : map 97% reduce 15%
23:21:47 : map 97% reduce 19%
23:21:50 : map 97% reduce 21%
23:21:52 : map 97% reduce 26%
23:21:57 : map 97% reduce 31%
23:21:58 : map 97% reduce 32%
23:23:46 : map 100% reduce 32%
23:24:54 : map 100% reduce 46%
23:24:55 : map 100% reduce 86%
23:24:56 : map 100% reduce 100%
Typical Iteration: Slow Iteration:
3 minutes waiting for last
map tasksto complete
23:17:05 : map 0% reduce 0%
23:17:12 : map 3% reduce 0%
23:17:13 : map 26% reduce 0%
23:17:14 : map 49% reduce 0%
23:17:15 : map 66% reduce 0%
23:17:16 : map 72% reduce 0%
23:17:17 : map 97% reduce 0%
23:17:18 : map 100% reduce 0%
23:18:00 : map 100% reduce 1%
23:18:15 : map 100% reduce 2%
23:18:18 : map 100% reduce 4%
23:18:20 : map 100% reduce 15%
23:18:27 : map 100% reduce 17%
23:18:28 : map 100% reduce 18%
23:18:30 : map 100% reduce 23%
23:18:32 : map 100% reduce 100%
23:20:27 : map 0% reduce 0%
23:20:34 : map 5% reduce 0%
23:20:35 : map 20% reduce 0%
23:20:36 : map 41% reduce 0%
23:20:37 : map 56% reduce 0%
23:20:38 : map 74% reduce 0%
23:20:39 : map 95% reduce 0%
23:20:40 : map 97% reduce 0%
23:21:32 : map 97% reduce 1%
23:21:37 : map 97% reduce 2%
23:21:42 : map 97% reduce 12%
23:21:43 : map 97% reduce 15%
23:21:47 : map 97% reduce 19%
23:21:50 : map 97% reduce 21%
23:21:52 : map 97% reduce 26%
23:21:57 : map 97% reduce 31%
23:21:58 : map 97% reduce 32%
23:23:46 : map 100% reduce 32%
23:24:54 : map 100% reduce 46%
23:24:55 : map 100% reduce 86%
23:24:56 : map 100% reduce 100%
Typical Iteration: Slow Iteration:
3 minutes waiting for last
map tasksto complete
Suggestions?(Doesn’t Hadoop replicate map tasks to avoid this?)
Questions?