On Large-Batch Training for Deep Learning:Generalization Gap and Sharp Minima
Nitish Shirish Keskar1 Dheevatsa Mudigere2 Jorge Nocedal1
Mikhail Smelyanskiy2 Ping Tak Peter Tang2
1Northwestern University
2Intel Corporation
ICLR, 2017Presenter: Tianlu Wang
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 1 /
20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 2 /
20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 3 /
20
Batch Size of Stochastic Gradient Methods
Non-convex optimization in deep learning:minx∈Rn f (x) := 1
M
∑Mi=1 fi (x)
Stochastic Gradient Methods and its variants:|Bk | ∈ {32, 64, . . . , 512}Increase batch size to improve parallelism leads to a loss ingeneralization performance
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 4 /
20
Batch Size of Stochastic Gradient Methods
Non-convex optimization in deep learning:minx∈Rn f (x) := 1
M
∑Mi=1 fi (x)
Stochastic Gradient Methods and its variants:|Bk | ∈ {32, 64, . . . , 512}Increase batch size to improve parallelism leads to a loss ingeneralization performance
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 4 /
20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 5 /
20
Main Observations
Large-batch methods tend to converge to sharp minimizers of thetraining function and tend to generalize less well.Small-batch methods converge to flat minimizers and are able toescape basins of attraction of sharp minimizers.
Sharp Minimizer x̂ : function increases rapidly in a small neighborhoodof x̂Flat Minimizer x̄ : function varies slowly in a large neighborhood of x̄
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 6 /
20
Main Observations
Large-batch methods tend to converge to sharp minimizers of thetraining function and tend to generalize less well.Small-batch methods converge to flat minimizers and are able toescape basins of attraction of sharp minimizers.
Sharp Minimizer x̂ : function increases rapidly in a small neighborhoodof x̂Flat Minimizer x̄ : function varies slowly in a large neighborhood of x̄
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 6 /
20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 7 /
20
Numerical Results
6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 8 /
20
Numerical Results
6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 8 /
20
Numerical Results
6 multi-class classification networks, mean cross entropy, ADAMoptimizer, LB: 10% of training data, SB: 256 data points
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 8 /
20
Question
Generalization gap is not due to over-fitting or over-training ???
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 9 /
20
Question
Generalization gap is not due to over-fitting or over-training ???
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 9 /
20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 10
/ 20
Parametric Plots
x∗s and x∗l :solutions obtained by SB and LB
plot f (αx∗l + (1− α)x∗s ):
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 11
/ 20
Parametric Plots
x∗s and x∗l :solutions obtained by SB and LB
plot f (αx∗l + (1− α)x∗s ):
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 11
/ 20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 12
/ 20
Sharpness of Minima
Motivation: Measure the sensitivity of training function at the givenlocal minimizer, so we want to explore a small neighborhood of aminimizer and compute the largest value that f can attain in thisneighborhood.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 13
/ 20
Sharpness of Minima
Small neighborhood:p: dimension of manifoldA: n × p matrix,columns are randomly generatedA+: pesudo-inverse of A
Cε = {z ∈ Rn : −ε(|xi |+ 1) ≤ zi ≤ ε(|xi |+ 1)}∀i ∈ {1, 2, . . . , n}
Cε = {z ∈ Rp : −ε(|(A+x)i |+ 1) ≤ zi ≤ ε(|(A+x)i |+ 1)}∀i ∈ {1, 2, . . . , p}
Metric 2.1. Given x ∈ Rn, ε > 0 and A ∈ Rn∗p, the sharpness of fat x :
φx ,f (ε,A) :=(maxy∈Cε f (x + Ay))− f (x)
1 + f (x)× 100 (1)
A can be the identity matrix In
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 14
/ 20
Sharpness of Minima
Sharpness of Minima in Full Space(A is the identity matrix):
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 15
/ 20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 16
/ 20
Deterioration along Increasing of Batch-Size
Note batch-size≈ 15000 for F2 and batch-size≈ 500 for C1
There exists a threshold after which there is a deterioration in thequality of the model.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 17
/ 20
Outline
1 IntroductionBatch Size of Stochastic Gradient Methods
2 Drawbacks of Large-Batch MethodsMain ObservationNumerical ResultsParametric PlotsSharpness of Minima
3 Success of Small-Batch MethodsDeterioration along Increasing of Batch-SizeWarm-started Large Batch experiments
4 Summary
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 18
/ 20
Warm-started Large Batch experiments
Train network for 100 epochs with batch-size=256 and use these 100epochs as starting points.
The SB method needs some epochs to explore and discover a flatminimizer.
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 19
/ 20
Summary
Numerical experiments that support the view that convergence tosharp minimizers gives rise to the poor generalization of large-batchmethods for deep learning.
SB methods have an exploration phase followed by convergence to aflat minimizer.
Attempts to remedy the problem:
Data augmentationConservative trainingAdversarial trainingRobust optimization
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang (Northwestern University and Intel Corporation)On Large-Batch Training for Deep Learning: Generalization Gap and Sharp MinimaICLR, 2017 Presenter: Tianlu Wang 20
/ 20