Date post: | 05-Jan-2017 |
Category: |
Technology |
Upload: | sparktc |
View: | 403 times |
Download: | 1 times |
Spark & Machine Learning MeetupHyperparameter Optimization - when scikit-learn meets PySpark
Sven Hafeneger
27.10.2016
©2015 IBM Corporation May 2, 20232
Data Science Workflow
Wikipedia https://en.wikipedia.org/wiki Cross_Industry_Standard_Process_for_Data_Mining
©2015 IBM Corporation May 2, 20233
Data Science Workflow
knobs to tune !
Wikipedia https://en.wikipedia.org/wiki Cross_Industry_Standard_Process_for_Data_Mining
https://www.okwenclosures.com/en/Potentiometer-Tuning-knobs/Top-Knobs.htm
©2015 IBM Corporation May 2, 20234
Data Science Workflow - Modeling
Model Improves robustness
Influences complexity
Helps with class imbalances
https://www.kvraudio.com/forum/viewtopic.php?t=328938
©2015 IBM Corporation May 2, 20235
“… is the problem of choosing a set of hyperparameters for a learning algorithm, …” [1]
Grid search Random search …
What is Hyperparameter Optimzation?
https://openclipart.org/detail/194603/grid-search-pattern
©2015 IBM Corporation May 2, 20236
“… is the problem of choosing a set of hyperparameters for a learning algorithm, …” [1]
Grid search Random search …
What is Hyperparameter Optimzation?
http://25.media.tumblr.com/tumblr_lcelmoEfoX1qbl1tko1_400.jpg
©2015 IBM Corporation May 2, 20237
Gridsearch with scikit-learn
Build a classification model
We have some data and a classification problem
©2015 IBM Corporation May 2, 20238
Gridsearch with scikit-learn
©2015 IBM Corporation May 2, 20239
Gridsearch with scikit-learn
… well ... yes ... overfitted !
©2015 IBM Corporation May 2, 202310
Gridsearch with scikit-learn
Improve test scores !
©2015 IBM Corporation May 2, 202311
Gridsearch with scikit-learn
~ 500 jobs~ 13 mins
©2015 IBM Corporation May 2, 202312
Gridsearch with scikit-learn
Return (best) model
Accuracy: 0.44 => 0.76
max_depth=15n_estimators=200
©2015 IBM Corporation May 2, 202313
Gridsearch with spark-sklearn
What if you have access to a Spark cluster ?
Distribute the workload on the cluster !
©2015 IBM Corporation May 2, 202314
Save time ! Concentrate on more important problems …
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html
©2015 IBM Corporation May 2, 202315
Data Science Workflow
Faster cycles !
©2015 IBM Corporation May 2, 202316
Try it out
Source: [6]
https://pypi.python.org
©2015 IBM Corporation May 2, 202317
Try it out
©2015 IBM Corporation May 2, 202318
References [1]: Bergstra, James; Bengio, Yoshua (2012). "Random Search for Hyper-Parameter Optimization”, J.
Machine Learning Research. 13: 281–305.
Thanks !