+ All Categories
Home > Data & Analytics > Winning Kaggle 101: Dmitry Larko's Experiences

Winning Kaggle 101: Dmitry Larko's Experiences

Date post: 16-Apr-2017
Category:
Upload: ted-xiao
View: 1,519 times
Download: 0 times
Share this document with a friend
12
KAGGLE COMPETITIONS DMITRY LARKO MY EXPERIENCE FOR MACHINE LEARNING AT BERKELEY
Transcript
Page 1: Winning Kaggle 101: Dmitry Larko's Experiences

KAGGLE COMPETITIONS

DMITRY LARKO

My experienceMY EXPERIENCE

FOR MACHINE LEARNING AT BERKELEY

Page 2: Winning Kaggle 101: Dmitry Larko's Experiences

Hi, here is my short bio

About 10 years working experience in DB/Data Warehouse field.

4 years ago I learned about Kaggle from my dad, who competing on Kaggle as

well (and does that better than me)

My first competition was Amazon.com - Employee Access Challenge, I placed 10th

out of 1687, learned a tons of new techniques/algorithms in ML field in a month

and I got addicted to Kaggle.

Until now I participated in 25 completions, was 2nd twice and I am in Kaggle top-

100 Data Scientists.

Currently I’m working as Lead Data Scientist.

Page 3: Winning Kaggle 101: Dmitry Larko's Experiences

Motivation. Why to participate?

• To win – this is the best possible motivation for

competition;

• To learn – Kaggle is a good place to learn and the best

place to learn on Kaggle is forum for past competitions;

• Looks good in resume – well… only if you’re constantly

winning

Page 4: Winning Kaggle 101: Dmitry Larko's Experiences

How to start?

Learn Python1

MOOC for Machine learning (Coursera, Udacity, Harvard)2

Participate in Kaggle “Getting started” and “Playground” competitions3

Visit Kaggle finished competitions and go through winner’s solution posted at

competition’s forum4

Page 5: Winning Kaggle 101: Dmitry Larko's Experiences

Kaggle’s Toolset (1 of 2)

• Scikit-Learn (scikit-learn.org). Simple and efficient tools for data

mining and data analysis. A lot of tutorials, many ML algorithms have

scikit-learn implementation.

• XGBoost (github.com/dmlc/xgboost). An optimized general purpose

gradient boosting library. The library is parallelized (OpenMP). It

implements machine learning algorithm under gradient boosting

framework, including generalized linear model and gradient boosted

regression tree (GBDT).

• Theory behind XGBoost: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

• Tutorial:

https://www.kaggle.com/tqchen/otto-group-product-classification-challenge/understanding-xgboost-model-on-otto-data

Page 6: Winning Kaggle 101: Dmitry Larko's Experiences

Kaggle’s Toolset (2 of 2)

• H2O (h2o.ai). Fast Scalable Machine Learning API. Has stat-of-the-art

models like Random Forest and Gradient Boosting Trees. Allows you to

work with really big datasets on Hadoop cluster. It also works on Spark!

Check out Sparkling Water: http://h2o.ai/product/sparkling-water/

• Neural Nets/Deep Learning. Most of Python libraries build on top of

Theano (http://deeplearning.net/software/theano/):

• Keras (https://github.com/fchollet/keras)

• Lasagne (https://github.com/Lasagne/Lasagne)

• Or TensorFlow:

• Skflow(https://github.com/tensorflow/skflow)

• Keras(https://github.com/fchollet/keras)

Page 7: Winning Kaggle 101: Dmitry Larko's Experiences

Kaggle’s Toolset - Advanced

• Vowpal Wabbit (GitHub) – fast and state-of-the-art online learner. A

great tutorial how to use VW for NLP is available here:

https://github.com/hal3/vwnlp

• LibFM (http://libfm.org/) - Factorization machines (FM) are a generic

approach that allows to mimic most factorization models by feature

engineering. Works great for sparse wide datasets, has a few

competitors:

• FastFM - http://ibayer.github.io/fastFM/index.html

• pyFM - https://github.com/coreylynch/pyFM

• Regularized Greedy Forest (RGF) - tree ensemble learning method, can

be better than XGBoost, but you need to know how to cook it.

• Four leaf clover - +100 to Luck, gives outstanding model performance

and state-of-the-art “golden feature” search technique… just kidding.

Page 8: Winning Kaggle 101: Dmitry Larko's Experiences

Which past competition to check?

• Greek Media Monitoring Multilabel Classification (WISE 2014). Why?

• Interesting problem, a lot of classes.

• Otto Group Product Classification Challenge. Why?

• Biggest Kaggle competition so far.

• A great dataset to learn and polish you ensembling techniques

• Caterpillar Tube Pricing. Why?

• A good example of business problem to solve

• Need some work with data before you can build good models

• Has a “data leakage” you can find and exploit

• Driver Telematics Analysis. Why?

• Interesting problem to solve.

• To get an experience working with GPS data

Well, all of them of course. But if I would need to choose I’d selected

these 4:

Page 9: Winning Kaggle 101: Dmitry Larko's Experiences

Common steps

• Know your data:

• Visualize;

• Train linear regression and check the weights;

• Build decision tree (I prefer to use R ctree for that);

• Cluster the data and look at what clusters you get out;

• Or train a simple classifier, see what mistakes it makes

• Know your metric.

• Different metrics can give you a clue how approach the problem, for

example for logloss you need to add calibration step for random forest

• Know your tools.

• Knowledge of advantage and disadvantages of ML algorithms you use

can tell you what to do to solve a problem

Page 10: Winning Kaggle 101: Dmitry Larko's Experiences

Common steps cont’d

• Getting more from data.

• Domain knowledge

• Feature extraction and feature engineering pipelines. Some ideas:

• Get most important features and build 2nd level interactions

• Add new features such as cluster ID or leaf ID

• Binarize numerical features to reduce noise

• Ensembling.

• A great Kaggle Ensembling Guide by MLWave

Page 11: Winning Kaggle 101: Dmitry Larko's Experiences

Teaming strategy

• Usually we agree to team up at pretty early phase but keep working

independently to avoid any bias in our solutions

• After teaming up:

• Code sharing can help you to learn new things but a waste of

time from competition perspective;

• Share data – especially some engineered features;

• Combine your models’ outcomes using k-fold stacking and build

a second level meta-learner (stacking)

• Continue iteratively add new features and building new models

• Teaming up right before competition end:

• “Black-box” ensembling – linear and not so linear blending using

LB as validation set.

• Linear: f(x) = alpha*F1(x) + (1-alpha)*F2(x)

• Non linear: f(x) = F1(x)^alpha * F2(x)^(1-alpha)

Page 12: Winning Kaggle 101: Dmitry Larko's Experiences

Q&A

Thank you!


Recommended