Fast data mining flow prototyping using IPython Notebook
2013/01/31
Jimmy Lai
r97922028 [at] ntu.edu.tw
Outline
1. Workflow for data mining
2. What IPython Notebook provides
3. Exemplified by text classification
4. Demo code and Notebook usage
IPython Notebook 2
Workflow for data mining
• Traditional programming workflow:
– Edit -> Compile -> Run
• Data Mining workflow:
– Execute -> Explore
– Consists of many data processing stages and we may do trials in each stage with different methods.
– Stages: data parsing, feature extraction, feature selection, model training, model predicting, post processing, etc.
IPython Notebook 3
What IPython Notebook provides
• Interactive Web IDE – Display rich data like plots by matplotlib, math
symbols by latex
– Code cell for sketching
– Execute piece of code in arbitrarily order
– Browser interface for programming remotely
– Easy to demonstrate code and execution result in html or PDF.
• IPython Notebook makes sketching data analysis easily.
IPython Notebook 4
Demo code and Notebook usage
• Demo Code: ipython_demo directory in https://bitbucket.org/noahsark/slideshare
• Ipython Notebook: – Install
$ pip install ipython
– Execution (under ipython_demo dir)
$ ipython notebook --pylab=inline
– Open notebook with browser, e.g. http://127.0.0.1:8888
IPython Notebook 5
IPython Note Interface
IPython Notebook 6
Exemplified by text classification
• Text classification on newsgroup dataset.
• Dataset:
– Build in sklearn.datasets
– Each article belongs to one of the 20 groups
• Goal: classify article to one of the newsgroup name.
• Experiment: feature generation using different ngram parameters.
IPython Notebook 7
Example article
IPython Notebook 8
talk.politics.mideast
IPython Notebook 9
Sample result of feature extraction
IPython Notebook 10
Table of experiment setups
IPython Notebook 11
IPython Notebook 12
Experiment Result
IPython Notebook 13
IPython Notebook 14
Observation from plots
IPython Notebook 15