+ All Categories
Home > Data & Analytics > Data Mining 101

Data Mining 101

Date post: 16-Jul-2015
Category:
Upload: ali-septiandri
View: 125 times
Download: 0 times
Share this document with a friend
Popular Tags:
50
Data Mining 101 Okiriza Wibisono - @okiriza Ali Akbar Septiandri - @aliakbars
Transcript

Data Mining 101Okiriza Wibisono - @okiriza

Ali Akbar Septiandri - @aliakbars

Outline

Introduction

•Terminology

•Potential application

•Venn diagram

Process overview

•Business understanding

•Data understanding (exploration)

•Data preparation (preprocessing)

•Modeling

•Evaluation

•Deployment (presentation)

Tools & Resource

Introduction – Terminology

Data mining

Knowledge Discovery

in Databases

Big data analytics

Statistics

Data science

The process of collecting,

searching through, and analyzing

a large amount of data in a

database, as to discover patterns

or relationships.Data Mining - dictionary.reference.com

Introduction – Potential Application

Customer segmentation

Recommendation engine

Social media mining

What should we do?

Where to start? Do I have to get a master degree in statistics?

http://tomfishburne.com.s3.amazonaws.com/site/wp-content/uploads/2014/01/140113.bigdata.jpg

Data Science Venn Diagram

http://drewconway.com/zia/2013/3/26/the-

data-science-venn-diagram

And now the business process…

CRISP DM Methodology

http://lyle.smu.edu/~mhd/8331f03/crisp.pdf

Business UnderstandingCRISP DM Methodology

Objective Statement

Bottom-up

Top-down

Objective Statement

Data Problem

vs

Situation Assessment

Inventory of Resources

Requirements, Assumptions, and Constraints

Risks and Contingencies

Terminology

Costs and Benefits

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment –

Inventory of Resources

Resource

Data, Knowledge,

Tools

Hardware

Personnel

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment –

Requirements, Assumptions, and Constraints

Requirements

Scheduling

Accuracy

Security

Assumptions

Data quality

External factors

Reporting type

Constraints

Legal issues

Budget

Resources

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment –

Risks and Contingencies

Contingency Plan

Financial

Organizational

Business

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

Situation Assessment – TerminologyWrite down related terminology

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg

Situation Assessment – Costs and BenefitsMoney, money, money!

http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf

http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg

How to evaluate the results?

Define your success criteria!

Data UnderstandingCRISP DM Methodology

Data Collection

External Internal

vs

Watch out!

visible ≠ accessible ≠

storable ≠ presentable

Victor Lavrenko – Text Technologies

http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf

Data Exploration –

Visualization Heuristics

Visualize fast. Visualize reactively.

Go for high information 2D visualizations.

Select data subsets to visualize.

http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Data Exploration –

Visualization Heuristics

Never let anomalies pass you by. Dig deeper.

Use your visualizations to inform potential

models. Use your potential model to direct your

visualizations.

Expect problems in your data.

http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

This is the cheapest and most

informative stage of data

mining.

Nigel Goddard – DME Visualization

http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf

Data Exploration –

Visualization Tools

Column/bar: Large change

Line, curve: Small change, long periods

Histogram: Frequency distribution

https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp

Data PreparationCRISP DM Methodology

Which one should I include

(or exclude)?

Data Selection

Data Cleaning

Dirty Data

Missing value

Incomplete

OutdatedDuplication

OutlierRemember: Expect problems in your data.

Data Construction

Feature engineering – derived attributes,

e.g.:

year from timestamp

quarter from timestamp

BMI from weight and height

Log(x) for skewed data (e.g. house price)

Data Splitting

Two kinds of data splitting:

Training-Validation-Testing

Cross Validation

Data Splitting –

Training-Validation-Testing

• Construct classifierTraining

• Pick algorithm

• Knob settings (tree depth, k in kNN, c in SVM)

Validation

• Estimate future error rateTesting

Split randomly to avoid bias

http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf

Data Splitting –

Cross ValidationEvery point is both training and testing, never at the same time

Dimensionality Reduction

Principal Component

Analysis

Linear Discriminant

Analysisvs

ModelingCRISP DM Methodology

Machine Learning

Classification Regression Ranking Clustering

Model Selection

Regression Technique

Generalization bound

Linear regression

Kernel ridge regression

Support vector regression

Lasso

Which one should I choose?

Should I use all of them?

It depends on…

Model Selection

AssumptionsThe predictors are linearly

independent

The error is a random variable with a mean of zero conditional on

the explanatory variables

The sample is representative of the population for the inference

prediction

Interpretability

The understandability of why the model is true or how the model is induced

from

https://chenhaot.com/pubs/mldg-interpretability.pdf

Beware of Overfitting!

http://pingax.com/wp-content/uploads/2014/05/underfitting-overfitting.png

Model Assessment

Regression

• (R)MSE

• Mean Absolute Error

• Correlation Coefficient

Classification

• Accuracy

• Precision

• Recall

• F-score

Descriptive

• Std. Error

• p-value

• Confidence Interval

EvaluationCRISP DM Methodology

Does my model solve the

problem?

What is the impact? Is it novel? How useful is the solution?

DeploymentCRISP DM Methodology

The Tasks

Plan deploymentPlan monitoring

and maintenanceProduce final

reportReview project

Tools & Resource

Text mining: NLTK, spaCy, OpenNLP

Query expansion & clustering: Carrot2, Weka

Data mining & machine learning: Weka, scikit-learn

Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala

Python lib: Pandas, SciPy, NumPy, scikit-learn

Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark

Visualization: D3.js

Community: Big Data & Open Data Indonesia

Thank you!

Data Mining 101 – Python-ID Meetup February 2015

Okiriza Wibisono - @okiriza

Ali Akbar Septiandri - @aliakbars


Recommended