Date post: | 16-Jul-2015 |
Category: |
Data & Analytics |
Upload: | ali-septiandri |
View: | 125 times |
Download: | 0 times |
Outline
Introduction
•Terminology
•Potential application
•Venn diagram
Process overview
•Business understanding
•Data understanding (exploration)
•Data preparation (preprocessing)
•Modeling
•Evaluation
•Deployment (presentation)
Tools & Resource
Introduction – Terminology
Data mining
Knowledge Discovery
in Databases
Big data analytics
Statistics
Data science
“
”
The process of collecting,
searching through, and analyzing
a large amount of data in a
database, as to discover patterns
or relationships.Data Mining - dictionary.reference.com
Introduction – Potential Application
Customer segmentation
Recommendation engine
Social media mining
Situation Assessment
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Terminology
Costs and Benefits
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment –
Inventory of Resources
Resource
Data, Knowledge,
Tools
Hardware
Personnel
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment –
Requirements, Assumptions, and Constraints
Requirements
Scheduling
Accuracy
Security
Assumptions
Data quality
External factors
Reporting type
Constraints
Legal issues
Budget
Resources
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment –
Risks and Contingencies
Contingency Plan
Financial
Organizational
Business
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
Situation Assessment – TerminologyWrite down related terminology
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.partnersmn.com/wp-content/uploads/2010/08/5b8567b2b4e2d1cfd1a31b2b8a0ecebc1.jpg
Situation Assessment – Costs and BenefitsMoney, money, money!
http://www.ismll.uni-hildesheim.de/lehre/ba-12ss/script/ba03.pdf
http://www.centuryproductsllc.com/wp-content/uploads/holding-money.jpg
“
”
visible ≠ accessible ≠
storable ≠ presentable
Victor Lavrenko – Text Technologies
http://www.inf.ed.ac.uk/teaching/courses/tts/pdf/crawl-2x2.pdf
Data Exploration –
Visualization Heuristics
Visualize fast. Visualize reactively.
Go for high information 2D visualizations.
Select data subsets to visualize.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration –
Visualization Heuristics
Never let anomalies pass you by. Dig deeper.
Use your visualizations to inform potential
models. Use your potential model to direct your
visualizations.
Expect problems in your data.
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
“
”
This is the cheapest and most
informative stage of data
mining.
Nigel Goddard – DME Visualization
http://www.inf.ed.ac.uk/teaching/courses/dme/slides2014/visualization-print4up.pdf
Data Exploration –
Visualization Tools
Column/bar: Large change
Line, curve: Small change, long periods
Histogram: Frequency distribution
https://nces.ed.gov/nceskids/help/user_guide/graph/whentouse.asp
Data Cleaning
Dirty Data
Missing value
Incomplete
OutdatedDuplication
OutlierRemember: Expect problems in your data.
Data Construction
Feature engineering – derived attributes,
e.g.:
year from timestamp
quarter from timestamp
BMI from weight and height
Log(x) for skewed data (e.g. house price)
Data Splitting –
Training-Validation-Testing
• Construct classifierTraining
• Pick algorithm
• Knob settings (tree depth, k in kNN, c in SVM)
Validation
• Estimate future error rateTesting
Split randomly to avoid bias
http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf
Model Selection
Regression Technique
Generalization bound
Linear regression
Kernel ridge regression
Support vector regression
Lasso
Model Selection
AssumptionsThe predictors are linearly
independent
The error is a random variable with a mean of zero conditional on
the explanatory variables
The sample is representative of the population for the inference
prediction
Interpretability
The understandability of why the model is true or how the model is induced
from
https://chenhaot.com/pubs/mldg-interpretability.pdf
Model Assessment
Regression
• (R)MSE
• Mean Absolute Error
• Correlation Coefficient
Classification
• Accuracy
• Precision
• Recall
• F-score
Descriptive
• Std. Error
• p-value
• Confidence Interval
Tools & Resource
Text mining: NLTK, spaCy, OpenNLP
Query expansion & clustering: Carrot2, Weka
Data mining & machine learning: Weka, scikit-learn
Language: R, Python, Julia, Java, Matlab, Mathematica, Haskell, Scala
Python lib: Pandas, SciPy, NumPy, scikit-learn
Infrastructure: AWS, Hadoop, Google Cloud, Azure, Apache Spark
Visualization: D3.js
Community: Big Data & Open Data Indonesia