+ All Categories
Home > Documents > Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email...

Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email...

Date post: 05-Feb-2018
Category:
Upload: dangtuyen
View: 225 times
Download: 0 times
Share this document with a friend
50
+ Data Science in Action Peerapon Vateekul, Ph.D. Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University Chula Data Science
Transcript
Page 1: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+

Data Science in Action

Peerapon Vateekul, Ph.D.

Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University

Chula Data Science

Page 2: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Outlines

Data Science & Data Scientist

Data Mining

Analytics with R

A Framework for Big Data Analytics

2

Chula Data Science

Page 3: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+

Data Science & Data Scientist

3

Chula Data Science

Page 4: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science?

Data

Facts and statistics collected together for reference or analysis

Science

A systematic study through observation and experiment

Data Science

The scientific exploration of data to extract meaning or insight

, and the construction of software to utilize such insight in a

business context.

5

Data Preparation

Data Analysis

Data Visualization

Data Product

Chula Data Science

Page 5: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into valuable insights

Transform data into data products

Transform data into interesting stories

6

Chula Data Science

Page 6: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into valuable insights

7

Social Influence in Social Advertising: Evidence from Field Experiments (Bakshy et al. 2012)

Chula Data Science

Page 7: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into data products

8

Service Recommendation

Chula Data Science

Page 8: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into data products

9

Fraud Detection

Chula Data Science

Page 9: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into data products

10

Email Classification

Spam Detection

Chula Data Science

Page 10: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into interesting stories

11

Chula Data Science

Page 11: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Science? (cont.)

Transform data into interesting stories

12

Chula Data Science

Page 12: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Data Science: Famous Definition

13

Chula Data Science

Page 13: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Data Science: Components

14

Data Science

Statistics

Domain Expertise

Visualization

Data Engineering

Advanced Computing

Chula Data Science

Page 14: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Data Science Process: Iterative Activity

15

Chula Data Science

Page 15: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Data Science Tasks

16

2

1

3

Chula Data Science

Page 16: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Data Science with Big Data

Very large raw data sets are now available:

Log files

Sensor data

Sentiment information

With more raw data, we can build better models with improved predictive performance.

To handle the larger datasets we need a scalable processing platform like Hadoop and YARN

17

Chula Data Science

Page 17: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Who builds these systems?

18

Data Scientist:

By Thomas H. Davenport and D.J. Patil

From the October 2012 issue

Chula Data Science

Page 18: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

19

It is estimated that by 2018, US could have a shortage of 140,000+ people

with advanced analytical skills!

Chula Data Science

Page 19: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Definition

Data collection systems

Machine learning

algorithms

Interface design

Design/manage/query

database

Data aggregation

Data mining

Statistical models

Evaluation metrics

Predictive analytics

Data visualization

20

Computer Scientist Mathematician Business Person

Domain expertise

Knowing what questions

to ask

Interpreting results for

business decisions

Presenting outcomes

Chula Data Science

Page 20: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Needed Skills

21

Applied Science

Statistics, applied math

Machine Learning, Data

Mining

Tools: Python, R, SAS, SPSS

Data engineering

Database technologies

Computer science

Tools: Java, Scala, Python,

C++

Business Analysis

Data Analysis, BI

Business/domain expertise

Tools: SQL, Excel, EDW

Big data engineering

Big data technologies

Statistics and machine

learning over large datasets

Tools: Hadoop, PIG, HIVE,

Cascading, SOLR, etc.

Chula Data Science

Page 21: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+The Data Science Team

22

Chula Data Science

Page 22: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+

Data Mining

23

Chula Data Science

Page 23: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is Data Mining (DM)?

An automatic process of

discovering useful information

in large data repositories

with sophisticated algorithm

24

Machine LearningStatistics

Data Mining

Database

systems

Chula Data Science

Page 24: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Data Mining Tasks

Predictive Task (Supervised Learning)

Classification

Regression

Descriptive Task (Unsupervised Learning)

Clustering

Association Rules Mining

Sequence Analysis

Other:

Collaborative filtering: (recommendations engine) uses techniques from both supervised and unsupervised world.

25

Chula Data Science

Page 25: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Supervised Learning: learning from target

26

Training dataset:

Test dataset:

71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?

57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0

78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0

69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0

18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0

84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0

89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0

49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0

40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0

77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

0

1

0

1

1

0

1

0

0

1

0

Chula Data Science

Page 26: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Classification: predicting a category

Some techniques:

Naïve Bayes

Decision Tree

Logistic Regression

Support Vector Machines

Neural Network

Ensembles

27

Age

Salary

Predict targeted customers who

tend to buy our product (yes/no)

Chula Data Science

Page 27: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Regression: predict a continuous value

Some techniques:

Linear Regression / GLM

Decision Trees

Support vector regression

Neural Network

Ensembles

28

Predict a sale price of each house

Chula Data Science

Page 28: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Predictive Modeling Applications

Database marketing

Financial risk management

Fraud detection

Pattern detection

Chula Data Science

Page 29: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Unsupervised Learning: detect natural

patterns

30

Training dataset:

Test dataset:

71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 ?

57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0

78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0

69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0

18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0

84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0

89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0

49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0

40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0

77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

0

1

0

1

1

0

1

0

0

1

0

Chula Data Science

Page 30: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Clustering: detect similar instance

groupings

Some techniques:

k-means

Spectral clustering

DB-scan

Hierarchical clustering

31

Example: Customer SegmentationChula Data Science

Page 31: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

Association Rule Discovery

32

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered:

{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}

Store layout design/promotion

Chula Data Science

Page 32: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

Product recommendation: predicting

“preference”

33Chula Data Science

Page 33: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+

Analytics with R

34

Chula Data Science

Page 34: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+What is R?

R is a free software environment for statistical computing and

graphics.

R can be easily extended with 5,800+ packages available on

CRAN (as of 13 Sept 2014).

Many other packages provided on Bioconductor, R-Forge,

GitHub, etc.

R manuals on CRAN

35

Chula Data Science

Page 35: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Why R?

R is widely used in both academia

and industry.

R was ranked no. 1 in the

KDnuggets 2014 poll on Top

Languages for analytics, data

mining, data science (actually, no.

1 in 2011, 2012 & 2013!).

The CRAN Task Views 9 provide

collections of packages for

different tasks.

36

Chula Data Science

Page 36: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Classification with R

Decision trees: rpart, party

Random forest: randomForest,

party

SVM: e1071, kernlab

Neural networks: nnet,

neuralnet, RSNNS

Performance evaluation: ROCR

37

Page 37: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Clustering with R

k-means: kmeans(), kmeansruns()

k-medoids: pam(), pamk()

Hierarchical clustering: hclust(),

agnes(), diana()

DBSCAN: fpc

BIRCH: birch

38

Chula Data Science

Page 38: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Association Rule Mining with R

Association rules: apriori(), eclat()

in package arules

Sequential patterns:

arulesSequence

Visualization of associations:

arulesViz

39

Chula Data Science

Page 39: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Text Mining with R

Text mining: tm

Topic modelling: topicmodels, lda

Word cloud: wordcloud

Twitter data access: twitteR

40

Chula Data Science

Page 40: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Time Series Analysis with R

Time series decomposition: decomp(), decompose(), arima(),

stl()

Time series forecasting: forecast

Time Series Clustering: TSclust

Dynamic Time Warping (DTW): dtw

41

Chula Data Science

Page 41: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Social Network Analysis with R

Packages: igraph, sna

Centrality measures: degree(), betweenness(), closeness(),

transitivity()

Clusters: clusters(), no.clusters()

Cliques: cliques(), largest.cliques(), maximal.cliques(),

clique.number()

Community detection: fastgreedy.community(),

spinglass.community()

42

Chula Data Science

Page 42: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+R and Big Data

Hadoop

Hadoop (or YARN) - a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models

R Packages: RHadoop, RHIPE

Spark

Spark - a fast and general engine for large-scale data processing, which can be 100 times faster than Hadoop

SparkR - R frontend for Spark

H2O

H2O - an open source in-memory prediction engine for big data science

R Package: h2o

MongoDB

MongoDB - an open-source document database

R packages: rmongodb, RMongo

43

Chula Data Science

Page 43: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+A Framework for Big Data

Analytics

44

Chula Data Science

Page 44: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Big Data Analytics: Components

45

Tool: Hadoop R Tools: RHadoop, H2O Tool: R

Chula Data Science

Page 45: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+RHadoop

46

Chula Data Science

Page 46: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+H2O

47

1• Regression

2• Classification

3• Clustering

4

• Others: Recommendation, Time Series

Chula Data Science

Page 47: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+C

lou

der

a

Big data & Analytic Architecture

HDFS

Hadoop Distributed File System

YARN (Map Reduce V.2)

Distributed Processing Framework

Hive

SQL QueryR Hadoop H2O

Zoo Keeper

Co-ordination

,

Management

Data Storage

Data Processing

(Batch Processing)

Client Access

Resource

Management

YARN

Resource Manager

YARN enables multiple processing applications

Chula Data Science

Page 48: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+

Program List

Cloudera

HDFS

YARN

HIVE

Zoo Keeper

JAVA

R

RHadoop

RStudio Server

H2O

ManagementHadoop

EcosystemLanguage Analytic

Chula Data Science

Page 49: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+Use Case: Predict Airline Delays

Every year approximately 20% of airline flights are delayed or cancelled,

resulting in significant costs to both travelers and airlines.

Datasets:

Airline delay data (1987-2008)

http://stat-computing.org/dataexpo/2009/

12 GB!

Goal:

Predict delay (delayTime >= 15 mins) in flights

50

Chula Data Science

Page 50: Data Science in Action · PDF fileData Science in Action Peerapon Vateekul, Ph.D. ... Email Classification Spam Detection ... Neural Network

+

Thank you & Any questions?

51

Chula Data Science


Recommended