+ All Categories
Home > Documents > Database Data Mining: Practical R Enterprise and Oracle Advanced

Database Data Mining: Practical R Enterprise and Oracle Advanced

Date post: 12-Sep-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
50
Introduction Oracle Enterprise R in Practice Wrap up Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies October 2, 2012 Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Transcript
Page 1: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Database Data Mining: Practical R Enterpriseand Oracle Advanced Analytics

Husnu [email protected]

Global Maksimum Data & Information Technologies

October 2, 2012

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 2: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Content

1 Introduction

2 Oracle Enterprise R in PracticeData VisualizationA Bit of Probability and Information TheoryOptimizationText Analysis & Decision Trees

3 Wrap up

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 3: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Who am I ?

X Founder at Global Maksimum Data & InformationTechnologies

X in BI Domain

X Oracle Magazine DBA of the Year in 2009

X

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 4: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Global Maksimum Data & Information Technologies

X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.

X Complex Event Processing

X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades

X Data Mining

X Churn Prediction Models for TelcosX Marketing Target Selection Models

X Large Scale Database Management System Projects

X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata

customers, Oracle partners, and Oracle staff at the region.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 5: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Global Maksimum Data & Information Technologies

X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.

X Complex Event Processing

X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades

X Data Mining

X Churn Prediction Models for TelcosX Marketing Target Selection Models

X Large Scale Database Management System Projects

X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata

customers, Oracle partners, and Oracle staff at the region.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 6: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Global Maksimum Data & Information Technologies

X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.

X Complex Event Processing

X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades

X Data Mining

X Churn Prediction Models for TelcosX Marketing Target Selection Models

X Large Scale Database Management System Projects

X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata

customers, Oracle partners, and Oracle staff at the region.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 7: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Global Maksimum Data & Information Technologies

X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.

X Complex Event Processing

X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades

X Data Mining

X Churn Prediction Models for TelcosX Marketing Target Selection Models

X Large Scale Database Management System Projects

X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata

customers, Oracle partners, and Oracle staff at the region.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 8: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Global Maksimum Data & Information Technologies

X A bunch of people who know what they are doing mainlyfocused on data and the transformation of data intoinformation.

X Complex Event Processing

X 1.2 Million Event in a second on 2x2 Socket Nehalem Blades

X Data Mining

X Churn Prediction Models for TelcosX Marketing Target Selection Models

X Large Scale Database Management System Projects

X 120+ TB Exadata migration from UNIX systems.X Exadata Master Class all over the EMEA region for Exadata

customers, Oracle partners, and Oracle staff at the region.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 9: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Advanced Analytics

X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.

X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.

X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.

X So it requires better tricks, automation, and post-analysiscapabilities.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 10: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Advanced Analytics

X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.

X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.

X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.

X So it requires better tricks, automation, and post-analysiscapabilities.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 11: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Advanced Analytics

X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.

X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.

X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.

X So it requires better tricks, automation, and post-analysiscapabilities.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 12: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Advanced Analytics

X For the first version of BI we just filter rows, project columns,aggragate them using some functions, and give only whatcustomer asks for.

X After we have focused on machine generated data, or BigData dealing the data as we did before becomes more andmore fruitless.

X That’s mainly because of the fact that there is only tinyamount of information available in this pile of data.

X So it requires better tricks, automation, and post-analysiscapabilities.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 13: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

In-database Advanced Analytics

X 80% of data mining activity for enterprise means featureengineering.

X Feature Engineering requires an iterative process of

X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)

X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 14: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

In-database Advanced Analytics

X 80% of data mining activity for enterprise means featureengineering.

X Feature Engineering requires an iterative process of

X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)

X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 15: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

In-database Advanced Analytics

X 80% of data mining activity for enterprise means featureengineering.

X Feature Engineering requires an iterative process of

X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)

X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 16: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

In-database Advanced Analytics

X 80% of data mining activity for enterprise means featureengineering.

X Feature Engineering requires an iterative process of

X Filtering data (WHERE)X Aggragating data (GROUP BY)X Transforming data (CASE, DECODE, COALESCE, etc.)

X It is almost impossible to maintain an integrated miningenvironment (Scripts, files, metafiles,etc.) out of database

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 17: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Oracle Advanced Analytics Toolkit

X SQL-2003 & Extensions

X Oracle Data Mining

X Oracle Spatial Extensions

X Flow based mining with SQL Developer

X Oracle Enterprise R

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 18: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

X R is a free software environment for statistical computing andgraphics.

X Majority of newbies (young data scientists) recently graduateor to be graduated from top universities use R.

X Batteries are included.

X Runs on all modern platforms.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 19: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Oracle R Enterprise

X Data you can process with standard R is limited with theamount of memory available on the server running R.

X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.

X ORE is an extension to standard R adding Oracle steroids intoit.

X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 20: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Oracle R Enterprise

X Data you can process with standard R is limited with theamount of memory available on the server running R.

X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.

X ORE is an extension to standard R adding Oracle steroids intoit.

X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 21: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Oracle R Enterprise

X Data you can process with standard R is limited with theamount of memory available on the server running R.

X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.

X ORE is an extension to standard R adding Oracle steroids intoit.

X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 22: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Oracle R Enterprise

X Data you can process with standard R is limited with theamount of memory available on the server running R.

X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.

X ORE is an extension to standard R adding Oracle steroids intoit.

X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 23: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

This session

This session is not a R tutorial session but rather a fly over somepossible solutions to real life scenarios using R.If you need some R tutorial please refer to

X

X Rob Kabacoff. R in Action. Manning, 2010

X Oracle R Enterprise Training 2 - Introduction to R

X R Studio

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 24: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Data Visualization

Data Visualization

X Advance data analysis usually starts and ends with datavisualization.

X Before modeling anything data scientists use graphs & chartsto figure out behaviour of data

X After modeling in order to report the results they again refer tocharts.

X R supports tens of different charting & graphing packages.Just to mention two of them

lattice is used to generate conditioned graphs (a.k.a.trellis graphs)

ggplot2 is used to make graph generation moreconsistent in R.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 25: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Data Visualization

Histogram

X Do you see any significantpattern in distribution ?

X Do you like the wayhistogram is represented ?

s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)

da t a s e t = genera teCus tomer ( )

h=h i s t ( d a t a s e t $ B i l l p e r P e r i o d , f r e q=TRUE,y l a b=”Number o f Customers ” ,x l a b=” B i l l Amount” ,main=” B i l l Amount D i s t r i b u t i o n ” )

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 26: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Data Visualization

Remove the Outliers

Do you see any significantpattern in distribution ?

s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” , l o c a l=TRUE)da t a s e t = genera teCus tomer ( )

n o o u t l i e r = f u n c t i o n ( data , column , q=0.99 , i n c=TRUE){q = q u a n t i l e ( data [ , column ] , na . rm=TRUE,

probs = quan t i l e , names=FALSE)

i f ( i n c l u s i v e ){pruned = sub s e t ( data , data [ , column ] <= q)

} e l s e{pruned = sub s e t ( data , data [ , column ] < q )

}

pruned}

pruned = n o o u t l i e r ( da ta s e t , ” B i l l p e r P e r i o d ” , 0 . 99 )

h=h i s t ( pruned $ B i l l p e r P e r i o d , f r e q=TRUE,y l a b=”Number o f Customers ” ,x l a b=” B i l l Amount” ,main=” B i l l Amount D i s t r i b u t i o n ” )

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 27: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Data Visualization

Conditional Histograms

s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)

sou r c e ( ”˜/ r−s n i p p l e t s /oow2012/commons . r ” ,l o c a l=TRUE)

da t a s e t = genera teCus tomer ( )

pruned = n o o u t l i e r ( da ta s e t ,” B i l l p e r P e r i o d ” , 0 . 99 )

l i b r a r y ( l a t t i c e )h i s tog ram ( ˜ B i l l p e r P e r i o d | Us ingSe rv i ceX ,

data=pruned )

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 28: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Data Visualization

Too Many Columns to Visualize

s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)

sou r c e ( ”˜/ r−s n i p p l e t s /oow2012/commons . r ” ,l o c a l=TRUE)

da t a s e t = genera teCus tomer ( )head ( d a t a s e t )

pruned = n o o u t l i e r ( da ta s e t ,” B i l l p e r P e r i o d ” , 0 . 99 )

l i b r a r y ( l a t t i c e )h i s tog ram ( ˜ B i l l p e r P e r i o d | CarBrand ,

data=pruned )

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 29: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

A Bit of Probability and Information Theory

Comparing Histograms

X We need a way tocalculate similaritybetween those histograms.

X A strong tool frominformation theoryKullback—LeiblerDivergence allows us todefine a distance metricbetween two distributions.

equ iw i d th = f u n c t i o n ( data , co l , n=10, s f=1e−6){q l i s t = q u a n t i l e ( data [ , c o l ] , na . rm=TRUE,

probs = seq ( 0 . 1 , 1 . 0 , by=1./n ) ,names=FALSE)

r e s u l t=c ( )f o r ( q u a n t i l e i n q l i s t ){

r e s u l t = c ( r e s u l t ,( nrow ( s ub s e t ( data , data [ , c o l ] <=

qu a n t i l e ) ) /nrow ( data ) ) )}

r e s u l t [ 1 : n]−c (0 , r e s u l t [ 1 : ( n−1) ] ) + rep ( s f , n )}

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 30: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

A Bit of Probability and Information Theory

KL Divergence & Symmetry

X DKL(P‖Q) =∑

i P(i)ln P(i)Q(i)

X Notice thatDKL(P‖Q) 6= DKL(Q‖P)

X So we simply take the averageof two to obtain a symmetricmetric.

k l d i s t a n c e = f u n c t i o n ( d i s t 1 , d i s t 2 ){

k l 1 = 0 .0f o r ( i i n 1 : l e n g t h ( d i s t 1 ) ){

k l 1 = k l 1 + d i s t 1 [ i ] ∗ l o g ( d i s t 1 [ i ] / d i s t 2 [ i ] , 2 )}

k l 2 = 0 .0f o r ( i i n 1 : l e n g t h ( d i s t 1 ) ){

k l 2 = k l 2 + d i s t 2 [ i ] ∗ l o g ( d i s t 2 [ i ] / d i s t 1 [ i ] , 2 )}

( k l 1+k l 2 ) /2

}

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 31: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

A Bit of Probability and Information Theory

Top 5 Car Brands whose Owners Diverge from Baseline

Brand KL

Lancia 8.969125Lincoln 8.969125Proton 7.572549Daewoo 7.572549Pontiac 6.421267

ddf = NULLb a s e l i n e = equ iw i d th ( pruned , ” B i l l p e r P e r i o d ” )f o r ( brand i n d a t a s e t [ ! d u p l i c a t e d ( d a t a s e t [ , c ( ’ CarBrand ’ ) ] ) , 1 ] ){

b randD i s t = equ iw i d th ( s ub s e t ( pruned ,pruned [ , ’ CarBrand ’ ] == brand ) ,

” B i l l p e r P e r i o d ” )ddf = rb i n d ( ddf ,

data . f rame ( ca rb rand=brand ,k l=k l d i s t a n c e ( b a s e l i n e ,

b r andD i s t ) ) )

}

head ( ddf [ o r d e r ( ddf $ k l , d e c r e a s i n g=TRUE) , ] )

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 32: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Optimization

Problem Definition

X We a have a terrain covered by severalstations and each point on the terrainhas one of the following status

GREEN Region is in the LoS ofat least one station.

YELLOW Region is in the LoS ofat least on station butfar away.

RED Region is out of LoS.

X For a fixed number of stations weneed to cover as much as we can.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 33: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Optimization

Model Sketch Up1

1 Define a function tocalculate the ratio ofgreen zones on terrain.

2 Give this function to oneof optimization modulesof R (Nelder — MeadTechnique) which canhandle non-smooth targetfunctions.

3 Get the optimal stationdistribution.

t a r g e t f u n c=f u n c t i o n ( o b s e r v e r ){m = mat r i x ( data=obs e r v e r , n c o l =2,byrow=TRUE)

# Compute merged s t a t u s o f a l l o b s e r v e r smergeds ta tu s <− r ep ( ” red ” , l e n g t h ( t e r r $ h e i g h t ) )f o r ( i i n seq ( 1 : dim (m) [ 1 ] ) ){

t e r r $ d i s t 2 o b s e r v e r = d i s t a n c e ( t e r r , c (m[ i , ] , 7 ) )s t a t u s = LoS ( t e r r , c (m[ i , ] , 7 ) , maxDist )me rgeds ta tu s = upda t e s t a t u s ( mergeds tatus , s t a t u s )

}

sum( mergeds ta tu s==” green ” )}

optim <− optim ( ob s e r v e r s , t a r g e t f u n c ,c o n t r o l= l i s t ( f n s c a l e=−1, t r a c e =5,

REPORT=1) )

1Refer to LoS Analysis (Part 4)Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 34: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Optimization

1 Station (54%)

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 35: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Optimization

3 Stations (83%)

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 36: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Optimization

6 Stations (99%)

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 37: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Problem Definition

X For a given string which is writtenintentionally or erroneously wrong bysubscribers, how can we build a modelwhich can deduce the most probablestring among 3 possibilities (or choseto not making any decision).

Our legitamate strings aremom, dad, and brother . Andwe havebrothe → brotherbro → brotherbrother1 → brotherp → ?1234 → ?mom.i.came.home → mommmomyy → momdad[atwork] → daddod → dad

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 38: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Problem Definition

X For a given string which is writtenintentionally or erroneously wrong bysubscribers, how can we build a modelwhich can deduce the most probablestring among 3 possibilities (or choseto not making any decision).

Our legitamate strings aremom, dad, and brother . Andwe havebrothe → brotherbro → brotherbrother1 → brotherp → ?1234 → ?mom.i.came.home → mommmomyy → momdad[atwork] → daddod → dad

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 39: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Model Sketch Up

1 Do some feature engineering

X Length of the stringX Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)X Anything else ?

2 Build a classifier to classify those texts based on thosefeatures.

3 Evaluate your classifier

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 40: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Model Sketch Up

1 Do some feature engineering

X Length of the string

X Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)X Anything else ?

2 Build a classifier to classify those texts based on thosefeatures.

3 Evaluate your classifier

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 41: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Model Sketch Up

1 Do some feature engineering

X Length of the stringX Prefix flag (3 attributes for each)

X Contains flag (3 attributes for each)X Anything else ?

2 Build a classifier to classify those texts based on thosefeatures.

3 Evaluate your classifier

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 42: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Model Sketch Up

1 Do some feature engineering

X Length of the stringX Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)

X Anything else ?

2 Build a classifier to classify those texts based on thosefeatures.

3 Evaluate your classifier

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 43: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

Model Sketch Up

1 Do some feature engineering

X Length of the stringX Prefix flag (3 attributes for each)X Contains flag (3 attributes for each)X Anything else ?

2 Build a classifier to classify those texts based on thosefeatures.

3 Evaluate your classifier

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 44: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

First Model

s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” , l o c a l=TRUE)d f = gene ra t eTex t ( )

l i b r a r y ( r p a r t )

# grow t r e ef i t <− r p a r t ( c o r r e c t e d ˜ l e n g t h+p r e f i x B r o t h e r+p r e f i xDad+pref ixMom+i n s t r B r o t h e r+

in s t rDad+instrMom ,method=” c l a s s ” , data=df )

t a b l e ( pred = p r e d i c t ( f i t , df , t ype=” c l a s s ” ) ,t r u e = df $ c o r r e c t e d )

truepred ? brother dad mom? 20 0 0 0brother 0 30 10 0dad 0 0 10 0mom 0 0 0 20

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 45: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Text Analysis & Decision Trees

More Feature Engineering using Jaro-Winkler Algorithm

Jaro-Winkler distance is a distance metric between strings whichcan be used as a fuzzy string matching algorithm resilient to typoerrors.

l i b r a r y ( RecordL inkage )

enhanced = data . f rame ( df ,momScore = j a r o w i n k l e r ( ”mom” , d f $ o r g i n a l ) ,dadScore = j a r o w i n k l e r ( ”dad” , d f $ o r g i n a l ) ,b r o t h e r S c o r e = j a r o w i n k l e r ( ” b r o t h e r ” , d f $ o r g i n a l ) )

truepred ? brother dad mom? 20 0 0 0brother 0 30 0 0dad 0 0 20 0mom 0 0 0 20

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 46: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Conclusion

X R contains lots of libraries to help you model a physicalphenomenon in anyway you like and visualize it.

X Oracle Enterprise R makes it possible to handle large volumesof data without changing your R environment basics.

X Don’t take ODM and Oracle Enterprise R as alternatives ofeach other but rather complimentary solutions of the sameproblem.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 47: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Conclusion

X R contains lots of libraries to help you model a physicalphenomenon in anyway you like and visualize it.

X Oracle Enterprise R makes it possible to handle large volumesof data without changing your R environment basics.

X Don’t take ODM and Oracle Enterprise R as alternatives ofeach other but rather complimentary solutions of the sameproblem.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 48: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Conclusion

X R contains lots of libraries to help you model a physicalphenomenon in anyway you like and visualize it.

X Oracle Enterprise R makes it possible to handle large volumesof data without changing your R environment basics.

X Don’t take ODM and Oracle Enterprise R as alternatives ofeach other but rather complimentary solutions of the sameproblem.

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 49: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Question & Answer

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

Page 50: Database Data Mining: Practical R Enterprise and Oracle Advanced

Introduction Oracle Enterprise R in Practice Wrap up

Stay in Touch

[email protected]

[email protected]

http://husnusensoy.wordpress.com

@husnusensoy

Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics


Recommended