+ All Categories
Home > Documents > Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in...

Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in...

Date post: 15-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
30
Assessing Data Mining: Assessing Data Mining: The State of the The State of the Practice Practice ©2003 ©2003 Herbert A. Edelstein Herbert A. Edelstein Two Crows Corporation Two Crows Corporation 10500 Falls Road 10500 Falls Road Potomac, Maryland 20854 Potomac, Maryland 20854 www.twocrows.com www.twocrows.com (301) 983-3555 (301) 983-3555
Transcript
Page 1: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

Assessing Data Mining: Assessing Data Mining: The State of the The State of the

PracticePractice©2003©2003

Herbert A. EdelsteinHerbert A. EdelsteinTwo Crows CorporationTwo Crows Corporation

10500 Falls Road10500 Falls RoadPotomac, Maryland 20854Potomac, Maryland 20854

www.twocrows.comwww.twocrows.com(301) 983-3555(301) 983-3555

Page 2: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 2

ObjectivesObjectives Separate myth from realitySeparate myth from reality Interactive session: question driven! The Interactive session: question driven! The

slides are largely to ensure common slides are largely to ensure common background. background.

Page 3: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 3

The Key to ValueThe Key to Value The utility of data increases as it spans The utility of data increases as it spans

the business value chain and is integratedthe business value chain and is integrated Information increases as data are relatedInformation increases as data are related

Consolidate similar databasesConsolidate similar databases Consolidate different types of Consolidate different types of

databases databases Without data and good analysis all you Without data and good analysis all you

have are opinions.have are opinions.

Page 4: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 4

Data Mining DefinitionsData Mining Definitions What IT departments callWhat IT departments call

OLAPOLAP Query Query StatisticsStatistics

Page 5: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 5

Data Mining DefinitionsData Mining Definitions Knowledge Discovery in Databases Knowledge Discovery in Databases is the is the

non-trivial process of identifying valid, non-trivial process of identifying valid, novel, potentially useful and ultimately novel, potentially useful and ultimately understandable patterns in data. understandable patterns in data. (Fayyad, Piatetsky-Shapiro, & Smyth)(Fayyad, Piatetsky-Shapiro, & Smyth) KDD is the process, data mining is the KDD is the process, data mining is the

application of algorithmsapplication of algorithms Includes description and predictionIncludes description and prediction Large databases often explicitly added Large databases often explicitly added

to definitionto definition

Page 6: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 6

Data Mining DefinitionsData Mining Definitions Data miningData mining is a process that uses a is a process that uses a

variety of data analysis tools to discover variety of data analysis tools to discover patterns and relationships in data that patterns and relationships in data that may be used to make valid predictionsmay be used to make valid predictions

Exploration and description is required Exploration and description is required but not the goalbut not the goal

Page 7: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 7

New Statistical SoftwareNew Statistical Software Takes advantage of advances in Takes advantage of advances in

hardware and softwarehardware and software Provides new interfaces for a wider class Provides new interfaces for a wider class

of users of users Comes from statistics, machine learning Comes from statistics, machine learning

and information systemsand information systems

Page 8: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 8

Why Data Mining is Taking OffWhy Data Mining is Taking Off

Demand for informationDemand for information Availability of data Availability of data

Enormous quantity of happenstance dataEnormous quantity of happenstance data Spread of data warehousesSpread of data warehouses Data is easily accessible through the Web.Data is easily accessible through the Web.

Improved technologyImproved technology Inexpensive, scalable processingInexpensive, scalable processing Inexpensive storageInexpensive storage High bandwidthHigh bandwidth

Page 9: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 9

Why Data Mining Is NeededWhy Data Mining Is Needed Massive amounts of dataMassive amounts of data

Example:Example: 75 million customers75 million customers 3,000 columns for each customer3,000 columns for each customer

Low signal to noiseLow signal to noise Subtle relationshipsSubtle relationships VariationVariation

Allow domain experts to build predictive Allow domain experts to build predictive modelsmodels

Page 10: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 10

Data Mining ProductsData Mining Products Handle large volumes of dataHandle large volumes of data Reduce dependence on the modelerReduce dependence on the modeler

Model specification Model specification Knowing characteristics of variablesKnowing characteristics of variables

Create hypothesesCreate hypotheses Emphasize predictionEmphasize prediction Simplify model deploymentSimplify model deployment Data mining is a productivity tool even Data mining is a productivity tool even

for the skilled statisticianfor the skilled statistician

Page 11: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 11

Attribute InteractionsAttribute Interactions

High0.000

0.025

0.050

0.075

High Fee,Low Interest

Low Fee,Low Interest

Low

Interest RateInterest Rate

Response Response RateRate

Low Fee,High Interest

High Fee,High Interest

Page 12: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 12

Data Mining MythsData Mining Myths

Data mining does NOTData mining does NOT Find answers to unasked questionsFind answers to unasked questions Explain behavior.Explain behavior. Continuously monitor data for new patternsContinuously monitor data for new patterns Eliminate the need to understand your businessEliminate the need to understand your business Eliminate the need to collect good dataEliminate the need to collect good data Eliminate the need to be a good data analystEliminate the need to be a good data analyst

Page 13: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 13

Commercial ApplicationsCommercial Applications IndustryIndustry

Retail Retail FinancialFinancial ManufacturingManufacturing InsuranceInsurance PublishingPublishing Health careHealth care

ApplicationApplication Marketing Marketing Sales force managementSales force management Fraud detectionFraud detection Risk managementRisk management

Page 14: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 14

Credit Risk AnalysisCredit Risk Analysis Database checksDatabase checks

Data validation: Does an address exist, is the Data validation: Does an address exist, is the social security number consistent with date social security number consistent with date and place of birth, etc.and place of birth, etc.

Where is Benford’s Law applicable?Where is Benford’s Law applicable? History checks: when was the last time a History checks: when was the last time a

property sold?property sold? Profiling good and bad credit risks – Finding Profiling good and bad credit risks – Finding

good risks within bad categories.good risks within bad categories. Outlier detection – uni-dimensional and multi-Outlier detection – uni-dimensional and multi-

dimensionaldimensional Does not replace human follow upDoes not replace human follow up

Page 15: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 15

Benford’s LawBenford’s LawIf the numbers under investigation are not entirely If the numbers under investigation are not entirely random but somehow random but somehow socially or naturally relatedsocially or naturally related, the , the distribution of the first digit is not uniform. More distribution of the first digit is not uniform. More accurately, digit D appears as the first digit with the accurately, digit D appears as the first digit with the frequency proportional to log10(1 + 1/D). In other frequency proportional to log10(1 + 1/D). In other words, one may expect 1 to be the first digit of a words, one may expect 1 to be the first digit of a random number in about 30% of cases, 2 will come up random number in about 30% of cases, 2 will come up in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, etc. etc.

http://www.cut-the-knot.org/do_you_know/zipfLaw.shtmlhttp://www.cut-the-knot.org/do_you_know/zipfLaw.shtml

Page 16: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 16

Data Mining Data Mining ProcessProcess Define business problemDefine business problem Prepare the data Prepare the data

Build data mining databaseBuild data mining database Explore dataExplore data Prepare data for modelingPrepare data for modeling

Create modelsCreate models Build modelsBuild models Evaluate modelsEvaluate models

Act on the results (implementation)Act on the results (implementation) Apply models to new dataApply models to new data Integrate results with an applicationIntegrate results with an application

Page 17: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 17

Data PreparationData Preparation Build data mining databaseBuild data mining database Explore dataExplore data Prepare data for modelingPrepare data for modeling

60% to 95% of the time is spent 60% to 95% of the time is spent preparing the datapreparing the data

Page 18: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 18

The ToolsThe Tools Starting simply - linear regression Starting simply - linear regression Fancier regressions Fancier regressions Projections Projections Smoothing based regressionsSmoothing based regressions Survival analysis Survival analysis Nearest neighbor methodsNearest neighbor methods Collaborative filteringCollaborative filtering Trees Trees MARS MARS Neural networks Neural networks Genetic algorithms Genetic algorithms

Page 19: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 19

Decision TreesDecision Trees Build a tree inductively that describes a set of Build a tree inductively that describes a set of

datadata Classification tree: Hierarchical set of rules Classification tree: Hierarchical set of rules

which classify datawhich classify data Regression tree: Hierarchical set of rules Regression tree: Hierarchical set of rules

which predicts valueswhich predicts values

Page 20: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 20

Neural NetsNeural Nets Don’t resemble the brainDon’t resemble the brain

A model of memoryA model of memory A model of learningA model of learning

Page 21: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 21

NN Don’t Resemble the BrainNN Don’t Resemble the Brain

Brain neurons can not only add signals, but they Brain neurons can not only add signals, but they can subtract, multiply, divide, filter, average, etc.can subtract, multiply, divide, filter, average, etc.

““The computational toolbox of individual neurons The computational toolbox of individual neurons dwarfs the elements available to today’s electronic dwarfs the elements available to today’s electronic circuit designers” Christof Koch, Professor, circuit designers” Christof Koch, Professor, Computation and Neural Systems, Cal TechComputation and Neural Systems, Cal Tech

Page 22: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 22

Linear Regression ExampleLinear Regression Example

Inputs (I)Inputs (I)

0.50.5

3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y

OutputOutput

xx11

xx22

xx33

xx44

xx00

0.30.3

0.70.7

-0.2-0.2

0.40.4yy

WeightsWeights

PerceptronPerceptron

Page 23: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 23

Logistic Regression ExampleLogistic Regression Example

Inputs (I)Inputs (I)

0.50.5

3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y

OutputOutput

xx11

xx22

xx33

xx44

xx00

0.30.3

0.70.7

-0.2-0.2

0.40.4ffy)y)

WeightsWeights

PerceptronPerceptron

Page 24: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 24

Sigmoid Activation FunctionSigmoid Activation Function S shapedS shaped Continuous approximation of thresholdContinuous approximation of threshold Has derivativeHas derivative

yeyf

11)(

00.20.40.60.81

1.2

-10 -7 -4 -1 2 5 8

Page 25: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 25

Logistic Regression ExampleLogistic Regression Example

Input (I)Input (I)Output (y) Output (y)

0.30.3

0.70.7

-0.2-0.2

0.40.4

0.50.5

=.3-.7-.2+.4+.5=.3=.3-.7-.2+.4+.5=.3

xx1 1 = +1 = +1

xx2 2 = -1 = -1

xx3 3 = +1 = +1

xx4 4 = +1 = +1

xx0 0 = +1 = +1 57.

11

3.

ey

3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= I+.5= I

PerceptronPerceptron

Page 26: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 26

Layered ArchitectureLayered Architecture

Input layer

Output layer

Hidden layer

Page 27: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 27

State of the Industry Report CardState of the Industry Report Card 19991999 20012001

20032003ProductsProducts

User interfaceUser interface CC B -B - B B Data preparationData preparation DD CC CCData explorationData exploration C-C- CC BBAlgorithmsAlgorithms CC BB B+B+Model deploymentModel deployment DD CC BBRobustnessRobustness CC B-B- B+B+

AdoptionAdoptionOrganizational readinessOrganizational readiness CC CC BBSuccessful applicationsSuccessful applications C+ C+ BB AATrainingTraining DD CC BB

Available consultingAvailable consulting DD CC BB

Page 28: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 28

State of the Product MarketState of the Product Market Good products are availableGood products are available

Feature rich Feature rich MatureMature Reasonably stableReasonably stable Well supportedWell supported

Page 29: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 29

Tools and TechnologyTools and Technology Match tool to application and users.Match tool to application and users.

Allow time for training and learningAllow time for training and learning The tool is not the solution.The tool is not the solution.

Model building is the fun and easy part.Model building is the fun and easy part.

Page 30: Assessing Data Mining: The State of the Practice...Data Mining Definitions Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful

© Two Crows Corporation© Two Crows Corporation 30

Further ReadingFurther ReadingHerb Edelstein, Herb Edelstein, Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery, 2003, 2003Herb Edelstein, Herb Edelstein, Data Mining Technology ReportData Mining Technology Report, 2003, 2003M. Berry and G. Linoff, M. Berry and G. Linoff, MasteringMastering Data Mining TechniquesData Mining Techniques, John Wiley, 1999, John Wiley, 1999William S. Cleveland, William S. Cleveland, The Elements of Graphing DataThe Elements of Graphing Data, revised, Hobart Press, 1994, revised, Hobart Press, 1994Howard Wainer, Howard Wainer, Visual Revelations, Copernicus, 1997Visual Revelations, Copernicus, 1997T. Hastie, R. Tibshirani, J. H. Friedman,T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning : Data The Elements of Statistical Learning : Data

Mining, Inference, and Prediction, Mining, Inference, and Prediction, Springer Verlag, 2001Springer Verlag, 2001Richard O. Duda, P. E. Hart, D. G. Stork, Richard O. Duda, P. E. Hart, D. G. Stork, Pattern ClassificationPattern Classification, John Wiley, 2000, John Wiley, 2000David W Hosmer Jr., S. Lemeshow,David W Hosmer Jr., S. Lemeshow, Applied Logistic Regression Applied Logistic Regression, John Wiley, 2000, John Wiley, 2000David W Hosmer Jr., S. Lemeshow,David W Hosmer Jr., S. Lemeshow, Applied Survival Analysis Applied Survival Analysis, John Wiley, 1999 , John Wiley, 1999 David J. Hand, H. Mannila, P. Smyth , David J. Hand, H. Mannila, P. Smyth , Principles of Data Mining Principles of Data Mining , MIT Press, 2001, MIT Press, 2001Brieman, Freidman, Olshen, and Stone, Brieman, Freidman, Olshen, and Stone, Classification and Regression TreesClassification and Regression Trees, ,

Wadsworth, 1984Wadsworth, 1984J. R. Quinlan, C4.5: J. R. Quinlan, C4.5: Programs for Machine LearningPrograms for Machine Learning, Morgan Kaufmann, 1992 , Morgan Kaufmann, 1992


Recommended