Assessing Data Mining: Assessing Data Mining: The State of the The State of the
PracticePractice©2003©2003
Herbert A. EdelsteinHerbert A. EdelsteinTwo Crows CorporationTwo Crows Corporation
10500 Falls Road10500 Falls RoadPotomac, Maryland 20854Potomac, Maryland 20854
www.twocrows.comwww.twocrows.com(301) 983-3555(301) 983-3555
© Two Crows Corporation© Two Crows Corporation 2
ObjectivesObjectives Separate myth from realitySeparate myth from reality Interactive session: question driven! The Interactive session: question driven! The
slides are largely to ensure common slides are largely to ensure common background. background.
© Two Crows Corporation© Two Crows Corporation 3
The Key to ValueThe Key to Value The utility of data increases as it spans The utility of data increases as it spans
the business value chain and is integratedthe business value chain and is integrated Information increases as data are relatedInformation increases as data are related
Consolidate similar databasesConsolidate similar databases Consolidate different types of Consolidate different types of
databases databases Without data and good analysis all you Without data and good analysis all you
have are opinions.have are opinions.
© Two Crows Corporation© Two Crows Corporation 4
Data Mining DefinitionsData Mining Definitions What IT departments callWhat IT departments call
OLAPOLAP Query Query StatisticsStatistics
© Two Crows Corporation© Two Crows Corporation 5
Data Mining DefinitionsData Mining Definitions Knowledge Discovery in Databases Knowledge Discovery in Databases is the is the
non-trivial process of identifying valid, non-trivial process of identifying valid, novel, potentially useful and ultimately novel, potentially useful and ultimately understandable patterns in data. understandable patterns in data. (Fayyad, Piatetsky-Shapiro, & Smyth)(Fayyad, Piatetsky-Shapiro, & Smyth) KDD is the process, data mining is the KDD is the process, data mining is the
application of algorithmsapplication of algorithms Includes description and predictionIncludes description and prediction Large databases often explicitly added Large databases often explicitly added
to definitionto definition
© Two Crows Corporation© Two Crows Corporation 6
Data Mining DefinitionsData Mining Definitions Data miningData mining is a process that uses a is a process that uses a
variety of data analysis tools to discover variety of data analysis tools to discover patterns and relationships in data that patterns and relationships in data that may be used to make valid predictionsmay be used to make valid predictions
Exploration and description is required Exploration and description is required but not the goalbut not the goal
© Two Crows Corporation© Two Crows Corporation 7
New Statistical SoftwareNew Statistical Software Takes advantage of advances in Takes advantage of advances in
hardware and softwarehardware and software Provides new interfaces for a wider class Provides new interfaces for a wider class
of users of users Comes from statistics, machine learning Comes from statistics, machine learning
and information systemsand information systems
© Two Crows Corporation© Two Crows Corporation 8
Why Data Mining is Taking OffWhy Data Mining is Taking Off
Demand for informationDemand for information Availability of data Availability of data
Enormous quantity of happenstance dataEnormous quantity of happenstance data Spread of data warehousesSpread of data warehouses Data is easily accessible through the Web.Data is easily accessible through the Web.
Improved technologyImproved technology Inexpensive, scalable processingInexpensive, scalable processing Inexpensive storageInexpensive storage High bandwidthHigh bandwidth
© Two Crows Corporation© Two Crows Corporation 9
Why Data Mining Is NeededWhy Data Mining Is Needed Massive amounts of dataMassive amounts of data
Example:Example: 75 million customers75 million customers 3,000 columns for each customer3,000 columns for each customer
Low signal to noiseLow signal to noise Subtle relationshipsSubtle relationships VariationVariation
Allow domain experts to build predictive Allow domain experts to build predictive modelsmodels
© Two Crows Corporation© Two Crows Corporation 10
Data Mining ProductsData Mining Products Handle large volumes of dataHandle large volumes of data Reduce dependence on the modelerReduce dependence on the modeler
Model specification Model specification Knowing characteristics of variablesKnowing characteristics of variables
Create hypothesesCreate hypotheses Emphasize predictionEmphasize prediction Simplify model deploymentSimplify model deployment Data mining is a productivity tool even Data mining is a productivity tool even
for the skilled statisticianfor the skilled statistician
© Two Crows Corporation© Two Crows Corporation 11
Attribute InteractionsAttribute Interactions
High0.000
0.025
0.050
0.075
High Fee,Low Interest
Low Fee,Low Interest
Low
Interest RateInterest Rate
Response Response RateRate
Low Fee,High Interest
High Fee,High Interest
© Two Crows Corporation© Two Crows Corporation 12
Data Mining MythsData Mining Myths
Data mining does NOTData mining does NOT Find answers to unasked questionsFind answers to unasked questions Explain behavior.Explain behavior. Continuously monitor data for new patternsContinuously monitor data for new patterns Eliminate the need to understand your businessEliminate the need to understand your business Eliminate the need to collect good dataEliminate the need to collect good data Eliminate the need to be a good data analystEliminate the need to be a good data analyst
© Two Crows Corporation© Two Crows Corporation 13
Commercial ApplicationsCommercial Applications IndustryIndustry
Retail Retail FinancialFinancial ManufacturingManufacturing InsuranceInsurance PublishingPublishing Health careHealth care
ApplicationApplication Marketing Marketing Sales force managementSales force management Fraud detectionFraud detection Risk managementRisk management
© Two Crows Corporation© Two Crows Corporation 14
Credit Risk AnalysisCredit Risk Analysis Database checksDatabase checks
Data validation: Does an address exist, is the Data validation: Does an address exist, is the social security number consistent with date social security number consistent with date and place of birth, etc.and place of birth, etc.
Where is Benford’s Law applicable?Where is Benford’s Law applicable? History checks: when was the last time a History checks: when was the last time a
property sold?property sold? Profiling good and bad credit risks – Finding Profiling good and bad credit risks – Finding
good risks within bad categories.good risks within bad categories. Outlier detection – uni-dimensional and multi-Outlier detection – uni-dimensional and multi-
dimensionaldimensional Does not replace human follow upDoes not replace human follow up
© Two Crows Corporation© Two Crows Corporation 15
Benford’s LawBenford’s LawIf the numbers under investigation are not entirely If the numbers under investigation are not entirely random but somehow random but somehow socially or naturally relatedsocially or naturally related, the , the distribution of the first digit is not uniform. More distribution of the first digit is not uniform. More accurately, digit D appears as the first digit with the accurately, digit D appears as the first digit with the frequency proportional to log10(1 + 1/D). In other frequency proportional to log10(1 + 1/D). In other words, one may expect 1 to be the first digit of a words, one may expect 1 to be the first digit of a random number in about 30% of cases, 2 will come up random number in about 30% of cases, 2 will come up in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, etc. etc.
http://www.cut-the-knot.org/do_you_know/zipfLaw.shtmlhttp://www.cut-the-knot.org/do_you_know/zipfLaw.shtml
© Two Crows Corporation© Two Crows Corporation 16
Data Mining Data Mining ProcessProcess Define business problemDefine business problem Prepare the data Prepare the data
Build data mining databaseBuild data mining database Explore dataExplore data Prepare data for modelingPrepare data for modeling
Create modelsCreate models Build modelsBuild models Evaluate modelsEvaluate models
Act on the results (implementation)Act on the results (implementation) Apply models to new dataApply models to new data Integrate results with an applicationIntegrate results with an application
© Two Crows Corporation© Two Crows Corporation 17
Data PreparationData Preparation Build data mining databaseBuild data mining database Explore dataExplore data Prepare data for modelingPrepare data for modeling
60% to 95% of the time is spent 60% to 95% of the time is spent preparing the datapreparing the data
© Two Crows Corporation© Two Crows Corporation 18
The ToolsThe Tools Starting simply - linear regression Starting simply - linear regression Fancier regressions Fancier regressions Projections Projections Smoothing based regressionsSmoothing based regressions Survival analysis Survival analysis Nearest neighbor methodsNearest neighbor methods Collaborative filteringCollaborative filtering Trees Trees MARS MARS Neural networks Neural networks Genetic algorithms Genetic algorithms
© Two Crows Corporation© Two Crows Corporation 19
Decision TreesDecision Trees Build a tree inductively that describes a set of Build a tree inductively that describes a set of
datadata Classification tree: Hierarchical set of rules Classification tree: Hierarchical set of rules
which classify datawhich classify data Regression tree: Hierarchical set of rules Regression tree: Hierarchical set of rules
which predicts valueswhich predicts values
© Two Crows Corporation© Two Crows Corporation 20
Neural NetsNeural Nets Don’t resemble the brainDon’t resemble the brain
A model of memoryA model of memory A model of learningA model of learning
© Two Crows Corporation© Two Crows Corporation 21
NN Don’t Resemble the BrainNN Don’t Resemble the Brain
Brain neurons can not only add signals, but they Brain neurons can not only add signals, but they can subtract, multiply, divide, filter, average, etc.can subtract, multiply, divide, filter, average, etc.
““The computational toolbox of individual neurons The computational toolbox of individual neurons dwarfs the elements available to today’s electronic dwarfs the elements available to today’s electronic circuit designers” Christof Koch, Professor, circuit designers” Christof Koch, Professor, Computation and Neural Systems, Cal TechComputation and Neural Systems, Cal Tech
© Two Crows Corporation© Two Crows Corporation 22
Linear Regression ExampleLinear Regression Example
Inputs (I)Inputs (I)
0.50.5
3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y
OutputOutput
xx11
xx22
xx33
xx44
xx00
0.30.3
0.70.7
-0.2-0.2
0.40.4yy
WeightsWeights
PerceptronPerceptron
© Two Crows Corporation© Two Crows Corporation 23
Logistic Regression ExampleLogistic Regression Example
Inputs (I)Inputs (I)
0.50.5
3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y
OutputOutput
xx11
xx22
xx33
xx44
xx00
0.30.3
0.70.7
-0.2-0.2
0.40.4ffy)y)
WeightsWeights
PerceptronPerceptron
© Two Crows Corporation© Two Crows Corporation 24
Sigmoid Activation FunctionSigmoid Activation Function S shapedS shaped Continuous approximation of thresholdContinuous approximation of threshold Has derivativeHas derivative
yeyf
11)(
00.20.40.60.81
1.2
-10 -7 -4 -1 2 5 8
© Two Crows Corporation© Two Crows Corporation 25
Logistic Regression ExampleLogistic Regression Example
Input (I)Input (I)Output (y) Output (y)
0.30.3
0.70.7
-0.2-0.2
0.40.4
0.50.5
=.3-.7-.2+.4+.5=.3=.3-.7-.2+.4+.5=.3
xx1 1 = +1 = +1
xx2 2 = -1 = -1
xx3 3 = +1 = +1
xx4 4 = +1 = +1
xx0 0 = +1 = +1 57.
11
3.
ey
3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= I+.5= I
PerceptronPerceptron
© Two Crows Corporation© Two Crows Corporation 26
Layered ArchitectureLayered Architecture
Input layer
Output layer
Hidden layer
© Two Crows Corporation© Two Crows Corporation 27
State of the Industry Report CardState of the Industry Report Card 19991999 20012001
20032003ProductsProducts
User interfaceUser interface CC B -B - B B Data preparationData preparation DD CC CCData explorationData exploration C-C- CC BBAlgorithmsAlgorithms CC BB B+B+Model deploymentModel deployment DD CC BBRobustnessRobustness CC B-B- B+B+
AdoptionAdoptionOrganizational readinessOrganizational readiness CC CC BBSuccessful applicationsSuccessful applications C+ C+ BB AATrainingTraining DD CC BB
Available consultingAvailable consulting DD CC BB
© Two Crows Corporation© Two Crows Corporation 28
State of the Product MarketState of the Product Market Good products are availableGood products are available
Feature rich Feature rich MatureMature Reasonably stableReasonably stable Well supportedWell supported
© Two Crows Corporation© Two Crows Corporation 29
Tools and TechnologyTools and Technology Match tool to application and users.Match tool to application and users.
Allow time for training and learningAllow time for training and learning The tool is not the solution.The tool is not the solution.
Model building is the fun and easy part.Model building is the fun and easy part.
© Two Crows Corporation© Two Crows Corporation 30
Further ReadingFurther ReadingHerb Edelstein, Herb Edelstein, Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery, 2003, 2003Herb Edelstein, Herb Edelstein, Data Mining Technology ReportData Mining Technology Report, 2003, 2003M. Berry and G. Linoff, M. Berry and G. Linoff, MasteringMastering Data Mining TechniquesData Mining Techniques, John Wiley, 1999, John Wiley, 1999William S. Cleveland, William S. Cleveland, The Elements of Graphing DataThe Elements of Graphing Data, revised, Hobart Press, 1994, revised, Hobart Press, 1994Howard Wainer, Howard Wainer, Visual Revelations, Copernicus, 1997Visual Revelations, Copernicus, 1997T. Hastie, R. Tibshirani, J. H. Friedman,T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning : Data The Elements of Statistical Learning : Data
Mining, Inference, and Prediction, Mining, Inference, and Prediction, Springer Verlag, 2001Springer Verlag, 2001Richard O. Duda, P. E. Hart, D. G. Stork, Richard O. Duda, P. E. Hart, D. G. Stork, Pattern ClassificationPattern Classification, John Wiley, 2000, John Wiley, 2000David W Hosmer Jr., S. Lemeshow,David W Hosmer Jr., S. Lemeshow, Applied Logistic Regression Applied Logistic Regression, John Wiley, 2000, John Wiley, 2000David W Hosmer Jr., S. Lemeshow,David W Hosmer Jr., S. Lemeshow, Applied Survival Analysis Applied Survival Analysis, John Wiley, 1999 , John Wiley, 1999 David J. Hand, H. Mannila, P. Smyth , David J. Hand, H. Mannila, P. Smyth , Principles of Data Mining Principles of Data Mining , MIT Press, 2001, MIT Press, 2001Brieman, Freidman, Olshen, and Stone, Brieman, Freidman, Olshen, and Stone, Classification and Regression TreesClassification and Regression Trees, ,
Wadsworth, 1984Wadsworth, 1984J. R. Quinlan, C4.5: J. R. Quinlan, C4.5: Programs for Machine LearningPrograms for Machine Learning, Morgan Kaufmann, 1992 , Morgan Kaufmann, 1992