Home >Documents >Multivariate Statistical Analysis for Data Mining

Multivariate Statistical Analysis for Data Mining

Date post:20-Aug-2015
Category:
View:745 times
Download:2 times
Share this document with a friend
Transcript:
  1. 1. IntroductiontoDataMining Kwok-Leung Tsui Industrial & Systems EngineeringGeorgia Institute of Technology1/5/2009 1
  2. 2. WhatisDataMining DataFlood!! Dataminingis extractionofmeaningful/useful/interestingpatternsfromalargevolumeofdatasources(signal,image,timeseries,image,transaction,text,web,etc.)Dataminingisoneoftoptenemerging technology!(MITsTechnologyReview,2004) 1/5/2009 2
  3. 3. DMFields&Backgrounds Dataminingisanemergingmultidisciplinaryfield: Statistics(especially,multivariatestatistics) Machinelearning ApplicationBackground(e.g.,Biology) Patternrecognition Databases Visualization OLAPanddatawarehousing etc. 1/5/2009 3
  4. 4. Commonly Used Language in Data Mining Data Mining = DM Knowledge Discovery in Database = KDD Massive Data Sets = MD(Very Large Data Base = VLDB) Data Analysis = DA 1/5/20094
  5. 5. Data MiningDM MD DM DA DA+MD = DM ? Statistical DM: Computationally feasible algorithms. Little or no human intervention. Money issue: DA software (~$ 5-10K), DM software(~ $100K) 1/5/20095
  6. 6. Statistical Data Mining Data Mining is exploratory data analysis withlittle or no human interaction usingcomputationally feasible techniques, i.e., theattempt to find unknown interesting structure. 1/5/20096
  7. 7. Data Mining Data mining is the process of exploration and analysis, by automatic or semi- automatic means, of large quantities of data in order to discover meaningful patterns and rules. 1/5/2009 7
  8. 8. KDD Knowledge discovery in databases (KDD) is a multi-disciplinary research field for non-trivial extraction of implicit, previously unknown, and potentially useful knowledge from data. 1/5/20098
  9. 9. KDD & Data Mining KDD Process (DM Process) The process of using data mining methods (algorithms) toextract knowledge according to the specifications ofmeasures and thresholds, using a database along with anynecessary preprocessing or transformations Data Mining (& Modeling) A step in the knowledge discovery process consisting ofparticular algorithms(methods) that, under some acceptableobjective, produces particular patterns or knowledge overthe data. Text mining, web mining, etc. Some people treat DM and KDD equivalently. 1/5/20099
  10. 10. Data Mining, Statistics, CSData MinersComputerStatisticians ScientistsExtract useful Support data mining bySupport data mining byinformation from large mathematical theory and computational algorithmamount of raw data statistical methods and relevant softwareFriedman 1/5/200910
  11. 11. Applications Bioinformatics SalesandMarketing HealthCare/MedicalDiagnosis SupplyChainManagement ProcessControl NetworkIntrusionDetection Astronomy SportsandEntertainment 1/5/200911
  12. 12. Examples of DM Applications Finance: Forecast stock price or movement usingneural network or time series Telecom: Predict churn rate and customer usageusing tree, logistic regression, and activity monitoring Retail: Identify cross selling using association rules,e.g. Market Basket Analysis, RFM Analysis Pharmaceutical: Segment customers into differentbehavior groups using clustering and classification Banking: Customer relationship management (CRM)using clustering and association 1/5/200912
  13. 13. Examples of DM Applications Hotel/airline: Identify potential customers forpromotion offers using tree or neural network Ocean terminal operation data mining for efficiencyimprovement Sales/demand data mining for inventory planning UPS transaction data mining for mail box locationdesign1/5/2009 13
  14. 14. Prerequisites for Data Mining Large amount of data (internal & ext.)(called data warehouse, data mart, etc.) Phone calls, web visits, supermarket transactions, weather data etc. Mega-, giga-, tera-bytes, . Information technology advancement Most companies have these resources Friedman 1/5/200914
  15. 15. Prerequisites for Data Mining Advanced computer technology(big CPU, parallel architecture, etc.) allow fast access to vast amount of data allow computationally intensive algorithm andstatistical methods Knowledge in business or subject matter ask important business questions understand and verify discovered knowledge 1/5/2009 15
  16. 16. Data Mining Process 1/5/2009 16
  17. 17. Data Mining (KDD) ProcessDetermine Business ObjectivesDataPreparationMining &Modeling Consolidationand Application1/5/2009 17
  18. 18. Formulate Business ObjectivesDataPreparationFormulate Business Objectives Data MiningConsolidationand Application Examples of a telecom company: Identify important customer traits to keep profitablecustomers and to predict fraudulent behavior, creditrisks and customer churn Improve programs in target marketing, marketingchannel management, micro-marketing, and crossselling Meet effectively the challenges of new productdevelopment 1/5/2009 18
  19. 19. FormulateBusinessObjectives Data Data Preparation PreparationData Mining Consolidation andApplication Business Source Legacy systems ObjectivesSystems External systemsModel DiscoveryFileIdentify Data Extract data Cleanse andNeeded andFrom source AggregateSources systemsdataModel EvaluationFile 1/5/200919
  20. 20. Formulate Business ObjectivesDataPreparationMining & Modeling Data MiningConsolidationand Application ModelExplore ConstructDiscovery DataModel FileIdeasModelTransformEvaluate ReportsEvaluationModel into Model Usable Format File Models 1/5/2009 20
  21. 21. FormulateBusinessObjectives Data Consolidation and Application PreparationData Mining Consolidation andApplication IdeasCommunicate Make BusinessReports KnowledgeExtract /TransportDecisions and Knowledge Database Knowledge Improve model Models 1/5/200921
  22. 22. Effort in Data Mining 605040Effort (%) 3020100 BusinessData Preparation Data Mining Consolidate DiscoveredObjectiveKnowledgeDetermination 1/5/200922
  23. 23. Data Preparation 1/5/200923
  24. 24. Databases & Data Warehouses Relational Database Object Oriented Database Transactional Database Time Series, Spatial Database Data Warehouse, Data Cube SQL = Structured Query Language OLAP = On-Line Analytical Processing MOLAP = Multidimensional OLAP Fundamental data object for MOLAP is the DataCube ROLAP = Relational OLAP using extended SQL 1/5/2009 24
  25. 25. Data Preparation Sources of Noises Faulty data collection instruments, e.g., sensors Transmission errors, e.g., intermittent errors fromsatellite or internet transmissions Data entry error Technology limitations error Naming conventions misused, e.g., same names butdifferent meaning Incorrect classification1/5/2009 25
  26. 26. Data Preparation Redundant data Variables have different names in different databases Raw variable in one database is a derived variable inanother Changes in variable over time not reflected indatabase Irrelevant variables destroy speed (dimension reduction needed)1/5/2009 26
  27. 27. Data Preparation Problem of Missing Data Missing values in massive data sets Missing data may be irrelevant to desired result Massive data sets if acquired by instrumentation may have few missing values Impute missing data manually or statistically Missing Value Plot Traditional methods limited for small data sets 1/5/200927
  28. 28. Data Preparation Problem of Outliers Outliers easy to detect in low dimensions A high dimensional outliers may not show up in lowdimensional projections Clustering and other statistical modeling can be used Fisher Info Matrix and Convex Hull Peeling morefeasible but still too complex for Massive datasets(Generally difficult for massive data) Traditional methods limited for small data sets 1/5/2009 28
  29. 29. Data Cleaning Duplicate removal (tool based) Missing value imputation (manual and statistical) Identify and remove data inconsistencies Identify and refresh stale data Create unique record (case) ID 1/5/200929
  30. 30. Database Sampling The KDD systems must be able to assist in the selection of appropriate parts of the databases to be examined For sampling to work, the data must satisfy certain conditions(e.g., no systematic biases) Sampling can be very expensive operation especially when the sample is taken from data stored in some databases 1/5/200930
  31. 31. Database Reduction Data Cube aggregation Dimension reduction Eliminate irrelevant and redundant attributes Data Compression Encoding mechanisms, quantizations, wavelettransformation, principle components, etc. 1/5/2009 31
  32. 32. DataPreparationUsingR 1/5/200932
  33. 33. Introduction to R A software package suitable for dataanalysis and graphic representation. Free and open source. Implementation of many modernstatistical methods. Flexible and customizable.1/5/200933
  34. 34. Software Download of R Go to http://www.cran.r-project.org/ Click on Windows (95 and later) Click on base Click on R-2.3.0-win32.exe Press Save to download the file to your computer. Install R by doubling click on the downloaded file and following the instructions.1/5/200934
  35. 35. Using R To invoke R, Go to Start ! Programs ! R To quit R, Type q( ) at the R prompt (>) and pressEnter key. Or simply close the window.1/5/2009 35
  36. 36. Using R A good introduction to R is available athttp://www.cran.r-project.org/ Click on Manuals under Documentations. Videos on R:wwww.decisionsciencenews.com/?p=261 A few important commands help.start() for a web-based interface tothe help system. Help(topic) or ?topic for help on topic. help.search(pattern) for help pagescontaining pattern. 1/5/2009 36
  37. 37. Data PreparationData Matrix 1/5/200937
  38. 38. Data Preparation Variable Types 1/5/200938
  39. 39. Data Preparation Missing Values 1/5/200939
  40. 40. Data Preparation Missing Values 1/5/200940
  41. 41. Data Preparation 1/5/200941
  42. 42. Data Preparation 1/5/200942
  43. 43. Data Preparation Data Dissimilarities (Distances) 1/5/200943
  44. 44. Data Preparation Data Dissimilarities (Distances) 1/5/200944
  45. 45. Data Preparation Data Dissimilarities (Distances) 1/5/200945
  46. 46. Data Preparation Data Dissimilarities (Distances) 1/5/200946
  47. 47. Data Preparation Data Scaling of Mixed-Type Data 1/5/2009 47
  48. 48. Data PreparationOutlier Detection Grubbs test (z-score rule, non-resistant): 1/5/2009 48
  49. 49. Data PreparationOutlier Detection Other Rules: CV Rule: CV (coefficient of variation) = SD/mean Call for outlier if CV exceeds certain threshold. Resistant Rule: Resistant Score = (X median )/ MAD MAD = Median absolute deviation Call for outlier if Score exceeds certain threshold(say 5)1/5/2009 49
  50. 50. Data PreparationMahalanobis Distance(multi-dimensional data) 1/5/200950
  51. 51. Data Preparation Home equity loan example: 1/5/200951
  52. 52. Data Preparation 1. Download data:2. Remove cases with missing data: 1/5/200952
  53. 53. Data Preparation 3. Remove outliers. 3a. Separate the data into two groups 3b. Compute Mahalanobis distance for each group: 1/5/2009 53
  54. 54. Data Preparation 3c. Remove outliers in each of the two groups:3d. Combine the two groups of data & write to file: 1/5/2009 54
  55. 55. Data Preparation Plots of Mahalanobis distance: 1/5/200955
of 55/55
1/5/2009 1 Introduction to Data Mining Kwok-Leung Tsui Industrial & Systems Engineering Georgia Institute of Technology
Embed Size (px)
Recommended