Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 225 times |
Download: | 1 times |
Another Look at Data MiningAnother Look at Data Mining
Why do we mine?Why do we mine?
What do we mine?What do we mine?
How do we mine?How do we mine?
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
What is Data MiningWhat is Data Mining
Data mining discovers meaningful new Data mining discovers meaningful new correlations, hidden patterns and correlations, hidden patterns and relationships in your datarelationships in your data
Conceptual descendent of statisticsConceptual descendent of statistics Combines machine learning,statistics,and Combines machine learning,statistics,and
databasesdatabases Knowledge discovery:process of building Knowledge discovery:process of building
and implementing a data mining solutionand implementing a data mining solution
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Mining OverviewData Mining Overview Knowledge Discovery in Databases, Knowledge Discovery in Databases, KDDKDD No one data mining approachNo one data mining approach
each tool viewed logically as application of clienteach tool viewed logically as application of client Can reside on separate machine or in separate process and access Can reside on separate machine or in separate process and access
data warehousedata warehouse RDBMS or proprietary OLAP embed data mining RDBMS or proprietary OLAP embed data mining
capabilities deeply within engines to improve efficiency capabilities deeply within engines to improve efficiency and add extensionsand add extensions
Requires a good foundation in terms of a data warehouseRequires a good foundation in terms of a data warehouse
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Mining Overview (con’t)Data Mining Overview (con’t)
Common algorithmic approachesCommon algorithmic approaches association, affinity groupingassociation, affinity grouping predicting, sequence-based analysispredicting, sequence-based analysis clustering clustering classificationclassification estimationestimation
Steps are:data selection, data Steps are:data selection, data transformation,data mining,result transformation,data mining,result interpretation.interpretation.
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Strategic Benefit of Data MiningStrategic Benefit of Data Mining
Direct MarketingDirect Marketing Trend AnalysisTrend Analysis Fraud detectionFraud detection Forecasting in Financial MarketsForecasting in Financial Markets
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Why Data Mining Now?Why Data Mining Now?
EconomicsEconomics Unprecedented affordability of MIPS and MBUnprecedented affordability of MIPS and MB
Parallel computingParallel computing Enormous amounts of data can be processedEnormous amounts of data can be processed
Popularity of data warehouses, data martsPopularity of data warehouses, data marts Relatively clean data availableRelatively clean data available
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Mining compared to Traditional AnalysisData Mining compared to Traditional Analysis
Traditional AnalysisTraditional Analysis Did sales of product X increase in Nov.?Did sales of product X increase in Nov.? Do sales of product X decrease when there is a Do sales of product X decrease when there is a
promotion on product Y?promotion on product Y? Data mining is result orientedData mining is result oriented
What are the factors that determine sales of What are the factors that determine sales of product X?product X?
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Mining compared to Traditional Analysis (con’t)Data Mining compared to Traditional Analysis (con’t)
Traditional; analysis is incrementalTraditional; analysis is incremental Does billing level affect turnover?Does billing level affect turnover? Does location affect turnover?Does location affect turnover? Analyst builds model step by stepAnalyst builds model step by step
Data Mining is result orientedData Mining is result oriented Identify the factors and predict turnoverIdentify the factors and predict turnover
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Steps in Data MiningSteps in Data Mining Data Manipulation - can be 70-80% of data Data Manipulation - can be 70-80% of data
mining effortmining effort data cleaningdata cleaning missing valuesmissing values data derivationdata derivation merging datamerging data
Defining a studyDefining a study Supervised-articulating goal, choosing dependent variable or Supervised-articulating goal, choosing dependent variable or
output and specifying data fieldsoutput and specifying data fields Unsupervised-group similar types of data or identify Unsupervised-group similar types of data or identify
exceptionsexceptions
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Steps in Data Mining (con’t)Steps in Data Mining (con’t)
Reading the data and building the modelReading the data and building the model model summarizes large amounts of data by model summarizes large amounts of data by
accumulating indicators accumulating indicators (frequencies,weight,conjunctions,differentiation)(frequencies,weight,conjunctions,differentiation)
Understanding the modelUnderstanding the model Know the particular modelKnow the particular model
PredictionPrediction Choose the best outcome based on historical dataChoose the best outcome based on historical data
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
ModelsModels
Genetic AlgorithmsGenetic Algorithms Neural NetsNeural Nets AgentsAgents StatisticsStatistics VisualizationVisualization
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Genetic AlgorithmsGenetic AlgorithmsGenetic AlgorithmsGenetic Algorithms
Artificial intelligence system that mimics the Artificial intelligence system that mimics the evolutionary, survival-of-the-fittest processes to evolutionary, survival-of-the-fittest processes to generate increasingly better solutions to a problem.generate increasingly better solutions to a problem.
Genetic algorithms produce several generations of Genetic algorithms produce several generations of solutions, choosing the best of the current set for solutions, choosing the best of the current set for each new generation.each new generation.
ExamplesExamples Generating human faces based on a few known features.Generating human faces based on a few known features. Generating solutions to routing problems.Generating solutions to routing problems. Generating stock portfolios.Generating stock portfolios.
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
EVOLUTION IN GENETIC ALGORITHMSEVOLUTION IN GENETIC ALGORITHMS
SELECTIONSELECTION - or survival of the fittest. The - or survival of the fittest. The key is to give preference to better outcomes.key is to give preference to better outcomes.
CROSSOVERCROSSOVER - combining portions of good - combining portions of good outcomes in the hope of creating an even outcomes in the hope of creating an even better outcome.better outcome.
MUTATIONMUTATION - randomly trying combinations - randomly trying combinations and evaluating the success (or failure) of the and evaluating the success (or failure) of the outcome.outcome.
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Neural NetsNeural NetsNeural NetsNeural Nets Mathematical Model of the Way a Brain Mathematical Model of the Way a Brain
FunctionsFunctions Machine learning approach by which Machine learning approach by which
historical data can be examined for historical data can be examined for pattern recognitionpattern recognition
A neural network simulates the human A neural network simulates the human ability to classify things based on the ability to classify things based on the experience of seeing many examplesexperience of seeing many examples..
Pros -Numerical Data Pros -Numerical Data
Cons - Opaque, Art or Science Cons - Opaque, Art or Science
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
ExampleExampleDistinguishing different chemical Distinguishing different chemical compoundscompounds
Detecting anomalies in human tissue Detecting anomalies in human tissue that may signify diseasethat may signify disease
Reading handwritingReading handwriting
Detecting fraud in credit card useDetecting fraud in credit card use
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Intelligent AgentsIntelligent Agents
Software entities that carry out some set of Software entities that carry out some set of operations on behalf of user or program with some operations on behalf of user or program with some degree of autonomy and employ some knowledge degree of autonomy and employ some knowledge or representation of users goals and desires.or representation of users goals and desires.
Some common characteristics Some common characteristics ability to communicate, cooperate and coordinate with ability to communicate, cooperate and coordinate with
other agentsother agents ability to act autonomously to achieve collective goal ability to act autonomously to achieve collective goal
of systemof system
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Intelligent Agents (con’t)Intelligent Agents (con’t)
TasksTasks automate repetitive tasksautomate repetitive tasks finding and filtering informationfinding and filtering information summarizing complex datasummarizing complex data
Capability to learn and make Capability to learn and make recommendationsrecommendations
Black box approach hides complexity and Black box approach hides complexity and allows for design of scalable systemallows for design of scalable system
AI System
Expert Systems
Neural Networks
Genetic Algorithms
Intelligent Agents
Problem Type
Diagnostic or prescriptive
Identification, classification, prediction
Optimal solution
Specific and repetitive tasks
Based On
Strategies of experts
The human brain
Biological evolution
One or more AI techniques
Starting Information
Expert’s know-how
Acceptable patterns
Set of possible solutions
Your preferences
Comparison
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
StatisticsStatisticsStatisticsStatistics
SAS, SPSSSAS, SPSS Pros - Established technology Pros - Established technology Cons - Needs assumptions, nominal Cons - Needs assumptions, nominal
variable handling, management variable handling, management acceptance?acceptance?
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
VisualizationVisualizationVisualizationVisualization
Data visualization refers to technologies Data visualization refers to technologies that support visualization of informationthat support visualization of information
Includes – digital images, GIS, multi-Includes – digital images, GIS, multi-dimensions, 3-D presentations, animationsdimensions, 3-D presentations, animations
http://www.almaden.ibm.com/cs/quest/http://www.almaden.ibm.com/cs/quest/demo/assoc/general.htmldemo/assoc/general.html
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Mining is Not a Silver BulletData Mining is Not a Silver Bullet
It does not:It does not: Find answers to questions you don’t askFind answers to questions you don’t ask Eliminate the need for domain experienceEliminate the need for domain experience Remove the need for data analysis skillsRemove the need for data analysis skills
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Mining SoftwareData Mining Software
http://www.kdnuggets.com/software/http://www.kdnuggets.com/software/ http://www.attar.com/http://www.attar.com/ download download http://www.cs.bham.ac.uk/~anp/software.hthttp://www.cs.bham.ac.uk/~anp/software.ht
mlml software listing software listing
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Six Rules of Data Qualityby Ken Orr
Six Rules of Data Qualityby Ken Orr
1. Data that is not used cannot be correct for very long1. Data that is not used cannot be correct for very long
2. Data Quality in an information system is a function of its 2. Data Quality in an information system is a function of its use, not its collectionuse, not its collection
3.Data quality will ultimately be no better than its most 3.Data quality will ultimately be no better than its most stringent usestringent use
4. Data quality problems tend to become worse with the age of 4. Data quality problems tend to become worse with the age of the systemthe system
5. Less likely it is that some data element will change, more 5. Less likely it is that some data element will change, more traumatic it will be when it finally does change.traumatic it will be when it finally does change.
6. Information overload affects data quality6. Information overload affects data quality
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data Quality SoftwareData Quality Software
http://www.rulequest.com/gritbot-info.htmlhttp://www.rulequest.com/gritbot-info.html
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
General DW Data transformationGeneral DW Data transformation
Resolve inconsistent legacy formatsResolve inconsistent legacy formats Strip out unwanted fieldsStrip out unwanted fields Interpret codes into textInterpret codes into text Combine data from multiple sources under Combine data from multiple sources under
a common keya common key Find fields used for multiple purposes and Find fields used for multiple purposes and
interpret fields value based on contextinterpret fields value based on context
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Data transformation for Data MiningData transformation for Data Mining
Flag normal, abnormal, out of bounds or Flag normal, abnormal, out of bounds or impossible factsimpossible facts
Recognize random or noise values from Recognize random or noise values from context and mask outcontext and mask out
Apply uniform treatment to NULL valuesApply uniform treatment to NULL values Flag fast records with changed statusFlag fast records with changed status Classify individual record by one of its Classify individual record by one of its
aggregatesaggregates
CS753 Dr. Mary Ann RobberCS753 Dr. Mary Ann Robbertt
Conclusion Conclusion
For successful data mining:For successful data mining: data analysis and mining goals must be data analysis and mining goals must be
identifies and formulatedidentifies and formulated appropriate data must be selected, cleaned and appropriate data must be selected, cleaned and
prepared for queries and business analysisprepared for queries and business analysis http://www.rulequest.com/cubist-http://www.rulequest.com/cubist-
examples.html#BOSTONexamples.html#BOSTON http://www.almaden.ibm.com/cs/quest/http://www.almaden.ibm.com/cs/quest/