8/8/2019 02 - Introduction to Data Mining
1/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 1
Davide AnguitaSmartLab
DIBE Facolt di Ingegneria
n vers t eg tu enova
1Slides from: J.Han & M.Kamber Data Mining: Conceptsand Techniques
IntroductionMotivation: Wh data minin ?
What is data mining?Data Mining: On what kind of data?Data mining functionalityAre all the patterns interesting?
Major issues in data mining
2J.Han & M.Kamber Data Mining: Concepts and Techniques
Motivation: Necessity is the Mother of Invention
Data ex losion roblem Automated data collection tools and mature database
technology lead
to
tremendous
amounts
of
data
stored
in
databases, data warehouses and other information repositories
We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining
Data warehousing and on line analytical processingExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
3J.Han & M.Kamber Data Mining: Concepts and Techniques
Evolution of Database Technology1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation1980s:
RDBMS, advanced data models (extended relational, OO, , . ,
scientific, engineering, etc.)1990s2000s:
Data mining and data warehousing, multimedia databases, and Web databases
4J.Han & M.Kamber Data Mining: Concepts and Techniques
8/8/2019 02 - Introduction to Data Mining
2/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 2
What Is
Data
Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Alternative names and their inside stories: Data mining: a misnomer?Knowledge discovery(mining) in databases (KDD), knowledge
, , , , information harvesting, business intelligence, etc.
What is not data mining?(Deductive) query processing. Expert systems or small ML/statistical programs
5J.Han & M.Kamber Data Mining: Concepts and Techniques
Why Data Mining? Potential ApplicationsDatabase anal sis and decision su ort
Market analysis and managementtarget marketing, customer relation management, market basket analysis, cross selling, market segmentation
Risk analysis and managementForecasting, customer retention, improved underwriting, quality control, competitive analysis
Fraud detection and managementOther Applications
Text mining (news group, email, documents) and Web analysis.Intelligent query answering
6J.Han & M.Kamber Data Mining: Concepts and Techniques
Market Analysis and Management Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketingFind clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.
Cross market analysisAssociations/co relations between product salesPrediction based on the association information
7J.Han & M.Kamber Data Mining: Concepts and Techniques
Market Analysis and Management
Customer profiling
data mining can tell you what types of customers buy what products (clustering or classification)
Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customers
Provides summary informationvarious multidimensional summary reportsstatistical summary information (data central tendency and variation)
8J.Han & M.Kamber Data Mining: Concepts and Techniques
8/8/2019 02 - Introduction to Data Mining
3/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 3
Corporate Analysis and Risk
ManagementFinance lannin
cash flow analysis and predictioncross sectional and time series analysis (financial ratio, trend analysis, etc.)
Resource planning:summarize and compare the resources and spending
monitor competitors and market directions
group customers into classes and a class based pricing procedureset pricing strategy in a highly competitive market
9J.Han & M.Kamber Data Mining: Concepts and Techniques
Fraud Detection
and
Management
Applications
widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.
Approachuse historical data to build models of fraudulent behavior and use data mining to help identify similar instances
Examplesauto insurance: detect a group of people who stage accidents to
co ect on insurancemoney laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors
10J.Han & M.Kamber Data Mining: Concepts and Techniques
Fraud Detection and Management Detecting inappropriate medical treatment
Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).
Detecting telephone fraudTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.
intra group calls, especially mobile phones, and broke a multimillion dollar fraud.
RetailAnalysts estimate that 38% of retail shrink is due to dishonest employees.
11J.Han & M.Kamber Data Mining: Concepts and Techniques
Other ApplicationsSports
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami HeatAstronomy
JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
IBM Surf Aid applies data mining algorithms to Web access logs for market related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
12J.Han & M.Kamber Data Mining: Concepts and Techniques
8/8/2019 02 - Introduction to Data Mining
4/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 4
Data Mining:
A
KDD
Process
Data minin : the core of knowled e discover rocess.Pattern Evaluation
Data Warehouse
Task-relevant Data
Selection
Data Mining
13
Data Cleaning
Data Integration
DatabasesJ.Han & M.Kamber Data Mining: Concepts and Techniques
Steps of
a KDD
Process
Learning the application domain:
relevant prior knowledge and goals of applicationCreating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)Data mining: search for patterns of interestPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.Use of discovered knowledge
14J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining and Business Intelligence Increasing potentialto supportbusiness decisions End UserMakin
BusinessAnalyst
DataAnalyst
Decisions
Data PresentationVisualization Techniques
Data Mining Information Discovery
15
DBA
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources Paper, Files, Information Providers, Database Systems, OLTP
J.Han & M.Kamber Data Mining: Concepts and Techniques
Architecture of a Typical Data Mining System
Graphical user interface
Data mining engine
Pattern evaluation
Knowledge base
16
DataWarehouse
Data cleaning & data integration Filtering
Databases
warehouse server
J.Han & M.Kamber Data Mining: Concepts and Techniques
8/8/2019 02 - Introduction to Data Mining
5/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 5
Data Mining:
On
What
Kind
of
Data?
Relational databasesData warehousesTransactional databasesAdvanced DB and information repositories
Object oriented and object relational databasesSpatial databases
Timeseries data and temporal dataText databases and multimedia databasesHeterogeneous and legacy databasesWWW
17J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining
Functionalities
(1)
Conce t descri tion: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
Association (correlation and causality)Multi dimensional vs. single dimensional association
age(X, 20..29) income(X, 20..29K) > buys(X, PC)
[support = 2%, confidence = 60%]contains(T, computer) > contains(T, software) [1%, 75%]
18J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining Functionalities (2)Classification and Prediction
Finding models (functions) that describe and distinguish classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas mileagePresentation: decision tree, classification rule, neural networkPrediction: Predict some unknown or missing numerical values
us er ana ys sClass label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patternsClustering based on the principle: maximizing the intra class similarity and minimizing the interclass similarity
19J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining Functionalities (3)Outlier anal sis
Outlier: a data object that does not comply with the general behavior of the dataIt can be considered as noise or exception but is quite useful in fraud detection, rare events analysis
Trend and evolution analysisTrend and deviation: regression analysisSequential pattern mining, periodicity analysisSimilaritybased analysis
Other pattern directed or statistical analyses20J.Han & M.Kamber Data Mining: Concepts and Techniques
8/8/2019 02 - Introduction to Data Mining
6/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 6
Are All
the
Discovered
Patterns
Interesting?
A data mining system/query may generate thousands of patterns, not all of them are interesting.
Suggested approach: Human centered, query based, focused mining
Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
.
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.
21J.Han & M.Kamber Data Mining: Concepts and Techniques
Can We
Find
All
and
Only
Interesting
Patterns?
Find all the interestin atterns: Com letenessCan a data mining system find all the interesting patterns?Association vs. classification vs. clustering
Search for only interesting patterns: OptimizationCan a data mining system find only the interesting patterns?
First general all the patterns and then filter out the uninteresting ones.Generate only the interesting patternsmining query optimization
22J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining: Confluence of Multiple Disciplines
Database
Data Mining
Technology
MachineLearning Visualization
23
OtherDisciplines
InformationScience
J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining: Classification SchemesGeneral functionalit
Descriptive data mining
Predictive data
mining
Different views, different classificationsKinds of databases to be minedKinds of knowledge to be discoveredKinds of techniques utilizedKinds of applications adapted
24J.Han & M.Kamber Data Mining: Concepts and Techniques
8/8/2019 02 - Introduction to Data Mining
7/15
8/8/2019 02 - Introduction to Data Mining
8/15
8/8/2019 02 - Introduction to Data Mining
9/15
8/8/2019 02 - Introduction to Data Mining
10/15
Introduction to Data Mining
8/8/2019 02 - Introduction to Data Mining
11/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 11
Visual Data
Mining
&
Data
Visualization
Inte ration of visualization and data minin data visualizationdata mining result visualizationdata mining process visualizationinteractive visual data mining
Data visualizationData in a database or data warehouse can be viewed
at different levels of granularity or abstractionas different combinations of attributes or dimensions
Data can be presented in various visual forms
41J.Han & M.Kamber Data Mining: Concepts and Techniques
Boxplots from Statsoft: multiple
variable combinations
42J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining Result VisualizationPresentation of the results or knowled e obtained from data mining in visual formsExamples
Scatter plots and boxplots (obtained from descriptive data mining)Decision treesAssociation rulesClustersOutliersGeneralized rules
43J.Han & M.Kamber Data Mining: Concepts and Techniques
Visualization of data mining results in SAS Enterprise Miner: scatter plots
44J.Han & M.Kamber Data Mining: Concepts and Techniques
Introduction to Data Mining
8/8/2019 02 - Introduction to Data Mining
12/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 12
Visualization of association rules in
MineSet 3.0
45J.Han & M.Kamber Data Mining: Concepts and Techniques
Visualization of a decision tree in
MineSet 3.0
46J.Han & M.Kamber Data Mining: Concepts and Techniques
Visualization of cluster groupings in IBM Intelligent Miner
47J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining Process VisualizationPresentation of the various rocesses of data minin in visual forms so that users can see
How the data are extractedFrom which database or data warehouse they are extractedHow the selected data are cleaned, integrated, preprocessed, and minedWhich method is selected at data miningWhere the results are storedHow they may be viewed
48J.Han & M.Kamber Data Mining: Concepts and Techniques
Introduction to Data Mining
8/8/2019 02 - Introduction to Data Mining
13/15
Introduction to Data Mining
Business Intelligence Ingegneria Gestionale UNIGE 13
Visualization of Data Mining
Processes by Clementine
49J.Han & M.Kamber Data Mining: Concepts and Techniques
Interactive Visual
Data
Mining
Usin visualization tools in the data minin rocess to help users make smart data mining decisions Example
Display the data distribution in a set of attributes using colored sectors or columns (depending on whether the whole space is represented by either a circle or a set of
Use the display to which sector should first be selected for classification and where a good split point for this sector may be
50J.Han & M.Kamber Data Mining: Concepts and Techniques
Interactive Visual Mining by Perception Based Classification (PBC)
51J.Han & M.Kamber Data Mining: Concepts and Techniques
Audio Data MiningUses audio signals to indicate the patterns of data or the features of data mining resultsAn interesting alternative to visual miningAn inverse task of mining audio (such as music) databases which is to find patterns from audio dataVisual data mining may disclose interesting patterns using
, watching patterns Instead, transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual
52J.Han & M.Kamber Data Mining: Concepts and Techniques
Introduction to Data Mining
8/8/2019 02 - Introduction to Data Mining
14/15
g
Business Intelligence Ingegneria Gestionale UNIGE 14
Is Data
Mining
a Hype
or
Will
It
Be
Persistent?
Data minin is a technolo Technological life cycle
InnovatorsEarly adoptersChasmEarly majority
Late majorityLaggards
53J.Han & M.Kamber Data Mining: Concepts and Techniques
Life Cycle
of
Technology
Adoption
Data minin is at Chasm!? Existing data mining systems are too genericNeed business specific data mining solutions and smooth integration of business logic with data mining functions
54J.Han & M.Kamber Data Mining: Concepts and Techniques
Data Mining: Merely Managers' Business or Everyone's?
Data mining will surely be an important tool for managers decision making
Bill Gates: Business @ the speed of thoughtThe amount of the available data is increasing, and data mining systems will be more affordableMultiple personal uses
Mine your family's medical history to identify genetically related medical conditions Mine the records of the companies you deal with Mine data on stocks and company performance, etc.
Invisible data miningBuild data mining functions into many intelligent tools
55J.Han & M.Kamber Data Mining: Concepts and Techniques
Social Impacts: Threat to Privacy and Data Security?
Is data mining a threat to privacy and data security?Big Brother, Big Banker, and Big Business are carefully watching youProfiling information is collected every time
You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the aboveYou surf the Web, reply to an Internet newsgroup, subscribe to a
, , , ,You pay for prescription drugs, or present you medical care number when visiting the doctor
Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse
56J.Han & M.Kamber Data Mining: Concepts and Techniques
Introduction to Data Mining
8/8/2019 02 - Introduction to Data Mining
15/15
g
Business Intelligence Ingegneria Gestionale UNIGE 15
Protect Privacy and Data SecurityFair information practices
International guidelines for data privacy protectionCover aspects relating to data collection, purpose, use, quality, openness, individual participation, and accountabilityPurpose specification and use limitationOpenness: Individuals have the right to know what information is collected about them, who has access to the data, and how the
Develop and use data security enhancing techniquesBlind signaturesBiometric encryptionAnonymous databases
57J.Han & M.Kamber Data Mining: Concepts and Techniques
Trends in Data Mining (1)A lication ex loration
development of application specific data mining systemInvisible data mining (mining as built in function)
Scalable data mining methodsConstraint based mining: use of constraints to guide data mining systems in their search for interesting patterns
Integration of data mining with database systems, data warehouse systems, and Web database systems
58J.Han & M.Kamber Data Mining: Concepts and Techniques
Trends in Data Mining (2)Standardization of data minin lan ua e
A standard will facilitate systematic development, improve interoperability, and promote the education and use of data
mining systems in industry and societyVisual data miningNew methods for mining complex types of data
methods with existing data analysis techniques for the complex types of data
Web miningPrivacy protection and information security in data mining
59J.Han & M.Kamber Data Mining: Concepts and Techniques