Post on 26-Dec-2015
transcript
Chapter 1Chapter 1Initial Description of Data Mining Initial Description of Data Mining
in Businessin Business
Prepared by: Dr. Tsung-Nan Tsai
結束
1-2
ContentsContents
Introduces data mining concepts
Presents typical business data applications
Explains the meaning of key concepts
Gives a brief overview of data mining tools
Outlines the remaining chapters of the book
結束
1-3
DefinitionDefinition
DATA MINING: exploration & analysisRefers to the analysis of the large quantities of data that
are stored in computers.by automatic meansof large quantities of datato discover actionable patterns & rules
Data mining is a way to use massive quantities of data that businesses generate
GOAL - improve marketing, sales, customer support through better understanding of customers
結束
1-4
Retail OutletsRetail Outlets
Bar coding & scanning generate masses of datacustomer service (Grocery stores can quickly
process he purchases and accurately determine product prices)
inventory control (Determine the quantity of items of each product on hand, supply chain management)
MICROMARKETINGCUSTOMER PROFITABILITY ANALYSISMARKET-BASKET ANALYSIS
結束
1-5
Political Data MiningPolitical Data Mining
Grossman et al., 10/18/2004, Time, 38
2004 ElectionRepublicans: VoterVault
From Mid-1990sAbout 165 million votersMassive get-out-the-vote drive
for those expected to vote Republican
Democrats: DemzillaAlso about 165 million votersNames typically have 200 to
400 information items
結束
1-6
Medical DiagnosisMedical Diagnosis
J. Morris, Health Management Technology Nov 2004, 20, 22-24
Electronic Medical RecordsAssociated Cardiovascular
Consultants31 physicians40,000 patients per year,
southern New JerseyData mined to identify
efficient medical practiceEnhance patient outcomesReduced medical liability
insurance
結束
1-7
Mayo ClinicMayo Clinic
Swartz, Information Management Journal Nov/Dec 2004, 8
IBM developed EMR programComplete records on almost
4.4 million patients.Doctors can ask for how last
100 Mayo patients with same gender, age, medical history responded to particular treatments.
結束
1-8
Business Uses of Data MiningBusiness Uses of Data Mining
Toyata used the data mining of its data warehouse to determine more efficient transportation routes, reducing time-to-market by average of 19 days.
Bank firms used the data mining in soliciting credit card customers,
Insurance and Telecommunication companies used DM to detect fraud.
Manufacturing firms used DM in quality control,
Many …..
結束
1-9
Business Uses of Data MiningBusiness Uses of Data Mining
1. Customer profiling Identify profitability from subset customers
2. Targeting• Determine characteristics of most profitable
customers
3. Market-Basket Analysis• Determine correlation of purchases by profile
(customers)
• Cross-selling
• Part of Customer Relationship Management
結束
1-10
What is needed to do DM?What is needed to do DM?
DM requires the identification of a problem, along with data collection that can lead to a better understanding of the market.
Computer models provide statistical or other means of analysis.
Two general types of DM studies:1. Hypothesis testing: involving expressing a theory
about the relationship between actions and outcomes.
2. Knowledge discovery: a preconceived notion may not be present, but rather than relationships can be identified by looking at the data (correlation analysis).
結束
1-11
Reasons why Data Mining is now effectiveReasons why Data Mining is now effective
Data are there
Data are warehoused (computerized)Walmart: 35 thousand queries per week
Computing economically available
Competitive pressure
Commercial products available
結束
1-12
TrendsTrends
Every business is servicehotel chains record your
preferencescar rental companies the sameservice versus price
credit card companieslong distance providersairlinescomputer retailers
結束
1-13
TrendsTrends
Information as ProductCustom Clothing Technology Corporation
fit jeans, other clothing
INFORMATION BROKERINGIMS - collects prescription data from pharmacies, sells
to drug firmsAC Nielsen - TV
結束
1-14
TrendsTrends
Commercial Software Availableusing statistical, artificial intelligence tools
that have been developedEnterprise Miner SASIntelligent Miner IBMClementine SPSSPolyAnalyst MegaputerSpecialty products
結束
1-15
Fingerhut’s DM modelsFingerhut’s DM models
Fingerhut used segmentation, decision tree, regression analysis, and neural modeling tools from SAS for regression analysis tools and SPSS for neural network tools.
The segmentation model combines order and basic demographic data with Fingerhut’s product offerings.
Neural network models used to identify in mailing patterns and order filling telephone call orders.
Goal: Create new mailings targeted at customers with the greatest
potential payoff. Create a catalog containing products that those who is interested
in, such as furniture, telephones…
結束
1-16
How Data Mining Is Being UsedHow Data Mining Is Being Used
U.S. Government track down Oklahoma City
bombers, Unabomber, many others
Treasury department - international funds transfers, money laundering
Internal Revenue Service
結束
1-17
How Data Mining Is UsedHow Data Mining Is Used
Fireflyasks members to rate
music and moviessubscribers clusteredclusters get custom-
designed recommendations
結束
1-18
Warranty Claims RoutingWarranty Claims Routing
Diesel engine manufacturerstream of warranty claimsexamine each by expert
determine whether charges are reasonable & appropriate
think of expert system to automate claims processing
結束
1-19
Data mining application areaData mining application area
Application Area Applications Specifics
Retailing Affinity positioning
Cross-selling
Position products effectively
Find more products for customers
Banking Customer relationship management
Identify customer value
develop programs to maximize revenue
Credit card Management
Lift
Churn,
Fraud detection
Identify effective market segments
Identify likely customer turnover
Insurance Fraud detection Identify claims meriting investigation
Telecommunications Churn Identify likely customer turnover
Telemarketing Online information Aid telemarketers with easy data access
Human Resource Management
Churn Identify potential employee turnover
結束
1-20
RetailingRetailing
Affinity positioning is based up the identification of products that the same customer is likely to want.Cold medicine tissues
Cross-selling: The knowledge of products that go together can be used by marketing the complementary product.Grocery stores do that through position product shelf
location.
Grocery stores generate mountains of cash register data. Current technology enables grocers to look at customers who have defected from a store, their purchase history, and characteristics of other potential defectors.
結束
1-21
Cross-sellingCross-selling
USAA insurancedoubled number of products held by average
customer due to data miningdetailed records on customerspredict products they might need
Fidelity Investmentsregression - what makes customer loyal
結束
1-22
BankingBanking
CRM involves the application of technology to monitor customer service, a function that is enhanced through data mining support.
DM applications in finance include predicting the prices of equities involving a dynamic environment with surprise information, some of which might be inaccurate …
Only 3% of the customers at Norwest bank provided 44% of their profits.
CRM products enable banks to define and identify customer and household relationships.
結束
1-23
Retaining Good CustomersRetaining Good Customers
Customer loss:Banks - AttritionCellular Phone Companies - Churn
study who might leave, whySouthern California Gas
– customer usage, credit information
– direct mail contact - most likely best billing plan
– who is price sensitive
Who should get incentives, whom to keep
結束
1-24
Credit card managementCredit card management
Bank credit card marketing promotions typically generate 1,000 responses to mailed solicitations – a response rate of about 1%. The rate is improved significantly through data mining analysis.
DM tools used by banks include credit scoring which is a quantified analysis of credit applicants with respect to predictions of on-time loan repayment. (Data covering deposits, savings, loans, credit card, insurance…).
These credit scores can be used to accept/reject recommendations, as well as to establish the size of a credit line.
ATM machines could be rigged up with electronic sales pitches for products that a particular customer is likely to be interested in.
結束
1-25
Fairbank & MorrisFairbank & Morris
Credit card company’s most valuable asset:INFORMATION ABOUT CUSTOMERS
Signet Banking Corporationobtained behavioral data from many sourcesbuilt predictive modelsaggressively marketed balance transfer card
First Unionwho will move soon - improve retention
結束
1-26
TelecommunicationsTelecommunications
Retention of customers for telemarketing is very difficult. The phenomenon of a customer switching carriers is referred to as churn, a fundamental concept in telemarketing as well as in other fields.A communications company considered the 1/3 of churn is due to poor call quality, and up to ½ is due to poor equipment.A cellular fraud prevention monitors traffic to spot problems with faulty telephones. When a telephone begins to go bad, telemarketing personal are alerted to contact the customer and suggest bringing the equipment in for service.Another way to reduce churn is to protect customers from subscription and cloning (duplication) fraud. Fraud prevention systems provide verification that is transparent to legitimate subscribers.
結束
1-27
Human resource managementHuman resource management
Business intelligence is a way to truly understand markets, competitors, and processes.Software technology such as data warehouses, data marts, online analytical processing (OLAP), and data mining can be used to improve firm’s profitability.In HRM, the analysis can lead to the identification of individuals who are liable to leave the company unless additional compensation or benefits are provided.HRM would identify the right people so that organizations could treat them well and retain them (reduce churn).
結束
1-28
Methodology and ToolsMethodology and Tools
Analyzing dataGiven management goals and that management
can translate knowledge into action
結束
1-29
Basic StylesBasic Styles
Top-Down: HYPOTHESIS TESTINGSUPERVISEDhave a theory, experiment to prove or disproveSCIENCE
Bottom-Up: KNOWLEDGE DISCOVERYUNSUPERVISEDstart with data, see new patternsCREATIVITY
結束
1-30
Hypothesis TestingHypothesis Testing
Generate theory
Determine data needed
Get data
Prepare data
Build computer model
Evaluate model resultsconfirm or reject hypotheses
結束
1-31
Generate TheoryGenerate Theory
Systematically tie different input sources together (MENTAL MODEL)What causes sales volume?
sales rep performanceeconomy, seasonalityproduct quality, price, promotion,
location
結束
1-32
Generate TheoryGenerate Theory
Brainstorm:diverse representatives for broad coverage of
perspectives (electronic)keep under control (keep positive)generate testable hypotheses
結束
1-33
Define Data NeededDefine Data Needed
Determine data needed to test hypothesisLucky - query existing databaseMore often - gather
pull together from diverse databases, survey, buy
結束
1-34
Locate DataLocate Data
Usually scattered or unavailable
Sources: warranty claims
point-of-sale data (cash register records) medical insurance claims telephone call detail records direct mail response records demographic data, economic data
PROFILE: counts, summary statistics, cross-tabs, cleanup
結束
1-35
Prepare Data for AnalysisPrepare Data for Analysis
Summarize: too much - no discriminant information too little - swamped with useless
detailProcess for computer: ASCII, SpreedsheetData encoding: how data are recorded can vary - may have been collected with specific purposeTextual data: avoid if possible (may need to code)Missing values: missing salary - use mean?
結束
1-36
Build and Evaluate ModelBuild and Evaluate Model
Build Computer ModelChoice the appropriate modeling tools and algorithmsTraining and test data sets.
Determine if hypotheses supportedstatistical practicetest rule-based systems for accuracy
Requires both business and analytic knowledge
結束
1-37
SUPERVISEDSUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
Health care fraudUse statistics to identify
indicators of fraud or abuseCan rapidly sort through large
databasesIdentify patterns different from
normModerately successful
But only effective on schemes already detected
To benefit firm, need to identify fraud before paying claim
結束
1-38
Knowledge DiscoveryKnowledge Discovery
Machine learning?Usually need intelligent analyst
Directed: explain value of some variable
Undirected: no dependent variable selectedidentify patterns
Use undirected to recognize relationships; use directed to explain once found
結束
1-39
DirectedDirected
Goal-orientedExamples: If discount applies, impact on products -
who is likely to purchase credit insurance?Predicted profitability of new customer - what to bundle with a particular packageIdentify sources of preclassified dataPrepare data for analysisBuilt & train computer modelEvaluate
結束
1-40
Identify Data SourcesIdentify Data Sources
Best - existing corporate data warehousedata clean, verified, consistent, aggregated
Usually need to generatemost data in form most efficient for designed
purposehistorical sales data often purged for dormant
customers (but you need that information)
結束
1-41
Prepare DataPrepare Data
Put in needed format for computer
Make consistent in meaning
Need to recognize what data are missingchange in balance = new – old
add missing but known-to-be-important data
Divide data into training, test, evaluation
Decide how to treat outliersstatistically biasing, but may be most important
結束
1-42
Build & Train ModelBuild & Train Model
Regression - human builds (selects IVs)
Automatic systems traingive it data, let it hammer
OVERFITTING:fit the dataTEST SET a means to evaluate model against
data not used in trainingtune weights before using to evaluate
結束
1-43
Evaluate ModelEvaluate Model
ERROR RATE: proportion of classifications in evaluation set that were wrong
too little training: poor fit on training data and poor error rate
optimal training: good fit on both
too much training: great fit on training data and poor error rate
結束
1-44
Undirected DiscoveryUndirected Discovery
What items sell together? Strawberries & creamDirected: What items sell with tofu? tabasco
Long distance caller market segmentationUniform usage - weekday & weekend, spikes
on holidaysAfter segmentation:
high & uniform except for several months of nothing
結束
1-45
UNSUPERVISEDUNSUPERVISED
Dorn, National Underwriter Oct 18, 2004, 34,39
Health care fraudLook at historical claim
submissionsBuild ad hoc model to
compare with current claims
Assign similarity score to fraudulent claims
Predict fraud potential
結束
1-46
Undirected ProcessUndirected Process
Identify data sources
Prepare data
Build & train computer model
Evaluate model
Apply model to new data
Identify potential targets for undirected
Generate new hypotheses to test
結束
1-47
Generate hypothesesGenerate hypotheses
Any commonalities in data?
Are they useful?Many adults watch children’s movies
chaperones are an important market segmentthey probably make final decision
When hypothesis is generated, that determines data needed
結束
1-48
Bank Case StudyBank Case Study
Directed knowledge discovery to recognize likely prospects for home equity loan
training set - current loan holdersdeveloped model for propensity to borrow got continuous scores, ranked customerssent top 11% material
Undirected: segmented market into clustersin one, 39% had both business & personal
accountscluster had 27% of the top 11%
Hypothesis: people use home equity to start business
結束
1-49
Data mining products and data setsData mining products and data sets
A good source to view current DM products is www.KDNuggests.com.
The UCI Machine Learning Repository is a source of very good data mining datasets at www.ics.uci.edu/~mlearn/MLOther.html.
Weka DM software at http://www.cs.waikato.ac.nz/ml/weka/
Tanagra DM software at http://eric.univ-lyon2.fr/~ricco/tanagra/index.html