Final ProjectFinal Project
結束
10-2
Data setsData sets
Visit web site:
http://www.kdnuggets.com/datasets/index.htmlThis is an online repository of large data sets which
encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets.
http://kdd.ics.uci.edu/
結束
10-3
Data setsData sets
Data Sets
by application area
by name
by date (reverse chronological)
Machine Learning Repository
Task Files
by task type
by application area
by name
by date (reverse chronological)
by data type
結束
10-4
Report & PresentationReport & Presentation
書面 (50%) + 簡報 (50%)==> 為期末考成績4 位同學一組書面報告 (8 pages at least, cover not included)
簡報 : 15 分鐘 + 問題提問 (5 分鐘 ) ,簡報同學不發問,其餘同學皆須回答問題,不用及時回答,可於下課前回答。一節課用於討論與提問,並預先訂定所選定資料庫。 ( 可於一星期內修改之 ) 。
Business Data Mining ApplicationsBusiness Data Mining Applications
結束
10-6
Business Data Mining ApplicationsBusiness Data Mining Applications
Partial representative sample of applications
Catalog sales
CRM
Credit scoring
Banking (loans)
Investment risk
Insurance
結束
10-7
FingerhutFingerhut
Founded 1948today sends out 130 different catalogsto over 65 million customers6 terabyte data warehouse3000 variables of 12 million most active customersover 300 predictive models
Focused marketing
結束
10-8
FingerhutFingerhut
Purchased by Federated Department Stores for $1.7 billion in 1999 (for database)
Fingerhut had $1.6 to $2 billion business per year, targeted at lower income households
Can mail 400,000 packages per day
Each product line has its own catalog
結束
10-9
FingerhutFingerhut
Uses segmentation, decision tree, regression, neural network tools from SAS and SPSS
Segmentation - combines order & demographic data with product offeringscan target mailings to greatest payoff
customers who recently had moved tripled their purchasing 12 weeks after the move
send furniture, telephone, decoration catalogs
結束
10-10
Data for SEGMENTATIONData for SEGMENTATION
cluster indices
subj age income marital grocery dine out savings
1001 53 80000 wife 180 90 30000
1002 48 120000 husband 120 110 20000
1003 32 90000 single 30 160 5000
1004 26 40000 wife 80 40 0
1005 51 90000 wife 110 90 20000
1006 59 150000 wife 160 120 30000
1007 43 120000 husband 140 110 10000
1008 38 160000 wife 80 130 15000
1009 35 70000 single 40 170 5000
1010 27 50000 wife 130 80 0
結束
10-11
Initial Look at DataInitial Look at Data
Want to know features of those who spend a lot dining out
INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLEthings you can identify
Manipulate datasort on most likely indicator (dine out)
結束
10-12
Sorted by Dine OutSorted by Dine Out
cluster indices
subject age income marital grocery dine out savings
1004 26 40000 wife 80 40 0
1010 27 50000 wife 130 80 0
1001 53 80000 wife 180 90 30000
1005 51 90000 wife 110 90 20000
1002 48 120000 husband 120 110 20000
1007 43 120000 husband 140 110 10000
1006 59 150000 wife 160 120 30000
1008 38 160000 wife 80 130 15000
1003 32 90000 single 30 160 5000
1009 35 70000 single 40 170 5000
結束
10-13
AnalysisAnalysis
Best indicatorsmarital statusgroceries
Availablemarital status might be easier to get
結束
10-14
FingerhutFingerhut
Mailstream optimizationwhich customers most likely to respond to
existing catalog mailingssave near $3 million per yearreversed trend of catalog sales industry in 1998reduced mailings by 20% while increasing net
earnings to over $37 million
結束
10-15
LIFTLIFT
LIFT = probability in class by sample divided by probability in class by populationif population probability is 20% and
sample probability is 30%,
LIFT = 0.3/0.2 = 1.5
Best lift not necessarily bestneed sufficient sample sizeas confidence increases, longer list but lower lift
結束
10-16
Lift ExampleLift Example
Product to be promoted
Sampled over 10 identifiable segments of potential buying populationProfit $50 per item soldMailing cost $1Sorted by Estimated response rates
結束
10-17
Lift DataLift Data
S eg R a te R ev C o st P ro fit S eg R a te R ev C o st P ro fit
1 0 .0 4 2 $ 2 .1 0 $ 1 $ 1 .1 0 6 0 .0 1 3 $ 0 .6 5 $ 1 -$ 0 .3 5
2 0 .0 3 5 $ 1 .7 5 $ 1 $ 0 .7 5 7 0 .0 0 9 $ 0 .4 5 $ 1 -$ 0 .5 5
3 0 .0 2 5 $ 1 .2 5 $ 1 $ 0 .2 5 8 0 .0 0 5 $ 0 .2 5 $ 1 -$ 0 .7 5
4 0 .0 1 7 $ 0 .8 5 $ 1 -$ 0 .1 5 9 0 .0 0 4 $ 0 .2 0 $ 1 -$ 0 .8 0
5 0 .0 1 5 $ 0 .7 5 $ 1 -$ 0 .2 5 1 0 0 .0 0 1 $ 0 .0 5 $ 1 -$ 0 .9 5
結束
10-18
Lift ChartLift Chart
LIFT
0
0.2
0.4
0.6
0.8
1
1.2
0 1 2 3 4 5 6 7 8 9 10
Segment
Cu
mu
lati
ve P
rop
ort
ion
Cum Response
Random
結束
10-19
Profit ImpactProfit Impact
PROFIT
-4
-2
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7 8 9 10
Segment
Do
lla
rs Cum Revenue
Cum Cost
Cum Profit
結束
10-20
RFMRFM
Recency, Frequency, Monetary
Same purpose as liftIdentify customers more likely to respond
RFM tracks customer transactions by its 3 measuresCode each customer Often 5 cells for each measure, or 125 combinationsIdentify positive response of each of the
combinations
結束
10-21
CUSTOMER RELATIONSHIP CUSTOMER RELATIONSHIP MANAGEMENT (MANAGEMENT (CRMCRM))
understanding value customer provides to firmKathleen Khirallah - The Tower Group
Banks will spend $9 billion on CRM by end of 1999Deloitte
only 31% of senior bank executives confident that their current distribution mix anticipated customer needs
結束
10-22
Customer ValueCustomer Value
Middle age (41-55), 3-9 years on job, 3-9 years in town, savings account
year annual purchases profit discounted net 1.3 rate
1 1000 200 153 153
2 1000 200 118 272
3 1000 200 91 363
4 1000 200 70 433
5 1000 200 53 487
6 1000 200 41 528
7 1000 200 31 560
8 1000 200 24 584
9 1000 200 18 603
10 1000 200 14 618
結束
10-23
Younger CustomerYounger Customer
Young (21-29), 0-2 years on job, 0-2 years in town, no savings account
year annual purchases profit discounted net 1.3
1 300 60 46 46
2 360 72 43 89
3 432 86 39 128
4 518 104 36 164
5 622 124 34 198
6 746 149 31 229
7 896 179 29 257
8 1075 215 26 284
9 1290 258 24 308
10 1548 310 22 331
結束
10-24
Lifetime Value ApplicationLifetime Value ApplicationDrew et al. (2001), Drew et al. (2001), Journal of Service ResearchJournal of Service Research 3:3 3:3
Cellular telephone division, major US telecommunications firmData on billing, usage, demographicsNeural net model of churn proportion by month of tenure 36 tenure classes
Tested model on 21,500 subscribers April 1998 Trained on 15,000, tested on 6,500
結束
10-25
Customer Tenure SegmentsCustomer Tenure Segments
1. Least likely to churn• Left alone
2. Slight propensity to churn at end of tenure• Moderate pre-expiration marketing
3. Large spike in churn at expiration• Concentrated marketing efforts before expiration
4. Highest risk• Continued competitive offers
結束
10-26
CREDIT SCORINGCREDIT SCORING
Data warehouse including demand deposits, savings, loans, credit cards, insurance, annuities, retirement
programs, securities underwriting, other Statistical & mathematical models (regression) to predict repayment
結束
10-27
CREDIT SCORINGCREDIT SCORING
Bank Loan ApplicationsAge Income Assets Debts Want On-time
24 55557 27040 48191 1500 1
20 17152 11090 20455 400 1
20 85104 0 14361 4500 1
33 40921 91111 90076 2900 1
30 76183 101162 114601 1000 1
55 80149 511937 21923 1000 1
28 26169 47355 49341 3100 0
20 34843 0 21031 2100 1
20 52623 0 23054 15900 0
39 59006 195759 161750 600 1
結束
10-28
Credit Card ManagementCredit Card Management
Very profitable industry
Card surfing - pay old balance with new card
Promotions typically generate 1000 responses, about 1%
In early 1990s, almost all mass marketing
Data mining improves (lift)
結束
10-29
British Credit Card CompanyBritish Credit Card Company
Monthly credit dataDidn’t want those who paid in full (no profit)
Application scoringContinued what had been done manually for over 50
yearsBehavioral scoringMonitor revolving credit accounts for early warning
90,000 customersState variable: cumulative months of missed repaymentSelected sample of 10,000 observations Initial state all 0 in selected dataOver 70% of customers never left state 0
結束
10-30
AnalysisAnalysis
ClusteringUnsupervised partitioning
K-median to get more stable results
Pattern searchSought patterns from object groupingUnexpectedly large number of similar objectsEstimated probability of each case belonging to
objects
結束
10-31
ComparisonComparison
Compared clustering partitions with pattern search groupings
Pattern search identified those behaving in anomalous manner
結束
10-32
BankingBanking
Among first users of data mining
Used to find out what motivates their customers (reduce churn)
Loan applications
Target marketingNorwest: 3% of customers provided 44% profits
Bank of America: program cultivating top 10% of customers
結束
10-33
CHURNCHURN
Customer turnover
Critical to:telecommunicationsbankshuman resource managementretailers
結束
10-34
Characteristics of Not On-TimeCharacteristics of Not On-Time
Age Income Assets Debts Want On-time
28 26169 47355 49341 3100 0
20 52623 0 23054 15900 0
Here, Debts exceed Assets
Age Young
Income Low
BETTER: Base on statistics, large sample
supplement data with other relevant variables
結束
10-35
Identify Characteristics of Those Who LeaveIdentify Characteristics of Those Who Leave
Age Time-job Time-town min bal checking savings card loan
years months months $
27 12 12 549 x x
41 18 41 3259 x x x
28 9 15 286 x x
55 301 5 2854 x x x
43 18 18 1112 x x x
29 6 3 0 x
38 55 20 321 x x x
63 185 3 2175 x x x
26 15 15 386 x x
46 13 12 1187 x x x
37 32 25 1865 x x x
結束
10-36
AnalysisAnalysis
What are the characteristics of those who leave?Correlation analysis
Which customers do you want to keep?Customer value - net present value of customer to the
firm
結束
10-37
CorrelationCorrelation
Age Time Time min-bal check saving card loan
Job Town
Age 1.0 0.6 0.4 -0.4 0.0 0.4 0.2 0.3
Job 1.0 0.9 -0.6 0.1 0.6 0.9 -0.2
Town 1.0 -0.5 -0.1 0.3 0.5 0.4
Min-Bal 1.0 -0.2 0.3 0.6 -0.1
Check 1.0 0.5 0.2 0.2
Saving 1.0 0.9 0.3
Card 1.0 0.5
Loan 1.0
結束
10-38
Bankruptcy PredictionBankruptcy PredictionSung et al. (1999), Sung et al. (1999), Journal of MISJournal of MIS 16:1 16:1
Late 20th-century, East Asian corporate bankruptcy criticalModels built for normal & crisis conditionsUsed decision tree models for explanation Discriminant analysis applied to benchmark
Korean corporations Data for all bankrupt corporations on Korean Stock Exchange,
2nd quarter 1997 to 1st quarter 199875 such cases – full data on 30 of those
Normal 2nd Qtr 1991 to 1st Qtr 199556 firms, full data on 26
結束
10-39
Korean Bankruptcy StudyKorean Bankruptcy Study
Matched bankrupt firms with one or two nonbankrupt firms that had similar assets and size
56 financial ratios usedEliminated 16 due to duplication
結束
10-40
Financial RatiosFinancial Ratios
Growth (5)
Profitability (13)
Leverage (9)
Efficiency (6)
Productivity (7)
DV 0/1 variable of bankruptcy or not
結束
10-41
Multivariate Discriminant AnalysisMultivariate Discriminant Analysis
Used stepwise procedureNORMAL PERIODNormal = 0.58 * cash flow/assets
+ 0.0623 * productivity of capital- 0.006 * average inventory turnover
BANKRUPT PERIODBankrupt = 0.053 * cash flow/liabilities
+ 0.056 * productivity of capital+ 0.014 * fixed assets/(equity+LT liab)
結束
10-42
Decision Tree ModelsDecision Tree Models
Used C4.5Applied boosting to improve predictive power, improved prediction successNORMAL RULESIF productivity of capital > 19.65 THEN OKIF cash flow/total assets > 5.64 THEN OKIF cash flow/total assets ≤ 55.64 & productivity of
capital ≤ 19.65 THEN bankrupt
結束
10-43
CRISIS RULESCRISIS RULES
IF productivity of capital > 20.61 THEN OK
IF cash flow/liabilities > 2.64 THEN OK
IF fixed assets/(equity+long-term invest) > 87.23 THEN OK
IF cash flow/liabilities ≤ 2.64
AND productivity of capital ≤20.61
AND fixed assets/(equity+long-term invest) ≤ 87.23 THEN bankrupt
結束
10-44
ComparisonComparison
Correct Bankrupt
Correct OK Overall Variables
DA-normal 0.69 0.90 0.82 3
DA-crisis 0.53 0.85 0.74 3
DT-normal 0.72 0.90 0.83 8
DT-crisis 0.67 0.89 0.81 6
結束
10-45
Mortgage MarketMortgage Market
Early 1990s - massive refinancing
Need to keep customers happy to retain
Contact current customers who have rates significantly higher than marketa major change in practicedata mining & telemarketing increased Crestar
Mortgage’s retention rate from 8% to over 20%
結束
10-46
Country Investment RiskCountry Investment Risk
Outcome categories:1. Most safe
2. Developed
3. Mature emerging markets
4. New emerging markets
5. Frontier
結束
10-47
Investment Risk AnalysisInvestment Risk AnalysisBecerra-Fernandez et al. (2002) Becerra-Fernandez et al. (2002) Computers and Industrial Engineering Computers and Industrial Engineering 4343
Risk by countryExpert assessment available
Decision tree (C5), neural network modelsData:Economic indicators (4)Depth & liquidity (4)Performance & value (5)Economic & market risk (4)Regulation & efficiency (4)52 samples, so used bootstrapping
結束
10-48
ModelsModels
Decision treesPruning rate 50%:Pruning rate 75%
Neural networksBackpropogationFuzzy (ARTMAP)Learning vector quantization
結束
10-49
ResultsResults
Decision tree algorithms more accurateLower pruning rate – lowest error rateNeural networks disadvantaged by small data setDecision tree algorithms consistently optimistic
relative to expert ratings
結束
10-50
BankingBanking
Fleet Financial Group $30 million data warehousehired 60 database marketers, statistical/quantitative
analysts & DSS specialistsexpected to add $100 million in profit by 2001
結束
10-51
BankingBanking
First Unionconcentrated on contact pointpreviously had very focused product groups, little
coordinationDeveloped offers for customers
結束
10-52
INSURANCEINSURANCE
Marketing, as retailing & banking
Special: Farmers Insurance Group - underwriting system
generating $ millions in higher revenues, lower claims7 databases, 35 million records
better understanding of market nicheslower rates on sports cars, increasing business
結束
10-53
Insurance FraudInsurance Fraud
Specialist criminals - multiple personas
InfoGlide specializes in fraud detection productsSimilarity search engine
link names, telephone numbers, streets, birthdays, variations
identify 7 times more fraud than exact-match systems
結束
10-54
Insurance Fraud - Link AnalysisInsurance Fraud - Link Analysis
claim
type amount physician attorney
back 50000 Welby McBeal
neck 80000 Frank Jones
arm 40000 Barnard Fraser
neck 80000 Frank Jones
leg 30000 Schmidt Mason
multiple 120000 Heinrich Feiffer
neck 80000 Frank Jones
back 60000 Schwartz Nixon
arm 30000 Templer White
internal 180000 Weiss Richards
結束
10-55
Insurance FraudInsurance Fraud
Analytics’ NetMap for Claimsuses industrywide database creates data mart of internal, external dataunusual activity for specific chiropractors, attorneys
HNC Insurance Solutionsworkers compensation fraud
VeriComp - predictive software (neural nets) saved Utah over $2 million
結束
10-56
Insurance Data Mining ExamplesInsurance Data Mining ExamplesSmith et al. (2000) Smith et al. (2000) Journal of the Operational Research SocietyJournal of the Operational Research Society 51:5 51:5
Large data warehouse systemRecorded every transaction & claim
Data mining to predict average claim costs & frequency, impact on profitabilityPricing
結束
10-57
Customer Retention AnalysisCustomer Retention Analysis
Over 20,000 motor vehicle policies due for renewal in one monthAbout 7% didn’t renewExpected reasons: price, service, value of vehicle
結束
10-58
Customer Retention ResultsCustomer Retention Results
Data MiningEnterprise MinerUsed data exploration to select variables (13)Used log transforms for highly skewed dataPerformed log regression, decision trees, neural
networks
Neural network fit test set bestBut low correct rate for termination
結束
10-59
Claims AnalysisClaims Analysis
Recent growth in policiesLower profitabilityCould improve by lowering frequency, reducing claim
amounts
Data over a three-year period
Sample size well over 100,000 per quarter
Descriptive statistics:High growth in young people, insurance over $40,000
結束
10-60
Claims ModelsClaims Models
ClusteringPredict group policy claims behaviorUsed 50 clustersK-means algorithm
Identified several clusters with abnormal cost ratios or frequency size
結束
10-61
TELECOMMUNICATIONSTELECOMMUNICATIONS
Deregulation - widespread competitionchurn
1/3 poor call quality, 1/2 poor equipmentwireless performance monitor tracking
reduced churn about 61%, $580,000/yearcellular fraud preventionspot problems when cell phones begin to go bad
結束
10-62
TelecommunicationsTelecommunications
Metapath’s Communications Enterprise Operating Systemhelp identify telephone customer problems
dropped calls, mobility patterns, demographics
to target specific customersreduce subscription fraud
$1.1 billionreduce cloning fraud
cost $650 million in 1996
結束
10-63
TelecommunicationsTelecommunications
Churn Prophet, ChurnAlertdata mining to predict subscribers who cancel
Arbor/Mobileset of products, including churn analysis
結束
10-64
TELEMARKETINGTELEMARKETING
MCI uses data marts to extract data on prospective customerstypically a 2-month program20% improvement in sales leadsmultimillion investment in data marts & hardwarestaff of 45trend spotting (which approaches specific
customers)
結束
10-65
TelemarketingTelemarketing
Australian Tourist Commissionmaintained database since 1992
responses to travel inquiries on tours, hotels, airlines, travel agents, consumers
data mine to identify travel agents & consumers responding to various media
sales closure rate at 10% and uplead lists faxed weekly to productive travel agents
結束
10-66
TelemarketingTelemarketing
SegmentationWhich customers respond to new promotions, to
discounts, to new product offersDetermine
whom to offer new service tothose most likely to commit fraud
結束
10-67
Human Resource ManagementHuman Resource Management
Identify individuals liable to leave company without additional compensation or benefits
Firm may already know 20% use 80% of offered servicesdon’t know which 20%data mining (business intelligence) can identify
Use most talented people in highest priority (or most profitable) business units
結束
10-68
Human Resource ManagementHuman Resource Management
Downsizingidentify right people, treat them welltrack key performance indicatorsdata on talents, company needs, competitor
requirements
State of Mississippi’s MERLIN network30 databases (finance, payroll, personnel, capital
projects)Cognos Impromptu system - 230 users
結束
10-69
CASINOSCASINOS
Casino gaming one of richest data sets known
Harrah’s - incentive programsabout 8 million customers hold Total Gold cards,
used whenever the customer spends money in the casino
comprehensive data collection
Trump’s Taj Card similar
結束
10-70
CasinosCasinos
Bellagio & Mandelay Baystrategy of luxury visitschild entertainmentchange from old strategy - cheap food
Identify high rollers - cultivateidentify those to discourage from playestimate lifetime value of players
結束
10-71
ARTSARTS
Computerized box offices lead to high volumes of data
Identify potential consumers for shows
Software to manage showssimilar to airline seating chart software