04/08/23 CS267, Yelick 1
CS 267: Applications of Parallel Computers
Lecture 25:
Data Mining
Kathy Yelick
Material based on lecture by
Vipin Kumar and Mahesh Joshi
http://www-users.cs.umn.edu/~mjoshi/hpdmtut/
04/08/23 CS267, Yelick 2
Lecture Schedule
• 12/3: 3 things
Projects and performance analysis
(N-body assignment observations)
Data Mining
HKN Review at 3:40
• 12/5: The Future of Parallel Computing
David Bailey
• 12/13: CS267 Poster Session (2-4pm, Woz)
• 12/14: Final Papers due
04/08/23 CS267, Yelick 3
N-Body Assignment
• Some observations on your N-Body assignments• Problems and pitfalls to avoid in final project
• Performance analysis• Micro-benchmarks are good
• To understand application performance, build up performance model from measured pieces, e.g., network performance
• Noise is expected, but quantifying it is also useful• Means, alone, can be confusing
• Median + variance is good
• Carefully select problem sizes• Are they large enough to justify the # of processors?
• What do real users want?
• Can you vary the problem size in some reasonable way?
04/08/23 CS267, Yelick 4
N-Body Assignment
• Minor comments on N-Body Results• Describe performance graphs – what is expected, surprising
• Sanity check your numbers• Are you getting more than P time speedup on P processors?
• Does the observed running time (“time command”) match total?
• What is your Mflops rate? Is it between 10 and 90% of HW peak?
• Be careful of different timers• Get-time-of-day is wall-clock time (charged for OS and others)
• Clock is process time (Linux creates a process per thread)
• RT clock on Cray is wall clock time
• Check captions, titles, axes of figures/graphs
• Run spell checker
04/08/23 CS267, Yelick 5
Outline
• Overview of Data Mining
• Serial Algorithms for Classification
• Parallel Algorithms for Classification
• Summary
04/08/23 CS267, Yelick 6
Data Mining Overview
• What is Data Mining?
• Data Mining Tasks• Classification
• Clustering
• Association Rules and Sequential Patterns
• Regression
• Deviation Detection
04/08/23 CS267, Yelick 7
What is Data Mining?
• Several definitions:• Search for valuable information in large volumes of data
• Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules
• A step in the Knowledge Discovery in Databases (KDD) process
04/08/23 CS267, Yelick 8
Knowledge Discovery Process
• Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in data
Clean, collect,
summarize
Datapreparation
Datamining
Verificationand
evaluation
Datawarehouse
Operational Databases
TrainingData
Model,patterns
04/08/23 CS267, Yelick 9
Why Mine Data?
• Data collected and stored at enormous rate• Remote sensor on a satellite
• Telescope scanning the skies
• Microarrays generating gene expressions
• Scientific simulations
• Traditional techniques infeasible
• Data mining for data reduction• Cataloging, classifying, segmenting
• Help scientists formulate hypotheses
04/08/23 CS267, Yelick 10
Data Mining Tasks
• Predictive Methods: Use variable to predict unknown or future values of other variables
• Classification
• Regression
• Deviation Detection
• Descriptive Methods: Find human-interpretable patterns that describe data
• Clustering
• Association Rule Discovery
• Sequential Pattern Discovery
04/08/23 CS267, Yelick 11
Classification
• Given a collection of records (training set) • Each record contains a set of attributes, on of which is the class
• Find a model for class attributes as a function of the values of other attributes
• Goal: previously unseen records should be accurately assigned a class
• A test set is used to determine accuracy
• Examples:• Direct marketing: targeted mailings based on buy/don’t class
• Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc.
• Sky survey cataloging: catalog objects based as star/galaxy
04/08/23 CS267, Yelick 12
Classification Example: Sky Survey
•Currently 3K images
•23Kx23K pixels
Approach• Segment the image
• Measure image attributes – 40 per object
• Model the class (star/galaxy or stage) based on the attributes
Images from: http://aps.umn.edu
04/08/23 CS267, Yelick 13
Clustering
• Given a set of data points:• Each has a set of attributes
• A similarity measure among them
• Find clusters such that:• Points in one cluster are more similar to each other than points
in other clusters
• Similarities measures are problem specific:• E.g., Euclidean distance for continuous data
04/08/23 CS267, Yelick 14
Clustering Applications
• Market Segmentation:• Divide market into distinct subsets
• Document clustering: • Find group of related documents, based on common keywords
• Set in information retrieval
• Financial market analysis• Find groups of companies with common stock behavior
04/08/23 CS267, Yelick 15
Associate Rule Discovery
• Given a set of records, each containing set of items• Produce dependency rules that predict occurrences of an item
based on others
• Applications:• Marketing, sales promotion and shelf management
• Inventory management
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules:
{Milk} {Coke}
{Diaper,Milk} Beer
04/08/23 CS267, Yelick 16
Other Data Mining Problems
• Sequential Pattern Discovery • Given a set of objects, each with a timeline of events
• Find rules that predict sequential dependencies
• Example: patterns in telecommunications alarm logs
• Regression: • Predict a value one variable given others
• Assume a linear or non-linear model of dependence
• Examples:• Predict sales amounts based on advertising expenditures
• Predict wind velocities based on temperature, pressure, etc.
• Deviation Detection• Discover most significant change in data from previous values
04/08/23 CS267, Yelick 17
Serial Algorithms for Classification
• Decision Tree Classifiers• Overview of Decision Trees
• Tree induction
• Tree pruning
• Rule-based methods
• Memory Based Reasoning
• Neural networks
• Genetic algorithms
• Bayesian networks
Inexpensive
Easy to interpret
Easy to integrate into DBs
04/08/23 CS267, Yelick 18
Decision Tree Algorithms
• Many Algorithms• Hunt’s Algorithm
• CART
• ID3, C4.5
• SLIQ, SPRINT
Tid Refund Marital Income Cheat
1 Yes S 125K No
2 No M 100K No
3 No S 70K No
4 Yes M 120K No
5 No D 95K Yes
6 No M 60K No
7 Yes D 220K No
8 No S 85K Yes
9 No M 75K No
10 No S 90K Yes
Refund
Marital
Income
NO
NO
NO
YES
>80K<=80K
S,D M
Yes No
04/08/23 CS267, Yelick 19
Tree Induction
• Greedy strategy• Split based on attribute that optimizes splitting criterion
• Two phases at each node in tree• Split determining phase:
• Which attribute to split
• How to split
– Two-way split of multi-valued attribute (Marital: S,D,M)
– Continuous attributes: discretize in advance, cluster on the fly
• Splitting phase• Do the split and create child nodes
04/08/23 CS267, Yelick 20
GINI Splitting Criterion
• Gini Index: GINI(t) = 1 – j [p(j | t) ] 2
where p(j|t) is the relative frequence of class j at node t
• Measures impurity of a node• Max (1-1/nc) when records are equally distributed
• Minimum (0.) when all records belong to one class, implying most interesting information
• Other criteria may be better, but similar evaluation
C1 0
C2 6
Gini = 0.00
C1 1
C2 5
Gini = 0.28
C1 2
C2 6
Gini = 0.44
C1 3
C2 3
Gini = 0.50
04/08/23 CS267, Yelick 21
Splitting Based on GINI
• Use in CART, SLIQ, SPRINT
• Criterion: Minimize GINI index of the Split
• When a node p is split into k partitions (children), the quality of the split is computed as
GINIsplit = j
k
=1 nj / n GINI(j)
• Where nj = number of records at child j
n = number or records at node p
• To evaluate: • Categorical attributes: compute counts of each class
• Continuous attributes: sort and choose split (1 or more)
04/08/23 CS267, Yelick 22
Splitting Based on INFO
• Information/Entropy:
INFO(t) = – (j
k
=1 p(j | t) log g(j | t))• Information Gain
GAINsplit = INFO(p) – (j
k
=1 nj / n INFO(j))
• Measures reduction in entropy; choose split to maximize
• Used in ID3 and C4.5
• Problems: tends to prefer splits that are large in number• Variations avoid this
• Computation similar to GINI
04/08/23 CS267, Yelick 23
C4.5 Classification
• Simple depth-first construction of tree
• Sorts continuous attributes at each node
• Needs to fit data into memory • To avoid out-of-core sort
• Limits scalability
04/08/23 CS267, Yelick 24
SLIQ Classification
• Arrays of continuous attributes are pre-sorted
• Classification tree is grown breadth-first
• Class list structure maintains mapping: record id node
• Split determining phase: class list is referred to for computing best split for each attribute. (breadth-first)
• Splitting phase: the list of this splitting attribute is used to update the leave labels in the class list. (no physical splitting)
• Problem: class list is frequently and randomly accessed• Required to be in-memory for efficient performance
04/08/23 CS267, Yelick 25
SLIQ Example
04/08/23 CS267, Yelick 26
SPRINT
• Arrays of continuous attributes are presorted• Sorted order is maintained during splits
• Classification tree is grown breadth-first
• Attribute lists are physically split among nodes
• Split determining phase is just a linear scan of lists at each nodes
• Hashing scheme used in splitting phase• IDs of the splitting attribute are hashed with the tree node
• Remaining attribute arrays are split by querying this hash table
• Problems: Hash table is O(N) at root
04/08/23 CS267, Yelick 27
Parallel Algorithms for Classification
• Driven by need to handle large data sets• Larger aggregate memory on parallel machines
• Scales on cluster architecture
• I/O time dominates• More difficult to analyze benefits (cost/performance) than
simple MFLOP-limited problem
• I.e., buy disks for parallel Bandwidth vs. Processors+Memory
04/08/23 CS267, Yelick 28
Parallel Tree Construction: Approach 1
• First approach: partition data, data-parallel operations across nodes
• Global reduction per node
• Large number of nodes is expensive
04/08/23 CS267, Yelick 29
Parallel Tree Construction: Approach 2• Task parallelism: exploit parallelism between nodes
• Load imbalance as number of records vary
• Locality: child/parent need same data
04/08/23 CS267, Yelick 30
Parallel Tree Construction: Hybrid Approach
• Switch from data to task parallelism (within a node to between nodes) when:
total communication cost >=
Moving cost + Load balancing cost
• Splitting ensures:• Communication cost <= 2 * Optimal-Communication-cost
04/08/23 CS267, Yelick 31
Continuous Data
• Parallel mining algorithms with continuous data adds• Parallel sort
• Essentially a transpose of data – all to all
• Parallel hashing• Ramon small access
• Both are very hard to do efficiently on current machines
04/08/23 CS267, Yelick 32
Performance Results from ScalParC
• Parallel running time on Cray T3E
04/08/23 CS267, Yelick 33
Performance Results from ScalParC
• Runtime with constant size per processor, also T3E