+ All Categories
Home > Documents > Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army...

Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army...

Date post: 06-Jan-2018
Category:
Upload: ethan-cooper
View: 219 times
Download: 3 times
Share this document with a friend
Description:
Why Mine Data? Commercial ViewPoint... zLots of data is being collected and warehoused. zComputing has become affordable. zCompetitive Pressure is Strong yProvide better, customized services for an edge. yInformation is becoming product in its own right.
51
Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University of Minnesota Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL http://www. cs . umn . edu /~han Joint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh
Transcript
Page 1: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Scalable Parallel Data Mining

Eui-Hong (Sam) Han

Department of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of Minnesota

Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL

http://www.cs.umn.edu/~han

Joint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh

Page 2: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

What is Data Mining?Many Definitions

Search for Valuable Information in Large Volumes of Data.

Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.

A Step in the KDD Process…

Page 3: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Why Mine Data? Commercial ViewPoint...Lots of data is being collected and

warehoused.Computing has become affordable.Competitive Pressure is Strong

Provide better, customized services for an edge.

Information is becoming product in its own right.

Page 4: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Why Mine Data?Scientific Viewpoint... Data collected and stored at enormous speeds

(Gbyte/hour) remote sensor on a satellite telescope scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

Traditional techniques are infeasible for raw data Data mining for data reduction..

cataloging, classifying, segmenting data Helps scientists in Hypothesis Formation

Page 5: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Data Mining Tasks...Classification [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Sequential Pattern Discovery

[Descriptive]Regression [Predictive]Deviation Detection [Predictive]

Page 6: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set Model

Learn Classifier

Page 7: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Classification ApplicationDirect Marketing

Fraud Detection

Customer Attrition/Churn

Sky Survey Cataloging

Page 8: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Example Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categorical

categorical

continuous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

The splitting attribute at a node is determined based on the Gini index.

Page 9: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Hunt’s MethodAn Example: Attributes: Refund (Yes, No), Marital Status (Single, Married,

Divorced), Taxable Income (Continuous) Class: Cheat, Don’t Cheat

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced Married

TaxableIncome

Don’t Cheat

< 80K >= 80KDon’t Cheat

Refund

Don’t Cheat

Don’t Cheat

Yes NoRefund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced Married

Page 10: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Binary Attributes: Computing GINI IndexSplits into two partitionsEffect of Weighing partitions:

Larger and Purer Partitions are sought for.

N1 N2C1 0 4C2 6 0Gini=0.000

N1 N2C1 3 4C2 3 0Gini=0.300

N1 N2C1 4 2C2 4 0Gini=0.400

N1 N2C1 6 2C2 2 0Gini=0.300

True?

Yes No

Node N1 Node N2

Page 11: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute,

Sort the attribute on values Linearly scan these values, each time updating the count matrix and

computing gini index Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Sorted ValuesSplit Positions

Page 12: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Need for Parallel Formulations

Need to handle very large datasets.Memory limitations of sequential computers

cause sequential algorithms to make multiple expensive I/O passes over data.

Need for scalable, efficient (fast) data mining computations gain competitive advantage. Handle larger data for greater accuracy in

shorter times.

Page 13: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Constructing a Decision Tree in Parallel

Partitioning of data only– global reduction per

node is required– large number of

classification tree nodes gives high communication cost

n recordsm categorical attributes

Good Bad

35 50

20 5

family

sport

Page 14: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Synchronous Tree Construction Approach

+ No data movement is required

– Load imbalance• can be eliminated by

breadth-first expansion

– High communication cost• becomes too high in lower

parts of the tree

Partition Data Across Processors

Page 15: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Constructing a Decision Tree in Parallel

Partitioning of classification tree nodes– natural concurrency– load imbalance as the

amount of work associated with each node varies

– child nodes use the same data as used by parent node

– loss of locality– high data movement

cost

7,000 records

10,000 training records

3,000 records

2,000 5,000 2,000 1,000

Page 16: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Partitioned Tree Construction Approach

+ Highly concurrent

- High communication cost due to excessive data movements

- Load imbalance

Partition Data and Nodes

Page 17: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Hybrid Parallel Formulation

Page 18: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Load Balancing

Page 19: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Splitting CriterionSwitch to Partitioned Tree Construction

when

Splitting criterion ensures

– G. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994

BalancingLoadCostMovingCostionCommunicat

optimalCostionCommunicatCostionCommunicat 2

Page 20: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Experimental ResultsData set

– function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96)

– 2 class labels, 3 categorical and 6 continuous attributes

IBM SP2 with 128 processors– 66.7 MHz CPU with 256 MB real memory– AIX version 4– high performance switch

Page 21: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Speedup Comparison of the Three Parallel Algorithms

0.8 million examples 1.6 million examples

Page 22: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Splitting Criterion Verification in the Hybrid Algorithm

BalancingLoadCostMovingCostionCommunicat

RatioCriterionSplitting

0.8 million examples on 8 processors 1.6 million examples on 16 processors

Page 23: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Speedup of the Hybrid Algorithm with Different Size Data Sets

Page 24: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Summary of Algorithms for Categorical Attributes Synchronous Tree Construction Approach

– no data movement required– high communication cost as tree becomes bushy

Partitioned Tree Construction Approach– processors work independently once partitioned

completely– load imbalance and high cost of data movement

Hybrid Algorithm– combines good features of two approaches– adapts dynamically according to the size and shape of

trees

Page 25: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Handling Continuous AttributesSort continuous attributes at each node of

the tree (as in C4.5). Expensive, hence Undesirable!

Discretize continuous attributes CLOUDS (Alsabti, Ranka, and Singh, 1998) SPEC (Srivastava, Han, Kumar, and Singh, 1997)

Use a pre-sorted list for each continuous attributes SPRINT (Shafer, Agrawal, and Mehta, VLDB’96) ScalParC (Joshi, Karypis, and Kumar, IPPS’98)

Page 26: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Association Rule Discovery: DefinitionGiven a set of records each of which

contain some number of items from a given collection; Produce dependency rules which will predict

occurrence of an item based on occurrences of other items.TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Page 27: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Association Rule Discovery ApplicationMarketing and Sales Promotion

Supermarket Shelf Management

Inventory Management

Page 28: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Association Rule Discovery: Support and Confidence

TID Items

1 Bread, Milk2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Bread, Diaper, Milk

yX ,s

))yX,((||

)yX( PsT

s

))X|y((|)X()yX( P

Association Rule:

Support:

Confidence:

Example:Beer}MilkDiaper,{ ,s

4.052

nsTransactio ofNumber Total)BeerMilk,Diaper,(

s

66.0|)MilkDiaper,(

)BeerMilk,Diaper,(

Page 29: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Handling Exponential Complexity

Given n transactions and m different items: number of possible association rules: computation complexity:

Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: If {A,B} has support at least, then both A and B

have support at least If either A or B has support less than , then {A,B} has

support less than . Use patterns of size k-1 to find patterns of size k.

)2( 1mmO)2( mnmO

Page 30: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Illustrating Apriori PrincipleItem Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Items (1-itemsets)

Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

Pairs (2-itemsets)

Triplets (3-itemsets)Itemset Count{Bread,Milk,Diaper} 3{Milk,Diaper,Beer} 2

Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C3 = 41

With support-based pruning,6 + 6 + 2 = 14

Page 31: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Counting CandidatesFrequent Itemsets are found by counting

candidates.Simple way:

Search for each candidate in each transaction.Expensive!!!

TransactionsCandidates

N M

Page 32: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Association Rule Discovery: Hash tree for fast access.

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 83 5 63 5 76 8 9

2 3 45 6 7

1 2 44 5 7

1 2 54 5 8

1,4,7

2,5,8

3,6,9

Hash Function Candidate Hash Tree

Page 33: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Formulation of Association Rules

Need: Huge Transaction Datasets (10s of TB) Large Number of Candidates.

Data Distribution: Partition the Transaction Database, or Partition the Candidates, or Both

Page 34: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Association Rules: Count Distribution (CD)Each Processor has complete candidate

hash tree.Each Processor updates its hash tree with

local data.Each Processor participates in global

reduction to get global counts of candidates in the hash tree.

Multiple database scans are required if the hash tree is too big to fit in the memory.

Page 35: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

CD: Illustration

{5,8}

2

{3,4}{2,3}{1,3}{1,2}

5372 {5,8}

7

{3,4}{2,3}{1,3}{1,2}

3119 {5,8}

0

{3,4}{2,3}{1,3}{1,2}

2826

P0 P1 P2

Global Reduction of Counts

N/p N/p N/p

Page 36: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Association Rules: Data Distribution (DD)Candidate set is partitioned among the

processors.Once local data has been partitioned, it is

broadcast to all other processors.High Communication Cost due to data

movement.Redundant work due to multiple traversals

of the hash trees.

Page 37: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

DD: Illustration

All-to-All Broadcast of Candidates

9{1,3}{1,2}

10 {3,4}{2,3} 12

10{5,8} 17

P0 P1 P2

N/p N/p N/pRemote

DataRemote

DataRemote

Data DataBroadcastCount Count Count

Page 38: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Association Rules: Intelligent Data Distribution (IDD)

Data Distribution using point-to-point communication.

Intelligent partitioning of candidate sets. Partitioning based on the first item of candidates. Bitmap to keep track of local candidate items.

Pruning at the root of candidate hash tree using the bitmap.

Suitable for single data source such as database server.

With smaller candidate set, load balancing is difficult.

Page 39: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

IDD: Illustration

All-to-All Broadcast of Candidates

9{1,3}{1,2}

10 {3,4}{2,3} 12

10{5,8} 17

P0 P1 P2

N/p N/p N/pRemote

DataRemote

DataRemote

Data

1 2,3 5bitmask

DataShift

Count Count Count

Page 40: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Filtering Transactions in IDD

1 5 9

1 4 5 1 3 63 4 5 3 6 7

3 6 83 5 63 5 76 8 9

2 3 45 6 7

1 2 44 5 7

1 2 54 5 8

Skipped!3 5 62 +1 + 2 3 5 6

5 63 +

bitmask 1,3,5 1 2 3 5 6 transaction

Page 41: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Association Rules: Hybrid Distribution (HD)Candidate set is partitioned into G groups

to just fit in main memory Ensures Good load balance with smaller

candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processors

Data movement among processors is minimized.

Perform CD along the row processors Smaller number of processors is global

reduction operation.

Page 42: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

HD: Illustration

C0

C1

C2

C0

C1

C2

C0

C1

C2

N/(P/G) N/(P/G) N/(P/G) CDalongRows

IDD

alo

ng C

olum

nsA

ll-to

-All

Bro

adca

st o

f Can

dida

tes

G G

roup

s of

Pro

cess

ors

P/G Processors per Group

N/P

N/P

N/P N/P

N/P

N/P N/P

N/P

N/P

Page 43: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Association Rules: Experimental Setup 128-processor Cray T3E

600 MHz DEC Alpha (EV4) 512MB of main memory per processor 3-D torus interconnection network with peak

unidirectional bandwidth of 430 MB/sec. MPI used for communications. Synthetic data set: avg transaction size 15 and

1000 distinct items. For larger data sets, multiple read of transactions

in blocks of 1000. HD switch to CD after 90.7% of the total

computation is done.

Page 44: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Scaleup Results (50K, 0.1%)

Page 45: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Speedup Results (N=1.3 million, M=0.7 million)

Page 46: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Sizeup Results (P=64, M=0.7 million)

Page 47: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Response Time with Varying Candidate Size (P=64, N=1.3 million)

Page 48: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

SP2 Response Time with Varying Candidate Size (P=64, N=1.3 million)

Page 49: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Parallel Association Rules: Summary of Experiments

HD shows the same linear speedup and sizeup behavior as that of CD.

HD Exploits Total Aggregate Main Memory, while CD does not.

IDD has much better scaleup behavior than DD

Page 50: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

SummaryData mining is a rapidly growing field

Fueled by enormous data collection rates, and need for intelligent analysis for business and scientific gains.

Large and high-dimensional nature data requires new analysis techniques and algorithms.

Scalable, fast parallel algorithms are becoming indispensable.

Many research and commercial opportunities!!!

Page 51: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University.

Collaborators George Karypis and Vipin Kumar

Department of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of Minnesota

Anurag Srivastava

Digital Impact

Vineet Singh

Hewlett Packard Laboratories

http://www.cs.umn.edu/~han


Recommended