SPRINT : A Scalable Parallel Classifier for Data Mining
John Shafer, Rakesh Agrawal, Manish Mehta
PATHWAY Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results
Terms Training Data Set
Attributes : Categorical and Continuous
Class Label
Partition AlgorithmPartition( Data S ) {
if all points in S are in the same classreturnfor each attribute Aevaluate split on attribute Afind best splitpartition S into S1 and S2 call Partition( S1 )call Partition( S2 )
}
Data Structures Attribute Lists
Histograms : Continuous and Categorical
Finding Split PointGini(S) = 1 – Sum( Pj*Pj )
Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n
Split on Continuous Attributes Threshold value : Cabove and Cbelow
Sorted Once and Sequential Scan
Deallocation of Cabove and Cbelow
Split on Categorical Attributes Create Count-Matrix All subsets of attribute values as possible split
point Compute Gini Index Gini from Count Matrix only Memory deallocation
Perform Split and Partitioning Select splitting attribute and splitting value Create two child nodes and divide data on RIDs Optimization using Hashing <RID,child-ptr> Optimization depending on number of RIDs Partitioned Hashing for large hash-table Create new histogram and count-matrix of children
Parallel SPRINT Environment : Shared nothing
Data placement and workload balancing
Parallel computation of categorical attribute lists
Repartition of Continuous Attributes Global Sort Equal re-partitioning Relation between Cabove and Cbelow and processor
number Parallel computation of split index
Split point for Categorical Attributes Create global matrix at coordinator
Compute split-index
Partitioning Collect RIDs of splitting attributes from processors
Exchange RIDs
Age Class Rid
17 High 1
20 High 5
23 High 0
Age Class Rid
17 High 1
20 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
CarT Class Rid
Family High 0
Sport High 1
Family High 5
CarT Class Rid
Family High 0
Sport High 1
Sport High 2
Family Low 3
Truck Low 4
Family High 5
Age Class Rid
32 Low 4
43 High 2
68 Low 3
CarT Class Rid
Sport High 2
Family Low 3
Truck Low 4
0
1 2
Age < 27.5
Age Class Rid
17 High 1
20 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
0 0
4 2
1 2
3 0
Position 0
Position 3
Cbelow
Cabove
Cbelow
Cabove
H L
H L
Attribute List
CarT Class Rid
Family High 0
Sport High 1
Sport High 2
Family Low 3
Truck Low 4
Family High 5
2 12 00 1
family
sport
truck
H L
Count MatrixAttribute List
Breakdown of Response Time
Scaleup of SPRINT
Speedup of SPRINT
Sizeup of SPRINT
Age CarT Risk
23 Family High
17 Sports High
43 Sports High
68 Family Low
32 Truck Low
20 Family High
Age < 25
CarType=sports
High
High Low
Example:Decision Tree