Large Scale Hierarchical Classification: Foundations,Algorithms and Applications
Huzefa Rangwala and Azad Naik
Department of Computer Science
MLBio+ Laboratory
Fairfax, Virginia, USA
KDD Tutorial, Halifax, Canada
13th Aug, 2017
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 1 / 117
Overview of Tutorial Coverage
Part - I
1 Introduction and BackgroundMotivationHierarchical Classification (HC) problem descriptionChallengesMethods for solving HC
2 State-of-the-Art HC ApproachesParent-child regularizationCost-sensitive learning
Package description/Software demo
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 2 / 117
Overview of Tutorial Coverage
Part - II
1 Inconsistent HierarchyMotivationMethods for resolving inconsistencyOptimal hierarchy search in hierarchical space
2 Other HC MethodsLearning using multiple hierarchiesExtreme and deep classification
3 Conclusion
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 3 / 117
Motivation
Exponential growth in data (image, text, video) over time
Big data era - megabytes & gigabytes to terabytes & petabytesgrowth in almost all fields - astronomical, biological, web content
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 4 / 117
Data Organization
Organize data into structure
tree, graph [LSHTC, BioASQ and ILSVRC challenge]
Useful in various applications
query search, browsing and categorizing products
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 5 / 117
Hierarchical Structure
Classes organized into the hierarchical structure
Generic (↑) to specific (↓) categories in top-down order
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 6 / 117
Hierarchical Classification
Goal
Given hierarchy of classes exploit the hierarchicalstructure to learn models and classify unlabeled test
examples (instances) to one or more nodes in thehierarchy
Solution
(i) Manual Classification
(ii) Automated Classification
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 7 / 117
Hierarchical Classification
Goal
Given hierarchy of classes exploit the hierarchicalstructure to learn models and classify unlabeled test
examples (instances) to one or more nodes in thehierarchy
Solution
(i) Manual Classification
(ii) Automated Classification
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 7 / 117
Manual Classification
Requires human understanding and expertise
Infeasible for huge data
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 8 / 117
Automated Classification
Trained expert (such as computer)
Scalable for huge data
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 9 / 117
Challenges - I
Single label vs. multi-label
Single label classification - each example belongs exclusively to oneclass only
Multi-label classification - example may belong to more than oneclass
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 10 / 117
Challenges - II
Mandatory leaf node vs. internal node prediction
Example may be assigned to internal nodes
Orphan node detection problem
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 11 / 117
Challenges - II
Mandatory leaf node vs. internal node prediction
Example may be assigned to internal nodes
Orphan node detection problem
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 11 / 117
Challenges - III
Rare categories
Many classes with very few labeled examples
More prevalent in large scale datasets - ≥70% have ≤10 examples
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 12 / 117
Challenges - III
Rare categories
Many classes with very few labeled examples
More prevalent in large scale datasets - ≥70% have ≤10 examples
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 12 / 117
Challenges - IV
Feature selection
All features are not essential to discriminate between classes
Identify features to improve classification performance
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 13 / 117
Challenges - IV
Feature selection
All features are not essential to discriminate between classes
Identify features to improve classification performance
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 14 / 117
Other Challenges
Parameter optimizationincorporate relationships (parent-child, silings) information
Scalabilitylarge # of classes, features and examples require distributedcomputation
Dataset#Training #Leaf node
#Features #ParametersParameter
examples (classes) size (approx)DMOZ-2010 128,710 12,294 381,580 4,652,986,520 18.5 GBDMOZ-2012 383,408 11,947 348,548 4,164,102,956 16.5 GB
Inconsistent hierarchynot suitable for classification (more details later)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 15 / 117
Other Challenges
Parameter optimizationincorporate relationships (parent-child, silings) information
Scalabilitylarge # of classes, features and examples require distributedcomputation
Dataset#Training #Leaf node
#Features #ParametersParameter
examples (classes) size (approx)DMOZ-2010 128,710 12,294 381,580 4,652,986,520 18.5 GBDMOZ-2012 383,408 11,947 348,548 4,164,102,956 16.5 GB
Inconsistent hierarchynot suitable for classification (more details later)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 15 / 117
Other Challenges
Parameter optimizationincorporate relationships (parent-child, silings) information
Scalabilitylarge # of classes, features and examples require distributedcomputation
Dataset#Training #Leaf node
#Features #ParametersParameter
examples (classes) size (approx)DMOZ-2010 128,710 12,294 381,580 4,652,986,520 18.5 GBDMOZ-2012 383,408 11,947 348,548 4,164,102,956 16.5 GB
Inconsistent hierarchynot suitable for classification (more details later)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 15 / 117
Notation
n = # of training examples (instances) D = dimension of each instanceN = set of nodes in the hierarchy L = set of leaf node (classes)C(t) = children of node t π(t) = parent of node t
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 16 / 117
Classification
Training - Learn mapping function using training data
Testing - Predict the label of test example
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 17 / 117
Learning Algorithm: General Formulation
Combination of two terms:
1 Empirical loss - controls how well the learnt models fits the trainingdata
2 Regularization - prevent models from over-fitting and encodesadditional information such as hierarchical relationships
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 18 / 117
Different Approaches for Solving HC Problem
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 19 / 117
Flat Classification Approach
Simplest method (ignores hierarchy)
Learn discriminant classifiers for each leaf node in the hierarchy
Unlabeled test example classified using the rule:
y = arg maxy ∈ Y
f (x, y |w)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 20 / 117
Local Classification Approach - I
Local Classifier per Node (LCN)
Learn binary classifiers for all non-root nodes
Goal is to effectively discriminate between the siblings
Top-down approach is followed for classifying unlabeled test examples
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 21 / 117
Local Classification Approach - II
Local Classifier per Parent Node (LCPN)
Learn multi-class classifiers for all non-leaf nodes
Like LCN goal is to effectively discriminate between the siblings
Top-down approach is followed for classifying unlabeled test examples
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 22 / 117
Local Classification Approach - III
Local Classifier per Level (LCL)
Learn multi-class classifiers for all levels in the hierarchy
Least popular among local approaches
Prediction inconsistency may occur and hence post-processing step isrequired
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 23 / 117
Global Classification Approach
Learn global function considering all hierarchical relationships
Often referred as Big-Bang approach
Unlabeled test instance is classified using an approach similar to flator local methods
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 24 / 117
Evaluation Metrics - I
Flat evaluation measures
Misclassifications treated equally
Common evaluation metrics:
Micro-F1 - gives equal weightage to all examples, dominated bycommon classMacro-F1 - gives equal weightage to each class
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 25 / 117
Evaluation Metrics - II
Hierarchical evaluation measures
Hierarchical distance between the true and predicted class taken intoconsideration for performance evaluation
Common evaluation metrics:
Hierarchical-F1 - common ancestors between true and predicted classTree Error - average hierarchical distance b/w true and predicted class
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 26 / 117
Multi-Task Learning (MTL)
Involves joint training of multiple related tasks to improvegeneralization performance
Independent learning problems can utilize the shared knowledge
Exploits inductive biases that are helpful to all the related tasks
similar set of parameterscommon feature space
Examples
personal email spam classification - many person with same spamautomated driving - brakes and accelerator
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 27 / 117
Parent-child Regularization, Gopal and Yang, SIGKDD’13
Motivation
Traditional approach learn classifiers for each leafnode (task) to discriminate one class from other
minwt
1
2||wt ||22 + C
n∑i=1
[1− Yitw
Tt xi]+
Works well if:
Dataset is smallBalancedSufficient positive examples per class to learngeneralized discriminant function
Drawbacks
Real world datasets suffers from rare categories issueRemember: 70% classes have less than 10 examples per class
Large number of classes (scalability issue)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 28 / 117
Motivation - II
Can we improve the performance of data sparseleaf nodes by taking advantage of data rich nodesat higher levels?
Incorporate inter-class dependencies to improveclassification
examples belonging to Soccer category is lesslikely to belong to Software category
minwt
1
2||wt −wπ(t)||22 +C
∑k∈C(t)
n∑i=1
[1−YikwT
t xi]+
Objective
How to effectively incorporate the hierarchical relationships into theobjective function to improve generalization performance
Make it scalable for larger datasets
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 29 / 117
Proposed Formulation
Enforces model parameters (weights) to be similar to the parent inregularization
Proposed state-of-the-art: HR-SVM and HR-LR global formulation
HR-SVM
minW
∑t∈N
1
2||wt −wπ(t)||22 + C
∑k∈L
n∑i=1
[1− YikwT
k xi]+
Internal Node
minwt
1
2||wt −wπ(t)||22 +
1
2
∑c∈C(t)
||wc −wt ||22
Leaf Node
minwt
1
2||wt −wπ(t)||22 +
1
2
n∑i=1
[1− Yitw
Tt xi]+
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 30 / 117
HR-LR Models
Similar formulation as HR-SVM
Logistic loss instead of hinge loss
HR-LR
minW
∑t∈N
1
2||wt −wπ(t)||22 + C
∑k∈L
n∑i=1
log(1 + exp(−YikwTk xi ))
Internal Node
minwt
1
2||wt −wπ(t)||22 +
1
2
∑c∈C(t)
||wc −wt ||22
Leaf Node
minwt
1
2||wt −wπ(t)||22 +
1
2
n∑i=1
log(1 + exp(−YitwTt xi ))
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 31 / 117
Proposed Parallel Implementation
Each node is independent of all other nodes except its neighboursObjective function is block separable. Therefore, Parallel BlockCoordinate Descent (CD) can be used for optimization
1 Fix odd-levels parameters,optimize even-levels in parallel
2 Fix even-levels parameters,optimize odd-levels in parallel
3 Repeat untill convergence
Extended to graph by first finding the minimum graph coloring[Np-hard] and repeatedly optimizing nodes with the same color inparalle during each iteration
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 32 / 117
Experiments
Dataset description
Wide range of single and multi-label dataset with varying number offeatures and categories were used for model evaluation
Datasets # Features # Categories TypeAvg # labels(per instance)
CLEF 89 87 Single-label 1RCV1 48,734 137 Multi-label 3.18IPC 541,869 552 Single-label 1DMOZ-SMALL 51,033 1,563 Single-label 1DMOZ-2010 381,580 15,358 Single-label 1DMOZ-2012 348,548 13,347 Single-label 1DMOZ-2011 594,158 27,875 Multi-label 1.03SWIKI-2011 346,299 50,312 Multi-label 1.85LWIKI 1,617,899 614,428 Multi-label 3.26
Table: Dataset statistics
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 33 / 117
Comparison Methods
Flat baselines
SVM - one-vs-rest binary support vector machines
LR - one-vs-rest regularized logistic regression
Hierarchical baselines
Top-down SVM (TD)[Liu et al., SIGKDD’05] - a pachinkomachine style SVM
Hierarchical SVM (HSVM)[Tsochantaridis et al., JMLR’05] - alarge-margin discriminative method with path dependent discriminantfunction
Hierarchical Orthogonal Transfer (OT)[Lin et al., ICML’11] - alarge-margin method enforcing orthogonality between the parent andthe children
Hierarchical Bayesian Logistic Regression (HBLR)[Gopal et al.,NIPS’12]- a bayesian methods to model hierarchical dependenciesamong class labels using multivariate logistic regression
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 34 / 117
Flat Baselines Comparison - I
Figure: Performance improvement: HR-SVM vs. SVM
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 35 / 117
Flat Baselines Comparison - II
Figure: Performance improvement: HR-LR vs. LR
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 36 / 117
Hierarchical Baselines Comparison
Datasets HR-SVM HR-LR TD HSVM OT HBLR
CLEF 80.02 80.12 70.11 79.72 73.84 81.41RCV1 81.66 81.23 71.34 NA NS NAIPC 54.26 55.37 50.34 NS NS 56.02DMOZ-SMALL 45.31 45.11 38.48 39.66 37.12 46.03DMOZ-2010 46.02 45.84 38.64 NS NS NSDMOZ-2012 57.17 53.18 55.14 NS NS NSDMOZ-2011 43.73 42.27 35.91 NA NS NASWIKI-2011 41.79 40.99 36.65 NA NA NALWIKI 38.08 37.67 NA NA NA NA
[NA - Not Applicable; NS - Not Scalable]
Table: Micro-F1 performance comparison
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 37 / 117
Runtime Comparison - flat baselines
HR-SVM vs. SVM
HR-LR vs. LR
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 38 / 117
Runtime Comparison - hierarchical baselines
Datasets HR-SVM HR-LR TD HSVM OT HBLR
CLEF 0.42 1.02 0.13 3.19 1.31 3.05RCV1 0.55 11.74 0.21 NA NS NAIPC 6.81 15.91 2.21 NS NS 31.20DMOZ-SMALL 0.52 3.73 0.11 289.60 132.34 5.22DMOZ-2010 8.23 123.22 3.97 NS NS NSDMOZ-2012 36.66 229.73 12.49 NS NS NSDMOZ-2011 58.31 248.07 16.39 NA NS NASWIKI-2011 89.23 296.87 21.34 NA NA NALWIKI 2230.54 7282.09 NA NA NA NA
[NA - Not Applicable; NS - Not Scalable]
Table: Training runtime comparison (in mins)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 39 / 117
Cost-sensitive Learning, Charuvaka & Rangwala, ECML’15
Motivation
Drawbacks of Recursive Regularization
scalable, but more expensive to train than flat classificationrequires specialized implementation and communication betweenprocessing nodeDoes not deal with class imbalance directly
Objective
Decouple models so that they can be trained in parallel withoutdependencies between models
Account for class imbalance in the optimization framework
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 40 / 117
Hierarchical Regularization Re-examination - I
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 41 / 117
Hierarchical Regularization Re-examination - II
Opposing learning influences:
loss term - model for a node is forced to be dissimilar to all othernodesregularization term - model is forced to be similar to its neighbors;greater similarity to nearer neighbors
Resultant effect:
Mistakes on negative examples that come from near nodes is lesssevere than those coming from far nodes while still taking advantage ofthe hierarchy
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 42 / 117
Cost-sensitive Loss
Consider the loss term for class ”t” which is separable over examples∑iloss(yi ,w
Ti xi )
Each loss value is multiplied by importance of the example for thisclass ∑
iloss(yi ,w
Ti xi )× φ(t, yi )
This is an example of ”instance-based” cost sensitive learning
cti = φ(t, y1)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 43 / 117
Hierarchical Costs
How to define costs based on hierarchy?
Tree Distance (TrD) - undirected graph distance between betweennodes
Number Common Ancestors (NCA) - the number of ancestors incommon to target class and class label
Exponentiated Tree Distance (ExTrD) - squash tree distance intoa suitable range using validation
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 44 / 117
Imbalance Costs
Using the same formulation ofcost-sensitive learning, dataimbalance can also be addressed
ci = 1 + L/[1 + exp|n − n0|]
Due to very large skew, inverseclass size can result in extremelylarge weights. Fix usingsquashing function shown in Fig.
Multiply to combine withHierarchical costs
ni = num examplesn0, L = user defined constants
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 45 / 117
Experiments
Dataset
For comparison purpose same dataset has been used as proposed inthe paper [Gopal and Yang, SIGKDD’13]
Comparison MethodsFlat baseline
LR - one-vs-rest binary logistic regression is used in the conventionalflat classification setting
Hierarchical baselines
Top-down Logistic Regression (TD-LR) - one-vs-rest multi-classclassifier trained at each internal node
HR-LR [Gopal and Yang, SIGKDD’13] - a recursive regularizationapproach based on hierarchical relationships
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 46 / 117
Results (Hierarchical Costs)
Datasets Micro-F1 (↑) Macro-F1 (↑) hF1 (↑) TE (↓)
CLEF
LR 79.82 53.45 85.24 0.994TrD 80.02 55.51 85.39 0.984NCA 80.02 57.48 85.34 0.986ExTrD 80.22 57.55† 85.34 0.982
DMOZ-SMALL
LR 46.39 30.20 67.00 3.569TrD 47.52‡ 31.37‡ 68.26 3.449NCA 47.36‡ 31.20‡ 68.12 3.460ExTrD 47.36‡ 31.19‡ 68.20 3.456
IPC
LR 55.04 48.99 72.82 1.974TrD 55.24‡ 50.20‡ 73.21 1.954NCA 55.33‡ 50.29‡ 73.28 1.949ExTrD 55.31‡ 50.29‡ 73.26 1.951
RCV1
LR 78.43 60.37 80.16 0.534TrD 79.46‡ 60.61 82.83 0.451NCA 79.74‡ 60.76 83.11 0.442ExTrD 79.33‡ 61.74† 82.91 0.466
Table: Performance comparison of hierarchical costs
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 47 / 117
Results (Imbalance Costs)
Datasets Micro-F1 (↑) Macro-F1 (↑) hF1 (↑) TE (↓)
CLEF
IMB + LR 79.52 53.11 85.19 1.002IMB + TrD 79.92 52.84 85.59 0.978IMB + NCA 79.62 51.89 85.34 0.994IMB + ExTrD 80.32 58.45 85.69 0.966
DMOZ-SMALL
IMB + LR 48.55‡ 32.72‡ 68.62 3.406IMB + TrD 49.03‡ 33.21‡ 69.41 3.334IMB + NCA 48.87‡ 33.27‡ 69.37 3.335IMB + ExTrD tbf49.03‡ 33.34‡ 69.54 3.322
IPC
IMB + LR 55.04 49.00 72.82 1.974IMB + TrD 55.60‡ 50.45† 73.56 1.933IMB + NCA 55.33 50.29 73.28 1.949IMB + ExTrD 55.67‡ 50.42 73.58 1.931
RCV1
IMB + LR 78.59‡ 60.77 81.27 0.511IMB + TrD 79.63‡ 61.04 83.13 0.435IMB + NCA 79.61 61.04 82.65 0.458IMB + ExTrD 79.22 61.33 82.89 0.469
Table: Peformance comparison with imbalance cost included
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 48 / 117
Results (our best with other methods)
Datasets Micro-F1 (↑) Macro-F1 (↑) hF1 (↑) TE (↓)
CLEF
TD-LR 73.06 34.47 79.32 1.366LR 79.82 53.45 85.24 0.994HR-LR 80.12 55.83 NA NAHierCost 80.32 58.45† 85.69 0.966
DMOZ-SMALL
TD-LR 40.90 24.15 69.99 3.147LR 46.39 30.20 67.00 3.569HR-LR 45.11 28.48 NA NAHierCost 49.03‡ 33.34‡ 69.54 3.322
IPC
TD-LR 50.22 43.87 69.33 2.210LR 55.04 48.99 72.82 1.974HR-LR 55.37 49.60 NA NAHierCost 55.67‡ 50.42† 73.58 1.931
RCV1
TD-LR 77.85 57.80 88.78 0.524LR 78.43 60.37 80.16 0.534HR-LR 81.23 55.81 NA NAHierCost 79.22‡ 61.33 82.89 0.469
Table: Performance comparison of HierCost with other baseline methods
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 49 / 117
Runtime comparison
Datasets TD-LR LR HierCost
CLEF <1 <1 <1DMOZ-SMALL 4 41 40IPC 27 643 453RCV1 20 29 48DMOZ-2010 196 15191 20174DMOZ-2012 384 46044 50253
Table: Total training runtimes (in mins)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 50 / 117
Demo of Software
Freely available for research and education purpose at:
https://cs.gmu.edu/∼mlbio/HierCost/
Software: implemented in python using scikit-learn machine learningand svmlight-loader package
Other prerequisite package:
numpyscipynetworkxpandas
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 51 / 117
Command Line Interface (CLI) Options
Two main CLI is exposed for easy training and classificationtrain.py
-d : train data file path-t : hierarchy file path-m : path to save learned model parameters-f : number of features-r : regularization parameter ( 0; default = 1)-i : to incorporate imbalance cost-c : cost function to use (lr, trd, nca, etrd) (default = ’lr’)-u : for multi-label classification (default = single-label classification)-n: set of nodes to train (for parallelization)
predict.py
-p : path to save prediction for test examples-m, -d, -t, -f, -u : similar functionalities
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 52 / 117
Part II
Inconsistent Hierarchy(30 minutes break)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 53 / 117
Motivation
Predefined Hierarchy
Hierachy defined by the domain experts
Reflects human-view of the domain - may not be optimal for machinelearning classification algorithms
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 54 / 117
Motivation
Flat Classification
Works well for well-balanced datasets with smaller number ofcategories
Expensive train/prediction cost
Hierarchical Classification
Performs well for rare categories by leveraging hierarchical structure
Computationally efficient
Preferable for large-scale datasets
Some benchmark datasets have good performance with flat method (andits variant). Can we improve upon that using hierarchical settings?
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 55 / 117
Case Study: NG dataset
Different hierarchical structures results in completely differentclassification performance
µF1=77.04, MF1=77.94 µF1=79.42, MF1=79.82 µF1=81.24, MF1=81.94Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 56 / 117
Well-known HC Methods
Parent-child regularization [Gopal and Yang, KDD’13]
[HR− LR] minW
∑k∈N
1
2||wk −wπ(k)||22 +C
∑l∈L
n∑i=1
log(1 + exp
(−y li wT
l xi))
Cost-sensitive learning [Charuvaka and Rangwala, ECML’15]
[HierCost] minwl
[1
2||wl ||22 + C
n∑i=1
σi log(1 + exp
(−y li wT
l xi)) ]
HC methods uses hierarchical structure. Performance can deteriorate ifhierarchy used is not consistent.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 57 / 117
Reason for Inconsistencies within Predefined Hierarchy - I
Hierarchy is designed for the sole purpose of easy search andnavigation without taking classification into consideration
Hierarchy is created based on semantics which is independent of data
whereas
classification depends on data characteristics such as term frequency
”Our expectation: data-driven hierarchy can be much powerful”
Apriori it is not clear to domain experts when to generate new nodes(hierarchy expansion) or merge two or more nodes (link creation) inthe hierarchy
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 58 / 117
Reason for Inconsistencies within Predefined Hierarchy - I
Hierarchy is designed for the sole purpose of easy search andnavigation without taking classification into consideration
Hierarchy is created based on semantics which is independent of data
whereas
classification depends on data characteristics such as term frequency
”Our expectation: data-driven hierarchy can be much powerful”
Apriori it is not clear to domain experts when to generate new nodes(hierarchy expansion) or merge two or more nodes (link creation) inthe hierarchy
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 58 / 117
Reason for Inconsistencies within Predefined Hierarchy - I
Hierarchy is designed for the sole purpose of easy search andnavigation without taking classification into consideration
Hierarchy is created based on semantics which is independent of data
whereas
classification depends on data characteristics such as term frequency
”Our expectation: data-driven hierarchy can be much powerful”
Apriori it is not clear to domain experts when to generate new nodes(hierarchy expansion) or merge two or more nodes (link creation) inthe hierarchy
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 58 / 117
Reason for Inconsistencies within Predefined Hierarchy - II
Given list of categories: different experts may come up with differenthierarchies with completely different classification results
Large number of classes with confusing labels pose a unique challengefor manual design of good hierarchy
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 59 / 117
Reason for Inconsistencies within Predefined Hierarchy - II
Given list of categories: different experts may come up with differenthierarchies with completely different classification results
Large number of classes with confusing labels pose a unique challengefor manual design of good hierarchy
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 59 / 117
Reason for Inconsistencies within Predefined Hierarchy - III
Dynamic changes can affect hierarchical relationships”Flood” is the sub-group of geography class but during chennai floodit becomes political news
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 60 / 117
What we want?
“Predefined Hierarchy”
↓“Data-driven Hierarchy”
(for improving classification performance)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 61 / 117
What we want?
“Predefined Hierarchy”↓
“Data-driven Hierarchy”(for improving classification performance)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 61 / 117
Literature Overview
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 62 / 117
Flattening Strategy
Motivation
For large scale datasets top-down (TD) hierarchical models arepreferred over flat models due to computational benefite (training andprediction time)
TD models performance suffers due to error propagation i .e.compounding of errors from misclassifications at higher levels whichcannot be rectified at the lower levels
Objective
Modify predefined hierarchy by removing (flattening) inconsistentnodes to improve the classification performance of TD models
Reduces top-down error propagation due to less number of decisionsfor classifying unlabeled example
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 63 / 117
Flatten Hierarchies, Wang and Lu, ICDIM’10
Level Flattening Techniques
Single or multiple levels within the hierarchy is flattened
Based on level(s) flattened various methods exist, for e.g . TLF, BLF,MLF
Drawback - all nodes in the level(s) is identified as inconsistent whichmay not be true; resulting in poor classification performance
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 64 / 117
Selected Inconsistent Node Removal (Flattening)
Rather than flattening entire level(s) only subset of the inconsistentnodes are removed from the hierarchy
Criterion to decide inconsistent nodes - degree of error made at thenode, margin-based or learning-based strategy
Comparatively better performance than level flattening methods
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 65 / 117
Inconsistent Node Removal - I
Local Approach for INR (Level-INR)
Inconsistent set of nodes determined for each level based on lossfunction values (such as logistic loss) obtained for nodes at that level
Criterion for flattening nodes - mean and standard deviation per level
Different levels have different threshold for node flattening
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 66 / 117
Inconsistent Node Removal - II
Global Approach for INR (Global-INR)
Inconsistent node determined by considering loss function of allinternal nodes in the hierarchy
Criterion for flattening nodes - mean and standard deviation of allnodes
All levels have same threshold for node flattening
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 67 / 117
Comparison Methods
One-vs-rest models is trained for each node (except root) in thehierarchy
Predictions are made starting from the root node and recursivelyselecting the best child nodes until a leaf node is reached
Hierarchical baselines
Top-down Logistic Regression (TD-LR) - predefined hierarchyused for training the models
Level Flattening [Wang and Lu, ICDIM’10] - flattened hierarchyused for training the models, based on level flattened we have:
TLF - Top level flattenedBLF - Bottom level flattened andMLF - Multiple level flattened
MTA [Babbar et al., NIP’13] - hierarchy is modified using themargin value computed at each node
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 68 / 117
Comparison Against Other Flattening Methods
Level-INF and Global-INF performed comparatively better than otherapproachesGlobal approach has better performance as compared to localapproach
Name MetricsTop-down Hierarchical Baselines Proposed ModelsTLF BLF MLF MTA Level-INF Global-INF
CLEFµF1(↑) 75.84 73.76 X 74.48 75.25 77.14MMF1(↑) 38.45 40.93 X 39.53 39.89 46.54N
DIATOMSµF1(↑) 56.93 53.27 X 58.36 58.32 61.31NMF1(↑) 45.17 44.30 X 45.21 48.77 51.85N
IPCµF1(↑) 51.28 50.36 X 51.36 50.40 52.30MMF1(↑) 44.99 43.74 X 42.80 43.26 45.65M
DMOZ-SMALLµF1(↑) 45.48 44.34 45.80 46.01 45.43 46.61MMF1(↑) 30.60 30.94 30.62 30.82 30.34 31.86N
DMOZ-2010µF1(↑) 41.32 40.34 41.77 41.82 40.71 42.37MF1(↑) 29.05 28.41 29.11 29.18 28.66 30.41
DMOZ-2012µF1(↑) 50.32 50.11 48.05 50.31 49.90 50.64MF1(↑) 29.89 29.73 27.65 30.04 30.52 30.58
N (and M) indicates that improvements are statistically significant withp-value <0.01 (and <0.05).
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 69 / 117
Comparison Against Flat Method
# Train Best Proposed Flat BaselineName example (Global-INF) (LR)
per class MF1 hF1 MF1 hF1
DMOZ-SMALL
≤5 28.77 51.86 27.02 46.816-10 55.55 67.47 54.76 65.4011-50 72.26 78.74 72.60 80.12>50 69.43 86.70 71.44 88.95avg. 31.86M 63.37 30.80 60.87
DMOZ-2010
≤5 18.23 53.59 14.35 48.136-10 23.03 55.76 22.62 51.8411-50 42.56 62.39 43.26 61.85>50 70.74 77.51 73.20 81.51avg. 28.41M 56.17 27.06 53.94
DMOZ-2012
≤5 10.28 50.56 8.78 48.016-10 20.37 50.71 18.84 48.8211-50 37.19 73.16 37.98 73.24>50 53.20 79.73 55.72 84.92avg. 29.14N 68.24 27.04 66.45
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 70 / 117
Runtime Comparison (in mins)
Training Time
CLEF DIATOMS IPC DMOZ-SMALL DMOZ-2010 DMOZ-2012Global-INF 3 10 830 68 25,462 63,000
LR 1 3 658 46 15,248 46,124
Prediction Time
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 71 / 117
Drawbacks of Flattening Strategy
Flattening strategy although useful upto certain extent has fewlimitations
Inability to deal with inconsistencies in different branches of thehierarchy
Rewiring strategy can be used to resolve inconsistencies that occurs indifferent branch
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 72 / 117
Hierarchy Adjustment using Elementary Operation - I,Tang et al., SIGKDD’06
Elementary operation: promote, demote, merge
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 73 / 117
Hierarchy Adjustment using Elementary Operation - II
Assumption: the optimal hierarchy is near the neighborhood ofpredefined taxonomy
Search for constrained optimal hierarchy by applying sequence ofelementary operations and searching in the hierarchy space
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 74 / 117
Proposed Hierarchy Adjustment Algorithm
Wrapper based approach for hierarchy modification; requires hierarchyevaluation after each modification which is computationally expensive
Input: Predefined hierarchy (H0), Trainingdata (Dt), Validation data (Dv )
1 Generate neighbor hierarchies for H0
2 Train hierarchical classification modelsfor each neighbor on Dt
3 Evaluate hierarchical classifiers on Dv
4 Pick the best neighbor hierarchy as H0
5 Until no improvement, repeat fromstep 1
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 75 / 117
Dataset for Experimental Evaluation
Sub-branch from AOL database hierarchy is used for evaluation
Data is small w .r .t no. of features and classes
Datasets #Total Node #Leaf Node Height #Training #Features
Soc 83 69 4 5,248 34,003Kids 299 244 5 15,795 48,115
Table: Dataset Statistics
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 76 / 117
Performance Results
Adjusted hierarchy shows significant performance improvement incomparison to predefined (original) hierarchy and hierarchy generatedusing clustering approach
Figure: Soc (left) and Kids (right) dataset results
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 77 / 117
Filter based Rewiring Strategy
Motivation
For large scale datasets wrapper based approaches are intractable dueto multiple hierarchy evaluations
Objective
Modify predefined hierarchy using filter based rewiring strategy thatdoes not requires multiple hierarchy evaluations
Without significant loss in performance
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 78 / 117
Proposed Rewiring Strategy
Elementary operation: node creation, parent-child rewiring, nodedeletion
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 79 / 117
Proposed Rewiring Strategy Algorithm
Filter based approach for hierarchy modification
Input: Predefined hierarchy (H0), Train data(Dt)
1 Compute pairwise similarity between classesdefined in H0 on Dt
2 Group together most similar classes
3 Identify inconsistencies within the hierarchy
4 Apply elementary operations: node creationor parent-child rewiring to correctinconsistencies and obtain new hierarchy H1
5 Perform post-processing step (node deletion)on H1 to obtain new hierarchy H2
6 Train and evaluate hierarchical classificationmodels on H2
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 80 / 117
Case Study: NG dataset
Recap: Figure shows different hierarchical structures obtained usingflattening and rewiring approaches
µF1=77.04, MF1=77.94 µF1=79.42, MF1=79.82 µF1=81.24, MF1=81.94Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 81 / 117
Performance Results - Flat Measure
T-Easy method is slightly better due to the brute-force method tofind optimal hierarchy
T-Easy is very expensive; not scalable for large-scale datasets
NameEvaluation
TD-LRAgglomerative Flattening Rewiring Methods
Metrics Clustering Global-INF T-Easy rewHier
CLEFµF1(↑) 72.74 73.24 77.14 78.12 78.00MF1(↑) 35.92 38.27 46.54 48.83N 47.10N
DIATOMSµF1(↑) 53.27 56.08 61.31 62.34N 62.05NMF1(↑) 44.46 44.78 51.85 53.81N 52.14N
IPCµF1(↑) 49.32 49.83 52.30 53.94M 54.28MMF1(↑) 42.51 44.50 45.65 46.10M 46.04M
DMOZ-SMALLµF1(↑) 45.10 45.94 46.61 Not Scalable 48.25MMF1(↑) 30.65 30.75 31.86 Not Scalable 32.92N
DMOZ-2010µF1(↑) 40.22 Not Scalable 42.37 Not Scalable 43.10MF1(↑) 28.37 Not Scalable 30.41 Not Scalable 31.21
DMOZ-2012µF1(↑) 50.13 Not Scalable 50.64 Not Scalable 51.82MF1(↑) 29.89 Not Scalable 30.58 Not Scalable 31.24
N (and M) indicates that improvements are statistically significant withp-value <0.01 (and <0.05).
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 82 / 117
Performance Results - Hierarchical Measure
NameHierarchy Flattening Rewiring Methods
used Global-INF T-Easy rewHier
CLEFOriginal 79.06 81.43 80.14Modified 80.87 81.82 81.28
DIATOMSOriginal 62.80 64.28 63.24Modified 63.88 66.35 64.27
IPCOriginal 64.73 67.23 68.34Modified 66.29 68.10 68.36
DMOZ-SMALLOriginal 63.37 Not Scalable 66.18Modified 64.97 Not Scalable 66.30
DMOZ-2012 Original 73.19 Not Scalable 74.21
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 83 / 117
Runtime Comparison (in mins)
T-Easy method is very expensive (∼20 times more expensive for IPCdataset)
NameBaseline Flattening Rewiring MethodsTD-LR Global-INF T-Easy rewHier
CLEF 2.5 3.5 59 7.5DIATOMS 8.5 10 268 24IPC 607 830 26432 1284DMOZ-SMALL 52 65 Not Scalable 168DMOZ-2010 20190 25600 Not Scalable 42000DMOZ-2012 50040 63000 Not Scalable 94800
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 84 / 117
# Elementary Operation Comparisons
# elementary operation executed CLEF DIATOMS IPCTang et al.
52 156 412(promote, demote, merge)
Proposed Rewiring Filter Model25 34 42
(node creation, PCRewire, node deletion)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 85 / 117
Effect of varying % of Training Size
Our proposed rewHier method performs well for smaller % of trainingdataset
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 86 / 117
Comparison to Flat, TD-LR, HierCost Approach
Our proposed hierarchy modification results in best performanceirrespective of the model trained
(a) Macro-F1 (b) hF1
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 87 / 117
Comparison to Flat, TD-LR, HierCost Approach
Figure shows percentage of classes improved over flat approach
80% of the classes showed improved performance with our rewHierhierarchy and HierCost approach
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 88 / 117
Conclusion
Proposed different approach for flattening inconsistent nodes
Local ApproachGlobal Approach
Proposed filter-based data-driven rewiring approach. Works well,especially for classes with rare categories.
Works well for large-scale datasets due to embarassingly parallel steps.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 89 / 117
Learning using Multiple Hierarchies (MTL), Charuvaka andRangwala, ICDM’12
Motivation
Hierarchies are so common that sometimes multiple hierarchiesclassify similar data
Heterogenous label view provide additional knowledge which shouldbe exploited by learners
Examples
protein structure classification - several hierarchical schemes fororganizing proteins based on curation process or 3D structureweb-page classification - several hierarchy exist for categorizing suchas DMOZ and wikipedia datasets
Objective
Utilize multiple hierarchical label views in multi-task learning contextto improve classification performance
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 90 / 117
Three Different Learning Settings - I
(i) Single Task Learning (STL) - each task model parameters learnedindependently
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 91 / 117
Three Different Learning Settings - II
(ii) Single Hierarchy Multi-Task Learning (SHMTL) - relationshipbetween tasks within a hierarchy are combined individually
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 92 / 117
Three Different Learning Settings - III
(iii) Multiple Hierarchy Multi-Task Learning (MHMTL) - relationshipbetween tasks from different hierarchies are extracted using commonexamples
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 93 / 117
MTL Formulations
General MTL formulation:
Different MTL formulation based on regularization:
Sparse - All tasks share a single set of useful features
Ω(W) = ||W||2,1Graph Regularization - Related tasks have similar parameters
Ω(W) =∑
(a,b)∈E ||Wa −Wb||22Trace - Task parameters are drawn from a low dimensional sub-space
Ω(W) = ||W||∗ = TraceNorm(W)
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 94 / 117
Performance: AUC Comparison
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 95 / 117
STL, SHMTL and MHMTL Comparison
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 96 / 117
Extreme Classification
Motivation
Many real world problems with multi-class and multi-label involvingan extremely large number of labels or output space
Learning classifiers corresponding to each labels is almost animpossible task
Inter-label dependency not available
Examples
Predict hashtags from tweetsLSHTC (Kaggle competition): Predict wikipedia tags from documents
Objective
Given huge set of labels, identify the labels that can be assigned tounlabeled instances (examples), efficiently and accurately
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 97 / 117
Extreme Classification Challenges
Statistical Challengesincrease in number of classes is prone to decrease in accuracy due tocomplexity in discriminating between different classes
Computational Challengestraining classifiers for large number of classes is computationallyinfeasiblepredicting label for unlabeled test instances is also compute intensivetask
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 98 / 117
Eigenpartition trees, Mineiro and Karampatziakis, NIPSworkshop’15
Compute small set of plausible labels using eigenpartitiondecomposition at each node in the tree
At each node try to send each classs examplesexclusively left or rightWhile sending roughly the same number ofexamples left or right in aggregate
can be achieved through tree decomposition[Choromanska and Langford, NIPS’15]Optimization function
maximize wT (XTX )ws.t. wTw ≤ 1
1TXTw = 0
Invoke expensive classifiers on set of plausible labels only
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 99 / 117
Experiments
Dataset description
Largs-scale text dataset used for model evaluation
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 100 / 117
Results - LSHTC dataset
FastXML [Prabhu and Varma, SIGKDD’14] - node partitioningformulation which optimized an nDCG based ranking loss over all thelabels
X1 [Bhatia et al., NIPS’15] - effective number of labels is reducedby projecting the high dimensional label vectors onto a lowdimensional linear subspace
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 101 / 117
Deep Classification, Xue et al., SIGIR’08
Motivation
Large scale taxonomies are more prevelant due to the more specifictopic related class information that is beneficial in several domains
Examples
web search browsing - finding documents relevant to querymodeling user’s for personalized web search - ”java” meansdifferent to tourist and programmeradvertisement matching - finding related ads corresponding toweb-page
Traditional algorithms cannot be directly scaled to large scaleproblems due to several drawbacks
large scale hierarchieslonger training time andincorporating structural information into learning framework
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 102 / 117
Motivation - II
Observations
Related categories corresponding to the query document are smallerthan the number of unrelated categories
Performance on smaller set of categories is easier and much bettercompared to the large set of categories
Objective
Given large and deep hierarchies identify the relevant subset ofcategories for effectively finding the label of unlabeled test instances
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 103 / 117
Two Stages for Classification
First stage - Search stageIdentify related candidate categories corresponding to the test example
Second stage - Classification stageSelect the best candidate categories using classification algorithm asthe label for test document
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 104 / 117
First Stage
Large hierarchy is pruned into smaller subset of hierarchy withcandidate categories and its ancestors only
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 105 / 117
Second Stage
Classifiers are trained on candidate categories
Best category for test example is selected using trained classifiers
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 106 / 117
Dataset
Open Directory Projects (ODP) dataset used for evaluationDataset statistics
Trainining dataset - 1,174,586 web pages, 130,000 categories organizedinto 15 levelsTest dataset - 130,000 web pages
Figure: Data distribution at different levels in the hierarchyHuzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 107 / 117
Results - Micro-F1 performance comparison
Search based Strategy - best neighbor is chosen as the label for testexample
Hierarchical SVM [Liu et al., SIGKDD’05] - a pachinko machinestyle SVM
Deep Classification - top ten neighbors are used as the candidatecategories
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 108 / 117
Number of candidate categories selection
As the number of candidate categories chosen by the search stageincreases; chances for finding the correct label for test example in theclassification stage increases
Evaluation time increases with increasing number of category selection
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 109 / 117
Conclusion
Large scale hierarchical classification is an important research problemin machine learning community due to its wide applicability acrossseveral domains
Discussed various challenges associated with the hierarchicalclassification
Discussed various state-of-the-art existing approaches; Demo of thesoftware package developed by the author
Emerging topics:
Large-scale classification with deep hierarchiesOrphan node prediction
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 110 / 117
References - I
Gopal, Siddharth, and Yiming Yang. ”Recursive regularization for large-scaleclassification with hierarchical and graphical dependencies.” SIGKDD, 2013.
Charuvaka, Anveshi, and Huzefa Rangwala. ”HierCost: Improving LargeScale Hierarchical Classification with Cost Sensitive Learning.” ECML, 2015.
Tang, Lei, Jianping Zhang, and Huan Liu. ”Acclimatizing taxonomicsemantics for hierarchical content classification.” SIGKDD, 2006.
Li, Tao, Shenghuo Zhu, and Mitsunori Ogihara. ”Hierarchical documentclassification using automatically generated hierarchy.” Journal of IntelligentInformation Systems 29.2 (2007): 211-230.
Charuvaka, Anveshi, and Huzefa Rangwala. ”Multi-task learning forclassifying proteins using dual hierarchies.” ICDM, 2012.
Punera, Kunal, Suju Rajan, and Joydeep Ghosh. ”Automatically learningdocument taxonomies for hierarchical classification.” WWW, 2005.
Qi, Xiaoguang, and Brian D. Davison. ”Hierarchy evolution for improvedclassification.” CIKM, 2011.
Bennett, Paul N., and Nam Nguyen. ”Refined experts: improvingclassification in large taxonomies.” SIGIR, 2009.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 111 / 117
References - II
Silla Jr, Carlos N., and Alex A. Freitas. ”A survey of hierarchicalclassification across different application domains.” DMKD, 2011.
Naik, Azad, A. Charuvaka, and H. Rangwala. ”Classifying documents withinmultiple hierarchical datasets using multi-task learning.” ICTAI, 2013.
Babbar, Rohit, et al. ”On flat versus hierarchical classification in large-scaletaxonomies.” NIPS, 2013.
Wang, Xiao-Lin, and Bao-Liang Lu. ”Flatten hierarchies for large-scalehierarchical text categorization.” ICDIM, 2010.
Chuang, Shui-Lung, and Lee-Feng Chien. ”A practical web-based approachto generating topic hierarchy for text segments.” CIKM, 2004.
Fagni, Tiziano, and Fabrizio Sebastiani. ”On the selection of negativeexamples for hierarchical text categorization.” LTC, 2007.
Clare, Amanda, and Ross D. King. ”Predicting gene function inSaccharomyces cerevisiae.” Bioinformatics, 2003.
Koller, Daphne, and Mehran Sahami. ”Hierarchically Classifying DocumentsUsing Very Few Words.” ICML, 1997.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 112 / 117
References - III
Xue et al. ”Deep classification in large-scale text hierarchies.” SIGIR, 2008.
Tzanetakis, G., and P. Cook. ”Musical genre classification of audio signals.”IEEE transactions on Speech and Audio Processing, 2007.
Gopal, Siddharth, et al. ”Bayesian models for large-scale hierarchicalclassification.” NIPS, 2012.
Xiao, Lin, Dengyong Zhou, and Mingrui Wu. ”Hierarchical classification viaorthogonal transfer.” ICML, 2011.
Naik, Azad, and Huzefa Rangwala. ”A ranking-based approach forhierarchical classification.” DSAA, 2015.
Tsochantaridis, Ioannis, et al. ”Large margin methods for structured andinterdependent output variables.” JMLR, 2005.
Liu, Tie-Yan, et al. ”Support vector machines classification with a verylarge-scale taxonomy.” SIGKDD, 2005.
Caruana, Rich. ”Multitask learning.” Machine learning, 1997.
Anveshi Charuvaka and Huzefa Rangwala. ”Approximate block coordinatedescent for large scale hierarchical classification.” SAC, 2015.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 113 / 117
References - IV
Dumais, Susan, and Hao Chen. ”Hierarchical classification of Web content.”SIGIR, 2000.
Mineiro, Paul, and Karampatziakis, Nikos. ”A Hierarchical Spectral Methodfor Extreme Classification.” eprint arXiv:1511.03260 (NIPS workshop), 2015.
Choromanska, Anna, et al. ”Extreme Multi Class Classification.” NIPSWorkshop: eXtreme Classification, 2013.
McCallum, Andrew, et al. ”Improving Text Classification by Shrinkage in aHierarchy of Classes.” ICML, 1998.
Babbar, Rohit, et al. ”Maximum-margin framework for training datasynchronization in large-scale hierarchical classification.” NIP, 2013.
Choromanska, Anna E., and John Langford. ”Logarithmic time onlinemulticlass prediction.” NIPS, 2015.
Prabhu, Yashoteja, and Manik Varma. ”FastXML: a fast, accurate andstable tree-classifier for extreme multi-label learning.” SIGKDD, 2014.
Bhatia, Kush, et al. ”Sparse Local Embeddings for Extreme Multi-labelClassification.” NIPS, 2015.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 114 / 117
References - V
Naik, A., and Rangwala, H. ”Filter based taxonomy modification forimproving hierarchical classification.” http://arxiv.org/abs/1603.00772,2016.
Peng, Hanchuan, Fuhui Long, and Chris Ding. ”Feature selection based onmutual information criteria of max-dependency, max-relevance, andmin-redundancy.” PAMI, 2005.
Ding, Chris, and Hanchuan Peng. ”Minimum redundancy feature selectionfrom microarray gene expression data.” Journal of bioinformatics andcomputational biology, 2005.
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 115 / 117
Acknowledgement
Presenter:
Huzefa Rangwala Azad Naik
Slides available for download at:
http://cs.gmu.edu/ mlbio/kdd2017tutorial.html
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 116 / 117
Thank You!
Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 117 / 117