+ All Categories
Home > Documents > Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale...

Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale...

Date post: 19-Jun-2020
Category:
Upload: others
View: 16 times
Download: 0 times
Share this document with a friend
126
Large Scale Hierarchical Classification: Foundations, Algorithms and Applications Huzefa Rangwala and Azad Naik Department of Computer Science MLBio+ Laboratory Fairfax, Virginia, USA KDD Tutorial, Halifax, Canada 13 th Aug, 2017 Huzefa Rangwala and Azad Naik George Mason University 13 th Aug, 2017 1 / 117
Transcript
Page 1: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Large Scale Hierarchical Classification: Foundations,Algorithms and Applications

Huzefa Rangwala and Azad Naik

Department of Computer Science

MLBio+ Laboratory

Fairfax, Virginia, USA

KDD Tutorial, Halifax, Canada

13th Aug, 2017

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 1 / 117

Page 2: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Overview of Tutorial Coverage

Part - I

1 Introduction and BackgroundMotivationHierarchical Classification (HC) problem descriptionChallengesMethods for solving HC

2 State-of-the-Art HC ApproachesParent-child regularizationCost-sensitive learning

Package description/Software demo

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 2 / 117

Page 3: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Overview of Tutorial Coverage

Part - II

1 Inconsistent HierarchyMotivationMethods for resolving inconsistencyOptimal hierarchy search in hierarchical space

2 Other HC MethodsLearning using multiple hierarchiesExtreme and deep classification

3 Conclusion

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 3 / 117

Page 4: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Motivation

Exponential growth in data (image, text, video) over time

Big data era - megabytes & gigabytes to terabytes & petabytesgrowth in almost all fields - astronomical, biological, web content

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 4 / 117

Page 5: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Data Organization

Organize data into structure

tree, graph [LSHTC, BioASQ and ILSVRC challenge]

Useful in various applications

query search, browsing and categorizing products

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 5 / 117

Page 6: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Structure

Classes organized into the hierarchical structure

Generic (↑) to specific (↓) categories in top-down order

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 6 / 117

Page 7: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Classification

Goal

Given hierarchy of classes exploit the hierarchicalstructure to learn models and classify unlabeled test

examples (instances) to one or more nodes in thehierarchy

Solution

(i) Manual Classification

(ii) Automated Classification

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 7 / 117

Page 8: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Classification

Goal

Given hierarchy of classes exploit the hierarchicalstructure to learn models and classify unlabeled test

examples (instances) to one or more nodes in thehierarchy

Solution

(i) Manual Classification

(ii) Automated Classification

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 7 / 117

Page 9: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Manual Classification

Requires human understanding and expertise

Infeasible for huge data

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 8 / 117

Page 10: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Automated Classification

Trained expert (such as computer)

Scalable for huge data

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 9 / 117

Page 11: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - I

Single label vs. multi-label

Single label classification - each example belongs exclusively to oneclass only

Multi-label classification - example may belong to more than oneclass

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 10 / 117

Page 12: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - II

Mandatory leaf node vs. internal node prediction

Example may be assigned to internal nodes

Orphan node detection problem

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 11 / 117

Page 13: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - II

Mandatory leaf node vs. internal node prediction

Example may be assigned to internal nodes

Orphan node detection problem

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 11 / 117

Page 14: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - III

Rare categories

Many classes with very few labeled examples

More prevalent in large scale datasets - ≥70% have ≤10 examples

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 12 / 117

Page 15: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - III

Rare categories

Many classes with very few labeled examples

More prevalent in large scale datasets - ≥70% have ≤10 examples

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 12 / 117

Page 16: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - IV

Feature selection

All features are not essential to discriminate between classes

Identify features to improve classification performance

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 13 / 117

Page 17: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Challenges - IV

Feature selection

All features are not essential to discriminate between classes

Identify features to improve classification performance

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 14 / 117

Page 18: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Other Challenges

Parameter optimizationincorporate relationships (parent-child, silings) information

Scalabilitylarge # of classes, features and examples require distributedcomputation

Dataset#Training #Leaf node

#Features #ParametersParameter

examples (classes) size (approx)DMOZ-2010 128,710 12,294 381,580 4,652,986,520 18.5 GBDMOZ-2012 383,408 11,947 348,548 4,164,102,956 16.5 GB

Inconsistent hierarchynot suitable for classification (more details later)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 15 / 117

Page 19: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Other Challenges

Parameter optimizationincorporate relationships (parent-child, silings) information

Scalabilitylarge # of classes, features and examples require distributedcomputation

Dataset#Training #Leaf node

#Features #ParametersParameter

examples (classes) size (approx)DMOZ-2010 128,710 12,294 381,580 4,652,986,520 18.5 GBDMOZ-2012 383,408 11,947 348,548 4,164,102,956 16.5 GB

Inconsistent hierarchynot suitable for classification (more details later)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 15 / 117

Page 20: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Other Challenges

Parameter optimizationincorporate relationships (parent-child, silings) information

Scalabilitylarge # of classes, features and examples require distributedcomputation

Dataset#Training #Leaf node

#Features #ParametersParameter

examples (classes) size (approx)DMOZ-2010 128,710 12,294 381,580 4,652,986,520 18.5 GBDMOZ-2012 383,408 11,947 348,548 4,164,102,956 16.5 GB

Inconsistent hierarchynot suitable for classification (more details later)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 15 / 117

Page 21: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Notation

n = # of training examples (instances) D = dimension of each instanceN = set of nodes in the hierarchy L = set of leaf node (classes)C(t) = children of node t π(t) = parent of node t

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 16 / 117

Page 22: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Classification

Training - Learn mapping function using training data

Testing - Predict the label of test example

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 17 / 117

Page 23: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Learning Algorithm: General Formulation

Combination of two terms:

1 Empirical loss - controls how well the learnt models fits the trainingdata

2 Regularization - prevent models from over-fitting and encodesadditional information such as hierarchical relationships

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 18 / 117

Page 24: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Different Approaches for Solving HC Problem

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 19 / 117

Page 25: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Flat Classification Approach

Simplest method (ignores hierarchy)

Learn discriminant classifiers for each leaf node in the hierarchy

Unlabeled test example classified using the rule:

y = arg maxy ∈ Y

f (x, y |w)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 20 / 117

Page 26: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Local Classification Approach - I

Local Classifier per Node (LCN)

Learn binary classifiers for all non-root nodes

Goal is to effectively discriminate between the siblings

Top-down approach is followed for classifying unlabeled test examples

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 21 / 117

Page 27: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Local Classification Approach - II

Local Classifier per Parent Node (LCPN)

Learn multi-class classifiers for all non-leaf nodes

Like LCN goal is to effectively discriminate between the siblings

Top-down approach is followed for classifying unlabeled test examples

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 22 / 117

Page 28: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Local Classification Approach - III

Local Classifier per Level (LCL)

Learn multi-class classifiers for all levels in the hierarchy

Least popular among local approaches

Prediction inconsistency may occur and hence post-processing step isrequired

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 23 / 117

Page 29: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Global Classification Approach

Learn global function considering all hierarchical relationships

Often referred as Big-Bang approach

Unlabeled test instance is classified using an approach similar to flator local methods

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 24 / 117

Page 30: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Evaluation Metrics - I

Flat evaluation measures

Misclassifications treated equally

Common evaluation metrics:

Micro-F1 - gives equal weightage to all examples, dominated bycommon classMacro-F1 - gives equal weightage to each class

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 25 / 117

Page 31: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Evaluation Metrics - II

Hierarchical evaluation measures

Hierarchical distance between the true and predicted class taken intoconsideration for performance evaluation

Common evaluation metrics:

Hierarchical-F1 - common ancestors between true and predicted classTree Error - average hierarchical distance b/w true and predicted class

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 26 / 117

Page 32: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Multi-Task Learning (MTL)

Involves joint training of multiple related tasks to improvegeneralization performance

Independent learning problems can utilize the shared knowledge

Exploits inductive biases that are helpful to all the related tasks

similar set of parameterscommon feature space

Examples

personal email spam classification - many person with same spamautomated driving - brakes and accelerator

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 27 / 117

Page 33: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Parent-child Regularization, Gopal and Yang, SIGKDD’13

Motivation

Traditional approach learn classifiers for each leafnode (task) to discriminate one class from other

minwt

1

2||wt ||22 + C

n∑i=1

[1− Yitw

Tt xi]+

Works well if:

Dataset is smallBalancedSufficient positive examples per class to learngeneralized discriminant function

Drawbacks

Real world datasets suffers from rare categories issueRemember: 70% classes have less than 10 examples per class

Large number of classes (scalability issue)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 28 / 117

Page 34: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Motivation - II

Can we improve the performance of data sparseleaf nodes by taking advantage of data rich nodesat higher levels?

Incorporate inter-class dependencies to improveclassification

examples belonging to Soccer category is lesslikely to belong to Software category

minwt

1

2||wt −wπ(t)||22 +C

∑k∈C(t)

n∑i=1

[1−YikwT

t xi]+

Objective

How to effectively incorporate the hierarchical relationships into theobjective function to improve generalization performance

Make it scalable for larger datasets

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 29 / 117

Page 35: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Proposed Formulation

Enforces model parameters (weights) to be similar to the parent inregularization

Proposed state-of-the-art: HR-SVM and HR-LR global formulation

HR-SVM

minW

∑t∈N

1

2||wt −wπ(t)||22 + C

∑k∈L

n∑i=1

[1− YikwT

k xi]+

Internal Node

minwt

1

2||wt −wπ(t)||22 +

1

2

∑c∈C(t)

||wc −wt ||22

Leaf Node

minwt

1

2||wt −wπ(t)||22 +

1

2

n∑i=1

[1− Yitw

Tt xi]+

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 30 / 117

Page 36: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

HR-LR Models

Similar formulation as HR-SVM

Logistic loss instead of hinge loss

HR-LR

minW

∑t∈N

1

2||wt −wπ(t)||22 + C

∑k∈L

n∑i=1

log(1 + exp(−YikwTk xi ))

Internal Node

minwt

1

2||wt −wπ(t)||22 +

1

2

∑c∈C(t)

||wc −wt ||22

Leaf Node

minwt

1

2||wt −wπ(t)||22 +

1

2

n∑i=1

log(1 + exp(−YitwTt xi ))

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 31 / 117

Page 37: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Proposed Parallel Implementation

Each node is independent of all other nodes except its neighboursObjective function is block separable. Therefore, Parallel BlockCoordinate Descent (CD) can be used for optimization

1 Fix odd-levels parameters,optimize even-levels in parallel

2 Fix even-levels parameters,optimize odd-levels in parallel

3 Repeat untill convergence

Extended to graph by first finding the minimum graph coloring[Np-hard] and repeatedly optimizing nodes with the same color inparalle during each iteration

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 32 / 117

Page 38: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Experiments

Dataset description

Wide range of single and multi-label dataset with varying number offeatures and categories were used for model evaluation

Datasets # Features # Categories TypeAvg # labels(per instance)

CLEF 89 87 Single-label 1RCV1 48,734 137 Multi-label 3.18IPC 541,869 552 Single-label 1DMOZ-SMALL 51,033 1,563 Single-label 1DMOZ-2010 381,580 15,358 Single-label 1DMOZ-2012 348,548 13,347 Single-label 1DMOZ-2011 594,158 27,875 Multi-label 1.03SWIKI-2011 346,299 50,312 Multi-label 1.85LWIKI 1,617,899 614,428 Multi-label 3.26

Table: Dataset statistics

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 33 / 117

Page 39: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Comparison Methods

Flat baselines

SVM - one-vs-rest binary support vector machines

LR - one-vs-rest regularized logistic regression

Hierarchical baselines

Top-down SVM (TD)[Liu et al., SIGKDD’05] - a pachinkomachine style SVM

Hierarchical SVM (HSVM)[Tsochantaridis et al., JMLR’05] - alarge-margin discriminative method with path dependent discriminantfunction

Hierarchical Orthogonal Transfer (OT)[Lin et al., ICML’11] - alarge-margin method enforcing orthogonality between the parent andthe children

Hierarchical Bayesian Logistic Regression (HBLR)[Gopal et al.,NIPS’12]- a bayesian methods to model hierarchical dependenciesamong class labels using multivariate logistic regression

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 34 / 117

Page 40: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Flat Baselines Comparison - I

Figure: Performance improvement: HR-SVM vs. SVM

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 35 / 117

Page 41: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Flat Baselines Comparison - II

Figure: Performance improvement: HR-LR vs. LR

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 36 / 117

Page 42: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Baselines Comparison

Datasets HR-SVM HR-LR TD HSVM OT HBLR

CLEF 80.02 80.12 70.11 79.72 73.84 81.41RCV1 81.66 81.23 71.34 NA NS NAIPC 54.26 55.37 50.34 NS NS 56.02DMOZ-SMALL 45.31 45.11 38.48 39.66 37.12 46.03DMOZ-2010 46.02 45.84 38.64 NS NS NSDMOZ-2012 57.17 53.18 55.14 NS NS NSDMOZ-2011 43.73 42.27 35.91 NA NS NASWIKI-2011 41.79 40.99 36.65 NA NA NALWIKI 38.08 37.67 NA NA NA NA

[NA - Not Applicable; NS - Not Scalable]

Table: Micro-F1 performance comparison

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 37 / 117

Page 43: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Runtime Comparison - flat baselines

HR-SVM vs. SVM

HR-LR vs. LR

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 38 / 117

Page 44: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Runtime Comparison - hierarchical baselines

Datasets HR-SVM HR-LR TD HSVM OT HBLR

CLEF 0.42 1.02 0.13 3.19 1.31 3.05RCV1 0.55 11.74 0.21 NA NS NAIPC 6.81 15.91 2.21 NS NS 31.20DMOZ-SMALL 0.52 3.73 0.11 289.60 132.34 5.22DMOZ-2010 8.23 123.22 3.97 NS NS NSDMOZ-2012 36.66 229.73 12.49 NS NS NSDMOZ-2011 58.31 248.07 16.39 NA NS NASWIKI-2011 89.23 296.87 21.34 NA NA NALWIKI 2230.54 7282.09 NA NA NA NA

[NA - Not Applicable; NS - Not Scalable]

Table: Training runtime comparison (in mins)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 39 / 117

Page 45: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Cost-sensitive Learning, Charuvaka & Rangwala, ECML’15

Motivation

Drawbacks of Recursive Regularization

scalable, but more expensive to train than flat classificationrequires specialized implementation and communication betweenprocessing nodeDoes not deal with class imbalance directly

Objective

Decouple models so that they can be trained in parallel withoutdependencies between models

Account for class imbalance in the optimization framework

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 40 / 117

Page 46: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Regularization Re-examination - I

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 41 / 117

Page 47: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Regularization Re-examination - II

Opposing learning influences:

loss term - model for a node is forced to be dissimilar to all othernodesregularization term - model is forced to be similar to its neighbors;greater similarity to nearer neighbors

Resultant effect:

Mistakes on negative examples that come from near nodes is lesssevere than those coming from far nodes while still taking advantage ofthe hierarchy

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 42 / 117

Page 48: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Cost-sensitive Loss

Consider the loss term for class ”t” which is separable over examples∑iloss(yi ,w

Ti xi )

Each loss value is multiplied by importance of the example for thisclass ∑

iloss(yi ,w

Ti xi )× φ(t, yi )

This is an example of ”instance-based” cost sensitive learning

cti = φ(t, y1)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 43 / 117

Page 49: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchical Costs

How to define costs based on hierarchy?

Tree Distance (TrD) - undirected graph distance between betweennodes

Number Common Ancestors (NCA) - the number of ancestors incommon to target class and class label

Exponentiated Tree Distance (ExTrD) - squash tree distance intoa suitable range using validation

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 44 / 117

Page 50: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Imbalance Costs

Using the same formulation ofcost-sensitive learning, dataimbalance can also be addressed

ci = 1 + L/[1 + exp|n − n0|]

Due to very large skew, inverseclass size can result in extremelylarge weights. Fix usingsquashing function shown in Fig.

Multiply to combine withHierarchical costs

ni = num examplesn0, L = user defined constants

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 45 / 117

Page 51: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Experiments

Dataset

For comparison purpose same dataset has been used as proposed inthe paper [Gopal and Yang, SIGKDD’13]

Comparison MethodsFlat baseline

LR - one-vs-rest binary logistic regression is used in the conventionalflat classification setting

Hierarchical baselines

Top-down Logistic Regression (TD-LR) - one-vs-rest multi-classclassifier trained at each internal node

HR-LR [Gopal and Yang, SIGKDD’13] - a recursive regularizationapproach based on hierarchical relationships

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 46 / 117

Page 52: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Results (Hierarchical Costs)

Datasets Micro-F1 (↑) Macro-F1 (↑) hF1 (↑) TE (↓)

CLEF

LR 79.82 53.45 85.24 0.994TrD 80.02 55.51 85.39 0.984NCA 80.02 57.48 85.34 0.986ExTrD 80.22 57.55† 85.34 0.982

DMOZ-SMALL

LR 46.39 30.20 67.00 3.569TrD 47.52‡ 31.37‡ 68.26 3.449NCA 47.36‡ 31.20‡ 68.12 3.460ExTrD 47.36‡ 31.19‡ 68.20 3.456

IPC

LR 55.04 48.99 72.82 1.974TrD 55.24‡ 50.20‡ 73.21 1.954NCA 55.33‡ 50.29‡ 73.28 1.949ExTrD 55.31‡ 50.29‡ 73.26 1.951

RCV1

LR 78.43 60.37 80.16 0.534TrD 79.46‡ 60.61 82.83 0.451NCA 79.74‡ 60.76 83.11 0.442ExTrD 79.33‡ 61.74† 82.91 0.466

Table: Performance comparison of hierarchical costs

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 47 / 117

Page 53: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Results (Imbalance Costs)

Datasets Micro-F1 (↑) Macro-F1 (↑) hF1 (↑) TE (↓)

CLEF

IMB + LR 79.52 53.11 85.19 1.002IMB + TrD 79.92 52.84 85.59 0.978IMB + NCA 79.62 51.89 85.34 0.994IMB + ExTrD 80.32 58.45 85.69 0.966

DMOZ-SMALL

IMB + LR 48.55‡ 32.72‡ 68.62 3.406IMB + TrD 49.03‡ 33.21‡ 69.41 3.334IMB + NCA 48.87‡ 33.27‡ 69.37 3.335IMB + ExTrD tbf49.03‡ 33.34‡ 69.54 3.322

IPC

IMB + LR 55.04 49.00 72.82 1.974IMB + TrD 55.60‡ 50.45† 73.56 1.933IMB + NCA 55.33 50.29 73.28 1.949IMB + ExTrD 55.67‡ 50.42 73.58 1.931

RCV1

IMB + LR 78.59‡ 60.77 81.27 0.511IMB + TrD 79.63‡ 61.04 83.13 0.435IMB + NCA 79.61 61.04 82.65 0.458IMB + ExTrD 79.22 61.33 82.89 0.469

Table: Peformance comparison with imbalance cost included

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 48 / 117

Page 54: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Results (our best with other methods)

Datasets Micro-F1 (↑) Macro-F1 (↑) hF1 (↑) TE (↓)

CLEF

TD-LR 73.06 34.47 79.32 1.366LR 79.82 53.45 85.24 0.994HR-LR 80.12 55.83 NA NAHierCost 80.32 58.45† 85.69 0.966

DMOZ-SMALL

TD-LR 40.90 24.15 69.99 3.147LR 46.39 30.20 67.00 3.569HR-LR 45.11 28.48 NA NAHierCost 49.03‡ 33.34‡ 69.54 3.322

IPC

TD-LR 50.22 43.87 69.33 2.210LR 55.04 48.99 72.82 1.974HR-LR 55.37 49.60 NA NAHierCost 55.67‡ 50.42† 73.58 1.931

RCV1

TD-LR 77.85 57.80 88.78 0.524LR 78.43 60.37 80.16 0.534HR-LR 81.23 55.81 NA NAHierCost 79.22‡ 61.33 82.89 0.469

Table: Performance comparison of HierCost with other baseline methods

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 49 / 117

Page 55: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Runtime comparison

Datasets TD-LR LR HierCost

CLEF <1 <1 <1DMOZ-SMALL 4 41 40IPC 27 643 453RCV1 20 29 48DMOZ-2010 196 15191 20174DMOZ-2012 384 46044 50253

Table: Total training runtimes (in mins)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 50 / 117

Page 56: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Demo of Software

Freely available for research and education purpose at:

https://cs.gmu.edu/∼mlbio/HierCost/

Software: implemented in python using scikit-learn machine learningand svmlight-loader package

Other prerequisite package:

numpyscipynetworkxpandas

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 51 / 117

Page 57: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Command Line Interface (CLI) Options

Two main CLI is exposed for easy training and classificationtrain.py

-d : train data file path-t : hierarchy file path-m : path to save learned model parameters-f : number of features-r : regularization parameter ( 0; default = 1)-i : to incorporate imbalance cost-c : cost function to use (lr, trd, nca, etrd) (default = ’lr’)-u : for multi-label classification (default = single-label classification)-n: set of nodes to train (for parallelization)

predict.py

-p : path to save prediction for test examples-m, -d, -t, -f, -u : similar functionalities

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 52 / 117

Page 58: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Part II

Inconsistent Hierarchy(30 minutes break)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 53 / 117

Page 59: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Motivation

Predefined Hierarchy

Hierachy defined by the domain experts

Reflects human-view of the domain - may not be optimal for machinelearning classification algorithms

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 54 / 117

Page 60: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Motivation

Flat Classification

Works well for well-balanced datasets with smaller number ofcategories

Expensive train/prediction cost

Hierarchical Classification

Performs well for rare categories by leveraging hierarchical structure

Computationally efficient

Preferable for large-scale datasets

Some benchmark datasets have good performance with flat method (andits variant). Can we improve upon that using hierarchical settings?

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 55 / 117

Page 61: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Case Study: NG dataset

Different hierarchical structures results in completely differentclassification performance

µF1=77.04, MF1=77.94 µF1=79.42, MF1=79.82 µF1=81.24, MF1=81.94Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 56 / 117

Page 62: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Well-known HC Methods

Parent-child regularization [Gopal and Yang, KDD’13]

[HR− LR] minW

∑k∈N

1

2||wk −wπ(k)||22 +C

∑l∈L

n∑i=1

log(1 + exp

(−y li wT

l xi))

Cost-sensitive learning [Charuvaka and Rangwala, ECML’15]

[HierCost] minwl

[1

2||wl ||22 + C

n∑i=1

σi log(1 + exp

(−y li wT

l xi)) ]

HC methods uses hierarchical structure. Performance can deteriorate ifhierarchy used is not consistent.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 57 / 117

Page 63: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Reason for Inconsistencies within Predefined Hierarchy - I

Hierarchy is designed for the sole purpose of easy search andnavigation without taking classification into consideration

Hierarchy is created based on semantics which is independent of data

whereas

classification depends on data characteristics such as term frequency

”Our expectation: data-driven hierarchy can be much powerful”

Apriori it is not clear to domain experts when to generate new nodes(hierarchy expansion) or merge two or more nodes (link creation) inthe hierarchy

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 58 / 117

Page 64: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Reason for Inconsistencies within Predefined Hierarchy - I

Hierarchy is designed for the sole purpose of easy search andnavigation without taking classification into consideration

Hierarchy is created based on semantics which is independent of data

whereas

classification depends on data characteristics such as term frequency

”Our expectation: data-driven hierarchy can be much powerful”

Apriori it is not clear to domain experts when to generate new nodes(hierarchy expansion) or merge two or more nodes (link creation) inthe hierarchy

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 58 / 117

Page 65: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Reason for Inconsistencies within Predefined Hierarchy - I

Hierarchy is designed for the sole purpose of easy search andnavigation without taking classification into consideration

Hierarchy is created based on semantics which is independent of data

whereas

classification depends on data characteristics such as term frequency

”Our expectation: data-driven hierarchy can be much powerful”

Apriori it is not clear to domain experts when to generate new nodes(hierarchy expansion) or merge two or more nodes (link creation) inthe hierarchy

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 58 / 117

Page 66: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Reason for Inconsistencies within Predefined Hierarchy - II

Given list of categories: different experts may come up with differenthierarchies with completely different classification results

Large number of classes with confusing labels pose a unique challengefor manual design of good hierarchy

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 59 / 117

Page 67: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Reason for Inconsistencies within Predefined Hierarchy - II

Given list of categories: different experts may come up with differenthierarchies with completely different classification results

Large number of classes with confusing labels pose a unique challengefor manual design of good hierarchy

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 59 / 117

Page 68: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Reason for Inconsistencies within Predefined Hierarchy - III

Dynamic changes can affect hierarchical relationships”Flood” is the sub-group of geography class but during chennai floodit becomes political news

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 60 / 117

Page 69: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

What we want?

“Predefined Hierarchy”

↓“Data-driven Hierarchy”

(for improving classification performance)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 61 / 117

Page 70: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

What we want?

“Predefined Hierarchy”↓

“Data-driven Hierarchy”(for improving classification performance)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 61 / 117

Page 71: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Literature Overview

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 62 / 117

Page 72: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Flattening Strategy

Motivation

For large scale datasets top-down (TD) hierarchical models arepreferred over flat models due to computational benefite (training andprediction time)

TD models performance suffers due to error propagation i .e.compounding of errors from misclassifications at higher levels whichcannot be rectified at the lower levels

Objective

Modify predefined hierarchy by removing (flattening) inconsistentnodes to improve the classification performance of TD models

Reduces top-down error propagation due to less number of decisionsfor classifying unlabeled example

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 63 / 117

Page 73: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Flatten Hierarchies, Wang and Lu, ICDIM’10

Level Flattening Techniques

Single or multiple levels within the hierarchy is flattened

Based on level(s) flattened various methods exist, for e.g . TLF, BLF,MLF

Drawback - all nodes in the level(s) is identified as inconsistent whichmay not be true; resulting in poor classification performance

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 64 / 117

Page 74: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Selected Inconsistent Node Removal (Flattening)

Rather than flattening entire level(s) only subset of the inconsistentnodes are removed from the hierarchy

Criterion to decide inconsistent nodes - degree of error made at thenode, margin-based or learning-based strategy

Comparatively better performance than level flattening methods

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 65 / 117

Page 75: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Inconsistent Node Removal - I

Local Approach for INR (Level-INR)

Inconsistent set of nodes determined for each level based on lossfunction values (such as logistic loss) obtained for nodes at that level

Criterion for flattening nodes - mean and standard deviation per level

Different levels have different threshold for node flattening

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 66 / 117

Page 76: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Inconsistent Node Removal - II

Global Approach for INR (Global-INR)

Inconsistent node determined by considering loss function of allinternal nodes in the hierarchy

Criterion for flattening nodes - mean and standard deviation of allnodes

All levels have same threshold for node flattening

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 67 / 117

Page 77: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Comparison Methods

One-vs-rest models is trained for each node (except root) in thehierarchy

Predictions are made starting from the root node and recursivelyselecting the best child nodes until a leaf node is reached

Hierarchical baselines

Top-down Logistic Regression (TD-LR) - predefined hierarchyused for training the models

Level Flattening [Wang and Lu, ICDIM’10] - flattened hierarchyused for training the models, based on level flattened we have:

TLF - Top level flattenedBLF - Bottom level flattened andMLF - Multiple level flattened

MTA [Babbar et al., NIP’13] - hierarchy is modified using themargin value computed at each node

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 68 / 117

Page 78: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Comparison Against Other Flattening Methods

Level-INF and Global-INF performed comparatively better than otherapproachesGlobal approach has better performance as compared to localapproach

Name MetricsTop-down Hierarchical Baselines Proposed ModelsTLF BLF MLF MTA Level-INF Global-INF

CLEFµF1(↑) 75.84 73.76 X 74.48 75.25 77.14MMF1(↑) 38.45 40.93 X 39.53 39.89 46.54N

DIATOMSµF1(↑) 56.93 53.27 X 58.36 58.32 61.31NMF1(↑) 45.17 44.30 X 45.21 48.77 51.85N

IPCµF1(↑) 51.28 50.36 X 51.36 50.40 52.30MMF1(↑) 44.99 43.74 X 42.80 43.26 45.65M

DMOZ-SMALLµF1(↑) 45.48 44.34 45.80 46.01 45.43 46.61MMF1(↑) 30.60 30.94 30.62 30.82 30.34 31.86N

DMOZ-2010µF1(↑) 41.32 40.34 41.77 41.82 40.71 42.37MF1(↑) 29.05 28.41 29.11 29.18 28.66 30.41

DMOZ-2012µF1(↑) 50.32 50.11 48.05 50.31 49.90 50.64MF1(↑) 29.89 29.73 27.65 30.04 30.52 30.58

N (and M) indicates that improvements are statistically significant withp-value <0.01 (and <0.05).

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 69 / 117

Page 79: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Comparison Against Flat Method

# Train Best Proposed Flat BaselineName example (Global-INF) (LR)

per class MF1 hF1 MF1 hF1

DMOZ-SMALL

≤5 28.77 51.86 27.02 46.816-10 55.55 67.47 54.76 65.4011-50 72.26 78.74 72.60 80.12>50 69.43 86.70 71.44 88.95avg. 31.86M 63.37 30.80 60.87

DMOZ-2010

≤5 18.23 53.59 14.35 48.136-10 23.03 55.76 22.62 51.8411-50 42.56 62.39 43.26 61.85>50 70.74 77.51 73.20 81.51avg. 28.41M 56.17 27.06 53.94

DMOZ-2012

≤5 10.28 50.56 8.78 48.016-10 20.37 50.71 18.84 48.8211-50 37.19 73.16 37.98 73.24>50 53.20 79.73 55.72 84.92avg. 29.14N 68.24 27.04 66.45

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 70 / 117

Page 80: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Runtime Comparison (in mins)

Training Time

CLEF DIATOMS IPC DMOZ-SMALL DMOZ-2010 DMOZ-2012Global-INF 3 10 830 68 25,462 63,000

LR 1 3 658 46 15,248 46,124

Prediction Time

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 71 / 117

Page 81: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Drawbacks of Flattening Strategy

Flattening strategy although useful upto certain extent has fewlimitations

Inability to deal with inconsistencies in different branches of thehierarchy

Rewiring strategy can be used to resolve inconsistencies that occurs indifferent branch

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 72 / 117

Page 82: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchy Adjustment using Elementary Operation - I,Tang et al., SIGKDD’06

Elementary operation: promote, demote, merge

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 73 / 117

Page 83: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Hierarchy Adjustment using Elementary Operation - II

Assumption: the optimal hierarchy is near the neighborhood ofpredefined taxonomy

Search for constrained optimal hierarchy by applying sequence ofelementary operations and searching in the hierarchy space

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 74 / 117

Page 84: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Proposed Hierarchy Adjustment Algorithm

Wrapper based approach for hierarchy modification; requires hierarchyevaluation after each modification which is computationally expensive

Input: Predefined hierarchy (H0), Trainingdata (Dt), Validation data (Dv )

1 Generate neighbor hierarchies for H0

2 Train hierarchical classification modelsfor each neighbor on Dt

3 Evaluate hierarchical classifiers on Dv

4 Pick the best neighbor hierarchy as H0

5 Until no improvement, repeat fromstep 1

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 75 / 117

Page 85: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Dataset for Experimental Evaluation

Sub-branch from AOL database hierarchy is used for evaluation

Data is small w .r .t no. of features and classes

Datasets #Total Node #Leaf Node Height #Training #Features

Soc 83 69 4 5,248 34,003Kids 299 244 5 15,795 48,115

Table: Dataset Statistics

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 76 / 117

Page 86: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Performance Results

Adjusted hierarchy shows significant performance improvement incomparison to predefined (original) hierarchy and hierarchy generatedusing clustering approach

Figure: Soc (left) and Kids (right) dataset results

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 77 / 117

Page 87: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Filter based Rewiring Strategy

Motivation

For large scale datasets wrapper based approaches are intractable dueto multiple hierarchy evaluations

Objective

Modify predefined hierarchy using filter based rewiring strategy thatdoes not requires multiple hierarchy evaluations

Without significant loss in performance

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 78 / 117

Page 88: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Proposed Rewiring Strategy

Elementary operation: node creation, parent-child rewiring, nodedeletion

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 79 / 117

Page 89: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Proposed Rewiring Strategy Algorithm

Filter based approach for hierarchy modification

Input: Predefined hierarchy (H0), Train data(Dt)

1 Compute pairwise similarity between classesdefined in H0 on Dt

2 Group together most similar classes

3 Identify inconsistencies within the hierarchy

4 Apply elementary operations: node creationor parent-child rewiring to correctinconsistencies and obtain new hierarchy H1

5 Perform post-processing step (node deletion)on H1 to obtain new hierarchy H2

6 Train and evaluate hierarchical classificationmodels on H2

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 80 / 117

Page 90: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Case Study: NG dataset

Recap: Figure shows different hierarchical structures obtained usingflattening and rewiring approaches

µF1=77.04, MF1=77.94 µF1=79.42, MF1=79.82 µF1=81.24, MF1=81.94Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 81 / 117

Page 91: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Performance Results - Flat Measure

T-Easy method is slightly better due to the brute-force method tofind optimal hierarchy

T-Easy is very expensive; not scalable for large-scale datasets

NameEvaluation

TD-LRAgglomerative Flattening Rewiring Methods

Metrics Clustering Global-INF T-Easy rewHier

CLEFµF1(↑) 72.74 73.24 77.14 78.12 78.00MF1(↑) 35.92 38.27 46.54 48.83N 47.10N

DIATOMSµF1(↑) 53.27 56.08 61.31 62.34N 62.05NMF1(↑) 44.46 44.78 51.85 53.81N 52.14N

IPCµF1(↑) 49.32 49.83 52.30 53.94M 54.28MMF1(↑) 42.51 44.50 45.65 46.10M 46.04M

DMOZ-SMALLµF1(↑) 45.10 45.94 46.61 Not Scalable 48.25MMF1(↑) 30.65 30.75 31.86 Not Scalable 32.92N

DMOZ-2010µF1(↑) 40.22 Not Scalable 42.37 Not Scalable 43.10MF1(↑) 28.37 Not Scalable 30.41 Not Scalable 31.21

DMOZ-2012µF1(↑) 50.13 Not Scalable 50.64 Not Scalable 51.82MF1(↑) 29.89 Not Scalable 30.58 Not Scalable 31.24

N (and M) indicates that improvements are statistically significant withp-value <0.01 (and <0.05).

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 82 / 117

Page 92: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Performance Results - Hierarchical Measure

NameHierarchy Flattening Rewiring Methods

used Global-INF T-Easy rewHier

CLEFOriginal 79.06 81.43 80.14Modified 80.87 81.82 81.28

DIATOMSOriginal 62.80 64.28 63.24Modified 63.88 66.35 64.27

IPCOriginal 64.73 67.23 68.34Modified 66.29 68.10 68.36

DMOZ-SMALLOriginal 63.37 Not Scalable 66.18Modified 64.97 Not Scalable 66.30

DMOZ-2012 Original 73.19 Not Scalable 74.21

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 83 / 117

Page 93: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Runtime Comparison (in mins)

T-Easy method is very expensive (∼20 times more expensive for IPCdataset)

NameBaseline Flattening Rewiring MethodsTD-LR Global-INF T-Easy rewHier

CLEF 2.5 3.5 59 7.5DIATOMS 8.5 10 268 24IPC 607 830 26432 1284DMOZ-SMALL 52 65 Not Scalable 168DMOZ-2010 20190 25600 Not Scalable 42000DMOZ-2012 50040 63000 Not Scalable 94800

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 84 / 117

Page 94: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

# Elementary Operation Comparisons

# elementary operation executed CLEF DIATOMS IPCTang et al.

52 156 412(promote, demote, merge)

Proposed Rewiring Filter Model25 34 42

(node creation, PCRewire, node deletion)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 85 / 117

Page 95: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Effect of varying % of Training Size

Our proposed rewHier method performs well for smaller % of trainingdataset

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 86 / 117

Page 96: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Comparison to Flat, TD-LR, HierCost Approach

Our proposed hierarchy modification results in best performanceirrespective of the model trained

(a) Macro-F1 (b) hF1

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 87 / 117

Page 97: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Comparison to Flat, TD-LR, HierCost Approach

Figure shows percentage of classes improved over flat approach

80% of the classes showed improved performance with our rewHierhierarchy and HierCost approach

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 88 / 117

Page 98: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Conclusion

Proposed different approach for flattening inconsistent nodes

Local ApproachGlobal Approach

Proposed filter-based data-driven rewiring approach. Works well,especially for classes with rare categories.

Works well for large-scale datasets due to embarassingly parallel steps.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 89 / 117

Page 99: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Learning using Multiple Hierarchies (MTL), Charuvaka andRangwala, ICDM’12

Motivation

Hierarchies are so common that sometimes multiple hierarchiesclassify similar data

Heterogenous label view provide additional knowledge which shouldbe exploited by learners

Examples

protein structure classification - several hierarchical schemes fororganizing proteins based on curation process or 3D structureweb-page classification - several hierarchy exist for categorizing suchas DMOZ and wikipedia datasets

Objective

Utilize multiple hierarchical label views in multi-task learning contextto improve classification performance

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 90 / 117

Page 100: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Three Different Learning Settings - I

(i) Single Task Learning (STL) - each task model parameters learnedindependently

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 91 / 117

Page 101: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Three Different Learning Settings - II

(ii) Single Hierarchy Multi-Task Learning (SHMTL) - relationshipbetween tasks within a hierarchy are combined individually

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 92 / 117

Page 102: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Three Different Learning Settings - III

(iii) Multiple Hierarchy Multi-Task Learning (MHMTL) - relationshipbetween tasks from different hierarchies are extracted using commonexamples

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 93 / 117

Page 103: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

MTL Formulations

General MTL formulation:

Different MTL formulation based on regularization:

Sparse - All tasks share a single set of useful features

Ω(W) = ||W||2,1Graph Regularization - Related tasks have similar parameters

Ω(W) =∑

(a,b)∈E ||Wa −Wb||22Trace - Task parameters are drawn from a low dimensional sub-space

Ω(W) = ||W||∗ = TraceNorm(W)

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 94 / 117

Page 104: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Performance: AUC Comparison

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 95 / 117

Page 105: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

STL, SHMTL and MHMTL Comparison

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 96 / 117

Page 106: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Extreme Classification

Motivation

Many real world problems with multi-class and multi-label involvingan extremely large number of labels or output space

Learning classifiers corresponding to each labels is almost animpossible task

Inter-label dependency not available

Examples

Predict hashtags from tweetsLSHTC (Kaggle competition): Predict wikipedia tags from documents

Objective

Given huge set of labels, identify the labels that can be assigned tounlabeled instances (examples), efficiently and accurately

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 97 / 117

Page 107: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Extreme Classification Challenges

Statistical Challengesincrease in number of classes is prone to decrease in accuracy due tocomplexity in discriminating between different classes

Computational Challengestraining classifiers for large number of classes is computationallyinfeasiblepredicting label for unlabeled test instances is also compute intensivetask

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 98 / 117

Page 108: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Eigenpartition trees, Mineiro and Karampatziakis, NIPSworkshop’15

Compute small set of plausible labels using eigenpartitiondecomposition at each node in the tree

At each node try to send each classs examplesexclusively left or rightWhile sending roughly the same number ofexamples left or right in aggregate

can be achieved through tree decomposition[Choromanska and Langford, NIPS’15]Optimization function

maximize wT (XTX )ws.t. wTw ≤ 1

1TXTw = 0

Invoke expensive classifiers on set of plausible labels only

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 99 / 117

Page 109: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Experiments

Dataset description

Largs-scale text dataset used for model evaluation

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 100 / 117

Page 110: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Results - LSHTC dataset

FastXML [Prabhu and Varma, SIGKDD’14] - node partitioningformulation which optimized an nDCG based ranking loss over all thelabels

X1 [Bhatia et al., NIPS’15] - effective number of labels is reducedby projecting the high dimensional label vectors onto a lowdimensional linear subspace

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 101 / 117

Page 111: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Deep Classification, Xue et al., SIGIR’08

Motivation

Large scale taxonomies are more prevelant due to the more specifictopic related class information that is beneficial in several domains

Examples

web search browsing - finding documents relevant to querymodeling user’s for personalized web search - ”java” meansdifferent to tourist and programmeradvertisement matching - finding related ads corresponding toweb-page

Traditional algorithms cannot be directly scaled to large scaleproblems due to several drawbacks

large scale hierarchieslonger training time andincorporating structural information into learning framework

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 102 / 117

Page 112: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Motivation - II

Observations

Related categories corresponding to the query document are smallerthan the number of unrelated categories

Performance on smaller set of categories is easier and much bettercompared to the large set of categories

Objective

Given large and deep hierarchies identify the relevant subset ofcategories for effectively finding the label of unlabeled test instances

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 103 / 117

Page 113: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Two Stages for Classification

First stage - Search stageIdentify related candidate categories corresponding to the test example

Second stage - Classification stageSelect the best candidate categories using classification algorithm asthe label for test document

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 104 / 117

Page 114: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

First Stage

Large hierarchy is pruned into smaller subset of hierarchy withcandidate categories and its ancestors only

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 105 / 117

Page 115: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Second Stage

Classifiers are trained on candidate categories

Best category for test example is selected using trained classifiers

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 106 / 117

Page 116: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Dataset

Open Directory Projects (ODP) dataset used for evaluationDataset statistics

Trainining dataset - 1,174,586 web pages, 130,000 categories organizedinto 15 levelsTest dataset - 130,000 web pages

Figure: Data distribution at different levels in the hierarchyHuzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 107 / 117

Page 117: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Results - Micro-F1 performance comparison

Search based Strategy - best neighbor is chosen as the label for testexample

Hierarchical SVM [Liu et al., SIGKDD’05] - a pachinko machinestyle SVM

Deep Classification - top ten neighbors are used as the candidatecategories

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 108 / 117

Page 118: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Number of candidate categories selection

As the number of candidate categories chosen by the search stageincreases; chances for finding the correct label for test example in theclassification stage increases

Evaluation time increases with increasing number of category selection

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 109 / 117

Page 119: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Conclusion

Large scale hierarchical classification is an important research problemin machine learning community due to its wide applicability acrossseveral domains

Discussed various challenges associated with the hierarchicalclassification

Discussed various state-of-the-art existing approaches; Demo of thesoftware package developed by the author

Emerging topics:

Large-scale classification with deep hierarchiesOrphan node prediction

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 110 / 117

Page 120: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

References - I

Gopal, Siddharth, and Yiming Yang. ”Recursive regularization for large-scaleclassification with hierarchical and graphical dependencies.” SIGKDD, 2013.

Charuvaka, Anveshi, and Huzefa Rangwala. ”HierCost: Improving LargeScale Hierarchical Classification with Cost Sensitive Learning.” ECML, 2015.

Tang, Lei, Jianping Zhang, and Huan Liu. ”Acclimatizing taxonomicsemantics for hierarchical content classification.” SIGKDD, 2006.

Li, Tao, Shenghuo Zhu, and Mitsunori Ogihara. ”Hierarchical documentclassification using automatically generated hierarchy.” Journal of IntelligentInformation Systems 29.2 (2007): 211-230.

Charuvaka, Anveshi, and Huzefa Rangwala. ”Multi-task learning forclassifying proteins using dual hierarchies.” ICDM, 2012.

Punera, Kunal, Suju Rajan, and Joydeep Ghosh. ”Automatically learningdocument taxonomies for hierarchical classification.” WWW, 2005.

Qi, Xiaoguang, and Brian D. Davison. ”Hierarchy evolution for improvedclassification.” CIKM, 2011.

Bennett, Paul N., and Nam Nguyen. ”Refined experts: improvingclassification in large taxonomies.” SIGIR, 2009.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 111 / 117

Page 121: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

References - II

Silla Jr, Carlos N., and Alex A. Freitas. ”A survey of hierarchicalclassification across different application domains.” DMKD, 2011.

Naik, Azad, A. Charuvaka, and H. Rangwala. ”Classifying documents withinmultiple hierarchical datasets using multi-task learning.” ICTAI, 2013.

Babbar, Rohit, et al. ”On flat versus hierarchical classification in large-scaletaxonomies.” NIPS, 2013.

Wang, Xiao-Lin, and Bao-Liang Lu. ”Flatten hierarchies for large-scalehierarchical text categorization.” ICDIM, 2010.

Chuang, Shui-Lung, and Lee-Feng Chien. ”A practical web-based approachto generating topic hierarchy for text segments.” CIKM, 2004.

Fagni, Tiziano, and Fabrizio Sebastiani. ”On the selection of negativeexamples for hierarchical text categorization.” LTC, 2007.

Clare, Amanda, and Ross D. King. ”Predicting gene function inSaccharomyces cerevisiae.” Bioinformatics, 2003.

Koller, Daphne, and Mehran Sahami. ”Hierarchically Classifying DocumentsUsing Very Few Words.” ICML, 1997.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 112 / 117

Page 122: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

References - III

Xue et al. ”Deep classification in large-scale text hierarchies.” SIGIR, 2008.

Tzanetakis, G., and P. Cook. ”Musical genre classification of audio signals.”IEEE transactions on Speech and Audio Processing, 2007.

Gopal, Siddharth, et al. ”Bayesian models for large-scale hierarchicalclassification.” NIPS, 2012.

Xiao, Lin, Dengyong Zhou, and Mingrui Wu. ”Hierarchical classification viaorthogonal transfer.” ICML, 2011.

Naik, Azad, and Huzefa Rangwala. ”A ranking-based approach forhierarchical classification.” DSAA, 2015.

Tsochantaridis, Ioannis, et al. ”Large margin methods for structured andinterdependent output variables.” JMLR, 2005.

Liu, Tie-Yan, et al. ”Support vector machines classification with a verylarge-scale taxonomy.” SIGKDD, 2005.

Caruana, Rich. ”Multitask learning.” Machine learning, 1997.

Anveshi Charuvaka and Huzefa Rangwala. ”Approximate block coordinatedescent for large scale hierarchical classification.” SAC, 2015.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 113 / 117

Page 123: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

References - IV

Dumais, Susan, and Hao Chen. ”Hierarchical classification of Web content.”SIGIR, 2000.

Mineiro, Paul, and Karampatziakis, Nikos. ”A Hierarchical Spectral Methodfor Extreme Classification.” eprint arXiv:1511.03260 (NIPS workshop), 2015.

Choromanska, Anna, et al. ”Extreme Multi Class Classification.” NIPSWorkshop: eXtreme Classification, 2013.

McCallum, Andrew, et al. ”Improving Text Classification by Shrinkage in aHierarchy of Classes.” ICML, 1998.

Babbar, Rohit, et al. ”Maximum-margin framework for training datasynchronization in large-scale hierarchical classification.” NIP, 2013.

Choromanska, Anna E., and John Langford. ”Logarithmic time onlinemulticlass prediction.” NIPS, 2015.

Prabhu, Yashoteja, and Manik Varma. ”FastXML: a fast, accurate andstable tree-classifier for extreme multi-label learning.” SIGKDD, 2014.

Bhatia, Kush, et al. ”Sparse Local Embeddings for Extreme Multi-labelClassification.” NIPS, 2015.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 114 / 117

Page 124: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

References - V

Naik, A., and Rangwala, H. ”Filter based taxonomy modification forimproving hierarchical classification.” http://arxiv.org/abs/1603.00772,2016.

Peng, Hanchuan, Fuhui Long, and Chris Ding. ”Feature selection based onmutual information criteria of max-dependency, max-relevance, andmin-redundancy.” PAMI, 2005.

Ding, Chris, and Hanchuan Peng. ”Minimum redundancy feature selectionfrom microarray gene expression data.” Journal of bioinformatics andcomputational biology, 2005.

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 115 / 117

Page 125: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Acknowledgement

Presenter:

Huzefa Rangwala Azad Naik

Slides available for download at:

http://cs.gmu.edu/ mlbio/kdd2017tutorial.html

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 116 / 117

Page 126: Large Scale Hierarchical Classification: Foundations ...mlbio/presentation_KDD.pdf · Large Scale Hierarchical Classi cation: Foundations, Algorithms and Applications Huzefa Rangwala

Thank You!

Huzefa Rangwala and Azad Naik George Mason University 13th Aug, 2017 117 / 117


Recommended