Searching For a Few Good Features

transcript

Bȕlent YenerRensselaer Polytechnic Institute Department of Computer Science

Pathology Informatics 2010

The Hard Problem: Bad or just Ugly??

One of the main challenges is to

Unlike healthy tissue, discriminating damaged (diseased but not cancerous) tissue from cancerous one.

We need a few good features!!.

Brain Tissue - Diffused

Good: healthy

Bad: glioma Ugly: inflammation

Gland based tissue: Prostate

Good (Healthy) Ugly (PIN)

Bad (cancerous)

Gland based tissue: Breast

Ugly (in Situ)

Bad (invesive)

Bone Tissue Images

Healthy (good)

Fracture

Osteosarcoma (bad)

Fracture(ugly)

Two related problems

• Feature Extraction– Identify and compute attributes that will characterize the information

encoded in the histology images– Need to quantify!

• Feature Selection– Identify an optimal subset.

Feature Selection

• Select a subset of the original features– reduces the number of features (dimensionality reduction)– removes irrelevant or redundant data (noise reduction)

• speeding up a data mining algorithm• improving prediction accuracy

• It is an hard optimization problem!• Optimal feature selection is an exhaustive search of all

possible subsets of features of the chosen cardinality.– Too expensive

• In practice Adhoc heuristics

Greedy Algorithms

• A local optimum is searched– evaluate a candidate subset of features– modify the subset and evaluate it– if the new subset is an improvement over the old– then take it as current– else

• If algorithm is deterministic reject the modifications (e.g. hill climbing)

• Else accept with a probability (e.g. simulated annealing).

Methods (partial list)• Exhaustive search: evaluate possible subsets.• Branch and Bound Search: enumerate a fraction of the subsets---

can find optimum but worst-case is exponential.• Best features (isolated): evaluate all m features in isolation–-- no

guarantee for optimum• Sequential Forward Selection: start with the best feature and add

one at a time – no back tracking• SBS: start with all d features and eliminate one at a time—more

expensive than SFS and no backtracking either.• Variants of SFS and SBS: start with k best features and then

delete r of them.. etc

Types of Algorithms• Supervised, unsupervised , and semi-supervised (embedded)

feature selection algorithms – e.g. (PCA) is a unsupervised feature extraction method- finds a set of

mutually orthogonal basis functions that capture the directions of maximum variance in the data.

• But these features may not be useful for discriminating between data in different classes.

• Wrappers (wrap the selection process around the learning algorithm), Filters (examine intrinsic properties of the data)

• Feature selection algorithms with filter and embedded models may return either a subset of selected features or the weights (measuring feature relevance) of all features.

Relevance and redundancy

• A feature is statistically relevant if its removal from a feature set will reduce the prediction power.

• A feature may be redundant due to the existence of other relevant features, which provide similar prediction power as this feature.

Filter Model

• Filtering is independent from the algorithm• It is a preprocessing step• Example: Relief method

All d features Subset selection m<d features Induction Algorithm

Algorithms inducing concept descriptions from examples (i.e. learning algorithms)

Relief Method• It assigns relevance to features based on their ability to disambiguate

similar samples– Similarity is defined by proximity in feature space. – Relevant features accumulate high positive weights, while irrelevant features

retain near-zero weights.– For each target sample,

• find the nearest sample in feature space of the same category, the “hit” sample.

• find the nearest sample of the other category, the “miss” sample. – The relevance of feature f near the target sample is measured as:

Source: K. Kira and L.A. Rendell

Other Filter Algorithms• Laplacian Score: focuses local structure of the data space, computes a score

to reflect its locality preserving power.• SPEC: similar but uses normalized Laplacian matrix.• Fisher Score: assigns the highest score to the feature on which the data

points of different classes are far from each other.• Chi-square Score: tests independence whether the class label is independent

of a particular feature.• Minimum-Redundancy-Maximum-Relevance (mRmR): selects features that

are mutually far away from each other, while they still have "high" correlation to the classication variable. (approximation to maximizing the dependency between the joint distribution of the selected features and the classication variable.)

• Kruskal Wallis: non-parametric method. Based on ranks for comparing the population medians among groups.

• Information Gain: measures of dependence between the feature and the class label.

Source: Zhao et al http://featureselection.asu.edu

Wrapper Model

BLogReg : Gavin C. Cawley and Nicola L. C. Talbot. Gene selection in cancer classication using sparse logistic regression with bayesian regularization. Bioinformatics, 22(19):2348{2355, 2006.

CFS : Mark A. Hall and Lloyd A. Smith. Feature selection for machine learning: Comparing a correlationbased fllter approach to the wrapper, 1999.

Chi-Square : H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In J.F. Vassilopoulos, editor, Proceedings of the Seventh IEEE International Conference on Tools with Articial Intelligence, November 5-8, 1995, pages 388{391, Herndon, Virginia, 1995. IEEE Computer Society.

FCBF: H. Liu and L. Yu. Feature selection for high-dimensional data: A fast correlation-based lter solution. In Correlation-Based Filter Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), pages 856{863, Washington, D.C., 2003. ICM.

Fisher Score : R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classication. John Wiley & Sons, New York, 2 edition, 2001.

Information Gain: T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.

Kruskal-Wallis : L. J. Wei. Asymptotic conservativeness and eciency of kruskal-wallis test for k dependent samples.

Journal of the American Statistical Association, 76(376):1006{1009, December 1981.

mRMR : F. Ding C. Peng, H. Long. Feature selection based on mutual information: Criteria of maxdependency, max-relevance, and min-redundancy. IEEE TRANSACTIONS ON PATTERN ANAL- YSIS AND MACHINE INTELLIGENCE, 27(8):1226{1238, 2005.

Relief : K. Kira and L.A. Rendell. A practical approach to feature selection. In Sleeman and P. Edwards, editors, Proceedings of the Ninth International Conference on Machine Learning (ICML-92), pages 249{256. Morgan Kaufmann, 1992.

SBMLR: Gavin C. Cawley, Nicola L. C. Talbot, and Mark Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In NIPS, pages 209{216, 2006.

Spectrum: Huan Liu and Zheng Zhao. Spectral feature selection for supervised and unsupervised learning.

Proceedings of the 24th International Conference on Machine Learning, 2007.

Feature Space over Histology Images is Large

• Texture based • Intensity based• Graph theoretical

– Voronoi graphs– Cell-graphs

Voronoi Graphs and its Features

• Minimum Spanning tree and its properties

Represent the tissue as a graph:– A node of the graph represents a cell or cell cluster– An edge of the graph represents a relation between a pair of nodes

(e.g., spatial, ECM)– generalization of Voronoi graphs

Cell-Graphs

(a) Healthy(b) Damaged

(c) Cancerous

What do we gain from Cell-graphs ?• Mathematical representation

– We can apply operands on them using

• (multi) Linear Algebra• Algorithms

• We can quantify the structural properties with mathematically well defined graph metrics.

• Subgraph mining– Descriptor subgraphs– Subgraph search in a large graph– Subgraph Kernels

Adjacency matrix:

Normalized Laplacian:

1 if u and v are adjacent( , )

0 otherwiseA u v

1 if and 0,

1( , ) if and are adjacent,

0 otherwise.

L u v u vd d

Cell-graph Features• Local: cell-level

– Graph theoretical: e.g. Degree, clustering coeff.– Morphological: e.g., shape

• Global: tissue-level– Graph theoretical– Spectral

Number # of nodes in the largest connected component

Total (#) of nodes in the graph

# of Nodes Number of cells.

# of Edges Number of links between cells.

Average Degree Average number of “neighboring” cells computed over all the nodes in a cell-graph.

Giant Connected Ratio

Clustering Coefficient Ci. where k is the number of neighbors of the node i and is the number of existing links between i and its neighbors. We exclude the nodes with degree 1 (Dorogovtsev and Mendes, 2002).

% of Isolated Points (Pnts) Percentage of nodes that have no edges incident to them

% of end Pnts Percentage of nodes that have exactly one edge incident to them

# of Central Pnts A node i is a central point of a graph if its eccentricity equals the min. eccentricity (i.e., graph radius). The set of all central points is called the graph center, cardinality of this set is the definition of this metric.

Eccentricity Closeness Given shortest path lengths between a node i and all of the reachable nodes around it, the eccentricity and the closeness of the node i are defined as the maximum and the average of these shortest path lengths, respectively.

Spectral Radius Maximum absolute value of eigenvalues in spectrum of a graph, which is the set of graph eigenvalues.

2nd Eigen Value Second largest eigen value in the graph spectrum.

Eigen Exponent The slope of the sorted eigen values as a function of their orders in log-log scale.

Trace Sum of the eigen values.

Triangles Clique of 3 nodes.

Cliques A (sub)graph such that every pair of nodes are connected with a distinct edge.

Subgraph Density A bound on the clustering coefficient of a subgraph (e.g., at least 0.9).

Bipartite Cliques A complete bipartite graph: all possible edges are present

(2 ) / ( ( 1))i iC E k k

Rich Set of Features for Description and Classification

Cell-graph Feature Selection

• Pairwise correlation of features

Goal: to find a set of features which are pairwise independent.

• Discriminative power

Goal: to find a smaller subset of features which are as expressive as all feature set.

Pairwise Correlation Graph

• The correlation between the graph features, themselves, can be represented as correlation graph.

• The correlation graph can be obtained in the procedure below.– Calculate the nxn correlation matrix for n features and obtain the

correlation coefficients (n = 20 in this case). – Create nodes for each feature which are located in a circular manner.– Set a threshold for correlation and establish an edge between two

feature nodes if |correlation coefficient| ≥ threshold (threshold = 0.9 in this case) .

Correlation Graphs for Healthy TissueBreast Brain

Correlation Graphs for Cancerous TissueBreast Brain

Observations on Correlation Graphs

• The correlation graphs differ greatly depending on tissue type and (dis) functional status.

• The complexity of the correlation graph (number of edges) depends on the tissue type and tissue status.– Some features in some cases can show cluster structures (E.g. node

number, edge number and average degree in breast - healthy), – but a cluster structure may not be in all cases (E.g. brain - cancer).

• The features are highly correlated.

Interpretation

• The strong correlation means a high dependency between the features, which causes a complex joint probability density function. Any probabilistic/statistic model attempt should be aware of this complexity.

• An uncorrelated feature does not necessarily mean a distinguishing feature. It might not be a discriminative feature for classification.

• The high correlation may indicate that a smaller subset of features might be enough to discriminate the classes – but not always

Feature Selection: good, bad, and ugly

Breast – Average Degree Brain – Average Degree

Feature Selection - cont

Breast – End Point Percentage Brain – End Point Percentage

Feature Selection

Need a few god features!

Two phase approach:– Find the best classifier (MLP) – Determine the features

Feature Selection• The data is not linearly separable. Also the features, as expected,

show different distributions in each tissue type.• 10-fold cross-validation results (accuracy percentages) for breast

tissue using – Adaboost (30 C4.5 trees),– k-nn (k = 5), – MLP (1 hidden layer, 12 hidden units, back propagation).

with all existing 20 features are obtained to see which classifier is more successful in classifying the data for cell-graph features .

• These classifiers are used since they are good at separating non-linearly distributed data and they are from different classification algorithm families.

Feature Selection – next step

• The classification problem is reduced into 2-class problems (healthy vs. cancerous, healthy vs. damaged, damaged vs. cancerous).

• Number of edges and number of nodes are excluded. This exclusion also decrease the runtime for selection.

Details

• An exhaustive search over 18 features is done using MLP. Since MLP has given the highest accuracy rate with all feature, it is intuitively expected to show higher accuracy than the other classifiers during subset selection.

• The procedure is described below.– Start with an empty selected feature subset with 0 accuracy

percentage. (seq. forward selection alg).– Repeat the procedure below for all possible feature subset (218).

• Train the classifier and validate its accuracy with 10-fold cross-validation.• If the average 10-fold CV accuracy percentage of the current subset is higher than

the selected feature subset, assign the current subset as the selected feature subset.

MLP + Exhaustive Search Results on Breast Cancer

• The results for breast data is given below. (no normalization)

Features Selected Accuracy Percentage

Healthy vs. Cancer

Clustering CoefficientsMax and Min EccentricityPerc. Of Isolated PointsPerc. Of End PointsPerc. Of. Central Points

84.71 ± 2.7

Healthy vs. Damaged

Average DegreeExcluding Clustering Coeff.Max Eccentricity 90%Effective Hop DiameterPerc. Of Isolated Point

80.52 ± 4.5

Damaged vs. Cancer 15 features out of 18 70.59 ± 3.64

Cell-graph Feature Selection with Relief Method1. Average degree2. Average Clustering coefficient3. Average eccentricity4. Maximum eccentricity5. Minimum eccentricity6. Average effective eccentricity7. Maximum effective eccentricity8. Minimum effective eccentricity9. Average path length (closeness)10. Giant connected ratio11. Percentage of isolated points12. Percentage of end points13. Number of central points14. Percentage of central points15. Number of nodes16. Number of edges17. Spectral radius18. Second largest eigenvalue19. Trace20. Energy21. Number of eigenvalues

Relief based Cell-graph Feature Selection Result

Selected Features for Different Normalization

Modeling branching Morphogensis

Problem Definition

• Treated with ROCK (Rhoassociated coil-coil kinase) that regulates branching morphogenesis

• Untreated

• Can we quantify the organizing principles and distinguish between different states of branching process?

Even a Richer Set of Features

1 Average_degree 2 C 3 C2 4 D 5 Average_eccentricity 6 Maximum_eccentricity_(diameter) 7 Minimum_eccentricity_(radius) 8 Average_eccentricity_90 9 Maximum_eccentricity_90 10 Minimum_eccentricity_90 11 Average_path_length_(closeness) 12 Giant_connected_ratio 13 Number_of_Connected_Components 14 Percentage_of_isolated_points 15 Percentage_of_end_points 16 Number_of_central_points 17 Percentage_of_central_points 18 Number_of_nodes 19 Number_of_edges

20 elongation_ 21 area 22 orientation 23 eccentricity 24 perimeter 25 circularity_ 26 solidity

38 degree_cluster_1 39 degree_cluster_2 40 degree_cluster_3 41 clustering_coefficient_C_cluster_1 42 clustering_coefficient_C_cluster_2 43 clustering_coefficient_C_cluster_3 44 clustering_coefficient_D_cluster_1 45 clustering_coefficient_D_cluster_2 46 clustering_coefficient_D_cluster_3 47 eccentricity_cluster_1 48 eccentricity_cluster_2 49 eccentricity_cluster_3 50 effective_eccentricity_cluster_1_ 51 effective_eccentricity_cluster_2 52 effective_eccentricity_cluster_3 53 closeness_cluster_1 54 closeness_cluster_2 55 closeness_cluster_3

27 largest_eigen_adjacency_ 28 second_largest_adjacency 29 trace_adjacency_ 30 energy_adjacency 31 #of_zeros_normalized_laplacian 32 slope_0-1_normalized_laplacian 33 #of_ones_normalized_laplacian 34 slope_1-2_normalized_laplacian 35 #of_twos_normalized_laplacian 36 trace_laplacian 37 energy_laplacian

Classifier Comparison• Since MLP has a higher overall accuracy, it is used in later studies in

feature selection.

Adaboost k-nn MLP

Overall (%) 67.3 ± 3.27 68.13 ± 1.29 73.24 ± 1.94

Inflamed (%) 57.96 ± 9.53 65.93 ± 2.65 54.0 7± 6.28

Healthy (%) 73.82 ± 5.23 75 ± 1.70 78.38 ± 2.78

Cancerous (%) 67.82 ± 7.5 65.21 ± 1.78 78.99 ± 2.63

Epithelial vs Mesenchymal comparison in treated tissue samples

Feature Selection Algorithm Best CV rate

SVM No Feature Selection 100

Fscore selection: select features: 7,10,26,44,45 100

CfsSubsetEval: 7,10,14,15,16,21,25,26,43,44,45 100

ConsistencySubsetEval: 10,14 95.24

ReliefFAttributeEval: 26,7 100

SymmetricalUncertAttributeEval: 14,44 100

SVD Based: 12,20,22,23,26,41,42,44,49,52,54,55 95.238

Epithelial vs Mesenchymal comparison in untreated tissue samples

SVM No Feature Selection 97.619

Fscore selection 7,26,35 97.619

CfsSubsetEval: 6,9,14,15,25,26,43 95.2381

ConsistencySubsetEval: 14,21,25 97.619

ReliefFAttributeEval: 7,26,35 97.619

SymmetricalUncertAttributeEval: 6,7,9,15,25,26,44 97.619

SVD Based: 2,4,12,20,22,26,27,28,41,49,52,55 88.0952

Treated mesenchymal vs untreated mesenchymal comparison

Fscore selection: select features: 3,4,21,24,26,27,39,45 80.95

CfsSubsetEval: 24 76.1905

ConsistencySubsetEval: 24 76.1905

ReliefFAttributeEval: 21,24,3,39,45,26,2,27,4,35,33,28,42 90.4762

SymmetricalUncertAttributeEval: 24,18,20,19,16,15,17,25,27,26,21

76.1905

SVD Based: 12,20,23,24,26,41,44,45,49,52,55 69.0476

Treated epithelial vs untreated epithelial comparison

Fscore selection: 3 88.09

CfsSubsetEval: 3,44,45 88.09

ConsistencySubsetEval: 3,44 85.71

ReliefFAttributeEval: 3,44,45,46,49 85.71

SymmetricalUncertAttributeEval: 3,44,45 88.09

SVD Based: 1,2,12,20,41,45,46,49,52,53,55 76.1905

Concluding Remarks

• Feature extraction and selection are strongly coupled for accuracy– always room for new features

• Feature selection performance depends on the induction algorithm (i.e., learning algorithm)

• Quantifiable features are not always interpretable- mapping the features to biology or pathology is crucial link!

Thank you!

Searching For a Few Good Features

Documents