Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve...

transcript

Finding Local Correlations in High Dimensional Data

USTC Seminar

Xiang ZhangCase Western Reserve University

Finding Latent Patterns in High Dimensional Data

• An important research problem with wide applicationsbiology (gene expression analysis, genotype-

phenotype association study) customer transactions, and so on.

• Common approaches feature selection feature transformation subspace clustering

Existing Approaches

• Feature selection find a single representative subset of features that are

most relevant for the data mining task at hand

• Feature transformation find a set of new (transformed) features that contain the

information in the original data as much as possible Principal Component Analysis (PCA)

• Correlation clustering find clusters of data points that may not exist in the axis

parallel subspaces but only exist in the projected subspaces.

Motivation Example

0362 972 xxx

0523 8651 xxxx

Question: How to find these local linear correlations (using existing methods)?

linearly correlated genes

Applying PCA — Correlated?• PCA is an effective way to determine whether a set

of features is strongly correlated

• A global transformation applied to the entire dataset

a few eigenvectors describe most variance in the dataset small amount of variance represented by the remaining eigenvectors small residual variance indicates strong correlation

Applying PCA – Representation?• The linear correlation is

represented as the hyperplane that is orthogonal to the eigenvectors with the minimum variances

0321 xxx

[1, -1, 1]

0362 972 xxx

0523 8651 xxxx

linear correlations reestablished by full-dimensional PCAembedded linear correlations

Applying Bi-clustering or Correlation Clustering Methods

• Correlation clustering no obvious clustering

structure

• Bi-clustering no strong pair-wise

correlations

linearly correlated genes

Revisiting Existing Work

• Feature selection finds only one representative subset of features

• Feature transformation performs one and the same feature transformation for the

entire dataset does not really eliminate the impact of any original

attributes

• Correlation clustering projected subspaces are usually found by applying

standard feature transformation method, such as PCA

Local Linear Correlations - formalization

• Idea: formalize local linear correlations as strongly correlated feature subsetsDetermining if a feature subset is correlated

small residual variance

The correlation may not be supported by all data points -- noise, domain knowledge…supported by a large portion of the data points

Problem Formalization

• Suppose that F (m by n) be a submatrix of the dataset D (M by N)

• Let { } be the eigenvalues of the covariance matrix of F and arranged in ascending order

• F is strongly correlated feature subset if

1 Mmand(1) (2)

total variance

variance on the k eigenvectors having smallest eigenvalues (residue variance)

number of supporting data points

total number of data points

Problem Formalization

• Suppose that F (m by n) be a submatrix of the dataset D (M by N)

larger k, stronger correlation

smaller ε, stronger correlation

K and ε, together control the strength of the correlation

Eigenvalue idE

larger k smaller ε

• Goal: to find all strongly correlated feature subsets

• Enumerate all sub-matrices?Not feasible (2M×N sub-matrices in total)Efficient algorithm needed

• Any property we can use?Monotonicity of the objective function

Monotonicity

• Monotonic w.r.t. the feature subset If a feature subset is strongly correlated, all its

supersets are also strongly correlated Derived from Interlacing Eigenvalue Theorem

Allow us to focus on finding the smallest feature subsets that are strongly correlated

Enable efficient algorithm – no exhaustive enumeration needed

'1 nnn

The CARE Algorithm

• Selecting the feature subsetsEnumerate feature subsets from smaller size to

larger size (DFS or BFS) If a feature subset is strongly correlated, then its

supersets are pruned (monotonicity of the objective function)

Further pruning possible

Monotonicity

• Non-monotonic w.r.t. the point subsetAdding (or deleting) point from a feature subset

can increase or decrease the correlation among the features

Exhaustive enumeration infeasible – effective heuristic needed

The CARE Algorithm

• Selecting the point subsets Feature subset may only correlate on a subset of

data points If a feature subset is not strongly correlated on

all data points, how to chose the proper point subset?

The CARE Algorithm

• Successive point deletion heuristicgreedy algorithm – in each iteration, delete the

point that resulting the maximum increasing of the correlation among the subset of features

Inefficient – need to evaluate objective function for all data points

The CARE Algorithm

• Distance-based point deletion heuristic Let S1 be the subspace spanned by the k eigenvectors with

the smallest eigenvalues Let S2 be the subspace spanned by the remaining n-k

eigenvectors. Intuition: Try to reduce the variance in S1 as much as

possible while retaining the variance in S2

Directly delete (1-δ)M points having large variance in S1 and small variance in S2 (refer to paper for details)

The CARE Algorithm

A comparison between two point deletion heuristics

successive distance-based

Experimental Results (Synthetic)

Linear correlation reestablished

Full-dimensional PCA CARE

Linear correlation embedded

Pair-wise correlations

Linear correlation embedded (hyperplan representation)

Scalability evaluation

Experimental Results (Wage)

Correlation clustering method & CARE

6 AYWYE

CARE only

A comparison between correlation clustering method and CARE(dataset (534×11) http://lib.stat.cmu.edu/datasets/CPS_85_Wages)

805.425.4 AWYW

Experimental Results

Linearly correlated genes (Hyperplan representations) (220 genes for 42 mouse strains)

Nrg4: cell partMyh7: cell part; intracelluar partHist1h2bk: cell part; intracelluar partArntl: cell part; intracelluar part

Nrg4: integral to membraneOlfr281: integral to membraneSlco1a1: integral to membraneP196867: N/A

Oazin: catalytic activityCtse: catalytic activityMgst3: catalytic activity

Hspb2: cellular physiological process2810453L12Rik: cellular physiological process1010001D01Rik: cellular physiological processP213651: N/A

Ldb3: intracellular partSec61g: intracellular partExosc4: intracellular partBC048403: N/A

Mgst3: catalytic activity; intracellular part Nr1d2: intracellular part; metal ion bindingCtse: catalytic activityPgm3: metal ion binding

Hspb2: cellular metabolismSec61b: cellular metabolismGucy2g: cellular metabolismSdh1: cellular metabolism

Ptk6: membraneGucy2g: integral to membraneClec2g: integral to membraneH2-Q2: integral to membrane

An example

Result of applying PCA Result of applying ISOMAP

Finding local correlations• Dimension reduction

performs a single feature transformation for the entire dataset

• To find local correlationsFirst: identify the correlated feature subspacesThen: apply dimension reduction methods to

uncover the low dimensional structureDimension reduction addresses the second aspectOur focus is the first aspect

Finding local correlations• Challenges

Modeling subspace correlations Measurements for pair-wise correlations may not suffice.

Searching algorithmExhaustive enumeration is too time consuming.

Modeling correlated subspaces

• Intrinsic dimensionality the minimum number of free

variables required to define the data without any significant information loss

• Correlation dimension as ID estimator

Modeling correlated subspaces• Strong correlation

subspace V and feature fa has strong correlation if

• Redundancy feature fvi in subspace V is redundant if

)(}){(),( VIDfVIDfVID aa

),/(ii vv ffVID

Modeling correlated subspaces• Reducible Subspace and Core Space

subspace Y is reducible if there exist subspace V of Y, such that

(2) , U is non-redundant V is the core space of Y, and Y is reducible to V

),(, aa fVIDYf

|)||(| VUYU

all features in Y are strongly correlated with the cores space V

the core space is the smallest non-redundant subspace Y, with which all other features in Y are strongly correlated

Modeling correlated subspaces

• Maximum reducible subspaceY is a reducible subspace and V is its core spaceY is maximum if it includes all features that are

strongly correlated with core space V

• GoalTo find all maximum reducible subspaces in the full

dimensional space

Finding reducible subspaces• General idea

First find the overall reducible subspace (OR), which is the union of all maximum reducible subspaces

Then identify the individual maximum reducible subspaces (IR) from OR

Finding OR

• Property suppose Y is a maximum reducible subspace

with core space V, then any subspace U of Y, if |U|=|V|, U is also a core space of Y

• Let RFfa be the remaining features in the datasets after deleting fa, then we have

• A linear scan of all the features in the dataset can find OR

}),(|{ afa fRFIDfORa

Finding Individual RS

• Assumption maximum reducible subspaces are disjoint

• Method enumerate candidate core space from size 1 to |

OR| a candidate core space is a subset of OR

find features that are strongly correlated with candidate core space and remove them from OR

Finding Individual RS

• Determine if a feature is strongly correlated with candidate core space ID-base method :quadratic to number

of data points Sampling based method: sample some

data points and see the number of data points distributed around them

see paper for details

Experimental result

A synthetic dataset consisting of 50 features with 3 RS

Experimental result

Efficiency evaluation on finding OR

Experimental result

Sampling v.s. ID based method on finding Individual RS

Experimental result

Reducible subspaces in NBA dataset (from ESPN website)28 features for 200 players

Thank You !

Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve...

Documents