Dimension Reduction: What? Why? and How?

DIMENSION REDUCTION Kazi Toufiq Wadud

[email protected]: @KaziToufiqWadud

mailto:[email protected]

https://twitter.com/KaziToufiqWadud


WHAT IS DIMENSION REDUCTION?

Process of converting data set having vast dimensions into data set with lesser dimensions ensuring that it conveys similar information concisely

DIMENSION REDUCTION : WHY ?

Curse of Dimensionality what is the curse?

UNDERSTANDING THE CURSE Consider a 3 -class pattern recognition problem.1) Strat with 1 dimension/feature 2) Divide the feature space into uniform bins3) Compute the ratio of examples for each class at each bin4) For a new example, find its bin and choose the predominant class in that bin In this case, We decide to start with one feature and divide the real line into 3 bins - but exists overlap between classes ; so let’s add 2nd Feature to improve discrimination.

UNDERSTANDING THE CURSE

2 dimensions : two dimensions increases the number of bins from 3 to 32 =9QUESTION: Which should we maintain constant?

The total number of examples? This results in a 2D scatter plot - Reduced Overlapping , Higher Sparsity

To address sparsity, what about keeping the density of examples per bin constant (say 3) ? This increases the number of examples from 9 to 27 ( 9 x 3 = 27 , at least)

UNDERSTANDING THE CURSE

Moving to 3 features The number of bins grow to 3^3 = 27 For same number of examples, 3D scatter plot is almost empty Constant Density: To keep the initial density of 3, required examples, 27 X 3 = 81

IMPLICATIONS OF CURSE OF DIMENSIONALITY Exponential growth with dimensionality in the number of examples required to accurately estimate a functionIn practice, the curse of dimensionality means that for a given sample size, there is a maximum number of features above which the performance of a classifier will degrade rather than improve.

In most cases the information that is lost by discarding some features is compensated by more accurate mapping in lower dimensional space

MULTICOLLINEARITY

Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable.

UNDERSTANDING MULTI-COLLINEARITY

Let’s look at the image shown below. It shows 2 dimensions x1 and x2, which are, Let us say, measurements of several object - in cm (x1) and inches (x2). Now, if you were to use both these dimensions in machine learning, they will convey similar information and introduce a lot of noise in System. We are better of just using one dimension. Here we have converted the dimension of data from 2D (from x1 and x2) to 1D (z1), which has made the data relatively easier to explain.

BENEFITS OF APPLYING DIMENSION REDUCTION Data Compression; Reduction of storage space

Less computing; Faster processing

Removal of multi-collinearity (redundant features) to reduce noise for better model fit

Better visualization and interpretation

APPROACHES FOR DIMENSION REDUCTION

Feature selection: choosing a subset of all the features

Feature extraction: creating new features by combining existing ones

In either case, the goal is to find a low-dimensional representation of the data that preserves (most of) the information or structure in the data

COMMON METHODS FOR DIMENSION REDUCTION MISSING VALUE While exploring data, if we encounter missing values, what we do? Our first step should be to identify the reason then impute missing values/ drop variables using appropriate methods.

But, what if we have too many missing values? Should we impute missing or drop the variables?

May be dropping is a good option, because it would not have lot more details about data set. Also, it would not help in improving the power of model.

Next question, is there any threshold of missing values for dropping a variable? It varies from case to case. If the information contained in the variable is not that much, you can drop the variable if it has more than ~40-50% missing values.

COMMON METHODS FOR DIMENSION REDUCTION MISSING VALUE

R code: summary(data)

COMMON METHODS FOR DIMENSION REDUCTION LOW VARIANCE

Let’s think of a scenario where we have a constant variable (all observations have same value, 5) in our data set. Do you think, it can improve the power of model?

Of course NOT, because it has zero variance. In case of high number of dimensions, we should drop variables having low variance compared to others because these variables will not explain the variation in target variables.

nearZeroVar

EXAMPLE AND CODE To identify these types of predictors, the following two metrics can be

calculated:

- the frequency of the most prevalent value over the second most frequent value (called the "frequency ratio''), which would be near one for well-behaved predictors and very large for highly-unbalanced data>

- the "percent of unique values'' is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases If the frequency ratio is less than a pre-specified threshold and the unique

value percentage is less than a threshold, we might consider a predictor to be near zero-variance

NEARZEROVARIANCE

data(mdrr)data.frame(table(mdrrDescr$nR11))

nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)

nzv[nzv$nzv,][1:10,]

dim(mdrrDescr)

nzv <- nearZeroVar(mdrrDescr)filteredDescr <- mdrrDescr[, -nzv]dim(filteredDescr)

COMMON METHODS FOR DIMENSION REDUCTION DECISION TREE

It can be used as an ultimate solution to tackle multiple challenges like missing values, outliers and identifying significant variables - How? Need to understand how Decision Tree works and the concept of Entropy and Information Gain

DECISION TREE – DATA SET

DECISION TREE: HOW DOES IT LOOK LIKE?

DECISION TREE AND ENTROPY

DECISION TREE: ENTROPY AND INFORMATION GAIN

To build a decision tree , we need to calculate 2 types of Entropy using frequency tables:

a) Entropy using the frequency table of one attribute

DECISION TREE: ENTROPY AND INFORMATION GAINb) Entropy using the frequency table of two attributes:

ENTROPY …..CONTD.

ENTROPY …..CONTD.

ENTROPY CALCULATION

ENTROPY CALCULATION

DECISION TREE: ENTROPY AND INFORMATION GAINSimilarly :

KEY POINTS: Branch with entropy 0 is leaf node

KEY POINTS: A branch with entropy more than 0 needs further splitting.

The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified

DECISION TREE: PROS AND CONS

R CODE

fancyRpartPlot(fit)

##to calculate information gainlibrary(FSelector)weights <- information.gain(Play~., data=w)print(weights)subset <- cutoff.k(weights, 3)f <- as.simple.formula(subset, "Play")print(f)

COMMON METHODS FOR DIMENSION REDUCTION Random Forest

Similar to decision tree is Random Forest. I would also recommend using the in-built feature importance provided by random forests to select a smaller subset of input features.

Just be careful that random forests have a tendency to bias towards variables that have more no. of distinct values i.e. favour numeric variables over binary/categorical values.

HOW DOES RANDOM FOREST WORK?

W/ AND W/O REPLACEMENT

HOW RANDOMFOREST IS BUILT

R CODE

w <- read.csv("weather.csv",header=T)library(randomForest)set.seed(12345)w <- as.data.frame(w)w <- read.csv("weather.csv",header=T)str(w)w <- w[names(w)[1:5]]fit1 <- randomForest(Play~ Outlook+Temperature+Humidity+Windy, data=w, importance=TRUE, ntree=20)varImpPlot(fit1)

COMMON METHODS FOR DIMENSION REDUCTION High Correlation

Dimensions exhibiting higher correlation can lower down the performance of model. Moreover, it is not good to have multiple variables of similar information or variation also known as “Multicollinearity”. You can use Pearson (continuous variables) or Polychoric (discrete variables) correlation matrix to identify the variables with high correlation and select one of them using VIF (Variance Inflation Factor). Variables having higher value ( VIF > 5 ) can be dropped.

https://en.wikipedia.org/wiki/Variance_inflation_factor

IMPACT OF MULTICOLLINEARITY

R demonstration with example

13 independent variables against response variable of Crimerate

- a discrepancy is noticeable on studying the regression equation with regard to expenditure on police services between the years 1959 and 1960. Why should police expenditure in one year be associated with increase in crime rate and decrease in the previous year? It does not make sense

IMPACT OF MULTICOLLINEARITY

2nd - even though the F statistic is highly significant, and which provides evidence for the presence of a linear relationship between all 13 variables and the response variable, the β coefficients of both expenditures for 1959 and 1960 have nonsignificant t ratios. Non-significant t means there is no slope! In other words police expenditure has no effect whatsoever on crime rate!

HIGH CORRELATION - VIF

Most widely-used diagnostic for multicollinearity, the variance inflation factor (VIF).

- The VIF may be calculated for each predictor by doing a linear regression of that predictor on all the other predictors, and then obtaining the R2 from that regression. The VIF is just 1/(1-R2).

- It’s called the variance inflation factor because it estimates how much the variance of a coefficient is “inflated” because of linear dependence with other predictors. Thus, a VIF of 1.8 tells us that the variance (the square of the standard error) of a particular coefficient is 80% larger than it would be if that predictor was completely uncorrelated with all the other predictors.

-The VIF has a lower bound of 1 but no upper bound. Authorities differ on how high the VIF has to be to constitute a problem. Personally, I tend to get concerned when a VIF is greater than 2.50, which corresponds to an R2 of .60 with the other variables.

VIF – R CODE vif(fit)

KEY POINTS: The variables with high VIFs are control variables, and the variables of interest do not have high VIFs.

Let’s the sample consists of U.S. colleges. The dependent variable is graduation rate, and the variable of interest is an indicator (dummy) for public vs. private. Two control variables are average SAT scores and average ACT scores for entering freshmen. These two variables have a correlation above .9, which corresponds to VIFs of at least 5.26 for each of them. But the VIF for the public/private indicator is only 1.04. So there’s no problem to be concerned about, and no need to delete one or the other of the two controls.

http://statisticalhorizons.com/multicollinearity



COMMON METHODS FOR DIMENSION REDUCTION Backward Feature Elimination/ Forward Feature Selection

In this method, we start with all n dimensions. Compute the sum of square of error (SSR) after eliminating each variable (n times). Then, identifying variables whose removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-1 input features.

Repeat this process until no other variables can be dropped.

Reverse to this, we can use “Forward Feature Selection” method. In this method, we select one variable and analyse the performance of model by adding another variable. Here, selection of variable is based on higher improvement in model performance.

COMMON METHODS FOR DIMENSION REDUCTION Factor AnalysisLet’s say some variables are highly correlated. These variables can be grouped by their correlations i.e. all variables in a particular group can be highly correlated among themselves but have low correlation with variables of other group(s). Here each group represents a single underlying construct or factor. These factors are small in number as compared to large number of dimensions. However, these factors are difficult to observe. There are basically two methods of performing factor analysis: EFA (Exploratory Factor Analysis) CFA (Confirmatory Factor Analysis)

COMMON METHODS FOR DIMENSION REDUCTION Principal Component Analysis (PCA)

In this technique, variables are transformed into a new set of variables, which are linear combination of original variables. These new set of variables are known as principle components. They are obtained in such a way that first principle component accounts for most of the possible variation of original data after which each succeeding component has the highest possible variance.

The second principal component must be orthogonal to the first principal component. In other words, it does its best to capture the variance in the data that is not captured by the first principal component. For two-dimensional dataset, there can be only two principal components.

The principal components are sensitive to the scale of measurement, now to fix this issue we should always standardize variables before applying PCA. Applying PCA to your data set loses its meaning. If interpretability of the results is important for your analysis, PCA is not the right technique for your project.

UNDERSTANDING PCA: PREREQUISITE

Standard Deviation

Statisticians are usually concerned with taking a sample of a population

Mean

“The average distance from the mean of the data set to a point”

Variance

UNDERSTANDING PCA: PREREQUISITE

Covariance Covariance is always measured between 2 dimensions.

Covariance Matrix

EXAMPLE: HOW TO CALCULATE COVARIANCE

EIGENVECTORS & EIGEN VALUES OF A MATRIX

EigenVectorEigenValueMatrix

PROPERTIES:

- eigenvectors and eigenvalues always come in pairs

- Eigenvectors can only be found for square matrices; but not all square matrices have eigenvectors;

-For n x n matrix, there are n eigenvectors

-Another property of eigenvectors is that even if I scale the vector by some amountbefore I multiply it, I still get the same multiple of it as a result. Thisis because if you scale a vector by some amount, all you are doing is making it longer, not changing it’s direction

PROPERTIES:

all the eigenvectors of a matrix are perpendicular,ie. at right angles to each other, no matter how many dimensions you have. By the way,another word for perpendicular, in maths talk, is orthogonal

This is important ;

That means we can express data in terms of these perpendicular eigenvectors, instead of expressing them in terms of the x axes and y axes.

PROPERTIES:

Another important thing to know is that when mathematicians find eigenvectors, they like to find the eigenvectors whose length is exactly one. This is because, as you know, the length of a vector doesn’t affect whether it’s an eigenvector or not, whereas the direction does. So, in order to keep eigenvectors standard, whenever we find an eigenvector we usually scale it to make it have a length of 1, so that all eigenvectors have the same length. Here’s a demonstration from our example above.

HOW DOES ONE GO ABOUT FINDING THESE MYSTICAL EIGENVECTORS?

- Unfortunately, it’s only easy(ish) if you have a rather small matrix,

-The usual way to find the eigenvectors is by some complicated iterative methodwhich is beyond the scope of this tutorial

REVISE:

METHOD 1. Get Some Data

2. Subtract the mean

3. Calculate the covariance matrix

METHOD ….CONTD

4. Calculate the eigenvectors and eigenvalues of the covarianceMatrix

PLOT

METHOD

METHOD: Step 5: Deriving the new data set

RawFeatureVector is the matrix with the eigenvectors in the columns transposedso that the eigenvectors are now in the rows, with the most significant eigenvectorat the top,

RowDataAdjust is the mean-adjusted data transposed, ie. the dataitems are in each column, with each row holding a separate dimension

PCA

Thanks [email protected] @KaziToufiqWadud

mailto:[email protected]



Date post:	14-Apr-2017
Category:	Data & Analytics
Upload:	kazi-toufiq-wadud
View:	447 times
Download:	0 times

Dimension Reduction: What? Why? and How?

Data & Analytics