Conceptual and Empirical Exploration of t-SNE
Lucy Chen, Allyson Ling, Shuting Zang
Advisor: Xiaodong Li
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 1 / 24
Seeds Dataset
From UC Irvine Machine Learning Repository
Geometric features of different wheat seeds
210 observations, 7 attributes, 3 classes
Link for dataset: https://archive.ics.uci.edu/ml/datasets/seeds
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 2 / 24
Seeds Dataset
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 3 / 24
Seeds Dataset
Figure: t-SNE Figure: PCA
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 4 / 24
What is t-SNE?
t-distributed stochastic neighbor embedding
t-SNE is a non-linear dimension reduction technique
Steps:X → D → P → Q → Y
X = Data matrix (≤ 40 attributes)D = Pairwise distance matrixP = Probability matrix in high dimensionQ = Probability matrix in low dimensionY = Data matrix in low dimension
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 5 / 24
X → Pairwise Distance Matrix → P
Perp(Pi ) = eH(Pi ) H(Pi ) = −∑j
pj |i log(pj |i )
pj |i =−exp(||xi − xj ||2/2σ2i )∑k 6=i −exp(||xi − xk ||2/2σ2i )
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 6 / 24
Perplexity
Perp(Pi ) = eH(Pi )
Measure of effective number of neighbors
Function of Shannon entropy
Measure of uncertainty of an outcome from a set of possible eventsNumber of bits needed to store information from dataset
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 7 / 24
Pairwise Distance Matrix → P
Perp(Pi ) = eH(Pi ), H(Pi ) = −∑
j pj |i log(pj |i ) , D = { p1∑i pi, . . . , pk∑
i pi}
pj = e−βjdij
H(D) = −∑j
pj∑i pi
logpj∑i pi
= −∑j
pj∑i pi
(logpj − log∑i
pi )
= −∑j
pj∑i pi
logpj + log(∑i
pi )pj∑i pi
= log∑i
pi −1∑i pi
∑j
pj logpj
= log∑i
pi −1∑i pi
∑j
βjdje−βjdj
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 8 / 24
P → Q → Y
First set pij =pj|i+pi|j
2n and diagonal entries to 0
Then use gradient descent algorithm
C = KL(P||Q) =∑i
∑j
pij logpijqij
δC
δyi= 4
∑j
(pij − qij)(yi − yj)(1 + ||yi − yj ||2)−1
qij =(1 + ||yi − yj ||2)−1∑k 6=i (1 + ||yk − yi ||2)−1
Lucy Chen, Allyson Ling, Shuting Zang (Advisor: Xiaodong Li)Conceptual and Empirical Exploration of t-SNE 9 / 24
Real Data Implementation
Real Data Implementation 10 / 24
Wisconsin Breast Cancer Dataset
From UC Irvine Machine Learning Repository
Breast Cancer Diagnosis: 357 benign, 212 malignant
Ten real-valued features are computed for each cell nucleus, andattributes include the mean, standard error, and ”worst” or largest(mean of the three largest values) of these features
Real Data Implementation 11 / 24
Wisconsin Breast Cancer Dataset
Figure: Original Figure: Scaled
Real Data Implementation 12 / 24
Wisconsin Breast Cancer Dataset
Figure: 2D PCA
Real Data Implementation 13 / 24
Wisconsin Breast Cancer Dataset
Finding the proper perplexity
For this dataset, higher perplexity looks better for clustering
Real Data Implementation 14 / 24
Wisconsin Breast Cancer Dataset
Find the anomalies, and trace them back to the original dataset
Blue for benign, orange for malignant
Real Data Implementation 15 / 24
Wisconsin Breast Cancer Dataset
Real Data Implementation 16 / 24
Wisconsin Breast Cancer Dataset
By the visualized clusters, we would like the doctor to reconsider thediagnosis, since the patient may be misdiagnosed.
Mean/SD/Worse of radiusMean/SD/Worse of perimeter → Features about size of cancer cellMean/SD/Worse of area
Real Data Implementation 17 / 24
Vertebral Column Dataset
From UC Irvine Machine Learning Repository
Features distinguishing spinal disorders
310 observations, 6 attributes, 3 classes
Link for dataset:https://archive.ics.uci.edu/ml/datasets/Vertebral+Column
Real Data Implementation 18 / 24
Vertebral Column Dataset
Figure: Disk Hernia and Normal Figure: Spondylolisthesis
Real Data Implementation 19 / 24
Vertebral Column Dataset
Figure: t-SNE Figure: PCA
Real Data Implementation 20 / 24
Vertebral Column Dataset
Figure: Density plots of each attribute
Real Data Implementation 21 / 24
Vertebral Column Dataset
Real Data Implementation 22 / 24
Vertebral Column Dataset
Figure: Density plots of each attribute with anomalies
Real Data Implementation 23 / 24
References
Maaten, Laurens van der. Hinton, Geoffrey. Visualizing Data using t-SNE.https://lvdmaaten.github.io/publications/papers/JMLR 2008.pdf
Shannon, Claude E. A Mathematical Theory of Communication.http://math.harvard.edu/ ctm/home/text/others/shannon/entropy/entropy.pdf
Datasets:
Seeds: https://archive.ics.uci.edu/ml/datasets/seeds
Breast Cancer: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
Vertebral Column:https://archive.ics.uci.edu/ml/datasets/Vertebral+Column
Real Data Implementation 24 / 24