Post on 08-Oct-2020
transcript
Esteban García-Cuesta – Computer Science Department
Scalable Machine LearningAlgorithms
and Applications
PhD. Esteban García-Cuesta
Associate Professor & Head of Data Science Laboratory
Universidad Europea de Madrid
Esteban García-Cuesta – Computer Science Department
Esteban García-Cuesta – Computer Science Department
Professor and Researcher at Universidad Europea de Madrid
Head of Data Science Lab Research Group• Machine Learning and data mining• Affective computing• Dimensionality reduction and latent spaces• Social mining
Contact informationEmail: esteban.garcia@universidadeuropea.esSkype: egarciacuestaTel: +34 912115163
PhD. In Computer Science(Artificial Intelligence) byUniversidad Carlos III de Madrid
Esteban García-Cuesta – Computer Science Department
CANOPY ALS SLMVP
END
Esteban García-Cuesta – Computer Science Department
CANOPYClustering
Esteban García-Cuesta – Computer Science Department
High Dimensional Data
• Given a cloud of data points we want to understand its structure
Esteban García-Cuesta – Computer Science Department
Clustering Images
• Image segmentation• Goal: break up the images into meaningful or perceptually similar regions
Nuclear segmentation in microscope cell images: A hand-segmented dataset and comparison of algorithms" by "Luis Pedro Coelho and Aabid Shariff and Robert F. Murphy"; DOI: 10.1109/ISBI.2009.5193098
Esteban García-Cuesta – Computer Science Department
Clustering Problem: Galaxies (SkyCat)
• A catalog of 2 billion “sky objects” represents objects by theirradiation in 7 dimensions (frequency bands)
• Problem: Cluster into similar objects, e.g., galaxies, nearby stars,quasars, etc.
• Sloan Digital Sky Survey
Esteban García-Cuesta – Computer Science Department
Clustering is a hard problem!
Esteban García-Cuesta – Computer Science Department
Why is it hard?
Esteban García-Cuesta – Computer Science Department
Why is it hard?
• Clustering in two dimensions looks easy
• Clustering small amounts of data looks easy
• And in most cases, looks are not deceiving
• Many applications involve not 2, but 10 or 10,000 dimensions
• High-dimensional spaces look different: Almost all pairs of points areat about the same distance
Esteban García-Cuesta – Computer Science Department
Previous step for reducing the number of operations to be performed by k-
means
Suitable for large data sets (large number of samples)
Results are similar to those provided by k-means itself
Canopy clustering
Esteban García-Cuesta – Computer Science Department
Canopy clustering
Esteban García-Cuesta – Computer Science Department
1
Esteban García-Cuesta – Computer Science Department
2
Esteban García-Cuesta – Computer Science Department
3
Esteban García-Cuesta – Computer Science Department
4
Esteban García-Cuesta – Computer Science Department
5
Esteban García-Cuesta – Computer Science Department
6
Esteban García-Cuesta – Computer Science Department
7
Esteban García-Cuesta – Computer Science Department
8
Esteban García-Cuesta – Computer Science Department
9
Esteban García-Cuesta – Computer Science Department
10
Esteban García-Cuesta – Computer Science Department
11
Esteban García-Cuesta – Computer Science Department
12
Esteban García-Cuesta – Computer Science Department
13
Esteban García-Cuesta – Computer Science Department
14
Esteban García-Cuesta – Computer Science Department
15
Esteban García-Cuesta – Computer Science Department
Assigning points to canopies
Esteban García-Cuesta – Computer Science Department
16
Esteban García-Cuesta – Computer Science Department
17
Esteban García-Cuesta – Computer Science Department
18
Esteban García-Cuesta – Computer Science Department
19
Esteban García-Cuesta – Computer Science Department
20
Esteban García-Cuesta – Computer Science Department
Canopy as initial step for k-means
Esteban García-Cuesta – Computer Science Department
21
Esteban García-Cuesta – Computer Science Department
22
Esteban García-Cuesta – Computer Science Department
23
Esteban García-Cuesta – Computer Science Department
24
Esteban García-Cuesta – Computer Science Department
25
Esteban García-Cuesta – Computer Science Department
26
Esteban García-Cuesta – Computer Science Department
27
Esteban García-Cuesta – Computer Science Department
Summary (Canopy Algorithm)
• Start with a list of data points and two distances T1 > T21. Select any point (at random) for the list to form a canopy center
2. Calculate its distance to all the other points in the list
3. Put all the points which fall within the distance threshold of T1 into a canopy
4. Remove from the main dataset list all the points which fall within thethreshold of T2. These points are excluded from being the center of a formin new canopies.
5. Repeat from step 1 to 4 until original list is empty
Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. DOI=http://dx.doi.org/10.1145/347090.347123
Esteban García-Cuesta – Computer Science Department
The processing is done in 3 M/R steps:
1. The data is massaged into suitable input format
2. Each mapper performs canopy clustering on the points in its input set and outputs its canopies’ centers
3. The reducer clusters the canopy centers to produce the final canopy centers
4. The points are then clustered into these final canopies
Canopy clustering (Parallelization Summary)
Esteban García-Cuesta – Computer Science Department
Co
stFu
nct
ion
#N de clusters
Thumb rule k=(n/2)^0.5
A better approximation
Canopy (How to Choose K?)
Esteban García-Cuesta – Computer Science Department
Co
stFu
nct
ion
#N de clusters
Optimal
Thumb rule k=(n/2)^0.5
A better approximation
Canopy (How to Choose K?)
Esteban García-Cuesta – Computer Science Department
• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation
We
igth
Height Height
We
igth
Canopy (How to Choose K?)
Esteban García-Cuesta – Computer Science Department
• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation
We
igth
Height Height
We
igth
Canopy (How to Choose K?)
Esteban García-Cuesta – Computer Science Department
Copyright
Nota para los usuarios de las diapositivas proporcionadas: Nos encantaríaque este material le resulte útil para dar sus propias conferencias. Siéntaselibre de usar estas diapositivas textualmente, o modificarlas para que seajusten a sus propias necesidades. Los originales de PowerPoint estándisponibles. Si utiliza una parte importante de estas diapositivas en su propiaconferencia o charla, incluya este mensaje.
Note to the users of provided slides: We would be delighted if you foundthis our material useful in giving your own lectures. Feel free to use theseslides verbatim, or to modify them to fit your own needs. PowerPointoriginals are available. If you make use of a significant portion of these slidesin your own lecture, please include this message.