Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · •...

transcript

Esteban García-Cuesta – Computer Science Department

Scalable Machine LearningAlgorithms

and Applications

PhD. Esteban García-Cuesta

Associate Professor & Head of Data Science Laboratory

Universidad Europea de Madrid

Professor and Researcher at Universidad Europea de Madrid

Head of Data Science Lab Research Group• Machine Learning and data mining• Affective computing• Dimensionality reduction and latent spaces• Social mining

Contact informationEmail: esteban.garcia@universidadeuropea.esSkype: egarciacuestaTel: +34 912115163

PhD. In Computer Science(Artificial Intelligence) byUniversidad Carlos III de Madrid

CANOPY ALS SLMVP

CANOPYClustering

High Dimensional Data

• Given a cloud of data points we want to understand its structure

Clustering Images

• Image segmentation• Goal: break up the images into meaningful or perceptually similar regions

Nuclear segmentation in microscope cell images: A hand-segmented dataset and comparison of algorithms" by "Luis Pedro Coelho and Aabid Shariff and Robert F. Murphy"; DOI: 10.1109/ISBI.2009.5193098

Clustering Problem: Galaxies (SkyCat)

• A catalog of 2 billion “sky objects” represents objects by theirradiation in 7 dimensions (frequency bands)

• Problem: Cluster into similar objects, e.g., galaxies, nearby stars,quasars, etc.

• Sloan Digital Sky Survey

Clustering is a hard problem!

Why is it hard?

• Clustering in two dimensions looks easy

• Clustering small amounts of data looks easy

• And in most cases, looks are not deceiving

• Many applications involve not 2, but 10 or 10,000 dimensions

• High-dimensional spaces look different: Almost all pairs of points areat about the same distance

Previous step for reducing the number of operations to be performed by k-

Suitable for large data sets (large number of samples)

Results are similar to those provided by k-means itself

Canopy clustering

Assigning points to canopies

Canopy as initial step for k-means

Summary (Canopy Algorithm)

• Start with a list of data points and two distances T1 > T21. Select any point (at random) for the list to form a canopy center

2. Calculate its distance to all the other points in the list

3. Put all the points which fall within the distance threshold of T1 into a canopy

4. Remove from the main dataset list all the points which fall within thethreshold of T2. These points are excluded from being the center of a formin new canopies.

5. Repeat from step 1 to 4 until original list is empty

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. DOI=http://dx.doi.org/10.1145/347090.347123

The processing is done in 3 M/R steps:

1. The data is massaged into suitable input format

2. Each mapper performs canopy clustering on the points in its input set and outputs its canopies’ centers

3. The reducer clusters the canopy centers to produce the final canopy centers

4. The points are then clustered into these final canopies

Canopy clustering (Parallelization Summary)

#N de clusters

Thumb rule k=(n/2)^0.5

A better approximation

Canopy (How to Choose K?)

#N de clusters

Optimal

Thumb rule k=(n/2)^0.5

A better approximation

• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation

Height Height

• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation

Height Height

Copyright

Nota para los usuarios de las diapositivas proporcionadas: Nos encantaríaque este material le resulte útil para dar sus propias conferencias. Siéntaselibre de usar estas diapositivas textualmente, o modificarlas para que seajusten a sus propias necesidades. Los originales de PowerPoint estándisponibles. Si utiliza una parte importante de estas diapositivas en su propiaconferencia o charla, incluya este mensaje.

Note to the users of provided slides: We would be delighted if you foundthis our material useful in giving your own lectures. Feel free to use theseslides verbatim, or to modify them to fit your own needs. PowerPointoriginals are available. If you make use of a significant portion of these slidesin your own lecture, please include this message.

Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · •...

Documents