Scalable Machine Learning Algorithmsprojectbasedschool.universidadeuropea.es/blogs/dsl/... · •...

Post on 08-Oct-2020

4 views 0 download

transcript

Esteban García-Cuesta – Computer Science Department

Scalable Machine LearningAlgorithms

and Applications

PhD. Esteban García-Cuesta

Associate Professor & Head of Data Science Laboratory

Universidad Europea de Madrid

Esteban García-Cuesta – Computer Science Department

Esteban García-Cuesta – Computer Science Department

Professor and Researcher at Universidad Europea de Madrid

Head of Data Science Lab Research Group• Machine Learning and data mining• Affective computing• Dimensionality reduction and latent spaces• Social mining

Contact informationEmail: esteban.garcia@universidadeuropea.esSkype: egarciacuestaTel: +34 912115163

PhD. In Computer Science(Artificial Intelligence) byUniversidad Carlos III de Madrid

Esteban García-Cuesta – Computer Science Department

CANOPY ALS SLMVP

END

Esteban García-Cuesta – Computer Science Department

CANOPYClustering

Esteban García-Cuesta – Computer Science Department

High Dimensional Data

• Given a cloud of data points we want to understand its structure

Esteban García-Cuesta – Computer Science Department

Clustering Images

• Image segmentation• Goal: break up the images into meaningful or perceptually similar regions

Nuclear segmentation in microscope cell images: A hand-segmented dataset and comparison of algorithms" by "Luis Pedro Coelho and Aabid Shariff and Robert F. Murphy"; DOI: 10.1109/ISBI.2009.5193098

Esteban García-Cuesta – Computer Science Department

Clustering Problem: Galaxies (SkyCat)

• A catalog of 2 billion “sky objects” represents objects by theirradiation in 7 dimensions (frequency bands)

• Problem: Cluster into similar objects, e.g., galaxies, nearby stars,quasars, etc.

• Sloan Digital Sky Survey

Esteban García-Cuesta – Computer Science Department

Clustering is a hard problem!

Esteban García-Cuesta – Computer Science Department

Why is it hard?

Esteban García-Cuesta – Computer Science Department

Why is it hard?

• Clustering in two dimensions looks easy

• Clustering small amounts of data looks easy

• And in most cases, looks are not deceiving

• Many applications involve not 2, but 10 or 10,000 dimensions

• High-dimensional spaces look different: Almost all pairs of points areat about the same distance

Esteban García-Cuesta – Computer Science Department

Previous step for reducing the number of operations to be performed by k-

means

Suitable for large data sets (large number of samples)

Results are similar to those provided by k-means itself

Canopy clustering

Esteban García-Cuesta – Computer Science Department

Canopy clustering

Esteban García-Cuesta – Computer Science Department

1

Esteban García-Cuesta – Computer Science Department

2

Esteban García-Cuesta – Computer Science Department

3

Esteban García-Cuesta – Computer Science Department

4

Esteban García-Cuesta – Computer Science Department

5

Esteban García-Cuesta – Computer Science Department

6

Esteban García-Cuesta – Computer Science Department

7

Esteban García-Cuesta – Computer Science Department

8

Esteban García-Cuesta – Computer Science Department

9

Esteban García-Cuesta – Computer Science Department

10

Esteban García-Cuesta – Computer Science Department

11

Esteban García-Cuesta – Computer Science Department

12

Esteban García-Cuesta – Computer Science Department

13

Esteban García-Cuesta – Computer Science Department

14

Esteban García-Cuesta – Computer Science Department

15

Esteban García-Cuesta – Computer Science Department

Assigning points to canopies

Esteban García-Cuesta – Computer Science Department

16

Esteban García-Cuesta – Computer Science Department

17

Esteban García-Cuesta – Computer Science Department

18

Esteban García-Cuesta – Computer Science Department

19

Esteban García-Cuesta – Computer Science Department

20

Esteban García-Cuesta – Computer Science Department

Canopy as initial step for k-means

Esteban García-Cuesta – Computer Science Department

21

Esteban García-Cuesta – Computer Science Department

22

Esteban García-Cuesta – Computer Science Department

23

Esteban García-Cuesta – Computer Science Department

24

Esteban García-Cuesta – Computer Science Department

25

Esteban García-Cuesta – Computer Science Department

26

Esteban García-Cuesta – Computer Science Department

27

Esteban García-Cuesta – Computer Science Department

Summary (Canopy Algorithm)

• Start with a list of data points and two distances T1 > T21. Select any point (at random) for the list to form a canopy center

2. Calculate its distance to all the other points in the list

3. Put all the points which fall within the distance threshold of T1 into a canopy

4. Remove from the main dataset list all the points which fall within thethreshold of T2. These points are excluded from being the center of a formin new canopies.

5. Repeat from step 1 to 4 until original list is empty

Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. 2000. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 169-178. DOI=http://dx.doi.org/10.1145/347090.347123

Esteban García-Cuesta – Computer Science Department

The processing is done in 3 M/R steps:

1. The data is massaged into suitable input format

2. Each mapper performs canopy clustering on the points in its input set and outputs its canopies’ centers

3. The reducer clusters the canopy centers to produce the final canopy centers

4. The points are then clustered into these final canopies

Canopy clustering (Parallelization Summary)

Esteban García-Cuesta – Computer Science Department

Co

stFu

nct

ion

#N de clusters

Thumb rule k=(n/2)^0.5

A better approximation

Canopy (How to Choose K?)

Esteban García-Cuesta – Computer Science Department

Co

stFu

nct

ion

#N de clusters

Optimal

Thumb rule k=(n/2)^0.5

A better approximation

Canopy (How to Choose K?)

Esteban García-Cuesta – Computer Science Department

• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation

We

igth

Height Height

We

igth

Canopy (How to Choose K?)

Esteban García-Cuesta – Computer Science Department

• Check how good are the clusters for the applicationunder use: e.g. Portugal market segmentation

We

igth

Height Height

We

igth

Canopy (How to Choose K?)

Esteban García-Cuesta – Computer Science Department

Copyright

Nota para los usuarios de las diapositivas proporcionadas: Nos encantaríaque este material le resulte útil para dar sus propias conferencias. Siéntaselibre de usar estas diapositivas textualmente, o modificarlas para que seajusten a sus propias necesidades. Los originales de PowerPoint estándisponibles. Si utiliza una parte importante de estas diapositivas en su propiaconferencia o charla, incluya este mensaje.

Note to the users of provided slides: We would be delighted if you foundthis our material useful in giving your own lectures. Feel free to use theseslides verbatim, or to modify them to fit your own needs. PowerPointoriginals are available. If you make use of a significant portion of these slidesin your own lecture, please include this message.