+ All Categories
Home > Documents > CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

Date post: 11-Jan-2016
Category:
Upload: tamsin-collins
View: 221 times
Download: 2 times
Share this document with a friend
Popular Tags:
18
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1
Transcript
Page 1: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

1

CS525: Big Data Analytics

Machine Learning on Hadoop

Fall 2013

Elke A. Rundensteiner

Page 2: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

2

Analytics ?

• Machine learning, data mining & statistics tools• Analyze/mine/summarize large datasets• Extract knowledge from past or streaming data• Predict trends in future data

Page 3: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

ML Today

• Internet search clustering

• Social network analysis

• Taxonomy transformations

• Market analytics

• Recommendation systems

• Log analysis & event filtering

• SPAM filtering

• Fraud detection

Page 4: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

4

Tools & Algorithms

• Collaborative Filtering

• Clustering Techniques

• Classification Algorithms

• Association Rules

• Frequent Pattern Mining

• Statistical libraries (Regression, SVM, …)

• Others…

Page 5: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

5

Common Use Cases

Page 6: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

6

Make It Industry Strength: Big Data

--Efficient in analyzing/mining data--Do not scale

--Efficient in managing big data--Does not analyze or mine data

How to integrate these two worlds ?

Page 7: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

8

Some Projects

• Apache Mahout• Open-source package on Hadoop for

data mining and machine learning

• Revolution R (R-Hadoop or Radoop )• Extensions to R package to run on

Hadoop

Page 8: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

9

Apache Mahout

Page 9: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

10

Apache Mahout

• Apache Software Foundation project

• Create scalable machine learning libraries

• Why ?

• Many Open Source ML libraries either:• Lack Community• Lack Documentation• Lack Scalability• Or are research-oriented only

Page 10: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

Support Machine Learning

Page 11: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

12

But Must Scale & Perform

• Be as fast as possible

• Scale to as much data as possible

Page 12: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

13

But Must Scale & Perform

• Be as fast as possible given intrinsic algorithm !

• What is expressible as map-reduce jobs ?

• Work in progress . . .

Page 13: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

14

C1: Collaborative Filtering

Page 14: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

15

C2: Clustering

• Group similar objects together

• K-Means, Fuzzy K-Means, Density-Based,…

• Different distance measures• Manhattan, Euclidean, …

Page 15: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

16

C3: Classification

Page 16: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

17

FPM: Frequent Pattern Mining

• Find the frequent itemsets• <milk, bread, cheese> are sold

frequently together

• Very common in market analysis, access pattern analysis, etc…

Page 17: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

18

Matrices and Statistics

• Math libraries• Vectors, matrices, etc.

• Noise reduction

• Similarity Functions

Page 18: CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.

19

Apache Mahout

• http://mahout.apache.org/


Recommended