+ All Categories
Home > Documents > Introducing Apache Mahout

Introducing Apache Mahout

Date post: 14-Feb-2016
Category:
Upload: callia
View: 59 times
Download: 0 times
Share this document with a friend
Description:
Introducing Apache Mahout. Scalable Machine Learning for All! Grant Ingersoll. Agenda. What is Machine Learning? Definitions Types Applications Mahout What? Why? How? Who?. What is Machine Learning?. NOT!. Or?. http://en.wikipedia.org/wiki/Image:Hal-9000.jpg. - PowerPoint PPT Presentation
Popular Tags:
25
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll
Transcript
Page 1: Introducing Apache Mahout

Introducing Apache Mahout

Scalable Machine Learning for All!Grant Ingersoll

Page 2: Introducing Apache Mahout

Agenda• What is Machine Learning?

– Definitions– Types– Applications

• Mahout– What?– Why? – How?– Who?

Page 3: Introducing Apache Mahout

What is Machine Learning?

QuickTime™ and a decompressor

are needed to see this picture.

http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg

Or?QuickTime™ and a decompressorare needed to see this picture.

http://en.wikipedia.org/wiki/Image:Hal-9000.jpg

NOT!

Page 4: Introducing Apache Mahout

How about?

Google News

Page 5: Introducing Apache Mahout

Or?

Amazon.com

Page 6: Introducing Apache Mahout

Definition• “Machine Learning is programming

computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.

Alpaydin• Subset of Artificial Intelligence

– Many other fields: comp sci., biology, math, psychology, etc.

Page 7: Introducing Apache Mahout

Characterizations• Lots of Data

• Identifiable Features in that Data

• Too big/costly for people to handle– People still can help

Page 8: Introducing Apache Mahout

Types• Supervised

– Using labeled training data, create function that predicts output of unseen inputs

• Unsupervised– Using unlabeled data, create function

that predicts output• Semi-Supervised

– Uses labeled and unlabeled data

Page 9: Introducing Apache Mahout

Classification/Categorization• Spam Filtering• Named Entity Recognition• Phrase Identification• Sentiment Analysis• Classification into a Taxonomy

Page 10: Introducing Apache Mahout

Clustering• Find Natural Groupings

– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Page 11: Introducing Apache Mahout

Collaborative Filtering• Recommend people and products

– User-User• User likes X, you might too

– Item-Item• People who bought X also bought Y

Page 12: Introducing Apache Mahout

Info. Retrieval• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking

Page 13: Introducing Apache Mahout

Other• Image Analysis• Robotics• Games• Higher level natural language

processing• Many, many others

Page 14: Introducing Apache Mahout

What is Apache Mahout?• A Mahout is an elephant

trainer/driver/keeper, hence…QuickTime™ and a

decompressorare needed to see this picture.

+Machine Learning

=

(and other distributed techniques)

Page 15: Introducing Apache Mahout

What?• Hadoop brings:

– Map/Reduce API– HDFS– In other words, scalability and fault-

tolerance• Thus, Mahout’s Goal is:

– Scalable Machine Learning with Apache License

Page 16: Introducing Apache Mahout

Why Mahout?• Many Open Source ML libraries either:

– Lack Community– Lack Documentation and Examples– Lack Scalability– Lack the Apache License ;-)– Or are research-oriented

• Personal: Learn more ML• Intelligent Apps are the Present and Future

– See the Hadoop talks tomorrow and Friday!• Goal: Overcome gaps the Apache Way!

Page 17: Introducing Apache Mahout

Current Status• Close to Initial release

– Focused on examples, docs, bug fixes• What’s in it:

– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering

• Canopy/K-Means/Fuzzy K-Means/Mean-shift– Classifiers

• Naïve Bayes• Complementary NB

– Evolutionary• Integration with Watchmaker for fitness function

Page 18: Introducing Apache Mahout

How?• Examples

– Taste– Clustering– Classification– Evolutionary

Page 19: Introducing Apache Mahout

Taste: Movie Recommendations

• Given ratings by users of movies, recommend other movies

• http://lucene.apache.org/mahout/taste.html#demo

Page 20: Introducing Apache Mahout

Clustering: Synthetic Control Data

• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series

• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*

• Outputs clusters…

Page 21: Introducing Apache Mahout

Classification: NB and CNB Examples

• 20 Newsgroups– http://cwiki.apache.org/confluence/

display/MAHOUT/TwentyNewsgroups

• Wikipedia– http://cwiki.apache.org/confluence/

display/MAHOUT/WikipediaBayesExample

Page 22: Introducing Apache Mahout

Evolutionary• Traveling Salesman

– http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman

• Class Discovery– http://cwiki.apache.org/confluence/

display/MAHOUT/Class+Discovery

Page 23: Introducing Apache Mahout

What’s Next?• Release 0.1!• Shared Amazon Images (others?)• More Examples• Winnow/Perceptron (MAHOUT-85)• Hbase and HAMA support• Normalize I/O format for data• Solr Integration (SOLR-769)• Other Algorithms: SVM, Linear Regression,

etc.

Page 24: Introducing Apache Mahout

When, Where, Who• When? Now!

– Mahout is growing• Who? You!

– We want Java programmers who:• Are comfortable with math• Like to work on large, hard problems

• Where?– http://lucene.apache.org/mahout– http://cwiki.apache.org/MAHOUT– mahout-{user|dev}@lucene.apache.org

Page 25: Introducing Apache Mahout

Resources• “Programming Collective Intelligence”

by Toby Segaran• “Data Mining - Practical Machine

Learning Tools and Techniques” by Ian H. Witten and Eibe Frank

• Hadoop - http://hadoop.apache.org• http://mloss.org/software/


Recommended