Orchestrating the Intelligent Web with Apache Mahout

transcript

Presented by Aneesha BakhariaTwitter: aneesha

Email: aneesha.bakharia@gmail.com

What is Apache Mahout?

• Open source • Machine Learning Java library• Scalable (Apache Hadoop) • Framework for developing, testing and

deploying large-scale algorithms

http://mahout.apache.org/

What’s in a Name?

• Mahout is Hindi for Elephant Driver

What is Apache Mahout?

• Framework– Vector Math/Matrices (eg SVD)– Collections– Hadoop

• Algorithms– Classification, Clustering, etc

• Your Application???– You can orchestrate the intelligent web!!!

A New Breed of Developer

• Key Skills– Databases– Programming– Networking– Security

• …but now also– distributed data processing is fast becoming an

essential part the developer’s toolbox.

You never know where you will use Probability and

Statistics!!!!Video snippet from Equilibrium:

http://en.wikipedia.org/wiki/Equilibrium_%28film%29

You never know what you will discover!!!!

Where people swear in the United States?

http://flowingdata.com/2011/01/25/where-people-swear-in-the-united-states/

Algorithms is Apache Mahout

• Recommendation (collaborative filtering)• Clustering• Classification • Evolutionary Algorithms

Algorithms is Apache Mahout

• Top 10 algorithms in data mining

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37.

k-Means, Apriori (fp-growth), kNN, Naive Bayes, SVM (coming)

Already supported

Requirements

• Java 1.6java -version

• Maven 2.2mvn -- version

• Hadoop 0.2

Running Mahout

• Command line launcherbin/mahout (This shows the list of algorithms)Valid program names are:

canopy: : Canopy clustering cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text dirichlet: : Dirichlet Clustering fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lda: : Latent Dirchlet Allocation ldatopics: : LDA Print Topics lucene.vector: : Generate Vectors from a Lucene index matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering recommenditembased: : Compute recommendations using item-based collaborative filtering …..

Running Mahout

• Run any algorithm eg kmeans locallybin/mahout kmeans –help

Job-Specific Options: --input (-i) input --output (-o) output --distanceMeasure (-dm) eg SquaredEuclidean --numClusters (-k) k

Running Mahout

• Scale outRuns on cluster as per conf files in Hadoop directory

• export HADOOP_HOME = /pathto/hadoop-0.20.2/

• Need to use the driver classesKMeansDriver.runjob(Path input, Path output ...)

Clustering

• Unsupervised Machine Learning technique• Organise items in to clusters/groups based

upon similarity• Good for finding patterns and exploring data

Clustering

• Lots of Algorithms:k-means, Fuzzy K-means, Mean Shift, Canopy, Dirichlet Process, Latent Dirichlet Allocation

• Similarity Distance Measures– Euclidean– Cosine– Tanimoto– Manhattan

Vectors

• DocumentsBag of wordsword1 => 10word2 => 2word3 => 4Resulting vector [10.0, 2.0, 4.0, .... ]

Range of Vectorization Tools

• Collate multiple words (n-grams)• Normalization• TF-IDF• Stop word removal

kmeans Example

• Set of text files in a directory• Use seqdirectory to convert files to vectors

bin/mahout seqdirectory -i <input> -o <seq-output>• Use seq2sparse to convert to sparse vector

bin/mahout seq2sparse -i seq-output -o <vector-output>• Run kmeans with k=5

bin/mahout kmeans -i<vector-output> -c <cluster-temp> -o <cluster-output> -k 5

• View outputbin/mahout clusterdump

Easy enough, but

• How do you know k?• Data Exploration is required to find the • k for your purposes• Similarity distance for your purpose

• Role for the Data Scientist• Explore, Model, Test and Evaluate

Recommender Engines

• Encounter the most• Recommend products (books, movies, etc)

based upon past actions• Infer tastes and preferences to identify

unknown items of interest

Recomendation

• Algorithms:user and item recommendation

• Framework for storage, online and offline computation

• Similarity Measures– Cosine– Tanimoto– Pearson

Frequent Pattern Mining

• Discover interesting patterns based upon how items occur in a sequence

• ExampleSales Transactions (Bread, Milk and Eggs)(Nappies, Beer)

• Parallel FPGrowth Algorithm

Classification

• Set of classes/categories (observed pattern)• Decide if a new input matches a category• Supervised technique – need training• Eg spam or not

Classification

• Algorithms:Naive Bayes, Random Forest Decision Tree, SVM coming

• Learn a model from a manually trained dataset

• Predict the class of an unseen object based on features

Latent Dirichlet Allocation

– Convert text to term-document matrix– LDA produces • word-theme mapping• theme-document mapping• Allows topic overlap

– Need to specify number of Topics (k)

• LDA

• Tweet 1• Tweet 2• Tweet 3

Word 1 Word 2 Word n

Doc 1 1 0 2

Doc 2 0 1 0

Doc 3 0 1 1

Term-Document Matrix

Specify No Themes (k)

Word 1

Word 2

Word n

Topic 1 0.5 0 1

Topic 2 0 0.5 0

Topic to Word Mapping

Topic 1 Topic 2

Doc 1 1 0

Doc 2 0 1

Doc 3 0 1

Tweet to Topic Mapping

– Run LDAbin/mahout lda -input <PATH> output <PATH> –numTopics 20‐

– View Topicsbin/mahout LDAPrintTopics input <PATH>‐output <PATH> dictonaryType sequencefile‐ ‐

Suggesting Twitter Lists

– Twitter introduced Lists group people you follow so you can see only their timeline of tweets

– Build an application that could recommend people that should be grouped in the same list.

– LDA because it will allow for overlapping list membership - this is great because people talk about multiple topics.

– Twitter API Tasks• Get list of people that a user follows• Retrieve tweets for each person• Save Lists back to Twitter

– Data Processing• Combine all tweets for a person• Remove stop words• Stem words• Create a user-word matrix

– Web UI• Authenticate to Twitter• Display suggested lists (based on estimate of k)

(Could also display the important tweets that place the person in the group?)• Allow users to change k

ie decide on the number of Lists• Allow group re-organisation with jquery sortables

Gently Getting into Machine Learning and Data Mining

• Programming Collective Intelligenceby Toby Segaram

• Mahout in Actionby Owen, Anil, Dunning and Friedman

Summary

• Mahout offers good abstraction for building intelligent web applications

• Skills in data analysis and exploration are now more important than ever

• Mahout is a good platform for distributed algorithm development

Fascinating Algorithms

• My Top 3 algorithms– Some interesting and some disturbing and

interesting at the same time

• No 3 – Identifying Manipulated Imageshttp://www.technologyreview.com/computing/20423/page1/

• No 2 – Seam CarvingContent Aware ResizingExample http://swieskowski.net/carve/

Disturbing Algorithms

• No 1 – Digital Face Beautificationhttp://leyvand.com/research/beautification/dfb_sketch.pdf

Image from Shrek Copyright Dreamworks

Discussion/Questions

• What will you build?

Orchestrating the Intelligent Web with Apache Mahout

Technology