Date post: | 07-Aug-2015 |
Category: |
Technology |
Upload: | krishna-sankar |
View: | 344 times |
Download: | 10 times |
In Apache Spark
Foundations of Data Science with Spark
Foundations of Data Science with Spark
July 16, 2015
@ksankar // doubleclix.wordpress.com
www.globalbigdataconference.com
Twitter : @bigdataconf
o Intro & Setup [8:00-8:20)• Goals/non-goals
o Spark & Data Science DevOps [8:20-8:40)• Spark in the context of Data
ScienceoWhere Exactly is Apache Spark headed ?
[8:40-9:30)• Spark Yesterday, Today &
Tomorrow• Spark Stack
o Break [9:30-10:00)
oDataFrames for the Data Scientist [10:00-11:30)• pySpark Classes• Walkthru DataFrames• Hands-on Notebooks
o [15] Discussions/Slack (11:30 - 11:45)
Agenda : Introduction To SparkAgenda : Introduction To Sparkhttp://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
o Review (2:00-2:30)• 004-Orders-Homework-
Solution• MLlib Statistical Toolbox• Summary, Correlations
o [20] Linear Regression (2:30-2:45)o [20] “Mood Of the Union” (2:45-3:15)• State of the Union w/
Washington, Lincoln, FDR, JFK, Clinton, Bush & Obama • Map reduce, parse text
o Break (3:15-3:30)
o [60] Predicting Survivors with Classification (3:30-4:30)• Decision Trees• NaiveBayes (Titanic data set)
o Break (4:30-4:45) o [20] Clustering(4:45-5:05)• K-means for Gallactic Hoppers!
o [20]Recommendation Engine (5:05-5:25)• Collab Filtering w/movie lens
o [15] Discussions/Slack (5:45-6:00)
Agenda : Data Wrangling w/ DataFrames & MLlibAgenda : Data Wrangling w/ DataFrames & MLlib
http://globalbigdataconference.com/52/santa-clara/big-data-developer-conference/schedule.html
Goals & non-goalsGoals & non-goalsGoals
¤Understand how to program Machine Learning with Spark & Python
¤Focus on programming & ML application
¤Give you a focused time to work thru examples§ Work with me. I will wait
if you want to catch-up¤Less theory, more usage - let us
see if this works¤As straightforward as possible§ The programs can be
optimized
Non-goals¡Go deep into the algorithms• We don’t have sufficient
time. The topic can be easily a 5 day tutorial !
¡Dive into spark internals• That is for another day
¡The underlying computation, communication, constraints & distribution is a fascinating subject• Paco does a good job
explaining them¡A passive talk• Nope. Interactive &
hands-on
About MeAbout Meo Data Scientist • Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] …
o Full-day Spark workshop “Advanced Data Science w/ Spark” / Spark Summit-E’15[https://goo.gl/7SBKTC]o Co-author : “Fast Data Processing with Spark”, Packt Publishing [http://goo.gl/eNtXpT]o Reviewer : “Machine Learning with Spark” Packt Publishingo Have done lots of things:• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Volunteer as Robotics Judge at First Lego league World Competitionso @ksankar, doubleclix.wordpress.com
Close EncountersClose Encounters� 1st◦ This Tutorial
� 2nd◦ Do More Hands-on Walkthrough
� 3nd◦ Listen To Lectures◦ More competitions …
Spark InstallationSpark Installationo Install Spark 1.4.1 in local Machine• https://spark.apache.org/downloads.html• Pre-built For Hadoop 2.6 is fine
• Download & uncompress• Remember the path & use it wherever you see /usr/local/spark/• I have downloaded in /usr/local & have a softlink spark to the latest version
o Install iPython
Tutorial MaterialsTutorial MaterialsoGithub : https://github.com/xsankar/global-bd-conf• Clone or download zip
oOpen terminalo cd ~/global-bd-confo IPYTHON=1 IPYTHON_OPTS="notebook” /usr/local/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.0.3
oNotes : • I have a soft link “spark” in my /usr/local that points to the spark version
that I use. For example ln -s spark-1.4.1/ spark
o Click on ipython dashboardo Run 000-PreFlightCheck.ipynbo Run 001-TestSparkCSV.ipynboNow you are ready for the workshop !
Spark & Data Science DevOpsSpark & Data Science DevOps
8:208:20
Spark in the context of data scienceSpark in the context of data science
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown unknowns!
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown unknowns!
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The World
Knowns
Unknowns
YouUnKnown Known
o Others know, you don’t o What we do
o Facts, outcomes or scenarios we have not encountered, nor considered
o “Black swans”, outliers, long tails of probability distributions
o Lack of experience, imagination
o Potential facts, outcomes we are aware, but not with certainty
o Stochastic processes, Probabilities
o Known Knownso There are things we know that we know
o Known Unknownso That is to say, there are things that we
now know we don't knowo But there are also Unknown Unknowns
o There are things we do not know we don't know
The curious case of the Data ScientistThe curious case of the Data ScientistoData Scientist is multi-faceted & ContextualoData Scientist should be building Data ProductsoData Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/the-‐curious-‐case-‐of-‐the-‐data-‐scientist-‐profession/
Large is hard; Infinite is much easier !– Titus Brown
Data Science - ContextData Science - Context
o Scalable Model Deployment
o Big Data automation & purpose built appliances (soft/hard)
o Manage SLAs & response times
o Scalable Model Deployment
o Big Data automation & purpose built appliances (soft/hard)
o Manage SLAs & response times
o Volumeo Velocityo Streaming Data
o Volumeo Velocityo Streaming Data
o Canonical formo Data catalogo Data Fabric across the
organizationo Access to multiple
sources of data o Think Hybrid – Big Data
Apps, Appliances & Infrastructure
o Canonical formo Data catalogo Data Fabric across the
organizationo Access to multiple
sources of data o Think Hybrid – Big Data
Apps, Appliances & Infrastructure
CollectCollect StoreStore TransformTransform
o Metadatao Monitor counters &
Metricso Structured vs. Multi-‐
structured
o Metadatao Monitor counters &
Metricso Structured vs. Multi-‐
structured
o Flexible & Selectable§ Data Subsets § Attribute sets
o Flexible & Selectable§ Data Subsets § Attribute sets
o Refine model with§ Extended Data
subsets§ Engineered
Attribute setso Validation run across a
larger data set
o Refine model with§ Extended Data
subsets§ Engineered
Attribute setso Validation run across a
larger data set
ReasonReason ModelModel DeployDeploy
Data ManagementData Management
Data ScienceData Science
o Dynamic Data Setso 2 way key-‐value tagging of
datasetso Extended attribute setso Advanced Analytics
o Dynamic Data Setso 2 way key-‐value tagging of
datasetso Extended attribute setso Advanced Analytics
ExploreExploreVisualizeVisualize RecommendRecommend PredictPredict
o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics
o Performanceo Scalabilityo Refresh Latencyo In-‐memory Analytics
o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics
o Advanced Visualizationo Interactive Dashboardso Map Overlayo Infographics
¤ Bytes to Business a.k.a. Build the full stack
¤ Find Relevant Data For Business
¤ Connect the Dots
VolumeVolume
VelocityVelocity
VarietyVariety
Data Science - ContextData Science - Context
ContextContext
ConnectednessConnectedness
IntelligenceIntelligence
InterfaceInterface
InferenceInference
“Data of unusual size” that can't be brute forced
o Three Amigoso Interface = Cognitiono Intelligence = Compute(CPU) & Computational(GPU)o Infer Significance & Causality
Day in the life of a (super) ModelDay in the life of a (super) Model
IntelligenceIntelligence
InferenceInference
Data RepresentationData Representation
InterfaceInterface
AlgorithmsAlgorithms
ParametersParametersAttributesAttributes
Data (Scoring)Data (Scoring)
Model SelectionModel Selection
Reason & LearnReason & Learn
ModelsModels
Visualize, Recommend, Explore
Visualize, Recommend, Explore
Model AssessmentModel Assessment
Feature SelectionFeature SelectionDimensionality ReductionDimensionality Reduction
Data Science Maturity Model & SparkData Science Maturity Model & SparkIsolated Analytics Integrated Analytics Aggregated Analytics Automated Analytics
Data Small Data Larger Data set Big Data Big Data Factory Model
Context Local Domain Cross-‐domain + External Cross domain + External
Model, Reason & Deploy
• Single set of boxes, usually owned by the Model Builders
• Departmental
• Deploy -‐ Central Analytics Infrastructure
• Models still owned & operated by Modelers
• Partly Enterprise-‐wide
• Central Analytics Infrastructure• Model & Reason – by Model Builders• Deploy, Operate – by ops• Residuals and other metrics monitored
by modelers• Enterprise-‐wide
• Distributed Analytics Infrastructure• AI Augmented models• Model & Reason – by Model
Builders• Deploy, Operate – by ops• Data as a monetized service,
extending to eco system partners
• Reports • Dashboards • Dashboards + some APIs • Dashboards + Well defined APIs + programming models
Type • Descriptive & Reactive • + Predictive • + Adaptive • Adaptive
Datasets • All in the same box • Fixed data sets, usually in temp data spaces
• Flexible Data & Attribute sets • Dynamic datasets with well-‐defined refresh policies
Workload • Skunk works • Business relevant apps with approx SLAs
• High performance appliance clusters • Appliances and clusters for multiple workloads including real time apps
• Infrastructure for emerging technologies
Strategy • Informal definitions • Data definitions buried in the analytics models
• Some data definitions • Data catalogue, metadata & Annotations
• Big Data MDM Strategy
The Sense & Sensibility of a DataScientist DevOpsThe Sense & Sensibility of a DataScientist DevOps
Factory = Operational
Lab = Investigative
http://doubleclix.wordpress.com/2014/05/11/the-‐sense-‐sensibility-‐of-‐a-‐data-‐scientist-‐devops/
Where exactly is Apache Spark headed ? Where exactly is Apache Spark headed ?
Spark Yesterday, Today & Tomorrow …
“Unified engine across diverse data sources, workloads & environments”
8:408:40
http://free-‐stock-‐illustration.com/winding+road+clip+art
Spark 1.x• Fast engine for big data processing• Fast to run code & fast to write code• In-memory computation graphs with
compatibility with the Hadoop eco system and an interesting very usable APIs
• Iterative & interactive apps that operated on data multiple times, which are not a good use case for Hadoop.
Spark 1.3 & beyond has been the catalyst for a renaissance in Data Science !
Spark 1.4+• Multi-pass analytics - ML pipelines, GraphX• Ad-hoc queries - DataFrames• Real-time stream processing – Spark Streaming• Parallel Machine Learning Algorithms beyond the
basic RDDs• More types of data sources as input & output• More integration with R to span statistical computing
beyond “single-node tools”• More integration with apps like visualization
dashboards• More performance with even larger datasets &
complex applications – Project Tungsten
Spark Yesterday, Today & Tomorrow …
Spark DirectionsSpark Directions
Data ScienceData Science Platform APIsPlatform APIs Streaming, DAG Visualization & Debugging
Streaming, DAG Visualization & Debugging
Execution Optimization(Project Tungsten)Execution Optimization(Project Tungsten)o DataFrames
o ML Pipelineso SparkR
o Growing the eco system§ Data Sources -‐Uniform access to
diverse data sources§ Pluggable “smart” DataSource
API for reading/writing DataFrame while minimizing I/O
§ Spark Packages§ Deployment utilities for Google
Compute, Azure & Job Server
o Focus on CPU Efficiencyo Run-‐Time Code Generationo Cache Locality & cache aware
data structureso Binary Format for aggregationso Spark managed Memory
o Off-‐heap memory management
o Spark Streaming flow control & optimized state management
Spark-The (simple) StackSpark-The (simple) Stack
RDD – The workhorse of Core SparkRDD – The workhorse of Core Spark
o Resilient Distributed Datasets• Collection that can be operated in parallel
o Transformations – create RDDs• Map, Filter,…
o Actions – Get values• Collect, Take,…
oWe will apply these operations during this tutorial
DataFrame API
Spark Core
Spark SQL Spark Streaming
Spark R MLlib GraphX Packages
ML Pipelines Advanced Analytics
Neural Networks
Deep Learning
Parameter Server
R Scala JavaPython
Catalyst Optimizer – optimize execution plan
Data Sources - Parquet, Hadoop, Cassandra, JSON, CSV, JDBC,…
Tungsten Execution
RDD
SQL Query
DataFrame
Unresolved Logical Plan
Logical Plan
Optimized Logical Plan
Physical PlansPhysical PlansPhysical Plans
Cost
Mod
el
Selected Physical Plan RDDs
Catalog
AnalysisLogical
Optimization Physical Planning Code Generation
Query Optimization-Execution pipeline
Ref: Spark SQL paper
Spark DataFrames for the Data ScientistSpark DataFrames for the Data Scientist
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.
DataFrames ! The Most Massively useful thing a Data Scientist can have …
10:0010:00
Data Science “folk knowledge” (1 of A)Data Science “folk knowledge” (1 of A)o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactionso Learning = Representation + Evaluation + Optimizationo It’s Generalization that counts• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
oData alone is not enough• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
oMachine Learning is not magic – one cannot get something from nothing• In order to infer, one needs the knobs & the dials• One also needs a rich expressive dataset
A few useful things to know about machine learning - by Pedro Domingoshttp://dl.acm.org/citation.cfm?id=2347755
pyspark
pyspark.SparkContext()pyspark.SparkConf()pyspark.RDD()pyspark.Broadcast()pyspark.Accululator()pyspark.SparkFiles()pyspark.StorageLevel()
pyspark.sqlpyspark.streaming pyspark.mllib pyspark.ml
pyspark.streaming.StreamingContext()pyspark.streaming.Dstream()pyspark.streaming.kafkapyspark.streaming.kafka.Broker()…
pyspark.sql.SQLContext()pyspark.sql.DataFrame()pyspark.sql.DataFrameNaFunctions()pyspark.sql.DataFrameStatFunctions()pyspark.sql.DataFrameReader()pyspark.sql.DataFrameWriter()pyspark.sql.Column()pyspark.sql.Row()pyspark.sql.functions()pyspark.sql.types()pyspark.sql.Window()pyspark.sql.WindowSpec()pyspark.sql.GroupedData()pyspark.sql.HiveContext()
pyspark.mllib.classificationpyspark.mllib.clusteringpyspark.mllib.evaluationpyspark.mllib.featurepyspark.mllib.fpmpyspark.mllib.linalgpyspark.mllib.randompyspark.mllib.recommendationpyspark.mllib.regressionpyspark.mllib.statpyspark.mllib.treepyspark.mllib.util
ML Pipeline APIspyspark.ml.Transformerpyspark.ml.Estimatorpyspark.ml.Modelpyspark.ml.Pipelinepyspark.ml.PipelineModel
pyspark.ml.parampyspark.ml.featurepyspark.ml.classificationpyspark.ml.recommendationpyspark.ml.regressionpyspark.ml.tuningpyspark.ml.evaluation
1. SparkContext()1. SparkContext()
2. Read/Write2. Read/Write
3. Convert3. Convert
pyspark.sql.DataFrame
table
pandas.DataFrame
sqlContext.registerDataFrameAsTable(df, "aTable")
df2 = sqlContext.table("aTable")
df = createDataFrame(pandas.DataFrame)
p_df = df.toPandas()
4. Columns & Rows (1 of 3)4. Columns & Rows (1 of 3)o Select a column• by the df(“<columnName>”) notation or the df.<columnName> notation. • The recommended way is the df(“<columnName>”), reason being a column
name can collide with a dataframe method if we use the df.<columnName>
o Column-wise operations line +,-, *,/,% (modulo),&&,||, <,<=,> and >=• df(“total”) = df(“price”) * df(“qty”) • inequality operator is !==, the usual equalTo operator is === and an
equality test that is safe for null values <=>
oMeta operations – type conversion (cast), alias, not null, …• df_cars.mpg.cast("double").alias('mpg')
o Run arbitrary udfs on a column (see next page)
4. Columns & Rows (2 of 3)4. Columns & Rows (2 of 3)Run arbitrary udfs on a column
4. Columns & Rows (3 of 3)4. Columns & Rows (3 of 3)Interesting Operations …
Adding a column…
5. DataFrame : RDD-like Operations5. DataFrame : RDD-like OperationsFunction Description df.sort(<sort expression>) or df.orderBy(<sort expression>) Returns a sorted DataFrame. There are multiple ways of specifying the sort expression. Use of the orderBy is
recommended (as the syntax is closer to SQL) for example:df_orders_1.groupBy("CustomerID","Year").count() .orderBy(‘count’,ascending=False).show()
df.filter(<condition>) or df.where(<condition>) Returns a new DataFrame after applying the <condition>. The condition is usually based on a column.Use of the where form is recommended (as the syntax is closer to the SQL world), for example df_orders.where(df_orders.[‘shipCountry’] == ’France’)
df.colasce(n) Returns a DataFrame with n partitions, same as colasce(n) method of RDD
df.foreach(<function>) Applies a function on all the rows of a DataFramedf.map(lambda r:..) Applies the function on all the rows and returns the resulting object
df.flatMap(lambda r: …) Returns an RDD, flattened, after applying the function on all the rows of the DataFrame
df.rdd() Returns the DataFrame as an rdd of Row objectsdf.na.replace([<list of values to be replaced],[list of replacing values],subset=[list of columns]) or DataFrame.replace() or DataFrameNaFunctions.replace()
An interesting function, very useful and a little strange syntax-‐‑wise. The recommended form is the df.na.replace() even though the .na namespace throws it a little bit. Use the subset= for column names. The syntax is different from the Scala syntax.
6. DataFrame : Action6. DataFrame : Actiono cache()o collect(), collectAsList()o count()o describe(), o first(), head(), show(), take()o…
7. DataFrame : Scientific Functions7. DataFrame : Scientific Functions
8. DataFrame : Statistical Functions8. DataFrame : Statistical Functions
The pair-wise frequency (contingency table) of transmission type and no of speeds show interesting observation. • All automatic cars in the dataset are 3 speed while most of the manual transmission cars have 4 or 5 speeds• Almost all the manual cars have 2 barrels while the automatic cars have 2 and 4 barrels
9. DataFrame : Aggregate Functions9. DataFrame : Aggregate Functionso The pyspark.sql.functions class (and the org.apache.spark.sql.functions for Scala)
contains the aggregation functionso There are two types of aggregations, one on column values and the other on
subsets of column values i.e. grouped values of some other columns• pyspark.sql.functions.avg(“sales”)• pyspark.sql.functions.groupby(“year”).agg(“sales”:”avg”)
o Count(), countDistinct()o First(),last()
10. DataFrame : na10. DataFrame : nao One of the tenets of big data and data science is that data is never fully clean-while we
can handle types, formats et al, missing values is always challengingo One easy solution is to drop the rows that have missing values, but then we would lose
valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be
created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0• A better solution is to replace numerical values with the average of the rest of the valid
values; for categorical replacing with the most common value is a good strategy• We could use mode or median instead of mean
• Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”.• For example the Titanic data has name and for imputing the missing age field, we could use the
Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.”
• There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field.
• We could even average the ages from different strategies.
10. DataFrame : na10. DataFrame : na
11. Joins/Set Operations a.k.a. 11. Joins/Set Operations a.k.a. Language Integrated Queries
12. SQL on Tables12. SQL on Tables
Hands-OnHands-Ono 003-DataFrame-For-DS• Understand and run the iPython Notebook
o 004-Orders• Homework – we will go thru the solution when we meet in the afternoon
Data Wrangling with SparkData Wrangling with Spark
2:002:00
Algorithm spectrumAlgorithm spectrum
o Regressiono Logito CARTo Ensemble
:Random Forest
o Clusteringo KNNo Genetic Algo Simulated
Annealing
o CollabFiltering
o SVMo Kernels
o SVD
o NNeto Boltzman
Machineo Feature
Learning
Machine Learning Cute Math Artificial Intelligence
Statistical ToolboxStatistical Toolboxo Sample data : Car mileage data
Linear Regression Linear Regression
2:302:30
Linear Regression - APILinear Regression - API
LabeledPoint The features and labels of a data pointLinearModel weights, interceptLinearRegressionModelBase predict()LinearRegressionModelLinearRegressionWithSGD
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=1.0, regType=None, intercept=False)
LassoModel Least-squares fit with an l_1 penalty term.
LassoWithSGDtrain(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0,initialWeights=None)
RidgeRegressionModel Least-squares fit with an l_2 penalty term.
RidgeRegressionWithSGD
train(cls, data, iterations=100, step=1.0, regParam=1.0, miniBatchFraction=1.0, initialWeights=None)
Basic Linear RegressionBasic Linear Regression
Use LR model for prediction & calculate MSEUse LR model for prediction & calculate MSE
Step size is important, the model can diverge !Step size is important, the model can diverge !
Interesting step sizeInteresting step size
“Mood Of the Union” with TF-IDF “Mood Of the Union” with TF-IDF
2:452:45
Scenario – Mood Of the UnionScenario – Mood Of the Union
o It has been said that the State of the Union speech by the President of USA reflects the social challenge faced by the country ?
o If so, can we infer the mood of the country by analyzing SOTU ?o If we embark on this line of thought, how would we do it with Spark & python ?o Is it different from Hadoop-MapReduce ? o Is it better ?
POA (Plan Of Action)POA (Plan Of Action)o Collect State of the Union speech by George Washington, Abe Lincoln, FDR,
JFK, Bill Clinton, GW Bush & Barack Obama o Read the 7 SOTU from the 7 presidents into 7 RDDso Create word vectorso Transform into word frequency vectorso Remove stock common wordso Inspect to n words to see if they reflect the sentiment of the timeo Compute set difference and see how new words have cropped upo Compute TF-IDF (homework!)
Lookout for these interesting Spark featuresLookout for these interesting Spark featureso RDD Map-ReduceoHow to parse inputo Removing common wordso Sort rdd by value
Read & Create word vectorRead & Create word vector
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 1 of 3Remove Common Words – 1 of 3
iPython notebook at https://github.com/xsankar/cloaked-ironman
Remove Common Words – 2 of 3Remove Common Words – 2 of 3
Remove Common Words – 3 of 3Remove Common Words – 3 of 3
FDR vs. Barack Obama as reflected by SOTUFDR vs. Barack Obama as reflected by SOTU
Barack Obama vs. Bill ClintonBarack Obama vs. Bill Clinton
GWB vs Abe Lincoln as reflected by SOTUGWB vs Abe Lincoln as reflected by SOTU
EpilogueEpilogueo Interesting ExerciseoHighlights• Map-reduce in a couple of lines !• But it is not exactly the same as Hadoop Mapreduce (see the excellent blog by Sean Owen1)
• Set differences using substractByKey• Ability to sort a map by values (or any arbitrary function, for that matter)
o To Explore as homework:• TF-IDF in http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
http://blog.cloudera.com/blog/2014/09/how-‐to-‐translate-‐from-‐mapreduce-‐to-‐apache-‐spark/
BreakBreak
3:153:15
Predicting Survivors with Classification Predicting Survivors with Classification
3:303:30
Data Science “folk knowledge” (Wisdom of Kaggle)Jeremy’s AxiomsData Science “folk knowledge” (Wisdom of Kaggle)Jeremy’s Axiomso Iteratively explore datao Tools• Excel Format, Perl, Perl Book, Spark !
o Get your head around data• Pivot Table
o Don’t over-complicateo If people give you data, don’t assume that you
need to use all of ito Look at pictures !o History of your submissions – keep a tabo Don’t be afraid to submit simple solutions• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/
Titanic Passenger Metadata• Small• 3 Predictors
• Class• Sex• Age• Survived?
Classification - ScenarioClassification - Scenarioo This is a knowledge exerciseo Classify survival from the titanic dataoGives us a quick dataset to run & test classification
iPython notebook at https://github.com/xsankar/cloaked-ironman
Classifying ClassifiersClassifying Classifiers
StatisticalStatistical StructuralStructural
RegressionRegression Naïve BayesNaïve Bayes
Bayesian NetworksBayesian Networks
Rule-‐basedRule-‐based Distance-‐basedDistance-‐based
Neural NetworksNeural
Networks
Production RulesProduction Rules Decision TreesDecision Trees
Multi-‐layer PerceptionMulti-‐layer Perception
FunctionalFunctional Nearest NeighborNearest Neighbor
LinearLinear Spectral WaveletSpectral Wavelet
kNNkNN Learning vector Quantization
Learning vector Quantization
EnsembleEnsemble
Random ForestsRandom Forests
Logistic Regression1Logistic
Regression1SVMSVMBoostingBoosting
1Max Entropy Classifier 1Max Entropy Classifier
Ref: Algorithms of the Intelligent Web, Marmanis & Babenko
Classifiers
RegressionContinuousVariables
Categorical Variables
Decision Trees
k-‐NN(Nearest Neighbors)
BiasVariance
Model ComplexityOver-fitting
BoostingBagging
CART
Classification - Spark APIClassification - Spark APIo Logistic Regressiono SVMWIthSGDo DecisionTreeso Data as LabelledPoint (we will see in a moment)o DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini",
maxDepth=4, maxBins=100)o Impurity – “entropy” or “gini”o maxBins = control to throttle communication at the expense of accuracy• Larger = Higher Accuracy• Smaller = less communication (as # of bins = number of instances)
o data adaptive – i.e. decision tree samples on the driver and figures out the bin spacing i.e. the places you slice for binning
o intelligent framework - need this for scale
Lookout for these interesting Spark featuresLookout for these interesting Spark featureso Concept of Labeled Point & how to create an RDD of LPso Print the treeo Calculate Accuracy & MSE from RDDs
Read data & extract featuresRead data & extract features
iPython notebook at https://github.com/xsankar/cloaked-ironman
Create the modelCreate the model
Extract labels & featuresExtract labels & features
Calculate Accuracy & MSECalculate Accuracy & MSE
Use NaiveBayes AlgorithmUse NaiveBayes Algorithm
Decision Tree – Best PracticesDecision Tree – Best Practices
maxDepth Tune with Data/Model Selection
maxBins Set low, monitor communications, increase if needed
# RDD partitions Set to # of cores• Usually the recommendation is that the RDD partitions should be over
partitioned ie “more partitions than cores”, because tasks take different times, we need to utilize the compute power and in the end they average out
• But for Machine Learning especially trees, all tasks are approx equal computationally intensive, so over partitioning doesn’t help
• Joe Bradley talk (reference below) has interesting insights
https://speakerdeck.com/jkbradley/mllib-‐decision-‐trees-‐at-‐sf-‐scala-‐baml-‐meetup
DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, impurity="gini", maxDepth=4, maxBins=100)
Future …Future …o Actually we should split the data to training & test setso Then use different feature sets to see if we can increase the accuracyo Leave it as Homeworko In 1.2 …o Random Forest • Bagging• PR for Random Forest
o Boostingo Alpine lab sequoia Forest: coordinating mergeoModel Selection Pipeline ; Design Doc
◦ “Output of weak classifiers into a powerful committee”◦ Final Prediction = weighted majority vote ◦ Later classifiers get misclassified points � With higher weight, � So they are forced � To concentrate on them◦ AdaBoost (AdaptiveBoosting)◦ Boosting vs Bagging� Bagging – independent trees <-‐ Spark shines here� Boosting – successively weighted
BoostingBoosting� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)
◦ Builds large collection of de-‐correlated trees & averages them◦ Improves Bagging by selecting i.i.d* random variables for
splitting◦ Simpler to train & tune◦ “Do remarkably well, with very little tuning required” – ESLII◦ Less susceptible to over fitting (than boosting)◦ Many RF implementations� Original version -‐ Fortran-‐77 ! By Breiman/Cutler� Python, R, Mahout, Weka, Milk (ML toolkit for py), matlab
* i.i.d – independent identically distributed+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Random Forests+Random Forests+
� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)
◦ Two Step� Develop a set of learners� Combine the results to develop a composite predictor
◦ Ensemble methods can take the form of:� Using different algorithms, � Using the same algorithm with different settings� Assigning different parts of the dataset to different classifiers◦ Bagging & Random Forests are examples of ensemble
method
Ref: Machine Learning In Action
Ensemble MethodsEnsemble Methods� Goal◦ Model Complexity (-)◦ Variance (-)◦ Prediction Accuracy (+)
Random ForestsRandom Forestso While Boosting splits based on best among all variables, RF splits based on best among
randomly chosen variableso Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees
(500 for large dataset, 150 for smaller)o Error prediction• For each iteration, predict for dataset that is not in the sample
(OOB data)• Aggregate OOB predictions• Calculate Prediction Error for the aggregate, which is
basically the OOB estimate of error rate• Can use this to search for optimal # of predictors• We will see how close this is to the actual error in the
Heritage Health Prizeo Assumes equal cost for mis-prediction. Can add a cost functiono Proximity matrix & applications like adding missing data, dropping outliers Ref: R News Vol 2/3, Dec 2002
Statistical Learning from a Regression Perspective : BerkA Brief Overview of RF by Dan Steinberg
Why didn’t RF do better ? Bias/VarianceWhy didn’t RF do better ? Bias/VarianceoHigh Bias• Due to Underfitting• Add more features• More sophisticated model• Quadratic Terms, complex equations,…
• Decrease regularization
oHigh Variance• Due to Overfitting• Use fewer features• Use more training sample• Increase Regularization
Prediction Error
Training Error
Ref: Strata 2013 Tutorial by Olivier Grisel
Learning Curve
Need more features or more complex model to improve
Need more data to improve
'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos
http://www.slideshare.net/ksankar/data-‐science-‐folk-‐knowledge
BreakBreak
4:304:30
ClusteringClustering
4:454:45
Data Science “folk knowledge” (3 of A)Data Science “folk knowledge” (3 of A)oMore Data Beats a Cleverer Algorithm• Or conversely select algorithms that improve with data• Don’t optimize prematurely without getting more data
o Learn many models, not Just One• Ensembles ! – Change the hypothesis space• Netflix prize• E.g. Bagging, Boosting, Stacking
o Simplicity Does not necessarily imply Accuracyo Representable Does not imply Learnable• Just because a function can be represented does not
mean it can be learned
o Correlation Does not imply Causationo http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/o A few useful things to know about machine learning -by Pedro Domingos
§ http://dl.acm.org/citation.cfm?id=2347755
Scenario – Clustering with SparkScenario – Clustering with Spark
o InterGallactic Airlines have the GallacticHoppers frequent flyer program & have data about their customers who participate in the program.
o The airlines execs have a feeling that other airlines will poach their customers if they do not keep their loyal customers happy.
o So the business want to customize promotions to their frequent flier program.o Can they just have one type of promotion ? o Should they have different types of incentives ?oWho exactly are the customers in their GallacticHoppers program ?o Recently they have deployed an infrastructure with Sparko Can Spark help in this business problem ?
Clustering - TheoryClustering - Theoryo Clustering is unsupervised learningoWhile the computers can dissect a dataset into “similar” clusters, it still needs
human direction & domain knowledge to interpret & guideo Two types:• Centroid based clustering – k-means clustering• Tree based Clustering – hierarchical clustering
o Spark implements the Scalable Kmeans++ • Paper : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Lookout for these interesting Spark featuresLookout for these interesting Spark featureso Application of Statistics toolboxo Center & Scale RDDo Filter RDDs
Clustering - APIClustering - APIo from pyspark.mllib.clustering import KMeanso Kmeans.traino train(cls, data, k, maxIterations=100, runs=1, initializationMode="k-means||")o K = number of clusters to create, default=2o initializationMode = The initialization algorithm. This can be either "random" to
choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||
o KMeansModel.predictoMaps a point to a cluster
DataData
iPython notebook at https://github.com/xsankar/cloaked-ironman
Read Data & Create RDDRead Data & Create RDD
Train & PredictTrain & Predict
Calculate errorCalculate error
But Data is not evenBut Data is not even
So let us center & scale the data and try againSo let us center & scale the data and try again
Looks GoodLooks Good
Let us try with 5 clustersLet us try with 5 clusters
Let us map the cluster to our dataLet us map the cluster to our data
InterpretationInterpretation
C# AVG Interpretation
1
2
3
4
5
Note : • This is just a sample interpretation.• In real life we would “noodle” over the clusters & tweak
them to be useful, interpretable and distinguishable.• May be 3 is more suited to create targeted promotions
EpilogueEpilogueo KMeans in Spark has enough controlso It does a decent joboWe were able to control the clusters based on our experience (2 cluster is too
low, 10 is too high, 5 seems to be right)oWe can see that the Scalable KMeans has control over runs, parallelism et al.
(Home work : explore the scalability)oWe were able to interpret the results with domain knowledge and arrive at a
scheme to solve the business opportunityoNaturally we would tweak the clusters to fit the business viability. 20 clusters
with corresponding promotion schemes are unwieldy, even if the WSSE is the minimum.
Recommendation Engine Recommendation Engine
5:055:05
Recommendation & Personalization - SparkRecommendation & Personalization - Spark
Automated Analytics-‐ Let Data tell storyFeature Learning, AI, Deep Learning
Learning Models -‐ fit parameters as it gets more data
Dynamic Models –model selection based on context
o Knowledge Basedo Demographic Basedo Content Basedo Collaborative Filtering
o Item Basedo User Based
o Latent Factor based
o User Ratingo Purchasedo Looked/Not purchased
Spark implements the user based ALS collaborative filtering
Ref: ALS -‐ Collaborative Filtering for Implicit Feedback Datasets, Yifan Hu ; AT&T Labs., Florham Park, NJ ; Koren, Y. ; Volinsky, C.ALS-‐WR -‐ Large-‐Scale Parallel Collaborative Filtering for the Netflix Prize, Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, Rong Pan
Spark Collaborative Filtering APISpark Collaborative Filtering APIo ALS.train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1)o ALS.trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1,
alpha=0.01)oMatrixFactorizationModel.predict(self, user, product)oMatrixFactorizationModel.predictAll(self, usersProducts)
Read & ParseRead & Parse
Split & TrainSplit & Train
EvaluateEvaluate
EpilogueEpilogueoWe explored interesting APIs in Sparko ALS-Collab Filteringo RDD Operations• Join (HashJoin)• In memory, Grace, Recursive hash join
http://technet.microsoft.com/en-‐us/library/ms189313(v=sql.105).aspx
Questions ?Questions ?
4:454:45
ReferenceReference1. SF Scala & SF Bay Area Machine Learning, Joseph Bradley: Decision Trees on
Spark http://functional.tv/post/98342564544/sfscala-sfbaml-joseph-bradley-decision-trees-on-spark
2. http://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering
3. http://stats.stackexchange.com/questions/19216/variables-are-often-adjusted-e-g-standardised-before-making-a-model-when-is
4. http://funny-pictures.picphotos.net/tongue-out-smiley-face/smile-day.net*wp-content*uploads*2012*01*Tongue-Out-Smiley-Face1.jpg/
5. https://speakerdeck.com/jkbradley/mllib-decision-trees-at-sf-scala-baml-meetup
6. http://www.rosebt.com/1/post/2011/10/big-data-analytics-maturity-model.html
7. http://blogs.gartner.com/matthew-davis/
Essential Reading ListEssential Reading Listo A few useful things to know about machine learning - by Pedro Domingos• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper
t.pdf
o http://www.no-free-lunch.org/o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.
and Hochberg, Y. C• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD
R.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap
er_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered ListFor your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning• http://www-bcf.usc.edu/~gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning• http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/
References:References:o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas• http://pyvideo.org/video/1655/an-introduction-to-scikit-
learn-machine-learningo Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel• http://pyvideo.org/video/1719/advanced-machine-learning-
with-scikit-learno Just The Basics, Strata 2013, William Cukierski & Ben Hamner• http://strataconf.com/strata2013/public/schedule/detail/27291
o The Problem of Multiple Testing• http://download.journals.elsevierhealth.com/pdfs/journals/193
4-1482/PIIS1934148209014609.pdfo Thanks to Ana Crisan for the Titanic inset. Picture courtesy http://emileeid.com/2012/02/11/titanic-3d-exclusive-posters/
The Beginning As The EndThe Beginning As The EndHow did we do ?
4:454:45