Date post: | 15-Jan-2017 |
Category: |
Software |
Upload: | vinay-shukla |
View: | 1,292 times |
Download: | 3 times |
Page 1
Data Science: A view from the trenches Ram Sriharsha Twitter: @halfbrane Vinay Shukla Twitter: @neomythos
Page 2
Agenda • Problems we work on • Common Challenges • Reductions
• Handling label sparsity
– Co Training
– Adaptive Learning
• When you have to be fast and accurate – Online Clustering
– Sketches
– Online Learning
• Visualization
Page 3
Some Problems • Search Advertising
– Click Prediction: Given a query, ad and user context, how likely is the user to click on ad? – Feature Engineering: Query/ ad categorization, query -> feature vector
• Entity Resolution and Disambiguation • Over / Under Payment of claims detection • Document Matching • Login Risk Detection
Page 4
Common Challenges • Labeling is expensive and not clean
– Selectively ask for labels (active learning) – Co-Training to expand label set
• Not enough high quality implementations of algorithms – Modular extensions of base implementations (Reductions)
– Boosting • Speed of training/ scoring important
– Online learning
– Online clustering – Sketches
• Freshness of models
– Online and adaptive learning • Visualizing performance and feature importance
– Zeppelin
Page 5
Reductions OVR
Let R = rejection sampling algorithm For each example h, sample according to Cost of h and feed to 0/1 classifier
A
A
…
Randomize over classifiers that Output yes
Importance Weighting R A
R^-1
Let A = Algorithm for optimizing 0/1 loss
Page 6
Active Learning • Given a pool of examples determine which ones is the classifier least confident about • Ask those examples to be labeled, and feed to training • Choose query points that shrink the space of classifiers rapidly • Exploit natural structure in data
45% 45% 2.5% 2.5% 5%
Page 7
Co Training • Suppose you have two “views” of the data
– e.g, web pages have content, and hyperlinks pointing to and from them – Suppose problem is to label webpage as about literature/ or not (binary classification)
• One approach: – Label web pages manually. Train classifier to use both content text and hyperlinks as
features – This requires a large # of labeled pages
• Other approach: – Since we have two views , try to learn two classifiers – Each classifier learns on a subset of labeled examples. – The scores of each classifier are used to label a subset of unlabeled web pages and extend
the labels for the other classifier.
Page 8
Sketches • Store a “summary” of the dataset • Querying the sketch is “almost” as good as querying the dataset • Example: frequent items in a stream
– Initialize associative array A of size k-1 – Process: for each j
- if j is in keys(A), A[j] += 1 - else if |keys(A)| < k - 1, A[j] = 1 - else
– for each l in keys(A), » A(l) -=1 ; » if A(l) = 0, remove l;
– done
Page 9
Clustering is not fast enough • Sample and then cluster • Do clusters need to dynamically adapt?
– Online clustering – Streaming K Means
Page 10
K Means • Initialize cluster centers somehow
– random – K means ++
• Alternate – Assign each point to closest cluster center – Move cluster center to average of points assigned to center
• Stop when convergence criteria reached – Points don’t move “much” – Number of iterations reached.
Page 11
11
Initialize Cluster Centers
k1
k2
k3
X
Y
Pick 3 initial cluster centers (randomly)
Page 12
Assign each point
k1
k2
k3
X
Y
Assign each point to the closest cluster center
Page 13
Recompute Cluster Centers
X
Y
Move each cluster center to the mean of each cluster
k1
k2
k2
k1
k3
k3
Page 14
Streaming K Means • For each new point
– Assign to closest cluster center – Update cluster center to incrementally move in direction of new point
• Online version of Lloyd’s algorithm • Good enough in practice
Page 15
Recompute Cluster Centers
X
Y
Move each cluster center to the mean of each cluster
k2
k1
k3
Page 16
Online Clustering (Liberty, Sriharsha, Sviridenko) • Initialization Phase:
– First point is its own cluster – Pick some Normalization factor f
• Update Phase for point p: – Let d = distance from p to closest center so far – With probability proportional to d/ f , attach p to closest center – With probability max (1 – d/f, 1), form a new cluster center at p.
• Merge Phase: – Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
Page 17
Properties • Provably close to optimal in online setting • Does not open more than O(log(OPT)) clusters pays O(OPT) cost • Very efficient to implement • Adaptive algorithm • Forgetfulness can be introduced in the merge process • Leaving out the merge process still produces a clustering that might be indicative of
structure, i.e useful as a machine learning feature
Page 18
My classifier is not fast enough • Even for batch problems online learning might be good enough! • For real time problems, online learning or incremental learning is needed.
Page 19
What is online learning? • Batch Learning:
– Classifier sees a set of labeled examples, and trains a model – Predicts on trained model for unseen examples
• Online Learning: – Classifier sees an example at a time. – Limited look back window (often 0) – Predicts on example and is revealed the cost – Learns from mistake – Yields a batch learning algorithm that is one pass: simply run online algorithm for each
example in a batch.
Page 20
Challenges of online learning • normalization
– In batch set up, can normalize data by making a pass over the full dataset – In online setting, cannot make a second pass – Solution: Adaptive normalization
• Late arriving features – In Batch setting, all features are recorded in the dataset – In online setting different features may arrive at different times – Solution: Adagrad (Adaptive gradient technique)
• Stochastic Gradient Descent convergence can be slow – More data helps – Adaptive normalization improves convergence – Adagrad improves convergence and reduces sensitivity to step size
Page 21
Visualization • Speed up feature discovery • Intuitive visualization of model performance • Improve debuggability
Page 22
The Data Science Workflow…
What is the question I'm answering?
What data will I need?
Plan
Acquire the data
Analyze data quality
ReformatImpute
etc
Clean Data
Analyze data
Visualize
Create model Evaluate results
Create features
Create report
Deploy in Production
Publish& Share
Start here
End here
Script
VisualizeScript
Page 23
Introducing Apache Zeppelin Web-based Notebook for interactive analytics
Use Case Data exploration and discovery Visualization
Interactive snippet-at-a-time experience “Modern Data Science Studio”
Page 24
Zeppelin today in Data Science Workflow…
What is the question I'm answering?
What data will I need?
Plan
Acquire the data
Analyze data quality
ReformatImpute
etc
Clean Data
Analyze data
Visualize
Create model Evaluate results
Create features
Create report
Deploy in Production
Publish& Share
Start here
End here
Script
VisualizeScript
Page 25
Zeppelin – Road Ahead
Operations - Deploy to the cluster with Ambari
Security - Authentication against LDAP - SSL - Run in Kerberized Cluster - Authorization of notebooks
Sharing/ Collaboration - Share selected notebooks with selected
users/groups - Ability to read/publish notebooks to github
Data Import - Visual data import/download
- Clean data as it comes
Usability - Summary Data – See column summary - Keyboard shortcuts, Auto complete, syntax high
light, line numbers
Visualization - Pluggable visualization & more charts, maps &
tables.
R support - Harden SparkR interpreter
Enterprise Ready Ease of Use
Page 26
Upcoming Work • Entity Resolution package GA
– Supports Entity Graph based resolution – Includes Random Walk algorithm for computing similarity score
• Online learning and clustering Spark Packages • Contribute more Reduction algorithms to Spark ML
– Cost Sensitive Classification – Filter tree based Multiclass Reduction
• Zeppelin GA
Page 27
Thank you! • Ram Sriharsha
@halfabrane
• Vinay Shukla
@neomythos