Apache con big data 2015 - Data Science from the trenches

Page 1

Data Science: A view from the trenches Ram Sriharsha Twitter: @halfbrane Vinay Shukla Twitter: @neomythos

Page 2

Agenda •  Problems we work on •  Common Challenges •  Reductions

•  Handling label sparsity

– Co Training

– Adaptive Learning

•  When you have to be fast and accurate – Online Clustering

– Sketches

– Online Learning

•  Visualization

Page 3

Some Problems • Search Advertising

– Click Prediction: Given a query, ad and user context, how likely is the user to click on ad? – Feature Engineering: Query/ ad categorization, query -> feature vector

• Entity Resolution and Disambiguation • Over / Under Payment of claims detection • Document Matching • Login Risk Detection

Page 4

Common Challenges •  Labeling is expensive and not clean

– Selectively ask for labels (active learning) – Co-Training to expand label set

•  Not enough high quality implementations of algorithms – Modular extensions of base implementations (Reductions)

– Boosting •  Speed of training/ scoring important

– Online learning

– Online clustering – Sketches

•  Freshness of models

– Online and adaptive learning •  Visualizing performance and feature importance

– Zeppelin

Page 5

Reductions OVR

Let R = rejection sampling algorithm For each example h, sample according to Cost of h and feed to 0/1 classifier

A

A

…

Randomize over classifiers that Output yes

Importance Weighting R A

R^-1

Let A = Algorithm for optimizing 0/1 loss

Page 6

Active Learning • Given a pool of examples determine which ones is the classifier least confident about • Ask those examples to be labeled, and feed to training • Choose query points that shrink the space of classifiers rapidly • Exploit natural structure in data

45% 45% 2.5% 2.5% 5%

Page 7

Co Training • Suppose you have two “views” of the data

– e.g, web pages have content, and hyperlinks pointing to and from them – Suppose problem is to label webpage as about literature/ or not (binary classification)

• One approach: – Label web pages manually. Train classifier to use both content text and hyperlinks as

features – This requires a large # of labeled pages

• Other approach: – Since we have two views , try to learn two classifiers – Each classifier learns on a subset of labeled examples. – The scores of each classifier are used to label a subset of unlabeled web pages and extend

the labels for the other classifier.

Page 8

Sketches • Store a “summary” of the dataset • Querying the sketch is “almost” as good as querying the dataset • Example: frequent items in a stream

– Initialize associative array A of size k-1 – Process: for each j

-  if j is in keys(A), A[j] += 1 -  else if |keys(A)| < k - 1, A[j] = 1 -  else

–  for each l in keys(A), »  A(l) -=1 ; »  if A(l) = 0, remove l;

–  done

Page 9

Clustering is not fast enough • Sample and then cluster • Do clusters need to dynamically adapt?

– Online clustering – Streaming K Means

Page 10

K Means •  Initialize cluster centers somehow

– random – K means ++

• Alternate – Assign each point to closest cluster center – Move cluster center to average of points assigned to center

• Stop when convergence criteria reached – Points don’t move “much” – Number of iterations reached.

Page 11

11

Initialize Cluster Centers

k1

k2

k3

X

Y

Pick 3 initial cluster centers (randomly)

Page 12

Assign each point

k1

k2

k3

X

Y

Assign each point to the closest cluster center

Page 13

Recompute Cluster Centers

X

Y

Move each cluster center to the mean of each cluster

k1

k2

k2

k1

k3

k3

Page 14

Streaming K Means • For each new point

– Assign to closest cluster center – Update cluster center to incrementally move in direction of new point

• Online version of Lloyd’s algorithm • Good enough in practice

Page 15

Recompute Cluster Centers

X

Y

Move each cluster center to the mean of each cluster

k2

k1

k3

Page 16

Online Clustering (Liberty, Sriharsha, Sviridenko) •  Initialization Phase:

– First point is its own cluster – Pick some Normalization factor f

• Update Phase for point p: – Let d = distance from p to closest center so far – With probability proportional to d/ f , attach p to closest center – With probability max (1 – d/f, 1), form a new cluster center at p.

• Merge Phase: – Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters

Page 17

Properties • Provably close to optimal in online setting • Does not open more than O(log(OPT)) clusters pays O(OPT) cost • Very efficient to implement • Adaptive algorithm • Forgetfulness can be introduced in the merge process • Leaving out the merge process still produces a clustering that might be indicative of

structure, i.e useful as a machine learning feature

Page 18

My classifier is not fast enough • Even for batch problems online learning might be good enough! • For real time problems, online learning or incremental learning is needed.

Page 19

What is online learning? • Batch Learning:

– Classifier sees a set of labeled examples, and trains a model – Predicts on trained model for unseen examples

• Online Learning: – Classifier sees an example at a time. – Limited look back window (often 0) – Predicts on example and is revealed the cost – Learns from mistake – Yields a batch learning algorithm that is one pass: simply run online algorithm for each

example in a batch.

Page 20

Challenges of online learning • normalization

– In batch set up, can normalize data by making a pass over the full dataset – In online setting, cannot make a second pass – Solution: Adaptive normalization

• Late arriving features – In Batch setting, all features are recorded in the dataset – In online setting different features may arrive at different times – Solution: Adagrad (Adaptive gradient technique)

• Stochastic Gradient Descent convergence can be slow – More data helps – Adaptive normalization improves convergence – Adagrad improves convergence and reduces sensitivity to step size

Page 21

Visualization • Speed up feature discovery •  Intuitive visualization of model performance •  Improve debuggability

Page 22

The Data Science Workflow…

What is the question I'm answering?

What data will I need?

Plan

Acquire the data

Analyze data quality

ReformatImpute

etc

Clean Data

Analyze data

Visualize

Create model Evaluate results

Create features

Create report

Deploy in Production

Publish& Share

Start here

End here

Script

VisualizeScript

Page 23

Introducing Apache Zeppelin Web-based Notebook for interactive analytics

Use Case Data exploration and discovery Visualization

Interactive snippet-at-a-time experience “Modern Data Science Studio”

Page 24

Zeppelin today in Data Science Workflow…

What is the question I'm answering?

What data will I need?

Plan

Acquire the data

Analyze data quality

ReformatImpute

etc

Clean Data

Analyze data

Visualize

Create model Evaluate results

Create features

Create report

Deploy in Production

Publish& Share

Start here

End here

Script

VisualizeScript

Page 25

Zeppelin – Road Ahead

Operations -  Deploy to the cluster with Ambari

Security -  Authentication against LDAP -  SSL -  Run in Kerberized Cluster -  Authorization of notebooks

Sharing/ Collaboration -  Share selected notebooks with selected

users/groups -  Ability to read/publish notebooks to github

Data Import - Visual data import/download

- Clean data as it comes

Usability -  Summary Data – See column summary -  Keyboard shortcuts, Auto complete, syntax high

light, line numbers

Visualization -  Pluggable visualization & more charts, maps &

tables.

R support -  Harden SparkR interpreter

Enterprise Ready Ease of Use

Page 26

Upcoming Work • Entity Resolution package GA

– Supports Entity Graph based resolution – Includes Random Walk algorithm for computing similarity score

• Online learning and clustering Spark Packages • Contribute more Reduction algorithms to Spark ML

– Cost Sensitive Classification – Filter tree based Multiclass Reduction

• Zeppelin GA

Page 27

Thank you! • Ram Sriharsha

@halfabrane

• Vinay Shukla

@neomythos

Date post:	15-Jan-2017
Category:	Software
Upload:	vinay-shukla
View:	1,292 times
Download:	3 times

Apache con big data 2015 - Data Science from the trenches

Software