DataEngConf SF16 - Methods for Content Relevance at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved. 1

Methods for approaching content relevance at LinkedIn Ajit Singh DataEngConf, April 7th 2016

©2015 LinkedIn Corporation. All Rights Reserved. 2 2

Who am I ?

•  Builds models and infrastructure which serves those models in production.

•  Background in machine learning

•  Ph.D. Machine Learning, CMU •  Post-doc, University of Washington.


Who Do I Work With ?

Engineering team with expertise in machine learning and systems.


Why attend this talk ?

Feed Slideshare

Rich media: slides and video, with transcripts

Pulse Content discovery.

Groups Conversation Threads


Article Recommendation


Common Pattern for Recommendations Keep it Simple

1.  Feature Extraction –  What do I know about this content ?

2.  Indexing & Search –  How do I store and retrieve documents and their features? –  What candidates should I consider for this request ?

3.  Recommendation –  What should I show you ?

6


Engineering Stacks

Offline (minutes – days)

•  Hadoop •  Spark •  “Big Data” tools

Nearline (seconds – hours)

•  Samza •  Spark Streaming, Storm

Online (milliseconds)

•  Services •  DB, Distributed Key-Value Stores


Kinds of Model Features

§  Member Features –  What we know about the viewer or guest. –  E.g., Industry, Skills, Languages Spoken

§  Document Features –  What we know about a candidate item for recommendation. –  E.g., For articles: vector-space, topics, social gestures.

§  Engagement –  Aggregations of tracking data –  E.g., Per-item click-through rates, Dwell time. –  Think OLAP cubes.


Document Features


Document Features


NLP

§  Featurizers which process text one document at a time. –  Foundational:

§  Tokenization, Lemmatization, Stop-word removal => Vector-space models §  Text near-deduplication (w-shingling, SpotSigs) §  Language detection

–  Classifiers: §  Explicitly categorize the document into one or more label sets.

§  Fast Prototyping –  Build a library first –  Deploy the library to Hadoop first: e.g., via Pig, Scalding. –  Don’t build a near-real time system till you have validated features.


Near-Real Time Feature Generation

§  Before starting, ask yourself – do I need near-real time features?

14

Offline

Nearline

Online

Article DB

1

Article Stream

Article Stream

2

Language

Topics

Entity Extractors

Text Hashes

Language Features

Topic Features

Entity Features

Hash Features

Language Features

Hash Features

Topic Features

Entity Features

Search Index

3

6

5

Other Features

4 OfflineProducer


Pre-computed Recommendations When you can stick to the offline world.

§  In many cases the recommendation problem is constrained: e.g., –  You know which users are likely to visit in a given time period. –  Documents being considered will not change quickly.

§  Give me the top-k documents by highest normalized click-through rate. –  Recommendations are pushed (e-mail, push notifications)

§  Just pre-compute recommendations for likely-to-visit users –  Obvious parallelization via Hadoop. –  Agile development. –  Send recommendations to a distributed key-value store to serve.

§  Presupposes having good tracking & data pipelines (data lake).

15


Data Science / Engineering Contracts

§  How do I get my data into the near real-time flow ?

§  How do I deploy a model for feature X ? §  What if my model has a large number of parameters ?

–  Language models are notoriously huge. –  Memory is often a tighter constraint than CPU.

§  What if I want to A/B test different versions of a feature ?

§  What happens if a feature source fails ?


Search Indices

§  We use an in-house search system called Galene.

§  Store features as searchable facets within the index.


Why ?

§  Flexible candidate selection, –  Give me all the English-language documents ingested in the last four

days, which also mention the term “Lectra”.

–  Give me all promoted articles tagged with “Grace Hopper 2016”

§  Consolidate state management across search & recommendations.


Search Verticals

Search Federator

Recommender

…

Pre-computed Recommendations

Articles Slides

Groups Courses

Search Verticals

Search

quer

y

formulated query

look

up

lookup


Continuum between Search & Recommendations

Search Recommendation

Navigational Broad / Exploratory

Guided & Faceted Search

Empty Search


Homepage Module

Epsilon-Greedy Explore/Exploit

Target’s Red Ink Runs out in Canada.

Why we need “Economy Wide” Airline Seats.

How much work is too much work?

We can learn from barn raisers.

#1

#2

#3

#4

0.3241%

0.5923%

0.4864%

0.0231%

24


Key Ideas

§  Online Algorithm –  The model continuous updates with feedback from decisions. –  Infrastructure components:

§  Near real-time counting (OLAP). §  Efficient scoring of k-candidates per-request.

–  Allows for warm/cold-start models (c.f., Thompson sampling)

§  What algorithms do well vs. what humans do well. –  Candidate selection by human editors. –  Ranking via algorithms.


Algorithm Aversion

“Although people may be willing to trust an algorithm in the absence of experience with it, seeing it perform—and almost inevitably err—will cause them to abandon it in favor of a human judge. This may occur even when people see the algorithm outperform the human.”

Dietvorst et al. J. Exp. Psych Res. 2014

Man + Machine


Engineering & Data Science

§  The core challenge is that data-driven products are complex –  Tracking –  Data Warehousing –  Offline Infrastructure (Hadoop & Spark) –  Modelers –  Nearline & Online Infrastructure

§  Craftsmanship –  Clear contracts are critical

§  Are you providing recommendations ? §  Are you providing data sets ? §  Are you providing modeling infrastructure to other teams ?


Align on Metrics Know why something matters.

§  True North Metrics: –  Track the health of a product. –  Usually affected by many aspects of the product; not just relevance. –  May not be measurable on short-time spans.

§  e.g., Revenue in a subscription funnel.

§  Signpost Metrics: –  Leading indicators of health for a true north. –  Measurable, often via an A/B test.

§  Relevance Metrics: –  You have an optimization problem, this is what it optimizes. –  Rare that you can directly optimize your true north metric.

Date post:	12-Apr-2017
Category:	Technology
Upload:	hakka-labs
View:	221 times
Download:	0 times

DataEngConf SF16 - Methods for Content Relevance at LinkedIn

Technology