Nearline systems to improve Netflix recommendations

transcript

Near line systems to improve Netflix recommendations

Gopal Krishnan

Feb 2015

About me

Gopal Krishnan

Director, Consumer Science Engineering

Netflix, Inc.

Driving innovation through AB testing the member experience.

Twitter: @sgkrishnan

LinkedIn: https://www.linkedin.com/pub/gopal-krishnan/0/7a7/905

Netflix: global streaming video service for TV and movies

Netflix is available on 1000+ devices

More than 57M members globally

• In more than 50 countries

• Planning to launch in all (200+) countries in 2 years.

Netflix Consumes 34% of peak downstream bandwidth in North America

Netflix Consumes 6% of peak upstream bandwidth in North America

What my team does?

• Help improve rate of innovation through AB testing to improve member experience

• Infrastructure for algorithmic support

– Feature value store to help model training

– Services to store and serve explicit data sources

– Services to collect, process, validate, and serve implicit data sources

– Caching services

• Data improves our understanding of end to end user behavior

Every part of Netflix is personalized

NETFLIX RECOMMENDATIONS WITH ONLINE MICRO SERVICES

Life Cycle of Netflix Recommendation Data

Devices

Data Collection

Offline Big Data Analysis

Netflix recommendation:

online services

Netflix API Netflix beacon telemetry

Data Collection: explicit inputs

Star ratings

Data Collection: explicit inputs

Virtual plays from new user on-boarding

Outputs from offline analysis

Devices

Data Collection

online services

“Implicit” Data Services

Popularity Targeting

User clustering

Recommendations combines both online and aggregated offline data

Devices

Data Collection

online services

“Explicit” Data Services

My List On Ramp

Taste pref

Popularity Targeting

User clustering

WHY BOTHER WITH NEAR LINE SYSTEMS THEN?

Our algorithms became too complex to be computed online leading to higher latency.

Near line systems improve our availability story.

Near line systems allow us to innovate at a greater velocity.

Near line systems improve agility and availability

Devices

Data Collection

Big Data Analysis(Hadoop, Teradata)

online services

Pre-computed recommendations

Post-processat run time

Manhattan pre-compute engine

Manhattan: Netflix pre-compute engine

Video Ranker

Row selection

Similars

Top picks

What data would improve recommendations even further?

All UI Events from all key platforms

• Moving beyond explicit inputs from users, we would like to track all member activity to derive deeper insights.

• Challenges include:

– 1000s of device platforms

– Non-standardized UIs across different platforms

– Lack of earlier focus on tracking the browse experience

Patterns arise in aggregate

Challenges with collecting UI Events

• Consistent data semantics across lots of device and UI platforms.

• Scaling to handle billions of events.

• Near real-time semantic data quality and validation

• Dealing with data loss (low power devices, loss at the network, etc.)

Canaries for data quality

Near real time feedback and validation on data quality.

“Trending” on Netflix

Now being AB tested

Near line systems for Netflix recommendations

Devices

Data Collection

Big Data Analysis(Hadoop, Teradata)

online services

Pre-computed recommendations

Post-processat run time

Near line data processing and serving

systems

“Trending on Netflix” near line system

Take rates (play/impression)kafka stream

Cassandra

dashboards

StreamProcessing(ETA: low # of minutes)

Play start(kafka stream)

1000’s / sec

Impressions (kafka stream)

millions / sec

“Trending on Netflix” near line system

Play start(kafka stream)

1000’s / sec

Impressions (kafka stream)

millions / secStream ProcessingWindowed operations.Small batches.Merging streams.Flexibility.

Take rates

Impressions rollup

Personalized Ranked videos

Merged to generate “Trending on Netflix”

Spark Streaming at Netflix

• Collaborating with Databricks to make sure Spark (batch and streaming) works well in a cloud environment

– Resiliency and scalability testing

• Actively working on studying scaling needs for algorithmic needs for both Spark batch and Spark streaming.

Spark at Netflix

• Several different use cases where we are interested in Spark – both batch and streaming.

• Largest Spark batch production cluster is 150 m3.2xl instances for personalization.

• Netflix has both Spark batch and Spark streaming in production.

Spark at Netflix

• Integrating with Spark with Scala (mostly), python, and some SQL.

• Python typically via iPython notebook integration.

• Running in standalone mode or in mesos.

Spark: areas to watch for.

• We have really not tested the multi-tenancy boundaries yet. Mostly spinning custom purpose clusters for now.

• Tuning the jobs and optimizing performance of jobs remains a challenge as we make steady inroads.

• Incrementally getting better with stability and scale as we tackle larger use cases this year.

Netflix Tech Blog

• Tech blog about the “Trending on Netflix” row published today.

• Watch for upcoming tech blog from Netflix on near line systems and another one about Spark in the coming weeks.

Now Hiring leaders and engineers!

Talk to me in person or at

Twitter: @sgkrishnan

LinkedIn:https://www.linkedin.com/pub/gopal-krishnan/0/7a7/905