Luigi PyData presentation

transcript

Elias Freiderfreider@spotify.com

Batch data processing in Python

Background

Billions of log messages (several TBs) every dayUsage and backend stats, debug informationWhat we want to do

AB-testingMusic recommendationsMonthly/daily/hourly reportingBusiness metric dashboards

We experiment a lot – need quick development cycles

We have a lot of data

Why did we build Luigi?

How to do it?

HadoopData storageClean, filter, join and aggregate data

PostgresSemi-aggregated data for dashboards

CassandraTime series data

A bunch of data processing tasks with inter-dependencies

Defining a data flow

Input is a list of Timestamp, Track, Artist tuples

Example: Artist Toplist

Streams Artist Aggregation

Top 10 Database

Naive (non-luigi) approach

Example: Artist Toplist

If a failed step leaves behind broken data, we want to clean that up

Errors will occur

The second step fails, you fix it, then you want to resume

Avoid duplicate work

To use data flows as command line tools

Parametrization

You want to run the dataflow for a set of similar inputs

Repeat!

Dead-simple dependency definitions (think GNU make)Emphasis on Hadoop/HDFS integrationAtomic file operationsPowerful task templating using OOPData flow visualizationCommand line integration

Main features

A Python framework for data flow definition and execution

Introducing Luigi

Luigi - Aggregate Artists

Luigi - Aggregate ArtistsRun on the command line:$ python dataflow.py AggregateArtists

DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74375] Running AggregateArtists()INFO: [pid 74375] Done AggregateArtists()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time

Running Hadoop MapReduce utilizing Hadoop Streaming or custom jar-filesRunning Hive and (soon) Pig queriesInserting data sets into Postgres

Luigi comes with pre-implemented Tasks for...

Reduces code-repetition by utilizing Luigi’s object oriented data model

Task templates and targets

Writing new ones are as easy as defining an interface and implementing run()

Built-in Hadoop Streaming Python framework

Hadoop MapReduce

Very slim interfaceFetches error logs from Hadoop cluster and displays them to the userClass instance variables can be referenced in MapReduce code, which makes it easy to supply extra data in dictionaries etc. for map side joinsEasy to send along Python modules that might not be installed on the clusterBasic support for secondary sortRuns on CPython so you can use your favorite libs (numpy, pandas etc.)

Features

Built-in Hadoop Streaming Python framework

Hadoop MapReduce

So you can use local modules remotely on the cluster

Attach Python modules

Set up local variables for map-side joins etc.

Local state is transferred

Top 10 artists - Wrapped arbitrary Python code

Completing the top list

Basic functionality for exporting to Postgres. Cassandra support is in the works

Database support

Running it all...DEBUG: Checking if ArtistToplistToDatabase() is completeINFO: Scheduled ArtistToplistToDatabase()DEBUG: Checking if Top10Artists() is completeINFO: Scheduled Top10Artists()DEBUG: Checking if AggregateArtists() is completeINFO: Scheduled AggregateArtists()DEBUG: Checking if Streams() is completeINFO: Done scheduling tasksDEBUG: Asking scheduler for work...DEBUG: Pending tasks: 3INFO: [pid 74811] Running AggregateArtists()INFO: [pid 74811] Done AggregateArtists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 2INFO: [pid 74811] Running Top10Artists()INFO: [pid 74811] Done Top10Artists()DEBUG: Asking scheduler for work...DEBUG: Pending tasks: 1INFO: [pid 74811] Running ArtistToplistToDatabase()INFO: Done writing, importing at 2013-03-13 15:41:09.407138INFO: [pid 74811] Done ArtistToplistToDatabase()DEBUG: Asking scheduler for work...INFO: DoneINFO: There are no more tasks to run at this time

Light-weight scheduler daemon with a web interface

Data flow visualization

Imagine how cool this would be with real data...

The results

Tasks have implicit __init__

Task Parameters

Generates command line interface with typing and documentation

Class variables with some magic

$ python dataflow.py AggregateArtists --date 2013-03-05

Combined usage example

Task Parameters

Makes it really easy to create aggregates over time

Task Parameters

Basic multi-processing

Multiple workers

$ python dataflow.py --workers 3 AggregateArtists --date_interval 2013-W11

Large data flows

3/16/13 Luigi overview

localhost:8081 1/2

AggregateTracks

(test=False,

date=2013-03-16,

test_users=False)

JoinArtistGids

(test=False,

date=2013-03-16,

nshards=12,

test_users=False)

AggregateUserMatrices

(test=False,

date=2013-03-16,

index_version=1362433077,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-16,

test_users=False)

UserRecs

(test=False,

date=2013-03-17,

rec_days=5,

exp_days=10,

test_users=False,

force_updates=False,

build_from_scratch=True,

index_path=/spotify/discover/index,

index_version=None,

FOLLOWS_SCORE=5.0)

AccumulateUserMatrices

(test=False,

date=2013-03-16,

test_users=False,

decay_factor=0.99,

days=5,

build_from_scratch=True)

AccumulateByArtists

(test=False,

date=2013-03-16,

test_users=False,

max_days=90,

decay_factor=0.99,

days=10,

IngestUserRecs

(date=2013-03-17,

test_users=False,

ttl=604800)

AggregateByArtists

(test=False,

date=2013-03-10,

test_users=False)

AccumulateByArtists

(test=False,

date=2013-03-15,

test_users=False,

max_days=90,

decay_factor=0.99,

days=10,

UserRecs

(test=False,

date=2013-03-16,

rec_days=5,

exp_days=10,

test_users=False,

force_updates=False,

build_from_scratch=True,

index_path=/spotify/discover/index,

index_version=None,

FOLLOWS_SCORE=5.0)

IngestUserRecs

(date=2013-03-16,

test_users=False,

ttl=604800)

AggregateByArtists

(test=False,

date=2013-03-13,

test_users=False)

JoinArtistGids

(test=False,

date=2013-03-11,

nshards=12,

test_users=False)

MasterMetadata

(date=2013-03-15)

JoinArtistGids

(test=False,

date=2013-03-15,

nshards=12,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-15,

test_users=False)

(test=False,

date=2013-03-15,

test_users=False)

AccumulateUserMatrices

(test=False,

date=2013-03-15,

test_users=False,

decay_factor=0.99,

days=5,

JoinArtistGids

(test=False,

date=2013-03-13,

nshards=12,

test_users=False)

EndSongCleaned

(date=2013-03-15)

AggregateTracks

(test=False,

date=2013-03-15,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-15,

test_users=False)

UserLocationPeriod

(test=False,

period=10,

date=2013-03-16,

test_users=False)

UserLocationPeriod

(test=False,

period=10,

date=2013-03-17,

test_users=False)

(test=False,

date=2013-03-12,

test_users=False)

ArtistFollows

(date=2013-03-14)

EndSongCleaned

(date=2013-03-16)

UserLocationDay

(test=False,

date=2013-03-16,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-06,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-14,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-14,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-09,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-08,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-13,

test_users=False)

(test=False,

date=2013-03-14,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-08,

test_users=False)

(test=False,

date=2013-03-13,

test_users=False)

RelatedArtistsTC

AggregateByArtists

(test=False,

date=2013-03-12,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-11,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-06,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-12,

test_users=False)

JoinArtistGids

(test=False,

date=2013-03-12,

nshards=12,

test_users=False)

JoinArtistGids

(test=False,

date=2013-03-14,

nshards=12,

test_users=False)

IngestUserRecs

(date=2013-03-15,

test_users=False,

ttl=604800)

UserLocationDay

(test=False,

date=2013-03-07,

test_users=False)

AggregateByArtists

(test=False,

date=2013-03-07,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-11,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-10,

test_users=False)

UserLocationDay

(test=False,

date=2013-03-09,

test_users=False)

(test=False,

date=2013-03-11,

test_users=False)

PlaylistVectors

(test=False,

date=2013-03-14,

index_version=1362433077)

(Screenshot from web interface)

Visualization needs some work

Really large...

Great for automated execution

Error notifications

Prevents two identical tasks from running simultaneously

Process Synchronization

luigidSimple task synchronization

Data flow 1 Data flow 2

Common dependency Task

Not just another Hadoop streaming framework!

What we use Luigi forHadoop StreamingJava Hadoop MapReduceHivePigLocal (non-hadoop) data processingImport/Export data to/from PostgresInsert data into Cassandrascp/rsync/ftp data files and reportsDump and load databases

Others using it with Scala MapReduce and MRJob as well

Originated at SpotifyBased on many years of experience with data processingRecent contributions by Foursquare and BitlyOpen source since September 2012

http://github.com/spotify/luigi

Luigi is open source

For more information feel free to reach out at http://github.com/spotify/luigi

Thank you!Oh, and we’re hiring – http://spotify.com/jobs

Elias Freiderfreider@spotify.com

Luigi PyData presentation

Technology