Hadoop, Pig, and Python (PyData NYC 2012)

transcript

PyData NYC 2012Hadoop, Pig, and Python

OF THIS SESSIONOverview

Why Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency, how to start)

FOR ONE MACHINEToo much data

Data doubles every 18 mo

ETL / Munging

CleanseFormatSimple calculations

Social Graph

Predict

Detect

Genetics

RAPID OVERVIEWHadoop

MapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)

RAPID OVERVIEWHadoop

Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more

PROBLEMSHadoop

DifficultNot much PythonBatch only (...or it was)

FUTUREHadoop

YarnMapReduce optionalGeneric management + distributed appsImpala

AND PYTHONHadoop

ON HADOOP (MAP)Jython

ON HADOOP (REDUCE; 1ST HALF)Jython

ON HADOOP (REDUCE; 2ND HALF)Jython

ON HADOOPJython

ON HADOOPPython

Streaming

(Works with any language, not just

ON HADOOPMrJob (Python)

Streaming + local / EMR / your Hadoop

ON HADOOPMrJob (Python)

Multi-step jobs

ON HADOOPPig

Less codeExpressive code

BRIEF, EXPRESSIVEPig

(thanks: twitter hadoop world presentation)

FOR SERIOUSThe Same Script, In

ON HADOOPPig

Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford

ON HADOOPPig

Works with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions

ON HADOOPPig + Python

Hadoop won’t magically parallelize your algorithm

NOT ACTUALLY MAGICHadoop + Python

Don’t stream Java-based languages•Jython•Pig + Jython

Streaming has ~30% overhead•Python•MrJob•Pig + Python

EFFICIENCYHadoop + Python

Well... 90-95% of time isn’t spent on algos

EXCITED?Hadoop + Python

Get Hadoop runningSoftware where it needs to beProcesses communicatingData available

HARD STUFF: SETUPHadoop + Python

LearnProject structure, modularityDev environment like Production

HARD STUFF: DEVELOPHadoop + Python

Syntax checkPackages availableData readableData writableWithout long waits for failure

HARD STUFF: VALIDATEHadoop + Python

Distributed execution is hard to debug

HARD STUFF: DEBUGHadoop + Python

Data processing is hard to testBut critical

HARD STUFF: TESTHadoop + Python

Environments identicalCode correctly deployed Configuration changesNon-disruptive

HARD STUFF: DEPLOYHadoop + Python

Stats about prior runsWhat code was run?What’s changed?

HARD STUFF: HISTORYHadoop + Python

Distributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs

HARD STUFF: LOGSHadoop + Python

Setup: PaaS, pip installation, connectorsDevelop: learning, structure, instant dev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy

HARD STUFF: MORTAR’S APPROACHHadoop + Python

K Young

Hadoop, Pig, and Python (PyData NYC 2012)

Technology