Post on 06-May-2015
description
transcript
PyData NYC 2012Hadoop, Pig, and Python
OF THIS SESSIONOverview
Why Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency, how to start)
FOR ONE MACHINEToo much data
Data doubles every 18 mo
ETL / Munging
CleanseFormatSimple calculations
Social Graph
Predict
Detect
Genetics
RAPID OVERVIEWHadoop
MapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
RAPID OVERVIEWHadoop
RAPID OVERVIEWHadoop
Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
PROBLEMSHadoop
DifficultNot much PythonBatch only (...or it was)
FUTUREHadoop
YarnMapReduce optionalGeneric management + distributed appsImpala
AND PYTHONHadoop
ON HADOOP (MAP)Jython
ON HADOOP (REDUCE; 1ST HALF)Jython
ON HADOOP (REDUCE; 2ND HALF)Jython
ON HADOOPJython
ON HADOOPPython
Streaming
(Works with any language, not just
ON HADOOPMrJob (Python)
Streaming + local / EMR / your Hadoop
ON HADOOPMrJob (Python)
Multi-step jobs
ON HADOOPPig
Less codeExpressive code
BRIEF, EXPRESSIVEPig
(thanks: twitter hadoop world presentation)
FOR SERIOUSThe Same Script, In
ON HADOOPPig
Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford
ON HADOOPPig
Works with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions
ON HADOOPPig + Python
Hadoop won’t magically parallelize your algorithm
NOT ACTUALLY MAGICHadoop + Python
Don’t stream Java-based languages•Jython•Pig + Jython
Streaming has ~30% overhead•Python•MrJob•Pig + Python
EFFICIENCYHadoop + Python
Well... 90-95% of time isn’t spent on algos
EXCITED?Hadoop + Python
Get Hadoop runningSoftware where it needs to beProcesses communicatingData available
HARD STUFF: SETUPHadoop + Python
LearnProject structure, modularityDev environment like Production
HARD STUFF: DEVELOPHadoop + Python
Syntax checkPackages availableData readableData writableWithout long waits for failure
HARD STUFF: VALIDATEHadoop + Python
Distributed execution is hard to debug
HARD STUFF: DEBUGHadoop + Python
Data processing is hard to testBut critical
HARD STUFF: TESTHadoop + Python
Environments identicalCode correctly deployed Configuration changesNon-disruptive
HARD STUFF: DEPLOYHadoop + Python
Stats about prior runsWhat code was run?What’s changed?
HARD STUFF: HISTORYHadoop + Python
Distributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs
HARD STUFF: LOGSHadoop + Python
Setup: PaaS, pip installation, connectorsDevelop: learning, structure, instant dev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy
HARD STUFF: MORTAR’S APPROACHHadoop + Python
K Young
@kky