+ All Categories
Home > Technology > Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Date post: 15-Jan-2015
Category:
Upload: david-zuelke
View: 5,474 times
Download: 2 times
Share this document with a friend
Description:
Presentation given at PHP Day 2011 in Verona, Italy.
Popular Tags:
87
LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP
Transcript
Page 1: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

Page 2: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

David Zülke

Page 3: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

David Zuelke

Page 4: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)
Page 5: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Page 6: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Founder

Page 8: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Lead Developer

Page 11: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

FROM 30.000 FEETDistributed And Parallel Computing

Page 12: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

we want to process data

Page 13: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

how much data exactly?

Page 14: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

SOME NUMBERS

• Facebook

•New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

• 12 TB (March 2010)

• Google

•Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

Page 15: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

what if you have that much data?

Page 16: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

what if you have just 1% of that amount?

Page 17: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

“no problemo”, you say?(or, if you're Italian, “nessun problema a tutti”)

Page 18: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

reading 180 GB sequentially off a disk will take ~45 minutes

Page 19: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

but you only have 16 GB or so of RAM per computer

Page 20: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

data can be processed much faster than it can be read

Page 21: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

solution: parallelize your I/O

Page 22: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

but now you need to coordinate what you’re doing

Page 23: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

and that’s hard

Page 24: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

what if a node dies?

Page 25: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

Page 26: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

ENTER: OUR HEROIntroducing MapReduce

Page 27: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

in the olden days, the workload was distributed across a grid

Page 28: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

but the data was shipped around between nodes

Page 29: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

or even stored centrally on something like an SAN

Page 30: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

I/O bottleneck

Page 31: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

along came a Google publication in 2004

Page 32: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

Page 33: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

now the data is distributed

Page 34: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

computing happens on the nodes where the data already is

Page 35: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

processes are isolated and don’t communicate (share-nothing)

Page 36: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

Page 37: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

•We simply sum up the bytes to get the total traffic per IP!

Page 38: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

Page 39: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

Page 40: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

Page 41: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

PSEUDOCODE

function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);    emit($parts['ip'],  $parts['bytes']);}

function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes);}

212.122.174.13  21023874.119.8.111      99643

212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  19874.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  4374.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

Page 42: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

A YELLOW ELEPHANTIntroducing Apache Hadoop

Page 44: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Hadoop is a MapReduce framework

Page 45: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

it allows us to focus on writing Mappers, Reducers etc.

Page 46: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

and it works extremely well

Page 47: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

how well exactly?

Page 48: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOP AT FACEBOOK (I)

• Predominantly used in combination with Hive (~95%)

• 8400 cores with ~12.5 PB of total storage

• 8 cores, 12 TB storage and 32 GB RAM per node

• 1x Gigabit Ethernet for each server in a rack

• 4x Gigabit Ethernet from rack switch to core

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hadoop is aware of racks and locality of nodes

Page 49: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOP AT FACEBOOK (II)

•Daily stats:

• 25 TB logged by Scribe

• 135 TB of compressed data scanned

• 7500+ Hive jobs

• ~80k compute hours

•New data per day:

• I/08: 200 GB

• II/09: 2 TB (compressed)

• III/09: 4 TB (compressed)

• I/10: 12 TB (compressed)

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Page 50: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOP AT YAHOO!

•Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

•Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

Page 51: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

OTHER NOTABLE USERS

• Twitter (storage, logging, analysis. Heavy users of Pig)

• Rackspace (log analysis; data pumped into Lucene/Solr)

• LinkedIn (friend suggestions)

• Last.fm (charts, log analysis, A/B testing)

• The New York Times (converted 4 TB of scans using EC2)

Page 52: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

there’s just one little problem

Page 53: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

you need to write Java code

Page 54: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

however, there is hope...

Page 55: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

STREAMINGHadoop Won’t Force Us To Use Java

Page 56: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Hadoop Streaming can use any script as Mapper or Reducer

Page 57: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

many configuration options (parsers, formats, combining, …)

Page 58: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

it works using STDIN and STDOUT

Page 59: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Mappers are streamed the records(usually by line: <line>\n)

and emit key/value pairs: <key>\t<value>\n

Page 60: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Page 61: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Caution: no separate Reducer processes per key(but keys are sorted)

Page 62: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFSHadoop Distributed File System

Page 63: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFS

• Stores data in blocks (default block size: 64 MB)

•Designed for very large data sets

•Designed for streaming rather than random reads

•Write-once, read-many (although appending is possible)

• Capable of compression and other cool things

Page 64: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFS CONCEPTS

• Large blocks minimize amount of seeks, maximize throughput

• Blocks are stored redundantly (3 replicas as default)

• Aware of infrastructure characteristics (nodes, racks, ...)

• Datanodes hold blocks

• Namenode holds the metadata

Critical component for an HDFS cluster (HA, SPOF)

Page 65: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

JOB PROCESSINGHow Hadoop Works

Page 66: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Just like I already described! It’s MapReduce!\o/

Page 67: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

BASIC RULES

• Uses Input Formats to split up your data into single records

• You can optimize using combiners to reduce locally on a node

•Only possible in some cases, e.g. for max(), but not avg()

• You can control partitioning of map output yourself

• Rarely useful, the default partitioner (key hash) is enough

• And a million other things that really don’t matter right now ;)

Page 68: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

oh and, if you’re wondering how Hadoop got its name

Page 69: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Doug Cutting

Page 70: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

STREAMING WITH PHPIntroducing HadooPHP

Page 71: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HADOOPHP

• A little framework to help with writing mapred jobs in PHP

• Takes care of input splitting, can do basic decoding et cetera

• Automatically detects and handles Hadoop settings such as key length or field separators

• Packages jobs as one .phar archive to ease deployment

• Also creates a ready-to-rock shell script to invoke the job

Page 72: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

written by

Page 73: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)
Page 75: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HANDS-ONHadoop Streaming & PHP in action

Page 76: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

THE HADOOP ECOSYSTEMA Little Tour

Page 77: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

APACHE AVROEfficient Data Serialization System With Schemas

(compare: Facebook’s Thrift)

Page 78: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

CLOUDERA FLUMEDistributed Data Collection System

(compare: Facebook’s Scribe)

Page 79: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

APACHE HBASELike Google’s BigTable, Only That You Can Have It, Too!

Page 80: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HDFSYour Friendly Distributed File

System

Page 81: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

HIVEData Warehousing Made

Simple With An SQL Interface

Page 82: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

PIGA High-Level Language For Modeling Data Processing Tasks

(fulfills the same purpose as Hive)

Page 83: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

ZOOKEEPERYour Distributed Applications,

Coordinated

Page 84: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

!e End

Page 85: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

RESOURCES

• http://www.cloudera.com/developers/learn-hadoop/

• Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009

• http://www.cloudera.com/hadoop/

• Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …

Page 86: Large-Scale Data Processing with Hadoop and PHP (PHPDAY11 2011-05-14)

Questions?


Recommended