Hadoop summit-ams-2014-04-03

Post on 21-Aug-2014

3,590 views 3 download

Tags:

description

Criteo slides form the Hadoop summit in Amsterdam

transcript

HADOOP, FROM LAB TO 24/7 PRODUCTION

http://criteolabs.com/jobs

criteolabs.com/jobs

Jean-Baptiste NOTE

jb.note@criteo.com

Ana DIN

a.din@criteo.com

From the Criteo HPC Team(+ Loïc / Serge / Maxime / Samuel / Yann / Stuart)

ABOUT US

criteolabs.com/jobs

CRITEO ?

6 DATA CENTERS, 4 CONTINENTS.120 BILLION REQUESTS/DAY*.

* EVERY DAY CRITEO IS CALLED MORE THAN 100 BILLION TIMES BY ADVERTISERS AND PUBLISHERS

54 OPEN POSITIONS IN PARIS’ R&Dhttp://criteolabs.com/jobs

criteolabs.com/jobs

« Anything that can go wrong - will go wrong »-- Murphy’s Law

TALES OF A TECHNOLOGY ADOPTION

criteolabs.com/jobs

Usage of Hadoop is growing exponentially

• Learning curve is real• Analysts discover interesting things with raw data

– Which causes them to ask more questions• Increased insight leads to a better product

– Which leads to more data• Data gains in value and more is kept (and studied!)

• YOU (the admin) are the bottleneck !

USAGE GROWTH

criteolabs.com/jobs

• Administration automation• Hadoop configuration tuning• Network• Multitenancy

TOPICS

criteolabs.com/jobs

ADMINISTRATION AUTOMATION

criteolabs.com/jobs

Rack and load!• Machine is racked, cabled and provisionned for a role• Chef is our one stop-shop for automation• Diskless system install

AUTOMATING DEPLOYMENTS

INSTA- CLUSTER!

criteolabs.com/jobs

• Learn from the past• Previous cluster 1.5 years operation• 78% failure rate on /dev/sda at restart

• Disk usage symmetry

• Garanteed statelessness

OS DISKLESS : WHY

criteolabs.com/jobs

• PXE Boot on custom CentOs image• Automated Chef bootstrap• Everything done by Chef

– Inventory– Firmware updates– OS / Service deployment

OS DISKLESS : HOW

criteolabs.com/jobs

• Evolutive maintenance (version bump)• Not much to do on normal ops• Most freq. issue is flacking / slow performing host

• Use Preprod / Prod for infra changes• Progressive VS black out

MAINTENANCE

criteolabs.com/jobs

• User facing interfaces• Jobtracker• Fsimage checkpointing• HDFS usage and local disk usage

MONITORING

criteolabs.com/jobs

HADOOP CONFIG TUNING

criteolabs.com/jobs

• Hadoop is a DDOS to your infrastructure– Increase ARP retention (L2-specific)– Use NSCD

• Increase Read ahead• Disable THP compaction• MTU jumbo frames

SYSTEM CONFIGS

criteolabs.com/jobs

CLUSTER CONFIGS

criteolabs.com/jobs

CLUSTER CONFIGS

• Adjust log settings (default is INFO,console)• Increase handler counts (JT,NN,DN)• Use namenode.service.handler.count• Watch out for checkpointing loops

criteolabs.com/jobs

NETWORK

criteolabs.com/jobs

• One datacenter topology will not fit all• Web traffic VS Hadoop traffic• Historical Fat-tree hierarchy with layer 2 routing• Switched to meshed design (soon layer3)

NETWORK TOPOLOGY

criteolabs.com/jobs

• Rack awareness (of course !)– Performance– Reliability– Maintenance (eg. relocation)

HADOOP TOPOLOGY

criteolabs.com/jobs

• HDFS Quotas• Scheduling (user-facing)• Map / Reduce ratio

• Use Yarn !

MULTITENANCY

criteolabs.com/jobs

SECURITY

criteolabs.com/jobs

• Dedicated kdc / realm• Dedicated services principals• Cross-realm trusts• Delegate user management to your IT

KERBEROS SETUP

criteolabs.com/jobs

• Use multiple proxies• Easy way to interconnect to the outside world• Data injection / read with a simple curl• High bandwidth transfers

HTTPFS PROXIES

criteolabs.com/jobs

• Multiple use cases (ML, BI analytics)• Baseline Json (+gzip) is ok• Don’t optimize too early• We still use it(*) at Peta scale

(*) some teams also use Parquet and contributed to Hive integration

FILE FORMATS

criteolabs.com/jobs

QUESTIONS ?

criteolabs.com/jobs

Did I say we’re hiring!

We’re hiring lots of engineers in 2014. Come join us!

http://criteolabs.com/jobs

MY FELLOW CRITEOS WOULD KILL ME…