Date post: | 19-Oct-2014 |
Category: |
Technology |
View: | 2,871 times |
Download: | 1 times |
YahooPresentationBigDataEvent
1 2011 IBM Corporation
Eric Eric BaldeschwielerBaldeschwielerVP, VP, HadoopHadoop Software Software
HADOOP HADOOP YAHOO &YAHOO &
USING AND IMPROVINGAPACHE HADOOP AT YAHOO!
2 2011 IBM Corporation
Brief Overview
Hadoop @ Yahoo!
Hadoop Momentum
The Future of Hadoop
AGENDA
2
3 2011 IBM Corporation
happening whats
- Big Data is here!- unstructured data- petabyte scale- operationally critical
Flickr : sub_lime79
4 2011 IBM Corporation
into insightsturning dataturning data
machine learningmachine learningtime seriestime series
content clusteringcontent clustering
factorization modelsfactorization models
logic regressionlogic regression
Flickr : NASA Goddard Photo and Video
algorithmsalgorithmsuser interest predictionuser interest prediction
ad inventory modelingad inventory modeling
5 2011 IBM Corporation
relevantrelevantmaking YAHOOmaking YAHOO
Flickr : ogimogi
6 2011 IBM Corporation
Poweringhadoop:
science science + + big big data data ++ insight insight = = personal relevancepersonal relevance = = VALUEVALUE
Yahoo!
Flickr : DDFic
7 2011 IBM Corporation
WHAT IS HADOOP?
7
HDFS
MapReduce
Pig HiveCommodity ComputersNetwork
Focus onSimplicityRedundancy ScaleAvailability
Transforms commodity equipment into a service that:HDFS Stores peta bytes of data reliablyMap-Reduce Allows huge distributed computations
Key AttributesRedundant and reliable Doesnt stop or loose data even as hardware failsEasy to program Our rocket scientists use it directly!Very powerful Allows the development of big data algorithms & tools
Batch processing centric
8 2011 IBM Corporation
WHAT HADOOP ISNT
A replacement for relational and data warehouse systems
A transactional / online / serving system A low latency or streaming solution
8
9 2011 IBM Corporation
HADOOP IN THE ENTERPRISE
9
RDMSRDMS EDWEDWData
Marts
Data
Marts
HADOOP CLUSTER(S)
Transactions, Structured Data
Business
ApplicationsWeb Logs, Server Logs,
Social Media, etc
Interactions
Semi-Structured or Un-Structured Data
Business Intelligence ApplicationsBusiness Intelligence Applications
10 2011 IBM Corporation 10
HADOOP @ YAHOO!
11 2011 IBM Corporation 11
HADOOP @YAHOO!
Where Science meets Data
HADOOP CLUSTERSTens of thousands of servers
DATA PIPELINES
CONTENT
DIMENSIONAL DATA
PRODUCTS
APPLIED SCIENCE
Data Analytics Content OptimizationContent Enrichment Yahoo! Mail Anti-Spam Advertising ProductsAd Optimization Ad SelectionBig Data Processing & ETL
User Interest Prediction Ad inventory prediction Machine learning -search ranking Machine learning - ad targetingMachine learning - spam filtering
Terabytes /
Day
(compressed)
10s of Petabytes
12 2011 IBM Corporation
2006 2007 2008 2009 201012
FROM PROJECT TOCORE PLATFORM
170 PB Storage
T
h
o
u
s
a
n
d
s
o
f
S
e
r
v
e
r
s
P
e
t
a
b
y
t
e
s
90
80
70
60
50
40
30
20
10
0
250
200
150
100
50
0
40K+ Servers
5M+ Monthly Jobs
13 2011 IBM Corporation
HADOOP POWERS THEYAHOO! NETWORK
advertising optimizationadvertising optimization
ad selectionad selection
Yahoo! Homepage
machine learning search rankingmachine learning search ranking
ad inventory predictionad inventory prediction
Yahoo! Mail anti-spam
user interest predictionuser interest prediction
audience, ad and search pipelinesaudience, ad and search pipelinesadvertising data systemsadvertising data systems
Content OptimizationContent Optimization
data analyticsdata analytics
13
14 2011 IBM Corporation
twice the engagementtwice the engagement
CASE STUDYYAHOO! HOMEPAGE
14
Personalized
for each visitor
Result:
twice the engagement
+160% clicksvs. one size fits all
+79% clicksvs. randomly selected
+43% clicksvs. editor selected
Recommended links News Interests Top Searches
15 2011 IBM Corporation
CASE STUDYYAHOO! HOMEPAGE
15
Serving Maps Users - Interests
Five Minute Production
Weekly Categorization models
SCIENCEHADOOP
CLUSTER
SERVING SYSTEMS
PRODUCTIONHADOOP
CLUSTER
USER
BEHAVIOR
ENGAGED USERS
CATEGORIZATION
MODELS (weekly)
SERVING
MAPS
(every 5 minutes)USER
BEHAVIOR
Identify user interests using Categorization models
Machine learning to build ever better categorization models
Build customized home pages with latest data (thousands / second)
16 2011 IBM Corporation
CASE STUDYYAHOO! MAIL
Enabling quick response in the spam arms race
450M mail boxes 5B+ deliveries/day
Antispam models retrainedevery few hours on Hadoop
40% less spam than Hotmail and 55% less spam than Gmail
SCIENCE
PRODUCTION
16
17 2011 IBM Corporation
YAHOO! & APACHE HADOOP
17
Yahoo! has contributed 70+% of Apache Hadoop code to date
Hadoop is not our business, but Hadoop is key to our business
Yahoo! benefits from open source eco-system around Hadoop
Hadoop drives revenue at Yahoo! by making our core products better
We need Hadoop to be rock solid
We invest heavily in core Hadoop development
We focus on scalability, reliability, availability
We fix bugs before you see them
We run very large clusters
We have a large QA effort
We run a huge variety of workloads
We are good Apache Hadoop citizens
We contribute our work to Apache
We share the exact code we run
18 2011 IBM Corporation 18
HADOOP MOMENTUM
19 2011 IBM Corporation
HADOOP IS GOINGMAINSTREAM
2007 2008 2009
19
2010
The Datagraph Blog
20 2011 IBM Corporation
THE PLATFORM EFFECTBIRTH OF AN ECOSYSTEM
and other Early AdoptersScale and productize Hadoop
20
Apache Hadoop
Orgs with Internet Scale ProblemsAdd tools / frameworks, enhance Hadoop
Mainstream / Enterprise adoptionDrive further development, enhancements
Enhance
Hadoop
Ecosystem
Service Providers Grow ecosystem - Training, support, enhancements
Virtuous Circle! Investment -> Adoption Adoption -> Investment
21 2011 IBM Corporation 21
THE FUTURE OF HADOOP
22 2011 IBM Corporation
MAKING HADOOP ENTERPRISE-READYWHATS NEXT
22
Hadoop is far from done Current implementation is showing its age Need to address several deficiencies in scalability,
flexibility, ease of use & performanceYahoo! is working on Next Generation of Hadoop
MapReduce: Rewrite to improve performance;pluggable support for new programming models
HDFS: Adding volumes to improve scalability;Flush & sync support for applications that log to HDFS
Apache should remain the hub of Hadoop ecosystem Yahoo! contributes all Hadoop changes back to Apache
Hadoop Everyone benefits from shared neutral foundation
23 2011 IBM Corporation 23
Questions?