Date post: | 25-Jan-2015 |
Category: |
Technology |
Upload: | gfawkesnew2 |
View: | 414 times |
Download: | 6 times |
© 2013 Geoff Fawkes. All Rights Reserved. 1
Introduction to Analytics and Big Data - Hadoop
The University of British ColumbiaComputer Science Alumni/Industry Lecture Series
Geoff FawkesNovember, 2013
/ 450
© 2013 Geoff Fawkes. All Rights Reserved. 2
Who am I?
Director Engineering, Teradata HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various
engineering roles Technology executive, mentor, software engineer
B.Sc. Comp Sci (UBC), MBA Executive (SFU)
Interruptive (disruptive?) personality Please ask questions to me / each other as we go along I don’t have all the answers – you do!
Credits: Rob Pegler, SNIA Education Storage Networking Industry Association, 2012
Who’s paying attention - 450 slides page count? Not that “big” - - about 50
© 2013 Geoff Fawkes. All Rights Reserved. 3
Big Data and Hadoop
History Data Challenges Why Hadoop?
© 2013 Geoff Fawkes. All Rights Reserved. 4
Customer Challenges: The Data Deluge
© 2013 Geoff Fawkes. All Rights Reserved. 5
Big Data is Different than Business Intelligence
© 2013 Geoff Fawkes. All Rights Reserved. 6
Questions From Business Will Vary
© 2013 Geoff Fawkes. All Rights Reserved. 7
Web 2.0 is “Data Driven”
© 2013 Geoff Fawkes. All Rights Reserved. 8
The World of Data-Driven Applications
© 2013 Geoff Fawkes. All Rights Reserved. 9
Attributes of Big Data
© 2013 Geoff Fawkes. All Rights Reserved. 10
Top Ten Common Big Data Problems
© 2013 Geoff Fawkes. All Rights Reserved. 11
Industries Are Embracing Big Data
© 2013 Geoff Fawkes. All Rights Reserved. 12
Why Hadoop?
© 2013 Geoff Fawkes. All Rights Reserved. 13
Why Hadoop?
© 2013 Geoff Fawkes. All Rights Reserved. 14
Storage and Memory B/W Lagging CPU
© 2013 Geoff Fawkes. All Rights Reserved. 15
Commodity Hardware Economics
© 2013 Geoff Fawkes. All Rights Reserved. 17
What is Hadoop?
Hadoop Adoption HDFS MapReduce Examples Ecosystem Projects
© 2013 Geoff Fawkes. All Rights Reserved. 18
Hadoop Adoption in the Industry
© 2013 Geoff Fawkes. All Rights Reserved. 19
What is Hadoop?
© 2013 Geoff Fawkes. All Rights Reserved. 20
What is Hadoop?
© 2013 Geoff Fawkes. All Rights Reserved. 21
HDFS 101 – The Data Set System
© 2013 Geoff Fawkes. All Rights Reserved. 22
HDFS Organization and Replication
© 2013 Geoff Fawkes. All Rights Reserved. 23
Hadoop Server Roles - Multiple
© 2013 Geoff Fawkes. All Rights Reserved. 24
Hadoop Cluster
© 2013 Geoff Fawkes. All Rights Reserved. 25
HDFS File Write Operation - Instance
© 2013 Geoff Fawkes. All Rights Reserved. 26
HDFS File Read Operation - Instance
© 2013 Geoff Fawkes. All Rights Reserved. 27
HDFS File Operation R/W Replication
© 2013 Geoff Fawkes. All Rights Reserved. 28
MapReduce 101 – Functional Programming Meets Distributed Processing
© 2013 Geoff Fawkes. All Rights Reserved. 29
What is MapReduce?
© 2013 Geoff Fawkes. All Rights Reserved. 30
Key MapReduce Terminology
© 2013 Geoff Fawkes. All Rights Reserved. 31
MapReduce Basic Concepts
© 2013 Geoff Fawkes. All Rights Reserved. 32
Example 1: MapReduce Operation
© 2013 Geoff Fawkes. All Rights Reserved. 33
Example 2: Sample Dataset
© 2013 Geoff Fawkes. All Rights Reserved. 34
MapReduce Paradigm – UNIX Cmd
© 2013 Geoff Fawkes. All Rights Reserved. 35
Example 3: Count Words
© 2013 Geoff Fawkes. All Rights Reserved. 36
Map function
Reduce function
Run this program as aMapReduce job
Ex. 3: Lifecycle of a MapReduce Job
© 2013 Geoff Fawkes. All Rights Reserved. 37
Map function
Reduce function
Run this program as aMapReduce job
Ex. 3: Lifecycle of a MapReduce Job
© 2013 Geoff Fawkes. All Rights Reserved. 38
Map Wave 1
ReduceWave 1
Map Wave 2
ReduceWave 2
Input Splits
Time
How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?
Ex. 3: Lifecycle of a MapReduce Job
© 2013 Geoff Fawkes. All Rights Reserved. 39
190+ parameters in Hadoop
Set manually or defaults are used
MapReduce Job Configuration Parms
© 2013 Geoff Fawkes. All Rights Reserved. 40
Putting it all Together: MapReduce + HDFS
© 2013 Geoff Fawkes. All Rights Reserved. 41
Hadoop Ecosystem Projects
- Interactive SQL Query & Modeling
- Data flow for tedious MapReduce Jobs
- Columnar NoSQL Store
© 2013 Geoff Fawkes. All Rights Reserved. 42
Compare: Hadoop, SQL, Massively Parallel Processing (MPP)
© 2013 Geoff Fawkes. All Rights Reserved. 43
Compare: RDBMS and MapReduce
© 2013 Geoff Fawkes. All Rights Reserved. 44
Hadoop Use Cases
Set Top Cable TV Boxes Pay Per View Advertising Bank Risk Modelling Product Sentiment Analysis
© 2013 Geoff Fawkes. All Rights Reserved. 45
Example 1: Set Top Cable TV Boxes
© 2013 Geoff Fawkes. All Rights Reserved. 46
Example 2: Pay Per View Advertising
© 2013 Geoff Fawkes. All Rights Reserved. 47
Example 3: Bank Risk Modelling
© 2013 Geoff Fawkes. All Rights Reserved. 48
Example 4: Product Sentiment Analysis
© 2013 Geoff Fawkes. All Rights Reserved. 49
More Reading?
World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011
McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity
Big Data: Harnessing a game-changing asset
IDC: 2011 Digital Universe Study: Extracting Value from Chaos
The Economist: Data, Data Everywhere
Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field
O’Reilly – What is Data Science?
O’Reilly – Building Data Science Teams?
O’Reilly – Data for the public good
Obama Administration “Big Data Research and Development Initiative.”
© 2013 Geoff Fawkes. All Rights Reserved. 50
Introduction to Analytics and Big Data – Hadoop
Q&A
Geoff Fawkes http://www.linkedin.com/pub/geoff-fawkes/1/269/202 @gfawkes
November, 2013