Intro to big data and hadoop ubc cs lecture series - g fawkes

Post on 25-Jan-2015

414 views 6 download

description

 

transcript

© 2013 Geoff Fawkes. All Rights Reserved. 1

Introduction to Analytics and Big Data - Hadoop

The University of British ColumbiaComputer Science Alumni/Industry Lecture Series

Geoff FawkesNovember, 2013

/ 450

© 2013 Geoff Fawkes. All Rights Reserved. 2

Who am I?

Director Engineering, Teradata HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various

engineering roles Technology executive, mentor, software engineer

B.Sc. Comp Sci (UBC), MBA Executive (SFU)

Interruptive (disruptive?) personality Please ask questions to me / each other as we go along I don’t have all the answers – you do!

Credits: Rob Pegler, SNIA Education Storage Networking Industry Association, 2012

Who’s paying attention - 450 slides page count? Not that “big” - - about 50

© 2013 Geoff Fawkes. All Rights Reserved. 3

Big Data and Hadoop

History Data Challenges Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved. 4

Customer Challenges: The Data Deluge

© 2013 Geoff Fawkes. All Rights Reserved. 5

Big Data is Different than Business Intelligence

© 2013 Geoff Fawkes. All Rights Reserved. 6

Questions From Business Will Vary

© 2013 Geoff Fawkes. All Rights Reserved. 7

Web 2.0 is “Data Driven”

© 2013 Geoff Fawkes. All Rights Reserved. 8

The World of Data-Driven Applications

© 2013 Geoff Fawkes. All Rights Reserved. 9

Attributes of Big Data

© 2013 Geoff Fawkes. All Rights Reserved. 10

Top Ten Common Big Data Problems

© 2013 Geoff Fawkes. All Rights Reserved. 11

Industries Are Embracing Big Data

© 2013 Geoff Fawkes. All Rights Reserved. 12

Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved. 13

Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved. 14

Storage and Memory B/W Lagging CPU

© 2013 Geoff Fawkes. All Rights Reserved. 15

Commodity Hardware Economics

© 2013 Geoff Fawkes. All Rights Reserved. 17

What is Hadoop?

Hadoop Adoption HDFS MapReduce Examples Ecosystem Projects

© 2013 Geoff Fawkes. All Rights Reserved. 18

Hadoop Adoption in the Industry

© 2013 Geoff Fawkes. All Rights Reserved. 19

What is Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved. 20

What is Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved. 21

HDFS 101 – The Data Set System

© 2013 Geoff Fawkes. All Rights Reserved. 22

HDFS Organization and Replication

© 2013 Geoff Fawkes. All Rights Reserved. 23

Hadoop Server Roles - Multiple

© 2013 Geoff Fawkes. All Rights Reserved. 24

Hadoop Cluster

© 2013 Geoff Fawkes. All Rights Reserved. 25

HDFS File Write Operation - Instance

© 2013 Geoff Fawkes. All Rights Reserved. 26

HDFS File Read Operation - Instance

© 2013 Geoff Fawkes. All Rights Reserved. 27

HDFS File Operation R/W Replication

© 2013 Geoff Fawkes. All Rights Reserved. 28

MapReduce 101 – Functional Programming Meets Distributed Processing

© 2013 Geoff Fawkes. All Rights Reserved. 29

What is MapReduce?

© 2013 Geoff Fawkes. All Rights Reserved. 30

Key MapReduce Terminology

© 2013 Geoff Fawkes. All Rights Reserved. 31

MapReduce Basic Concepts

© 2013 Geoff Fawkes. All Rights Reserved. 32

Example 1: MapReduce Operation

© 2013 Geoff Fawkes. All Rights Reserved. 33

Example 2: Sample Dataset

© 2013 Geoff Fawkes. All Rights Reserved. 34

MapReduce Paradigm – UNIX Cmd

© 2013 Geoff Fawkes. All Rights Reserved. 35

Example 3: Count Words

© 2013 Geoff Fawkes. All Rights Reserved. 36

Map function

Reduce function

Run this program as aMapReduce job

Ex. 3: Lifecycle of a MapReduce Job

© 2013 Geoff Fawkes. All Rights Reserved. 37

Map function

Reduce function

Run this program as aMapReduce job

Ex. 3: Lifecycle of a MapReduce Job

© 2013 Geoff Fawkes. All Rights Reserved. 38

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Time

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Ex. 3: Lifecycle of a MapReduce Job

© 2013 Geoff Fawkes. All Rights Reserved. 39

190+ parameters in Hadoop

Set manually or defaults are used

MapReduce Job Configuration Parms

© 2013 Geoff Fawkes. All Rights Reserved. 40

Putting it all Together: MapReduce + HDFS

© 2013 Geoff Fawkes. All Rights Reserved. 41

Hadoop Ecosystem Projects

- Interactive SQL Query & Modeling

- Data flow for tedious MapReduce Jobs

- Columnar NoSQL Store

© 2013 Geoff Fawkes. All Rights Reserved. 42

Compare: Hadoop, SQL, Massively Parallel Processing (MPP)

© 2013 Geoff Fawkes. All Rights Reserved. 43

Compare: RDBMS and MapReduce

© 2013 Geoff Fawkes. All Rights Reserved. 44

Hadoop Use Cases

Set Top Cable TV Boxes Pay Per View Advertising Bank Risk Modelling Product Sentiment Analysis

© 2013 Geoff Fawkes. All Rights Reserved. 45

Example 1: Set Top Cable TV Boxes

© 2013 Geoff Fawkes. All Rights Reserved. 46

Example 2: Pay Per View Advertising

© 2013 Geoff Fawkes. All Rights Reserved. 47

Example 3: Bank Risk Modelling

© 2013 Geoff Fawkes. All Rights Reserved. 48

Example 4: Product Sentiment Analysis

© 2013 Geoff Fawkes. All Rights Reserved. 49

More Reading?

World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011

McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity

Big Data: Harnessing a game-changing asset

IDC: 2011 Digital Universe Study: Extracting Value from Chaos

The Economist: Data, Data Everywhere

Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field

O’Reilly – What is Data Science?

O’Reilly – Building Data Science Teams?

O’Reilly – Data for the public good

Obama Administration “Big Data Research and Development Initiative.”

© 2013 Geoff Fawkes. All Rights Reserved. 50

Introduction to Analytics and Big Data – Hadoop

Q&A

Geoff Fawkes http://www.linkedin.com/pub/geoff-fawkes/1/269/202 @gfawkes

November, 2013