+ All Categories
Home > Documents > Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv...

Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv...

Date post: 18-Dec-2015
Category:
Upload: cameron-mosley
View: 215 times
Download: 1 times
Share this document with a friend
13
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom Big Data and Map Reduce Paula Ta-Shma IBM Haifa Research Storage Systems 1/5/2013
Transcript

Paula Ta-Shma, IBM Haifa Research

1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Big Data and Map Reduce

Paula Ta-Shma

IBM Haifa Research

Storage Systems

1/5/2013

Paula Ta-Shma, IBM Haifa Research

2 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Outline

Historical Context behind Map Reduce What is Big Data ? The Map Reduce Framework Connections with Storage Cloud

Paula Ta-Shma, IBM Haifa Research

3 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Historical Context

Relational Database Management Systems (RDBMS)

– Researched in 70s, products in 80s and beyond

– Relational (tabular) data model

– Query Language : SQL

- Efficient Query Processing: Indexing, Query Evaluation Strategies

– Transactions, Consistency

– Concurrency Control

– Security and Authorization

– Can be implemented on top of file systems

- Provide higher level of abstraction and functionality than file systems

Example Use Cases– Banking, Stock trading, Personnel Management,

Inventory Management, Manfuacturing Data, etc.

– The list is very long

SELECT Name

FROM Accounts

GROUP BY Name

HAVING SUM(Balance) < 0

Name Balance ($)

Bob 5000.00

Alice -389.27

Fred -800.00

Alice 2980000.00

Accounts

Paula Ta-Shma, IBM Haifa Research

4 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Historical Context Cont.

Business Intelligence– Extract value from large amounts of data– Banking use case example

- Identify and actively retain and pursue profitable customers- Analyze the performance of sales personnel, tellers and account managers- etc.

– Massive query processing to analyze data across multiple dimensions- Requires read access to large amounts of data- Typically long running queries, can interfere with transactions

– Work on a snapshot of data- Deployed as physically separate Data Warehousing systems- Mission critical- Data warehousing products in early 90s

Paula Ta-Shma, IBM Haifa Research

5 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

New Requirements in Internet Era

Massive amounts of data Unstructured (e.g. text) and semi-structured data (e.g. XML) Analysis capabilities beyond what is possible in SQL LOW COST

$$$ Capital Expenses Operational Expenses

Hardware Use commodity hardware, scale out instead of scale up.

Make it easy to manage hardware which will fail often. Treat failure case as the norm, automatic failover.

Software DBMS software is complex and expensive, transactions, concurrency control etc. not needed for many tasks

Make it easy to write ‘queries’ on a distributed infrastructure.

Paula Ta-Shma, IBM Haifa Research

6 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Map Reduce

Invented by Google– Inspired by functional programming languages map and reduce functions– Seminal paper: Dean, Jeffrey & Ghemawat, Sanjay (OSDI 2004), "MapReduce:

Simplified Data Processing on Large Clusters"

Used at Google to completely regenerate Google's index of the World Wide Web.

– It replaced the old ad hoc programs that updated the index and ran the various analyses.

Uses:– distributed pattern-based searching, distributed sorting, web link-graph reversal, term-

vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation

Hadoop:– Open source implementation which matches Google’s specifications

Paula Ta-Shma, IBM Haifa Research

7 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Source: IBM InfoSphere BigInsights slides, by Bruce Brownhttps://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf

Paula Ta-Shma, IBM Haifa Research

8 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Source: IBM InfoSphere BigInsights slides, by Bruce Brown https://www-950.ibm.com/events/wwe/grp/grp004.nsf/vLookupPDFs/Bruce%20Brown%20-%20BigInsights-1-16-12-external/$file/Bruce%20Brown%20-%20BigInsights-1-16-12-external.pdf

Paula Ta-Shma, IBM Haifa Research

9 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Map Reduce In Detail

Map Reduce material taken from Distributed Systems Course, MapReduce lecture by Paul Krzyzanowski

– http://www.seas.gwu.edu/~gparmer/courses/f12_3411/distrib-5-mapreduce.pdf

Paula Ta-Shma, IBM Haifa Research

10 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

HDFS Architecture

Source http://hadoop.apache.org/docs/r1.0.4/hdfs_design.html

Paula Ta-Shma, IBM Haifa Research

11 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Integrating Hadoop with Object Storage

Implement Hadoop FileSystem API

Leave MapReduce framework unchanged

– => no changes needed for user applications

– => work with Hadoop based technologies

- Hive, Pig Latin, HBase, Jaql, and others

Hadoop FileSystem API(create,open,close,read,write,seek,get block locations…)

HadoopDistributedFileSystem(HDFS)

S3FileSystem CDMIFileSystem

Hadoop Map Reduce

invokes

implements

Application HBase, Jaql,…

Paula Ta-Shma, IBM Haifa Research

12 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

Amazon Elastic Map Reduce

Source: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html

Paula Ta-Shma, IBM Haifa Research

13 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University http://www.eng.tau.ac.il/semcom

The End


Recommended