+ All Categories
Home > Business > Map Reduce

Map Reduce

Date post: 10-May-2015
Category:
Upload: michel-bruley
View: 556 times
Download: 2 times
Share this document with a friend
Description:
• What is MapReduce? • What are MapReduce implementations? Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
Popular Tags:
16
www.decideo.fr/bruley MapReduce MapReduce [email protected] April 2012 April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Transcript
Page 1: Map Reduce

www.decideo.fr/bruley

MapReduceMapReduce

[email protected]

April 2012April 2012

Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Page 2: Map Reduce

www.decideo.fr/bruley

What is MapReduce?What is MapReduce?

Restricted parallel programming model meant for large clusters

– User implements Map() and Reduce() functions

Parallel computing framework

– Libraries take care of EVERYTHING else

• Parallelization

• Fault Tolerance

• Data Distribution

• Load Balancing

Useful model for many practical tasks

Page 3: Map Reduce

www.decideo.fr/bruley

Map and Reduce Map and Reduce

The idea of Map, and Reduce is 40+ year old

– Present in all Functional Programming Languages.

– See, e.g., APL, Lisp and ML

Alternate names for Map: Apply-All

Higher Order Functions

– take function definitions as arguments, or

– return a function as output

Map and Reduce are higher-order functions.

Page 4: Map Reduce

www.decideo.fr/bruley

Map and Reduce FunctionsMap and Reduce Functions

Functions borrowed from functional programming languages (eg. Lisp)

Map()– Process a key/value pair to generate intermediate

key/value pairs

Reduce()– Merge all intermediate values associated with the same

key

Page 5: Map Reduce

www.decideo.fr/bruley

Example: Counting WordsExample: Counting Words

Map()– Input <filename, file text>– Parses file and emits <word, count> pairs

• eg. <”hello”, 1>

Reduce()– Sums all values for the same key and emits <word,

TotalCount>• eg. <”hello”, (3 5 2 7)> => <”hello”, 17>

Page 6: Map Reduce

www.decideo.fr/bruley

Execution on ClustersExecution on Clusters

1. Input files split (M splits)

2. Assign Master & Workers

3. Map tasks

4. Writing intermediate data to disk (R regions)

5. Intermediate data read & sort

6. Reduce tasks

7. Return

Page 7: Map Reduce

www.decideo.fr/bruley

Map/Reduce Cluster Map/Reduce Cluster ImplementationImplementation

split 0split 1split 2split 3split 4

Output 0

Output 1

Input files

Output files

M map tasks

R reduce tasks

Intermediate files

Several map or reduce tasks can run on a single

computer

Each intermediate file is divided into R

partitions, by partitioning function

Each reduce task corresponds to one partition

Page 8: Map Reduce

www.decideo.fr/bruley

Map Reduce vs. Parallel Map Reduce vs. Parallel DatabasesDatabases

Map Reduce widely used for parallel processing

– Google, Yahoo, and 100’s of other companies

– Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, ….

Database people say:

– but parallel databases have been doing this for decades

Map Reduce people say:

– we operate at scales of 1000’s of machines

– We handle failures seamlessly

– We allow procedural code in map and reduce and allow data of any type

Page 9: Map Reduce

www.decideo.fr/bruley

Typical MapReduce ClusterTypical MapReduce Cluster

Page 10: Map Reduce

www.decideo.fr/bruley

Map Reduce Map Reduce ImplementationsImplementations

Google– Not available outside Google

Hadoop– An open-source implementation in Java– Uses HDFS for stable storage– Download: http://lucene.apache.org/hadoop/

Teradata Aster– Cluster-optimized SQL Database that also implements

MapReduce• IITB alumnus among founders

And several others, such as Cassandra at Facebook, etc.

Page 11: Map Reduce

www.decideo.fr/bruley

MapReduce v. HadoopMapReduce v. Hadoop

MapReduce Hadoop

Org Google Yahoo/Apache

Impl C++ Java

Distributed File Sys

GFS HDFS

Data Base Bigtable HBase

Distributed lock mgr

Chubby ZooKeeper

Page 12: Map Reduce

www.decideo.fr/bruley

Solutions Solutions StackStack for Teradata Aster for Teradata Aster

Aster Data nCluster

Business Intelligence

Tools

Analytics Specialists

Data Integration

/ ETL

Systems Management

Security

Query Tools

Servers

Operating System

Cloud Infrastructure

Aster Data Ecosystem

Aster Data Platform

InfrastructureStorage

Page 13: Map Reduce

www.decideo.fr/bruley

Teradata Aster Platform Teradata Aster Platform InfrastructureInfrastructure

For physical infrastructure (non-cloud) deployments

Server Hardware

Operating System

Aster Data Analytic Platform

Certified commodity (x86) server hardware with internal storage

Certified Linux operating system

Aster Data nCluster packaged softwarenClusternCluster

Page 14: Map Reduce

www.decideo.fr/bruley

Teradata Aster InfrastructureTeradata Aster Infrastructure

For cloud deploymentsFor cloud deployments

Compute Instance

Compute instance from cloud provider (e.g. Amazon Web Services EC2)

CCCCxLargexLarge

StorageStorage connected to cloud computing

capacityEBSEBS

EphemeralEphemeral

Operating System

Aster Data Analytic Platform

Linux operating system

Aster Data nCluster packaged softwarenClusternCluster

Page 15: Map Reduce

www.decideo.fr/bruley

Teradata Aster Architecture for Teradata Aster Architecture for AnalyticsAnalytics

Your Analytics & Advanced Reporting Applications

Aster Data nCluster

Massively Parallel Data Stores • Hybrid row/column DBMS • Linear, incremental scalability

• Commodity hardware

• Standard SQL interface • MapReduce processing integrated with SQL via

SQL-MapReduce interface

• Rich libraries of MapReduce analytics from Aster Data and partners

• Visual development environment--develop in hours

Unified Interface

SQL SQL-MapReduce

Analytic Functions and Frameworks

• Optimized SQL engine• Fully-integrated in-database MapReduce

Analytics Processing Engines

AppAppApp App

SQL MapReduce …

• Support for in-database processing of custom applications written in broad variety of languages

• Integration with third-party packaged software via ODBC/JDBC or in-database integration

Page 16: Map Reduce

www.decideo.fr/bruley

Teradata Aster EcosystemTeradata Aster Ecosystem

Partner ProductProduct release

Platform for Certification

MicroStrategy Intelligence Server 9.2.1 32-bit Windows 7, Enterprise Edition SP1, 32-bit, 64-bit

SAP Business Objects XI 3.1 Windows 2008, 32-bit

Informatica Powercenter 9.0.1Client: Windows 2003/2008 Server 32 bit.Server: Windows 2003/2008 Server 32 bit and 64 bit

IBM Cognos 10.1FP1 n/a

Tableau Tableau Server 6 Windows (SS: TBU)

MicrosoftSSLS, SSAS, SSFS, SSIS

SQL Server 2008

.NET Framework 2.0Windows Server, 2008 64-bitWindows 2003, 32-bit

*Oracle BIEE certification currently in process


Recommended