+ All Categories
Home > Technology > Mrdp reduce side_join

Mrdp reduce side_join

Date post: 07-Aug-2015
Category:
Upload: edureka
View: 104 times
Download: 0 times
Share this document with a friend
20
View MR Design Patterns course details at www.edureka.co/mapreduce-design-patterns Application of JOIN Pattern MAP Reduce Design PATTERN
Transcript

View MR Design Patterns course details at www.edureka.co/mapreduce-design-patterns

Application of JOIN Pattern

MAP Reduce Design PATTERN

Slide 2 www.edureka.co/mapreduce-design-patterns

Objectives

At the end of this module, you will be able to understand

Why Design Patterns in MR

Who should know Map-Reduce Design patterns

Available Design Patterns in MR

Join pattern

Slide 3 www.edureka.co/mapreduce-design-patternsSlide 3

Why Design Patterns in MR?

General reusable, optimized solutions to most common problems

Template to solve problems used in different situations

Speed up the development process

Tried and tested design principles

An initial guideline to solve most common problems in MR

Help build sophisticated and best solution

Slide 4 www.edureka.co/mapreduce-design-patternsSlide 4

Who should know MR Design Pattern?

A Java developer who wants to explore world of Big Data

A MapReduce programmer who wants to develop expertise in his/her MR skills

One who aims to become a Hadoop Architect

Slide 5 www.edureka.co/mapreduce-design-patternsSlide 5

Available Design Patterns in MR

Summarization Pattern

Filtering Pattern

Data Organization Pattern

Join Pattern

Meta Pattern

Input & Output Pattern

Slide 6 www.edureka.co/mapreduce-design-patterns

Join Patterns – What is it

Datasets generally exist in multiple sources

Deriving full-value requires merging them together

Join Patterns are used for this purpose

Performing joins on the fly on Big Data can be costly in terms of time

Example: Joining StackOverflow data from Comments & Posts on UserId

Slide 7 www.edureka.co/mapreduce-design-patterns

Join Patterns – What is it?

Joining Patterns we will talk about are

» Reduce Side Join/Repartition Join

» Reduce Side Join with Bloom Filter

» Replicated Join

» Composite Join

» Cartesian Product

Slide 8 www.edureka.co/mapreduce-design-patterns

Join – Refresher

Inner Join

Outer Join

» Left Outer Join

» Right Outer Join

» Full Outer Join

Anti Join

Cartesian Product

Slide 9 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Description

Easiest to implement but can be longest to execute

Supports all types of join operation

Can join multiple data sources, but expensive in terms of network resources & time

All data transferred across network

Example : Join PostLinks table data in StackOverflow to Posts data

Slide 10 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Description (Contd.)

Applicability – Use it when

» Multiple large data sets require to be joined

» If one of the data sources is small look at using replicated join

» Different data sources are linked by a foreign key

» You want all join operations to be supported

Slide 11 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Structure

Slide 12 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Structure (Contd.)

Mapper

» Output key should reflect the foreign key

» Value can be the whole record and an identifier to identify the source

» Use projection and output only the required number of fields

Combiner

» Not Required ; No additional benefit

Slide 13 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Structure (Contd.)

Partitioner

» User Custom Partitioner if required;

Reducer

» Reducer logic based on type of join required» Reducer receives the data from all the different sources per key

Slide 14 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Analogy

Resemblances

» SQL » SELECT users.ID, users.Location, comments.upVotes

FROM users[INNER|LEFT|RIGHT] JOIN commentsON users.ID=comments.UserID

» Pig » Supports inner & outer joins» Inner Join

» A = JOIN comments BY userID, users BY userID;» Outer Join

» A = JOIN comments BY userID [LEFT|RIGHT|FULL] OUTER, users BY userID

Slide 15 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Performance

Performance

» The whole data moves across the network to reducers

» You can optimize by using projection and sending only the required fields

» Number of reducers typically higher than normal

» If you can use any other Join type for your problem, use that instead

Slide 16 www.edureka.co/mapreduce-design-patterns

Reduce Side Join – Use Cases

Join tweets with user personal information for Behavioral Analysis

Join PostLinks and Posts tables from StackOverflow to have all related posts in one place

Slide 17 www.edureka.co/mapreduce-design-patterns

Reduce Side Join Example – Problem

Your dataset is the StackOverflow dataset. Look at the PostLinks.xml & Posts.xml file. Join the two tables based on PostId in PostLinks & Id in Posts

» Use MultipleInputs class

» Projection on PostLinks to output only PostId & RelatedPostId fields

Slide 18 www.edureka.co/mapreduce-design-patterns

DEMO

Reduce Side Join Example

Slide 19 www.edureka.co/mapreduce-design-patterns

Questions

Slide 20 www.edureka.co/mapreduce-design-patterns


Recommended