+ All Categories
Home > Documents > App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf ·...

App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf ·...

Date post: 02-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
54
Transcript
Page 1: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle
Page 2: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

App Engine MapReduce

Mike Aizatsky11 May 2011

Hashtags: #io2011 #AppEngine Feedback: http://goo.gl/SnV2i

Page 3: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Agenda

● MapReduce Computational Model

● Mapper library

● Announcement

● Technical bits:

○ Files API

○ User-space shuffling

● MapReduce & Pipeline API

● Examples and Demos

Page 4: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

MapReduce Computational Model

Page 5: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

MapReduce

● A model to do efficient distributed computing over large data sets.

● Used at Google for years

● Every project uses MapReduce!

Page 6: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

MapReduce Computational Model

Page 7: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Map

● �Input: user data

● Output: (key, value) pairs

● User code

Page 8: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Shuffle

● Collates value with the same key

● �Input: (key, value) pairs

● Output: (key, [value]) pairs

● No user code

Page 9: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Reduce

● �Input: (key, [value]) pairs

● Output: user data

● User code

Page 10: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

MapReduce Computational Model

Page 11: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Common App Engine Approach

● Take what works for us at Google

● Give it to people

Page 12: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

App Engine & Google's MapReduce

● Additional scaling dimension:

○ Lots and lots of applications

○ Many of them will run MapReduce at the same time

● Isolation: application shouldn't influence performance of the other

Page 13: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

App Engine & Google's MapReduce

● Rate limiting: you don't want to burn all day's resources in 15min and kill your online traffic

● Very slow execution: free apps want to go really slow, staying under their resource limint

● Protection: from malicious App Engine users

Page 14: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Mapper

Page 15: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Mapper Library

● Released at Google I/O 2010

● Heavily used by developers outside and inside Google (admin console, new indexer pipeline, etc.)

● Has seen lots of improvements since

Page 16: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Mapper Library Improvements

● Control API - start your jobs programmatically (and transactionally)

● Custom mutation pools - batch work between map function calls

● Namespaces support - iterate over data in different namespaces or over namespaces themselves

● Better sharding with scatter indices

● And more!

Page 17: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Mapper => MapReduce?

● Storage system for intermediate data:

○ Files API, released in 1.4.3 (March 2011)

● Shuffler

● Lots of glue code

Page 18: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Launching Shuffler Functionality

● In-memory, user-space, task-driven shuffle for small (100Mb) datasets.

● Trusted testers access to big shuffler.

● All the integration pieces needed to run your own mapreduce jobs are part of Mapper library.

● Mapper library => Mapreduce library!

● Python today, Java soon.

http://mapreduce.appspot.com

Page 19: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Examples

Page 20: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Example 1: Word Count

# Mapdef map(line): for w in clean(line).split(): yield (w, '')

# Reducedef reduce(key, values): yield (key, len(values))

Zed's dead, baby, Zed's dead!

('zed's', ''), ('dead', ''), ('baby', ''), ('zed's', ''), ('dead', '')

('zed's', ['', '']), ('dead', ['', '']), ('baby', [''])

('zed's', 2), ('dead', 2'), ('baby', 1)

Page 21: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Demo

Page 22: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Example 2: Inverse Index

# Mapdef map(line, filename): for w in clean(line).split(): yield (w, filename)

# Reducedef reduce(key, values): yield (key, list(set(values)))

Page 23: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Demo

Page 24: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

MapReduce: Technical Bits

Page 25: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Technical Bits

● Files API: solution to MapReduce storage problem

● User-space shuffler

Page 26: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Files API

Page 27: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Mapreduce Storage

● Mapreduce jobs generate lots of intermediate data.

● Datastore: expensive, 1MB entity limit

● Blobstore: read-only

● Memcache: small, volatile

Page 28: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Files API

● Familiar, files-like interface to various virtual file systems.

● Released in 1.4.3, integrated with Mapper library.

● Considered to be a low-level API.

Page 29: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Files API

● Files have two states: writable and readable.

● Start in writable. Moved to readable by "finalization".

● Can't read writable, can't write to readable.

● Write is append-only, atomic and fully serializable between concurrent clients.

● Concrete filesystems might have their own reliability constraints and/or additional APIs.

Page 30: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Blobstore Filesystem

● Write directly to blobstore.

● Files can be >2G.

● Finalized files are durable.

● Writable files are not (just restart your MapReduce)

● Can fetch a blob key for finalized files and use blobstore api.

Page 31: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Blobstore Filesystem Python Example

from google.appengine.api import filesfrom __future__ import with_statement

# Create the file.file_name = files.blobstore.create()# Open the file for append.with files.open(file_name, 'a') as f: f.write('data')

# All data is in. Finalize the file.files.finalize(file_name)

Page 32: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Blobstore Filesystem Python Example

# Open the file for read.with files.open(file_name, 'r') as f: data = f.read(4)

# Fetch blobkey for blobstore api.blob_key = files.blobstore.get_blob_key(file_name)

Page 33: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Mapper Integration

# mapreduce.yaml...mapper: output_writer: mapreduce.output_writers.BlobstoreOutputWriters

# Handler functiondef map(entity): yield entity.to_csv_line() + '\n'

Page 34: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Low-level Features

● Exclusive locks: files can be opened exclusively by a single client only.

● Sequence keys: each write can have a "sequence key" attached. Our backends make sure that they only increase.

Page 35: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Future Plans

● "Tempfile" file system: much faster, much cheaper, but not durable, several days of storage only (geared specifically towards MapReduce)

● Integrations with other Google storage technologies and other reliability guarantees

Page 36: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

User-Space Shuffler

Page 37: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

User-Space Shuffler

● Consolidates values for the same key together.● [(key, value)] => [(key, [value])]

● Should be reasonably fast, scalable and efficient.

● User-space: full source code, no new AppEngine components.

Page 38: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 1

● Load all data into memory

● Sort

● Read sorted array

Page 39: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 1

Page 40: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 1 Properties

● Memory-bound

● No parallelism

Page 41: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 2

● Sort chunks of data and store them back to Files API

● Merge-sort all chunks (or merge-read)

Page 42: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 2

Page 43: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 2 Properties

● No longer memory-bound

● Sorting is parallel

● Merge phase is not parallel

● Difficult (and slow) to read from too many files

Page 44: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 3

● Shard mapper output by key hash code

● Sort each shard into chunks

● Merge-read each shard

Page 45: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 3

Page 46: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Take 3 Properties

● No longer memory-bound

● Sorting is parallel

● Merge phase is now parallel

● This is the shuffler we release today.

Page 47: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

MapReduce & Pipeline

Page 48: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Pipeline API

● New API to chain complex work together.

● A glue which holds Mapper + Shuffler + Reducer together.

● MapReduce library is fully integrated with Pipeline.

● For in-depth look visit "Large-scale Data Analysis Using the App Engine Pipeline API" talk later today.

Page 49: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

More Complex Example

Page 50: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Example 3: Distinguishing Phrases

# Mapdef map(text, filename): for words in ngrams(text): yield (words, filename)

# Reducedef reduce(key, values):if len(values) < 10:returnfor filename, count in count_occurences(values):if count > len(values) / 2:yield (key, filename)

Page 51: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Demo

Page 52: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Summary

● Small & Medium MapReduce jobs can be run by anyone today!

● Contact us for getting access to Large MapReduce jobs.

http://mapreduce.appspot.com

Page 53: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Questions?

Hashtags: #io2011 #AppEngine Feedback: http://goo.gl/SnV2i

Page 54: App Engine MapReduce - CSU East Bayalgebra.sci.csueastbay.edu/.../appengine_mapreduce.pdf · 2012-05-09 · Launching Shuffler Functionality In-memory, user-space, task-driven shuffle

Recommended