+ All Categories
Home > Documents > Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over...

Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over...

Date post: 01-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
66
Computer Science 61C Spring 2018 Wawrzynek and Weaver Warehouse Scale Computing 1
Transcript
Page 1: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Warehouse ScaleComputing

1

Page 2: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Agenda

• Warehouse-Scale Computing• Cloud Computing• Request-Level Parallelism (RLP)• Map-Reduce Data Parallelism• And, in Conclusion …

11/8/17 2

Page 3: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Agenda

• Warehouse-Scale Computing• Cloud Computing• Request Level Parallelism (RLP)• Map-Reduce Data Parallelism• And, in Conclusion …

11/8/17 3

Page 4: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#21

Google’s WSCs

411/8/17

Ex:InOregon

11/8/17 5

Page 5: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

WSC Architecture

1UServer:8cores,16GiBDRAM,4x1TBdisk

Rack:40-80servers,LocalEthernet(1-10Gbps)switch(30$/1Gbps/server)

Array(akacluster):16-32racksExpensiveswitch(10Xbandwidthà 100xcost)

11/8/17 5

Page 6: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

WSC Storage Hierarchy

1UServer:DRAM:16GB,100ns,20GB/sDisk:2TB,10ms,200MB/s

Rack(80severs):DRAM:1TB,300µs,100MB/sDisk:160TB,11ms,100MB/s

Array(30racks):DRAM:30TB,500µs,10MB/sDisk:4.80PB,12ms,10MB/s

11/8/17 6

Page 7: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Google Server Internals

GoogleServer

11/8/17 7

Page 8: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

8

Page 9: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Power Usage Effectiveness

• Energy efficiency• Primary concern in the design of WSC• Important component of the total cost of ownership

• Power Usage Effectiveness (PUE):

• Power efficiency measure for WSC• Not considering efficiency of servers, networking• Perfection = 1.0• Google WSC’s PUE = 1.2

TotalBuildingPower

ITequipmentPower

9

Page 10: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Power Usage Effectiveness

10

ITEquipmentTotalPowerIn

Datacenter

Servers,Storage,Networks

AirConditioning,PowerDistribution,UPS,…

PUE=TotalPower/ITPower

Infrastructure

PUE=2

Infrastructure

PUE=1.5

Page 11: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Cheating on Cooling

• Normally cooling the air requires big air-conditioning units• These suck a lot of power and still consume a lot of water• Evaporation of water to dissipate the energy

• Cheat #1: Heat-exchange to a water source• Locate your data center on a river or the ocean• Heat up water rather than air

• Cheat #2: Just have things open to the air!• Ups the failure rate, but if the power savings exceed the costs incurred by

additional machines dying, !

11

Page 12: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Energy Proportionality

12

Figure1.AverageCPUutilizationofmorethan5,000serversduringasix-monthperiod.Serversarerarelycompletelyidleandseldomoperateneartheirmaximumutilization,insteadoperatingmostofthetimeatbetween10and50percentoftheirmaximum

Itissurprisinglyhard toachievehighlevelsofutilizationoftypicalservers(andyourhomePCorlaptopisevenworse)

“TheCaseforEnergy-ProportionalComputing,”LuizAndréBarroso,UrsHölzle,IEEEComputerDecember2007

Page 13: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Energy-Proportional Computing

!13

Figure 2. Server power usage and energy efficiency at varying utilization levels, from idle to peak performance. Even an energy-efficient server still consumes about half its full power when doing virtually no work.

“The Case for Energy-Proportional Computing,” Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007

Energy Efficiency = Utilization/Power

Page 14: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Energy Proportionality

14

Figure4.Powerusageandenergyefficiencyinamoreenergy-proportionalserver.Thisserverhasapowerefficiencyofmorethan80percentofitspeakvalueforutilizationsof30percentandabove,withefficiencyremainingabove50percentforutilizationlevelsaslowas10percent.

“TheCaseforEnergy-ProportionalComputing,”LuizAndréBarroso,UrsHölzle,IEEEComputerDecember2007

Designforwidedynamicpowerrangeandactivelowpowermodes

EnergyEfficiency=Utilization/Power

Page 15: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Agenda

• Warehouse Scale Computing• Cloud Computing• Request Level Parallelism (RLP)• Map-Reduce Data Parallelism• And, in Conclusion …

11/8/17 15

Page 16: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Scaled Communities, Processing, and Data

11/8/17 16

Page 17: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Cloud Distinguished by …

• Shared platform with illusion of isolation• Collocation with other tenants• Exploits technology of VMs and hypervisors (next lectures!)• At best “fair” allocation of resources, but not true isolation

• Attraction of low-cost cycles• Economies of scale driving move to consolidation• Statistical multiplexing to achieve high utilization/efficiency of resources

• Elastic service• Pay for what you need, get more when you need it• But no performance guarantees: assumes uncorrelated demand for resources

17

Page 18: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Cloud Services

• SaaS:deliverappsoverInternet,eliminaengneedtoinstall/runoncustomer'scomputers,simplifyingmaintenanceandsupport

• E.g., Google Docs, Win Apps in the Cloud• PaaS:delivercompueng“stack”asaservice,usingcloud

infrastructuretoimplementapps.Deployappswithoutcost/complexityofbuyingandmanagingunderlyinglayers

• E.g., Hadoop on EC2, Apache Spark on GCP• IaaS:Ratherthanpurchasingservers,sogware,data

centerspaceornetequipment,clientsbuyresourcesasanoutsourcedservice.Billedonuelitybasis.Amountofresourcesconsumed/costreflectlevelofacevity

• E.g., Amazon Elastic Compute Cloud, Google Compute Platform

11/8/17 18

Page 19: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Agenda

• Warehouse Scale Computing• Cloud Computing• Request-Level Parallelism (RLP)• Map-Reduce Data Parallelism• And, in Conclusion …

11/8/17 19

Page 20: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Request-Level Parallelism (RLP)

• Hundreds of thousands of requests per second• Popular Internet services like web search, social networking, …• Such requests are largely independent• Often involve read-mostly databases• Rarely involve read-write sharing or synchronization across requests

• Computation easily partitioned across different requests and even within a request

• Can often "load balance" just at the DNS level:Just tell different people to use a different computer

11/8/17 20

Page 21: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Google Query-Serving Architecture

11/8/17 21

Page 22: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Web Search Result

11/8/17 22

Page 23: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Anatomy of a Web Search (1/3)

• Google “Nicholas Weaver”1. Direct request to “closest” Google Warehouse-Scale Computer2. Front-end load balancer directs request to one of many clusters of

servers within WSC3. Within cluster, select one of many Google Web Servers (GWS) to handle

the request and compose the response pages4. GWS communicates with Index Servers to find documents that contain

the search words, “Nicholas”, “Weaver”, uses location of search as well as user information

5. Send information about this search to the node in charge of tracking [email protected]

6. Return document list with associated relevance score 23

Page 24: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Anatomy of a Web Search (2/3)

• In parallel,• Ad system: if anyone has bothered to advertise for me• Customization based on my account• Use docids (document IDs) to access indexed documents

to get snippets of stuff• Compose the page• Result document extracts (with keyword in context) ordered by

relevance score• Sponsored links (along the top) and advertisements (along the

sides)

11/8/17 24

Page 25: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Anatomy of a Web Search (3/3)

• Implementation strategy• Randomly distribute the entries• Make many copies of data (aka “replicas”)• Load balance requests across replicas• Redundantcopiesofindicesanddocuments• Breaks up hot spots, e.g., “Justin Bieber”• Increases opportunities for request-levelparallelism• Makes the system more tolerantoffailures

11/8/17 25

Page 26: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Administrivia

11/8/17

• Project 4 Out• Due Monday• Project Party This Wednesday!!!!

• HW4 due Friday• Final:

If you have a conflict, fill out the form now if you haven't yet• Clicker Question:

What is your favorite letter?

26

Page 27: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#21

Agenda

• Warehouse Scale Computing• Cloud Computing• Request Level Parallelism (RLP)• Map-Reduce Data Parallelism• And, in Conclusion …

11/8/17 27

Page 28: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Data-Level Parallelism (DLP)

• SIMD• Supports data-level parallelism in a single machine• Additional instructions & hardware (e.g., AVX)e.g., Matrix multiplication in memory

• DLP on WSC• Supports data-level parallelism across mul6plemachines• MapReduce & scalable file systems

11/8/17 28

Page 29: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Problem Statement

• How process large amounts of raw data (crawled documents, request logs, …) every day to compute derived data (inverted indices, page popularity, …) when computation conceptually simple but input data large and distributed across 100s to 1000s of servers so that finish in reasonable time?

• Challenge: Parallelize computation, distribute data, tolerate faults without obscuring simple computation with complex code to deal with issues

11/8/17 29

Page 30: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Solution: MapReduce

• Simple data-parallel programmingmodel and implementa6on for processing large datasets

• Users specify the computation in terms of • a map function, and • a reduce function• Underlying runtime system• Automatically parallelize the computation across large scale clusters of

machines• Handlesmachinefailure• Scheduleinter-machinecommunicaeontomakeefficientuseofthenetworks11/8/17

30

Page 31: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Inspiration: Map & Reduce Functions, ex: Python Calculate : n2

n=1

4

∑A=[1,2,3,4]defsquare(x):returnx*xdefsum(x,y):returnx+yreduce(sum,map(square,A))

1 2 3 4

1 4 9 16

5 25

30 31

DivideandConquer!

Page 32: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

• Map:(in_key,in_value)à list(interm_key,interm_val)map(in_key,in_val)://DOWORKHEREemit(interm_key,interm_val)• Slice data into “shards” or “splits” and distribute to workers• Compute set of intermediate key/value pairs

• Reduce:(interm_key,list(interm_value))à list(out_value)

reduce(interm_key,list(interm_val))://DOWORKHEREemit(out_key,out_val)• Combines all intermediate values for a particular key• Produces a set of merged output values (usually just one)

MapReduce Programming Model

11/8/17 32

Page 33: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

MapReduce Execution

Finegranularitytasks:manymoremaptasksthanmachines

2000servers=> ≈200,000MapTasks,≈5,000Reducetasks

Bucketsorttogetsamekeystogether

11/8/17 33

Page 34: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

MapReduce Word Count Example

11/8/17 34

thatthatisisthatthatisnotisnotisthatititisis1,that1,that1 Is1,that1,that1 is1,is1,not1,not1 is1,is1,it1,it1,that1Map1 Map2 Map3 Map4

Reduce1 Reduce2is1 that1,1is1,1 that1,1,1,1is1,1,1,1,1,1it1,1

that1,1,1,1,1not1,1

is6;it2 not2;that5

Shuffle

Collectis6;it2;not2;that5

Distribute

that1,that1,is1 Is1,that1,that1 is1,not1,is1,not1 is1,that1,it1,it1,is1 LocalSort

Page 35: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

MapReduce Word Count Example

11/8/17

User-written Map function reads the document data andparses the words. For each word, it writes the (key, value) pair of (word, 1). The word is treated as the intermediate key and the associated value of 1 means that we saw the word once.

Mapphase:(docname,doccontents)à list(word,count)//“IdoIlearn”à [(“I”,1),(“do”,1),(“I”,1),(“learn”,1)]map(key,value):foreachwordwinvalue:emit(w,1)

35

Page 36: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Intermediate data is then sorted by MapReduce by keys and the user’s Reduce function is called for each unique key. In this case, Reduce is called with a list of a "1" for each occurrence of the word that was parsed from the document. The function adds them up to generate a total word count for that word.

Reducephase:(word,list(counts))à (word,count_sum)//(“I”,[1,1])à (“I”,2)reduce(key,values):result=0foreachvinvalues:result+=vemit(key,result)

MapReduce Word Count Example

36

Page 37: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

The Combiner (Optional)

• One missing piece for our first example:• Many times, the output of a single mapper can be

“compressed” to save on bandwidth and to distribute work (usually more map tasks than reduce tasks)

• To implement this, we have the combiner:combiner(interm_key,list(interm_val)): // DO WORK (usually like reducer) emit(interm_key2, interm_val2)

37

Page 38: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Our Final Execution Sequence

• Map – Apply operations to all input key, val• Combine – Apply reducer operation, but distributed across

map tasks• Reduce – Combine all values of a key to produce desired

output

38

Page 39: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

MapReduce Processing Example: Count Word Occurrences• Pseudo Code: for each word in input, generate <key=word, value=1>

• Reduce sums all counts emitted for a particular word across all mappers map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); // Produce count of words combiner: (same as below reducer) reduce(String output_key, Iterator intermediate_values): // output_key: a word // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); // get integer from key-value Emit(output_key, result);

39

Page 40: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

MapReduce Word Count Example(with Combiner)

40

thatthatisisthatthatisnotisnotisthatititisis1,that1,that1 Is1,that1,that1 is1,is1,not1,not1 is1,is1,it1,it1,that1Map1 Map2 Map3 Map4

Reduce1 Reduce2is1 that2is1,1 that2,2is1,1,2,2It2

that2,2,1not2

is6;it2 not2;that5

Shuffle

Collectis6;it2;not2;that5

Distribute

LocalSortis1,that2 is1,that2 is2,not2 is2,it2,that1 Combine

Page 41: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#211/8/17 41

Shufflephase

Page 42: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#211/8/17 42

1.MR1stsplitstheinputfilesintoM“splits”thenstartsmanycopiesofprogramonservers

Shufflephase

Page 43: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#2

MapReduce Processing

11/8/17 43

2.Onecopy—themaster—isspecial.Therestareworkers.Themasterpicksidleworkersandassignseach1ofMmaptasksor1ofRreducetasks.

Shufflephase

Page 44: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#2

MapReduce Processing

11/8/17 44

3.Amapworkerreadstheinputsplit.Itparseskey/valuepairsoftheinputdataandpasseseachpairtotheuser-definedmapfunction.

(Theintermediatekey/valuepairsproducedbythemapfunctionarebufferedinmemory)

Shufflephase

Page 45: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#2

MapReduce Processing

11/8/17 45

4.Periodically,thebufferedpairsarewrittentolocaldisk,partitionedintoRregionsbythepartitioningfunction.

Shufflephase

Page 46: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#2

MapReduce Processing

11/8/17 46

5.Whenareduceworkerhasreadallintermediatedataforitspartition,itbucketsortsusingintermediatekeyssothatoccurrencesofsamekeysaregroupedtogether

(Thesortingisneededbecausetypicallymanydifferentkeysmaptothesamereducetask)

Shufflephase

Page 47: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#2

MapReduce Processing

11/8/17 47

6.Reduceworkeriteratesoversortedintermediatedataandforeachuniqueintermediatekey,itpasseskeyandcorrespondingsetofvaluestotheuser’sreducefunction.

Theoutputofthereducefunctionisappendedtoafinaloutputfileforthisreducepartition.

Shufflephase

Page 48: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Fall2016--Lecture#2

MapReduce Processing

11/8/17 48

7.Whenallmaptasksandreducetaskshavebeencompleted,themasterwakesuptheuserprogram.TheMapReducecallinuserprogramreturns

OutputofMRisinRoutputfiles(1perreducetask,withfilenamesspecifiedbyuser);oftenpassedintoanotherMRjobsodon’tconcatenate

Shufflephase

Page 49: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Big Data Frameworks: Hadoop & Spark

• Apache Hadoop• Open-source MapReduce Framework• Hadoop Distributed File System (HDFS)• MapReduce Java APIs

• Apache Spark• Fast and general engine for large-scale

data processing.• Originally developed in the AMP lab at UC Berkeley• Running on HDFS• Provides Java, Scala, Python APIs for• Database• Machine learning• Graph algorithm11/8/17

49

Page 50: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

WordCount in Hadoop’s Java API

50

Page 51: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Word Count in Spark’s Python API

//RDD:primaryabstractionofadistributedcollectionofitemsfile=sc.textFile(“hdfs://…”)//Twokindsofoperations:

//Actions:RDDà Value

//Transformations:RDDà RDD//e.g.flatMap,Map,reduceByKeyfile.flatMap(lambdaline:line.split()).map(lambdaword:(word,1)).reduceByKey(lambdaa,b:a+b)

51

Seehttp://spark.apache.org/examples.html

Page 52: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

MapReduce Processing Time Line

• Master assigns map + reduce tasks to “worker” servers• As soon as a map task finishes, worker server can be assigned a new map or reduce

task• Data shuffle begins as soon as a given Map finishes• Reduce task begins as soon as all data shuffles finish• To tolerate faults, reassign task if a worker server “dies”

52

Page 53: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Show MapReduce Job Running

• ~41 minutes total• ~29 minutes for Map tasks & Shuffle tasks• ~12 minutes for Reduce tasks• 1707 worker servers used• Map (Green) tasks read 0.8 TB, write 0.5 TB• Shuffle (Red) tasks read 0.5 TB, write 0.5 TB• Reduce (Blue) tasks read 0.5 TB, write 0.5 TB

53

Page 54: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

54

Page 55: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

55

Page 56: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

56

Page 57: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

57

Page 58: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

58

Page 59: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

59

Page 60: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

60

Page 61: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

61

Page 62: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

62

Page 63: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

63

Page 64: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

64

Page 65: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

Critical Limitations...

• This only works for specific classes of problems• Need parallel compute over data and parallel reduction steps

• Spark can be even more limited• Hadoop at least allows some more flexibility

• HUGE overhead!• Hadoop Distributed File System: 3x+ redundant storage• Lots of startup and control overhead:

So unless you have multiple-terabytes of data, don't bother!

• For many cases, you are better served throwing a Big F-n Database machine at the problem• Gazillion cores, a TON of memory, and a lot of SSD running Postgres or Oracle

65

Page 66: Warehouse Scale Computingcs61c/sp18/lec/23/lec23.pdfCloud Services • SaaS: deliver apps over Internet, eliminaeng need to install/run on customer's computers, simplifying maintenance

Computer Science 61C Spring 2018 Wawrzynek and Weaver

And, in Conclusion ...

• Warehouse-Scale Computers (WSCs)• New class of computers• Scalability, energy efficiency, high failure rate

• Cloud Computing• Benefits of WSC computing for third parties• “Elastic” pay as you go resource allocation

• Request-Level Parallelism• High request volume, each largely independent of other • Use replication for better request throughput, availability

• MapReduce Data Parallelism• Map: Divide large data set into pieces for independent parallel processing• Reduce: Combine and process intermediate results to obtain final result • Hadoop, Spark

66


Recommended