+ All Categories
Transcript
Page 1: Big Data in a Public Cloud

Big Data in a public cloud

Maximize RAM without paying for overhead CPU and other price/

performance tricks

Page 2: Big Data in a Public Cloud

Big Data

• Big Data: just another way of saying large data sets

• All of them are so large its difficult to manage with traditional tools

• Distributed computing is an approach to solve that problem

• First the data needs to be Mapped

• Then it can be analyzed – or Reduced

Page 3: Big Data in a Public Cloud

MapReduce

• Instead of talking about it, lets just see some code

Page 4: Big Data in a Public Cloud

Map/Reduce – distributed work

Page 5: Big Data in a Public Cloud

But that’s perfect for public cloud!

Page 6: Big Data in a Public Cloud

CPU / RAM

• Mapping – is not CPU intensive

• Reducing – is (usually) not CPU intensive

• Speed: load data in RAM or it hits the HDD and creates iowait

Page 7: Big Data in a Public Cloud

First point – a lot of RAM

Page 8: Big Data in a Public Cloud

Ah…now they have RAM instances

Page 9: Big Data in a Public Cloud

I wanted more RAM I said L

Page 10: Big Data in a Public Cloud

What about MapReduce PaaS

• Be aware of data lock in

• Be aware of forced tool set – limits your workflow

• …and I just don’t like it so much as a developer; it’s a control thing

Page 11: Big Data in a Public Cloud

There is another way

Page 12: Big Data in a Public Cloud

What we do

Page 13: Big Data in a Public Cloud

What is compute like – why commodity?

Page 14: Big Data in a Public Cloud

Now lets get to work

• Chef & Puppet – deploy and config

• Clone; scale up and down

• Problem solved?

Page 15: Big Data in a Public Cloud

And then there is reality…iowait

Page 16: Big Data in a Public Cloud

ioWAIT

• 1,000 iops on a high end enterprise HDD

• 500,000 iops on a high end SSD

• Public Cloud should be forbidden on HDD

Page 17: Big Data in a Public Cloud

Announcing: SSD priced as HDD

Page 18: Big Data in a Public Cloud

Who and How does workload?

• CERN – the LHC experiments to find the Higgs Boson

• 200 Petabytes per year

• 200,000 CPU days / day on hundreds of partner grids and clouds

• Montecarlo simulations (CPU intensive)

Page 19: Big Data in a Public Cloud

The CERN case

• Golden VM images with tools, from which they clone

• A set of coordinator servers, which scales the worker nodes up and down via provisioning API (clone, puppet config)

• Coordinator servers manages workload on each worker node doing Montecarlo simulations

• Federated cloud brokers such as Enstratus and Slipstream

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

Page 20: Big Data in a Public Cloud

The loyalty program case

• The customer runs various loyalty programs

• Plain vanilla Hadoop Map/Reduce:

• Chef and Puppet to deploy and config worker nodes

• Lots of RAM, little CPU

• Self healing architecture – uptime of a worker node is not critical; just spin up a new worker node instance

Page 21: Big Data in a Public Cloud

Find IaaS suppliers with

• Unbundled resources• Short billing cycles• New equipment: what’s your life cycle?• SSD as compute storage• Allows cross connects from your equipment in the

DC

• CloudSigma, ElasticHost, ProfitBricks to mention a few

Page 22: Big Data in a Public Cloud

What we do


Top Related