Sebastien Goasguen, January 29th
@sebgoa
Cloud and Big Data
Drag picture to placeholder or click icon to add
A view on Big Data
http://www.economist.com/node/15557443?story_id=15557443
SKA
Drag picture to placeholder or click icon to add
How did we get there ?
A natural evolution
New Distributed systems for:
Large scale datasets• From scientific instruments• From Web apps logs
Complex datasets• Not necessarily large.
Object stores• S3 clones
BigData and map-reduce
• While BigData is often associated with HDFS, Map-Reduce is the algorithm used to parallelize data processing.
• BigData ≠ Map-Reduce ≠ HDFS• Map-reduce is a way to express
embarrassingly parallel work easily.• You can do Map-Reduce without HDFS.
• e.g Basho map-reduce on riackCS
Drag picture to placeholder or click icon to add
A really quick view on Clouds
Open Source IaaS
Today
BigData at peak
History
2003 –Google File System2005 – Hadoop2006 – Hadoop enters ASF incubator (Feb)2006 – S3 launched 2007 – Paper on Amazon Dynamo2009 – EMR launched2013 – CloudStack as a ASF TLP (March)2013 – Spark/Mesos enters ASF incubator
Drag picture to placeholder or click icon to add
The Apache Software Foundation
Apache Software Foundation
35 projects in incubation:• 12 Hadoop related• ~30% Big Data related• Spark
117 top level projects:• ~16 cloud or bigdata +10%• Deltacloud, Libcloud, Whirr, jclouds• Hadoop, couchdb, cassandra, mesos• Bigtop, accumulo, lucene, UIMA• CloudStack
Hadoop Ecosystem
+ Up-coming next generation BD systems
Drag picture to placeholder or click icon to add
Big Data and Cloud (Stack)s
Clouds and BigData
• Object store + compute IaaS to build EC2+S3 clone
• BigData solutions as storage backends for image catalogue and large scale instance storage.
• BigData solutions as workloads to CloudStack based clouds.
EC2, S3 clone• An open source IaaS with an EC2
wrapper e.g Opennebula• Deploy a S3 compatible object store –
separately- e.g riakCS• Two independent distributed systems
deployed
Cloud = EC2 + S3
Big Data as IaaS backend
“Big Data” solutions can be used as secondary storage .
Example• Open source IaaS + EC2 wrapper, e.g
CloudStack• Deploy S3 compatible object store, e.g
riakCS or Ceph or glusterFS• Use S3 as image store• Your EC2 service is a customer to your S3
service• Logstash + elasticsearch for logs/monitoring
Even use Bare Metal
Drag picture to placeholder or click icon to add
Big Data as a Workload to the Cloud
Mesos, Spark are EC2 native
oec2_deploy.pyoec2_deploy.sho…
Tools
“PaaS”
Dev Pipeline
Conclusions
• Big Data is “catching up”• Tackle the big three head on:
• BigData, Cloud and DevOps• Add a big data backend to your cloud
from the start • Provide Big Data services on your cloud
Still behind !
Final Thoughts
Who manages my data transfers ?
Event
ApacheCON + CloudStack Collaboration Conference
Denver April 7-11th.
Cloud and Big Data
Get Involved with Apache CloudStack
Web: http://cloudstack.apache.org/
Mailing Lists: cloudstack.apache.org/mailing-lists.html
IRC: irc.freenode.net: 6667 #cloudstack #cloudstack-dev
Twitter: @cloudstack
LinkedIn: www.linkedin.com/groups/CloudStack-Users-Group-3144859
If it didn’t happen on the mailing list, it didn’t happen.