Big Data ComputationsUsing Elastic DataProcessing inOpenStack Cloud
Sergey Lukjanov (Mirantis)Alexander Ignatov (Mirantis)Trevor McKay (Red Hat)
Agenda
• OpenStack Data Processing Overview
• EDP Architecture & Technical Concepts
• Live Demo
Agenda
• OpenStack Data Processing Overview
• EDP Architecture & Technical Concepts
• Live Demo
OpenStack Data Processing: Sahara
Mission: To provide a scalable data processing stack and associated management interfaces.
• provision and operate Hadoop clusters • schedule and operate Hadoop jobs
Hadoop - Big Data Platform
© http://hortonworks.com/hadoop/yarn/
Trends
http://www.google.com/trends/
Architecture overview
Data Sources
Savanna Python Client RE
ST A
PI
Cluster Configuration
Manager
Horizon
Keystone
Auth
Data Access Layer
Swift
Savanna Pages
HadoopVM
Vendors Plugins
HadoopVM
HadoopVM
HadoopVM
Resources Orchestration
Manager
Job Sources Job
Manager
Heat
Nova
Glance
Cinder
Neutron
Trove DB
Sahara status
• Official integrated OpenStack project• Supported Hadoop distros:
• Vanilla Apache Hadoop• Hortonworks Data Platform• Intel Distribution• Cloudera Distribution in blueprint
• Included into OpenStack distros:• RDO - openstack.redhat.com• Mirantis OpenStack - software.mirantis.com
Contributors
Agenda
• OpenStack Data Processing Overview
• EDP Architecture & Technical Concepts
• Live Demo
Elastic Data Processing
• EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)• Supported data sources: Swift, HDFS, Ceph• Supported job types: Java actions,
MapReduce, MapReduce.Streaming, Pig, Hive• Oozie for Hadoop jobs workflow management
• Supports both Hadoop 1 & 2• Job executions on transient clusters
EDP Use Cases
• Simplified task executions. You don’t need to know Hadoop!
• Bursty workload: ad-hoc queries requiring a significant resource only for short time period
• Utilization of free IaaS capacity for Hadoop tasks
EDP - Data Sources
Swift Sahara EDP
INPUT
OUTPUT
HadoopVM
HadoopVM
HadoopVM
HadoopVM
swift://some_container/INPUT
swift://some_container/OUTPUT
EDP - Job Binaries
Swift
Sahara DB
Sahara EDP
internal-db://script.pig
swift://some_container/mapreduce.jar
1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and
libraries
EDP - Job Execution. Step 1
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
EDP - Job Execution. Step 2
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
JobTracker
Oozie
HadoopVM
HadoopVM
HadoopVM
EDP - Job Execution. Step 3
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
JobTracker
OozieExecute a job
EDP - Job Execution. Step 4
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
JobTracker
Oozie
EDP - Job Execution. Step 5
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
workflow.xm
l
1. Job-specific configurations
2. URLs to binaries
3. URLs for data sources
4. Credentials
JobTracker
Oozie
EDP - Job Execution. Step 6
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
workflow.xm
l
Data Processing
OUTPUT
1. Job-specific configurations
2. URLs to binaries
3. URLs for data sources
4. Credentials
JobTracker
Oozie
EDP - Job Execution. Step 7
Sahara
SwiftINPUT
DB: Jar, Pig
EDP
Jar, Pig
HadoopVM
HadoopVM
HadoopVM
workflow.xm
l
1. Job-specific configurations
2. URLs to binaries
3. URLs for data sources
4. Credentials
Data Processing
OUTPUT
JobTracker
Oozie
Agenda
• OpenStack Data Processing Overview
• EDP Architecture & Technical Concepts
• Live Demo
EDP BigPetStore Demo
BigPetStore is now part of Apache BigTop• Test/demo laboratory for all things Hadoop
• Actively developed with integration testing
• Generates and processes data of arbitrary size
• git clone git://git.apache.org/bigtop.git
• Filed under bigtop/bigtop-bigpetstore
EDP BigPetStore Demo
What are we going to do?
• Generate 1M records of pet supply purchases• Clean the data (“dirty CSV”)• Extract cumulative counts by state• Demonstrates Sahara EDP objects
• Job Binaries• Jobs (Java and Pig)• Data Sources
EDP BigPetStore Sample Data
Generated Data (first job)
$ hadoop fs -cat bigpetstore/gen/part-r-00000 | more
BigPetStore,storeCode_AK,1 deanna,booker,Sun Jan 18 20:50:06 GMT+00:00 1970,7.5,cat-food
BigPetStore,storeCode_AK,10 erica,buck,Thu Dec 25 16:29:28 GMT+00:00 1969,10.5,dog-food
Cleaned Data (second job)
$ hadoop fs -cat bigpetstore/clean/part-m-00000 | more
BigPetStore storeCode_AK 1 deanna booker Sun Jan 18 20:50:06 GMT+00:00 1970 7.5 cat-food
BigPetStore storeCode_AK 10 erica buck Thu Dec 25 16:29:28 GMT+00:00 1969 10.5 dog-food
EDP BigPetStore Sample Data
Summed Data For Products by State (3rd job)
$ hadoop fs -cat bigpetstore/analyze_rel/part-r-00000 | more
US-AK cat-food 24837
US-AK dog-food 24994
US-AK fuzzy-collar 25145
US-AK antelope-caller 25024
US-AZ cat-food 25106
US-AZ dog-food 25064
US-AZ leather-collar 24870
US-AZ snake-bite ointment 24960
What Next for EDP
Potential Areas for Development within EDP
• Pluggable Job Execution Model• Allows Sahara to run jobs with additional execution engines• Current Oozie offerings become one of multiple options
• Expand Capabilities via Oozie• Support upload of user-written Oozie workflows• Support for coordinated jobs
• Enhanced Usability• Better Error Reporting• User Experience (UI, CLI, API)
Please, send us your feedback! Ideas are always welcome• #openstack-sahara on freenode• [email protected] with [openstack-dev][sahara] subject
Design Summit Sessions
7 Sessions: Thursday 1:30 - Friday 10:30
http://goo.gl/lQXtUS
Q&A
Thank you!