Date post: | 05-Dec-2014 |
Category: |
Technology |
Upload: | daniel-krook |
View: | 272 times |
Download: | 2 times |
© 2014 IBM Corporation
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa
© 2014 IBM Corporation
Topics Covered in This Talk § Openstack Swift
§ Apache Spark
§ Basic integration between Spark and Swift
§ Advanced integration between Spark and Swift by utilizing the Storlets technology.
© 2014 IBM Corporation
Digital Universe
More than 1.8 zettabytes (1.8 trillion gigabytes)
Grows rapidly
80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"
© 2014 IBM Corporation
Map-Reduce, Databases, etc..
Data needs to be replicated, Time, Cost, etc..
© 2014 IBM Corporation
Can we do it better?
© 2014 IBM Corporation
Openstack Swift § A massively scalable object store
§ Known to work with thousands of servers, stores petabytes of data.
§ Exposes REST API
§ Features: – Storage polices – Erasure codes – Data replication – ….
PUT Proxy Nodes
Storage Nodes
© 2014 IBM Corporation
Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing
– Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk
§ Combines SQL, streaming, and complex analytics
§ Can read existing Hadoop data
§ Most active project in Apache today
© 2014 IBM Corporation
Swift enablement for data retrieval in Spark
§ Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source.
Swift Network
§ IBM research enabled Spark to access data stored in Openstack Swift.
© 2014 IBM Corporation
What do we analyze?
Swift
Network
Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….
© 2014 IBM Corporation
Yes! We can do it better.
© 2014 IBM Corporation
Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities
§ Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes.
§ Storlet engine - responsible to execute every storlet in a secure environment
§ Storlet is a standard Java code
© 2014 IBM Corporation
Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the
computation
© 2014 IBM Corporation
Swift Storlets: How do they benefit Spark?
Swift Storlet Network
Objects Filter Data processing +
© 2014 IBM Corporation
Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos
§ Object store is a natural repository for photos
§ Photos contain rich capture metadata
§ Analyzing this metadata for a set of photos can show how the camera is used
© 2014 IBM Corporation
Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark)
10MB 1KB
© 2014 IBM Corporation
Example: Analyzing EXIF metadata.
• Spark accesses images via storlet • No change to Spark, only changes the URI • JSON file returned by storlet defines schema • SQL from Spark processes metadata
© 2014 IBM Corporation
Example: Analyzing EXIF metadata.
© 2014 IBM Corporation
Summary § Openstack Swift is the most popular open source object store
§ Apache Spark is the next big thing in data analytics
§ Spark and Swift can be integrated
§ Storlets in Swift provide clear benefits for analytics use cases.
Thank you!
More information
Gil Vernik, IBM Research -Haifa [email protected]