Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | geetanjali-g |
View: | 104 times |
Download: | 0 times |
clearstorydata.com
Using Spark and Shark for
Fast Cycle Analysis on Diverse Data
12.2.13
Vaibhav Nivargi
clearstorydata.com
About ClearStory Data
clearstorydata.com
Analysis in the New Data Landscape
New use cases seen in all industries.
• Live situational analysis requiring fast-cycle
analysis across internal data and sources of
external data
• Multi-source analysis with data refreshing on
new insights, as data from sources evolves
• Large-scale analysis of structured and
unstructured data combined in integrated
insights
clearstorydata.com
Example: Interactive Multi-source Analysis
More data and more people change the analysis.
FacebookShares, Likes, Comments
News Coverage
Online, Print, Television
TwitterFollowers,
Tweets, Retweets
DonationsNew Members,
Donations
Website TrafficTraffic,
Referrals, Content
Data Intelligence
Interactive analysis on diverse
internal & external data
Corporate SponsorsCorporate
Engagement, New Inquiries
clearstorydata.com
Today’s Need is Speed, Scale & Ad Hoc Flexibility
With more sources, more data and more people.
? ?
??
clearstorydata.com
Why Spark and Shark ?
• RDDs– Low latency & scale
– Iterative and Interactive computation
• Lineage and fault tolerance– Able to re-derive data
• Expressive power of Scala and SQL– Operations beyond aggregations, joins, and statistical operators
– Advanced: ML, data mining, segmentation, approximate queries, graphs …
• Support for structured and semi-structured data
• BDAS Stack & AMPLab– Tachyon, MLBase, BlinkDB, GraphX …
• Community and adoption
clearstorydata.com
Data Sources ClearStory Platform ClearStory Application
The ClearStory Solution
Data Inference & Profiling
Harmonization
Visualization
Collaboration
In-MemoryData Units
clearstorydata.com
Public PremiumWebRDBMS Hadoop
ClearStory API
User Application
Data Access, Inference and Lineage
Data Source API
Files
Spark Cluster + ClearStory IP
Harmonization Engine and Blended Data Processing
Where do Spark & Shark fit ?
clearstorydata.com
How we leverage Spark & Shark
• User intent captured and translated to custom API
• Harmonization-as-a-Service
• Manages Spark and Shark query execution
• Read cached data from HDFS
• RESTful
• Merges datasets (RDDs) on the fly – on user request
• Support conversion of user actions to backend queries
• Query optimizations
• Performance optimizations
• Mixed-mode execution (sql2rdd & spark native)
• Caching
• Pre-computation
clearstorydata.com
How we leverage Spark & Shark
• Query results returned to the application for
scalable visualization and ClearStory-specific viz
techniques
• RDDs cached/un-cached and materialized at
strategic points based on usage patterns and
signals
• Data updates automatically processed as source
data changes
• ClearStory’s own deployment, packaging, and
integrated monitoring for operations at scale
clearstorydata.com
Spark Developments – What We Like
• Query cancellation, progress indication (0.8.1 and
beyond)
• More performance breakthroughs
• Workload Management
• BlinkDB
• MLBase
• Tachyon
• GraphX
clearstorydata.com
We’re Hiring!
• Working with the community, giving back
• Lots of exciting new developments
• This is like the early days of Hadoop – massive
momentum gathering
The First Spark Summit!
More Meet-ups!
clearstorydata.com