How to deploy Apache Spark in a multi-tenant, on-premises environment

transcript

HOW TO DEPLOY APACHE SPARK IN A MULTI-TENANT, ON-PREMISES ENVIRONMENT

Adoption of Apache Spark is accelerating

• Spark adoption is growing rapidly – The number of contributors and end users is increasing at a substantial rate

• Spark is expanding beyond Hadoop– Spark is an integral component of new big data platforms - with support for pipelines,

streaming and statistical analysis, SQL, and more

• A variety of use cases are being implemented – Use cases include recommendation systems, data warehousing, log processing, and more

• Programming paradigm is expanding– Languages supported include java, scala, python, SQL, R and more

Source: Spark Survey Report, 2015 (Databricks)

Top roles using Spark in the enterprise

DATA ENGINEERS

41%DATA SCIENTISTS

22.2%ARCHITECTS

MANAGEMENT

10.6%ACADEMIA

6.2%OTHER

Spark infrastructure patterns

• Individual developers or data scientists who build their own infrastructure from VMs or bare metal machines

• A bottoms-up approach where everyone gets the same infrastructure/platform irrespective of their skill or use case

Developers / data scientists and Spark

• Mostly self-starters who identify a use case

• They build their own systems on laptops, VMs, or servers

• The complexity soon overwhelms them and restricts adoption

• They need help to scale deployment beyond the initial use case

Rigid on-premises infrastructure

• Infrastructure is often built by IT for generic use cases

• Flexibility to cater to different usage scenarios is lost

• Spark users needs are always changing

• Upgrades become a challenge

Common Deployment Patterns

48%Standalone mode

40%YARN

11%Mesos

Most Common Spark Deployment Environments (Cluster Managers)

Scalable, self-service infrastructure

• IT controls machines, network, storage, and security

• Users create their own tenants and Spark clusters

• Teams can upgrade and scale their clusters independently

Big Data New Realities

Big Data Traditional Assumptions

Bare-metal

Data locality

HDFS on local disks

Big Data New Realities

Containers and VMs

Compute and storage separation

In-place access on remote data stores (e.g.

NFS, Object)

New Benefits and Value

Big-Data-as-a-Service

Agility and cost savings

Faster time-to-insights

Local HDFS

BlueData EPIC Software Platform

IOBoost™ - Extreme performance and scalability

ElasticPlane™ - Self-service, multi-tenant clusters

DataTap™ - In-place access to enterprise data stores

Blue Data EPIC 2.0 PlatformMarketing R&D Sales Manufacturing Support

BI/Analytics Tools

NFS Gluster Object Store Remote HDFS CEPH

Deployment flexibility for Spark

• Physical Machines or VMs as hosts

• Docker containers as nodes

• Networking and security enabled

• Standalone or YARN-based deployment

Support for all types of Spark users

• Integrated web-based notebook support for data analysts

• Command line support for data engineers and data scientists

• API support for building customer pipelines

•Multiple language support including SQL, R, Streaming

• JDBC support for business intelligence tools

Simple and easy Spark cluster creation

Instant Spark analysis and visualization

• Web-based notebook with integrated Spark cluster

• Support for various languages and Zeppelin interpreters

• Fully provisioned Hadoop File System (HDFS)

• Support for persistent tables

• Iterative analysis and visualization

App Store for Spark and Big Data tools

One-click Big Data app deployment

www.bluedata.com