+ All Categories
Home > Documents > VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New...

VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New...

Date post: 24-Mar-2020
Category:
Upload: others
View: 8 times
Download: 0 times
Share this document with a friend
53
VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri VMworld 2017 Content: Not for publication or distribution
Transcript
Page 1: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

VIRT1351BE

#VMworld #VIRT1351BE

New Architectures for Virtualizing Spark and Big Data Workloadson vSphere

Justin MurrayMohan Potheri

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 2: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

2#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 3: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Agenda

1 Introductions

2 Existing and new Approaches in the Big Data World

3 Traditional Deployment Reference Architectures

4 New Architectures – Changing the Paradigm

5Proof of Concept:

Testing in the VMware Solutions Lab

6 Introduction to Machine Learning

7 Conclusions

3#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 4: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Why the Interest in Big Data?

• Enterprises want to get off existing costly data platforms

• Older data warehouse technology is not serving your needs

• Want to do queries and analytics against many different forms of data (structured, unstructured, streaming)

• Provide data access to our end customers

• Integrate systems that have been islands till now

– Single source of truth for the enterprise

• Exploit new application architectures for developer productivity

• Want to do data science, machine learning, deep learning

4#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 5: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Worker Node 1 Worker Node 2 Worker Node 3

ResourceManager

Client

Datanode

Nodemanager

AppMaster - 1

Nodemanager Nodemanager

Datanode Datanode

HDFS Block 1 HDFS Block 2 HDFS Block 3

Container - 2 Container - 3

Master File System Index

NameNode

submit jobWorkers

Master Scheduler

5

The Existing Hadoop Architecture

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 6: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

High Level View of Spark

6#VIRT1351BE CONFIDENTIAL 6

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 7: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Worker Node 1 Worker Node 2 Worker Node 3

Driver

Job

Executor

JVM

Executor Executor

JVM JVM

Executor

JVM

Executor

JVM

Executor

JVM

7

The Spark Architecture – Standalone

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 8: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

NodemanagerNodemanagerNodemanager

Worker Node 1 Worker Node 2 Worker Node 3

Job

Datanode

AppMaster - 1

Datanode Datanode

HDFS Block 1 HDFS Block 2 HDFS Block 3

Container - 2 Container - 3

Namenode

Driver Executor Executor

Resourcemanager

8

The Spark Architecture (on YARN)

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 9: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Traditional Reference Architectures

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 10: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

vSphereHost Server

HadoopNode 1Virtual Machine

Datanode

Ext4

Nodemanager

Ext4 Ext4 Ext4

Local DAS disks/devices allocated to a Virtual Machine

HadoopNode 2VirtualMachine

Datanode

Ext4

Nodemanager

Ext4 Ext4 Ext4Ext4 Ext4Ext4Ext4

10

Two Virtual Machines on a Host Server

VMDK VMDK VMDK VMDKVMDKVMDK VMDK VMDK VMDK VMDKVMDKVMDK

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 11: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

11

Data/Compute Separation (with External Access to HDFS)

HadoopVirtualNode 2

NN

NN

NN

NN

NN

NN data n

od

e

Isilon

VirtualizationHost

VMDKOS Image –

VMDKOS Image –

VMDK VMDK

VMDK

HadoopVirtualNode 1

Ext4

ResourceManager

Ext4

Temp

OS Image –

VMDK

Ext4

NodeManager

Ext4

HadoopVirtualNode 3

Ext4

NodeManager

Ext4

Temp

HDFS requests

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 12: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Concerns with HDFS (The Hadoop Distributed File System)

• Difficult to separate compute from data storage concerns

• Three-way block replication for each 256MB data block (or 512MB block)

– Triples input data size at least - to achieve safety

• Re-balance of data when you add new data node processes

• Data must be ingested into HDFS from legacy systems (can be time consuming)

• Site-to-site replication not inherent

• NameNode process (which holds the central index of all files) can be sensitive to higher numbers of small files

12#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 13: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Developers and Data Scientists

• Work on their code or on their data analysis model

• Don’t need a multi-tenant cluster

• Don’t care about job scheduling for other users

• Want to scale out to see the effect on their work

• Want to use the latest tools and newer versions (Python, R, Scala, ML kits)

• Experiment with different data models, code, algorithms, data sets

• Training the analysis model is separated from testing it – interested in the time taken for each

• May not need the full Hadoop cluster set

13#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 14: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

New Architectures for Big Data

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 15: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Key Trends in Big Data Infrastructure

• Decoupling of Compute and Storage Clusters

– Separate compute virtual machines from storage VMs

– Data is processed and scaled independently of compute

• Dynamic Scaling of compute nodes used for analysis from dozens to hundreds

• SPARK and other newer Big Data platforms can work with regular filesystems

• Newer platforms store and process data in memory

• New platforms can leverage Distributed Filesystems that can use local or shared storage

• Need for High Availability & Fault Tolerance for master components

15

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 16: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Apache Spark Platform Capabilities

#VIRT1351BU CONFIDENTIAL 16

• Open-source cluster computing framework

• In Memory Data Processing Engine

• ETL, analytics, ML and graph processing

• Batch and streams processing

• Rich APIs for Scala, Python, Java, R, and SQL

• Distributed platform for complex multi-stage

applications

Reference: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-overview.html

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 17: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

HDFS replacement needed for the next generation distributed file System

• What candidates present themselves?

– S3, Ceph, Gluster, etc.

• GlusterFS used in POC:

– Mature Solution

– Native GlusterFS filesystem for Linux

– Layers on top of any traditional storage

– Truly distributed and resilient distributed file system

– Supports many common client protocols

17#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 18: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

GlusterFS

18

• GlusterFS is a scale out distributed filesystem that can support thousands of clients

• File-system can run on DAS or Shared Storage

• Fault Tolerant Distributed File System.

• Provides multiprotocol support

– Native

– NFS

– CIFS

– HDFS

– S3

– FTP

https://www.slideshare.net/shubhendutripathi040980/glusterfs-hadoop

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 19: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

HDFS vs Ceph vs Gluster IOZONE Performance Comparison

19

http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 20: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

SPARK with GlusterFS POC Architecture on Pure FC SAN

VMware vSphere VMware vSphere VMware vSphere VMware vSphere

Spark

Master

Spark

Worker

Spark

Worker

Spark

Worker

Spark

WorkerSpark

Worker

Spark

Worker

Spark

Worker

Gluster

Node

Gluster

Node

Gluster

Node

GlusterFS

Pure M50 Storage on Fibre-Channel

Spark

Worker

#VIRT1351BE CONFIDENTIAL 20

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 21: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

SPARK with GlusterFS POC Architecture on Virtual SAN

VMware vSphere +

VSAN

VMware vSphere +

VSAN

VMware vSphere +

VSAN

VMware vSphere +

VSAN

Spark

Master

Spark

Worker

Spark

Worker

Spark

Worker

Spark

WorkerSpark

Worker

Spark

Worker

Spark

Worker

Gluster

Node

Gluster

Node

Gluster

Node

GlusterFS

Clustered VSANDatastore

Spark

Worker

#VIRT1351BE CONFIDENTIAL 21

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 22: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

TPC-DS on SPARK on GlusterFS

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 23: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

TPC-DS with Spark-SQL and Apache SPARK

• IBM has helped integrate the TPC-DS Benchmark (v2), into the spark-sql-perf

• The 99 queries were generated using the TPC-DS query generator and are based on the 100-GB scale factor.

• The spark-sql-perf test kit can be used to evaluate and compare the infrastructure for its performance.

• We leveraged a subset of TPC-DS queries to evaluate our POC and Solution

23#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 24: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Test Setup

• SPARK Nodes:

– 1 Master and 8 Slave Nodes with 16 vCPU and 128 GB each

– 3 Node GlusterFS cluster with 2 TB shared Filesystem mount across all SPARK nodes

• Storage: (Two Use Cases)

1. GlusterFS backed by Pure Storage LUNS (16 GBPS FC Fabric with Pure M50 Array)

2. GlusterFS backed by vSAN (Western Digital NVMe Cache, High Capacity Flash for persistence)

• TPC-DS Data Sets

– 5 TB

• Queries

– Interactive TPC-DS Queries Set (q19, q42, q52, q55, q63, q68, q73 & q98)

24#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 25: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Apache SPARK Web Console

25#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 26: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

SPARK Job Details

26#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 27: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

TPC-DS test results ( 5TB Data Set)

27

0

0.5

1

1.5

2

2.5

3

q19 q42 q52 q55 q63 q68 q73 q98

Query Time Comparison between FC SAN and vSAN

Pure VSAN

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 28: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

TPC-DS (vSAN on Premises versus VMware Cloud on AWS)

28

0

0.5

1

1.5

2

2.5

3

3.5

q19 q42 q52 q55 q63 q68 q73 q98

TPC-DS On Premises vs VMware Cloud on AWS

On-Prem VMware Cloud on AWS

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 29: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Demo

#VIRT1351BU CONFIDENTIAL 29

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 30: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Section-Conclusion

• Modern Big Data platforms like SPARK are mostly memory resident

• GlusterFS provides a high performance distributed filesystem for SPARK and newer big data workloads

• GlusterFS supports a wide range of protocols that make it the ideal storage platform for data lakes

• Layering GlusterFS on top of shared storage or VSAN helps leverage all the vSphere platform features

• Dedicated HW with local storage is no longer required for modern big data applications.

• TPC-DS testing showed similar performance for SPARK-SQL on VSAN and FC.

30#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 31: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Introduction to Machine Learning

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 32: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

32#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 33: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

• Machine Learning algorithms try to make predictions based on training data that is given to a mathematical model (e.g. a linear regression algorithm)

• Find the minimum the difference between the model’s prediction and the already known outcomes (minimize the loss or objective function)

33

New Sample

Transaction Data

Training Data (Big) Mathematical ModelClassification or

Prediction

Mathematical ModelMathematical Model

training

Samples from History

testing

What Is Machine Learning?

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 34: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

• Training data contains many features that have each been given a numeric value (e.g. zip code = 99)

• Several models are used against the training data and the best one is chosen (minimal loss or error)

• One kind of outcome is a binary classification (a good credit application or bad)

34

Example: Machine Learning Model for “A Customer Applies for Credit”

A new application

for credit

Training Data (Big) Mathematical ModelClassification or

Prediction

Mathematical ModelMathematical Model

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 35: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

35

Acct

Number

Txn

ID

Txn

Location

Code

Age Home

Zip

Code

Balance Annual

Salary

Passed

Valid

Check

Model’s

Estimate

as Valid

Error

(Loss)

1234 45 94312. 21 94304 100 80 Y N 1

5678 89 UK 31 12116 5000 110 N Y 1

9012 150 12126 61 31024 1400 50 Y Y 0

Knowns Computed/Learned

Examplesxi

Features or Feature Variables

Training Data

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 36: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

36

Acct

Number

Txn

ID

Txn

Location

Code

Age Home

Zip

Code

Balance Annual

Salary

Passed

Valid

Check

Model’s

Estimate

as Valid

Error

(Loss)

1234 45 94312. 21 94304 100 80 Y N 1

5678 89 UK 31 12116 5000 110 N Y 1

9012 150 12126 61 31024 1400 50 Y Y 0

Known Computed/Learned

Examplesxi

Features or Feature Variables

GOLDEN RULE : Don’t TEST on your TRAINING DATA

Test Data

Training Data

Test Data Should Always Be Separated from Training Data

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 37: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

37

f (xi, W, b) = Wxi + b

Source: Stanford University class cs231nx: Example data

W: weights

b: bias

Example: A Linear Classifier

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 38: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

• Spark is the runtime platform for the models and ingestion of the training data

• Different Machine Learning algorithms available from MLlib library that comes with Spark

• Application and Data is distributed out to many nodes (virtual machines)

38

SparkSpark

Spark

A new application

for credit

Training Data (Big) Mathematical ModelClassification or

Prediction

Mathematical ModelMathematical Model

Deployment Platform for Machine Learning

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 39: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Introducing vSphere Scale-Out for Big Data and HPC Workloads

39

• Hypervisor, vMotion, vShield Endpoint, Storage vMotion, Storage APIs, Distributed Switch, I/O Controls & SR-IOV, Host Profiles / Auto Deploy and more

Features

• Sold in Packs of 8 CPU at a cost-effective price pointPackaging

• EULA enforced for use w/ Big Data/HPC workloads onlyLicensing

New package that provides all the core features required for scale-out workloads at an attractive price point

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 40: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Conclusions

• New architectures for big data are emerging beyond the existing documented ones

• Spark changes the profile of I/O and persistence for the newer applications

• This lends itself well to virtualization and separation of compute from data

• Traditional values in vSphere can be used in a big data context

• We would like to explore how these new architectural ideas will fit in your environment

40#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 41: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 42: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

[email protected]

[email protected]

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 43: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

BACKUP SLIDES – NOT FOR PRESENTATION

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 44: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Placeholder : Key Requirements for Big Data Architecture

• Performance

• Scaling

– to dozens or hundreds of nodes (VMs)

• Robustness – distributed file system, no one process is a single point of failure

• High Availability

• Fault Tolerance

• Capable of handling new workloads with new compute demands

44

Subtitle

#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 45: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Placeholder : Key Requirements for Big Data Architecture

• Can we use a distributed file system that is not HDFS?

• Use a lighter weight framework than full Hadoop – e.g. Spark?

• Can we keep as much data in memory as possible and avoid I/O? Avoid spills

• Are shared file systems like VSAN useful?

• How to achieve the performance requirements without losing functionality?

45#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 46: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

vSAN Optimization

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 47: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Hardware Configuration

All-Flash vSAN

• (4) Node Dell™ R730XD

– (2) E5-2699V4 – 22-core 2.2GHz

– 1TB Memory

– (4) 10 Gb/s Ethernet connections

– PERC H730mini

– SDCard System Drive

– vSphere 6.5 Update 1

• VSAN disk configuration

– (2) Disk groups per node

• (1) 1.6TB* Ultrastar SN100 cache drive

• (2) 3.84TB Optimus MAX capacity drive

* 1TB=1,000GB, 1GB=1,000,000,000 bytes. Actual usable capacity less. #VIRT1351BE CONFIDENTIAL 47

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 48: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

vSAN Disk Group Configuration

48#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 49: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

Virtual

Switch

vSAN - Network

• These are not necessarily for redundancy (like an “Air-Gap” network with redundant physical interfaces routed to multiple VMKs) but for performance to pull from two physical interfaces at once.

Dual vSAN VMKernel Adapters

Port

GroupPort

Group

#VIRT1351BE CONFIDENTIAL 49

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 50: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

vSAN VMK Configuration

50#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 51: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

vSAN Port Group Uplink Maps

• vDS Contained 4 Uplinks

– 2 dedicated to normal operation

– 2 dedicated to vSAN communication

• vDS-Comp01-Private

– Active Uplink: dvUplink3

– Standby Uplink: dvUplink4

• vDS-Comp01-Private2

– Active Uplink: dvUplink4

– Standby Uplink: dvUplink3

51#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 52: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

HCIBench – Results – Network

0

0.5

1

1.5

2

2.5

3

3.5

4

0

100000

200000

300000

400000

500000

600000

700000

4K 8K 32K 64K

MS

IOP

s

Block Size

Baseline Multiple vSAN VMK 1500 MTU 10Gb Ethernet 10Gb Eth Multiple vSAN VMK

Baseline - Lat Multiple vSAN VMK - Lat 1500 MTU - Lat 10Gb Ethernet - Lat 10Gb Eth Multiple vSAN VMK - Lat

100% Read IOPs and LatencyvSAN 6.6.1

#VIRT1351BE CONFIDENTIAL 52

VMworld 2017 Content: Not fo

r publication or distri

bution

Page 53: VIRT1351BE New Architectures for Virtualizing …...VIRT1351BE #VMworld #VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vSphere Justin Murray Mohan Potheri

What Have We Seen so Far?

• We can use a different file system for big data to HDFS

• With the right storage, we can use the vMotion/DRS/HA/FT features of vSphere

• VSAN can provide the storage underpinning big data (particularly for newer workloads)

• A number of different workloads were exercised on this new architecture

– Analytical queries, batch jobs and machine learning

• Testing is still in progress on all the above – more to come

53#VIRT1351BE CONFIDENTIAL

VMworld 2017 Content: Not fo

r publication or distri

bution


Recommended