+ All Categories
Home > Engineering > Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Date post: 14-Apr-2017
Category:
Upload: julien-anguenot
View: 1,663 times
Download: 1 times
Share this document with a friend
63
Leveraging C* for real-time multi-dc public cloud analytics Julien Anguenot VP Software Engineering @anguenot
Transcript
Page 1: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Leveraging C* for real-time multi-dc public cloud analytics

Julien Anguenot VP Software Engineering

@anguenot

Page 2: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

1 iland cloud story & use case

2 data & domain constraints

3 deployment, hardware, configuration and architecture overview

4 lessons learned

5 future platform extensions

Page 3: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

3

iland cloud story & use case

Page 4: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Who are we?

• public, private, DRaaS, BaaS cloud provider • Cisco CMSP • VMware Vspp for 7+ years • 20+ years in business • HQ in Houston, TX • http://www.iland.com

4

Page 5: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Yet another cloud provider? Well, …

5

• performance and stability • custom SLA • compliance • security • DRaaS • global datacenter footprint: US, UK and Singapore • dedicated support staff! • iland cloud platform, Web management console and API

Page 6: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

The iland cloud platform

6

Page 7: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

iland cloud platform essentially

• data warehouse running across multiple data-centers • monitoring (resource consumption / performance) • billing (customers and internal use) • alerting • predictive analytics • cloud management • cloud services (backups, DR, etc.) • desktop and mobile management consoles • API • Cassandra powered!

7

Page 8: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

The iland cloud Web management console

8

Page 9: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

9

Page 10: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

10

Page 11: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

11

Page 12: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

12

Page 13: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

13

Page 14: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

14

Page 15: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

15

Page 16: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

16

Page 17: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

17

Page 18: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

So, why did we do all this?

• Initial motivations (v1) • vendor software (VMware vCloud Director) lacking:

• performance analytics (real-time and historical) • billing • alerts • cross datacenter visibility

• more private cloud type transparency • abstract ourselves from vendors and integrate an

umbrella of heterogeneous services • modern UX and good looking UI

18

Page 19: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

19

data and domain constraints

Page 20: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Constraints

20

• write latency • high throughput • precision (used for billing) • availability • multi-data center • scalability: tens of thousands of VMs • agent-less • pull/poll vs push • high latency environs (multi-dc)

Page 21: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Pipeline

21

• collection of real-time data • store • aggregation • correlation • rollups (historical) • processing

• alerting • billing

• reporting • querying

Page 22: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Real-time collected perf counters

22

• 20 seconds samples • compute, storage, network • 15+ perf counters collected

• ~50 data points per minute and per VM • time series

• (timestamp, value) • metadata

• unit • interval • etc.

• 1 year TTL

Page 23: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

VM CPU 20 seconds perf counters

23

Group Name Type

CPU USAGE AVERAGE

CPU USAGE_MHZ AVERAGE

CPU READY SUMMATION

Page 24: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

VM memory 20 seconds perf counters

24

Group Name Type

MEM ACTIVE AVERAGE

MEM CONSUMED AVERAGE

MEM VM_MEM_CTRL SUMMATION

Page 25: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

VM network 20 seconds perf counters

25

Group Name Type

NET RECEIVED AVERAGE

NET TRANSMITTED AVERAGE

NET USAGE AVERAGE

Page 26: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

VM disk 20 seconds perf counters

26

Group Name Type

DISK READ AVERAGE

DISK WRITE AVERAGE

DISK MAX_TOTAL_LATENCY LATEST

DISK USAGE AVERAGE

DISK PROVISIONED LATEST

DISK USED LATEST

DISK NUMBER_WRITE_AVERAGED AVERAGE

DISK NUMBER_READ_AVERAGED AVERAGE

Page 27: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

More counters collected for 3rd party services

27

Page 28: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

VM to time serie bindings

28

• binding on VM UUID • serie UUID

• <VM_UUID>:disk:numberReadAveraged:average • Simple, fast and easy to construct at application level.

Page 29: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

29

Page 30: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

30

Page 31: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

VM containment and aggregation of real-time samples

31

• what’s this? • resource pool / vs instance-based $$ • 20 seconds samples aggregated

from VM to VDC top level • separated tables

Page 32: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Historical rollups and intervals

32

• VM, VAPP, VDC, ORG and network • 1 minute (TTL = 1 year) • 1 hour (used for billing) • 1 day • 1 week • 1 month • separated tables • new performance counter types created • TTL > 3 years for 1h samples for compliance & billing reasons • application level responsibilities

Page 33: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

1 minute rollups processing

33

• processed to trigger alerts (usage, billing) • processed to compute real-time billing

Page 34: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

1 hour rollups processing

34

• processed for final billing computation • leveraging salesforce.com collected data

Page 35: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Data sources essentially

35

• compute • storage • network • Management • users • cloud configuration • salesforce.com • 3rd party services: backups, DR, etc. • pluggable: add / upgrade / remove services

Page 36: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Cassandra is the sole record keeper

36

Page 37: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

37

deployment, configuration, hardware and architecture overview

Page 38: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

iland cloud platform foundation

38

• Cisco UCS • VMware ESXi • VMware vSphere (management) • our Cassandra cluster runs on the exact same base

foundation as our customer public clouds.

Page 39: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

39

Simplified architecture (each DC)

HAProxy Apache KeyCloak Wildfly AS

Postgres

Wildfly AS Resteasy API

Wildfly AS cluster

Apache Lucene

NFSApache

Cassandra

Compute Storage Network

+ 3rd parties

Salesforce

iland cloud

Cassandra ring

API

AngularJS / API Redis Sentinel

AMQP syslog-ng

Page 40: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Cassandra version history

40

• late 2014: 2.1.x • early 2014: 2.0.x w/ Java CQL driver • late 2013: 2.0 beta w/ Astanyax (CQL3) (v1)

• empty cluster • early 2013: 1.2.x w/ Astanyax (initial proto)

Page 41: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

iland’s cassandra cluster overall

41

• 6 datacenters • 1 (one) keyspace • 27 nodes • 1.5 to 2TB per node (TTL)

Page 42: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

42

Reston, VALA,CA

Dallas, TX

US

Singapore

Asia

London,UK

Manchester,UK

EU

Page 43: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Each DC

43

• 1 or 2 C* rack(s) of 3 Cassandra nodes • endpoint_snitch: RackInferringSnitch • RF=3

Page 44: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

44

Page 45: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Each node

45

• VM • Ubuntu 14.04 LTS • Apache Cassandra Open Source distribution • 32GB of RAM • 16 CPUs • 3 disks: system, commit logs, data

Page 46: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Hardware

46

• Cisco UCS B200 M3 • not very expensive

• Disks • Initially 10K SAS disks • now hybrid array (accelerated SSD)

• reads off SSD (75/25) • boot time • maintenance ops • Cassandra CPU and RAM intensive.

• No need to get crazy on disks initially • C* really runs well on non-SSD

Page 47: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Network

47

• 1G and 10G lines (currently switching all to 10G) • Cassandra chatty but performs well in high latency

environs • network usage is pretty much constant

• 25 Mb/s in between DC: • default C* 2.1 outbound throttle • Increase when streaming node is needed

• Permanent VPN in between DC (no C* SSL)

Page 48: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Network

48

Page 49: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

ultimately an API for everything and everywhere

49

Page 50: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

50

C* W

iland ReST API

iland core platform iland core platform

iland ReST API

C* R C* RC* W

C* R only deployed in: Dallas, TX - London, UK - Singapore

Page 51: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

51

Lessons learned

Page 52: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Tuning Cassandra node: JVM

52

• Java 8 • MAX_HEAP_SIZE=“8G” • HEAP_NEWSIZE=“2G” • Still using CMS but eager to switch to G1 w/ latest

Cassandra version. • no magic bullet

• test and monitor • 2.0.x to 2.1.x: had to revisit drastically

Page 53: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Tuning Cassandra node: some config opts

53

• concurrent_writes / concurrent_reads • nodetool tpstats

• concurrent_compactors • nodetool compactionstats • ++

• auto_snapshot • batch_size_warn_threshold_in_kb

• monitor • no magic bullet

• test and monitor

Page 54: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Minimize C* reads (with Redis in our case)

54

• writes are great / reads are good • application level optimizations • 16G of cached data in every DC

• very little in Redis. Bindings and alerts • in-memory only (no save on disk)

Page 55: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Migration

55

• went live with 2.1.1 because of UDT • suggest waiting for at least 5 or 6 dot releases

• 2.0.x / 2.1.x • have to re-tune the whole cluster • new features can be an issue initially (drivers) • Python driver very slow for data migration

Page 56: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Don’t’s

56

• secondary indexes (or make sure you know what you’re doing) • IN operator • don’t forget TTL

• no easy way around range deletes • complex “relational” type of models

Page 57: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Do’s

57

• design simple data model • queries driven data model • writes are cheap: duplicate data to accommodate queries • prepared statements • batches • minimize reads from C* • UDT

Page 58: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

#pain

58

• bootstrapping new DC • streaming very hard to complete OK w/ 2.0 • temp node tuning during streaming • Cassandra 2.2 should help with bootstrap resume

• repairs • very long and costly op • incremental repairs broken until late 2.1.x

Page 59: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

59

future platform extensions

Page 60: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Issue with in-app server aggregations and rollups

60

• JEE container works great but… • lack of traceability / monitoring around jobs • separation of concerns • need to minimize reads against Cassandra

• in-memory computation • code base growing fast (200k+ Java loc)

Page 61: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Spark for aggregations and rollups

61

• tackling issues in previous slides • multiple new use cases:

• for instance, heavy throughput data for network analysis

• machine learning • Kafka & Spark Streaming • currently experimenting

Page 62: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Multiple Keyspaces

62

• compliance / data isolation • lower network traffic

Page 63: Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Thank you


Recommended