Date post: | 15-Apr-2017 |
Category: |
Technology |
Upload: | datastax-academy |
View: | 926 times |
Download: | 0 times |
Azure + DSE Powers O365 Per-User Store
© 2015. All Rights Reserved.
1 Introduction
2 What We Built
3 What to Pay Close Attention To
4 Deployment
5 Wrap Up
© 2015. All Rights Reserved.
Overview
Sean UsherOffice 365Email: [email protected]: @seanushermsft
Introduction
© 2015. All Rights Reserved.
Mahesh ThiagarajanMicrosoft AzureEmail: [email protected]: @_cloudguy
Ben LackeyDataStaxEmail: [email protected]
© 2015. All Rights Reserved.
Introduction – Office 365EmailCollaborationDocument AuthoringSocial NetworkingCalendaringFile StorageBusiness IntelligenceEtc…
© 2015. All Rights Reserved.
Introduction – AzureAzure is Microsoft’s cloud computing platform, a growing collection of integrated services—analytics, computing, database, mobile, networking, storage, and web—for moving faster, achieving more, and saving money.
© 2015. All Rights Reserved.
What We Built - OverviewA way to understand our users and organizations at a deeper level!
• Are users happy with the service they are receiving?• Are users fully utilizing the services they are paying us for?• Are users hitting issues that we can proactively help them with? • How has a user’s experience been over their lifetime?• Can we discover insights that we aren’t even aware of?
This requires ingesting and storing a lot of data. We need to be able to perform fast, scalable analytics on that data, or we will discover issues too late!
Questions:
© 2015. All Rights Reserved.
What We Built – Why CassandraThe Good• Low Latency ✓• Linear Scale ✓• Highly Available ✓• Aggregations (Spark/Spark Streaming) ✓• Machine Learning (Spark ML) ✓• No Enforcement of Full Consistency ✓ ✓ ✓
The Not-So-Good• No Hosted Option in Azure ✗• Have to Install and Configure it Ourselves ✗
Cassandra: 12 NodesAnalytics: 12 Nodes
VM Size: G4Heap Size: 30 GBGC: G1Ingestion: 20k – 50k events/sec
Data on ephemeral SSD drives.RF = 3 in both DCs
Cassandra: 30 NodesAnalytics: 15 Nodes (30 within 1 month)
VM Size: G4Heap Size: 30 GBGC: G1Ingestion: 200k+ events/sec
Data on ephemeral SSD drives.RF = 3 in both DCs
© 2015. All Rights Reserved.
What We Built – DSE Clusters
Cluster 1:
Cluster 2:
What We Built - Pipeline Evolution
REST
API
O36
5Event Hub
Ingestion Worker
(Azure worker role using DataStax C#
driver)
C* Analytics
REST
API
O36
5
KafkaC*/
Spark Streaming
Analytics
G4 – Local SSD
Kafka: G4 – Data DiskZooKeeper: A7 – Data Disk
PaaS Small
G4 – Local SSD
© 2015. All Rights Reserved.
Cluster 1:
Cluster 2:
What to Pay Close Attention To – Azure DisksVHD Storage: No more than 40 VMs per-storage account“… and for a Standard Tier VM, it is about 40 (20,000/500 IOPS per disk)…..”https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/
Disk Choice: 1. Local SSD (Ephemeral) – Fast but allows data loss.2. Data Disk (Standard Storage) – No data loss, network-attached which can add latency. 20k IOPs account Limit.3. Data Disk (Premium Storage) – No data loss, network-attached which can add latency. Per-disk IOPs Limit.https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-how-to-attach-disk/ https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#storage-limits
VM
SSD: /dev/sdb
Storage Account(Data Disk)
Storage Account(OS Disk)
OS: /dev/sda
© 2015. All Rights Reserved.
What to Pay Close Attention To – Azure VM Size
VM Size: We chose G4 nodes, but are investigating moving to D14 nodes. Having a larger number of smaller nodes will allow for faster rebuild which can reduce recovery time.https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-size-specs/
© 2015. All Rights Reserved.
What to Pay Close Attention To – Azure NetworkingNetworking: Virtual Network (VNet) vs Public IP 1. Public IPs – Default limit of 5 per subscription. Allows geo-redundant replication over Internet.2. VNet – Define your own subnets and IP ranges. Allows geo-redundant replication via Gateways/Express Route. No bandwidth limit within Vnet.
1. Standard Gateway – Max 100Mbs.2. High-Performance Gateway – Max 200Mbs.3. Express Route – Max 10Gbs.
https://azure.microsoft.com/en-us/documentation/articles/virtual-networks-instance-level-public-ip/ https://azure.microsoft.com/en-us/documentation/articles/vpn-gateway-vnet-vnet-rm-ps/https://msdn.microsoft.com/en-us/library/azure/mt586720.aspx
© 2015. All Rights Reserved.
What to Pay Close Attention To – Azure NetworkingTest performance of every dependency and see if it meets the expectations of your application.
Network Performance: Iperf (https://iperf.fr/) – Test bandwidth between two VMs within various DCs
VNet
VM10.1.0.10
Iperf -s
VM10.1.0.11
Iperf –c 10.1.0.10
user@machine:~$ iperf -c 10.1.0.10------------------------------------------------------------Client connecting to 10.1.0.10, TCP port 5001TCP window size: 2.50 MByte (default)------------------------------------------------------------[ 3] local 10.1.0.10 port 42892 connected with 10.1.0.10 port 5001[ ID] Interval Transfer Bandwidth[ 3] 0.0-10.0 sec 45.7 GBytes 39.2 Gbits/sec
© 2015. All Rights Reserved.
What to Pay Close Attention To – Azure Storage Test performance of every dependency and see if it meets the expectations of your application.
Disk: SysBench (https://wiki.gentoo.org/wiki/Sysbench) – Test write throughput and IOPsuser@machine:/mnt$ sysbench --test=fileio --file-total-size=1000G --file-test-mode=rndrw --init-rng=on --max-time=300 --max-requests=0 runsysbench 0.4.12: multi-threaded system evaluation benchmark
<….. Excess Logging Removed….>
Operations performed: 402240 Read, 268160 Write, 858065 Other = 1528465 TotalRead 6.1377Gb Written 4.0918Gb Total transferred 10.229Gb (34.917Mb/sec) 2234.67 Requests/sec executed
Test execution summary: total time: 300.0002s total number of events: 670400 total time taken by event execution: 16.1526 per-request statistics: min: 0.00ms avg: 0.02ms max: 2.20ms approx. 95 percentile: 0.05ms
Threads fairness: events (avg/stddev): 670400.0000/0.00 execution time (avg/stddev): 16.1526/0.00 © 2015. All Rights Reserved.
What to Pay Close Attention To – Cassandra
Metrics!
Need to tune? Al Tobey can help - https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html © 2015. All Rights Reserved.
What to Pay Close Attention To – Cassandra
SSTable Count• Too many SSTables can lead to OOM errors and nodes becoming unavailable.• Watch count and balance compaction throughput with system limits.• SSTable count may spike during repairs if data is inconsistent.
Dropped Mutations• Dropped mutations mean more repairs need to be done.• Impact of dropped mutations can be controlled by tuning write consistency.• Check iostat to see if disk queue is building up or write latency is high.
• iostat -x /dev/sdb 1 5 • Do drops only happen when Spark Jobs batch write? Tune Spark write throughput (
https://github.com/datastax/spark-cassandra-connector/blob/v1.2.5/doc/FAQ.md)
See memtables & flushing in Al’s Tuning Guide.
© 2015. All Rights Reserved.
What to Pay Close Attention To – Cassandra
Pending Compactions• If you aren’t keeping up with compactions, performance will suffer.• Too many SSTables impact read speed, but also can lead to hitting OS limits. See:
• /etc/sysctl.conf - vm.max_map_count• /etc/security/limits.d/cassandra.conf – nofile• /etc/init.d/dse – Certain DSE versions overwrite nofile with: FD_LIMIT=100000
Heap Used• Heap usage changes over time. What works in week one, may not work in week 10.• We used a 20GB heap until nodes started hitting OOM when they needed 25 GB.• Use G1 if at all possible to see GC times decrease, and use a large (25 – 30 GB) heap.• Let G1 tune your young generation heap size.
© 2015. All Rights Reserved.
What to Pay Close Attention To – SparkWe are still learning!
Scheduler Output:
NOT CRON!
Spark UI: Spark Job Logs:If you don’t enable Spark UI for security reasons, ship your Spark logs off box for analysis.
You may also find that jobs fail to read data because partitions are missing or nodes are timing out. This can indicate you are overwhelming Cassandra.
© 2015. All Rights Reserved.
DeploymentUse the Azure/DataStax TemplateAzure will be investing in building more features into the Azure template, and you will get those easier if you use the existing template.
https://www.youtube.com/watch?v=vacp267zLBA&noredirect=1https://github.com/DSPN/azure-resource-manager-dse
We Didn’t Use the Template because it wasn’t ready yet. We had to write our own logic to deploy nodes and need to transition to the template so we can get all of these new features. We are scheduling time to do this because it will save us a lot of work!
Consider Security and Compliance: This will influence how you deploy (VNet vs Public IP), what Cassandra configuration you use (internode encryption, require_client_auth: true), and what OS configuration you use (CIS standards).
C* Hardening: http://thelastpickle.com/blog/2015/09/30/hardening-cassandra-step-by-step-part-1-server-to-server.html CIS Standards: https://benchmarks.cisecurity.org/downloads/show-single/?file=ubuntu1404.100
© 2015. All Rights Reserved.
Azure Templates can:• Ensure Idempotency• Simplify Orchestration• Simplify Roll-back• Provide Cross-Resource
Configuration and Update Support
Azure Templates are: • Source file, checked-in• Specifies resources and
dependencies (VMs, WebSites, DBs) and connections (config, LB sets)
• Parametized input/output
Instantiation of repeatable config.Configuration Resource Group
Power of Repeatability
SQL - A Website VirtualMachines
SQL-AWebsite[SQL CONFIG] VM (2x)
DEPENDS ON SQLDEPENDS ON SQL
SQLCONFIG
MICROSOFT CONF IDENT IAL – INTERNAL ONLY
Extending the power of your VMEnable easier managementSupport partner ecosystemFull control still with you!
Azure VM Extensions
IaaS extended
Azure
Curated ExtensionsAgent
Thank youSean UsherOffice 365Email: [email protected]: @seanushermsft
Mahesh ThiagarajanMicrosoft AzureEmail: [email protected]: @_cloudguy
Ben LackeyDataStaxEmail: [email protected]