Dr. Christian Geuer-Pollmann
@chgeuer
http://blog.geuer-pollmann.de/
Lessons learned:
Hosting large-scale backends like the “Eurovision Song Contest” on Microsoft Azure
Architecture Overview
Operations Security
Load Testing Performance Connectivity
Agenda
Kampf der Orchester
(SRF)
Eurovision Song Contest 2015
(EBU / ORF)
Quizduell im Ersten
(Das Erste / NDR)
Spiel für Dein Land
(Das Erste / SRF / ORF)
Projekte
im Ersten
Technology Partner
Nächster Sendetermin
Sa, 12. 12. 2015 | 20:15 Uhr
• Support 2+ mio concurrent connections
• Sub-second in-app notifications
• Voting and fast aggregation
• Web Sockets for bi-directional communications
• Build on Azure "Cloud Services" ("PaaS v1")
ASP.NET, SignalR
Solution Overview
General Architecture
Patterns
• KISS!!!
• Cloud Services - Affinity, Network, CPU,
Memory only.
• Reduce moving pieces. If you can eliminate
3rd party services, do so.
• Asynchronous to potentially blocking / failing
components.
• Retry operations towards data store
shouldn't block critical path.
Paranoia – Trust no one
https://github.com/chgeuer/RedisCloudService
No external dependencies
Multi-paradigm fallbacks
• Realtime updates via
WebSockets,
• Fallback to CDN.
Paranoia – Don‘t trust your own solution
• Quorum in PaaS v1 Cloud Services is "difficult“;
On paper, Compute v1 has 2 FDs only
• New Compute v2 (ARM, Service Fabric) provides 3FD
Unfortunately v2 not avail end CY14
Quorum and Fault Domains
• Don't let all web roles hammer
the backend. Reduce traffic to
central DB
• Aggregate in role
• Constant load on backend
• Shared-Access Signatures for
Profile Pictures
Reduce Load on Backend
http://blog.smarx.com/posts/architecting-scalable-counters-with-windows-azure
• Establishing TCP connections is expensive strain on
TCP/IP stack
• Closed TCP connections are expensive (TIMED_WAIT2)
• UX: Minimize realtime delay and latency
• WebSockets have no poll interval
• Authenticating each request
HTTP Polling versus WebSockets
Don‘t use plain http polling
Votes per POST
Status per GET
Automate everything!
Automate everything!
Network Tweaking
• HTTP • http.sys max connections
• Concurrent requests per CPU
• Request queue limit
• TCP • TIME_WAIT2
• max. TCP retransmissions
• Windows OS
https://github.com/chgeuer/Quizzer/blob/master/Quizzer.Web/SetupScripts/install2.ps1
Network Tweaking – Receive Side Scaling
• “Receive side scaling (RSS) is a
network driver technology that
enables the efficient distribution
of network receive processing
across multiple CPUs in
multiprocessor systems.” [1] • https://github.com/chgeuer/Quizzer/blob/
master/Quizzer.Web/SetupScripts/install2.ps
1#L190-L216
[1] https://msdn.microsoft.com/en-us/library/windows/hardware/ff556942(v=vs.85).aspx
• Egress data volume for client is high.
Questions and Answers can have image attachments.
• Individually encrypted questions, zipped JSON in CDN
• Change distribution time, path and costs
• Goal: Separate bulk data and realtime traffic • 500k people * 100kB == 50GB.
• 500k people * question ID + key only == 1MB.
Traffic Volume Optimization
https://github.com/chgeuer/SelectiveFieldConfidentiality
• There‘s no sizing info, as patterns vary heavily
• Load Test is the (only) answer!
SignalR (and other) Performance Guidance
https://github.com/SignalR/SignalR/wiki/Performance
• There is never enough time for testing.
• High # of concurrent users
• Long-lived connections
• Each public IP can establish a theorerical maximum of
64k connections to http://target:80/
• Custom protocol on top of SignalR
Developed an own load test framework (“bot net” )
Load Testing Challenges
https://github.com/chgeuer/AzureDistributedRunner
https://github.com/chgeuer/AzureDistributedRunner
Load Test Setup
Spin 60 individual nodes (unique src IPs)
Security Rule #1 – Know your threat model
• Quizduell Gewinnspiel
• QD “Hall of Fame”
• QD Double-voting
• ESC votes per SMS
Caution: Don‘t generalize that specific decision!
• Used TLS for registration and login only
• TLS is burden on CPU, we did custom authN (HMAC only).
• Different APIs might have different security requirements &
protocols (possible due to closed system nature)
Performance vs. Security
Security Reviews you didn‘t ask for… Your client implementation is never private
3 Tage vor „Go Live“
http://quizduellforum.de/index.php?topic=478.0
Live API provides status,
and links to bulk data in
CDN
A manifest // http://qd-prod.appsfactory.de/api/info { "AgbChange": "2015-10-23T18:55:00", "DsChange": "2015-10-23T18:55:00", "TbChange": "2015-10-23T18:55:00", "Live": false, "IdxVersion": 2, "RankingBlobUrl": "https://az692393.vo.msecnd.net/rankings/top50.zip", "RankingTimestamp": 1446231308743, "Capped": false, "PlayAlong": false, "Apps": [ { "OS": "iOS", "Version": "1.6", "Force": false }, { "OS": "Android", "Version": "1.2.7", "Force": false }, { "OS": "Windows", "Version": "1.8.0.0", "Force": false }, { "OS": "WindowsTablet", "Version": "1.8.0.0", "Force": false } ], "Duells": [ "https://az692393.vo.msecnd.net/duells/1179.zip", "https://az692393.vo.msecnd.net/duells/1253.zip", "https://az692393.vo.msecnd.net/duells/2274.zip", "https://az692393.vo.msecnd.net/duells/2275.zip" ] }
Thanks for the voluntary analysis
„Für alle weiteren Zugriffe auf die Web-API müssen wir
jedoch ein sog. User-Token mitliefern, damit der Server uns
überhaupt antwortet. Dieses User-Token erhalten wir erst
nach Authentifizierung über Googles OAuth 2-Dienst mit
unserem Google-Konto.“
http://quizduellforum.de/index.php?topic=478.0
„Nach Herunterladen [...] entdecken wir [...] einen Katalog aller
Fragen, allerdings sind die Fragen verschlüsselt. [...] Der Schlüssel
wird mit Beginn jeder Fragerunde [...] an die Spieler ausgeliefert. Bis
dahin bleiben die Fragen unter Verschluss, denn das eingesetzte
symmetrische AES-Verschlüsselungsverfahren ist unknackbar. [...]
Cheaten ist also nicht drin, es sei denn man wertet die Kenntnis über
die zweite und dritte Frage schon zu Beginn einer Runde, während
andere erst die erste Frage sehen, als einen solchen Betrug.“
Thanks for the voluntary analysis (2)
http://quizduellforum.de/index.php?topic=478.0
„Durch die geschickten Vorab-Downloads der Fragen und
Team-Fotos müssen während eines Live-Duells nur noch
kryptographische Schlüssel und einige Metadaten
ausgetauscht werden. Dies ist sicher eine deutliche
Reduktion des übertragenen Datenvolumens. Weiterhin
kommen sog. Websockets zum Einsatz, welche gegenüber
der alten App viele Performance-Vorteile bei der Live-
Synchronisation des Spielgeschehens bieten.“
Thanks for the voluntary analysis (3)
http://quizduellforum.de/index.php?topic=478.0
Pre-heating up your app
Heating up
Production
Autoscaling
How we handled auto-scaling
We did not auto-scale!
• Instrument your infrastructure. Know what the load is on your
nodes.
• Using Microsoft standard logging (Performance Counters) helps.
• Monitor everything (VMs, CDN)
• Realtime-logging for startup tasks:
chgeuer/UnorthodoxAzureLogging
Logging and Instrumentation
Schedule
Dr. Christian Geuer-Pollmann
@chgeuer
http://blog.geuer-pollmann.de/