Kubernetes and lastminute.com group: our course towards better scalability and processes
Milan, 25-26 November 2016
The inspiring travel company
lastminute.com group in numbers
40 countries17 languages
10Mtravellers per year*
€ 2.5B GTV*€ 250M revenue*
43M users per month*
*data as 31st December 2015icons from http://www.flaticon.com
A tech company to the core
Tech department: 300+ people
Modules: ~100
Database: 150 schemas, 3300 tables, TB data
Instances: 1400+
Locations: Chiasso, Milan, Madrid, London, Bengaluru
https://www.pexels.com/photo/turtle-walking-on-sand-132936/
“Business thinks developers are slow"
lastminute.com group: an agile company
● Scrum and Kanban● TDD● clean code● continuous integration● code review● internal communities
Starting from the monolith ...
https://www.flickr.com/photos/southtopia/5702790189
https://www.pexels.com/photo/gray-pebbles-with-green-grass-51168/
... broken into microservices
The improvements needed
● alignment
● real pipelines
● infrastructure
● resilience
● monitoring
● remove constraints
An year-long endeavour
● build a new, modern infrastructure
● migrate the search (flight/hotel) product there
... without:
● impacting the business● throwing away our whole datacenter
TODO list
● company framework
● docker
● kubernetes
How? Teams and peopleNew teams
https://www.pexels.com/photo/blue-lego-toy-beside-orange-and-white-lego-toy-standing-during-daytime-105822/
Our infrastructure and technologyOur infrastructure and technology
https://www.pexels.com/photo/colorful-toothed-wheels-171198/
● build once, run everywhere
● externalised configuration
Docker containers
Docker containers
registry.intra/application:v2-090025112016
BASE OS
JAVA SDK
START/STOP SCRIPTS
JAR APPLICATION
● build once, run everywhere
● externalised configuration
Kubernetes
● independent from OS/hosts
● isolated env, managed at scale
● self-healing
● externalised configuration
Omega paper: http://research.google.com/pubs/pub41684.html
https://www.pexels.com/photo/red-toy-truck-24619/
“Your infrastructure on wheels”
Kubernetes: physical representation
NODE1
DOCKER
ETCD
K8S
cluster
FLANNEL
NODE2
DOCKER
ETCD
K8S
FLANNEL
NODE28
DOCKER
ETCD
K8S
FLANNEL
...
Kubernetes: logical representation
NAMESPACE1 CPU 10MEM 40GB
NAMESPACE2 CPU 20MEM 50GB
NAMESPACE3 CPU 80MEM 60GB
NAMESPACE4 CPU 5MEM 5GB
cluster
APP3-PRODUCTION
Kubernetes: our architecture
APP2-PRODUCTIONAPP1-PRODUCTION
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-PREVIEW
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-DEVELOPMENT
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-QA
APP3-PRODUCTIONAPP2-PRODUCTION
APP1-STRESSTEST
nonproductionproduction
Kubernetes: our architecture and choices
APP1-PRODUCTION
deployment
replica-set
POD3
POD2
POD1
production
Kubernetes: our architecture and choices
APP1-PRODUCTION
deployment
replica-set
secret configmap
POD3
POD2
POD1
production
Kubernetes: our architecture and choices
APP1-PRODUCTION
deployment
replica-set
app1.lastminute.intra
secret configmap
POD3
POD2
POD1
loadbalancer-app1
production
APP1-PRODUCTION
Kubernetes: our architecture and choices
POD
collectd
production
application fluentd
Kubernetes: what’s left outside?
● datastores
● distributed caches
● distributed locking
● pub-sub
● logs and metrics storage
1st try (with test app), it seemed to work
https://www.flickr.com/photos/26516072@N00/2194001232
The self-healing term describes any application, service, or a system that can discover that it is not working correctly and, without any human intervention, make the necessary changes to restore itself to the normal or designed state.
Self-healing
ref: https://technologyconversations.com/2016/01/26/self-healing-systems
Kubernetes agnostic interfaces
“When a container is dead I will restart it”
“When a container is ready I will forward traffic to it”
Kubernetes probes: liveness & readiness
Two questions for dev:
● when can I consider my container alive?
● when can I consider my container ready to receive traffic?
spec: containers: livenessProbe: httpGet: path: /liveness successThreshold: 3 failureThreshold: 2
readinessProbe: httpGet: path: /readiness successThreshold: 3 failureThreshold: 2
deployment.yaml
/liveness:
● when tomcat container is up● when ratio “active/max” threads are lower than a
threshold
/readiness:
● all the startup jobs have run● no termination request has been received
.. ongoing never-ending research ..
Our choices: framework - k8s
● zero downtime during rollout
● monitoring in place
● alerting
● centralized logging
● legacy infrastructure to the rescue in case of problem
2nd try (with production traffic)
... failure ... the big one!
https://www.flickr.com/photos/ghost_of_kuji/2763674926
Problems
● configuration
● infrastructure
● tools
● manual mistakes
● (external) scalability
● temporary team focus on objective
● automation
● monitoring
● Go deeper in docker/kubernetes
Another improvement step
Pipeline: a huge step forward
microservice = factory.newDeployRequest().withArtifact(“com.lastminute.application1”,2)
lmn_deployCanaryStrategy(microservice,”qa”)
lmn_deployStableStrategy(microservice,”preview”)
lmn_deployCanaryStrategy(microservice,”production”)
pipeline
APP1-PRODUCTION
POD
Monitoring: grafana/graphite/nagios
cluster
graphiteapplication collectd
Grafana
nagios
icons from http://www.flaticon.com
“Go” deep .. whatever language it takes
https://www.pexels.com/photo/sea-man-person-ocean-2859/
There’s light .. There’s a light .. at the end
https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/
● lead and migration time
● resilience
● root cause analysis
● speed of deployment
● instant scaling
... benefits
● 1300 req/sec in the new cluster● 25 micro-services migrated in 4 months● 1 week to migrate an application● 10 minutes to create a new environment ● 11 min to gracefully roll-out a new version with 55
instances● whole pipeline runs in 16 min● 1.5M metrics/minute flows
Give me the numbers!
Yes, we’re hiring!
THANKS
www.lastminutegroup.com