Download - Kubernetes and lastminute.com: our course towards better scalability and processes

Kubernetes and lastminute.com group: our course towards better scalability and processes

[email protected]

Milan, 25-26 November 2016

The inspiring travel company

lastminute.com group in numbers

40 countries17 languages

10Mtravellers per year*

€ 2.5B GTV*€ 250M revenue*

43M users per month*

*data as 31st December 2015icons from http://www.flaticon.com

http://www.flaticon.com

A tech company to the core

Tech department: 300+ people

Modules: ~100

Database: 150 schemas, 3300 tables, TB data

Instances: 1400+

Locations: Chiasso, Milan, Madrid, London, Bengaluru

https://www.pexels.com/photo/turtle-walking-on-sand-132936/

“Business thinks developers are slow"

lastminute.com group: an agile company

● Scrum and Kanban● TDD● clean code● continuous integration● code review● internal communities

Starting from the monolith ...

https://www.flickr.com/photos/southtopia/5702790189

https://www.pexels.com/photo/gray-pebbles-with-green-grass-51168/

... broken into microservices

The improvements needed

● alignment

● real pipelines

● infrastructure

● resilience

● monitoring

● remove constraints

An year-long endeavour

● build a new, modern infrastructure

● migrate the search (flight/hotel) product there

... without:

● impacting the business● throwing away our whole datacenter

TODO list

● company framework

● docker

● kubernetes

How? Teams and peopleNew teams

https://www.pexels.com/photo/blue-lego-toy-beside-orange-and-white-lego-toy-standing-during-daytime-105822/

Our infrastructure and technologyOur infrastructure and technology

https://www.pexels.com/photo/colorful-toothed-wheels-171198/

● build once, run everywhere

● externalised configuration

Docker containers

Docker containers

registry.intra/application:v2-090025112016

BASE OS

JAVA SDK

START/STOP SCRIPTS

JAR APPLICATION

● build once, run everywhere


Kubernetes

● independent from OS/hosts

● isolated env, managed at scale

● self-healing


Omega paper: http://research.google.com/pubs/pub41684.html

http://research.google.com/pubs/pub41684.html

https://www.pexels.com/photo/red-toy-truck-24619/

“Your infrastructure on wheels”

Kubernetes: physical representation

NODE1

DOCKER

ETCD

K8S

cluster

FLANNEL

NODE2

DOCKER

ETCD

K8S

FLANNEL

NODE28

DOCKER

ETCD

K8S

FLANNEL

...

Kubernetes: logical representation

NAMESPACE1 CPU 10MEM 40GB




cluster

APP3-PRODUCTION

Kubernetes: our architecture

APP2-PRODUCTIONAPP1-PRODUCTION


APP1-PREVIEW


APP1-DEVELOPMENT


APP1-QA


APP1-STRESSTEST

nonproductionproduction

Kubernetes: our architecture and choices

APP1-PRODUCTION

deployment

replica-set

POD3

POD2

POD1

production


APP1-PRODUCTION

deployment

replica-set

secret configmap

POD3

POD2

POD1

production


APP1-PRODUCTION

deployment

replica-set

app1.lastminute.intra

secret configmap

POD3

POD2

POD1

loadbalancer-app1

production

APP1-PRODUCTION


POD

collectd

production

application fluentd

Kubernetes: what’s left outside?

● datastores

● distributed caches

● distributed locking

● pub-sub

● logs and metrics storage

1st try (with test app), it seemed to work

https://www.flickr.com/photos/26516072@N00/2194001232

The self-healing term describes any application, service, or a system that can discover that it is not working correctly and, without any human intervention, make the necessary changes to restore itself to the normal or designed state.

Self-healing

ref: https://technologyconversations.com/2016/01/26/self-healing-systems

https://technologyconversations.com/2016/01/26/self-healing-systems

Kubernetes agnostic interfaces

“When a container is dead I will restart it”

“When a container is ready I will forward traffic to it”

Kubernetes probes: liveness & readiness

Two questions for dev:

● when can I consider my container alive?

● when can I consider my container ready to receive traffic?

spec: containers: livenessProbe: httpGet: path: /liveness successThreshold: 3 failureThreshold: 2

readinessProbe: httpGet: path: /readiness successThreshold: 3 failureThreshold: 2

deployment.yaml

/liveness:

● when tomcat container is up● when ratio “active/max” threads are lower than a

threshold

/readiness:

● all the startup jobs have run● no termination request has been received

.. ongoing never-ending research ..

Our choices: framework - k8s

● zero downtime during rollout

● monitoring in place

● alerting

● centralized logging

● legacy infrastructure to the rescue in case of problem

2nd try (with production traffic)

... failure ... the big one!

https://www.flickr.com/photos/ghost_of_kuji/2763674926

Problems

● configuration

● infrastructure

● tools

● manual mistakes

● (external) scalability

● temporary team focus on objective

● automation

● monitoring

● Go deeper in docker/kubernetes

Another improvement step

Pipeline: a huge step forward

microservice = factory.newDeployRequest().withArtifact(“com.lastminute.application1”,2)

lmn_deployCanaryStrategy(microservice,”qa”)

lmn_deployStableStrategy(microservice,”preview”)

lmn_deployCanaryStrategy(microservice,”production”)

pipeline

APP1-PRODUCTION

POD

Monitoring: grafana/graphite/nagios

cluster

graphiteapplication collectd

Grafana

nagios

icons from http://www.flaticon.com

http://www.flaticon.com

“Go” deep .. whatever language it takes

https://www.pexels.com/photo/sea-man-person-ocean-2859/

There’s light .. There’s a light .. at the end

https://www.pexels.com/photo/grayscale-photography-of-person-at-the-end-of-tunnel-211816/

● lead and migration time

● resilience

● root cause analysis

● speed of deployment

● instant scaling

... benefits

● 1300 req/sec in the new cluster● 25 micro-services migrated in 4 months● 1 week to migrate an application● 10 minutes to create a new environment ● 11 min to gracefully roll-out a new version with 55

instances● whole pipeline runs in 16 min● 1.5M metrics/minute flows

Give me the numbers!

Yes, we’re hiring!

THANKS

www.lastminutegroup.com