Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | dataworks-summithadoop-summit |
View: | 793 times |
Download: | 3 times |
Welcome!‣(let’s define pipeline)
‣Background
‣Docker improving engineering experience
‣Docker piece of puzzle to handle growth
‣Practical advice
Spotify & me‣Spotify
Streaming musicCelebrates 10 years this summer30m subscribers, most users on free tiermillions of concurrent users
‣..meat Spotify for 6 years.less than 50 engineers, now more than 1000operations engineeringbackend developmentFree SoftwareData Infrastructure
Humble beginnings‣Counting stream playbacks
‣Stack of servers in the Fußball-room
‣Streaming Hadoop, python
‣Quick excursion to Amazon 2012
Spotify engineering org‣A lot of autonomy
‣Big data touches a many different teamsFinanceAnalyticsFeature development (A/B testing)RecommendationsPayments and fraud
Shared resources, packaging‣Started out with some shared edge
nodeschaos ensued
‣More edge nodes!more chaos? more chaos!
‣Shared execution environmentfrom .deb to .jarstill a lot of one off edge nodes
Brief introduction to docker‣Containers seem like virtual machines
‣docker run -it <image_name>
‣Filesystem reset between invocations
‣Typically built using a docker file
‣Image inheritance
Docker at Spotify‣Big bet on docker for services: helios
‣Lots of useful infrastructure
‣Solves some immediate packaging problems
What does Docker provide?‣Useful abstraction to reason about
‣an incremental way out of dependency hell
‣Artefact distribution, caching
‣Image inheritance mechanism for sharing infrastructure
Switching to docker in practice‣Previously
maven project with java, python, cron filebuild step to upload resulting jar to artifactorybuild step to copy cron file to execution cluster
‣Nowadd Dockerfile, data infrastructure base imagebuild step to build and upload image
Problems with cron cluster execution‣Implicit deployment via CI/CD
declaration
‣Status reported via output materialising
‣Who / what triggered job X?
‣Where does it run?
‣Debugging is a pain
Our solution: execution as a service‣Restful API for pipeline execution
‣List your job invocations
‣Explicitly schedule execution on node
‣Don’t rerun successful execution
‣Interface: docker image
Scaling is hard‣2000 nodes
‣100 PB storage
‣800 000 000 files in HDFS
‣180GB heap, 10G young generation
‣Adding 100TB data per day
Docker as vehicle for migration‣Our path forward: Google Cloud
‣Decouple storage from compute
‣Transparent switch from on premise Hadoop to DataProc and Cloud Storage
‣Entry point executable in base image
‣Auth, config, dynamic cluster allocation
Where are we now?‣Two squads are using dockerized
pipelines in production
‣Still using luigi, pull based dependencies
‣Styx, execution as service soon in prod
‣Google cloud migration as we speak
‣Docker drives transparent migration
Some practical docker advice‣Reproducible normalised builds
‣Explicit versioning
‣Split code, configuration, secrets
‣github.com/spotify/dockerfile-maven
Thank you!Don’t be a [email protected]@blippie