Building a distributed data-platform - A perspective on current trends in computing

Data, dev-ops, and cloud services

Building a distributed data-platform

Charles Care

Engineering TeamKasabi / Talis

Talk overview

● About me...● What Kasabi is,

● what we are trying to do● how we are working to achieve that● a quick walk-though

● Discussion of the Kasabi platform team● Our technology / architecture● Our engineering culture● Lessons learnt

Views are mine...

…and not necessarily those of my (current/past) employers

About me...

About me...

● 2001-2004 – BSc Computer Science (Warwick) ● 2004-2008 – PhD Computer Science (Warwick) ● 2007-2011 – BT Plc

● Technical risk analyst – BT Global MPLS Network● Software Engineer – Infrastructure for Financial Markets● Senior Software Engineer – Central software standards

and tools

● 2011-Present – Talis/Kasabi ● Software Engineer – Semantic web platform

About Kasabi

About Kasabi

● Data market place● Bringing together data...

● owners● consumers

● Lowering the barrier for data-driven apps to enter the market

● Enabling new opportunities for aggregating and mixing data

Data licensing today

Data Owners Data Consumers

Bespoke, expensive, contracts

Kasabi as a data platform

Data Owners

Third-party services

Application Developers

Data enthusiastsData engineers

API developers

About Kasabi

● Publish datasets using standard APIs● Access data using standard APIs

● Query a dataset using SPARQL● Search a dataset using a simple full-text search

● Define, contribute, and share your own APIs

Data marketplace

http://www.kasabi.com/


A dataset

Access data using standard APIs

Contribute custom APIs

Example – contributed APIs

Current organisation

● Product development● Data engineering● Customer operations● Platform development

Current organisation

● Product development● Data engineering● Customer operations● Platform development

Platform architecture

Data Platform

Load balancing and routing

Update services Search services Query services

Datasets

● Need to store and update datasets● Access data via various services● Must scale with load and increasing data● Must be tolerant to failure● Extensible

● Should be easy to add new services over time

To distribute...

...or not to distribute

Dynamic Gossip Network

Distributed PlatformRouting layer

Updateservice Search

service

Sequence Service Storage Service Monitoring Services

Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?


Distributed Platform – updatesRouting layer


service

Sequence Service Storage Service

Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Monitoring Services

- Updates are sequenced- Data stored in distributed storage


Distributed Platform – updatesRouting layer


service


Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Monitoring Services

- Updates are gossiped around network- Here a SPARQL node realises that it should apply the update


Distributed Platform – queryRouting layer


service


Updateservice

Updateservice

Searchservice

Searchservice

SPARQLservice

SPARQLservice

SPARQLservice

Newservice?

Monitoring Services

SPARQL queries will now reflect the update that was submitted

Monolithic vs distributed

● Monolithic● Easy to synchronise events and data

● Consistent views and queries

● Less inter-process communication / less network overhead

● Easier to optimise for high throughput

● Single code-base

● Fewer processes to monitor

● Distributed● Service-oriented - separate concerns run in isolated processes (and can be scaled

independently)

● Development is component-based

– Changes are more focussed / helps avoids scope-creep

● Deployment can be localised to avoid downtime

● Failure is more likely – so you need to plan for it

● Easier to integrate out-of-the box software – e.g. using standard Apache Solr

Distributed data platform

● Separate services for each API

● Communication via Gossip messages

● Have to manage eventual consistency

● Highly scalable

● Easy to add new services

● Use standard protocols and open-source components● HTTP libraries / REST / ZeroMQ / Apache Thrift● RDF and SPARQL using Apache Jena● Search using Apache Solr● Avoid modification and forks

● Deploy into Amazon EC2 (also using: S3, EMR, and ELB)

Benefits of using cloud services

Consider a start-up in 2002

● Have an idea...

● Get funding (development, op-ex, cap-ex)

● Aquire servers● Set-up your servers

– mail, web, source code repo, build systems

– development, staging, live

● Some 'cloud' services

– …, SourceForge, shared servers, etc

● Build, and go, to market● Probably embedding open-source

components

● Delivery based on full-stack, monolithic, architectures

Consider a start-up in 2012

● Have an idea...

● Get funding (development capital, op-ex)● you will probably not get cap-ex

● Use cloud services... rent rather than buy● SaaS – Software as a Service

– Why would you run your own (chat/email etc)

– Host your code in GitHub/BitBucket etc

● PaaS – Platform as a Service

– Do you need to control the full stack?

– Could you leverage platforms like: Heroku, Joyant, AppEngine etc

– Amazon RDS

● IaaS – Infrastructure as a Service

– Cloud services to provide 'bare metal'

● Build and go to market quickly

● scale elastically over time

But what about the enterprise?

● Benefits of cloud services are already transforming the enterprise● Private clouds

● Virtual appliances

● Cloud bursting

● Independent scaling

● Separation of concerns

● SOA architecture

● And in future...● Appetite for IaaS is growing

● PaaS and SaaS will follow.

● Perimeter security will be replaced by localised security boundaries

So how do we build this stuff...?

How it all happens

● Constantly iterating through...● Requirements● Development (Test-driven)● Testing/Review● Deployment● Operation

● We're an Agile, dev-ops team...

so all the above is a shared responsibility

Being a dev-ops team...

● Removing barriers between development and operations

● Shared responsibilities rather than distrust

● Everyone has root access

● Developers are responsible for operating systems they build

● Everyone is free to make changes

...and responsible to manage the roll-out of those changes

● Ops/Deployment/Monitoring are automated

● Everyone should have full-stack awareness

● Read more...● http://dev2ops.org/blog/2010/2/22/what-is-devops.html

● http://www.jedi.be/blog/

● http://en.wikipedia.org/wiki/Devops

● http://www.slideshare.net/jallspaw/ 10-deploys-per-day-dev-and-ops-cooperation-at-flickr

http://dev2ops.org/blog/2010/2/22/what-is-devops.html

http://www.jedi.be/blog/

http://en.wikipedia.org/wiki/Devops

http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr

Life-cycle of a change

Requirements and Planning

● Identification of requirement ● Planning

● Break down big changes into smaller tasks– Can the change be deployed in small steps?– Can the change be dark-deployed?

● Understand the wider impact● Find middle ground between generic and specific

● Team is self-organising● People pull work from the prioritised, planned stories

Branch based development

● One branch per change, squash before merge

Writing the code

● Work on a branch ● don't know if/when you'll merge

● Test-driven● Unit tests first

● Do acceptance tests need to change?

● What technology? Which tool-sets?

● Smoke testing● How do you know it works?

● What's different in production?

● What are the risks of failure?

● Feature flags?

Tests run: 110, Failures: 0, Errors: 0, Skipped: 2

[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESSFUL[INFO] ------------------------------------------------------------------------[INFO] Total time: 39 seconds[INFO] Finished at: Sat Feb 18 15:20:36 GMT 2012[INFO] Final Memory: 33M/240M[INFO] ------------------------------------------------------------------------

Writing the code

● Avoid unnecessary scope-creep● “I'll just fix this...”

● “It would be much cleaner if I re-factored this...”

● “It would be neat if I also added this...”

● …however, these observations can be written as new stories

● …and sometimes it's good to fix things before they cause pain

● …if extra changes are really necessary, can they be implemented separately?

● …team should be empowered to fix technical debt

● ...managing scope-creep is a shared responsibility

● Be prepared to abandon a change if it's taking too long, maybe it needs more planning?

● Should you be pairing?

● Should you demo your work?

Code review

● Code review possible with tools for distributed teams (e.g. Gerrit or ReviewBoard)

● If you're not following a strict pairing policy, code-review is vital

● Useful to make others aware of changes

● Gerrit● Build agent automatically builds your change and

runs tests – verify +/- 1

● Invite others to review your code, they can give it a score between -2 and +2.

● Can only deploy code once at least one person has given a +2

● Work-flow is customisable

● Self-organising... anyone can review

$> git commit$> git review

Code review (2)

Code review (3)

Merge / Deployment

● Merge & Deployment● One-click deployment

● Developer should press the button

● Code is merged into the master/release branch

● Build server automatically checks out the code and builds, tags, and uploads the release to an artefact repository

● Package is automatically deployed on all servers

– Extra orchestration for external-facing services to avoid “thundering-herd” problems

Managing infrastructure

● Puppet or Chef

● Build packages (e.g. DEB or RPM)

● Centralise configuration management

● Utilising cloud compute infrastructure● Amazon EC2

● Amazon S3

● Elastic load balancers

● Elastic Map-Reduce

● Application monitoring● Metrics

● Log analysis

● Internal monitoring

● External checks

Lessons learnt

(again, my views!)

Technical lessons learnt

● Use distributed SOA-based services to reduce tight-coupling

● Monitor everything...● Leverage cloud offerings

● wrap them with well-defined interfaces to avoid lock-in

● Design systems to scale● Use open and unmodified components where possible

● Standard components fronting external APIs● E.g. Jena, Solr, Haproxy, Apache

Practices that have helped us

● Dev-ops culture● Pragmatic approach to agile development

● Task allocation should be 'pull', rather than 'push'● Teams should be self-organising● Pairing when working on new problems

● Test-Driven-Development (TDD)● Continuous integration● Peer-review of code● Continuous deployment

…so, in summary...

Conclusion

● Isolate your design into components● Empower your team to release small changes

frequently● Leverage hosted/cloud offerings

Thanks for listening!

Credits

● Thanks for the invite to speak● Thanks to Kasabi / Talis Systems Ltd

● Sign up at http://www.kasabi.com

Graphics from http://www.iconarchive.com/, http://www.oxygen-icons.org and http://www.icons-land.com


http://www.iconarchive.com/

http://www.oxygen-icons.org/

http://www.icons-land.com/

Questions?

Date post:	19-Jan-2015
Category:	Technology
Upload:	charles-care
View:	1,034 times
Download:	0 times

Building a distributed data-platform - A perspective on current trends in computing

Technology