+ All Categories
Home > Technology > No SQL at The Guardian

No SQL at The Guardian

Date post: 07-Dec-2014
Category:
Upload: mat-wall
View: 3,407 times
Download: 0 times
Share this document with a friend
Description:
Presentation given at No:SQL EU conference describing architectures past, present & future for guardian.co.uk
Popular Tags:
59
NoSql at guardian.co.uk Matthew Wall Simon Willison
Transcript
Page 1: No SQL at The Guardian

NoSql at guardian.co.ukMatthew WallSimon Willison

Page 2: No SQL at The Guardian
Page 3: No SQL at The Guardian

!

Page 4: No SQL at The Guardian

SQL

Page 5: No SQL at The Guardian
Page 6: No SQL at The Guardian
Page 7: No SQL at The Guardian
Page 8: No SQL at The Guardian

ot

nly

Page 9: No SQL at The Guardian

Guardian journalism online: 1995

Page 10: No SQL at The Guardian

Guardian journalism online: 1999

Page 11: No SQL at The Guardian

Guardian journalism online: 2000

Page 12: No SQL at The Guardian

Guardian journalism online: 2010

Page 13: No SQL at The Guardian

Read all about it!

Page 14: No SQL at The Guardian

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached (20Gb)

Page 15: No SQL at The Guardian

I bring you NEWS!!!App server App server App server

Web server Web server Web server

CMS Data feeds

Oracle

Memcached

Why RDBMS?

5 years ago, fewer alternatives

Understand operations procedures

Can easily recruit DBAs / devs

Developer/ops tools

Business critical system: a safe choice

Page 16: No SQL at The Guardian
Page 17: No SQL at The Guardian
Page 18: No SQL at The Guardian
Page 19: No SQL at The Guardian
Page 20: No SQL at The Guardian

Related content from search engine

Page 21: No SQL at The Guardian

Introduction of memcached

Related content from search engine

Page 22: No SQL at The Guardian

Introduction of memcached

Big traffic spikeRelated content from search engine

Page 23: No SQL at The Guardian

Distributed memcached

Protects database from peak load

Entities explicitly decached

Queries given TTL

memcached = database supercharger

Page 24: No SQL at The Guardian

Now we have a stable “broadcast” platform

We know how to scale it

SQL running effectively at core

We’ve finished, right?

Page 25: No SQL at The Guardian

Digital journalism is changing

We can’t cover everything

We can’t compete with everyone

Need to be “part of the web” not just “on the web”

Page 26: No SQL at The Guardian

Mutualisethe news!

Page 27: No SQL at The Guardian

Mutualised news!

Mutalisation of journalism

No longer only broadcasting content

User engagement & contribution:journalism

datasoftware

Data curation / linked data

Support engaged developers with data and APIs

Page 28: No SQL at The Guardian

Mutualised news!

Be a part of the data fabric of the internet

Page 29: No SQL at The Guardian

Mutualised news!Platform strategy

Out: Release our data to the world via APIs

In: Rapidly build new functionality outside the core

Write: Ingest, store & present arbitrary data

Page 30: No SQL at The Guardian

Mutualised news!

Data Out

Content API

Page 31: No SQL at The Guardian

Mutualised news!

Content API

Delivered using Apache Solr

Document oriented search engine

Loose schema:records, fields, facets

Fields can be multi-value

Supports dynamic field generation

Can apply multiple facets in queries faster than RDBMS

Page 32: No SQL at The Guardian

Mutualised news!

Page 33: No SQL at The Guardian

Mutualised news!

Page 34: No SQL at The Guardian

Mutualised news!

Page 35: No SQL at The Guardian

Mutualised news!

Is Solr a database?

Page 36: No SQL at The Guardian

Mutualised news!Can perform complex queries, including full text search

Can filter results with facets (WHERE clause)

ANYTHING can be a facet. Very powerful.

On our dataset most queries are of a similar cost

Scales very well horizontally

Handles millions of documents

Page 37: No SQL at The Guardian

Mutualised news!No transactions

Excellent for certain types of queries

Not truly general purpose

Schema design very important

Search index not really persistence

Page 38: No SQL at The Guardian

App server

Web servers

CMS

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Api

rdbms

Page 39: No SQL at The Guardian

Mutualised news!API

Currently powering iPad app

Site components

External applications

Editors tools

More to follow

Page 40: No SQL at The Guardian

Mutualised news!

Data In

Application framework

Page 41: No SQL at The Guardian

Mutualised news!

Application framework

Simple REST/ HTTP framework allows lightweight development

Applications proxied for performance

Apps generally hosted in the cloud, hot deployment into production

No RDBMs provided for storage

Can develop in news timeline

Page 42: No SQL at The Guardian

App server

Web servers

CMS

Memcached (20Gb)

Core

M/Q

App

App

App

App

App

App

Apps

Proxy

external hostingapp engine etc

rdbms

Page 43: No SQL at The Guardian

NoSQL for journalism

Page 44: No SQL at The Guardian

Some useful characteristics

• Scale down as well as up

• Support rapid production-ready prototyping: turn projects around in hours or days

• Handle massive traffic spikes

Page 45: No SQL at The Guardian

Desktop analysis• Leaked BNP

membership list

• Load postcodes to constituencies mapping in to Redis

• Generate heatmaps by looking up all 12,000 postcodes

Page 46: No SQL at The Guardian

MP’s expenses

Page 47: No SQL at The Guardian

MP’s expenses

SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()

Page 48: No SQL at The Guardian

v2 used Redis

Page 49: No SQL at The Guardian

v2 used RedisSet difference:labour MP pages - reviewed pages

SRANDMEMBER

Page 50: No SQL at The Guardian

BigTable: Zeitgeist

Page 51: No SQL at The Guardian

Zeitgeist stores pre-calculated results in BigTable

• Data comes in from stats system, comments system and OneRiot real-time search API

• AppEngine cron tasks populate task queues

• Task queues recalculate hotness levels

• “Live” BigTable queries are simple SELECT / SORT

Page 52: No SQL at The Guardian

Live debate poll

• Over a million votes cast in an hour

• Stretched limits of BigTable / AppEngine

• Sharded counter pattern to handle writes

Page 53: No SQL at The Guardian

Spreadsheets are NoSQL too...

Page 54: No SQL at The Guardian

Google Docs powered infographics

Page 55: No SQL at The Guardian

The Datablog

Page 56: No SQL at The Guardian

• Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets

• Retrieve data as CSV, XLS, JSON, Atom...

• “Make a copy” and run your own analysis

Page 57: No SQL at The Guardian

Mutualised news!

Write

Arbitrary data

Page 58: No SQL at The Guardian

Mutualised news!Create schema free database alongside RDBMS

Index in Solr

Provide access in API

Investigating: CouchDB

Page 59: No SQL at The Guardian

App server

Web servers

CMS Data feeds

Memcached (20Gb)

Solr

Core

Solr

Solr

Solr

Solr

Solr

Cloud, EC2

M/Q

Out

App

App

App

App

App

App

In

Proxyexternal hostingapp engine etc

CouchDB?rdbms


Recommended