+ All Categories
Home > Software > MongoDB and Python as a Market Data Platform

MongoDB and Python as a Market Data Platform

Date post: 27-Aug-2014
Category:
Upload: james-blackburn
View: 2,414 times
Download: 0 times
Share this document with a friend
Description:
Using MongoDB as a low-latency, high-throughput key-value store for financial market data (or any other domain specific data). By compressing and chunking data you can easily build a versioned key-value store using Mongo. This store outperforms significantly more expensive data platforms, and scales to terabytes of data. Storing data denormalized in a key-value store reduces the burden of storage and retrieval, and allows applications to access, update and store data in its natural shape. The result is a system that can both ship data at high throughput to a cluster of machines for batch processing, as well as provide low latency to interactive users. https://www.youtube.com/watch?v=FVyIxdxsyok
Popular Tags:
32
Python and MongoDB as a Market Data Platform Scalable storage of time series data 2014
Transcript
Page 1: MongoDB and Python as a Market Data Platform

Python and MongoDB as a Market Data Platform

Scalable storage of time series data

2014

Page 2: MongoDB and Python as a Market Data Platform

Opinions expressed are those of the author and may not be shared by all personnel of Man Group plc (‘Man’).  These opinions are subject to change without notice, and are for information purposes only and do not constitute an offer or invitation to make an investment in any financial instrument or in any product to which any member of Man’s group of companies provides investment advisory or any other services.  Any forward-looking statements speak only as of the date on which they are made and are subject to risks and uncertainties that may cause actual results to differ materially from those contained in the statements.  Unless stated otherwise this information is communicated by Man Investments Limited and AHL Partners LLP which are both authorised and regulated in the UK by the Financial Conduct Authority. 

2

Legalese…

Page 3: MongoDB and Python as a Market Data Platform

3

The Problem

Page 4: MongoDB and Python as a Market Data Platform

Financial data comes in different sizes…• ~1MB 1x a day price data • ~1GB x 1000s 9,000 x 9,000 data matrices• ~40GB 1-minute data• ~30TB Tick data• > even larger data sets (options, …)

… and different shapes• Time series of prices• Event data• News data• What’s next?

4

Overview – Data shapes

Page 5: MongoDB and Python as a Market Data Platform

Quant researchers • Interactive work – latency sensitive• Batch jobs run on a cluster – maximize throughput• Historical data• New data• ... want control of storing their own data

Trading system• Auditable – SVN for data• Stable• Performant

5

Overview – Data consumers

Page 6: MongoDB and Python as a Market Data Platform

6

The Research Problem – Scale

lib.read(‘Equity Prices')Out[4]: <class 'pandas.core.frame.DataFrame'>DatetimeIndex: 9605 entries, 1983-01-31 21:30:00 to 2014-02-14 21:30:00Columns: 8103 entries, AST10000 to AST9997dtypes: float64(8631)

Equity Prices: 77M float64s 593MB of data = 4,744Mbits! 600 MB

Page 7: MongoDB and Python as a Market Data Platform

Many different existing data stores• Relational databases• Tick databases• Flat files • HDF5 files• Caches

7

Overview – Databases

Page 8: MongoDB and Python as a Market Data Platform

Many different existing data stores• Relational databases• Tick databases• Flat files • HDF5 files• Caches

8

Can we build one system to rule them all?

Overview – Databases

Page 9: MongoDB and Python as a Market Data Platform

Goals• 10 years of 1 minute data in <1s• 200 instruments x all history x once a day data <1s

• Single data store for all data types• 1x day data Tick Data

• Data versioning + Audit

Requirements• Fast – most data in-memory• Complete – all data in single location• Scalable – unbounded in size and number of clients • Agile – rapid iterative development

9

Project Goals

Page 10: MongoDB and Python as a Market Data Platform

10

Implementation

Page 11: MongoDB and Python as a Market Data Platform

Impedance mismatch between Python/Pandas/Numpy and Existing Databases- Machine cluster operating on data blocksVs- Database doing the analytical work

MongoDB:- Developer productivity

- Document Python Dictionary- Fast out the box

- Low latency- High throughput- Predictable performance

- Sharding / Replication for growth and scale out- Free- Great support - Most widely used NoSQL DB

11

Implementation – Choosing MongoDB

Page 12: MongoDB and Python as a Market Data Platform

12

Implementation – System Architecture

Python client

rs0

mongod

500GB

rs1

mongod

500GB

rs2

mongod

500GB

rs3

mongod

500GB

rs4

mongod

500GB

configserver

configserver

configserver

mongos mongosmongos

Python clientcn…

Python client

{'_id': ObjectId(…'), 'c': 47, 'columns': { 'PRICE': {'data': Binary('...', 0), 'dtype': 'float64', 'rowmask': Binary('...', 0)}, 'SIZE': {'data': Binary('...', 0), 'dtype': 'int64','endSeq': -1L, 'index': Binary('...', 0), 'segment': 1296568173000L, 'sha': abcd123456, 'start': 1296568173000L, 'end': 1298569664000L, 'symbol': ‘AST1209', 'v': 2}

Page 13: MongoDB and Python as a Market Data Platform

Data bucketed into named Libraries• One minute• Daily• User-data: jbloggs.EOD• Metadata Index

Pluggable library types:• VersionStore• TickStore• Metadata store• … others …

© Man 2013 13

Implementation – Mongoose

Page 14: MongoDB and Python as a Market Data Platform

Mongoose key-value store

14

Implementation - MongooseAPI

from ahl.mongo import Mongoose

m = Mongoose('research') # Connect to the data store

m.list_libraries() # What data libraries are availablelibrary = m[‘jbloggs.EOD’] # Get a Librarylibrary.list_symbols() # List symbols

library.write(‘SYMBOL’, <TS or other data>) # Writelibrary.read(‘SYMBOL’, version=…) # Read, with an optional version

library.snapshot('snapshot-name') # Create a named snapshot of the libraryLibrary.list_snapshots()

Page 15: MongoDB and Python as a Market Data Platform

15

Implementation – Version Store

Snap A

Snap B

Sym1, v1

Sym2, v3

Sym2, v4

Sym2, v4

Sym2, v4

Page 16: MongoDB and Python as a Market Data Platform

16

Implementation – VersionStore: A chunk

Page 17: MongoDB and Python as a Market Data Platform

17

Implementation – VersionStore: A version

Page 18: MongoDB and Python as a Market Data Platform

18

Implementation – VersionStore: Bringing it together

Page 19: MongoDB and Python as a Market Data Platform

_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB

class PickleStore(object):

def write(collection, version, symbol, item):

# Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item))

for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha},

{'$set': segment, '$addToSet': {'parent': version['_id']}},

upsert=True)19

Implementation – Arbitrary Data

Page 20: MongoDB and Python as a Market Data Platform

_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB

class PickleStore(object):

def write(collection, version, symbol, item):

# Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item))

for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha},

{'$set': segment, '$addToSet': {'parent': version['_id']}},

upsert=True)20

Implementation – Arbitrary Data

Page 21: MongoDB and Python as a Market Data Platform

_CHUNK_SIZE = 15 * 1024 * 1024 # 15MB

class PickleStore(object):

def write(collection, version, symbol, item):

# Try to pickle it. This is best effort pickled = lz4.compressHC(cPickle.dumps(item))

for i in xrange(len(pickled) / _CHUNK_SIZE + 1): segment = {'data': Binary(pickled[i * _CHUNK_SIZE : (i + 1) * _CHUNK_SIZE])} segment['segment'] = i sha = checksum(symbol, segment) collection.update({'symbol': symbol, 'sha': sha},

{'$set': segment, '$addToSet': {'parent': version['_id']}},

upsert=True)21

Implementation – Arbitrary Data

Page 22: MongoDB and Python as a Market Data Platform

class PickleStore(object):

def read(self, collection, version, symbol): data = ''.join([x['data'] for x in collection.find({'symbol': symbol, 'parent': version['_id']}, sort=[('segment', pymongo.ASCENDING)])]) return cPickle.loads(lz4.decompress(data))

22

Implementation – Arbitrary Data

Page 23: MongoDB and Python as a Market Data Platform

23

Implementation – DataFrames

def do_write(df, version): records = df.to_records() version['dtype'] = str(records.dtype) chunk_size = _CHUNK_SIZE / records.dtype.itemsize ... chunk_and_store ...

def do_read(version): ... read_chunks ... data = ''.join(chunks)

dtype = np.dtype(version['dtype']) recs = np.fromstring(data, dtype=dtype) return DataFrame.from_records(recs)

Page 24: MongoDB and Python as a Market Data Platform

24

Results

Page 25: MongoDB and Python as a Market Data Platform

Flat files on NFS – Random market

25

Results – Performance Once a Day Data

Page 26: MongoDB and Python as a Market Data Platform

HDF5 files – Random instrument

26

Results – Performance One Minute Data

Page 27: MongoDB and Python as a Market Data Platform

Random E-Mini S&P contract from 2013

© Man 2013 27

Results – TickStore – 8 parallel

Page 28: MongoDB and Python as a Market Data Platform

Random E-Mini S&P contract from 2013

© Man 2013 28

Results – TickStore

Page 29: MongoDB and Python as a Market Data Platform

Random E-Mini S&P contract from 2013

© Man 2013 29

Results – TickStore Throughput

Page 30: MongoDB and Python as a Market Data Platform

Random E-Mini S&P contract from 2013

30

Results – System Load

OtherTick Mongo (x2)N Tasks = 32

Page 31: MongoDB and Python as a Market Data Platform

Built a system to store data of any shape and size- Reduced impedance between Python language and the data store

Low latency:- 1xDay data: 4ms for 10,000 rows (vs. 2,210ms from SQL) - OneMinute / Tick data: 1s for 3.5M rows Python (vs. 15s – 40s+ from OtherTick)- 1s for 15M rows Java

Parallel Access:- Cluster with 256+ concurrent data access- Consistent throughput – little load on the Mongo server

Efficient:- 10-15x reduction in network load- Negligible decompression cost (lz4: 1.8Gb/s)

31

Conclusions

Page 32: MongoDB and Python as a Market Data Platform

32

Questions?


Recommended