Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | datastax |
View: | 105 times |
Download: | 1 times |
MONGODB TO CASSANDRA ARCHITECTURAL LESSONS !
Jon Hadad & Blake Eggleston
Overview
Differences in DB Architectures !
SHIFT Platform !
SHIFT Media Manager !
Intro to cqlengine
MongoDB Architecture
Important Concepts • replica set (master / slave) • shard (replica set within a cluster) • config server (topology) • mongos (router) • Shard key is an indexed field that
determines the shard a particular document belongs to
!
sources: http://docs.mongodb.org/manual/core/sharded-cluster-architectures-production/, http://docs.mongodb.org/manual/core/sharding-shard-key/
Cassandra Architecture
• Only 1 type of server (Cassandra) • Ring Based Replication (no master
or slave) • No single point of failure • Key hashes to a location in the ring • Replication Factor (RF=3) • Limited query flexibility (always
select by key) • Each query has a consistency level
source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
Cassandra Storage
source: http://developer.rackspace.com/images/2013-03-27-rackspace-service-registry-status-update/vnodes.png
• SSTables are immutable • Each column includes a timestamp of when it was written • The same column can exist for a given key in multiple
SSTables • Deletes are written as tombstones • SSTables are periodically merged (compaction) • Compaction keeps the column with the latest timestamp
on conflicts
Cassandra Writes
• Writes are written to any node in the cluster (the coordinator) which figures out where it should go
• Writes are saved in memory to a “memtable”, and written to a commit log.
• Memtables are flushed to disk periodically as SSTables. source: http://www.datastax.com/docs/_images/write_access.png
Cassandra Reads
• Any server may be queried • Acts as coordinator • Data is pulled from SSTables and
merged • Contacts nodes with the
requested key • Performs read repair if necessary • Reads are a more time consuming
operation than writes. source: http://www.datastax.com/docs/_images/write_access.png
MongoDB Advantages
• Very Flexible Documents
• Very Flexible Queries
• Full text search (2.4)
• Aggregation Framework
• Geospatial Indexes / Queries
• Really good documentation
MongoDB Pitfalls• Many queries will route to entire
cluster !
• Overwriting documents / changing doc sizes causes memory fragmentation problems (db repair)
!• Query language is awkward for
humans !• Queries that go to disk pay an
enormous penalty !• Max size of 256GB per collection source: https://blog.serverdensity.com/map-reduce-and-mongodb/
Cassandra Advantages
• Multi data center aware & reliable • Fewer moving parts • No DB / table locking • Unbelievable with time series data (stats) • Performance scales linearly as you add servers • Optimized compaction options for traditional spinning
disks and SSDs • Lots of control over how your data is stored on disk.
Cassandra Pitfalls
• Secondary Indexes have hidden costs • Individual reads (single rows) are not as fast as other DBs • JVM can be intimidating (GC) • Data modeling requires more planning • Generally need to construct a table per query you intend on
running • Ad hoc queries or queries with lots of permutations can be
very difficult to model • We complement Cassandra with Elastic Search for these types
of queries (also Solr & DS Enterprise are good choices)
Media Manager Social Analytics
What is Media Manager?• Ad buying and management tool for Facebook, Twitter
• We sync ~2 billion ad stats a month
• We roll up stats at multiple levels in real time
• 10 node C* cluster, AWS high I/O
• Peaked at 150K queries / second
• Approx 150GB of data, growing 10% / week
Real time Rollups
• A single row per parent object type & date
• For any object (teams, folders, campaign) we can perform a rollup for a given date by accessing only a single row. This limits our I/O and is extremely efficient.
• New ad stats are propagated up immediately in rollups with very few reads.
campaign+date
ad1 ad2 ad3
stats stats stats
folder+datecampaign1 campaign2 campaign3
stats stats stats
rollup
Why Cassandra?
• Almost our entire DB is in our working set.
• We have rows on disk that are inconsistently sized, so heuristics on doc size for preallocation are not useful.
• We could not tolerate unpredictable query behavior due to disk access.
SHIFT.com Collaboration Platform
Real time Collaboration
• Build for Marketers
• Allows communication across departments and organizations
• 3rd Party Applications
Messaging
• Messages are fanned out to an entire team
• Teams may have hundreds of members
• Each member has perspectival view of their messages and their own metadata on those messages (tags & unread)
Message Inbox
user timeuuid1 timeuuid2 timeuuid3
jon msg1 msg2 msg3
blake msg3 msg1 msg2
• When a message is sent or replied to, we use insert a record with a timeuuid into a persons stream which points to the message.
• Timeuuids are stored on disk in reverse order of the embedded timestamp
• We can easily query the row for the first N items in the users inbox
• We store multiple views as tags for each user to quickly surface messages in different contexts.
CQLENGINE python CQL3 mapper
cqlengine features• CQL3 Object Mapper for Python • Supports Cassandra 1.2 • Builds queries supporting the following: • TTLs • Per Query Consistency • Blind Table Updates • Batch Queries • Counters • Maps, sets, lists
• Schema management • Per table compaction settings • Table Polymorphism
Table Polymorphism• In a single table we can have heterogenous objects • We use this on Media Manager for Ad types
campaign ad type
1 1 page_post
1 2 mobile_ad
1 3 application_ad
Upcoming Features• Work seamlessly with multiple clusters
• Native driver integration
• Key cache / row cache configuration
• Cassandra 2.0 features
• Third party plugins • session • flask • identity map
THANK YOU
PALO ALTO 650.804.8319
NEW YORK 646.649.2972
CHICAGO 312.465.2152
www.shift.com
SANTA MONICA 310.310.8315
@rustyrazorblade
Blake [email protected]
@beggleston