Tools for Social Networking Infrastructures
1
Cassandra - a decentralised structured storage system
● hundreds of millions of users● distributed infrastructure● inbox changes constantly● easily scalable● dealing with failures● keeping the cost low
2
Problem : Facebook Inbox Search
Cassandra - a decentralised structured storage system
● Coda, Ficus● Google File System ● Bayou● Dynamo● Bigtable
3
Existing Solutions
Cassandra - a decentralised structured storage system
Table - multidimensional map indexed by a key
Row key - string with no restrictions, usually 13-36 bytes
Column family - a group of columnsSuper column family - group within a group of columns
Sorting - Columns can be sorted by name or date
It is possible to have multiple tables in one cluster.
4
Data Model
Cassandra - a decentralised structured storage system
● read/write gets to any node in a cluster, the node determines the replicas for the key
● write: request is routed to replicas and wait until the quorum has acknowledged
● read○ “weak”: first response is sent back○ “strong”: wait for quorum of responses
5
System Architecture
Cassandra - a decentralised structured storage system
● consistent hashing● each node has a randomly
assigned position on the ring● low impact of arrival/departure
of the nodes● load balancing to alleviate
heavily loaded nodes
6
System Architecture - Partitioning
Cassandra - a decentralised structured storage system
● Allows for high availability and durability● Coordinator node is in charge of the replicas● various replication options
○ “Rack Unaware” - use N-1 consecutive nodes in the ring after the coordinator
● replication metadata is kept in Zookeeper● replication across multiple data centers allows for no downtime in case of a
crash
7
System Architecture - Replication
Cassandra - a decentralised structured storage system
● Scuttlebutt - efficient Gossip based mechanism
● Φ Accrual Failure Detector○ module emits a value (Φ) which
represents suspicion level for node
○ Φ is calculated based on arrival times
of gossip messages using the exponential distribution
○ allows for setting a threshold for suspecting a node is down
8
System Architecture - Membership
Cassandra - a decentralised structured storage system
● Data is stored in memory and on file system● write consists of:
○ 1. file system: commit log update○ 2. memory: data structure update
● memory data structure is saved to data file on disk, as it crosses a threshold● all writes are sequential and generate an index for lookup● merge process runs in background to collate the data files● lookup:
○ first check memory○ then check all files on disk from newest to oldest
■ bloom filter is used to check if file contains a key9
System Architecture - Persistence
Cassandra - a decentralised structured storage system
● Moving data from MySQL to Cassandra using Map/Reduce○ 7 TB of inbox data for over 100M users
● Different failure detectors produce very different detection times○ PHI Detector detects in 15s vs over 120 with other detectors
● Cassandra is decentralized, but uses Zookeeper for some coordination● Inbox Search
○ 50+ TB data stored on a 150 node cluster
10
Practical experiences
Kafka: a Distributed Messaging System for Log Processing
● Log processing has become a critical component of the data pipeline for consumer internet companies.
● Activity data is a part of the production data pipeline used directly in site features.
● Every day, China Mobile collects 5–8TB of phone call records and Facebook gathers almost 6TB of various user activity events.
● System should be distributed, scalable and offer high throughput● Log consumption should be possible in real time
11
Problem: managing large amount of “log” data
Kafka: a Distributed Messaging System for Log Processing
● Early systems for processing this kind of data relied on physically scraping log files off production servers
● Most systems are designed for collecting and loading the log data into a data warehouse for offline consumption.
● Systems allowing online consumption are usually overcomplicated, which results in lower performance
● There are nearly no systems allowing for “pull” model
12
Existing solutions
Kafka: a Distributed Messaging System for Log Processing
● Pub/Sub system○ Producer sends messages to topics○ Consumer consumes from topics○ Messages are transferred via broker
● To balance load, a topic is divided into partitions and each broker stores one or more of those partitions.
13
Architecture
Kafka: a Distributed Messaging System for Log Processing
● Messages can be packed together to reduce overhead
● Multiple producers and consumers can publish and retrieve messages at the same time
● Messages are evenly distributed among consumer streams
14
API
Sample producer code:producer = new Producer(...);message = new Message(“test message str”.getBytes());set = new MessageSet(message);producer.send(“topic1”, set);
Sample consumer code:streams[] = Consumer.createMessageStreams(“topic1”, 1);for (message : streams[0]) {
bytes = message.payload();// do something with the bytes
}
Kafka: a Distributed Messaging System for Log Processing
● Simple Storage○ each partition corresponds to a logical log○ messages have no id, except for file offset○ messages are consumed in order
■ consumer keeps the state○ messages are deleted after some period
● Efficient transfer○ pull request retrieves multiple messages○ Linux sendfile API usage○ no application cache
15
Architecture...
Kafka: a Distributed Messaging System for Log Processing
● All messages from one partition are consumed by a single consumer
● Consumers coordinate using Zookeeper○ detecting consumer/broker changes○ triggering rebalance process○ keeping track of the consumed offset
● Rebalancing on broker/consumer change● Delivery
○ at-least-once delivery guarantee○ in order delivery from one partition○ no broker redundancy
16
Distributed Coordination
Kafka: a Distributed Messaging System for Log Processing
● kafka cluster co-located with each datacenter● services publish to local Kafka brokers● hardware load-balancer to distribute the
publish requests evenly● online consumers run within the same
datacenter● separate datacenter for offline analysis● Statistics:
○ end-to-end latency ~10 seconds○ hundreds of gigabytes of data○ billion messages a day
17
LinkedIn usage
Kafka: a Distributed Messaging System for Log Processing
18
Experimental results
Workload Analysis of a Large-Scale Key-Value Store
● Typical usage is a layer in data-retrieval hierarchy● Memcached exposes the data in RAM to clients over network● Expanded by adding RAM or more servers● Consistent hashing to determine server per key● Stored items can have different size● Memory is divided into slab classes, and objects are stored in matching class● LRU method used for cache eviction
19
Memcached - a distributed hash table
Workload Analysis of a Large-Scale Key-Value Store
20
Methodology & Pools used in the study
● kernel module used for sniffing
● captured traces are 3-7 TB● Apache HIVE used for
analysis● comparison of the traces with
the logs for verification
Workload Analysis of a Large-Scale Key-Value Store
21
Pools used in the study
Workload Analysis of a Large-Scale Key-Value Store
22
Key and value size distributions for all traces
The sizes of keys, up to Memcached’s limit of 250 B (not shown).
The sizes of values. Aggregated value sizes by the total amount of data they use in the cache.
Workload Analysis of a Large-Scale Key-Value Store
Figure 3: Request rates at different dates and times of day, Coordinated Universal Time (UTC). 23
Temporal Patterns
Workload Analysis of a Large-Scale Key-Value Store
● Hit rates and reasons for misses● Locality
○ Repeating keys○ Locality over time
■ how many keys do not repeat in time proximity○ Reuse period
■ time between consecutive accesses to a key
24
Cache behaviour
Workload Analysis of a Large-Scale Key-Value Store
25
Hit rates & miss categories
Table 3: Miss categories in last 24 hours of the ETC trace.
Workload Analysis of a Large-Scale Key-Value Store
26
Workload Analysis of a Large-Scale Key-Value Store
27
Workload Analysis of a Large-Scale Key-Value Store
28
Workload Analysis of a Large-Scale Key-Value Store
29
Statistical modelling
Workload Analysis of a Large-Scale Key-Value Store
Hit rates are inversely correlated with the pool size
Hit rates are not correlated with the locality
Improvements of hit rates:
- Increase RAM+ different cache eviction policy
30
Discussion
Making clear graphs
31
sidenote
Questions
1. Main concern of Cassandra is write throughput, what are the tradeoffs?2. Why is Kafka faster than the other services to which it was compared?3. What are the possible data loss causes in Kafka?4. What are the advantages of using consensus service (Zookeper) vs replicated
master node?5. How is churn handled in different systems?6. What is the problem with using Memcached as persistent storage (eg. USR)?
32