Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage...

Tools for Social Networking Infrastructures

1

Cassandra - a decentralised structured storage system

● hundreds of millions of users● distributed infrastructure● inbox changes constantly● easily scalable● dealing with failures● keeping the cost low

2

Problem : Facebook Inbox Search


● Coda, Ficus● Google File System ● Bayou● Dynamo● Bigtable

3

Existing Solutions


Table - multidimensional map indexed by a key

Row key - string with no restrictions, usually 13-36 bytes

Column family - a group of columnsSuper column family - group within a group of columns

Sorting - Columns can be sorted by name or date

It is possible to have multiple tables in one cluster.

4

Data Model


● read/write gets to any node in a cluster, the node determines the replicas for the key

● write: request is routed to replicas and wait until the quorum has acknowledged

● read○ “weak”: first response is sent back○ “strong”: wait for quorum of responses

5

System Architecture


● consistent hashing● each node has a randomly

assigned position on the ring● low impact of arrival/departure

of the nodes● load balancing to alleviate

heavily loaded nodes

6

System Architecture - Partitioning


● Allows for high availability and durability● Coordinator node is in charge of the replicas● various replication options

○ “Rack Unaware” - use N-1 consecutive nodes in the ring after the coordinator

● replication metadata is kept in Zookeeper● replication across multiple data centers allows for no downtime in case of a

crash

7

System Architecture - Replication


● Scuttlebutt - efficient Gossip based mechanism

● Φ Accrual Failure Detector○ module emits a value (Φ) which

represents suspicion level for node

○ Φ is calculated based on arrival times

of gossip messages using the exponential distribution

○ allows for setting a threshold for suspecting a node is down

8

System Architecture - Membership


● Data is stored in memory and on file system● write consists of:

○ 1. file system: commit log update○ 2. memory: data structure update

● memory data structure is saved to data file on disk, as it crosses a threshold● all writes are sequential and generate an index for lookup● merge process runs in background to collate the data files● lookup:

○ first check memory○ then check all files on disk from newest to oldest

■ bloom filter is used to check if file contains a key9

System Architecture - Persistence


● Moving data from MySQL to Cassandra using Map/Reduce○ 7 TB of inbox data for over 100M users

● Different failure detectors produce very different detection times○ PHI Detector detects in 15s vs over 120 with other detectors

● Cassandra is decentralized, but uses Zookeeper for some coordination● Inbox Search

○ 50+ TB data stored on a 150 node cluster

10

Practical experiences

Kafka: a Distributed Messaging System for Log Processing

● Log processing has become a critical component of the data pipeline for consumer internet companies.

● Activity data is a part of the production data pipeline used directly in site features.

● Every day, China Mobile collects 5–8TB of phone call records and Facebook gathers almost 6TB of various user activity events.

● System should be distributed, scalable and offer high throughput● Log consumption should be possible in real time

11

Problem: managing large amount of “log” data


● Early systems for processing this kind of data relied on physically scraping log files off production servers

● Most systems are designed for collecting and loading the log data into a data warehouse for offline consumption.

● Systems allowing online consumption are usually overcomplicated, which results in lower performance

● There are nearly no systems allowing for “pull” model

12

Existing solutions


● Pub/Sub system○ Producer sends messages to topics○ Consumer consumes from topics○ Messages are transferred via broker

● To balance load, a topic is divided into partitions and each broker stores one or more of those partitions.

13

Architecture


● Messages can be packed together to reduce overhead

● Multiple producers and consumers can publish and retrieve messages at the same time

● Messages are evenly distributed among consumer streams

14

API

Sample producer code:producer = new Producer(...);message = new Message(“test message str”.getBytes());set = new MessageSet(message);producer.send(“topic1”, set);

Sample consumer code:streams[] = Consumer.createMessageStreams(“topic1”, 1);for (message : streams[0]) {

bytes = message.payload();// do something with the bytes

}


● Simple Storage○ each partition corresponds to a logical log○ messages have no id, except for file offset○ messages are consumed in order

■ consumer keeps the state○ messages are deleted after some period

● Efficient transfer○ pull request retrieves multiple messages○ Linux sendfile API usage○ no application cache

15

Architecture...


● All messages from one partition are consumed by a single consumer

● Consumers coordinate using Zookeeper○ detecting consumer/broker changes○ triggering rebalance process○ keeping track of the consumed offset

● Rebalancing on broker/consumer change● Delivery

○ at-least-once delivery guarantee○ in order delivery from one partition○ no broker redundancy

16

Distributed Coordination


● kafka cluster co-located with each datacenter● services publish to local Kafka brokers● hardware load-balancer to distribute the

publish requests evenly● online consumers run within the same

datacenter● separate datacenter for offline analysis● Statistics:

○ end-to-end latency ~10 seconds○ hundreds of gigabytes of data○ billion messages a day

17

LinkedIn usage


18

Experimental results

Workload Analysis of a Large-Scale Key-Value Store

● Typical usage is a layer in data-retrieval hierarchy● Memcached exposes the data in RAM to clients over network● Expanded by adding RAM or more servers● Consistent hashing to determine server per key● Stored items can have different size● Memory is divided into slab classes, and objects are stored in matching class● LRU method used for cache eviction

19

Memcached - a distributed hash table


20

Methodology & Pools used in the study

● kernel module used for sniffing

● captured traces are 3-7 TB● Apache HIVE used for

analysis● comparison of the traces with

the logs for verification


21

Pools used in the study


22

Key and value size distributions for all traces

The sizes of keys, up to Memcached’s limit of 250 B (not shown).

The sizes of values. Aggregated value sizes by the total amount of data they use in the cache.


Figure 3: Request rates at different dates and times of day, Coordinated Universal Time (UTC). 23

Temporal Patterns


● Hit rates and reasons for misses● Locality

○ Repeating keys○ Locality over time

■ how many keys do not repeat in time proximity○ Reuse period

■ time between consecutive accesses to a key

24

Cache behaviour


25

Hit rates & miss categories

Table 3: Miss categories in last 24 hours of the ETC trace.


26


27


28


29

Statistical modelling


Hit rates are inversely correlated with the pool size

Hit rates are not correlated with the locality

Improvements of hit rates:

- Increase RAM+ different cache eviction policy

30

Discussion

Making clear graphs

31

sidenote

Questions

1. Main concern of Cassandra is write throughput, what are the tradeoffs?2. Why is Kafka faster than the other services to which it was compared?3. What are the possible data loss causes in Kafka?4. What are the advantages of using consensus service (Zookeper) vs replicated

master node?5. How is churn handled in different systems?6. What is the problem with using Memcached as persistent storage (eg. USR)?

32

Date post:	22-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage...

Documents