+ All Categories
Home > Documents > Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage...

Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage...

Date post: 22-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
32
Tools for Social Networking Infrastructures 1
Transcript
Page 1: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Tools for Social Networking Infrastructures

1

Page 2: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● hundreds of millions of users● distributed infrastructure● inbox changes constantly● easily scalable● dealing with failures● keeping the cost low

2

Problem : Facebook Inbox Search

Page 3: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● Coda, Ficus● Google File System ● Bayou● Dynamo● Bigtable

3

Existing Solutions

Page 4: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

Table - multidimensional map indexed by a key

Row key - string with no restrictions, usually 13-36 bytes

Column family - a group of columnsSuper column family - group within a group of columns

Sorting - Columns can be sorted by name or date

It is possible to have multiple tables in one cluster.

4

Data Model

Page 5: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● read/write gets to any node in a cluster, the node determines the replicas for the key

● write: request is routed to replicas and wait until the quorum has acknowledged

● read○ “weak”: first response is sent back○ “strong”: wait for quorum of responses

5

System Architecture

Page 6: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● consistent hashing● each node has a randomly

assigned position on the ring● low impact of arrival/departure

of the nodes● load balancing to alleviate

heavily loaded nodes

6

System Architecture - Partitioning

Page 7: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● Allows for high availability and durability● Coordinator node is in charge of the replicas● various replication options

○ “Rack Unaware” - use N-1 consecutive nodes in the ring after the coordinator

● replication metadata is kept in Zookeeper● replication across multiple data centers allows for no downtime in case of a

crash

7

System Architecture - Replication

Page 8: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● Scuttlebutt - efficient Gossip based mechanism

● Φ Accrual Failure Detector○ module emits a value (Φ) which

represents suspicion level for node

○ Φ is calculated based on arrival times

of gossip messages using the exponential distribution

○ allows for setting a threshold for suspecting a node is down

8

System Architecture - Membership

Page 9: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● Data is stored in memory and on file system● write consists of:

○ 1. file system: commit log update○ 2. memory: data structure update

● memory data structure is saved to data file on disk, as it crosses a threshold● all writes are sequential and generate an index for lookup● merge process runs in background to collate the data files● lookup:

○ first check memory○ then check all files on disk from newest to oldest

■ bloom filter is used to check if file contains a key9

System Architecture - Persistence

Page 10: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Cassandra - a decentralised structured storage system

● Moving data from MySQL to Cassandra using Map/Reduce○ 7 TB of inbox data for over 100M users

● Different failure detectors produce very different detection times○ PHI Detector detects in 15s vs over 120 with other detectors

● Cassandra is decentralized, but uses Zookeeper for some coordination● Inbox Search

○ 50+ TB data stored on a 150 node cluster

10

Practical experiences

Page 11: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● Log processing has become a critical component of the data pipeline for consumer internet companies.

● Activity data is a part of the production data pipeline used directly in site features.

● Every day, China Mobile collects 5–8TB of phone call records and Facebook gathers almost 6TB of various user activity events.

● System should be distributed, scalable and offer high throughput● Log consumption should be possible in real time

11

Problem: managing large amount of “log” data

Page 12: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● Early systems for processing this kind of data relied on physically scraping log files off production servers

● Most systems are designed for collecting and loading the log data into a data warehouse for offline consumption.

● Systems allowing online consumption are usually overcomplicated, which results in lower performance

● There are nearly no systems allowing for “pull” model

12

Existing solutions

Page 13: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● Pub/Sub system○ Producer sends messages to topics○ Consumer consumes from topics○ Messages are transferred via broker

● To balance load, a topic is divided into partitions and each broker stores one or more of those partitions.

13

Architecture

Page 14: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● Messages can be packed together to reduce overhead

● Multiple producers and consumers can publish and retrieve messages at the same time

● Messages are evenly distributed among consumer streams

14

API

Sample producer code:producer = new Producer(...);message = new Message(“test message str”.getBytes());set = new MessageSet(message);producer.send(“topic1”, set);

Sample consumer code:streams[] = Consumer.createMessageStreams(“topic1”, 1);for (message : streams[0]) {

bytes = message.payload();// do something with the bytes

}

Page 15: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● Simple Storage○ each partition corresponds to a logical log○ messages have no id, except for file offset○ messages are consumed in order

■ consumer keeps the state○ messages are deleted after some period

● Efficient transfer○ pull request retrieves multiple messages○ Linux sendfile API usage○ no application cache

15

Architecture...

Page 16: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● All messages from one partition are consumed by a single consumer

● Consumers coordinate using Zookeeper○ detecting consumer/broker changes○ triggering rebalance process○ keeping track of the consumed offset

● Rebalancing on broker/consumer change● Delivery

○ at-least-once delivery guarantee○ in order delivery from one partition○ no broker redundancy

16

Distributed Coordination

Page 17: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

● kafka cluster co-located with each datacenter● services publish to local Kafka brokers● hardware load-balancer to distribute the

publish requests evenly● online consumers run within the same

datacenter● separate datacenter for offline analysis● Statistics:

○ end-to-end latency ~10 seconds○ hundreds of gigabytes of data○ billion messages a day

17

LinkedIn usage

Page 18: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Kafka: a Distributed Messaging System for Log Processing

18

Experimental results

Page 19: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

● Typical usage is a layer in data-retrieval hierarchy● Memcached exposes the data in RAM to clients over network● Expanded by adding RAM or more servers● Consistent hashing to determine server per key● Stored items can have different size● Memory is divided into slab classes, and objects are stored in matching class● LRU method used for cache eviction

19

Memcached - a distributed hash table

Page 20: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

20

Methodology & Pools used in the study

● kernel module used for sniffing

● captured traces are 3-7 TB● Apache HIVE used for

analysis● comparison of the traces with

the logs for verification

Page 21: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

21

Pools used in the study

Page 22: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

22

Key and value size distributions for all traces

The sizes of keys, up to Memcached’s limit of 250 B (not shown).

The sizes of values. Aggregated value sizes by the total amount of data they use in the cache.

Page 23: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

Figure 3: Request rates at different dates and times of day, Coordinated Universal Time (UTC). 23

Temporal Patterns

Page 24: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

● Hit rates and reasons for misses● Locality

○ Repeating keys○ Locality over time

■ how many keys do not repeat in time proximity○ Reuse period

■ time between consecutive accesses to a key

24

Cache behaviour

Page 25: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

25

Hit rates & miss categories

Table 3: Miss categories in last 24 hours of the ETC trace.

Page 26: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

26

Page 27: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

27

Page 28: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

28

Page 29: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

29

Statistical modelling

Page 30: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Workload Analysis of a Large-Scale Key-Value Store

Hit rates are inversely correlated with the pool size

Hit rates are not correlated with the locality

Improvements of hit rates:

- Increase RAM+ different cache eviction policy

30

Discussion

Page 31: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Making clear graphs

31

sidenote

Page 32: Tools for Social Networking Infrastructures · Cassandra - a decentralised structured storage system Data is stored in memory and on file system write consists of: 1. file system:

Questions

1. Main concern of Cassandra is write throughput, what are the tradeoffs?2. Why is Kafka faster than the other services to which it was compared?3. What are the possible data loss causes in Kafka?4. What are the advantages of using consensus service (Zookeper) vs replicated

master node?5. How is churn handled in different systems?6. What is the problem with using Memcached as persistent storage (eg. USR)?

32


Recommended