PeopleStore - blazing fast 2.6 billion profiles storageAnd other Cassandra uses cases@MyHeritage
Ran Peled, Chief ArchitectTech Talk TeachDec, 2016
MyHeritage is a leading destination for discovering, preserving and sharing family history.
Recently added: DNA for genealogy
Who are we?
Personal profiles in family trees
Personal profiles in family trees
Family trees:A complex network of people, each with personal info, life events, and connections to relatives.
Personal profiles in family trees
Family trees:A complex network of people, each with personal info, life events, and connections to relatives.
Personal profiles in family trees – Sharding MySQL
Family trees:A complex network of people, each with personal info, life events, and connections to relatives. Many interconnected MySQL tables. Millions of daily updates.
Site ASite A
Site A
Event Individual
ChildInFamilyFamily
FamilyEvent
Tags
Photos
Personal profiles in family trees – Sharding MySQL
Family trees:A complex network of people, each with personal info, life events, and connections to relatives. Many interconnected MySQL tables. Millions of daily updates.
Good response time for single family site access, using MySQL Database Sharding.
Over 650 shards, on >20 physical hosts, growing
Shard 650
Partition 500Shard 1
Site ASite A
Site A
Event Individual
ChildInFamilyFamily
FamilyEvent
Tags
Photos
. . .
The issue with RDBMS sharding
Problematic when multiple shards are needed at once.For example, to display search results and profile matches coming from many family trees.
Costly to scale for more readers
Options:• Build a custom
parallel-fetch Aggregator service
• NoSQL
Cassandra to the rescue
Cassandra recap:• Key-value store• Ring-based consistent hashing
cluster• Support for clusters split
between data centers• Data redundancy and
consistency at user controlled level
• Append-only high write throughput
PeopleStore
PeopleStore: Overview
• Store 2.6 billion profiles (and growing over a million a day)• Provide very fast read access• Shadows the MySQL source of truth (at least for foreseeable future)• Data consistency is critical
• Store each person as one aggregated record in Cassandra, including ALL info for typical uses, to minimize nested/follow-up queries: get all information needed at once
• Decision point: replicate relatives, or point to their record?
PeopleStore: Architecture
Web Servers (PHP)
PeopleStore: Architecture
. . .
MySQL
Highly sharded RDBMS
Web Servers (PHP)Web Server
PoepleStore micorservicePoepleStore
micorservicePeopleStore micorservice
Source of Truth
Synchronous updates
Multi-item Fetch
Cassandra Cluster
MassLoading
Hadoop cluster
Online Flows Batch first load / reload
PeopleStore: Schema
CREATE TABLE peoplestore.people ( site_id int, tree_id int, individual_id int, adopted_child_in_family_id int, child_in_family_id int, foster_child_in_family_id int, gender text, is_alive boolean, privacy_level int, last_update int, loading_mode int, loading_time timestamp, thumbnail text, name text, events text, photos text, relatives text, PRIMARY KEY (site_id, tree_id, individual_id)) WITH … compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
• JSON: Flexibility of structure (text, not 2.2 JSON support)
• Split fields: Flexibility to fetch fields needed
• Not using a Collection for plural fields – due to Cassandra limitation on using IN clause on table with Collection fields (non issue for us)
• Future: use User Defined Types
PeopleStore: Schema
CREATE TABLE peoplestore.people ( site_id int, tree_id int, individual_id int, adopted_child_in_family_id int, child_in_family_id int, foster_child_in_family_id int, gender text, is_alive boolean, privacy_level int, last_update int, loading_mode int, loading_time timestamp, thumbnail text, name text, events text, photos text, relatives text, PRIMARY KEY (site_id, tree_id, individual_id)) WITH … compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
Only minimal relatives info: ID + nameRequires another fetch for full relative data
PeopleStore: Schema
CREATE TABLE peoplestore.people ( site_id int, tree_id int, individual_id int, adopted_child_in_family_id int, child_in_family_id int, foster_child_in_family_id int, gender text, is_alive boolean, privacy_level int, last_update int, loading_mode int, loading_time timestamp, thumbnail text, name text, events text, photos text, relatives text, PRIMARY KEY (site_id, tree_id, individual_id)) WITH … compaction = {'class': ’...LeveledCompactionStrategy'};
6 hosts, RF=3
ID
Meta data
JSON blobs
Started with Size Tiered Compaction. Generated thousands of SSTables and slowed query time.Moving to Leveled Compaction solved the issue.
PeopleStore: microservice
Clients• Control exposure, read / write per flow• Discover services by listing DNS SRV records • Clients do round-robin on these services
Services• A Spring Boot Java REST server• Deployed as a Docker container managed by
Mesos & Marathon• Mesos manages DNS entries• Mesos monitors services health• Metrics sent to JMX
Failure recovery needed despite redundancy• In write for consistency; in read for availability
PoepleStore micorservice (Java)
PoepleStore micorservice (Java)
PeopleStore micorservice
Web Servers (PHP)Web Servers
(PHP)Web Server
Write failure recovery
Mes
os +
Mar
atho
n DNS
PeopleStore: Mass Loading
To boot the system, and in case of major scheme/logic changes, we had to load 2.2 billion person profiles at once.
Evaluated:• Cassandra’s sstableloader tool• hdfs2cass from SpotifyCons:• Uses SSTableSimpleWriter and Cassandra streaming• Very sensitive to C* version
Selected: Hadoop + online Cassandra updates
PeopleStore: Mass Loading with Hadoop
. . .
MySQL
Hadoop cluster
Extract and AggregateMySQL extractor + PIG flow Avro
LoadCrunch + Cassandra Driver
• Tested: logged/unlogged BATCH writes. Does NOT help performance.
• Had to implement write retries to reach 0 failures
• Collect stats into Hadoop counters
• Load time2.2 billion items, 6 Hadoop nodes, 6 C* nodes~30k writes per second~17 hours load + hours compaction timeImpact on read latency very reasonable
PeopleStore: Mass loading with online updates
Mass loading takes time. In the meanwhile, we have online updates.Batch load must not overwrite newer online updates
Tested: lightweight transactions: INSERT ... IF NOT EXISTS / UPDATE... IF update_time < <value>
Result: major slowdown, due to massive read-before-write
Solution: updated_people table: small table, indicating only people that changed online while batch loading is running. Read-before-write viable because table is small; >99% of queries return empty set. Insignificant slowdown.
Hadoop cluster
PoepleStore micorservice (Java)
PoepleStore micorservice (Java)
PeopleStore micorservice (Java)
. . .
MySQL
PeopleStore: JVM Tuning
Experienced long GC pauses in the Cassandra nodes
Upgrade from Java 1.7 to 1.8.0_65 Switch from CMS to G1 garbage collector Major improvement.This is the default in Cassandra 3.0
Tune JVM params (/etc/cassandra/conf/cassandra-env.sh)
See https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
# highlights:
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC”JVM_OPTS="$JVM_OPTS -Xms16G -Xmx16G”JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500”JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16"JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16”
PeopleStore: Other issues
Experienced unexplained missing rows on read (CASSANDRA-10801)We upgraded the Cassandra nodes from 2.1.11 to 2.1.12 and the java driver from 2.1.5 to 2.1.9, which solved the issue.
Cassandra Driver: Spring @Query annotations cannot handle “IN” queries. Instead, we used CassandraTemplate to build a native query.
PeopleStore: Results
Reduced latency:• Matches page: over 50% reduction of load time• Search results page: 40% reduction of load time• 90% of microservice
calls < 100ms
Reduced load on MySQL databases• From hundreds of
queries per page, to just a few
AccountStore
EVERY page on myheritage.com needs access to• Summarized user (account) information from multiple sources
For marketing tracking, affiliates programs, retargetingIncludes properties and counters, coming from various sources
• A/B test dataParticipation and variant selectionfor guests and registered members
Latency: less than 10ms slowdown for any page
Data must be fresh
Storing also for guests, lots of data
Make data available to BI systems
Aggregating the data at runtime is too slow requires maintaining live aggregated data high updates rate
AccountStore NeedsFast user properties and counters
Examplevar gtmDataLayer = [{ "site_plan":"premium-plus”, "data_subscription": "no-data-subscription", "active_paying": "not-actively-paying” "site_visits":3509, "last_mobile_sighting": "2016-02-07 11:10:25” ...}];
Use Cassandra to “store it as you read it” – updated aggregate information and counters on users and guests.
Event subscribers update the aggregate data online as it changes, in two tables: data, and counters (C* limitation)For example, num_individuals_in_trees changed online as family tree is modified, and subscription_expiration_date is changed as user becomes a paying subscriber.
A separate Cassandra table maps guests to users as they convert and register.
AccountStore: Overview
Same physical datacenter
Requirement: Allow BI systems to collect data. Do not put BI load on the production cluster.
Solution: create a fictitious data-center in the cluster
AccountStore and A/B test cluster topology
App Cassandra data center
BI Cassandra data center
BI systemclientsclients
clients
Using secondary indexes for non-typical flows
Converted guests maintain the UUID, plus mapping to/from account_id
AccountStore: Schema
CREATE TABLE accounts.account_store_data ( account_uid uuid PRIMARY KEY, creation_time timestamp, device_types set, highest_site_plan int, last_visit timestamp, . . . ) WITH ...;
CREATE TABLE accounts.account_id_guest_id ( account_id int, guest_id ascii, guest_creation_time timestamp, updated_at timestamp, uuid uuid, PRIMARY KEY ((account_id, guest_id))) WITH ...;CREATE INDEX account_id_guest_id_updated_at_idx ON accounts.account_id_guest_id (updated_at);CREATE INDEX account_id_guest_id_uuid_idx ON accounts.account_id_guest_id (uuid);
CREATE TABLE accounts.account_store_counters ( account_uid uuid PRIMARY KEY, num_individuals_in_all_trees counter, num_visits counter, . . . ) WITH ...;
Scale: Millions of active users, hundreds of active experiments: billions of rows.Latency: must not slow down the application; many pages have multiple experiments active on them.Must allow time-based collection into BI systems.
Classic implementation: Sharded MySQL. We already have a cluster sharded by Family Site ID. We do not want another MySQL cluster, sharded by User ID.
Decision: a natural addition to the AccountStore Cassandra cluster.
A/B tests
AccountStore: A/B tests schema
CREATE TABLE ab_test.member_to_experiment_ts ( uuid_bucket int, day int, hour int, experiment_id int, uuid uuid, created_at timestamp, variant_id int, PRIMARY KEY ( (uuid_bucket, day, hour), experiment_id, uuid)) WITH ...;
CREATE TABLE ab_test.member_to_experiment ( account_uid uuid, experiment_id int, created_at timestamp, created_at_ts bigint, variant_id int, PRIMARY KEY (account_uid, experiment_id)) WITH ...;CREATE INDEX member_to_experiment_experiment_id_idx ON ab_test.member_to_experiment (experiment_id);
Simple lookup of experiment variant for a user
Secondary lookup by experiment
Preventing hotspots in time-based index:Use uuid_bucketing to ensure partitioningReading requires going over all buckets.
Full dump: using sstable2json
Cassandra:• Write: 99% < 1ms• Read: 99% < 2ms
App:• 99% < 6 ms
Performance
Other Cassandra projects
• Activity feed
• Metrics using OpenTSDB
• Titan Graph Database