Lessons Learned from Building a Big Data Technology Stack
Haggai Shachar Director, Data Services [email protected]
!{ name: "Haggai Shachar", work: [
{ employee: "LivePerson", title: "Director, Data Services“ }, { employee: “NuConomy”, title: “Co-Founder, CTO” }, { employee: “Israeli Intelligence Corps”, title: “n/a” } ],
likes: [ “data”, “machine learning”, “cycling”, “diving” ], wife: "Orit", kids: [ { gender: “female”, age: -0.2, name: undefined } ] , todos: [ "buy a stroller" ] }
Hello World!
LivePerson(“you do something with chat, right ??”)
1990s Click-to-Chat User initiated
2000 Proactive Based on Real-Time Behavior
2010 Real-time Prediction Multichannel
Predictive Intelligence
TodayEngage
everywhereWeb, Social, Native Apps, SMS, Email
40
TB Raw data 2
2M Interactions 2 B
Visits
* monthly figures
LivePerson Data stack
LiveEngage Console
MONITORING CHAT/VOICE system
Batch track Real-Time trackAPACHE KAFKA
STORM
COMPLEX EVENT PROCESSING
PERPETUAL STORE
BUSINESS INTELLIGENCE
ANALYTICAL DB
Serving layer (Data Producers) Monitoring Engagement systems
Middleware using Kafka Batch Track (near) Real Time Track
CEP using Storm Real Time computation Real Time data aggregation
Rich Business Intelligence Pre-defined dashboards Drill down to the record level Ad-hoc and self service BI
Data Repositories DSPT, Analytics, RT
Aggregation
Data Repositories DSPT, Analytics, RT
Aggregation
LiveEngage backoffice
RT REPOSITORIES
Forget data, lets talk cars -What’s the ultimate vehicle ?
What’s the ultimate vehicle ?
What’s the ultimate vehicle ?
!1. Choosing the right tool 2. Organization-wide schema 3. Decouple producers from consumers 4. Write Optimized vs Read Optimized Models 5. Freshness vs Correctness
Lessons Learned
Since the beginning of mankind <-> ~2004
LL#1 choosing the right tool
2004 - Today
1. Problem fit 2. Scaling fit 3. Query language (SQL is not going anywhere) 4. Aggregation framework 5. By Key R/W throughput 6. Community
LL#1 choosing the right tool
Scaling Query Language
Aggregation framework
By Key throughput
Community
Hadoop Great MR, Hive Robust but slow
n/a Huge
Cassandra Great CQL, Thrift Sucks Awesome Big
MySQL Medium SQL Ok Ok Huge
Vertica Good SQL, R Awesome Ok Small
▪ 150 developers ▪ 20 scrum teams ▪ 50 services ▪ 3 floors ▪ 4 development languages (Java, Scala, Python, Javascript) ▪ 3-5 deployments a week ▪ Marketing terms keep on changing
LL#2 Organization-wide data model
Tower of Babel by Pieter Bruegel the Elder Jacob's Ladder by William Blake
OR
Apache Avro to rescue ▪ A schema based serialization/deserialization framework ▪ Strong Hadoop integration & efficient storage ▪ Backward & Forward Compatibility ▪ Rich data structures (primitives, records, maps, arrays, enums)
LL#2 Organization-wide data model
LL#2 Organization-wide data model
Protobuf Thrift Avro
Created 2001 (2008) 2007 2009
Creator / Maintainer Google / Google Facebook / Apache Doug cutting / Apache
Hadoop support No No Yes
Used by Google Facebook, Cassandra
Hadoop, Liveperson
Lang support Good Great Good
#3 Producers / Consumers decoupling
▪ Flexibility of development / deployment ▪ Publisher multi subscribers
PRODUCER
MULTIPULE CONSUMERS
▪ Predicting the exact future architecture and project needs is hard
▪ Use a middleware layer to simplify the interface between producers and consumers.
▪ Happily extend and modify each of the tiers independently
LL#3 Decouple producers from consumers
middleware
Hadoop
Producer ProducerProducer
ExternalStorm
Apache Kafka
▪ Distributed pub-sub system ▪ Developed at LinkedIn, Maintained by Apache ▪ Very high throughput (~300K messages/sec) ▪ Horizontally scalable ▪ Multiple subscribers for topics
Message queues !!ActiveMQ TIBCO
Log aggregators !!
Flume Scribe
• Low throughput • Secondary indexes • Tuned for low
latency
• Focus on HDFS • Push model • No rewindable
consumption
KAFKA
Apache Kafka
Writers like ▪ Write fast
LL#4 Write Optimized vs Read Optimized
Readers like ▪ Pre-defined aggregations ▪ Denormalized dimensions ▪ Data duplication
Not all data needs are made equal
LL#5: Freshness vs Correctness
▪ High freshness is the key ▪ Minor inaccuracy is acceptable ▪ Fire & forget or eventually
consistent ▪ NoSQL
▪ It’s all about accuracy ▪ Billable data ▪ Batch oriented ▪ Transactional ▪ RDBMS
MONITORING CHAT/VOICE system
Batch track Real-Time trackAPACHE KAFKA
PERPETUAL STORE
RT REPOSITORIES
300K events/sec
STORM
CEP
ANALYTICAL DB
Real Time counters Accuracy 99.9%
Raw data & Aggregations
Accuracy 100%
~300ms~2h
LL#4 Write Optimized vs Read OptimizedLL#5: Freshness vs Correctness
Read optimized
!1. Choosing the right tool 2. Organization-wide schema 3. Decouple producers from consumers 4. Write Optimized vs Read Optimized Models 5. Freshness vs Correctness
So, what did we have ??