Post on 03-Feb-2015
description
transcript
Big Data and Me
Bhupesh BansalFeb 3, 2012
1
Relational Model Architecture
2Reference : http://www.slideshare.net/adorepump/voldemort-nosql
Linkedin 2006
Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Relational model
• The relational model is a triumph of computer science:• General• Concise• Well understood
• But then again:• SQL is a pain• Hard to build re-usable data structures• Hides performance issues/details
4
Specialized Systems Architecture
5Reference : http://www.slideshare.net/adorepump/voldemort-nosql
Linkedin 2007
Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Specialized systems
• Specialized systems are efficient (10-100x)• Search: Inverted index• Offline: Hadoop, Terradata, Oracle DWH• Memcached• In memory systems (social graph)
• Specialized system are scalable• New data and problems
• Graphs, sequences, and text
7
Batch Driven Architecture
8Reference : http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010
Motivation I : Big Data
04/10/23 9Reference : algo2.iti.kit.edu/.../fopraext/index.html
Motivation II: Data Driven Features
Motivation III: Makes Money
04/10/23 Proprietary & Confidential 11
Motivation IV: Big Data is cool
04/10/23 12
13Reference :http://www.slideshare.net/BenSiscovick/the-business-of-big-data-ia-ventures-8577588
Big Data Challenges
• Large scale data processing– Use all available signals eg. Weblogs, Social signals
(twitter/facebook/linkedin)
• Data Driven Applications– Refine data push back to user for consumption
• Near real time feedback loop– Keep continuously improving
14
Why is this hard ?
• Large scale data processing– TB/PB of data– Traditional storage systems cannot handle the
scale• Data Driven Applications
– Need to run complex machine learning algorithms on this data scale
• Near real time analysis– improves application performance and usage.
15
Some good news !!
• Hadoop– Biggest single driver for large scale data economy– Scales, works, easy to use
• Memcached– Works, scales and is fast
• Open source world– Lot of awesome people working on awesome systems eg.
hBase, memcached, Voldemort, kafka, mahout etc.• Sharing across companies
– Common practices/knowledge sharing across companies.
16
What works !!• Simplicity
– Go with the simplest design possible.• Near real time
– Async/Batch processing• Put computation to background as much as possible
• Duplicate data everywhere– Build customized solution for each problem– Duplicate data as needed
• Data river – Publish events and let all systems consume at their own pace
• Monitoring/Alerting– Keep a close eye on things and build a strong dev-ops team
17
What doesn’t works !!• Magic systems
– Auto configure, Auto tuning– Very hard to get it right instead have easy configuration and better
monitoring• Open source
– If Not supported by strong engineering team internally– Be ready to have folks spend 30-40% time on understanding, helping
open source components• Silver bullets
– One system to solve all scaling problems eg. Hbase– Build separate systems for separate problems
• Central data source – Don’t lock your data let it flow– Use (Kafka, Scribe or any publish/subscribe system)
18
Open source
• Very very important for any company today– Do not reinvent the wheel
• Do not write a line of code if not needed
– 90/10 % rule• Pick up open source solutions, fix what is broken
– Big plus for hiring– Stand on shoulder of crowd
19
Open source: Storage
• Problem: You want to store TB of data for user consumption in real time– Latency < 50 ms– Scale 10,000 QPS +
• Solutions– Big table design eg. Hbase– Amazon Dynamo design eg. Voldemort– Cache with persistence eg. Membase– Document based storage eg. MongoDB
20
Open source: Publish/Subscribe
• Problem: Data River for all other systems to get their feed
• Solutions– Strong data guarantees eg. ActiveMQ, RabbitMQ,
HornetQ– Log feeds eg. Scribe, flume– Kafka
• A great mix of both the world
21
Open source: Real time analysis
• Problem: Analyze a stream of data and do simple analysis/reporting
• Solutions– Splunk
• General purpose but high maintenance expansive analysis tool
– OpenTSDB• Simple but scalable metrics reporting
– Yahoo S4/Twitter Storm• Online map-reduce ish• New systems will need lots of love and care
22
Open source: Search
• Problem: unstructured queries on data• Solutions
– Lucene• Most tested common search (but just a) library
– Solr• Old system with lot of users but bad design
– Elastic Search• Very well designed but new system
– Linkedin search open source systems• sensieDB, zoie
23
Open source: Batch computation
• Problem: You want to process TB of data• Solutions is simple: Use Hadoop
– Hadoop workflow manager• Azkaban• Oozie
– Query• Native Java code• Cascading• Hive• Pig
24
Open source: Other
• Serialization– Avro, Thrift, protocol buffers
• Compression– Snappy, LZO
• Monitoring– Ganglia
25
My personal picks !!• Storage:
– Pure key-value lookup : Voldemort– Range queries, Hadoop job support: Hbase– Batch generated Read only data serving: Voldemort
• Publish/Subscribe– HornetQ OR Kafka
• Search– ElasticSearch
• Hadoop– Azkaban– Hive and Native Java code
26
Jeff Dean’s Thoughts
• Very practical advice on building good reliable distributed systems. Highlights– Back of the envelope calculations
• Understand your base numbers well
– Scale for 10X not 100X– Embrace chaos/failure and design around it– Monitor/status hooks at all levels– Important not to try to be all things for everybody
27Reference :http://www.slideshare.net/xlight/google-designs-lessons-and-advice-from-building-large-distributed-systems
How Voldemort was born ?
28
Reference : 1) http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2) http://www.slideshare.net/adorepump/voldemort-nosql
Why NoSQL ?
• TBs of data• Sharding the only way to scale
• No joins possible (Data is split across machines)• Specialized systems eg search, network feed breaks relational model•No constraints, triggers, etc disappear• Lots of denormalization•Latency is key
• Relational DB depend on caching layer to achieve high throughput and low latency
• Amazon’s Dynamo storage system• Works across data centers• Eventual consistency• Commodity hardware
• Memcached• Actually works• Really fast• Really simple
Inspired By Amazon Dynamo & Memcached
ACID Vs CAP
• ACID – Great for single centralized server.
• CAP Theorem– Consistency (Strict), Availability , Partition Tolerance– Impossible to achieve all three at same time in distributed platform– Can choose 2 out of 3– Dynamo chooses High Availability and Partition Tolerance
• by sacrificing Strict Consistency to Eventual consistency
Proprietary & Confidential 04/10/23 31
Consistent Hashing
• Key space is Partitioned– Many small partitions
• Partitions never change– Partitions ownership can change
• Replication – Each partition is stored by ‘N’ nodes
Proprietary & Confidential 04/10/23 32
R+W > N
• N - The replication factor • R - The number of blocking reads• W - The number of blocking writes
• If R+W > N • then we have a quorum-like algorithm• Guarantees that we will read latest writes OR fail
• R, W, N can be tuned for different use cases• W = 1, Highly available writes • R = 1, Read intensive workloads• Knobs to tune performance, durability and availability
Proprietary & Confidential 04/10/23 33
Versioning & Conflict Resolution• Eventual Consistency allows multiple versions of value• Need a way to understand which value is latest• Need a way to say values are not comparable
• Solutions• Timestamp• Vector clocks
• Provides global ordering.• No locking or blocking necessary
Vector Clock
• Vector Clock [Lamport] provides way to order events in a distributed system.
• A vector clock is a tuple {t1 , t2 , ..., tn } of counters.
• Each value update has a master node• When data is written with master node i, it increments ti.• All the replicas will receive the same version• Helps resolving consistency between writes on multiple replicas
• If you get network partitions• You can have a case where two vector clocks are not comparable.• In this case Voldemort returns both values to clients for conflict
resolution
Proprietary & Confidential 04/10/23 35
Client API• Data is organized into “stores”, i.e. tables• Key-value only• But values can be arbitrarily rich or complex• Maps, lists, nested combinations …
• Four operations• PUT (Key K, Value V) • GET (Key K)• MULTI-GET (Iterator<Key> K), • DELETE (Key K) / (Key K , Version ver)• No Range Scans
Voldemort Physical Deployment
Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback!
Read-only storage engine
Reference : http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010
What do we use Hadoop/Voldemort for ?
Proprietary & Confidential 04/10/23 40
Batch Driven Architecture
41Reference : http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010
Data Flow Driven Architecture
Reference :http://sna-projects.com/blog/2011/08/kafka/
Questions