Bhupeshbansal bigdata

transcript

Big Data and Me

Bhupesh BansalFeb 3, 2012

Relational Model Architecture

2Reference : http://www.slideshare.net/adorepump/voldemort-nosql

Linkedin 2006

Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm

Relational model

• The relational model is a triumph of computer science:• General• Concise• Well understood

• But then again:• SQL is a pain• Hard to build re-usable data structures• Hides performance issues/details

Specialized Systems Architecture

5Reference : http://www.slideshare.net/adorepump/voldemort-nosql

Linkedin 2007

Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm

Specialized systems

• Specialized systems are efficient (10-100x)• Search: Inverted index• Offline: Hadoop, Terradata, Oracle DWH• Memcached• In memory systems (social graph)

• Specialized system are scalable• New data and problems

• Graphs, sequences, and text

Batch Driven Architecture

8Reference : http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010

Motivation I : Big Data

04/10/23 9Reference : algo2.iti.kit.edu/.../fopraext/index.html

Motivation II: Data Driven Features

Motivation III: Makes Money

04/10/23 Proprietary & Confidential 11

Motivation IV: Big Data is cool

04/10/23 12

13Reference :http://www.slideshare.net/BenSiscovick/the-business-of-big-data-ia-ventures-8577588

Big Data Challenges

• Large scale data processing– Use all available signals eg. Weblogs, Social signals

(twitter/facebook/linkedin)

• Data Driven Applications– Refine data push back to user for consumption

• Near real time feedback loop– Keep continuously improving

Why is this hard ?

• Large scale data processing– TB/PB of data– Traditional storage systems cannot handle the

scale• Data Driven Applications

– Need to run complex machine learning algorithms on this data scale

• Near real time analysis– improves application performance and usage.

Some good news !!

• Hadoop– Biggest single driver for large scale data economy– Scales, works, easy to use

• Memcached– Works, scales and is fast

• Open source world– Lot of awesome people working on awesome systems eg.

hBase, memcached, Voldemort, kafka, mahout etc.• Sharing across companies

– Common practices/knowledge sharing across companies.

What works !!• Simplicity

– Go with the simplest design possible.• Near real time

– Async/Batch processing• Put computation to background as much as possible

• Duplicate data everywhere– Build customized solution for each problem– Duplicate data as needed

• Data river – Publish events and let all systems consume at their own pace

• Monitoring/Alerting– Keep a close eye on things and build a strong dev-ops team

What doesn’t works !!• Magic systems

– Auto configure, Auto tuning– Very hard to get it right instead have easy configuration and better

monitoring• Open source

– If Not supported by strong engineering team internally– Be ready to have folks spend 30-40% time on understanding, helping

open source components• Silver bullets

– One system to solve all scaling problems eg. Hbase– Build separate systems for separate problems

• Central data source – Don’t lock your data let it flow– Use (Kafka, Scribe or any publish/subscribe system)

Open source

• Very very important for any company today– Do not reinvent the wheel

• Do not write a line of code if not needed

– 90/10 % rule• Pick up open source solutions, fix what is broken

– Big plus for hiring– Stand on shoulder of crowd

Open source: Storage

• Problem: You want to store TB of data for user consumption in real time– Latency < 50 ms– Scale 10,000 QPS +

• Solutions– Big table design eg. Hbase– Amazon Dynamo design eg. Voldemort– Cache with persistence eg. Membase– Document based storage eg. MongoDB

Open source: Publish/Subscribe

• Problem: Data River for all other systems to get their feed

• Solutions– Strong data guarantees eg. ActiveMQ, RabbitMQ,

HornetQ– Log feeds eg. Scribe, flume– Kafka

• A great mix of both the world

Open source: Real time analysis

• Problem: Analyze a stream of data and do simple analysis/reporting

• Solutions– Splunk

• General purpose but high maintenance expansive analysis tool

– OpenTSDB• Simple but scalable metrics reporting

– Yahoo S4/Twitter Storm• Online map-reduce ish• New systems will need lots of love and care

Open source: Search

• Problem: unstructured queries on data• Solutions

– Lucene• Most tested common search (but just a) library

– Solr• Old system with lot of users but bad design

– Elastic Search• Very well designed but new system

– Linkedin search open source systems• sensieDB, zoie

Open source: Batch computation

• Problem: You want to process TB of data• Solutions is simple: Use Hadoop

– Hadoop workflow manager• Azkaban• Oozie

– Query• Native Java code• Cascading• Hive• Pig

Open source: Other

• Serialization– Avro, Thrift, protocol buffers

• Compression– Snappy, LZO

• Monitoring– Ganglia

My personal picks !!• Storage:

– Pure key-value lookup : Voldemort– Range queries, Hadoop job support: Hbase– Batch generated Read only data serving: Voldemort

• Publish/Subscribe– HornetQ OR Kafka

• Search– ElasticSearch

• Hadoop– Azkaban– Hive and Native Java code

Jeff Dean’s Thoughts

• Very practical advice on building good reliable distributed systems. Highlights– Back of the envelope calculations

• Understand your base numbers well

– Scale for 10X not 100X– Embrace chaos/failure and design around it– Monitor/status hooks at all levels– Important not to try to be all things for everybody

27Reference :http://www.slideshare.net/xlight/google-designs-lessons-and-advice-from-building-large-distributed-systems

How Voldemort was born ?

Reference : 1) http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2) http://www.slideshare.net/adorepump/voldemort-nosql

Why NoSQL ?

• TBs of data• Sharding the only way to scale

• No joins possible (Data is split across machines)• Specialized systems eg search, network feed breaks relational model•No constraints, triggers, etc disappear• Lots of denormalization•Latency is key

• Relational DB depend on caching layer to achieve high throughput and low latency

• Amazon’s Dynamo storage system• Works across data centers• Eventual consistency• Commodity hardware

• Memcached• Actually works• Really fast• Really simple

Inspired By Amazon Dynamo & Memcached

ACID Vs CAP

• ACID – Great for single centralized server.

• CAP Theorem– Consistency (Strict), Availability , Partition Tolerance– Impossible to achieve all three at same time in distributed platform– Can choose 2 out of 3– Dynamo chooses High Availability and Partition Tolerance

• by sacrificing Strict Consistency to Eventual consistency

Proprietary & Confidential 04/10/23 31

Consistent Hashing

• Key space is Partitioned– Many small partitions

• Partitions never change– Partitions ownership can change

• Replication – Each partition is stored by ‘N’ nodes

R+W > N

• N - The replication factor • R - The number of blocking reads• W - The number of blocking writes

• If R+W > N • then we have a quorum-like algorithm• Guarantees that we will read latest writes OR fail

• R, W, N can be tuned for different use cases• W = 1, Highly available writes • R = 1, Read intensive workloads• Knobs to tune performance, durability and availability

Versioning & Conflict Resolution• Eventual Consistency allows multiple versions of value• Need a way to understand which value is latest• Need a way to say values are not comparable

• Solutions• Timestamp• Vector clocks

• Provides global ordering.• No locking or blocking necessary

Vector Clock

• Vector Clock [Lamport] provides way to order events in a distributed system.

• A vector clock is a tuple {t1 , t2 , ..., tn } of counters.

• Each value update has a master node• When data is written with master node i, it increments ti.• All the replicas will receive the same version• Helps resolving consistency between writes on multiple replicas

• If you get network partitions• You can have a case where two vector clocks are not comparable.• In this case Voldemort returns both values to clients for conflict

resolution

Client API• Data is organized into “stores”, i.e. tables• Key-value only• But values can be arbitrarily rich or complex• Maps, lists, nested combinations …

• Four operations• PUT (Key K, Value V) • GET (Key K)• MULTI-GET (Iterator<Key> K), • DELETE (Key K) / (Key K , Version ver)• No Range Scans

Voldemort Physical Deployment

Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback!

Read-only storage engine

Reference : http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010

What do we use Hadoop/Voldemort for ?

Batch Driven Architecture

41Reference : http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010

Data Flow Driven Architecture

Reference :http://sna-projects.com/blog/2011/08/kafka/

Questions

Bhupeshbansal bigdata

Technology