New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Time Predictive Modeling

transcript

N E W Y O R K S T O R M U S E R S G R O U P

Using Storm with MapR M7 for Real-Time Predictive Modeling !!!January 28, 2014

• Introductions • About Velos • Our Use Cases • Requirements • Why Storm? • Why MapR M7? • How Did We Get Here? • Architecture • Quick Storm Introduction • Our Topologies • Performance & Learnings • Road Map • Q&A

A G E N D A

Gna PhetsarathDirector of Technology @sourignahttp://www.linkedin.com/in/sourignaphetsarath/

I N T R O D U C T I O N S

• Velos provides Predictive Analytics lifecycle and scaling solutions for Enterprise companies

• Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware

• Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others

• Customers can easily automate ETL, feature engineering, model evaluation and production deployment and monitoring, as well as relearning and adaptation

• Plug-in existing Python, Java, and R models

A B O U T V E L O S

• Real-Time Predictive Modeling • Real-Time Metrics

• Atomic Counters • Unique Probabilistic Counting (Hyper

Log Log Plus) • Group Membership (Bloom Filters)

• Page Parsing - NLP Feature Extraction • Event/Entity Attribute Maintenance

O U R U S E C A S E S

• < 50 ms response time • Random access to large data set > 1B keys • Near Real-time/streaming • Distributed • Scalable • Fault Tolerant • Reliable

R E Q U I R E M E N T S

• Simple API • Scalable • Fault tolerant • Guarantees data processing • Handles parallelization, partitioning, and

retrying on failures when necessary • Easy to deploy and operate • Free and open source

W H Y S T O R M ?

• Configuration is simpler than with HBase • No region servers • No compaction happens since it is read-write file system • Recovery from cold starts are easier. HBase if it goes

down and has to restarted takes a long time. Hours. Whereas, this is in minutes. we haven't had to experience that but we did have a ZK failure and had to bounce each node. Was quick.

• NFS Gateway is very useful • There are plenty of features we haven't taken advantage

of yet • MapR Admin UI is easy to use

W H Y M A P R M 7 ?

• Amazon Elastic MapReduce • Cloudera Hadoop on Amazon Web Services • MapR M3 (Hadoop MapReduce) on

Managed Hosting Service • MapR M3, Riak, Storm, Kafka, Redis, Play

on Managed Hosting Service • MapR M3, MapR M7 (HBase), Storm, Kafka,

Redis, Play on Managed Hosting Service

H O W D I D W E G E T H E R E ?

A R C H I T E C T U R E - Q 4 2 0 1 3

API - Play Framework

Dashboard - Play Framework

Storm ToplogiesRedis

MapR M7

MapR M3

PostgresSQL

S T O R M C O N C E P T S

Topology

Tuples

(key,fields,...)

• Tuple - named list of values • Streams - streams of tuples • Spouts - a source of streams • Bolts - processes any number of input streams

and produces a number of output streams • Topologies - an network of spouts and bolts • Reliability - guarantees that every tuple will be

fully processed • Workers - executes subset of topology • Tasks - executed by workers for bolts/spouts

Q U I C K S T O R M O V E R V I E W

• Entity Observe • Kafka Spout • Bot Detection Bolt • Entity Observe Bolt • Real-time Counter Bolt • Predictive Model Update Bolt

• NLP Feature Extraction of HTML Content • Entity/Event Attribute Maintenance

O U R T O P O L O G I E S

E N T I T Y O B S E R V E T O P O L O G Y

R E A L - T I M E C O U N T E R B O LT

P E R F O R M A N C E M E T R I C S

Play / Kafka ~ 3000 ops/node

Kafka / Storm ~ 1650 ops/node

Storm / MapR M7 ~ 5000 ops/node

1M Put 1,900 ops/n 15,000 ops/n

1M RW 2,000 ops/n 5,000 ops/n

1B Load N/A 7,000 ops/n

C O M PA R I N G M 7 W I T H C A S S A N D R A

YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node Cassandra 2.0.x; MapR M7 Pre-Release 3.00

closer to what we see in production

T A M I N G S T O R M

• Use monit to keep Nimbus & Supervisors running smoothly • Local queues that periodically write operational stats to Redis

(e.g. processing throughput) & alert Ops team • Shaded jars & deployment scripts to keep topologies up to date • ScBaseRichBolt

• Write your own base classes to trap framework exceptions and do proper things

• Reduce boiler-plate code • Use Murmur Hash to make jobs more efficient by distributing

keys more evenly. (True for MapReduce, as well) • Storm UI is not reliable (v0.8.2). So, need to roll out your own

stats; Storm 0.9 UI should be more reliable • DataDog used for Dashboards and Alerts

F E A T U R E S

• Deep learning for feature detection • Anomaly detection • Automation of full data science lifecycle,

from exploration and modeling to production and relearning

• R and Python custom algorithm support • Automated model training and

optimization

R O A D M A P

T E C H N O L O G Y

• Storm 0.9.0 • Kafka 0.8.0 • Apache Spark • Play 2.2.x • Cascading • Spring XD - eXtreme Data • Spring Reactor • Spring Boot

R O A D M A P

Thank you!