Post on 06-May-2015
description
transcript
N E W Y O R K S T O R M U S E R S G R O U P
Using Storm with MapR M7 for Real-Time Predictive Modeling !!!January 28, 2014
• Introductions • About Velos • Our Use Cases • Requirements • Why Storm? • Why MapR M7? • How Did We Get Here? • Architecture • Quick Storm Introduction • Our Topologies • Performance & Learnings • Road Map • Q&A
A G E N D A
2
Gna PhetsarathDirector of Technology @sourignahttp://www.linkedin.com/in/sourignaphetsarath/
I N T R O D U C T I O N S
3
• Velos provides Predictive Analytics lifecycle and scaling solutions for Enterprise companies
• Formerly Sociocast, a SaaS solution with use-case specific ad tech and e-commerce models on our own hardware
• Velos provides an on-premise platform supporting any models on various production runtimes, such as Hadoop, Storm, Spark and others
• Customers can easily automate ETL, feature engineering, model evaluation and production deployment and monitoring, as well as relearning and adaptation
• Plug-in existing Python, Java, and R models
A B O U T V E L O S
4
• Real-Time Predictive Modeling • Real-Time Metrics
• Atomic Counters • Unique Probabilistic Counting (Hyper
Log Log Plus) • Group Membership (Bloom Filters)
• Page Parsing - NLP Feature Extraction • Event/Entity Attribute Maintenance
O U R U S E C A S E S
5
• < 50 ms response time • Random access to large data set > 1B keys • Near Real-time/streaming • Distributed • Scalable • Fault Tolerant • Reliable
R E Q U I R E M E N T S
6
• Simple API • Scalable • Fault tolerant • Guarantees data processing • Handles parallelization, partitioning, and
retrying on failures when necessary • Easy to deploy and operate • Free and open source
W H Y S T O R M ?
7
• Configuration is simpler than with HBase • No region servers • No compaction happens since it is read-write file system • Recovery from cold starts are easier. HBase if it goes
down and has to restarted takes a long time. Hours. Whereas, this is in minutes. we haven't had to experience that but we did have a ZK failure and had to bounce each node. Was quick.
• NFS Gateway is very useful • There are plenty of features we haven't taken advantage
of yet • MapR Admin UI is easy to use
W H Y M A P R M 7 ?
8
• Amazon Elastic MapReduce • Cloudera Hadoop on Amazon Web Services • MapR M3 (Hadoop MapReduce) on
Managed Hosting Service • MapR M3, Riak, Storm, Kafka, Redis, Play
on Managed Hosting Service • MapR M3, MapR M7 (HBase), Storm, Kafka,
Redis, Play on Managed Hosting Service
H O W D I D W E G E T H E R E ?
9
A R C H I T E C T U R E - Q 4 2 0 1 3
10
API - Play Framework
Dashboard - Play Framework
Kafka
Storm ToplogiesRedis
MapR M7
MapR M3
PostgresSQL
S T O R M C O N C E P T S
11
Spout
Bolt
Topology
Tuples
(key,fields,...)
• Tuple - named list of values • Streams - streams of tuples • Spouts - a source of streams • Bolts - processes any number of input streams
and produces a number of output streams • Topologies - an network of spouts and bolts • Reliability - guarantees that every tuple will be
fully processed • Workers - executes subset of topology • Tasks - executed by workers for bolts/spouts
Q U I C K S T O R M O V E R V I E W
12
• Entity Observe • Kafka Spout • Bot Detection Bolt • Entity Observe Bolt • Real-time Counter Bolt • Predictive Model Update Bolt
• NLP Feature Extraction of HTML Content • Entity/Event Attribute Maintenance
O U R T O P O L O G I E S
13
E N T I T Y O B S E R V E T O P O L O G Y
14
R E A L - T I M E C O U N T E R B O LT
15
P E R F O R M A N C E M E T R I C S
16
Play / Kafka ~ 3000 ops/node
Kafka / Storm ~ 1650 ops/node
Storm / MapR M7 ~ 5000 ops/node
1M Put 1,900 ops/n 15,000 ops/n
1M RW 2,000 ops/n 5,000 ops/n
1B Load N/A 7,000 ops/n
C O M PA R I N G M 7 W I T H C A S S A N D R A
YCSB benchmark on 5-Node Cluster with 24 Cores, 192GB RAM, 24 Disks / node Cassandra 2.0.x; MapR M7 Pre-Release 3.00
closer to what we see in production
17
T A M I N G S T O R M
18
• Use monit to keep Nimbus & Supervisors running smoothly • Local queues that periodically write operational stats to Redis
(e.g. processing throughput) & alert Ops team • Shaded jars & deployment scripts to keep topologies up to date • ScBaseRichBolt
• Write your own base classes to trap framework exceptions and do proper things
• Reduce boiler-plate code • Use Murmur Hash to make jobs more efficient by distributing
keys more evenly. (True for MapReduce, as well) • Storm UI is not reliable (v0.8.2). So, need to roll out your own
stats; Storm 0.9 UI should be more reliable • DataDog used for Dashboards and Alerts
F E A T U R E S
• Deep learning for feature detection • Anomaly detection • Automation of full data science lifecycle,
from exploration and modeling to production and relearning
• R and Python custom algorithm support • Automated model training and
optimization
R O A D M A P
19
T E C H N O L O G Y
• Storm 0.9.0 • Kafka 0.8.0 • Apache Spark • Play 2.2.x • Cascading • Spring XD - eXtreme Data • Spring Reactor • Spring Boot
R O A D M A P
20
Q&A
Thank you!