Date post: | 15-Dec-2014 |
Category: |
Technology |
Upload: | chris-huang |
View: | 565 times |
Download: | 4 times |
04/10/20231 Confidential | Copyright 2013 TrendMicro Inc.
Chris Huang
Sr. Manager, Core Tech
2014/9/24
Real-time Big Data Applications with Hadoop Ecosystem
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
About – Chris Huang
• Chris Huang– SPN Solution Developer Manager– SPN Hadoop Architect– Hadoop.TW Active Member
• Believes Cloud, Service, Software, Big Data are critical factors for Taiwan’s future economic development
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 3
Conference Talks
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 4
Conference Talks
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 5
Hot Keywords in Hadoop Community
Real-time• Impala, Stinger
Computing Framework• YARN, Tez
In Memory• Spark
Streaming• Kafka, Storm
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 6
Big Data Applications
• Operational– Real-time– Near Real-time
• Analytical – Batch– Interactive– Near Real-time– Streaming
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 7
An Online Music Example
• Operational– Recent N login time (listen duration)– Recent N album/artist user browses– Recent N keyword user search– Recent N song/album/artist user listens (buys)– Recent N month user’s purchase amount
• Analytical– Recommend right song/album/artist to right user at right time– Correlate similar song/album/artist (CDDB or user behavior)– Know seasonal music trending (X’max, Valentine’s Day, New
Year)– Know regional music trending– Calculate regional leaderboard– Connect user with social network
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 8
An Online Banking Example
• Operational – Recent N login time / frequency– Recent N items purchased by credit card– Recent N month balance amount– Recent N transfer in/out amount– Recent N investment event– Recent N month investment balance
• Analytical– Know user’s profile more (assets/debts/shopping habits/family)– Recommend right product to right user (investment, credit card,
loan)– Know seasonal trending (tax month/year end/back to school/X’mas)– Know regional investment product leaderboard (by different age)– Recommend product by similar user profile
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 9
Building Your Big Data Applications
• Think about your data– Entity or Event?
• Think about your use case– Operational or Analytic?
• Think about your data user– External or Internal?
04/10/2023 10Confidential | Copyright 2013 TrendMicro Inc.
Slides from “Apache HBase Application Archetypes”, HBaseCon 2014
You can Replace HBase with similar alternatives, but concepts are the same
Think About Your Data
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 11
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 12
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 13
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 14
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 15
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 16
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 17
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 18
Think About Your Use Case
04/10/2023 19Confidential | Copyright 2013 TrendMicro Inc.
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 20
Operational Use Case 1
MR / Spark
MR / Spark
Real-time
Real-time
BatchBatch
Real-time
HDFS
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 21
HBase: No Secondary Index (yet)
• Search index building (row key)• Use Solr to make text data searchable
– Snapshot & clone table– Index column qualifier text– Record row-key in Solr document– Use HBase client to fetch data
• Usually less than few seconds
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 22
Operational Use Case 2 (SPN)
Solr Client
Get, Scanlow latency
high throughput
Index Query
MapReduce
Pig
HDFS
Flume
Feed App
Real-time
Real-time
Batch
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 23
Operational Use Case 3 (Mixed)
Solr Client
Get, Scanlow latency
high throughput
Index Query
MapReduce
Pig
HDFS
Flume
Feed App
Real-time Real-time
Batch
HBase Client
GetsShort scan
HBase Client
Put, Incr, Append
Bulk Import
HBase Client
MR / Spark Batch
HBase Replication Solr
MR / SparkBatch
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 24
HBase or HDFS?
• Depends on what’s your data– Entity or Event?
• Depends on your workload– Low latency? – Random read/write? – Short/full scan?– Sequential read/write? – Update?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 25
Wait…Batch for
Operational?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 26
Yes, Why not?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 27
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 28
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 29
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 30
Operational: Batch + Real-time
• Bridge the gap between batch and now• 80/20 rule
– HDFS/MapReduce/Spark solves 80% easily– Remaining 20% takes 80% of the efforts
• Go as close as possible, don’t overdo it!
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 31
What is Real-time?
• Real-time is NOT always “faster than batch”– If you have really BIG DATA
• Most of the time, we want Timely Information• Minimize the gap between scheduled batch jobs
Hourly Job
Hourly Job
Hourly Job
How to get result at 1:33?
04/10/2023 32Confidential | Copyright 2013 TrendMicro Inc.
Batch/streaming compute
Near real-time/interactive deliver
Analytical Use Case
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 33
Near Real-time Interactive
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 34
Recommendation System
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 35
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 36
The Online Music Example
• Operational– Recent N login time (listen duration )– Recent N album/artist user browses– Recent N keyword user search– Recent N song/album/artist user listens (buys)– Recent N month user’s purchase amount
• Analytical– Recommend right song/album/artist to right user at right time– Correlate similar song/album/artist (CDDB or user behavior)– Know seasonal music trending (X’max, Valentine’s Day, New
Year)– Know regional music trending– Calculate regional leaderboard– Connect user with social network
Do you really want to analytical result (recommendation)
EVERY 50 millisecond?
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 37
Analytical Use Case 1
Batch
HDFS
Solr Client
Index Query
Real-time
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 38
Analytical Use Case 2 (SPN)
“A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”, HBaseCon 2014
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 39
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 40
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 41
You Need an Interactive
Analytic Engine
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 42
Stinger
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 43
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Impala Architecture
Datanode
Tasktracker
Regionserver
impala daemon
2
NN, JT, HMActive
NN, JT, HMStandby
Datanode
Tasktracker
Regionserver
impala daemon
Datanode
Tasktracker
Regionserver
impala daemon
Datanode
Tasktracker
Regionserver
impala daemon
State store
Catalog
Hive Metastore
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Apache Pig (MapReduce)
• Do hourly count on akamai log– A = load 'date://2014/07/20/00'
using AkamaiRCLoader();B = foreach (group A all) COUNT_STAR(A);dump B;
– …0% complete100% complete(194202349)
2
4mins, 28sec
Too Slow for Interactive
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc.
Using Impala
• No memory cache– > select count(*) from akafast
where day=20140720 and hour=0– 194202349
• with OS cache
• Do a further query:– select count(*) from akafast where day=20140720
and hour=00 and c='US';– 41118019
2
96.46s
9.07s
6.57s
Make Sense Now
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 51
Don’t Connect Analytic
Engine with Operational Use Case
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 52
Analytical Use Case 3
low latency
high throughput
Impala/Stinger
HDFS
Flume
Feed App
Real-time
Real-time
Interactive
HBase Client
GetsShort scan
HBase Client
Put, Incr, Append
Bulk Import
HBase Client
MR / Spark Batch
Customer
Analyst
04/10/2023 53Confidential | Copyright 2013 TrendMicro Inc.
Streaming Use Cases
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 54
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 55
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 56
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 57
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 58
TME – Trend Message Exchange
http://trendmicro.github.io/tme/
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 59
Kafka/Storm
Kafka/Storm
Streaming Operational Use Case
low latency
high throughput
HDFS
HBase Client
Put, Incr, Append
HBase Client
GetsShort scan
Real-time
Streaming
Solr Client
Index Query
Streaming
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 60
Kafka/Storm
Streaming Analytical Use Case
low latency
high throughput
HDFS
HBase Client
Put, Incr, Append
Flume
Feed App
HBase Client
GetsShort scan
Impala/Stinger
Interactive
Analyst
Real-time Customer
Streaming
04/10/2023 61Confidential | Copyright 2013 TrendMicro Inc.
Think About Your Data User
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 62
Data User
• External– Customer– Partner
• Internal– Business report user– Data researcher– Data analyst– Algorithm developer
• They want instant response• They don’t know (and don’t care) if
the recommendation is computed 1 hour ago or 50 ms ago
• Interactive or near real-time is enough
• Sometimes even wait for batch (make data small and analyze)
• Of course, everyone wants result faster, but it depends on your investment $$
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 63
No Silver Bullet
For Real-time, Or Big Data Application
04/10/2023
64
Q&A
Confidential | Copyright 2013 TrendMicro Inc.
04/10/2023
Confidential | Copyright 2013 TrendMicro Inc. 65