Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013

Post on 09-May-2015

7,670 views 1 download

transcript

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Cosmin Lehene | Adobe

#bigdataro - 30 January 2013

Real-time “OLAP” for Big Data (+ use cases)

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

What we needed … and built

OLAP Semantics Low Latency Ingestion High Throughput Real-time Query API

2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

“Physical” Building Blocks

3

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Logical Building Blocks

Dimensions, Metrics Aggregations Roll-up, drill-down, slicing and dicing, sorting

4

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Queries example

5

Date Country

City OS Browser Sale

2012-05-21

USA NY Windows FF 0.0

2012-05-21

USA NY Windows FF 10.0

2012-05-22

USA SF OSX Chrome 25.0

2012-05-22

Canada Ontario Linux Chrome 0.0

2012-05-23

USA Chicago OSX Safari 15.0

5 visits,3 days

2 countriesUSA: 4Canada: 1

4 cities:NY: 2SF: 1

3 OS-esWin: 2OSX: 2

3 browsersFF: 2Chrome:2

50.03 sales

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

OLAP 101 – Queries example

Rolling up to country level:

SELECT COUNT(visits), SUM(sales)

GROUP BY country

“Slice” by browser

SELECT COUNT(visits), SUM(sales)

GROUP BY country

HAVING browser = “FF”

Top browsers by sales

SELECT SUM(sales), COUNT(visits)

GROUP BY browser

ORDER BY sales

6

Country visits

sales

USA 4 $50

Canada 1 0

Country visits

sales

USA 2 $10

Canada 0 0

Browser sales visits

Chrome $25 2

Safari $15 1

FF $10 2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Aggregate at runtime Most flexible

Fast – scatter gather

Space efficient

But I/O, CPU intensive

slow for larger data

low throughput

Pre-aggregate Fast

Efficient – O(1)

High throughput

But More effort to process

(latency)

Combinatorial explosion (space)

No flexibility

OLAP – Runtime Aggregation vs. Pre-aggregation

7

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase Map

8

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase Domain Model Mapping

9

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Domain Model Mapping

10

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Ingestion, Processing, Indexing, Querying

11

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase - Ingestion, Processing, Indexing, Querying

12

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion

13

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ingestion(ETL) throughput vs. latency

Historical data (large batches) Optimize for throughput

Increments (latest data, smaller) Optimize for latency

14

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing

15

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing

Processing involves reading the Input (files, tables, events), pre-aggregating it (reducing cardinality) and generating cubes that can be queried in real-time

“Super Processor” code running in Storm, Map-Reduce, HBase

16

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Processing for OLAP semantics

GROUP BY (process, query)

COUNT, SUM, AVG, etc. (process, query)

SORT (process, query)

HAVING (mostly query, can define pre-process constraints)

17

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

SaasBase vs. SQL Views Comparison

18

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Query Engine

Always reads indexed, compact data

Query parsing

Scan strategy

Single vs. multiple scans

Start/stop rows (prefixes, index positions, etc.)

Index selection (volatile indexes with incremental processing)

Deserialization

Post-aggregation, sorting, fuzzy-sorting etc.

Paging

Custom dimension/metric class loading

19

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Business Catalyst

Online business presence: e-commerce, marketing, web analytics etc.

Use case: Web Analytics (visitors, channels, content, e-commerce, campaigns, etc.)

20

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

BC - Workflow

21

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Business Catalyst - Stats

3 active datacenters

Raw data ~6TB (from ~1TB 18 months ago)

Visits table: ~1TB each(compressed)

OLAP cubes (stats): 49GB – 64GB (compressed)

~30 minutes latency (from actual pageview/sale to chart in UI)

10s – 100s of milliseconds latency for queries

~3000/s max concurrent OLAP queries (actual traffic is much lower)

22

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Pass for TV Everywhere

Authentication & Authorization

Single sign-on to Programmer content (e.g. Turner, NBC, Hulu, MTV, etc) with Cable operator credentials (e.g. Comcast, Dish, etc.)

23

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Pass – Use Case

Analytics use case: Operational metrics (users, devices, latencies, etc.)

Real-time ingestion in HBase

High Frequency Map Reduce jobs (every 2 minutes)

24

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Pass - Stats (London Olympics 2012)

67M streams ~ 5.3M hours

1.5M concurrent streams

> 7M unique users

1 Technical & Engineering Emmy Award ;)

25

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Primetime – Real-time Video Analytics

Unified video platform (acquisition, transcoding, broadcast, ads, analytics)

26

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Adobe Primetime – Use Case

Use Cases: Audience metrics – minutes latency ok

Ads metrics – seconds to minutes ok

Streaming QoS metrics – seconds must

Requirements: Massive throughput (millions of streams, multiple

heartbeats every 10 seconds)

Low latency (end-to-end)

27

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Conclusions

OLAP semantics on a simple data model

Data as first class citizen

Domain Specific “Language” for Dimensions, Metrics, Aggregations

Framework for vertical analytics systems

Tunable performance, resource allocation

29

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Thank you!Cosmin Lehene @clehene

http://hstack.org

30

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Related

http://www.hbasecon.com/sessions/low-latency-olap-with-hbase/

http://www.slideshare.net/clehene/low-latency-olap-with-hbase-hbasecon-2012

31

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.