+ All Categories
Home > Technology > Viki_bigdatameetup_2013_10

Viki_bigdatameetup_2013_10

Date post: 15-Jan-2015
Category:
Upload: ishanagrawal90
View: 1,561 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
43
Viki Analytics Infrastructure BigData Singapore Meetup Oct 2013
Transcript
Page 1: Viki_bigdatameetup_2013_10

Viki Analytics Infrastructure

BigData Singapore Meetup

Oct 2013

Page 2: Viki_bigdatameetup_2013_10

Viki’s Data Pipeline

Page 3: Viki_bigdatameetup_2013_10

1. Collecting Data

Page 4: Viki_bigdatameetup_2013_10

What data do we collect?

• Clickstream data• An event is some user interaction or product

related• A client (web/mobile) sends these events as

HTTP calls.• Format: JSON

– Schema-less– Flexible

{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…

Page 5: Viki_bigdatameetup_2013_10

How to keep this data clean?

• Problem: Clients often send erroneous data.

eg. missing parameter

• Solution: We write client

libraries for each client to

enforce “world peace”

Ps: there is no such thing as

“world peace”

Page 6: Viki_bigdatameetup_2013_10

How to collect > 60 M events a day?

• fluentd Scalable Extensibility Let you send data to

Hadoop, MongoDB, PostgreSQL etc.• Writes to Hadoop (TD), Amazon

S3, MongoDB

Page 7: Viki_bigdatameetup_2013_10

Where do we store?

• Hadoop (Treasure Data)

Its fast and easy to setup!

We don’t have money or time to hire a

Hadoop engineer.

We retrieve data from Hadoop in batch

jobs

• Amazon S3

Backup

• MongoDB: Real-time data

Page 8: Viki_bigdatameetup_2013_10

2. Retrieving & Processing Data

Page 9: Viki_bigdatameetup_2013_10

2. Retrieving & Processing Data

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Page 10: Viki_bigdatameetup_2013_10

2. Retrieving & Processing Data

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Page 11: Viki_bigdatameetup_2013_10

Getting All Data To 1 Place• Port data from different

production databases into PG

• Retrieve click-stream data from Hadoop to PG

thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1

a) Production Databases Analytics DB:

thor db:cp --source A --destination B –t reporting.video_plays --increment

PostgreSQL

Page 12: Viki_bigdatameetup_2013_10

{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…

date source partner event video_id country cnt

2013-09-29 ios viki video_play 1008912v ca 2

2013-09-29 android viki video_play 1008912v us 18

b) Click-stream Data (Hadoop) Analytics DB:

Hadoop

PostgreSQL

Aggregation (Hive)

Export Output / Sqoop

Page 13: Viki_bigdatameetup_2013_10

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['source'], v['partner'], v['event'], v['video_id'], v['country'], COUNT(1) as cntFROM eventsWHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') AND v['event'] = 'video_play'GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['source'], v['partner'], v['event'], v['video_id'], v['country'];

Simple Aggregation SQL

Page 14: Viki_bigdatameetup_2013_10

The Data Is Not Clean!

Event properties and names change as we develop:

But…

{"user_id": "152u”, "country": "sg" }

{"user_id": "152", "country_code":"sg" }Old Version:

New Version:

Page 15: Viki_bigdatameetup_2013_10

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['app_id'] AS `app_id`, CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END AS `partner`, CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END AS `source` , LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ) AS `country` , COALESCE ( v['device_size'] ,v['device'] ) AS `device`, COUNT( 1 ) AS `cnt`FROM eventsWHERE time >= 1380326400 AND time <= 1380412799 AND v['event'] = 'video_play'GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END , CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END, LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ), COALESCE ( v['device_size'] ,v['device'] );

(Not so) simple Aggregation SQL

Hadoop

Page 16: Viki_bigdatameetup_2013_10

UPDATE "reporting"."cl_main_2013_09"SET source = 'embed', partner = ’partner1'WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')

UPDATE "reporting"."cl_main_2013_09"SET app_id = '100105a'WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')

UPDATE reporting.cl_main_2013_09SET user_id = user_id || 'u’WHERE RIGHT(user_id, 1) ~ '[0-9]’

UPDATE "reporting"."cl_main_2013_09"SET app_id = '100106a'WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')

UPDATE reporting.cl_main_2013_09SET source = 'raynor', partner = 'viki', app_id = '100000a’WHERE event = 'pv’ AND source IS NULL AND partner IS NULL AND app_id IS NULL

…even after import

PostgreSQL

Page 17: Viki_bigdatameetup_2013_10

30% 70%

Import data Clean up data

Cleaning Up Data Takes Lots of Time

Page 18: Viki_bigdatameetup_2013_10

Transforming Data

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Page 19: Viki_bigdatameetup_2013_10

Transforming Data

Table A

Table B

Analytics DB (PostgreSQL)

Page 20: Viki_bigdatameetup_2013_10

date source partner event country cnt

2013-09-29 ios viki video_play ca 20

date source partner event video_id country cnt

2013-09-29 ios viki video_play 1v ca 2

2013-09-29 ios viki video_play 2v ca 18

PostgreSQL

20M records

4M records

a) Reducing Table Size By Dropping Dimension

video_plays_with_video_id

video_plays

Page 21: Viki_bigdatameetup_2013_10

id title

1c Game of Thrones

2c My Girlfriend Is A Gumiho

PostgreSQL

b) Injecting Extra Fields For Analysis

id title video_count

1c Game of Thrones

30

2c My Girlfriend Is A Gumiho

16

containers videos

containers containers

1 n

Page 22: Viki_bigdatameetup_2013_10

id title

1c Game of Thrones

2c My Girlfriend Is A Gumiho

PostgreSQL

Injecting Extra Fields For Analysis

id title video_count

1c Game of Thrones

30

2c My Girlfriend Is A Gumiho

16

containers videos

containers containers

1 n

Page 23: Viki_bigdatameetup_2013_10

Chunk Tables By Month

video_plays_2013_06

video_plays_2013_07

video_plays_2013_08

video_plays_2013_09

ALTER TABLE video_plays_2013_09 INHERIT video_plays;

ALTER TABLE video_plays_2013_09ADD CONSTRAINT CHECK date >= '2013-09-01' AND date < '2013-10-01';

video_plays (parent table)

Page 24: Viki_bigdatameetup_2013_10

Managing Job Dependency

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

Page 25: Viki_bigdatameetup_2013_10

Managing Job Dependency

tableA

tableB

Analytics DB (PostgreSQL)

Page 26: Viki_bigdatameetup_2013_10

Managing Job Dependency

tableA

tableB

Analytics DB (PostgreSQL)

Page 27: Viki_bigdatameetup_2013_10

AzkabanCron dependency management

(Viki Cron Dependency Graph)

Page 28: Viki_bigdatameetup_2013_10

Data Presentation

Page 29: Viki_bigdatameetup_2013_10

Data Presentation

`

Page 30: Viki_bigdatameetup_2013_10

Dashboard• Yes, dashboard on Rails.

• We have a daily logship process to port the data over to

dashboard server.

thor db:logship –t big_table

Page 31: Viki_bigdatameetup_2013_10

Data Visualization

Tableau is slow if directly working on PostgreSQL

Export compressed csv’s to tableau server Windows Line charts do solve most problems

Page 32: Viki_bigdatameetup_2013_10

Engineering involvement in report creation

• Bad idea!

• Enter Query Reports! Fast report churn rate

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe” – Abraham Lincoln

Page 33: Viki_bigdatameetup_2013_10

Query Reports

Page 34: Viki_bigdatameetup_2013_10

Query Reports

Page 35: Viki_bigdatameetup_2013_10

Summary report

• Higher level view of metrics• See changes over time

• (screen shot)

Page 36: Viki_bigdatameetup_2013_10

Data Explorer“The world is your oyster”

Page 37: Viki_bigdatameetup_2013_10

One more thing! (Viki Live)

Page 38: Viki_bigdatameetup_2013_10

Recap

Page 39: Viki_bigdatameetup_2013_10

Lessons Learnt

• Line charts can solve most problems

• Chart your data quickly

• Our dataset is not that big

Page 40: Viki_bigdatameetup_2013_10

Simple DIY Suggestion• Put QueryReports on top of your database. Or Tableau

Desktop.

• Use Mixpanel/KISSMetrics for Product Analytics

• fluentd writes data to Postgres (hstore)

CAN

Page 41: Viki_bigdatameetup_2013_10

We are hiring!

Page 43: Viki_bigdatameetup_2013_10

Viki’s Data Pipeline