+ All Categories
Home > Technology > (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Date post: 30-Jun-2015
Category:
Upload: amazon-web-services
View: 544 times
Download: 4 times
Share this document with a friend
Description:
Delivering deep insight on advertising metrics and providing customers easy data access becomes a challenge as scale increases. In this session, Neustar, a global provider of real-time analytics, shows how they use Redshift to help advertisers and agencies reach the highest-performing customers using data science at scale. Neustar dives into the queries they use to determine how best to target ads based on their real reach, how much to pay for ads using multi-touch attribution, and how frequently to show ads. Finally, Neustar discusses how they operate a fleet of Redshift clusters to run workloads in parallel and generate daily reports on billions of events within hours. Session includes how Neustar provides daily feeds of event-level data to their customers for ad-hoc data science.
67
November 13, 2014 | Las Vegas, NV Timon Karnezos, Director Infrastructure, Neustar Vidhya Srinivasan, Sr. Manager Software Development, Amazon Redshift
Transcript
Page 1: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

November 13, 2014 | Las Vegas, NV

Timon Karnezos, Director Infrastructure, Neustar

Vidhya Srinivasan, Sr. Manager Software Development, Amazon Redshift

Page 2: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Page 3: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

10 GigE

(HPC)

IngestionBackupRestore

JDBC/ODBC

Page 4: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Ad Tech

Use Cases

Page 5: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 6: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 7: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

692.8s

34.9s

< 0.76%

Page 8: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 9: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

00 01 10 11

00

01

10

11

P

Space filling Curve for Two Dimensions

Page 10: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 11: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 12: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 13: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 14: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 15: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Frequency

Page 16: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 17: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Attribution

Page 18: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 19: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Overlap

Page 20: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 21: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Ad-hoc

Page 22: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 23: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 24: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 25: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

0.7B / day

2B / week

8B / month

21B / quarter

Page 26: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Number of ads seen per user

WITH frequency_intermediate AS (

SELECT user_id ,

SUM(1) AS impression_count,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM impressions

WHERE record_date BETWEEN <...>

GROUP BY 1

)

-- Number of people who saw N ads

SELECT impression_count, SUM(1), SUM(cost), SUM(revenue)

FROM frequency_intermediate

GROUP BY 1;

Page 27: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 28: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 29: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

CREATE TABLE (

record_date date ENCODE NOT NULL ,

campaign_id bigint ENCODE NOT NULL ,

site_id bigint ENCODE NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY,

impression_count int ENCODE NOT NULL ,

cost bigint ENCODE NOT NULL ,

revenue bigint ENCODE NOT NULL

) SORTKEY( , , , );

Page 30: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

WITH user_frequency AS (

SELECT user_id, campaign_id, site_id,

SUM(impression_count) AS frequency,

SUM(cost) AS cost ,

SUM(revenue) AS revenue

FROM frequency_intermediate

WHERE record_date BETWEEN <...>

GROUP BY 1,2,3

)

SELECT campaign_id, site_id, frequency,

SUM(1), SUM(cost), SUM(revenue)

FROM user_frequency

GROUP BY 1,2,3;

Page 31: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 32: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 33: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 34: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 35: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 36: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Basic sessionization query, assemble user activity

-- that ended in a conversion into a timeline.

SELECT <...>

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.record_date < c.record_date

ORDER BY i.record_date;

Page 37: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Position: 1

Position: 2

Position: 3

Page 38: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Hour offset: 3

Position: 1

Position: 2

Hour offset: 12

Hour offset: 16

Position: 3

Page 39: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Sessionize user activity per conversion, partition by campaign (45-day lookback window)

SELECT c.record_date AS conversion_date ,

c.event_id AS conversion_id ,

i.campaign_id AS campaign_id ,

i.site_id AS site_id ,

i.user_id AS user_id ,

c.revenue AS conversion_revenue,

DATEDIFF('hour', i.record_date, c.record_date) AS hour_offset,

SUM(1) OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date DESC ROWS UNBOUNDED PRECEDING) AS position

FROM impressions i

JOIN conversions c ON

i.user_id = c.user_id AND

i.campaign_id = c.campaign_id AND

i.record_date < c.record_date AND

i.record_date > (c.record_date - interval '45 days') AND

c.record_date BETWEEN <...>;

Page 40: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

-- Compute statistics on sessions (funnel placement, last-touch, site-count, etc...)

SELECT campaign_id ,

site_id ,

conversion_date,

AVG(position) AS average_position,

SUM(conversion_revenue * (position = 1)::int) AS lta_attributed ,

AVG(COUNT(DISTINCT site_id)

OVER (PARTITION BY i.user_id, i.campaign_id, c.event_id

ORDER BY i.record_date ASC

ROWS UNBOUNDED PRECEDING)) AS average_unique_preceding_site_count

FROM sessions

GROUP BY 1,2,3;

Page 41: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 42: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 43: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 44: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Site A Site B Site

C

Site A 20% 60%

Site B 90%

Site

C

CPM $0.06 $1.05 $9.50

Page 45: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Site A Site B Site

C

Site A 20% 60%

Site B 90%

Site

C

CPM $0.06 $1.05 $9.50

Page 46: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Site A Site B Site

C

Site A 20% 60%

Site B 90%

Site

C

CPM $0.06 $1.05 $9.50

Page 47: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

CREATE TABLE (

user_id bigint ENCODE NOT NULL DISTKEY,

site_id bigint ENCODE NOT NULL

) SORTKEY ( );

Page 48: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

WITH co_occurences AS (

SELECT

oi.site_id AS site1 ,

oi2.site_id AS site2

FROM overlap_intermediate oi

JOIN overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id

)

SELECT site1, site2, SUM(1)

FROM co_occurences

GROUP BY 1,2;

Page 49: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

CREATE TABLE (

record_date date ENCODE NOT NULL ,

campaign_id bigint ENCODE NOT NULL ,

site_id bigint ENCODE NOT NULL ,

user_id bigint ENCODE NOT NULL DISTKEY

) SORTKEY ( , );

Page 50: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

WITH

site_overlap_intermediate AS (

SELECT user_id, site_id, campaign_id

FROM overlap_intermediate WHERE record_date BETWEEN <...> GROUP BY 1,2,3

),

site_co_occurences AS (

SELECT oi.campaign_id AS c_id, oi.site_id AS site1, oi2.site_id AS site2

FROM site_overlap_intermediate oi

JOIN site_overlap_intermediate oi2 ON

oi.site_id > oi2.site_id AND

oi.ak_user_id = oi2.ak_user_id AND

oi.campaign_id = oi2.campaign_id

)

SELECT c_id, site1, site2, SUM(1) FROM site_co_occurences GROUP BY 1,2,3;

Page 51: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 52: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 53: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 54: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 55: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 56: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

8 fact tables

26 dimension tables

7 mapping tables

Page 57: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

42 views

121 joins

1100 sloc

Page 58: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

$ pg_dump –Fc some_file --table=foo --table=bar

$ pg_restore --schema-only --clean –Fc some_file > schema.sql

$ pg_restore --data-only --table=foo –Fc some_file > foo.tsv

$ aws s3 cp schema.sql s3://metadata-bucket/YYYYMMDD/schema.sql

$ aws s3 cp foo.tsv s3://metadata-bucket/YYYYMMDD/foo.tsv

> \i schema.sql

> COPY foo FROM ‘s3://metadata-bucket/YYYYMMDD/foo.tsv’ <...>

# or combine ‘COPY <..> FROM <...> SSH’ and pg_restore/psql

Page 59: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

UNLOAD

('

SELECT i.*

FROM impressions i

JOIN client_to_campaign_mapping m ON

m.campaign_id = i.campaign_id

WHERE i.record_date >= '{{yyyy}}-{{mm}}-{{dd}}' - interval \'1 day\' AND

i.record_date < '{{yyyy}}-{{mm}}-{{dd}}' AND

m.client_id = <...>

‘)

TO 's3://{{bucket}}/us_eastern/{{yyyy}}/{{mm}}/{{dd}}/dsdk_events/{{vers}}/impressions/'

WITH CREDENTIALS 'aws_access_key_id={{key}};aws_secret_access_key={{secret}}'

DELIMITER ',' NULL '\\N' ADDQUOTES ESCAPE GZIP MANIFEST;

Page 60: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 61: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 62: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 63: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 64: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Workload Node Count Node Type Restore Maint. Exec.

Frequency

& Attribution

& Overlap

& Ad Hoc

16 dw2.8xlarge 2h 1h 6h

= $691.20

Page 65: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

Workload Node Count Node Type Restore Maint. Exec.

Frequency 8 dw2.8xlarge 1.5h 0.5h 2.5h

Attribution 8 dw2.8xlarge 1.5h 0.5h 2h

Overlap 8 dw2.8xlarge 1h 0.5h 2.5h

Ad-hoc 8 dw2.8xlarge 0h 0.5h 1.5h

= $556.80 (-19%)

Page 66: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
Page 67: (ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014

http://bit.ly/awsevals


Recommended