+ All Categories
Home > Data & Analytics > Validating big data at scale

Validating big data at scale

Date post: 22-Jun-2015
Category:
Upload: amplitude-mobile-analytics
View: 349 times
Download: 0 times
Share this document with a friend
Description:
When you're collecting data from hundreds of millions of devices simultaneously, things get noisy. We go over key problems and solutions for collecting and validating data at scale.
Popular Tags:
17
Validating Data at Scale Spenser Skates CEO at Amplitude
Transcript
Page 1: Validating big data at scale

Validating Data at Scale Spenser Skates

CEO at Amplitude

Page 2: Validating big data at scale

Doing things at scale is noisy

u  Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?

Page 3: Validating big data at scale

Data from phones is noisier

u  Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones

u  Platforms have the craziest settings

Page 4: Validating big data at scale

How data can get messed up

u  HTTP requests get mangled in transit

u  Phone might not get the acknowledgement from the server

u  People’s clocks are off

u  People are running weird versions of Android

u  Memory/disk corruption

u  Gamma ray events

Page 5: Validating big data at scale

You can’t trust data from the client

Page 6: Validating big data at scale

Problem: Data gets mangled in transit

u  Parameters from post requests get dropped

u  Within a parameter, a chunk of data may not actually reach the server

Page 7: Validating big data at scale

Solution: Checksumming

u  Send a checksum that’s a function of all the fields

u  If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful

u  The phone will attempt to reupload the data

Page 8: Validating big data at scale

Problem: Client sends the same data twice

u  How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back

u  How does the server know that the phone has received the acknowledgement? It doesn’t!

u  Equivalent to the two generals problem

u  Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time

u  That means all counts are inflated by about 5%!

Page 9: Validating big data at scale

Solution: Deduplication

u  Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state

u  Create a unique key for every event that has been sent

u  When you see an event, check your list of keys if the key is already present, discard the event

Page 10: Validating big data at scale

Problem: Clocks are off

u  Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred

u  But people’s clocks are often off, occasionally by years!

u  We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened

Page 11: Validating big data at scale

Solution: Get an estimate of the actual time an event was logged

u  Timestamp the upload from the phone

u  For each event, let’s compare:

u  The difference between the phone event timestamp and the server upload time

u  The difference between the phone upload timestamp and the server upload time

Page 12: Validating big data at scale
Page 13: Validating big data at scale
Page 14: Validating big data at scale

Solution: Get an estimate of the actual time an event was logged

u  For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time

Page 15: Validating big data at scale

Other Problems

u  People are running weird versions of Android u  MD5 library

u  Memory/disk corruption

u  Gamma ray events

Page 16: Validating big data at scale

Clean Data

Page 17: Validating big data at scale

Questions?

Always happy to talk about analytics problems!

[email protected]

blog.amplitude.com

twitter: @amplitudemobile

MOBILE ANALYTICS FOR DECISION MAKERS


Recommended