Validating big data at scale

Post on 22-Jun-2015

349 views 0 download

Tags:

description

When you're collecting data from hundreds of millions of devices simultaneously, things get noisy. We go over key problems and solutions for collecting and validating data at scale.

transcript

Validating Data at Scale Spenser Skates

CEO at Amplitude

Doing things at scale is noisy

u  Code is supposed to run the same way, but what if you run the same loop a million times on a million different machines- how confident are you it will always run the same?

Data from phones is noisier

u  Running on tens of thousands of different platforms with hundreds of thousands of different software configurations on hundreds of millions of phones

u  Platforms have the craziest settings

How data can get messed up

u  HTTP requests get mangled in transit

u  Phone might not get the acknowledgement from the server

u  People’s clocks are off

u  People are running weird versions of Android

u  Memory/disk corruption

u  Gamma ray events

You can’t trust data from the client

Problem: Data gets mangled in transit

u  Parameters from post requests get dropped

u  Within a parameter, a chunk of data may not actually reach the server

Solution: Checksumming

u  Send a checksum that’s a function of all the fields

u  If the checksum is wrong/not present, you know that you haven’t got all the data. Tell the phone the upload wasn’t successful

u  The phone will attempt to reupload the data

Problem: Client sends the same data twice

u  How does the phone know that the server has received the data so it doesn’t reupload the same piece of data twice? It gets an acknowledgement back

u  How does the server know that the phone has received the acknowledgement? It doesn’t!

u  Equivalent to the two generals problem

u  Requests that are successfully received by the server fail to successfully send an acknowledgement to the phone 5% of the time

u  That means all counts are inflated by about 5%!

Solution: Deduplication

u  Your system must be idempotent on the event level- it must be able to receive an event it’s received before and not change its state

u  Create a unique key for every event that has been sent

u  When you see an event, check your list of keys if the key is already present, discard the event

Problem: Clocks are off

u  Phones are often offline, so an analytics SDK needs to cache data locally before uploading, including the time the event occurred

u  But people’s clocks are often off, occasionally by years!

u  We can’t timestamp to the upload time, 5% of data is uploaded >24 hours after an event happened

Solution: Get an estimate of the actual time an event was logged

u  Timestamp the upload from the phone

u  For each event, let’s compare:

u  The difference between the phone event timestamp and the server upload time

u  The difference between the phone upload timestamp and the server upload time

Solution: Get an estimate of the actual time an event was logged

u  For each event timestamp, subtract the difference between the phone’s upload time and the server’s upload time

Other Problems

u  People are running weird versions of Android u  MD5 library

u  Memory/disk corruption

u  Gamma ray events

Clean Data

Questions?

Always happy to talk about analytics problems!

spenser@amplitude.com

blog.amplitude.com

twitter: @amplitudemobile

MOBILE ANALYTICS FOR DECISION MAKERS