Date post: | 17-Jan-2017 |
Category: |
Technology |
Upload: | hakka-labs |
View: | 674 times |
Download: | 0 times |
Data AssertsDefensive Data Science
Tommy Guy
Microsoft
Observation: Complexity In Pipeline
Our pipeline:
DATA!!!
Insight! Direction! Strategy!
Our pipeline in reality: bugs tend to compound
DATA!!!
How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.
Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and unanticipated by data providers.
Pushing requirements upstream to all producers is Sisyphean.
We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored• Counter fields to ensure no dropped rows• Sentinel events to measure join fidelity
Are availability SLAs being met?• Progressive server-client merging
Data Scientists Require Semantic Correctness
Does this field mean what I think it does?
How do Data Scientists identify potential errors?
How do Data Scientists identify potential errors?Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.
What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive• Correctly logging timestamp for Office purchase• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.
What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.
Data Asserts: Defensive Data Science
Data Asserts: Maintain Quality
Data Asserts: Clear Trust Boundaries
These should match!
Data Asserts: Defensive Data Science
Data Asserts in Production: A few Observations• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.