+ All Categories
Home > Technology > DataEngConf SF16 - Data Asserts: Defensive Data Science

DataEngConf SF16 - Data Asserts: Defensive Data Science

Date post: 17-Jan-2017
Category:
Upload: hakka-labs
View: 674 times
Download: 0 times
Share this document with a friend
17
Data Asserts Defensive Data Science Tommy Guy Microsoft
Transcript
Page 1: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data AssertsDefensive Data Science

Tommy Guy

Microsoft

Page 2: DataEngConf SF16 - Data Asserts: Defensive Data Science

Observation: Complexity In Pipeline

Page 3: DataEngConf SF16 - Data Asserts: Defensive Data Science

Our pipeline:

DATA!!!

Insight! Direction! Strategy!

Page 4: DataEngConf SF16 - Data Asserts: Defensive Data Science

Our pipeline in reality: bugs tend to compound

DATA!!!

Page 5: DataEngConf SF16 - Data Asserts: Defensive Data Science

How do Engineers Manage Complexity?

Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity

Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.

Page 6: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data introduces a few complications

Pipelines take many upstream dependencies

Researcher use cases are frequently unknown and unanticipated by data providers.

Pushing requirements upstream to all producers is Sisyphean.

Page 7: DataEngConf SF16 - Data Asserts: Defensive Data Science

We are not talking about data pipeline tests

The data pipeline teams:

Are all rows that are produced stored• Counter fields to ensure no dropped rows• Sentinel events to measure join fidelity

Are availability SLAs being met?• Progressive server-client merging

Page 8: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data Scientists Require Semantic Correctness

Does this field mean what I think it does?

Page 9: DataEngConf SF16 - Data Asserts: Defensive Data Science

How do Data Scientists identify potential errors?

Page 10: DataEngConf SF16 - Data Asserts: Defensive Data Science

How do Data Scientists identify potential errors?Some follow-on fact is absurd…

… which leads to investigation …

… which finds a broader problem

If [potential conclusion], then we must have 3 billion OneDrive users…

… because my user table doesn’t have a primary key …

… so I should aggregate by user.

Page 11: DataEngConf SF16 - Data Asserts: Defensive Data Science

What are your Assumptions?

If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions:

Field Assumptions

User Id • Logged and PII-encrypted similarly in Outlook and OneDrive• Correctly logging timestamp for Office purchase• User Id isn’t empty or missing

OneDrive activity • Wasn’t automated traffic [identified by a certain flag].

Email Activity • Mobile client identifiers are correct.

All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.

Page 12: DataEngConf SF16 - Data Asserts: Defensive Data Science

What are your Sanity Checks?

• If a column “OfficeId” is really a user id, it has certain known properties:

• Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often.

Assumption Why does it matter?

Never null/empty Causes job-breaking data skew issues

Users are 1:* with Tenants Logical constraint: sign you are missing something.

Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.

All rows in event data join to it Otherwise, your data is incomplete.

Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.

Page 13: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data Asserts: Defensive Data Science

Page 14: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data Asserts: Maintain Quality

Page 15: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data Asserts: Clear Trust Boundaries

Page 16: DataEngConf SF16 - Data Asserts: Defensive Data Science

These should match!

Data Asserts: Defensive Data Science

Page 17: DataEngConf SF16 - Data Asserts: Defensive Data Science

Data Asserts in Production: A few Observations• Most of the analysis-impacting assertion failures we’ve seen were

actually errors in our assumptions not errors in the pipeline.

• Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines.

• Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.


Recommended