Hannah Aizenman - Get To Know Your Data

Post on 08-Jul-2015

262 views 1 download

description

A recent article in the New York Times estimates that data scientists spend somewhere between %50 and %80 of their time "collecting and preparing unruly digital data" before they ever get to the analysis. Data is often badly labeled, inconsistently sampled, incorrect in strange places, missing, and otherwise contains a whole host of errors, leading to the "garbage in, garbage out" problem. While detecting the myriad ways in which the data is broken can sometimes be difficult, traditional visualization techniques, exploratory data analytics, and cluster analysis can help. This talk will discuss some of the typical methods for sanity checking small data sets: visualization, simple statistics, and some basic combinations of the two. This talk will then veer into some machine learning techniques for exploring the underlying structure of larger data sets to verify the occurrence of known patterns and to detect outliers that could be due to errors rather than the occurance of something interesting.

transcript

Get To Know Your Data

Hannah Aizenman@story645

Unprocessed Data

Missing Observations

Misused Technique

Start?

Research

Explore Attributes

Take Snapshots

Plot

Label

Rearrange

Higher D Data: Plot 1 Dim

Plot Another Dim (or 2)

Fix that Plot

Histogram

Min, Max, Mean, Median

Too Much Data

Multivariate Relationships

Multivariate Relationships With Classes

Known Patterns

Expected Values

Look For Structure

Incorporate Outside Knowledge

Weave it All Together