ORIE 4741: Learning with Big Messy Data
Exploratory Data Analysis
Professor Udell
Operations Research and Information EngineeringCornell
October 1, 2020
1 / 21
Announcements
I If you’re taking lecture async: remember to submitparticipation post after each class!
I Sections start Wednesday. They are optional, attend anyone you prefer.Section this week is a Julia tutorial.
I Office hours: Zoom links and times are posted on coursewebsite.
I Gradescope is open for submission of hw0, due Thursday9-9-19 9:30am.
I First quiz this week! It should occupy about 20 minutes;you’ll have up to half an hour to complete it. Start itanytime between 6:15pm Thursday and 9pm Friday.
(All times ET)
2 / 21
Questions from campuswire
I search for your question before posting new question
I approximate date of voter registration is fine
3 / 21
Why julia?
I the two language problem
I julia is fast (JIT-compiled)
I julia has pleasant syntax, esp for linear algebra(MATLAB-like, but more principled)
I julia supports efficient parallelism (including multithreading)
I the julia ecosystem
for this class: you can use any language you’d like (which yourTAs can read), but the course staff will only support Julia.
4 / 21
Topics to review
We will cover (most of) these in section, too:
I Linear algebra: invertible matrices, rank, norm, basic matrixidentities. When is a matrix invertible?
I QR factorization
I Gradients (multivariate derivative)
I Projections
I SVD
I Maximum likelihood estimation
I Union bound
I Computational complexity
5 / 21
Why look at the data?
I detect errors in data
I check assumptions
I select appropriate models
I understand relationships among the features
I understand relationships between features and labels
6 / 21
How to look at the data?
I inspect raw data
I summary statistics
I visualize
7 / 21
American community survey
2013 ACS:
I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .
I 1/3 of responses missing
find it at https://people.orie.cornell.edu/mru8/orie4741/data/acs_2013.csv
8 / 21
How do computers work?
on a laptop:
I hard disk: usually ≤ 500 GB
I memory (RAM): usually ≤ 16 GB
I many programs (e.g., Excel): substantially more limited
don’t load a giant file into memory.your computer will crash.
how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB
9 / 21
How do computers work?
on a laptop:
I hard disk: usually ≤ 500 GB
I memory (RAM): usually ≤ 16 GB
I many programs (e.g., Excel): substantially more limited
don’t load a giant file into memory.your computer will crash.
how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB
9 / 21
How do computers work?
on a laptop:
I hard disk: usually ≤ 500 GB
I memory (RAM): usually ≤ 16 GB
I many programs (e.g., Excel): substantially more limited
don’t load a giant file into memory.your computer will crash.
how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB
9 / 21
Inspect raw data
solution for large files: technology from the 70s!
bash shell:
I “how big are these files?”: ls -lh
I “show me some lines from the file”: head, tail, less
I “how many lines are in the file?”: wc -l
10 / 21
American Community SurveyVariable Description Type
HHTYPE household type categoricalSTATEICP state categoricalOWNERSHP own home BooleanCOMMUSE commercial use BooleanACREHOUS house on ≥ 10 acres BooleanHHINCOME household income realCOSTELEC monthly electricity bill realCOSTWATR monthly water bill realCOSTGAS monthly gas bill realFOODSTMP on food stamps BooleanHCOVANY have health insurance BooleanSCHOOL currently in school BooleanEDUC highest level of education ordinalGRADEATT highest grade level attained ordinalEMPSTAT employment status categoricalLABFORCE in labor force BooleanCLASSWKR class of worker BooleanWKSWORK2 weeks worked per year ordinalUHRSWORK usual hours worked per week realLOOKING looking for work BooleanMIGRATE1 migration status categorical 11 / 21
Julia and Jupyter
I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer
I Jupyter is a protocol for interacting with a programminglanguage.
I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia
how to access?
I recommended: install anaconda distribution, then julia(go to section or see section materials for details)
I not recommended: use Google Colabhttps://discourse.julialang.org/t/
julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/
15319
12 / 21
Julia and Jupyter
I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer
I Jupyter is a protocol for interacting with a programminglanguage.
I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia
how to access?
I recommended: install anaconda distribution, then julia(go to section or see section materials for details)
I not recommended: use Google Colabhttps://discourse.julialang.org/t/
julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/
1531912 / 21
Summary statistics
univariate
I mean, median, mode
I max, min, range
I variance
I . . .
explore via Julia + Jupyter notebook
https:
//github.com/ORIE4741/demos/blob/master/eda.ipynb
multi- (but usualy just bi-)variate
I correlation, covariance
I . . .
13 / 21
Summary statistics
univariate
I mean, median, mode
I max, min, range
I variance
I . . .
explore via Julia + Jupyter notebook
https:
//github.com/ORIE4741/demos/blob/master/eda.ipynb
multi- (but usualy just bi-)variate
I correlation, covariance
I . . .13 / 21
The perils of summary statistics
same mean, variance, correlation, line of best fit. . .
14 / 21
The perils of summary statistics
same mean, variance, correlation, line of best fit. . .14 / 21
The perils of summary statistics
15 / 21
The perils of summary statistics: modern update
https:
//www.autodeskresearch.com/publications/samestats
16 / 21
What to visualize?
I examples across all features (usually not)
I plot features across all examples (much more common)
17 / 21
Best practices
I Always label your axes.
I Ensure all marks on plot are meaningful.
I Beware of pie charts; bar charts are often easier to read.
I Beware of line plots; if your data is not continuous, tryscatter plot instead.
I Consider which curves to plot on same axes. Makecomparisons easy!
18 / 21
Beware of bad data
19 / 21
Take away
I always look at (some of) your data
I decide what question you want to answer
20 / 21
Questions?
https://docs.google.com/spreadsheets/d/
1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk
21 / 21