+ All Categories
Home > Documents > 02 Process

02 Process

Date post: 15-Dec-2015
Category:
Upload: tusharnimbhorkar
View: 233 times
Download: 13 times
Share this document with a friend
Description:
process
86
CS 109: Data Science Process, Data, and Visual Attributes Hanspeter Pfister pfi[email protected] Joe Blitzstein [email protected]
Transcript
Page 2: 02 Process

This Week• HW0 - due next Tuesday (not graded)

• Install Anaconda & IPython frameworks

• Sign up for Piazza and introduce yourself

• Fill out survey

• Friday lab 10-11:30 am in MD G115

• Intro to Python with Ian Stokes-Rees

• Make sure to have IPython installed and ready

• Readings - post comments on Piazza

Page 3: 02 Process

Outline

• Process & Process Books

• What makes visualizations effective?

• Data Sources & Data Cleanup

Page 4: 02 Process

Process

Page 5: 02 Process

Data ExplorationNot always sure what we are looking for (until we find it)

Page 6: 02 Process

Ask an interesting question.

Get the data.

Explore the data.

Model the data.

Communicate and visualize the results.

What is the scientific goal?What would you do if you had all the data?What do you want to predict or estimate?

How were the data sampled?Which data are relevant?Are there privacy issues?

Plot the data.Are there anomalies?Are there patterns?

Build a model.Fit the model.

Validate the model.

What did we learn?Do the results make sense?

Can we tell a story?

Page 7: 02 Process

What do analysts do?

Page 8: 02 Process

What do analysts do?

Page 9: 02 Process

What do analysts do?

I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis. Most of the time I’m lucky if I get to do any analysis. Most of the time once you transform the data you just do an average... the insights can be scarily obvious. It’s fun when you get to do something somewhat analytical.

Page 10: 02 Process

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John Tukey

Exploratory Data Analysis

Page 11: 02 Process

Ascombe’s Quartet

Anscombe ’73

Same mean, variance, correlation, and linear regression line

Page 12: 02 Process

Ascombe’s QuartetSame mean, variance, correlation, and linear regression line

Anscombe ’73

Page 13: 02 Process

Example: AntibioticsWill Burtin, 1951

Page 14: 02 Process

Effectiveness of Antibiotics

Page 15: 02 Process

Data & Questions

• What are the data types?

• What are possible questions?

Page 16: 02 Process

Data• Genus & species of bacteria [string]

• Antibiotic name [string]

• Gram staining? [pos/neg]

• Minimum inhibitory concentration (mg/ml) [float] (lower == more effective)

Page 17: 02 Process

What Questions?

Page 18: 02 Process

M. Bostock, Protovisafter W. Burtin, 1951

How effective are the drugs?

P & N N

Page 19: 02 Process

Wainer & Lysen, “That’s funny...”American Scientist, 2009

Adapted from Brian Schmotzer

Not a streptococcus!(realized ~30 years later)

Really a streptococcus!(realized ~20 years later)

How do the bacteria compare?

Page 20: 02 Process

Wainer & Lysen, “That’s funny...”American Scientist, 2009

How do the bacteria compare?

Page 21: 02 Process

“The greatest value of a picture is when it forces us to notice what we never expected to see.”

John Tukey

Page 22: 02 Process

Process Books

Page 23: 02 Process

Process Books

[Varun Bansal, Cici Cao, Sofia Hou, CS171, 2013]

[Blake Walsh, Gabriel Trevino, Antony Bett, CS171, 2013]

Page 24: 02 Process

IPython Notebookshttp://nbviewer.ipython.org/

Page 25: 02 Process

IPython is Great(for large-scale computation, data exploration, and creating reproducible research artifacts)

Mike Roberts, Stanford University

[http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter1_Introduction/Chapter1_Introduction.ipynb]

Page 26: 02 Process

Python is great.

[http://xkcd.com/353/]

Page 27: 02 Process

E.g.: https://github.com/jrjohansson/scientific-python-lectures

Intro to Python Lab this Friday!!

10-11:30 am, MD G115

Page 28: 02 Process

You probably all know the default Python interpreter.

Don’t bother with

Page 29: 02 Process

IPython is a more powerful interactive Python interpreter.

write and execute Python code in

snippets

[http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter1_Introduction/Chapter1_Introduction.ipynb]

Page 30: 02 Process

IPython is a more powerful interactive Python interpreter.

comments can include Latex math

[http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter1_Introduction/Chapter1_Introduction.ipynb]

Page 31: 02 Process

IPython is a more powerful interactive Python interpreter.

any plots generated by your code are displayed inline

[http://nbviewer.ipython.org/urls/raw.github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/master/Chapter1_Introduction/Chapter1_Introduction.ipynb]

Page 32: 02 Process

Write Python code interactively in a web browser instead of a terminal window.

Page 33: 02 Process

IPython has a decoupled client-server architecture.

[http://communities.intel.com/community/datastack/blog/2011/05/02/top-10-reasons-to-setup-a-client-server-network]

Page 34: 02 Process

server executes Python code

IPython has a decoupled client-server architecture.

[http://communities.intel.com/community/datastack/blog/2011/05/02/top-10-reasons-to-setup-a-client-server-network]

Page 35: 02 Process

clients see results of computation in real-time

IPython has a decoupled client-server architecture.

[http://communities.intel.com/community/datastack/blog/2011/05/02/top-10-reasons-to-setup-a-client-server-network]

Page 36: 02 Process

You can stay productive on any computer with an internet connection and a web browser.

[image source unknown]

Page 37: 02 Process

IPython integrates seamlessly with Matplotlib, making it well-suited for data exploration.

[http://matplotlib.org/gallery.html]

Page 38: 02 Process

Examples

[image source unknown]

Page 40: 02 Process

Good Practices

• IPython notebooks to document your process

• Visualizations for data exploration

• Comment your code!

• Modularity - breaking down code into small functional, composable pieces

• Array-oriented computing

• Using assert statements and tests

• Version control (svn, git, github)

Page 41: 02 Process

Data Types

Page 42: 02 Process

Ben Shneiderman, 1996

• 1D (sequences)

• Temporal

• 2D (maps)

• 3D (shaped)

• nD (relational)

• Trees (hierarchical)

• Networks (graphs)

• Others?

The Eyes Have It: A Task by Data Type Taxonomy for Information Visualization [Shneiderman, 96]

Page 43: 02 Process

43

Tamara Munzner, 2013

Page 44: 02 Process

Semantics vs. Types

• Data Semantics: The real-world meaninge.g., company name, day of the month, person height, etc.

• Data Type: Interpretation in terms of scales of measurements

e.g., quantity or category, sensible mathematical operations, data structure, etc.

Page 45: 02 Process
Page 46: 02 Process

NominalCategoricalQualitative

Ordinal

Interval

Ratio

On the theory of scales and measurements [S. Stevens, 46]

Page 47: 02 Process

Data Types

• Nominal (Categorical) (N)Are = or ≠ to other values

Apples, Oranges, Bananas,...

• Ordinal (O)Obey a < relationship

Small, medium, large

• Quantitative (Q)Can do arithmetic on them

10 inches, 23 inches, etc.

On the theory of scales and measurements [S. Stevens, 46]

Page 48: 02 Process

Data Types• Q - Interval (location of zero arbitrary)

Dates: Jan 19; Location: (Lat, Long)

Like a geometric point. Cannot compare directly.

Only differences (i.e., intervals) can be compared

• Q - Ratio (zero fixed)Measurements: Length, Mass, Temp, ...

Origin is meaningful, can measure ratios & proportions

Like a geometric vector, origin is meaningful

On the theory of scales and measurements [S. Stevens, 46]

Page 49: 02 Process

Data Types

• N - Nominal (labels)Operations: =, ≠

• O - Ordinal (ordered)Operations: =, ≠, >, <

• Q - Interval (location of zero arbitrary)Operations: =, ≠, >, <, +, − (distance)

• Q - Ratio (zero fixed)Operations: =, ≠, >, <, +, −,×, ÷ (proportions)

On the theory of scales and measurements [S. Stevens, 46]

Page 50: 02 Process

Semantics

Page 51: 02 Process

Item

Page 52: 02 Process

Attributeaka

Feature

Page 53: 02 Process

1 = Quantitative2 = Nominal3 = Ordinal

Page 54: 02 Process

1 = Quantitative2 = Nominal3 = Ordinal

Page 55: 02 Process

Nominal /Ordinal = DimensionsDescribe the data, independent variables Quantitative = MeasuresNumbers to be analyzed, dependent variables

Page 56: 02 Process

Data vs. Conceptual Model

• Data Model: Low-level description of the data Set with operations, e.g., floats with +, -, /, *

• Conceptual Model: Mental constructionIncludes semantics, supports reasoning

Data Conceptual

1D floats temperature

3D vector of floats space

Page 57: 02 Process

Data vs. Conceptual Model• From data model...

32.5, 54.0, -17.3, … (floats)

• using conceptual model...Temperature

• to data typeContinuous to 4 significant figures (Q)

Hot, warm, cold (O)

Burned vs. Not burned (N)

Based on slide from Munzner

Page 58: 02 Process

Data Dimensions

Page 59: 02 Process

Univariate Data

Based on slide from M. Agrawala

http://www.smartmoney.com/marketmap/

Page 60: 02 Process

Bivariate Data

Based on slide from M. Agrawala

Scatterplot is common

Page 61: 02 Process

Trivariate Data

Based on slide from M. Agrawala

Do NOT use 3D scatterplots!

Page 62: 02 Process

Trivariate Data

Based on slide from M. Agrawala

Map the third dimension to some other visual attribute

Page 63: 02 Process

Multivariate Data

Based on slide from M. Agrawala

Give each attribute its own display (small multiples)

Page 64: 02 Process

Multivariate Data Representations

Page 65: 02 Process

Data Reduction

• Filtering: Eliminate some items or attributese.g., select range of interest, zoom in, remove outliers, etc.

• Aggregation: Represent a group of elements by a new derived element

e.g., take average, min, max, count, sum

Attribute aggregation a.k.a. dimensionality reduction

Page 66: 02 Process

Mapping Data to Images

Page 67: 02 Process

13

Image

Visual language is a sign system

Images perceived as a set of signs

Sender encodes information in signs

Receiver decodes information from signs

Semiology of Graphics, 1983

Jacques Bertin

Jacques Bertin

• French cartographer [1918-2010]

• Semiology of Graphics [1967]

• Theoretical principles for visual encodings

Page 68: 02 Process

Bertin’s Visual AttributesPoints Lines AreasMarks

Position

Size

(Grey)Value

Texture

Color

Orientation

Shape

Channels

Bertin, Semiology of Graphics, 1967

Page 69: 02 Process

Large Design Space (Visual Metaphors)

J. Bertin, “Graphics and Graphic Information Processing”, 1981

Page 70: 02 Process

Example

W. Playfair, 1786

Page 71: 02 Process

Example

W. Playfair, 1786

x-axis: Year (Q)y-axis: Currency (Q)Color: Imports / Exports (N, O)

Page 72: 02 Process

Effective Visual Attributes

Page 74: 02 Process

Order These Colors

Based on slide from Stasko

Page 75: 02 Process

Order These Colors

Based on slide from Stasko

Page 76: 02 Process

Order These Colors

Based on slide from Stasko

Page 77: 02 Process

Brightness Saturation Hue: not as muchPerceived as Ordered

Page 78: 02 Process

Visual Attributes per Data Type

Jock Mackinlay “Automating The Design of Graphical Presentations.” 1986

Bertin, 1967

William S. Cleveland; Robert McGill , “Graphical Perception: Theory,

Experimentation, and Application to the Development of Graphical Methods.” 1984

Cleveland / McGill, 1984 Mackinlay, 1986

Bertin, Semiology of Graphics, 1967

Page 79: 02 Process

Most Efficient

Least Efficient

C. MulbrandonVisualizingEconomics.com

Quantitative

Ordinal

Nominal

}}}

Page 80: 02 Process

Most Effective

VisualizingEconomics.com

Page 81: 02 Process

Less Effective

VisualizingEconomics.com

Page 83: 02 Process

Effective Visualizations

Page 84: 02 Process

Not Effective...

Sources: US Treasury and WHO reports

Page 85: 02 Process

Also not effective...

Source: Nature

Page 86: 02 Process

Much better...

Sources: US Treasury, WHO, Nature


Recommended