+ All Categories
Home > Documents > Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Date post: 03-Jan-2016
Category:
Upload: uriel-woodard
View: 26 times
Download: 1 times
Share this document with a friend
Description:
Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101. Admin info (keep/ print this slide). Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 - PowerPoint PPT Presentation
40
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, January 28, 2014, SAGE 3101 Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices
Transcript
Page 1: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

1

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 2a, January 28, 2014, SAGE 3101

Data and Information Resources, Role of Hypothesis, Synthesis

and Model Choices

Page 2: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: [email protected], 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri [email protected] • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.2

Page 3: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Contents• Back to the data

sources– Cyber– Human

• “Munging”

• Beginning with hypothesis -> synthesis

• Distributions…

• Scoping out analysis and model choices 3

Page 4: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Lower layers in the Analytics Stack

4

Page 5: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

“Cyber Data” …

5

Page 6: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

“Human Data” …

6

Page 7: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

• Descriptive statistics: numerical summaries of samples– i.e., what was observed, distributions– The ‘sample’ may be exhaustive, i.e., identical to the population

• Inferential statistics: from samples to populations– i.e., what could have been or will be observed in a larger population

• Descriptive (report) to Inferential (model suggestion) is a key process in analytics

• So often NOT a linear process..

• Sample bias – choice and awareness

Descriptive / Inferential

7Adapted from Marshall Ma (and other sources)

Page 8: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

• A population is defined– We must be able to say, for every object, if it is in the population or not– We must be able, in principle, to find every individual of the population

A geographic example of a population is all pixels in a multi-spectral satellite image

• A sample is a subset of a population– We must be able to say, for every object in the population, if it is in the

sample or not– Sampling is the process of selecting a sample from a population

• E.g 2010EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs)

Populations and samples

8

Page 9: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Election prediction• Exit polls versus election results

– Human versus cyber

• How is the “population” defined here?

• What is the sample, how chosen?– What is described and how is that used to

predict?– Are results categorized? (where from, M/F, age)

• What is the uncertainty?– It is reflected in the “sample distribution”– And controlled/ constraints by “sampling theory”

9

Page 10: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Bias difference: between cyber and human data

• Election results and exit polls– What are examples of bias in election results?

– In exit polls?

10

Page 11: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Hypothesis• What are you exploring?

• Regular data analytics features ~ well defined hypotheses– Big Data messes that up

• E.g. Stock market performance / trends versus unusual events (crash/ boom):– Populations versus samples – which is which?– Why?

• E.g. Election results are predictable from exit polls

11

Page 12: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Distributions• http://www.quantitativeskills.com/sisa/rojo/

alldist.zip

• Shape

• Character

• Parameter(s)

12

Page 13: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Plotting these distributions• Histograms and binning

• Getting used to log scales

• Going beyond 2-D

• More of this on Friday (in more detail)

13

Page 14: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

In applications• Scipy:

http://docs.scipy.org/doc/scipy/reference/stats.html

• R: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html

• Matlab: http://www.mathworks.com/help/stats/_brn2irf.html

• Excel: HAH!14

Page 15: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Heavy-tail distributions• are probability distributions whose tails are

not exponentially bounded

• Common – long-tail… human v. cyber…

15

Few that dominate More that add up

Equal areas

http://en.wikipedia.org/wiki/Heavy-tailed_distribution

Page 16: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Spatial example

16

Page 17: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Spatial roughness…

17

Page 18: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Compare median, mean, mode

18

Page 19: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Huh, we have Big Data?• Why would we care about samples?

– Let’s take it all?

• It gets messy == quality, gaps, …

• Very often goes beyond known patterns, i.e. out of the range of previous values– Anyone remember the financial crisis in 2008?

• Data becomes more subjective than objective and especially human v. cyber..

• To start: let’s take a look at EPI data that you started to explore last week (cyber)

19

Page 20: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Munging• Missing values, null values, etc.• E.g. in EPI_data – they use “--”

– Most data applications provide built ins for these higher-order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday)

• Of course, different variables often are missing “different” values

• In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then friends: http://www.johnmyleswhite.com/notebook/2010/09/23/higher-order-functions-in-r/

20

Page 21: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

21

Page 22: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

22

Page 23: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

23

Page 24: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

24

Page 25: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Patterns and Relationships• Stepping from elementary/ distribution

analysis to algorithmic-based analysis

• I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models

• Relations – associations between/among populations

• Outcome: model and an evaluation of its fitness for purpose

25

Page 26: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

More munging• Bad values, outliers, corrupted entries,

thresholds …

• Noise reduction – low-pass filtering, binning

• A few example today but the labs will bring this into view soon

• REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations…

26

Page 27: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

27

Page 28: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

28

Page 29: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Populations within populations• In the EPI example:

– Geographic regions (GEO_subregion)– EPI_regions– Eco-regions (EDC v. LEDC – know what that is?)– Primary industry(ies)– Climate region

• What would you do to start exploring?

29

Page 30: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

30

Page 31: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

31

Or, a twist – n=1 but many attributes?

The item of interest in relation to its attributes

Page 32: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Summary: explore• Going from preliminary to initial analysis…

• Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution)

• Fitting that distribution

• Or NOT– A hybrid or

– Non-parametric (statistics) approaches are needed – more on this to come

32

Page 33: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Models• Assumptions are often used when

considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit)

• Two key topics:– N=all and the open world assumption– Model of the thing of interest versus model of the

data (data model; structural form)

• “All models are wrong but some are useful” (generally attributed to the statistician George Box) 33

Page 34: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Conceptual, logical and physical models

34

Applied to a database:

However our models will be mathematical, statistical, or a combination.

The concept of the model comes from the hypothesis

The implementation of the physical model comes from the data ;-)

Page 35: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Art or science?• The form of the model, incorporating the

hypothesis determines a “form”

• Thus, as much art as science because it depends both on your world view and what the data is telling you (or not)

• We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 35

Page 36: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Goodness of fit• And, we cannot take the models at face

value, we must assess how fit they may be:– Chi-Square – One-sided and two-sided Kolmogorov-Smirnov

tests– Lilliefors tests– Ansari-Bradley tests– Jarque-Bera tests

• Just a preview…36

Page 37: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

37

Summary

• Cyber and Human data; quality, uncertainty and bias• Distributions – the common and not-so common

ones and how cyber and human data can have distinct distributions

• How simple statistical distributions can mis-lead us• Populations and samples and how inferential

statistics will lead us to model choices (no we have not actually done that yet in detail)

• Big Data and some consequences• Munging toward exploratory analysis• Toward models!

Page 38: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Tentative assignments• Assignment 2: Datasets and data infrastructures – lab

assignment. Held in week 3 (Feb. 7) 10% (lab; individual);

• Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual);

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual);

• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);

• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 38

Page 39: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

How are the software installs going?

• R/Scipy (et al)/Matlab

• Data infrastructure

• Exercises?

• More on Friday…

39

Page 40: Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Assignment 1 – how is it going?

• Choose a DA case study from a) readings, or b) your choice (must be approved by me)

• Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits).

• Be prepared to discuss it in class this Friday 31st. Hand in the written report by 5pm that day.

40


Recommended