Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

transcript

Peter Fox

Data Analytics – ITWS-4963/ITWS-6965

Week 2a, January 28, 2014, SAGE 3101

Data and Information Resources, Role of Hypothesis, Synthesis

and Model Choices

Admin info (keep/ print this slide)• Class: ITWS-4963/ITWS 6965• Hours: 12:00pm-1:50pm Tuesday/ Friday• Location: SAGE 3101• Instructor: Peter Fox• Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not

leave a msg)• Contact hours: Monday** 3:00-4:00pm (or by email appt)• Contact location: Winslow 2120 (sometimes Lally 207A

announced by email)• TA: Lakshmi Chenicheri chenil@rpi.edu • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014

– Schedule, lectures, syllabus, reading, assignments, etc.2

Contents• Back to the data

sources– Cyber– Human

• “Munging”

• Beginning with hypothesis -> synthesis

• Distributions…

• Scoping out analysis and model choices 3

Lower layers in the Analytics Stack

“Cyber Data” …

“Human Data” …

• Descriptive statistics: numerical summaries of samples– i.e., what was observed, distributions– The ‘sample’ may be exhaustive, i.e., identical to the population

• Inferential statistics: from samples to populations– i.e., what could have been or will be observed in a larger population

• Descriptive (report) to Inferential (model suggestion) is a key process in analytics

• So often NOT a linear process..

• Sample bias – choice and awareness

Descriptive / Inferential

7Adapted from Marshall Ma (and other sources)

• A population is defined– We must be able to say, for every object, if it is in the population or not– We must be able, in principle, to find every individual of the population

A geographic example of a population is all pixels in a multi-spectral satellite image

• A sample is a subset of a population– We must be able to say, for every object in the population, if it is in the

sample or not– Sampling is the process of selecting a sample from a population

• E.g 2010EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs)

Populations and samples

Election prediction• Exit polls versus election results

– Human versus cyber

• How is the “population” defined here?

• What is the sample, how chosen?– What is described and how is that used to

predict?– Are results categorized? (where from, M/F, age)

• What is the uncertainty?– It is reflected in the “sample distribution”– And controlled/ constraints by “sampling theory”

Bias difference: between cyber and human data

• Election results and exit polls– What are examples of bias in election results?

– In exit polls?

Hypothesis• What are you exploring?

• Regular data analytics features ~ well defined hypotheses– Big Data messes that up

• E.g. Stock market performance / trends versus unusual events (crash/ boom):– Populations versus samples – which is which?– Why?

• E.g. Election results are predictable from exit polls

Distributions• http://www.quantitativeskills.com/sisa/rojo/

alldist.zip

• Shape

• Character

• Parameter(s)

Plotting these distributions• Histograms and binning

• Getting used to log scales

• Going beyond 2-D

• More of this on Friday (in more detail)

In applications• Scipy:

http://docs.scipy.org/doc/scipy/reference/stats.html

• R: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html

• Matlab: http://www.mathworks.com/help/stats/_brn2irf.html

• Excel: HAH!14

Heavy-tail distributions• are probability distributions whose tails are

not exponentially bounded

• Common – long-tail… human v. cyber…

Few that dominate More that add up

Equal areas

http://en.wikipedia.org/wiki/Heavy-tailed_distribution

Spatial example

Spatial roughness…

Compare median, mean, mode

Huh, we have Big Data?• Why would we care about samples?

– Let’s take it all?

• It gets messy == quality, gaps, …

• Very often goes beyond known patterns, i.e. out of the range of previous values– Anyone remember the financial crisis in 2008?

• Data becomes more subjective than objective and especially human v. cyber..

• To start: let’s take a look at EPI data that you started to explore last week (cyber)

Munging• Missing values, null values, etc.• E.g. in EPI_data – they use “--”

– Most data applications provide built ins for these higher-order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday)

• Of course, different variables often are missing “different” values

• In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then friends: http://www.johnmyleswhite.com/notebook/2010/09/23/higher-order-functions-in-r/

Patterns and Relationships• Stepping from elementary/ distribution

analysis to algorithmic-based analysis

• I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non-parametric models

• Relations – associations between/among populations

• Outcome: model and an evaluation of its fitness for purpose

More munging• Bad values, outliers, corrupted entries,

thresholds …

• Noise reduction – low-pass filtering, binning

• A few example today but the labs will bring this into view soon

• REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations…

Populations within populations• In the EPI example:

– Geographic regions (GEO_subregion)– EPI_regions– Eco-regions (EDC v. LEDC – know what that is?)– Primary industry(ies)– Climate region

• What would you do to start exploring?

Or, a twist – n=1 but many attributes?

The item of interest in relation to its attributes

Summary: explore• Going from preliminary to initial analysis…

• Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution)

• Fitting that distribution

• Or NOT– A hybrid or

– Non-parametric (statistics) approaches are needed – more on this to come

Models• Assumptions are often used when

considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit)

• Two key topics:– N=all and the open world assumption– Model of the thing of interest versus model of the

data (data model; structural form)

• “All models are wrong but some are useful” (generally attributed to the statistician George Box) 33

Conceptual, logical and physical models

Applied to a database:

However our models will be mathematical, statistical, or a combination.

The concept of the model comes from the hypothesis

The implementation of the physical model comes from the data ;-)

Art or science?• The form of the model, incorporating the

hypothesis determines a “form”

• Thus, as much art as science because it depends both on your world view and what the data is telling you (or not)

• We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 35

Goodness of fit• And, we cannot take the models at face

value, we must assess how fit they may be:– Chi-Square – One-sided and two-sided Kolmogorov-Smirnov

tests– Lilliefors tests– Ansari-Bradley tests– Jarque-Bera tests

• Just a preview…36

Summary

• Cyber and Human data; quality, uncertainty and bias• Distributions – the common and not-so common

ones and how cyber and human data can have distinct distributions

• How simple statistical distributions can mis-lead us• Populations and samples and how inferential

statistics will lead us to model choices (no we have not actually done that yet in detail)

• Big Data and some consequences• Munging toward exploratory analysis• Toward models!

Tentative assignments• Assignment 2: Datasets and data infrastructures – lab

assignment. Held in week 3 (Feb. 7) 10% (lab; individual);

• Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual);

• Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual);

• Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual);

• Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual);

• Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 38

How are the software installs going?

• R/Scipy (et al)/Matlab

• Data infrastructure

• Exercises?

• More on Friday…

Assignment 1 – how is it going?

• Choose a DA case study from a) readings, or b) your choice (must be approved by me)

• Read it and provide a short written review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits).

• Be prepared to discuss it in class this Friday 31st. Hand in the written report by 5pm that day.

Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices

Documents