+ All Categories
Home > Documents > CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern...

CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern...

Date post: 01-Apr-2015
Category:
Upload: carley-dry
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
17
CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding Why and How Instructor: Dr. Longin Jan Latecki
Transcript
Page 1: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

CIS 2033

1

Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding Why and How

Instructor: Dr. Longin Jan Latecki

Page 2: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

The set of observations is called a dataset.

By exploring the dataset we can gain insight into what probability model suits the phenomenon.

To graphically represent univariate datasets, consisting of repeated measurements of one particular quantity, we discuss the classical histogram, the more recently introduced kernel density estimates and the empirical distribution function.

To represent a bivariate dataset, which consists of repeated measurements of two quantities, we use the scatterplot.

2

Chapter 15 Exploratory data analysis: graphical summaries

Page 3: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

15.2 Histograms: The term histogram appears to have

been used first by Karl Pearson.

3

Page 4: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Histogram construction and pdf

4

Denote a generic (univariate) dataset of size n by

First we divide the range of the data into intervals. These intervals are called bins

and denoted by

The length of an interval Bi is denoted by ǀBiǀ and is called the bin width.

We want the area under the histogram on each bin Bi to reflect the number of elements in Bi. Since the total area 1 under the histogram then corresponds to the total number of elements n in the dataset, the area under the histogram on a bin Bi is equal to the proportion of elements in Bi:

The height of the histogram on bin Bi must be equal to

As we know from Ch. 13.4, the histogram approximates the pdf f, in particular, for a bin centered at point a, Ba=(a-h, a+h], we have

aaj

a

aj Hhn

Bx

Bn

Bxaf

2

#

||

#)(

Page 5: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

5

The function g in blue is a mixture of two Gaussians. We draw 200 samples from it,which are shown as blue dots.

We use the samples to generate the histogram (yellow)and its kernel density estimate f (red).The Matlab script is twoGaussKernelDensity1.m

In Matlab:binwidth=0.5;bincenters=[0.5:binwidth:9.5];hx=hist(x,bincenters)/(200*binwidth);

Page 6: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Choice of the bin width

6

Consider a histogram with bins of equal width. In that case the bins are of the

from

where r is some reference point smaller than the minimum of the dataset and b

denotes the bin width. Mathematical research, however, has provided some guide-

line for a data-based choice for b or m, where s is the sample std:

Page 7: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

15.3 Kernel density estimates

7

Page 8: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

A kernel K is a function K:RR and a kernel K typically satisfies the following conditions.

8

Page 9: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Examples of Kernel Construction

9

Page 10: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Scaling the kernel K

10

Scale the kernel K into the function

Then put a scaled kernel around each element xi in the dataset

Page 11: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

11

The bandwidth is too

small

The bandwidth is too big

Page 12: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

12

The function g in blue is a mixture of two Gaussians. We draw 200 samples from it,which are shown as blue dots.

We use the samples to generate the histogram (yellow)and its kernel density estimate f (red).The Matlab script is twoGaussKernelDensity1.m

Page 13: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

15.4 The empirical distribution function

13

Another way to graphically represent a dataset is to plot the data in a cumulative manner.

This can be done by using the

empirical cumulative distribution function .

Page 14: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Empirical distribution function Continued

14

Page 15: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Example

15.6. Given is the following information about a histogram, compute the value of the empirical distribution function at point t = 7:

By: Wanwisa Smith15

Because (2 - 0) * 0.245 + (4 - 2) * 0.130 + (7 - 4) * 0.050 + (11 - 7) * 0.020 + (15 - 11) * 0.005 = 1, there are no data points outside the listed bins. Hence

Page 16: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

Relation between histogram and empirical cdf

15.11. Given is a histogram and the empirical distribution function Fn of the same dataset. Show that the height of the histogram on

a bin (a, b] is equal to

By: Wanwisa Smith16

The height of the histogram on a bin Bi = (a, b] is

Hence

Page 17: CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

15.5 Scatterplot

17

In some situation we might wants to investigate the relationship between two or more variable. In the case of two variables x and y, the dataset consists of pairs of observations:

We call such a dataset a bivariate dataset in contrast to the univariate.

The plot the points (Xi, Yi) for i = 1, 2, …,n is called a scatterplot.


Recommended