+ All Categories
Home > Documents > Sampling and Descriptive Statics

Sampling and Descriptive Statics

Date post: 03-Apr-2018
Category:
Upload: tran-anh-tu
View: 220 times
Download: 0 times
Share this document with a friend

of 49

Transcript
  • 7/28/2019 Sampling and Descriptive Statics

    1/49

    Sampling and Descriptive Statistics

    Berlin Chen

    Department of Computer Science & Information Engineering

    National Taiwan Normal University

    Reference:

    1. W. Navidi. Statistics for Engineering and Scientists. Chapter 1 & Teaching Material

  • 7/28/2019 Sampling and Descriptive Statics

    2/49

    Sampling (1/2)

    Definition: A o ulation is the entire collection of ob ects

    or outcomes about which information is sought

    All NTNU students

    Definition: A sample is a subset of a population,

    containing the objects or outcomes that are actuallyobserved

    . .,

    Choose the 100 students from the rosters of football or

    basketball teams (appropriate?) Choose the 100 students living a certain dorm or enrolled in

    the statistics course (appropriate?)

    Statistics-Berlin Chen 2

  • 7/28/2019 Sampling and Descriptive Statics

    3/49

    Sampling (2/2)

    Definition: A sim le random sam le SRS of size n is a

    sample chosen by a method in which each collection ofn

    population items is equally likely to comprise the sample,us as n e o ery

    Definition: A sample of convenience is a sample that isno rawn y a we - e ne ran om me o

    Things to consider with convenience samples:

    Statistics-Berlin Chen 3

    Only use when it is not feasible to draw a random sample

  • 7/28/2019 Sampling and Descriptive Statics

    4/49

    More on SRS (1/3)

    Definition: A conce tual o ulation consists of all the

    values that might possibly have been observed

    It is in contrast to tangible () population E.g., a geologist weighs a rock several times on a sensitive scale.

    Each time, the scale gives a slightly different reading

    .

    that the scale could in principle produce

    Statistics-Berlin Chen 4

  • 7/28/2019 Sampling and Descriptive Statics

    5/49

    More on SRS (2/3)

    A SRS is not uaranteed to reflect the o ulation

    perfectly

    SRSs always differ in some ways from each other,

    occasionally a sample is substantially different from the

    population

    Two different samples from the same population will vary

    from each other as well

    This phenomenon is known as sampling variation

    Statistics-Berlin Chen 5

  • 7/28/2019 Sampling and Descriptive Statics

    6/49

    More on SRS (3/3)

    The items in a sam le are inde endent if knowin thevalues of some of the items does not help to predict the

    values of the others

    (A Rule of Thumb) Items in a simple random sample

    encountered in practice The exception occurs when the population is finite and the

    samp e compr ses a su s an a rac on more an o epopulation

    However, it is possible to make a population behave asthough it were infinite large, by replacing each item after

    Statistics-Berlin Chen 6

    Sampling With Replacement

  • 7/28/2019 Sampling and Descriptive Statics

    7/49

    Other Sampling Methods

    Weighting Sampling Some items are given a greater chance of being selected than

    others

    . .,others

    Stratified Sampling

    The population is divided up into subpopulations, called strata A simple random sample is drawn from each stratum

    Su ervised ?

    Cluster Sampling

    Items are drawn from the population in groups or clusters E.g., the U.S. government agencies use cluster sampling to

    sample the U.S. population to measure sociological factors suchas income and unemployment

    Statistics-Berlin Chen 7

    Unsupervised (?)

  • 7/28/2019 Sampling and Descriptive Statics

    8/49

    Types of Experiments

    One-Sam le Ex eriment

    There is only one population of interest

    A single sample is drawn from it

    Multi-Sample Experiment

    ere are wo or more popu a ons o n eres

    A simple is drawn from each population

    -

    comparisons among populations

    Statistics-Berlin Chen 8

  • 7/28/2019 Sampling and Descriptive Statics

    9/49

    Types of Data

    Numerical or uantitative if a numerical uantit is

    assigned to each item in the sample

    Height Weight

    Age

    Categorical or qualitative if the sample items are placed

    Gender

    Hair color

    Blood type

    Statistics-Berlin Chen 9

  • 7/28/2019 Sampling and Descriptive Statics

    10/49

    Summary Statistics

    The summar statistics are sometimes called descri tive

    statistics because they describe the data

    Numerical Summaries Sample mean, median, trimmed mean, mode

    Sample standard deviation (variance), range

    ,

    Skewness, kurtosis

    .

    Graphical Summaries

    Stem and leaf plot Dotplot

    Histogram (more commonly used)

    Scatterplot

    . Statistics-Berlin Chen 10

  • 7/28/2019 Sampling and Descriptive Statics

    11/49

    Numerical Summaries (1/4)

    Definition: Sam le Mean (thecenterofthedata) Let be a sample. The sample mean isnXX ,,1

    n1

    Its customary to use a letter with a bar over it to denote a

    in 1

    Definition: Sample Variance (howspreadoutthedataare) Let be a sam le. The sam le variance isnXX ,,1

    n

    i XXs

    22 1

    Which is equivalent to

    2

    1

    22

    1

    1XnX

    ns

    n

    ii

    Statistics-Berlin Chen 11

    5040,30,20,,10

    3231,30,29,,20

  • 7/28/2019 Sampling and Descriptive Statics

    12/49

    Numerical Summaries (2/4)

    Actually, we are interest in

    Population mean

    Population deviation: Measuring the spread of the population

    mean

    , ,

    use sample mean to replace it

    a ema ca y, e ev a ons aroun e samp e mean

    tend to be a bit smaller than the deviations around the

    So when calculating sample variance, the quantity divided by

    rather than provides the right correctionn1n

    Statistics-Berlin Chen 12

    To be proved later on !

  • 7/28/2019 Sampling and Descriptive Statics

    13/49

    Numerical Summaries (3/4)

    Definition: Sam le Standard Deviation

    Let be a sample. The sample deviation isnXX ,,1

    ii XX

    ns

    11

    nnXs 22

    1

    The sample deviation also measures the degree of spread in a

    in 11

    samp e av ng esameun sas e a a

    Statistics-Berlin Chen 13

  • 7/28/2019 Sampling and Descriptive Statics

    14/49

    Numerical Summaries (3/4)

    If is a sam le, and ,where andnXX ,,1 bXaY a b

    are constants, then XbaY

    If is a sample, and ,where and

    are constants, thennXX ,,1 ii bXaY a b

    xyxy sbssbs and222

    Definition: Outliers

    Sometimes a sample may contain a few points that are much

    larger or smaller than the rest (mainlyresultingfromdataentryerrors)

    Statistics-Berlin Chen 14

  • 7/28/2019 Sampling and Descriptive Statics

    15/49

    More on Numerical Summaries (1/2)

    Definition: The median is another measure of center of a

    sample , like the mean

    To compute the median items in the sample have to be ordered

    nXX ,,1

    y e r va ues

    If is odd, the sample median is the number in positionn 2/1n

    If is even, the sample median is the average of the numbers inpositions andn

    2/n 12/ n

    The median is an important (robust) measure of center for

    samples containing outliers

    Statistics-Berlin Chen 15

  • 7/28/2019 Sampling and Descriptive Statics

    16/49

    More on Numerical Summaries (2/2)

    Definition: The trimmed mean of one-dimensional data

    is computed by

    First, arranging the sample values in (ascending or descending)or er

    Then, trimming an equal number of them from each end, say,

    %

    Finally, computing the sample mean of those remaining

    Statistics-Berlin Chen 16

  • 7/28/2019 Sampling and Descriptive Statics

    17/49

    More on mean

    Arithmetic mean n

    XX1

    Geometric mean

    in 1

    nn

    1

    Harmonic mean

    ii

    1

    1n

    1 i iXn1

    meanharmonic:1mean;arithmetic:1

    minimum:maximum;:

    mm

    mm

    mn

    i

    m

    iXn

    X1

    meanquadratic:2

    meangeometric:0

    m

    m

    Arithmetic mean Geometric mean Harmonic mean

    n

    Weighted arithmetic meanStatistics-Berlin Chen 17http://en.wikipedia.org/wiki/Mean

    ni i

    i ii

    wX 1

    1

  • 7/28/2019 Sampling and Descriptive Statics

    18/49

    Quartiles

    Definition: the quartiles of a sample divides it asnXX ,,1

    near y as poss e n o quar ers. e samp e va ues ave

    to be ordered from the smallest to the largest

    , .

    The second quartile found by computing the value 0.5(n+1)

    The third quartile found by computing the value 0.75(n+1)

    Example 1.14: Find the first and third quartiles of the

    .30 75 79 80 80 105 126 138 149 179 179 191

    223 232 232 236 240 242 254 247 254 274 384 470

    n=24

    To find the first quartile, compute (n+1)25=6.25

    105+126 /2=115.5

    Statistics-Berlin Chen 18

    To find the third quartile, compute (n+1)75=18.75

    (242+245)/2=243.5

  • 7/28/2019 Sampling and Descriptive Statics

    19/49

    Percentiles

    Definition: Thepth percentile of a sample , for anXX ,,1

    number between 0 and 100, divide the sample so that as

    nearly as possiblep% of the sample values are less than

    .

    Order the sample values from smallest to largest

    Then com ute the uantit /100 n+1 where n is the sam le

    size If this quantity is an integer, the sample value in this position is

    . ,

    either side

    , ,

    is the 50th percentile, and the third quartile is the 75th

    Statistics-Berlin Chen 19

  • 7/28/2019 Sampling and Descriptive Statics

    20/49

    Mode and Range

    Mode The sample mode is the most frequently occurring values in a

    sample

    Ran e

    The difference between the largest and smallest values in asample

    values

    Statistics-Berlin Chen 20

  • 7/28/2019 Sampling and Descriptive Statics

    21/49

    Numerical Summaries for Categorical Data

    For cate orical data, each sam le item is assi ned a

    category rather than a numerical value

    Two Numerical Summaries for Categorical Data

    Definition: (Relative) Frequencies

    The frequency of a given category is simply the number ofsample items falling in that category

    Definition: Sample Proportions (alsocalledrelativefrequency) The sample proportion is the frequency divided by the

    sample size

    Statistics-Berlin Chen 21

  • 7/28/2019 Sampling and Descriptive Statics

    22/49

    Sam le Statistics and Po ulation Parameters 1/2

    A numerical summar of a sam le is called a statistic

    A numerical summary of a population is called a

    parameter If a population is finite, the methods used for calculating the

    numerical summaries of a sample can be applied for calculating

    outcome) occurs with probability? See Chapter 2) Exceptions are the variance and standard deviation (?)

    2

    2

    2

    2

    1:Normal

    x

    exf

    However, sample statistics are often used to estimate

    parameters (to be taken as estimators)

    Statistics-Berlin Chen 22

    In practice, the entire population is never observed, so the

    population parameters cannot be calculated directly

  • 7/28/2019 Sampling and Descriptive Statics

    23/49

    Sample Statistics and Population Parameters (2/2)

    A Schematic Depiction

    Population Sample

    Statistics

    In erence

    Parameters

    Statistics-Berlin Chen 23

  • 7/28/2019 Sampling and Descriptive Statics

    24/49

    Graphical Summaries

    Recall that the mean, median and standard deviation,

    etc., are numerical summaries of a sample of of a

    population

    On the other hand, the graphical summaries are used as

    will to help visualize a list of numbers (or the sampleitems). Methods to be discussed include:

    Dotplot

    Histo ram more commonl used Boxplot (more commonly used)

    Scatterplot

    Statistics-Berlin Chen 24

  • 7/28/2019 Sampling and Descriptive Statics

    25/49

    Stem-and-leaf Plot (1/3)

    A sim le wa to summarize a data set

    Each item in the sample is divided into two parts

    stem, consisting of the leftmost one or two digits leaf, consisting of the next significant digit

    The stem-and-leaf plot is a compact way to represent the

    data It also gives us some indication of the shape of our data

    Statistics-Berlin Chen 25

  • 7/28/2019 Sampling and Descriptive Statics

    26/49

    Stem-and-leaf Plot (2/3)

    Example: Duration of dormant () periods of thee ser Old Faithful in Minutes

    e s oo a e rs ne o e s em-an - ea p o . srepresents measurements of 42, 45, and 49 minutes

    A good feature of these plots is that they display all the sample

    Statistics-Berlin Chen 26

    values. One can reconstruct the data in its entirety from a stem-

    and-leaf plot (however, the order information that items sampled is lost)

  • 7/28/2019 Sampling and Descriptive Statics

    27/49

    Stem-and-leaf Plot (3/3)

    Another Exam le: Particulate matter PM emissions for

    62 vehicles driven at high altitude

    Contain the a count of number of

    items at or above this line

    This stem contains

    the medium

    Contain the a count of number of

    items at or below this line

    Statistics-Berlin Chen 27Cumulative frequency columnStem (tens digits)

    Leaf (ones digits)

  • 7/28/2019 Sampling and Descriptive Statics

    28/49

    Dotplot

    A dot lot is a ra h that can be used to ive a rou h

    impression of the shape of a sample

    Where the sample values are concentrated Where the gaps are

    It is useful when the sample size is not too large and

    Good method, along with the stem-and-leaf plot to

    Not generally used in formal presentations

    Statistics-Berlin Chen 28Figure 1.7 Dotplot of the geyser data in Table 1.3

  • 7/28/2019 Sampling and Descriptive Statics

    29/49

    Histogram (1/3)

    A ra h ives an idea of the sha e of a sam le

    Indicate regions where samples are concentrated or sparse

    The first step is to construct a frequency table

    Choose boundary points for the class intervals

    Compute the frequencies and relative frequencies for eachclass

    Relative frequencies: frequency/sample size

    Com ute the densit for each class accordin to the formula

    Density = relative frequency/class width

    Statistics-Berlin Chen 29

    ens y can e oug o as e re a ve requency per

    unit

  • 7/28/2019 Sampling and Descriptive Statics

    30/49

    Histogram (2/3)

    Table 1.4A frequency table

    Draw a rectangle for each class, whose height is equal to thedensity

    The total areas of rectangles is equal to 1

    Figure 1.8

    Statistics-Berlin Chen 30

  • 7/28/2019 Sampling and Descriptive Statics

    31/49

    Histogram (3/3)

    A common rule of thumb for constructin the histo ram

    of a sample

    It is good to have more intervals rather than fewer But it also to good to have large numbers of sample points in the

    intervals

    judgment and of trial and error It is reasonable to take the number of intervals roughly equal

    o e square roo o e samp e s ze

    Statistics-Berlin Chen 31

  • 7/28/2019 Sampling and Descriptive Statics

    32/49

    Histogram with Equal Class Widths

    Default settin of most software acka e

    Example: an histogram with equal class widths for Table

    .

    The total areas of rectangles is equal to 1

    Figure 1.9

    few (7) data points

    Compared to Figure 1.9, Figure 1.8 presents a smoother

    Statistics-Berlin Chen 32

    appearance and better enables the eye to appreciate the

    structure of the data set as a whole

  • 7/28/2019 Sampling and Descriptive Statics

    33/49

    Histogram, Sample Mean and Sample Variance (1/2)

    ii

    i valClassInterDensityOflassIntervaCenterOfCl

    An approximation to the sample mean

    E. ., the center of mass of the histo ram in Fi ure 1.8 is

    730.6065.020177.04194.02

    While the sample mean is 6.596

    The narrower the rectangles (intervals), the closer the

    =>items of the same value)

    1, 1, 1, 2, 3, 422.1

    18

    22

    16

    14

    36

    52

    0.5 3.5 4.5

    Statistics-Berlin Chen 33

    1, 1, 1, 2, 3, 4

    4.50.5 1.5 2.5 3.5

    26

    12

    6

    1

    46

    1

    36

    1

    26

    3

    1

  • 7/28/2019 Sampling and Descriptive Statics

    34/49

    Histogram, Sample Mean and Sample Variance (2/2)

    Definition: The moment of inertia () for the entire

    histogram is

    i ramssOfHistogCenterOfMalassIntervaCenterOfCl -2

    iallassIntervDensityOfC

    An approximation to the sample variance

    E.g., the moment of inertia for the entire histogram in Figure 1.8 is

    While the sam le mean is 20.42

    25.20065.0730.620177.0730.64194.0730.62 222

    The narrower the rectangles (intervals) are, the closer the

    approximation is

    Statistics-Berlin Chen 34

  • 7/28/2019 Sampling and Descriptive Statics

    35/49

    Symmetry and Skewness (1/2)

    A histo ram is erfectl s mmetric if its ri ht half is a

    mirror image of its left half

    E.g., heights of random men

    Histograms that are not symmetric are referred to as

    skewed

    A histogram with a long right-hand tail is said to be

    s ewe o e r g , orpos ve y s ewe

    E.g., incomes are right skewed (?)

    -skewed to the left, ornegatively skewed

    Grades on an eas test are left skewed ?

    Statistics-Berlin Chen 35

  • 7/28/2019 Sampling and Descriptive Statics

    36/49

    Symmetry and Skewness (2/2)

    skewed to the left nearly symmetric skewed to the right

    ere s a so ano er erm ca e ur os s a s a so

    widely used for descriptive statistics

    ,the distribution of a population

    Statistics-Berlin Chen 36

  • 7/28/2019 Sampling and Descriptive Statics

    37/49

    More on Skewness and Kurtosis (1/3)

    Skewness can be used to characterize the s mmetr of

    a data set (sample)

    Given a sample : nXXX ,,, 21

    3n Skewness is defined by

    31

    1 snSkewness i i

    symmetric distribution shape => 0Skewnessi

    : Skewed to the right0Skewness

    : ewe o e e

    Statistics-Berlin Chen 37

    0Skewness

  • 7/28/2019 Sampling and Descriptive Statics

    38/49

    More on Skewness and Kurtosis (2/3)

    kurtosis can be used to characterize the flatness of a

    data set (sample)

    Given a sample : nXXX ,,, 21

    Kurtosis is defined by

    4

    1

    4XX

    Kurtosisni i

    A standard normal distribution has 3Kurtosis

    A larger kurtosis value indicates a peaked distribution

    A smaller kurtosis value indicates a flat distribution

    Statistics-Berlin Chen 38

  • 7/28/2019 Sampling and Descriptive Statics

    39/49

    More on Skewness and Kurtosis (3/3)

    standard normal

    Statistics-Berlin Chen 39

    http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

  • 7/28/2019 Sampling and Descriptive Statics

    40/49

    Unimodal and Bimodal Histograms (1/2)

    Definition: Mode

    Can refer to the most frequently occurring value in a sample

    Or refer to a peak or local maximum for a histogram or othercurves

    A unimodal histogram A bimodal histogram

    mo a s ogram, n some cases, n ca es a e

    sample can be divides into two subsamples that differ

    Statistics-Berlin Chen 40

  • 7/28/2019 Sampling and Descriptive Statics

    41/49

    Unimodal and Bimodal Histograms (2/2)

    Example: Durations of dormant periods (in minutes) and

    lon : more than 3 minutes

    short: less than 3 minutes

    Statistics-Berlin Chen 41

    Histogram of all 60 durations Histogram of the durations

    following short eruption

    Histogram of the durationsfollowing long eruption

  • 7/28/2019 Sampling and Descriptive Statics

    42/49

    Histogram with Height Equal to Frequency

    Till now, we refer the term histo ram to a ra h in

    which the heights of rectangles represent densities

    ,of rectangles equal to the frequencies

    .

    the heights equal to the frequencies

    cf. Figure 1.8Figure 1.13

    exaggerate the proportion

    of vehicles in the intervals

    Statistics-Berlin Chen 42

  • 7/28/2019 Sampling and Descriptive Statics

    43/49

    Boxplot (1/4)

    A box lot is a ra h that resents the median, the first

    ()1.5 IQR

    and third quartiles, and any outliers present in the

    sample

    1.5 IQR

    1.5 IQR

    The interquartile range (IQR) is the difference between the third

    Statistics-Berlin Chen 43

    an rs quar e. s s e s ance nee e o span e m e

    half of the data

  • 7/28/2019 Sampling and Descriptive Statics

    44/49

    Boxplot (2/4)

    Ste s in the Construction of a Box lot

    Compute the median and the first and third quartiles of the

    sample. Indicate these with horizontal lines. Draw vertical lines

    Find the largest sample value that is no more than 1.5 IQR

    above the third quartile, and the smallest sample value that is

    not more than 1.5 IQR below the first quartile. Extend vertical

    lines (whiskers) from the quartile lines to these points

    1.5 IQR below the first quartile are designated as outliers. Plot

    each outlier individually

    Statistics-Berlin Chen 44

  • 7/28/2019 Sampling and Descriptive Statics

    45/49

    Boxplot (3/4)

    Exam le: A box lot for the e ser data resented in

    Table 1.5

    Notice there are no outliers in these data The sample values are comparatively

    densely packed between the median and

    the third uartile

    The lower whisker is a bit longerthan the upper one, indicating that

    tail than an upper tail

    The distance between the first quartile and the median is greater

    than the distance between the median and the third quartile

    This boxplot suggests that the data are skewed to the left (?)

    Statistics-Berlin Chen 45

    B l t (4/4)

  • 7/28/2019 Sampling and Descriptive Statics

    46/49

    Boxplot (4/4)

    Another Exam le: Com arative box lots for PM

    emissions data for vehicle driving at high versus low

    altitudes

    Statistics-Berlin Chen 46

    S tt l t (1/2)

  • 7/28/2019 Sampling and Descriptive Statics

    47/49

    Scatterplot (1/2)

    Data for which item consists of a air of values is called

    bivariate

    The graphical summary for bivariate data is a scatterplot Display of a scatterplot (strength of Titanium () welds

    vs. its chemical contents)

    Statistics-Berlin Chen 47

    S tt l t (2/2)

  • 7/28/2019 Sampling and Descriptive Statics

    48/49

    Scatterplot (2/2)

    Exam le: S eech feature sam le Dimensions 1 & 2 of

    male (blue) and female (red) speakers after LDA

    transformation

    0

    .

    -0.4

    -0.2

    e

    2

    -0.8

    -0.6featur

    -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2-1.2

    -1

    feature 1

    Statistics-Berlin Chen 48

    S

  • 7/28/2019 Sampling and Descriptive Statics

    49/49

    Summary

    We discussed t es of data

    We looked at sam lin mostl SRS

    We studied summar descri tive statistics

    We learned about numeric summaries We examined graphical summaries (displays of data)

    Statistics-Berlin Chen 49


Recommended