+ All Categories
Home > Documents > Lecture 1 - National Tsing Hua University

Lecture 1 - National Tsing Hua University

Date post: 03-Jan-2022
Category:
Upload: others
View: 9 times
Download: 0 times
Share this document with a friend
14
Lecture 1 Introduction BMIR Lecture Series in Probability and Statistics Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab 1.1 1 Statistics: Why and What? Why statistics? As you begin this course, you may wonder: What is statistics? Why should I study statistics? How can studying statistics help me in my profession? 1.2 Statistics is everywhere Almost every day you are exposed to statistics. "The number of Americans with diabetes will nearly double in the next 25 years." (Source: Diabetes Care) "The NRF expects holiday sales to decline 1% versus a 3.4% drop in holi- day sales the previous year." (Source: National Retail Federation) "EIA projects total U.S. natural gas consumption will decline by 2.6 per- cent in 2009 and increase by 0.5 percent in 2010." (Source: Energy Infor- mation Administration) Definition 1. Data consist of information coming from observations, counts, measurements, or responses. 1.3 Statistics Definition 2. Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions. Definition 3. A population is the collection of all outcomes, responses, mea- surements, or counts that are of interest. A sample is a subset, or part, of a population. 1.4 1
Transcript
Page 1: Lecture 1 - National Tsing Hua University

Lecture 1IntroductionBMIR Lecture Series in Probability and Statistics

Ching-Han Hsu, BMES, National Tsing Hua Universityc©2015 by Ching-Han Hsu, Ph.D., BMIR Lab

1.1

1 Statistics: Why and What?

Why statistics?As you begin this course, you may wonder:

• What is statistics?• Why should I study statistics?• How can studying statistics help me in my profession?

1.2

Statistics is everywhereAlmost every day you are exposed to statistics.

• "The number of Americans with diabetes will nearly double in the next 25years." (Source: Diabetes Care)• "The NRF expects holiday sales to decline 1% versus a 3.4% drop in holi-

day sales the previous year." (Source: National Retail Federation)• "EIA projects total U.S. natural gas consumption will decline by 2.6 per-

cent in 2009 and increase by 0.5 percent in 2010." (Source: Energy Infor-mation Administration)

Definition 1. Data consist of information coming from observations, counts,measurements, or responses.

1.3

Statistics

Definition 2. Statistics is the science of collecting, organizing, analyzing, andinterpreting data in order to make decisions.

Definition 3. A population is the collection of all outcomes, responses, mea-surements, or counts that are of interest. A sample is a subset, or part, of apopulation.

1.4

1

Page 2: Lecture 1 - National Tsing Hua University

Probability and Statistics 2/14 Fall, 2015

Statistics: Example

Example 4. In a recent survey, 1500 adults in the United States were asked if theythought there was solid evidence of global warming. Eight hundred fifty-five ofthe adults said yes. Identify the population and the sample. Describe the sampledata set.

• The population consists of the responses of all adults in the United States.• The sample consists of the responses of the 1500 adults in the United States

in the survey.• The sample is a subset of the responses of all adults in the United States.• The sample data set consists of 855 yes’s and 645 no’s.

1.5

Statistic and Parameter

Definition 5. A parameter is a numerical description of a population character-istic. A statistic is a numerical description of a samplecharacteristic.

Example 6. Decide whether the numerical value describes a population parame-ter or a sample statistic.

• A recent survey of 200 college career centers reported that the averagestarting salary for petroleum engineering majors is $83,121. (sample statis-tic)• The 2182 students who accepted admission offers to Northwestern Univer-

sity in 2009 have an average SAT score of 1442. (population parameter)• In a random check of a sample of retail stores, the Food and Drug Admin-

istration found that 34% of the stores were not storing fish at the propertemperature. (sample statistic)

1.6

Branches of Statistics

Definition 7. Descriptive statistics is the branch of statistics that involves theorganization, summarization, and display of data.

Definition 8. Inferential statistics is the branch of statistics that involves usinga sample to draw conclusions about a population. A basic tool in the study ofinferential statistics is probability.

1.7

Branches of Statistics

Example 9. Decide which part of the study represents the descriptive branch ofstatistics. What conclusions might be drawn from the study using inferentialstatistics?

• A large sample of men, aged 48, was studied for 18 years. For unmarriedmen, approximately 70% were alive at age 65. For married men, 90% werealive at age 65.• In a sample of Wall Street analysts, the percentage who incorrectly fore-

casted high-tech earnings in a recent year was 44%.1.8

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 3: Lecture 1 - National Tsing Hua University

Probability and Statistics 3/14 Fall, 2015

Model Suggested retail priceFocus Sedan $15,995

Fusion $19,270Mustang $20,995

Edge $26,920Flex $28,495

Escape Hybrid $32,260Expedition $35,085

F-450 $44,145

Data Types

Definition 10. Qualitative data consist of attributes, labels, or non-numericalentries.

Definition 11. Quantitative data consist of numerical measurements or counts.1.9

Branches of Statistics

Example 12. The suggested retail prices of several Ford vehicles are shown inthe table. Which data are qualitative data and which are quantitative data?

The model names are are qualitative data. The suggested retail prices arequantitative data. 1.10

2 Numerical Summaries

Sample Mean

Definition 13. If the n observations in a sample are denoted by x1,x2, . . . ,xn, thesample mean is

x̄ =x1 + x2 + · · ·+ xn

n=

∑ni=1 xi

n(1)

• The sample mean is the average value of all observations.• There is also a population mean µ associated with the probability distribu-

tion of random variable X from which the sample was drawn.• The sample mean x̄ is a reasonable estimate of the population mean µ .

1.11

Sample Variance

Definition 14. If x1,x2, . . . ,xn is a sample of n observations, the sample varianceis

s2 =∑

ni=1(xi− x̄)2

n−1(2)

The sample standard deviation, s, is the positive square root of the samplevariance.

• Analogous to the sample variance s2, the variability in the population isdefined by the population variance σ2.• σ is the population standard deviation.

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 4: Lecture 1 - National Tsing Hua University

Probability and Statistics 4/14 Fall, 2015

x1 12.6 x2 12.9 x3 13.4 x4 12.3x5 13.6 x6 13.5 x7 12.6 x8 13.1

x1 12.6 x2 12.9 x3 13.4 x4 12.3x5 13.6 x6 13.5 x7 12.6 x8 13.1

• Computation of s2

s2 =∑

ni=1 x2

i −(∑

ni=1 xi)

2

nn−1

(3)

s2 =∑

ni=1(xi− x̄)2

n−1=

∑ni=1(x

2i −2xix̄+ x̄2)

n−1

=∑

ni=1 x2

i −2x̄∑ni=1 xi +∑

ni=1 x̄2

n−1

=∑

ni=1 x2

i −2nx̄2 +nx̄2

n−1

=∑

ni=1 x2

i −nx̄2

n−1=

∑ni=1 x2

i −n(∑ni=1 xi/n)2

n−1

=∑

ni=1 x2

i −(∑

ni=1 xi)

2

nn−1

1.12

Example: Sample Mean and Variance

Example 15. Assume that there are eight observations: Compute the samplemean and variance.

• The sample mean is

x̄ =x1 + x2 + · · ·+ x8

8=

12.6+12.9+ · · ·+13.18

=1048

= 13.0

• The sample variance is

s2 =∑

8i=1(xi− x̄)2

8−1=

1.67

= 0.2286

(the sample standard deviation s =√

0.2286 = 0.48.)1.13

Sample Range

Definition 16. If the n observations in a sample are denoted by x1,x2, . . . ,xn, thesample range is

r = max(xi)−min(xi) (4)

Example 17. Given the eight observations: The sample range is 13.6− 12.3 =1.3 > 0.48 = s.

1.14

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 5: Lecture 1 - National Tsing Hua University

Probability and Statistics 5/14 Fall, 2015

Parameter Estimation

• Sample mean can be used as an estimation of the population mean.• Sample variance can be also use as estimation of the population variance.• These are referred to as parameter estimation.• The sample variance s2 is based on the n− 1 degrees of freedom. The

degrees of freedom results from the fact that the n deviations x1− x̄,x2−x̄, . . . ,xn− x̄ always sum to zero. Note that this is a constraint.

1.15

3 Frequency Distribution and Histogram

Boat Image

Figure 1: Digital image of boat: 512×512 image pixels. Each pixel uses 8 bits,i.e, pixel value between 0 and 255.

1.16

Independent Pixel Approach: IID Assumption

• Each pixel uses 8 bits which consists of 28 = 256 possible combinations.• The most common usage is to represent the values from 0 to 255.• One can assume that each image pixel can be modeled as an uniform ran-

dom variable with 256 possible outcomes.• For a 512×512 image, there are 262,144 independent random variables in

total, i.e., X1, . . . ,X262144. These RVs are assumed to be independent andidentically distributed (i.i.d).• The boat image is simply a possible outcome of these 262,144 RVs, with

probability:

P(X1, . . . ,X262144) =1

2562621441.17

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 6: Lecture 1 - National Tsing Hua University

Probability and Statistics 6/14 Fall, 2015

Ensemble Image Interpretation

• The whole boat image can be also modeled an ensemble of a single randomvariable X which represents intensity value.• X can take values from 0,255.• How do we define P(X = k) =?

P(X = k) =number of voxels with value equal to k

total pixels

This is also relative frequency approach.1.18

Ensemble Image Interpretation

• Consider the following pixel intensities and the corresponding pixel counts:Value Count Value Count Value Count

0 7 34 620 120 1433147 5677 168 2432 255 2

• the corresponding probabilities using relative frequency:P(X = 0) P(X = 34) P(X = 120)

7262144 = 2.67×10−5 620

262144 = 2.37×10−3 1433262144 = 5.47×10−3

P(X = 147) P(X = 168) P(X = 255)5677

262144 = 2.17×10−2 2432262144 = 9.28×10−3 2

262144 = 7.63×10−6

1.19

Boat Image: Histogram

Figure 2: The histogram of boat image1.20

Construction of a HistogramThe steps to construct a histogram:

• Label the bin (class interval) boundaries on a horizontal scale.• Mark and label the vertical scale with the frequencies or the relative fre-

quency.

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 7: Lecture 1 - National Tsing Hua University

Probability and Statistics 7/14 Fall, 2015

• Above each bin, draw a rectangle where height is equal to the frequency(or relative frequency) corresponding to that bin.

Another popular variation histogram is the cumulative frequency plot, for whichthe height of each bar is the total number of observations that are less than orequal to the upper limit of bin. 1.21

Boat Image: Histogram with Different Bin Numbers

Figure 3: The histogram of boat image with different bin numbers1.22

Relative Frequency vs Population 1.23

Ensemble Image InterpretationSummary of image intensities:

• Sample Mean : 129.708• Sample STD : 46.677• Min : 0.• Max : 255• Mode : 148 (5796)

1.24

Boat Image: Relative Frequency vs Gauss Model 1.25

Distribution Shapes 1.26

Boat Image: Cumulative Frequency 1.27

4 Stem-And-Leaf Diagram

Stem-And-Leaf Diagram: Example

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 8: Lecture 1 - National Tsing Hua University

Probability and Statistics 8/14 Fall, 2015

Figure 4: Relationship between relative frequency and population.

Example 18. Given the height measurements (in cm) from a group of students,

{147,117,101,149,145,105,93,94,114,104,136,140,121,145,120,142,98,135,135,132}

plot the stem-and-leaf diagram. Use the first two digits as stems.

Stems Leaves Frequency9 3 4 8 3

10 1 4 5 311 4 7 212 0 1 213 2 5 5 6 414 0 2 5 5 7 9 6

1.28

Stem-And-Leaf Diagram: Construction Steps

• A stem-and-leaf diagram is a good way to obtain an informative displayof a data set x1,x2, . . . ,xn, where each number xi consists of at least twodigits.• To construct a stem-and-leaf diagram, use the following steps:

1. Divide each number xi into two parts: a stem consisting of oneor more of leading digits; and a leaf consisting of the remain-ing digit.

2. List the stem values in a vertical column.3. Record the leaf for each observation besides its stem.4. Write the units for stems and leaves on the display.

• If the stems are arranged in order, the diagram is called an ordered stem-and-leaf diagram.

1.29

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 9: Lecture 1 - National Tsing Hua University

Probability and Statistics 9/14 Fall, 2015

Figure 5: The relative probability of boat image versus the Gaussian Model ∼N(µ,σ2).

Figure 6: Histogram for symmetric and skewed distributions

Example: Alloy Compressive Strength

Example 19. Consider the alloy compressive strength data illustrated in Fig. 8.1.30

Compressive Strength: Histogram 1.31

Compressive Strength: Cumulative Frequency 1.32

Stem-And-Leaf Diagram: Compressive Strength 1.33

Ordered Stem-And-Leaf Diagram: Compressive Strength 1.34

Median, Mode and Quartile

• The sample median is a measure of central tendency that divides the datainto two equal parts, half below the median and half above.• If the number of the data is even, the median is the halfway between two

central values. If the number is odd, the median is the central value.• The sample mode is the most frequently occurring data value.• The mode in the compressive strength example is 158.

1.35

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 10: Lecture 1 - National Tsing Hua University

Probability and Statistics 10/14 Fall, 2015

Figure 7: The cumulative frequency of boat image

Figure 8: The Compressive Strength of 80 Alloy Specimens

Quartile

• When an ordered set of data is divided into four equal parts, the divisionpoints are call quartiles.• The first or lower quartile, q1, is a value that has approximately 25% of the

observations below it.• The second quartile, q2, has approximately 50% of the observations below

it and is equal to the median.• The third or upper quartile, q3, has approximately 75% of the observations

below it.• The interquartile range is defined as IQR = q3−q1, which is a measure of

variability.1.36

Trimmed Mean

• The trimmed mean is a measure of center.• It is designed to be unaffected by outliers.

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 11: Lecture 1 - National Tsing Hua University

Probability and Statistics 11/14 Fall, 2015

Figure 9: Histogram of compressive strength.

Figure 10: A cumulative distribution compressive strength.

• The trimmed mean is computed by arranging the sample values in order,"trimming" an equal number of them from each end, and computing themean of those remaining.• If p% of the data are trimmed from each end, the resulting trimmed mean

is called the "p% trimmed mean".• There are no hard-and-fast rules on how many values to trim.• The most commonly used trimmed means are the 5%, 10%, and 20%

trimmed means.1.37

Trimmed Mean

Example 20. Assume that there are 24 observations: Compute the mean, median,and 5%, 10%, and 20% trimmed means.

• The sample is

30+75+ · · ·+384+47024

= 195.42

30 75 79 80 80 105 126 138149 179 179 191 223 232 232 236240 242 245 247 254 274 384 470

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 12: Lecture 1 - National Tsing Hua University

Probability and Statistics 12/14 Fall, 2015

Figure 11: The stem-and-leaf diagram diagram of Fig. 8. Stem: tens and hun-dreds digits; Leaf: Ones digits.

• The median is the average of the 12th and 13th numbers

191+2232

= 207

• To compute the 5% trimmed mean, round off (0.05)(24) = 1.2 to 1. Drop 1observations from each end, and then average the remaining 22 numbers:

75+79+ · · ·+274+38422

= 190.45

• To compute the 10% trimmed mean, round off (0.1)(24) = 2.4 to 2. Drop2 observations from each end, and then average the remaining 20:

79+80+ · · ·+254+27420

= 186.55

• To compute the 20% trimmed mean, round off (0.2)(24) = 4.8 to 5. Drop5 observations from each end, and then average the remaining 14:

105+126+ · · ·+242+24514

= 194.07

1.38

5 Box Plots

Box Plots

• The box plot is a graphical display that describes several data features likecenter, spread, departure from symmetry, and outliers.• A box plot display three quantiles, the minimum, and the maximum of the

data on a rectangle box.• The box width is the interquartile range.

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 13: Lecture 1 - National Tsing Hua University

Probability and Statistics 13/14 Fall, 2015

Figure 12: Description of a box plot

• The left edge is located at q1, while the right edge is located at q3.• A line between two edges is drawn at q2.• A line, or so-called whisker, extends from each end of the box.• The lower whisker is a line from q1 to the smallest value data point within

1.5 interquartile ranges from q1.• The upper whisker is a line from q3 to the largest value data point within

1.5 interquartile ranges from q3.• Data outside the whiskers are plotted as individual points.• A point beyond a whisker but less than 3 interquartile ranges from the box

edge is called an outlier.• A point more than 3 interquartile ranges from the box edge is called an

extreme outlier.• Sometimes box plots called an box-and-whisker plots.

1.39

Box Plot: Compressive Strength Example

Min. 1st Qu. Median Mean 3rd Qu. Max.76.0 144.5 161.5 162.7 181.0 245.0

1.40

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.

Page 14: Lecture 1 - National Tsing Hua University

Probability and Statistics 14/14 Fall, 2015

Figure 13: Box plot for the compressive strength example.

BMES, NTHU. BMIR c©Ching-Han Hsu, Ph.D.


Recommended