Date post: | 03-Apr-2018 |
Category: |
Documents |
Upload: | derivatives-rd |
View: | 219 times |
Download: | 0 times |
of 82
7/29/2019 Ch 2 BB Basic Statistics
1/82
Chapter 2:
Basic Statistics
7/29/2019 Ch 2 BB Basic Statistics
2/82
Part A
Types of Data, Data Qualityand Data Collection
7/29/2019 Ch 2 BB Basic Statistics
3/82
3
Data
Data are facts or figures related to any characteristicof an individualAlso called a variable
A m/c, an year, a casting, a dimension, a person
Power station outages (up to 31/03/01 since commissioning)
Station Date of
commi-
ssioning
Avail-
ability
(%)
No. of
outages
Average
duration of
non-stop
operation
(days)
Average loss per
outage (hours)
Main
cause
of
outage
Capa-
city
utiliza-
tionForced Planned
C:15 12/11/98 92.59 30 27 64 52 Leakage High
C:16 10/05/97 93.04 47 28 52 52 Leakage Mod.
D 12/10/78 88.32 124 58 261 164 Gen* V. Low
E 31/12/84 82.77 116 42 440 158 Gen* Low
F 29/09/88 89.23 82 50 379 79 Gen* High
VARIABLES
INDIVIDUALS* Generator stator / rotor problem
7/29/2019 Ch 2 BB Basic Statistics
4/82
4
Types of Data/Variable
Continuous Discrete
Numerical/Quantitative
Ordinal Nominal
Categorical/Qualitative
Data/Variable
7/29/2019 Ch 2 BB Basic Statistics
5/82
5
Types of Data - Examples
Continuous: An infinite number of values (positiveor negative) are possible, e.g. measurements ofweight, length, chemical composition.
Discrete: The variable can take values 0,1,2,3, ..e.g. count of frequency (# of defects, breakdownsetc.)
Ordinal: Data classified in ordered categories, e.g.quality of service provided is classified as poor,moderate, good or yearly rainfall classified as verylow, low, moderate, good and very good.
Nominal: Data classified in categories having noinherent or explicit order, e.g. location classified aseast, west, north, south or names of departments.
7/29/2019 Ch 2 BB Basic Statistics
6/82
6
Types of Data - Outage Data Example
Variable Name Variable Type1. Date of commissioning
2. Availability (%)
3. Number of outages sincecommissioning
4. Average duration of non-stopoperation (days)
5. Average loss per outage (hours)
6. Main cause of outage
7. Capacity utilization
7/29/2019 Ch 2 BB Basic Statistics
7/82
7
Types of Data - Further Considerations
Continuous data may appear as discrete either due torounding (see the outage data example) or due tomeasurement limitations. We should treat such data ascontinuous unless the number of levels in the data set isvery few (say 2-4).
However, hourly records of steam pressure at turbineinlet (station F) show that the values are either 126 or127 or 128. Great care must be exercised whileanalyzing such data.
Discrete data having seven or more levels may betreated as continuous data.
Dichotomous data (O.K/Not O.K, Pass/Fail etc.) may betreated as discrete data after coding the two categoriesas 1 (O.K) and 0 (Not O.K).
7/29/2019 Ch 2 BB Basic Statistics
8/82
8
Variable and Attribute Data
In the field of Quality Control, various types ofdata are classified as
- VARIABLE DATA : Continuous data
- ATTRIBUTE DATA: Others Discrete and
counts of items falling in various categories
(Dichotomous, Ordinal and Nominal)
Henceforth we shall use this later classification.
7/29/2019 Ch 2 BB Basic Statistics
9/82
9
Data Gateway
Problem/
HypothesisData
Solution/
Fact
DATA COLLECTION DATA ANALYSIS
Quality problems can not be solved merely based on experience.
Any claim not backed by data is only a hypothesis.
Data Gates: Quality of the data gates and their placement at
appropriate locations of a process are extremely important forprocess control.
Data Quality: Data collection step is vital garbage in, garbage out
7/29/2019 Ch 2 BB Basic Statistics
10/82
7/29/2019 Ch 2 BB Basic Statistics
11/82
11
Information Content in Datafor Process Control
Source of Data Attribute Data Variable Data
General literature Very low Low
Past data: In-house routine Q.Crecords
Low Moderate
Past Data: Statistically designedexperiments
Moderate High
Live data: Passive observation ofthe process
Moderate High
Live Data: Statistically designed
experiments
High Very High
Do not transform variable data to attribute data.
That will be like burning diamond for heat.
7/29/2019 Ch 2 BB Basic Statistics
12/82
12
Data Collection Process
INDIVI-DUALS
VARIABLES
Var. 1 Var. 2 Var. 3 . . . Var. p
Ind. 1 Data Data Data Data
Ind. 2 Data Data Data Data
Ind. 3 Data Data Data Data
. . . . .
. . . . .
Ind. n Data Data Data Data
Population . .
Sample
Measurement . .
Recording
Editing, Storage, Retrieval
7/29/2019 Ch 2 BB Basic Statistics
13/82
13
Linking Data Qualityto Data Collection Process
Process Elements Wrong Noisy Irrelevant
Inadequate
Hard Redun
dant
Popula
tion
Individual
Issuesrelatedto data
basemgmt.
Variables
Sample Procedure
Size
Measurement
Gauge
Appraiser
Others
Record
ing
Format
Recorder
Editing, Storage,Retrieval
7/29/2019 Ch 2 BB Basic Statistics
14/82
7/29/2019 Ch 2 BB Basic Statistics
15/82
15
Measurement Related Causes forPoor Data Quality
Calibration
Status
Not done
Done long back
Results
Not used
Not traceable
Number
Many
Variableleast count
Different makes
Capability
Operating range
Beyond limit
Type of data
Unwanted
Lowrepeatability
Low leastcount
Precision
Operation
Malfunctioning
Breakdown
Gauges
Bias Inadvertent error
Number Reproducibility
Appraisers
Measurand
Unstable
Inhomogeneous
Method
Standard procedure
Notavailable
Not followed
Communication
PoorDataQuality
Measurement
7/29/2019 Ch 2 BB Basic Statistics
16/82
16
Data Collection Planning- Principle of Inverse Loading
The Planning Questions
1) What do you want to know?
2) How do you want to seewhat it is that you need toknow?
3) What type of tool willgenerate what it is that
you need to see?
4) What type of data isrequired of the selectedtool?
5) Where can you get therequired type of data?
Plan
Execute
...... . .
..
.
Has X any effect on Y?
. .... .
...
Histogram Scatter diagram
Final inspection andproduction log book
Nowhere- tobe collected
Illustration
Y X
YX1 X2 X3
X1 X2 X3
Y11 Y21 Y31
. . .
Y1n Y2p Y3q
X Y
X1 Y1
. .
Xn Yn
7/29/2019 Ch 2 BB Basic Statistics
17/82
7/29/2019 Ch 2 BB Basic Statistics
18/82
18
Check Sheet and Data Sheet
Check Sheet: Checks (/, , x etc.) are madeagainst a category of a variable or combination ofcategories of several variables. Used primarily forcollecting attribute data.
Data Sheet: Measurement results are recordedagainst an individual and its characteristics. Usedfor collecting both attribute and variable data.
Many consider all check sheets as data sheets
and vice versa. However, we shall distinguishbetween the two as above.
7/29/2019 Ch 2 BB Basic Statistics
19/82
19
Process Distribution Check SheetPower Generation Process (Moving Target)
Month: September
Process average (Y1 bar): 420 MW
Characteristic: Y1= Total generation (MW), Y2= System demand
Sampling interval: Every 3.5 hours
Target: Min(420, Y1) Data: Target - Y1 bar
Class Interval Check Frq
55.01 4
Total No. of observations: 206
Import limit = +20
Export limit = -10
Wasteful import
due to lack of control
Wasteful export
due to lack of control
Defect rate = 27 %
7/29/2019 Ch 2 BB Basic Statistics
20/82
20
Causes for Wasteful Import of Power
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
1
104
207
310
413
516
619
722
825
928
1031
1134
1237
1340
Run Chart of half-hourly readings ofgeneration at station C15 in September 2001
A
B CD
A: Process failure B: Process deficiencyC: Early slow down D: Late pick up
7/29/2019 Ch 2 BB Basic Statistics
21/82
21
Defect Cause Check Sheet
StationDefect
C15 C16 D E F Total
Processfailure
52
Processdeficiency 81
Earlyslowdown
15
Late pick
up
34
Total 54 22 65 21 20 182
Month: September, 2001 Data: # of hours of generation affected
Note: Criticality of the defects is not same over all stations
7/29/2019 Ch 2 BB Basic Statistics
22/82
22
Identifying Critical Causesfor Wasteful Import
C15 C16 D E F
PF 30 0 15 7 0
PD 11 9 36 14 11
ES 2 2 9 0 2
LP 11 11 5 0 7
C15 C16 D E F
PF 29.0 29.5 107.0 103.5 110.0
PD 5 2 10 5 5
ES 10 4 30 - 15
LP 10 4 30 - 15
C15 C16 D E F Total
PF 870 0 1605 725 0 3200
PD 55 18 360 70 55 558
ES 20 8 270 0 30 328
LP 110 44 150 0 105 409
Total 1055 70 2385 795 190 4495
Hours of low generation Average generation loss at each instant
Total generation loss (MWH)
=
PF = Process failure
PD = Process deficiency
ES = Early slow down
LP = Late pick up
7/29/2019 Ch 2 BB Basic Statistics
23/82
23
Other Types of Check Sheets
Defective item check sheet Checks are made against various causes of
rejection/rework of an item.
Defect location check sheet
Instead of a table a diagram is made of the defectspace.
Checks are made at the location where defect occurs.
Locational segregation of defects, if any, providesvaluable clue.
Leakage in a cooling system Cracks in castings
Wear out of moving parts
7/29/2019 Ch 2 BB Basic Statistics
24/82
24
Other Types of Check Sheets (..Contd.)
Check-up confirmation check sheet Used to make a comprehensive check-up of
product/process quality (usually at the final stage).
Preprinted items of checks avoids duplication andmissing of tests to be performed.
It is a variation of check list, which is used for checkingif all the tasks have been performed or not.
C-E diagram check sheet Checks are made against the cause of a problem in the
C-E diagram.
7/29/2019 Ch 2 BB Basic Statistics
25/82
25
Data Sheet General Format
TitleCommon relevant information
Individual Var. 1 Var. 2 Var. p Remark
Ind. 1
Ind. 2
Ind. n
Important summary of data
Notes:
7/29/2019 Ch 2 BB Basic Statistics
26/82
26
Data Sheet - Example
Up-load detention report for the month of July, 2001Rake
N0.
Date Arrival
time
Qua
lity
# of
wagons
Form
date
Form
time
Depart.
date
Depart.
time
Deten.
hours
Demur.
hours
Rea
son
Actualunloadingtime - Hr.
01 01 19.45 Envi
ro
58 02 05.35 02 15.30 09.55 - - 09.00
. . . . . . . . . . . . .
20 14 07.50 Du.
hill
58 15 16.45 16 00.20 07.35 23 S(19)+I(4)
14.30
. . . . . . . . . . . . .
42 31 20.20 . . . . . . . . . 14.45
Purpose?
Estimation of demurrage hours
Control of demurrage hoursImportant reasons cited are receipt in quick succession, successive detentionsand wet coal. These are beyond the control of the coal handling section.
Inadequate Data!
7/29/2019 Ch 2 BB Basic Statistics
27/82
Part B
Summarization of Data
7/29/2019 Ch 2 BB Basic Statistics
28/82
28
Data Analysis Getting Started
102.8 105.2 103.2 104.0 105.2 104.8 105.6 105.0
105.0 104.0 104.0 105.2 106.0 106.4 103.2 104.2
102.0 103.6 103.8 105.0 105.2 105.2 106.0 105.0103.0 103.2 103.0 103.0 104.2 105.8 105.4 104.8
104.8 105.2 105.2 106.0 104.0 104.2 103.8 104.4
104.0 102.2 103.4 104.4 104.4 104.2 104.8 106.2
106.4 104.8 102.8 103.6 104.8 104.4 104.8 104.0
104.0 104.0 104.0 104.0 104.4 104.0 102.6 103.0
104.8 102.8 104.0 103.4 103.6 104.0 104.0 103.4106.0 104.4 104.4 102.4 102.8 105.0 105.2 105.2
Hours Generation (MW)
10.00 13.30
14.00 17.30
18.00 21.3022.00 01.30
02.00 05.30
06.00 09.30
10.00 13.30
14.00 17.30
18.00 21.3022.00 01.30
Half-hourly record of generation by station E during 19/9/01 (10 hrs.)to 21/9/01 (1.30 hrs.) under normal operating condition
What are your conclusions?
7/29/2019 Ch 2 BB Basic Statistics
29/82
29
Frequency Distribution- Analyzing a large data set on the same variable
Class Interval Tally Frequency
101.7 102.3 02
102.3 102.9 06
102.9 103.5 10
103.5 104.1 19
104.1 104.7 11
104.7 105.3 22
105.3 105.9 03
105.9 106.5 07
Total 80
Generation data set (previous slide)The eighty observations are grouped in eight classes of equal length
Does the frequency distribution provide better insight into the process?
DATA + ANALYSIS = INFORMATION
Data are not information
7/29/2019 Ch 2 BB Basic Statistics
30/82
30
Constructing Frequency Distributions- Variable Data
Data set
Number of observations (N):About 100 on the same variable.
Formation of the classes (first column)
Number of classes (k)
Too many classes obscure the pattern of the distribution due to samplingfluctuations. Details are lost with too few classes. Optimum number of classes
is given by k = 1 + 3.3 log10 (N)
The simpler formula k = N also works well in practice.
For better visual impact, it is preferable to have 5 k 12.
For the generation data set we have N = 80. Therefore, k =
1+3.3*log(80) = 7.3. This means the number of classes should be
either 7 or 8. We have chosen 7 classes.
7/29/2019 Ch 2 BB Basic Statistics
31/82
31
Constructing Frequency Distributions(..contd.)
Class width (h) h = (R + w) / k
where R = Range of the observations = Maximum Minimumand w = Least count of measurement.
Next, h is rounded to the nearest integer multiple of w. This means, if the
least unit of measurement (w) is 0.1, then h = 2.312 should be rounded to
2.3. However, if w = 0.2, then the same h should be rounded to 2.4.
In our generation data example, R = 106.4102.0 = 4.4, and w =
0.2. Thus, h = (4.4+0.2) / 7 = 0.657, which is rounded to 0.6. We
shall explain later, why taking h = 0.7 will be erroneous.
Note that if h is rounded down then we shall need (k+1) classes to cover the
whole range of the observations. How many classes shall we need if his rounded up?
i i ib i
7/29/2019 Ch 2 BB Basic Statistics
32/82
32
Constructing Frequency Distributions(..Contd.)
Class limits The minimum value of the generation data is 102.0 and the class width has
been determined as 0.6. So we can form the classes as
102.0 102.6, 102.7 103.3, 103.4 103.9, . . .
The problem with the above classification is that there is a gap between twosuccessive class intervals. This is not desirable since we are dealing withcontinuous data.
Discontinuity can be removed by forming the classes as
102.0 102.6, 102.6 103.2, 103.2 103.8, . . .
However, this classification has another problem. Suppose we have an
observation 102.6. In which class shall we place it, first or second?
In order to avoid such confusion we take
Lower limit of the first class = Minimum w/2
and then successively add the class width to this lower limit to obtain
the other class limits.
C i F Di ib i
7/29/2019 Ch 2 BB Basic Statistics
33/82
33
Constructing Frequency Distributions(..Contd.)
Class limits (..Contd.)
Thus, for the generation data we have the classes as
101.9 102.5 102.5 103.1 103.1 103.7 103.7 104.3
104.3 104.9 104.9 105.5 105.5 106.1 106.1 106.7
Note that now we have
- 8 classes (since h has been rounded down from 0.657 to 0.6)- no confusion in classification (since there are no observations whichfall on the class limits) and
- an extended last class (ideally the upper limit of the last class shouldhave been 106.5).
In the example, we have extended the first class instead of the last
one since this has brought out the process abnormalities better.Thus the eight classes used are
101.7102.3, 102.3102.8, , 105.9106.5
C t ti F Di t ib ti
7/29/2019 Ch 2 BB Basic Statistics
34/82
34
Constructing Frequency Distributions(..Contd.)
Tally marking (second column) Start with the first observation. Find the class to which the observation belongs.
Put a tally against the class.
Classify all the remaining observations as above.
Tally marks are grouped in five, with the fifth tally crossed through the previousfour tallies. This provides a better visual display and helps in counting the
frequency of each class. Note that all the above observations get classified as we go through the
observations only once. However, if we concentrate on a class and then try to findout the number of observations in the class then we have to go through theobservations k times. This not only consumes more time but also increases thechance of committing error.
Counting frequency (third column) The frequency (f) of each class is obtained simply by counting the tallies.
Other columns Columns giving cumulative frequency (f1, f1+f2, ..) and relative frequency (f1/N,
f2/N, ..) may also be added, if required.
7/29/2019 Ch 2 BB Basic Statistics
35/82
35
Constructing Frequency Distributions- Getting the class intervals right
Why class width (h) is rounded to nearest integer multiple of w Consider the same generation data example. Here w=0.2. Assume that h = 0.657
is rounded to 0.7 (which is not an integer multiple of 0.2) instead of 0.6. Thus theclasses will be 101.9 102.6, 102.6 103.3, ..
Now in order to overcome the problem of classifying observations like 102.6, weare forced to consider w=0.1 and have the classes as101.95 102.65, 102.65 103.35, 103.35 104.05, 104.05 104.75,104.75 105.45, 105.45 106.15, 106.15 106.85
Note that the number of observation units covered by each class are not same. Forexample, the second class covers three units (102.8, 103.0 and 103.2) but thethird class covers four units (103.4, 103.6, 103.8 and 104.0). As a result thefrequency distribution is likely to show many peaks.
Balancing end points Assuming w=0.1, the seven classes shown above should be appropriate. However,
note that the last class is extended by four units beyond the maximum observedvalue of 106.4. It is desirable to distribute this imbalance to the two end classes bystarting the first class from 101.75 and ending at 106.65.
F Di t ib ti f Th G ti
7/29/2019 Ch 2 BB Basic Statistics
36/82
36
Frequency Distribution of The GenerationDataFurther analysis
The frequency distribution shows an abnormal pattern (nearly alternative peaks). Doesthis mean the process mean is jumping randomly by about 1.2 unit?
Following two frequency distributions constructed out of the same data provide someadditional clues.
Fractional part Frequency
.0 27
.2 18
.4 15
.6 5
.8 15
Total 80
Class interval Frequency
101.7 102.7 04
102.7 103.7 17
103.7 104.7 26
104.7 105.7 25
105.7 106.7 08
Total 80
0s occur more frequently at thecost of 6s. Does this indicatemeasurement bias?
Smooth pattern (left skewed). Smoothness hasbeen achieved not only by reducing the number ofclasses but also by including the adjacent 0 s and 6sin the same interval.
7/29/2019 Ch 2 BB Basic Statistics
37/82
37
Histogram
Histogram is a graphical representation of a frequency distribution of variable data.
The histogram of the generation data having five classes is shown below.
101.7 103.7 105.7Generation in E station (MW)
0
5
1015
20
25
Frequency
30 Bars of equal width (=class width)
Heights of the bars are proportional tothe frequencies of the classes
Bar width of about 1 cm. (7-10 classes)
Horizontal axis is about 1.6 timeslonger than the vertical axis
Central tendency: About 104.2.
Pattern of variation: Slightly left skewed
Specification limits: Should be shown wherever applicable.
Class mid-point: Marking the class mid-points may be helpful in certain cases.
Open ended classes: Avoid adding too many classes at the ends having zero orvery low frequencies. Shown as open ended bars with arbitrarily reduced heights.
C t ti f Hi t
7/29/2019 Ch 2 BB Basic Statistics
38/82
38
Construction of Histogram- An exercise
Half-hourly record of power (MW) generated by station E during 29.9.2001(10.00 hours) to 30.9.2001 (24.00 hours) gives us the following data.
6.4 6.4 6.8 6.0 5.2 4.8 6.4 4.4 5.2 6.0
7.6 8.0 7.4 6.6 8.0 5.6 7.2 7.2 7.0 4.0
6.4 8.0 8.0 6.0 6.0 6.4 7.8 7.6 7.6 7.4
7.6 7.6 7.4 4.6 4.2 4.8 6.0 5.6 5.4 5.0
6.2 7.8 7.4 7.2 7.4 7.8 6.6 6.4 6.8 6.8
6.8 6.8 6.6 6.8 6.6 6.8 6.8 6.8 7.0 7.0
6.0 5.6 4.4 4.6 4.6 4.8 6.2 7.0 6.6 6.4
5.2 5.2 7.2 7.4 6.0 5.0 7.0 7.6 7.6 7.4
5.2 7.2 7.2 7.0 7.2 6.8 6.0 6.0 6.0 5.2
Construct a histogram of the above data set. Compare with the histogram
for the period 19.9.01 to 21.9.01 ( previous slide) and offer your comments.
29/9(10 hrs.)
30/9(24 hrs.)
Commonly Observed Histogram
7/29/2019 Ch 2 BB Basic Statistics
39/82
39
Commonly Observed HistogramPatterns
Single peak, symmetric, bell
shaped, commonly observedpattern of a stable process
Single peak, positively
skewed (long tail on theright)
Single peak, negatively
skewed (Long tail on theleft)
Many characteristics follow suchpatterns. We have already seenthat generation data isnegatively skewed while
breakdown data is positivelyskewed. However such shapesmay also indicate processinstability.
LSL USL
Single peak, thick tailTwo peaks (bi-modal)
Frequency Distribution
7/29/2019 Ch 2 BB Basic Statistics
40/82
40
Frequency Distributionof Discrete Data
Number of plant outages in each year since commissioningStation Period Type of
outage# of outages in a year
D 1978-79To
2000-01
Forced 2, 3, 1, 0, 3, 2, 1, 0, 2, 2, 0, 2, 3, 0, 2, 1, 2, 1, 1,0, 1, 0, 2
Planned 3, 5, 1, 4, 2, 5, 2, 1, 6, 3, 7, 7, 4, 7, 6, 5, 6, 4, 2,2, 2, 6, 2
E 1985-86To2000-01
Forced 2, 2, 5, 3, 0, 0, 1, 0, 1, 0, 2, 1, 1, 0, 1, 4Planned 15, 7, 8, 3, 7, 5, 2, 6, 3, 8, 7, 4, 5, 4, 3, 4
F 1988-89To
2000-01
Forced 4, 1, 1, 0, 0, 1, 1, 2, 0, 1, 0, 1, 6
Planned 3, 11, 6, 12, 4, 0, 1, 2, 8, 2, 4, 4, 6
Ideally we should construct six frequency distributions (for each type of outage in
each station). However, due to shortage of data we shall construct only two - one forforced outage and the other for planned outage.
What can you say about the occurrence of two types of outages from theabove data set?
7/29/2019 Ch 2 BB Basic Statistics
41/82
Summary Measures
7/29/2019 Ch 2 BB Basic Statistics
42/82
42
Summary Measuresof a Univariate Data Set
Type Commonly Used Measure*
Measures of Location orCentre
Mean,Median, Mode, TrimmedMean, Geometric Mean
Measures of Spread orVariability Range, Standard Deviation,Entropy (for nominal data)
Measures ofShape Skewness, Kurtosis
General Measure Quartiles
* There are a host of other measures developed for specific applications
7/29/2019 Ch 2 BB Basic Statistics
43/82
43
Arithmetic Mean
May be used for ordinal databut not for nominal data
Sensitive to extreme values
Usually referred to as MEAN or AVERAGE
MEAN =Sum of all the observations
Number of observations
=
X1 + X2 + X3 + . . . . . + XN-1 + XN
N=
Xii=1
n
NX
Notation
Example:In a rising voltage test the alternating breakdown voltage(kV) of
24 samples of an insulation arrangement were found to be as follows:
210; 208; 208; 175; 182; 206; 190; 194; 198; 205; 212; 200; 205; 202; 207;
210; 202; 201; 188; 205; 209; 201; 216; 196
MEAN = [210 + 208 + + 216 + 196] / 24 = 201.25 kV
7/29/2019 Ch 2 BB Basic Statistics
44/82
44
Mean of Grouped Data
NotationsClass: i (=1, 2, , k)
Frequency of the iih class: fi
Value of the ith class: Mi (Class mid-point if class width > least count)
Formula
i=1
i=1
k
k fi * Mi fi
X =
Example: The observations { 1.3, 1.3, 1.5, 3.3, 3.5, 3.5, 3.5, 3.6, 5.4, 5.4, 5.8, 7.3,7.4, 9.1} are grouped as follows:
i Class Interval Mi fi fi * Mi
1 1.25 3.25 2.25 3 06.75
2 3.25 5.25 4.25 5 21.25
3 5.25 7.25 6.25 3 18.75
4 7.25 9.25 8.25 4 33.00
Total () 15 79.75
Mean = 79.75/15 = 5.32
Mean of ungrouped data = 4.62. Thuserror due to grouping is 5.32-4.62 = 0.7,which is close to the maximum valuepossible, i.e. (class width/2) = 1.0. WHY?
In general, error will not be so large.Nevertheless, it is recommended to use theindividual observations for computingmean, whenever possible.
7/29/2019 Ch 2 BB Basic Statistics
45/82
45
Interpretation of Mean
170 180 190 210 220
Mean = 201.25
Dot Plot of the Breakdown Voltage Data (Previous Slide)
Mean is the balance point (or fulcrum) for the distribution of the values
Mean is analogous to centre of gravity
In case of unimodal and symmetric distribution, mean also indicates thecentral tendency of the distribution and may be interpreted as a TYPICALVALUE.
In the above example, the observations are not symmetrically distributedaround the mean. The distribution is skewed to the left. Consequently meanshould be interpreted here as a measure of centre or location and not thatof central tendency or typical value.
7/29/2019 Ch 2 BB Basic Statistics
46/82
46
Misuse of Mean
Landfill
Site
DioxinPresently, WHO has classified Dioxin asa known human carcinogen
Question: Are the people in the neighborhood of thelandfill site safe with respect to exposure to dioxin?
Data: Dioxin content the soil samples taken from alarge residential area in the neighborhood of the site.
Answer: Yes, since the average dioxin content in thesamples is found to be less than the permissible limit.
Critique: Individuals are not exposed to average soil levels, they areexposed to dioxins/furans present in the air they breathe, food they eatand water they drink. Higher exposure of residents living in the vicinityof the site are not averagedoutwith the lower exposure of residentsten miles away.
7/29/2019 Ch 2 BB Basic Statistics
47/82
47
Properties of Mean
P1Sum of the deviations of all the observations from mean isalways zero. In notation, we have
(Xi X) = 0n
i=1
Sum of negative deviations=
Sum of positive deviations
P2Data Transformation:
(i) Let Y i = Xi k. Then Y = X k(ii) Let Y i = k*Xi. Then Y = k*X
(iii) Let Yi = Xi/k. Then Y = X/k
These three properties are frequently used to reduce the size ofthe data, which in turn reduces both computational load and error.
An Example follows.
7/29/2019 Ch 2 BB Basic Statistics
48/82
48
Properties of Mean (Contd.)
Example of P2: Data Transformation
Outer diameter (X) of tubular glass shell (Specification: 37.5 0.8 mm.)
i Xi Yi = 37.5 - Xi Zi = Yi*100
1 37.46 0.04 4
2 36.66 0.84 84
3 37.44 0.06 6
4 37.85 -0.35 -35
5 37.36 0.14 14
6 36.95 0.55 55
7 37.62 -0.12 -12
8 36.96 0.54 54
9 37.12 0.38 38
10 37.36 0.14 14
TOTAL 269 47 = 222
Thus Z = 222/10 = 22.2
Since Yi = Zi/100, using theproperty (iii) of P2 we have
Y = 22.2/100 = 0.222
Further, since Xi = 37.5 Yi ,using property (i) of P2 we have
X = 37.5 0.222 = 37.278
In this case the Zi values are verylarge because the least count of
measurement used is too small. Usinga gauge having a lest count of 0.1 mm.and recording the deviations from theTARGET would have been better.
7/29/2019 Ch 2 BB Basic Statistics
49/82
49
Properties of Mean (Contd.)
P3 The sum of the squared deviations of a set of observations is minimum whenthe deviations are taken from the mean of the observations.
In notation, we have (Xi X)2 < (Xi M)2, M X
Implication: Consider we have production figures for the last twenty days. We want topredict the production of the 21st day, assuming production condition remains the same.Then the best prediction is the average of the past twenty days data, provided the loss
due to prediction error is proportional to the square of the error.
P4 Sample mean is more stable than other possible measures of center.
We shall see this later.
P5 Mean is strongly affected by extreme values.
This is a disadvantage of mean over other measures of center. However,routine trimming of extreme values is not recommended unless themeasurements are subjective in nature. Genuine outliers must, of course, beeliminated from the data set.
7/29/2019 Ch 2 BB Basic Statistics
50/82
50
Pooled Mean
Data Set 1 n1 X1 n1*X1
Data Set 2 n2 X2 n2*X2
. . . .
. . . .
Data Set k nk Xk nk*Xk
All (Pooled) ni ni*Xi/ ni ni*Xi
No. of
observations Average Total
Example: Process averages in threeshifts are found to be 15, 12 and 13based on 30, 40 and 20 observationsrespectively. Then the process averagefor the day is
(15*30+12*40+13*20)/(30+40+20) =990/90 = 11 [(15+12+13)/3 = 13]
Pooled mean ni*Xi/ ni = Xi/ k (WHEN?)
Note that the formula for mean of grouped data is similar to the above.
A related concept is that ofweighted mean. An application example follows.
Weighted Mean
7/29/2019 Ch 2 BB Basic Statistics
51/82
51
Weighted Mean- An Application Example
BlowingDrawing
&Cutting
SortingMoltenglass
Glazing
Reject Accept
Glassshells
Tube
Assume the total number of shells produced in a shift is 24000. In a particularshift 8% of the shells produced are found to be rejected. We want to estimatethe average outer diameter of all the shells produced in the shift.
Samples can not be taken before sorting. So 50 shells are randomly selectedfrom each of the two streams ( reject and accept). The average diameter of the50 shells in the reject and accept groups are found to be 37.7 mm and 37.6mm respectively.
Shift average = Weighted mean of the average of the two streams = ( 0.08 *37.7 + 0.92 * 37.6) / (0.08 + 0.92) = 37.61.
OD Sensor
Weighted Mean
X = wi*Xi / wi
Weighted Mean
7/29/2019 Ch 2 BB Basic Statistics
52/82
52
Weighted Mean- An Application Example (Contd.)
However, it would have been better to take more samples from the rejectstream. (WHY?)
Because of the higher variation expected in this stream.
Assume 100 shells (instead of 50) were selected from the reject stream and gotthe same average (37.7 mm).
Now the weights are given by wi = pi*N/fi. (WHY?)pi = Proportion of the i
th category, N = Total sample size
fi = Sample size of the ith category.
If the samples are selected randomly from the total population, then the numberof samples expected in the ith category is pi*N. Since we have selected fi samples,we must compensate for this by a factor of pi* N / fi.
In our example, p1 = 0.08, p2 = 0.92, f1 = 100, f2 = 50, N = 150. Thus w1 =
(0.08 * 150) / 100 = 0.12 and w2 = (0.92 * 150) / 50 = 2.76. This gives the shift
average as (37.7 * 0.12 + 37.6 * 2.76) / (0.12 + 2.76) = 37.60.
7/29/2019 Ch 2 BB Basic Statistics
53/82
53
Median and Mode
Ordinal data: Category containing the (N+1)/2 caseNumerical data: (N+1)/2 th ordered observation, when N isodd and average of N/2 th and (N/2)+1 th ordered observations,when N is even.
Can be computed even for open ended classes at the extremesprovided each of the end classes contain less than 50% of the
observations.
Insensitive to outliers.
Median
Category or the value occurring with greatest frequency
Only measure of center for nominal data
May not be unique and highly sensitive to how the classes orcategories are formed.
Mode
7/29/2019 Ch 2 BB Basic Statistics
54/82
54
Caveat: Dont Trust Centre Alone
Mean depth
= D < H
HStatisticians tell the story ofpeople who got themselvesdrowned by wading into alake with an average depthof 3 feet.
Median
Median
Distribution of marksobtained by studentsof two schools. Whichschool is better?
School A
School B
Mean may not tell you all youneed to know. Pay attentionto variation as well.
7/29/2019 Ch 2 BB Basic Statistics
55/82
55
Standard Deviation
Standard Deviation is the most important measure of variability in a data set.
Let {X1, X2, , Xn} be a sampledata set and X is the mean of the observations.
Variability is measured in terms of the deviations of the observations from mean. For ourdata set, the deviations are (Xi X), i = 1, 2, , n.
Next, these deviations are summarized to obtain a single value for reporting variability.
Recall from property P1 of mean that the sum of the deviations will be always zero. So
we can not summarize by simply taking the average of the deviations.
The mathematical trick used to get rid of this difficulty (negative deviations) is to squarethe deviations and then these squares are averaged. So we compute (Xi - X)
2/(n - 1).The reason for using (n - 1) instead of n as the divisor will be explained later.
Finally, the effect of squaring is neutralized by taking square root of the above average toobtain the quantity called Standard Deviation. So we have
Sample Standard Deviation = s = (Xi - X)2
n - 1
7/29/2019 Ch 2 BB Basic Statistics
56/82
56
Computing Standard Deviation
Root [ (Xi X)2 / (n 1)]
Mean (Xi X)2 / (n 1)
Square (Xi X)2
Deviation Xi - XRead
Computei Xi Xi X (Xi X)
2
1 4 -3 9
2 7 0 0
3 2 -5 25
4 5 -2 45 11 4 16
6 2 -5 25
7 10 3 9
8 7 0 0
9 15 8 64
10 9 2 4
11 5 -2 4
Total 77 0 160
A Numerical Example
Mean Square Deviation= 160 / (111) = 160 / 10= 16
Root Mean SquareDeviation or StandardDeviation = 16 = 4.
Shorter Method
(Xi X)2
= Xi2 ( Xi)2/ n
= 699 772/11= 160
7/29/2019 Ch 2 BB Basic Statistics
57/82
57
Interpretation of Standard Deviation
Let us be honest. It is not easy to interpret standard deviation.
Literally speaking, standard deviation is a measure of the closeness of the data values totheir mean. However, the difficulty in interpretation arises because the closeness depends ontwo things- the range of the data values and also the distribution of the values within the range.
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9
The six caseshave identicalmean (= 5.5)and range (=
7). But themaximum s. dis about twice
that of theminimum value.
Compare thedistributions
having s. d. of2.07 and 2.14.
Application of Standard Deviation
7/29/2019 Ch 2 BB Basic Statistics
58/82
58
pp- Rake weight data
3100 3200 3300 3400 3500
3100 3200 3300 3400 3500
Dot plots of weight of 24 rakes of coal received during January - June 2002.
Four rakes have been selected randomly from each of the six months. All therakes in the sample consists of 58 wagons.
Indigenous
Imported
Source Mean(Ton)
Range(Ton)
n-1(Ton)
Indigenous 3211.5 219.5 68.8
Imported 3401.8 189.6 40.7
Range of the two distributions donot differ as much as the standard
deviation do. We shall see laterthat higher variation of indigenouscoal implies higher inventory cost.
7/29/2019 Ch 2 BB Basic Statistics
59/82
Part C
Population, Sample andProbability Distribution
Population
7/29/2019 Ch 2 BB Basic Statistics
60/82
60
Population
Astatistical populationis a set of values orattributes
of the characteristic(s)
of a set of well defined objects
belonging to a specified group and/or period
Example 1 Example 2
Characteristic Height Ash content
Object of adult males in lots of coal
Group of India received in Oct2010
T f P l i
7/29/2019 Ch 2 BB Basic Statistics
61/82
61
Types of Population
Finite and real
Infinite and hypothetical
Ash content in a particular lot can be thought of as anobservation from an infinite and hypothetical
population of all possible values of ash content
Continuous
Power generated by a power station, A tank of liquidchemical. Such population need to be suitablydiscretized for the purpose of measurement
P l ti d S l
7/29/2019 Ch 2 BB Basic Statistics
62/82
62
Population and Sample
A (random) sample is a subset of the population obtained in sucha manner such that each object (unit) of the population (or of
subpopulation) has equal probability of being included in thesubset.
samples must be distinguished from specimens. A specimen ismerely a convenient subset of the population.
Purpose of sampling is to draw conclusions about a target
population economically with acceptable limits of error.
Population Sample
Mean XStandard Deviation s
V i ti i l d l ti
7/29/2019 Ch 2 BB Basic Statistics
63/82
63
Variation in sample and population
Histogram of plate thickness(sample values)
Probability distributionof thickness (for thepopulation
Frequency polygonAn estimate of thepopulation distribution
Larger sample More classes Smaller class interval
Smoother frequency polygon andcloser to the population distribution
In case of hypothetical population, the distribution of a characteristic inthe population will never be known. Normal distribution is frequentlyassumed for a population distribution.
Discrete Probability Distribution
7/29/2019 Ch 2 BB Basic Statistics
64/82
64
Discrete Probability Distribution
1
2
3
4
5
6
1/6
p(x)
x1 2 3 4 5 6
Sample space
Random variable Xtakes values
x={1, 2, 3, 4, 5, 6}
P(X=x)
P(X=1)=p(1)=1/6p(x): Probability mass function
Continuous Probability Distribution
7/29/2019 Ch 2 BB Basic Statistics
65/82
65
Continuous Probability Distribution
Measurementof diameter
x1.
.
.
x2
F(x) = P(Xx): Probabilitydistribution function
f(x) = F(x): Probability
density function
Random variable Xtakes valuesx1 x x2
Sample space
x
f(x)
x1 x2
f(x) does not give the probability of X=x
Bernoulli and Hypergeometric
7/29/2019 Ch 2 BB Basic Statistics
66/82
66
Bernoulli and HypergeometricSample space
P( ) = P(x=0) = p(0) = 0.8
P( ) = P(x=1) = p(1) = 0.2
X follows Bernoulli Distribution
having parameter p = 0.2
x=0
x=1
x=2
X follows HypergeometricDistribution with parameters
N=10, n=3 and d=2
P(0) = ?, p(1) = ?, p(2) = ?
N=10, d=2n=3
Hypergeometric Distribution
7/29/2019 Ch 2 BB Basic Statistics
67/82
67
Hypergeometric Distribution
n
N
r
d
rn
dNrxP /)(
N=10, n=3, d=2
P(x=0) = (10-2C3-0) * (2C0) / (
10C3) =(56 * 1) / 120 = 0.467
P(x=1) = (10-2C3-1) * (2C1) / (
10C3) = (28 * 2)/120 = 0.467
P(x=2) = (10-2C3-2) * (2C2) / (
10C3) = (8 * 1) / 120 = 0.066
p(0) + p(1) + p(2) = 0.467 + 0.467 + 0.067 = 1
Binomial Distribution
7/29/2019 Ch 2 BB Basic Statistics
68/82
68
Binomial Distribution
p=0.2
X follows Binomial Distributionwith parametersn=3 and p=0.2
x=0
x=1
x=2
n=3
x=3
Hypergeometric
Finite Population
Sampling without replacement
Binomial
Infinite population OR
Sampling with replacement
Binomial Distribution: Distribution of no. of defectives insamples drawn from a process under control (p=constant)
Computing Binomial Probability
7/29/2019 Ch 2 BB Basic Statistics
69/82
69
Computing Binomial Probabilityn =3, p = 0.2
p(0) = 3C0 * (0.2)0 * (0.8)3-0
= 1 * 1* 0.512 = 0.512
P(1) = 3C1 * (0.2)1 * (0.8)3-1
= 3 * 0.2 * 0.64 = 0.384
p(2) = 3C2 * (0.2)2 * (0.8)3-2
= 3 * 0.04 * 0.8 = 0.096
p(3) = 3C3 * (0.2)3 * (0.8)3-3
= 1 * 0.008 * 1 = 0.008
p(0) + p(1) + p(2) + p(3)
= 0.512+0.384+0.096+0.008
=1.000
Poisson Distribution
7/29/2019 Ch 2 BB Basic Statistics
70/82
70
Poisson Distribution
As an approximation to Binomial probability Small p (say < 0.1)
Large n
As a distribution in its own right
Count of defects/unit
Infinite opportunities of occurrence Rare event accidents, flaws in cloth, instances of power outages,
absenteeism in large organizations, no. of production stoppages
..
.
.
Many opportunities and maximumof 1 defect per opportunity
Defects are randomly distributed Defect rate constant and proportionalto area, No location preference
Poisson Probability
7/29/2019 Ch 2 BB Basic Statistics
71/82
71
Poisson Probability
2
...,2,1,0,!)()( rr
erprxP
r
Example: The no. of error in bills raised by the billing department follows Poissondistribution. Mean error rate per bill is o.5. A bill is selected at random. What is
the probability that the bill will contain (i) exactly two errors, (ii) at most twoerrors and (iii) at least two errors?
(i) = 0.5, p(2) = exp(-0.5) * (0.5)2 / 2! = 0.6065 * 0.25 / 2 = 0.076
(ii) p( 2) = p(0) + p(1) + p(2) = 0.6065 + 0.3033 + 0.076 = 0.986
p(0) = exp(-0.5) * (0.5)0 / 0! = 0.6065 * 1 * 1 = 0.6065
p(1) = exp(-0.5) * (0.5)1 / 1! = 0.6065 * 0.5 * 1 = 0.3033
(iii) p( 2) = 1 - p( 1) = 1 - p(0) p(1) = 1 0.6065 0.3033 = 0.09
Normal Distribution
7/29/2019 Ch 2 BB Basic Statistics
72/82
72
Normal Distribution
x
f(x)
Inflection point
Symmetric Unimodal
Bell shaped
- to +
Area under curve = 1
Also Known as Gaussian distribution
Arises naturally in many physical, biological and socialmeasurements
Non-normal Abnormal All cases are approximations only most measurementsare non-negetive
Normal Characteristics -Examples
7/29/2019 Ch 2 BB Basic Statistics
73/82
73
Normal Characteristics ExamplesTHE
NORMAL
LAW OF ERROR
STANDS OUT IN THE
EXPERIENCE OF MANKIND
AS ONE OF THE BROADEST
GENERALIZATIONS OF NATURAL
PHILOSOPHY. IT SERVES AS THE GUIDING
INSTRUMENT IN RESEARCHES IN THE PHYSICAL
AND SOCIAL SCIENCES AND IN MEDICINE, AGRICULTURE
AND ENGINEERING. IT IS AN INDISPENSIBLE TOOL FOR THE ANALYSIS AND
INTERPRETATION OF THE BASIC DATA OBTAINED BY OBSERVATION AND EXPERIMENT
Machined dimensions
Fill volume/weight Colour density
Wear-out failure time
Germination at a given ageing
Height of Indian tribals No. of single girls in a bar (1 - 2 P.M)
Return from a diversified portfolio
- W. J. Youden
Central Limit Theorem
7/29/2019 Ch 2 BB Basic Statistics
74/82
74
Central Limit Theorem Distribution of an average (X-bar) or a sum (X) tends to be
normal, irrespective of the distributional form of X.
Many statistical procedures are based on the assumption ofNormality. CLT acts as safeguard for validity of such applications.
Aggregation of numeroussmall but independentrandom events. In thiscase eight events - eachproducing small randomdisplacement either to
the left or to the right.
Normal Density Function
7/29/2019 Ch 2 BB Basic Statistics
75/82
75
Normal Density Function
2
)(2
1 2
21)(
VarianceMean
xexf
x
x xx1 x2
P (X < x)= F (x)
P (x1 < X < x2)= F (x2) F (x1) P (X > x)
= 1 - F (x)
Normal Probability
7/29/2019 Ch 2 BB Basic Statistics
76/82
76
Normal Probability
f(x)
x
68.27%
2
95.45% 3
99.73%
f(x)
Popularlyknown as
68-95-99.73
rule
Standard Normal Distribution
7/29/2019 Ch 2 BB Basic Statistics
77/82
77
X
Z0 1 2 3-1-2-3
= 0
= 1
= 1= 1
= 2= 2
= 3=
3
2
2
2
1
)(
z
ezf
Z - Transform
7/29/2019 Ch 2 BB Basic Statistics
78/82
78
Standard Normal Table
7/29/2019 Ch 2 BB Basic Statistics
79/82
79
z 0.00 0.01 . . 0.09
0.0 0.50000 0.50399 . . 0.53586
0.1 0.53983 0.54379 0.57534
. . . . . .1.0 0.84134 0.84375 0.86214
. . . . . .
2.0 0.97725 0.97778 0.98169
. . . . . .
3.0 0.99865 0.99869 . . 0.99900. . . . .
3.9 0.99995 0.99995 . . 0.99997
z
..
. .....
....
..
..
.......
.........
.
.. .
.. Other tables may giveprobabilities between 0 andz > 0 be careful.
Tables giving probabilitiesfor negative values of z are
convenient but are notessential.
P (z > -2 .01) = ?
P (-1 .09 < z < 2) = ?
P (- .19 < z < - .01) = ?
Normal Distribution - Exercise
7/29/2019 Ch 2 BB Basic Statistics
80/82
80
The specification on viscosity of a chemical produced by a batch
process is given as 16.52.5. Viscosity of 10 consecutive batches
produced in the immediate past are given below:
14.8, 15.6, 16.9, 17.0, 14.9, 15.6, 14.5, 15.2, 15.7, 14.2
(a) Assuming viscosity follows Normal distribution, find theexpected rejection percent of batches. [Ans: 6.2%]
(b) Note that none of the 10 sample batches are rejected. Still, is
there any cause for concern?
Normal Probability Plotting
7/29/2019 Ch 2 BB Basic Statistics
81/82
81
y g
Purpose:To examine (based on sample data) whether the
population distribution is Normal or not.
Method:
Rank the sample observations from smallest to largest (R=1,
2, ., n). Try to have n>25.
Compute observed relative cumulative frequency F(x) = (R-0.5)/n [or F(x) = R/ (n+1)] for each x, where R is the rank
of observation x.
Plot [x, F(x)] in Normal probability paper
If the points fall approximately along a straight line then the
underlying distribution can be considered as Normal
NPP - Example
7/29/2019 Ch 2 BB Basic Statistics
82/82
82
p
Observation (x)
Rank
(R- 0.5)/10
14.2 1 .05
14.5 2 .15
14.8 3 .2514.9 4 .35
15.2 5 .45
15.6 6 .55
15.6 7 .65
15.7 8 .7516.9 9 .85
17.0 10 .95
Viscosity data: Specification 16.52.5
14.8, 15.6, 16.9, 17.0, 14.9, 15.6, 14.5, 15.2, 15.7, 14.2
Viscosity
Percent
1918171615141312
99
95
90
80
70
60
50
40
30
20
10
5
1
Mean
0.374
15.44
StDev 0.9348
N 10
AD 0.359
P-Value
Probability Plot of ViscosityNormal - 95% CI