Ch 2 BB Basic Statistics

7/29/2019 Ch 2 BB Basic Statistics

1/82

Chapter 2:

Basic Statistics


2/82

Part A

Types of Data, Data Qualityand Data Collection


3/82

3

Data

Data are facts or figures related to any characteristicof an individualAlso called a variable

A m/c, an year, a casting, a dimension, a person

Power station outages (up to 31/03/01 since commissioning)

Station Date of

commi-

ssioning

Avail-

ability

(%)

No. of

outages

Average

duration of

non-stop

operation

(days)

Average loss per

outage (hours)

Main

cause

of

outage

Capa-

city

utiliza-

tionForced Planned

C:15 12/11/98 92.59 30 27 64 52 Leakage High

C:16 10/05/97 93.04 47 28 52 52 Leakage Mod.

D 12/10/78 88.32 124 58 261 164 Gen* V. Low

E 31/12/84 82.77 116 42 440 158 Gen* Low

F 29/09/88 89.23 82 50 379 79 Gen* High

VARIABLES

INDIVIDUALS* Generator stator / rotor problem


4/82

4

Types of Data/Variable

Continuous Discrete

Numerical/Quantitative

Ordinal Nominal

Categorical/Qualitative

Data/Variable


5/82

5

Types of Data - Examples

Continuous: An infinite number of values (positiveor negative) are possible, e.g. measurements ofweight, length, chemical composition.

Discrete: The variable can take values 0,1,2,3, ..e.g. count of frequency (# of defects, breakdownsetc.)

Ordinal: Data classified in ordered categories, e.g.quality of service provided is classified as poor,moderate, good or yearly rainfall classified as verylow, low, moderate, good and very good.

Nominal: Data classified in categories having noinherent or explicit order, e.g. location classified aseast, west, north, south or names of departments.


6/82

6

Types of Data - Outage Data Example

Variable Name Variable Type1. Date of commissioning

2. Availability (%)

3. Number of outages sincecommissioning

4. Average duration of non-stopoperation (days)

5. Average loss per outage (hours)

6. Main cause of outage

7. Capacity utilization


7/82

7

Types of Data - Further Considerations

Continuous data may appear as discrete either due torounding (see the outage data example) or due tomeasurement limitations. We should treat such data ascontinuous unless the number of levels in the data set isvery few (say 2-4).

However, hourly records of steam pressure at turbineinlet (station F) show that the values are either 126 or127 or 128. Great care must be exercised whileanalyzing such data.

Discrete data having seven or more levels may betreated as continuous data.

Dichotomous data (O.K/Not O.K, Pass/Fail etc.) may betreated as discrete data after coding the two categoriesas 1 (O.K) and 0 (Not O.K).


8/82

8

Variable and Attribute Data

In the field of Quality Control, various types ofdata are classified as

- VARIABLE DATA : Continuous data

- ATTRIBUTE DATA: Others Discrete and

counts of items falling in various categories

(Dichotomous, Ordinal and Nominal)

Henceforth we shall use this later classification.


9/82

9

Data Gateway

Problem/

HypothesisData

Solution/

Fact

DATA COLLECTION DATA ANALYSIS

Quality problems can not be solved merely based on experience.

Any claim not backed by data is only a hypothesis.

Data Gates: Quality of the data gates and their placement at

appropriate locations of a process are extremely important forprocess control.

Data Quality: Data collection step is vital garbage in, garbage out


10/82


11/82

11

Information Content in Datafor Process Control

Source of Data Attribute Data Variable Data

General literature Very low Low

Past data: In-house routine Q.Crecords

Low Moderate

Past Data: Statistically designedexperiments

Moderate High

Live data: Passive observation ofthe process

Moderate High

Live Data: Statistically designed

experiments

High Very High

Do not transform variable data to attribute data.

That will be like burning diamond for heat.


12/82

12

Data Collection Process

INDIVI-DUALS

VARIABLES

Var. 1 Var. 2 Var. 3 . . . Var. p

Ind. 1 Data Data Data Data



. . . . .

. . . . .

Ind. n Data Data Data Data

Population . .

Sample

Measurement . .

Recording

Editing, Storage, Retrieval


13/82

13

Linking Data Qualityto Data Collection Process

Process Elements Wrong Noisy Irrelevant

Inadequate

Hard Redun

dant

Popula

tion

Individual

Issuesrelatedto data

basemgmt.

Variables

Sample Procedure

Size

Measurement

Gauge

Appraiser

Others

Record

ing

Format

Recorder

Editing, Storage,Retrieval


14/82


15/82

15

Measurement Related Causes forPoor Data Quality

Calibration

Status

Not done

Done long back

Results

Not used

Not traceable

Number

Many

Variableleast count

Different makes

Capability

Operating range

Beyond limit

Type of data

Unwanted

Lowrepeatability

Low leastcount

Precision

Operation

Malfunctioning

Breakdown

Gauges

Bias Inadvertent error

Number Reproducibility

Appraisers

Measurand

Unstable

Inhomogeneous

Method

Standard procedure

Notavailable

Not followed

Communication

PoorDataQuality

Measurement


16/82

16

Data Collection Planning- Principle of Inverse Loading

The Planning Questions

1) What do you want to know?

2) How do you want to seewhat it is that you need toknow?

3) What type of tool willgenerate what it is that

you need to see?

4) What type of data isrequired of the selectedtool?

5) Where can you get therequired type of data?

Plan

Execute

...... . .

..

.

Has X any effect on Y?

. .... .

...

Histogram Scatter diagram

Final inspection andproduction log book

Nowhere- tobe collected

Illustration

Y X

YX1 X2 X3

X1 X2 X3

Y11 Y21 Y31

. . .

Y1n Y2p Y3q

X Y

X1 Y1

. .

Xn Yn


17/82


18/82

18

Check Sheet and Data Sheet

Check Sheet: Checks (/, , x etc.) are madeagainst a category of a variable or combination ofcategories of several variables. Used primarily forcollecting attribute data.

Data Sheet: Measurement results are recordedagainst an individual and its characteristics. Usedfor collecting both attribute and variable data.

Many consider all check sheets as data sheets

and vice versa. However, we shall distinguishbetween the two as above.


19/82

19

Process Distribution Check SheetPower Generation Process (Moving Target)

Month: September

Process average (Y1 bar): 420 MW

Characteristic: Y1= Total generation (MW), Y2= System demand

Sampling interval: Every 3.5 hours

Target: Min(420, Y1) Data: Target - Y1 bar

Class Interval Check Frq

55.01 4

Total No. of observations: 206

Import limit = +20

Export limit = -10

Wasteful import

due to lack of control

Wasteful export

due to lack of control

Defect rate = 27 %


20/82

20

Causes for Wasteful Import of Power

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

1

104

207

310

413

516

619

722

825

928

1031

1134

1237

1340

Run Chart of half-hourly readings ofgeneration at station C15 in September 2001

A

B CD

A: Process failure B: Process deficiencyC: Early slow down D: Late pick up


21/82

21

Defect Cause Check Sheet

StationDefect

C15 C16 D E F Total

Processfailure

52

Processdeficiency 81

Earlyslowdown

15

Late pick

up

34

Total 54 22 65 21 20 182

Month: September, 2001 Data: # of hours of generation affected

Note: Criticality of the defects is not same over all stations


22/82

22

Identifying Critical Causesfor Wasteful Import

C15 C16 D E F

PF 30 0 15 7 0

PD 11 9 36 14 11

ES 2 2 9 0 2

LP 11 11 5 0 7

C15 C16 D E F

PF 29.0 29.5 107.0 103.5 110.0

PD 5 2 10 5 5

ES 10 4 30 - 15

LP 10 4 30 - 15

C15 C16 D E F Total

PF 870 0 1605 725 0 3200

PD 55 18 360 70 55 558

ES 20 8 270 0 30 328

LP 110 44 150 0 105 409

Total 1055 70 2385 795 190 4495

Hours of low generation Average generation loss at each instant

Total generation loss (MWH)

=

PF = Process failure

PD = Process deficiency

ES = Early slow down

LP = Late pick up


23/82

23

Other Types of Check Sheets

Defective item check sheet Checks are made against various causes of

rejection/rework of an item.

Defect location check sheet

Instead of a table a diagram is made of the defectspace.

Checks are made at the location where defect occurs.

Locational segregation of defects, if any, providesvaluable clue.

Leakage in a cooling system Cracks in castings

Wear out of moving parts


24/82

24

Other Types of Check Sheets (..Contd.)

Check-up confirmation check sheet Used to make a comprehensive check-up of

product/process quality (usually at the final stage).

Preprinted items of checks avoids duplication andmissing of tests to be performed.

It is a variation of check list, which is used for checkingif all the tasks have been performed or not.

C-E diagram check sheet Checks are made against the cause of a problem in the

C-E diagram.


25/82

25

Data Sheet General Format

TitleCommon relevant information

Individual Var. 1 Var. 2 Var. p Remark

Ind. 1

Ind. 2

Ind. n

Important summary of data

Notes:


26/82

26

Data Sheet - Example

Up-load detention report for the month of July, 2001Rake

N0.

Date Arrival

time

Qua

lity

# of

wagons

Form

date

Form

time

Depart.

date

Depart.

time

Deten.

hours

Demur.

hours

Rea

son

Actualunloadingtime - Hr.

01 01 19.45 Envi

ro

58 02 05.35 02 15.30 09.55 - - 09.00

. . . . . . . . . . . . .

20 14 07.50 Du.

hill

58 15 16.45 16 00.20 07.35 23 S(19)+I(4)

14.30

. . . . . . . . . . . . .

42 31 20.20 . . . . . . . . . 14.45

Purpose?

Estimation of demurrage hours

Control of demurrage hoursImportant reasons cited are receipt in quick succession, successive detentionsand wet coal. These are beyond the control of the coal handling section.

Inadequate Data!


27/82

Part B

Summarization of Data


28/82

28

Data Analysis Getting Started

102.8 105.2 103.2 104.0 105.2 104.8 105.6 105.0

105.0 104.0 104.0 105.2 106.0 106.4 103.2 104.2

102.0 103.6 103.8 105.0 105.2 105.2 106.0 105.0103.0 103.2 103.0 103.0 104.2 105.8 105.4 104.8

104.8 105.2 105.2 106.0 104.0 104.2 103.8 104.4

104.0 102.2 103.4 104.4 104.4 104.2 104.8 106.2

106.4 104.8 102.8 103.6 104.8 104.4 104.8 104.0

104.0 104.0 104.0 104.0 104.4 104.0 102.6 103.0

104.8 102.8 104.0 103.4 103.6 104.0 104.0 103.4106.0 104.4 104.4 102.4 102.8 105.0 105.2 105.2

Hours Generation (MW)

10.00 13.30

14.00 17.30

18.00 21.3022.00 01.30

02.00 05.30

06.00 09.30

10.00 13.30

14.00 17.30

18.00 21.3022.00 01.30

Half-hourly record of generation by station E during 19/9/01 (10 hrs.)to 21/9/01 (1.30 hrs.) under normal operating condition

What are your conclusions?


29/82

29

Frequency Distribution- Analyzing a large data set on the same variable

Class Interval Tally Frequency

101.7 102.3 02

102.3 102.9 06

102.9 103.5 10

103.5 104.1 19

104.1 104.7 11

104.7 105.3 22

105.3 105.9 03

105.9 106.5 07

Total 80

Generation data set (previous slide)The eighty observations are grouped in eight classes of equal length

Does the frequency distribution provide better insight into the process?

DATA + ANALYSIS = INFORMATION

Data are not information


30/82

30

Constructing Frequency Distributions- Variable Data

Data set

Number of observations (N):About 100 on the same variable.

Formation of the classes (first column)

Number of classes (k)

Too many classes obscure the pattern of the distribution due to samplingfluctuations. Details are lost with too few classes. Optimum number of classes

is given by k = 1 + 3.3 log10 (N)

The simpler formula k = N also works well in practice.

For better visual impact, it is preferable to have 5 k 12.

For the generation data set we have N = 80. Therefore, k =

1+3.3*log(80) = 7.3. This means the number of classes should be

either 7 or 8. We have chosen 7 classes.


31/82

31

Constructing Frequency Distributions(..contd.)

Class width (h) h = (R + w) / k

where R = Range of the observations = Maximum Minimumand w = Least count of measurement.

Next, h is rounded to the nearest integer multiple of w. This means, if the

least unit of measurement (w) is 0.1, then h = 2.312 should be rounded to

2.3. However, if w = 0.2, then the same h should be rounded to 2.4.

In our generation data example, R = 106.4102.0 = 4.4, and w =

0.2. Thus, h = (4.4+0.2) / 7 = 0.657, which is rounded to 0.6. We

shall explain later, why taking h = 0.7 will be erroneous.

Note that if h is rounded down then we shall need (k+1) classes to cover the

whole range of the observations. How many classes shall we need if his rounded up?

i i ib i


32/82

32

Constructing Frequency Distributions(..Contd.)

Class limits The minimum value of the generation data is 102.0 and the class width has

been determined as 0.6. So we can form the classes as

102.0 102.6, 102.7 103.3, 103.4 103.9, . . .

The problem with the above classification is that there is a gap between twosuccessive class intervals. This is not desirable since we are dealing withcontinuous data.

Discontinuity can be removed by forming the classes as

102.0 102.6, 102.6 103.2, 103.2 103.8, . . .

However, this classification has another problem. Suppose we have an

observation 102.6. In which class shall we place it, first or second?

In order to avoid such confusion we take

Lower limit of the first class = Minimum w/2

and then successively add the class width to this lower limit to obtain

the other class limits.

C i F Di ib i


33/82

33


Class limits (..Contd.)

Thus, for the generation data we have the classes as

101.9 102.5 102.5 103.1 103.1 103.7 103.7 104.3

104.3 104.9 104.9 105.5 105.5 106.1 106.1 106.7

Note that now we have

- 8 classes (since h has been rounded down from 0.657 to 0.6)- no confusion in classification (since there are no observations whichfall on the class limits) and

- an extended last class (ideally the upper limit of the last class shouldhave been 106.5).

In the example, we have extended the first class instead of the last

one since this has brought out the process abnormalities better.Thus the eight classes used are

101.7102.3, 102.3102.8, , 105.9106.5

C t ti F Di t ib ti


34/82

34


Tally marking (second column) Start with the first observation. Find the class to which the observation belongs.

Put a tally against the class.

Classify all the remaining observations as above.

Tally marks are grouped in five, with the fifth tally crossed through the previousfour tallies. This provides a better visual display and helps in counting the

frequency of each class. Note that all the above observations get classified as we go through the

observations only once. However, if we concentrate on a class and then try to findout the number of observations in the class then we have to go through theobservations k times. This not only consumes more time but also increases thechance of committing error.

Counting frequency (third column) The frequency (f) of each class is obtained simply by counting the tallies.

Other columns Columns giving cumulative frequency (f1, f1+f2, ..) and relative frequency (f1/N,

f2/N, ..) may also be added, if required.


35/82

35

Constructing Frequency Distributions- Getting the class intervals right

Why class width (h) is rounded to nearest integer multiple of w Consider the same generation data example. Here w=0.2. Assume that h = 0.657

is rounded to 0.7 (which is not an integer multiple of 0.2) instead of 0.6. Thus theclasses will be 101.9 102.6, 102.6 103.3, ..

Now in order to overcome the problem of classifying observations like 102.6, weare forced to consider w=0.1 and have the classes as101.95 102.65, 102.65 103.35, 103.35 104.05, 104.05 104.75,104.75 105.45, 105.45 106.15, 106.15 106.85

Note that the number of observation units covered by each class are not same. Forexample, the second class covers three units (102.8, 103.0 and 103.2) but thethird class covers four units (103.4, 103.6, 103.8 and 104.0). As a result thefrequency distribution is likely to show many peaks.

Balancing end points Assuming w=0.1, the seven classes shown above should be appropriate. However,

note that the last class is extended by four units beyond the maximum observedvalue of 106.4. It is desirable to distribute this imbalance to the two end classes bystarting the first class from 101.75 and ending at 106.65.

F Di t ib ti f Th G ti


36/82

36

Frequency Distribution of The GenerationDataFurther analysis

The frequency distribution shows an abnormal pattern (nearly alternative peaks). Doesthis mean the process mean is jumping randomly by about 1.2 unit?

Following two frequency distributions constructed out of the same data provide someadditional clues.

Fractional part Frequency

.0 27

.2 18

.4 15

.6 5

.8 15

Total 80

Class interval Frequency

101.7 102.7 04

102.7 103.7 17

103.7 104.7 26

104.7 105.7 25

105.7 106.7 08

Total 80

0s occur more frequently at thecost of 6s. Does this indicatemeasurement bias?

Smooth pattern (left skewed). Smoothness hasbeen achieved not only by reducing the number ofclasses but also by including the adjacent 0 s and 6sin the same interval.


37/82

37

Histogram

Histogram is a graphical representation of a frequency distribution of variable data.

The histogram of the generation data having five classes is shown below.

101.7 103.7 105.7Generation in E station (MW)

0

5

1015

20

25

Frequency

30 Bars of equal width (=class width)

Heights of the bars are proportional tothe frequencies of the classes

Bar width of about 1 cm. (7-10 classes)

Horizontal axis is about 1.6 timeslonger than the vertical axis

Central tendency: About 104.2.

Pattern of variation: Slightly left skewed

Specification limits: Should be shown wherever applicable.

Class mid-point: Marking the class mid-points may be helpful in certain cases.

Open ended classes: Avoid adding too many classes at the ends having zero orvery low frequencies. Shown as open ended bars with arbitrarily reduced heights.

C t ti f Hi t


38/82

38

Construction of Histogram- An exercise

Half-hourly record of power (MW) generated by station E during 29.9.2001(10.00 hours) to 30.9.2001 (24.00 hours) gives us the following data.

6.4 6.4 6.8 6.0 5.2 4.8 6.4 4.4 5.2 6.0

7.6 8.0 7.4 6.6 8.0 5.6 7.2 7.2 7.0 4.0

6.4 8.0 8.0 6.0 6.0 6.4 7.8 7.6 7.6 7.4

7.6 7.6 7.4 4.6 4.2 4.8 6.0 5.6 5.4 5.0

6.2 7.8 7.4 7.2 7.4 7.8 6.6 6.4 6.8 6.8

6.8 6.8 6.6 6.8 6.6 6.8 6.8 6.8 7.0 7.0

6.0 5.6 4.4 4.6 4.6 4.8 6.2 7.0 6.6 6.4

5.2 5.2 7.2 7.4 6.0 5.0 7.0 7.6 7.6 7.4

5.2 7.2 7.2 7.0 7.2 6.8 6.0 6.0 6.0 5.2

Construct a histogram of the above data set. Compare with the histogram

for the period 19.9.01 to 21.9.01 ( previous slide) and offer your comments.

29/9(10 hrs.)

30/9(24 hrs.)

Commonly Observed Histogram


39/82

39

Commonly Observed HistogramPatterns

Single peak, symmetric, bell

shaped, commonly observedpattern of a stable process

Single peak, positively

skewed (long tail on theright)

Single peak, negatively

skewed (Long tail on theleft)

Many characteristics follow suchpatterns. We have already seenthat generation data isnegatively skewed while

breakdown data is positivelyskewed. However such shapesmay also indicate processinstability.

LSL USL

Single peak, thick tailTwo peaks (bi-modal)

Frequency Distribution


40/82

40

Frequency Distributionof Discrete Data

Number of plant outages in each year since commissioningStation Period Type of

outage# of outages in a year

D 1978-79To

2000-01

Forced 2, 3, 1, 0, 3, 2, 1, 0, 2, 2, 0, 2, 3, 0, 2, 1, 2, 1, 1,0, 1, 0, 2

Planned 3, 5, 1, 4, 2, 5, 2, 1, 6, 3, 7, 7, 4, 7, 6, 5, 6, 4, 2,2, 2, 6, 2

E 1985-86To2000-01

Forced 2, 2, 5, 3, 0, 0, 1, 0, 1, 0, 2, 1, 1, 0, 1, 4Planned 15, 7, 8, 3, 7, 5, 2, 6, 3, 8, 7, 4, 5, 4, 3, 4

F 1988-89To

2000-01

Forced 4, 1, 1, 0, 0, 1, 1, 2, 0, 1, 0, 1, 6

Planned 3, 11, 6, 12, 4, 0, 1, 2, 8, 2, 4, 4, 6

Ideally we should construct six frequency distributions (for each type of outage in

each station). However, due to shortage of data we shall construct only two - one forforced outage and the other for planned outage.

What can you say about the occurrence of two types of outages from theabove data set?


41/82

Summary Measures


42/82

42

Summary Measuresof a Univariate Data Set

Type Commonly Used Measure*

Measures of Location orCentre

Mean,Median, Mode, TrimmedMean, Geometric Mean

Measures of Spread orVariability Range, Standard Deviation,Entropy (for nominal data)

Measures ofShape Skewness, Kurtosis

General Measure Quartiles

* There are a host of other measures developed for specific applications


43/82

43

Arithmetic Mean

May be used for ordinal databut not for nominal data

Sensitive to extreme values

Usually referred to as MEAN or AVERAGE

MEAN =Sum of all the observations

Number of observations

=

X1 + X2 + X3 + . . . . . + XN-1 + XN

N=

Xii=1

n

NX

Notation

Example:In a rising voltage test the alternating breakdown voltage(kV) of

24 samples of an insulation arrangement were found to be as follows:

210; 208; 208; 175; 182; 206; 190; 194; 198; 205; 212; 200; 205; 202; 207;

210; 202; 201; 188; 205; 209; 201; 216; 196

MEAN = [210 + 208 + + 216 + 196] / 24 = 201.25 kV


44/82

44

Mean of Grouped Data

NotationsClass: i (=1, 2, , k)

Frequency of the iih class: fi

Value of the ith class: Mi (Class mid-point if class width > least count)

Formula

i=1

i=1

k

k fi * Mi fi

X =

Example: The observations { 1.3, 1.3, 1.5, 3.3, 3.5, 3.5, 3.5, 3.6, 5.4, 5.4, 5.8, 7.3,7.4, 9.1} are grouped as follows:

i Class Interval Mi fi fi * Mi

1 1.25 3.25 2.25 3 06.75

2 3.25 5.25 4.25 5 21.25

3 5.25 7.25 6.25 3 18.75

4 7.25 9.25 8.25 4 33.00

Total () 15 79.75

Mean = 79.75/15 = 5.32

Mean of ungrouped data = 4.62. Thuserror due to grouping is 5.32-4.62 = 0.7,which is close to the maximum valuepossible, i.e. (class width/2) = 1.0. WHY?

In general, error will not be so large.Nevertheless, it is recommended to use theindividual observations for computingmean, whenever possible.


45/82

45

Interpretation of Mean

170 180 190 210 220

Mean = 201.25

Dot Plot of the Breakdown Voltage Data (Previous Slide)

Mean is the balance point (or fulcrum) for the distribution of the values

Mean is analogous to centre of gravity

In case of unimodal and symmetric distribution, mean also indicates thecentral tendency of the distribution and may be interpreted as a TYPICALVALUE.

In the above example, the observations are not symmetrically distributedaround the mean. The distribution is skewed to the left. Consequently meanshould be interpreted here as a measure of centre or location and not thatof central tendency or typical value.


46/82

46

Misuse of Mean

Landfill

Site

DioxinPresently, WHO has classified Dioxin asa known human carcinogen

Question: Are the people in the neighborhood of thelandfill site safe with respect to exposure to dioxin?

Data: Dioxin content the soil samples taken from alarge residential area in the neighborhood of the site.

Answer: Yes, since the average dioxin content in thesamples is found to be less than the permissible limit.

Critique: Individuals are not exposed to average soil levels, they areexposed to dioxins/furans present in the air they breathe, food they eatand water they drink. Higher exposure of residents living in the vicinityof the site are not averagedoutwith the lower exposure of residentsten miles away.


47/82

47

Properties of Mean

P1Sum of the deviations of all the observations from mean isalways zero. In notation, we have

(Xi X) = 0n

i=1

Sum of negative deviations=

Sum of positive deviations

P2Data Transformation:

(i) Let Y i = Xi k. Then Y = X k(ii) Let Y i = k*Xi. Then Y = k*X

(iii) Let Yi = Xi/k. Then Y = X/k

These three properties are frequently used to reduce the size ofthe data, which in turn reduces both computational load and error.

An Example follows.


48/82

48

Properties of Mean (Contd.)

Example of P2: Data Transformation

Outer diameter (X) of tubular glass shell (Specification: 37.5 0.8 mm.)

i Xi Yi = 37.5 - Xi Zi = Yi*100

1 37.46 0.04 4

2 36.66 0.84 84

3 37.44 0.06 6

4 37.85 -0.35 -35

5 37.36 0.14 14

6 36.95 0.55 55

7 37.62 -0.12 -12

8 36.96 0.54 54

9 37.12 0.38 38

10 37.36 0.14 14

TOTAL 269 47 = 222

Thus Z = 222/10 = 22.2

Since Yi = Zi/100, using theproperty (iii) of P2 we have

Y = 22.2/100 = 0.222

Further, since Xi = 37.5 Yi ,using property (i) of P2 we have

X = 37.5 0.222 = 37.278

In this case the Zi values are verylarge because the least count of

measurement used is too small. Usinga gauge having a lest count of 0.1 mm.and recording the deviations from theTARGET would have been better.


49/82

49

Properties of Mean (Contd.)

P3 The sum of the squared deviations of a set of observations is minimum whenthe deviations are taken from the mean of the observations.

In notation, we have (Xi X)2 < (Xi M)2, M X

Implication: Consider we have production figures for the last twenty days. We want topredict the production of the 21st day, assuming production condition remains the same.Then the best prediction is the average of the past twenty days data, provided the loss

due to prediction error is proportional to the square of the error.

P4 Sample mean is more stable than other possible measures of center.

We shall see this later.

P5 Mean is strongly affected by extreme values.

This is a disadvantage of mean over other measures of center. However,routine trimming of extreme values is not recommended unless themeasurements are subjective in nature. Genuine outliers must, of course, beeliminated from the data set.


50/82

50

Pooled Mean

Data Set 1 n1 X1 n1*X1

Data Set 2 n2 X2 n2*X2

. . . .

. . . .

Data Set k nk Xk nk*Xk

All (Pooled) ni ni*Xi/ ni ni*Xi

No. of

observations Average Total

Example: Process averages in threeshifts are found to be 15, 12 and 13based on 30, 40 and 20 observationsrespectively. Then the process averagefor the day is

(15*30+12*40+13*20)/(30+40+20) =990/90 = 11 [(15+12+13)/3 = 13]

Pooled mean ni*Xi/ ni = Xi/ k (WHEN?)

Note that the formula for mean of grouped data is similar to the above.

A related concept is that ofweighted mean. An application example follows.

Weighted Mean


51/82

51

Weighted Mean- An Application Example

BlowingDrawing

&Cutting

SortingMoltenglass

Glazing

Reject Accept

Glassshells

Tube

Assume the total number of shells produced in a shift is 24000. In a particularshift 8% of the shells produced are found to be rejected. We want to estimatethe average outer diameter of all the shells produced in the shift.

Samples can not be taken before sorting. So 50 shells are randomly selectedfrom each of the two streams ( reject and accept). The average diameter of the50 shells in the reject and accept groups are found to be 37.7 mm and 37.6mm respectively.

Shift average = Weighted mean of the average of the two streams = ( 0.08 *37.7 + 0.92 * 37.6) / (0.08 + 0.92) = 37.61.

OD Sensor

Weighted Mean

X = wi*Xi / wi

Weighted Mean


52/82

52

Weighted Mean- An Application Example (Contd.)

However, it would have been better to take more samples from the rejectstream. (WHY?)

Because of the higher variation expected in this stream.

Assume 100 shells (instead of 50) were selected from the reject stream and gotthe same average (37.7 mm).

Now the weights are given by wi = pi*N/fi. (WHY?)pi = Proportion of the i

th category, N = Total sample size

fi = Sample size of the ith category.

If the samples are selected randomly from the total population, then the numberof samples expected in the ith category is pi*N. Since we have selected fi samples,we must compensate for this by a factor of pi* N / fi.

In our example, p1 = 0.08, p2 = 0.92, f1 = 100, f2 = 50, N = 150. Thus w1 =

(0.08 * 150) / 100 = 0.12 and w2 = (0.92 * 150) / 50 = 2.76. This gives the shift

average as (37.7 * 0.12 + 37.6 * 2.76) / (0.12 + 2.76) = 37.60.


53/82

53

Median and Mode

Ordinal data: Category containing the (N+1)/2 caseNumerical data: (N+1)/2 th ordered observation, when N isodd and average of N/2 th and (N/2)+1 th ordered observations,when N is even.

Can be computed even for open ended classes at the extremesprovided each of the end classes contain less than 50% of the

observations.

Insensitive to outliers.

Median

Category or the value occurring with greatest frequency

Only measure of center for nominal data

May not be unique and highly sensitive to how the classes orcategories are formed.

Mode


54/82

54

Caveat: Dont Trust Centre Alone

Mean depth

= D < H

HStatisticians tell the story ofpeople who got themselvesdrowned by wading into alake with an average depthof 3 feet.

Median

Median

Distribution of marksobtained by studentsof two schools. Whichschool is better?

School A

School B

Mean may not tell you all youneed to know. Pay attentionto variation as well.


55/82

55

Standard Deviation

Standard Deviation is the most important measure of variability in a data set.

Let {X1, X2, , Xn} be a sampledata set and X is the mean of the observations.

Variability is measured in terms of the deviations of the observations from mean. For ourdata set, the deviations are (Xi X), i = 1, 2, , n.

Next, these deviations are summarized to obtain a single value for reporting variability.

Recall from property P1 of mean that the sum of the deviations will be always zero. So

we can not summarize by simply taking the average of the deviations.

The mathematical trick used to get rid of this difficulty (negative deviations) is to squarethe deviations and then these squares are averaged. So we compute (Xi - X)

2/(n - 1).The reason for using (n - 1) instead of n as the divisor will be explained later.

Finally, the effect of squaring is neutralized by taking square root of the above average toobtain the quantity called Standard Deviation. So we have

Sample Standard Deviation = s = (Xi - X)2

n - 1


56/82

56

Computing Standard Deviation

Root [ (Xi X)2 / (n 1)]

Mean (Xi X)2 / (n 1)

Square (Xi X)2

Deviation Xi - XRead

Computei Xi Xi X (Xi X)

2

1 4 -3 9

2 7 0 0

3 2 -5 25

4 5 -2 45 11 4 16

6 2 -5 25

7 10 3 9

8 7 0 0

9 15 8 64

10 9 2 4

11 5 -2 4

Total 77 0 160

A Numerical Example

Mean Square Deviation= 160 / (111) = 160 / 10= 16

Root Mean SquareDeviation or StandardDeviation = 16 = 4.

Shorter Method

(Xi X)2

= Xi2 ( Xi)2/ n

= 699 772/11= 160


57/82

57

Interpretation of Standard Deviation

Let us be honest. It is not easy to interpret standard deviation.

Literally speaking, standard deviation is a measure of the closeness of the data values totheir mean. However, the difficulty in interpretation arises because the closeness depends ontwo things- the range of the data values and also the distribution of the values within the range.

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9

The six caseshave identicalmean (= 5.5)and range (=

7). But themaximum s. dis about twice

that of theminimum value.

Compare thedistributions

having s. d. of2.07 and 2.14.

Application of Standard Deviation


58/82

58

pp- Rake weight data

3100 3200 3300 3400 3500

3100 3200 3300 3400 3500

Dot plots of weight of 24 rakes of coal received during January - June 2002.

Four rakes have been selected randomly from each of the six months. All therakes in the sample consists of 58 wagons.

Indigenous

Imported

Source Mean(Ton)

Range(Ton)

n-1(Ton)

Indigenous 3211.5 219.5 68.8

Imported 3401.8 189.6 40.7

Range of the two distributions donot differ as much as the standard

deviation do. We shall see laterthat higher variation of indigenouscoal implies higher inventory cost.


59/82

Part C

Population, Sample andProbability Distribution

Population


60/82

60

Population

Astatistical populationis a set of values orattributes

of the characteristic(s)

of a set of well defined objects

belonging to a specified group and/or period

Example 1 Example 2

Characteristic Height Ash content

Object of adult males in lots of coal

Group of India received in Oct2010

T f P l i


61/82

61

Types of Population

Finite and real

Infinite and hypothetical

Ash content in a particular lot can be thought of as anobservation from an infinite and hypothetical

population of all possible values of ash content

Continuous

Power generated by a power station, A tank of liquidchemical. Such population need to be suitablydiscretized for the purpose of measurement

P l ti d S l


62/82

62

Population and Sample

A (random) sample is a subset of the population obtained in sucha manner such that each object (unit) of the population (or of

subpopulation) has equal probability of being included in thesubset.

samples must be distinguished from specimens. A specimen ismerely a convenient subset of the population.

Purpose of sampling is to draw conclusions about a target

population economically with acceptable limits of error.

Population Sample

Mean XStandard Deviation s

V i ti i l d l ti


63/82

63

Variation in sample and population

Histogram of plate thickness(sample values)

Probability distributionof thickness (for thepopulation

Frequency polygonAn estimate of thepopulation distribution

Larger sample More classes Smaller class interval

Smoother frequency polygon andcloser to the population distribution

In case of hypothetical population, the distribution of a characteristic inthe population will never be known. Normal distribution is frequentlyassumed for a population distribution.

Discrete Probability Distribution


64/82

64

Discrete Probability Distribution

1

2

3

4

5

6

1/6

p(x)

x1 2 3 4 5 6

Sample space

Random variable Xtakes values

x={1, 2, 3, 4, 5, 6}

P(X=x)

P(X=1)=p(1)=1/6p(x): Probability mass function

Continuous Probability Distribution


65/82

65

Continuous Probability Distribution

Measurementof diameter

x1.

.

.

x2

F(x) = P(Xx): Probabilitydistribution function

f(x) = F(x): Probability

density function

Random variable Xtakes valuesx1 x x2

Sample space

x

f(x)

x1 x2

f(x) does not give the probability of X=x

Bernoulli and Hypergeometric


66/82

66

Bernoulli and HypergeometricSample space

P( ) = P(x=0) = p(0) = 0.8

P( ) = P(x=1) = p(1) = 0.2

X follows Bernoulli Distribution

having parameter p = 0.2

x=0

x=1

x=2

X follows HypergeometricDistribution with parameters

N=10, n=3 and d=2

P(0) = ?, p(1) = ?, p(2) = ?

N=10, d=2n=3

Hypergeometric Distribution


67/82

67

Hypergeometric Distribution

n

N

r

d

rn

dNrxP /)(

N=10, n=3, d=2

P(x=0) = (10-2C3-0) * (2C0) / (

10C3) =(56 * 1) / 120 = 0.467

P(x=1) = (10-2C3-1) * (2C1) / (

10C3) = (28 * 2)/120 = 0.467

P(x=2) = (10-2C3-2) * (2C2) / (

10C3) = (8 * 1) / 120 = 0.066

p(0) + p(1) + p(2) = 0.467 + 0.467 + 0.067 = 1

Binomial Distribution


68/82

68

Binomial Distribution

p=0.2

X follows Binomial Distributionwith parametersn=3 and p=0.2

x=0

x=1

x=2

n=3

x=3

Hypergeometric

Finite Population

Sampling without replacement

Binomial

Infinite population OR

Sampling with replacement

Binomial Distribution: Distribution of no. of defectives insamples drawn from a process under control (p=constant)

Computing Binomial Probability


69/82

69

Computing Binomial Probabilityn =3, p = 0.2

p(0) = 3C0 * (0.2)0 * (0.8)3-0

= 1 * 1* 0.512 = 0.512

P(1) = 3C1 * (0.2)1 * (0.8)3-1

= 3 * 0.2 * 0.64 = 0.384

p(2) = 3C2 * (0.2)2 * (0.8)3-2

= 3 * 0.04 * 0.8 = 0.096

p(3) = 3C3 * (0.2)3 * (0.8)3-3

= 1 * 0.008 * 1 = 0.008

p(0) + p(1) + p(2) + p(3)

= 0.512+0.384+0.096+0.008

=1.000

Poisson Distribution


70/82

70

Poisson Distribution

As an approximation to Binomial probability Small p (say < 0.1)

Large n

As a distribution in its own right

Count of defects/unit

Infinite opportunities of occurrence Rare event accidents, flaws in cloth, instances of power outages,

absenteeism in large organizations, no. of production stoppages

..

.

.

Many opportunities and maximumof 1 defect per opportunity

Defects are randomly distributed Defect rate constant and proportionalto area, No location preference

Poisson Probability


71/82

71

Poisson Probability

2

...,2,1,0,!)()( rr

erprxP

r

Example: The no. of error in bills raised by the billing department follows Poissondistribution. Mean error rate per bill is o.5. A bill is selected at random. What is

the probability that the bill will contain (i) exactly two errors, (ii) at most twoerrors and (iii) at least two errors?

(i) = 0.5, p(2) = exp(-0.5) * (0.5)2 / 2! = 0.6065 * 0.25 / 2 = 0.076

(ii) p( 2) = p(0) + p(1) + p(2) = 0.6065 + 0.3033 + 0.076 = 0.986

p(0) = exp(-0.5) * (0.5)0 / 0! = 0.6065 * 1 * 1 = 0.6065

p(1) = exp(-0.5) * (0.5)1 / 1! = 0.6065 * 0.5 * 1 = 0.3033

(iii) p( 2) = 1 - p( 1) = 1 - p(0) p(1) = 1 0.6065 0.3033 = 0.09

Normal Distribution


72/82

72

Normal Distribution

x

f(x)

Inflection point

Symmetric Unimodal

Bell shaped

- to +

Area under curve = 1

Also Known as Gaussian distribution

Arises naturally in many physical, biological and socialmeasurements

Non-normal Abnormal All cases are approximations only most measurementsare non-negetive

Normal Characteristics -Examples


73/82

73

Normal Characteristics ExamplesTHE

NORMAL

LAW OF ERROR

STANDS OUT IN THE

EXPERIENCE OF MANKIND

AS ONE OF THE BROADEST

GENERALIZATIONS OF NATURAL

PHILOSOPHY. IT SERVES AS THE GUIDING

INSTRUMENT IN RESEARCHES IN THE PHYSICAL

AND SOCIAL SCIENCES AND IN MEDICINE, AGRICULTURE

AND ENGINEERING. IT IS AN INDISPENSIBLE TOOL FOR THE ANALYSIS AND

INTERPRETATION OF THE BASIC DATA OBTAINED BY OBSERVATION AND EXPERIMENT

Machined dimensions

Fill volume/weight Colour density

Wear-out failure time

Germination at a given ageing

Height of Indian tribals No. of single girls in a bar (1 - 2 P.M)

Return from a diversified portfolio

- W. J. Youden

Central Limit Theorem


74/82

74

Central Limit Theorem Distribution of an average (X-bar) or a sum (X) tends to be

normal, irrespective of the distributional form of X.

Many statistical procedures are based on the assumption ofNormality. CLT acts as safeguard for validity of such applications.

Aggregation of numeroussmall but independentrandom events. In thiscase eight events - eachproducing small randomdisplacement either to

the left or to the right.

Normal Density Function


75/82

75

Normal Density Function

2

)(2

1 2

21)(

VarianceMean

xexf

x

x xx1 x2

P (X < x)= F (x)

P (x1 < X < x2)= F (x2) F (x1) P (X > x)

= 1 - F (x)

Normal Probability


76/82

76

Normal Probability

f(x)

x

68.27%

2

95.45% 3

99.73%

f(x)

Popularlyknown as

68-95-99.73

rule

Standard Normal Distribution


77/82

77

X

Z0 1 2 3-1-2-3

= 0

= 1

= 1= 1

= 2= 2

= 3=

3

2

2

2

1

)(

z

ezf

Z - Transform


78/82

78

Standard Normal Table


79/82

79

z 0.00 0.01 . . 0.09

0.0 0.50000 0.50399 . . 0.53586

0.1 0.53983 0.54379 0.57534

. . . . . .1.0 0.84134 0.84375 0.86214

. . . . . .

2.0 0.97725 0.97778 0.98169

. . . . . .

3.0 0.99865 0.99869 . . 0.99900. . . . .

3.9 0.99995 0.99995 . . 0.99997

z

..

. .....

....

..

..

.......

.........

.

.. .

.. Other tables may giveprobabilities between 0 andz > 0 be careful.

Tables giving probabilitiesfor negative values of z are

convenient but are notessential.

P (z > -2 .01) = ?

P (-1 .09 < z < 2) = ?

P (- .19 < z < - .01) = ?

Normal Distribution - Exercise


80/82

80

The specification on viscosity of a chemical produced by a batch

process is given as 16.52.5. Viscosity of 10 consecutive batches

produced in the immediate past are given below:

14.8, 15.6, 16.9, 17.0, 14.9, 15.6, 14.5, 15.2, 15.7, 14.2

(a) Assuming viscosity follows Normal distribution, find theexpected rejection percent of batches. [Ans: 6.2%]

(b) Note that none of the 10 sample batches are rejected. Still, is

there any cause for concern?

Normal Probability Plotting


81/82

81

y g

Purpose:To examine (based on sample data) whether the

population distribution is Normal or not.

Method:

Rank the sample observations from smallest to largest (R=1,

2, ., n). Try to have n>25.

Compute observed relative cumulative frequency F(x) = (R-0.5)/n [or F(x) = R/ (n+1)] for each x, where R is the rank

of observation x.

Plot [x, F(x)] in Normal probability paper

If the points fall approximately along a straight line then the

underlying distribution can be considered as Normal

NPP - Example


82/82

82

p

Observation (x)

Rank

(R- 0.5)/10

14.2 1 .05

14.5 2 .15

14.8 3 .2514.9 4 .35

15.2 5 .45

15.6 6 .55

15.6 7 .65

15.7 8 .7516.9 9 .85

17.0 10 .95

Viscosity data: Specification 16.52.5

14.8, 15.6, 16.9, 17.0, 14.9, 15.6, 14.5, 15.2, 15.7, 14.2

Viscosity

Percent

1918171615141312

99

95

90

80

70

60

50

40

30

20

10

5

1

Mean

0.374

15.44

StDev 0.9348

N 10

AD 0.359

P-Value

Probability Plot of ViscosityNormal - 95% CI

Date post:	03-Apr-2018
Category:	Documents
Upload:	derivatives-rd
View:	219 times
Download:	0 times

Ch 2 BB Basic Statistics

Documents