Module 1 Statistical Inference

transcript

7/29/2019 Module 1 Statistical Inference

1/67

Statistical

InferenceDr. Basheer Ahmad Samim

18:16 PM


2/67

Course Outline1. Review of Descriptive Statistics and SPSS

2. Random Variable and Mathematical Expectation

3. Discrete Probability Distributions (Binomial, Poisson)

4. Continuous Probability Distribution (Normal)

5. Sampling Theory

6. Confidance Intervals

7. Hypotheses Testing

8. Goodness of Fit

9. Regression and Correlation with ANOVA

10. Multiple Regression

11. All the topics will be SPSS oriented

28:16 PM


3/67

Recommended Readings (Books)

Introduction to Statistics,Walpole, R. E., 3rd Edition

(2000)Statistical Methods for Practice

and Research by Ajai S. Gaurand Sanjaya S. Gaur

38:16 PM


4/67

Attendance Policy16-Weeks Teaching16-Lectures (32-Attendance)

Twice Roll Call, Once before the breakand once after the break

At Least 80% (24) Attendance is

compulsory to be elligible for the FinalExamination

No Roll Call after First Ten(5) minutes

48:16 PM


5/67

Mode of TeachingLecture

SPSS Workshop

Discussion Session

58:16 PM


6/67

Mode of AssessmentQuizes (15%)

Assignments (15%)Class Performance (5%)

Mid Term Test (25%)Final Examination (40%)

68:16 PM


7/67

Questionnaire

78:16 PM


8/67

VariableA characteristic orproperty thatvaries

from individual toindividual.

88:16 PM


9/67

ConstantA characteristic orproperty that does notchange from individual

to individual.

98:16 PM


10/67

Types of Variables

Types ofVariables

Qualitative Quantitative

Discrete Continuous

108:16 PM


11/67

Nominal ScaleVariable categories are mutually

exclusive and exhaustive.Variable categories have no

logical order.

Eye Color, Hair Color, Gender.

118:16 PM


12/67

Ordinal ScaleData categories are mutually

exclusive and exhaustive.Data classifications are ranked orordered according to the

particular trait they possess.Level of Knowledge about SPSS

128:16 PM


13/67

Interval ScaleData categories are mutually exclusiveand exhaustive.

Data classifications are ranked or orderedaccording to the particular trait theypossess.

Equal differences in the characteristic arenot represented by equal differences inthe measurements.Temperature, Shoe Size and IQ scores

138:16 PM


14/67

14

Ratio ScaleData categories are mutually exclusive and

exhaustive.Data classifications are ranked or ordered

according to the particular trait they possess. Equal differences in the characteristic are

represented by equal differences in the

measurements. The zero point is the essence of the

characteristic.Height, Weight, Distance.

8:16 PM


15/67

15

Scale

Nominal

Data may only

be classified

Eye color,Hair Color

Gender.

Ordinal

Data are

ranked

Level ofKnowledge

aboutSPSS

Interval

True Zero Point

does notExist.

Temperature,Shoe Size,IQ Scores

Ratio

Meaningful Zero

point and RatioBetween values

Height, Weight,Distance.

Measurement Scales

8:16 PM


16/67

16

Data

The information collectedfor any kind of investigation.Usually Numerical but can

be Qualitative.

8:16 PM


17/67

17

Primary DataThe initial material collected

during the research process.The information collected

directly from the respondent.Personal Invetigation, Through Investigator, Through Questionnaire,Through Local Sources, Through Telephone,

8:16 PM


18/67

18

Secondary DataThe information

collected and processedby the people other than

the researcherGovernment Organizations, Semi-GovernmentOrganizations,

8:16 PM


19/67

Data Collection

Any of the following methods may beadopted:

(a) Personal interview(b) Direct observation

(c) Mail interview (internet interview)

(d) Telephone interview

What are the cons and pros of each?

198:16 PM


20/67

Data management

Office Editing,

Post Coding,

Data entry and Verification.

208:16 PM


21/67

Data organization and Analysis

Preparing data for analysis, Extracting descriptive measures

from the data, Using advanced statistical

techniques to analyze the dataand draw inference there from.

218:16 PM


22/67

22

Measures of Central Tendency

Arithmetic Mean

Quantiles(Median, Quartiles, Deciles, Percentiles)

Mode

8:16 PM


23/67

23

ArithmeticMean

A value obtained by dividing the sum of all the observations by

their number.

nn

XXXX

n

1ii

n21X

If X1, X2, , Xn are n observations of a variable X then

nsobservatiotheofNumbernsobservatiotheallofSumMeanArithmetic

8:16 PM


24/67

24

Arithmetic Mean

The marks obtained by 8 students are:

Marks5.688

548

8

637267

X

67 72 68 70 65 68 75 63

8:16 PM


25/67

25

QuantilesFor individual observations/discrete frequencydistribution, the ith quartile, jth decile and kth

percentile are located in the array/discrete frequencydistribution by the following relations

32,1,ion,distributiin thenobservatioth4

1)i(nQi

,92,1,jon,distributiin thenobservatioth10

1)j(nDj

,992,1,kon,distributiin thenobservatioth100

1)k(nPk

8:16 PM


26/67

26

The weekly TV Watching times (Hours):

25 41 27 32 43 66 35 31 15 5

34 26 32 38 16 30 38 30 20 21

Quartiles

The array of the above data is given below:

5 15 16 20 21 25 26 27 30 3031 32 32 34 35 37 38 41 43 66

8:16 PM


27/67

27

Quartiles

Hours22.021}-0.25{2521

obs.}5th-obs.0.25{6thobs.th5

ondistributiin thenobservatioth25.5

ondistributiin thenobservatioth

4

1)1(20Q1

8:16 PM


28/67

28

Hours30.530}-0.50{3130

obs.}10th-obs.0.50{11thobs.th10

ondistributiin thenobservatioth50.10

ondistributiin thenobservatioth

4

1)2(20Q2

Quartiles

8:16 PM


29/67

29

Quantiles

8:16 PM


30/67

30

ModeThe mode is a value which occurs

most frequently in a set of data. Ormode is a value that occurs

maximum number of times in a

sequence of observations.

8:16 PM


31/67

31

The total automobile sales (in millions) in

the United States for the last 14 years.

9.0 8.2 8.0 9.1 10.3 11.0 11.5

10.3 10.5 9.8 9.3 8.2 8.2 8.5

Mode

Mode = 8.2 million

8:16 PM


32/67

32

Measures of variation measure thevariation present among the values

of a data set, so measures ofvariation are measures of spread of

values in the data.

8:16 PM


33/67

33

Absolute Measures of

Dispersion

RangeQuartile Deviation

Mean (Average) Deviation

Variance and Standard Deviation

8:16 PM


34/67

34

Relative Measures ofDispersion

Coefficient of RangeCoefficient of Quartile Deviation

Coefficient of Mean Deviation

Coefficient of Variation (CV)

8:16 PM


35/67

35

RangeDifference between the largest

and the smallest observations

Largest SmallestRange X X

8:16 PM


36/67

36

Ignores the way in which data are distributed

Sensitive to outliers

7 8 9 10 11 12

Range = 12 - 7 = 5

7 8 9 10 11 12

Range = 12 - 7 = 5

Disadvantages of the Range

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 5 - 1 = 4

Range = 120 - 1 = 119

8:16 PM


37/67

Inter-quartile Range (IQR)

Inter-quartile range = 3rd quartile 1st QuartileQ3 - Q1

IQR is independent of outliers

378:16 PM


38/67

Inter-quartile Range

38

Median

(Q2)

XmaximumXminimum Q1 Q3

25% 25% 25% 25%

12 30 45 57 70

Inter-quartile Range (IQR)

= 57 30 = 27

8:16 PM


39/67

39

The Mean (absolute) Deviation

X

8 3

5 0

2 -3

0

Mean Deviation is the average of absolutedeviations taken form the mean value.

( ) 62

3

x x

n

3

0

3

6

( )X X X X

8:16 PM


40/67

40

Variance

Variance is the averageof the squared

deviations taken fromthe mean value.

X cm (X-Mean)^2 X2

4 36 16

6 16 369 1 81

12 4 144

13 9 169

16 36 25660 102 702

2

2 2

2

222 2

( ) 102( ) 17

6

702 102( ) 17

6 6

x xi S cm

n

X Xii S cm

n n

8:16 PM

C i St d d D i ti


41/67

41

Comparing Standard Deviations

Mean = 15.5S = 3.33811 12 13 14 15 16 17 18 19 20 21

Data A

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5

S = 4.567

Data C

The smaller the standard deviation, the more tightlyclustered the scores around mean

The larger the standard deviation, the more spread outthe scores from mean8:16 PM

11 12 13 14 15 16 17 18 19 20 21

Data BMean = 15.5

S = 0.926


42/67

42

Relative Measures of Variation

Largest Smallest

Largest Smallest

Coefficient of RangeX X

X X

3 1

3 1

Coefficient of Quartile DeviationQ Q

Q Q

Coefficient of Mean Deviation MDMean

8:16 PM


43/67

Coefficient of Variation (CV)

Can be used to compare two or moresets of data measured in differentunits or same units but different

average size.

8:16 PM 43

100%X

SCV


44/67

44

Use of Coefficient of Variation Stock A:

Average price last year = $50 Standard deviation = $5

Stock B:

Average price last year = $100

Standard deviation = $5

but stock B is

less variablerelative to its

price

10%100%$50

$5

100%X

S

CVA

5%100%$100

$5100%

X

SCVB

Both stocks

have the

same

standard

deviation

8:16 PM


45/67

45

Appropriate Choice of Measure

of Variability

If data are symmetric, with no serious

outliers, use range and standarddeviation.

If data are skewed, and/or have serious

outliers, use IQR. If comparing variation across two data

sets, use coefficient of variation (C.V)

8:16 PM


46/67

46

Five Number SummaryThe five number summary of a data set consists of the

minimum value, the first quartile, the second quartile, the

third quartile and the maximum value written in that order:Min, Q1, Q2, Q3, Max.

From the three quartiles we can obtain a measure of central

tendency (the median, Q2

)and measures of variation of thetwo middle quarters of the distribution, Q2-Q1 for the

second quarter and Q3-Q2for the third quarter.

8:16 PM


47/67

47

The weekly TV viewing times (in hours).

25 41 27 32 43 66 35 31 15 5

34 26 32 38 16 30 38 30 20 21

The array of the above data is given below:

5 15 16 20 21 25 26 27 30 30

31 32 32 34 35 37 38 41 43 66

Five Number Summary

8:16 PM


48/67

48

Hrs22.021}-0.25{2521obs.}5th-obs.0.25{6thobs.5th;Q1ofVALUE

obs.5.25thdatain theobs.th4

1)1(20;Q1ofLOCATION

Five Number Summary

Hrs30.530}-0.50{3103obs.}10th-obs.0.50{11thobs.th10;Q2ofVALUE

obs.th50.10datain theobs.th4

1)2(20;2QofLOCATION

Minimum value=5.0 Maximum value=66.0

Hrs36.535}-0.75{3735obs}15th-obs{16th75.0obs15th;3QofVALUE

obs.15.75thdatain theobs.th

4

1)3(20;3QofLOCATION

8:16 PM


49/67

49

Box and Whisker DiagramA box and whisker diagram or box-plot is a

graphical mean for displaying the five number

summary of a set of data. In a box-plot the firstquartile is placed at the lower hinge and the

third quartile is placed at the upper hinge. The

median is placed in between these two hinges.

The two lines emanating from the box are

called whiskers. The box and whisker diagram

was introduced by Professor Jhon W. Tukey.

8:16 PM


50/67

50

Construction of Box-Plot

1. Start the box from Q1 and end atQ3

2. Within the box draw a line torepresent Q2

3. Draw lower whisker to Min.Value up to Q1

4. Draw upper Whisker from Q3 upto Max. Value

Q1

Q3

Q2

8:16 PM

MaxValue

MinValue


51/67

51

Construction of Box-Plot

1. Q1=22.0 Q3=36.5

2. Q2=30.53. Minimum Value=5.0

4. Maximum Value=66.0

70

60

50

40

30

20

10

0

8:16 PM


52/67

52

Interpretation of Box-Plot

70

60

50

40

30

20

10

0

Box-Whisker Plot is useful to identify

Maximum and Minimum Values in the data

Median of the data

IQR=Q3-Q1,Lengthy box indicates more variability in the data

Shape of the data From Position of line within box

Line At the center of the box----Symmetrical

Line above center of the box----Negatively skewed

Line below center of the box----Positively Skewed

Detection of Outliers in the data

8:16 PM


53/67

53

OutliersAn outlier is the values that falls well outside the overall

pattern of the data. It might be

the result of a measurement or recording error,

a member from a different population,

simply an unusual extreme value.

An extreme value needs not to be an outliers; it might,

instead, be an indication of skewness.

8:16 PM


54/67

54

Inner and Outer Fences

If Q1=22.0 Q2=30.5 Q3=36.5

25.58IQR1.5QFenceInnerUpper

25.0IQR1.5QFenceInnerLower:FencesInner

3

1

0.80IQR3QFenceOuterUpper

5.21IQR3QFenceOuterLower:FencesOuter

3

1

8:16 PM


55/67

55

Identification of the Outliers

1. The values that lie within inner

fences are normal values

2. The values that lie outside inner

fences but inside outer fencesare possible/suspected/mild

outliers

3. The values that lie outside outer

fences are sure outliers

80

70

60

50

40

30

20

10

0

Plot each suspected outliers with an asteriskand each sure outliers with an hollow dot.

*

Only

66 is amildoutlier

8:16 PM


56/67

56

Box plots are

especially suitable for

comparing two or moredata sets. In such a

situation the box plots

are constructed on the

same scale.

Uses of Box and Whisker Diagram

Male Female8:16 PM


57/67

Standardized VariableA variable that has mean 0 and Variance 1 is

called standardized variable

Values of standardized variable are calledstandard scores

Values of standard variable i.e standard scores areunit-less

Construction

VariableofDeviationStandard

VariableofMeanVariableZ

8:16 PM 57


58/67

X Z

3 25 -1.3624 1.8561

6 4 -0.5450 0.2970

11 9 0.81741 0.6682

12 16 1.0899 1.1879

32 54 0 4.009

5.134

54

84

32

2

xS

n

X

X

2

)( XX

67.3

8

X

Sx

XX

Z

14009.4

0

2

zS

n

ZZ

2)( ZZ

Variable Z has mean 0 and

variance 1 so Z is a standard variable.

Standard Score at X=11 is 8174.067.3

811

Sx

XXZ

8:16 PM

Standardized Variable


59/67

59

The industry in which sales rep Mr. Atif works has meanannual sales=$2,500

standard deviation=$500.

The industry in which sales rep Mr. Asad works has meanannual sales=$4,800

standard deviation=$600.

Last year Mr. Atif s sales were $4,000 andMr. Asads sales were $6,000.

Performance evaluation by z-scores

Which of the representatives would you hireif you have one sales position to fill?

8:16 PM


60/67

60

Performance evaluation by z-scores

3500

500,2000,4

B

B

BB

B

Z

S

XXZ

Sales rep. Atif

XB= $2,500

SB= $500

XB= $4,000

Sales rep. Asad

XP=$4,800

SP= $600

XP= $6,000

2600

800,4000,6

P

P

PPP

Z

S

XXZ

Mr. Atif is the best choice8:16 PM


61/67

61

valuesof68%aboutcontains1SX

The Empirical Rule

X

68%

1SX

valuesof99.7%aboutcontains3SX

valuesof95%aboutcontains2SX 95%

X 2S

X 3S

99.7%

8:16 PM


62/67

62

A distribution in which the values equidistant from

the centre have equal frequencies is defined to be

symmetrical and any departure from symmetry is

called skewness.

1. Length of Right Tail = Length of Left

Tail

2. Mean = Median = Mode

3. Sk=0a) Sk=(Mean-Mode)/SD

b) Sk=(Q3-2Q2+Q1)/(Q3-Q1)

8:16 PM

Measures of Skewness


63/67

63

A distribution is positively skewed, if the observationstend to concentrate more at the lower end of the possiblevalues of the variable than the upper end. A positivelyskewed frequency curve has a longer tail on the righthand side

1. Length of Right Tail > Length of Left

Tail

2. Mean > Median > Mode

3. SK>0

MeasuresofSkewness

8:16 PM


64/67

64

A distribution is negatively skewed, if the

observations tend to concentrate more at the upper

end of the possible values of the variable than the

lower end. A negatively skewed frequency curve has a

longer tail on the left side.

1. Length of Right Tail < Length of Left

Tail

2. Mean < Median < Mode

3. SK< 0

8:16 PM

Measures of Skewness


65/67

8:16 PM 65

The Kurtosis is the degree of peakedness or flatness of a

unimodal (single humped) distribution,

When the values of a variable are highly concentrated around

the mode, the peak of the curve becomes relatively high; the

curve isLeptokurtic. When the values of a variable have low concentration around

the mode, the peak of the curve becomes relatively flat;curve

isPlatykurtic. A curve, which is neither very peaked nor very flat-toped, it

is taken as a basis for comparison, is called

Mesokurtic/Normal.

Measures of Kurtosis


66/67

668:16 PM



67/67


1. If Coefficient of Kurtosis > 3 -----------------Leptokurtic.

2. If Coefficient of Kurtosis = 3 -----------------Mesokurtic.

3. If Coefficient of Kurtosis < 3 ----------------- is Platykurtic.

4

22

n X-XCoefficient of Kurtosis=

X-X

Module 1 Statistical Inference

Documents