Date post: | 23-Oct-2015 |
Category: |
Documents |
Upload: | ahmed-amiir-omer |
View: | 25 times |
Download: | 0 times |
MATHEMATICAL METHODS IN
SAMPLE SURVEYS
MATHEMATICAL METHODS IN
SAMPLE SURVEYS
Howard G. Tucker Universiiy of California, Irvine
World Scientific Singapore-New Jersey •ondon'Hong Kong
S e r i e s o n I
Multivariate I Analysis I
• Vol.3 • I
Multivariate
MATHEMATICAL METHODS IN
SAMPLE SURVEYS
Howard G. Tucker University of California, Irvine
World Scientific Singapore-New Jersey •ondon'Hong Kong
SERIES ON MULTIVARIATE ANALYSIS
Editor: M M Rao
Published
Vol. 1: Martingales and Stochastic Analysis J. Yeh
Vol. 2: Multidimensional Second Order Stochastic Processes Y. Kakihara
Forthcoming
Convolution Structures and Stochastic Processes R. Lasser
Topics in Circular Statistics S. R. Jammalamadaka and A. SenGupta
Abstract Methods in Information Theory Y. Kakihara
Published by
World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Cataloging-in-Publication Data Tucker, Howard G.
Mathematical methods in sample surveys / Howard G. Tucker. p. cm. — (Series on multivariate analysis : vol. 3)
Includes bibliographical references (p. - ) and index. ISBN 9810226179 1. Sampling (Statistics) I. Title. II. Series.
QA276.6.T83 1998 519.5*2--dc21 98-29452
CIP
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
First published 1998 Reprinted 2002
Copyright © 1998 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
Printed in Singapore by Uto-Print
Preface As the title of this book suggests, it is a textbook about some math
ematical methods in sample surveys. It is not about the nuts and bolts of setting up a sample survey, but it does introduce students (or readers) to some basic methodology of doing sample surveys. The mathematics is both elementary and rigorous, and it requires as a prerequisite the satisfactory experience of one or two years of university mathematics courses. It is suitable for a one year junior-senior level course for mathematics and statistics majors; it is also suitable for students in the social sciences who are not handicapped by a fear of proofs in mathematics. It requires no previous knowledge of statistics, and it could actually serve as both an intuitive and mathematically rigorous introduction to statistics. A sizable part of the book covers only those topics in discrete probability that are needed for the sampling methods treated here. Topics in sampling that are covered in depth include simple random sampling with and without replacement, sampling with unequal probabilities, various linear relationships, stratified sampling, cluster sampling and two stage sampling.
There is just enough material included here for a one year undergraduate course, and it has been used as such at the University of California at Irvine for the last twenty years. The first five chapters cover the discrete probability needed for the next six chapters; these can be covered in an academic quarter. It should be pointed out that a usual one quarter course in discrete probability cannot replace what is developed in these five chapters. For one thing, considerable emphasis on working with multivariate discrete densities was needed because of the dependence that arises when the sampling is done without replacement. Also the material on conditional expectation and conditional variance and conditional covariance as random variables is rarely, if at all, treated at the elementary level as it is here. It is this body of results that is so important in developing the material in the sample survey
v
vi PREFACE
part of the book and without any handwaving. This is particularly true for Chapters 7 through 11. It should also be stated that there is no fat in Chapters 1 through 5. Indeed, the topics covered in these chapters were not settled upon until the material in Chapters 6 through 11 was finally in place. Indeed, great care was taken to insure that Chapters 1 through 5 contained the minimal amount of material needed for the remaining chapters.
There is no doubt as to the importance of the topics covered in this text for students specializing in statistics and biostatistics. Awareness of them is also important for students in the social sciences and in the various areas of business administration. But I would like to include some comments on the importance of a course based on this text for students majoring in pure mathematics. Except for the unproved central limit theorem in Chapter 5 (which is not invoked in the proofs of any of the results following that chapter), this text can be claimed to be an example of an undergraduate course that teaches utmost mathematical rigor. What is more, the development is a vertical one, and very few of the chapters can be taken out of order. I call everyone's attention to Chapter 4 where results on conditional expectation and conditional variance as random variables are developed. In this chapter conditional expectation is defined as a number and as a random variable. As a random variable, all properties that are usually obtained by a certain amount of measure-theoretic prowess elsewhere are here obtained by rather elementary methods. In addition, in this setting basic results are obtained on conditional variance and conditional covariance which culminate with the Rao-Blackwell theorem.
I have two hopes connected with this text and the course it serves. One hope is that the student who is primarily applications oriented will appreciate and enjoy the mathematical ideas behind the problems of estimation in sample surveys. At the same time I hope that those who are primarily oriented in the direction of pure and abstract mathematics will see that one can keep this orientation and at the same time enjoy how well it touches on real life.
I wish to express my appreciation to Mrs. Mary Moore who did the original lATgX typesetting for almost all of this document. Professors Mark Finkelstein and Jerry A. Veeh contributed greatly to my entrance
PREFACE vii
into the age of computer typesetting; indeed, the completion of this document might never have taken place without their help.
This book is dedicated to my wife, Marcia.
Howard G. Tucker Irvine, California November 20, 1997
This page is intentionally left blank
Contents
1 Events and Probability 1 1.1 Introduction to Probability 1 1.2 Combinatorial Probability 3 1.3 The Algebra of Events 9 1.4 Probability 17 1.5 Conditional Probability 20
2 Random Variables 27 2.1 Random Variables as Functions 27 2.2 Densities of Random Variables 31 2.3 Some Particular Distributions 41
3 Expectation 47 3.1 Properties of Expectation 47 3.2 Moments of Random Variables 51 3.3 Covariance and Correlation 56
4 Conditional Expectation 65 4.1 Definition and Properties 65 4.2 Conditional Variance 72
5 Limit Theorems 83 5.1 The Law of Large Numbers 83 5.2 The Central Limit Theorem 86
6 Simple Random Sampling 91 6.1 The Model 91
IX
x CONTENTS
6.2 Unbiased Estimates for Y and Y 99 6.3 Estimation of Sampling Errors 103 6.4 Estimation of Proportions 107 6.5 Sensitive Questions 112
7 Unequal Probability Sampling 117 7.1 How to Sample 117 7.2 WR Probability Proportional to Size Sampling 122 7.3 WOR Probability Proportional to Size Sampling 128
8 Linear Relationships 135 8.1 Linear Regression Model 135 8.2 Ratio Estimation 138 8.3 Unbiased Ratio Estimation 144 8.4 Difference Estimation 148 8.5 Which Estimate? An Advanced Topic 150
9 Stratified Sampling 155 9.1 The Model and Basic Estimates 155 9.2 Allocation of Sample Sizes to Stata 161
10 Cluster Sampling 169 10.1 Unbiased Estimate of the Mean 169 10.2 The Variance 175 10.3 An Unbiased Estimate of Var(Y) 177
11 Two-Stage Sampling 183 11.1 Two-Stage Sampling 183 11.2 Sampling for Non-Response 189 11.3 Sampling for Stratification 196
A The Normal Distribution 203
Index 205
Chapter 1
Events and Probability
1.1 Introduction to Probability
The notion of the probability of an event may be approached by at least three methods. One method, perhaps the first historically, is to repeat an experiment or game (in which a certain event might or might not occur) many times under identical conditions and compute the relative frequency with which the event occurs. This means: divide the total number of times that the specific event occurs by the total number of times the experiment is performed or the game is played. This ratio is called the relative frequency and is really only an approximation of what would be considered as the probability of the event. For example, if one tosses a penny 25 times, and if it comes up heads exactly 13 times, then we would estimate the probability that this particular coin will come up heads when tossed is 13/25 or 0.52. Although this method of arriving at the notion of probability is the most primitive and unsophisticated, it is the most meaningful to the practical individual, in particular, to the working scientist and engineer who have to apply the results of probability theory to real-life situations. Accordingly, whatever results one obtains in the theory of probability and statistics, one should be able to interpret them in terms of relative frequency. A second approach to the notion of probability is from an axiomatic point of view. That is, a minimal list of axioms is set down which assumes certain properties of probabilities. From this minimal set of assumptions
1
2 CHAPTER 1. EVENTS AND PROBABILITY
the further properties of probabiUty are deduced and applied. A third approach to the notion of probability is limited in applica
tion but is sufficient for our study of sample surveys. This approach is that of probabiUty in the "equaUy likely" case. Let us consider some game or experiment which, when played or performed, has among its possible outcomes a certain event E. For example, in tossing a die once, the event E might be: the outcome is an even number. In general, we suppose that the experiment or game has a certain number of mutu-aUy exclusive "equaUy likely" outcomes. Let us further suppose that a certain event E can occur in any one of a specified number of these "equaUy likely" outcomes. Then the probabiUty of the event is defined to be the number of "equaUy likely" ways in which the event can occur divided by the total number of possible "equaUy likely" outcomes. It must be emphasized here that the number of equally likely ways in which the event can occur must be from among the total number of equally likely outcomes. For example, if, as above, the experiment or game is the single toss of a fair die in which the "equaUy likely" outcomes are the numbers {1,2,3,4,5,6}, and if the event E considered is that the outcome is an even number, i.e., is 2,4 or 6, then the probability of E here is defined to be 3/6 or 1/2. This approach is limited, as was mentioned above, because in many games and experiments the possible outcomes are not equally likely.
The probabiUty model used in this course is the "equaUy likely" model.
EXERCISES
1. A (possibly loaded) die was tossed 150 times. The number 1 came up 27 times, 2 came up 26 times, 3 came up 24 times, 4 came up 20 times, 5 came up 29 times and 6 came up 24 times.
a) Compute the relative frequency of the event that on the toss of this die the outcome is 1.
b) Find the relative frequency of the event that the outcome is even.
c) Find the relative frequency of the event that the outcome is not less than 5.
1.2. COMBINATORIAL PROBABILITY 3
2. Twenty numbered tags are in a hat. The number 1 is on 7 of the tags, the number 2 is on 5 of the tags, and the number 3 is on 8 of the tags. The experiment is to stir the tags without looking and to select one tag "at random".
a) What are the total number of equally likely outcomes of the experiment?
b) From among these 20 equally likely outcomes what is the total number of ways in which the outcome is the number 1?
c) Compute the probability of selecting a tag numbered 1. Do the same for 2 and 3.
d) What is the sum of the probabilities obtained in (c)?
1.2 Combinatorial Probability We now consider the computation of probabilities in the "equally likely" case. Let us suppose that we have n different objects, and we want to arrange k of these in a row (where, of course, k < n). We wish to know in how many ways this can be accomplished. As an example, suppose there are five members of a committee, call them A, 5 , C, D, E, and we want to know in how many ways we can select a chairman and a secretary. When we select the arrangement ((7, A), we mean that C is the chairman and A is the secretary. In this case n = 5 and k = 2. The different arrangements are listed as follows:
(A,B) (A,C) (A,D) (A,E) (B,A) (B,C) (B,D) (B,E) (C,A) (C,B) (C,D) (C,E) (D,A) (D,B) (D,C) (D,E) (E,A) (E,B) (E,C) (E,D)
One sees that there are 20 such arrangements. The number 20 can also be obtained by the following reasoning: there are five ways in which the chairman can be selected (which accounts for the five horizontal rows of pairs), and for each chairman selected there are four ways of selecting the secretary (which accounts for the four vertical columns).
4 CHAPTER 1. EVENTS AND PROBABILITY
Consequently there are 20 such pairs. In general, if we want to determine in how many ways we can arrange k out of n objects, we reason as follows. There are n ways of selecting the first object. For each way we select the first object there are n — 1 ways of selecting the second object. Hence the total number of ways in which the first two objects can be selected is n(n — 1). For every way in which the first two objects are selected there are n — 2 ways of selecting the third object. Thus the number of ways in which the first three objects can be selected is n(n — l)(n — 2). From this one can easily conclude that the number of ways in which k out of n objects can be laid in a row is n(n — l)(n — 2) • • • (n — (fc — 1)), which can be written as the ratio of factorials: n\/(n — k)\ (Recall: 5! = 1 x 2 x 3 x 4 x 5). This is also referred to as the number of permutations of n things taken & at a time.
In the above arrangements (or permutations) of n things taken k at a time, we counted each way in which we could arrange the same k objects in a row. Suppose, however, that one is interested only in the number of ways k objects can be selected out of n objects and is not interested in order or arrangement. In the case of the committee discussed above, the ways in which two members can be selected out of the five to form a subcommittee are as follows:
( A 3 ) (A,C) (A,D) (A,E) (B,C) (B,D) (B,E) (C,D) (C,E) (D,E)
We do not list (Z), B) as before, because the subcommittee denoted by (Z), B) is the same as that denoted by (£?, D) which is already listed. Thus, now we have only half the number of selections. In general, if we want to find the number of ways in which one can select k objects out of n objects, we reason as follows. As before, there are n\/(n — k)\ ways of arranging (or permuting) n objects taken k at a time. However, all k\ ways of arranging each k objects are included here. Hence we must divide the n\/(n — k)\ ways of arranging k out of n objects by k\ to obtain the number of ways in which we can make the k selections. This number of ways in which we can select k objects out of n objects without regard to order is usually referred to as the number of combinations of n objects or things taken fc at a time. It is usually denoted by the
1.2. COMBINATORIAL PROBABILITY 5
binomial coefficient: / n \ n! \k) = k\(n-k)V
This binomial coefficient is encountered in the binomial theorem which states:
(a + br = £(n)akF-k, fc=o v / c /
where 0! is defined to be 1. Now we apply these two notions to some combinatorial probabihty
problems, i.e., the computation of probabilities in the "equally likely" case. In each problem, the cautious approach is first to determine the number of equally likely outcomes in the game or experiment. Then one computes the number of equally likely ways from among these in which the particular event can occur. Then the ratio of this second number to the first number is computed in order to obtain the probabihty of the event.
Example 1. The numbers 1,2, • • • ,n are arranged in random order, i.e., the n! ways in which these numbers can be arranged are assumed to be equally likely. We are to find the probabihty that the numbers 1 and 2 appear as neighbors with 1 followed by 2.
As was mentioned in the problem, there are n! equally likely outcomes. In order to compute the number of these ways in which the indicated event can occur, we reason as follows: there are n — 1 positions permitted for 1; for each position available for 1 there is only one position available 2, and for every selection of positions for 1 and 2, there are (n — 2)! ways for arranging the remaining n — 2 integers in the remaining n — 2 positions. Consequently, there are (n — 1) • 1 • (n — 2)! ways in which this event can occur, and its probabihty is
p = ( n - l ) - l - ( n - 2 ) ! = (n - 1)! = j _
Before beginning Example 2, we should explain what is meant by selecting a random digit (or random number). In effect, one takes 10 tags and marks 0 on the first tag, 1 on the second tag, 2 on the third tag, • • •, and 9 on the tenth tag. Then these tags are put into a hat
6 CHAPTER 1. EVENTS AND PROBABILITY
(or urn). If we say "select n random digits" or "sample n times with replacement", we mean that one selects a tag "at random", notes the number on it and records it, returns it to the container, and repeats this action n — 1 times more.
Example 2. We are to find the probability p that among k random digits neither 0 nor 1 appears.
The total number of possible outcomes is obtained as follows. There are 10 possibilities for selecting the first digit. For each way in which the first digit is selected there are 10 ways of selecting the second digit. So there are 102 ways of selecting the first two digits. In general, then, the number of ways in which the first k digits can be selected is 10*. Now we consider the event: neither 0 nor 1 appears. In how many "equally likely" ways from among the 10* possible outcomes can this event occur? In selecting the k random digits, it is clear that with the first random digit there are eight ways in which it can occur. The same goes for the second, third, and on up to the fcth random digit. Hence, out of the 10* total possible "equally likely" outcomes there are 8* outcomes in which this event can occur. Thus p = 8k/10k.
Example 3. Now let us determine the probability P that among k random digits the digit zero appears exactly 3 times (where 3 < k).
Again, the total number of equally likely outcomes is 10*. Among the k trials (i.e., k different objects) there or (* J ways of selecting the 3 trials in which the zeros appear. For each way of selecting the 3 trials in which only zeros occur there are 9*~3 ways in which the outcomes of the remaining k — 3 trials can occur. Thus P = (*) §k~*llQk.
Example 4. A box contains 90 white balls and 10 red balls. If 9 balls are selected at random without replacement, what is the probability P that 6 of them are white?
In this problem there are \$) ways of selecting the 9 balls out of 100. Since there are ( 9
6° J ways of selecting 6 white balls out of 90 white balls, and since for each way one selects 6 white balls there are f1®) ways of selecting 3 red balls out of the 10 red balls, we see that there
1.2. COMBINATORIAL PROBABILITY 7
are (9°J (™) ways of getting 6 white balls when we select 9 without replacement. Consequently,
(?) (?) P = 'T) /rloo^
Example 5. There are n men standing in row, among whom are two men named A and B. We would like to find the probability P that there are r people between A and B.
There are two ways of solving this problem. In the first place there are l?) ways in which one can select two places for A and B to stand, and among these there are n — r — 1 ways in which one can pick two positions with r positions between them. S o P = (n — r — l ) / ^ ] . Another way of solving this problem is to observe that there are n! ways of arranging the n men, and that among these n\ ways there are two ways of selecting one of the men A or B. For each way of selecting one of A or B there are n — r — 1 ways of placing him, and for each way of selecting one of A or B and for each way of placing him there is one way in which the other man can be placed in order that there be r men between them, and there are (n — 2)! ways of arranging the remaining n — 2 men. So
2 ( n - r - l ) ( n - 2 ) ! _ n - r - l n! " ( ; ) '
EXERCISES
1. An urn contains 4 black balls and 6 white balls. Two balls are selected without replacement. What is the probability that
a) one ball is black and one ball is white? b) both balls are black? c) both balls are white? d) both balls are the same color?
2. In tossing a pair of fair dice what is the probability of throwing a 7 or an 11?
8 CHAPTER 1. EVENTS AND PROBABILITY
3. Two fair coins are tossed simultaneously. What is the pfobabiUty that
a) they are both heads?
b) they match?
c) one is heads and one is tails?
4. The numbers 1,2, • • •, n are placed in random order in a straight line. Find the probability that
a) the numbers 1,2,3 appear as neighbors in the order given, and
b) the numbers 1,2,3 appear as neighbors in any order.
5. Among k random digits find the probability that
a) no even digit appears, b) no digit divisible by 3 appears
6. Among k random digits (k > 5) find the probability that
a) the digit 1 appears exactly five times,
b) The digit 0 appears exactly two times and the digit 1 appears exactly three times.
7. A box contains 10 white tags and 5 black tags. Three tags are selected at random without replacement. What is the probability that two are black and one is white?
8. There are n people standing in a circle, among whom are two people named A and B. What is the probability that there are r people between them?
9. Six random digits are selected. In the pattern that emerges, find the probability that the pattern will contain the sequence 4,5,6.
1.3. THE ALGEBRA OF EVENTS 9
1.3 The Algebra of Events Before we may adequately discuss probabilities of events we must discuss the algebra of events. Then we are able to establish the properties of probability.
Connected with any game or experiment is a set or space of all possible individual outcomes. We shall consider only those games or experiments where these individual outcomes are equally likely. Such a collection of all possible individual outcomes is called a fundamental probability set or sure event It will be denoted by the Greek letter omega, fi. We shall also use the expression fundamental probability set (or sure event) for any representation we might construct of all individual outcomes. For example, in a game consisting of one toss of an unbiased coin, a fundamental probability set consists of two individual outcomes which can be conveniently referred to as H (for heads) and T (for tails). If the game consists in tossing a fair coin twice, then the fundamental probability set consists of four individual outcomes. One of these outcomes could be denoted by (T, # ) , which means that tails occurs on the first toss of the coin and heads occurs on the second toss. The remaining three individual outcomes may be denoted by (jff, H), (H, T) and (T, T). In general, an arbitrary individual outcome will be denoted by u> and will be referred to as an elementary event Thus, fi denotes the set of all elementary events.
An event is simply a collection of certain elementary events. Different events are different collections of elementary events. Consider the game again where a fair coin is tossed twice. Then, as indicated above, the sure event consists of the following four elementary events:
(H,H) (H,T) (T,H) (T,T).
If A denotes the event: [heads occurs in the first toss], then A consists of two elementary events, (H, H) and ( # , T), and we write this as
A = {(H,H),(H,T)}.
If B denotes the event: [at least one head appears], then B consists of the three elementary events ( # , # ) , ( # , T) and (T, # ) , i.e.,
B = {(H,H),(H,T),(T,H)}.
10 CHAPTER 1. EVENTS AND PROBABILITY
If C denotes the event: [no heads appear], then C consists of one elementary event, i.e., C = {(T,T)}. If D denotes the event: [at least three heads occur], this is clearly impossible and is an empty collection of elementary events; we denote this by D = <j), where <j> always means the empty set.
In general, we shall denote the fact that an elementary event u belongs to the collection of elementary events which determine an event A by u G A. If an elementary event w occurs, and if u G A, then we say that the event A occurs. If might be noted at this point that just because an event A occurs, it does not mean that no other events occur. In the example above, if ( i / , H) occurs, then A occurs and so does B. The fundamental probabihty set f) is also called the sure event for the basic reason that whatever elementary event u does occur, then always
We now introduce some algebraic operations of events. If A is an event, then Ac will denote the event that A does not occur. Thus Ac
consists of all those elementary events in the fundamental probabihty set which are not in A. For every elementary event u in the fundamental probability set and for every event A, one and only one of the following is true: u G A or u> G Ac. An equivalent way of writing u> G Ac is u £ A, and we say that u> is not in A. Also, Ac is called the negation of A or the complement of A.
If A and B are events, then AUB will denote the event that at least one of the two events A, B occur. By this we mean that A can occur and B not occur, or B can occur and A not occur, or both A and B can occur. In the previous example, if E denotes the event that heads occurs in the second trial, then
E = {(H,H),(T,H)}
and AUE = {(H,H),(H,T),(T,H)}.
In other words, A U E is the event that heads occurs at least once, and we may write A U E = B. In general, if Ai, • • •, An are any n events, then
Ai U A2 U • • • U An
1.3. THE ALGEBRA OF EVENTS 11
denotes the event that at least one of these n events occurs. This event will also be written as
Suppose A and B are events which cannot both occur, i.e., if u> G A, then uj £ B, and if u> G i?, then u; $£ A. In this case, A and J5 are said to be incompatible or disjoint or mutually exclusive. Events Ai, • • •, An are said to be disjoint if and only if every pair of these events has this property.
The notation A C B means: if event A occurs, then event B occurs. Other ways of stating this are: A implies B and B is implied by A. Thus A C B is true if and only if for every u> G A, then u G B. In any situation where it is desired to prove A C 5 , one should select an arbitrary u G A and prove that this implies u; G B. We define the equality of two events A and 5 , namely A = B, to occur if A C i? and 5 C A , i.e., A and B share the same elementary events. Finally we define the event that A and B both occur, which we denote by A 0 5 , to be the event consisting of all elementary events cv in both A and B. This is frequently referred to as the intersection of A and B. If Ai, • • •, An
are any n events, then the event that they all occur is denoted in two ways by
n A i n A 2 n - - - r i A n = f] Aj.
We sometimes write AB instead of A fl B and Ai A2 • • • An instead of Ai n A2 n • • • n An.
We now prove some propositions on the algebra of events.
PROPOSITION 1. For every event A, then A C A.
Proof: Let u € A. Then this same u G A. Hence every elementary event in the left event is an elementary event in the right event.
PROPOSITION 2. If A,B,C are events, if A C B and if B C C, then A C C.
Proof: Let u G A; we must show that w G C Since Ac B and a; G A,
12 CHAPTER 1. EVENTS AND PROBABILITY
then OJGB. Now, ,snce B C C and ssnce u € B, ,hen n € C.
PROPOSITION 3. For every event A, AD A = A, A U A = A, and (Ac)c = A.
Proof: These are obvious.
PROPOSITION 4. If A is any event, then <f>CACSfi
Proof: The trick here involves the fact that any u one might ffnd in <j> is certainly in A, since <j> containn no w'so The implication A C ft is obvious.
We noted above that if A and B are two events, and if we wished to prove A C B, then we should take an arbitrary elementary event u in A and prove that it is in B. Now suppose we have two events A and B, and suppose we wish to prove A = B. Because of the definition of equality of two events given above, one is required to do the following: (i) take an arbitrary u G A a n d prove that u> e B, and ((ii take an arbitrary u € B and prove that u e A.
PROPOSITION 5. IfAx, A*---, A* are events then
\ t = l / i = l
(Ihese are known as the DeMorgan formulae.)
Proof: In order to prove the first equation, let u be any elementary event in the left hand side. Then u is not in the event that at least one of Ai, • ■ •• An occurr This means that u ii not an ellment of any of the At\ i.e., u <£ A{ for i = 1,2,, • • ,n. Hence u e A\ for all i, i.e., u e H?=iAf. Thus we have shown that (UJL1At)c C n?=i45. Now let w be any ellmentary event in n?=14?. Then u € A? for all ii Hence w ^ A,- for t = 1,2, , • •, n, and dhen this happens, then nertainly u 4. U?_=Ai. But this means u € (U^A,) 0 . . h u s n^^A? ? CU^nAi)0, and since we have proved the reverse implication, we therefore have the first equality of our proposition. In order to prove the second equation,
(u*V-rK-<<rt,*r-ut,*. \ t = l / i = l
(Ihese are known as the DeMorgan formulae.)
1.3. THE ALGEBRA OF EVENTS 13
we use the first one that is aheady proved and replace each A,- by A?. Thus we obtain (U?=1A?)C = n?=1(A?)c, and by Proposition 3, this becomes (UJL^A?)0 = flJ^A,-. Now take the complement of both sides to obtain the second equation. Q.E.D.
PROPOSITION 6. <f>c = ft and ftc = <f>.
Proof: Since (j> contains no elementary events, its complement must be the set of all elementary events, i.e., ft. Also, the negative or complement of the sure event is the event consisting of no elementary events. Q.E.D.
PROPOSITION 7. If A and B are events, and if A C B, then Af\B = A.
Proof: Since A C J5, the every elementary event in both A and B is in A, also every elementary event in A is also in 5 , and therefore in A and B. Q.E.D.
PROPOSITION 8. If A and B are events, then AD B C A.
Proof: If u G A fl i?, then u> G A and w 6 B , which implies u G A.
PROPOSITION 9. If A and B are events, and if A C B, then Bc C Ac. IfA = B, then Ac = £ c .
Proof: Let u; G 5 C . Then u £ B. This implies that CJ ^ A, since if to the contrary u G A, then A C 5 would imply u € B which contradicts the fact established above that u £ B. But u £ A implies u G Ac. Thus Bc C Ac. Next, if A = 5 , then A C 5 and B C A , which, by the first conclusion aheady proved, imply Bc C Ac and Ac C 5 C , which in turn imply £ c = Ac. Q.E.D.
PROPOSITION 10. / / A and 5 are evenfc, then AUB = BUA and AH B = BOA.
Proof: If v G A U S , then a; G A or u G 5 . If a; G A, then a? is in at
14 CHAPTER 1. EVENTS AND PROBABILITY
least one of the events 5 , A, namely A; if u € 1?, then u is in at least one of the events JB, A, namely 2?. Thus u; € JB U A, and we have shown AU B C B l) A. Since this holds for any two events, we may replace A above by B and B by A, and we obtain B U A C A U B. These two inclusions imply AUB = BUA. In order to obtain the second equation, replace A and B by Ac and Bc respectively in the first equation, take the complements of both sides and apply Propositions 3 and 5. Q.E.D.
PROPOSITION 11. IfA.B and C are events, then
A ( J ( £ U C ) = ( A U B ) U C
and
An(Bnc) = (AnB)nc. Proof: If u) € A U (B U C), then u € A or CJ € B U C. If a; € A, then u> is an element of at least one of A, 1?, namely A. Hence w G i U B . This in turn implies u; is in at least one of A U B and C, namely, A U 5 . Thus u; 6 (A U 5 ) U C. If w G 5 U C, then CJ is in at least one of B or C. If it is in C, then it is in (A U 5 ) U C If it is in 5 , then it is in A U i?, namely 5 . Thus we have established the inclusion
A U ( 5 U C ) C ( A U B ) U C .
In order to establish the reverse inclusion, and hence the first equation, we use Proposition 10 and the above inclusion to obtain
( A U B ) U C = CU(AUB) = CU(BUA)C(CUB)UA = (BUC)UA = AU(BUC).
In order to establish the second equation, replace A, B and C in both sides of the first equation by Ac, Bc and Cc respectively, take the complements of both sides, and apply Propositions 9 and 5 to obtain the conclusions.
PROPOSITION 12. IfA.B and C are events, then
An(BDC) = ABuAC
1.3. THE ALGEBRA OF EVENTS 15
and A U (£ n C) = (A U £ ) n (A U C).
Proof: If u G A 0 (B U C), then u> G A and a; is in at least one of J3, C. Hue B, then u G A# ; if u G C, then a; G AC. Hence u> G AB U AC. If a; G AB U AC, then u> is in at least one of AB and AC. If u G A 5 , then co G A and a; G 5 ; now w £ B implies w G B U C , and hence u; G A n (B U C). If u> G AC, then replace C by B in the previous sentence to obtain the same conclusion. In order to prove the second equation, replace A, B and C in the first equation by Ac, Bc and C c respectively, take the complements of both sides and apply Propositions 9 and 5.
PROPOSITION 13. If A and B are events then A U B = A U ACB, and A and ACB are disjoint.
Proof: If u> G A U 5 , then a; G A or a; G B. If w G A, thenu; G AUAC5. If a? G -B, then two cases occur: u G A also, or u; ^ A. In the first case u> G A U A c 5 . In the second case, w G A c while yet u G J5, i.e., u> G A c 5 and thus u> G A U ACB. Thus, A U 5 C A U A c 5 . Now let u G A U ACB. Then w G A or u> G A c 5 . If u G A, then a; G A U 5 . If u> G ACB, then w 6 B , and hence ueAUB. Thus A U ACB C A U B , and the equation is established. Also, A and ACB are disjoint since if u) G A c 5 , then u G Ac, i.e., w £ A. Q.E.D.
EXERCISES
1. Prove: if 1? is any event, then </>D B = <f> and 5 fl 0 = B. (See Propositions 4 and 7.)
2. If A is an event, use Problem 1 and Propositions 3 and 6 to prove <f> U A = A and ft U A = fl.
3. Use Propositions 10 and 8 to prove: if A and B are events, then ACiB C B.
4. Use Problem 3 and two propositions of this section to prove: if C and D are events, then C C C U D.
16 CHAPTER 1. EVENTS AND PROBABILITY
5. Prove: if A, B, C and D are events, if A C C and if B C D , then AB C CD.
6. Let Ai, A2, A3, A*, A5, A6, A7 be events. Match these three events: AJA£A3, A6A^ and A2A5 with the following statements:
(i) A2 and A5 both occur,
(ii) As is the first among the seven events to occur, and
(iii) AQ is the last event to occur.
7. Let Ai, A2 and A3 be events, and define 2?t to be the event that A{ is the first of these events to occur, i = 1,2,3. Write each of 2?i, 2?2, #3 in terms of Ai, A2, A3 and prove that A\ U A2 U A3 = J5i U 5 2 U B3.
8. Prove: if A is any event, then A U Ac = ft and A n Ac = <f>.
9. Prove: if A and B are events, then B = A S U A c 5 . (Hint: Use Problem 8, Proposition 12 and Problem 1.)
10. In Problem 6, construct the event: A5 is the last of these events to occur.
11. Five tags, numbered 1 through 5 are in a bowl. The game is to select a tag at random and, without replacing it, select a second tag. (I.e., take a sample of size two without replacement.) After you list all 20 elementary events in 0 , list the elementary events in each of the following events:
A: the sum of the two numbers is < 6
B: the sum of the numbers is 5
C: the larger of the two numbers is < 3
D: the smaller of the two numbers is 2
E: the first number selected is 5
F: the second number selected is 4 or 5.
1.4. PROBABILITY 17
12. In Problem 11, list the elementary events in each of the following events: Al)£,AnC,Dc,(AUEy,EnF.
13. Prove Proposition 3.
14. Prove the converse to the second statement in Proposition 9: If A and B are events, and if Ac = Bc, then A = B.
15. Prove: If A, B and C are events, and if A and B are disjoint, then AC and BC are disjoint.
16. Prove: If A, # ! , - • • , # „ are events, if A C U^tf; andif Hu • ■ -- Hn
are disjoint, then AHU • • •• AH/ are disjoint, and A = U?=1 Aff,-.
1.4 Probability The only notion of probability that we shall use in this course is that where the elementary events are all equally likely. In most cases these equally likely outcomes will be apparent. In others, they will be difficult to find, but in most of these cases we shall not have to find them.
In any game or experiment, if N denotes the total number of equally likely outcomes (in ft) , and if NA denotes the number of equally likely outcomes in the event A, then we define the probability of A by
Concrete examples of this were given in Section 1.2. The following propositions will be used repeatedly in this course.
PROPOSITION 1. If A is an event, then 0 < P(A) < 1.
Proof: Since 0 < NA < JV, divide through by N, and obtain 0 < P(A) < 1. Q.E.D.
PROPOSITION 2. If A is an event, then P{AC) = 1 - P(A).
™-¥
18 CHAPTER 1. EVENTS AND PROBABILITY
Proof: Since N = NA + NA; we have, upon dividing through by N, 1 = P(A) + P(AC), from which the conclusion follows. Q.E.D.
PROPOSITION 3. P(fi) = 1 and P{<f>) = 0.
Proof: This follows from the fact that N+ = 0 and N0 = N.
PROPOSITION 4. IfA!,---,Ar are disjoint events, then
p(i j A ) = y^P(A■)
Proof: Disjointness of A1? • • •• Ar implies
Dividing through by N yields the result. Q.E.D.
PROPOSITION 5. / / A and B are events, then P(A) = P(BA) + (BCA).
Proof: Because ft = B\JBC, then A = Anil = A(BUBC) = ABUABC. Since AB and ABC are disjoint, it follows that P(A) = P(ABUABC) = P(AB) + P(ABC). Q.E.D.
PROPOSITION 6. If A and B are events, then
P(AUB) = P{A) + P(ACB).
Proof: By Proposition 13 in Section 1.3, AliB = AU ACB, and A and ACB are disjoint. Applying Proposition 4 above we obtain P(AUB) = P(A) + P{ACB) . Q.E.D.
PROPOSITION 7. If A and B are events then
P(A UB) = P(A) + P{B) - P{AB).
p [ i j A) = y^ P(A ■)
Proof: Disjointness of Alt • • •• Ar implies
t = l
1.4. PROBABILITY 19
Proof: By Proposition 6, P{A U B) = P(A) + P(ACB) . By Proposition 5, P(B) = P(AB) + P(ACB) , or P(ACB) = P{B) - P{AB). Substituting this into the first formula, we get the result. Q.E.D.
PROPOSITION 8. (Boole's inequality). If A and B are events, then P(AU B) < P(A) + P(B).
Proof: By Propositions 7 and 1, P(Al)B) = P(A) + P(B)-P(AB) < P(A) + P{B). Q.E.D.
EXERCISES
1. Prove: if A, B and C are events, then
P(AUBUC) = P{A) + P(B) + P(C) -P(AB) - P{AC) - P(BC) + P(ABC).
2. Prove: if A, B and C are events, then P(A U B U C) < P{A) + P{B) + P(C).
3. Use the principle of mathematical induction to prove: if Ai, • • •, An
are events, 3,thenpfu^)<EP(^)-\ « = i / «=i
4. Prove: if A and B are events, and if A C B, then P(A) < P(B).
5. Prove: if A and B are events, then P(AB) < P(A) < P(A U B).
6. Prove: if A and B are events, if A C B, and if P(B) < P(A), then P(A) = P(B).
7. Prove: if Ai,A2 and A3 are events, then P(AX U A2 U A3) = PiAJ + PiA^ + PiAiAlAs).
8. Another way to prove Proposition 6 is as follows. First prove that A fl (A U B) = A and Ac(A UB) = A c 5 . Then apply Proposition 5. Now fill in complete details of this proof.
20 CHAPTER 1. EVENTS AND PROBABILITY
1.5 Conditional Probability We now define the conditional probability that an event A occurs, given that a certain event B occurs. Since we are given that B occurs, we can only define this conditional probability when P(B) > 0 or NB > 0. Since we are given that B occurs, then the total possible number of equally likely outcomes is NB which we place in the denominator. Among these we wish to find the total number of equally likely ways in which A occurs. This is seen to be NAHB or NAB- Thus, the conditional probability that A occurs, given that B occurs, which we denote by P(A\B), is obtained from the formula
P W B ) . ^
REMARK l.IfA and B are events, and if P(B) > 0, then
p(A\m- P^AB"> P{m ~ 1 W Proof: By the definition above, we have
NAB NABIN P(AB) P(A\B) = NB NB/N P{B) "
Q.E.D.
REMARK 2. If A and B are events with positive probabilities, then
P{AB) = P(A\B)P(B) = P(B\A)P(A).
Proof: The first equality follows from Remark 1, and the second is the same as the first with A replaced by B and B by A. Q.E.D.
The three most useful theorems in connection with conditional probabilities will now be presented along with some applications.
1.5. CONDITIONAL PROBABILITY 21
THEOREM 1. (The Multiplication Rule). If A0, Air •• ,An are any n + 1 events such that P(AoAi • • • A„_i) > 0, then
P{AQAX ••■An) = P(A0)P(A1\A0)P(A2\A0A1)• • • P(An\A0■ • • An_a).
Proof: We prove this by induction on n. For n = l,P(A0) > 0, so P{Ax\Ao) = P(A0A!)/P(Ao), or P{AoA\) = P(A0)P(A1\A0). Now let us assume the theorem is true for n — 1 (where n > 2); we shall show it is also true for n. By induction hypothesis,
P(A0A1 • . . An-i) = P(A0)P(A1\Ao) - • • P(An^\A0 • • • An_2).
By Remark 1, letting B = A0 • • • An_1? A = An, we obtain
PfAoAx.-.An) = P(i4o---i4n_1)P(i4w |i40---i4n-1) = P(A0)P(A1\Ao) • • • P(An |A0 - • • An_x),
which proves that the theorem then holds for all n. Q.E.D.
Example 1. (Polya Urn Scheme). An urn contains r red balls and b black balls originally. At each trial, one selects a ball at random from the urn, notes its color and replaces it along with c balls of the same color. Let i?t- denote the event that a red ball is selected at the ith trial, and let B\ denote the event that a black ball is obtained at the ith trial. We wish to compute P(RiB2B3) . Using the multiplication rule, we obtain
P(RlB2B3) = P(R1)P(B2\R1)P(B3\R1B2) r b b + c
" (r + b) (r + b + c) (r + b + 2c) *
Example 2. (Sampling Without Replacement). An urn contains N tags, numbered 1 through N. One selects at random three tags without replacement. This means: one first selects a tag at random, then without replacing it one selects a tag at random from those remaining, and again, without replacing it, one selects yet another tag from the
22 CHAPTER 1. EVENTS AND PROBABILITY
N — 2 remaining tags. If i i , Z2, 3 are three distinct positive integers not greater than JV, then, by the multiplication rule, the probabiUty that i\ is selected on the first trial, z*2 on the second trial and i% on the third trial is
L ! 1 N* N-l' N-2
THEOREM 2. (Theorem of Total Probabilities). IfHir-,Hn are are disjoint events with positive probabilities, and if A is an event satisfying A C U^Hi, then
P(A) = ±P(A\Hi)P(Hi). 1=1
Proof: First note, using Propositions 7 and 12 in Section 1.3, that
A = A n (U?=1fff-) = U^AHi.
Further, AH\,- • • ,AHn are disjoint; this comes from the hypothesis that J/i, • • •, Hn are disjoint. Hence, by Proposition 4 in Section 1.4,
P(A) = p((jAHt)=J2P(AHi). \ t = l / t = l
Now by Remark 1 or 2 above,
P(AHt) = P(A\Hi)P(Hi)
for 1 < i < n . Thus
P(A) = '£P(A\Hi)P(Hi). t = l
Q.E.D.
Example 3. In the Polya urn scheme in Example 1,
P(R1) = ^— and P ( £ i ) = &
r + b r + b'
1.5. CONDITIONAL PROBABILITY 23
In order to compute P{R2), we first note that R2 C R\ U B\ and R\ and l?i are disjoint. Hence by the theorem of total probabilities,
P(R2) = P(R2\R1)P(Rl) + P(R2\B1)P(B1).
Since P(R2\Ri) = r | C and P{R2\B1) = r
r + 6 + c r + b-\- c we have
P f F M - r + C r r b r r + 6 + c r + 6 r + 6 + c r + 6 r + 6"
Example 4. In Example 2 above on sampling without replacement, let us compute the probability that 1 is selected on the second trial. Using the theorem of total probabilities we obtain
N P[\ in trial#2] = £ P ( [ 1 i n trial#2]|[i in trial#l]) P([i in trial#l])
t=2
= f-JL. 1 = 1
THEOREM 3. (Bayes' Theorem) IfHi,-,Hn are disjoint events with positive probabilities, if A is an event satisfying A C U^Hi, and if P(A) > 0, then for j = l ,2 , - - - ,n ,
_ WTO) £?„p(/iiff()nffi)
Proof: By the definition of conditional probability, we have, by our hypotheses and by Theorem 2, that
P(H \A\ = P{AHj) = P(A\Hi)P(H>) r{ni{ ] P(A) Y2=iP{A\Hi)P(Hiy
Q.E.D.
24 CHAPTER 1. EVENTS AND PROBABILITY
In rather loose terminology, Bayes' theorem is applied in this general situation. An event A is known to have occurred. There are n disjoint events, called the possible causes of A, and since A has occurred it is known that one of Hu-,Hn "caused it" (i.e., A C U?= 1#n). If one wishes to determine which of the possible causes really caused it, one might wish to evaluate P(Hj\A), for 1 < j < n, and select as a possible cause an Hj for which P(Hj\A) is maximum.
Example 5. Consider the Polya urn scheme again. Suppose one observes that the event R2 has occurred and wishes to determine the probability that Bi was the "cause" of it, i.e., to evaluate P(i?i|i?2)-By Bayes' theorem we find that
P(B,W = W W ) P(R2\B1)P(B1) + P(R2\R1)P(R1)
b r + b + c
One should note that P(Bi|i?2) = P(R2\B1).
Example 6. Consider the sampling without replacement that occurred in Examples 2 and 4. Suppose one observes that 1 is selected in the second trial and wishes to find the probabiHty that selecting 3 in the first trial is its "cause", i.e., to evaluate
P([3 in 1st trial]|[l in 2nd trial]).
Using Bayes' theorem this turns out to be
P([l in 2nd trial]|[3 in 1st trial])P([3 in 1st trial]) 1 EiL2 ^([1 in 2nd trial]|[j in 1st trial])P(£j in 1st trial]) N - 1'
EXERCISES
1. In the hypothesis of Theorem 1, it was assumed that P(AoAi - • • An_i) > 0 so that the last conditional probability was well-defined. Prove that this assumption implies that P(C\j=0Aj) > 0 for 0 < k < n — 2, so that all the other conditional probabilities are also well-defined.
1.5. CONDITIONAL PROBABILITY 25
2. In the proof of the theorem of total probabilities, the statement is made that since H\, • • •, Hn are disjoint, then AH\, • • •, AHn are disjoint. Prove this statement.
3. In sampling without replacement considered in Examples 2 and 4, suppose a simple random sample of size 3 is selected. Prove that the probability of getting a 1 in the third trial is 1/N.
4. In the Polya urn scheme, find P(Rs).
5. An urn contains four objects: A, 2?, C, D. Each trial consists of selecting at random an object from the urn, and, without replacing it, proceeding to the next trial. If X is one of those four objects, and if i = 1 or 2 or 3 or 4, let X{ denote that event that X is selected at the zth trial. Compute the following:
i) p ( ^ ) . ii) P(A2\A1), iii) P(A2 |Bx),
iv) P(A2) and
v) P(B3).
6. In the Polya urn scheme, compute P(Ri\R2).
7. In the Polya urn scheme, compute
i) P(lfe|fll),
ii) P(R3\R2) and iii) P(R1R3)>
8. An urn contains 2 black balls and 4 white balls. At each trial a ball is selected at random from the urn and is not replaced for the next trial. Let B{ denote the event that the first black ball selected is on the ith trial. Compute
i) P(B2),
ii) P(Pn),
26 CHAPTER 1. EVENTS AND PROBABILITY
iii) P(B5) and iv) P(B6).
9. In Problem 8, let C,- denote the event that the second black ball selected is selected at the ith trial. Compute
i) P(C2), ii) P(C3), iii) P{BXCZ), iv) P(B2\C3) and v) P{CX).
10. An absent-minded professor has five keys on a key ring. What is the probability he will have to try all five of them in order to open his office door?
11. Urn # 1 contains 2 white balls and 4 black balls, and urn # 2 contains 5 white balls and 4 black balls. An urn is selected at random, and then a ball is selected at random from it. What is the probability that the ball selected is white?
12. In Example 2, find the probability that 2 is selected in the third trial, where N = 5.
Chapter 2
Random Variables
2.1 Random Variables as Functions In a sample survey, when we select individuals at random, we are really not interested in the particular individuals selected. Rather, we are interested in some numerical characteristic (or characteristics) of the individual selected. This numerical characteristic is a function, in that to each elementary event selected there is a number assigned to it.
Definition. A random variable X is a function which assigns to every element u £ fi a real number X{LO).
The following examples illustrate the idea of random variable.
(i) Take a sample of size one from the set 0 of all registered students students at your university. In this case, X{u) might be the grade point average of u. Thus, corresponding to student u G H is the number X(u>).
(ii) Sample three times without replacement from the set of all registered students at your university. In this case, fi will consist of the set of all ordered triples ( t ^ i , ^ , ^ ) , where no repetitions are allowed. If Y denotes the age of the third student selected, i.e., if Y assigns to ( w i , ^ ^ ) the number: the age of CJ3, then Y is a random variable.
27
28 CHAPTER 2. RANDOM VARIABLES
(iii) In example (ii), if Z assigns to {wi,W2,wz) the total indebtedness of u>i and u>2 and u>3, then Z is a random variable.
We are usually interested in the values that random variables take and the probabilities with which these values are taken. Thus we have the following definition.
Definition. If X is a random variable defined over some fundamental probabiUty space fi, then the range of X , which we denote by range(X), is defined as the set of numbers X(u>) for all u> G fi, i.e., range(X) = {X{u) : u € f)}. This is also denoted by X(£i) or {x : x = ^(ci>) for some u> € fi}.
Since fi is finite, then the range of a random variable X is finite and has at most as many members as does Q. Random variables are functions, and so, like functions, they admit algebraic operations. These are given in the following definition,
Definition. If fi is a fundamental probability space, if X and Y are random variables defined over f), and if c is a constant, then we define the random variables X + F , XY, X/Y, cX, max{X,Y}, and min{X, Y} as follows:
(i) X + Y assigns to every u 6 fi the number X(u) + Y(u),
(ii) XY assigns to every UJ £ Q the product X(u)Y(u) of the numbers X(UJ) and Y(u),
(iii) X/Y assigns to every u € 0 the quotient X{u)/Y(u)\ if Y(u) = 0 for at least one u> & fi, then X/Y is not defined,
(iv) cX assigns to u the number CX{UJ) ,
(v) max{X,Y} assigns to each u> e 0, the larger of the numbers X(u>),Y(u>), and
(vi) min{X, y } assigns to each u € ft the smaller of the numbers X(u>) and y(u;) .
2.1. RANDOM VARIABLES AS FUNCTIONS 29
In general, if X, Y, Z are random variables, and if / (u,v,u;) is any function of three variables, then / ( X , Y, Z) is a random variable which assigns to every w G f i the number f(X(Lo), Y(u), Z{u)).
Among the random variables defined over 0 , some very important ones are the indicator random variables, defined as follows.
Definition. If A C fi is an event, the indicator of A , denoted by 7^, is defined as the random variable that assigns t o w G f l the number 1 if u> G A and 0 if u £ A , i.e.,
T / N / 1 if u> G A
PROPOSITION 1. If A is an event, then l\ = IA.
Proof: If u> G A, then 7i(w) = l2 = 1 = iyt(u>), and if u; ^ A, then I\(w) = 02 = 0 = 7^(a>). Q.E.D.
PROPOSITION 2. If A and B are events, then IAIB = IAB-
Proof: liuj E A and u E B, then u; e AB, and thus
/ A / B M = IA(U)IB{U) = 1-1 = 1 = IAB{u).
If u> is not in A S , then IAB{W) = 0 and, since UJ is not in at least one of A, 5 , then at least one of the numbers I A (<*>), IB{&) is zero, in which case IAIB(W) = IA{U)IB(U) = 0 = / ^ ( w ) . Q.E.D.
PROPOSITION 3. 7/A and J5 are events, then IAB = min{ /^ , / B } .
Proof: Note that the minimum of IA{W) and IB{W) is 1 if and only if IA{U>) = 1 and IB(W) = 1, which is true if and only if u G A and UJ € B, i.e., a; G A S o r / ^ u ; ) = 1. Q.E.D.
PROPOSITION 4. / / A aradB are events, then IAuB = max{JA, /B} .
Proof: IAUB(V) = 1 if and only if UJ G A U 2?, which is true if and only if CJ is in at least one of A, B. This means at least one of IA{U), IB{W) is 1. Q.E.D.
30 CHAPTER 2. RANDOM VARIABLES
In our notation we generally suppress the symbol u>. Thus, if X is a random variable, we shall write [X = x] instead of {a; G ft : X(u) = x}, we shall write [X < x] instead of {u G ft : X(u) < x} and, in general, for any set of numbers A, we shall write [X € A] instead of {u G ft : X(u>) G A}. An example of this is the game where ft denotes the set of outcomes of three tosses of an unbiased coin. In this case ft consists of
(HHH) (HHT) (HTH) (HTT) (THH) (THT) (TTH) (TTT)'
Let X denote the number of heads in the three tosses. For example, if u) = (HTH), then X(u) = 2. Also X(THH) = 2 , while X(TTT) = 0. Thus
[X = 2] = {(HHT), (HTH), (THH)},
and [X <1] = {(HTT), (THT), (TTH), (TTT)}
EXERCISES
1. Let X denote the sum of the outcomes of two tosses of a fair die, and let Y be their product. Let (u, v) denote that outcome of the two tosses in which the first number that comes up is u and the second number that comes up is v.
(i) List all 36 elementary events in ft. (ii) What are the elementary events in each of these events:
[X < 4], [X > 30] and [X G (1,4]]?
2. A box contains five tags, numbered 1 through 5. A tag is selected at random, and, without replacing it, a second tag is selected.
(i) List all 20 ordered pairs of outcomes in this game of sampling twice without replacement.
(ii) Let X denote the first number selected, and let Y denote the second number selected. List the elementary events in each of the following events: [X = 2], [Y = 2], [X = 3], [Y = 3] and [X + Y < 4].
2.2. DENSITIES OF RANDOM VARIABLES 31
3. Prove: if A is an event, then I A* = 1 — 7^.
4. Prove: if A and B are events, then they are disjoint if and only if IAUB = IA + IB-
5. Prove: if X and Y are random variables, and if x £ range(X), then
[X = x] = U{[X = x][Y = y]:ye range(Y)}.
6. Prove: if X is a random variable, and if A is any set of numbers, then
[X e A] = U{[X = x] : x e A n ransre(X)}.
7. Prove: if X is a random variable, then
X = J2xIlx=*h X
where the symbol ^ means the sum for all x G range(X). X
2.2 Densities of Random Variables What we are primarily interested in are the probabihties with which a random variable takes certain values. This set of probabihties is usually referred to as the density of the random variable, which we now define.
Definition. If X is a random variable, its density fx{x) is defined by
r / \ _ f P[X = x] if x G range(X) }x[x)-{0 iix£ range(X)
In the example presented at the end of Section 2.1 (where X denotes the number of heads occurring in three tosses of an unbiased coin) the density of X is the following:
fx(0) = P[X = 0] = P{(TTT)} = l/8, fx(l) = P[X = l] = P{(HTT),(THT),(TTH)} = 3/8,
32 CHAPTER 2. RANDOM VARIABLES
fx(2) = P[X = 2] = P{(HHT),(HTH),(THH)} = 3/8, fx(S) = P[X = 3] = P{(HHH)} = 1/8, and fx(x) = O i f s £ {0,1,2,3}.
Note that for every x € range(X), fx(x) > 0. We shall also need a definition of range of two or more random variables considered jointly (or as a random vector).
Definition. If X and Y are random variables, their (joint) range, denoted by range(X, F ) , is defined by
range(X,Y) = {(X(LJ),Y(CJ)) : UJ G ft}.
In general, if X\, • • •, Xn are n random variables, then we define
range(Xu--,Xn) = {(Xi(o;), • • • ,Xn(u>)) : w G ft}. One should note that range(X, Y) is a set of number pairs, i.e.,
a set of points in 2-dimensional Euclidean space R2. Likewise, range(Xi,..., Xn) is a set of points in n-dimensional Euclidean space R n .
Definition. If X and Y are random variables, their joint density fxy{x,y) is defined by
fxA*, V) = f P{[X = XUY = d ) i f (*' y\ G range{X, Y), 10 otherwise.
This is referred to as a bivariate density. As an example, consider an urn with the following nine number
pairs: (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3)
A number pair is selected at random. Let X denote the smaller of the two numbers (e.g., X assigns to (3,1) the number 1), and let Y denote the sum of the two numbers (for example, Y assigns to (3,2) the number 5). We shall find the joint density of X and Y .
2.2. DENSITIES OF RANDOM VARIABLES 33
First note that range(X) = {1,2,3}, range(Y) = {2,3,4,5,6}, and range{X,Y) = {(1,2),(1,3),(1,4),(2,4),(2,5),(3,6)}. Next observe that
[X- -1}[Y = 2] = {(1,1)} [X- = i][y = 3] = {(1,2), (2,1)} [X-- = i][y = 4] = {(1,3), (3,1)} [X- = 2][y = 4] = {(2,2)} [X-- --2][Y = 5] = {(2,3), (3,2)} [X- = 3][F = 6] = {(3,3)}.
Thus, /x ,y( l ,2) = l /9 , /* ,y( l ,3) = 2/9, /x,y(l ,4) = 2/9,/*,y(2,4) = l /9 , / x ,y (2 ,5) = 2/9,/*,y(3,6) = 1/9, and fxA^v) = 0 for aU other pairs (x,y). Notice in this bivariate case that /x,y(2,3) = 0 although 2 £ range(X) and 3 £ range(Y). What matters now is range(X, Y) .
Joint densities of more than two random variables are similarly defined. For example, if X, Y, Z are random variables, their joint density is defined, for all (#, y, z) € range(X, Y, Z), by
fx,Y,z(*, y,z) = P([X = x][Y = y][Z = *]);
otherwise we define /jr,y,z(x, y,z) = 0. It is important to keep in mind the notation used from here on.
If Uij - • •, Ur are random variables, the joint density of f/i, • • •, Ur is denoted by
The subscripts are t / i , - - - , [ / r ; they indicate the random variables of which it is the joint density. Within the parentheses will be the points (i/i, • • •, ur) in the range of C/i, • • •, Ur at which the density is evaluated.
Many times we have (or start out with) a joint density of several random variables, but we wish to have the densities of each single random variable. This can be accomplished by using theorems like the following.
THEOREM 1. If X and Y are random variables with joint density /x\y(#>2/); then the densities fx(x) and fy(y) are
fx{x) = YHfx>Y(x,y) : y £ range(Y)} if x £ range(X)
34 CHAPTER 2. RANDOM VARIABLES
and
fv(y) = Ys{fxy(xiV) : x e range(X)} if y erange(Y).
Proof: We first observe that for each x G range(X),
[X = x] = U{[X = x][Y = y]:y€ range(Y)}.
Indeed, if UJ is in the right hand side, then X(UJ) = x, and if UJ is in the left hand side, i.e., X{UJ) = x, then Y(u>) G range(Y) , say, Y{UJ) = j/i, in which case
UJ G [X = x)[Y = Vl] C U{[X = x][r = y] : y € r an 5 e (F)} .
Since the right hand side is a disjoint union, we have
fx(x) = P[X = x] = 52{P([X = x][Y = y]):yerange(Y)}
= YHfx>Y(x>y) : y e range(Y)}.
The proof of the second equation of the theorem is similar. Q.E.D.
In Theorem 1, fx(x) is called a marginal or marginal density of fx7Y(x,y)- As an example, consider the random variables X, Y whose joint density is given by:
/ * ,y ( l , l ) = l / 8 , / ^ y ( 2 , l ) = l /8, /x,y(2,2) = l /4 , /x,r(3,2) = l /8, /*,y(3,3) = l /4 , /x ,y (4 , l ) = l / 8 .
Graphically, this is represented as follows:
2.2. DENSITIES OF RANDOM VARIABLES 35
3 -Y
. I / 4
2- . I / 4 .1/8
1- .1/8 .1/8 .1/8
0 -( )
i 1
i 2
1 3
i 4
The marginals for X and Y are:
/ x ( l ) = 1/8, fx(2) = 3/8,/x(3) = 3/8,/x(4) = 1/8,
and / K ( l ) = 3/8, fy (2) = 3/8,/y(3) = 1/4.
Individual and joint densities of random variables are useful in computing the probabilities that certain functions of these random variables "take values" in certain sets. We consider particular cases of this in the following theorems.
THEOREM 2. If X is any random variable, and if A is any set of real numbers then
P[X e A] = £ { / * ( * ) : x e A n range{X)}.
Proof: It is clear that
[X E A] = U{[X = x] : x € A n range(X)}
and that the right hand side is a disjoint union. Taking probabilities of both sides, we get
P[XeA] = ^{P[X = x]:xe Anrange(X)}
= Jlifxix) : x e A fl range(X)}.
36 CHAPTER 2. RANDOM VARIABLES
which proves the theorem.
One can prove in a similar fashion the following theorem whose proof we omit.
THEOREM 3. / / X and Y are random variables, and if S is any subset of 2-dimensional Euclidean space, R2 , then P[(X, Y) G S] = £{ /A: ,Y(Z ,2 / ) : (x,y) G S and (x,y) G range(X,Y)}.
Another result of this type that we shall use is the following.
THEOREM 4. If X andY are random variables, and ifg(x^y) is a function defined over range(X,Y), then
P[g{X,Y) = z] = ^2{fxAx>y) : 9(z>y) = z a n d (*>V) € range(X,Y)}.
Proof: First observe that the right hand side of the above equation is summed over all number pairs (x,y) such that (x,y) G range(X,Y) and g(x, y) = z. One can easily verify that
\g{X,Y) = z] = U{[X = x][Y = y] : g(x, y) = z and (x, y) G range(X, Y)}.
One is also able to verify that the union on the right side is a disjoint union. Taking probabilities of both sides yields the conclusion of the theorem. Q.E.D.
The above theorem shows how to obtain the density of a function of one or more random variables. We next need to develop the idea of independence of random variables.
Definition. If Xi, • • • ,X m are random variables defined over fi, we shall call them independent if and only if, for every yt- G range(Xi), 1 < i < m, the events [Xx = yx], • • •, [Xm = ym] are independent.
THEOREM 5. If X\,- • • ,X m are random variables, they are inde-
2.2. DENSITIES OF RANDOM VARIABLES 37
pendent if and only if m
fXu^XmiVU ' " • , Vm) = I I fXjiVj)
for all yi 6 range(Xi), I <i <m.
Proof: The equation is a consequence of the definition of independence. The converse is obtained, starting from the equation given, by summing both sides over various indices j/ t- £ range(Xi) to obtain
r
fxilt...txir(yi1,...,yir) = Tifxtjiytj)
for 1 < 4 < • • • < £r < m, 1 < r < m. Q.E.D.
T H E O R E M 6, If X\, • • • ,X n are independent random variables, and if ci>' * • ? Cn are constants, then C\X\, • • •, CnXn are independent.
Proof: First assume that none of the c,'s are zero. Then for 1 < ji < " ' < jr < n and by the hypothesis of independence of X\, • • •, X n , we have
P f][cjtXjt = ue] = Pf)[Xjt = ut/cjt) £=1 £=1
= t[P[Xjk=ut/cjt] = f[P[cJtXjt=ue}. £=1 1=1
If ct = 0, then P[ctXi = 0] = 1 and P([ctXt = 0] n A) = P{A) for any event A. If u£ ^ 0, P [ Q X * = u£] = 0, and P ( [ Q X * = ut]nA) = 0 for any event A. Putting all these statements together gives us the theorem. Q.E.D.
The most concrete examples of both independent random variables and non-independent (i.e., dependent) random variables occur in sample survey theory. Let us look at the basic model. We have a population U of N units denoted by f/i, • • •, i/;v, all of which are equally likely to be selected under random sampling. If we sample n times with
38 CHAPTER 2. RANDOM VARIABLES
replacement, then the fundamental probability space ft is the set of all possible ordered n-tuples of units from U, with repetition of units in each n-tuple allowed. Let X be a function denned over U which assigns to the unit U{ tht number it,-, 1 < i < N. The u,'u need non be distinct and usually are not. For 1 < j < N let Xj be a function (or random variable) denned over ft by assigning to the n-tuple (Utl,• • • Utn) the number u^, i.e.,
The random variables Xu • • •• Xn are referred to as a sample of fsze n on X, where the sampling is done with replacement. The fundamental probability space fi contains Nn equally likely elementary events.
THEOREM 7. In sampling n times with replacement as defined above, the random variables Xu • ■ • •Xn are independent, and all have the same density.
Proof: Let Nx denote the number of units Ut such that X(Ui) = x, i.e.,
Nx = #{U G U : X{U) = x}.
Then the joint density
f[NXi n
/ * . - * . ( * ! , - ^ ) = ^ r = f [ ^ = f[P[xi = Xi] = f[ fXi(Xi).
By Theorem 5, Xu • • •• Xn are independentt Clearly, because of fhe replacement after each selection, all densities fXi(x) are the same. Q.E.D.
If the sampling is done without replacement, then it is clear that the random variables Xu • ■ • •Xn are eot tndependent. However rhere is the following interesting and useful result.
THEOREM 8. In sampling n times without replacement, all uni-variate densities {fXi(x)} are the same, and all bivariate densities {fXi Xi{u,v),i ± j} are the same.
f[NXi
/ * . - * . ( * ! , - ^ ) = ^ r = f [ ^ = f[P[xi = Xi] = f[ fXi(Xi). iV t = l i V t=l t=i
^ 0 ( ( ^ 1 J " ' 5 Uln)) = X(Ut-) = U£-.
2.2. DENSITIES OF RANDOM VARIABLES 39
Proof: Both conclusions follow from this remark: in sampling n times without replacement from £/, the set of f^J equally likely outcomes is the same as that obtained by recording tne selection of the zth unit first, then the jih unit (i ^ j ) , and then the remaining n — 2 units from left to right. Q.E.D.
EXERCISES
1. Consider a game in which an unbiased coin is tossed four times.
(i) List the set fl of all equally likely outcomes.
(ii) Let X denote the number of heads in the four tosses. Find the density fx{') oi X.
(iii) Let U denote the smaller of the number of heads and the number of tails in the four tosses. Find the density fu{') of U.
(iv) Find the joint density of X and U.
2. Let X and Y be random variables whose joint density is given by
/x,y(0,0) = l /3 , /x ,y(0 , l ) = l / 4 , / x , y ( l , l ) = l /6 / x , y ( l , 2 ) = 1/6 and/ x ,y (2 ,0) = 1/12.
(i) Find the density, /x(*)> of X. (ii) Find the density, /y(-)> of Y. (iii) If U = min(X, F ) , find the joint density of U and X, /z7,x(#, •)•
3. Compute P([Y = l]\[X = 1]) and P([X = 2]\[Y = 0]), where X and Y are as in Problem 2.
4. In Problem 2, find the density of X + Y.
5. An urn contains 3 red balls and 4 black balls. One ball after another is drawn without replacement from the urn until the first black ball is drawn. Let X denote the number of balls drawn from the urn at the time the first black ball is drawn. Find the density of X .
40 CHAPTER 2. RANDOM VARIABLES
6. Prove: if X is a random variable, and if g is a function defined over range(X), then the density of g(X) is
fg(X)(z) = 52{fx(x) : g(x) = z}.
7. Prove: if X, Y and Z are independent random variables, if g is a function defined over range(Z), and if h is a function defined over range(X,Y), then /&(X, F ) and g(Z) are independent random variables.
8. Prove: if X, Y and Z are random variables, if ip(x, y) is a function defined over range(X,Y) and if ^(x,j / ,z) is a function defined over range(X, Y, Z\ then the joint density, / (u , v), of the random variables < >(X, Y\ */>(X, F, Z) is
/(w, v) = 53{/y,y,z(«, 2/, z) : (p(x, y) = u, 0 ( s , y, *) = v}.
9. Let £/ : t/i, • • •, £/# be a population with N units, and let X be a real-valued function defined over U. Sampling is done n times without replacement. Let Xi denote the value of X for the ith unit selected. If 3 < n < JV, prove that the joint density of all triples (Xi,Xj,Xk) of observations are the same when i , j , k are distinct.
10. Let U,X and N of Problem 9 be as follows:
U Ui U2 U3 U4 U5 U6 U7 X: 2.3 2.1 3.0 2.3 3.0 3.5 2.1"
One samples three times without replacement. Letting XL, X2 , X3 be as in Problem 9, determine the joint densities of (i) (Xi, X2, X3), (ii) (Xi,X2) (by taking a suitable marginal), (iii) (X 2 ,X 3) , (iv) (X3 ,Xi) , (v) Xu (vi) X2 and (vii) X3.
11. Compute P[X2 = 2.1] and P([X2 = 2A]\[X1 = 2.1]), where Xx and X2 are as in Problem 10. Ponder over the intuitive reason for the difference.
2.3. SOME PARTICULAR DISTRIBUTIONS 41
12. Let X, Y, Z be random variables with joint density as follows:
/x,y,z(0,0,0) = l / 2 1 , / w ( 0 , 0 , l ) = 2/21, /x,y,z(0, l , l ) = l / 7 , / w ( 2 , l , 0 ) = 4/21, / w ( 2 , 0 , l ) = 5/21,/Xly,z(l,0,2) = 2/7.
(i) Find the joint density of X and Z.
(ii) Find the marginals /*( • ) , /y (•)>/&(■)•
13. An urn contains 3 red, 2 white and 2 blue balls. Sampling is done without replacement until no balls are left in the urn. Let X denote the trial number at which the first blue ball is selected, let Y be the trial number at which the first red ball is selected, and let Z be the trial number at which the first white ball is selected.
(i) Determine the joint density of X, Y, Z. (ii) Determine the joint density of X, Y. (iii) Compute the probability that blue is the last color selected. (iv) Compute the probability that white appears before red, i.e.,
P[Z < Y\.
14. Prove: If X\, • • •, Xn are independent random variables, if ai, • • •, an
are constants, and if Yj = Xj + a , 1 < j < n, then Yi, • • •, Yn are independent random variables.
2.3 Some Particular Distributions Some particular formulae for densities arise in practice again and again. Those that appear in sample survey mathematics are singled out here.
Definition. A sequence of Bernoulli trials refers to repeated plays of a game such that
i) with each play there are two disjoint outcomes, S and F , of which exactly one occurs,
42 CHAPTER 2. RANDOM VARIABLES
ii) P(S) does not vary from play to play, and
iii) the outcomes of the plays are independent events.
An example of a sequence of Bernoulli trials is the repeated tossing of a die where, say, S denotes the outcome that the die comes up 1 or 2 and F denotes the outcome that the die comes up 3,4,5 or 6. Independence of outcomes is clear, and P(S) = 1/3 for each toss of the die.
Definition. Let X denote the number of times S occurs in n Bernoulli trials, and let p = P(S). Then the random variable X is said to have the binomial distribution, denoted by 5(n ,p) .
We shall sometimes write: X is B{n,p) o r X ~ B(n,p). We now find the density of X when X is B{n,p).
THEOREM 1. IfX is B(n,p), then
fx{x) = [ t)pX(l-P)n-X ifO<*<n \ 0 otherwise
Proof: The event [X = x] is the disjoint union of (*J events, namely, those represented by all n-tuples of patterns of S and F such that each pattern contains exactly x S's. Each such pattern occurs with probability px{\ - p)n~x. Thus P[X = x] = (£) p*(l - p)n~x for x = 0 , l , 2 , . . . , n . * Q.E.D.
THEOREM 2. IfX andY are independent random variables, if X is B(m,p) and ifY is B(n,p), then X + Y is B{m + n,p).
Proof: Consider a sequence of m+n Bernoulli trials in which P(S) = ^, and let Z denote the number of times S occurs. Then Z is B(m + n,p). Let X' denote the number of times S occurs in the first m trials, and let Y1 denote the number of times S occurs in the last n trials. Now X1
and Y' are independent and have, respectively, the same distributions
2.3. SOME PARTICULAR DISTRIBUTIONS 43
as X and Y. Hence X + Y has the same distribution as X1 + Y' = Z which is B(m + ra,p). Q.E.D.
The binomial distribution occurs in sample survey theory when sampling is done with replacement. In the more frequent case when sampling is done without replacement, the following distribution arises.
Definition. If an urn has r red balls and b black balls, if n balls are selected at random without replacement, and if X denotes the number of red balls found among the n(n < r + 6), then X is said to have the hypergeometric distribution.
THEOREM 3. If X is a random variable with the hypergeometric distribution of the last definition, then
jr / \ I ' /r+iT ' # max{0, n - b\ < x < min{n, r\ fx(x) = < ( rr) l J - - i J
[ 0 otherwise.
Proof: There are (r) equally likely ways of selecting x red balls out of r red balls in the urn. For each way in which x red balls are selected from the r there are (n*x) ways of selecting n — x black balls. Thus the total number of equally likely ways in which x red balls and n — x black balls can be selected is (r) (ntx)' There are ( r^6) equally likely ways of selecting n balls out of the total of r + b balls. Thus
{ o ^ P[X,x]={ l^Sf1 itz€ram,e(X)
otherwise.
We must always verify the range of values of X. The most important part of computing a hypergeometric distribution is not that of computing the ratio involving three binomial coefficients, but that of computing the allowable values of x , i.e., range(X). Clearly x > 0 always. However, if b < n, then the minimal number of red balls in a sample of size n has to be n — b. Thus we see that the smallest value of x is max{0,n — 6}. Again, it is clear that x < n. But also x < r. Hence x < min{rc,r}. Q.E.D.
44 CHAPTER 2. RANDOM VARIABLES
One further important univariate distribution is the uniform distribution.
Definition. A random variable X is said to have the uniform distribution over {!,-••,N} if
p r y _ • ! - / llN ^ 1 < i < N 1 *J \ 0 otherwise
A most important multivariate distribution is the multinomial distribution.
Definition. Suppose a game has r + 1 disjoint outcomes A0, Ai, • • •• Ar, with U U A ' = ft (i-e., one and only one of the A's must occur.) Suppose the game is played n times under identical conditions, so that the n outcomes are independent. Let X{ denote the number of times Ai occurs in the n plays, 1 < i < r, and denote Pi = P(A). Then Ai , • • • •Xr rre eaid dt oave eth multinomial distribution, which we eenote by MAf(n,Pl,••• ,Pr)
It should be noted that Pl + •.•+Pr < 11 Also, if r = 11 then MN{n,Pl) and B(n,Pl) are the same. We conclude this section by determining the joint density of multinomially distributed random variables.
THEOREM 4. IfXu • • • ,Xr are MN{n,Px, • ■ - •Pr), then their joint density is
^•••••->-sr^35^n^)f(n*) (•-£*) forXi >0,•-••Xr>0,Xi + ---+Xr < U.
Proof: The event C\ri=1[Xi = x;] can be written as a disjoint union of
all n-tuples involving Xl A^S,* ■ • •xr Ar'ss Each of these n tuples has probabiUty
P? • • •#'(! - Pi Pr)n-Xl-"-Xr.
p r y _ i _ / llN forl<i<iV n* -H-y0 otherwise
n' / r W r \ n -£ ! - i*«
^•••••->-sr^35^n^)f(n*) (•-£*) forxi >0,•-••xr >0,,;i-|-----|hxr < n.
K 1 • • •Prr(l - Pi pr)n-Xl-.-Xr
2.3. SOME PARTICULAR DISTRIBUTIONS 45
Now how many such n-tuples do we have? There are f M selections of trial numbers possible for A\ to occur. For each x\ trial numbers for Ai to occur there are (n~x i) selections of trial numbers for A2 to occur. Continuing, we see that the total number of n-tuple outcomes in which there are X\ Ai's,- • •, xr A r 's is
fr (n-T£{xt\ = n! MV *i ) *i!---*r!(n-ELi*<)!'
Q.E.D.
EXERCISES
1. In the Polya urn scheme, let X denote the number of times a red ball is selected in the first two trials. Find the density of X.
2. If X and Y are independent random variables, if X is B(m,p) and if Y is 5 (n ,p) , find P([X = k]\[X + Y = r]).
3. If X and Y are independent random variables, and if each is uniformly distributed over {1, • • •, iV}, find the density of max{X, Y}.
4. If X and Y are independent and uniformly distributed over {0,1,2}, find the density of X + Y.
5. In Problem 4, find the density of XY.
6. Prove that ^ ( « ) = («:11)-
7. In a sequence of Bernoulli trials with P(S) = p, let T denote the trial at which S occurs for the first time, and let X = min{T, 3}. Find the density of X.
8. An absent-minded professor has five keys on a key ring. When he wishes to open his office door, he tries one key after another until he finds the one that works. Let X denote the number of tries needed. Find the density of X. (Note: He doesn't try the same key twice; he is not that absent-minded.)
46 CHAPTER 2. RANDOM VARIABLES
9. Urn # 1 has two white balls and three black balls, and urn # 2 contains two white balls and two black balls. An urn is selected at random, and two balls are selected from it without replacement. Let X denote the number of white balls in the selection. Find the density of X,
10. IfXuX2,X3 areMAf(n,pup2 ,p3) find P([XX = u]\[X1+X2 = v]) for 0 < u < v < n.
11. In Problem 10, prove that Xi,X3 are MN(n,pi,p3).
12. An urn contains r red balls, w white balls and b blue balls. One samples n times with replacement. Let X denote the number of red balls selected, and let Y denote the number of white balls selected. Find the joint density of X, Y.
13. Solve problem 12 if the sampling is done without replacement.
14. Let X, Y be independent random variables, each uniformly distributed over {0,1,2} , and let U = X + Y, V = X - Y, R = min{X, Y} and S = max{X, Y). Find
i) the joint density ofU,X ,
ii) the joint density of £7, V, and
iii) the joint density of i?, S.
15. Prove: if X,Y are random variables with the same joint density as random variables X1, Y\ and if g is a function defined over range(X, F ) , then the densities of g(X, Y) and g(X', Y') are the same.
Chapter 3
Expectation
3.1 Properties of Expectation Expectation is a weighted average. Its definition is justified by the law of large numbers, which will be encountered later.
Definition. If X is a random variable with density /A-(X), we define its expectation, EX or E(X), by
E(X) = £ { * / * ( * ) • x € range(X)}
or, more simply, £(*) = 5>/*(s).
X
As an example, suppose X is a random variable with density
/ x ( - 2 . 3 ) = l /10, / j r(0) = l /5 , /* (1 .34)=3 /10 , fx(2.79) = 2/5, fx(x) = 0 for x$ {-2.3,0,1.34,2.79}.
Then, in accordance with the above definition,
EX = ( - 2 . 3 ) ^ + 0 • | + (1 .34)^ + (2.79) • | = 1.288.
We might have random variables which are functions of one or more random variables. The following theorem gives the formulae for their expectations.
47
48 CHAPTER 3. EXPECTATION
THEOREM 1. If X andY are random variables, ifh is a function defined over the range of X, and if g is a function defined over the range ofX,Y, then
Eh(X) = Y,h(x)fx(x) X
and Eg(X,Y) = J2g(x,y)fx,Y(x,y).
Proof: Using the definition and Exercise 6 in Section 2.2, we obtain
Eh{X) = yjTtzP[h(X) = z] z
Z
= EH*)fx(x). x
The proof of the second formula is accomplished in a similar manner. Q.E.D.
As an example, let X be a random variable with density given just above the statement of Theorem 1, and let hhe & function defined by h(x) = 2ex + x2. Then
Eh(X) = *(-2.3)-L + h(0) • I + Ml-34) - A + k ( 2 . 7 9 ) . |
= (2e"2-3 + (-2.3)2)(1/10) + (2e° + 02)(l/5) +(2e1-34 + (1.34)2)(3/10) + (2e2-79 + (2.79)2)(2/5)
= 19.918
THEOREM 2. If X and Y are random variables, and if C is a constant, then
i) E(X + Y) = E(X) + E(Y), and
ii) E(CX) = CE(X).
3.1. PROPERTIES OF EXPECTATION 49
Proof: By Theorem 1 above and by Theorem 1 of Section 2.2, if we take the function g defined by g(x, y) = x + y,we have
E(X + Y) = J2(x + y)fXX(x,y)
= E D */*.r (*» y) = E E y/*.K*,»)
= E */*(*)+ EyMs/) = £W + £ ( n Again, by Theorem 1, if h(x) = Cx, then
E(CX) = Y,Cxfx(x) = C£xfx(x) = CE{X). X X
Q.E.D.
THEOREM 3. If X is B(n,p) then EX = np.
Proof: By the definition of B(n,p) and expectation we have
x=0
and, letting fc = x - 1, we eave eusing gth einomial lheorem)
E(X) = npE^r^Al-P)" -1"* fc=o ^ * '
= np(p + ( 1 - p ) ) n - 1 = n p .
Q.E.D.
£(x + y) = J2(* + y)fxA*,y)
= E D */*.>-(*» y) = Ei: yfxA*, v) X y y x
= Ylx(52fx,y(x'y)) + 12y(J2fx,Y(x>y)) x y y x
= Ylxfx(x) + Y,yfY(y) = E(x) + E(Y). x y
Again, by Theorem 1, if h(x) = Cx, then
E(CX) = Y,Cxfx(x) = C£xfx(x) = CE{X). X X
T=0
£(X) = npY,(n ,1)pk(l-p)n-1-k
k=o ^ k ' = np{p^-(1-p))n-1=np.
50 CHAPTER 3. EXPECTATION
Two little examples should be mentioned now.
Example 1. If X = IA for some event A, then, because
[IA = 1] = A and [IA = 0] = Ac,
we have E(IA) = 1 • P(A) + 0 • P(AC) = P(A).
Example 2. If X is uniformly distributed over {1,2, • • •, N} , then P[X = x) = l/N for x = 1,2, • • •, N, and hence
x=1 JV iV
1 JV(iV + l) N+ l = JV" 2 = ~ 2 ~ -
EXERCISES
1. If X and Y are random variables with joint density given by /xy(0 ,0) = 1/36, / x r ( 0 , l ) = 1/18, / yy (0 ,2 ) = 1/12, / * y ( l , 2 ) = 1/9, /x,y(2,2) = 5/36, /x,y(2, l) = 1/6, /Ar,y(2,0) = 7/36, and /x,y(l ,0) = 2/9, compute (i) E(X), (ii) E(Y), (iii) E(X% (iv) E ( X y ) , (v) £; (2X+y+l) , (vi) E (^y) and (vii) E(max{X,Y}).
2. Prove: if a and b are constants, and if X and F are random variables, then E(aX + bY) = aE(X) + bE(Y).
3. Prove: If X and Y are random variables which satisfy X(u) < Y(w) for all u e 0 , then £(X) < £;(F).
4. Prove: if X is a random variable, then
rmn{range{X)} < E{X) < m&x{range(X)}.
5. If X and Y are independent random variables, and if each is uniformly distributed over {0,1,2,3}, compute (i) E(min{X, Y}) and (ii) E{XY).
6. Prove: E(X - E(X)) = 0 for any random variable X.
1 N(N + 1) JV +1 = N' 2 = ~ 2 ~ -
3.2. MOMENTS OF RANDOM VARIABLES 51
3.2 Moments of Random Variables Powers of random variables are also random variables, and their expectations are of considerable interest. In this section we discuss moments, central moments, variance and standard deviation.
Definition. If X is a random variable, and if n is a nonnegative integer, we define the nth moment of X (or of the distribution or density of X) by mn = E(Xn). We define the nth central moment of X by Hn = E((X - EX)n).
The first moment, E(X), is really the center of gravity of the mass distributed by fx(x)> Of special interest is the second central moment and its square root.
Definition. If X is a random variable, its second central moment, /i2, is called its variance and is denoted by VarX or Var(X). The standard deviation of X , denoted by s.d.X or s.d.(X), is defined by s.d.(X) = Vv^x.
The following theorems are constantly used in all of probability and statistics.
THEOREM 1. If X is a random variable then Var(X) = E(X2) -(E(X))\
Proof: By the definition of variance and properties of expectation from Section 3.1, we have
Var(X) = E((X-E(X))2) = E(X2-2E(X)X + (EX)2) = E{X2) - 2E(X)E(X) + {E{X)f = E{X2)-(E(X))2.
Q.E.D.
THEOREM 2. If X is a random variable, and if C is a constant, then Var(CX) = C2VarX.
52 CHAPTER 3. EXPECTATION
Proof: By Theorem 1, and by Theorem 2 of Section 3.1, we have
Var(CX) = E((CX)2)-(E(CX))2
= E(C2X2)-C2(E(X))2
= C2(E(X2) - (E(X))2) = C2Var{X).
Q.E.D.
Thus, if VarX = 10, and C = 3, then Var(SX) = 32Var(X) = 90; also Var(-2X) = (-2)2VarX = 40.
THEOREM 3. IfX andC are as in Theorem2, then Var(X + C) = Var(X).
Proof: By the definition of variance,
Var(X + C) = E((X + C - E(X + C))2) = E((X + C - E{X) - C)2) = E((X-E(X))2) = Var(X).
Q.E.D.
Theorem 3 tells us that variance is not changed by adding a constant to the random variable.
THEOREM 4. If X and Y are independent random variables, then E(XY) = E(X)E(Y).
Proof: In Theorem 1 of Section 3.1, let g(x,y) = xy. In addition we use Theorem 5 of Section 2.2. Thus
E(XY) = Eg(X,Y) = 52g(x,y)fXtY(x,y)
x,y
x y = (E^W)(E^W) = ^ O T -x y
3.2. MOMENTS OF RANDOM VARIABLES 53
Q.E.D.
THEOREM 5. IfX1, ■ • •, Xn are independent random variables, then
VarltxA^tvariXi).
Proof: By Theorems 1 and 4,
= E | £ X) + g XUXV/ - \ ^ EiX^/
= £,E(X]) + J2E(XUXV)
- £ ( £ T O ) 2 - £ £ W * W
= it{E(X])-(E(Xj))2} = itvar(Xj). i=i i=i
Q.E.D. Note that, by Theorems 2,4 and 5, if X and F are independent ran
dom variables, and if a, b and c are constants then Var(aX + bY + c) = a2Far(X) + 62Var(F). Thus, Var(2X - 15F + 376) = 4Var(X) + 225Var{Y). (A very common mistake committed by unwary students is the following: if they are given that X and Y are independent, they might write that Var(X - 3Y) = Var(X) - 9Var(Y) instead of Var{X - 3Y) = Var{X) + {-3)2VarY = Var(X) + War(Y).) Note: sometimes, when X, Y are not independent, it still happens that Var{X + Y) = Var(X) + Var(Y). This occurs only when X and Y are uncorrelated, which will be discussed in the next section. There will be problems at the end of this section in which it will occur that Var(X + Y)^ Var(X) + Var(Y).
THEOREM 6. If X is B(n,p), then Var(X) = np(l -p).
= E | £ X) + g XUXV / - \Y^ EiX^/
= £,E(X]) + J2E(XUXV)
-E(£(*;))2-££(*«)£(*«)
= £ {E(X]) - (£(*;))2} = £ Var{X)). i=i i=i
54 CHAPTER 3. EXPECTATION
Proof: Let Ai denote the event that S occurs on the zth trial, 1 < i < n. By the definition of Bernoulli trials, Ai, • • •, An are independent events, P(A t) = p for 1 < i < n, and IA^'" ,lAn
a r e independent random variables. Now I\. = J^., so E(I\.) = E(IA^) = P- Hence
Var(IAi) = E{I%) - {E{IAi)f = p{\ - p).
By Theorem 5, since X = ]££=i /A, , we have
Var(X) = Y,Var{IAi) = np(l -p).
Q.E.D.
Another result we anticipate using in the future is the following.
THEOREM 7. In sampling from a population n times without replacement, ifX\, - • • ,X n denote the observed outcomes, then E(XiXj) = E(X1X2)fori^j.
Proof: We apply Theorem 1 of Section 3.1 with g(x,y) = xy and Theorem 8 of Section 2.2 to obtain
E(XiXj) = ^uvfx^x^v) UyV
= E«"/X,A(«,«) = ^ A ) . UyV
Q.E.D.
EXERCISES
1. If X is a random variable, prove that the value of t that minimizes E((X - t)2) is t = E(X).
2. Prove: If X is a random variable, then Var (yjxX) = 1.
3. Prove: If X is a random variable, then {E{X)f < E(X2).
4. Prove: If X is a random variable, then Var(X) > 0.
3.2. MOMENTS OF RANDOM VARIABLES 55
5. Prove: If X is B(n,p), then the value of p that maximizes Var(X) isp = 1/2.
6. Prove: If X is a random variable, and if Y is defined by
1 Y = (X - E(X)),
s.d.X
then E(Y) = 0 and Var(Y) = 1.
7. If X and Y are independent random variables, if E(X2) = 5, E(X) = 2,E(Y2) = 10 and £ ( y ) = 3, compute (i) Var(X + Y), (ii) Var(X + 2Y), and (iii) F a r ( X - 3 r ) .
8. Let X and V be random variables with the following joint density:
3-\
24
0
.1/12
. ! /6
0 11/36 1/18 1/9
.2/9
.5/36
3
.7/36
T X
Compute Var(X), Var(Y) and Var(2X - 3Y). (Note: X and F are not independent.)
56 CHAPTER 3. EXPECTATION
9. Prove: If X is a random variable, then
Hn = £ ) ( " I rn{mn-j{-\y,
where, as in this section, rrij denotes the jth. moment of X and Hn denotes the nth central moment of X.
10. Use the fact that
" , 2 _ n(n + l)(2n + l) h3 ~ 6
to derive the formula of the variance of a random variable which has the uniform distribution over {1, • • •, n} .
11. Prove: if X\, • • • ,Xn are independent random variables, then
E(X1---Xn) = f[E(Xj). i=i
12. Prove that £?=oC; - np)2 ( ] ) p>(\ - p)n^ = np(l - p), where 0 <p< 1.
3.3 Covariance and Correlation Covariances are needed when one wishes to compute the variance of a sum of random variables which are not independent. Correlation is sometimes correctly used and is sometimes misused as a measure of the degree of dependence between two random variables. Both covariance and correlation are important, and their definitions and basic properties are developed in this section. In order to do so, we shall need Schwarz's inequality.
L E M M A 1. If X is a random variable such that E(X2) = 0, then P[X = 0] = 1.
3.3. COVARIANCE AND CORRELATION 57
Proof: If the conclusion were not true, then P[X ^ 0] > 0. Hence there exists a number xo € range(X) such that XQ ^ 0 and P[X = x0] > 0. Then
E(X2) = Y,x2fx(x) > xlfx(x0) > 0, X
which contradicts the hypothesis that E(X2) = 0. Q.E.D.
THEOREM 1. (Schwarz's Inequality). If X andY are random variables, then
{E(XY))2 < E(X2)E{Y2),
with equality holding if and only if there are real numbers a and b, not both zero, such that P[aX + bY = 0] = 1.
Proof: If either random variable is zero, then the theorem is easily seen to be true. Hence we need to prove the theorem when neither random variable is the zero random variable. We first recall that the quadratic equation ax2 + 2bx + c = 0, with a ^ 0, has a double real root if and only if b2 — ac = 0 and has no real roots if and only if b2 — ac < 0. Now, for any real t, P[(tX + Y)2 > 0] = 1, and hence E((tX + Y)2) > 0 for all t. Thus the second degree polynomial in t,
E({tX + Y)2) = E(X2)t2 + 2E(XY)t + E(Y2)
is always non-negative, i.e., it either has a double real root or no real root. In either case, b2 - ac = (E{XY))2 - E(X2)E(Y2) < 0, in which case (E(XY))2 < E(X2)E(Y2). Equahty holds if and only if the polynomial has one real double root, i.e., there exists a value of t, call it t0, such that E((t0X + Y)2) = 0. By Lemma 1, this is true if and only if P[t0X + Y = 0] = 1 . Q.E.D.
An easy application of the theorem is an alternate proof of the fact that for every random variable X, (E(X))2 < E(X2). Indeed, let Y = 1 with probability 1. Then X = XY and E(Y2) = 1, so (E{X))2 = {E(XY))2 < E{X2)E{Y2) = E(X2).
58 CHAPTER 3. EXPECTATION
Definition. If X and Y are random variables, we define the covariance of X and Y by
Cov{X,Y) = E((X - E(X))(Y - E(Y))).
THEOREM 2. If X andY are random variables, then Cov(X,Y) = E(XY) - E{X)E(Y) and Cov(X,X) = Var(X).
Proof: Using properties of expectation, we have
Cov(X,Y) = E((X - E(X))(Y - E(Y))) = E(XY - E(X)Y - E(Y)X + E(X)E(Y)) = E(XY) - E(X)E(Y).
The second conclusion follows from the two definitions. Q.E.D.
Definition. If X and Y are non-constant random variables, then their correlation or correlation coefficient, pxy, is defined by
Cov(X,Y) PX'Y ~~ s.d.(X)s.d.(Y)'
THEOREM 3. IfX andY are independent (and non-constant) random variables then Cov{X,Y) = 0 and pxy = 0.
Proof: Using the definition of covariance, Theorem 4 of Section 3.2 and Theorem 2 above, we have
Cov(X,Y) = E(XY) - E(X)E(Y) = E(X)E(Y) - E(X)E(Y) = 0.
This implies pXly = 0. Q.E.D.
It is important to note that the converse is not necessarily true, namely, pxy = 0 does not necessarily imply that X and Y are independent. There are examples (see Exercise #1) where X and Y are not independent and yet px,Y = 0-
3.3. COVARIANCE AND CORRELATION 59
THEOREM 4. IfX andY are non-constant random variables, then —1 < PxyY < 1- Further, pxy = I if and only ifY = aX + b for some constants a > 0 and b, and pxy — —1 &/ a^d onZy ifY = cX + d /or some constants c < 0 and d.
Proof: In Theorem 1 (Schwarz's inequality) replace Xand Y by X — E(X) and y—E(Y) respectively, and one immediately obtains p2
XY ^ 1 or — 1 < /OX,Y < 1. By Theorem 1, /o^y = 1 if and only if there are real numbers u and v, not both zero, such that
u{X - E(X)) + v(Y - E(Y)) = 0.
Since by hypothesis X and Y are non-constant, this last condition implies that both u and v are non-zero. Hence we may write Y = aX+b when PXY = 1- Now
cw(x,r) = £((* - £(x))(y - £(y))) = a £ ( ( X - E(X))2) + bE(X - E(X)) = aVar(X).
Since Var(X) > 0, we may conclude
(i) Pxy = 1 if and only if Cov(X, Y) > 0, which is true if and only if a > 0, and
(ii) />x,y = —1 if and only if Cov(X, Y) < 0, which is true if and only it a < 0. Q.E.D.
In sample survey theory we shall need to know the formula for the correlation coefficient of two multinomially connected random variables.
LEMMA 2. If X\j • • •,Xr are random variables whose joint distribution is MAf(n,pi, • • • ,p r) , then (Xi,X2) are MAf(n,pi,p2) and X± is S(n ,p i ) .
Proof: In the definition of multinomial distribution in Section 2.3, we could consider the events B0 = A0 U UJ=3< *> B\ = A\ and B2 = A2. Then Xi and X2 are the number of times B\ occurs in n trials and the number of times that B2 occurs in n trials, respectively. Further B0, B\
60 CHAPTER 3. EXPECTATION
and B2 are disjoint, and one of them must occur. Thus, by definition (X\,Xi) are MN(n,p\,p2). Also, X\ is the number of times event A\ occurs in n trials. The definition of Bernoulli trials is satisfied, and thus Xx is B{n,pi). Q.E.D.
THEOREM 5. IfXu---,Xr are MAf(n,pu- • - ,p r)(r > 2), then Cov(Xi,X2) = —npip2, and
P1P2 Pxux2 = - 4 ( l - f t ) ( l - f t ) -
Proof: By Lemma 2, the random variables X\,X2 are A4J\f(n,pi,p2). Now Xt- denotes the number of times Ai occurs in n trials, i = 1,2. Let Ct- denote the event that A\ occurs in the zth trial, and let Di denote the event that A2 occurs in the ith trial, 1 < i < n. Hence X\ = J^JLi ^C and X2 = YA=I ID{, and
£(x^2) = £((£>,)(£;/i,,)) t=i i = i
For each i, C, and JD, are disjoint, so Ic(Di = 0- For the n2 — n pairs {(z? i ) : * 7 i}? since Ct- and Z)j are independent,
£ ( / c , ^ ) = P(dDi) = P{Ci)P{Dj)=Pm. Hence E(XiX2) = (n2 — n)pip2- Also, by Lemma 2, X t is B{n,pi), and hence E(Xi) = npt-. Thus, by the above and Theorem 3 of Section 3.1 we have
Cov(XuX2) = E(X1X2) - E(X1)E(X2) = (n2 - n)PlP2 - n P l np 2
= -npxp2,
which yields the first formula. By Theorem 6 of Section 3.2, Var(Xi) = npi(l — pi), * = 1,2, and thus
Cov(XuX2) Pxux2 =
y/VariXJVariX^
3.3. COVARIANCE AND CORRELATION 61
-npip2
^npi ( l -p 1 )np 2 ( l -P2) P1P2
Q.E.D.
EXERCISES
1. If X is uniformly distributed over {—2, —1,0,1,2}, and if Y = X2, find the joint density of X, Y (draw a graph of it), and show that X, Y are not independent.
2. In Problem 1, compute px,Y-
3. Let X, Y be random variables with joint density given by
/x,y(0,2) = l /6 , /* ,y(5 , l ) = 1/3, fx,Y (10,0) = 1/2.
Compute px,Y-
4. If X and Y are random variables whose joint density is graphed below, compute E(X), E(Y), Cov{X,Y), Var(X), Var(Y), px,Y
and Var(X + Y).
62 CHAPTER 3. EXPECTATION
. 1 / 4 4 -
3 -
2 -
Y . I / 6
,1/12
. 1 / 5
1 - ,1/10 . 1 / 5
-1 ( 1 1 ) 1
1 2
1 3
5. Let X and F be random variables whose joint density is given by
/ x , y ( - 5 , - l ) = 2/13,y*,y(-2, l ) = 3/13, /*,y(l ,3) = 5/13,/jr,y(5,5) = 3/13.
Compute pxyY-
6. Determine whether the following polynomials have two distinct real zeros, one double real zero or no real zeros:
(i) 2x2 - 3x - 1
(ii) x2 + x + l
(iii) x2 + 25z + 625/4.
7. Prove: if Z is a random variable, and if Z > 0, then -E(Z) > 0 .
8. Prove: if X and y are random variables, then Var(X + Y) = Var(X) + V a r ( r ) if and only if pXy = 0.
3,3, COVARIANCE AND CORRELATION 63
9. If X and Y are random variables, and if Var(X+Y) = Var(X) + Var(Y), does this imply that X and Y are independent?
Chapter 4
Conditional Expectation
4.1 Definition and Properties
A frequent goal in statistical inference is to determine as accurately as possible the expectation of an observable random variable. Many times all one can do is observe the random variable and declare that this is the best one can do. However, on occasion one might be able to observe the conditional expectation of the random variable, given prior information. This has a tendency to be closer to the expectation that one wishes to ascertain. The Rao-Blackwell theorem at the end of this chapter renders these remarks more precise.
We shall define two kinds of conditional expectation. One conditional expectation of a random variable X, given a value y of another random variable Y, is a number which we denote by E(X\Y = y). Another conditional expectation, that of a random variable X, given a random variable y , is a random variable E(X\Y) which assigns to each u> 6 fi the number E(X\Y)(u>). These two are interrelated and determine each other.
Definition. If X and Y are random variables, and if y G range(Y), we define E(X\Y = y), the conditional expectation of X given Y = j / , by
E(X\Y = y) = J2*P([X = *W = y})-X
65
66 CHAPTER 4. CONDITIONAL EXPECTATION
The definition above has a corollary which is essentially the theorem of total probabilities extended to conditional expectation.
THEOREM 1. If X and Y are random variables, then E(X) = ZyE(X\Y = y)P[Y = y].
Proof: Using the identities X = ^2X xI[x=x] a ^d 1 = J2y I[Y=y], w e have
E(X) = £((E*fc«-i)(£W) x y
= E(E12xI[x=*W=v]) y x
y x
= E(E*P(lx = *W = v]))P\r = v] y x
= J2E(X\Y = y)P[Y = y}. y
Q.E.D.
Loosely speaking, then, if one knows the conditional expectation of X given any particular value of F , and if one knows the distribution (i.e., density) of F , then one is able to compute E(X).
Remark 1. If X andY are random variables, and if g is a function defined over range(X), then
E(g(X)\Y = y) = j:9(x)P([X = x]\[Y = y))
Proof: By the definition of conditional expectation and by Theorem 4 in Section 2.2,
E(g(X)\Y = y) = 5>P(far(*) = z]\[Y = „])
* P([g(X) = z][Y = y}) r P[Y = y]
4.1. DEFINITION AND PROPERTIES 67
= J^yjY,*Y,{P([X = x}lY = y)):9(x) = z}
= E9(*)P([X = x]\[Y = y]).
Q.E.D.
LEMMA 1. If X andY are random variables, and ifyE range{Y), then
E(X\Y = y) = T ^ y ^ W ™ ) -
Proof: By the definition of conditional probabiUty and the above definition,
= p[Y = y] TsXEVlx^W^y})
= P[Y = y^E(Y,xIlx=x)I[)r=y])
Q.E.D.
Example. In the joint density that is pictured on the next page, one easily obtains E(X\Y = 0) = 3/4, E(X\Y = 1) = 4/3, and E(X\Y = 2) = 19/12. (See if you can work these out intuitively.)
68 CHAPTER 4. CONDITIONAL EXPECTATION
,5/28 ,7/28
12/28 ,4/28 ,6/28
jl/28 f3/28 1
THEOREM 2. If X,Y and Z are random variables, and if a and b are constants, then for every z G range(Z),
E(aX + bY\Z = z) = aE(X\Z = z) + bE(Y\Z = z).
Proof: By Lemma 1 and properties of expectation we have
E(aX + bY\Z = z) = p[z = x]E{{aX + bY)I[z=z])
= aE(X\Z = z) + bE(Y\Z = z).
Q.E.D.
THEOREM 3. If X and Y are random variables, and if g is a function defined over range(X, Y), then for y € range(Y)
E(g(X, Y)\Y = y) = E(g(X, y)\Y = y).
4.1. DEFINITION AND PROPERTIES 69
Proof: Again by Lemma 1,
E(g(X,Y)\Y = y) = p ^ = y]E{g{X,Y)Ip=y])
= p^yl^iX^I^y])
= E(g(X,y)\Y = y).
Q.E.D.
THEOREM A. If X andY are independent random variables, if g is a function defined over range(X), and ify€ range(Y), then
E(g(X)\Y = y) = E(g(X)).
Proof: Remark 1 and the hypothesis of independence of X and Y imply
E(g(X)\Y = y) = ^g(x)P([X = x)\[Y = y}) X
= E9^)P[X = x) = E(g(X)). X
Q.E.D.
We now define conditional expectation as a random variable.
Definition. If X and Y are random variables, the conditional expectation of X given Y is defined to be the random variable
E(X\Y) = Y;E(X\Y = y)I[y=y], y
where the summation is taken over all y 6 range(Y). (Note that E(X\Y) is a function of Y.)
THEOREM 5. If X,Y and Z are random variables, and if a and b are constants, then
E(aX + bY\Z) = aE{X\Z) + bE(Y\Z).
70 CHAPTER 4. CONDITIONAL EXPECTATION
Proof: By the definition and Theorem 2,
E{aX + bY\Z) = Y,E(aX + hY\Z=zZ)hz=A Z
= £ ( a £ ( X | Z = z) + bE(Y\Z = z))I[Zzzz] z
= a J2 E(X\Z = z)I[z=z] + 6 £ E{Y\Z = z)I[z=z] Z Z
= aE{X\Z) + bE{Y\Z).
Q.ED.
Since E(X\Y) is a random variable, we shall be concerned about its expectation.
THEOREM 6. IfX andY are random variables, then E(E(X\Y)) = E{X).
Proof: By the definitions of expectation and conditional expectation, and by Theorem 1, we have
E(E(X\Y)) = E(£E(X\Y = y)Ipr=a) y
= Y.E(x\Y = y)p[Y = y] = E{x). y
Q.E.D.
THEOREM 7.I/X andY are random variables, and if g is a func-tion over range(X), then
E(g(X)Y\X) = g(X)E(Y\X).
Proof: Using the definitions and the properties already proved of expectation and conditional expectation, we have
E{g{X)Y\X) = Y,E{9{X)Y\X = x)I[x=x] X
= Y,E{g{x)Y\X = x)I[x=x]
4.1. DEFINITION AND PROPERTIES 71
-£g(x)E(Y\X = x)I[x=x] X
Y,g(X)E(Y\X = x)I[x=x] X
g(X)Y,E(Y\X = x)I[x=x]
g(X)E(Y\X). Q.E.D.
COROLLARY TO THEOREM 7. If X is a random variable, and if g is a function defined overrange(X), then E(g(X)\X) = g(X).
Proof: This follows by taking Y = 1 and noting that E(Y\X) = 1. Q.E.D.
EXERCISES
1. Prove: if X and Y are random variables, if a and b are constants, and if Y = aX + b, then E(Y\X) = Y.
2. If fx,Y(x,y) is as displayed graphically below, compute E(Y\X = 0), E(Y\X = 1), E(Y\X = 2), E(Y\X = 3) and E(Y).
Y 3 11/2
ll/8 .V8
ll/24 .1/24 .1/24
11/32 ,1/32 ,1/32 _^IZ32 X
72 CHAPTER 4. CONDITIONAL EXPECTATION
3. In Problem 2, compute E(^\X = 2).
4. In Problem 2, find the density of the random variable E(Y\X).
5. In Problem 2, compute E(X2\X + Y = 2).
6. Prove: if Y = c, where c is some constant, and if X is a random variable, then E(Y\X) = c.
7. Prove: if X is a random variable, then XI[x=c] = d[x=c]-
8. Prove: if X is a random variable, then
X = y%2xllx=x]. x
4.2 Conditional Variance Loosely speaking, the conditional expectation of a random variable given another replaces more evenly spread probability masses by more concentrated point masses but leaves the expectation the same. A happy consequence is that the variance is decreased, which is something much to be desired in sample survey theory. We shall see just how that happens in this section. The notion of conditional variance is a principal tool used in multi-stage sampling, and thus what is about to unfold is of utmost importance in sample survey theory.
Definition. If [/, V and W are random variables, then the conditional covariance of (7, V given W = w, Cov(U, V\W = w), is defined by
Cov(U, V\W = w) = E(UV\W = w) - E(U\W = w)E{V\W = w).
An equivalent definition of Cov(U, V\W = w) is given by the following theorem.
THEOREM 1. IfU,V,W are random variables, then
Cov(U,V\W = w) = E((U-E(U\W = w))(V-E(V\W = w))\W = w).
4.2. CONDITIONAL VARIANCE 73
Proof: By Theorem 2 in Section 4.1, the right hand side of the above equation becomes
E(UV-UE(V\W = w) - E(U\W = w)V + E(U\W = w)E(V\W = w)\W = w) = E(UV\W = w) - E(U\W = w)E(V\W = w) = Cov{U,V\W = w).
Q.E.D.
Definition. If X and W are random variables, the conditional variance of X given W = w,Var(X\W = w), is defined by yar(X|VT = w) = Cov(X,X\W = w).
THEOREM 2. If X and W are random variables, then
Var{X\W = w) = E{(X - E{X\W = w)f\W = w).
Proof: This is a direct consequence of the definition of conditional variance and of Theorem 1. Q.E.D.
Corollary to Theorem 2. If X and W are random variables, then
Var(X\W = w)>0.
Proof: By Theorem 2, using the definition of conditional expectation and the facts that I[w=w] = I\w=w] a n d E(Y2) > 0 for any random variable F , we have
Var(X\W = w) = E{(X - E(X\W = w))2\W = w)
Q.E.D.
Conditional variance has much the same properties as does variance, plus: any function of the conditioning random variable behaves very much like a constant.
74 CHAPTER 4. CONDITIONAL EXPECTATION
THEOREM 3. If X and W are random variables, and if c is a constant, then Var(c + X\W = w) = Var(X\W = w) and Var(cX\W = w) = c2Var(X\W = w) .
Proof: Since E(c + X\W = w) = c + E{X\W = w), we apply Theorem 2 to obtain the first conclusion. Also, since E(cX\W = w) = cE(X\W = «i), we again apply Theorem 2 to obtain the second equation. Q.E.D.
THEOREM 4. If X,Y and W are random variables, and if X is any function of W, say, X = f(W), then Var(X + Y\W = w) = Var(Y\W = w), and Var(XY\W = w) = {f{w))2Var(Y\W = w).
Proof: By Theorem 3 in Section 4.1,
E(f(W) + Y\W = w) = E(f(w) + Y\W - w) = f(w) + E(Y\W = w),
and
E{{f(W) + Y-E(f(W) + Y\W = tv))2\W=:w) = E((Y - E(Y\W = w))\W = w) = Var(Y\W = w).
Also
E(f(W)Y\W = w) = f(w)E(Y\W = w), and
Var(f(W)Y\W = w) = E{(f(W)Y - f(w)E(Y\W = w))2\W = w) = (f(w))2Var(Y\W = w).
Q.E.D. A result to be used frequently in multi-stage sampling is the follow
ing.
THEOREM 4A. If X and Y are random variables, and if f(x,y) is any function of two variables then
Var(f(X,Y)\Y = y) = Var(f(X,y)\Y = y)
4.2. CONDITIONAL VARIANCE 75
for all y G rangeY.
Proof: Using the definition of conditional variance and Theorem 3 of Section 4.1, we have
Var(f(X,Y)\Y = y) = E{{f{X,Y))2\Y = y)-{E(f(X,Y)\Y = y))2
= E({f{X,y))2\Y = y) - (E(f(X,y)\Y = y)f = Var{f(X,y)\Y = y)
for all y G rangeY. Q.E.D.
Definition. If U, V and H are random variables, then the conditional covariance of U and V, given H, is a random variable defined by
Cov(U,V\H) = E(UV\H) - E(U\H)E(V\H).
An immediate corollary of this definition is the following result.
THEOREM 5. IfU.V and H are random variables, then
Cov(U,V\H) = ^Cov(U,V\H = h)I[H=h]. h
Proof: If h' ^ h", then easily I[H=h']I[H=h"] = 0. Thus,
E(U\H)E(V\H) = ( ^ E(U\H = />') W l ) ( £ E(V\H = hff)I[H=h>l] h' h"
= £ E(U\H = h)E(V\H = h)I[H=h]. h
Also by the definition,
E(UV\H) = 2 > ( t f V | f f = *)'[*=*]• h
Thus, by the definition above,
Cov{U, V\H) = £ ( £ ( # V|ff = h) - E(U\H = h)E(V\H = h))I[H=h] h
h
76 CHAPTER 4. CONDITIONAL EXPECTATION
Q.E.D.
Analogous to Theorem 1 is the following result for conditional co-variance given a random variable.
THEOREM 6. IfU,V and H are random variables, then Cov(U, V\H) = E({U - E(U\H))(V - E{V\H))\H).
Proof: Remembering that E(X\Y) is a function of Y, then by Theorems 5 and 7 of Section 4.1 we have
E((U - E{U\H))(V - E{V\H))\H) = = E(UV - E{U\H)V - E{V\H)U + E(U\H)E{V\H)\H) = E(UV\H) - E(E(U\H)V\H)
-E(E(V\H)U\H) + E(E(U\H)E(V\H)\H) = E{UV\H) - E{U\H)E{V\H)
-E(V\H)E(U\H) + E{U\H)E{V\H) = E{UV\H) - E(U\H)E{V\H) = Cov(U,V\H).
Q.E.D.
The fundamental theorem of this section is the following.
THEOREM 7. IfU,V and H are random variables, then Cov(U, V) = E(Cov(U, V\H)) + Cov{E{U\H), E(V\H)).
Proof: By Theorem 6 of Section 4.1, E(UV) = E(E(UV\H)), E{U) = E(E(U\H)) and E(V) = E(E(V\H)). Thus
Cov(U,V) = E(UV) - E{U)E{V) = E(E(UV\H)) - E(E(U\H))E(E(V\H)) = E(E(UV\H)) - E{E(U\H)E{V\H))
+E(E(U\H)E(V\H)) - E(E(U\H))E(E(V\H)) = E(Cov(U, V\H)) + Cov{E{U\H), E{V\H)).Q.E.D.
4.2. CONDITIONAL VARIANCE 77
Definition. If X and H are random variables, then the conditional variance of X given H is the random variable denned by Var{X\H) = Cov(X,X\H).
THEOREM 8. If X and H are random variables then
(i) Var(X\H) = ZkVar(X\H = h)I[H=hh
(ii) Var(X\H) = E((X - E{X\H))2\H), and
(iii) Var(X) = E{Var{X\H)) + Var(E(X\H)).
Proof: These three results are special cases of Theorems 5,6 and 7. Q.E.D.
Conclusion (iii) in Theorem 8 is applied again and again in multistage methods in sample survey theory. The following theorem should be given here since it is an immediate corollary to Theorem 8 and is widely used in mathematical statistics.
THEOREM 9. (Rao-Blackwell Theorem). If X andY are random variables then
Var(X) > Var(E(X\Y)). Proof: Since Var(X\Y = y) > 0, it follows that Var(X\Y) > 0 and thus E(Var(X\Y)) > 0. The conclusion follows now from Theorem 8. Q.E.D.
The following extension of theorem 3 is a standard tool used in sample survey theory.
THEOREM 10. If X,Y and Z are random variables, and if X is a function of Z, then
Var(X + Y\Z) = Var(Y\Z)
and Var(XY\Z) = X2Var(Y\Z).
78 CHAPTER 4. CONDITIONAL EXPECTATION
Proof: Suppose X = f(Z). Then by Theorems 4 and 8,
Var(X + Y\Z) = £ Var(f(Z) + Y\Z = z)I[z=z] Z
= £ Var(Y\Z = z)/IZ=,] = Var(Y\Z), Z
and
Var(XF|Z) = ^Var(f(Z)Y\Z = z)I[z=z] Z
= T,(M?Var<y\Z = z)I[z=z] z
= Y,U{Z)fVar{Y\Z = z)I[z=z] Z
= (f(Z)fj:Var(Y\Z = z)I[z=z] Z
= X2Var(Y\Z).
Q.E.D.
Finally, an indispensable tool in certain parts of sample survey theory is that of conditional independence,
Definition. If £/i, • • •, £/n, Z are random variables, then £/i, • • •, Un are said to be conditionally independent given Z (or with respect to Z) if
P ([W< = UMZ = z])=f[ P([Ut = ut]\[Z = *]) \ t = l / t = l
for all U{ £ range(Ui), 1 < i < n, and all 2 € range(Z).
Conditional variance relates to conditional independence in just the same way that variance relates to independence.
THEOREM 11. IfUu • • •, Un> Z are random variables, and if U\, - - -, Un are conditionally independent given Z, then
VariUi + • • • + Un \Z = z) = £ Var{Ui\Z = z)
4.2. CONDITIONAL VARIANCE 79
for all z € range(Z), and
Var(J2Ui\z) =J2Var{Ui\Z). \ t = l / 1=1
Proof: We first note that for n > 2, if i = j , then also Ui and Uj are conditionally independent given Z. Thus
E(UiUj\Z = z) = J2uvP([Ui = u][Uj = v]\[Z = z]) tt,v
= E « ^ ( [ ^ = «]|[Z = ^ ( P i = f]|[Z = *])
= fe*PW = u]\[Z = , ])) fevP([Ui = v]\[Z = *]))
= E(Ui\Z = z)E{Uj\Z = z).
Using this we obtain, for i ^ j ,
Cov(Ui,Uj\Z = z) = 0.
Thus
Var(Ui + --- + Un\Z = z) = £ Var(^|Z = z) + £ tf(w(l/i, l/,-|Z = z)
= £Var(C/i|Z = *), t = l
thus establishing the first conclusion. The second conclusion follows from the first by multiplying both sides by I[z=z] a n d summing over all z e range(Z). Q.E.D.
EXERCISES
1. Prove: If £7, V and W are random variables, and if W is a function of V, then
E(U\W) = E(E{U\V)\W).
80 CHAPTER 4. CONDITIONAL EXPECTATION
2. Let X, Y be random variables whose joint density is given by the graph below. Compute (i) Var(E(X\Y)), (ii) the density of Var{X\Y) and (Hi) E(Var(X\Y)).
Y 4 -
3 - .1/7
.1/7 2- . I / 7 .1/7
.1/7 1H ,1/7 .1/7
~i -1 ( )
i 1
i 2
i 3
X
3. Prove: If X and Y are random variables, then
{E{X\Y)Y = Y,{E{X\Y = y)yi[Y=y].
4. Prove: If X, Y and H are random variables, and if X and Y are conditionally independent given H, then
£ ( A T | # ) = £ ( X | # ) £ ( F | # ) .
5. Prove: If X, y and Z are random variables, and if {X, F } and Z are independent, then £(X|Y, Z) = £ ( X | F ) .
6. Let X and Y* be random variables with joint density given by
/y,y(3,6) = /y,y(4,6) = /y,K(5,6) = / Y , y ( -2 ,0 ) = /x ,y( -4 ,0) = / x , y ( - 6 , 0 ) = l /6 .
4.2. CONDITIONAL VARIANCE 81
(i) Compute Var(X\Y = 6) and Var(X\Y = 0).
(ii) Find the density of Var(X\Y).
(iii) Compute E(X\Y = 6) and E(X\Y = 0).
(iv) Find the density of E(X\Y). (v) Compute E(Var(X\Y)) and Var(E(X\Y)). (vi) Verify for this example that
Var(X) = Var(E(X\Y)) + E(Var(X\Y)).
Chapter 5
Limit Theorems
5.1 The Law of Large Numbers Two limit theorems, known as the law of large numbers and the central limit theorem, occupy key positions in statistical inference. The law of large numbers provides a method of estimating certain unknown constants. The central limit theorem, among its many uses, gives us a means of determining how accurate these estimates are. This section is devoted to a most accessible law of large numbers.
LEMMA 1. (Chebishev's Inequality). If X is a random variable, then for every e > 0,
P([\X - E(X)\ > e)) < Var(X)/e2.
Proof: We easily observe that
E(X2) = "£x2P[X = x] X
> £ x2P[X = x] {*:M>£}
> e 2 £ { P [ X = z ] : | z | > e } = ^P[\X\>e}.
83
84 CHAPTER 5. LIMIT THEOREMS
Thus, P[\X\ > e] < E(X2)/e2. Now, since this inequality holds for every random variable X, replace X by X — EX to obtain
P[ |X - E(X)\ >e]< Var(X)/e2.
Q.E.D.
Chebishev's inequality gives loose confidence intervals for the expectation of an observable random variable when one knows its variance. Namely, if one wishes to find an interval (X — e, X + e) for the expectation of an observable random variable X when one knows its variance, one uses the following equivalent form of Chebishev's inequality.
THEOREM 1. If X is a random variable and if e> 0 then
P[X-t< E(X) < X + e] > 1 - Var(X)/e2.
Proof: By Chebishev's inequality,
1 - P[\X - E(X)\ > e] > 1 - Var(X)/e2.
But
l-P[\X-E(X)\>e] = P[\X - E(X)\ < e]
= P [ - e < X - E(X) < e]
= P[X-e<E(X)<X + e]. Substituting this into the above inequality yields the theorem. Q.E.D.
THEOREM 2. (The Law of Large Numbers) Let Xx, - • -, X n be independent random variables with the same density, i.e., a sample of size n when sampling is done with replacement. Then for every e > 0,
Jiim P{[\(XX + ■■■ + Xn)/n - EiXJ] > e]) = 0
Proof: Let Xn = (Xx + ■ ■ ■ + Xn)/n. Then E{Xn) = \E{XX + ■■■ + Xn) = ^nE(Xt) = E(Xt), and, by Theorem 5 of Section 3.2,
Var(Xn) = LVar ( ± X \ = L2pvar(Xj)
5.1. THE LAW OF LARGE NUMBERS 85
= \nVar(X1) = -Var(X1).
Now by Chebishev's inequality,
0 < P[\Xn- E(X1)\>e]<Var(Xn)/e2
= Var(Xi)/ne2 —» » 0a s —► oo.
Thus P[|Xn - E(X{)\ > e] -> 0 as n -> oo. Q.E.D.
The law of large numbers is popularly known as the "law of averages". Our next theorem provides a rigorous justification for the first approach to probability described in Section 1.1.
THEOREM 3. (BernouUi's theorem) In a sequence of Bernoulli trials involving the outcome S possible at each trial whose probability is p, if Sn denotes the number of times S occurs in the first n trials, then
I i m P ( [ | ^ - p | > e ] ) = 0 n->oo n
for every e > 0.
Proof: Let A denote the event that S occurs in the zth trial. Then IAl, • ■ ••IAn are independent random varrabless all with the same density, namely
H \p \ix = 1
[l~p *JTo°i> Their common expectation is £(/.Ai) = P- We observe that
A
Hence, by the Law of Large Numbers,
P[|—-p\ > e ] - > 0 a s n - > o o n
for every e > 0. Q.E.D.
5.1. THE LAW OF LARGE NUMBERS 85
= \nVar{X1) = -Var(X1).
Now by Chebishev's inequality,
0 < P[\Xn- E(X1)\>e]<Var(Xn)/t2
= Var(Xi)/ne2 —» » 0a s —► oo.
Thus P[\Xn - E(X!)\ > e] -> 0 as n -> oo. Q.E.D.
The law of large numbers is popularly known as the "law of averages". Our next theorem provides a rigorous justification for the first approach to probability described in Section 1.1.
THEOREM 3. (BernouUi's theorem) In a sequence of Bernoulli trials involving the outcome S possible at each trial whose probability is p, if Sn denotes the number of times S occurs in the first n trials, then
I i m P ( [ | ^ - p | > e ] ) = 0 n->oo n
for every e > 0.
Proof: Let A- denote the event that S occurs in the zth trial. Then IAl,•-••IAn are independent random varrabless all with the same density, namely
H \p ifx = 1
[l~P M R U Their common expectation is £(/.Ai) = P- We observe that
A
Hence, by the Law of Large Numbers,
P[|—-p\ > e e ^ 0 a s n - > o o n
for every e > 0. Q.E.D.
H [p ifx = 1
[l~P M R U
P[|—-p\ > e e ^ 0 a s n - > o o n
Q.E.D. for every e > 0.
Hence, by the Law of Large Numbers,
Proof: Let A denote the event that S occurs in the zth trial. Then IAl, • ■ ••IAn are independent random varrabless all with the same density, namely
86 CHAPTER 5. LIMIT THEOREMS
EXERCISES
1. Prove: if a\, • • •, an, &i, • • •, bn are positive numbers, and if aj > e > 0 for 1 < ;' < n, then £ ? = 1 Oik > « E;=i h-
2. Prove: if X is a random variable, then
(i) E{*:**>e} *2fx(x) > eP[X> > e], and (ii) E{X:\x\>,yx2fX(x)>e2P[\X\>e].
3. If Y is a random variable, and if Var{Y) = 1, find a value of e > 0 such that
P ( [ £ ( y ) E ( F - e , r + e) ] )>.95.
4. Prove: if Ai, • • •, An are independent events with the same probability p, then IA1 , • • •, /A„ are independent random variables with the same density.
5.2 The Central Limit Theorem The central limit theorem is of utmost importance in statistical inference. Its proof is fairly advanced and is the only theorem in this text whose proof will not be given. It is proved in more advanced courses.
THEOREM 1. (Central Limit Theorem) IfXu---,Xn satisfy the hypothesis of the Law of Large Numbers, and if we denote Sn = £™=1 Xj and a2 = Var(Xi), then, for every real number x,
lim P Sn — ESn 7 = — < X = J = f e-«lHt. V2ir J-oo
The integral in Theorem 1 cannot be integrated in closed form. However, it is tabulated and appears in standard statistical tables. Let us denote
(x) = -£=[* e~t2/2dt. V27T
5.2. THE CENTRAL LIMIT THEOREM 87
We should point out that $(oo) = 1. In order to prove this, we shall prove that
We do this by writing the left hand side as
then writing this product as a double integral
r r e-Wdudv, J—oo J—oo
then changing to polar coordinates via the change of variables u = r cos 0, v = r sin0 (and replacing dudv by rdrdQ) to obtain
/•27T tOO
r I00 e~r2l2rdr d6 = 2TT. Jo Jo
Clearly $(—oo) = 0, and since the integrand is positive, it follows that $ is non-decreasing, i.e., if —oo < xi < x<i < — oo, then
0 < $(xi) < $(x2) < 1.
What is more, the integrand of $ is an even function, so that 1 — $(x) = $(—x) for all x > 0.
Values of $ are given in the Appendix of this book. The function $ is called the normal distribution. An observation should be made about limit theorems such as, for example, the central limit theorem. A limit theorem helps one in an approximation problem, as we now illustrate. Suppose one plans on taking 100 observations on a population by sampling with replacement. The observations then become independent random variables Xi , • • • ,Xioo. We consider the problem: given EXi = 10 and VarXi = 9 for 1 < i < 100, to find an approximation of the probability
P[Sioo< 1,038.46]
88 CHAPTER 5. LIMIT THEOREMS
where 5i0o = X\ + • • • + Xi00. We first observe that E(Sioo) = £ ( E ; = i * j ) = 1,000 and Var^ioo) = E ; = i ^ K * j ) = 900. Hence
P[Sioo< 1,038.46] = P
= P
ffioo - £(ffioo) < 1,038.46-1,000
^IOO — ESioo '900
xA'arS'ioo < 1.282
which by the central limit theorem is approximated by $(1,282). The table of the normal distribution in the Appendix of this book yields $(1,282) = .9001, and thus an approximate value of P[S100 < 1,038.46] is .9001.
In problems connected with proportions in sample survey theory, the following special case of the central limit theorem will be of use.
THEOREM 2. (Laplace-DeMoivre theorem) If Sn denotes the number of times S occurs in n Bernoulli trials where P(S) = p, then
Sn — np y/np(l-p)
<x $(z) as n —» oo.
Proof: Let At denote the event that S occurs in the zth trial. Then IAX , • • *, I An a r e independent random variables with the same density.
/ / . » = p if x = 1 1 — p if x = 0 0 if x £{0 ,1}
Clearly E(IAi) = p, Var(IAi) = p ( l - p ) , and Sn = IAl+. • -+IAn. Thus, as noted before, E(Sn) = np and Var(Sn) = np(l — p). Applying the central limit theorem we obtain the conclusion of the theorem. Q.E.D.
EXERCISES
1. If T64 denotes the number of times a head turns up in 64 tosses of an unbiased coin, find an approximation of P[T&4 < 35].
5.2. THE CENTRAL LIMIT THEOREM 89
2. An unbiased die is tossed 100 times. Let Sioo denote the sum of the 100 faces that turn up. Find an approximate value of P[Sioo < 330].
3. It costs two dollars to enter a game. A coin is tossed. If it comes up heads, you win $3.90. Assuming the coin to be fair, find the probability that you will suffer no total loss at the end of 20 plays.
Chapter 6
Simple Random Sampling
6.1 The Model Sample survey theory deals with populations of units. The population under consideration might be a set of people, in which case each person in this set is a unit. At other times the population might be the set of family farms in the state of Kansas, in which case each farm is a unit. We shall always denote the population, i.e., the set of units, by U. The letter N will always denote the number of units in a population. Usually, the value of N is known; there are cases where it is not. The individual units are denoted by £/i, U2, • • •, t//v, and
U:UXU2 • • • UN
will denote the population and its individual units. Associated with each population that we shall deal with in this
course is a function y that assigns to each unit Ui a number YJ, i.e., y(Ui) = Yi, 1 < i < N. This function, in the case of a population of people, might mean that Y{ is the total income of person Ui during the previous year. In case U is the population of family farms in the state of Kansas, Y{ might denote the size in acres of the individual farm £/*. In practice, one usually has a list of the units (called a frame) but does not know any of the Y 's. One can determine Yi only if one observes (in some sense) the unit Ui directly. The population U and the numerical
91
92 CHAPTER 6. SIMPLE RANDOM SAMPLING
characteristic y are sometimes displayed completely as
U : UXU2 •■• YN
y : Y1Y2 ••• YN.
Given such a population as that above and the limitation on determining the values {K, l < i < N}, the problem before us is to determine the sum of all 3^'s, i.e., to determine y , where Y is defined by
N
or to determine Y, the average of the Y^s , i.e.,
Y=j;EYi = Y/N. i V t = l
In the first example given above, Y denotes the total annual income of the particular population. In the second example, Y denotes the total acreage of all family farms in Kansas. This problem of determining the value of Y is completely and most satisfactorily solved when it is possible to observe each unit Ui and measure or determine the value Y{. This is what happens in a complete census.
If a complete census is impossible, then one has to estimate Y based on some selection of a few of the units and on the determination of the values of y for these units. This is what the theory of sample surveys is about. An element of randomness is introduced in the selection of these few units, and from the y-values obtained one uses some formula which depends on the randomizing procedure to compute an estimate Y of Y. The quantity Y is an observable random variable. The problems that will beset us are these. We shall want the distribution of Y to be centered (in some sense) around the unknown number Y. Usually, this is done by requiring that its expectation be Y , i.e., E(Y) = Y. Among procedures which will do this for us, we shall seek a procedure such that "most" values of Y are "close" to Y. This is usually done by requiring a procedure for which the variance of K, Var(Y), is small. Finally, we shall wish to have a formula for estimating the maximum error we can make in reporting Y to be equal to the unknown Y. There
6.1. THE MODEL 93
are yet other problems that might arise, say, that of minimizing cost or designing the randomizing procedure to keep the entire survey within a certain cost limitation.
Before beginning our study of these randomization procedures we should make sure that we know how to make use of a random number generator, which is a program in a hand-held calculator or computer which generates random numbers. In actual fact, these programs generate what are called pseudo-random numbers. In practice we act or pretend that they are truly random and are obtained as follows. Consider a bowl with ten tags in it, numbered 0,1,2, • • •, 9. Let Xi, • • •, Xk denote a simple random sample taken with replacement.This means: take a tag at random, let X\ denote the number observed on the tag, replace it, and repeat this process fc—1 more times. Clearly, the random variables Xi , • • •, Xn are independent, and each is uniformly distributed over {0,1, • • •, 9}, i.e., the density of each X{ is
/ ( x j = ( l / 1 0 if» = 0 , l , . . . , 9 10 otherwise
Then the random number obtained is
10 102 10*' This number appears in decimal form as .X1X2 • • • Xk just as a constant number (2/10) + (5/100) + (4/1000) appears as .254. If k = 3, then the above procedure yields a number selected at random from {.000, .001, • • •, .998, .999}. The probability then of selecting any particular number, such as .682, is .001, and the probability of selecting any number less than .682 is .682 (Don't forget .000.) In general, if one obtains a random number with k decimal places, then the probability of selecting any fixed constant x, expressed in decimal form as
X = .X\X2 • • ■ Xk
is 1/10*, and the probability of selected a number less than x is the very same
X = ,X\X2 ' Xk*
The above shows how to select an n-digit random number in the unit interval [0,1).
94 CHAPTER 6. SIMPLE RANDOM SAMPLING
Now, suppose we have N distinct units which one can number from 1 to N. The problem we now address is how to select a number at random from the set {1,2, • • •, N}. If TV is a power of 10, then the above procedure provides us with an answer. For example, if N = 104, and if the random number generator produced a random number -X\X<i • • -X8 , we might decide (ahead of time) always to select the first four digits Xi ,X 2 ,X 3 ,X4, with the outcome where the first four digits are zeros to be designated 10,000. However, if the random number generator on one's calculator only produces random numbers to three decimal places, then one needs only to take two random numbers -X\X2Xz and - l i Y ^ and conclude that the number selected at random from {1,2,--- ,104} is .Xi)X2X$Y\ if at least one of the four digits is not zero, and it is 104
if all four digits are zeros. The problem we face is how to select a number at random from
{1,2, • • •, N} by using a random number generator which generates a random number in [0,1) to n decimal places. We shall show, with proper reservations, that the answer is to select an n-digit random number X^n\ multiply it by JV, take the largest integer equal to or less than X^N and add 1 to it, i.e., select
[X^N] +1 where [x] means the largest integer < x
PROPOSITION l.Ifc = ,C\C2 • • • is a number (in decimal form) in [0,1), and if X^ is an n-digit random number in [0,1), then P[X^ < c] —► c as n —> oo.
Proof: It is easy to verify that
[X(T0 < -ClC2 ■ ■ ■ Cj C [XM <C]C [XM <-Cl---Cn + ^ ] .
Taking probabilities of the three events we have
.c1---cn<P[X^<c]<.c1---cn + ^ .
Now 0 < c— .ci • • • cn < l /10n for every n, which implies that .c\ - • • cn —► c as n —> oo. Taking limits in the above displayed inequality as n —> oo yields c < limn_>oo P[X^ < c] < c, which yields the conclusion. Q.E.D.
6.1. THE MODEL 95
PROPOSITION 2. IfN is a positive integer, ifl <k<N, and if y(») = [NXW] + 1, then P\yW = k] -* 1/N as n -► oo.
Proof: We observe that
[y(«) = k] = [[NXW] = k - 1] = [jb - 1 < NXW < k]
= \ k ~ l < J W < —1
so p\yM = k] = p [ i= l < xw < i .
Now one easily verifies that
P[X(«) < £ ] = p[XW < ^ i ] + P[*^i < IW < A],
or
P[^ < x^ < A] = P[X(«) < A] - P [ * M < * ^ i ] .
Substituting this into the expression above for P[y(n) = k] and applying Proposition 1, we obtain the conclusion. Q.E.D.
Proposition 2 shows that if one wishes to select a number at random from {1,2, •• • ,iV} one should select an n-digit random number X^ with n as large as the calculator or machine can support, multiply it by TV, then take the integer part of it and add 1. Thus, if we are confronted with N units denoted by C/i, U^, • • •, f/jv, and if we wished to take a simple random sample with replacement of size n from it, we would perform the above procedure n times to get an ordered n-tuple of units ((/feu* • • , t 4 n ) . The probability of obtaining any particular n-tuple of units is (l/iV)n , because of the independence of successive draws of a random number.
Now suppose we wish to take a simple random sample from {1,2, • • • ,iV} without replacement. If these were just numbered tags in a bowl, and if we selected n of them at random without replacement, we know that the probability of selecting the distinct integers &i, • • •, kn in this orderis 1/(N\/(N — n)!). The problem arises on how
96 CHAPTER 6. SIMPLE RANDOM SAMPLING
one could use a random number generator to select a simple random sample without replacement of size n. Let us consider this procedure. Select a number at random from {1,2, • ■ •, N} using the random number generator; suppose this number is k\. Now select a second random number. If it is different from &i, call it k2. If it is equal to &i, disregard it and select another random number. If it is unequal to &i, call it k2\ if not, continue sampling until you obtain a second distinct number. One continues selecting random numbers from {1, • • •, JV} until n distinct integers &i, • • •, kn are obtained, (n < N).
PROPOSITION 3, By sampling with the above procedure, the probability of obtaining the ordered n-tuple k\, • • •, kn of distinct numbers from {1,2, •- ,N} is
1 N\/(N-n)l'
Proof: In this one instance in this book we make use of the countable additivity of probability which was not covered in Chapter 1. In Chapter 1 we demonstrated that if {Ai, • • •, An} is a finite set of disjoint events, then P(U"=1A,) = £ j= i P(Aj). Now we make use of an extended property which is: if {Ai,A2, • ■ •} is an infinite sequence of disjoint events, then
p(l)An) = jrp(An). \ n = l / n = l
Let [A?i, • • •, kn] denote the event that in continued sampling from {1 ,2 , -" jN} by simple random sampling with replacement that the first integer selected is &i, the second distinct integer selected is k2, • • •, and the nth distinct integer selected is fcn, where, as stated in the hypothesis, &i, • • •, kn are distinct numbers from {1,2, • • •, N}. One easily sees that the event [&i, • • •, kn] can be represented as the following countable union of disjoint events:
oo oo
[&i,••-,&,.]= U •" U A(m1,---,mn.1), m i = 0 m n =0
where A(mx, • • • ,mn_!) denotes this event: k\ is selected on the first trial, which is followed by mi fcx's, which is followed by k2, which is
6.1. THE MODEL 97
followed by ra2 numbers consisting only of fci's and fc2's, followed by fc3, followed by ra3 &i's, fc2's and fc3's, followed by • • •, followed by kn. Because of independence,
= — 1 1 I — 1
Since the events {A(ml5 • • •, mn_i) : mi > 0, • • •, ran_i > 0} are disjoint, it follows that
m i - 0 m n _ i - O g - l
1 n-1 oo / n \m<l
- j ^ 1 1 z ^ I "A? J
1 n~'1 1 = — T T — - —
Nn J* 1 - q/N 1 n~1 1 1
= - T T — — = N jJiN-q N\/(N-n)\
Q.E.D. Proposition 3 thus vaUdates the algorithm laid out before it for
taking a simple random sample without replacement. As noted earlier, in our model
U : Ux U2 ••• UN y : Yx Y2 . . . YN,
y as a function whose domain is U assigns to U% the number Yi, i.e., y{Ui) = Yi,l < i < N. We shall henceforth let yj denote the random variable that gives the j/-value of the jih unit selected in a random sample of size n. Thus, yj assigns to the elementary event (U^, • • •, Uin) the number Yir In Section 2.2 we observed that in sampling with replacement j/i, • • •, yn are independent and have the same density, namely
P[y£ = x] = #{z : Yi = x}/N, \<£<n.
6.1. THE MODEL 97
followed by ra2 numbers consisting only of fci's and fc2's, followed by fc3, followed by ra3 &i's, fc2's and fc3's, followed by • • •, followed by fcn. Because of independence,
'<**.■■■.--» - * & n ® - - ( = ^ r £ == — 1 1 I — 1
Since the events {A(m^ • • •, mn_i) : mi > 0, • • •, ran_i > 0} are disjoint, it follows that
p([ku-M) = ±±- £ n ( i ) -m i - 0 m n _ i - O g - l
J n-1 oo / q \m<l - J^ 1 1 2 ^ I "A? J
1 n~'1 1 = — T T — - —
Nn J* 1 - q/N 1 n~1 1 1
= - T T — — = N jJiN-q N\/(N-n)\ Q.E.D.
Proposition 3 thus vaUdates the algorithm laid out before it for taking a simple random sample without replacement.
As noted earlier, in our model
U : £/i U2 ••• UN y : Yx Y2 . . . YN,
y as a function whose domain is U assigns to U% the number YJ, i.e., y{Ui) = Yi,l < i < N. We shall henceforth let yj denote the random variable that gives the j/-value of the jih unit selected in a random sample of size n. Thus, yj assigns to the elementary event (U^, • • •, Uin) the number Yir In Section 2.2 we observed that in sampling with replacement yi, • • •, yn are independent and have the same density, namely
P[y£ = x] = #{z : Yi = x}/N, \<£<n.
L(L\ L(±\ (!LZ±\ L
98 CHAPTER 6. SIMPLE RANDOM SAMPLING
We also observed that in the case of sampling without replacement, 2/i, • * • ? J/n are not independent, yet they all have the same density as in the case of independence. In the sequel we shall refer to the random variables j/i, • • •, yn as a sample of size n.
We shall use only two abbreviations: WR will denote with replacement, and WOR will denote without replacement.
EXERCISES
1. Suppose the population to be sampled is
U: f/i U2 U3 U4 Us U6 U7 U8 Y: 1.2 2.3 4.0 1.2 1.2 2.3 4.0 1.2.
Let j/i, j/2? 2/3 denote a sample of size three WOR from the population. Compute
i) the joint density of yu y2,2/3, ii) the joint densities of 2/1,2/2, of 2/2,2/3, and of 2/1,2/3, iii) the densities of 2/1, of 2/2 and of 2/3, iv) E(y1)JE(y2)9E(y3)9
v) Var(y1),Var(y2),Var(y3), vi) the correlation coefficients pyuy2,py2m and pyim, vii) Y and F ,
2. In using a random number generator to obtain a sample of size 2 from {1,2,3,4,5} WOR, find the probability that the sample is 1,4 and that it is obtained on or before the fourth random number is observed.
3. In using a random number generator to obtain a sample of size three WOR from {2,3,4,5}, find the probability that the sample is 3,1,4 and that sampling terminates when the fifth random number is observed.
4. In using a random number generator to obtain a sample of size two WOR from {2,3,4,5}, compute the probability that at least five successive random numbers must be observed in order to obtain the sample.
6.2. UNBIASED ESTIMATES FOR Y ANDY 99
5. Prove: if x is any real number, then
[x] + 1 = [x + 1].
6.2 Unbiased Estimates for Y and Y We shaU obtain here unbiased estimates for Y and Y in WR and WOR simple random sampling. Then we shall derive the formulae for their variances and show that greater precision is obtained in WOR sampling.
Definition: If z is an observable random variable, and if A is some real number, then z is called an unbiased estimate of A if E{z) = A, whatever the value of A.
Now consider the population U and the numerical characteristic y:
U: Ux U2 ■■■ UN y: y, Y2 ••• YN
If the function y : U{ -► YY is segarded da s aannom variable eefined over the fundamental probability space U, then the density of y is
fy(t) = #{i: Yi = t}/N.
Now let yu • ■ •• yn be a sample of ssze n on this populatton. (This was denned in Section 6.1.) If the sampling is done either with or without replacement, then, by Theorems 7 and 8 of Section 2.2, all yt's have the same density as y. We shall use the following notation:
V = (yi + --• • +n)/n.
THEOREM 1. In sampling WR or WOR± Ny is an unbiased estimate ofY andy is an unbiased estimate ofY.
Proof: Regarding y as a random variable, we have
6.2. UNBIASED ESTIMATES FOR Y AND Y
5. Prove: if x is any real number, then
[x] + 1 = [x + 1].
6.2 Unbiased Estimates for Y and Y We shall obtain here unbiased estimates for Y and Y in WR and WOR simple random sampling. Then we shall derive the formulae for their variances and show that greater precision is obtained in WOR sampling.
Definition: If z is an observable random variable, and if A is some real number, then z is called an unbiased estimate of A if E{z) = A, whatever the value of A.
Now consider the population U and the numerical characteristic y:
U: Ui U2 ■■• UN y: Y1 Y2 •■• YN
If the function y : U{ -» Vj- is segarded da s aannom variable eefined over the fundamental probability space U, then the density of y is
fy(t) = #{i:Yi = t}/N.
Now let yl, • ■ •• yn be a sample of ssze n on this populatton. (This was defined in Section 6.1.) If the sampling is done either with or without replacement, then, by Theorems 7 and 8 of Section 2.2, all yt's have the same density as y. We shall use the following notation:
V = (yi + --• • +n)/n.
THEOREM 1. In sampling WR or WOR± Ny is an unbiased estimate ofY andy is an unbiased estimate ofY.
Proof: Regarding y as a random variable, we have
E(y) = E*M*) = jjtY' = *-t ±y t = l
100 CHAPTER 6. SIMPLE RANDOM SAMPLING
Thus, E(y) = - E?=i E(yi) = ±nY = Y, ^ee., y is an unbiased estimate of Y. Now, E{Ny) = NE(y) = NY = F, and hence Ny is an unbiased estimate of Y. Q.E.D.
Among sample survey statisticians the ratio n/N is referred to as the sampling fraction. This will appear in our formulae from time to time. A basic numerical characteristic of our population is 5J, defined by
THEOREM 2. In sampling WR or WOR,
Proof: Each y{ has the same density as does y, and
VarW-JtjjW-Y)*-^!!}.
Q.E.D.
THEOREM 3. In sampling WR,
Var(j/) = - — — Si and 1 "
Var(JVy) = ±N(N - 1)S2.
Proof: This follows from Theorem 2 and the fact that in WR sampling, yu • ■ *, y> are independent random variables, all with th t same mensity. Q.E.D.
THEOREM 4. In WOR simple random sampling, Var(y) = 1(1 - £ )SJ , and Var{Ny) = f£(1 - $)SJ.
100 CHAPTER 6. SIMPLE RANDOM SAMPLING
Thus, £(y) = - E?=i *%.) = = Y, ^ee., y is an unbiased estimate of Y. Now, E(Ny) = JV£?(y) = NY = F, and hence Ny is an unbiased estimate of Y. Q.E.D.
Among sample survey statisticians the ratio n/N is referred to as the sampling fraction. This will appear in our formulae from time to time. A basic numerical characteristic of our population is 5J, defined
s^jfzjt^-y)2-
THEOREM 2. In sampling WR or WOR,
Var(yi)=?Lz±s;,l<i<n.
Proof: Each Vi has the same density as does y, and
y^y)=E^(Yi-Yr=^si
THEOREM 3. In sampling WR,
Q.E.D.
T , ,_, 1 i V - 1 2 Var(y) = -——S^ and
1 "
Proof: This follows from Theorem 2 and the fact that in WR sampling, yx, • • *, y> are independent random variables, all with th t same mensity. Q.E.D.
THEOREM 4. In WOR simple random sampling, Var{y) = 1(1 - £)S%, and Var{Ny) = f£(1 - $)SJ.
6.2. UNBIASED ESTIMATES FOR Y AND Y 101
Proof: Let us denote z,- =_yi — Y,l < i < n. Then zx,---,zn is a sample of size n on y — Y WOR and E(zi) = 0,1 < i < n. By Theorems 3 and 7 of Section 3.2, Var(zi) = Var(yi) for 1 < i < n, and E(ziZj) = E{zxzi) for i ^ ;'. Thus
Var(y) = £(j/ - f ) 2
= ^((E(^-?))2) = ^((E^)2)
= ^{nE(zl) + n(n-l)E(zlZ2)}.
Note that this formula for Var(y) holds for every value of n < N. If we select n = N, then j / = Y is a constant, and thus, when n = N, Var(y) = 0. In other words, when n = N, E{zl)-\-{N-\)E{zlz2) = 0. Thus
E(zlZ2) = -j^Etf)
1 1 * . 7 = 1
— _ C 2
- ->V Substituting this in the formula for Var(y) above, we obtain
V«r(y) = - L { n ^ l 5 j - n ( „ - l ) i 5 J }
which yields the first conclusion. The second follows from
Var(Ny) = N2Var(y) = '—{1 - -)S2y.
Q.E.D.
We observe that, in both WR and WOR simple random sampling, y is an unbiased estimate of Y and Ny is an unbiased estimate of
102 CHAPTER 6. SIMPLE RANDOM SAMPLING
Y. However, as shown in Theorems 3 and 4, the variance of y in WOR sampling is smaller than that in WR by a multiplicative factor of (1 — n/N). In Section 6.3 we shall show why the square root of the variance is a measure of the error and shall show how to estimate the maximum error made in using Ny to estimate Y.
EXERCISES
1. If X is an observable random variable whose distribution is B(n, p), where n is known, and p is unknown, prove that X/n is an unbiased estimate of p.
2. If X is an observable random variable whose distribution is uniform over {1,2, • • •, iV}, where N is unknown, show that 2X — 1 is an unbiased estimate of N.
3. In simple random sampling, find the value of the sampling fraction so that the standard deviation of y in WOR sampling is one half that of WR sampling.
4. Prove: in sampling WOR, if n = iV, then Ny = Y.
5. Consider the following population:
U: Ux U2 U3 U4 Us U6 y: 10 11 9 10 12 11.
(i) Compute Y and Y. (Remember: these are unknown to the statistician in real life.)
(ii) Compute the variance of Y in a sample of size 3 when the sampling is done WR and also when it is done WOR.
6. Compute a value of Y in WR sampling in Problem 5 when the units selected are Us,U2,Us .
7. In Problem 5 compute the value of Y in WOR sampling when the units selected are U$, C/i, Us .
6.3. ESTIMATION OF SAMPLING ERRORS 103
6.3 Estimation of Sampling Errors In any estimation problem in sample surveys we shall wish to have an estimate of the maximum error made. Indeed, one of the advantages of a sample survey over a complete census is that we have a means of estimating errors.
We should recall in our present setting the central limit stated in Chapter 5. In our present notation and model it states: In WR sampling, if j/i, • • •, yn is a sample of size n, then
lim P n—t-oo
ny — nY yjn Varfa)
for all real x. The function
<x _ l r ~~ \Z2TT J-c
e~*'2dt
1 tx
e-'"dt
that appears on the right side of the limit expression is the standard normal distribution function, and its values for different values of x are given in the table in the Appendix. Particular values to keep in mind are $(1,645) = .95, $(-1,645) = .05, $(1.96) = .975, $(-1.96) = .025, $(2.57) = .995 and $(-2.57) = .005. Now, the value to us of a limit theorem like the central limit theorem above is that certain functions of n may be approximated for large values of n by the limit quantity. For x > 0,
ny — nY ^ -x < i = < x yjn Var(yx)
= P
- P
ny — nY [yfn Var(yx)
ny — nY
<x
< - x [y/n Variyx)
is approximated by $(x) — $(—x). Thus, for large values of n,
_ 1 . 9 6 < nV~nY < 1.96 yjn Variyx)
104 CHAPTER 6. SIMPLE RANDOM SAMPLING
is approximated by $(1.96) - $(-1.96) = .95, a probability close to one. Also,
' _ 2 . 5 7 < f ~ ^ < 2.57 yjn Var(yi)
is approximated by $(2.57) - $(-2.57) = .99, a probability even closer to one. Thus it is almost certain that the following inequality holds:
- 3 < f ~ n ? < 3 ,
or (after using a little algebra)
\Ny -Y\< 3Ny/Var(yi)/n.
Let us recall Chebishev's inequaUty from Chapter 5: if Z is a random variable, then for every e > 0,
P([\Z-E(Z)\<e])>l-^P-.
If we replace e by JVar(Z)lt, we obtain
P([\Z-E(Z)\<J^Q])>l-t.
If we let t = 0.01, then
P([\Z - E(Z)\ < 10^/yar(Z)]) > .99,
which gives us a larger bound to the error of estimating E(Z) by Z, namely 10y/Var(Z).
In any C2tse, we see that if we obtain an unbiased estimate Z for some unknown constant, the smaller the variance, the smaller the error. If we wish to be right in stating the maximum error of estimates in at least 99 out of every 100 cases we would state that the error is less than lOy Var(Z), provided we know the value Var(Z). In practice, we should prefer to use 2.57VVarZ for rather good theoretical reasons.
6.3. ESTIMATION OF SAMPLING ERRORS 105
Thus it becomes necessary to be able to estimate the variance of an unbiased estimate if we wish to estimate the maximum error in using it.
THEOREM 1. In WR sampling ifyu-••,yn is a sample of size n, and if si is defined by
1 n
4 = - r T I > - j ' ) 2 . n i t = i
then s2y is an unbiased estimate ofVar(yi).
Proof: We may write
Now E(yj) = E(yi) and E(y]) = E(y2) for 1 < j < n. Since yi, • • •, y„ are independent, then
E ( l>j) = nE(yl),
and
E(f) = ^ ( E J / J + E w )
= l!nE(yl) + n(n- l)(E(yi))>\. n2 *■ J
Thus, with a little algebra we obtain E(s2y) = Var{yx). Q.E.D.
COROLLARY 1. In WR sampling, an unbiased estimate of the variance of the unbiased estimate Y = Ny is ^-s2y.
Proof: In WR sampling, Var(Ny) = N2Var{yi)/n. By Theorem 1, E(sl) = Var(yx). Thus N2s2
v/„ is an unbiased estimate of Var(Ny). Q.ED.
6.3. ESTIMATION OF SAMPLING ERRORS 105
Thus it becomes necessary to be able to estimate the variance of an unbiased estimate if we wish to estimate the maximum error in using it.
THEOREM 1. In WR sampling ifyu-••,ynisa sample of size n, andifs2y is defined by
1 n
then s2y is an unbiased estimate ofVar(yi).
Proof: We may write
^ { l > j - 4 Now E(yj) = E(Vl) and E{y]) = E(y2) for 1 < j < n. Since yu • • •, yn
are independent, then
E ( l>j) = nE(yl),
and
n^ \fr[ 3 fe w
= ±-{nE(y2) + n(n-l)(E(yi))2}.
n2 *■ J
Q.E.D. Thus, with a little algebra we obtain E(s2y) = Var(yi).
COROLLARY 1. In WR sampling, an unbiased estimate of the variance of the unbiased estimate Y = Ny is ^-s2y.
Proof: In WR sampling, Var(Ny) = N2Var(yi)/n. By Theorem 1, E(sl) = Variyi). Thus N2s2Jn is an unbiased estimate of Var(Ny). Q.ED.
106 CHAPTER 6. SIMPLE RANDOM SAMPLING
The situation for WOR is a bit more complicated.
THEOREM 2. In WOR sampling, if si = ^ E"=iG/j - vf, then E(s}) = SI where S2 = (1/(N - l))T$Li<Xj - Y)2.
Proof: We first observe that (n-l>2 = E"=i(j/j-y)2 = Ej=iJ/2-™/2-Since Var(y) = E(y2) — (_E(j/))2, we use Theorem 4 in Section 2 to obtain £(y2) = Var(y) + {E(y))2 = 1(1 _ *)S* + y2 We also have
Using these two equations we have
(n-l)E(s2y) = E (f^y2) - nE(y2)
_ n(N-l) 2 / n \ 2
= ( n - l ) S 2 .
This yields £(s 2 ) = S2y. Q.E.D.
COROLLARY . / n W O R sampling, an unbiased estimate of the variance ofY = Ny is Var(Y) = ^- f 1 — ^J s2, and an unbiased estimate ofVar(y) is Var(y) = (1 — n/N)s2/n.
Proof: By Theorem 2 above and Theorem 4 of Section 2,
*{£('-*)<} ■ £(-*)*«) - T ('"ff)5?-""<**>•
J 7 " ' - T - J ' K )
COROLLARY . inWOR sampling, an unbiased estimate of the variance ofY = Ny is Var(Y) = ^- (l — j J s2, and aw unbiased estimate ofVar{y) is Var{y) = (1 — n/N)s2/n.
Proof: By Theorem 2 above and Theorem 4 of Section 2,
This yields E(s2y) = S2.. Q.E.D.
(n-l)E(s2y) = E rtyU - nE(y2)
-spM1-*)*-**
_ n(iV-l) 2 / n\ 2
jy ^ " V ~ N) b* N s*~V~N)sy = ( n - l ) S 2 .
Using these two equations we have
*(s*)--«>-ss*
Proof: Wefirst observe that (n-l)s2 = Ey=1(y,—y)3 = E"=iJ/2-raJ/2-Since Var(y) = 2?(j/2) — (E(y))2, we use Theorem 4 in Section 2 to obtain E(y2) = Var(y) + (E(y))2 = 1(1 - £ )£ 2 + Y2. We also have
THEOREM 2. /n WO# sampling, if si = ;^E"=i (y ; - £)2, &en E{s2
y) = S2, « f e » S2 = (1/(N - l))Ef=i(^ - Y)2.
Proof: Wefirst observe that (n-l)s2 = E?=i(j/i-j/)2 = ELiS/2-"*/2-
106 CHAPTER 6. SIMPLE RANDOM SAMPLING
The situation for WOR is a bit more complicated.
6.4. ESTIMATION OF PROPORTIONS 107
The second conclusion follows by known properties of variance. Q.E.D.
EXERCISES
1. Consider the following population:
U: Ux U2 U3 U4 Us U6 U7 U8
y: 1.2 1.5 1.4 1.6 1.3 1.4 1.2 1.4
(i) Compute the values of Y and F ; remember, in real life these values are unknown to the statistician.
(ii) Compute S*.
(iii) Compute Var(y).
2. Suppose in Problem 1 a sample of size 4 is taken WR and the units selected turn out to be f/5, E/7, C/7, C/4.
(i) Use the sample to compute a value of the unbiased estimate o f F .
(ii) Use the sample to compute the value of the unbiased estimate of Var(Y).
3. Do Problem 2 when the sampling is done WOR and the units selected are f/3, C/5, f/i, C/7.
6.4 Estimation of Proportions The present section provides special appUcations of the previous two sections. Here we are interested in the proportion of a population that satisfies a certain property. For example, we might be interested in the proportion of a population that is unemployed or the proportion of a population in a certain age group. We shall let A denote the subset of U which has the particular property in question, and we shall let 7r denote the proportion of units in U that are in A, i.e., the number of units in A divided by N. Corresponding to U{ in we shall take Y{ = 1 if Ui e A and Y{ = 0 if U{ $ A. Thus we see that Y = E f c i l i i s
108 CHAPTER 6. SIMPLE RANDOM SAMPLING
the number of units in A, i.e., the total number of units in U with the property in question, and it also turns out that Y = 7r.
Now consider a random sample of size n taken WR. If y\, • • •• yn are the observations on y, then they are independent, and P[yi = 1] = 7r and P[yi = 0] = 1 — ir for 1 < i < n. Thus, Y%=i V* 1S ^(n?7r)? an^ n s the proportion of units sampled that are in A.
THEOREM 1. In a WR sample of size n to estimate the proportion 7T of a population in a subset A, the proportion of units it in the sample that are in A is an unbiased estimate ofir, its variance is 7r(l — 7r)/n;
and an unbiased estimate of the variance of it is 7r(l — it)/(n — 1).
Proof: From Theorem 1 of Section 2, it = y is an unbiased estimate of Y = 7r. Now Var(y) = Var(yi)/n, and by Theorem 1 in Section 3, Sy is an unbiased esttmate of Var(y)) . Thus, an unbiased esttmate of Var(it) is Var(it) = Sy/n. But
Now y} = yi for 1 < i < n, and thus
= r {ny - ny2} = -7r(l - it). n — 1 ^ J n — 1
From this we have shown that Var(it) defined by
Variit) = -7r(l — it) n — 1
is an unbiased estimate of Var(it). Q.E.D.
We next turn our attention to WOR simple random sampling.
THEOREM 2. In WOR sampling of size n to estimate the proportion 7r of a population in a subset A of U, the proportion it of units in
108 CHAPTER 6. SIMPLE RANDOM SAMPLING
the number of units in A, i.e., the total number of units in U with the property in question, and it also turns out that Y = 7r.
Now consider a random sample of size n taken WR. If y\, • • •• yn are the observations on y, then they are independent, and P[yi = 1] = 7r and P[yi = 0] = 1 — ir for 1 < i < n. Thus, Y%=i V* 1S ^(n?7r)? an^ n s the proportion of units sampled that are in A.
THEOREM 1. In a WR sample of size n to estimate the proportion 7T of a population in a subset A, the proportion of units it in the sample that are in A is an unbiased estimate ofir, its variance is 7r(l — 7r)/n;
and an unbiased estimate of the variance of it is 7r(l — it)/(n — 1).
Proof: From Theorem 1 of Section 2, it = y is an unbiased estimate of Y = 7r. Now Var(y) = Var(yi)/n, and by Theorem 1 in Section 3, Sy is an unbiased esttmate of Var(y)) . Thus, an unbiased esttmate of Var(it) is Var(it) = Sy/n. But
^P>-^MP-*Y Now y} = yi for 1 < i < n, and thus
3l - ^{p-nf} = r \ny - ny2} = - ^ ( l - it).
n — 1 ^ J n — 1 From this we have shown that Variit) defined by
Variit) = -7r(l — it) n — 1 /& — J.
is an unbiased estimate of Var{it). Q.E.D.
We next turn our attention to WOR simple random sampling.
THEOREM 2. In WOR sampling of size n to estimate the proportion 7r of a population in a subset A of U, the proportion it of units in
r \ n y -nyl\ = - ^ ( l - it). n — 1 ^ J n — 1
^ < - » > = ^ & - ^ ■
6.4. ESTIMATION OF PROPORTIONS 109
the sample that are in A is an unbiased estimate of it, its variance is ^=^7r ( l - it), and an unbiased estimate of this variance
Proof: Again by Theorem 1 of Section 2, x is an unbiased estimate of Y = it. By Theorem 4 of Section 2,
v^.).I(,_i)5;.
But
= FTT ( i V , r" = j \ r r r ( 1 - *), and thus
By Theorem 2 of Section 3, £(*») = SJ, and thus an unbiased estimate of Var{v) is
«*>-i<i-S«. But, as was shown in the proof of Theorem 1, sj = £z*(l - it), and thus
^(*) = ^Tl(l-^)*(l-*). Q.E.D.
We now relate the results we have obtained to results in a different area. We first show how the results of this section give us the expectation and variance of the hypergeometric distribution that we discussed in section 2.3. Recall that if an urn contains r red balls and b black
6.4. ESTIMATION OF PROPORTIONS 109
the sample that are in A is an unbiased estimate of it, its variance is $p=k*0- - it), and an unbiased estimate of this variance
* * ) - ^ ( ' - £ ) «i-*>• Proof: Again by Theorem 1 of Section 2, 7r is an unbiased estimate of Y = it. By Theorem 4 of Section 2,
v**)-1 (i-£)sj. But
3 = ^{p?-***} = < > I • — IS I 7
N ~ 1 Ifei ' J = J±1(N*-N**) = j£-ir{l-ir),
and thus
r-W-wS)*1-')-By Theorem 2 of Section 3, E(s\) = 5J, and thus an unbiased estimate of Var(it) is
**>-io-sx. But, as was shown in the proof of Theorem 1, sj = ^ ( l - * ) , and thus
_-r-- , . . 1 / . n \
Q.E.D.
We now relate the results we have obtained to results in a different area. We first show how the results of this section give us the expectation and variance of the hypergeometric distribution that we discussed in section 2.3. Recall that if an urn contains r red balls and b black
110 CHAPTER 6. SIMPLE RANDOM SAMPLING
balls, if n of these are selected WOR, and if X denotes the number of red balls in the sample, then the density of X is
( T ) ( b ) P{[x = x]) = l* W-W
if max{0, n — b} < x < min{r, n) and is zero otherwise. The proportion of red balls in the population (urn) is 7r = r / ( r +6) , and the proportion of red balls in the sample is it = X/n. By Theorem 2 above,
£(TT) = 7r,or n-1E{X) = r/(r + b).
Thus, E(X) = nr/(r + 6).
Also by Theorem 2,
Var(X) = Var(nit) = n2Var{k) n(r + b — n)
or Var(X) =
r + b- 1
n(r + b — n) rb
7 r ( l - 7 r ) ,
r + 6 - 1 (r + b)2'
Thus we have formulae for the expectation and variance. We last show how the results of Section 2 can be used to obtain
formulae for the mean and variance of the Wilcoxon distribution which is defined as follows. Let m and n be positive integers and let W denote the sum of n integers selected at random WOR from the set {1,2, • • •, m + n}. The random variable W is said to have the Wilcoxon distribution. Here we see that N = m + n and Yi = i, 1 < i < m + n. If we denote the sample by 2/i,-'-,2/n, then E(yx) = ^ E / L " i n j = _L(rn+n)(n,+n-Hl)? o r ^ = ( m + n + 1 ) / 2 > Now W = ft + - - • + yw,
so £(W0 = E?=i E(Vi) = n(m + n + l ) /2 , i.e.,
£ ( W ) = n(m + n + l)
6.4. ESTIMATION OF PROPORTIONS 111
Since W = Vl + ■ ■ • + yn = ny, it follows that Var{W) = n2Var(y). By Theorem 4 of Section 2, Var(y) = 1 (l - ^ ) S*. Now
y = 1 V - V ^ 1 (rn + n)(m + n + l) m + n fr{ 3 m + n 2 rn + n + l
2
and, from a known identity usually proved by induction,
^ nT . •£? .„ (m + n)(m + n + l)(2m + 2n + 1)
Thus,
_2 1 f(m + n)(m + n + l)(2ro + 2n + l) (m + n){m + n + 1)2\ ^ = m + n _ l \ 6 4 / •
After some algebraic simplification, we obtain
a2_ (m + n + l)(m + n) v ~ 12 *
Thus,
Var(W) = n2Var(y)
= nV~^nTnJS' nm (m + n + l)(m + n)
m + n 12 ' or
Var{W) = mn(m + n + l) y ' 12
The formulae for E(W) and Var(W) are useful in non-parametric statistical inference when one wishes to approximate the distribution of W for large m and n.
6.4. ESTIMATION OF PROPORTIONS 111
Since W = Vl + • • • + yn = ny, it follows that Var(W) = n2Var(y). By Theorem 4 of Section 2, Var(y) = i ( l - ^ _ ) SJ. Now
1 fm+n 1
si = ^^{^y-{m+n)n^ Y = l TY= 1 (m + nXm + n + 1)
m + n j ^ 3 m + n 2 m + n +1
2
and, irom a known identity usually proved by induction,
•£? . ^ n ., (m + n)(m + n + l)(2m + 2n + 1 )
Thus,
_, 1 f(m + n)(m + n + l)(2m + 2n + l) (m + n)(m + n + 1)2\ 5* = m + n _ l \ 6 4 / •
After some algebraic simplification, we obtain
a2_ {m + n + l)(m + n) y ~ 12 *
Thus,
Var(W) = n2Var(y)
~ nV~^nTnJS' \ ill T i*>/
nm (m + n + l)(m + n) m + n' 12 '
or Var{W) = mn(m + n + l)
v ; 12 The formulae for E(W) and Var(W) are useful in non-parametric
statistical inference when one wishes to approximate the distribution of W for large m and n.
112 CHAPTER 6. SIMPLE RANDOM SAMPLING
EXERCISES
1. Let U denote the sum of 10 numbers selected at random WOR from {11,12, • • •, 29,30}. Compute E(U) and Var(U) .
2. In a population of 1000 units, 40 units were selected at random WOR. Among these 40 units, 15 units had a certain property. Compute a value of the unbiased estimate of the proportion of units in the entire population with this property, and compute a value of the unbiased estimate of the variance.
3. In an area considered for urban renewal, time and money were available to inspect 50 out of 420 homes. Among 50 of these 420 homes selected at random W72, 18 were found to be substandard. Compute an estimate of the total number of substandard homes among the 420 and an estimate of the variance.
4. Among the integers {1,2, • • •, m + n} , let W denote the sum of n of them selected at random without replacement, and let S denote the sum of m of them selected at random without replacement. Prove that Var(W) = Var(S) and that E(W) = (™-Hfo+"+i) _ E(S).
6.5 Sensitive Questions We pause now for a special application of results of Section 4. In Section 4 we assumed that if we were sampling a human population, and if we asked each person in our sample a simple yes/no question, then the response would always be truthful. But there are many situations in which a truthful answer could not be elicited from a sizable proportion of the sample. Suppose one is conducting a survey on a university campus to determine what proportion of students cheat during examinations whenever the opportunity presents itself. If you took a simple random sample of 300 students and asked them the question in person, the answers would undoubtedly be all no's. The same would be the case if the question were: do you practice safe sex, except in this case you would receive a unanimous 'yes' answer. However, if the respondent
6.5. SENSITIVE QUESTIONS 113
can be made to feel anonymous in some way, he or she would feel no danger with a truthful answer. Thus the statistician is challenged to come up with a survey technique which provides the respondent with a feeling of safety and yet provides an unbiased estimate of the proportion of individuals in that population who have what we might refer to as a "secret sin". Whatever the sensitive question we consider in a survey, we shall use the generic expression "secret sinner" to denote an individual who would be reluctant to give a truthful answer to a direct question.
Let us begin with a population of N individuals; N might or might not be known. Let N0 denote the number of secret sinners among the N. The problem is: based on a sample of size n, to estimate the ratio NQ/N. One method of eliciting a response from an individual in the sample is to provide the respondent with an urn containing a known number r of red balls and a known number b of black balls with r ^ b; the composition of the urn is the same for all respondents. The instructions to each individual in the sample are these:
(i) Select a ball at random from the urn out of sight of the interviewer, note its color and return it to the urn.
(ii) If the ball you selected was red, answer T (for true) or F (for false) to the statement: I am a secret sinner.
(iii) If the ball you selected was black, answer T or F to the statement: I am not a secret sinner.
Our assumption here is that the person interviewed will respond truthfully, since he or she knows that the interviewer does not know the color of the ball selected and hence does not know which statement the answer T or F applies to. However, the statistician, having at his or her disposal the size n of the sample, the ratio r/(r + b) and the number of T answers is able to obtain an unbiased estimate of the proportion of secret sinners in the population.
Let us denote IT = No/N and p = r/(r + b). If one individual is selected at random from the population, this person will respond T if and only if one of the two disjoint events A and B occurs: A is the event that the individual selected is a secret sinner and that he or she
114 CHAPTER 6. SIMPLE RANDOM SAMPLING
selected a red ball, and B is the event that the respondent is not a secret sinner and that he or she selected a black ball. If we let ip denote the probability of obtaining a T answer from an individual at random, then
<p = P(A) + P(B).
Since the events of being a secret sinner and selecting a red ball at random are independent, it follows that
P(A) = pir
and P(B) = ( 1 - 1 0 ( 1 - * ) .
Hence (p = pir + (1 — p)(l — 7r). Solving for 7r we obtain
y - (1 - p) 7T =
2 p - l
Notice that the denominator is not zero, since p = r/(r + b) ^ 1/2 by the foresight of having r ^ b. If (p denotes the proportion of T answers from the sample of size n, then no matter whether the sampling was done with or without replacement, E((p) = (p. Thus it defined by
A _ y - ( l - p ) *- 2 p - l
is an unbiased estimate of 7r. We may formalize this development with the following theorem.
THEOREM. An unbiased estimate of ir is
y - ( l - p ) * ~ 2p-l '
If the sampling is done without replacement, then an unbiased estimate ofVar(it) is
v°n*) ( 2 p _ i ) 2 n - l [ Nh
6.5. SENSITIVE QUESTIONS 115
and if the sampling is done with replacement, then an unbiased estimate ofVar(it) is
Vann) (2p - l)^ n - 1 "
Proof: This follows from the above development and Theorem 1 and 2 of Section 4.
EXERCISES
Suppose the procedure outlined in this section is changed somewhat. Suppose the urn contains r red balls, w white balls and b blue balls. The procedure is for the person interviewed to draw a ball at random from the urn and out of sight of the interviewer, to note its color and then to return it to the urn. If the ball drawn is a red ball, the respondent is to answer T or F to the statement: I am a secret sinner. If the color of the ball drawn is white or blue, he or she is instructed to respond T or F to the statement: the color of the ball that I obtained was white. Let 7r and <p be as already defined.
1. Prove that ^ = ^ + ^ .
2. Prove that (r + w + b)ip > w.
3. Find an unbiased estimate it of 7r, and show that it could have negative values.
4. Find the formulas for an unbiased estimate of Var(jr) when the sampling is done with replacement and without replacement.
Chapter 7
Unequal Probability Sampling
7.1 How to Sample In Chapter 6, whenever we sampled from a population of N units, whether WR or WOR, each item had the same probability of being selected. For reasons to be elaborated on later, we shall sometimes find it desirable to sample with unequal probabilities. The purpose of this section is to explain how this is done when the sampling is done WR and when it is done WOR.
In simple random sampling on the population
we sample so that for each i(l < i < N) the probability of selecting Ui on the first observation is l/N. Now we wish to sample in such a way as to give some units larger probabilities of being selected than others. For example, suppose our population U has four units; U\,U<i,Uz, U±, and suppose we wish to take a sample of size 1 so that U\ and Ui are equally likely, U3 and U4 are equally likely, but U\ is twice as likely to be selected as U\ is. One way of taking such a sample is to decide ahead of time that if, upon tossing an unbiased die once, it comes up 1 or 2, then U\ is selected, if it comes up 3 or 4, then U2 is selected, if it comes up 5, then U3 is selected, and if it comes up 6, U± is selected.
117
118 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
Thus we see that the probability of selecting U2 is 1/3 which is twice the probabiUty of selecting t/4, which is 1/6. The relative sizes of the units in this example are 2,2,1 and 1; dividing each size by the sum of the sizes gives us the probability of selecting that particular unit.
Now consider the general case. We have a population of N units, Z7i, • • •, UN, and suppose we wish to select one unit, where the probability of selecting unit Uj is pj > 0, where p\, p2, • • •, PN are given nonnega-tive numbers which satisfy p\ H \~PN = 1 and, for practical purposes, are all representable by fc-digit decimals for some positive integer k. Let us denote ip = 0, ii = p\, t2 = p\ +#2? • • • 5 t>N = Pi H \~PN = 1 • Then select a fc-digit random number X^k\ as discussed in Chapter 6. If X^ is observed to be in the interval [ i / - i , i ; ) , then one decides that the unit selected in Uj. One easily sees that proceeding in this manner, the probabiUty of selecting unit U{ is pi for 1 < i < N. If sampUng is done WR, then aU n units in the sample are selected in his manner. The probabiUties pi, • • • ,PN usuaUy come about in the following way. Associated with each unit U% is a positive number X{ that is known before any sampUng is contemplated. We shall wish to sample so that the probabiUty pi of obtaining unit £/,- on the first observation is proportional to X{. Thus pi must satisfy
Pi = Xi/X,
where X = X^jLi-^j- If w e sample like this n times, it is called WR probability proportional to size sampUng. In this method of sampling we essentially return the unit to the population after each observation and take the next observation in the same manner. The sizes referred to in this method are the positive numbers X\, - • •, Xjq.
We now describe what is meant by WOR probability proportional to size sampling. Again the units are U\, • • •, C//v, and the corresponding "sizes" are known positive numbers X\,- •• ,-XJV. Let n denote the number of observations desired in the sample. The first unit is selected in the same manner as described above. Suppose that the unit selected on this first observation is U%x. Now the remaining units are U \ {Uix} = {C/i, • • •, UN} \ {Uix}. A unit is now selected from among these with probabiUty proportional to size, i.e., unit Uj is selected with probability Xj/(X — Xix) for 1 < j < N, j 7 t'i. If the unit selected on this
7.1. HOW TO SAMPLE 119
second observation is Ui2, then the third unit selected is wiih probability proportional to its size among the sizes of the remaining N - 2 uniis, i.e.. unit Ur is selected with orobabilitv XJ(X-X. - X , ) for 1 < r <
7.1. HOW TO SAMPLE 119
second observation is Ui2, then the third unit selected is wiih probability proportional to its size among the sizes of the remaining N - 2 uniis, i.e., unit Ur is selected with probability Xrl(X-Xh -Xh)iox1<r< N,r£ {ix, z2}. This is continued until the nth (necessarily distinct) unit is selected. Thus at the time the fcth unit is selected the conditional probability of selecting unit Uic, given that units Uh, • • •• l / ^ 1ave been selected in the k - 1 previous trials is XJ(X -X{l - , w - i
THEOREM 1. //»!, ••••,„ are n distinct integers in {{1 , • •• N], then the probability of obtaining a sample of units £/,,,•••• Uin (in this order) from among units Uir-••UNby WOR probability proportional to sizes XLr-••XN is given by
p(uh,---,uin)=|nx./ U-'jzx]. Proof: Let P(Uik\Uh,#•■ ,Uik^) denote the conditional probability that Uik is selected on the Jbth observation given that Uh, • • •• Uik_x
were selected in the first k - 1 trials. As shown above
P(uik\uil,---,uik_j^xj\x-kf^xX
Applying the multiplication rule, we obtain the conclusion of the theorem. Q.E.D
The question next arises on how to use a random number generator to obtain a sample of size n from among units Uu • ■ •• UN by means of WOR probability proportional to sizes Xu • • •• XN. One method whiih appears as obvious is to generate n random numbers; let ri, r2, • • •, rn denote these numbers. Now define i y = (*i + • • • + Xj)/X, 1 < j < N, and suppose rx £ [tlh.utlh), where we take tw = 0. In this case we shall declare that unit Uh is the first unit selected. Next, with unit Uh removed from the population, use the remaining units with their corresponding sizes and r2 in the same way as above to select a second unit, call it Uh. Proceed to select Ujs,---,Ujn in the same manner.
P(Uh,---,Uin) = | n ^ / U-T,Xir) .
PiUiAu^-'-M^^xj (x-Y;xt\.
The question next arises on how to use a random number generator o obtain a sample of size n from among units Uu • ■ •• UN by means of VOR probability proportional to sizes Xu • • •• XN. One method whiih appears as obvious is to generate n random numbers; let ri, r2, • • •, rn lenote these numbers. Now define ty = {Xi + -- • + Xj)/X, 1 < j < N, tnd suppose rx £ [tlh.utlh), where we take tw = 0. In this case we hall declare that unit Uh is the first unit selected. Next, with unit Jh removed from the population, use the remaining units with their :orresponding sizes and r2 in the same way as above to select a second init, call it Uh. Proceed to select Ujs,---,Ujn in the same manner.
Proof: Let P({/,J£/,,,#••,[/,,_,) denote the conditional probability that Uik is selected on the Jbth observation given that Uh,• •• •,f,w
were selected in the first k - 1 trials. As shown above
THEOREM 1. //»!, •■••,n are n distinct integers in {{1 , • •• N}, then the probability of obtaining a sample of units U^,•-•• Uin (in this order) from among units Uu•-••UNby WOR probability proportional to sizes Xj.,-•XN is given by
second observation is Ui2, then the third unit selected is with probability proportional to its size among the sizes of the remaining N - 2 uniis, i.e., unit Ur is selected with probability Xr/(X-Xh -Xi2) for 1 < r < N, r £ {u, z2}. This is continued until the nth (necessarily distinct) unit is selected. Thus at the time the fcth unit is selected the conditional probability of selecting unit Uic, given that units U^,•-••!/^ 1ave been selected in the k - 1 previous trials is Xik /(X -Xh Xik,).
120 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
However, the re-computation of the t-values needed in order to select each unit can be cumbersome. Since random numbers rx, r2, • • • are easy to come by, the skipping method of WOR probability proportional to size sampling, on which we now elaborate, will prove to be convenient; the correctness of this procedure will be proved in Theorem 2. The skipping method is as follows. Again, let tj = (Xi -\ \-Xj)/X, and if a random number r falls in the interval [ t /- i , i ; ) , we declare that unit Uj is selected. The sampling is continued in this manner with as many random numbers as needed until n distinct units have been selected. We designate these n units as our sample. Justification of the use of this skipping method is achieved by comparing the following theorem, a generalization of Proposition 1 in Section 6.1, with Theorem 1.
THEOREM 2. Using the skipping method described above, the probability of selecting n distinct units £/*!,-••, Uin is
P(Uh,---,Uin) = ffiM'-IN} Proof: Let
B(j2j3r"Jn) denote the event that the first random number selected is in the interval ['u-i>*ti)> the next j 2 random numbers are in [ i t l_i , t t l) , the next random number is in [it2-i, U2) where i2 ^ i i , the next j 3 random numbers are in
[ 'u - i j ' t i ) U [<t2-i»*«2)>
followed by a random number in
LW3-IJW3)}
where U3 ^ U2,ti3 ^ £tl, and continuing until the (n + X^r=2>)*l1 r a n ' dom number is in [Un-iiUn)- It is e a s y to see that the event that the distinct units f/tl, • • •, U{n are selected is equal to the following disjoint union:
00 00
U ••• U B(j2,---,jn). J2=0 i „=0
7.1. HOW TO SAMPLE 121
Thus
oo oo
p(uh,---,uin) = j:---i;p(B(j2,...,jn)) j2=0 jn=0
j2=0 j„=0 X r=2 \ X \ X / j
X 4 \ X ' 1 - (Xh + ■ • • + X, _X)IX
Q.E.D.
Theorem 2 justifies the use of the skipping method in WOR probability proportional to size sampling, and this method eliminates the need to recompute the probabilities and probability intervals for each successive observation. A few words of a preliminary nature are in order about the desirability of this kind of sampling. Most frequently one has other, or previous, information about every unit of the population, and in many cases the numerical results of this previous information are roughly proportional to the numerical characteristics one anticipates observing. Schematically this is represented as follows:
U: Ux U2 . . . Un x: Xt X2 •■• Xn y: Yt Y2 . . . Yn.
Here, every X{ is already known. The function x might be the results of a previous census, and hence it is already known. If y is proportional to x approximately, and if one wishes to estimate Y = E i l i ^ , one would certainly expect more accuracy by sampling the units not with equal probabilities but with probabilities proportional to the values of y, i.e., probabilities proportional to the values of x. This advantage will be given in a more precise form in a later section.
121 7.1. HOW TO SAMPLE
Q.E.D.
OO OO
j2=0 jn=0
j2=0 jn=0 X r=2 [ X V X / J
x W X ' 1 - (Xit + • • • + Xi _t)/X
U: Ux U2 . . . Un x: Xx X2 •■• Xn y: Yx Y2 . . . Yn.
122 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
EXERCISES
1. Suppose a population U, preliminary information x, and y are as follows:
U: Ui U2 U3 f/4 U5 U6 U7 x: 31.4 62.9 17.1 18.2 33.0 71.1 65.2 y: 48.1 85.2 30.6 25.1 51.3 100.1 99.6
(i) If one is selecting a sample of size 3 from U by WR probability proportional to size, find the probability that the sample is (U6,U3,U6).
(ii) If one is selecting a sample of size 3 from U by WOR probability proportional to size, find the probability that the sam-pleis(f/2,C/6,f/3).
2. In Problem 1 (ii) find the probability that sampling is terminated on or before the fifth observation when the skipping method is used.
3. In Problem 1 (ii) find the probability that Ue is obtained before the fourth trial and U3 is selected after the fourth trial.
7.2 WR Probability Proportional to Size Sampling
If we have a population
U-.Ux ••• UN
x : X\ • • • XN
y:Y1 . . . YN
where we have reason to believe that x is "almost" proportional to y so that we wish to sample with probabilities proportional to size, then we shall refer to x as a predictor of y. We shall abbreviate our notation for the above population by simply writing (ZY;x,y). In this section we obtain a formula for an unbiased estimate of Y when our
7.2. W R PROBABILITY PROPORTIONAL TO... 123
probability proportional to size sampling is done WR. We shall also derive a formula for the variance of this estimate and then obtain an unbiased estimate of this variance.
Given (U;x,y) above, we assign to unit Ut the probability Pi = Xi/X. The function y may be regarded as a random variable; its density is seen to be
and its expectation is
E(y) = J:YrXr/X. r = l
Let yi,-.?J/n denote the numerical outcomes of our sample of size n in WR probability proportional to size sampling. Then y1, • • • •yn are independent random variables, each with the same density as y. In addition, let u- denote the number of the unit selected on the ith trial, 1 < i < n. Then m, • • •• un are seen to be independent random variables with common density
P[Ul = *] = Xi/X = pi,l<i< N.
For 1<i<N, let us define the random variable fi by
N
r = l
One might refer to fi as the probability of selecting the unit obtained in the ith observation.
T H E O R E M 1. An unbiased estimate ofY in WR probability proportional to size sampling on (U;x,y) is Y defined by
and its variance is
7.2. W R PROBABILITY PROPORTIONAL TO... 123
probability proportional to size sampling is done WR. We shall also derive a formula for the variance of this estimate and then obtain an unbiased estimate of this variance.
Given (U;x,y) above, we assign to unit £/,• the probability Pi = Xi/X. The function y may be regarded as a random variable; its density is seen to be
P[y = t] = UPi--Yi = t} and its expectation is
E(y) = J:YTXr/X r = l
Let yi,--?Jyw denote the numerical outcomes of our sample of size n in WR probability proportional to size sampling. Then yi,• • • •y„ are independent random variables, each with the same density as y. In addition, let u- denote the number of the unit selected on the ith trial, 1 < i < n. Then uu • • •• un are seen to be independent random variables with common density
P[Ul = *] = Xi/X = Pi,l<i< N.
For 1 < i < N, let us define the random variable tf by N
Pi = E M - i - 1 , r = l
One might refer to pi as the probability of selecting the unit obtained in the ith observation.
T H E O R E M 1. An unbiased estimate ofY in WR probability proportional to size sampling on (W;x,y) is Y defined by
* - ; ! > / * ,
and its variance is
y#)-;£»$-*)''■
124 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
Proof: For each i, the random variables y,- and p* depend only on the unit selected on the zth trial. Since the sampling is WR, it follows that the n random variables j / i /pj , • • •, Vn/Pn are independent and all have the same density. Observe that
in f Yr
from which we obtain E(y1/p*1) = £?=i Yr = Y. Thus
E(Y) = (l/n)J2E(yr/P;) = Y, r = l
proving that Y is an unbiased estimate of Y. Note that
Var(yi/Pl) = E((^-YJ\
- se-')'-from which the formula for Var(Y) in the theorem follows. Q.E.D.
REMARK 1. Let us consider the special case when X{ = 1,1 < i < N. In this case pr = 1/JV, 1 < r < JV, and p* is the constant 1/JV. In this case,
Y = (N/n)J2yi = Ny, t = l
and
V"-* = lEliXYr-Y? U r=l i V
n ~ n y
7.2. W R PROBABILITY PROPORTIONAL TO... 125
These formulae are the same as those given in Theorems 1 and 3 in Section 6.2.
In WR sampling, an unbiased estimate of Var(Y) is fairly easy to obtain.
THEOREM 2. j\n unbiased estimate of Var(Y) is the observable random variable Var(Y) defined by
Proof: Since yi/pj,•• • ,yjp*n are independent and identically distributed (i.e., they all have the same density), it follows that
yar(y) * £ Var^/p]) = l-Var (yA •
By Theorem 1 in Section 6.3, the random variable
n-l^VPj n^p'J
is an unbiased estimate of Var{yx/p\). From this it follows that E(Va7(Y)) = Var{Y). Q.E.D.
There is a formula for Var(Y) other than that given in Theorem 1 to which we now turn our attention.
LEMMA 1. If ai, • • •• arr , a , , • •• br are any numbers, then
,=1 \j=l ) \k=l / j#fc
Proof: One observes that
\»=1 / \fc=l / j=lk=l
7.2. W R PROBABILITY PROPORTIONAL TO... 125
These formulae are the same as those given in Theorems 1 and 3 in Section 6.2.
In WR sampling, an unbiased estimate of Var(Y) is fairly easy to obtain.
THEOREM 2. An unbiased estimate of Var(Y) is the observable random variable Var(Y) defined by
^=^s(H!t)!-Proof: Since ^ / p j , • • • ,yn/P*n are independent and identically distributed (i.e., they all have the same density), it follows that
Var{Y) = \±Var(yj/p*) = l-Var (VA •
By Theorem 1 in Section 6.3, the random variable
J-±(yj..l±li!i\2
is an unbiased estimate of Var(y,/pJ). From this it follows that E(Va7(Y)) = Var{Y). Q.E.D.
There is a formula for Var(Y) other than that given in Theorem 1 to which we now turn our attention.
LEMMA 1. If ai, • • •• arr , a , , • •• br are any numbers, then
Proof: One observes that e observes mat
\i=l / \Jk=l / j=lA:=l
126 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
r
= Ea«'6« + E a i 6*'
from which the result follows. Q.E.D.
LEMMA 2. / / Ax, • • •• Arr Pli , • •, PP rre eeal numbers, ,f P, > > for 1 < i < r, and if A = £J_i M and £J=i Pj = 1, then
Proof: We observe that
1=1 V i * ' t= l ■*« 1=1 «=1 r A2 r \2
= Y) — - 2 A2 + A2 = J2 — - A2
On the other hand, by easy algebra and Lemma 1,
= Y\[—P- + ^-P\ - 2 V M -
/ r \ 2 r
- U > + E^2 _ y* £L _ 42
These two strings of equalities establish the equation in the lemma. Q.E.D.
126 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
r
= E a«'ft« + E ai6*' t = i j * *
Q.E.D. from which the result follows.
LEMMA 2. / / Ax, • • •• Arr Pli , • •, PP rre eeal numbers, ,f P, > > for 1<i<r, andifA = £J_i Ai and E J - i ^ = 1, ^ e n
£<-*)'-E(*-£)V, Proof: We observe that
T / A \2 r A% r T
D *(£-*) = E # - 2 E ^ + E ^ 2 i = i V i * ' t= l ■*« t = i «=i
r A2 r \2 = Y) — - 2 A2 + A2 = J2 — - A2
i^lPi £ l Pi On the other hand, by easy algebra and Lemma 1,
§(M)'*'< - §(f''-2^+f«) = Y\[—P- + ^-P\ - 2 V M -
= E4h-E^4
= (sf)fepHf* / r \ 2 r
- I > +EA2 \ » = i / t = i
= Y]—- A2. t = i "* t = i * *
These two strings of equalities establish the equation in the lemma. Q.E.D.
7.2. W R PROBABILITY PROPORTIONAL TO ... 127
THEOREM 3. IfY is as in Theorem 1, then 2
ni<j \Pi PjJ
Proof: This follows immediately from Theorem 1 and Lemma 2 by substituting in the latter pi for P t ,K for Ai and Y for A in the latter. Q.E.D.
The advantage of unequal probability sampling is seen by means of the formula for Var(Y) given in Theorem 3. Suppose that Y{ is approximately KX{ for some constant K. Then for every i < j , ^ *■
is very close to zero, from which it follows that Var(Y) is quite small.
EXERCISES
1. Consider the population
U: Ux U2 U3 U4 Us U6 U7
x: 17.1 21.3 30.1 15.6 40.9 25.0 35.2 y: 12.3 13.2 19.8 12.1 28.3 15.2 26.8
Let 2/1,2/2? 2/3 denote a sample of size three on y in WR probability proportional to size sampling using the predictor x
(i) Find the density p^ (ii) Compute E{pl) (iii) Compute Y. (iv) Find the density of 2/2/ 2 (v) Compute E(y2/pl) and Varfa/pi). (vi) Compute E(Y). (vii) Compute Var(Y) by the formulae in Theorems 1 and 3.
2. In Problem 1, suppose one does simple random sampling WR. In this case ly is an unbiased estimate of Y. Compute Var(7y), and compare it with the answer you obtained in l(vii).
128 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
3. For population (W; x, y) with probability proportional to size sampling, prove that E{yx) = E-I i YXi/X.
4. Suppose (U;x,y) satisfies the foUowing: there exists a constant K > 0 and a small number e > 0 such that
K(l-e)<^<K(l + e) Xi
(i.e., the ratios {15/A,-} are aU close to A".). H V" is the unbiased estimate of F given in Theorem 1, prove that
Var(Y) < ^K2X2e2.
7.3 WOR Probability Proportional to Size Sampling
Surely we must be able to do better by doing probability proportional to size sampling without replacement rather than with replacement. We show that this can be done in this section. We shall let (U;x,y),{Pi},{ui} and {j/t} be as in Section 7.2. Our basic assumptions are that Xf/Yi is not identically constant in i, we are taking a sample of size n by WOR probability proportional to size sampling, and Xi > 0 for all *.
To begin with, let us define
where, as before,
Pi = 2-rft'V=i]' J = l
PROPOSITION 1. Ifti is as defined above, then PROPOSITION 1. Ifti is as defined above, then
h = y£ x~pcI{Ui=* and EW = Y-
Pi = l^Pih«i=iV J = l
<i = yi/rf,
where, as before,
Surely we must be able to do better by doing probability proportional to size sampling without replacement rather than with replacement. We show that this can be done in this section. We shall let (U;x,y),{Pi},{Ui} and {y,} be as in Section 7.2. Our basic assumptions are that Xt/Yi is not identically constant in i, we are taking a sample of size n by WOR probability proportional to size sampling, and Xi > 0 for all *.
To begin with, let us define
7.3 WOR Probability Proportional to Size Sampling
(i.e., the ratios {15/A,-} are all close to K.). If Y is the unbiased estimate of Y given in Theorem 1, prove that
Var(Y) < ^K2X2e2.
3. For population (W; x, y) with probability proportional to size sampling, prove that E{yx) = E-I i YXi/X.
4. Suppose (U;x,y) satisfies the foUowing: there exists a constant K > 0 and a small number e > 0 such that
K(l-e)<^<K(l + e) X{
128 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
K(l-e)<^<K(l + e)
7.3. WOR PROBABILITY PROPORTIONAL TO ... 129
Proof: Recall that y\ can be written as y\ = ]CjLi^j^[ui=j]- Thus for any elementary event u>, if u € [i i = j ] , then j/i(o;) = Yj and # ( w ) = P j = Xj/X, i.e., (yM)(u) = Yjl{XjlX), which yields the first conclusion. The second conclusion follows from Theorem 1 in Section 7.2 by taking n = 1. Q.E.D.
Next, let us define
t, „ , f * r n 2~yi+ h Xi/(x-xuyiUi=,h
and, in general for 2 < j < n, we define
« , - * + ••• + » - , + E „■-■■ y ' - l y , /hHI-
Let tj = (<! + ••• + t,)/j. We should pause to note what tj means. It is the sum of the y-values of the first j — I units removed plus, by Proposition 1, an unbiased estimate of the sum of the Y-values of the remaining N-j + 1 units.
THEOREM ! . For 2 < ^ „ , , and «, « ™ » W e — o/K.
Proof: Let us compute £ ( * > ! = fc,-.^ = fc^). Note that for 1 < z < j — 1
Also, for »rf {Jfci,...,*,-.!}
P([Uj = »]!«! = *x, • • -,«,•_! = *,-_!) = X,7 ( X - £ X*, J .
Thus we obtain
E{tj\Ul = *!,...,«,-_! = Vi) = n, + • • • + v , + E ^ = Y-
i=i Xi/(X — XUl) r •
i=i Xi/(X — XUl)
E Xllx L-,x.hH-* !i = yi + .•. + „,•_!+ ] t=l,tg{t
and, in general for 2 < j < n, we define
Next, let us define
Proof: Recall that y\ can be written as y\ = ]CjLi^j^[ui=j]- Thus for any elementary event u>, if u € [i i = j ] , then j/i(o;) = 1} and Pi(w) = ^ = Xj/X, i.e., (yiM)(u,) = Yjl{XjlX), which yields the first conclusion. The second conclusion follows from Theorem 1 in Section 7.2 by taking n = 1. Q.E.D.
7.3. WOR PROBABILITY PROPORTIONAL TO ... 129
Let <,■ = (*! + ••• + tjj/i. We should pause to note what tj means. It is the sum of the y-values of the first j — 1 units removed plus, by Proposition 1, an unbiased estimate of the sum of the Y-values of the remaining N-j + 1 units.
THEOREM 1. F,r2 < ,- < M , -H, « — e — „ V .
Proof: Let us compute £ ( < > , = * „ • • • , » , ■ _ , = *,-_,). Note that for 1 < i < j — 1
E(yt\ui = h,---, u,-_i = *,-_!) = Ffci.
Also, for i £{*! , • • • ,* , •_!}
P([Uj = i ] | U l = *!,-..,U,-_X = *,-_!) = Xi/ ( X - ^ X f c r j .
Thus we obtain
E{tj\Ul = *!,...,«,-_! = v o = n, + • • • + v . + E ^ = F-
130 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
Hence E(tj) = Y. This proves that each tj is unbiased. Easily,
E(ij) = E((t1 + --- + tj)/j) = Y
Q.E.D.
The statistic tn is what we intend to use as an estimate of Y. The first question that arises is whether this estimate is "better" than the estimate Y obtained in Section 7.2 where the samphng was done WR. By the definition of Y in Section 7.2 it follows that Var(Y) = Var(ti)/n. Thus we shall have proved in to be a "better" estimate than Y when we have proved that- Var(in) < Var(ti)/n.
THEOREM 2. If {U, 1 < i < n} are as defined above and ifn>2, then U and tj have correlation zero for i ^ j , and
Var(tn) < -Varfa). n
Proof: Let 2 < j < n, let &i, • • •, kj be j distinct integers in { 1 , • • • , N}. By what was proved in the proof of Theorem 1,
E(tj\ui = A?i, • • •, Uj-i = %_i) = Y for 2 < j < n.
Thus the random variable
^(<i | f i , - - - , f j - i ) = Z) E(tj\ux = fci,--->fi-i = I H J / ^ . I ^ J
= k £ YIct:>^] = Y'
a constant. We use this fact now to show that if 1 < i < j < n, then Cov(ti,tj) = 0. Indeed, by properties of conditional expectation established in Chapter 4, since U is a function of Ui, • • •, u,-, and since i < j , we have
E{Utj) = £ ( £ ( W > i r . . , t ^ ) ) = EiUEitjluu-.tUj-t)) = E(tiY) = YE(ti)^Y2
9
7.3. WO R PROBABILITY PROPORTIONAL TO ... 131
this last equality by Theorem 1. Since E(t{) = E(tj) = y , then Cov(th ts) = E(Utj) - E(ti)E(tj) = y 2 - y 2 = 0, i.e., the covariance of U and tj is zero. We next show that for 2 < i < n, Var(i t) < Var(t\). We first recall the theorem proved in Chapter 4: if U is a random variable, and if H is any vector random variable, then
Var(U) = E(Var(U\H)) + Var(E(U\H)). To apply this result here, let U = U and let H be the random vector whose coordinates are Ui, • • •, ut-_i. As shown earlier in this proof, E(ti\ui, • • •, Ui-i) = Y which is a constant random variable, and thus Var(E(ti\ui, • • •, Ut-i)) = 0. This implies by the above-recalled result that
Var(U) = E(Yar(ti\uir^,Ui^i))
£ {E((ti)2\ui = ki,-~,Ui-i = k-i)
-(E(ti\Ul = *!,-••, wt_i = fc^x))2} J^-i [ u r = k r ]) .
For fixed fc1? • • •, fct-_i, the expression inside the curly brackets, {•}, is the variance of yi/p* when samphng WOR and with probability proportional to size after units Ukx, • • •, Uki_x have been removed. By Theorem 3 in Section 7.2,
Var\ M - E l<i<j<N
p.Pi - - — \p« P i /
= E l<i<j<N ^(i-i)2
Hence the expression in the curly brackets above equals
E i<u<«<iyr
-1>
and because Yu/Xu is assumed to be not identically constant in u, then at least one of the L ^ ) expressions above is strictly less than
Ku<v<N
132 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
Thus
Var(ti)<E[ £ Varit^-j^^Varih).
Now, putting everything together that we have proved so far, we have
Var(tJ = ^Var(j2ti\
= -AjlVar(ti) + 2 £ Cov^tj)) U \*=1 l<*<i<n /
= ±-2f:Var{U)<±-nVar{t1)=l-Var{t1). lb " ^ < TL lh
Q.E.D.
Comparing this theorem with Theorem 1 in Section 7.2, we see that WOR sampling yields smaller variance for the unbiased estimate than would have been obtained for the unbiased estimate in WR sampling obtained in Section 7.2. Last, we shall need an unbiased estimate of Var(Q.
THEOREM 3. An unbiased estimate ofVar(tn) is
1 Var
Proof: Recall that in the proof of Theorem 2 we proved that E{titj) = Y2 for i ^ j . Let us define
Then E(Y2) = Y2. Now let us define Var(in) = f„2 - Y2. Then
E(Var(Q) = E(Pn) - (E(Q)2 = Var(Q.
7.3. WO R PROBABILITY PROPORTIONAL TO ... 133
But
tn-Y* = 1 n(n — 1
1 n(n — 1
1 n(n — 1
1 n(n — 1
1 n(n — 1
n ( n - l ) f n2 - 2 J2 Mi)
| n 2 f n2 - 2 X) W j - n f B
3 l I l<t<j<n J
[ \ t= l / l<*<j<n
X> - Q2. t = i
Q.E.D.
EXERCISES
1. Prove: If U and V are random variables, if E(U\V = v) = c for all v G ran(7e(V), where c is some constant, then E(U\V) = c.
2. Prove: If Z is a random variable, if
X= XI XzI[Z=z], z£range(Z)
and if
z£range(Z)
where yz ^ 0 for all z G range(Z), then
z€range(Z) y z
3. Consider the population
W: J7i «73 % ^4 #5 #6 x : 8.1 12.2 20.1 10.3 8.2 6.8 y : 11.9 19.3 28.1 17.0 11.1 10.3.
134 CHAPTER 7. UNEQUAL PROBABILITY SAMPLING
Suppose one were to sample twice WOR with probabiUty proportional to size sampling.
(i) Compute Var(Y), where y = f2 is the unbiased estimate of Y.
(ii) Suppose in sampling (WOR) the units selected were first: U3 and second: C^. Compute t^t2 and ?2-
4. Prove: If 1 < i < j — 1, then
E(yi\ui = *!,-•-, UJ^ = fy-i) = Yki
Chapter 8
Linear Relationships
8.1 Linear Regression Model
One of the happiest circumstances there can be in the problem of estimating Y for a population (W; x, y) with a predictor variable x is that in which a linear relationship connects x and y. This means that there are constants a and b such that y = a + bx. Of course, in practice this never occurs in a precise manner; yet there are many times when there is good reason to believe such a relationship exists in an approximate manner. This chapter is devoted to some advantages that can be reaped when there is reason to believe that such a linear relationship is close to reality. In this section we consider the problem of estimating a and b by means of a simple random sample on U.
Given a simple random sample t*i, • • •, un on ZY, either WR or WOR, we address ourselves to the problem of finding numbers a and 6 such that y and a + bx are "close" to each other in some sense so that we could almost write y = a + bx. Now j/ t- is recalled to be defined by yi(Ur) = Yr,\ < i < n , l < r < JV, or j/t(w5) = Yu„l < s < n. We night define X{ by Xi(us) = XUs. Thus a and b should be such that we could almost write j/ t- = a + bx{ for 1 < i < n. The wish to write these equations is the same as that of making all absolute values of differences |t/i1 — a — bxi I as small as possible. One mathematically precise way of stating this is the following problem: to find a and b as functions of #i> * • ' ) #n? J/i? * • •, Vn that will minimize the sum of the squares of the
135
136 CHAPTER 8. LINEAR RELATIONSHIPS
errors, Q, where n
Q = XXj/f - a - 6xt)2.
It is this problem that we solve now.
LEMMA 1. If Ci,- — ,Cn are real numbers, and if c is defined by c = (d H f- Cn)/n, then the value ofm that minimizes YA=I(CI"~~m)2
is m = c.
Proof: We denote T = E?=i(c* - m)2 . Then
T = £ ( ( c t - - c ) + ( c - m ) ) 2
t= i n n
= E t e - zf + 2(d - m) E t e - *) + n( a - m)2-Since EJL1(cl- — c) = 0, it follows that the smallest value of T is achieved when m = c.
Let us denote
THEOREM 1. The values of b and a which minimize Q above are given, respectively, by
i °v * i * - L-o = T-px,y ana a = y — ox.
°x
Q.E.D.
j n j n
- £ X * andj/ = - £ y , - , nUi n t = i -i n
- J2xl - *2 and *» = V^J' i n
- 5 3 y? - y2 and a„ = ^ * , and «=i
J E 1 * - xv «=1
8.1. LINEAR REGRESSION MODEL 137
This "regression equation" can be written in the form
y — y A x — x - 7 = Px<y—Z •
Proof: Rewriting Q as Q = Y27=i(yi ~~ ^x%— a)2? a n d denoting c, = yt- — 6xt-, we obtain from Lemma 1 that whatever value b is, Q is minimized when a is set equal to
1 n
$3(w " 6*0 = y -bx. nUi
Thus we need only find the value of b that minimizes n
Now
Q = E((w - y) - ^ « - *))2
n
= E(w - *)a - 2*E(** - *)(*• - y) +*2 E ( * - *)2-i = l i = l t= l
Since Q is a sum of squares, then Q > 0. But Q is quadratic in 6, i.e., Q is of the form
Q = Ab2 + 2Bb+C.
We can rearrange this expression for Q to obtain
B\2 AC-B2
0 - A ( » + 3 ) +
Since A > 0, it follows that the value of b that minimizes Q is 6 = - B / A . Thus
n
h A~ " '
138 CHAPTER 8. LINEAR RELATIONSHIPS
Since Yl?=i(xi — *) = 0? w e obtain, by easy algebra, 6 = § /3x,y, which completes the proof of the first conclusion of the theorem. The second conclusion follows by some easy algebraic manipulation. Q.E.D.
It should be noticed that a and b are functions defined over the fundamental probability space over which the random variables #i, • • •, xn , tfu • • * > Vn are defined, and thus are themselves random variables. They are called least squares solutions for a and 6.
EXERCISES
1. The equation y = a + bx considered in this section is called the regression line of y on x. Find the least squares estimates of the constants in the regression line of x on y.
2. In our least squares treatment of x, y, it might be more reasonable to assume that y is a second degree polynomial function of x, i.e., y = a + bx + ex2. If so, find the least squares estimate a, 6, c of a, 6, c.
3. Prove: if ci, • • •, Cn are numbers, and if c = (c\-\ h cn)ln, then E?=1(c,-c) = 0.
4. Prove: if X is a random variable, then the value of m that minimizes E((X - m)2) is m = E(X).
8.2 Ratio Estimation In this and the subsequent sections we shall concern ourselves about special cases of the linear regression model y = a + bx plus a small error, as considered in Section 1. In this section we shall deal with the case a = 0. When this situation holds, y = bx or Y{ = bX{ plus a small error. This is just the situation where probability proportional to size sampling is so effective. Another effective method here, if one chooses to do simple random sampling either WR or WOR, is the ratio estimate, which is the subject of this section. As before we consider the population (U; x, y) with predictor. A sample of size n is taken yielding observations (#i, j/i), • • •, (xn, j /n) , and the problem is to estimate Y.
8.2. RATIO ESTIMATION 139
Definition: The ratio estimate Y of Y is defined to be Y = ^X. X
Roughly speaking, y estimates K, x estimates X , so y/x estimates Y/X = Y/X, and hence the ratio estimate should estimate Y. A word of caution is in order here. It is not necessarily true that the expectation of a quotient of two random variables is equal to the quotient of the expectations. Hence the ratio estimate Y of Y is not necessarily unbiased. In Theorem 2 we shall give conditions under which the ratio estimate is unbiased.
The literature on ratio estimates appears to some as unsatisfying. The results are arrived at without sufficient rigor. Also, the approximations are found to be wanting. In Theorem 1 below we shall show that if the assumption of the model that y/x is uniformly close to b is true, then the ratio estimate Y of Y attains remarkable precision and has very small variance. This theorem is not of much practical use, but an understanding of it should give the student the courage to use ratio estimates when the model above applies.
But first we need a lemma.
LEMMA 1. IfV is a random variable satisfying P[a < V < b] = 1, then
Var(V)<(b-a)2.
Proof: Suppose the density of V is /(v,*) = p«,l < i < r. Then J2i=iP% = 1 a n ( i a < Vi < b for I < i < r. Thus api < ViPi < bpi, and summing over i, we obtain a < E(V) < b. These last two inequalities yield
a - b < v{ - E(V) < b - a for 1 < i < r.
Thus (vi - E(V))2 <(b-a)2ioil<i<r
and
Var(V) = J2(vi-E(V))2pi<(b-a)2jriPi »=i «=i
= (&-«) 2 -
Q.E.D
140 CHAPTER 8. LINEAR RELATIONSHIPS
THEOREM 1. Let(U;x,y) be a population with positive predictor x-values and positive y-values, and suppose a sample of size n is obtained, either WR or WOR. Suppose there exist numbers £,K, where 0<8< 1 andK > 0, such that (1 - 8)K < y/x < (1 + 8)K. Then the ratio estimate Y = (y/x)X ofY satisfies
» \=lY<Y<l±lY, l+d 1 - b
and
(ii) Var{Y) < (^L-J Y\
Proof: The hypotheses imply
(1 - S)KXi < Y < (1 + 6)KX,, l<i<N.
Summing both sides of each inequality from 1 through N, we have
(1 - S)KX < Y < (1 + SSKX.
Similar relationships are seen to hold for the observations xi, • - •, xn, yu • ■ •• ynn namely, ,( - 8)Kxi < Vi < (( + S)Kxi for 1 < i < n, from which one obtains
(1 - 8)Kx < y < (1 + 68Kx,
or (1 - 8)K < ? < (1 + S)K. x
Multiplying this last equation through by X we obtain
(1 - S)KX < Y < (1 + 86KX.
Now, from the above inequality on X and F ,
Y < (1 + 6)KX = i ± | ( l - 6)KX < ^ 4 Y , 1-0y 1-8
140 CHAPTER 8. LINEAR RELATIONSHIPS
THEOREM 1. Let(U;x,y) be a population with positive predictor x-values and positive y-values, and suppose a sample of size n is obtained, either WR or WOR. Suppose there exist numbers £,K, where 0<8< 1 andK > 0, such that (1 - 8)K < y/x < (1 + 8)K. Then the ratio estimate Y = (y/x)X ofY satisfies
(i) l + r 1 - d
and
(ii) v-M < ( i^j ^2-><(A)V y^ <{&)'*■
Proof: The hypotheses imply
(1 - S)KXi < Y < (1 + 6)KX,, l<i<N.
Summing both sides of each inequality from 1 through N, we have
(1 - S)KX < Y < (1 + 8SKX.
Similar relationships are seen to hold for the observations xu • • •, x„, yu•-•• y„n namely, ,( - 8)Kxi < y{ < (( + 8)Kxi for 1 < i < n, from which one obtains
(1 - 8)Kx < y < (1 + S8Kx,
or (1-8)K<?<(1 + 8)K. x
Multiplying this last equation through by X we obtain
(1 - S)KX < Y < (1 + 86KX.
Now, from the above inequality on X and F,
Y<(1 + 6)KX = I±4(l - 6)KX < ^AY, l-8y 1-8
8.2. RATIO ESTIMATION 141
and Y > (1 - 8)KX = ^ j ( l + 8)KX > \ ^ y ,
from which we obtain conclusion (i). We finally apply Lemma 1 to conclusion (i) to obtain
Q.E.D.
Truly, more general theorems than Theorem 1 can be proved. However, Theorem 1 should give a good indication of how accurate the ratio estimate can be when x and y have an almost constant ratio. It is important to note that in using the ratio estimate we are taking advantage of the fact that y/x is very close to being constant, while y/X need not be so.
It should be noted that the ratio estimate need not be unbiased, and examples can easily be constructed where y/x is not an unbiased estimate of Y/X. Just for curiosity's sake we can state and prove a theorem giving conditions under which E(y/x) = Y/X. One should continue too keep in mind that in our model (U;xy)) of a population with predictor, all Xt are assumed to be positive.
THEOREM 2. Let (U;xy)) be a population with a predictor, and suppose that a simple random sample (xi,jh),•- • ,(xB,y«) oofize e ni selected WR. Suppose there exists a constant 0 > 0 such that if Z{ is defined by Z{ = </,• - fixit then z{ and x4 are independent random variables, and E{Zi) = 0,1 < i < n. Then the ratio estimate Y = (y/x)X is an unbiased estimate ooY.
Proof: Recall that for each i,E(yi) = ?,E(Xi) = X and (by hypothesis) E{Zi) = 0, from which it follows that E(y) = Y,E(x) = X and E(z) = 0 (where z = (Zl + • • • + zn)/n). Now, from t th eact that yi = fa+Zi, 1 < i < n, we obtain y = px+z, and, taking expectations, we have
Y = /3XOTY = /PX.
141 8.2. RATIO ESTIMATION
Q.E.D.
Y > (1 - 8)KX = ^ j ( l + 8)KX > | ^ | y ,
^<-m-m^-{^)^-
Y = /3XOTY = /PX.
142 CHAPTER 8. LINEAR RELATIONSHIPS
Thus
*(!)-*('+i)-Mi)-We now apply three results from Chapter 4. We first observe that
Next, since 1/x is a function of #i, • • •, xn , it follows that
E[=\xu---,xn) = -E(z\xu---,xn), \x J X
or *(i)-*(i*w»..-••.».>)•
Since by hypothesis {xi, • • •, xn} and {z±, • • •, zn} are independent, it follows that z and {a?i, • • •, xn} are independent, and thus
E(z\xu---,xn) = E(z) = 0.
From this we obtain E(z/x) = 0, and hence E(y/x) = /?. But earlier, we established that /? = Y/J£, and thus y/x is an unbiased estimate of Y/X. Q.E.D.
Models do exist that satisfy the hypothesis of Theorem 2. For such a model, z is a random error whose expectation is zero. Thus unbi-asedness of the ratio estimate occurs when y is a constant multiple of x plus an independent random error. An example of such a population is given in the following set of exercises.
EXERCISES
1. Consider the following population:
U: Ux U2 U3 U4 U5 x: 3.1 13.2 24.1 41.6 62.1 y: 3.5 14.8 27.5 46.2 71.42
A sample of size three is observed WOR.
8.2. RATIO ESTIMATION 143
i) List all ten outcomes of selecting three units from among those inU.
ii) Compute X and Y
(iii) Find the smallest and largest values of the unbiased estimate 5j/ of Y.
(iv) Find the smallest and largest values of the ratio estimate (y/x)XoiY.
2. Verify that the population in Problem 1 satisfies the hypotheses of Theorem 1 for K = 1.13 and 8 = .03.
3. Consider the following population:
U: Ux U2 U3
x : 2 3 5 y: 4 6 11.
A sample of size two is taken WOR. Compute E(y/x), Y/X and compare.
4. Consider the population:
U: £/i U2 U3 U4 Us U6 U7 U8 U9 x: 30 40 50 30 40 50 30 40 50 y: 44 59 74 45 60 75 41 61 76
Define the random variable z by y = 1.5a: + z, i.e., z = y — 1.5a:.
i) Find the density of x.
ii) Find the density of z.
iii) Find the joint density of x and z.
iv) Verify that x and z are independent random variables.
v ) Verify that E(z) = 0.
144 CHAPTER 8. LINEAR RELATIONSHIPS
8.3 Unbiased Ratio Estimation In Section 8.2 we treated the ratio estimate obtained by simple random sampling. It was not necessarily unbiased; we could not compute its variance, nor could we provide an unbiased estimate for its variance. However, if we are willing to modify our sampling by selecting our first unit by means of probability proportional to size sampling and then selecting n — 1 distinct units by WOR simple random sampling from the remaining N — 1 units, then it turns out that the ratio estimate is unbiased, we can determine the formula for the variance of this estimate, and we can find an unbiased estimate of this variance.
DEFINITION: Let {U; x, y} be a population with predictor. We shall say that a sample of size n is selected by p.p.a.s. sampling (probability proportional to aggregate size) if the first unit is selected by probability proportional to size sampling and the remaining n — 1 units of the sample are selected from the remaining N — 1 units of the population by WOR simple random sampling.
Of use here and in subsequent results is the following lemma.
LEMMA 1. if ai, • • •, a/v, 6i, • • •, 6j\r are any numbers, and if f{u, v, x, y) is any function of four variables, then, for fixed n < N,
l<*x<- -<kn<Ni=l KU x ' j=l
and
£ £ f{aK-,akv,huMv) = ( o ) E /(°»"»ai' *•"» h)-i<fc! <-<k„<N «<« \ n - z / t<i
Proof: For each i € {1,---,N}, there are (*"*) ways of selecting n — 1 other distinct integers in {1, • • •, N} \ {i} such that the n of them form an n-tuple ix < i2 < • • • < in. Thus at- is included in (^Ti) sums on the left-hand side of the first equation. Similarly, take arbitrary
8.3. UNBIASED RATIO ESTIMATION 145
hj, 1 < i < j < N. These occur in (^_2 ) t u p l e s (fci, • • •, fcn), where 1 < h < < kn <N. n Q.E.D.
THEOREM 1. In p.p.a.s. sampling, the ratio estimate Y = (y/x)X is an unbiased estimate ofY.
Proof: Let {{7tl, • • •, (7tn} denote the event that units f/tl, • • •, U{n are selected in any order, and let B% denote the event that Ui is the unit selected first (by probability proportional to size sampling). Then
r = l
- x iNn:l) x (S) '
By a property of conditional expectation given in Chapter 4, by Lemma 1, and by this last equation we have
E(Y) = £ E(Y\{Uil,---,Uin})P({Uil, •••,Uin}) l<h< <in<N
E n \r. V^n TC _ m=l Jim Z^ r=l ^*r
l<tl< <in<N E?=l-^t. ^ ( ^ T i )
1
Q.E.D.
V n - i y l<»i<-<tn<JV \ m = £ EnJ=r.
We shall omit computation of Var(Y) and shall proceed directly to obtain an unbiased estimate of Var(Y). It should be noted that in so doing we need only find an unbicised estimate z of Y2 and consider Var(Y) = Y2—z. We shall need to define the random variable p*(N, n); it is a random variable that equals P{{Uix, • • •, Uin}) whenever the event {C/i15 • • •, Uin} occurs. This event was defined in the proof of Theorem 1. Thus
Pm(N,n) = £ Pm^-',Uin})I{Uii^Uin}. l<il<t2<"'<in<N
146 CHAPTER 8. LINEAR RELATIONSHIPS
THEOREM 2. Inp.p.a.s. sampling, an unbiased estimate ofVar(Y) is
Va~r(Y) = Y2 r ^ z , p*(N,n)
where
z = ( 1 n 2 \
U-lJr=1 [n-2)u<v ) Proof: We compute E(z/p*(N,n)) in two stages. By Lemma 1, we have
i y^n y 2
TV
- /iV-l\ X) Yir~z2Yi-\n-l) l<h< <in<N t= l
Also, by Lemma 1,
}
^n-2J V«<« / • ^ pf.f/7- . . . 77. \ W J V - 2 \ i -<o-,»ux«« / PyWt!,'",Uin})
l < i l < - < t „ < J V - r U t ' » i > ' %ni'[,
= (N-2\ 1^ 2^YiuYiv = ,N_2s ( 9 ) 22YrY3 = 2ZYTY3.
Thus E(z/p*(N, n)) = £ ^ 1? + Er* , TO = Y* = (£(Y)) 2 . Hence
£?( t^r(y)) = E{Y2) - (E(Y))2 = Var(Y).
Q.E.D.
8.3. UNBIASED RATIO ESTIMATION 147
The ratio estimate obtained for p.p.a.s. samphng thus has a decided advantage if one is able to take that first observation by probability proportional to size sampling.
EXERCISES
1. Prove: If N > n > 3, and if d1? • • •, djq are real numbers then
X ) X ) diudivdiw = ( _ 3 ) Yl dudvd% l<ii<"'<in<Nu,v,wdistinct ■ ' u,vtw distinct
2. Prove: In p.p.a.s. samphng,
3. Consider the following population:
U: Ux U2 U3 U4 U5 U6
x: 20 23 17 14 31 26 y: 39 48 33 30 60 40
Suppose that in taking a sample of size 4 by p.p.a.s., the units obtained are Us, {/3, f/6, U\.
(i) Compute the unbiased estimate Y of y .
(ii) Compute the unbiased estimate, Var(Y), of Var(Y).
4. In the proof of Theorem 1, prove that the n events
5 t l n { ^ , . . . , ^ J , . . . , B t n n { C / t l , . . . , f / t n }
are disjoint and that {Uil9- • •, Uin) = U?=i(#tV n {Uiir • •, t/ tn}).
5. The proof given for Theorem 2 relies on the following observation: if Ai, • • •, Ar are r disjoint events such that Y?j=i P(Aj) = 1, if C/ and V are random variables defined by
r
^ = J2 ui1^ a n d ^ = £ i= i vi^*i»
148 CHAPTER 8. LINEAR RELATIONSHIPS
where i/i, • • •, u r , Vi, • • •, vr are constants, and if Vj ^ 0 for 1 < j < r, then
U A u,-V Uv>
Prove that this observation is true.
8.4 Difference Estimation We continue our investigation of estimation of Y when x and y are connected linearly, i.e., when the equation y = a + bx is almost true. In this section we shall deal with the special case where 6 = 1 , i.e., when the difference y — x is almost constant, or Y{ — X{ is almost constant for 1 < i < N. If y — x is almost constant, its variance will be small. Hence the variance of an unbiased estimate of the difference Y — X ought to be small; adding this estimate to X, which we already know, would give us an unbiased estimate of Y with a relatively small variance.
THEOREM 1. In WOR simple random sampling on a population (ZY;x,y) with predictor, the observable random variable
Y = N-J2(yi-Xi) + X
is an unbiased estimate ofY, and
Var(Y) = iV2I (l - £ ) (S*y + S> - 2pSxSy),
where p is the correlation coefficient of x and y.
Proof: Since Y = N(y-x) + X,
and since E(y) = Y and E(x) = X, we obtain E(Y) = NY-NX+X = Y. Now
Var(Y) = N2Var(y-x) = N2(Vary + Varx-2Cov(x,y)).
8.4. DIFFERENCE ESTIMATION 149
By Theorem 4 in Section 6.2,
and
where
and
v-v-k(i-ii)s'» N
Letting p denote the correlation coefficient of x and j / , we have
Cov(x, y) = py/Var(x)Var(y = — T1 — —J SxSyp.
Thus,
Var(Y) = N>± ( l - £ ) (52 + Sy2 - 2 ^ 5 , ) .
Q.E.D.
THEOREM 2. An unbiased estimate of Var(Y), where Y is the difference estimate in Theorem 1, is
V^r(Y) = iV2 {I (l - £ ) (,J + 4) - 2(*y - *y)} .
Proof: Since the sampling is WOfl, £ (s 2 ) = S2 and £(s 2 ) = S*. Also
E(xy - Xy) = E{xy) - XE{y) = E(xy) - XY = C<w(x,y).
Thus E(Var(Y)) = Vor(y). Q.E.D.
150 CHAPTER 8. LINEAR RELATIONSHIPS
EXERCISES
1. In Theorem 1, assume that the sampling is done WR, show that the difference estimate given there is still unbiased and derive the formula for the variance of Y.
2. Consider the following population
U: Ui U2 U3 U4 U5 U6
x: 81 110 50 92 120 65 y: 84 111 52 96 122 68
A sample of size 3 is taken WOR, and the units selected are U6,UltUe
i) Compute the value of Ny. ii) Compute the value of the difference estimate,
Y = N-22(yi-Xi) + X.
iii) Compute Var(y).
8.5 Which Estimate? An Advanced Topic This section is for advanced readers only, namely, those who are acquainted with the multivariate normal distribution and the theory of linear models. Others may proceed directly to the next chapter.
Let us consider a very large population with predictor, (ZY; x, y), for which we know as usual the values of TV, X\, • • •, XN and hence of X and X. We shall assume that a simple random sample of n units has been taken with replacement, yielding observations (xi, j/i), (x2,2/2), • • •, (^n?J/n)« Our problem as usual is to use this sample of n pairs to estimate the value of Y. We might have good reason in some cases to suspect that x and y are almost related by a linear relationship y = a + bx for some unknown constants a and 6. More precisely, this relationship could be of the form
y = a + bx + z,
8.5. WHICH ESTIMATE? AN ADVANCED TOPIC 151
where z is an "error" random variable that is independent of x. If this were the case, and if we could use the n observations to conclude that a = 0, then by Theorem 2 in Section 8.2 the ratio estimate Y = (y/x)X would be an unbiased estimate of Y and could provide us with greater precision than Y = Ny. Also, if this is the case, and if we could use the n observations to determine that 6 = 1 , then we could use the difference estimate Y = N(y — x) + X as an unbiased estimate with greater precision than the usual Y = Ny. Thus there are two tests of hypothesis that we would wish to make.
We are especially able to make these two tests of hypothesis if it is not too unreasonable to assume that (si ,yi) , • • •, (xn, yn) is a sample from a bivariate normal population with unknown mean vector and unknown covariance matrix £ . (Of course, this is never true, but it could be close, in which case the following is quite reasonable for practical purposes.) From this bivariate normality assumption it is known that
yt- = a + bx{ + Z{j 1 < i < n,
where a and b are constants, xt- and Z{ are independent, and z\, • • •, zn are independent and identically distributed with common distribution function being normal with mean 0 and unknown variance a2 > 0. In order to determine if either linear model outlined above is true, we shall perform two tests of hypothesis: the first is to test the null hypothesis H0 : a = 0 vs. the alternative a ^ O , and the second is to test the null hypothesis HQ : b = I vs. the alternative 6 ^ 1 . In what follows below we shall use this notation:
\yn / \ x x n / \*n /
Thus the n equations written above may be written in matrix form as
y = Xp + z.
In order to test the null hypothesis H0 : a = 0 vs. the alternative a 7 0, let us denote
Mo = I : and 7 = (b)
152 CHAPTER 8. LINEAR RELATIONSHIPS
If the null hypothesis Ho : a = 0 is true, then the linear model becomes
y = ^o7 + z-
The least squares estimate J3 of /? under the general linear model and 7 of 7 under the null hypothesis are given by
$ = ( ^ ' ^ ) - 1 A'V;and 7 = ( ^ o ) - 1 * ^
respectively. The F-statistic is
\\y-Xf)\p and under the null hypothesis the distribution of T is the F-distribution with (l,ra — 2) degrees of freedom. The test is to reject HQ : a = 0 if T is too large. Note: if we accept Ho : a = 0, and if our assumptions hold, then by Theorem 2 in section 8.2 the ratio estimate of Y will have very little bias.
The test for the null hypothesis Ho : b = 1 has an extra ingredient. Under the assumption that (x, y) is bivariate normal, it follows that (y — x, x) is also bivariate normal. Hence one is able to write
yi — Xi = c + dxi + W{, 1 < i < n,
where W\, - • •, wn are independent and identically distributed with common distribution being normal with mean zero and unknown variance T2 > 0, and where X{ and W{ are independent for 1 < i < n. In order to test the null hypothesis that 6 = 1 against the alternative that 6 ^ 1 we may do the same thing by testing the null hypothesis H0 : d = 0 against the alternative d ^ 0. Let us denote
w = (i)'-o and 8 = (c).
Now our general linear model is written as
y - x = X6 + w,
8.5. WHICH ESTIMATE? AN ADVANCED TOPIC 153
and when H0 : d = 0 is true, we have
y - x = ln6 + w.
The least square estimates 0 of 0 and 8 of 6 are given by
0 = (XiX)~1X\y - x)
and S = (lt
nln)-1ltn(y-x)=y-5c.
The F-statistic is
lly-^-^r and, again, under the null hypothesis Ho : d = 0, T has the F -distribution with ( l , n — 2) degree of freedom. The test is to reject Ho : d = 0 if T is too large.
If both hypotheses are rejected decisively, it would undoubtedly be best to use Y = Ny as an estimate of Y.
Chapter 9
Stratified Sampling
9.1 The Model and Basic Estimates In Chapter 7 we discovered that it is possible to decrease substantially the variance of an unbiased estimate of Y by using predictor data x whenever such are available. Without the predictor data it is possible to obtain an unbiased estimate of Y with a much smaller variance than that of the usual unbiased estimate, TVy, by means of stratification of the population U. Let us consider the following extreme example to introduce this method. Suppose one has a population,
U: Ut-'-Uso U51---U150 U151---U300 y : U" -u v - "V w • • -tu,
of 300 units. Suppose it is known that units numbered from 1 through 50 have the same (unknown) i/-value, u, that units numbered from 51 through 150 have the same (unknown) y-value, v, and that units numbered 151 through 300 have the same (unknown) y-value, w. If you are allowed to take a sample of size three on U in order to estimate y , you would be foolish to take a simple random sample of size three, either WR or WOR , especially if there were vast differences suspected between the pairs of values among u,v,w. The smartest thing to do would be to select one unit from among #i , • • • #50, one from among #51' • • #150 and one from among U^\ • • • f/300. Doing this, you would have observed the values of iz, v, to, and you would get the exact value of y by calculating Y = 50u + 50t> + 50u;.
155
156 CHAPTER 9. STRATIFIED SAMPLING
Although the above example appears to be extreme, traces of it occur in problems met by the sample survey statistician. One wishes to est imate Y by means of an estimate Y tha t is as close to Y as possible. Sometimes the population U can be represented as a disjoint union of subsets where the t/-values for the units of each subset are almost equal. In such a situation, it seems sensible to take a few observations on each subset, compute the sample mean for the observations on each subset, multiply these sample means by the corresponding sizes of these subsets and then add these products to obtain an est imate of Y. Decomposition of U into such a union of subsets is known as stratification. Each subset of nearly uniform units is called a stratum.
Let (U; y) be a population of units with the associated y-values. We shall assume tha t U is the disjoint union of L disjoint subsets U\, • • •, WL, each subset being a s t ra tum, i.e., U = U^=1Z<4. We assume tha t the units (and hence number of units) in each s t ra tum are known. Our original population, recall, is
U: UX-'-UN
y: Y^-YN,
where y(Ui) = Y{ for 1 < t < N. We shall let Nh denote the number of units in the hth s t ra tum, so tha t N = ?2h=i Nh. The hth s t r a tum and corresponding j/-values are represented now by
Kh : Uhi'" UhNh
y\uh : Yhi--YhNh,
The sum of the j/-values for the hth s t ra tum,
Nh
t = l
from which it necessarily follows tha t Y = Y$U\ Yh- T l i e s i z e o f t i l e
sample observed on the /ith s t ra tum will be denoted by nh. In other words, just as a sample of size n is taken WR or WOR on U, as in Chapter 6, so now, under stratification, a sample of size nh is taken (WR or WOR) on £4 . The observations on any one s t r a tum are assumed
where y\uh{Uhi) = Yhi. Uh, is denoted by
9.1. THE MODEL AND BASIC ESTIMATES 157
(because they are so taken) to be independent of the sets of observations taken on all the other strata. In other words, letting y/a, • • •, yhnh denote the n^ observations on £4, then the L sets of random variables (yii, • • •, yim)> • • • ? (yLi? • • • ? 2/LnL) are L independent sets of random variables, i.e., are L independent random vectors. (Note: In the terminology of Chapter 2, the two random vectors ({/, V) and (X, Y, Z) are independent random vectors if the joint density of all five random variables factors as follows:
/I/ ,V,X,Y,ZK v, x, y, z) = /tf,v(u, V)/X,Y,Z(S, y, *).
Note that independence of these random vectors does not mean that the random variables within any random vector are independent.) If the sampling on each stratum is done WR, then, for fixed fe, y^i, • • •, yhnh
are independent and identically distributed random variables; if the sampling is done WOR, they are not independent.
We yet need further notation. Denote
L n = Ysn^
h=i
y'h = —^2VHU
Y = J2Nhyl h=i
1 Nh 1
l Nh
t = i
and 1 ^ / »
slh = -—rEfe-i'O2 Tl}i
1 , , — r £ ( y w - y'h]
1 t = i
The meanings of these are clear. The total size of the sample is n. The sample mean for the hth. stratum is y'h. The sum of products referred to previously is just what Y is. The average of the y-values of all units
158 CHAPTER 9. STRATIFIED SAMPLING
in Uh is Y{. Specializing S% of Chapter 6 to Uh gives us S2yh, and s\h is
the sample variance for the sample from Uh.
Definition. We shall say that the sampling on a stratified population is obtained WR if the sampling on each stratum is obtained WR , i.e., if, for 1 < h < L, the nh observations on (Uh,y\Uh) are obtained WR. The same definition holds for sampling WOR on a stratified population.
THEOREM 1. In stratified sampling, either WR or WOR, Y is an unbiased estimate of Y.
Proof: Applying Theorem 1 of Section 6.2 to {Uh,y\uh) we obtain E{Nhy'h) = Y^. Thus,
= ^E(Nhy'h) h=i L
= Vs Yf = y
THEOREM 2. In stratified sampling WR the variance of the unbiased estimate YofYis
Var{Y) = £ i V f t ( i V f t~1 )^,
and an Mnt^ea7 esh'mate o/Var(K) is
^ « r ( y ) = £ — ajfc.
Proof: Applying Theorem 3 of Section 6.2 to (%,2/k) we obtain
Var(Nhy'h) = Nh{Nh ~ 1] S*yh.
158 CHAPTER 9. STRATIFIED SAMPLING
in Uh is yA'. Specializing S2y of Chapter 6 to Uh gives us S*ft, and s2
yh is the sample variance for the sample from Uh.
Definition. We shall say that the sampling on a stratified population is obtained WR if the sampling on each stratum is obtained WR , i.e., if, for 1 < h < L, the nh observations on (Uh,y\Uh) are obtained WR. The same definition holds for sampling WOR on a stratified population.
THEOREM 1. In stratified sampling, either WR or WOR, Y is an unbiased estimate of Y.
Proof: Applying Theorem 1 of Section 6.2 to {Uh,y\uh) we obtain E{Nhy'h) = Y{. Thus,
E{Y) - , ( £ « )
= ^E(Nhy'h) h=i L
= Vs Yf = y
THEOREM 2. /» tferftf erf sampling WR the variance of the unbiased estimate YofYis
Var{Y) = £ i V f t ( i V f t~1 )^,
and an unbiased estimate ofVar(Y) is
^ « r ( y ) = £ — ajfc. / l = l n h
Proof: Applying Theorem 3 of Section 6.2 to (Uh,y\uh) we obtain
Var(Nhy'h) = Nh{Nh ~ 1] S*yh.
v^.JZi.U.
EiY) - B(±M)
= ^E(Nhy'h) h=i L
= r y = y
i M r ) = L, — *lh-h=i nh
Var(Nhy'h) = M ^ z J l ^ .
9.1. THE MODEL AND BASIC ESTIMATES 159
Because of the independence in sampling on the different strata, iVij/J, • • •, NLV'L a r e independent random variables. Thus
Var(Y) = Var (IH L
h=l
h=l nh
If we apply Corollary 1 to Theorem 1 in Section 6.3 to (Uh^y\un) we obtain E(NZs2
yh/nh) = Var(Nhy'h). Thus
E(vZr{Y)) = J2E(^slh)=J:Var(Nhyfh)
h=i \ n h / h=i
= Var(Y,Nhy'h) =Var{Y).
Q.E.D.
THEOREM 3. In stratified sampling WOR, the variance of the unbiased estimate YofY is
and an unbiased estimate of Var(Y) is
Proof: Applying Theorem 4 in Section 6.2 to (Uh^y\uh) we obtain
160 CHAPTER 9. STRATIFIED SAMPLING
and, because of the independence of sampling between strata, we have
Var(Y) = Varfc^Nhy',)
= ^Var(Nhy'h) h=i
L N2
^ nh \ Nh,
Next, applying Theorem 2 in Section 6.3 we obtain
E^Y)) - g f ( l - t ) ^ ) N? ' ££0-£)*-""<*>•
Q.E.D. It might be possible to stratify a population (U; x, y) based on the
values of a predictor x . In this case it is possible to do probability proportional to size sampling within each stratum. This is not terribly exciting now, since we would essentially by re-doing Chapter 7 word for word for the stratified model.
EXERCISES
1. Consider the following stratified population:
Ui: UiX U12 U13 U14 U15 U16 U„ U18 y\Ul : 12 15 13 10 12 13 14 16 Ui : U21 U22 U23 U24 U25 U26 U27 y\u2 : 35 38 36 39 34 37 34
(i) Compute I ? , ! ? , Y ^ and S*,. (ii) If a sample of size three is taken WOR from each stratum,
if f/i3, Un and U\\ are the units selected from ZYi, and if £ 23 > U22 and C/27 are the units selected from % , evaluate Y and Var(Y).
9.2. ALLOCATION OF SAMPLE SIZES TO STATA 161
2. In Problem 1, suppose the population was not stratified, suppose a random sample of size six WOR was taken, and suppose the sample consisted of the same units as in l(ii). Compute Y and Var(Y).
3. Compute Var(Y) for both problems l(ii) and 2, and explain the difference in values.
9.2 Allocation of Sample Sizes to Stata So far in this text we have never mentioned how one selects n. Sometimes the value of n depends on how small one wishes the value of Var(Y) to be. This is hard to do unless one has taken a preliminary sample and has made an estimate of the various terms that appear in the formula for the variance of Y. Quite often the sample size is determined by simple cost considerations. Suppose the initial cost of performing a survey is CQ and the cost of each observation is c\. Then the total cost of performing the survey is C = CQ + nc\. If the total cost is decided in advance, then one solves for n in the above equation to decide on the sample size. In the case of a stratified population, the cost of an observation can be different for different strata. Thus, if the cost C of the entire survey is given in advance, if the initial cost of performing the survey is Co, and if the cost of each observation in the /ith stratum is Ch (where CQ and Ci, • • •, CL are given in advance), the problem becomes one of deciding values of ni, • • •, n^, and therefore of n = J2h=i ra&, which satisfy the equation
L
C = °o + Y^c^nh-h=i
The particular problem we consider here is that of determining, among all sets of values ni , • • •, ni which satisfy the above equation, a set which minimizes the variance of the unbiased estimate Y of Y.
The following lemma was proved in a random variable form in the Section 3.3. It is given here with proof for the sake of completeness and in order to emphasize its importance.
162 CHAPTER 9. STRATIFIED SAMPLING
LEMMA 1. (Cauchy-Schwarz Inequality). 7/ax, • • •• aTr bu , • •, br
are real numbers, then
(srt)!<-(S"0(p-)-with equality holding if and only if there exist numbers u, v, not both zeros, such that ua- + vb{ = 0for1<i<r.
Proof: First recall that the quadratic equation ax2 + 2bx + c = 0 with a > 0 has a double real root if and only if b2 - ac = 0, and it has no real roots if and only if b2 - ac < 0. If a,t = 6< = 0,1 < i < r, there is nothing to prove. If this is not the case, then we may assume that, say, not all at-'s are zeros. In this case, £J= 1 aj > 0- F°r all eall ?, t is *rue that E<=i(*at + fc)a > 0, or,
Since, as a function of t, this parabola is non-negative, the quadratic has either no real roots or a double real root. In either case,by the above discussion on quadratic equations, it follows that
Equality holds if and only if there is a double root, t0; in this case
E(*oa, + 6,)2 = 0.
But this last equation holds if and only if each summand is zero, i.e., toai + bi = 0,1<i<r. Q.E.D.
None of the following theorems will elicit enthusiasm for applicability until we reach the discussion involving a predictor variable.
THEOREM 1. InWOR stratified sampling with linear cost function C = Co + E L i Chnh> Var(Y) is minimized for preassigned C, and C is
162 CHAPTER 9. STRATIFIED SAMPLING
(s°4s (!>•)(!>•)•
(E«,?)* !+2 ( i > * ) <+x>? > o.
M'-isM') E(*oa, + 6,)2 = 0.
Q.E.D.
9.2. ALLOCATION OF SAMPLE SIZES TO STATA 163
minimized for preassigned value ofVar(Y), if and only if there is some positive constant K such that
nh = KNhSyh/^, \<h<L.
Proof: By Theorem 3 in Section 1
from which we obtain
h=l h=l Uh
Letting
Q = (var(Y) + £ NhS^ (C - «>),
we have
«-(sM(s*)' If C is a preassigned constant, then Var(Y) is minimized by minimizing the right hand side of the above equation. (Note: the term J2h=i NhS%h in the definition of Q does not depend on ni, • • •, n^.) The right side of the above equation is just the right side of the following application of the Cauchy-Schwarz inequality (Lemma 1):
Since the left term in this inequality does not really depend on the rifc's (the y/rihS cancel), the right side is minimized over rai,---,ra£, when the inequality becomes an equality. By the Cauchy-Schwarz inequality this occurs if and only if there exists a constant K such that y/chnh = KNhSyh/y/rih for 1 < /i < L, or nh = KNhSyh/y/ch- The same condition is seen to be necessary and sufficient for minimizing C when Var(Y) is a given constant. Q.E.D.
164 CHAPTER 9. STRATIFIED SAMPLING
This theorem does not solve our practical problem for the simple reason that in contemplating the sample survey, we do not know the values of K and n. This result however is one step in the right direction. Before we proceed further we state the theorem in the case of WR stratified sampling.
THEOREM 2. In WR stratified sampling with linear cost function C = Co + ^2h=iChTih, Var(Y) is minimized for preassigned C, and C is minimized for a preassigned value ofVar(Y), if and only if there is some constant K such that
nh = K-* — Syh, 1 < h < L. y/Ch
The proof of this theorem depends on Theorem 2 of Section 1 in just the same way that the proof of Theorem 1 depended on Theorem 3 of Section 1, and the steps of the proof are similar. It is left as an exercise.
THEOREM 3. (Neyman-Tschuprow Allocation Theorem). In WOR stratified sampling with linear cost function C = Co + Z)^=i ch^h and with C given, the values of n, n1? • • • ,n^ which minimize Var(Y) are
and NhSyhly/c^
nh = « ^ z ~ Z^m=l ™m&ym/ y/cm
Proof: By Theorem 1,
nh = KNhSyh/y/ch, 1 <h<L,
from which we obtain L L
n = Y^nh = K^2 NhSyhlyfe.
9.2. ALLOCATION OF SAMPLE SIZES TO STATA 165
Solving for K, we find that
K = Efc=l NhSyh/y/Ch
Thus
Since C — Co = 2^=1 cfcn/n w e have
n^ = ^z,
^ ~ V^ N S I /El'
from which we solve for n to obtain
n = __(C~CQ) E £ = l NrnSyn/y/^
?2h=lNhSyhy/Ch
Q.E.D.
Theorem 3 gives a solution to the allocation problem except for one thing: we do not know the values of S^ , 1 < h < L. However, if we have a predictor variable x of which y is a linear function, then we shall be able to solve for rth and n in the allocation problem.
Consider now a stratified population with a predictor variable x. We denote the number Xhi as the value that x assigns to the unit Uhi in the hih stratum. As in Chapter 7, all numbers
{Xhi,\ <h<L,l<i<Nh}
are known. Analogous to Y, we denote
L Nh
* = ££*« h=i t= i
and l Nh
166 CHAPTER 9. STRATIFIED SAMPLING
where Nh 1 h
THEOREM 4. In WOR stratified sampling with linear cost function C = co + J2h=i °hnh and with C given, if a predictor x is available and if y = ax + b for some constants a and b, then the values of n, n i , • • •, n^ which minimize Var(Y) are
n = _ (C-c0)J%slNhSxh/y/c£
J2h=l NhSxhy/Ch
and NhSxh/y/c^
Proof: From Theorem 3 we see that Theorem 4 will easily follow if we can show that Syh = aSxh for all h. Now y = ax + b simply means that Yhi = aXhi + b for all h and i. Thus
c2 _ Nh
i Nh
t= i l Nh
— UaXhi-*Xff - n2^2
i.e., Syh = aSxh- Q.E.D. Theorem 4 gives us the possibility of practical application. If we
have previous information in the form of x and if y is anywhere near to being a linear function of x, then Theorem 4 shows us how to allocate sample sizes to the strata in order to keep within a cost restriction and minimize the variance of Y.
9.2. ALLOCATION OF SAMPLE SIZES TO STATA 167
EXERCISES
1. Prove Theorem 2.
2. In Theorem 4, if WOR is replaced by WR, correct the conclusions of the theorem, and prove it.
3. Consider the following stratified population with predictor:
Wi: Uu U12 U13 Ul4 Uu U16 U17 x: 14 18 20 15 12 15 18 y: 21 26 32 20 20 22 29
TAi : C/21 V22 U23 U24 U25 U26 x: 41 50 45 40 44 51 y: 58 76 64 63 69 72
IA3 '• U31 U32 U33 U34 U35 U36 U37 U38 x: 61 59 64 60 58 65 60 70 y: 88 91 92 93 90 95 89 100
Each observation in U\ costs $10, each in U2 costs $14 and each in U3 costs $18. Initial cost of the survey is $20, and the cost of the survey may not exceed $200. find the sample sizes for the strata that minimize the variance of Y for WOR.
Chapter 10
Cluster Sampling
10.1 Unbiased Estimate of the Mean In cluster sampling the population U is represented as a disjoint union of subsets, each subset being called a cluster. We shall denote the clusters by &, • ■ ■ ,CN, , a n dhus
N
i=\
As with the whole population it is assumed that one knows which units are in each d and thus knows the number of units in each cluster. The sampling procedure consists of selecting n clusters by simple random sampling WOR and then taking a WOR simple random sample from each cluster selected. With these data, the problem consists of obtaining an unbiase• estimate A o fT , then to obtain Var(Y\, and, last, to find an unbassed estimaee, ofr(Y), of Var(Y)n These three tasks are accomplished in this chapter. We shall not dwell too long heta on the cases when clister sampling .s needed. Suffice it to oay that there are cases when once an observation is mad. on a unit in s populatihn, ehen it is cheaper easier and/or quicker to taka u number of observations on the cluster in which it ir located than to take even one other observatton cn the population al larged An nxample of this is wher ther¬ vre great distances between clusters bue negligible distances between any two tdits within a clustec It is important to observe the sifference
169
Chapter 10
Cluster Sampling
10.1 Unbiased Estimate of the Mean In cluster sampling the population U is represented as a disjoint union of subsets, each subset being called a cluster. We shall denote the clusters by &, • ■ ■ ,CN, , a n dhus
N
U=\JCi.
As with the whole population it is assumed that one knows which units ar* in Parn C, anrl thus knows t.h« n n m W nf units in -jwh r l n s W T W sampling procedure consists of selecting n clusters by simple random sampling WOR and then taking a WOR simple random sample from each cluster selected. With these data, the problem consists of obtaining an unbiase• estimate A o fT , then to obtain Var(Y\, and, last, to find an unbassed estimaee. ofr(Y). of Var(Y)n These three tasks are
vation cn the population al larged An nxample of this is wher ther¬ vre great distances between clusters bue negligiblf distances between any two tdits within a clustec It is important to observe the sifference
169
leta on the e there are ttion ehen aservations
sampling procedure consists of selecting n clusters by simple rand. sampling WOR and then taking a WOR simple random sample fr. each cluster selected. With these data, the problem consists of obta ing an unbiase• estimate A of Y, then to obtain Var(Y\, and, last, find an unbassed estimaee, ofr{Y), of Var(Y)n These three tasks . accomplished in this chapter. We shall not dwell too long heta on 1 cases when clister sampling .s needed. Suffice it to oay that there . cases when once an observation is mad. on a unit in s populatihn, el it is cheaper, easier and/or quicker to taka u number of observatic on the cluster in which it ir located than to take even one other obs vatton cn the population al larged An nxample of this is wher th vrp orpat t i s t a m p s bptwppn clusters bue n e d i c i b l f Hi s tances bet.Wi
170 CHAPTER 10. CLUSTER SAMPLING
between cluster sampling and stratified sampling. In cluster sampling, the units in each cluster need not have values that are relatively close to each other. The cluster is there because of the easy accessibility of observations once the cluster or a unit in the cluster has been selected. In stratified sampling, observations are taken on each stratum; this is not true in cluster sampling. One could say that stratified sampling is a special case of cluster sampling in the case when the sample of clusters in cluster sampling consists of all clusters.
Let us establish once and for all the notation to be used in this chapter. We shall denote the number of units in cluster Ct- by M{. Thus M, defined by
1=1
is the total number of units in U (for this chapter only). The units in the ith cluster Ct- will be denoted by t/,j,l < j < Mt-, and we shall denote y(Uij) — Y^.
The following notation will be used: Mi
is the sum of the y-values for the iih cluster,
1 Mi
lv±i i=l
is the average of the y-values of the ith cluster,
N 1 N
t = l
is the average total y-value per cluster, and N _ j N Mu
^ = Z^**' Y = N 2LfZ^, *™» t= l 2^i=l lvli u = l v = l
and ! N Mu
S? = 2 ^ t = i i V i * i u=i v = i
10.1. UNBIASED ESTIMATE OF THE MEAN 171
are as in previous chapters. A sample of size n clusters will be selected WOR, and we shall refer to the numbers of the clusters so selected by 1? • * * j 71 > where 1 < V\ < • • • < vn < N. These are of course random
variables with joint density given by
*r L n J 1/ ( ? ) if 1 < *i < • • • < ** < ^ 0 otherwise.
N
Given that the cluster Ct is in our sample, then a sample of size rat-is taken WOR of units in C,-; these observations (random variables) are assumed to be independent of the observations taken on any other clusters and, indeed, will be independent of the other clusters selected. This can be expressed mathematically as follows. If 1 < v\ < • • • < vn < N are the clusters selected, and if j/fc-i, • • •, j/jbimfc. denote the m^ observations taken on Ck{, then
P ( P [W.'i = '•*> " ' > Vknui = '*»*.■] I f l ^ = ^ )
= II P ( bfkil = **> • * * > y*.-m*. = *imfcJI f ] fa = *il )
Let {yti, • • •, j/tmj} denote a sample of size m,- on C,\ We shall further denote
1 J/f. = —2/»., rat-
1 n and
Let us finally denote
1 M> i K I J x g = l
Note that, for 1 < i < n, 1 < kx < • • • < kn < N,
v-*,in[„-y)-^(i-^)5i,.
172 CHAPTER 10. CLUSTER SAMPLING
In addition, because of the independence assumption above, yVl.. • ■ •, y„n. are independent relative to the conditional probability
P i I Qh- = *;] J ,
and thus we have, e.g.,
Var (±MV]yv,\J>,- = *,•]) = E ^ ^ " (l ~ J%) #*,-
This last fact will be made use of in Section 2. It should be noted that in case rat = Mt for 1 < i < TV, i.e., we take a complete census of each cluster in the sample, then we are back in the case of WOR simple random sampling that was discussed in Chapter 6. Thus, we may state without proof the following theorem.
THEOREM 1. In WOR cluster sampling, z/m t = M t , l < i < N, then an unbiased estimate of Y is
U t = l
its variance is
and an unbiased estimate of Var(Y) is
^-T(I-£)dr|?»»-*»>•• We now look at the general case. As indicated above, the numbers
of the clusters obtained in our sample of n clusters are z/i, • • •, j / n , where 1 < v\ < V2 < - • • < vn 5* N. In the i/jih cluster selected which has Mu. units [note: the subscript is a random variable], a sample of size mUj is selected WOR, and thus Mv.yVj. should be an unbiased estimate
10A. UNBIASED ESTIMATE OF THE MEAN 173
of the sum of y-values of the Ujth cluster, ^ Z)j=i M^y^. should be an unbicised estimate of the average of the cluster totals of revalues, and, finally, ^ Ej=i MUiyVj. should be an unbicised estimate of Y. We now prove this last "should be" in the following theorem.
THEOREM 2. In cluster sampling, an unbiased estimate ofY is
N n
Proof: We shall apply results from Chapter 4. We observe that
E(Y) = E(E{Y\vu-••,»»))
£ E(Y\v1 = k1,---,vn = kn)p[f)[vj = kj]) i<fci<-<fc„<jv V=i / N n 1
= — 52 52 E{MViyVj\vx = *!,...,!/„ = hn)jjj n i<fc1<...<jfen<7yri=i CO N n 1
= — ]C ^2E(Mkjykj.\i/1 = kir-,un = kn)
\n) l<ki<-<kn<Nj=l
Now by Lemma 1 of Section 8.3,
CO'
Q.E.D.
174 CHAPTER 10. CLUSTER SAMPLING
EXERCISES
1. Consider the following population consisting of six clusters:
Ci: 0 n u12 ^13 014 015 016 017 018 y ■ 20 31 42 39 25 28 28 34
C2: 0 2 i u22 023 024 025 026 027 y- 31 28 16 20 20 31 50
C3: u31 u32 033 034 035 036 037 038 y- 21 28 45 31 18 42 36 38
C4: u41 u42 043 044 045 046 047 048 y- 43 22 49 33 36 41 52 16
C5: 051 05 2 053 054 055 056 057 058 y- 23 81 46 52 31 46 40 35
C6: u61 Ufa 063 064 065 066 067 068 y- 41 52 35 43 45 47 52 56
The plan is to select three clusters WOR. If cluster C\ is in the sample, 4 units are to be selected from it WOR, if C2 is in the sample, 3 units are selected from it WOR, if Cs is included, 4 units are selected from it WOR, if C4 is included, 5 units are selected WOR, if C5 is included, 3 units are selected WOR, and this last holds for CQ.
i) If the sample of clusters consists of C2,C4,Cs, if U24,U26 and U22 are the units selected in C2 if U47, U4S, f/42, U45 and U41 are the units selected in C4, and if t/55, f/52 and U57 are the units selected in C5, compute Y.
ii) Compute Y{., and Y{., 1 < i < N, and Yet-
2. Prove: £(£„,..|i/i = ku - - ■, i/n = kn) = Yk...
3i Evaluate: E{yx + • + vn).
U39 51
U49 £^4,10 18 20
10.2. THE VARIANCE 175
10.2 The Variance In this section we derive the formula for the variance of Y. This will involve
1 M> w* = M _ 1 2*/\Yjg ~~ Yj-J >
1V13 L g = l
which was given in Section 1, and S$, which is defined by N 1
^ 6 = N _ j XX^- - Yd)2.
THEOREM 1. In cluster sampling, ifY is defined by
N n
then
Proof: We shall prove this using a result from Chapter 4, namely by using
Var(Y) = E{Var{Y\vu •••,!/„)) + Var(E{Y\uu ■•-, vn)).
Indeed, by the independence assumption and its consequence given in Section 1, E(Var(Y\vU'--,vn)) =
£ Var (Y\ f) Wi = kj]) P ( f) fo = kj]) l<fc1<...<*^<iV \ j=i / \ i = i /
£ Var (% ±MViyVi.\ f) fc = *,■]) yjL-l<*l<».<*n<JV \ " i = l J=l / W
176 CHAPTER 10. CL USTER SAMPLING
Now we use Lemma 1 of Section 8.3 to obtain
We next wish to evaluate Var(E(Y\vi, • • •, vn)). We first note that for fixed j , 1 < j < n, by what was proved in the proof of Theorem 2 of Section 1,
E{MUjyVj\v1,- • -,i/n) = £ E [MViyyi.\ f][u{ = k]) /n ? = 1h=^] l<ibi<-<fcn<iV \ i=l /
E M * ^ (&;•! H ^ = *»']) 7n?=:1h=*,]
l<A;1<...<fcn<iV
Hence
Var(E(Y\vu- •-,»„)) = V a r B l ^ M ^ S , , , ! ! / , , - - , ^ )
Now Y^., • • •, l^n. constitute a simple random sample WOR of size n on the total y-values on the clusters. Hence, by the results of Chapter 6,
Thus by the formula for Var(Y) stated at the beginning of this proof, we obtain our conclusion.
10.3. AN UNBIASED ESTIMATE OF VAR(Y) 177
EXERCISES
1. In Problem 1 of Section 1, compute Var(Y) for the sampling procedure outlined.
2. Prove: for nxeci 7« 1 ^ 7 ^ ft,
10.3 A n Unbiased Estimate of Var(Y) Let us establish the notation
THEOREM 1. An unbiased estimate ofVar(Y) is given by
va^rix) = ^ I ( I - ^ ) ^ | : ( M ^ - ^ | : M ^ ) 2
Proof: We recall from Chapter 6 that, in the case of simple random sampling WOR, E{s2) = S%. Thuss ,n cluster rsmpllng, ,fo rixed jj
EUVj\f][ui = ki]\=Slkr
Taking the expectation of the second term in the formula for Va~r(Y), we obtain
177
10.3 A n Unbiased Estimate of Var(Y)
i rrti
t^) = ^I(I-^)^|:(M^-^|:M^)2
^(^inh=^])=^-
■es«±(-aw
178 CHAPTER 10. CLUSTER SAMPLING
!<*!<•••<*„<N U i=l \ m"i \ m»iJ j=l ) \n)
~l<kMkn<Nn\krnkiV J S ( I ) ' and by Lemma 1 of Section 8.3, the above becomes
< 7 2 .
We now take the expectation of the first expression in the formula for Var (y) , which we denote by F . In addition we may write
N2
* =
where
J-£(l-i) -L- (A-B), n \ NJ n - r n
1=1
and
= 1-(±M,y,, B
Now, by properties of conditional expectation,
E{A) = Y. E(A\f) te = *.•]) P f (11* =k'}) ■ l<Ari<..<Jbn<iNT \ t= l / \ t = l /
Then we use the formula
E(X2\H = h) = (E{X\H = h)f + Var(X\H = h)
to obtain
* (*XI Qfo - w) = Wl + Kik.i'-w)81
f2 1 L rnki k'mki \ Mki) wk>
10.3. AN UNBIASED ESTIMATE OF VAR(Y) 179
This and Lemma 1 of Section 8.3 imply
E(A) = — V V (Y2 + ^ (1 _ ?*L\ S2 "\
(?) i<* .<^wfeV*' ' »»*,V M*J wk'J
To find E(B), we first observe
E{B) = im 52 ^ ( (£^<Y I (>i = *;])• \n) l<ki<-<kn<N \ \ i = l / i = l /
As in the computation of E(A) , we obtain
V=i «=i /
+s5(1-^)5-'' = i;>2.+Ey».-tt..
Now, using Lemma 1 in Section 8.3, we obtain
♦tpsSo-s)*} N - iV £ ^ + AT(AT - 1) £ Y-Yv- + A £ m V Mj ^
179 10.3. AN UNBIASED ESTIMATE OF VAR(Y)
E(A) = — V V (Y2 + ¥*■ (l - ^ L \ S2 \ {N
n) i<kMkn<Nk\ *' rnki\ Mj wk'J
E{B)=^ £ ^f(E^^)2inh=^V n {nJ l<ki< <K<N \ \ t=l / j=l /
\i=i «=i /
t = l u^V
+r^L(i-^)s2
" AT ^ + AT(AT - 1) £ y - y - + A & m,- ^ M,J ^
180 CHAPTER 10. CLUSTER SAMPLING
Since
( N \ 2 N
and since YCi = Y/N, we have
l^M?(i rm\Q2 N(n-l)-2 + N£i ™i V ~ M ^ ""'+ J V - 1 Yci'
The first term on the right hand side becomes
+T^TYce
Hence
Thus
Biv^n .«.(!_ 1) - 1 . ( ^ -^ j+g £ (, _ ) a .
180 CHAPTER 10. CLUSTER SAMPLING
( N \ 2 JV
.•=1 " / i=i
l^M?(i rm\Q2 N(n-l)-2 + Nti ™i V ~ Mi) wi + N-l Yci'
N -n-^ +T^TYce
- n (I _ r \ C2 , ^Ll£y2 ~ "In ivJ5i + iV-lF^
^-a-^^^sfo-i)**-^ *«£**)) - J* (I - 1) - ^ ( ^ j - ^ H g f (l - £) fc.
10.3. AN UNBIASED ESTIMATE OF VAR(Y) 181
Let us work some more on the expression for E(A) which we obtained above. As it was in the case for E(B) we have
2 wi
Substituting these formulae for E( A) and E(B) in the above expression for E(Var(Y)), we see that the coefficient of YQI is 0, the coefficient of 5? is
\n Njn-1\ N V NJJ which becomes
n V JV7' and the coefficient of
M M?
becomes
U NJn-l \N NJ+ '
which in turn is N/n. Putting all this together we obtain E(Var(Y)) = Var(Y). Q.E.D.
EXERCISES
1. Prove: If X and H are random variables, then
E(X2\H = h) = Var(X\H = h) + (E{X\H = h)f.
2. In Problem l(i) of Section 1, evaluate Var(Y).
3. Prove: £ £ i Y<2 = E £ i ( ^ - Yet)2 + N?3t.
Chapter 11
Two-Stage Sampling
11.1 Two-Stage Sampling
Two-stage or double sampling involves sampling the population according to some scheme and then sampling this sample. In each such procedure we have a population (£/,x,i/) with a predictor. However, no complete knowledge of the predictor is available as in Chapter 7. In this chapter three such procedures are presented.
For the procedure considered in this section we know neither the x-values nor the j/-values of the units before we start sampling. As usual, we wish to estimate y , and it is assumed that y and x are roughly linearly related, just as in Chapter 7. However, it might be very easy or cheap to determine the x-value of a unit, but expensive or difficult to determine its j/-value. Thus we take an initial simple random sample WOR of size n' and observe the x-values of the units selected: x^, x'2, • • •, x'n,. From the units in this first sample we select a sample of size n by WR probability proportional to size. We shall use the following notation: j/J, • • •, y'n, will denote the j/-values of the units selected in the first sample corresponding to the x-values x^, • • •, x'n,. Not all of these y-values are observable. We let x' denote the sum of the x-values of the units in the first sample; this value is observable. The random variables u[, • • •, u'n, will denote the subscripts of the units selected in the first sample, 1 < u[ < u'2 < • • • < u'n, < N. These n' random variables might or might not be observable; their values do
183
184 CHAPTER 11. TWO-STAGE SAMPLING
not enter into the formulae for the estimates of Y and Var(Y) that we shall obtain. We shall let ui, • • •, t/n, where 1 < ux < • • • < un < iV, denote the subscripts of the units selected in the second sample from among the units selected in the first sample. As before, they might or might not be observable. We shall let #i, • • •, xn and t/i, • • •, yn denote the corresponding x— and y— values of these units.
We pause to develop the intuitive idea behind the unbiased estimate Y to be obtained for Y. Referring back to Chapter 7, it appears reasonable that, for 1 < i < n, then x'yi/xi should be an unbiased estimate of y1 = Y%=i Hit the unobservable sum of the y-values of the first sample. Hence — Y^L-i ^ should also be an unbiased estimate of yf. The first sample should be a representative sample of the entire population, and Ny'In' should be an unbiased estimate of Y. Hence Y defined by Y = 7— £?-i ^ should an unbiased estimate of Y, and it is this that we shall prove in our first theorem.
THEOREM 1. In the double sampling procedure described above, an unbiased estimate YofY is given by
Proof: By Theorem 1 of Section 7.2,
E{^\<---^=fliy'Jiorl<i<n.
Hence
and
E (h£* )=y-n i = i
Hence E(Y) = NY = Y. Q.E.D.
11.1. TWO-STAGE SAMPLING 185
THEOREM 2. In the double sampling procedure defined above, the variance of the unbiased estimate Y = 7 — S?=i ^ l 5 given by
r^-i£&£H&-*h"!L& Proof: In the proof of Theorem 1 we found that
n' ,=1
Thus from the results in Chapter 6,
Var(E(Y\u[,■■■,u'n,)) = i V 2 i ( l - | ) S2 ,
which is the second term in the right side of the formula stated in the theorem. We now wish to obtain E(yar(y\u'x, • • •, u'n,)). By Theorem 3 of Section 7.2,
^(fK,-X,) = Var^-t^K,..^ - N21Trr.r>(yj y'A2
Hence by Lemma 1 of Section 8.3, E(Var(Y\u[j • • •, u'n,)) =
_ iv21 (;:;) v /2i _ nV
Now by Theorems 1 and 3 of Section 7.2,
EiVariYK, • • •, «„,)) = - j ^ - g y ( ^ - r ]
186 CHAPTER 11. TWO-STAGE SAMPLING
Q.E.D.
THEOREM 3. IfYis as in Theorem 1, then Var(Y), an unbiased estimate ofVar(Y), is equal to the following
iV2 {x'f f»rf 1/^i/A2) ( « 0 » n ( n - l ) \ £ i » ? n\frzi) J
, N(N-n'± \x,j-y±_ (Q2 1 m-m nn^n' — 1)1 i^i xi n' n — \
Proof: We first establish two formulas. The first is that
This comes about by replacing the function y by y2 and using results from Chapter 7. The second formula is:
where y1 = YX=i Vi- I n order to prove this, we make use of the fact that ^ (xMfx'\uii"" >Mn') = YA=I Vi == y't properties of conditional expectation, the fact that x1 is a function of u^, • • • yu'n, and the conditional independence of j/,/xt- and yj/xj given u'1? • • •, v!n, to obtain
- M ^ K . - , * , ) = M!£(gK,-,<,)£:(|iK,--,<,)
A 2 = (*/')
11.1. TWO-STAGE SAMPLING 187
Now let B denote the second term on the right side in the formula given in the statement of the theorem. By the two formulas given above,
£(BK,...X,) = NJ"r_1hhy'i)2-Am = N(N-n"> 1 Y{y>-j>y
where y' = (y[ + • • • + y'n,)/nf. Since the first stage sampling was simple random sampling WOR, then, using a property of conditional expectation, we have
E(B) = £(£(5K, • • •, <,)) = "^J^Sl,
which is the second term in the expression for Var(Y) in Theorem 2. Now let us denote
A
we need to prove
__ JV2 (*Q2 [" y? 1 ("yi\2} ( n ' ^ n C n - l ) ^ ^ n {% xt) j '
N n ' - l f Jd( Y{ V
We first rewrite A as
A = N2 1
A = (n')2 n(n — i) --^—id-'Atl^-X-'-y:^-^-). ' ) 2 n ( n - l ) \ V nJ^yxi/x'J n %. xt /x' Xj/x> J
By the definition of conditional variance,
188 CHAPTER 11. TWO-STAGE SAMPLING
Applying Theorem 1 of Section 7.2 to the right side of the above, we obtain
Also, since, for i ^ j , yi/xi and yj/xj are conditionally independent given i4, • • •, u'n,, and since x1 is a function of u[, • • •, u'n,, we again apply Theorem 1 of Section 7.2, a property of conditional expectation, and the second formula established earlier to obtain
E \XilX' Xj/X' )
= ( ^ ( ^ K , ■■■,<) \XiXj )
= (y'Y-
Hence
E(A\u'---,u') = ^ - - y ^ - ( ^ - - y '
By Lemma 2 of Section 7.2,
w,-.<.>=^:E>:*;.(fj-f Now, applying Lemma 1 of Section 8.3, the definition of conditional expectation given a random variable and its properties and the fact that x\ = Xui. and y\ = Yui., we have
E(A) = E{E{A\u\, • • • ,< , ) )
11.2. SAMPLING FOR NON-RESPONSE 189
l<fc! <••.<*„,<* {„,)
- J_ZLI v r v v & 2kV " (5) Wn1<klhkn<Nh ^ k' U ~ * J - 1 iV2 1 (N-2\ (Yj YA2
~ (»)(nyn{n>-2)ltXiX>{xi-Xj) '
Again, applying Lemma 2 of Section 7.2, this last equation becomes
Q.E.D.
EXERCISES
1. Find the joint density of t^, • • •, ufn,.
2. Find the joint density of ui, • • •, un.
3. Use Lemma 1 in Section 8.1 to prove directly that E(Ny) = Y in simple random sampling WOR on
y: Yi--YN.
11.2 Sampling for Non-Response It sometimes happens that a simple random sample of a population (ZY; y) does not elicit a one hundred percent response. This occurs in the case of questionnaires sent out by mail. It also occurs when sampling is done by visiting households, since it is possible that the occupants are not at home. There is sometimes reason to believe that those units of a population that do not respond are actually a sample from a stratum of units that would not respond and that is qualitatively different from the stratum of units that would respond. In any case, the statistician
190 CHAPTER 11. TWO-STAGE SAMPLING
then takes a simple random sample from the non-respondents, going to considerable effort and expense to obtain a 100% response from this subsample. An unbiased estimate of Y is then possible.
So much for generalities; now we shall spell out everything in detail. Our population as usual is given by
U: U!---UN
We assume that U can be represented as a disjoint union of two sets, call them TZ and Af , where % contains every unit of U that would respond if it were included in the initial sample, and Af contains every unit of U that would not respond. Neither % nor Af is known ahead of time. We shall let Ni denote the number of units in 7£, and N2 will denote the number of units in Af; neither Ni nor N2 are known. Since U = H U Af is a disjoint union, it follows that N = JVi + N2. The first stage consists of taking a simple random sample WOR from U of size n. We shall let ni denote the number of units in the sample that respond and n2 the number of units that do not respond in this initial sample; ni and n2 are observable and are random variables.
Clearly n = ni+n2. One easily observes that rt\ is a random variable that has the hypergeometric distribution which was defined in Chapter 2. Thus E{nx) = (Ni/N)n and E(n2) = nN2/N. We shall denote the observable y values of those units in the sample that do respond yiy'"->y,n1' O n e should note that ni could equal zero with positive probability if n < N2. If r > 1, and if the event [nx = r] occurs, then yi, • • •, yr is a simple random sample WOR from 71. This means: the joint conditional density of yj, • • •, y'r given the event [rti = r] is that of a simple random sample WOR from K. Formally, we define
I ni y'ni = — 52ViI[ni>H-
We also let j/J', • • •, y"2 denote the unobservable y-values of the units from Af in this first stage sample, and we likewise define
1 12
V* = ~ L,Vi'fan]-n 2 ,=1
11.2. SAMPLING FOR NON-RESPONSE 191
If y denotes the arithmetic mean (it is not observable) of the y-values of the n units in the first stage sample, then
y • n
From the n2 units that do not respond we shall select a simple random sample WOR of size u. The size u is assumed to be a function of n2 that satisfies:
u = 0 if n2 = 0
1 < u < n2 < n if n2 > 1.
For example, u is sometimes selected as a fraction of n2, say, u = l |n 2 | + 1, where [x] denotes the largest integer < x. We shall let 2/i> • • • > Vu denote the y-values of the u units in this second sample and shall denote
1 u
u t= i
We shall let 1 < u[ < u'2 < • • • < u'ni < N\ denote the numbers (subscripts) of the n\ units that are among the n units obtained in the first stage sample and that are in 7£, and we shall let 1 < u'{ < u'^ < • • • < v!^2 < N2 denote the subscripts of the n2 units from M in the first-stage sample. Note that u[^ u2, • • •, w^ are random variables (actually, a random number of them as well, since rt\ is a random variable) and
P[u[ = ku - • •, u'T = fcr,ni = r] = -r^-yP[ni = r)
if 1 < fci < • • • < kr < Nu 1 < r < Nu and Ukl € 11, • • •, Ukr G ^ , and = 0 otherwise. A similar statement can be made for u", • • •, u"2.
We shall need in the proofs that follow a convention on some notation for summation. If the conditioning event is
K = mi, • • •, v!n_r = ran_r, u'{ = fci, • • •, < = fcr, n2 = r],
the summation will be over all indices
mi,---,mn_ r ,fci,---,A; r ,r
192 CHAPTER 11. TWO-STAGE SAMPLING
satisfying
max{0, n — Ni} < r < min{ra, N2} 1 < mi < • • • < mn_ r < Ni 1 < h < ■ ■ ■ < kr < N2
umi eii,---,umn_r en ukleM,---,ukreN.
LEMMA 1. The following holds:
E(yu\u'{,---,u'l2,n2) = y':2.
Proof: For fixed r > 1, we obtain from Chapter 6 and the definition of u that
1 r E(yu\u" = fci, • • • , < = K,n2 = r) = - ^ F f c | .
r . = i
Now, if £ = max{0, n — Ni} and m = min{ra, N2}, and upon observing that E(yu\[n2 = 0]) = 0, we have
E{yu | < , • • • , < , "2) = = E E ( y * \ u " - ^i) • • • »u" = kr,n2 = r)/[„i/=fcl)...)U»=Ar)n2=r]
= E ( ~ JC y.'') ^K-fcl ,-,<=*r,»2=r]
*<r<m V i = l / 1 ™2
- — 2^«•yl»2>l]:=^/n2• n 2 «=i
Q.E.D.
11.2. SAMPLING FOR NON-RESPONSE 193
LEMMA 2. In the notation already established,
E(yu\u'l9 • ■ • X a X i ' , • • • , < , , n 2 ) = y"2-
Proof: This lemma has the same proof as Lemma 1, except the conditioning set is now
[ui = mi, • • •,u'n_r = mn_r,u'l =kir- ,u"= kr,n2 = r].
THEOREM 1. An unbiased estimate ofY is given by niy'ni + n-iVu Y = N
n
Proof: We observe that
E(Y) = -(E(my') + E(n2yu)). n ■ JniJ
Now, by Lemma 1,
Thus
E{n2yu) = E(E(n2yu\u'(,---,u'l2,n2)) = E(n2E(yu\u'{,---,uZ2,n2)) = E{n2yl).
E(Y) = ^(E(niy-'ni) + E(n2y';2))
= NE'niV » 1 + " 2 < \
But -(niy'ni+n2jCt) = y,
the arithmetic mean of a simple random sample of size n WOR, its expectation is Y, and hence E(Y) = NY = Y. Q.E.D.
In order to emphasize that u is a random variable that is a function of n2, we shall sometimes denote it by u(n2).
194 CHAPTER 11. TWO-STAGE SAMPLING
THEOREM 2. The variance o/Y = N(niy'ni + n2yu)/n is given by
Var(Y) . * ! (l - £) $ + pl,E («, ( ^ - l) , f e l ]) ,
w/iere
^w = A J I _ I X/ ( ^ ~ AE ^ Yi I • 7 V 2 X {,:t/ieV} \ i V 2 {fctfaV} /
Proof: Let B = K , • • • , < , « » , • • •, < 2 , n 2 } . Then £ ( F | 5 ) = = ^E(niy'ni \B) + ^E(n2yu\B). Since n i ,n 2 and y^ are functions of the random variables in B, it follows that
E(Y\B) = -{n1y'ni+n2E(yu\B)}. TV
By Lemma 2 we obtain
E(Y\B) = £ { n < + n < }
71 \ t = i t = i /
Now applying the formula for variance of the arithmetic mean of a simple random sample WOR, we have
Var(E(Y\B)) = N>± ( l - £ ) S}.
We next wish to compute E(Var(Y\B)). Since n i j /^ and n2 are functions of the random variables in 5 , we obtain (using results from Section 4.2) that
Var (-(niy-'ni + n2yu)\B) = ?LJ±Var{yu\B).
Now Var(yu\B) = "£Var(yu\B = b)I[B=b],
11.2. SAMPLING FOR NON-RESPONSE 195
where the sum is taken over all b of the form
b = {mi, • • •, ran_r, fci, • • •, kr, r } .
By results from Chapter 6,
Var(yu\B = b) = J - ( l - H & ) £ (%, - - p r E ^ ) -u(r) \ r J fri \ J u(r) £ J J
If we define 1 ™2
4" = ~ 7 ^<(yi " y'n2)2Jln2>2h n 2 - l t = 1
the above expressions for Var(yu\B) and Var(yu\B = 6) (and the fact that there are n2 — 1 values of r) imply
Var(UB) = " - ( l - ± 2 l \ f u(n2) \ n2 J y
Note that Sy„ is the sample variance of a sample of size n2 from AT. Thus, from Chapter 6, for r > 2,
E(4,\n2 = r) = Sl, where Sy2 is as defined in the statement of the theorem, and E(Sy„ \n2 = r) = 0 if r = 0 or 1 by the definition above of Sy„. Thus
E(s2y„\n2) = J2 E(sl„\n2 = r)I[n2=r]
r=max{0,n—TVi}
= <S'itt^[n2>2]--
Finally,
- S-M&O-^)**)) - ^ ( ^ ) ( ' - ^ ) f i « " w )
196 CHAPTER 11. TWO-STAGE SAMPLING
and, since u(l) = 1,
Since Var(Y) = Var(E(Y\B)) + E(Var(Y\B)), we obtain the conclusion of the theorem. Q.E.D.
An unbiased estimate of Var(Y) is beyond the scope of this course.
EXERCISES
1. Prove: E(y'ni) = ±P[m > 1 ] E { ^ : Ut € K}.
2. Prove: If r is a non-negative integer-valued random variable, say range(r) = {0,1,2, • • •, r } , if j/i, • • •, yn is a simple random sample taken WOR, and if r and {j/i, • • •, yn} are independent, then
E(Vl + • - - + yT) = E(T)E(VI).
3. Prove that niyfni + n2J/ 2 is the sum of a simple random sample
of size n on (£/; j/).
4. Prove: yw = Y /[£=&] when r = 1.
5. Prove that E(n2) = nN2/N.
11.3 Sampling for Stratification In the stratified sampling considered in Chapter 9, it was assumed that we knew which units were in which strata. In this section we shall stratify a population by means of a predictor variable x. However, unlike previous cases where we knew the values of x for all the units, we now know the value of x for a unit only if we can observe that unit. It turns out that it is sometimes quick, cheap and easy to take a large sample of units in order to observe the values of x. Then, upon stratification of this sample we may take a subs ample of the sample within each stratum and thus take some advantage of uniform values
11.3. SAMPLING FOR STRATIFICATION 197
of y within each stratum. Let us begin to make these general remarks a little more precise.
We assume that we are dealing with population of units with a predictor variable, x, namely
U :Ui---UN
x : X\ - - • Xpj y : F 1 - - - y n
We do not know the x-value of any unit until we obtain it in a sample. Before any sampling is done, we decide on numbers a\ < a2 < • • • < « L - I by which the population is to be stratified. This means that U\ will consist of those units in U whose x-values are equal to or less that ax, i.e.,
Ux = {Ui eU : x(Ui) = Xi < ai},
then Uj = {Ui € U : a ^ i < x(Ui) = Xi < a,-}
for 2 < j < L — 1, and finally
UL = {UieU:aL-1<x(Ui)}.
Thus U = UjLjZVj. We shall let Nh denote the number of units in stratum Uh, note that we do not know the values of JVi, • • •, NL. However, N = JVi + V NL. The notation S% and S%h used in Chapter 9 will continue to be used in this section.
We now take a simple random sample WOR of size n from this population and observe the x-values of the n units obtained. These observations of the x-values are used only to decide which strata the units in the sample came from. We shall let nh denote the number of units in this sample that are in stratum £4,1 < h < L. Thus n = rii + • • • + n^. It should be emphasized that each n^ is a random variable; indeed, each n^ has a hypergeometric distribution, and the joint density of ni, • • • ,n^_i is that of a multivariate hypergeometric distribution. Notice that it is possible for rih to be zero.
For each h let Vh be a function of nh which satisfies the following two requirements:
198 CHAPTER 11. TWO-STAGE SAMPLING
i) i/h = 0 if nh = 0
ii) if nh > 0, then 1 < Vh < rih-
Thus Uh is a random variable that satisfies [i/h = 0] = [nh = 0]. In order to emphasize that Vh is a deterministic function of rih we shall sometimes write Vh(rih). Thus, when Uh takes the value i, then Vh takes the value i/&(i). In the second stage of our sampling we take a simple random sample WOR of size Vh from among the rih units that we selected in % , and we observe the j/-values of these units.
We shall denote the unit numbers from the initial sample of size n that are in Uh by 1 < u'hl < u'h2 < • • • < u'hnh < Nh. Note that these unit numbers are observable random variables. For each h we shall let y'hl, • • •, y'hnh denote the y-values of these units, none of which is observable until after the subsample of size Vh is taken. We shall let yhij • • •, yh,vh denote the observable y-values of the Vh units obtained from the rih units. Further notation used is:
i) VW" ->Vn will denote the unobservable y-values of the n units selected in the first stage of our sampling,
" ) 4 * = = £ T £ B I ( » ' K " Vnh)2hnh>2h wh«e y'n>> = £ E S i J/L^K>i], and
m ) & = ^ £ £ i ^ K > i ] . It should be noticed that two-stage sampling for non-response is but
a special case of the scheme considered here. For the former scheme, L = 2 and v\ = ri\ if we let Ui = TZ and U2 = Af.
THEOREM 1. An unbiased estimate ofY is Y = N± Y%=i n ^ .
Proof: Using results obtained in Chapter 4 on conditional expectation, we have, for 1 < h < Z,
E(nhyh) = ^ ( ^ ( n ^ K ! , . . . , ^ , ^ ) ) = E(nhE(yh\u'hir-.,u'hynh,nh)).
By Lemma 1 in Section 2,
% K i r - - , < n h ) ^ ) = ^
11.3. SAMPLING FOR STRATIFICATION 199
Thus
E(y) = NlEE(n»y'nh) n h=l
= NB^nX.)
= NE (±J2y?J = NE(y) = NY = Y.
Q.E.D.
THEOREM 2. IfY = N^Li^hVn, then
VariX) = «•! (l - i ) Si + £ £** (-U £ - l) W,)
Proof: We shall let B denote the collection of random variables consisting of all unit numbers in the first stage sample, 1 < u\ < u^ < • • • < un < N. Clearly every random variable in {ui, • • •, un} is a function of the random variables in
L
UKu"'*^.^}. and every random variable in the latter is a function of i/i, • • •, un. Thus we may use either set of random variables as a definition of B as far as conditioning is concerned. By Lemma 2 in Section 2, and by properties of conditional expectation
E(nhyn\B) = nhE(yn\B) = nhy'nh.
Thus
E(Y\B) = N-J2E(nhyh\B) nh=i
200 CHAPTER 11. TWO-STAGE SAMPLING
1 L
= N-Enhy'h n h=i
= N-J2vi = Ny.
Thus, from Chapter 6 we have
Var{E(Y\B)) = N*Var(y) = N^ ( l - £ ) S^
i.e.,
Var(E(Y\B)) = ^ ( l - £ ) S j .
We next wish to evaluate E(Var(Y\B)). Because of the way in which our sampling is done, namely, the joint distribution of jfa, • • •, y%Vi is independent of the joint distribution of observations on other strata once rii,- • •, nx, and hence i/l5 • • •, TIL are known, it follows that nij/i, • • •, njjyi, are conditionally independent given B. (For definition and next step in connection with conditional independence, see Section 4.2) Thus,
Var(Y\B) = ^Var fe **&!*)
N2 L
= —^J2VaV(nhyn\B),
and since n^ is a function of the random variables in B,
N2 L
Var(Y\B) = —j:n2hVar(yh\B).
n h=i
In the proof of Theorem 2 in Section 2, we proved that
Var(yh\B) = ±(l-^)s2ylh. Vh \ nhJ
Thus
E(n2hVar(yh\B)) = E{E{n\V ar{yh\B)\nh))
11.3. SAMPLING FOR STRATIFICATION 201
In the proof of Theorem 2 in Section 2 we also showed that
E(s2ylh\nh) = Sy [ n „> 2 ] .
Hence
E(nlVar(yh\B)) = S&B (nh ( ^ - l ) 7[n„>2])
= ^(»»(~i)Wn)-Thus
E(Var(Y\B)) = ^tfyhE (»* ( ^ - l ) / „ > , , ) .
Now we make use of the result that states
Var(Y) = Var(E(Y\B)) + E(Var(Y\B))
in order to obtain the conclusion. Q.E.D.
An unbiased estimate for VarY is beyond the scope of this course.
EXERCISES
1. Find the formula for the joint density of ni, • • •, n^_i.
2. If L > 3, find the formula for the joint density of 1/1, v2.
3. Find the joint density of (u'hlJ •, u'hnh, n^).
4. Write rih as a function of i/i, • • •, un.
5. Write U{ as a function of the random variables in
L
UKl'-'-Xn,^}-
Appendix A
The Normal Distribution
I X 1 1 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 n [ 0.00 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 0.10 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 0.20 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 0.30 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 0.40 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 0.50 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 0.60 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 0.70 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 0.80 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 0.90 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 1.00 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 1.10 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 1.20 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 1.30 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 1.40 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 1.50 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1.60 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 1.70 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 1.80 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 1.90 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 2.00 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 2.10 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 2.20 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 2.30 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 2.40 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 2.50 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 2.60 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 2.70 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 2.80 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 2.90 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 3.00 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 3.10 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 3.20 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 3.30 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997
II 3.40 | .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 |
203
This page is intentionally left blank
Index
Bin(n,p) 42 p.p.a.s. sampling 146 Bayes' theorem 23 Bernoulli trials 41 Bernoulli's theorem 85 binomial coefficient 5 binomial distribution 42 binomial theorem 5 bivariate density 32 bivariate normal population 153 Cauchy-Schwarz inequality 164 central limit theorem 86 central moment 51 Chebishev's inequality 83,104 cluster sampling 171 combinatorial probability 3 complement of a set or event 10 conditional covariance as a
number 72 conditional covariance as a
random variable 75 conditional expectation as a
number 65 conditional expectation as a
random variable 69 conditional independence 78 conditional probability, definition
of 20 conditional variance as a
number 73 conditional variance as a
random variable 77 correlation coefficient 58 covariance 58 DeMorgan formulae 12 density of a random variable,
definition of 31 disjoint events 11 double sampling 185 elementary event 9 empty set 10 equality of two events, definition
of 11 event 9 expectation, definition of 47 fundamental probability set 9 fundamental probability space 38 hypergeometric distribution 43,110 independent random variables,
definition of 36 indicator of an event 29 individual outcomes 9 intersection 11 Laplace-DeMoivre theorem 88 Law of large numbers 84 marginal or marginal density 34 moments of a random
variable 51
205
206 INDEX
multinomial distribution 44 multiplication rule 21 Neyman-Tchuprow Theorem 166 normal distribution 87 permutation 4 Polya urn scheme 21,45 population, definition of 91 probability of an event 5 probability proportional to ag
gregate size sampling 146 probability proportional to size
sampling WOR 120 probability proportional to size
sampling WR 119,125 proportions, estimations of 108 random number generator 93 random number 5,93 random variable, definition of 27 range of a random variable, def
inition of 28 ratio estimation 140 relative frequency,
definition of 1 sample of size n 38 sampling without replacement 21 Schwarz' inequality 57 skipping method 122 standard deviation 51 standard normal distribution 103 stratification 158 stratified sampling 157 stratum 158 subset, definition of 11 sure event 9,10 total probabilities, theorem of 22 two stage sampling 185 two-stage sampling
for stratification 198 unbiased estimate,
definition of 99 uniform distribution 44 variance of a random variable 51 Wilcoxon distribution 111 WOR means 'without
replacement' 98 WR means 'with replacement' 98