+ All Categories
Home > Documents > Probability Theory 101 - Introduction to Theoretical ... · 376 introduction to theoretical...

Probability Theory 101 - Introduction to Theoretical ... · 376 introduction to theoretical...

Date post: 20-Apr-2018
Category:
Upload: vuongdan
View: 219 times
Download: 6 times
Share this document with a friend
18
1 Harvard’s STAT 110 class (whose lectures are available on youtube ) is a highly recommended introduction to probability. See also these lecture notes from MIT’s “Mathematics for Computer Science” course. 17 Probability Theory 101 “God doesn’t play dice with the universe”, Albert Ein- stein “Einstein was doubly wrong … not only does God definitely play dice, but He sometimes confuses us by throwing them where they can’t be seen.”, Stephen Hawking “‘The probability of winning a battle’ has no place in our theory because it does not belong to any [random experi- ment]. Probability cannot be applied to this problem any more than the physical concept of work can be applied to the ‘work’ done by an actor reciting his part.”, Richard Von Mises, 1928 (paraphrased) “I am unable to see why ‘objectivity’ requires us to in- terpret every probability as a frequency in some random experiment; particularly when in most problems prob- abilities are frequencies only in an imaginary universe invented just for the purpose of allowing a frequency interpretation.”, E.T. Jaynes, 1976 Before we show how to use randomness in algorithms, let us do a quick review of some basic notions in probability theory. This is not meant to replace a course on probability theory, and if you have not seen this material before, I highly recommend you look at additional resources to get up to speed. 1 Fortunately, we will not need many of the advanced notions of probability theory, but, as we will see, even the so-called “simple” setting of tossing coins can lead to very subtle and interesting issues. Compiled on 6.5.2018 21:56 Learning Objectives: Review the basic notion of probability theory that we will use. Sample spaces, and in particular the space {0, 1} Events, probabilities of unions and intersec- tions. Random variables and their expectation, variance, and standard deviation. Independence and correlation for both events and random variables. Markov, Chebyshev and Chernoff tail bounds (bounding the probability that a random variable will deviate from its expectation).
Transcript

1 Harvard’s STAT 110 class (whoselectures are available on youtube ) is ahighly recommended introduction toprobability. See also these lecture notesfrom MIT’s “Mathematics for ComputerScience” course.

17Probability Theory 101

“God doesn’t play dice with the universe”, Albert Ein-stein

“Einstein was doubly wrong … not only does Goddefinitely play dice, but He sometimes confuses us bythrowing them where they can’t be seen.”, StephenHawking

“‘The probability of winning a battle’ has no place in ourtheory because it does not belong to any [random experi-ment]. Probability cannot be applied to this problem anymore than the physical concept of work can be applied tothe ‘work’ done by an actor reciting his part.”, RichardVon Mises, 1928 (paraphrased)

“I am unable to see why ‘objectivity’ requires us to in-terpret every probability as a frequency in some randomexperiment; particularly when in most problems prob-abilities are frequencies only in an imaginary universeinvented just for the purpose of allowing a frequencyinterpretation.”, E.T. Jaynes, 1976

Before we show how to use randomness in algorithms, let us do aquick review of some basic notions in probability theory. This is notmeant to replace a course on probability theory, and if you have notseen this material before, I highly recommend you look at additionalresources to get up to speed.1 Fortunately, we will not need many ofthe advanced notions of probability theory, but, as we will see, eventhe so-called “simple” setting of tossing 𝑛 coins can lead to very subtleand interesting issues.

Compiled on 6.5.2018 21:56

Learning Objectives:• Review the basic notion of probability theory

that we will use.

• Sample spaces, and in particular the space{0, 1}𝑛

• Events, probabilities of unions and intersec-tions.

• Random variables and their expectation,variance, and standard deviation.

• Independence and correlation for both eventsand random variables.

• Markov, Chebyshev and Chernoff tail bounds(bounding the probability that a randomvariable will deviate from its expectation).

382 introduction to theoretical computer science

17.1 RANDOM COINS

The nature of randomness and probability is a topic of great philo-sophical, scientific and mathematical depth. Is there actual random-ness in the world, or does it proceed in a deterministic clockwork fash-ion from some initial conditions set at the beginning of time? Doesprobability refer to our uncertainty of beliefs, or to the frequency ofoccurrences in repeated experiments? How can we define probabilityover infinite sets?

These are all important questions that have been studied and de-bated by scientists, mathematicians, statisticians and philosophers.Fortunately, we will not need to deal directly with these questionshere. We will be mostly interested in the setting of tossing 𝑛 random,unbiased and independent coins. Below we define the basic proba-bilistic objects of events and random variables when restricted to thissetting. These can be defined for much more general probabilistic ex-periments or sample spaces, and later on we will briefly discuss howthis can be done. However, the 𝑛-coin case is sufficient for almosteverything we’ll need in this course.

If instead of “heads” and “tails” we encode the sides of each coinby “zero” and “one”, we can encode the result of tossing 𝑛 coins asa string in {0, 1}𝑛. Each particular outcome 𝑥 ∈ {0, 1}𝑛 is obtainedwith probability 2−𝑛. For example, if we toss three coins, then weobtain each of the 8 outcomes 000, 001, 010, 011, 100, 101, 110, 111with probability 2−3 = 1/8 (see also Fig. 17.1). We can describe theexperiment of tossing 𝑛 coins as choosing a string 𝑥 uniformly atrandom from {0, 1}𝑛, and hence we’ll use the shorthand 𝑥 ∼ {0, 1}𝑛

for 𝑥 that is chosen according to this experiment.

Figure 17.1: The probabilistic experiment of tossing three coins corresponds to making2 × 2 × 2 = 8 choices, each with equal probability. In this example, the blue setcorresponds to the event 𝐴 = {𝑥 ∈ {0, 1}3 | 𝑥0 = 0} where the first coin toss is equalto 0, and the pink set corresponds to the event 𝐵 = {𝑥 ∈ {0, 1}3 | 𝑥1 = 1} where thesecond coin toss is equal to 1 (with their intersection having a purplish color). As wecan see, each of these events contains 4 elements (out of 8 total) and so has probability1/2. The intersection of 𝐴 and 𝐵 contains two elements, and so the probability thatboth of these events occur is 2

8 = 14

An event is simply a subset 𝐴 of {0, 1}𝑛. The probability of 𝐴, de-

probability theory 101 383

noted by Pr𝑥∼{0,1}𝑛 [𝐴] (or Pr[𝐴] for short, when the sample space isunderstood from the context), is the probability that an 𝑥 chosen uni-formly at random will be contained in 𝐴. Note that this is the same as|𝐴|/2𝑛 (where |𝐴| as usual denotes the number of elements in the set𝐴). For example, the probability that 𝑥 has an even number of onesis Pr[𝐴] where 𝐴 = {𝑥 ∶ ∑𝑛−1

𝑖=0 𝑥𝑖 = 0 mod 2}. In the case 𝑛 = 3,𝐴 = {000, 011, 101, 110}, and hence Pr[𝐴] = 4

8 = 12 . It turns out this is

true for every 𝑛:

Lemma 17.1

Pr𝑥∼{0,1}𝑛

[𝑛−1∑𝑖=0

𝑥𝑖 is even ] = 1/2 (17.1)

P To test your intuition on probability, try to stop hereand prove the lemma on your own.

Proof of Lemma 17.1. Let 𝐴 = {𝑥 ∈ {0, 1}𝑛 ∶ ∑𝑛−1𝑖=0 𝑥𝑖 = 0 mod 2}.

Since every 𝑥 is obtained with probability 2−𝑛, to show this we needto show that |𝐴| = 2𝑛/2 = 2𝑛−1. For every 𝑥0, … , 𝑥𝑛−2, if ∑𝑛−2

𝑖=0 𝑥𝑖 iseven then (𝑥0, … , 𝑥𝑛−1, 0) ∈ 𝐴 and (𝑥0, … , 𝑥𝑛−1, 1) ∉ 𝐴. Similarly, if∑𝑛−2

𝑖=0 𝑥𝑖 is odd then (𝑥0, … , 𝑥𝑛−1, 1) ∈ 𝐴 and (𝑥0, … , 𝑥𝑛−1, 0) ∉ 𝐴.Hence, for every one of the 2𝑛−1 prefixes (𝑥0, … , 𝑥𝑛−2), there is exactlya single continuation of (𝑥0, … , 𝑥𝑛−2) that places it in 𝐴. �

We can also use the intersection (∩) and union (∪) operators to talkabout the probability of both event 𝐴 and event 𝐵 happening, or theprobability of event 𝐴 or event 𝐵 happening. For example, the prob-ability 𝑝 that 𝑥 has an even number of ones and 𝑥0 = 1 is the sameas Pr[𝐴 ∩ 𝐵] where 𝐴 = {𝑥 ∈ {0, 1}𝑛 ∶ ∑𝑛−1

𝑖=0 𝑥𝑖 = 0 mod 2} and𝐵 = {𝑥 ∈ {0, 1}𝑛 ∶ 𝑥0 = 1}. This probability is equal to 1/4. (It isa great exercise for you to pause here and verify that you understandwhy this is the case.)

Because intersection corresponds to considering the logical ANDof the conditions that two events happen, while union correspondsto considering the logical OR, we will sometimes use the ∧ and ∨operators instead of ∩ and ∪, and so write this probability 𝑝 = Pr[𝐴 ∩𝐵] defined above also as

Pr𝑥∼{0,1}𝑛

[∑𝑖

𝑥𝑖 = 0 mod 2 ∧ 𝑥0 = 1] . (17.2)

If 𝐴 ⊆ {0, 1}𝑛 is an event, then 𝐴 = {0, 1}𝑛 𝐴 corresponds to theevent that 𝐴 does not happen. Since |𝐴| = 2𝑛 − |𝐴|, we get that

Pr[𝐴] = |𝐴|2𝑛 = 2𝑛−|𝐴|

2𝑛 = 1 − |𝐴|2𝑛 = 1 − Pr[𝐴] (17.3)

384 introduction to theoretical computer science

2 In many probability texts a randomvariable is always defined to havevalues in the set ℝ of real numbers, andthis will be our default option as well.However, in some contexts in theoreticalcomputer science we can considerrandom variables mapping to other setssuch as {0, 1}∗.

This makes sense: since 𝐴 happens if and only if 𝐴 does not happen,the probability of 𝐴 should be one minus the probability of 𝐴.

R Remember the sample space While the above def-inition might seem very simple and almost trivial,the human mind seems not to have evolved for prob-abilistic reasoning, and it is surprising how oftenpeople can get even the simplest settings of proba-bility wrong. One way to make sure you don’t getconfused when trying to calculate probability state-ments is to always ask yourself the following twoquestions: (1) Do I understand what is the samplespace that this probability is taken over?, and (2) DoI understand what is the definition of the event thatwe are analyzing?.For example, suppose that I were to randomizeseating in my course, and then it turned out that stu-dents sitting in row 7 performed better on the final:how surprising should we find this? If we started outwith the hypothesis that there is something specialabout the number 7 and chose it ahead of time, thenthe event that we are discussing is the event 𝐴 thatstudents sitting in number 7 had better performanceon the final, and we might find it surprising. How-ever, if we first looked at the results and then chosethe row whose average performance is best, thenthe event we are discussing is the event 𝐵 that thereexists some row where the performance is higherthan the overall average. 𝐵 is a superset of 𝐴, and itsprobability (even if there is no correlation betweensitting and performance) can be quite significant.

17.1.1 Random variablesEvents correspond to Yes/No questions, but often we want to analyzefiner questions. For example, if we make a bet at the roulette wheel,we don’t want to just analyze whether we won or lost, but also howmuch we’ve gained. A (real valued) random variable is simply a wayto associate a number with the result of a probabilistic experiment.Formally, a random variable is simply a function 𝑋 ∶ {0, 1}𝑛 → ℝthat maps every outcome 𝑥 ∈ {0, 1}𝑛 to a real number 𝑋(𝑥).2 Forexample, the function 𝑠𝑢𝑚 ∶ {0, 1}𝑛 → ℝ that maps 𝑥 to the sum of itscoordinates (i.e., to ∑𝑛−1

𝑖=0 𝑥𝑖) is a random variable.The expectation of a random variable 𝑋, denoted by 𝔼[𝑋], is the

average value that that this number takes, taken over all draws fromthe probabilistic experiment. In other words, the expectation of 𝑋 isdefined as follows:

𝔼[𝑋] = ∑𝑥∈{0,1}𝑛

2−𝑛𝑋(𝑥) . (17.4)

probability theory 101 385

If 𝑋 and 𝑌 are random variables, then we can define 𝑋 + 𝑌 assimply the random variable that maps a point 𝑥 ∈ {0, 1}𝑛 to 𝑋(𝑥) +𝑌 (𝑥). One basic and very useful property of the expectation is that itis linear:

Lemma 17.2 — Linearity of expectation.

𝔼[𝑋 + 𝑌 ] = 𝔼[𝑋] + 𝔼[𝑌 ] (17.5)

Proof.

𝔼[𝑋 + 𝑌 ] = ∑𝑥∈{0,1}𝑛

2−𝑛 (𝑋(𝑥) + 𝑌 (𝑥)) =

∑𝑥∈{0,1}𝑏

2−𝑛𝑋(𝑥) + ∑𝑥∈{0,1}𝑏

2−𝑛𝑌 (𝑥) =

𝔼[𝑋] + 𝔼[𝑌 ]

(17.6)

Similarly, 𝔼[𝑘𝑋] = 𝑘 𝔼[𝑋] for every 𝑘 ∈ ℝ. For example, using thelinearity of expectation, it is very easy to show that the expectation ofthe sum of the 𝑥𝑖’s for 𝑥 ∼ {0, 1}𝑛 is equal to 𝑛/2. Indeed, if we write𝑋 = ∑𝑛−1

𝑖=0 𝑥𝑖 then 𝑋 = 𝑋0 + ⋯ + 𝑋𝑛−1 where 𝑋𝑖 is the randomvariable 𝑥𝑖. Since for every 𝑖, Pr[𝑋𝑖 = 0] = 1/2 and Pr[𝑋𝑖 = 1] = 1/2,we get that 𝔼[𝑋𝑖] = (1/2) ⋅ 0 + (1/2) ⋅ 1 = 1/2 and hence 𝔼[𝑋] =∑𝑛−1

𝑖=0 𝔼[𝑋𝑖] = 𝑛 ⋅ (1/2) = 𝑛/2.

P If you have not seen discrete probability before,please go over this argument again until you aresure you follow it; it is a prototypical simple exampleof the type of reasoning we will employ again andagain in this course.

If 𝐴 is an event, then 1𝐴 is the random variable such that 1𝐴(𝑥)equals 1 if 𝑥 ∈ 𝐴, and 1𝐴(𝑥) = 0 otherwise. Note that Pr[𝐴] = 𝔼[1𝐴](can you see why?). Using this and the linearity of expectation, we canshow one of the most useful bounds in probability theory:

Lemma 17.3 — Union bound. For every two events 𝐴, 𝐵, Pr[𝐴 ∪ 𝐵] ≤Pr[𝐴] + Pr[𝐵]

P Before looking at the proof, try to see why the unionbound makes intuitive sense. We can also proveit directly from the definition of probabilities andthe cardinality of sets, together with the equation|𝐴 ∪ 𝐵| ≤ |𝐴| + |𝐵|. Can you see why the latterequation is true? (See also Fig. 17.2.)

386 introduction to theoretical computer science

Proof of Lemma 17.3. For every 𝑥, the variable 1𝐴∪𝐵(𝑥) ≤ 1𝐴(𝑥)+1𝐵(𝑥).Hence, Pr[𝐴 ∪ 𝐵] = 𝔼[1𝐴∪𝐵] ≤ 𝔼[1𝐴 + 1𝐵] = 𝔼[1𝐴] + 𝔼[1𝐵] =Pr[𝐴] + Pr[𝐵]. �

The way we often use this in theoretical computer science is toargue that, for example, if there is a list of 100 bad events that can hap-pen, and each one of them happens with probability at most 1/10000,then with probability at least 1 − 100/10000 = 0.99, no bad eventhappens.

Figure 17.2: The union bound tells us that the probability of 𝐴 or 𝐵 happening is at mostthe sum of the individual probabilities. We can see it by noting that for every two sets|𝐴 ∪ 𝐵| ≤ |𝐴| + |𝐵| (with equality only if 𝐴 and 𝐵 have no intersection).

17.1.2 Distributions over stringsWhile most of the time we think of random variables as havingas output a real number, we sometimes consider random vari-ables whose output is a string. That is, we can think of a map𝑌 ∶ {0, 1}𝑛 → {0, 1}∗ and consider the “random variable” 𝑌 suchthat for every 𝑦 ∈ {0, 1}∗, the probability that 𝑌 outputs 𝑦 is equal to1

2𝑛 |{𝑥 ∈ {0, 1}𝑛 | 𝑌 (𝑥) = 𝑦}|. To avoid confusion, we will typicallyrefer to such string-valued random variables as distributions overstrings. So, a distribution 𝑌 over strings {0, 1}∗ can be thought of asa finite collection of strings 𝑦0, … , 𝑦𝑀−1 ∈ {0, 1}∗ and probabilities𝑝0, … , 𝑝𝑀−1 (which are non-negative numbers summing up to one), sothat Pr[𝑌 = 𝑦𝑖] = 𝑝𝑖.

Two distributions 𝑌 and 𝑌 ′ are identical if they assign the sameprobability to every string. For example, consider the following twofunctions 𝑌 , 𝑌 ′ ∶ {0, 1}2 → {0, 1}2. For every 𝑥 ∈ {0, 1}2, we define𝑌 (𝑥) = 𝑥 and 𝑌 ′(𝑥) = 𝑥0(𝑥0 ⊕ 𝑥1) where ⊕ is the XOR operations. Al-though these are two different functions, they induce the same distri-bution over {0, 1}2 when invoked on a uniform input. The distribution

probability theory 101 387

3 TODO: add exercise on simulating dietosses and choosing a random numberin [𝑚] by coin tosses

4 Another thorny issue is of coursethe difference between correlation andcausation. Luckily, this is another pointwe don’t need to worry about in ourclean setting of tossing 𝑛 coins.

𝑌 (𝑥) for 𝑥 ∼ {0, 1}2 is of course the uniform distribution over {0, 1}2.On the other hand 𝑌 ′ is simply the map 00 ↦ 00, 01 ↦ 01, 10 ↦ 11,11 ↦ 10 which is a permutation over the map 𝐹 ∶ {0, 1}2 → {0, 1}2

defined as 𝐹(𝑥0𝑥1) = 𝑥0𝑥1 and the map 𝐺 ∶ {0, 1}2 → {0, 1}2 definedas 𝐺(𝑥0𝑥1) = 𝑥0(𝑥0 ⊕ 𝑥1)

17.1.3 More general sample spaces.While in this lecture we assume that the underlying probabilisticexperiment corresponds to tossing 𝑛 independent coins, everythingwe say easily generalizes to sampling 𝑥 from a more general finite orcountable set 𝑆 (and not-so-easily generalizes to uncountable sets 𝑆 aswell). A probability distribution over a finite set 𝑆 is simply a function𝜇 ∶ 𝑆 → [0, 1] such that ∑𝑥∈𝑆 𝜇(𝑠) = 1. We think of this as theexperiment where we obtain every 𝑥 ∈ 𝑆 with probability 𝜇(𝑠), andsometimes denote this as 𝑥 ∼ 𝜇. An event 𝐴 is a subset of 𝑆, and theprobability of 𝐴, which we denote by Pr𝜇[𝐴], is ∑𝑥∈𝐴 𝜇(𝑥). A randomvariable is a function 𝑋 ∶ 𝑆 → ℝ, where the probability that 𝑋 = 𝑦 isequal to ∑𝑥∈𝑆 s.t. 𝑋(𝑥)=𝑦 𝜇(𝑥).

3

17.2 CORRELATIONS AND INDEPENDENCE

One of the most delicate but important concepts in probability is thenotion of independence (and the opposing notion of correlations). Subtlecorrelations are often behind surprises and errors in probability andstatistical analysis, and several mistaken predictions have been blamedon miscalculating the correlations between, say, housing prices inFlorida and Arizona, or voter preferences in Ohio and Michigan. Seealso Joe Blitzstein’s aptly named talk “Conditioning is the Soul ofStatistics”.4

Two events 𝐴 and 𝐵 are independent if the fact that 𝐴 happensmakes 𝐵 neither more nor less likely to happen. For example, if wethink of the experiment of tossing 3 random coins 𝑥 ∈ {0, 1}3, and welet 𝐴 be the event that 𝑥0 = 1 and 𝐵 the event that 𝑥0 + 𝑥1 + 𝑥2 ≥ 2,then if 𝐴 happens it is more likely that 𝐵 happens, and hence theseevents are not independent. On the other hand, if we let 𝐶 be the eventthat 𝑥1 = 1, then because the second coin toss is not affected by theresult of the first one, the events 𝐴 and 𝐶 are independent.

The formal definition is that events 𝐴 and 𝐵 are independent ifPr[𝐴 ∩ 𝐵] = Pr[𝐴] ⋅ Pr[𝐵]. If Pr[𝐴 ∩ 𝐵] > Pr[𝐴] ⋅ Pr[𝐵] then we saythat 𝐴 and 𝐵 are positively correlated, while if Pr[𝐴 ∩ 𝐵] < Pr[𝐴] ⋅ Pr[𝐵]then we say that 𝐴 and 𝐵 are negatively correlated (see Fig. 17.1).

If we consider the above examples on the experiment of choosing

388 introduction to theoretical computer science

Figure 17.3: Two events 𝐴 and 𝐵 are independent if Pr[𝐴 ∩ 𝐵] = Pr[𝐴] ⋅ Pr[𝐵]. In the twofigures above, the empty 𝑥 × 𝑥 square is the sample space, and 𝐴 and 𝐵 are two eventsin this sample space. In the left figure, 𝐴 and 𝐵 are independent, while in the rightfigure they are negatively correlated, since 𝐵 is less likely to occur if we condition on 𝐴(and vice versa). Mathematically, one can see this by noticing that in the left figure theareas of 𝐴 and 𝐵 respectively are 𝑎 ⋅ 𝑥 and 𝑏 ⋅ 𝑥, and so their probabilities are 𝑎⋅𝑥

𝑥2 = 𝑎𝑥

and 𝑏⋅𝑥𝑥2 = 𝑏

𝑥 respectively, while the area of 𝐴 ∩ 𝐵 is 𝑎 ⋅ 𝑏 which corresponds to theprobability 𝑎⋅𝑏

𝑥2 . In the right figure, the area of the triangle 𝐵 is 𝑏⋅𝑥2 which corresponds

to a probability of 𝑏2𝑥 , but the area of 𝐴 ∩ 𝐵 is 𝑏′⋅𝑎

2 for some 𝑏′ < 𝑏. This means that theprobability of 𝐴 ∩ 𝐵 is 𝑏′⋅𝑎

2𝑥2 < 𝑏2𝑥 ⋅ 𝑎

𝑥 , or in other words Pr[𝐴 ∩ 𝐵] < Pr[𝐴] ⋅ Pr[𝐵].

𝑥 ∈ {0, 1}3 then we can see that

Pr[𝑥0 = 1] = 12

Pr[𝑥0 + 𝑥1 + 𝑥2 ≥ 2] = Pr[{011, 101, 110, 111}] = 48 = 1

2(17.7)

but

Pr[𝑥0 = 1 ∧ 𝑥0 + 𝑥1 + 𝑥2 ≥ 2] = Pr[{101, 110, 111}] = 38 > 1

2 ⋅ 12 (17.8)

and hence, as we already observed, the events {𝑥0 = 1} and {𝑥0 +𝑥1 + 𝑥2 ≥ 2} are not independent and in fact are positively correlated.On the other hand, Pr[𝑥0 = 1 ∧ 𝑥1 = 1] = Pr[{110, 111}] = 2

8 = 12 ⋅ 1

2and hence the events {𝑥0 = 1} and {𝑥1 = 1} are indeed independent.

R Disjointness vs independence People sometimesconfuse the notion of disjointness and independence,but these are actually quite different. Two events 𝐴and 𝐵 are disjoint if 𝐴 ∩ 𝐵 = ∅, which means that if 𝐴happens then 𝐵 definitely does not happen. They areindependent if Pr[𝐴 ∩ 𝐵] = Pr[𝐴] Pr[𝐵] which meansthat knowing that 𝐴 happens gives us no informa-tion about whether 𝐵 happened or not. If 𝐴 and 𝐵have nonzero probability, then being disjoint impliesthat they are not independent, since in particular itmeans that they are negatively correlated.

probability theory 101 389

5 We use {𝑋 = 𝑢} as shorthand for{𝑥 | 𝑋(𝑥) = 𝑢}.

Conditional probability: If 𝐴 and 𝐵 are events, and 𝐴 happenswith nonzero probability then we define the probability that 𝐵 hap-pens conditioned on 𝐴 to be Pr[𝐵|𝐴] = Pr[𝐴 ∩ 𝐵]/ Pr[𝐴]. This corre-sponds to calculating the probability that 𝐵 happens if we alreadyknow that 𝐴 happened. Note that 𝐴 and 𝐵 are independent if andonly if Pr[𝐵|𝐴] = Pr[𝐵].

More than two events: We can generalize this definition to morethan two events. We say that events 𝐴1, … , 𝐴𝑘 are mutually independentif knowing that any set of them occurred or didn’t occur does notchange the probability that an event outside the set occurs. Formally,the condition is that for every subset 𝐼 ⊆ [𝑘],

Pr[∧𝑖∈𝐼𝐴𝑖] = ∏𝑖∈𝐼

Pr[𝐴𝑖]. (17.9)

For example, if 𝑥 ∼ {0, 1}3, then the events {𝑥0 = 1}, {𝑥1 = 1} and{𝑥2 = 1} are mutually independent. On the other hand, the events{𝑥0 = 1}, {𝑥1 = 1} and {𝑥0 + 𝑥1 = 0 mod 2} are not mutuallyindependent, even though every pair of these events is independent(can you see why? see also Fig. 17.4).

Figure 17.4: Consider the sample space {0, 1}𝑛 and the events 𝐴, 𝐵, 𝐶, 𝐷, 𝐸 corre-sponding to 𝐴: 𝑥0 = 1, 𝐵: 𝑥1 = 1, 𝐶: 𝑥0 + 𝑥1 + 𝑥2 ≥ 2, 𝐷: 𝑥0 + 𝑥1 + 𝑥2 = 0𝑚𝑜𝑑2and 𝐷: 𝑥0 + 𝑥1 = 0𝑚𝑜𝑑2. We can see that 𝐴 and 𝐵 are independent, 𝐶 is positivelycorrelated with 𝐴 and positively correlated with 𝐵, the three events 𝐴, 𝐵, 𝐷 are mutu-ally independent, and while every pair out of 𝐴, 𝐵, 𝐸 is independent, the three events𝐴, 𝐵, 𝐸 are not mutually independent since their intersection has probability 2

8 = 14

instead of 12 ⋅ 1

2 ⋅ 12 = 1

8 .

17.2.1 Independent random variablesWe say that two random variables 𝑋 ∶ {0, 1}𝑛 → ℝ and 𝑌 ∶ {0, 1}𝑛 → ℝare independent if for every 𝑢, 𝑣 ∈ ℝ, the events {𝑋 = 𝑢} and {𝑌 = 𝑣}are independent.5 In other words, 𝑋 and 𝑌 are independent if Pr[𝑋 =𝑢 ∧ 𝑌 = 𝑣] = Pr[𝑋 = 𝑢] Pr[𝑌 = 𝑣] for every 𝑢, 𝑣 ∈ ℝ. For example, if

390 introduction to theoretical computer science

two random variables depend on the result of tossing different coinsthen they are independent:

Lemma 17.4 Suppose that 𝑆 = {𝑠0, … , 𝑠𝑘−1} and 𝑇 = {𝑡0, … , 𝑡𝑚−1} aredisjoint subsets of {0, … , 𝑛 − 1} and let 𝑋, 𝑌 ∶ {0, 1}𝑛 → ℝ be randomvariables such that 𝑋 = 𝐹(𝑥𝑠0

, … , 𝑥𝑠𝑘−1) and 𝑌 = 𝐺(𝑥𝑡0

, … , 𝑥𝑡𝑚−1) for

some functions 𝐹 ∶ {0, 1}𝑘 → ℝ and 𝐺 ∶ {0, 1}𝑚 → ℝ. Then 𝑋 and 𝑌are independent.

P The notation in the lemma’s statement is a bit cum-bersome, but at the end of the day, it simply says thatif 𝑋 and 𝑌 are random variables that depend on twodisjoint sets 𝑆 and 𝑇 of coins (for example, 𝑋 mightbe the sum of the first 𝑛/2 coins, and 𝑌 might be thelargest consecutive stretch of zeroes in the second𝑛/2 coins), then they are independent.

Proof of Lemma 17.4. Let 𝑎, 𝑏 ∈ ℝ, and let 𝐴 = {𝑥 ∈ {0, 1}𝑘 ∶ 𝐹 (𝑥) = 𝑎}and 𝐵 = {𝑥 ∈ {0, 1}𝑚 ∶ 𝐹 (𝑥) = 𝑏}. Since 𝑆 and 𝑇 are disjoint, we canreorder the indices so that 𝑆 = {0, … , 𝑘 − 1} and 𝑇 = {𝑘, … , 𝑘 + 𝑚 − 1}without affecting any of the probabilities. Hence we can write Pr[𝑋 =𝑎 ∧ 𝑋 = 𝑏] = |𝐶|/2𝑛 where 𝐶 = {𝑥0, … , 𝑥𝑛−1 ∶ (𝑥0, … , 𝑥𝑘−1) ∈𝐴 ∧ (𝑥𝑘, … , 𝑥𝑘+𝑚−1) ∈ 𝐵}. Another way to write this using stringconcatenation is that 𝐶 = {𝑥𝑦𝑧 ∶ 𝑥 ∈ 𝐴, 𝑦 ∈ 𝐵, 𝑧 ∈ {0, 1}𝑛−𝑘−𝑚}, andhence |𝐶| = |𝐴||𝐵|2𝑛−𝑘−𝑚, which means that

|𝐶|2𝑛 = |𝐴|

2𝑘|𝐵|2𝑚

2𝑛−𝑘−𝑚2𝑛−𝑘−𝑚 = Pr[𝑋 = 𝑎] Pr[𝑌 = 𝑏]. (17.10)

Note that if 𝑋 and 𝑌 are independent random variables then (ifwe let 𝑆𝑋, 𝑆𝑌 denote all the numbers that have positive probability ofbeing the output of 𝑋 and 𝑌 , respectively) it holds that:

𝔼[𝑋𝑌 ] = ∑𝑎∈𝑆𝑋,𝑏∈𝑆𝑌

Pr[𝑋 = 𝑎 ∧ 𝑌 = 𝑏] ⋅ 𝑎𝑏 =(1) ∑𝑎∈𝑆𝑋,𝑏∈𝑆𝑌

Pr[𝑋 = 𝑎] Pr[𝑌 = 𝑏] ⋅ 𝑎𝑏 =(2)

( ∑𝑎∈𝑆𝑋

Pr[𝑋 = 𝑎] ⋅ 𝑎) ( ∑𝑏∈𝑆𝑌

Pr[𝑌 = 𝑏]𝑏) =(3)

𝔼[𝑋] 𝔼[𝑌 ](17.11)

where the first equality (=(1)) follows from the independence of 𝑋 and𝑌 , the second equality (=(2)) follows by “opening the parentheses”of the righthand side, and the third inequality (=(3)) follows from thedefinition of expectation. (This is not an “if and only if”; see Exer-cise 17.2.)

probability theory 101 391

Another useful fact is that if 𝑋 and 𝑌 are independent randomvariables, then so are 𝐹(𝑋) and 𝐺(𝑌 ) for all functions 𝐹, 𝐺 ∶ ℝ → 𝑅.This is intuitively true since learning 𝐹(𝑋) can only provide us withless information than does learning 𝑋 itself. Hence, if learning 𝑋does not teach us anything about 𝑌 (and so also about 𝐹(𝑌 )) thenneither will learning 𝐹(𝑋). Indeed, to prove this we can write forevery 𝑎, 𝑏 ∈ ℝ:

Pr[𝐹 (𝑋) = 𝑎 ∧ 𝐺(𝑌 ) = 𝑏] = ∑𝑥 s.t.𝐹(𝑥)=𝑎,𝑦 s.t. 𝐺(𝑦)=𝑏

Pr[𝑋 = 𝑥 ∧ 𝑌 = 𝑦] =

∑𝑥 s.t.𝐹(𝑥)=𝑎,𝑦 s.t. 𝐺(𝑦)=𝑏

Pr[𝑋 = 𝑥] Pr[𝑌 = 𝑦] =

⎛⎜⎝

∑𝑥 s.t.𝐹(𝑥)=𝑎

Pr[𝑋 = 𝑥]⎞⎟⎠

⋅ ⎛⎜⎝

∑𝑦 s.t.𝐺(𝑦)=𝑏

Pr[𝑌 = 𝑦]⎞⎟⎠

=

Pr[𝐹 (𝑋) = 𝑎] Pr[𝐺(𝑌 ) = 𝑏].(17.12)

17.2.2 Collections of independent random variables.We can extend the notions of independence to more than two randomvariables: we say that the random variables 𝑋0, … , 𝑋𝑛−1 are mutuallyindependent if for every 𝑎0, … , 𝑎𝑛−1 ∈ 𝔼,

Pr [𝑋0 = 𝑎0 ∧ ⋯ ∧ 𝑋𝑛−1 = 𝑎𝑛−1] = Pr[𝑋0 = 𝑎0] ⋯ Pr[𝑋𝑛−1 = 𝑎𝑛−1].(17.13)

And similarly, we have that

Lemma 17.5 — Expectation of product of independent random variables. If𝑋0, … , 𝑋𝑛−1 are mutually independent then

𝔼[𝑛−1∏𝑖=0

𝑋𝑖] =𝑛−1∏𝑖=0

𝔼[𝑋𝑖]. (17.14)

Lemma 17.6 — Functions preserve independence. If 𝑋0, … , 𝑋𝑛−1 aremutually independent, and 𝑌0, … , 𝑌𝑛−1 are defined as 𝑌𝑖 = 𝐹𝑖(𝑋𝑖) forsome functions 𝐹0, … , 𝐹𝑛−1 ∶ ℝ → ℝ, then 𝑌0, … , 𝑌𝑛−1 are mutuallyindependent as well.

P We leave proving Lemma 17.5 and Lemma 17.6 asExercise 17.3 Exercise 17.4. It is good idea for youstop now and do these exercises to make sure youare comfortable with the notion of independence, aswe will use it heavily later on in this course.

392 introduction to theoretical computer science

17.3 CONCENTRATION

The name “expectation” is somewhat misleading. For example, sup-pose that you and I place a bet on the outcome of 10 coin tosses, whereif they all come out to be 1’s then I pay you 100,000 dollars and other-wise you pay me 10 dollars. If we let 𝑋 ∶ {0, 1}10 → ℝ be the randomvariable denoting your gain, then we see that

𝔼[𝑋] = 2−10 ⋅ 100000 − (1 − 2−10)10 ∼ 90. (17.15)

But we don’t really “expect” the result of this experiment to be foryou to gain 90 dollars. Rather, 99.9% of the time you will pay me 10dollars, and you will hit the jackpot 0.01% of the times.

However, if we repeat this experiment again and again (with freshand hence independent coins), then in the long run we do expect youraverage earning to be 90 dollars, which is the reason why casinos canmake money in a predictable way even though every individual bet israndom. For example, if we toss 𝑛 coins, then as 𝑛 grows, the numberof coins that come up ones will be more and more concentrated around𝑛/2 according to the famous “bell curve” (see Fig. 17.5).

Figure 17.5: The probabilities that we obtain a particular sum when we toss 𝑛 =10, 20, 100, 1000 coins converge quickly to the Gaussian/normal distribution.

Much of probability theory is concerned with so called concentrationor tail bounds, which are upper bounds on the probability that a ran-dom variable 𝑋 deviates too much from its expectation. The first andsimplest one of them is Markov’s inequality:

Theorem 17.7 — Markov’s inequality. If 𝑋 is a non-negative random

probability theory 101 393

variable then Pr[𝑋 ≥ 𝑘 𝔼[𝑋]] ≤ 1/𝑘.

P Markov’s Inequality is actually a very natural state-ment (see also Fig. 17.6). For example, if you knowthat the average (not the median!) household incomein the US is 70,000 dollars, then in particular you candeduce that at most 25 percent of households makemore than 280,000 dollars, since otherwise, even ifthe remaining 75 percent had zero income, the top25 percent alone would cause the average incometo be larger than 70,000. From this example youcan already see that in many situations, Markov’sinequality will not be tight and the probability ofdeviating from expectation will be much smaller: seethe Chebyshev and Chernoff inequalities below.

Proof of Theorem 17.7. Let 𝜇 = 𝔼[𝑋] and define 𝑌 = 1𝑋≥𝑘𝜇. Thatis, 𝑌 (𝑥) = 1 if 𝑋(𝑥) ≥ 𝑘𝜇 and 𝑌 (𝑥) = 0 otherwise. Note that bydefinition, for every 𝑥, 𝑌 (𝑥) ≤ 𝑋/(𝑘𝜇). We need to show 𝔼[𝑌 ] ≤ 1/𝑘.But this follows since 𝔼[𝑌 ] ≤ 𝔼[𝑋/𝑘(𝜇)] = 𝔼[𝑋]/(𝑘𝜇) = 𝜇/(𝑘𝜇) =1/𝑘. �

Figure 17.6: Markov’s Inequality tells us that a non-negative random variable 𝑋 cannotbe much larger than its expectation, with high probability. For example, if the expecta-tion of 𝑋 is 𝜇, then the probability that 𝑋 > 4𝜇 must be at most 1/4, as otherwise justthe contribution from this part of the sample space will be too large.

Going beyond Markov’s Inequality: Markov’s inequality saysthat a (non-negative) random variable 𝑋 can’t go too crazy and be,say, a million times its expectation, with significant probability. But

394 introduction to theoretical computer science

ideally we would like to say that with high probability, 𝑋 should bevery close to its expectation, e.g., in the range [0.99𝜇, 1.01𝜇] where𝜇 = 𝔼[𝑋]. This is not generally true, but does turn out to hold when𝑋 is obtained by combining (e.g., adding) many independent randomvariables. This phenomenon, variants of which are known as “law oflarge numbers”, “central limit theorem”, “invariance principles” and“Chernoff bounds”, is one of the most fundamental in probability andstatistics, and is one that we heavily use in computer science as well.

17.3.1 Chebyshev’s InequalityA standard way to measure the deviation of a random variable fromits expectation is by using its standard deviation. For a random variable𝑋, we define the variance of 𝑋 as Var[𝑋] = 𝔼[(𝑋 − 𝜇)2] where 𝜇 =𝔼[𝑋]; i.e., the variance is the average squared distance of 𝑋 from itsexpectation. The standard deviation of 𝑋 is defined as 𝜎[𝑋] = √Var[𝑋].(This is well-defined since the variance, being an average of a square,is always a non-negative number.)

Using Chebyshev’s inequality, we can control the probability thata random variable is too many standard deviations away from itsexpectation.

Theorem 17.8 — Chebyshev’s inequality. Suppose that 𝜇 = 𝔼[𝑋] and𝜎2 = Var[𝑋]. Then for every 𝑘 > 0, Pr[|𝑋 − 𝜇| ≥ 𝑘𝜎] ≤ 1/𝑘2.

Proof. The proof follows from Markov’s inequality. We define therandom variable 𝑌 = (𝑋 − 𝜇)2. Then 𝔼[𝑌 ] = Var[𝑋] = 𝜎2, and henceby Markov the probability that 𝑌 > 𝑘2𝜎2 is at most 1/𝑘2. But clearly(𝑋 − 𝜇)2 ≥ 𝑘2𝜎2 if and only if |𝑋 − 𝜇| ≥ 𝑘𝜎. �

One example of how to use Chebyshev’s inequality is the settingwhen 𝑋 = 𝑋1 + ⋯ + 𝑋𝑛 where 𝑋𝑖’s are independent and identicallydistributed (i.i.d for short) variables with values in [0, 1] where eachhas expectation 1/2. Since 𝔼[𝑋] = ∑𝑖 𝔼[𝑋𝑖] = 𝑛/2, we would liketo say that 𝑋 is very likely to be in, say, the interval [0.499𝑛, 0.501𝑛].Using Markov’s inequality directly will not help us, since it will onlytell us that 𝑋 is very likely to be at most 100𝑛 (which we already knew,since it always lies between 0 and 𝑛). However, since 𝑋1, … , 𝑋𝑛 areindependent,

Var[𝑋1 + ⋯ + 𝑋𝑛] = Var[𝑋1] + ⋯ + Var[𝑋𝑛] . (17.16)

(We leave showing this to the reader as Exercise 17.5.)For every random variable 𝑋𝑖 in [0, 1], Var[𝑋𝑖] ≤ 1 (if the variable

is always in [0, 1], it can’t be more than 1 away from its expectation),and hence Eq. (17.16) implies that Var[𝑋] ≤ 𝑛 and hence 𝜎[𝑋] ≤ √𝑛.

probability theory 101 395

6 Specifically, for a normal randomvariable 𝑋 of expectation 𝜇 and stan-dard deviation 𝜎, the probability that|𝑋 − 𝜇| ≥ 𝑘𝜎 is at most 2𝑒−𝑘2/2.

For large 𝑛,√𝑛 ≪ 0.001𝑛, and in particular if

√𝑛 ≤ 0.001𝑛/𝑘, we canuse Chebyshev’s inequality to bound the probability that 𝑋 is not in[0.499𝑛, 0.501𝑛] by 1/𝑘2.

17.3.2 The Chernoff boundChebyshev’s inequality already shows a connection between inde-pendence and concentration, but in many cases we can hope fora quantitatively much stronger result. If, as in the example above,𝑋 = 𝑋1 + … + 𝑋𝑛 where the 𝑋𝑖’s are bounded i.i.d random variablesof mean 1/2, then as 𝑛 grows, the distribution of 𝑋 would be roughlythe normal or Gaussian distribution− that is, distributed accordingto the bell curve (see Fig. 17.5 and Fig. 17.7). This distribution has theproperty of being very concentrated in the sense that the probability ofdeviating 𝑘 standard deviations from the mean is not merely 1/𝑘2 asis guaranteed by Chebyshev, but rather is roughly 𝑒−𝑘2 .6 That is, wehave an exponential decay of the probability of deviation.

Figure 17.7: In the normal distribution or the Bell curve, the probability of deviating𝑘 standard deviations from the expectation shrinks exponentially in 𝑘2, and specificallywith probability at least 1−2𝑒−𝑘2/2, a random variable 𝑋 of expectation 𝜇 and standarddeviation 𝜎 satisfies 𝜇 − 𝑘𝜎 ≤ 𝑋 ≤ 𝜇 + 𝑘𝜎. This figure gives more precise bounds for𝑘 = 1, 2, 3, 4, 5, 6. (Image credit:Imran Baghirov)

The following extremely useful theorem shows that such expo-nential decay occurs every time we have a sum of independent andbounded variables. This theorem is known under many names in dif-ferent communities, though it is mostly called the Chernoff bound inthe computer science literature:

396 introduction to theoretical computer science

7 TODO: maybe add an example appli-cation of Chernoff. Perhaps a probabilis-tic method proof using Chernoff+Unionbound.

Theorem 17.9 — Chernoff/Hoeffding bound. If 𝑋1, … , 𝑋𝑛 are i.i.d ran-dom variables such that 𝑋𝑖 ∈ [0, 1] and 𝔼[𝑋𝑖] = 𝑝 for every 𝑖, thenfor every 𝜖 > 0

Pr[∣𝑛−1∑𝑖=0

𝑋𝑖 − 𝑝𝑛∣ > 𝜖𝑛] ≤ 2 ⋅ 𝑒−2𝜖2𝑛. (17.17)

We omit the proof, which appears in many texts, and uses Markov’sinequality on i.i.d random variables 𝑌0, … , 𝑌𝑛 that are of the form𝑌𝑖 = 𝑒𝜆𝑋𝑖 for some carefully chosen parameter 𝜆. See Exercise 17.8for a proof of the simple (but highly useful and representative) casewhere each 𝑋𝑖 is {0, 1} valued and 𝑝 = 1/2. (See also Exercise 17.9 fora generalization.)

7

17.4 LECTURE SUMMARY

• A basic probabilistic experiment corresponds to tossing 𝑛 coins orchoosing 𝑥 uniformly at random from {0, 1}𝑛.

• Random variables assign a real number to every result of a coin toss.The expectation of a random variable 𝑋 is its average value, andthere are several concentration results showing that under certainconditions, random variables deviate significantly from their expec-tation only with small probability.

17.5 EXERCISES

Exercise 17.1 Give an example of random variables 𝑋, 𝑌 ∶ {0, 1}3 → ℝsuch that 𝔼[𝑋𝑌 ] ≠ 𝔼[𝑋] 𝔼[𝑌 ]. �

Exercise 17.2 Give an example of random variables 𝑋, 𝑌 ∶ {0, 1}3 → ℝsuch that 𝑋 and 𝑌 are not independent but 𝔼[𝑋𝑌 ] = 𝔼[𝑋] 𝔼[𝑌 ]. �

Exercise 17.3 — Product of expectations. Prove Lemma 17.5 �

Exercise 17.4 — Transformations preserve independence. ProveLemma 17.6 �

Exercise 17.5 — Variance of independent random variables. Prove that if𝑋0, … , 𝑋𝑛−1 are independent random variables then Var[𝑋0 + ⋯ +𝑋𝑛−1] = ∑𝑛−1

𝑖=0 Var[𝑋𝑖]. �

Exercise 17.6 — Entropy (challenge). Recall the definition of a distribu-tion 𝜇 over some finite set 𝑆. Shannon defined the entropy of a distri-bution 𝜇, denoted by 𝐻(𝜇), to be ∑𝑥∈𝑆 𝜇(𝑥) log(1/𝜇(𝑥)). The idea isthat if 𝜇 is a distribution of entropy 𝑘, then encoding members of 𝜇will require 𝑘 bits, in an amortized sense. In this exercise we justifythis definition. Let 𝜇 be such that 𝐻(𝜇) = 𝑘.

probability theory 101 397

8 While you don’t need this to solvethis exercise, this is the function thatmaps 𝑝 to the entropy (as defined inExercise 17.6) of the 𝑝-biased coindistribution over {0, 1}, which is thefunction 𝜇 ∶ {0, 1} → [0, 1] s.y.𝜇(0) = 1 − 𝑝 and 𝜇(1) = 𝑝.9 Hint: Use Stirling’s formula for ap-proximating the factorial function.

10 Hint: Bound the number of tuples𝑗0, … , 𝑗𝑛−1 such that every 𝑗𝑖 is evenand ∑ 𝑗𝑖 = 𝑘.11 Hint: Set 𝑘 = 2⌈𝜖2𝑛/1000⌉ and thenshow that if the event | ∑ 𝑌𝑖| ≥ 𝜖𝑛happens then the random variable(∑ 𝑌𝑖)𝑘 is a factor of 𝜖−𝑘 larger than itsexpectation.

12 Hint: Think of 𝑥 ∈ {0, 1}𝑛 aschoosing 𝑘 numbers 𝑦1, … , 𝑦𝑘 ∈{0, … , 2⌈log 𝑀⌉ − 1}. Output the firstsuch number that is in {0, … , 𝑀 − 1}.

1. Prove that for every one to one function 𝐹 ∶ 𝑆 → {0, 1}∗,𝔼𝑥∼𝜇 |𝐹 (𝑥)| ≥ 𝑘.

2. Prove that for every 𝜖, there is some 𝑛 and a one-to-one function𝐹 ∶ 𝑆𝑛 → {0, 1}∗, such that 𝔼𝑥∼𝜇𝑛 |𝐹 (𝑥)| ≤ 𝑛(𝑘 + 𝜖), where 𝑥 ∼ 𝜇denotes the experiments of choosing 𝑥0, … , 𝑥𝑛−1 each independentlyfrom 𝑆 using the distribution 𝜇. �

Exercise 17.7 — Entropy approximation to binomial. Let 𝐻(𝑝) =𝑝 log(1/𝑝) + (1 − 𝑝) log(1/(1 − 𝑝)).8 Prove that for every 𝑝 ∈ (0, 1) and𝜖 > 0, if 𝑛 is large enough then9

2(𝐻(𝑝)−𝜖)𝑛( 𝑛𝑝𝑛) ≤ 2(𝐻(𝑝)+𝜖)𝑛 (17.18)

where (𝑛𝑘) is the binomial coefficient 𝑛!

𝑘!(𝑛−𝑘)! which is equal to thenumber of 𝑘-size subsets of {0, … , 𝑛 − 1}. �

Exercise 17.8 — Chernoff using Stirling. 1. Prove that Pr𝑥∼{0,1}𝑛 [∑ 𝑥𝑖 =𝑘] = (𝑛

𝑘)2−𝑛.

2. Use this and Exercise 17.7 to prove the Chernoff bound for the casethat 𝑋0, … , 𝑋𝑛 are i.i.d. random variables over {0, 1} each equaling0 and 1 with probability 1/2.

Exercise 17.9 — Poor man’s Chernoff. Let 𝑋0, … , 𝑋𝑛 be i.i.d randomvariables with 𝔼 𝑋𝑖 = 𝑝 and Pr[0 ≤ 𝑋𝑖 ≤ 1] = 1. Define 𝑌𝑖 = 𝑋𝑖 − 𝑝.

1. Prove that for every 𝑗1, … , 𝑗𝑛 ∈ ℕ, if there exists one 𝑖 such that 𝑗𝑖is odd then 𝔼[∏𝑛−1

𝑖=0 𝑌 𝑗𝑖𝑖 ] = 0.

2. Prove that for every 𝑘, 𝔼[(∑𝑛−1𝑖=0 𝑌𝑖)𝑘] ≤ (10𝑘𝑛)𝑘/2.10

3. Prove that for every 𝜖 > 0, Pr[| ∑𝑖 𝑌𝑖| ≥ 𝜖𝑛] ≥ 2−𝜖2𝑛/(10000 log 1/𝜖).11

Exercise 17.10 — Simulating distributions using coins. Our model forprobability involves tossing 𝑛 coins, but sometimes algorithms re-quire sampling from other distributions, such as selecting a uniformnumber in {0, … , 𝑀 − 1} for some 𝑀 . Fortunately, we can simulatethis with an exponentially small probability of error: prove that forevery 𝑀 , if 𝑛 > 𝑘⌈log 𝑀⌉, then there is a function 𝐹 ∶ {0, 1}𝑛 →{0, … , 𝑀 − 1} ∪ {⊥} such that (1) The probability that 𝐹(𝑥) = ⊥ is atmost 2−𝑘 and (2) the distribution of 𝐹(𝑥) conditioned on 𝐹(𝑥) ≠ ⊥ isequal to the uniform distribution over {0, … , 𝑀 − 1}.12 �

Exercise 17.11 — Sampling. Suppose that a country has 300,000,000citizens, 52 percent of which prefer the color “green” and 48 percentof which prefer the color “orange”. Suppose we sample 𝑛 randomcitizens and ask them their favorite color (assume they will answertruthfully). What is the smallest value 𝑛 among the following choicesso that the probability that the majority of the sample answers “green”

398 introduction to theoretical computer science

13 TODO: add some exercise about theprobabilistic method

is at most 0.05? a. 1,000 b. 10,000 c. 100,000 d. 1,000,000 �

Exercise 17.12 Would the answer to Exercise 17.11 change if the coun-try had 300,000,000,000 citizens? �

Exercise 17.13 — Sampling (2). Under the same assumptions as Exer-cise 17.11, what is the smallest value 𝑛 among the following choices sothat the probability that the majority of the sample answers “green”is at most 2−100? a. 1,000 b. 10,000 c. 100,000 d. 1,000,000 e. It isimpossible to get such low probability since there are fewer than 2100

citizens. �

13

17.6 BIBLIOGRAPHICAL NOTES

17.7 FURTHER EXPLORATIONS

Some topics related to this lecture that might be accessible to ad-vanced students include: (to be completed)

17.8 ACKNOWLEDGEMENTS


Recommended