Basic Statistics Session 1

    Basic Statistics Introductory Workshop MSBAPM

    that much of the material and definitions are from A First Course in Statistics

    Section One + Overview of terms

    Some definitions

    Descriptive Statistics - Methods for describing a dataset as it is.

    Inferential Statistics + Reach conclusions beyond the immediate data

    Sample data insight ----- Infers population insight

    Parameter + summary measure for population Statistic + summary measure for a sample

    More definitions

    Population + A unit that we are interested in studying (ie, registered voters, college students, mlb players, etc.)

    Sample + subset of the population Variables + Characteristics of the population or sample

    Measures + Assignment of a value for each characteristic

    Could be continuous (ie age, weight) or categorical (A,B,C etc)

    Why inferential statistics?

    Population vs Sample Population may not be practical or is too expensive

    Ex. We want to measure likely voter outcomes—polling every voter is not practical so we poll a random sample of voters and then infer what we would get if we had polled ALL voters

    There is a potential for error

    Exercise One

    A survey shows that the average age of TV viewers is 50. Fox believes the average age of their viewers is less than 50. To test this, they sample 100 viewers

    Describe the population

    Describe the variable of interest

    Describe the sample Describe the inference

    Exercise One Answers

    A survey shows that the average age of TV viewers is 50. Fox believes the average age of their viewers is less than 50. To test this, they sample 100 viewers

    Describe the population (All Fox viewers)

    Describe the variable of interest (Age)

    escribe the sa!ple (9:: o" the $o2 'ie,ers) escribe the in"erence (7he in"erence is that the a'er

    o" the SAMP63 appro2i!ates the a'erae ae o" the epopulation)

    As noted earlier, there could be error in a sample statistic correctly approximating a population parameter (ie Fox sample age correctly approximates age of all Fox viewers)

    We use !easures o" reliability to re;ect the der

    This is often referred to as the margin of error or confidence interval

    We will discuss these in more detail later

    Data types

    Exercise Two

    The Army Corps of Engineers is measuring toxic levels in fish. They captured a total of 144 fish and noted following variables

    River/creek where fish was captured


    Length Weight

    Toxic concentration in ppm

    Which are Quantitative/Qualitative

    Exercise Two

    The Army Corps of Engineers is measuring toxic levels in fish. They captured a total of 144 fish and noted following variables

    Ri'erCcreek ,here #sh ,as captured (

    Section Two + Descriptive Statistics

    Exercise Three Patient # Insured (1), Medicare/Medicaid (2), Uninsured (3)

    Uninsured (3)







    What is the "re=ueach o" the ins


    What type o" 'arKPatient L

    Lets assume the following column of values: 4 5 9 2 1




    What is n?

    What value would correspond to x2?

    What is the value of Σxi?

    Lets assume the following column of values: 4 5 9 2 1




    What is n? n is 5

    What value would correspond to x2? 5

    What is the value of Σxi? 21

    Exercise Four

    Assume a dataset of 1,2,3,4,5


    Find Σxi²


    32ercise $i'e5









    What is the mean? Median?


    Excel for central measures

    Open up new excel workbook Enter numbers in from Exercise 5

    Use excel functions to calculate mean, median and mode

    SAVE this workbook...we will use this later

    Sample mean

    Formula for sample mean which is denoted as x̄

    Σxi/n where n is the number of observations

    x̄ is usually referred to as "x bar" (Population mean is "mu")

    The rounding of x̄ is subject to the degree of accuracy necessary

    When n is odd, the median is simply the middle value after arranging all observations either in ascending or descending order

    When n is even, then we use the average of the two middle numbers

    Ex: 1,5,7 The median would be 6 (Average of 5,7)

    Mean or Median?

    If the mean and median are close, we typically use mean

    However, the mean is sensitive to extreme values (outliers) whereas the median is not. In that case, median is often used.

    Often, the median is used when considering household incomes, which can have extreme outliers (Virginia senators)

    Symmetric vs Skewed

    When the mean is to the right of median, it is right skewed (Skewness number would be negative)

    When the mean is to the left of median, it is left skewed (Skewness number would be positive)

    Some other measures of quantitative values

    Max and Min + the highest and lowest value Range - the difference from the highest value and lowest value

    lo,est 'alue

    Upper half— all values to the right of the median

    Qpper hal"- all 'alues to the riht o" the !edian

    Exercise Six

    %onsider U&H&&8&9&&9&&&V %alculateD





    =uartile rd=uartile

    Section Three + Variance and Standard Deviation

  • 7/25/2019 Basic Statistics Session 1



    Central measures are only part of the story We are also interested in the "spread" of the distribution

    We earlier considered range

    Now we want to consider the variance and standard deviation

  • 7/25/2019 Basic Statistics Session 1



    The distance from an observation to the mean of all observations is called a deviation

    Each observation would be noted as xi-x̄ (Note that this could be positive or negative)


    Imagine a plot with three points {1,2,3} The mean is 2, so the deviation for 1 is negative one

    deviation for 3 is plus one (The deviation for 2 is 0)

    Example of variance

    Now consider these two data sets {1,9,2,3,4} and {9,2,2,2,0}

    How would you describe these datasets?

    What are the raw deviations for each number in each set?

    Why can't we simply "average out" the deviations?

    Example of variance solution

    {1,9,2,3,4} and {9,2,2,2,0}

    How would you describe these datasets? They both have a mean of 3.8

    The ranges are 8 and 9, respectively

    They both have 5 members, but there is greater variance in set 2

    What are the raw deviations for each number in each set? {-2.8,-0.8,0.2,1.2,5.2} and {-3.8,0.2,0.2,0.2,3.2}

    Why can't we simply "average out" the deviations to get a single measure of variability?

    It would always equal zero

    7here are t,o ,ays to handle this4use absolute 'alues (,h

    cause proble!s) or s=uare each de'iation

    Calculating variance

    Re!e!ber our last set U&&8V We had de'iations o" -&:& I" ,e su!!ed the! up& they ,ould e=ual ero& but th

    clearly not the case as there is a distribution

    Therefore our variance is 2/2=1

    Our formula looks like this:

    *ur "or!ula looks like thisD

    Standard deviation of sample

    Our standard deviation is the square root of our variance

    In our last case, our variance was 1, so the square root is 1

    Six sigma

    Six sigma

    St d d i ti d 0 i

    Standard Deviation and Variance in Excel Reopen our previous workbook

    We can use the stdev and var functions

    Exercise seven

    Using excel

    Enter the following numbers into a column: 1 2 3 4 5


    Qsin e2cel "unctions& return the !ean& !edian&

    'ariance and standard de'iation

    The need to visualize

    The measures discussed (mean, std deviation, etc) don't always tell the whole story

    Anscombe's quartet

    Lunch Break

    Why do professional poker players always seem to win?


    Some luck, but they also have a great understanding of probability and odds

  • 7/25/2019 Basic Statistics Session 1


    Simple probability + the coin toss

    Imagine tossing a coin

    What is the probability of the toss resulting in "Heads"?

    The answer: .5 or 50%

    Simple probability

    Let's take our last example—we flipped a coin and it came up heads

    We now want to flip the coin again What is the probability that it will come up heads again?

    If we flip it 10 times and it is heads all ten times, what is the probability that it will come up heads the 11th time?

  • 7/25/2019 Basic Statistics Session 1



    These coin tosses are acts or observations that lead to a single outcome, but that cannot be predicted with certainty.

    Each observation is called a sample point or simple event

    Sample points

    In the case of the coin, there were two sample points: heads and tails

    What are the sample points on a single die (dice)?

    We ha'e to be care"ul ,hen considerin the sa!

    points that are possible (See ne2t slide)

    Sample points

    Problem: List all of the sample points if we flip TWO coins

    We might think there are three sample points when flipping these coins: HH, HT, TT





    Sample points ---sample space

    But there are





    7he set o" sa!ple points is re"erred to as a sa!pspace and& as obser'ed abo'e& is denoted as SD

    77& T7& 7TV

    32ercise eiht

    enote the sa!ple space o" a sinle dice

    enote the sa!ple space o" the "ace cards (J&

    Probability de#ned

    7he probability o" an e'ent is the likelihood that outco!e ,ill occur ,hen the e2peri!ent is per"o

    Probability is usually denoted as KP

    What is the likelihood that ,hen ,e roll a standathe roll ,ill result in a K1

    It is in G& or appro2. G.HF

    b bili d d

    Probability e2panded

    Re!e!ber the : consecuti'e coin ;ips that res


    I" ,e ,ere to conduct that e2peri!ent ($lippin thone !illion ti!es& ,e ,ould 'ery likely see a resusuests rouhly hal" the ;ips ,ere heads and ha;ips ,ere tails.

    7his is the la, o" lare nu!bers ,hich states threlati'e "re=uency o" an outco!e approaches its ttheoretical probability the !ore ti!es you repeat

    Q l l

    Qnclear sa!ple spaces

    5ou are startin a business. What is the probability th

    succeed1 We could si!ply state ,ill either succeed or "a

    To,e'er& ,e kno, that is not true& so ho, ,ould ,e aprobability1

    We could consider e2perience in runnin a si!ilar business

    We could look at success rates o" si!ilar businesses We could apply statistical techni=ues usin 'ariables such a

    location& etc.

    Whiche'er techni=ues ,e use& the #nal assess!ent o"probability is still subecti'e

    P b bilit l

    Probability rules

    A66 probabilities& ,hether subecti'e or not& !us

    9 basic rules.7he probability o" a sa!ple point MQS7 lie bet,een :

    7he probabilities o" all sa!ple points ,ithin a sa!ple!ust e=ual

    3 i i %l di i

    32ercise nine + %lass discussion

    Suppose you are tra'elin to *rane county %al

    and are interested in stayin at a hotel that has conser'ation prora!.

    What are the sa!ple points "or this e2peri!ent1

    To, ,ould you o about assinin the probabiliyour sa!ple points1

    $irst& ,hat are the sa!ple points "or a sinle die

    SD U&9&&&8&GV

    Suppose that instead o" the probability o" a sinsa!ple point& ,e ,ere interested in so!ethin l

    probability that the nu!ber ,ill be e'en (or oddcalled an event

    An e'ent is typically noted as A.

    3'ent A (Ke'en in this case)contains sa!ple pall ,ith probabilities o" CG4,e si!ply add the!

    et P o" A o" C9

    St " l l ti P " t

    Steps "or calculatin P o" e'ent

    e#ne the e2perie!ent..that is& describe the pro

    used to !ake an obser'ation and the type o"obser'ation that ,ill be recorded

    6ist the sa!ple points

    Assin probabilities to the sa!ple points

    eter!ine the collection o" sa!ple points contathe e'ent

    Su! the sa!ple point probabilities to et the pro" the e'ent

    32ercise ten

    32ercise ten7his is a study o" di'orced peop

    Group Description Proportion

    PP Joint custody and et alon ,ell .9%% *ccasional con;ict .

    AA %ooperate on children& con;ict other,ise .98

    $$ Tostile to each other& in con;ict on e'ery issue .98

    Suppose that :: couples are selected at rando!D 6ist the sa!ple points Assin probabilities to the sa!ple points What is the probability that the spouses "all into the $$ cateory1 What is the probability o" at least so!e con;ict1

    Identi"yin sa!ple points "ro!

    de y sa p e po s oroups So "ar& ,e ha'e seen s!all co!binations& so ide

    the sa!ple points ,as easy But let>s assu!e ,e ,anted to select any sa!pl

    ite!s "ro! a ar o" :: ite!s. 7o understand theprobability o" any *3 roup o" 8 ite!s& ,e needkno, ho, !any diNerent roups o" 8 there are.

    We could try and list the! all& but that ,ould be tediodoesn>t scale ,ell

    So ,hat do ,e do1

    %o!binatorial !ath

    %o!binatorial !ath

    We utilie a co!bination "or!ula

    $ro! our last scenario& lets assin :: (7otal nu!ite!s in the ar) and n8 (7otal nu!ber o" ite!s ,e select)

    7he "or!ula ,e use isYD

    / C n/(-n)/

    7he e2cla!ation point is called a K"actorial (see ne

    Y7his assu!es once an ite! is selected it is *7 replaced). I" it ,as replaced& the "or!ula ,::8& but it ,ouldn>t !ake sense in this conte2t because ,e need 8 I$$3R37 ite!s


    $actorial si!ply !eans that ,e !ultiply a nu!b

    each nu!ber be"ore it

    32 8/7his !eans 8 2 2 2 9 2 (:/ is by de#nition)

    So 8/ 9:

    Back to co!binations

    Back to co!binations

    So in our e2a!ple& :: and n8

    So ,e ,ould ha'e ::/C8/(:: + 8)/

    7his e=uals a R3A665 bi nu!ber4let>s open up and use the K"act "unction to calculate it

    5ou should et H89H89:7hat !eans there are o'er H8 !illion co!binatio

    7he probability o" selectin any *3 o" theco!binations is CH89H89:

    32ercise ele'en

    32ercise ele'en

    %alculate ho, !any sa!ples o" 8 ite!s out o" a

    possible 9: there are. (Ans,er is 8&8:)

    6et>s calculate 6*77*//

    Assu!e 8 nu!bers and you need the rihtco!bination o" G to ,in. What is the probability

    Multiplicati'e probability o"

    p p yindependent e'ents What i" ,e ha'e !ultiple e'ents occurrin1

    $or e2a!ple& the P o" sur'in heart surery is .the p o" sur'i'in the reco'ery is .. What is thprobability that i" you ha'e heart surery& you ,actually o ho!e1

    We si!ply !ultiply the probabilities& so the p o" ho!e is . Y ..G (So the hospital ,ould saprocedures ha'e an 8F success rate)

    32ercise (6otto)

    32ercise (6otto)


    7his e=uals 99&8H&:

    7he probability is C99&8H&:

    7he ticket ,ould state that the odds o" ,innin arouhly in 9!illion

    iscrete probability distribution

    In our sa!ple point proble!s& ,e ,ere able to li

    either !anually or usin co!binatorial !ath& thesa!ple points.

    7he "act that ,e can Klist these !akes the! ,hre"er to as discrete rando! 'ariables. 7his ter!that there is a fnite nu!ber o" distinct possible

    (or sa!ple points)7his diNers "ro! continuous 'alues that are in#n

    lead to the continuous probability distribution todiscussed shortly

    iscrete probability distribution

    iscrete probability distributionQsin our e2a!ple o" t,o coins

    TT T& 7 77:





    7oss o" t,o coins

    7oss o" t,o coins

    iscrete Probability istribution

    iscrete Probability istribution

    TT T& 7 77-:.98




    7oss o" t,o coins

    7oss o" t,o coins

    Re!e!ber "ro! our earlier e2a!ple that there ,ere sa!ple points UTT& T7& 7T& 77V

    I" ,e treat T7 and 7T as a sinle 'alue& ,e cancalculate the p o" that 'alue as Z E Z or [

    7he raph abo'e depicts our discrete probabilitydistribution

    We can also depict this ,ith a "or!ula

    iscrete Probability istribution

    7he probability distribution o" a discrete rando! 'ariable is a or "or!ula that speci#es the PR*BABI6I75 associated ,ith each'alue that the rando! 'ariable can assu!e

    7here are t,o conditions that !ust be !et P(2) \ or to : "or A66 'alues o" 2.

    Su! o" p(2)

    iscrete Probability istribution

  • 7/25/2019 Basic Statistics Session 1


    So!eti!es& the distribution is disco'ered a"ter !any obser'at

    is not kno,n a priori $or e2a!ple& Q o" A] researchers used historical records o" dro7e2as to sho, that the distribution o" 2& ,here 2 nu!ber o" !ust be sa!pled until a dry year is obser'ed& could be sho,n"or!ulaD p(2) (.)(.H)(2-) 2 &9&4

    What is the p that a"ter any consecuti'e years& a drouht ,o

    P() (.)(.H)(-) (.)(.H)9 (.)(.) .H

    7hus there is a 8F chance that "or any years& a drouht !i

    Mean and std de'iation o" discrd i bl

  • 7/25/2019 Basic Statistics Session 1


    rando! 'ariable

    TT (:) T& 7 () 77 (9)-:.98




    7oss o" t,o coins

    7oss o" t,o coins

    6ets assin 'alues o" :& & 9 to the

    possible outco!es. While ,e canlook at the raph to see that the!ean appears to be & ,e cancon#r! that by !ultiplyin our'alues by their respecti'e p

    So :(C) E (.8) E 9(.98) 7hus the !ean is one.

    7his !ean is o"ten re"erred to as theKe2pected 'alue and denoted 3(2)-- 5ou ,ill see this later in the BAPMprora!

    32ercise t,el'e

    32ercise t,el'e

    6et>s say you ,ork "or an insurance co!pany an

    sell a one year ^:&::: policy ,ith a pre!iu! o7he policy pays out i" the custo!er dies& but thethat happenin is .::. What is the e2pected athis transaction1

    32ercise t,el'e

    32ercise t,el'e

    6et>s say you ,ork "or an insurance co!pany an

    sell a one year ^:&::: policy ,ith a pre!iu! o7he policy pays out i" the custo!er dies& but thethat happenin is .::. What is the e2pected athis transaction1Gain x Sap!e point p

    9: %usto!er li'es .

    -&H: (9: +


    %usto!er dies .::

    7he e2pected ain is 9:(.) E (-H:)(.::) ^9:

    I" the co!pany ,ere to sell a 'ery lare nu!ber o" policies it ,ould neon a'erae& 9: per sale.

    7he 'ariance (and std de'iationd i bl

    a rando! 'ariable7he 'ariance o" a rando! 'ariable 2 isD

    _9 3` (2-)9 (2-)9)Y(p(2))

    7his is re"erred to as the expected value o the sq

    distance rom the mean

    7he std de'iation is si!ply the s=uare root o" _9

    32ercise thirteen

    32ercise thirteen

    A certain type o" che!otherapy is success"ul H:

    ti!e. 6et 2 e=ual the nu!ber o" success"ul cure8.

    So this !eans that "or any #'e treated patients& 9sur'i'e .9F o" ti!e& ,ill sur'i'e GF o" ti!e&

    $ind the !ean& 'ariance and standard de'iation distribution

    " 1 2 3 4 5

    P(2) .::9 .:9 .9 .: .G: .G

    32ercise thirteen

    e c se ee

    3(2) :(.::9) E (.:9) E 9(.9) E (.:) E (.G:) E 8(.G) .87his !eans that the nu!ber o" success"ul cures& on a'erae& "or #'e patien,ill be .8 (H:F success rate)

    0ariance _9 3` (2-)9 (2-)9)Y(p(2)) (:-.8)9 Y(.::9) E (-.8)9 Y (.:9) 4(8-.8)9 Y (.G) .:8

    Std de'iation is s=uare root o" .:8 .:9

    %onclusion o" Session

    Website ,ith statistical sy!bols


    reek letters

    ] score table

