The Statistical Process: Data Collection and Visualization ... · Data Visualizations: Discrete...

A B I L E N E C H R I S T I A N U N I V E R S I T Y

Department of Mathematics

The Statistical Process: Data Collection and VisualizationSection 15.1-15.2

Dr. John EhrkeDepartment of Mathematics

Fall 2012

A B I L E N E C H R I S T I A N U N I V E R S I T Y D E P A R T M E N T O F M A T H E M A T I C S

The Statistical ProcessIn our discussions of probability we referred to each of our problems as a random experiment.In general when you plan an experiment of any kind where you will record observations orcollect data it is essential to know that the results can be analyzed. This analysis is theintersection of probability with statistics and is the focal point of our discussions in this unit. Welist a few of the main statistical methods of analyzing designed experiments that we willconsider below.

Data Collection: Data collection analysis answers questions of appropriate samples,experimental design, data verification, and outlier identification.

Data Visualizations: Discrete collections of data give rise to counting methods and aregenerally well suited to displays in tables and graphs. We will considervarious types of data visualization.

Statistical Analysis: Sometimes visualization is not enough and we must determinequantitative properties of a data set. We will consider measures of center,dispersion, and relative standing.

Regression: This method is used to study a casual relationship between two variables,such as in obtaining a dosage-patient response relationship.

Hypothesis Testing: The pinnacle of statistical testing, this type of analysis seeks todetermine if the results obtained in a random experiment representstatistically significant results. That usually involves the calculation of theprobability that observed differences between two data sets could have arisenby chance sampling variation. We will talk about this type of testing briefly atthe end of this unit.

This lecture focuses on the first two areas described above: data collection and visualization.Slide 2/16 — Dr. John Ehrke — Lecture 6 — Fall 2012


This Just In, 98.6 Not Normal

After believing for more than a century that 98.6 was the normal bodytemperature for humans, researchers now say normal is not normalanymore.

For some people at some hours of the day, 99.9 degrees could be fine.And readings as low as 96 turn out to be highly human.

The 98.6 standard was derived by a German doctor in 1868. Somephysicians have always been suspicious of the good doctor’s research. Hisclaim: 1 million readings–in an era without computers.

So Mackowiak & Co. took temperature readings from 148 healthy peopleover a three-day period and found that the mean temperature was 98.2degrees. Only 8 percent of the readings were 98.6.

– The Press Enterprise

Slide 3/16 — Dr. John Ehrke — Lecture 6 — Fall 2012


Questions Abound(1) How did the researchers select the 148 people?

(2) How can we be sure the results of these 148 people accurately reflect thehuman race as a whole?

(3) How did the researchers arrive at the “high” and “low” temperatures given in thearticle?

(4) How did the German doctor record 1 million temperatures in 1868?

When you read articles such as this, do you ever question where the numbers comefrom, and if they are accurate or the result of a mathematically sound process? It isthe job of the statistician to apply meaning to data and to ensure the process ofreporting numbers such as these is valid.



The Pitfalls of Collecting DataDealing with data involves two basic steps: collecting it and interpreting it. It mayseem as though collecting data is no big deal. However, a plethora of pitfallssurrounds such a process, and it is into these pits that we now peer.

• One of the easiest ways for survey data to be misleading and inaccurate is if thesurveyed people lie.

• Suppose we want to take a survey in which we wish to acquire accurate data onan embarrassing or controversial topic.

• How can we devise a method of taking the survey in which no one feels anydanger of their privacy being invaded.

ExampleOne of the most serious problems facing colleges and universities today is theproblem of alcohol abuse among students. Let’s suppose the question we want toask is: “Have you been drunk during the last week?” The goal is to structure thesurvey such that we can deduce approximately what fraction of the students gotdrunk; however, we cannot tell definitely which individual students got drunk.



Survey Method Guaranteeing AnonymityThe Setup: Suppose that each person secretly flips two coins. If both coins landheads up, then that person reports the opposite of the truthful answer. Anyone whoflips at least one tails answers truthfully.

The Situation: Suppose in a class of 60 students, 24 people answer yes, and 36people answer no.

Remarks:• There is no information about any individual available. In fact, we could have

student answer by raising their hands.• If someone said yes he or she could be telling the truth or not, and if another

person answered no he or she could be telling the truth or not.• There are four equally likely outcomes, {HH,HT,TH,TT}. These outcomes

occur randomly across the class and so the number of students who got drunkand flipped HH is the same as those who got drunk and flipped HT, TH, or TT.



The Blinded Survey ResultLets suppose that the number of students who got drunk and flipped HH is D, andthe number of sober students who flipped HH is S. This means that we expect thereto be 4D students in the class to be drunk (D of them flipped HH, D of them flippedHT, etc...).

• Which people answered yes? Any student who flipped HH is to answer theopposite of the truth, so the S sober students in that group answered yes. Thestudents who see at least one tails answer truthfully, so there are three suchgroups and there are 3D + S = 24 total that said yes.

• Which people answered no? The rest of the students answered no, so thatleaves D + 3S = 36.

So now we have two equations with two unknowns, and we can deal with them bysolving for S in the first equation and substituting it into the second:

D + 3S = D + (24− 3D) = 36.

Solving for D gives D = 4.5. All totaled we expect the number of drunk students tobe 4D, and so there are 4× 4.5 = 18 total drunk students surveyed. This numbergets more reliable as our pool of students gets larger.



Other Sources of Data ErrorFor data and the interpretation of that date to have any real meaning, one must beconvinced that the data is accurate, not biased, and sufficiently robust to berepresentative of the whole population. Often attaining such clean data is not aneasy task. Dirty data often leads to dirty stats. The following is a list of ways in whichdata can be corrupted.

• Human error (transcription error)• Source of the data lies• Sample is biased• Sample is not representative• Sample is not large enough to draw meaningful conclusions• Data is fabricated• Non-response error (not being able to interview people eligible)• Measurement error (data does not measure what it is supposed to)



Populations and Samples• Population: Set containing all the people or objects whose properties are to be

described and analyzed by the data collector.

• Sample: Subset or subgroup of the population formed by a specified number ofmeasurements or data.

• Representative Sample: A sample that exhibits characteristics typically ofthose possessed by the target population.

For the previous article the sample is the set of body-temperature measurements forthe 148 healthy people chosen. The hope of the researchers is that this sample isrepresentative of the population–all healthy humans in the world.

So which is more important, the sample or the population? In most cases, we areonly interested in the population, but when the population is difficult, impractical, orimpossible to measure, a representative sample is necessary to draw informationfrom in hopes of describing or predicting the behavior of the population.



Descriptive vs. Inferential Statistics• Statistics: Method of collecting, organizing, analyzing and interpreting data, as

well as drawing conclusions based on data. Methodology is divided into twomain areas.

• Descriptive Statistics: The data collection side of statistics. Data is organized,summarized, and presented in tables, graphs, and/or visually.

• Inferential Statistics: This side of statistics applies meaning to the data. Inferentialstatistics makes generalizations and draws conclusions from the data collected.

If the set of measurements is the entire population, you need only draw conclusionsbased on descriptive statistics. For various reasons though, you might only have asample and by looking at the sample, you want to answer questions about thepopulation. This deal with inferential statistics.



Data Types• Variable: A variable is a characteristic that changes or varies over time and/or

for different individuals or objects under consideration.

• Qualitative Variable: Variables which measure a quality or characteristic of theindividuals or objects under consideration.

• Quantitative Variable: Variables which measure a numerical quantity or amountassociated with the individuals or objects under consideration.

• Discrete Variable: A discrete variable can assume only a finite number or countablenumber of values.

• Continuous Variable: A continuous variable can assume the infinitely many valuescorresponding to the points on a line interval.



Visualizing Data: HistogramThe most basic and perhaps best thing to do with data is to look at it. Visualizing data throughgraphs and pictures often reveals important structure.

ExampleSuppose, we were interested in collecting data about the heights of students in this classroomand arrange them in 3-inch intervals:

4′0′′ − 4′2′′, 4′3′′ − 4′5′′, . . . , 5′0′′ − 5′2′′, . . . , 6′3′′ − 6′5′′, 6′6′′ − 6′8′′.

How might we best represent this data?

Use your iPhone to respond to this question by selecting the appropriate category for youheight. In statistics, a histogram is a graphical representation, showing a visual impression ofthe distribution of data. It is an estimate of the probability distribution of a continuous randomvariable.

The categories into which data is sorted are called classes. When creating a histogram we willsort data into classes of the same width. To determine the class width for a set of data we usethe formula:

class width =largest data value - smallest data value

number of classes desired.



Statistical TablesAfter data has been collected, it can be consolidated and summarized to display thedata graphically as a data distribution. For this purpose, we use a statistical table.When the variable of interest is qualitative, the statistical table is a list of thecategories being considered along with a measure of how often each valueoccurred. You can measure how often in three different ways:

• The frequency, or number of measurements in each category.

• The relative frequency, or proportion of measurements in each category.

• The percentage of measurements in each category.

If you let n be the total number of measurements in the set, you can find the relativefrequency and percentage using these formulas:

Relative Frequency =frequency

n× 100%

You will find that the sum of the frequencies is always n, the sum of the relativefrequencies is 1, and the sum of the percentages is 100%. The categories for aqualitative variable should be chosen so that (1) a measurement will belong to oneand only one category, and (2) each measurement has a category to which it can beassigned.



Frequency DistributionsThe data in the table below are the GPAs of 30 Abilene Christian Universityfreshmen, recorded at the end of freshmen year. In this exercise we will considerwhat are called “grouped frequency” distributions.

2.0 3.1 1.9 2.5 1.92.3 2.6 3.1 2.5 2.12.9 3.0 2.7 2.5 2.42.7 2.5 2.4 3.0 3.42.6 2.8 2.5 2.7 2.92.7 2.8 2.2 2.7 2.1

Table: Grade Point Averages of 30 ACU Freshmen

• Create a frequency distribution for this data set. What are some problems withrepresenting the data in this way?

• Create a grouped frequency distribution and histogram for this data set withclass width 0.5 beginning with a GPA of 0.0. What are some problems withrepresenting the data in this way?



Stem and Leaf PlotsIf you use the decimal as the dividing line you will have only three stemscorresponding to 1, 2, or 3. This is a generally unappealing diagram, a betterdiagram would look something like this.

1 9 92 0 1 12 2 32 4 4 5 5 5 5 52 6 6 7 7 7 7 72 8 8 9 93 0 0 1 133 4

If you turn the stem and leaf plot sideways, so that the vertical line is horizontal, youcan see that the data have been piled up or been distributed according to a“mound-shaped” pattern. The gap in the plot show there were no GPAs between 3.1and 3.4.



Data Collection ExampleBased on a sample of M&M

R©candies, create a statistical table of the number of

candies of each color. The last three columns of your table should give the threedifferent measurements of how often each category occurred.

• Is our sample representative?

• Is our sample size large enough to accurately reflect the population?

• For each of the three columns, describe another way of representing the datavisually.

Color Frequency Relative Frequency Expected % Observed %

Blue 24%Brown 14%Green 16%Orange 20%Red 13%Yellow 14%


Date post:	14-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

The Statistical Process: Data Collection and Visualization ... · Data Visualizations: Discrete...

Documents