Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | adela-byrd |
View: | 215 times |
Download: | 0 times |
Biostatistics coursePart 3
Data, summary and presentation
Dr. en C. Nicolas Padilla Raygoza
Department of Nursing and Obstetrics
Division of Health Sciences and Engineering
Campus Celaya Salvatierra
University of Guanajuato Mexico
Biosketch
Medical Doctor by University Autonomous of Guadalajara. Pediatrician by the Mexican Council of Certification on
Pediatrics. Postgraduate Diploma on Epidemiology, London School of
Hygine and Tropical Medicine, University of London. Master Sciences with aim in Epidemiology, Atlantic International
University. Doctorate Sciences with aim in Epidemiology, Atlantic
International University. Professor Titular A, Full Time, University of Guanajuato. Level 1 National Researcher System [email protected] [email protected]
Competencies
The reader will describe type of variables. He (she) will analyze how summary shows
the different variables He (she) will calculate central trend
measures and find them in graphics. He (she) will calculate dispersion measures
and find them in graphics.
Definitions
Data are collected on the specific characteristics of each subject, and groups are formed to be compared.
These characteristics are called variables, because they can change from each subject.
Variable is obtained because it is: A result of interest - dependent variable Or it explain the dependent variable - risk
factor - independent variable.
Type of data
Classification for its measurement scale: Qualititative
Binary - dichotomous Ordinal Nominal
Quantitative Discrete Continuous
Type of data - Examples
Qualitative Dichotomous - binary
Gender: male or female. Employment status: employment or without employment.
Ordinal Socioeconomic level: high, medium, low.
Nominal Residency place: center, North, South, East, West. Civil status: single, married, widowed, divorced, free union.
Quantitative Discrete
Number of offspring: 1,2,3,4. Continuous
Glucose in blood level: 110 mg/dl, 145 mg/dl.
Data summary
Generally, we want to show the data in a summary form.
Number of times that an event occur, is of our interest, it show us the variable distribution.
We can generate a frequency list quantitative or qualitative.
Summary of categorical data
We can obtain frequencies of categorical data and summary them in a table or graphic.
Example: we have 21 agents of parasitic diseases isolated from children.
Giardia lamblia
Entamoeba histolytica
Ascaris lumbricoides
Enterobius vermicularis
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Giardia lamblia
Entamoeba histolytica
Ascaris lumbricoides
Enterobius vermicularis
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Giardia lamblia
Entamoeba histolytica
Ascaris lumbricoides
Enterobius vermicularis
Ascaris lumbricoides
Enterobius vermicularis
Giardia lamblia
Summary of categorical data
List of parasites detected show us an idea of the frequency of each parasite, but that is not clear.
If we ordered them, the idea is more clear.
Giardia lamblia
Giardia lamblia
Giardia lamblia
Giardia lamblia
Giardia lamblia
Giardia lamblia
Ascaris lumbricoides
Ascaris lumbricoides
Ascaris lumbricoides
Ascaris lumbricoides
Ascaris lumbricoides
Ascaris lumbricoides
Enterobius vermicularis
Enterobius vermicularis
Enterobius vermicularis
Enterobius vermicularis
Enterobius vermicularis
Enterobius vermicularis
Entamoeba histolytica
Entamoeba histolytica
Entamoeba histolytica
Summary of categorical data
We can show the results in a frequency distribution.
Parasite n
Giardia lamblia 6
Ascaris lumbricoides 6
Enterobius vermicularis 6
Entamoeba histolytica 3
Total 21
Frequency distribution of intestinal parasites detected in children from CAISES Celaya, n=21
Source: Laboratory report
Summary of categorical data It is useful to show the frequency of each category, expressed
as percentage of the total frequency. It is called distribution of relative frequencies.
Source: Laboratory report
Parásito n %
Giardia lamblia 6 28.57
Ascaris lumbricoides 6 28.57
Enterobius vermicularis
6 28.57
Entamoeba histolytica
3 14.29
Total 21 100.00
Frequency distribution of intestinal parasites detected in children from CAISES Celaya, n=21
Summary of categorical data
Sometimes, the number of categories is high and should diminish the number of categories.
Death cause n %
Cardiovascular disease 12,525 21.96
Cancer 10,321 18.10
Lower respiratory infections
8,745 15.34
Other 25,435 44.60
Total 57,026 100.00
Distribution by death cause in Celaya, Gto, during 2012
Source: Certification of deaths
Frequency distributions for quantitative data With quantitative data, we need group the data,
before of show it in a frequencies or relative frequencies table.
Age (years) n %
19 52 14.70
20 32 9.00
21 46 12.99
22 67 18.94
23 26 7.35
24 77 21.76
25 54 15.26
Total 534 100.00
Distribution of frequencies in students of FEOC that have smoked at least once. n=534
Source: Health survey
With quantitative data, it is useful calculate cumulative frequency.
Source: Health survey
Age (years) n % % cumulative
19 52 14.70 14.70
20 32 9.00 23.70
21 46 12.99 36.69
22 67 18.94 55.63
23 26 7.35 62.98
24 77 21.76 84.74
25 54 15.26 100.00
Total 534 100.00
Frequency distributions for quantitative data
Distribution of frequencies in students of Campus that have smoked at least once. n=534
Distributions of frequencies for grouped quantitative data. Frequently, there are many categories with quantitative data,
and we have to calculate intervals for each category.
Distribution of frequencies of ages of children with acute streptoccocal pharyngotonsillitis
Source: Padilla N, Moreno M. Comparison between clarithromycin, azithromycin and propicillin in the management of acute streptococcal pharyngotonsillitis in children. Archivos de Investigación Pediátrica de México 2005; 8:5-11. (In Spanish)
Age (years) n %
<1 2 0.51
1 8 2.00
2 13 3.30
3 29 7.36
4 37 9.39
5 44 11.17
6 51 12.94
7 50 12.69
8 49 12.44
9 32 8.12
10 25 6.35
11 22 5.58
12 14 3.55
13 9 2.28
14 7 1.78
15 2 0.51
Total 394 100.00
Distribución de frecuencias para datos cuantitativos agrupados
Age (years) n %
<1 - 3 52 13.20
4 - 6 132 33.50
6 - 9 131 33.25
10 - 12 61 15.48
13 - 15 18 4.57
Total 394 100.00
Distribution of frequencies of ages of children with acute streptoccocal pharyngotonsillitis
Source: Padilla N, Moreno M. Comparison between clarithromycin, azithromycin and propicillin in the management of acute streptococcal pharyngotonsillitis in children. Archivos de Investigación Pediátrica de México 2005; 8:5-11. (In Spanish)
To group data
Guide To obtain minimum and maximum values and
decide the number of intervals. Number of intervals between 5 and 15. To assure interval limits. To assure that width of intervals been the
same. To avoid that first or last interval been open.
Categorical data Bar chart Gráfica de pastel
Quantitative data Histogram Polygon of frequencies
Charts
Bar chart
The frequency or relative frequency of a categorical variable can be show easily in a bar chart. It is used with categorical or numerical
discrete data. Each bar represent one category and its high
is the frequency or relative frequency. Bars should be separated. It is very important that Y axis begin with 0.
Bar chart
Gastrintestinal infections
01234567
Cryptos. E.histolyt. E.coli Giardia Rotavirus Shigella
Agents
Freq
uenc
y
Grouped bar chart
If we have a nominal categorical variable, divided in two categories, can show data with a grouped bar chart.
It allow easy comparison between groups.
Grouped bar chart
Gastrointestinal infections
0
1
2
3
4
5
Crypt. E.histolyt. E.coli Giardia Rotavirus Shigella
Agents
Freq
uenc
y Males
Females
Pie chart
It is an alternative to show categorical variable. Each slice of pie correspond at frequency or relative
frequency of categories of variable. It only shows one variable in each pie chart. If we want to make comparisons, we need to build
two or more pie charts.
Pie chart
Civil status of women in a community
Single28%
Married44%
Divorced11%
Widowed8%
Free union9%
Pie chart
Civil status of men in a community
Single31%
Married41%
Divorced
11%
Widowed
1%
Free union16%
Civil status of women in a community
Single28%
Married44%
Widowed
8%
Free union
9%
Divorced
11%
Distribution of frequency charts: histograms It is useful to quantitative variables. There are not spaces between bars. The area bar, not its high, represent its
frequency. X axis should be continuous. Y axis should begin in 0. Width represent the interval for each group.
Number of sons in women from Celaya
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8+
Number of sons
Nu
mb
er
of
wo
ma
n
Distribution of frequency charts: histograms
Distribution of frequency charts: frequencies polygon It is another form to show the frequency
distribution of a numerical variable. It is building, joining the middle point higher of
each bar of histogram. We should be take into account the width of
each bar. We can plot more than one polygon in each
chart, to make comparisons.
Number of sons of women from Celaya
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8+
Number of sons
Nu
mb
er
of
wo
me
n
Distribution of frequency charts: polygon of frequencies
Distribution of frequencies: cumulative histogram We can plot directly from a cumulative
frequencies table. It is not necessary to make adjustments to
the high of the bars, because the cumulative frequencies represent the total frequency superior, including the superior limit of the interval.
Cumulative frequency of birthweight
0
20
40
60
80
100
120
501- 1001- 1501- 2001- 2501- 3001- 3501- 4001- 4501- 5000+
Weight
Cu
mu
lati
ve
freq
uen
cy (
%)
New borns
Distribution of frequencies: cumulative histogram
We use them to see proportions below o above of a point in the curve.
We can read median and percentiles, directly. If the distribution is symmetrical, it has S form
symmetrical. If it is skewed to the right or to the left, will be
flatten in that side.
Distribution of frequencies: cumulative polygon of frequencies
Cumulative frequencies of birthweight
0
20
40
60
80
100
120
501- 1001- 1501- 2001- 2501- 3001- 3501- 4001- 4501- 5000+
Weight
Cu
mu
lati
ve
freq
uen
cy (
%)
New borns
Distribution of frequencies: cumulative polygon of frequencies
Other charts: tree and leafs
We use it to show directly quantitative data or preliminary step in the build a frequency distribution. We organize data determining the number of
divisions (5-15). We plot a vertical line and put the first digit of
category to the left of the line (tree) and the second digit to the right of the vertical line (leafs).
Other charts: tree and leafs
Patient
Age
1 54
2 35
3 49
4 61
5 58
6 64
7 32
8 57
9 43
10 42
3 5 2
4 932
5 487
6 14
Other charts: box and line
We plot a vertical line that represents the range of distribution.
We plot a horizontal line that represents third quartile and another that represents the first quartile (box).
The point middle of distribution is show as a horizontal line in the center of box.
Other charts: box and line
5500
5000
4500
4000
3500
3000
2500
2000
1500
1000
500
Localization measures
For categorical variable: percentage For quantitative variable:
Central trend measures: Mean Median Mode
Dispersion measures: Standard deviation Percentiles Range
Central trend measures
Mean It is the conventional mean. If we say that n observations have a xi value,
then the value of the mean will be:
_X =Σxi/n
Each value of data (xi) occur with a frequency (fi), then:
In a grouped distribution, we use point middle of each interval as x value.
_X =Σxifi/n
Central trend measures in a frequency distribution
Central trend measures in a frequency distribution
Interval Point middle Frequency (fi)_________________________________
1 – 3 2 184 – 6 5 277 – 9 8 3410 – 12 11 2213 – 15 14 13
_________________________________Total 114
Example of mean for a grouped distribution
(2 x 18) + (5 x 27) + (8 x 34) + (11 x 22) + (14 x 13) 36 + 135 + 272 + 242 + 182 867Mean = --------------------------------------------------------------------- = ---------------------------------------- = -------- = 7.61 (18 + 27 + 34 + 22 + 13) 114 114
Mean = 7.61 years
Central trend measures
Median It is the value that divide the distribution in two
equal parts. If it is a pair number of observations, the
central values are summed and divided by two.
51.2, 53.5, 55.6, 65.0, 74.2 median is the value at the half, thus: Median = 55.6
51.2, 53.5, 55.6, 61.4, 65.0, 74.2, 55.6 + 61.4 /2 = Median 58.5
Central trend measures for frequency distributions Median
It is the value where is 50%.
Cumulative frequency of birthweight
0
20
40
60
80
100
120
501- 1001- 1501- 2001- 2501- 3001- 3501- 4001- 4501- 5000+
Weight
Cu
mu
lati
ve
freq
uen
cy (
%)
New borns
Central trend measures
Mode It is the value that occur more frequently.
Interval Point middle Frequency (fi)_________________________________
1 – 3 2 184 – 6 5 277 – 9 8 3410 – 12 11 2213 – 15 14 13
_________________________________Total 114
Central trend measures
Properties Mean is sensitive to the tails, median and
mode, not. Mode can be affected by little changes in the
data, median and mean, not. Mode and median can be find in a chart. The three measures are the same in a Normal
distribution.
Central trend measures
What measurement to use? For skewed distributions, we use median. For statistical analysis or inference, we use
mean.
Dispersion measures
Range It show the minimum and maximum values
and the difference between they.
51.2, 53.5, 55.6, 61.4, 65.0, 74.2
Range of this distribution es 51.2 – 74.2 kg.
However, the extreme values of this distribution are far center of distribution, it unclear the fact that the most data are between 53.5 and 65 kg.
Dispersion measures Percentiles
A percentile o centile is the value, below of which, a percentage given of data, has occurred.
See the distribution of stature in this population. What is the range, median, percentile 25 and 75?Stature (cm.). n Relative frequency (%) Cumulative frequency (%) 151 2 0.7 0.7152 3 1.1 1.8152 6 2.2 4.0154 12 4.5 8.5155 27 10.0 18.5157 29 10.8 29.3158 26 9.7 39.0159 33 12.3 51.3163 37 13.8 65.1164 16 5.9 71.0165 24 8.9 79.9168 18 6.7 86.6169 14 5.2 91.8171 6 2.2 94.0174 7 2.6 96.6175 1 0.4 97.0177 4 1.5 98.5179 2 0.7 99.2184 1 0.4 99.6185 1 0.4 100.0_____________________________________________________________________Total 269 100.0
Dispersion measures
Standard deviation It is the more common form of to quantify the
variability of a distribution. It measure the distance between each value
and its mean.
Subject High Value Σ Xi - X 1 1.6 -1 Mean deviation = ------------- 2 1.7 0 n 3 1.8 +1 _ X= 1.7 Mean deviation = (-1)+(0)+(+1)/3 = 0
Dispersion measures
Standard deviation We should be interest in magnitude of observations. If squared each deviation, we shall have positive values. If divided this add by n -1, we shall obtain variance and if we
obtain square root, shall have standard deviation.
Subject High Value2
Σ (Xi - X)2 1 1.6 0.1 Standard deviation =√ --------------- 2 1.7 0 n-1 3 1.8 0.1 _ X= 1.7 Standard deviation = √0.2/2 = 0.32
Dispersion measures fo grouped data
Standard deviation It use the mean point of each interval.
Σ f(Xi - X)2 Standard deviation =√ -------------- f - 1 Also, it can be expressed as:
Σfx2 - (Σfx)2 /Σf Standard deviation = √ --------------------- Σ f -1
Dispersion measures for grouped data
For data with Normal distribution Around 68% of data are between -1 and +1
standard deviation. Around 95% of data are between -2 and +2
standard deviations. Around 99.9% of data are between -3 and +3
standard deviations. Standard deviation is a measure of the width of
the distribution. If the standard deviation change, the distribution change, also.
Bibliography
1.- Kirkwood BR. Essentials of medical statistics. Oxford, Blackwell Science, 1988.
2.- Altman DG. Practical statistics for medical research. Boca Ratón, Chapman & Hall/ CRC; 1991.