+ All Categories
Home > Documents > Basic Statistics.pdf

Basic Statistics.pdf

Date post: 04-Jun-2018
Category:
Upload: umair-adnan
View: 231 times
Download: 0 times
Share this document with a friend

of 55

Transcript
  • 8/14/2019 Basic Statistics.pdf

    1/55

    RESEARCH METHODS

    Dr. M. Shakaib AkramEmail: [email protected]

  • 8/14/2019 Basic Statistics.pdf

    2/55

    BASIC STATISTICS

  • 8/14/2019 Basic Statistics.pdf

    3/55

    Raw Data3

    Raw data have not been manipulated or treated in anyway beyond their original collection

    Ex: speeds (mph) of 105 vehicles

  • 8/14/2019 Basic Statistics.pdf

    4/55

    Frequency Distribution4

    A table that divides the data values into classes andshows the number of observed values that fall into eachclass

  • 8/14/2019 Basic Statistics.pdf

    5/55

    Frequency Distribution5

    Relative Frequency Distribution

    the proportion or percentage of data values thatfall within each category

    Cumulative relative frequency distribution the number of observations that are within or

    below each of the classes

  • 8/14/2019 Basic Statistics.pdf

    6/55

    Histogram6

    The histogram describes a frequency distribution by using a seriesof adjacent rectangles, each of which has a length that isproportional to the frequency of the observations within the rangeof values it represents

  • 8/14/2019 Basic Statistics.pdf

    7/55

    The Stem-and-Leaf Display7

    a variant of the frequency distribution, uses asubset of the original digits as class descriptors Ex: The raw data are the numbers of Congressional bills vetoed during the administrations of

    seven U.S. presidents, from Johnson to Clinton

    Stem-and-leaf diagram

  • 8/14/2019 Basic Statistics.pdf

    8/55

    Bar Chart8

    Bar chart represents frequencies according tothe relative lengths of a set of rectangles

    Bar Chart vs histogram: Histogramquantitative/continous data

    Bar chart qualitative/categorical data

    adjacent rectangles in the histogram share a common side, whilethose in the bar chart have a gap between them

  • 8/14/2019 Basic Statistics.pdf

    9/55

    The Scatterplot

    A scatter diagram is a two-dimensional plot ofdata representing values of two quantitativevariables.

    x, the independent variable, on the horizontal axis y, the dependent variable, on the vertical axis

    Four ways in which two variables can be related:

    1. Direct

    2. Inverse

    3. Curvilinear

    4. No relationship

  • 8/14/2019 Basic Statistics.pdf

    10/55

    The Scatterplot10

  • 8/14/2019 Basic Statistics.pdf

    11/55

    Coefficient of correlation (r)

    Coefficient of correlation, r Direction of the relationship:

    direct (r > 0) or inverse (r < 0)

    Strength of the relationship:When r is close to 1 or 1, the linear relationshipbetween x andy is strong. When r is close to 0, thelinear relationship between x andy is weak. When r =0, there is no linear relationship between x andy.

    Coefficient of determination, r2 The percent of total variation iny that is explained by

    variation in x.

    11

  • 8/14/2019 Basic Statistics.pdf

    12/55

    Coefficient of correlation12

  • 8/14/2019 Basic Statistics.pdf

    13/55

    Coefficient of correlation13

  • 8/14/2019 Basic Statistics.pdf

    14/55

    Coefficient of correlation14

  • 8/14/2019 Basic Statistics.pdf

    15/55

    The Center: Mean

  • 8/14/2019 Basic Statistics.pdf

    16/55

    The Center: Median

    To find the median: 1. Put the data in an array.

    2A. If the data set has an ODD number of numbers, themedian is the middle value.

    2B. If the data set has an EVEN number of numbers, themedian is the AVERAGE of the middle two values.

    (Note that the median of an even set of data values is notnecessarily a member of the set of values.)

    The median is particularly useful if there areoutliers in the data set, which otherwise tend toinfluence the value of an arithmetic mean.

  • 8/14/2019 Basic Statistics.pdf

    17/55

    The Center: Mode

    The mode is the most frequent value.

    While there is just one value for the mean and one valuefor the median, there may be more than one value forthe mode of a data set.

    The mode tends to be less frequently used than themean or the median.

  • 8/14/2019 Basic Statistics.pdf

    18/55

    The Spread: Range

    The range is the distancebetween the smallestand the largest data value in the set.

    Range = largest value

    smallest value

    Sometimes range is reported as an interval,anchored between the smallest and largest

    data value, rather than the actual width of thatinterval.

  • 8/14/2019 Basic Statistics.pdf

    19/55

    The Spread: Variance

    How far a set of numbers is spread out

    Variance is one of the most frequently usedmeasures of spread

  • 8/14/2019 Basic Statistics.pdf

    20/55

    The Spread: Standard Deviation

    A measure of the dispersion of a set of datafrom its mean

    Mathematically: the square root of variance

    for a population,

    for a sample,

    2

    s

    s

    2

  • 8/14/2019 Basic Statistics.pdf

    21/55

    Example: Standard Deviation

    Two classes took a recent quiz. There were 10students in each class, and each class had anaverage score of 81.5

  • 8/14/2019 Basic Statistics.pdf

    22/55

    Since the averages are the same,can we assume that the studentsin both classes all did pretty much

    the same on the exam?

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    23/55

    The average (mean) does not tell

    us anything about thedistribution or variation in thegrades.

    Example: Standard Deviation

    The answer is No.

  • 8/14/2019 Basic Statistics.pdf

    24/55

    Here are Dot-Plots of the gradesin each class:

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    25/55

    Mean

  • 8/14/2019 Basic Statistics.pdf

    26/55

    So, we need to come up withsome way of measuring not

    just the average, but also thespread of the distribution of

    our data.

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    27/55

    Why not just give an averageand the range of data (the

    highest and lowest values) todescribe the distribution of

    the data?

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    28/55

    Well, for example, lets sayfrom a set of data, the average

    is 17.95 and the range is 23.

    But what if the data looked likethis:

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    29/55

    Here is the average

    And here is the range

    But really, most of the

    numbers are in this area,

    and are not evenly

    distributed throughout therange.

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    30/55

    The Standard Deviation is a

    number that measures howfar away each number in a setof data is from their mean.

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    31/55

    If the Standard Deviation islarge, it means the numbers

    are spread out from theirmean.

    If the Standard Deviation issmall, it means the numbers

    are close to their mean.

    small,

    large,

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    32/55

    Here arethe scores

    on the mathquiz forTeam A:

    72

    76

    80

    80

    81

    83

    84

    85

    85

    89

    Average:

    81.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    33/55

    The Standard Deviation measures how far away eachnumber in a set of data is from their mean.For example, start with the lowest score, 72. How far away is 72 from the mean of

    81.5? 72 - 81.5 = - 9.5

    - 9.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    34/55

    - 9.5

    Or, start with the highest score, 89. How far away is 89 from the mean of 81.5?

    89 - 81.5 = 7.5

    7.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    35/55

    So, the first

    step to findingthe StandardDeviation is tofind all thedistances fromthe mean.

    Score Distance from

    Mean

    72 -9.5

    76

    80

    80

    81

    83

    84

    85

    85

    89 7.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    36/55

    So, the first

    step to findingthe StandardDeviation is tofind all thedistances fromthe mean.

    Score Distance from

    Mean

    72 -9.5

    76 - 5.5

    80 - 1.5

    80 - 1.5

    81 - 0.5

    83 1.5

    84 2.5

    85 3.5

    85 3.5

    89 7.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    37/55

    Next, you need

    to square eachof the distancesto turn them all

    into positivenumbers

    Score Distance from

    Mean

    Distances

    Squared

    72 -9.5 90.25

    76 - 5.5 30.25

    80 - 1.5

    80 - 1.5

    81 - 0.5

    83 1.5

    84 2.5

    85 3.5

    85 3.5

    89 7.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    38/55

    Next, you need

    to square eachof the distancesto turn them all

    into positivenumbers

    Score Distance from

    Mean

    Distances

    Squared

    72 -9.5 90.25

    76 - 5.5 30.25

    80 - 1.5 2.25

    80 - 1.5 2.25

    81 - 0.5 0.25

    83 1.5 2.25

    84 2.5 6.25

    85 3.5 12.25

    85 3.5 12.25

    89 7.5 56.25

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    39/55

    Add up allof the

    distances

    Score Distance from

    Mean

    Distances

    Squared

    72 -9.5 90.25

    76 - 5.5 30.25

    80 - 1.5 2.25

    80 - 1.5 2.25

    81 - 0.5 0.25

    83 1.5 2.25

    84 2.5 6.25

    85 3.5 12.25

    85 3.5 12.25

    89 7.5 56.25

    Sum:

    214.5

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    40/55

    Divide by (n

    - 1) where nrepresentsthe amount

    of numbersyou have.

    Score Distance from

    Mean

    Distances

    Squared

    72 -9.5 90.25

    76 - 5.5 30.25

    80 - 1.5 2.25

    80 - 1.5 2.25

    81 - 0.5 0.25

    83 1.5 2.25

    84 2.5 6.25

    85 3.5 12.25

    85 3.5 12.25

    89 7.5 56.25

    Sum:

    214.5

    (10 - 1)

    = 23.8

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    41/55

    Finally, takethe SquareRoot of the

    averagedistance

    Score Distance from

    Mean

    Distances

    Squared

    72 -9.5 90.25

    76 - 5.5 30.25

    80 - 1.5 2.25

    80 - 1.5 2.25

    81 - 0.5 0.25

    83 1.5 2.25

    84 2.5 6.25

    85 3.5 12.25

    85 3.5 12.25

    89 7.5 56.25

    Sum:

    214.5

    (10 - 1)

    = 23.8

    = 4.88

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    42/55

    This is theStandard

    Deviation

    Score Distance from

    Mean

    Distances

    Squared

    72 -9.5 90.25

    76 - 5.5 30.25

    80 - 1.5 2.25

    80 - 1.5 2.25

    81 - 0.5 0.25

    83 1.5 2.25

    84 2.5 6.25

    85 3.5 12.25

    85 3.5 12.25

    89 7.5 56.25

    Sum:

    214.5

    (10 - 1)

    = 23.8

    = 4.88

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    43/55

    Now find

    theStandardDeviation

    for the otherclass grades

    Score Distance from

    Mean

    Distances

    Squared

    57 - 24.5 600.25

    65 - 16.5 272.25

    83 1.5 2.25

    94 12.5 156.25

    95 13.5 182.25

    96 14.5 210.25

    98 16.5 272.25

    93 11.5 132.25

    71 - 10.5 110.25

    63 -18.5 342.25

    Sum:

    2280.5

    (10 - 1)

    = 253.4

    = 15.91

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    44/55

    Now, lets compare the two classes again:

    Team A Team B

    Average on

    the Quiz

    Standard

    Deviation

    81.5 81.5

    4.88 15.91

    Example: Standard Deviation

  • 8/14/2019 Basic Statistics.pdf

    45/55

    Relative Position - Quartiles

    Quartiles divide the values of a data set intofour subsets of equal size, each comprising 25%of the observations.

    To find the first, second, and third quartiles: 1. Arrange the Ndata values from smallest to largest.

    2. First quartile, Q1 = data value at position (N + 1)/4

    3. Second quartile, Q2 = data value at position 2(N +1)/4

    4. Third quartile, Q3 = data value at position 3(N + 1)/4

    Interquartile range = Q3-Q1

  • 8/14/2019 Basic Statistics.pdf

    46/55

    Finding the median, quartiles and inter-quartile range.

    12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10

    4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12

    Order the data

    Inter-Quartile Range = 9 - 5 = 3

    Example 1: Find the median and quartiles for the data below.

    Lower

    Quartile= 5

    Q1

    Upper

    Quartile= 9

    Q3

    Median

    = 8

    Q2

  • 8/14/2019 Basic Statistics.pdf

    47/55

    Normality47

    A normal distribution is assumed by manystatistical procedures.

    Normal distributions take the form of asymmetric bell-shaped curve.

    The standard normal distribution is one with amean of 0 and a standard deviation of 1

    Standard scores, also called z-scores orstandardized data, are scores which have had themean subtracted and which have been divided bythe standard deviation to yield scores which havea mean of 0 and a standard deviation of 1.

  • 8/14/2019 Basic Statistics.pdf

    48/55

    Normality (Cont.)48

  • 8/14/2019 Basic Statistics.pdf

    49/55

    Skewness49

    Skewness measures the deviation of thedistribution from symmetry.

    If the skewness is clearly different from 0, thenthat distribution is asymmetrical, while normaldistributions are perfectly symmetrical Positive skew left-leaning

    the mean is greater than the median

    Negative skew right-leaning

    the median is greater than mean

    Distribution Shape and Measures

  • 8/14/2019 Basic Statistics.pdf

    50/55

    Distribution Shape and Measuresof Central Tendency

    50

    If mean = median = mode,

    the shape of the distribution

    issymmetric

    If mean < median < mode,

    the shape of the distribution

    is Negatively Skewed

    If mode< median < mean,

    the shape of the distribution

    isPositively Skewed

  • 8/14/2019 Basic Statistics.pdf

    51/55

    Kurtosis51

    Kurtosis is the peakedness of a distribution

    A common rule-of-thumb test for normality is torun descriptive statistics to get skewness and

    kurtosis, then use the criterion that kurtosisshould be within the +2 to -2 range when the dataare normally distributed

    Thus, positive kurtosis indicates a relatively

    peaked distribution Thus, negative kurtosis indicates a relatively flat

    distribution

  • 8/14/2019 Basic Statistics.pdf

    52/55

    Outliers52

    Simple Outlier cases with extreme values with respect to a single

    variable

    cases which are more than plus or minus threestandard deviations from the mean of the variable

    Can radically alter the outcome of analysisand are also violations of normality

    Multivariate Outlier cases with extreme values with respect to

    multiple variables.

  • 8/14/2019 Basic Statistics.pdf

    53/55

    Types of Variable

    Nominal A variable can be treated as nominal when its values represent categories

    with no intrinsic ranking (for example, the department of the company inwhich an employee works)

    Ex: region, zip code, and religious affiliation

    Ordinal A variable can be treated as ordinal when its values represent categories

    with some intrinsic ranking (for example, levels of service satisfaction fromhighly dissatisfied to highly satisfied).

    Ex: attitude scores representing degree of satisfaction or confidence andpreference rating scores

    Scale A variable can be treated as scale (continuous) when its values represent

    ordered categories with a meaningful metric, so that distance comparisonsbetween values are appropriate.

    Ex: age in years and income in thousands of dollars

  • 8/14/2019 Basic Statistics.pdf

    54/55

    Statistical Significance (p-values)54

    The statistical significance of a result is theprobability that the observed relationship (e.g.,between variables) or a difference (e.g., betweenmeans) in a sample occurred by pure chance

    ("luck of the draw"), and that in the populationfrom which the sample was drawn, no suchrelationship or differences exist.

    The statistical significance of a result tells us

    something about the degree to which the result is"true" (in the sense of being "representative of thepopulation").

    Estimation Methods for Replacing

  • 8/14/2019 Basic Statistics.pdf

    55/55

    Estimation Methods for ReplacingMissing Values

    55

    Series mean. Replaces missing values with the mean for the entire series.

    Mean of nearby points. Replaces missing values with the mean of valid surroundingvalues. The span of nearby points is the number of valid values above and below themissing value used to compute the mean.

    Median of nearby points. Replaces missing values with the median of validsurrounding values. The span of nearby points is the number of valid values aboveand below the missing value used to compute the median.

    Linear interpolation. Replaces missing values using a linear interpolation. The lastvalid value before the missing value and the first valid value after the missing valueare used for the interpolation. If the first or last case in the series has a missing value,the missing value is not replaced.

    Linear trend at point. Replaces missing values with the linear trend for that point.The existing series is regressed on an index variable scaled 1 to n. Missing values arereplaced with their predicted values.


Recommended