+ All Categories
Home > Documents > Descriptive Statistics

Descriptive Statistics

Date post: 02-Jan-2016
Category:
Upload: callum-tucker
View: 27 times
Download: 0 times
Share this document with a friend
Description:
Descriptive Statistics. Statistics. Faculty of Information Technology King Mongkut’s University of Technology North Bangkok. Content. Data Preparation Data Presentation Descriptive Statistics. Data Preparation. Data checking for accuracy Data cleaning - PowerPoint PPT Presentation
Popular Tags:
43
Descriptive Statistics Statistics Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1
Transcript

Quantitative Analysis

Descriptive StatisticsStatisticsFaculty of Information TechnologyKing Mongkuts University of Technology North Bangkok1ContentData PreparationData PresentationDescriptive Statistics

2Data PreparationData checking for accuracyData cleaningRemoval of inaccurate data, errors, outlierDeal with missing dataData transformationApplication of a deterministic mathematical function to each point in a data setThe function that is used to transform the data is invertible, and generally is continuous3Data TransformationTo comply with requirement of statistical analysisFor better understanding of graphEase of interpretation of dataCommon methodThe logarithm and square root transformations are commonly used for positive dataThe multiplicative inverse (reciprocal) transformation can be used for non-zero data4ExamplePopulationsSee http://en.wikipedia.org/wiki/Data_transformation_(statistics)

Fuel consumptionKilometers per litre10 km/lReciprocal: litres per 100 kilometers10l/100kmWhy?5Data PresentationTextTableGraphicalPictographBar ChartPie ChartLine ChartHistogramStem and LeafScatter PlotBox PlotWhat is the difference between Bar Chart and Histogram?

67

Normal Curve and Skewed Curves

Positive Skewed Curve

Negative Skewed CurveNormal or Symmetrical Curve8

J-Shaped CurveJ-Reversed Shaped Curve

U-Shaped Curve

Bimodal Curve

Multimodal Curve9

Cumulative Frequency Curve

Stem and Leaf

Scatter Plot10Box PlotShows data distribution and skewness

NormalRight/Positive SkewedLeft/Negative Skewed11Descriptive StatisticA descriptive statistic is a numerical index that describes or summarizes some characteristic of a frequency or relative frequency distribution.(Frank & Althoen, Statistics: Concepts and applications, 1994)

12Descriptive StatisticsFrequency distribution tableDescribeLocation of distribution Mode, Median, MeanDispersion of distribution Range, SD, VarianceShape of distribution Skewness, KurtosisIndividuals in distributions Percentile, Decile, QuartileJoint distributions of dataScatter DiagramCorrelation CoefficientLinear Regression13Frequency DistributionScoreFrequency0 - 405 9010 14015 19020 24025 29030 34135 39140 44345 49650 54555 591060 641065 691070 74775 79480 84685 89690 94395 9901000Raw data:42, 45, 82, 32, 91, 76, 55, 58, 55, 62, 60, UngroupedGroupedCan be visualized using graphs and charts Determining number of intervalsk = 1 + 3.3logN Interval width = Range / k14Frequency Distribution TableOne-wayOne variable often used with percentageTwo-wayTwo variables shows rough relation between two variablesEtc.15PlanDepartmentITDNMISMaleFemaleMaleFemaleMaleFemaleThesis354323Master Project293427223235Describing Location of distributionModeThe value with highest frequencyApplicable to nominal scale (and higher scale)Can be more than one value for one set of datafx : MODE

16Arithmetic MeanConsidered best among the threeSum of value divided by total frequencyCan be affected by (very) peak valuesA value change of an entry also changes meanAdding / subtracting a value from all entry changes mean for the same valueMultiply / divide all entry with a value also changes mean for the same multiplication/division with the valueSum of the difference between each entry and mean is always zeroIn case of grouped data, use sum of product of the midpoint of each interval and the frequency of that intervalfx : AVERAGE

17MedianBetter for data with very peaked valuesUngrouped dataThe value in the middle of distribution after sortingN is odd: (N+1) / 2 N is even: Average(N/2, N/2 +1)Average of two middle valuesfx : MEDIANGrouped dataSee percentile

18Describing DispersionRangeUngrouped: Max Min (fx MAX fx MIN)Grouped: true highest upper bound true lowest lower boundTrue upper bound is average value between the upper bound of the interval and the (expected) lower bound of the higher intervalTrue lower bound is average value between the lower bound of the interval and the (expected) upper bound of the lower interval

19Root of the sum of difference between each entry and arithmetic mean (higher value means data are more dispersed)

OR

Standard Deviation (or SD, S.D., S) is most popular for describing dispersion

Standard Deviation & Variance

N >= 30N < 30N >= 30N < 3020Standard Deviation & VarianceAlways SD >= 0SD of 0 means that all data entries are of the same valueAdding / subtracting a value from all entries does not affect SDMultiply / divide all entries with a value m changes SD by multiplying/dividing SD with the absolute value of m

Variance is equal to SD2Only interested in the positive value of SDfx : STDEV and VARA

21Shape of DistributionSkewness0 means there is no skewness (normal distribution)Positive value meanspositive/right skewedNegative value meansnegative/left skewedCalculation?Just use Excel or SPSSfx : SKEW

22

Shape of DistributionKurtosis0 means normal distributionPositive value meansvery peaked (less dispersed)Negative value meansless peaked (more dispersed)Calculation?Just use Excel or SPSSfx : KURT

23

Describing Individuals in DistributionsPercentileQuartileDecilePerformed on data sorted in ascending orderDividing data in 100, 4, 10 parts and identify the value at the desired position

24Percentile RankThe percentile rank of any particular score x is the percentage of observations equal to or less than xDivide sorted data set into 100 partscent = 100 thus percent = /100Percentile rank of entry xi = 100*(cumulative frequencyi / N)e.g. 18, 29, 31, 32, 33Percentile rank of 31 = 100*(3/5) = 60Be careful!Percentile rank determines rank from data valueExcel uses 0.00 1.00 for fx: PERCENTRANK

25PercentileThe kth percentile is the x-value at or below which fall K percent of observationsRoughlyPosition of data entry at kth Percentile = k(n+1)/100e.g. 18, 29, 31, 32, 33Percentile 80th = 80/100(5+1) = 4.8 = 5th positionBe careful!Percentile rank determines data value from percentileExcel uses 0.00 1.00 for fx: PERCENTILE

26QuartileThe kth quartile is the x-value at or below which fall K quarters of observationsRoughlyPosition of data entry at kth Quartile = k(n+1)/4e.g. 18, 29, 31, 32, 33Quartile 3th = 3/4(5+1) = 4.5 = 4th-5th positionfx: QUARTILE

27DecileThe kth Decile is the x-value at or below which fall K tenth of observationsRoughlyPosition of data entry at kth Quartile = k(n+1)/10e.g. 18, 29, 31, 32, 33Decile 5th = 5/10(5+1) = 3rd positionExcel does not have direct decile functionUse fx: PERCENTILE with 0.1, 0.2, 0.3, instead

28Percentile for Grouped Datar: The percentileP: Data value at given percentile rL: True lower bound of the interval in which percentile r fallsI: Interval widthn: Number of data entriesf: Cumulative frequency of intervals below Lfr: Frequency of the L intervalDetermine the interval that the percentile fall using (n*r)/10029

Example30Percentile 60thn = 72, thus P60 is at around 60/100*72 = 43rd entry which falls in interval 61 70ThusP60 = 60.5 + (10{(60/100*72) - 36}) / 17 = 64.74freqCumulative0 - 100011 - 200021 - 301131 - 404541 - 50111651 - 60203661 - 70175371 - 80106381 - 9097291 - 10007231

MedianJoint Distribution of DataScatter Diagram32

Imaginary line showing relationImaginary line showing relationNegatively relatedNot relatedPositively relatedCorrelation CoefficientPearson Product-Moment Correlation CoefficientDenoted as rxy or r Measure the correlation between two data setsCan take value from -1 to 1Value of 1: two data sets have absolute positive relationValue of -1: two data sets have absolute negative relationValue of 0: two data sets have no linear relation

33Correlation CoefficientFormula

fx: PEARSON (do not use in MS Excel earlier than 2003)fx: CORREL34

Correlation for Ordinal ScaleSpearman Rank Correlation CoefficientTwo variables

Kendalls Tau Rank Correlation CoefficientThree or more variable35

Linear RegressionDescribe relation between two interval-scale variables in the form of regression equationy = bx + a (Straight line)y = a + bx + cx2 (Parabola equation)y = abx (Exponential equation)x: independent variabley: dependent variablea: Y-intercept (where the line crosses Y axis)b: Slope

3637

38

Simple Linear RegressionFind b then a

Then write the equationy = bx + aE.g. b = 31.4, a = 4.52y = 31.4x + 4.5239

ExampleStudentTimeScore1702528533345304150505100476903671355281204496042101805440Table shows the period of time each student spends reading for exam and his/her scoreb = {10 (45885) (1035)(413)} / {10 (123375) (1035)2 = 31395 / 162525 = 0.1932a = 41.3 (0.1932) (103.5) = 21.3038y = 0.1932x + 21.3038MeaningSpending 1 minute will increase score by 0.1932 markIf you dont read at all you should get 21.3038 mark

Multiple Linear RegressionMore than one independent variablesEquationY = a + b1x1 + b2x2 + b3x3RequirementNormal distributionNo multicollinearity (independent variables do not depend on each other)Selecting independent variablesAll Entry when you are not sure which variable has effectStepwise only use variables tested to be significant

41

42Simple correlationhow much of the dependent variable can be explained by the independent variableIs the model good (significant)? (yes, Sig. < 0.05)abab1b2Summary43Freq. Distr.Describing LocationIndividualDispersionF%MeanMedianModePQDRangeSDVarianceNorminal///Ordinal/////Interval///Ratio///


Recommended