+ All Categories
Home > Documents > Problems with using Microsoft Excel for Statistical...

Problems with using Microsoft Excel for Statistical...

Date post: 10-Mar-2018
Category:
Upload: trankhanh
View: 221 times
Download: 3 times
Share this document with a friend
21
Theresa A Scott, MS Problems with using Microsoft Excel for Statistical Analysis & Graphics Theresa A Scott, MS Department of Biostatistics [email protected] http://biostat.mc.vanderbilt.edu/TheresaScott - CRC Research Skills Workshop - Friday December 4, 2009
Transcript
Page 1: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Theresa A Scott, MS

Problems with using Microsoft Excel for Statistical Analysis & Graphics

Theresa A Scott, MS

Department of Biostatistics

[email protected]

http://biostat.mc.vanderbilt.edu/TheresaScott

- CRC Research Skills Workshop -

Friday December 4, 2009

Page 2: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Microsoft Excel

� Widely available as part of Microsoft Office.

� Automatically packaged with (ie, loaded on) new computers.

� Version 2007 now available.

� Commonly used for data entry, storage, & management.

2

� Commonly used for data entry, storage, & management.

� Also commonly used for computation & graphics.

� Basic installation contains a plethora of (built-in) functions, including some statistical functions.

� Can also install a “Data Analysis Toolpak” add-in.

� “Charting” capabilities also come with basic installation.

� These reasons encourage its use for statistical analysis.

Page 3: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Statistical Analysis in Excel

� “[I]t is quite possible that more basic statistical calculations are

done worldwide in Excel than in all statistical packages combined”.

– Wilkinson (1994)

� Unfortunate, since Excel is not a statistical package.

3

� Has been known for a long time (in the small world of statistical computing) that serious errors exist in Excel’s statistical procedures.

� Since 1994 – when Sawitzki published two manuscripts in Computational Statistics and Data Analysis.

� Remained Microsoft’s dark secret in the larger world.

Page 4: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Statistical Analysis in Excel, cont’d

� Many textbooks with titles like “Statistics with Excel” have been written by professional statisticians.

� Many professional statisticians (continue to) use Excel on a daily basis for quick and easy statistical calculations.

4

� In turn, a generation of students learned to do statistics with Excel.

� “Surely”, the student reasoned, “it is safe to use Excel for

statistics. If it weren’t, my professor would have chosen a

different software package.”

� Subsequently went on to use Excel in the professional world.

� Big question: Has Microsoft addressed/fixed these errors?

Page 5: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Microsoft’s track record

� Typically, when errors are found in a software package (like SAS or Stata), the developer fixes the errors, & the matter is done.

� No further need to re-evaluate the software on this set of tests.

� Not so with Microsoft & Excel.

5

� Occasionally fixes errors, more often ignores them, & sometimes fixes them incorrectly.

� Flaws identified by Sawitzki in 1994 have never been fixed.

� List of errors of which Microsoft is aware is not available.

� Consequently, every time there is a new version of Excel, the tests must be repeated – “Is it safe to use Excel?”

� Also, not all statistical flaws have been found – only a fraction of the statistical functions & procedures have been tested.

Page 6: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Individual Problems

� NOTE: The following is not a comprehensive list.

6

Page 7: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Numerical Accuracy

� Problem (from Excel 97): The algorithms Excel uses to calculate statistical distributions do not agree with better algorithms for those same distributions at the third decimal place & beyond.

� So, p-values are approximately correct, but not as exact.

� Harmful for hypothesis tests if third decimal place is of concern (eg, a p-value of 0.056 vs 0.057).

7

(eg, a p-value of 0.056 vs 0.057).

� Is of most concern when constructing confidence intervals (eg, 35*1.96 = 68.6 compared to 35*1.97 = 69.0).

� Example of other general problem that affects (at least some versions of) Excel 2007:

� Formula “=850*77.1” returns the value 100,000 not the correct value of 65,535 (but sometimes it does)!

Page 8: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Some Descriptive Statistics

� Problem (with Excel 2003 & older versions): The STDEV (standard deviation; SD) & VAR (variance) functions employed an internal formula that calculated the wrong result

Observation X

1 10,000,000,001

2 10,000,000,002

3 10,000,000,003

4 10,000,000,004

8

calculated the wrong result under specific situations – for variables whose mean is large compared to their SD.

� Affected calculation of other descriptive statistics (using other statistical functions).

� Seems OK in Excel 2007.

5 10,000,000,005

6 10,000,000,006

7 10,000,000,007

8 10,000,000,008

9 10,000,000,009

10 10,000,000,010

AVERAGE 10,000,000,005.5

STDEV (in old Excel) 0.000

Correct STDEV 3.028

Page 9: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Some Descriptive Statistics, cont’d

� “Problem” (with all versions): Difference in the way Excel calculates percentiles from the way other statistical packages (may) calculate them.

� Using the QUARTILE function in Excel, 1st & 3rd quartile = 130 & 157.5.

Observation X

1 120

2 125

3 125

4 145

9

Excel, 1st & 3rd quartile = 130 & 157.5.

� Using another stat package (SPSS), 1st

& 3rd quartile = 125 & 160.

� Discrepancies have to do with how Excel calculates ranks – does not take into account “tied” ranks.

� Thus, also discrepancies between calculated value of the median (50th

percentile).

4 145

5 145

6 150

7 150

8 160

9 170

10 175

Page 10: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Missing Data

� Problem: Missing values handled inconsistently & incorrectly.

� In older versions, any blank cell considered to be zero.

� At some point, changed so that blank cells were ignored (but still

X1 X2

1 1

2 2

3 3

4 4

5 5

6 5

10

blank cells were ignored (but still not in all cases).

� However, paired t-tests, ANOVA,

Regression & other Data Analysis Toolpak tools in badly deal with missing values.

� Seems not be the case for built-in statistical functions.

7 4

8 3

9 2

10 1

10

10

(Correct) p-value from

TTEST (PAIRED) 0.0448

p-value from Toolpak 0.2044

Page 11: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Other Misleading/incorrect Methods

� CONFIDENCE function (even in Excel 2007) uses incorrect constant in calculating a 95% confidence interval (CI).

� General formula for 95% CI is mean ± constant*SD.

� Excel uses 1.96 as the constant (ie, assumes a normal distribution).

Valid only if the population variance is known - never true for

11

� Valid only if the population variance is known - never true for experimental data.

� Thus, CIs computed on sample data will be too small.

� Constant based on t-distribution should be used instead.

� Excel is inconsistent in the type of p-values it returns (by default).

� Most often returns one-sided p-values.

� However, in the TINV function, Excel returns a 2-sided p-value.

Page 12: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Other Misleading/incorrect Methods

� Bad linear regression algorithm(s).

� Until Excel 2003, Excel would give completely incorrect coefficient estimates.

� Algorithm was improved in Excel 2003 – both Excel 2003 & 2007 report correct coefficients.

12

� However, regression routines are still incorrect for multicollinear

data – thus, (still) return incorrect coefficient estimates.

� A good statistics package will report errors due to correlations among the X variables (covariates; predictors).

� Excel does not compute any collinearity measures, & does not warn the user when collinearity is present, but does report (often nonsensical) coefficient estimates.

Page 13: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Excel’s Charts

� Problem:

� Default chart types in Excel 2007 (and older) violate standards of good (statistical) graphics (defined by Tufte, Cleveland, etc).

� Instead, these charts create chartjunk – hinder your ability to comprehend the data (ie, extraneous graphical elements).

Examples: pie & donut charts; area charts; bubble charts;

13

� Examples: pie & donut charts; area charts; bubble charts; cylinder, cone, & pyramid bar & column charts; & any “3-D” charts.

� Solution:

� Get to know the principles of good graphing well enough so you know how to choose appropriate options to override defaults.”

� “Just because you can doesn’t mean you should.”

Page 14: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Excel’s Charts – Examples

14

A

C0

50

Jan Feb Mar Apr

Pie chartLine chart

Apr

Area chart Bubblechart

Page 15: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Other General Remarks

� Many statistical methods (& graphs) simply not available in Excel.

� Histograms & boxplots;

� test of significance for a correlation coefficient;

� Spearman’s & Kendall’s rank correlation coefficients;

� non-parametric tests of association (eg, Wilcoxon rank-sum &

15

Kruskal-Wallis);

� 2-way ANOVA with unequal sample sizes (ie, unbalanced data);

� GLM (generalized linear models);

� Survival analysis methods; &

� regression diagnostics.

� Makes it difficult to use it for more than computing summary statistics, simple linear regression, & (some) hypothesis testing.

Page 16: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Other General Remarks, cont’d

� Inconsistencies between Data Analysis Toolpak add-in & built-in statistical functions, including many more errors unique to Toolpak.

� Some general issues specific to Toolpak:

� Data organization differs according to analysis – must reorganize

16

your data in many ways.

� Example: Require the X variables (ie, predictors; covariates) to be in contiguous columns in order to input them to the regression procedure.

� Many analyses can only be done one column at a time, making it inconvenient to do the same analysis on many columns.

� Output is poorly organized, sometimes inadequately labeled, & there is no record of how an analysis was accomplished.

Page 17: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Other General Remarks, cont’d

� Many reported (& some allegedly fixed) bugs necessarily provoke suspicion that others still exist or may have been introduced.

� No information about the nature of the numerical algorithms employed is generally provided or can be found.

17

� Help-files provided, but are often confusing, provide inaccurate statistical information, and/or are not helpful.

� Excel does not provide any record (ie, log or history) of what is done (including any changes to your data), making it virtually impossible to document or duplicate what is done.

� Vital for serious analysis & key to reproducible research.

Page 18: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Conclusions & Recommendations

� Excel is a poor choice for statistical analyses beyond the simplest descriptive statistics, or for more than a very few columns.

� Don’t assume that Excel will give the correct answer.

� The “Data Analysis Toolpak” is not worth bothering with either.

� No easier to use than most statistics package, has very limited capability, & also has (many) known bugs.

18

capability, & also has (many) known bugs.

� Use a real statistical package when you need to do statistics.

� SPSS, Stata, or SAS (see Vandy ITS for price), or R (free).

� Spreadsheet alternatives: OpenOffice Calc & Gnumeric (both free).

� Both read-in Excel spreadsheets (ie, .xls files).

� Some Excel errors not an issue, but no formal testing done yet.

Page 19: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

Other Stats Resources at Vandy

� Biostatistics Clinics

� Daily from 12:00 – 1:15 PM.

� See http://biostat.mc.vanderbilt.edu/Clinics for more info.

� VICTR Resource Request

VICTR = Vanderbilt Institute for Clinical & Translational Research.

19

� VICTR = Vanderbilt Institute for Clinical & Translational Research.

� Supported by the Vanderbilt Office of Research & an NIH sponsored Clinical & Translational Science (CTSA) Award.

� One option, a “Voucher”, grants you $2000, which can be used to work with a Biostatistician (covers ~20 hours of their time).

� Before your data is collected and/or once your data is collected.

� See the “Funding Support” tab once logging into StarBrite webpage (http://www.mc.vanderbilt.edu/starbrite) for more info & to submit a request (usually approved within 2-3 business days of submission).

Page 20: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

References

� Allen, E (2001) The Role of Excel for Statistical Analysis, (http://roger.babson.edu/rao/eatalk.ppt)

� Cox, N (2000) Use of Excel for Statistical Analysis, (http://www.agresearch.cri.nz/Science/Statistics/exceluse.htm)

� Cryer, JD (2001) Problems With Using Microsoft Excel for Statistics, (http://www.stat.wisc.edu/~clayton/stat575/cryerexcel.pdf)

� Goldwater, E (2007) Using Excel for Statistical Data Analysis – Caveats, (http://www-unix.oit.umass.edu/~evagold/excel.html)

� Heiser, DA, Microsoft Excel 2000, 2003, and 2007 Faults, Problems, Workarounds and Fixes

20

� Heiser, DA, Microsoft Excel 2000, 2003, and 2007 Faults, Problems, Workarounds and Fixes(http://www.daheiser.info/excel/frontpage.html)

� Helsel, DR (2009) Is Microsoft Excel an Adequate Statistics Package?, (http://www.practicalstats.com/xlsstats/excelstats.html)

� McCullough, BD & Heiser DA (2008) On the accuracy of statistical procedures in Microsoft Excel 2007. Computational Statistics and Data Analysis, 52(10), 4570-4578.

� McCullough, BD & Wilson B (2005) On the accuracy of statistical procedures in Microsoft Excel 2003. Computational Statistics and Data Analysis, 49(4), 1244-1252.

� McCullough, BD (2002) Proceedings of the 2001 Joint Statistical Meeting [CD-ROM]: Does Microsoft fix errors in Excel?

� McCullough, BD & Wilson B (2002) On the accuracy of statistical procedures in Microsoft Excel 2000 and Excel XP. Computational Statistics and Data Analysis, 40(4), 713-721.

Page 21: Problems with using Microsoft Excel for Statistical ...biostat.mc.vanderbilt.edu/wiki/pub/Main/TheresaScott/StatsInExcel... · Problems with using Microsoft Excel for Statistical

References, cont’d

� McCullough, BD & Wilson B (1999) On the accuracy of statistical procedures in Microsoft Excel 97. Computational Statistics and Data Analysis, 31(1) 27-37.

� McCullough, BD (1999) Assessing the reliability of statistical software: Part II. The American Statistician, 53(2), 149-159.

� McCullough, BD (1998) Assessing the reliability of statistical software: Part I. The American Statistician, 52(4), 358-366.

� Pottel, H (2001) Statistical flaws in Excel (http://www.coventry.ac.uk/ec//~nhunt/pottel.pdf)

� Sawitzki, G (1994a) Testing the Numerical Reliability of Data Analysis Systems. Computational Statistics

21

� Sawitzki, G (1994a) Testing the Numerical Reliability of Data Analysis Systems. Computational Statistics and Data Analysis 18, 269-286.

� Sawitzki, G (1994b) Report on the Numerical Reliability of Data Analysis System. Computational Statistics and Data Analysis 18, 289-301.

� Simninof, J (2008) Statistical analysis using Microsoft Excel, (http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf)

� Statistics Service Centre, The University of Reading (2000) Using Excel for Statistics – Tips and Warnings(http://www.reading.ac.uk/ssc/publications/guides/xfs.pdf)

� Should Microsoft Excel Software Be Used for Statistical Analysis or Graphics? (http://andrologi-indonesia-pandi.org/_UPLOAD_/article_43817_Excel.pdf)

� http://biostat.mc.vanderbilt.edu/ExcelProblems


Recommended