Excel project

profiley73t20851
UsingExcelforDataManipulationandStatisticalAnalysis_1.pdf

2010

Stanford University Social Science Data and Software

Using Excel for Data Manipulation and Statistical

Analysis: How-to’s and Cautions

This document describes how to perform some basic statistical procedures in Microsoft Excel.

Microsoft Excel is spreadsheet software that is used to store information in columns and rows,

which can then be organized and/or processed. Excel is a powerful program with an intuitive

user interface, and can be a great option for entering, organizing, and cleaning data.

In addition to its spreadsheet functions, Excel provides a number of standard statistical and

graphing procedures. However, these should be approached with caution, as statisticians have

found numerous errors in Excel’s statistical routines and distributions. Moreover, in recent

years, professional statistical packages such as SPSS (a.k.a. PASW) and Stata have developed

easy-to-use, point-and-click interfaces, complete with drop-down menus and dialogue boxes,

making them easier to use for those not familiar with the command-line interface. For these

reasons, we do not recommend using Excel for statistical analysis, beyond very basic descriptive

statistics and getting a feel for your data. If you choose to enter and clean your data initially in

Excel, we recommend transferring it to another program, such as Stata or SPSS, before

conducting analyses. SSDS provides resources and individual consulting to assist with

transferring data and with learning these statistical software packages.

This document begins with a brief review of the literature on the accuracy of Excel’s statistical

routines, and then offers suggestions on several procedures that can be run in Excel with

confidence.

Table of Contents

Caveats and Considerations .......................................................................................... 1

Using Formulas in Excel ................................................................................................. 2

Sorting ............................................................................................................................. 3

Filtering ........................................................................................................................... 6

Conditional Statements: Using IF, AND, OR .................................................................. 9

Caveats and Considerations Professional statisticians have been critical of statistical procedures in Excel for many years, at

least since the 1997 distribution. Recent assessments have found that many of the errors in

Excel’s algorithms persist in the 2007 release. Yalta (2008) assessed Excel’s computation of

several statistical distributions, and found substantive errors in almost all. He finds that Excel

will report more significant figures in its answer than it has accurately calculated. He compared

Excel to two free, open-source programs, Gnumeric 1.7.11 and OpenOffice.org Calc 2.3.0, and

found both of these to be more accurate than Excel.

2 – Using Excel for Data Manipulation and Statistical Analysis

McCullough and Heiser (2008) further find, “Excel 2007, like its predecessors, fails a

standard set of intermediate-level accuracy tests in three areas: statistical distributions, random

number generation, and estimation” (4570). Discussing Excel’s procedure for exponential

smoothing, the authors find it is “grievously flawed; we wonder how such obvious errors could

have been made.” They find however, that Excel’s procedures for univariate, ANOVA, and

linear regression analysis are acceptable, but strenuously caution against using the Solver

optimization tool. They additionally recommend against using the LOGEST and GROWTH

functions (which are described as “bad”) as well as the Normal Probability Plot, used to check

the residuals for normality; the authors find the help file for the latter to be “particularly useless.”

Finally, they cite others’ work showing inaccurate t-test results in the presence of missing values,

inaccurate p-values from a t-test, and incorrect labeling of t-test and z-test tables.

Beyond these considerable problems with the accuracy of statistics Excel reports, other critics

decry misleading visuals in many Excel graphical features. (See

http://www.stat.uiowa.edu/~jcryer/JSMTalk2001.pdf for a summary.)

Further Reading:

McCullough, B.A. and David A. Heiser. 2008. “On the accuracy of statistical procedures in

Microsoft Excel 2007.” Computational Statistics and Data Analysis 52: 4570–4578

Yalta, A. Talha. 2008. “The accuracy of statistical distributions in Microsoft Excel 2007.”

Computational Statistics and Data Analysis 52: 4579–4586

http://www.practicalstats.com/xlsstats/excelstats.html

http://people.umass.edu/evagold/excel.html

http://www.stat.uiowa.edu/~jcryer/JSMTalk2001.pdf

Using Formulas in Excel

Excel can be used with confidence to gain a feel for a dataset through basic descriptive statistics, such as

mean, median, mode, maximum, and minimum. All of these functions can be accessed through Excel’s

formula function.

To enter a formula, choose an empty cell. In this cell, type equals “=”. Whatever you type after the “=”

is the formula. For example, you can type

= A1 + A2

and then type Enter. The cell will now display the sum of cells A1 and A2. You can achieve the same

result by typing “=”, then clicking on cell A1, typing “+”, and then clicking on A2 and hitting Enter.

NOTE: If either cell A1 or A2 contains non-numeric values, then the formula cell will display

“#VALUE!”; this is generally an indication of an error in your formula.

3 – Using Excel for Data Manipulation and Statistical Analysis

Excel also provides a SUM function, which allows you to calculate a sum for a range of cells. To use the

SUM function on the first ten rows of column A, type in an empty cell:

=SUM(A1:A10)

You can use the SUM function on a row the same way:

=SUM(A1:M1)

You can also use the SUM function on a contiguous block of cells, for example, rows 1-5 of columns A-

M:

=SUM(A1:M5)

Notice that as you type the range of cells into the formula cell, Excel outlines the range in color.

Instead of typing the range, you can select it by clicking and dragging the mouse. To do this, type:

=SUM(

in the formula cell. Then click and drag to select the desired range. Excel will show the selected range in

the formula cell:

=SUM(A1:M5

End by typing the closing parenthesis “)”.

The formula interface can be used in exactly the same way on the following functions:

AVERAGE: the arithmetic mean of the selected data

MEDIAN: the value at the 50 th percentile of the selected data

MODE: the most commonly occurring value in the selected data

MIN: the smallest value in the selected data

MAX: the largest value in the selected data

It can also be used for a wide range of statistical and probability functions that we do not endorse.

Sorting

The SORT function will arrange your data in increasing, decreasing, alphabetical, or reverse-alphabetical

order.

Be careful when sorting: if you sort only one row or column, you will effectively “scramble” these

data relative to the rest of the spreadsheet. If the relationship between data in different rows or

columns must be preserved, always select the entire spreadsheet before sorting! And remember,

you can always undo a bad sort by typing “ctrl-Z” before you save.

4 – Using Excel for Data Manipulation and Statistical Analysis

Here is some gdp data for several countries in the western hemisphere. To sort, highlight the desired

selection (likely the whole spreadsheet). With the “Home” tab selected on the top right, select the “Sort

and Filter” menu from the top left.

Notice that whichever cell you last clicked in is white (below it’s cell A1). If you select “Sort A to Z” or

“Sort A to A” from this menu, Excel will sort your data in ascending or descending order, respectively,

depending on the value in the column with the white cell.

You can also choose “Custom Sort” from the “Sort and Filter” Menu, which will open the following box:

5 – Using Excel for Data Manipulation and Statistical Analysis

Notice that our data in this example has headers (“2005”, “2006”, “2007”, etc.). We don’t want these to

be treated as values and mixed in with the sorting, so select the “My data has headers” checkbox in the

top right corner.

Now in the “Sort by” drop-down menu, choose the column you would like to sort by. Leave the “Sort

On” menu set to “Values”, and choose an order from the “Order” drop-down menu. Then click OK.

If your data has some duplicate values, and you want to further sort within those, then you can use the

“Add level” selection:

This selection would cause Excel to first sort according to country name (Column A) and then for any

duplicates, sort those according to 2005 value.

Filtering

The FILTER function allows you to select a subset of your data to display. From the same “Sort and

Filter” menu used above, choose “Filter”. There will now be a small box on the first cell of each column.

If you click on one of these boxes, a dialogue box will open:

6 – Using Excel for Data Manipulation and Statistical Analysis

Initially, all values are selected. You can

deselect a value by clicking on the checkbox

next to it. If you click on the “(Select All)”

check box, you can select or deselect all.

7 – Using Excel for Data Manipulation and Statistical Analysis

Here, we have manually selected the values:

0.31, 0.84, 1.19, 3.15, and 3.25.

Now only those rows with the selected values

for 2009 are visible. All other rows are hidden

(but not deleted!).

To restore all values, click on the Filter button

on the 2009 column, and again Select All.

You can achieve the same effect a second way:

8 – Using Excel for Data Manipulation and Statistical Analysis

Click on the 2009 Filter button, and choose “Number Filters”. A second menu will open off to the side.

From this, choose “Less Than”.

All the values selected above were less than 4.

We can choose the same values here by

selecting rows where the value is less than 4.

9 – Using Excel for Data Manipulation and Statistical Analysis

There is a corresponding “Text Filters” menu for

columns that have text values, such as the name

column here.

Conditional Statements: Using IF, AND, OR

The formula interface can be used for conditional statements, using the IF function. These can be very

useful in cleaning data, for example checking for matching values in a range of cells. This comes in

handy if you have cut-and-pasted selections from two different spreadsheets, and you want to verify that

an ID column from each selection matches.

The general syntax for the IF function is:

=IF(condition, value if true, value if false)

If you want to check that values in column E match values in column F you can type in an empty cell:

=IF(E2=F2, 1,0)

(Note: if you didn’t have a header row, you would use E1 and F1.)

Then Enter.

10 – Using Excel for Data Manipulation and Statistical Analysis

Notice that there is a 0 in the formula cell because, in this case, E2 and F2 are not equal. Now click again

on the cell in which you just typed this formula. Notice that Excel highlights this cell by outlining it in

black, with a small black square on the bottom right corner. Click and hold the square, and drag it down

as many rows as you wish. This will carry the formula down through these rows; each new cell will

display a 1 or 0, indicating whether the corresponding cells from columns E and F match.

In this case, no cells from columns E and F match, so all formula cells are “0”.

Similarly, you can use the IF statement to look for duplicates. First sort your data. Be careful when

sorting! (see above) Exercising due caution, sort the spreadsheet in either increasing or decreasing order

(either is fine) according to the column of interest.

Let’s assume you want to check for duplicates in column A. After sorting, in an empty cell in the top

row, type:

=IF(A1=A2,1,0)

11 – Using Excel for Data Manipulation and Statistical Analysis

Now, select this cell, click on the small square in the bottom right corner, and drag it down to match the

length of column A. Any 1’s in your new column will indicate that the corresponding cell in column A

matches the cell below it. If the new column contains only 0’s, then there are no duplicates in column A.

Linking the AND and OR functions with IF allows you to evaluate sophisticated conditionals. AND

checks whether two logical statements are both true, while OR checks whether either is true. Building on

the prior example, suppose you want to check whether the value from column E matches the value from

column F, and at the same time, whether E1 equals 2.

Recall that

=IF(E1=F1, 1,0)

tells whether the values in A1 and G1 match. To check whether the value is 2, type in an empty cell:

=AND(IF(E1=F1,1,0), IF(E1=2,1,0))

This will display TRUE if both statements are true (i.e. if E1 equals F1, and E1=2), and FALSE

otherwise.

To check whether either statement is true, use OR:

=OR(IF(E1=F1,1,0), IF(E1=2,1,0))

This will display TRUE if either statement is true, and FALSE otherwise.

Note: it is important to have the 1’s and 0’s in the right order in your IF statements. Excel equates 1

with TRUE and 0 with FALSE. In an AND or OR statement, it does not directly check whether the

statements are true, only whether the IF statement returned a “1” or a “0”. When evaluating an AND

statement, it will check whether both IF statements returned “1”; when evaluating an OR statement, it will

check whether either IF statement returned “1”. If you type:

=OR(IF(E1=F1,0,1), IF(E1=2,0,1))

you will not obtain correct results.

For More Information and Assistance

Social Science Software Consulting

Software consultants are available during drop-in hours 3pm-5pm Monday through Thursday

throughout the academic quarter, and by appointment throughout the year. Please visit our

website for more information or to make an appointment:

https://www.stanford.edu/group/ssds/cgi-bin/drupal/

Note: This document is based on Excel 2007 for Windows

12 – Using Excel for Data Manipulation and Statistical Analysis

Copyright © 2004-2010, by the Board of Trustees of the Leland Stanford Junior University. Permission granted to copy for non-commercial purposes, provided we receive acknowledgment and a copy of the document in which our material appears. No right is granted to quote from or use any material in this document for purposes of promoting any product or service.

Social Science Data and Software, Document revised: 6/16/10