statistics

Faris69
12.StatisticalMethodsforSummarizingData.pptx

Statistics is both the science of uncertainty and the technology of extracting information from data.

A statistic is a summary measure of data.

Descriptive statistics are methods that describe and summarize data.

Microsoft Excel supports statistical analysis in two ways:

1. Statistical functions

2. Analysis Toolpak add-in

Statistical Methods for Summarizing Data

A frequency distribution is a table that shows the number of observations in each of several nonoverlapping groups.

Categorical variables naturally define the groups in a frequency distribution.

To construct a frequency distribution, we need only count the number of observations that appear in each category.

This can be done using the Excel COUNTIF function.

Frequency Distributions for Categorical Data

Example 3.16: Constructing a Frequency Distribution for Items in the Purchase Orders Database

List the item names in a column on the spreadsheet.

Use the function =COUNTIF($D$4:$D$97,cell_reference), where cell_reference is the cell containing the item name

Example 3.16: Constructing a Frequency Distribution for Items in the Purchase Orders Database

Construct a column chart to visualize the frequencies.

Relative frequency is the fraction, or proportion, of the total.

If a data set has n observations, the relative frequency of category i is:

We often multiply the relative frequencies by 100 to express them as percentages.

A relative frequency distribution is a tabular summary of the relative frequencies of all categories.

Relative Frequency Distributions

Example 3.17: Constructing a Relative Frequency Distribution for Items in the Purchase Orders Database

First, sum the frequencies to find the total number (note that the sum of the frequencies must be the same as the total number of observations, n).

Then divide the frequency of each category by this value.

For numerical data that consist of a small number of discrete values, we may construct a frequency distribution similar to the way we did for categorical data; that is, we simply use COUNTIF to count the frequencies of each discrete value.

Frequency Distributions for Numerical Data

In the Purchase Orders data, the A/P terms are all whole numbers 15, 25, 30, and 45.

Example 3.18: Frequency and Relative Frequency Distribution for A/P Terms

A graphical depiction of a frequency distribution for numerical data in the form of a column chart is called a histogram.

Frequency distributions and histograms can be created using the Analysis Toolpak in Excel.

Click the Data Analysis tools button in the Analysis group under the Data tab in the Excel menu bar and select Histogram from the list.

Excel Histogram Tool

Specify the Input Range corresponding to the data. If you include the column header, then also check the Labels box so Excel knows that the range contains a label. The Bin Range defines the groups (Excel calls these “bins”) used for the frequency distribution.

Histogram Dialog

If you do not specify a Bin Range, Excel will automatically determine bin values for the frequency distribution and histogram, which often results in a rather poor choice.

If you have discrete values, set up a column of these values in your spreadsheet for the bin range and specify this range in the Bin Range field.

Using Bin Ranges

We will create a frequency distribution and histogram for the A/P Terms variable in the Purchase Orders database.

We defined the bin range below the data in cells H99:H103 as follows:

Month

15

25

30

45

Example 3.19: Using the Histogram Tool

Histogram tool results:

Example 3.19: Using the Histogram Tool

For numerical data that have many different discrete values with little repetition or are continuous, a frequency distribution requires that we define by specifying

the number of groups,

the width of each group, and

the upper and lower limits of each group.

Choose between 5 to 15 groups, and the range of each should be equal.

Choose the lower limit of the first group (LL) as a whole number smaller than the minimum data value and the upper limit of the last group (UL) as a whole number larger than the maximum data value.

Histograms for Numerical Data

The data range from a minimum of $68.75 to a maximum of $127,500; set the lower limit of the first group to $0 and the upper limit of the last group to $130,000.

If we select 5 groups, using equation (3.2) the width of each group is ($130,000 - 0) / 5 = $26,000

Example 3.20: Constructing a Frequency Distribution and Histogram for Cost per Order

Ten-group histogram

Example 3.20: Constructing a Frequency Distribution and Histogram for Cost per Order

Set the cumulative relative frequency of the first group equal to its relative frequency. Then add the relative frequency of the next group to the cumulative relative frequency.

For, example, the cumulative relative frequency in cell D3 is computed as =D2+C3 = 0.000 + 0.447 = 0.447.

Example 3.21 Computing Cumulative Relative Frequencies

The kth percentile is a value at or below which at least k percent of the observations lie. The most common way to compute the kth percentile is to order the data values from smallest to largest and calculate the rank of the kth percentile using the formula:

Statistical software use different methods that often involve interpolating between ranks instead of rounding, thus producing different results.

The Excel function PERCENTILE.INC(array, k) computes the kth percentile of data in the range specified in the array field, where k is in the range 0 to 1, inclusive (i.e., including 0 and 1).

Percentiles

Compute the 90th percentile for Cost per order in the Purchase Orders data.

Rank of kth percentile = nk/100 + 0.5

n = 94; k = 90

For the 90th percentile, the rank is

= 94(90)/100+0.5 = 85.1 (round to 85)

Value of the 85th observation = $74,375

Using the Excel function PERCENTILE.INC(G4:G97,0.9), the 90th percentile is $73,737.50, which is different from using formula (3.3).

Examples 3.22 and 3.23: Computing Percentiles

Data >

Data Analysis >

Rank and Percentile

90.3rd percentile

= $74,375

(same result as

manually computing

the 90th percentile)

Example 3.24 Excel Rank and Percentile Tool

The Excel value of the 90th percentile that was computed in Example 3.23 as $74,375 is the 90.3rd percentile value.

Quartiles break the data into four parts.

The 25th percentile is called the first quartile,Q1;

the 50th percentile is called the second quartile, Q2;

the 75th percentile is called the third quartile, Q3; and

the 100th percentile is the fourth quartile, Q4.

One-fourth of the data fall below the first quartile, one-half are below the second quartile, and three-fourths are below the third quartile.

Excel function QUARTILE. INC(array, quart), where array specifies the range of the data and quart is a whole number between 1 and 4, designating the desired quartile.

Quartiles

Compute the Quartiles of the Cost per Order data

First quartile: =QUARTILE.INC(G4:G97,1) = $6,757.81

Second quartile: =QUARTILE.INC(G4:G97,2) = $15,656.25

Third quartile: =QUARTILE.INC(G4:G97,3) = $27,593.75

Fourth quartile: =QUARTILE.INC(G4:G97,4) = $127,500.00

Example 3.25 Computing Quartiles in Excel

A cross-tabulation is a tabular method that displays the number of observations in a data set for different subcategories of two categorical variables.

A cross-tabulation table is often called a contingency table.

The subcategories of the variables must be mutually exclusive and exhaustive, meaning that each observation can be classified into only one subcategory, and, taken together over all subcategories, they must constitute the complete data set.

Cross-Tabulations

Sales Transactions database

Count the number (and compute the percentage) of books and DVDs ordered by region.

Example 3.26: Constructing a Cross-Tabulation

Cross-Tabulation Visualization: Chart of Regional Sales by Product