Discussion 15 - 205

wtfbm69
Cengage_EBA_Chapter02.pptx

Chapter 2

Descriptive Statistics

1

Vb

Bm

Mbm

Overview of Using Data: Definitions and Goals

2

Vb

Bm

Mbm

Overview of Using Data: Definitions and Goals

Data: The facts and figures collected, analyzed, and summarized for presentation and interpretation.

Variable: A characteristic or a quantity of interest that can take on different values.

Observation: Set of values corresponding to a set of variables.

Variation: The difference in a variable measured over observations.

Random variable/uncertain variable: A quantity whose values are not known with certainty.

3

Vb

Bm

Mbm

When we collect data, we are gathering past observed values, or realizations of a variable.

By collecting these past realizations of one or more variables, our goal is to learn more about the variation of a particular business situation.

3

Table 2.1 - Data for Dow Jones Industrial Index Companies

4

Vb

Bm

Mbm

For the data in Table 2.1:

Variables: Symbol, Industry, Share Price, and Volume

Observation: Each row in Table 2.1

Variation: Time, customers, items, etc.

4

Types of Data

5

Vb

Bm

Mbm

Types of Data

Population: All elements of interest

Sample: Subset of the population

Random sampling - A sampling method to gather a representative sample of the population data.

Quantitative data: Data on which numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed.

Categorical data: Data on which arithmetic operations cannot be performed.

6

Vb

Bm

Mbm

Examples:

Population: With the thousands of publicly traded companies in the United States, tracking and analyzing all of these stocks every day would be too

time consuming and expensive.

Sample: The Dow represents a sample of 30 stocks of large public companies based in the United States, and it is often interpreted to represent

the larger population of all publicly traded companies.

Quantitative data: The values for Volume in the Dow data in Table 2.1 can be summed to calculate a total volume of all shares traded by companies included in the Dow.

Categorical data: The data in the Industry column in Table 2.1 are categorical - the number of companies in the Dow that are in the telecommunications industry can be counted.

6

Types of Data

Cross-sectional data: Data collected from several entities at the same, or approximately the same, point in time.

Time series data: Data collected over several time periods.

Graphs of time series data are frequently found in business and economic publications.

Help analysts understand what happened in the past, identify trends over time, and project future levels for the time series.

7

Vb

Bm

Mbm

Example:

Cross-sectional data: The data in Table 2.1 are cross-sectional because they describe the 30 companies that comprise the Dow at the same point in time (April 2013).

7

Figure 2.1 - Dow Jones Index Values Since 2002

8

Vb

Bm

Mbm

Example: (contd.)

Time series data: Figure 2.1 illustrates that the DJI was near 10,000 in 2002 and climbed to above 14,000 in 2007. However, the financial crisis in 2008 led to a significant decline in the DJI to between 6000 and 7000 by 2009. Since 2009, the DJI has been generally increasing and topped 14,000 in April 2013.

8

Types of Data

9

Sources of data

Experimental study - A variable of interest is first identified.

Then one or more other variables are identified and controlled or manipulated so that data can be obtained about how they influence the variable of interest.

Nonexperimental study or observational study - Make no attempt to control the variables of interest.

A survey is perhaps the most common type of observational study.

Vb

Bm

Mbm

Example:

Experimental study:

If a pharmaceutical firm is interested in conducting an experiment to learn about how a new drug affects blood pressure, then blood pressure is the variable of interest in the study.

The dosage level of the new drug is another variable that is hoped to have a causal effect on blood pressure.

To obtain data about the effect of the new drug, researchers select a sample of individuals.

The dosage level of the new drug is controlled as different groups of individuals are given different dosage levels.

Before and after the study, data on blood pressure are collected for each group.

Statistical analysis of these experimental data can help determine how the new drug affects blood pressure.

9

Figure 2.2 - Customer Opinion Questionnaire used by Chops City Grill Restaurant

10

Vb

Bm

Mbm

Example: (contd.)

Nonexperimental study:

Figure 2.2 shows a customer opinion questionnaire used by Chops City Grill in Naples, Florida.

Note that the customers who fill out the questionnaire are asked to provide ratings for 12 variables, including overall experience, the greeting by hostess, the table visit by the manager, overall service, and so on.

The response categories of excellent, good, average, fair, and poor provide categorical data that enable Chops City Grill management to maintain high standards for the restaurant’s food and service.

In some cases, the data needed for a particular application already exist from an experimental or observational study already conducted.

Companies maintain a variety of databases about their employees, customers, and business operations.

10

Modifying Data in Excel

11

Vb

Bm

Mbm

Table 2.2 - Top 20 Selling Automobiles in United States in March 2011

12

Vb

Bm

Mbm

12

Figure 2.3 - Top 20 Selling Automobiles Data entered into Excel with Percent Change in Sales from 2010

13

Vb

Bm

Mbm

Figure 2.3 shows the data from Table 2.2 entered into an Excel spreadsheet, and the percent change in sales for each model from March 2010 to March 2011 has been calculated.

This is done by entering the formula = (D2-E2)/E2 in cell F2 and then copying the contents of this cell to cells F3 to F20.

13

Modifying Data in Excel

Sorting and filtering data in excel

Illustration - To sort the automobiles by March 2010 sales

Step 1: Select cells A1:F21

Step 2: Click the DATA tab in the Ribbon

Step 3: Click Sort in the Sort & Filter group

Step 4: Select the check box for My data has headers

Step 5: In the first Sort by dropdown menu, select Sales (March 2010)

Step 6: In the Order dropdown menu, select Largest to Smallest

Step 7: Click OK

14

Vb

Bm

Mbm

Figure 2.4 - Using Excel’s Sort Function to Sort the Top Selling Automobiles Data

15

Vb

Bm

Mbm

15

Figure 2.5 - Top Selling Automobiles Data Sorted by Sales in March 2010 Sales

16

Vb

Bm

Mbm

The result of using Excel’s Sort function for the March 2010 data is shown in Figure 2.5.

Although the Honda Accord was the best-selling automobile in March 2011, both the Toyota Camry and the Toyota Corolla/Matrix outsold the Honda Accord in March 2010.

Note that while Sales (March 2010), which is in column E, is sorted, the data in all other columns are adjusted accordingly.

16

Modifying Data in Excel

Sorting and filtering data in excel

Illustration - Using Excel’s Filter function to see the sales of models made by Toyota.

Step 1: Select cells A1:F21

Step 2: Click the DATA tab in the Ribbon

Step 3: Click Filter in the Sort & Filter group

Step 4: Click on the Filter Arrow in column B, next to Manufacturer

Step 5: Select only the check box for Toyota. You can easily deselect all choices by unchecking (Select All)

17

Vb

Bm

Mbm

Figure 2.6 - Top Selling Automobiles Data Filtered to Show Only Automobiles Manufactured by Toyota

18

Vb

Bm

Mbm

The result (Figure 2.6) is a display of only the data for models made by Toyota.

Of the 20 top-selling models in March 2011, Toyota made three of them.

Further filter the data by choosing the down arrows in the other columns.

All data can be made visible again by clicking on the down arrow in column B and checking (Select All) or by clicking Filter in the Sort & Filter Group again from the DATA tab.

18

Modifying Data in Excel

Conditional Formatting of Data in Excel: Makes it easy to identify data that satisfy certain conditions in a data set.

Illustration - To identify the automobile models in Table 2.2 for which sales had decreased from March 2010 to March 2011.

Step 1: Starting with the original data shown in Figure 2.3, select cells F1:F21

Step 2: Click on the HOME tab in the Ribbon

19

Vb

Bm

Mbm

19

Modifying Data in Excel

Illustration (contd.)

Step 3: Click Conditional Formatting in the Styles group

Step 4: Select Highlight Cells Rules, and click Less Than from the dropdown menu

Step 5: Enter 0% in the Format cells that are LESS THAN: box Step 6: Click OK

20

Vb

Bm

Mbm

20

Figure 2.7 - Using Conditional Formatting in Excel to Highlight Automobiles with Declining Sales from March 2010

21

Vb

Bm

Mbm

Here, the models with decreasing sales (Toyota Camry, Ford Focus, Chevrolet Malibu, and Nissan Versa) are now clearly visible.

21

Figure 2.8 - Using Conditional Formatting in Excel to Generate Data Bars for the Top Selling Automobiles Data

22

Vb

Bm

Mbm

We can choose Data Bars from the Conditional Formatting dropdown menu in the Styles Group of the HOME tab in the Ribbon.

Data bars are essentially a bar chart input into the cells that show the magnitude of the cell values.

The width of the bars in this display are comparable to the values of the variable for which the bars have been drawn;

a value of 20 creates a bar twice as wide as that for a value of 10.

Negative values are shown to the left side of the axis; positive values are shown to the right.

Cells with negative values are shaded in a color different from that of cells with positive values.

22

Creating Distributions from Data

23

Vb

Bm

Mbm

23

Creating Distributions from Data

Frequency distributions for categorical data

Frequency distribution: A summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins, when dealing with distributions.

24

Vb

Bm

Mbm

Table 2.3 - Data from a Sample of 50 Soft Drink Purchases

25

Vb

Bm

Mbm

The data in Table 2.3 is taken from a sample of 50 soft drink purchases.

Each purchase is for one of five popular soft drinks, which define the five bins: Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and Sprite.

25

Table 2.4 - Frequency Distribution of Soft Drink Purchases

26

The frequency distribution summarizes information about the popularity of the five soft drinks:

Coca-Cola is the leader, Pepsi is second, Diet Coke is third, and Sprite and Dr. Pepper are tied for fourth.

Vb

Bm

Mbm

The frequency distribution of soft drink purchases (table 2.4) is obtained by counting the number of times each soft drink appears in Table 2.3.

Coca-Cola appears 19 times, Diet Coke appears 8 times, Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5 times.

This frequency distribution provides a summary of how the 50 soft drink purchases are distributed across the five soft drinks.

26

Figure 2.9 - Creating a Frequency Distribution for Soft Drinks Data in Excel

27

Vb

Bm

Mbm

Figure 2.9 shows the sample of 50 soft drink purchases in an Excel spreadsheet.

Column D contains the five different soft drink categories as the bins.

In cell E2, enter the formula =COUNTIF($A$2:$B$26, D2), where A2:B26 is the range for the sample data, and D2 is the bin (Coca-Cola) to match.

The COUNTIF function in Excel counts the number of times a certain value appears in the indicated range.

In this case, we want to count the number of times Coca-Cola appears in the sample data. The result is a value of 19 in cell E2, indicating that Coca-Cola appears 19 times in the sample data.

The formula from cell E2 to cells E3 to E6 can be copied to get frequency counts for Diet Coke, Pepsi, Dr. Pepper, and Sprite. By using the absolute reference $A$2:$B$26 in the formula.

27

Creating Distributions from Data

Relative frequency and percent frequency distributions

Relative frequency distribution: It is a tabular summary of data showing the relative frequency for each bin.

Percent frequency distribution: Summarizes the percent frequency of the data for each bin.

Used to provide estimates of the relative likelihoods of different values of a random variable.

28

Vb

Bm

Mbm

28

Table 2.5 - Relative Frequency and Percent Frequency Distributions of Soft Drink Purchases

29

Vb

Bm

Mbm

Table 2.4 shows that the relative frequency for Coca-Cola is 19/50 = 0.38, the relative frequency for Diet Coke is 8/50 = 0.16, and so on.

From the percent frequency distribution, it is seen that 38 percent of the purchases were Coca-Cola, 16 percent of the purchases were Diet Coke, and so on.

Note that 38 percent + 26 percent + 16 percent = 80 percent of the purchases were the top three soft drinks.

29

Creating Distributions from Data

Frequency distributions for quantitative data

Three steps necessary to define the classes for a frequency distribution with quantitative data:

1. Determine the number of nonoverlapping bins.

2. Determine the width of each bin.

3. Determine the bin limits.

30

Vb

Bm

Mbm

Number of bins:

Bins are formed by specifying the ranges used to group the data.

Generally, use between 5 and 20 bins.

Small number of data items - five or six bins.

Larger number of data items - more bins are required.

The goal is to use enough bins to show the variation in the data, but not so many classes that some contain only a few data items.

Width of the bins:

It should be the same for each bin.

Thus the choices of the number of bins and the width of bins are not independent decisions.

A larger number of bins means a smaller bin width and vice versa.

Approximate bin width =

Bin limits:

Bin limits must be chosen so that each data item belongs to one and only one class.

The lower bin limit identifies the smallest possible data value assigned to the bin.

The upper bin limit identifies the largest possible data value assigned to the class.

30

Creating Distributions from Data

31

Table 2.6 - Year-End Audit Times (Days)

Table 2.7 - Frequency, Relative Frequency, and Percent Frequency

Distributions for the Audit Time Data

Vb

Bm

Mbm

Number of bins:

The number of data items in Table 2.6 is relatively small (n = 20). Hence, we choose to develop a frequency distribution with five bins.

Width of bins:

The largest data value is 33, and the smallest data value is 12.

Because we decided to summarize the data with five classes, using the expression “Approximate bin width = ”, provides an approximate bin width of (33 – 12)/5 = 4.2.

We therefore decided to round up and use a bin width of five days in the frequency distribution.

Bin limits:

We selected 10 days as the lower bin limit and 14 days as the upper bin limit for the first class. This bin is denoted 10–14 in Table 2.7.

The smallest data value, 12, is included in the 10–14 bin. We then selected 15 days as the lower bin limit and 19 days as the upper bin limit of the next class.

We continued defining the lower and upper bin limits to obtain a total of five classes: 10–14, 15–19, 20–24, 25–29, and 30–34.

The difference between the upper bin limits of adjacent bins is the bin width. Using the first two upper bin limits of 14 and 19, we see that the bin width is 19 – 14 = 5.

With the number of bins, bin width, and bin limits determined, a frequency distribution can be obtained by counting the number of data values belonging to each bin.

Using the frequency distribution in Table 2.7, we can observe that:

The most frequently occurring audit times are in the bin of 15–19 days.

Eight of the 20 audit times are in this bin.

Only one audit required 30 or more days.

31

Figure 2.10 - Using Excel to Generate a Frequency Distribution for Audit Times Data

32

Vb

Bm

Mbm

Figure 2.10 shows the data from Table 2.6 entered into an Excel Worksheet.

The sample of 20 audit times is contained in cells A2:D6.

The upper limits of the defined bins are in cells A10:A14.

We can use the FREQUENCY function in Excel to count the number of observations in each bin:

Step 1. Select cells B10:B14

Step 2. Enter the formula =FREQUENCY(A2:D6, A10:A14). The range A2:D6 defines the data set, and the range A10:A14 defines the bins

Step 3. Press CTRL+SHIFT+ENTER

Excel will then fill in the values for the number of observations in each bin in cells B10 through B14 because these were the cells selected in Step 1 above.

32

Creating Distributions from Data

Histogram: A common graphical presentation of quantitative data

Constructed by placing the variable of interest on the horizontal axis and the selected frequency measure (absolute frequency, relative frequency, or percent frequency) on the vertical axis.

The frequency measure of each class is shown by drawing a rectangle whose base is determined by the class limits on the horizontal axis and whose height is the corresponding frequency measure.

33

Vb

Bm

Mbm

33

Figure 2.11 - Histogram for the Audit Time Data

34

Vb

Bm

Mbm

In figure 2.11, note that the class with the greatest frequency is shown by the rectangle appearing above the class of 15–19 days.

The height of the rectangle shows that the frequency of this class is 8.

34

Figure 2.12 - Creating a Histogram for the Audit Time Data using Data Analysis Toolpak in Excel

35

Vb

Bm

Mbm

Histograms can be created in Excel using the Data Analysis ToolPak. Following are the steps to create histogram in Excel.

Step 1. Click the DATA tab in the Ribbon

Step 2. Click Data Analysis in the Analysis group

Step 3. When the Data Analysis dialog box opens, choose Histogram from the list of Analysis Tools, and click OK

In the Input Range: box, enter A2:D6

In the Bin Range: box, enter A10:A14

Under Output Options:, select New Worksheet Ply:

Select the check box for Chart Output

Click OK

35

Figure 2.13 - Completed Histogram for the Audit Time Data using Data Analysis ToolPak in Excel

36

Vb

Bm

Mbm

In figure 2.13, we have modified the bin ranges in column A by typing the values shown in Figure 2.13 into cells A2:A6 so that the chart created by Excel shows both the lower and upper limits for each bin.

We have also removed the gaps between the columns in the histogram in Excel to match the traditional format of histograms.

To remove the gaps between the columns in the Histogram created by Excel, follow these steps:

Step 1. Right-click on one of the columns in the histogram

Select Format Data Series…

Step 2. When the Format Data Series pane opens, click the Series Options button.

Set the Gap Width to 0%

36

Creating Distributions from Data

Histogram provides information about the shape, or form, of a distribution.

Skewness: Lack of symmetry

Important characteristic of the shape of a distribution

37

Vb

Bm

Mbm

37

Figure 2.14 - Histograms Showing Distributions with Different Levels of Skewness

38

Vb

Bm

Mbm

Panel A: Moderately skewed to the left

Here, tail extends farther to the left than to the right.

Example: Exam scores, with no scores above 100 percent, most of the scores above 70 percent, and only a few really low scores.

Panel B: Moderately skewed to the right

Tail extends farther to the right than to the left.

Example: Housing prices; a few expensive houses create the skewness in the right tail.

Panel C: Symmetric

The left tail mirrors the shape of the right tail.

Example: Data for SAT scores, the heights and weights of people, and so on lead to histograms that are roughly symmetric.

Panel D: Highly skewed to the right

Example: Data on housing prices, salaries, purchase amounts, and so on often result in histograms skewed to the right.

38

Creating Distributions from Data

Cumulative Distributions

Cumulative frequency distribution: A variation of the frequency distribution that provides another tabular summary of quantitative data.

Uses the number of classes, class widths, and class limits developed for the frequency distribution.

Shows the number of data items with values less than or equal to the upper class limit of each class.

39

Vb

Bm

Mbm

39

Table 2.8 - Cumulative Frequency, Cumulative Relative Frequency, and Cumulative Percent Frequency Distributions for the Audit Time Data

40

Vb

Bm

Mbm

Consider the class with the description “Less than or equal to 24.”

The cumulative frequency for this class is simply the sum of the frequencies for all classes with data values less than or equal to 24.

The sum of the frequencies for classes 10–14, 15–19, and 20–24 indicates that 4 + 8 + 5 = 17 data values are less than or equal to 24. Hence, the

cumulative frequency for this class is 17.

In addition, the cumulative frequency distribution in Table 2.8 shows that four audits were completed in 14 days or less and that 19 audits were completed in 29 days or less.

The cumulative relative frequency distribution can be computed either by summing the relative frequencies in the relative frequency distribution or by dividing the cumulative frequencies by the total number of items.

Using the latter approach, we found the cumulative relative frequencies in column 3 of Table 2.8 by dividing the cumulative frequencies in column 2 by the total number of items (n = 20).

The cumulative percent frequencies were again computed by multiplying the relative frequencies by 100.

The cumulative relative and percent frequency distributions show that 0.85 of the audits, or 85 percent, were completed in 24 days or less, 0.95 of the audits, or 95 percent, were completed in 29 days or less, and so on.

40

Measures of Location

41

Vb

Bm

Mbm

41

Measures of Location

Mean/Arithmetic mean

Average value for a variable.

The mean is denoted by .

n = sample size

= value of variable x for the first observation

= value of variable x for the second observation

= value of variable x for the nth observation

42

Sample mean, = =

Vb

Bm

Mbm

Arithmetic Mean:

Provides a measure of central location for the data.

Denoted by for sample data.

Denoted by µ for population data.

42

Table 2.9 - Data on Home Sales in Cincinnati, Ohio, Suburb

Illustration: Computation of the mean home selling price for the sample of 12 home sales:

43

Vb

Bm

Mbm

43

Computation of Sample Mean

Illustration: Computation of the mean home selling price for the sample of 12 home sales:

= =

=

=

= 219,937.50

44

Vb

Bm

Mbm

44

Measures of Location

Median: Value in the middle when the data are arranged in ascending order.

Middle value, for an odd number of observations

Average of two middle values, for an even number of observations

45

Vb

Bm

Mbm

45

Computation of Sample Median

Illustration - When the number of observations are odd

Consider the class size data for a sample of five college classes:

46 54 42 46 32

Arrange the class size data in ascending order .

32 42 46 46 54

Middlemost value in the data set = 46.

Median is 46.

46

Vb

Bm

Mbm

Because n = 5 is odd, median is the middlemost value in the data set, 46.

46

Computation of Sample Median

Illustration - When the number of observations are even

Consider the data on home sales in Cincinnati, Ohio, Suburb:

47

Vb

Bm

Mbm

47

Computation of Sample Median

Illustration (contd.) - When the number of observations are even

Arrange the data in ascending order:

108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250

Median = average of two middle values = = 203,750

48

Middle Two Values

Vb

Bm

Mbm

Because n = 12 is even, the median is the average of the middle two values: 199,500 and 208,000.

48

Measures of Location

Mode: Value that occurs most frequently in a data set.

Consider the class size data:

32 42 46 46 54

Observe - 46 is the only value that occurs more than once.

Mode is 46.

Multimodal data - Data contain at least two modes.

Bimodal data - Data contain exactly two modes.

49

Vb

Bm

Mbm

49

Figure 2.15 - Calculating the Mean, Median, and Modes for the Home Sales Data using Excel

50

Vb

Bm

Mbm

The mean can be found in Excel using the AVERAGE function. The value for the mean in cell E2 is calculated using the formula =AVERAGE(B2:B13).

The median of a data set can be found in Excel using the function MEDIAN. The value for the median in cell E3 is found using the formula =MEDIAN(B2:B13).

The Excel MODE.SNGL function will return only a single most-often-occurring value.

For multimodal distributions, we must use the MODE.MULT command in Excel to return more than one mode.

To find both of the modes in Excel, we take these steps:

Step 1. Select cells E4 and E5

Step 2. Enter the formula =MODE.MULT(B2:B13)

Step 3. Press CTRL+SHIFT+ENTER

Excel enters the values for both modes of this data set in cells E4 and E5: $138,000 and $254,000.

50

Measures of Location

Geometric mean: nth root of the product of n values

Used in analyzing growth rates in financial data.

Sample geometric mean:

= [

51

Vb

Bm

Mbm

51

Table 2.10 - Percentage Annual Returns and Growth Factors for the Mutual Fund Data

Illustration - Consider the percentage annual returns and growth factors for the mutual fund data over the past 10 years.

We will determine the mean rate of growth for the fund over the 10-year period.

52

Vb

Bm

Mbm

Computation of Geometric Mean

Solution:

Product of the growth factors:

(.779)1.287)(1.109)(1.049)(1.158)(1.055)(.630)(1.265)(1.151)(1.021)

= 1.335

Geometric mean of the growth factors:

= = 1.029

Conclude that annual returns grew at an average annual rate of (1.029 – 1)100% or 2.9%.

53

Vb

Bm

Mbm

53

Figure 2.16 - Calculating the Geometric Mean for the Mutual Fund Data Using Excel

54

Vb

Bm

Mbm

In Figure 2.16, the value for the geometric mean in cell C13 is found using the formula =GEOMEAN(C2:C11).

54

Measures of Variability

55

Vb

Bm

Mbm

Measures of Variability

Range: Found by subtracting the smallest value from the largest value in a data set.

Illustration: Consider the data on home sales in Cincinnati, Ohio, Suburb:

56

Vb

Bm

Mbm

56

Computation of Range

Illustration (contd.):

Largest home sales price - $456,250

Smallest home sales price - $108,000

Range = Largest value – Smallest value

= $456,250 – $108,000

= $348,250

Drawback: Range is based on only two of the observations and thus is highly influenced by extreme values.

57

Vb

Bm

Mbm

57

Measures of Variability

Variance: Measure of variability that utilizes all the data.

It is based on the deviation about the mean, which is the difference between the value of each observation (xi) and the mean.

The deviations about the mean are squared while computing the variance.

Sample variance, =

Population variance , =

58

Vb

Bm

Mbm

58

Table 2.12 - Computation of Deviations and Squared Deviations about the Mean for the Class Size Data

59

= = =

Computation of Sample Variance:

Vb

Bm

Mbm

59

Measures of Variability

Standard deviation: Positive square root of the variance

Measured in the same units as the original data.

For sample , s =

For population, σ =

Coefficient of variation:

Measures the standard deviation relative to the mean.

Expressed as a percentage.

60

Vb

Bm

Mbm

60

Computation of Coefficient of Variation

Illustration:

Consider the class size data:

46 54 42 46 32

Mean, = 44

Standard deviation, s = 8

Coefficient of variation = % = 18.2%

61

Vb

Bm

Mbm

The coefficient of variation tells that the sample standard deviation is 18.2 percent of the value of the sample mean.

61

Figure 2.18 - Calculating Variability Measures for the Home Sales Data in Excel

62

Vb

Bm

Mbm

Calculating the variability measures in Excel:

Range:

The range can be calculated in Excel using the MAX and MIN functions.

The range value in cell E7 of Figure 2.18 calculates the range using the formula =MAX(B2:B13) - MIN(B2:B13).

This subtracts the smallest value in the range B2:B13 from the largest value in the range B2:B13.

Variance:

The variance in cell E8 is calculated using the formula =VAR.S(B2:B13).

Excel calculates the variance of the sample of 12 home sales to be 9,037,501,420.

Standard deviation:

The sample standard deviation in cell E9 is calculated using the formula =STDEV.S(B2:B13).

Excel calculated the sample standard deviation for the home sales to be $95,065.77.

Coefficient of Variation:

It is calculated in cell E11 using the formula =E9/E2, which divides the standard deviation by the mean. The coefficient of variation for the home sales data is 43.22 percent.

62

Analyzing

Distributions

63

Vb

Bm

Mbm

Analyzing Distributions

Percentile: Value of a variable at which a specified (approximate) percentage of observations are below that value.

The pth percentile tells us the point in the data where:

Approximately p percent of the observations have values less than the pth percentile;

Approximately (100 – p) percent of the observations have values greater than the pth percentile.

64

Vb

Bm

Mbm

64

Analyzing Distributions

Steps to calculate the pth percentile:

Arrange the data in ascending order (smallest to largest value).

Compute k = (n + 1) × p.

Divide k into its integer component, i, and its decimal component, d.

If d = 0, find the kth largest value in the data set. This is the pth percentile.

(contd.)

65

Vb

Bm

Mbm

65

Analyzing Distributions

b. If d > 0, the percentile is between the values in positions i and i + 1 in the sorted data. To find this percentile, we must interpolate between these two values.

Calculate the difference between the values in positions i and i + 1 in the sorted data set. We define this difference between the two values as m.

Multiply this difference by d: t = m × d.

To find the pth percentile, add t to the value in position i of the sorted data.

66

Vb

Bm

Mbm

66

Analyzing Distributions

Illustration: To determine the 85th percentile for the home sales data in Table 2.9.

1. Arrange the data in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250

2. Compute k = (n + 1) × p = (12 + 1) × 0.85 = 11.05.

3. Dividing 11.05 into the integer and decimal components gives us i = 11 and d = 0.05.

d > 0, interpolate between the values in the 11th and 12th positions in the sorted data.

67

Vb

Bm

Mbm

67

Analyzing Distributions

Illustration (contd.): To determine the 85th percentile for the home sales data in Table 2.9.

The value in the 11th position is 298,000, and

The value in the 12th position is 456,250.

i. m = 456,250 – 298,000 = 158,250

ii. t = m × d = 158,250 × 0.05 = 7912.5

iii. pth percentile = 298,000 + 7912.5 = 305,912.5

$305,912.50 represents the 85th percentile of the home sales data.

68

Vb

Bm

Mbm

68

Analyzing Distributions

Quartiles:

When the data is divided into four equal parts:

Each part contains approximately 25% of the observations.

Division points are referred to as quartiles.

= first quartile, or 25th percentile

= second quartile, or 50th percentile (also the median)

= third quartile, or 75th percentile

69

Vb

Bm

Mbm

69

Analyzing Distributions

z-score:

Measures the relative location of a value in the data set.

Helps to determine how far a particular value is from the mean relative to the data set’s standard deviation.

Standardized value

If , , . . . , is a sample of n observations

=

= z-score for

= sample mean

s = sample standard deviation

70

Vb

Bm

Mbm

The z -score can be interpreted as the number of standard deviations is from the mean .

70

Table 2.13 - z-Scores for the Class Size Data

For class size data, = 44 and s = 8.

For observations with a value > mean, z-score > 0.

For observations with a value < mean, z-score < 0.

71

Vb

Bm

Mbm

= .25 would indicate that is .25 standard deviations greater than the sample mean.

71

Figure 2.19 - Calculating z-Scores for the Home Sales Data in Excel

72

Vb

Bm

Mbm

To calculate the z-scores, the mean and standard deviation for the data set must be provided in the arguments of the STANDARDIZE function.

For instance, the z-score in cell C2 is calculated with the formula =STANDARDIZE(B2, $B$15, $B$16), where cell B15 contains the mean of the home sales data and cell B16 contains the standard deviation of the home sales data.

Then, this formula can be copy and pasted into cells C3:C13.

72

Analyzing Distributions

Empirical rule:

For data having a bell-shaped distribution:

Within 1 standard deviation – approximately 68% of the data values.

Within 2 standard deviations – approximately 95% of the data values.

Within 3 standard deviations – almost all the data values.

Identifying outliers:

Outliers: Extreme values in a data set.

It can be identified using standardized values (z-scores).

Any data value with a z-score less than –3 or greater than +3 is an outlier.

73

Vb

Bm

Mbm

Empirical rule:

The height of adult males in the United States has a bell-shaped distribution with a mean of approximately 69.5 inches and standard deviation of approximately 3 inches.

Using the empirical rule, the following conclusions can be drawn.

Approximately 68 percent of adult males in the United States have heights between 69.5 – 3 = 66.5 and 69.5 + 3 = 72.5 inches.

Approximately 95 percent of adult males in the United States have heights between 63.5 and 75.5 inches.

Almost all adult males in the United States have heights between 60.5 and 78.5 inches.

Outliers:

An outlier may be a data value that has been incorrectly recorded; if so, it can be corrected before further analysis.

It may also be from an observation that doesn’t belong to the population we are studying and was incorrectly included in the data set; if so, it can be removed.

It may be an unusual data value that has been recorded correctly and is a member of the population we are studying. In such cases, the observation should remain.

73

Analyzing Distributions

Box plot: Graphical summary of the distribution of data.

Developed from the quartiles for a data set.

74

Figure 2.21 - Box Plot for the Home Sales Data

Vb

Bm

Mbm

Steps used to construct the box plot for the home sales data:

A box is drawn with the ends of the box located at the first and third quartiles. For the home sales data, Q1 = 139,000 and Q3 = 256,625. This box contains the middle 50 percent of the data.

A vertical line is drawn in the box at the location of the median (203,750 for the home sales data).

By using the interquartile range, IQR = Q3 – Q1, limits are located. The limits for the box plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the home sales data, IQR = Q3 – Q1 = 256,625 – 139,000 = 117,625. Thus, the limits are 139,000 – 1.5(117,625) = –37,437.5 and 256,625 + 1.5(117,625) = 433,062.5. Data outside these limits are considered outliers.

The dashed lines in Figure 2.21 are called whiskers. The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits computed in step 3. Thus, the whiskers end at home sales values of 108,000 and 298,000.

Finally, the location of each outlier is shown with an asterisk (*). In Figure 2.21, we see one outlier, 456,250.

74

Figure 2.22 - Box Plots Comparing Home Sale Prices in Different Communities

75

Vb

Bm

Mbm

To compare home sales from several different communities, we have created box plots for recent home sales in each community.

The most expensive houses appear to be in Shadyside and the cheapest houses in Hamilton.

The median home selling price in Groton is about the same as the median home selling price in Irving. However, home sales prices in Irving have much greater variability.

Homes appear to be selling in Irving for many different prices, from very low to very high.

Home selling prices have the least variation in Groton and Hamilton.

Unusually expensive home sales (relative to the respective distribution of home sales vales) have occurred in Fairview, Groton, and Irving, which appear as outliers.

Groton is the only location with a low outlier, but note that most homes sell for very similar prices in Groton, so the selling price does not have to be too far from the median to be considered an outlier.

75

Measures of Association Between Two Variables

76

Vb

Bm

Mbm

Measures of Association Between Two Variables

Scatter Charts: Useful graph for analyzing the relationship between two variables.

Covariance: Descriptive measure of the linear association between two variables.

Sample covariance for a sample of size n with the observations

(, ), (, ), and so on:

=

Population covariance, =

77

Vb

Bm

Mbm

Measures of Association Between Two Variables

Correlation coefficient: Measures the relationship between two variables.

Not affected by the units of measurement for x and y.

Sample correlation coefficient denoted by .

=

= sample covariance =

= sample standard deviation of x =

= sample standard deviation of y =

78

Vb

Bm

Mbm

Population correlation coefficient is denoted by .

=

= population covariance = .

= population standard deviation of x = .

= population standard deviation of y = .

78

Interpretation of Correlation Coefficient

–1 ≤ r ≤ +1

r value Relationship between the x and y variables
< 0 Negative linear
Near 0 No linear relationship
> 0 Positive linear

79

Vb

Bm

Mbm

The closer the correlation coefficient is to +1, the closer the x and y values are to forming a straight line that trends upward to the right (positive slope).

The closer the correlation coefficient is to –1, the closer the x and y values are to forming a straight line with negative slope.

79

Table 2.14 - Data for Bottled Water Sales at Queensland Amusement Park for a Sample of 14 Summer Days

80

Vb

Bm

Mbm

Figure 2.23 - Chart Showing the Positive Linear Relation Between Sales and High Temperatures

81

Vb

Bm

Mbm

The scatter chart in the figure suggests that higher daily high temperatures are associated with higher bottled water sales.

This is an example of a positive relationship because when one variable (high temperature) increases, the other variable (sales of bottled water) generally also increases.

The scatter chart also suggests that a straight line could be used as an approximation for the relationship between high temperature and sales of bottled water.

81

Table 2.15 - Sample Covariance Calculations for Daily High Temperature and Bottled Water Sales at Queensland Amusement Park

82

Vb

Bm

Mbm

Because the covariance is greater than 0, it indicates a positive relationship between the high temperature and sales of bottled water.

This verifies the relationship in the scatter chart in Figure 2.23 that as the high temperature for a day increases, sales of bottled water generally increase.

If the covariance is near 0, then the x and y variables are not linearly related.

If the covariance is less than 0, then the x and y variables are negatively related, which means that as x increases, y generally decreases.

82

Figure 2.25 - Scatter Diagrams and Associated Covariance Values for Different Variable Relationships

83

(a)

Positive:

(x and y are positively

linearly related)

(b)

Approximately 0:

(x and y are not

linearly related)

(c)

Negative:

(x and y are negatively

linearly related)

Vb

Bm

Mbm

83

Computation of Correlation Coefficient

Illustration - To determine the sample correlation coefficient for bottled water sales at Queensland Amusement Park:

= = = 0.93

There is a very strong linear relationship between high temperature and sales.

84

Vb

Bm

Mbm

We have already obtained the value of as 12.8.

Using data in Table 2.14, we can compute sample standard deviations for x and y.

= 4.36

= 3.15

84

Figure 2.26 - Example of Nonlinear Relationship Producing a Correlation Coefficient Near Zero

85

Vb

Bm

Mbm

The scatter diagram in Figure 2.26 shows the relationship between the amount spent by a small retail store for environmental control (heating and cooling) and the daily high outside temperature over 100 days.

The sample correlation coefficient for these data is rxy = –0.007 and indicates that there is no linear relationship between the two variables.

However, Figure 2.26 provides strong visual evidence of a nonlinear relationship.

That is, we can see that as the daily high outside temperature increases, the money spent on environmental control first decreases as less heating is required and then increases as greater cooling is required.

85

Figure 2.24 - Calculating Covariance and Correlation Coefficient for Bottled Water Sales Using Excel

86

Vb

Bm

Mbm

Computation of covariance and correlation coefficient in Excel:

Figure 2.24 shows the data from Table 2.14 entered into an Excel Worksheet.

The covariance is calculated in cell B17 using the formula =COVARIANCE.S(A2:A15, B2:B15). The A2:A15 defines the range for the x variable (high temperature), and B2:B15 defines the range for the y variable (sales of bottled water).

The correlation coefficient in Figure 2.24 is computed in cell B18 for the sales of bottled water using the formula =CORREL(A2:A15, B2:B15) where A2:A15 defines the range for the x variable and B2:B15 defines the range for the y variable.

86