excel

profilemaladam
EvansStatsReview21.pptx

Business Statistics Review

1

Data, Elements, Variables, Observations, Types of Data

Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation.

Elements (data objects) are the entities on which data are collected.

A variable (attribute) is a characteristic of interest for the elements.

The set of measurements obtained for a particular

element is called an observation.

Data can be classified as being categorical or quantitative.

Quantitative data indicate how many or how much:

- discrete, if measuring how many

- continuous, if measuring how much

2

Categorical data:

- Nominal data – described by some categories according to specified characteristics

- Ordinal data – nominal but with naturally ordered categories

Quantitative data:

- Interval data – has a meaningless zero

- Ratio data – has a meaningful zero

Note: Most examined quantitative data will have a ratio scale of measurement

Scales of Measurements

3

Example 1.3: Classifying Data

Ordinal

Nominal

Nominal

Ratio

Nominal

Ratio

Ratio

Ratio

Interval

Interval

4

Population - all items of interest for a particular decision or investigation

- all subscribers to Netflix

Sample - a subset of the population

- a list of individuals who rented a comedy from

Netflix in the past year

The purpose of sampling is to obtain sufficient information to draw a valid inference about a population.

Populations and Samples

5

Population mean: µ (typically unknown)

Sample mean:

Excel function: =AVERAGE(data range)

Property of the mean:

Outliers can affect the value of the mean.

Measures of Location: Arithmetic Mean

6

Purchase Orders database

Using formula:

Mean = $2,471,760/94

= $26,295.32

Excel Function:

=AVERAGE(B2:B95)

Example 4.1: Computing Mean Cost per Order

7

The median specifies the middle value when the data are arranged from least to greatest.

Half the data are below the median, and half the data are above it.

For an odd number of observations, the median is the middle of the sorted numbers.

For an even number of observations, the median is the mean of the two middle numbers.

We could use the Sort option in Excel to rank-order the data and then determine the median. The Excel function =MEDIAN(data range) could also be used.

The median is meaningful for ratio, interval, and ordinal data.

Not affected by outliers.

Measures of Location: Median

8

Sort the data from smallest to largest. Since we have 94 observations, the median is the average of the 47th and 48th observation.

Example 4.2: Finding the Median Cost per Order

Median =

($15,562.50 + $15,750.00)/2 = $15,656.25

= MEDIAN(B2:B95)

9

The variance is the “average” of the squared deviations from the mean.

Population variance: (typically unknown)

Sample variance:

In Excel: =VAR.S(data range)

Measures of Dispersion: Variance

10

Purchase Orders Cost per order data

Example 4.8 Computing the Variance

11

The standard deviation is the positive square root of the variance.

Note that the dimension of the variance is the square of the dimension of the observations, whereas the dimension of the standard deviation is the same as the data. This makes the standard deviation more practical to use in applications.

Population standard deviation: σ (typically unknown)

Sample standard deviation:

In Excel: =STDEV.S(data range)

Measures of Dispersion: Standard Deviation

12

Purchase Orders Cost per order data

Using the results of Example 4.8, take the square root of the variance:

Alternatively, use the STDEV.S function for the data range.

Example 4.9 Computing the Standard Deviation

14

A standardized value, commonly called a z-score, provides a relative measure of the distance an observation is from the mean, which is independent of the units of measurement.

The z-score for the ith observation in a data set is calculated as follows:

Excel function: =STANDARDIZE(x, mean, standard_dev).

Standardized Values

14

Empirical Rule

About 68% of the data values are within 1 standard

deviation of its mean. (About 68% of the z-scores are

between -1 and 1.)

About 95% of the data values are within 2 standard

deviations of its mean. (About 68% of the z-scores are

between -2 and 2.)

Almost all of the data values are within 3 standard

deviations of its mean. (Almost all z-scores are

between -3 and 3.)

For data having a bell-shaped distribution:

15

An observation xi is an outlier if its z-score

is greater than +3 or less than -3

Distribution Shape: Skewness

An important measure of the shape of a distribution is called skewness.

The formula for the skewness of sample data is

Skewness can be computed using Excel’s function SKEW(data cell range).

16

Distribution Shape: Skewness

Symmetric (not skewed)

Relative Frequency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = 0

Skewness is zero.

Mean and median are equal.

17

Relative Frequency

.05

.10

.15

.20

.25

.30

.35

0

Distribution Shape: Skewness

Moderately Skewed Left

Skewness = - 0.31

Skewness is negative.

Mean will usually be less than the median.

18

Distribution Shape: Skewness

Moderately Skewed Right

Relative Frequency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = 0.31

Skewness is positive.

Mean will usually be more than the median.

19

Distribution Shape: Skewness

Highly Skewed Right

Relative Frequency

.05

.10

.15

.20

.25

.30

.35

0

Skewness = 1.25

Skewness is positive (often above 1.0).

Mean will usually be more than the median.

20

Excel Descriptive Statistics Tool

This tool provides a summary of numerical statistical measures for sample data.

Data >

Data Analysis >

Descriptive Statistics

Enter Input Range

Labels (optional)

Check Summary Statistics box

The data must be in a single row or column. If the data are in multiple columns, the tool treats each row or column as a separate data set

21

Example 4.15: Using the Descriptive Statistics Tool

Purchase Orders database

22

Correlation is a measure of the linear relationship between two variables, X and Y, which does not depend on the units of measurement.

Correlation is measured by the correlation coefficient, also known as the Pearson product moment correlation coefficient.

Population correlation coefficient: (typically unknown)

Sample correlation coefficient:

The correlation coefficient is always between -1 and 1.

Excel function: =CORREL(array1,array2)

Correlation Coefficient

23

Examples of Correlation

24

Example 4.22: Using the Correlation Tool in Data Analysis of Excel

Colleges and Universities data

Moderate negative correlation between acceptance rate and graduation rate, indicating that schools with lower acceptance rates have higher graduation rates.

Acceptance rate is also negatively correlated with the median SAT and Top 10% HS, suggesting that schools with lower acceptance rates have higher student profiles.

The correlations with Expenditures/Student suggest that schools with higher student profiles spend more money per student.

25

A random variable is a numerical description of the outcome of an experiment.

A discrete random variable is one for which the number of possible outcomes can be counted.

A continuous random variable has outcomes over one or more continuous intervals of real numbers.

Random Variables

26

A cumulative distribution function, F(x), specifies the probability that the random variable X assumes a value less than or equal to a specified value x; that is,

F(x) = P(X ≤ x)

A probability density function, f(x), is a mathematical function that characterizes a continuous random variable

Continuous Probability Distributions

27

Properties of density functions f(x):

f(x) ≥ 0 for all values x

Total area under the density function equals 1

P(X = x) = 0; the probability that a continuous random variable X takes on a particular value is always zero

Probabilities are only defined over intervals.

P(a ≤ X ≤ b) is the area under the density function between a and b

Continuous Probability Distributions

28

Normal density function f(x) is a bell-shaped curve

Characterized by 2 parameters:

 (mean)

 > 0 (standard deviation)

Properties:

Symmetric

Mean = Median = Mode

Range of X is unbounded

Excel function:

=NORM.DIST(x, mean, standard_deviation, cumulative).

NORM.DIST(x, mean, standard_deviation, TRUE) calculates the cumulative probability F(x) = P(X ≤ x)

Normal Distribution

29

A standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1.

A standard normal random variable is denoted by Z.

The scale along the z-axis represents the number of standard deviations from the mean of zero.

The Excel function =NORM.S.DIST(z) finds cumulative probability P(Z ≤ z) for the standard normal distribution.

Standard Normal Distribution

30

Table 1 of Appendix A

We may compute probabilities for any normal random variable X having a mean m and standard deviation s by converting it to a standard normal random variable Z:

Using Standard Normal Distribution Tables

31

1. If the sample size is large enough, then the sampling distribution of the sample mean is:

- approximately normally distributed regardless

of the distribution of the population

- has the mean μ and the standard deviation (standard error)

2. If the population is normally distributed, then the sampling distribution is also normally distributed for any sample size.

The central limit theorem allows us to use the theory we learned about calculating probabilities for normal distributions to draw conclusions about sample means.

Central Limit Theorem

32

A confidence interval is a range of values between which the value of the population parameter is believed to be, along with a probability that the interval correctly estimates the true (unknown) population parameter.

This probability is called the level of confidence, denoted by

1 - a, where a is typically a small probability as 0.01; 0.05 or 0.10

The level of confidence is usually expressed as a percent; common values are 90%, 95%, or 99%.

For a 95% confidence interval, if we chose 100 different samples, leading to 100 different interval estimates, we would expect that 95% of them would contain the true population mean.

Confidence Intervals

33

Sample mean ± margin of error

zα/2 is the value of the standard normal random variable for an upper tail area of α/2 (or a lower tail area of 1 − α/2).

zα/2 is computed as =NORM.S.INV(1 – a/2)

Example: if a = 0.05 (for a 95% confidence interval), then NORM.S.INV(0.975) = 1.96;

Example: if a = 0.10 (for a 90% confidence interval), then NORM.S.INV(0.95) = 1.645,

The margin of error can also be computed by =CONFIDENCE.NORM(alpha, standard_deviation, size).

Confidence Interval for the Population Mean μ with Known Population Standard Deviation σ

34

Confidence Interval for a Population Mean: σ Known

90% 0.10 0.05 0.9500 1.645

95% 0.05 0.025 0.9750 1.960

99% 0.01 0.005 0.9950 2.576

Confidence Table A.1

Level a a/2 Look-up Area za/2

/2

/2

1 - 

-

35

As the level of confidence, 1 - a, increases, za/2 increases, the margin of error increases, and the confidence interval becomes wider.

For example, a 90% confidence interval will be narrower than a 95% confidence interval. Similarly, a 99% confidence interval will be wider than a 95% confidence interval.

Essentially, you must trade off a higher level of accuracy with the risk that the confidence interval does not contain the true mean.

To reduce the risk, you should consider increasing the sample size.

Confidence Interval Properties

36

The t-distribution is a family of probability distributions with a shape similar to the standard normal distribution. Different t-distributions are distinguished by an additional parameter, degrees of freedom (df).

As the number of degrees of freedom increases, the t-distribution converges to the standard normal distribution

The t-Distribution

37

where tα/2 is the value of the t-distribution with df = n − 1

for an upper tail area of α/2.

t values are found in Table 2 of Appendix A or with the Excel function T.INV(1 – a/2, n – 1).

The Excel function

=CONFIDENCE.T(alpha, standard_deviation, size)

can be used to compute the margin of error

Confidence Interval for the Population Mean μ with Unknown Population Standard Deviation σ

38

Excel file Credit Approval Decisions. Find a 95% confidence interval estimate of the mean revolving balance of homeowner applicants

Sample mean = = $13,939.40; s = $9009.52; standard error = $1274.14; t0.025, 49 = 2.0096.

Example 6.9: Computing a Confidence Interval with Unknown Standard Deviation

13,939.40 ± 2.0106(9009.52/√50) =

13,939.40 ± 2.0096(1274.14) =

13,939.40 ± 2560.48 =

from 11,378.92 to 16,499.88

39

Example 6.9 continued

Excel’s Partial Output of Descriptive Statistics in Data Analysis:

40

We are 95% confident that the mean revolving balance for all homeowners is somewhere between $11,378.92 and $16,499.88

(6.3)

= 13939.40, s = 9009.52

df = n - 1 = 50 – 1 = 49

α = 0.05

= = 2.0096

95% margin of error

Revolving Balance for Homeowners
Mean 13939.40
Standard Error 1274.14
Standard Deviation 9009.52
Sample Variance 81171455
Count 50
Confidence Level(95.0%) 2560.48

Microsoft Office User (MOU) - Important for the assignment

Hypothesis testing involves drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters.

H0: Null hypothesis: typically describes an existing theory (status quo)

H1: Alternative hypothesis: the complement of H0

Using sample data, we either:

- reject H0 and conclude the sample data provides

sufficient evidence to support H1, or

- fail to reject H0 and conclude the sample data does not

support H1.

Hypothesis Testing

41

41

Steps in conducting a hypothesis test:

Identify the population parameter and formulate the hypotheses to test.

Select a level of significance α (the risk of drawing an incorrect conclusion).

Determine the decision rule on which to base a conclusion.

Collect data and calculate a test statistic.

Apply the decision rule and draw a conclusion.

Hypothesis Testing Procedure

42

One-tailed test

(lower (left) tail)

One-tailed test

(upper (right) tail)

Two-tailed test

Summary of Forms for Null and Alternative Hypotheses about a Population Mean

The sign “=“ must appear in the null hypothesis.

In general, a hypothesis test about the population

mean takes one of the following three forms

(where 0 is the hypothesized value of ):

43

H1:  >

H0:  ≤ 

H1:  <

H0:  ≥ 

H0:  =

H1:  ≠ 

Hypothesis testing always assumes that H0 is true and uses sample data to determine whether H1 is more likely to be true.

Statistically, we cannot “prove” that H0 is true; we can only fail to reject it.

Rejecting the null hypothesis provides strong evidence (in a statistical sense) that the null hypothesis is not true and that the alternative hypothesis is true.

We have to provide sufficiently strong evidence for rejecting H0

Determining the Proper Form of Hypotheses

44

CadSoft receives calls for technical support. In the past, the average response time has been at least 25 minutes. It believes the average response time can be reduced to less than 25 minutes.

If the new information system makes a difference, then, data should be able to confirm that the mean response time is less than 25 minutes; this defines the alternative hypothesis, H1.

H0: mean response time ≥ 25

H1: mean response time < 25

Example 7.2: Formulating a One-Sample Test of Hypothesis for a Population Mean

45

Type I and Type II Errors

Because hypothesis tests are based on sample data,

we must allow for the possibility of errors.

A Type I error is rejecting H0 when it is actually true.

The allowable probability of making a Type I error

is called the level of significance, and is denoted by a;

a is typically 0.01, 0.05 or 0.10, that is, 1%, 5% or 10%.

A Type II error is not rejecting H0 when it is actually

false.

Two possible conclusions in hypothesis testing:

- Reject H0 (support for H1) or

- Do not reject H0 (lack of support for H1).

46

Type I and Type II Errors

Correct

Decision

Type II Error

Correct

Decision

Type I Error

Reject H0

Fail to reject H0

H0 is actually true

H0 is actually false

Conclusion:

Population Condition:

44

The decision to reject or fail to reject a null hypothesis is based on computing a test statistic value from the sample data.

The test statistic used depends on the type of hypothesis test.

Test statistics for one-sample hypothesis tests for population means:

Selecting the Test Statistic

48

In the CadSoft example, sample data for 44 customers revealed a mean response time of 21.91 minutes and a sample standard deviation of 19.49 minutes.

t = -1.05 indicates that the sample mean of 21.91 is 1.05 standard errors below the hypothesized mean of 25 minutes.

Example 7.4: Computing the Test Statistic

49

In the CadSoft example, μ = mean response time and we use α = 0.05.

H0: μ ≥ 25

H1: μ < 25

n = 44; df = n −1 = 43

t = −1.05

Critical value = tα/2, n−1 = T.INV(1− α, n −1) = T.INV(0.95, 43) = 1.68

t = −1.05 does not fall in the rejection region.

Conclusion: Fail to reject H0.

Example 7.5: Using the Critical Value Approach to Draw a Conclusion

Even though the sample mean of 21.91 is well below 25, we have too much sampling error to conclude the that the true population mean is less than 25 minutes.

50

50

Rejection Regions for the Critical Value Approach

H0: parameter = constant

H1: parameter ≠ constant

H0: parameter ≤ constant

H1: parameter > constant

H0: parameter ≥ constant

H1: parameter < constant

For a one-tailed test, if H1 is stated as <, the rejection region is in the lower tail; if H1 is stated as >, the rejection region is in the upper tail (just think of the inequality as an arrow pointing to the proper tail direction).

For a two-tailed test (H1 is stated as ≠), the rejection region is in both tails.

51

A p-value (observed significance level) is the probability of obtaining a test statistic value equal to or more extreme than that obtained from the sample data when the null hypothesis is true.

An alternative approach to Step 3 of a hypothesis test uses the p-value rather than the critical value:

Reject H0 if the p-value < α

The p-Value Approach

For a lower one-tailed test, the p-value is the probability to the left of the test statistic t in the t-distribution, and is found using the Excel function: = T.DIST(t, n-1, TRUE).

In the CadSoft example: p-value = P(T43 ≤ - 1.05) =

T.DIST(-1.05, 43, 1) = 0.1498. Since p-value ≥ α = 0.05, the conclusion is: fail to reject H0

52

Excel file Vacation Survey

Test whether μ = the mean age of respondents is equal to 35.

H0: μ = 35

H1: μ ≠ 35

n = 34; sample mean = = 38.677; sample standard deviation = s = = 7.858.

Test statistic:

Positive critical value = T.INV.2T(0.05, 33) =

= 2.0345

p-value = P( |T33 | ≥ 2.73) =

= T.DIST.2T(2.73, 33) = 0.0101

Conclusion: Reject H0.

Example 7.6: Conducting a Two-Tailed Hypothesis Test for the Population Mean

53

Purchase Orders database

With α = 0.05, determine if the mean lead time for Alum Sheeting (µ1) is greater than the mean lead time for Durrable Products (µ2).

Two-Sample Tests for Difference in Population Means Example 7.9: Comparing Supplier Performance

54

54

t-Test: Two-Sample Assuming Unequal Variances

Variable 1 Range: Alum Sheeting data

Variable 2 Range: Durrable Products data

Example 7.10: Testing the Hypotheses for Supplier Lead-Time Performance

55

Results

The test statistic is assumed to be t distributed with df = 10. If the value of the test statistic is nonnegative, then the p-value in the output is the correct p-value for an upper-tail test.

Example 7.10 Continued

t = 3.83 > 1.81 = Critical value

p-value = 0.00166 < 0.05 = α

Reject H0

The mean lead time for

Alum Sheeting is greater

than for Durrable Products

56

Used to compare the means of two or more population groups.

ANOVA derives its name from the fact that we are analyzing variances in the data.

ANOVA measures variation between groups relative to variation within groups.

Each of the population groups is assumed to come from a normally distributed population.

Analysis of Variance (ANOVA)

57

With α = 0.05, determine whether any significant differences exist in satisfaction among individuals with different levels of education.

The variable of interest is called a factor. In this example, the factor is the educational level, and we have three levels of this factor, college graduate, graduate degree, and some college.

Example 7.13: Difference in Insurance Survey Data

58

Data Analysis tool: ANOVA: Single Factor

The input range of the data must be in contiguous columns

Example 7.14: Applying the Excel ANOVA Tool

59

Results:

F = 3.92

Fcrit = 3.47

F > Fcrit

p-value = 0.0356

0.0356 < 0.05

With α = 0.05, reject H0.

Example 7.14: Continued

60

The F statistic is assumed to be F distributed with 2 degrees

of freedom in the numerator, and 21 degrees of

freedom in the denominator.

60

The F-distribution has two degrees of freedom, one associated with the numerator of the F-statistic, 1, and one associated with the denominator of the F-statistic, 2.

The distribution is not symmetric, and Table A.4 in Appendix A provides upper-tail critical values.

F-Distribution

61

The m groups or factor levels being studied represent populations whose outcome measures

1. are randomly and independently obtained,

2. are normally distributed, and

3. have equal variances.

If these assumptions are violated, then the level of significance and the power of the test can be affected.

Assumptions of ANOVA

62



3

)2)(1(

Skewness

s

xx

nn

n

i