excel
Business Statistics Review
1
Data, Elements, Variables, Observations, Types of Data
Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation.
Elements (data objects) are the entities on which data are collected.
A variable (attribute) is a characteristic of interest for the elements.
The set of measurements obtained for a particular
element is called an observation.
Data can be classified as being categorical or quantitative.
Quantitative data indicate how many or how much:
- discrete, if measuring how many
- continuous, if measuring how much
2
Categorical data:
- Nominal data – described by some categories according to specified characteristics
- Ordinal data – nominal but with naturally ordered categories
Quantitative data:
- Interval data – has a meaningless zero
- Ratio data – has a meaningful zero
Note: Most examined quantitative data will have a ratio scale of measurement
Scales of Measurements
3
Example 1.3: Classifying Data
Ordinal
Nominal
Nominal
Ratio
Nominal
Ratio
Ratio
Ratio
Interval
Interval
4
Population - all items of interest for a particular decision or investigation
- all subscribers to Netflix
Sample - a subset of the population
- a list of individuals who rented a comedy from
Netflix in the past year
The purpose of sampling is to obtain sufficient information to draw a valid inference about a population.
Populations and Samples
5
Population mean: µ (typically unknown)
Sample mean:
Excel function: =AVERAGE(data range)
Property of the mean:
Outliers can affect the value of the mean.
Measures of Location: Arithmetic Mean
6
Purchase Orders database
Using formula:
Mean = $2,471,760/94
= $26,295.32
Excel Function:
=AVERAGE(B2:B95)
Example 4.1: Computing Mean Cost per Order
7
The median specifies the middle value when the data are arranged from least to greatest.
Half the data are below the median, and half the data are above it.
For an odd number of observations, the median is the middle of the sorted numbers.
For an even number of observations, the median is the mean of the two middle numbers.
We could use the Sort option in Excel to rank-order the data and then determine the median. The Excel function =MEDIAN(data range) could also be used.
The median is meaningful for ratio, interval, and ordinal data.
Not affected by outliers.
Measures of Location: Median
8
Sort the data from smallest to largest. Since we have 94 observations, the median is the average of the 47th and 48th observation.
Example 4.2: Finding the Median Cost per Order
Median =
($15,562.50 + $15,750.00)/2 = $15,656.25
= MEDIAN(B2:B95)
9
The variance is the “average” of the squared deviations from the mean.
Population variance: (typically unknown)
Sample variance:
In Excel: =VAR.S(data range)
Measures of Dispersion: Variance
10
Purchase Orders Cost per order data
Example 4.8 Computing the Variance
11
The standard deviation is the positive square root of the variance.
Note that the dimension of the variance is the square of the dimension of the observations, whereas the dimension of the standard deviation is the same as the data. This makes the standard deviation more practical to use in applications.
Population standard deviation: σ (typically unknown)
Sample standard deviation:
In Excel: =STDEV.S(data range)
Measures of Dispersion: Standard Deviation
12
Purchase Orders Cost per order data
Using the results of Example 4.8, take the square root of the variance:
Alternatively, use the STDEV.S function for the data range.
Example 4.9 Computing the Standard Deviation
14
A standardized value, commonly called a z-score, provides a relative measure of the distance an observation is from the mean, which is independent of the units of measurement.
The z-score for the ith observation in a data set is calculated as follows:
Excel function: =STANDARDIZE(x, mean, standard_dev).
Standardized Values
14
Empirical Rule
About 68% of the data values are within 1 standard
deviation of its mean. (About 68% of the z-scores are
between -1 and 1.)
About 95% of the data values are within 2 standard
deviations of its mean. (About 68% of the z-scores are
between -2 and 2.)
Almost all of the data values are within 3 standard
deviations of its mean. (Almost all z-scores are
between -3 and 3.)
For data having a bell-shaped distribution:
15
An observation xi is an outlier if its z-score
is greater than +3 or less than -3
Distribution Shape: Skewness
An important measure of the shape of a distribution is called skewness.
The formula for the skewness of sample data is
Skewness can be computed using Excel’s function SKEW(data cell range).
16
Distribution Shape: Skewness
Symmetric (not skewed)
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = 0
Skewness is zero.
Mean and median are equal.
17
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Distribution Shape: Skewness
Moderately Skewed Left
Skewness = - 0.31
Skewness is negative.
Mean will usually be less than the median.
18
Distribution Shape: Skewness
Moderately Skewed Right
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = 0.31
Skewness is positive.
Mean will usually be more than the median.
19
Distribution Shape: Skewness
Highly Skewed Right
Relative Frequency
.05
.10
.15
.20
.25
.30
.35
0
Skewness = 1.25
Skewness is positive (often above 1.0).
Mean will usually be more than the median.
20
Excel Descriptive Statistics Tool
This tool provides a summary of numerical statistical measures for sample data.
Data >
Data Analysis >
Descriptive Statistics
Enter Input Range
Labels (optional)
Check Summary Statistics box
The data must be in a single row or column. If the data are in multiple columns, the tool treats each row or column as a separate data set
21
Example 4.15: Using the Descriptive Statistics Tool
Purchase Orders database
22
Correlation is a measure of the linear relationship between two variables, X and Y, which does not depend on the units of measurement.
Correlation is measured by the correlation coefficient, also known as the Pearson product moment correlation coefficient.
Population correlation coefficient: (typically unknown)
Sample correlation coefficient:
The correlation coefficient is always between -1 and 1.
Excel function: =CORREL(array1,array2)
Correlation Coefficient
23
Examples of Correlation
24
Example 4.22: Using the Correlation Tool in Data Analysis of Excel
Colleges and Universities data
Moderate negative correlation between acceptance rate and graduation rate, indicating that schools with lower acceptance rates have higher graduation rates.
Acceptance rate is also negatively correlated with the median SAT and Top 10% HS, suggesting that schools with lower acceptance rates have higher student profiles.
The correlations with Expenditures/Student suggest that schools with higher student profiles spend more money per student.
25
A random variable is a numerical description of the outcome of an experiment.
A discrete random variable is one for which the number of possible outcomes can be counted.
A continuous random variable has outcomes over one or more continuous intervals of real numbers.
Random Variables
26
A cumulative distribution function, F(x), specifies the probability that the random variable X assumes a value less than or equal to a specified value x; that is,
F(x) = P(X ≤ x)
A probability density function, f(x), is a mathematical function that characterizes a continuous random variable
Continuous Probability Distributions
27
Properties of density functions f(x):
f(x) ≥ 0 for all values x
Total area under the density function equals 1
P(X = x) = 0; the probability that a continuous random variable X takes on a particular value is always zero
Probabilities are only defined over intervals.
P(a ≤ X ≤ b) is the area under the density function between a and b
Continuous Probability Distributions
28
Normal density function f(x) is a bell-shaped curve
Characterized by 2 parameters:
(mean)
> 0 (standard deviation)
Properties:
Symmetric
Mean = Median = Mode
Range of X is unbounded
Excel function:
=NORM.DIST(x, mean, standard_deviation, cumulative).
NORM.DIST(x, mean, standard_deviation, TRUE) calculates the cumulative probability F(x) = P(X ≤ x)
Normal Distribution
29
A standard normal distribution is a normal distribution with a mean of 0 and standard deviation of 1.
A standard normal random variable is denoted by Z.
The scale along the z-axis represents the number of standard deviations from the mean of zero.
The Excel function =NORM.S.DIST(z) finds cumulative probability P(Z ≤ z) for the standard normal distribution.
Standard Normal Distribution
30
Table 1 of Appendix A
We may compute probabilities for any normal random variable X having a mean m and standard deviation s by converting it to a standard normal random variable Z:
Using Standard Normal Distribution Tables
31
1. If the sample size is large enough, then the sampling distribution of the sample mean is:
- approximately normally distributed regardless
of the distribution of the population
- has the mean μ and the standard deviation (standard error)
2. If the population is normally distributed, then the sampling distribution is also normally distributed for any sample size.
The central limit theorem allows us to use the theory we learned about calculating probabilities for normal distributions to draw conclusions about sample means.
Central Limit Theorem
32
A confidence interval is a range of values between which the value of the population parameter is believed to be, along with a probability that the interval correctly estimates the true (unknown) population parameter.
This probability is called the level of confidence, denoted by
1 - a, where a is typically a small probability as 0.01; 0.05 or 0.10
The level of confidence is usually expressed as a percent; common values are 90%, 95%, or 99%.
For a 95% confidence interval, if we chose 100 different samples, leading to 100 different interval estimates, we would expect that 95% of them would contain the true population mean.
Confidence Intervals
33
Sample mean ± margin of error
zα/2 is the value of the standard normal random variable for an upper tail area of α/2 (or a lower tail area of 1 − α/2).
zα/2 is computed as =NORM.S.INV(1 – a/2)
Example: if a = 0.05 (for a 95% confidence interval), then NORM.S.INV(0.975) = 1.96;
Example: if a = 0.10 (for a 90% confidence interval), then NORM.S.INV(0.95) = 1.645,
The margin of error can also be computed by =CONFIDENCE.NORM(alpha, standard_deviation, size).
Confidence Interval for the Population Mean μ with Known Population Standard Deviation σ
34
Confidence Interval for a Population Mean: σ Known
90% 0.10 0.05 0.9500 1.645
95% 0.05 0.025 0.9750 1.960
99% 0.01 0.005 0.9950 2.576
Confidence Table A.1
Level a a/2 Look-up Area za/2
/2
/2
1 -
-
35
As the level of confidence, 1 - a, increases, za/2 increases, the margin of error increases, and the confidence interval becomes wider.
For example, a 90% confidence interval will be narrower than a 95% confidence interval. Similarly, a 99% confidence interval will be wider than a 95% confidence interval.
Essentially, you must trade off a higher level of accuracy with the risk that the confidence interval does not contain the true mean.
To reduce the risk, you should consider increasing the sample size.
Confidence Interval Properties
36
The t-distribution is a family of probability distributions with a shape similar to the standard normal distribution. Different t-distributions are distinguished by an additional parameter, degrees of freedom (df).
As the number of degrees of freedom increases, the t-distribution converges to the standard normal distribution
The t-Distribution
37
where tα/2 is the value of the t-distribution with df = n − 1
for an upper tail area of α/2.
t values are found in Table 2 of Appendix A or with the Excel function T.INV(1 – a/2, n – 1).
The Excel function
=CONFIDENCE.T(alpha, standard_deviation, size)
can be used to compute the margin of error
Confidence Interval for the Population Mean μ with Unknown Population Standard Deviation σ
38
Excel file Credit Approval Decisions. Find a 95% confidence interval estimate of the mean revolving balance of homeowner applicants
Sample mean = = $13,939.40; s = $9009.52; standard error = $1274.14; t0.025, 49 = 2.0096.
Example 6.9: Computing a Confidence Interval with Unknown Standard Deviation
13,939.40 ± 2.0106(9009.52/√50) =
13,939.40 ± 2.0096(1274.14) =
13,939.40 ± 2560.48 =
from 11,378.92 to 16,499.88
39
Example 6.9 continued
Excel’s Partial Output of Descriptive Statistics in Data Analysis:
40
We are 95% confident that the mean revolving balance for all homeowners is somewhere between $11,378.92 and $16,499.88
(6.3)
= 13939.40, s = 9009.52
df = n - 1 = 50 – 1 = 49
α = 0.05
= = 2.0096
95% margin of error
| Revolving Balance for Homeowners | |
| Mean | 13939.40 |
| Standard Error | 1274.14 |
| Standard Deviation | 9009.52 |
| Sample Variance | 81171455 |
| Count | 50 |
| Confidence Level(95.0%) | 2560.48 |
Microsoft Office User (MOU) - Important for the assignment
Hypothesis testing involves drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters.
H0: Null hypothesis: typically describes an existing theory (status quo)
H1: Alternative hypothesis: the complement of H0
Using sample data, we either:
- reject H0 and conclude the sample data provides
sufficient evidence to support H1, or
- fail to reject H0 and conclude the sample data does not
support H1.
Hypothesis Testing
41
41
Steps in conducting a hypothesis test:
Identify the population parameter and formulate the hypotheses to test.
Select a level of significance α (the risk of drawing an incorrect conclusion).
Determine the decision rule on which to base a conclusion.
Collect data and calculate a test statistic.
Apply the decision rule and draw a conclusion.
Hypothesis Testing Procedure
42
One-tailed test
(lower (left) tail)
One-tailed test
(upper (right) tail)
Two-tailed test
Summary of Forms for Null and Alternative Hypotheses about a Population Mean
The sign “=“ must appear in the null hypothesis.
In general, a hypothesis test about the population
mean takes one of the following three forms
(where 0 is the hypothesized value of ):
43
H1: >
H0: ≤
H1: <
H0: ≥
H0: =
H1: ≠
Hypothesis testing always assumes that H0 is true and uses sample data to determine whether H1 is more likely to be true.
Statistically, we cannot “prove” that H0 is true; we can only fail to reject it.
Rejecting the null hypothesis provides strong evidence (in a statistical sense) that the null hypothesis is not true and that the alternative hypothesis is true.
We have to provide sufficiently strong evidence for rejecting H0
Determining the Proper Form of Hypotheses
44
CadSoft receives calls for technical support. In the past, the average response time has been at least 25 minutes. It believes the average response time can be reduced to less than 25 minutes.
If the new information system makes a difference, then, data should be able to confirm that the mean response time is less than 25 minutes; this defines the alternative hypothesis, H1.
H0: mean response time ≥ 25
H1: mean response time < 25
Example 7.2: Formulating a One-Sample Test of Hypothesis for a Population Mean
45
Type I and Type II Errors
Because hypothesis tests are based on sample data,
we must allow for the possibility of errors.
A Type I error is rejecting H0 when it is actually true.
The allowable probability of making a Type I error
is called the level of significance, and is denoted by a;
a is typically 0.01, 0.05 or 0.10, that is, 1%, 5% or 10%.
A Type II error is not rejecting H0 when it is actually
false.
Two possible conclusions in hypothesis testing:
- Reject H0 (support for H1) or
- Do not reject H0 (lack of support for H1).
46
Type I and Type II Errors
Correct
Decision
Type II Error
Correct
Decision
Type I Error
Reject H0
Fail to reject H0
H0 is actually true
H0 is actually false
Conclusion:
Population Condition:
44
The decision to reject or fail to reject a null hypothesis is based on computing a test statistic value from the sample data.
The test statistic used depends on the type of hypothesis test.
Test statistics for one-sample hypothesis tests for population means:
Selecting the Test Statistic
48
In the CadSoft example, sample data for 44 customers revealed a mean response time of 21.91 minutes and a sample standard deviation of 19.49 minutes.
t = -1.05 indicates that the sample mean of 21.91 is 1.05 standard errors below the hypothesized mean of 25 minutes.
Example 7.4: Computing the Test Statistic
49
In the CadSoft example, μ = mean response time and we use α = 0.05.
H0: μ ≥ 25
H1: μ < 25
n = 44; df = n −1 = 43
t = −1.05
Critical value = tα/2, n−1 = T.INV(1− α, n −1) = T.INV(0.95, 43) = 1.68
t = −1.05 does not fall in the rejection region.
Conclusion: Fail to reject H0.
Example 7.5: Using the Critical Value Approach to Draw a Conclusion
Even though the sample mean of 21.91 is well below 25, we have too much sampling error to conclude the that the true population mean is less than 25 minutes.
50
50
Rejection Regions for the Critical Value Approach
H0: parameter = constant
H1: parameter ≠ constant
H0: parameter ≤ constant
H1: parameter > constant
H0: parameter ≥ constant
H1: parameter < constant
For a one-tailed test, if H1 is stated as <, the rejection region is in the lower tail; if H1 is stated as >, the rejection region is in the upper tail (just think of the inequality as an arrow pointing to the proper tail direction).
For a two-tailed test (H1 is stated as ≠), the rejection region is in both tails.
51
A p-value (observed significance level) is the probability of obtaining a test statistic value equal to or more extreme than that obtained from the sample data when the null hypothesis is true.
An alternative approach to Step 3 of a hypothesis test uses the p-value rather than the critical value:
Reject H0 if the p-value < α
The p-Value Approach
For a lower one-tailed test, the p-value is the probability to the left of the test statistic t in the t-distribution, and is found using the Excel function: = T.DIST(t, n-1, TRUE).
In the CadSoft example: p-value = P(T43 ≤ - 1.05) =
T.DIST(-1.05, 43, 1) = 0.1498. Since p-value ≥ α = 0.05, the conclusion is: fail to reject H0
52
Excel file Vacation Survey
Test whether μ = the mean age of respondents is equal to 35.
H0: μ = 35
H1: μ ≠ 35
n = 34; sample mean = = 38.677; sample standard deviation = s = = 7.858.
Test statistic:
Positive critical value = T.INV.2T(0.05, 33) =
= 2.0345
p-value = P( |T33 | ≥ 2.73) =
= T.DIST.2T(2.73, 33) = 0.0101
Conclusion: Reject H0.
Example 7.6: Conducting a Two-Tailed Hypothesis Test for the Population Mean
53
Purchase Orders database
With α = 0.05, determine if the mean lead time for Alum Sheeting (µ1) is greater than the mean lead time for Durrable Products (µ2).
Two-Sample Tests for Difference in Population Means Example 7.9: Comparing Supplier Performance
54
54
t-Test: Two-Sample Assuming Unequal Variances
Variable 1 Range: Alum Sheeting data
Variable 2 Range: Durrable Products data
Example 7.10: Testing the Hypotheses for Supplier Lead-Time Performance
55
Results
The test statistic is assumed to be t distributed with df = 10. If the value of the test statistic is nonnegative, then the p-value in the output is the correct p-value for an upper-tail test.
Example 7.10 Continued
t = 3.83 > 1.81 = Critical value
p-value = 0.00166 < 0.05 = α
Reject H0
The mean lead time for
Alum Sheeting is greater
than for Durrable Products
56
Used to compare the means of two or more population groups.
ANOVA derives its name from the fact that we are analyzing variances in the data.
ANOVA measures variation between groups relative to variation within groups.
Each of the population groups is assumed to come from a normally distributed population.
Analysis of Variance (ANOVA)
57
With α = 0.05, determine whether any significant differences exist in satisfaction among individuals with different levels of education.
The variable of interest is called a factor. In this example, the factor is the educational level, and we have three levels of this factor, college graduate, graduate degree, and some college.
Example 7.13: Difference in Insurance Survey Data
58
Data Analysis tool: ANOVA: Single Factor
The input range of the data must be in contiguous columns
Example 7.14: Applying the Excel ANOVA Tool
59
Results:
F = 3.92
Fcrit = 3.47
F > Fcrit
p-value = 0.0356
0.0356 < 0.05
With α = 0.05, reject H0.
Example 7.14: Continued
60
The F statistic is assumed to be F distributed with 2 degrees
of freedom in the numerator, and 21 degrees of
freedom in the denominator.
60
The F-distribution has two degrees of freedom, one associated with the numerator of the F-statistic, 1, and one associated with the denominator of the F-statistic, 2.
The distribution is not symmetric, and Table A.4 in Appendix A provides upper-tail critical values.
F-Distribution
61
The m groups or factor levels being studied represent populations whose outcome measures
1. are randomly and independently obtained,
2. are normally distributed, and
3. have equal variances.
If these assumptions are violated, then the level of significance and the power of the test can be affected.
Assumptions of ANOVA
62
3
)2)(1(
Skewness
s
xx
nn
n
i