Reflection paper Applied Business Analytics
Basic Statistical Concepts
POPULATIONS
Population – the entire collection of elements about which information is desired.
The size of a finance population – i.e., a population that contains a finite measure of elements – is denoted by the symbol N.
Elements can also be referred to as units of analysis, and units of analysis are usually people, places, things, or times – e.g., individuals, countries, organizations, students, years, semesters, or even student-semesters.
We are interested in studying properties of some numerical characteristic of the population elements.
Table 2.1 lists examples of population elements and numeric characteristics (another example might be students’ grade point averages).
Once we have identified a population of numerical values of interest, we often take a sample to estimate the following parameters:
1. Mean – denoted as µ, is the average of the values in the population.
2. Range – denoted as RNG, is the difference between the largest value and the smallest value in the population.
3. Variance – denoted as σ2, is the average of the squared deviations of the values in the population from the population mean µ.
4. Standard deviation – denoted as σ, is the positive square root of the population variance.
Note: We take a sample b/c in most situations, obtaining data on a population parameter would be too costly (e.g., the time associated with measuring the GPAs of all students in the United States).
Statistical inference – the science of using sample data to make a generalization about a population.
Statistical estimation – the science of using information contained in a sample to 1) find an estimate of an unknown population parameter and 2) to place a bound on how far the estimate might deviate from an unknown population parameter.
Example 1:
Consider the population of data analytics students enrolled during the 2016 spring semester. Solve for the mean, range, variance, and standard deviation of GPA – our numerical estimate of an element (students enrolled in econometrics).
PROBABILITY
The concept of probability is employed in describing populations and in using sample information to make statistical inferences.
Experiment - any process of observation that has an uncertain outcome (flipping a coin, drawing a card, rolling a die, asking for a number, etc.).
Event – an experimental outcome that may or may not take place (heads, ace of spades, etc.).
Probability – is a number that measures the chance that the event will occur when the experiment is performed.
If A is an event that may or may not occur when an experiment is performed (e.g., a coin landing on heads when performing the experiment of flipping), then the probability that the event A will occur is denoted P(A).
If the experiment is performed nEXP times and the event A occurs nA of these nEXP times, then the proportion of the time event A has occurred is
If we repeat the experiment a number of times approaching infinity, the limit of the sequence of numbers obtained by calculating the ratio nA/nEXP after each repetition is the probability P(A), or
So, what’s P(A) for the experiment of flipping a coin an infinite number of times?
For all practical purposes, the probability of an event is simply the proportion of the time the event would occur if the experiment were performed a large number of times (since we can’t perform an experiment an infinite number of times).
Note: 0<P(A)<1
We can also estimate probabilities from experience – subjective probability.
We can also estimate probabilities from continuous probability distributions – i.e., if we know how a variable is distributed, we can solve for the probability of observing certain values.
RANDOM SAMPLES AND SAMPLE STATISTICS
Remember, the calculation of population parameters requires knowledge of all the values in the population. If we don’t know all the values, we must randomly select a sample of n values from the population.
We can use sample data to make inferences about the population.
Random sample – we obtain a random sample of n elements from the population if each element has the same probability, or chance, or being selected in the sample.
With replacement – elements selected on a particular selection are placed back into the population for future selections.
Without replacement – we do not place selected elements back into the population for future selections (this is the preferred method).
If we have randomly selected a sample of n elements, the values of the numerical characteristic of interest possessed by these elements make up a randomly selected sample of n values.
For i=1, 2, …, n we let yi denote the value of the numerical characteristic under study possessed by the ith randomly selected element. The set of
denotes the randomly selected sample of n numerical values.
Sample statistic – a descriptive measure of the randomly selected sample of numerical values.
We use sample statistics as point estimates, or estimates that are single numerical values, of population parameters.
Commonly used sample statistics that are used as point estimates include the mean, variance, and standard deviation.
Suppose that sample
has been randomly selected from a population.
The sample mean is calculated as
and is a point estimate of the mean µ.
The sample variance is defined as
and is a point estimate of the variance σ2.
The sample standard deviation is defined to be
and is a point estimate of the population standard deviation σ.
Note: Make sure that you can reproduce sample statistic calculations in the text. ;-)
CONTINUOUS PROBABILITY DISTRIBUTIONS
Before we randomly select a value y from a population, y potentially can be any of the values in the population.
We can use continuous probability distributions to calculate probabilities concerning the value y might attain.
Consider the closed interval from a to b (a<b) on the real number line. Denoting this interval as [a, b], we often wish to find
which can be written more simply as
This probability can be interpreted as the proportion of values in the population that are greater than or equal to a and less than or equal to b.
Continuous probability distributions assign probabilities to intervals of numbers on the real number line.
Suppose f(v) is a continuous function of numbers on the real number line. Consider the continuous curve that results when f(v) is graphed.
We can say that the curve f(v) is the continuous probability distribution of y if the probability that y will be in the interval [a,b] is the area under the curve f(v) corresponding to the interval [a,b].
As an example, suppose that the curve f(v) illustrated in Figure 2.1 is the continuous probability distribution of y, the mileage of a randomly selected automobile.
Assume that we wish to find the probability that y will be between 31 and 33 mpg. Then we would find the area under the curve f(v) corresponding to the interval [31,33].
If this area is equal to .7023, then 70.23% of all mileages are between 31 and 33 mpg. Below is a hypothetical continuous probability curve f(v):
Suppose that the curve f(v) is the continuous probability distribution of y.
Then we say that the population of values from which y will be randomly selected is distributed according to the continuous probability curve f(v).
In other words, the population has the continuous probability distribution defined by f(v).
The height of the curve f(v) at a given point on the real line represents the relative probability, or chance, that y will be in a small interval of numbers around the given point.
In other words, the height represents the relative proportion of values in the population that are in a small interval of numbers around the given point.
Two properties satisfied by continuous probability distributions:
1. For any number v, f(v) ≥ 0. Remember the height of the curve represents a relative probability, and probabilities have to be between 0 and 1.
2. The total area under a continuous probability curve equals 1. This holds b/c the total area under f(v) equals the probability that y will fall between -∞ and +∞, and y is sure to fall b/w -∞ and +∞.
THE NORMAL PROBABILITY DISTRIBUTION
Many of the variables we study in business and economics are normally distributed. The normal distribution is a continuous probability distribution that is defined by the probability curve
.
If a population is distributed according to a normal distribution, we say that y is normally distributed with mean µ and standard deviation σ.
The normal probability distribution is denoted by N(µ,σ), which means that the shape of the normal probability curve depends on the mean and standard deviation of the population.
Important properties of the normal curve are:
1. The normal curve is centered at the population mean µ.
2. The mean µ corresponds to the highest point on the normal curve.
3. The normal curve is symmetrical around the mean (50% above and below the mean).
4. The total area under the normal curve is equal to 1 (since it is a probability distribution).
The population mean centers the normal curve on the real number line.
The variance measures the spread of the normal curve.
Note: Draw two normal curves with different means but the same variance. Draw two normal curves with the same mean but different variances.
If a variable y is normally distributed with mean µ and standard deviation σ, then
equals the area under the normal curve with mean µ and standard deviation σ corresponding to the interval [a,b]. This is depicted below:
Three important areas under the normal curve are emphasized in the figure below.
1. As you can see = .6826, or 68.26% of the values in the population are within plus or minus 1 standard deviation from the mean.
2. = .9544, or 95.44% of the values in the population are within plus or minus 2 standard deviations from the population mean.
3. = .9973, or 99.73% of the values in the population are within plus or minus 3 standard deviations from the population mean.
Example 2:
Suppose the population mean is 31.5 and the population standard deviation is .8, solve for the 68.26%, 95.44%, and 99.73% intervals.
What if we don’t know the true values of the population mean and standard deviation? Can we obtain sample estimates of the mean and standard deviation to solve for the 68.26%, 95.44%, and 99.73% intervals? Yes!
Perform these calculations if the sample mean is 31.2 and the sample standard deviation is .7517.
Note: The results in example 2 depend on what’s referred to as the normality assumption, or the assumption that the variable under analysis is normally distributed.
We can verify the normality assumption using a large sample (n > 30 observations) and graphical analyses.
GRAPHICAL ANALYSES
Use mileages.dta to create a histogram and a stem-and-leaf diagram.
A frequency distribution allows us to examine how a variable is distributed, and therefore whether the normality assumption is appropriate.
A frequency distribution is simply an arrangement of the values of a variable in ascending order. Each entry in the table contains the frequency of the occurrences of values within an interval.
In Stata we can create a frequency table using the “tabulate” command. Using the “plot” option can give you an even better feel for the data.
Note: Do this using mileages.dta
Before you can create a histogram using Stata, please move to the correct directory.
First, create a “projects” folder on your c: drive (go to your c drive, right click, and create a folder named “Projects).
In your projects folder, create a folder for our class – e.g., mgt6460.
In your mgt6460 folder, create a “statsreview” folder that contains the mileages.dta, or our “mileages” data.
Once you do this, you can create a histogram in Stata using the following commands:
. cd c:\projects\ mgt6460\statsreview
. use mileages, clear
. list
. histogram mileage
In a histogram, the frequency of values of a variable within an interval is measured on the vertical axis, and the total number of intervals, or bins, is included on the horizontal axis.
We can use a frequency distribution to “see” the chances of occurrence of a particular value of a variable, and how the chances change from one bin to the next.
This can be easily accomplished by creating a relative frequency distribution.
In a relative frequency distribution, the percentage of the occurrence of a particular range of values is plotted on the vertical axis and the total number of intervals is included on the horizontal axis.
Note: The relative frequency for a bin is simply the frequency of the occurrences of values within that interval divided by the sample size.
Note: The number of intervals or “bins” chosen should be the smallest integer K such that 2K > n. Here n is the total number of observations. A trick is to set 2K=n, and then just round up to the nearest integer.
Note: Interval widths, also known as “class” widths, are formed by dividing the range of the data by the number of class intervals. The class width for mileage is found as
Note: the class width is the distance between the lower limits of consecutive classes.
Once we have the class width, you can create the first interval by adding 0.6 to the smallest value of the variable, or the smallest mileage.
The result provides the lower class, or the base of the following interval. Repeat this process until the full table is constructed as follows:
|
Interval |
Frequency |
|
[29.8, but less than 30.4) |
3 |
|
[30.4, but less than 31) |
9 |
|
[31, but less than 31.6) |
12 |
|
[31.6, but less than 32.2) |
13 |
|
[32.2, but less than 32.2) |
9 |
|
[32.8, but less than 33.4) |
3 |
Or you can construct the table by adding 0.5 to the smallest value to create bin sizes as follows:
|
Interval |
Frequency |
|
[29.8, but less than 30.3] |
3 |
|
[30.4, but less than 30.9] |
9 |
|
[31, but less than 31.5] |
12 |
|
[31.6, but less than 32.1] |
13 |
|
[32.2, but less than 32.7] |
9 |
|
[32.8, but less than 33.3] |
3 |
This means that to include the smallest measurement and the largest measurement in the K=6 classes, each class should contain 6 measurement values moving from the lower class to the upper class in increments of 0.1.
For example, the first class includes the smallest measurement (the lower class), the largest measurement (the upper class), and all values in between in increments of .1, or 29.8, 29.9, 30.0, 30.1, 30.2, and 30.3.
Finally, use Stata’s “stem” command to create a stem-and-leaf diagram as follows:
. use mileages, clear
. stem mileage
0
5
10
15
20
25
Percent
30313233
mileage
3
3
*
3
3
2
.
5
5
6
7
8
8
3
2
*
0
0
0
1
1
2
2
3
4
4
3
1
.
5
5
5
6
6
7
7
7
8
8
9
3
1
*
0
0
1
2
3
3
4
4
4
3
0
.
5
6
6
6
8
8
9
3
0
*
1
3
4
4
2
9
.
8
33* 3
32. 556788
32* 0001122344
31. 55566777889
31* 001233444
30. 5666889
30* 1344
29. 8
0
5
10
15
Frequency
30313233
mileage