Statics
BUS308 – Week 1 Lecture 1
Making Sense of Data
This class is about ways to gather information from data sets – how we can summarize the data, how we can make decisions with them, and how we can understand patterns within them. This is often called data analysis; some courses discuss this approach with the term “data- based decision making.” However it is described, data is the language of business as seen in courses such as Finance, Accounting, Operations Management, Auditing, etc. Even courses not considered as data-based, such as Human Resources for example, often have a data component. It is extremely rare that management of any functional area within business does not have to deal with data – even if just the budget. Some examples of these data-based issues include
• Changes in manufacturing defect rates • Employee satisfaction survey response patterns • Changes in monthly expenses • Increased costs of warranty/insurance claims • Changes in average delivery times • Costs involved in re-training versus hiring new employees.
These all need to be analyzed in different ways to find out what the data is telling us.
Data analysis
Data analysis, whether statistical, financial, operational, etc., is much like solving a mystery, and those who work with these tools are like TV detectives. The crime we focus on presents itself as some outcome; results of a manufacturing process, customer satisfaction ratings differences, financial outcomes, etc.; that we do not fully understand. In this course, we will have a single “crime” to focus on. The Federal Equal Pay Act requires companies to pay males and females the same pay if they are doing (substantially) the same work. We will be taking the role of data analysts in a company that has received some evidence that they are going to have a Federal audit of their pay practices due to a complaint on unfair pay practices. Our “job” – the basis of the class assignments – is to determine if we do or do not comply with the Equal Pay Act. We will be using a single set of data, described below, for all our assignments.
In real life on the job or with assignments, we often, as do TV detectives, have an overwhelming amount of data that we need to sift thru to get to our clues; and then interpret the clues to get the information we need to answer our questions about what happened with the process or outcome we are looking at. The information that we are first presented with is typically a bunch of numbers that measure, count, and code lots of things. Note, as we talk about data we have three kinds we are concerned with:
• Measures tell us how much exists; such as a salary measure would tell us how many dollars someone is paid.
• Counts tell us how many exist, such as counting how many employees have a master’s degree.
• Codes tell us about a characteristic; for example, we might code being male as 0 and being female as 1. These numbers do not mean one gender is somehow ‘better” or “higher” than the other, they merely show a difference. More about this latter.
So, as data detectives, we approach any question by finding numbers (measures, counts, and codes) that somehow relate to the issue and the question we need to answer about the situation. Once we have this data, we need to sort thru it to find the clues that need to be examined to understand the situation or outcome. For this class, clues are what we get once we have done some statistical work on the data. This work, as we will see throughout the class, starts with relatively simply summaries – average values for different groups or things, measures of how consistent things are, etc. These summary measures become our first clues. And, just as with any good detective story, not all the clues are meaningful and some are not immediately apparent. The detective/analyst needs to find out what happened what the clues mean to understand and “solve” the crime.
Before we start with the data and how to tease clues from it, we need to understand a couple of concepts:
• Population: includes all of the “things” we are interested in; for example, the population of the US would include everyone living in the country.
• Sample: involves only a selected sub-group of the population; for example, those selected for a national opinion poll.
• Random Sample: a sample where every member of the population has an initial equal chance of being selected; this is the only way to obtain a sample that is truly representative of the entire population. Details on how to conduct a random survey are covered in research courses; we will assume the data we will be working with comes from a random sample of a company’s exempt employees. Note: an exempt employee, AKA salaried employee, does not get overtime pay for working more than 40 hours in a week (“exempt from overtime requirements”).
• Parameter: a characteristic of the population; the average age of everyone in the US would be a parameter.
• Statistic: a characteristic of a sample; the average age of everyone you know who attends school would be a statistic as the group is a sub-group of all students.
• Descriptive Statistics: measures that summarize characteristics of a group. • Inferential Statistics: measures that summarize the characteristics of a random sample
and are used to infer the value of the population parameters. • Statistical test: a quantitative technique to make a judgement about population
parameters based upon random sample outcomes (statistics).
Class Design
Statistics is often presented as a series of unrelated tools, each with a set of practice problems that are also unrelated to what has been done before. This never made sense, as statistical analysis is more appropriately seen as a set of tools that work together to give us insight into what the data is hiding. As we go thru this class, the case will tie all the various tools
and techniques together. We will see that while generally illuminating, not all analytical tools, working alone, will provide as much insight as we hope for – just as with the different clues a TV detective gets while working to find their answers to who did what and why.
As we work through the class, we will generally start each week, in Lecture One, with where we are in the story of our understanding the data, and the question(s) we want to answer. Then the new concepts for the week will be presented in an overview, showing what they are designed to do for us. In general, two distinct concepts or issues will be presented each week. Each week’s Lecture Two will present one concept in more detail and discuss how we can use Excel to perform the analysis and get our clues from the data in terms of statistical test outcomes. In the third Lecture for each week, we will discuss the other key concept for the week. Each lecture will provide an example of how to work the questions asked in the homework using a different measure than what the assignment will ask you to use. The lectures will discuss how to interpret both the statistical outcome and how to take this information and translate it back to our initial research question of do males and females get equal pay for equal work? (Note, due to the need to get some basics covered, our week one lectures will not follow this exact pattern.)
The Case
Our class, as a group of data analysts/detectives, will play the role of helping a Human Resources Department (in our assumed “company”) prepare for an audit from the government about our pay practices. This routine audit will focus on the question of equal pay for equal work, as required by both State and Federal statutes. Specifically, these require that if males and females are performing substantially the same job (equal work), then they should be paid equally.
Of course, nothing is quite that simple. The laws do allow some differences based on company policies calling for pay differences due to performance, seniority, education, and – with some companies – functional areas. Our company does have policies saying we pay for organizational level (different kinds of work), performance, seniority, experience, and educational achievements.
Our first step is to decide upon some questions that need to be answered, as questions lead to the data we need to collect. The overall question, also known as the Research Question, is simply: “Do males and females receive equal pay for equal work?” This just means that if a male and female are doing the same work for the company, are they paid the same? As straightforward as this question seems, it is very difficult to answer directly. So, after brainstorming, secondary or intermediate (more basic) questions have been identified as needing to be answered as we build our case towards the final answer. Some of these secondary questions (which will be address throughout the course) include:
• Do we have any measures that show pay comparisons between males and females?
• Since different factors influence pay, do males and females fair differently on them; such as age, service, education, performance ratings, etc.?
• How do the various salary related inputs interact with each other? How do they impact pay levels?
These general questions lead to our collecting data from a random sample of employees. Note that a random sample (covered in research) is the best approach to give us a sample that closely represents the actual employee population. The sample consists of 25 males and 25 females. The following data was collected on each employee selected:
• Salary, rounded to the nearest $100 dollars and measured in thousands of dollars, for example an annual salary of $38,825 is recorded as 38.8.
• Age, rounded (up or down) to the age as of the employee’s nearest birthday. • Seniority, rounded (up or down) to the nearest hiring anniversary. • Performance Appraisal Rating, based on a 100-point scale. • Raise – the percent of their last performance merit increase. • Job grade – groups of jobs that are considered substantially similar work (for
equal work purposes) that are grouped into classifications ranging from A (the lowest grade) through F (the highest grade). Note: all employees in this study are exempt employees – paid with a salary and not eligible for overtime payments. They are considered middle management and professional level employees.
• Midpoint – the middle of the salary range assigned to each Job Grade level. The midpoint is considered to be the average market rate that companies pay for jobs within each grade.
• Degree – the educational achievement level, coded as 0 for those having a Bachelor’s degree and 1 for those having a Master’s degree or higher.
• Gender – coded as M for Males, and F for Females, also coded 0 (Males) and 1 (Females) for use in an advanced statistical technique introduced in week 4.
In addition to these collected measures, the HR Compensation Department suggested that we construct a measure called Comparison-ratio or Compa-ratio. The Compa-ratio is defined as the salary divided by the employee’s grade midpoint. For example, an employee with a salary of $50,000 and a company salary range midpoint of $48,000 would have a Compa-ratio of 50/48 = 1.042 (rounded to 3 decimal places). Employees with a Compa-ratio greater (>) 1.0 are paid more than the market rate for their job, while employees with a Compa-ration less than (<) 1.0 are paid less than the prevailing market rate. Compensation professionals use Compa-ratios to examine the spread and relative pay levels of employees while the impact of grade is removed from the picture.
Here is the data collected that will be used in the lecture examples and discussions.
ID Salary Compa- ratio
Mid Age Perf App.
Service Gender Raise Deg. Gender 1
Grade
1 58 1.017 57 34 85 8 0 5.7 0 M E 2 27 0.870 31 52 80 7 0 3.9 0 M B 3 34 1.096 31 30 75 5 1 3.6 1 F B 4 66 1.157 57 42 100 16 0 5.5 1 M E
5 47 0.979 48 36 90 16 0 5.7 1 M D 6 76 1.134 67 36 70 12 0 4.5 1 M F 7 41 1.025 40 32 100 8 1 5.7 1 F C 8 23 1.000 23 32 90 9 1 5.8 1 F A 9 77 1.149 67 49 100 10 0 4 1 M F 10 22 0.956 23 30 80 7 1 4.7 1 F A 11 23 1.000 23 41 100 19 1 4.8 1 F A 12 60 1.052 57 52 95 22 0 4.5 0 M E 13 42 1.050 40 30 100 2 1 4.7 0 F C 14 24 1.043 23 32 90 12 1 6 1 F A 15 24 1.043 23 32 80 8 1 4.9 1 F A 16 47 1.175 40 44 90 4 0 5.7 0 M C 17 69 1.210 57 27 55 3 1 3 1 F E 18 36 1.161 31 31 80 11 1 5.6 0 F B 19 24 1.043 23 32 85 1 0 4.6 1 M A 20 34 1.096 31 44 70 16 1 4.8 0 F B 21 76 1.134 67 43 95 13 0 6.3 1 M F 22 57 1.187 48 48 65 6 1 3.8 1 F D 23 23 1.000 23 36 65 6 1 3.3 0 F A 24 50 1.041 48 30 75 9 1 3.8 0 F D 25 24 1.043 23 41 70 4 0 4 0 M A 26 24 1.043 23 22 95 2 1 6.2 0 F A 27 40 1.000 40 35 80 7 0 3.9 1 M C 28 75 1.119 67 44 95 9 1 4.4 0 F F 29 72 1.074 67 52 95 5 0 5.4 0 M F 30 49 1.020 48 45 90 18 0 4.3 0 M D 31 24 1.043 23 29 60 4 1 3.9 1 F A 32 28 0.903 31 25 95 4 0 5.6 0 M B 33 64 1.122 57 35 90 9 0 5.5 1 M E 34 28 0.903 31 26 80 2 0 4.9 1 M B 35 24 1.043 23 23 90 4 1 5.3 0 F A 36 23 1.000 23 27 75 3 1 4.3 0 F A 37 22 0.956 23 22 95 2 1 6.2 0 F A 38 56 0.982 57 45 95 11 0 4.5 0 M E 39 35 1.129 31 27 90 6 1 5.5 0 F B 40 25 1.086 23 24 90 2 0 6.3 0 M A 41 43 1.075 40 25 80 5 0 4.3 0 M C 42 24 1.043 23 32 100 8 1 5.7 1 F A 43 77 1.149 67 42 95 20 1 5.5 0 F F 44 60 1.052 57 45 90 16 0 5.2 1 M E
45 55 1.145 48 36 95 8 1 5.2 1 F D 46 65 1.140 57 39 75 20 0 3.9 1 M E 47 62 1.087 57 37 95 5 0 5.5 1 M E 48 65 1.140 57 34 90 11 1 5.3 1 F E 49 60 1.052 57 41 95 21 0 6.6 0 M E 50 66 1.157 57 38 80 12 0 4.6 0 M E
What kind of data do we have?
Just as all clues and information uncovered on mystery shows are not equally valuable, or even useful; not all data is equally useful in answering questions. But, all data has some value. As we look at this data set, it is clear that not all the data is the same. We have some measures (salary, seniority, etc.) but we also have some labels (ID, for example merely identifies different employees in the data set, and is not useful for much else). We have some data that are clearly codes, gender and degree for example. In general, our data set can be sorted inot four kinds of data (NOIR):
• Nominal: these are basically names or labels. For example, in our data set we Gender labeled as M and F (for males and females). Other examples of nominal data include names of cars (Ford, Chevrolet, Dodge, etc.), cities and states, flowers, etc. Anything where the name/label just indicates a difference from something else that is similar is nominal level data. Now, we can “code” with words and letters (such as Male or M) but we can also code them with using 0 and 1(for male and female). Regardless of one looking like a label (letters) and one looking like a measurement (numbers), both of these are simply ways to label males and females – they indicate a difference between the groups only – not that one is somehow higher than the other (as we typically think of 1 as higher or more than 0). Nominal level data are used in two ways. First, we can count them – how many males and females exist in the group, for example. Second, we can use them as group labels to identify different groups, and list other characteristics in each group; a list of all male and female compa-ratios will be quite helpful in our analysis, for example.
• Ordinal: these variables add a sense of order to the difference, but where the differences are not the same between levels. Often, these variables are based on judgement calls creating labels that can be placed in a rank order, such as good, better, best. The grade and degree variables in our data set are ordinal. We cannot assume that the amount of work to get the higher degree or higher job grade is the same for all differences. Note: Even though we only show education as bachelor and graduate, we could include no high school diploma, high school diploma on the low end and doctoral degree and professional certification on the upper end.
• Interval: these variables have a constant difference between successive values. Temperature is a common example – the difference between, for example, 45 and 46 degrees is the same amount of heat as between 87 and 88 degrees. Note: Often, analysts will assume that personal judgement scores such as Performance Appraisal ratings or
responses on a questionnaire scale using scores of 1 to 5 are ordinal as it cannot be proven the differences are constant. Other researchers have suggested that these measures can be considered interval in nature for analysis purposes. We will consider performance appraisal ratings to be interval level data for our analysis purposed.
• Ratio – these are interval measures that add a 0 point that means none. For example, 0 dollars in your wallet means no money, while a temperature of 0 degrees does not mean no heat. Ratio level variables include salary, compa-ratio, midpoint, age, service, and raise – even if our measurements do not go down to 0, each measure does have a 0 point that would mean none.
These differences are important, as we can do different kinds of analysis with each level, and attempting to use the wrong level of data in an approach will result in misleading or wrong outcomes. Within our data set our variable fit into these groups.
• Nominal: ID, gender, gender1 (Merely labels showing a difference) • Ordinal: Grade, Degree (Can be ordered from low to high, ex grade is the lowest and
grade F is the highest grade.) • Interval: Performance Rating (Constant difference between values, but no meaningful 0
score) • Ratio: Salary, Compa-ratio, Midpoint, Seniority, Age, Raise (All have a 0 point that
means none)
Difference Between Lectures and Assignments.
Now, the lectures and the weekly assignments will BOTH present clues (statistical results) that are needed to answer the research question, and each will answer exactly the same questions. This is to provide examples on how to perform the homework questions. However, the lecture examples will approach the questions with a different focus. The lectures will use the compa-ratio measure of pay, while the class homework will focus on the salary as the examined measure of pay. Both views will help us answer the question of equal pay for equal work, and your interpretation of what the data is telling us each week should include both the lecture and your homework results.
Additionally, while the lectures will use the data set shown above; your homework will have a different data set sample from the company. While some of the values (particularly the salary and compa-ratio measures) will differ, we can consider both samples as representative of the company’s workforce and both equally useful in answering our questions. The reason for this is simple and somewhat unfortunate. In the past, students have copied answers from websites that had the weekly answers rather than doing their own work. To stop this, the class homework data set changes periodically so that previous answers are not correct. However, changing the lecture examples each time the data set changes is a time-consuming task that we elected not to do. So, the lecture examples stay the same, and the weekly assignments differ over time. This has no practical impact on anyone in the class – use the class assigned homework file and read the lectures, and you should be fine.
Week 1 Clues
As suggested above, the first question we need to ask is “do we have any measures that show pay comparisons between our males and females?” To answer this question, our first step when confronted with a mass of data is to develop summary statistics, descriptive statistics and their close cousin inferential statistics. As the sounds, descriptive statistics describe the data set. Typical descriptive measures include:
• Measures of centrality such as the average (AKA mean), the median (middle point), and mode (most often occurring value, if it exists);
• Measures of consistency such as range (largest value minus the smallest value), variance, and standard deviation (more on these in lecture 2),
• Measure of location showing where a single data point is within the data set, such as percentile and rank.
• Measures of likelihood showing the probability of obtaining specific outcomes.
Descriptive statistics describe a particular data set, and can only be used for that data set. However, often we want to use a sample to infer back to a larger population. In this case, we would use inferential statistics. Most measures, except for variance and standard deviation, are calculated the same way. We will see the difference for those two in lecture 2. The key to whether we have descriptive statistics or inferential statistics lies with the group we are taking the measures on. If we are only concerned with that group, we use descriptive statistics. If, however, we want to use that group to make inferences, claims, and conclusions about a larger population, then we take a random sample from the population and use inferential statistics (allowing us to infer back to the population). Our class data sets – both the lecture and homework – are random samples from a larger population, so we will basically be using inferential statistical measures.
Please ask your instructor if you have any questions about this material.
When you have finished with this lecture, please respond to Discussion thread 1 for this week with your initial response and responses to others over a couple of days before reading the second lecture for the week.