Week 3 --PPOL 505 Exercise 2
Analyzing Performance Measures
Think of data as the raw materials that you convert into information. Data by themselves, however, are not likely to be very useful. Column after column of numbers mean very little. To make the data meaningful you need to organize and present them, that is, the data need to become information. As you skim a newspaper, listen to a presentation, or read a report you may mistakenly assume that creating the graphs and statistics took little work. This is not true. Someone thought through how to present the data so that you and others could quickly understand and interpret them.
One of your tasks as a program manager is to decide how to organize and present data. There is no single best way to graph or analyze a set of data. You may create several graphs and try different statistics as you search for patterns that make sense. In this chapter we focus on the basic tasks for organizing performance data: entering data into a spreadsheet, creating tables and graphs, and describing variations in individual variables. The same skills apply to surveys, program evaluations, and community assessment, which we cover in later chapters. Also note that in Chapter 8 we will discuss analyzing relationships between and among variables. First, however, we will cover the terminology of measurement scales. Familiarity with these terms will facilitate our discussion of various statistics in this chapter and later.
Measurement scales or levels of measurement describe the relationship among the values of a variable. You will find the terminology associated with measurement scales useful as you decide what statistics to use. The basic scales are nominal, ordinal, interval, and ratio scales.
Nominal scales identify and label the values of a variable. You cannot place the values of a nominal variable along a continuum; nor can you rank individual cases according to their values. Even though numbers are sometimes assigned, these numbers have no particular importance beyond allowing you to classify and count how many cases belong in each category. For example, imagine an organization, the Happy Housing Center, records why people seek its services. The variable Reason for Seeking Services has four values: “Laid off or lost job,” “Rental housing needs repairs,” “Rent increased,” “Eviction.” A nominal scale reports how many requests for services are in each category:
1 = Laid off or lost job
2 = Rental housing needs repairs
3 = Rent increase
4 = Eviction
The numbers are simply a device to identify categories; letters of the alphabet or other symbols could replace the numbers and the meaning of the scale would be unchanged. Remember, too, that values of nominal scales are not ranked. Thus, the numbering system in our example does not imply that an eviction has a greater or lesser value than being laid off.
Ordinal scales identify and categorize values of a variable and put the values in rank order. Ordinal scales rank the values without regard to the distance between values. Ordinal scales report that one case has more or less of the characteristic than another case does. If you can rank values but do not know how far apart they are, you have an ordinal scale. You assign numbers to the values in the same order as the ranking implied by the scale. For example, the value represented by 3 is greater than the value represented by 2, and the value represented by 2 is greater than the value represented by 1. The numbers indicate only that one value is more or less than another; they do not imply that a value represents an amount. Let’s look at how you could assign numbers to respondents’ answers to the statement “The Happy Housing Center staff provided me with accurate information.”
5 = Strongly agree
4 = Agree
3 = Neither agree nor disagree
2 = Disagree
1 = Strongly disagree
You can use other numbering schemes as long as the numbers preserve the rank order of the categories. For example, you could reverse the order and number Strongly agree as 1 and Strongly disagree as 5. Alternatively, you could skip numbers and number the categories 10, 8, 6, 4, and 2. Because you cannot determine the distance between values, you cannot argue that a client who answers “Strongly agree” to all items is five times more satisfied than a client who answers “Strongly disagree” to all items.
Rankings commonly produce an ordinal scale. For example, a supervisor may rank 10 employees and give the best employee a 10 and the worst a 1. The persons rated 10 and 9 may be exceptionally good, and the supervisor may have a hard time deciding which one is better. The employee rated with 8 may be good, but not nearly as good as the top two. Hence, the difference between employee 10 and employee 9 may be very small and much less than the difference between employee 9 and employee 8.
Interval and ratio scales assign numbers corresponding to the magnitude of the variable being measured. Interval scales do not have an absolute zero; ratio scales do. The most common example of an interval scale is the temperature scale. We know that 40°F is 20° warmer than 20°F, but we cannot say that 40° is twice as warm as 20°. In the Fahrenheit temperature scale, zero is an arbitrary point; heat exists at 0°F.
The numbers you assign to a ratio scale could be the actual number of persons working in an agency, the number of homeless persons in a given year, or the amount of per capita income in a city. You can add or subtract the values in ratio scale. If the Happy Housing Center had 100 service requests in January and 50 in February, you can say that the number of requests fell by 50 in February. You may also note that the center had half as many requests in February. And at zero, there are no service requests. Table 5.1 summarizes information on these four levels of measurement.
In practice, the boundaries between ordinal and interval scales and interval and ratio scales may be blurred. If an ordinal scale has a large number of values, analysts may assume that it approximates an interval scale. Similarly, the summed values from a set of questions, such as the six questions that measured orientation quality in Chapter 2, may be treated as a ratio scale.
You may mistakenly assume that categories consisting of numerical data form a ratio scale. They do not. Rather, such a scale would be ordinal. For example, the following categories constitute an ordinal scale: under 20 years old, 20 to 29 years old, 30 to 39 years old, and so forth. The exact distance, that is, the age difference between any two people, cannot be determined. While we know a person who checks off 20 to 29 years old is younger than a person who checks off 30 to 39 years old; the age difference between them could be a few days or nearly 10 years.
TABLE 5.1
Levels of Measurement
ENTERING DATA ON A SPREADSHEET
The first step in organizing data is to enter them into a spreadsheet. A spreadsheet may be a piece of paper with rows and columns or a software tool, such as Microsoft Excel. When you open an electronic spreadsheet a workbook with rows and columns appears on the computer screen. Each column can represent a variable and each row can represent a case. You can enter numbers or text into each cell. Entering data is relatively easy. In addition to storing data most spreadsheet programs can calculate basic statistics and create graphs. The graphs can be moved into text documents as part of a report. The data can be readily transferred into a sophisticated statistical software package for more intense analysis.
Before you enter the data you should have a plan for how you will use them. You will want to decide whether to enter words or letters as opposed to numbers. Statistical analysis is easier with numbers; but words and letters are easier to understand. To illustrate other common decisions consider the data on Robert and Anne, two clients who visited a Happy Housing Center. The counselor who worked with them recorded the following.
|
Name |
Gender |
Age |
Reason for Requesting Services |
|
F. Robert |
Male |
54 |
Needs Help Obtaining Apartment |
|
P. Anne |
Female |
39 |
Received Eviction Notice |
The names and the values of three variables, with the exception of age, could be entered as text, but statistical analysis may be easier if numbers represent the values for gender and reason for requesting services. A “1” may be entered for Males and a “2” for Females; “1” may be assigned to Needs Help Obtaining Housing, “2” for Received Eviction Notice, and so on. If numbers were used, the information for Robert and Anne would look like this:
|
Name |
Gender |
Age |
Reason for Requesting Services |
|
F. Robert |
1 |
54 |
1 |
|
P. Anne |
2 |
39 |
2 |
You can decide what numbers to assign to a value. Instead of “1” for males and “2” for females, other numbers could be used. If the values constitute an ordinal scale, the numbers assigned should correspond to a continuum. Responses ranging from Very satisfied to Very dissatisfied could just as easily been “5” for Very dissatisfied and “1” for Very satisfied.
Data for a given case may be incomplete. Data in existing records may be missing, a respondent may have refused to answer a question, or some items on a form may not be legible. A missing value may be represented by “missing,” or “not applicable,” or a 9, 99, or another number that is noticeably larger than the other values. A large number may enable you to easily identify and track missing data. It also reduces the risk of including missing data in statistical calculations, because a 9 or 99 will often yield a statistical result that doesn’t make sense. For example, assume that the Housing Center also recorded the number of times a client had been evicted. If the arithmetic average for this turned out as a high number, say 65, then you would know something was wrong. However, if a missing answer for this variable was entered as 0, then you might not notice that this value was included in calculating the mean.
COUNTING THE VALUES: FREQUENCY DISTRIBUTIONS
The first step in any analysis is to obtain a frequency distribution for each variable. A frequency distribution lists the values or categories for the variable and the number of cases with each value. For example, a frequency distribution of gender in the database containing Robert’s and Anne’s information would report the number of males and females. A frequency distribution for age would report the number of respondents who were 22, 23, 24 years old, and so on. More likely you would combine the age categories, for example, less than 20 years old, 20–29 years old, 30–39 years old, and so on. The categories must be set so that you count each case in one and only one category.
Some variables lend themselves to grouping. If the differences between some groups’ values are of little interest, you may combine responses. For example, you may combine “Strongly Agree” with Agree and “Strongly Disagree” with Disagree. A variable, such as income, may have so many values that you cannot easily discern how it varies without grouping the values in some way. Presenting each value of income separately may result in a large number of values, most of which will have few, if any, cases. Therefore, you need to decide how many categories to create and how wide each category should be. The categories should not be so wide that important differences are overlooked. Nor should they be so narrow that many intervals are required. Using equal intervals to group ages, by decades as shown in Table 5.2, is a common way to create categories, but for some variables equal intervals may mask important differences among cases. A frequency distribution for income will often have many cases in the lower income categories and few in the higher income categories. To give a more accurate picture you may want to create narrower categories at the lower income levels and wider intervals at the higher income levels.
TABLE 5.2
Frequency Distribution of Ages of Happy Housing Center Clients
|
Age (in Years) |
Number of Cases |
|
Less than 20 |
2 |
|
20–29 |
6 |
|
30–39 |
12 |
|
40–49 |
22 |
|
50–59 |
20 |
|
60–69 |
17 |
|
70–79 |
3 |
|
Total |
82 |
Relative Frequency Distribution
Relative frequency distributions report the percentage of cases for each value. (A percent is calculated by dividing the frequency of one value of the variable by the total number of cases and multiplying the result by 100.) The percentages allow you to compare the frequency of different values in the same distribution or to compare values in two or more frequency distributions that have different numbers of cases. Almost everyone is familiar with percentages and can quickly interpret them. In our experience politicians and the public are comfortable with findings that are either preceded by a dollar sign or followed by a percent. So don’t hesitate to report percentages and make them the focus of a presentation.
You can report frequency distributions and relative frequency distributions in the same table. Uncluttered tables are easier to read; therefore, you may want to report the relative frequency for each value and include enough information so that a reader can determine the number of cases represented by each percent. You may report the total number of cases as a column total as in Table 5.2, in the table title or a part of the row or column label as in Table 5.3.
Cumulative Relative Frequency Distribution
You may want to show more than just the percentage of cases in a specific category, or you may want to indicate the percentage of the cases below a given value. A cumulative relative frequency distribution provides this information. To obtain the cumulative percentage, the percentages up to the given value are added together. Table 5.3 includes the relative frequency distribution and cumulative percentage distribution for the data in Table 5.2. The percentage of cases column gives the total number of cases, so you can multiply the total number of cases by the percentage to learn how many cases are found in each category. Remember to convert the percentage to a decimal before multiplying. The last column reports the cumulative percentage for the distribution of ages. Note, for example, that 24.4 percent of the clients seeking housing assistance were under the age of 40. Note also that the Number of Cases column of Table 5.2 is omitted from Table 5.3.
TABLE 5.3
Distribution of Client Age in Happy Housing Center
|
Age (in Years) |
Percentage of Cases (N = 82) |
Cumulative Percentage |
|
Less than 20 |
2.4 |
2.4 |
|
20–29 |
7.3 |
9.7 |
|
30–39 |
14.6 |
24.4 |
|
40–49 |
26.8 |
51.2 |
|
50–59 |
24.4 |
75.6 |
|
60–69 |
20.7 |
96.3 |
|
70–79 |
3.7 |
100.0 |
Once you know the frequencies, you need to decide how to present them. Visual presentations of data often illustrate points more clearly than do verbal descriptions. Some people who are not comfortable with tables and statistics seem to have no trouble interpreting a well-done graph. Visual displays1 should
serve a clear purpose; to describe, explore, or elaborate;
show the data and make them coherent;
encourage the eye to compare different pieces of data;
entice the reader or listener to think about the information;
avoid distorting what the data have to say;
enhance the statistical and verbal descriptions of the data.
A graph should be able to stand on its own. You should give each graph a descriptive title and label its variables and their values. Use footnotes to clarify any terms that need an explanation to be interpreted correctly and to identify the source of the data. You will find it useful to indicate the date when a graph was produced because, as you analyze the data, you may create multiple versions as you correct errors and make changes in how you combine values. If graphs are not dated you can lose time trying to determine which version is the most recent.
A pie chart consists of a complete circle, or pie, with wedges. The circle represents 100 percent of the values of the variable displayed. The size of each wedge or “slice of the pie” corresponds to each value’s percentage of the total. Figure 5.1 depicts a pie chart showing the primary reason why people requested services from the Happy Housing Center.
A common convention is to place the largest slice of the pie at the 12 o’clock position. The other slices should follow in a clockwise direction according to size, with the smallest slice last. This rule, however, does not apply if you create two pie charts to make a comparison, for example, if you want to compare the reasons for requesting services from the Happy Housing Center last year and 10 years ago.
Pie charts work well for oral presentations and short written reports, because an audience can quickly understand the depicted information. They may effectively illustrate differences among organizations, locations, and dates. You should avoid using a pie chart that requires a large number of slices. An audience may find it difficult to differentiate between the slices, especially if it is comparing two charts. Furthermore, with many slices you may have trouble finding distinctive colors or patterns to distinguish one slice from another.
|
|
FIGURE 5.1 Reason for Visiting Homeless Office – Pie Chart |
Bar graphs are alternatives to pie charts. You place the value of the variable along one axis and the frequency or percentage of cases along the other. The length of the bar indicates the number or percentage of cases possessing each value of the variable. Figure 5.2 depicts a bar graph that includes the same data as the Figure 5.1 pie chart.
Whether the bars are vertical or horizontal depends on which arrangement communicates more effectively and clearly. All bars in a graph should have the same width. If you use different widths for the bars, you risk implying that some values are more important than others, which is misleading. You can also use bar charts to compare organizations, locations, or dates. For example, if you had historical data you could compare this year’s data in Figure 5.2 to data from 10 years ago. To do this you would place a bar representing the percentage laid off 10 years ago next to the first column. The percentage whose rental house needs repairs would go next to the second column and so on.
A histogram represents ratio variables. Figure 5.3 shows an example of a histogram. Each column of the histogram represents a range of values; for example, in Figure 5.3 the second column represents the range 20–24 years old, and the third column the range 25–29 years. The columns adjoin one another because the range of values is continuous. The variable and its values are displayed along the horizontal axis, and the frequency or percentage of cases is displayed along the vertical axis. A histogram is similar to a bar graph, but unlike bar graphs its widths can vary. For example, the widths would vary if the age groupings for the columns were as follows: less than 20, 20–24, 25–29, 30–44, 45–59, and over 60.
|
|
FIGURE 5.2 Reason for Visiting Homeless Office – Bar Chart |
|
|
FIGURE 5.3 Age of Clients of Homeless Office |
To track performance over time you will want to use time series graphs. They are valuable to monitor performance, show changes, and demonstrate the impact of a policy. Users can easily and quickly discern changes from one time period to another, trends over time, and the frequency and extent of irregular fluctuations. (Recall that Chapter 4 discussed the changes over time you should look for.) To create a time series graph you put time, whether it is days, months, or years, on the horizontal axis, and the values of the variable on the vertical axis. For each time period place a dot at the intersection of the time period and the variable’s value, and draw a line to connect the dots. Figure 5.4 shows an example of a time series graph.
|
|
FIGURE 5.4 Number of Clients by Month |
We were tempted to title this section “putting your elementary school math to use.” Calculation of rates and percentage changes requires only basic math skills. Both rates and percentage changes contain valuable information that policy makers and the public can understand and react to.
Rates report the number of cases experiencing an event as a proportion of the number of cases that could have experienced that event over a specific time period. Commonly reported rates include the unemployment rate and the crime rate. The unemployment rate reports the number of unemployed individuals as a percentage of the number who could have been employed (employed + unemployed). (The discerning reader will note that the number who “could have been employed” must be carefully defined. One common definition includes the “employable population actively looking for work.” This would exclude those who are not seeking employment.) You may report rates as percents or use a base number other than 100. For example, cities, states, and nations report the annual rates of violent crimes as the number of occurrences for every 1,000 residents.
Assume that a county agency wants to compare the extent of homelessness in its community with that of other jurisdictions. Knowing the number of homeless may be valuable, but knowing how the problem in large cities compares with that in small cities or suburbs is also important. Rates allow such comparisons even though the cities may vary greatly in size.
An important decision is what to put in the denominator, that is, the number who could have been homeless. For many rates, the denominator is population size, but not always. As our definition of rates implies, the denominator for the unemployment rate excludes certain population groups, such as the very young. The selection of the denominator may appear to be somewhat arbitrary. Take, for example, contraceptive use. To compare data on contraceptive use by putting the entire population—which also includes men and children—in the denominator would not give an accurate a picture. A far better method would be to use either the number of women of childbearing age or the number of married women. Either denominator more accurately estimates the number at risk; at risk is another way of thinking about the number of possible occurrences. Deciding between the number of women of childbearing age and the number of married women may largely depend on the availability of data.
Consider two counties Moburg and Robus and the number of infant deaths in each. In one year Moburg had 104 infant deaths and Robus had 20. Which community had the greater problem? Directly comparing the number of deaths would be misleading because Moburg has 511,400 inhabitants and Robus has 106,000. Dividing the frequency of infant deaths in each county by the county population produces a more useful comparison. For Moburg we divided 104 by 511,400 which equals 0.0002033. For Robus, we divided 20 by 106,000 and obtained 0.00018886. The decimal values are so small that they might be ignored or interpreted incorrectly. Multiplying each decimal by 10,000 converts the data into figures that are more easily understood. We would report that Moburg has 2.033 infant deaths per 10,000 inhabitants and Robus has 1.886 infant deaths per 10,000 inhabitants.
The equation to compute a rate is
where
N1 = count for variable of interest
N2 = population or another indicator of number of cases at risk
Base number = a multiple of 10
The following conventions apply in selecting a base number. Remember that these are conventions, not absolute rules.
■ Be consistent and report rates in common use for a specific variable. Crime rates, for example, are usually reported as crimes per 1,000 of the population. Homicide rates, however, may be reported per 100,000 of the population.
■ Select a base number that produces rates with a whole number with at least one digit and not more than four digits.
■ Use the same base number when calculating rates for comparison. For instance, in the example just mentioned, you should not use a base number of 1,000 for Moburg and a base number of 10,000 for Robus.
■ Note that a rate is meaningful only if it is specified for a particular time period, usually a year.
As noted earlier, you should also consider whether the entire population is the appropriate denominator for a rate
The percentage change measures the amount of change over two points in time. For example, organizations in Moburg County have a campaign to reduce homelessness every year over the next decade. If they know the number of homeless people in any two years, they can report the change as a percent.
The formula for percentage change is
where
N1 = value of the variable at time 1
N2 = value of the variable at time 2
For example, assume that 5,000 individuals were homeless in the first year (time 1) and 4,325 the second year (time 2). The calculations to determine the percentage change would be as follows:
The percentage change can be positive or negative. If the number of homeless in the second year was 5,075, the percentage change would be 1.5%, indicating a 1.5 percentage increase in the homeless population.
CHARACTERISTICS OF A DISTRIBUTION
While a frequency distribution includes useful information, you may want a simple statistic to summarize its content. Measures of central tendency reduce the distribution to a value that represents a typical case, the center of the distribution, or both. Measures of central tendency give an incomplete picture of how typical the typical case is. Focusing only on a typical case may be misleading, since cases may be widely different from one another. Measures of variability fill in this gap and add to your knowledge about the distribution. Measures of variability show how representative the typical case is by giving you information on how spread out or dispersed all of the cases are and how far they are from a central point. Several statistics are used to measure central tendency and variation. The choice of a statistic depends on the level of measurement of the variable and what information you will find most valuable. Note that measures of central tendency and variability are the same for interval and ratio scales, so to simplify our discussion we will refer only to ratio scales.
Measures of central tendency indicate the value that is representative, most typical, or central in the distribution. The most common measures are the mode, median, and arithmetic mean.
Mode: The simplest summary of a variable’s frequency distribution is to indicate which category or value is the most common. The mode is the value or category of a variable that occurs most often. In a frequency distribution it is the value with the highest frequency. Table 5.3 shows a distribution of a variable to the Happy Housing Center, Reason for Requesting Services. The mode is Needs Help Finding An Apartment. More clients came for that reason than for any other. Of course that reason has the highest relative frequency as well.
The mode can be determined for all measurement scales: nominal, ordinal, interval and ratio. If two values have the same frequency, the variable has two modes and is said to be bi-modal. A common mistake is to confuse the frequency of the modal category with the mode. For instance, the mode for the Reason for Requesting Services in Table 5.4 is Needs Help Finding an Apartment. It is not 13. For nominal and many ordinal variables, the category that occurs most often will have a name; for ratio variables, the value of the mode will be a number.
TABLE 5.4
Frequency and Percentage Distributions for Reason For Requesting Services From Housing Center.
|
Reason for Requesting Services |
Number of Clients (N = 50) Percent |
Percent |
|
Has Been Evicted |
7 |
14 |
|
Needs Help Finding an Apartment |
13 |
26 |
|
Rent Has Been Increased |
8 |
16 |
|
Lost Job |
12 |
24 |
|
Rental House Needs Repairs |
10 |
20 |
Median: The median is the value or category of the case that is in the middle of a distribution in which the cases have been ordered along a continuum. It is the value of the case that divides the distribution in two; one-half of the cases have values less than the median and one-half of them have values greater than the median. The median requires that variables be measured at the ordinal or ratio level. To find the median, as mentioned, you must order the case values along a continuum. It makes no sense to find the middle case if the cases have not been ordered according to their values on the variable of interest.
To find the median, you locate the middle case in a distribution. If the number of cases is odd, then the median is the value of a specific case. If the number of cases is even, then the median is estimated as the value halfway between two cases. For example, with 11 cases, the median is the value of the 6th case. If there are 12 cases the median is halfway between the value of case number 6 and case number 7. The formula for finding the middle case is (N + 1)/2.
Two examples follow. Table 5.5 shows the distribution of an ordinal variable. The table reports 11 clients’ ratings of a Happy Housing Center transportation program.
The middle case is the case number 6, which is determined by dividing the number of cases plus 1 by 2: (N + 1)/2. This case has the value Neither Good Nor Poor. The median for this variable, then, is Neither Good Nor Poor. One-half of the cases rated the transportation program as Neither Good Nor Poor or better, and one-half rated it as Neither Good Nor Poor or worse. Table 5.5 has two values with the largest number of ratings, Neither Good Nor Poor, Very Good, etc. This is an example of a bi-model distribution with both Neither Good Nor Poor and Very Poor as modes.
TABLE 5.5
Ranking of Transportation Support
|
Client |
Rating |
Code in Excel File |
|
B |
Very Good |
5 |
|
E |
Very Good |
5 |
|
H |
Good |
4 |
|
G |
Neither Good Nor Poor |
3 |
|
C |
Neither Good Nor Poor |
3 |
|
J |
Neither Good Nor Poor |
3 |
|
D |
Poor |
2 |
|
F |
Poor |
2 |
|
A |
Very Poor |
1 |
|
I |
Very Poor |
1 |
|
K |
Very Poor |
1 |
Table 5.6 shows the frequency distribution for a ratio variable, the number of months spent in temporary shelter. The data have been arranged according to increasing values. Since the number of cases, 16, is even, the middle case is case 8.5; the middle of the distribution is between case number 8 and case number 9. Therefore, the value of the median is 12.5.
TABLE 5.6
Number of Months in Temporary Shelter
|
Number of Months |
7, 7, 7, 9, 10, 11, 11, 12, 13, 14, 15, 15, 16, 16, 17, 18 |
The mode for Table 5.6 is 7 months. If we only use the mode, the additional information provided by the median is lost. The median is the preferred measure of central tendency for ordinal variables and for ratio variables that have a few extreme values. A distribution with a few extreme values is said to be skewed. Since the median is little affected by extreme numerical values, it gives a more accurate picture of central tendency than the arithmetic mean, which is discussed next.
Arithmetic Mean: A third measure of central tendency is the arithmetic average or arithmetic mean. When you first learned this measure in school you simply called it “the average.” The mean is the appropriate measure of central tendency for variables measured at the ratio level. To calculate the mean, add the value of the variable for each case and divide it by the number of cases. The formula for the arithmetic mean is
where sum xi indicates the sum of all the cases and N indicates the number of cases. The data for Table 5.6 has a mean of 12.38 (value of all cases/number of cases = 198/16).
If you are measuring a variable representing resources, beware that the mean can seriously misrepresent the data. Such resources include income, value of stocks, bonds or real estate, or size of land holdings. If a data set in a small community included a person with an income well over $1,000,000 the mean may greatly overestimate the typical income of the remaining less-fortunate members of the community. With measures of resources and other skewed data the median is the appropriate measure of central tendency.
Measures of Variation and Dispersion
You may find that policy makers and the general public are uneasy with just one value to summarize a variable. Through education, experience, or skepticism they may suspect that a median or mean doesn’t tell the whole story. So in addition to measures of central tendency you want to see how far the values spread out from the central part of the distribution. Measures of dispersion describe the similarity of the data. Relatively smaller values of the measure of dispersion for a variable imply more uniformity, whereas relatively larger values imply more diversity or variation. Some measures indicate only the difference between two observations in an ordered set of values. Other measures consider all observations in a distribution.
Consider the average number of months that clients in two cities lived in temporary shelter. If you calculate statistics using the data in Table 5.7 you will find the average stays of clients in temporary housing in City A and City B to be very similar. In analyzing the data, however, you can see that individual stays are very different in these two cities. Measures of dispersion will provide more complete information and avoid implying that the lengths of stays in the two cities are virtually identical. Maximum variation for ratio variables occurs when all cases are equally divided between two extreme values. Maximum variation for nominal and ordinal variables occurs when cases are evenly distributed across all categories. The more the cases are clustered in one category, the less the variation. If all cases were in one category, then variation would be zero. Common measures of dispersion are the range, the inter-quartile range, and standard deviation.
TABLE 5.7
Number of Months Clients Spent in Temporary Shelter by City
|
City A |
7, 7, 7, 9, 10, 11, 11, 12, 13, 14, 15, 15, 16, 16, 17, 18 |
|
City B |
3, 3, 4, 5, 7, 8, 10, 11, 12, 13, 14, 16, 17, 21, 22, 24 |
Range: The simplest measure of dispersion is the range, which is the difference between the highest value and the lowest value in a distribution. For City A, the range is 18 − 7, or 11 months. For City B the range is larger, 24 − 3 or 21 months. Whether you are reporting variations among regions or respondents, users can easily interpret the range; for some purposes you may find that reporting the highest and lowest values is clearer or more effective. You could report, for example, that the length of stay for City A was from 7 to 18 months and for City B from 3 to 24 months.
Inter-quartile Range: A range can be unduly affected by one value that is well above or well below the other values in the distribution. The inter-quartile range (IQr) identifies the values for the middle 50 percent of cases and is not affected by extreme values. The first quartile is that value below which 25 percent of the cases are found. The third quartile is that value below which 75 percent of the cases are found. The second quartile, of course, is the median. In determining the inter-quartile range, the lowest 25 percent of the observations and the highest 25 percent are omitted (see Figure 5.5). As with the median, determining the quartiles requires that you order the cases according to their values.
Let’s look at City A in Table 5.7. The range of the middle 50 percent of values includes the eight values between 10 and 15. To find the first quartile drop the lowest 25 percent of the 16 cases, that is, the lowest 4 cases. To find the third quartile drop the upper 25 percent of the 16 cases, that is, the 4 cases with the highest values. For this distribution the inter-quartile range would be 15 − 10 = 5. The spread of the middle 50 percent of the cases is 5 months. (Technically the value of the cutoff for the first quartile would be 9.25—25 percent of the way between 9 and 10—and for the third quartile it would be 75 percent of the way between 15 and 16. However, quartiles are often estimated as shown here.)
|
|
FIGURE 5.5 Identifying Quartiles |
The IQr for City B is 16 − 7 = 9 months. (Its third quartile is 75 percent of the way between 11 and 13, but presenting a rough estimate of the IQr is acceptable.) The middle 50 percent of the cases are spread over a range of 9 months.
By itself the inter-quartile range may have little value. It is most useful to compare distributions from one time to another or between two or more groups. For example, policy makers can examine the change in the annual rates of infant mortality for all counties in one state over time. Or they can compare how those rates varied in any of the 50 states at one time. In both cases the inter-quartile range will provide more information than a measure of central tendency and the range.
Audiences who are less comfortable with statistics or those who are visually oriented seem to become engaged with box plots. A box plot places either the mean or the median in a box enclosed by the IQR and a line extends out to the range. Figure 5.6 presents the city data as box plots.
|
|
FIGURE 5.6 Boxplots of Number of Months in Temporary Shelter |
Standard Deviation: If you have taken a basic statistics course, you are familiar with the standard deviation. It is the most common measure of the variation in a distribution. The standard deviation is calculated by
■ subtracting the mean from each individual value. These deviation values show how much the value of each case deviates from the mean of the distribution;
■ squaring each deviation value;
■ adding the squared deviations;
■ dividing the sum of squared deviations by the number of cases (N);
■ taking the square root of the divided sum of squared deviations.
Table 5.8 uses the City A data from Table 5.7 to calculate the standard deviation. The mean stay in temporary shelter was 12.38 months.
TABLE 5.8
Calculating Standard Deviation
A major use of the standard deviation is to estimate the variation in a population. It is important in statistics, because you can use data from a sample to estimate the distribution of a population. To estimate the variation in the population you have to assume (or verify) that the frequency distribution is reasonably described by the normal curve. A normal curve is bell-shaped and symmetrical so that the mode, median, and mean all have the same value. When working with a sample you will want to put n − 1 in the denominator to calculate the standard deviation; n − 1 gives a less biased estimate of the population variance. For example, assume that the data represent a random sample of residents staying in temporary shelter in City A. We would put 15 in the denominator, and the recalculated standard deviation would be 3.69. Using the mean and the standard deviation you may estimate the following from the data in Table 5.7:
1. 50 percent of the observations are above the mean and 50 percent are below it.
2. 34.1 percent of the observations are one standard deviation above the mean and 34.1 percent are below it. So you would estimate that 68.2 percent of the people who stay at a temporary shelter in City A stay there between 8.69 and 16.1 months.
3. 47.7 percent of the observations are two standard deviations above the mean and 47.7 percent are below it, or roughly 95 percent of the observations are within two standard deviations of the mean. So you would estimate that 95 percent of the people who stay at a temporary shelter in City A stay there between 5 and 19.8 months.
4. 49.8 percent of the observations are three standard deviations above the mean and 49.8 percent are below it, or over 99 percent of the observations are within three standard deviations of the mean. So you would estimate that 99 percent of the people who stay at a temporary shelter in City A stay there between 1.31 and 26 months.
As you organize data you begin a learning process. If you began with a logical model you can use the tools discussed in this chapter to describe a program’s inputs, activities, outcomes, and outputs. This information may motivate you and others to identify parts of the program that are working better than expected and parts that are under-performing. You may begin to wonder why the various components are the way they are and what needs to be done to strengthen some components and to mirror the success of others.
You may be stimulated to take action and monitor whether it had an impact.
Frequencies, rates, percentage changes, and measures of central tendency are tools that you will apply time and time again whether you are simply describing a data set or conducting more elaborate studies to see what works and under what conditions. The graphs may be less useful in conducting your own analysis, but they are an essential tool in communicating your findings to others.
Best, J. Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists (Los Angeles, CA: University of California Press, 2001).
Rumsey, D. Statistics for Dummies (Hoboken, NJ: Wiley Publishing, Inc., 2009).
O’Sullivan, E., G. Rassel, and M. Berner. Research Methods for Public Administrators, Fifth edition (New York: Pearson/Longman, 2008). Chapter 11.
Salkind, Neil. Statistics for People Who (Think They) Hate Statistics, Second edition/Excel 2007 (Thousand Oaks, CA: Sage Publications, Inc., 2010).
CHAPTER 5 EXERCISES
There are two separate exercises for Chapter 5. Each exercise develops your competence in interpreting and applying measurement concepts.
• Exercise 5.1 Fresh Start Center focuses on a database for a community job training program. The exercise presents a partial database. You are asked to use graphs and quantitative measures to describe the variables and decide on effective strategies to present the information.
• Exercise 5.2 Purple Flower Neighborhood Association asks you to apply basic skills to examine some crime data.
EXERCISE 5.1 Fresh Start Center
Scenario
Fresh Start Center is a community partnership of a community college, county government, local businesses, and nonprofits to offer job training to unemployed workers. It offers culinary training (training to work in restaurants or with caterers), automotive repairs, and carpentry. You have been asked to help the center develop a plan for monitoring its performance. Table 5.9 includes 25 cases from the first round of data collection. The data report the number of months of employment prior to entering training and the employment status of June 30 graduates as of August 1.
Section A: Getting Started
You may enter the Table 5.9 data in Excel to carry out the following tasks.
1. Create a frequency and relative frequency distribution for each variable. As appropriate, combine categories.
2. Create pie charts for perceived technical skills and employment status.
3. Create bar charts for perceived technical skills and employment status.
4. Create a histogram for months unemployed.
5. For each variable indicate. a. its level of measurement (nominal, ordinal, interval/ratio); b. the value of the mode, median, and mean, as appropriate. (Directions are at the end of exercise.)
6. Create a box plot for months unemployed and perceived technical skills. What do they suggest about the variation (dispersion) of the two variables?
7. Use the data to write a paragraph for the center’s annual report. Would you include graphics? Which one(s)?
8. Which variables should the agency track? Explain your choices.
TABLE 5.9
Fresh Start June 30 Graduates
Skill Rating: Just prior to graduation the trainees rated their agreement with the following statements (4 = Strongly agree, 3 = Agree, 2 = Disagree, 1 = Strongly disagree)
I can choose the right tool or equipment for the job at hand.
I know how to take care of the equipment
I have the skills I need to do my work
I have the training needed to do my work
Section B: Small Group or Class Discussion
1. One of the challenges is to sort through a data set and decide what is important. Although Table 5.9. represents only a partial data set, based on your analysis consider each variable. Which of the following would you use to present the data: a frequency distribution, a graph (which type), a measure of central tendency, a measure of dispersion? Would you categorize the values of Months Unemployed and Skill Rating? Why or why not? If you would categorize them, how would you organize the data?
2. Which variables would you suggest that the center collect regularly? For each variable consider how they should report the data (as a frequency, percentage, median or mean, or something else)? Explain the thinking behind your recommendations.
EXERCISE 5.2 Purple Flower Neighborhood Association
Scenario
The Purple Flower neighborhood Association’s mission is to “improve neighborhood safety, beautification, and education and to represent community interests to the city of Lilac Fields (population 94,700).” Each year association committees recommend objectives for the next year. The safety committee examined the problems of burglary and vandalism.
Section A: Getting Started
1. Based on newspaper reports of a rash of burglaries, community residents contend that the neighboring towns of Maple Leaf (population 51,600) and Elm Hill (population 12,700) are safer. Last year Lilac Fields had 705 burglaries; Maple Leaf had 215; Elm Hill had 72).
Compare the burglary rates in these three towns. Two years ago Lilac Fields had 549 burglaries. What was the percentage increase?
2. The committee gathered data on reports of vandalism in the neighborhood. Table 5.10 displays these data.
a. Draw a time series graph of these data.
b. Identify the variations in the time series.
c. If a new development opened in Purple Flower in 1999 should the committee track the number of incidences or something else? Explain.
d. To calculate the vandalism rate the committee debates using (i) the number of residences, (ii) the population, or (iii) the number of residents between 10 and 19 years of age for the denominator in the rate formula. Which would you suggest? Why?
TABLE 5.10
Reported Incidents of Vandalism in Purple Flower by Year
|
1994 |
53 |
|
1995 |
52 |
|
1996 |
51 |
|
1997 |
53 |
|
1998 |
56 |
|
1999 |
57 |
|
2000 |
56 |
|
2001 |
56 |
|
2002 |
57 |
|
2003 |
58 |
|
2004 |
59 |
|
2005 |
58 |
|
2006 |
57 |
|
2007 |
59 |
|
2008 |
60 |
|
2009 |
61 |
CALCULATING STATISTICS WITH EXCEL
After you input the data you can access routines to calculate the data using the Function Wizard. Depending on your version of Excel, you may have to hunt for the Function Wizard; on some versions you may find function (fx) under “Insert.”
Click on fx; a dialogue box will appear.
Click on arrow by “Search for a Function” or “Select a category”; a drop down menu listing Financial, Statistical, Math & Trig, and so on will appear.
Click on Statistical; a drop down menu listing statistical routines, such as AVERAGE, MEDIAN, and STDEV will appear.
Click on the function you want; another dialogue box will appear.
Enter the range of cells containing the data to analyze. For example, if Months unemployed data are in column B, rows 2–26, put B2:B26 in the first line in the dialogue box. You should get 5.04 for the value of the arithmetic mean of the number months unemployed. ■
NOTE
1Edward R. Tufte, The Visual Display of Quantitative Information (Cheshire, CT: Graphics Press, 1983), p. 13.
|
|
Describing Relationships Among Variables
After you have received completed surveys and forms and entered them in a database you must decide how to analyze them. You may start with the tools we covered in Chapter 5 and examine the frequency distribution of each variable. At a minimum this enables you to catch data entry errors and to identify insensitive measures, if any. In this chapter we cover tools you can use to examine relationships between two or more variables. We start with contingency tables, which effectively display relationships between variables measured at the ordinal or nominal level. We next cover linear regression, which examines the relationship among variables measured at the ratio or interval level. In connection with linear regression we also introduce correlation coefficients (r and R), measures of association that summarize the strength of a relationship. This chapter contains tools that can be appropriately used to analyze data from an entire population, a probability sample, or a nonprobability sample. Chapter 9 covers inferential statistics, which allow us to infer from a probability sample whether a relationship exists in the population.
ANALYZING NOMINAL AND ORDINAL VARIABLES: CONTINGENCY TABLES
You will find that exploring how variables are related is the most interesting part of research. After you first posed a research question you may have hypothesized how the values of one variable related to the values of another. You may wonder if the values of the independent variable are associated with the values of the dependent variable. For data measured at the ordinal or nominal level you may get the most information by organizing them in a contingency table. A contingency table shows the frequency or relative frequency of each value of the dependent variable for each value of the independent variable.
By convention, values of the independent variable head the columns and the values of the dependent variable head the rows. If the variables are ordinal, you should arrange the values along a continuum from low to high values. If the variables are nominal, however, no statistical consideration exists that recommends one arrangement over another. Instead, you can use your judgment to order the values. To avoid table clutter, you may want to include the number of respondents as part of the column label and report the relative frequencies, the percentages, in each cell so the column percentages sum to 100. If someone wants to know the number of cases in each cell, he can multiply the column total by the cell percentage. We illustrate this approach in Table 8.1 , a contingency table relating respondents’ opinion about local government service with their support for creating a fund to build affordable housing.
Managers and researchers use contingency tables to analyze relationships between variables. They may do this by identifying how attitudes, opinions, and support varyies among different groups. For example, affordable housing advocates wanted to know if public opinion about local government was related to support for a building fund. They were most interested in the difference between those who gave local government services high ratings and those who did not. They grouped opinions about local government services into three groups: “Excellent or Good,” “Fair,” and “Poor.” The total in each group was treated as 100 percent, and the relative frequency of each group that responded with “For,” “No opinion,” and “Against” was computed. Each reader may take away different information from a contingency table. For example, write down what Table 8.1 shows you about the relationship between the two variables—“support for a fund to build affordable housing” and “opinion of local government service.” Did you write “51 percent of the respondents favored the housing fund”? Or did you write that “55 percent of respondents who considered local services to be excellent or good supported the housing fund”? In both cases you focused on a single cell that dominates the rich information contained in the table as a whole.
To learn more from the table, you should focus on the percentages and the percentage differences between the values of the independent variable. Read across a row and note how the percentages change. Table 8.1 shows a direct relationship between support for creating the housing fund and opinion of local government services. The percentage of residents supporting the housing fund decreases as opinion of local government services decreases. The percentage of residents opposed to the housing fund increases as the opinion of local service decreases. Another way to interpret the table is to focus on the center of gravity, represented by the median, for the different values of the independent variable. As the opinion of local government services becomes more negative, the median shifts further downward, away from support for the fund.
TABLE 8.1
Support for Creating Housing Fund by Opinion of Local Government Services
Note: Percentages may not add up to 100% due to rounding.
The percentage difference, the difference in percentages when subtracting across the rows, indicates the strength of relationship between two variables. The difference can range from 0, for no difference, to 100, indicating maximum difference. The approach is clearest when there are only two columns for the independent variable. Table 8.1 shows that those who gave highest ratings to government services were most likely to support the housing fund. For example, the percentage supporting the fund varies from 55 percent for those who gave government service a high rating to a low of 36 percent for those rating government services as “poor;” the percentage difference is 19.
Contingency tables are often referred to by size—the number of rows for the dependent variable and the number of columns for the independent variable. A two-by-four table has two rows for the dependent variable and four columns for the independent variable. Its title may list the dependent variable name first. For example, the title for Table 8.1 is Support for Housing Fund by Opinion of Local Government Services; it is a three-by-three table. Depending on the information you want you can calculate other percentages, such as the percentage of the total respondents in each cell, in which case you could say that “33 percent of all respondents both support the housing fund and gave local government services a high rating.”
You may include a control variable to see if it impacts the relationship between the original two variables. For example, the housing fund advocates may wish to see if the relationship between service rating and support for the housing fund is the same for city residents and suburban residents. You would make residency the control variable, divide the original group of 647 cases into two groups—city residents and suburban residents—and create a table for each. The result would be Table 8.2 .
Table 8.2 shows that the relationship between support for the housing fund and service rating is different for city residents and suburban residents. Those in the city were likely to be in favor of the housing fund if they rated government services as “Good” (65 percent), whereas suburban residents were unlikely to be for the fund even if they rated government services as Good (1 percent).
TABLE 8.2
Support for Housing Fund by Opinion of Local Government Services by Residence
ANALYZING RATIO VARIABLES: LINEAR REGRESSION AND CORRELATION
If you have nominal or ordinal variables you may create contingency tables to examine relationships. The percents in the tables show how the values of one variable, the dependent variable, correspond to different values of the other variable, the independent variable. You can also create contingency tables for variables measured at the ratio level. Examples of ratio level variables include the number of hours of training, the number of volunteers who build houses for low income families, and the ages of people seeking help to locate safe, affordable housing. To create a contingency table for ratio variables, however, you may have to combine many values; otherwise you may end up with a large table that is virtually impossible to interpret. Because you will lose information if you combine values, linear regression and correlation may be better suited to analyze ratio level variables. Both regression and correlation are used across disciplines; they efficiently summarize data, handle a relatively large number of variables, and can be easily interpreted.
Consider an organization that builds houses for low income families and depends on volunteers to help on weekends. Knowing in advance how many volunteers will show up on any Saturday is important for scheduling and planning: materials, meals, and transportation all need to be arranged. The director of the organization suspects that the number of volunteers is related to how hot or cold it has been during the week. She asks a staff member to create a spreadsheet reporting the number of volunteers who showed up each weekend and the average daily high temperature for the week. Note that technically Fahrenheit temperature is an interval variable; however, you can treat variables measured at the interval level as ratio variables. Table 8.3 shows the spreadsheet data.
TABLE 8.3
Number of Building Volunteers and Average Daily High Temperature
|
Week |
Average Temperature (°F) |
Number of Volunteers |
|
1 |
65 |
130 |
|
2 |
56 |
98 |
|
3 |
64 |
123 |
|
4 |
72 |
138 |
|
5 |
54 |
132 |
|
6 |
78 |
150 |
|
7 |
51 |
83 |
|
8 |
62 |
115 |
|
9 |
45 |
72 |
|
10 |
80 |
149 |
You should first graph the data to see if a relationship exists. The graph, called a scatterplot, plots the points that represent the values of temperature and the number of volunteers. You should display temperature, the independent variable, on the horizontal or x axis and number of volunteers, the dependent variable, on the vertical or y axis. You may think of the independent variable as the input variable, and the dependent variable as the output variable.
Note that the points on the scatterplot ( Figure 8.1 ) fall along a line and as the average weekly high temperatures increase so do the numbers of volunteers. If the points can be enclosed in an imaginary envelope like a “thick cigar” then the relationship is said to be linear. The narrower the cigar, the stronger the relationship. The line in Figure 8.1 is called the regression line. Its formula is
where
Y = the value of the dependent variable;
X = the value of the independent variable;
a = the Y intercept or constant, the value of Y when X is zero;
b = the slope or regression coefficient.
The regression coefficient represents the change in Y for each unit increase in the value of X. The regression line, which can be calculated by statistical software, fits the data points better than any other straight line. Statisticians refer to this line as the ordinary least squares (OLS) line, because the sum of the squared differences between each point and the regression line is smaller than the sum of the squared differences between each point and any other line.
|
|
FIGURE 8.1 Number of Volunteers by Temperature |
The regression line in Figure 8.1 is Y = –6.9 + 2X. The director can use the regression equation –6.9 + 2X to estimate how many volunteers will show up. If the weather forecast for the coming week is an average high of (60°F), the director could substitute 60 for X in the formula and calculate the value of Y, the number of volunteers. She would estimate 113 volunteers will show up [y = –6.9 + 2.0 (60) = 113.1] next weekend. Of course, the value of y is only an estimate: some weeks fewer volunteers than predicted will show up and other weeks more.
A regression line can be computed for any set of data. Your key concern is deciding how well the line describes or fits the data. The closer the points are to the regression line the better the fit. In Figure 8.1 the fit is quite good, although for one point (x = 54°F) the equation underestimated the number of volunteers by a large amount. The distance between a data point and the regression line for the same x value is called the residual. The larger the sum of the residuals, the poorer the fit. The poorer the fit, the less you can trust the estimate for Y.
You should not use the regression equation to estimate the value of Y for values of X that extend very far beyond the data used to calculate the regression equation. For example, the high temperatures used to calculate the regression equation for Figure 8.1 ranged from 40°F to 80°F. You should not assume that the line will give you an accurate estimate of how many volunteers will show up at the end of a very cold or a very hot week. If the temperatures edged toward 100—or higher—you should expect y to seriously overestimate the number of volunteers.
Pearson’s r (or simply r), a correlation coefficient, gives us more information about the relationship between two ratio variables. Pearson’s r varies from 0, indicating no relationship between the variables, to +1.0 indicating a perfect positive or direct relationship, to −1.0 indicating a perfect negative or inverse relationship. The sign of r indicates the direction of the relationship. A minus sign indicates an inverse relationship and a plus sign a direct one. The formula for computing r is:
The value of r for the data in Figure 8.1 is 0.89. To assess how strong the relationship is you should square r. In our example, r2 is 0.79 or 0.892. This tells you that 79 percent of the variation in the number of volunteers is explained by the variation in the week’s average high temperature. This means that 21 percent of the variation is unexplained. The value of r2 is another indicator of how well the regression line fits the data points. Methodologists do not agree on what constitutes a weak, moderate, or strong relationship. We have heard different rules of thumb, such as an r of less than 0.25 describes a weak relationship and anything over 0.70 a strong one. We are somewhat skeptical of such rules of thumb. The strength of the relationship depends on the research question, findings from other studies, and how the data will be used.
Ideally you should use the scatterplot, regression equation, and the correlation coefficient together to assess the relationship between two interval variables. You should make it a practice to view the scatterplot; it will help you determine if the relationship is linear. If the relationship is not linear the correlation and regression statistics will not accurately portray the true relationship. You should be careful using regression with small data sets, since the regression line and correlation coefficient can be influenced by a single extreme value.
Multiple Regression: Regression and Correlation with More Than One Independent Variable
Regression is a powerful and versatile technique. Multiple regression expands the regression model to examine relationships with more than one independent variable. Although you may not need to use multiple regression in your own analysis, you are likely to run into it as you read reports and professional journals.
A multiple regression equation with two independent variables has the form
In a multiple regression equation each regression coefficient (b1 … bn) controls for the effect of the other independent variables. That is, in the two-variable example, b1 measures the change in Y if X2 were held constant; b2 measures the change in Y if X1 were held constant.
Imagine that the director of the organization in our earlier example had asked a radio station to broadcast public service announcements (PSAs) soliciting volunteers, which the station did as its schedule allowed. The director’s assistant consulted the station’s records and added the number of weekly PSAs into the data set. For week (1) the data set showed 130 volunteers, an average high of 65°F, and 7 PSAs. The director used statistical software, which reported the following regression equation:
where
Y = the number of volunteers;
X1 = the average high temperature;
X2 = the number of PSAs.
The b1 regression coefficient tells us that for every unit increase in average temperature, the number of volunteers increases by 1.6, controlling for the effect of the average number of PSAs. The b2 regression coefficient tells us that each PSA increases the number of volunteers by 3.7, controlling for the effect of the temperature. Similar to the two-variable linear regression model, the resulting y value, the number of volunteers, is an estimate. The quality of the estimate depends on how well the multiple regression model fits the data.
R, the multiple correlation coefficient, is a multivariate measure of association and measures the strength of the joint relationship of independent variables and the dependent variable. We can judge how well the regression model fits the data by squaring R. R in this example is 0.93, so R2 is 0.86. It indicates that the two variables explain 86 percent of the variation in the number of volunteers, an 8 percent improvement over just knowing the weekly high temperature. This R2 value suggests that the multiple regression equation fits the data quite well. Knowing the value of two variables, weekly average high temperature and the number of PSAs explains 86 percent of the variation in number of volunteers.
If the independent variables are measured in different units or have widely different ranges, standardized regression coefficients facilitate comparing the independent variables. The standardized regression coefficients allow you to identify quickly which variables have the strongest relationships with the dependent variables and which have the weakest relationships. In our example temperature ranges from the low 40s to 80; the number of public service announcements ranges from 2 to 10. (The data on public service announcements and number of volunteers by week are shown in Table 8.5 for Exercise 8.2 .) The standardized regression coefficient adjusts for different measurement units and different scales and tells us which independent variable has the most impact. The standardized coefficient for X1 is 0.69 and for X2 it is 0.33. The larger coefficient has the greater impact; in this example weekly average high temperature is more closely related to the number of volunteers who show up than PSAs are.
The standardized and unstandardized regression coefficients have different roles. The standardized regression coefficient allows you to directly compare the impact of the separate independent variables. The unstandardized regression coefficient allows you to estimate the value of y. To estimate the number of volunteers for a week that has an average high of 65°F and when the radio station ran 7 PSAs you would plug the respective temperature and PSA data into the equation with the unstandardized coefficients:
In our examples the independent variables had a direct relationship with the dependent variable. Consequently, the values for the regression coefficients, the standardized regression coefficients, and r were positive. Had the relationship been inverse, the values would have been negative, but the strength would have stayed the same; that is, only the direction of the coefficients, not their absolute values, which indicate strength, would have been altered. For the purpose of illustrating this point let’s imagine that the relationship between temperature and number of volunteers was r = −0.89 instead of +0.89. We would still say that we could explain 79 percent of the variation in number of volunteers. The change in our estimate would be that as temperatures go up the estimated number of volunteers would go down.
Regression describes a statistical relationship. When we say that a regression model fits the data, we mean that regression is an appropriate statistical model. However, a good fit does not necessarily mean that changes in the independent variable(s) caused changes in the dependent variable. To causally link variables we look beyond a statistical model to identify and eliminate other possible causes of an outcome. We discuss this in Chapter 11 .
In deciding how to analyze and present data you need to consider your research question, how your data are measured, and your audience. Contingency tables are appropriate for nominal and ordinal data. They also have the advantage of being easily understood. We have found that inexperienced researchers ignore the power of percentages: a clear, uncluttered table reporting relative frequencies is usually easier to interpret than one that only reports the raw numbers. After many years of teaching, we have learned that not all people interpret contingency tables—even ones with relative frequencies—accurately. If you use contingency tables in your report, it’s wise to include a sentence summarizing the findings.
Regression, especially multiple regression, is a powerful tool that can handle ratio variables efficiently. It has the distinct advantage of being able to include a number of independent variables in the same equation. Probably the greatest potential misuse of regression is to use it to analyze a small data set, where the addition of a few cases can greatly alter the findings.
Arlene, Fink, How To Conduct Surveys: A Step By Step Guide. Fourth edition (Thousand Oaks, CA: Sage Publications, Inc., 2009). See especially Chapter 6.
Lee, E. S., Analyzing Complex Survey Data, Second Edition (Thousand Oaks, CA: Sage Publications, Inc., 2006).
O’Sullivan, E., G. Rassel, and M. Berner, Research Methods for Public Administrators, Fifth Edition (New York: Pearson/Longman, 2008). Chapter 13.
Rea, Louise, and Richard Parker, Conducting and Designing Survey Research: A Comprehensive Guide, Third Edition (San Francisco, CA: Jossey Bass, 2005).
Tufte, Edward R., Data Analysis for Politics and Policy (Englewood Cliffs, NJ: Prentice Hall, 1974) has a clear discussion with excellent examples on the use and interpretation of linear correlation and regression.
CHAPTER 8 EXERCISES
Analyzing Data to Find Relationships Exercises
There are four sets of exercises for this chapter.
• Exercise 8.1 Nonprofit Participation in Experimental Financial Assistance Program is designed to give you practice in creating and interpreting contingency tables.
• Exercise 8.2 Do Public Service Announcements Yield More Volunteers? asks you to apply your knowledge of linear regression.
• Exercise 8.3 Physicians for Access also asks you to apply your knowledge of linear regression.
• Exercise 8.4 Fresh Start Center revisits the data presented in Table 5.8 , and asks you to suggest strategies for identifying relationships.
EXERCISE 8.1 Nonprofit Participation in Experimental Financial Assistance Program
Scenario
One hundred (100) statewide nonprofits are asked to participate in an experimental program to deliver financial assistance. The local board of directors of each agency decides whether the organization will participate in the program. Table 8.4 contains the data on each nonprofit’s region and whether it will participate in the program.
TABLE 8.4
Database on Participation in Experimental Program
Section A: Getting Started
1. Organize the data in a contingency table with three columns and two rows. Make Region of State the independent variable and Experimental Welfare Program the dependent variable.
a. Tally the number of agencies in each region that are in the program and the number that are not. Report the number in each column. Fully label the table.
b. In a second table, calculate the percentage of organizations in each region that are in the program and the percentage that are not. Enter only the percentages in the cells of the second table with the number of cases in each column at the head of the column. Fully label the table.
c. What can you tell the state director about the relationship between the decision to participate in the Experimental Program and Region of the State?
2. What percentage of the agencies are in each region of the state?
3. What percentage of the agencies will participate in the experimental program? What percentage will not? What percentage of the agencies in the Plains region are participating?
4. Calculate and report the percentage difference between the percentage of agencies in the Mountains and the percentage in the Plains that are participating in the experimental program.
EXERCISE 8.2 Do Public Service Announcements Yield More Volunteers?
Scenario
A radio station runs public service announcements for an organization that has volunteers build houses for low-income families. The database representing the number of volunteers and the number of public service announcements is in Table 8.5 .
Section A: Getting Started
1. Prepare a scatterplot to show the relationship between the two variables. Remember that “number of volunteers” is the dependent variable. Describe the direction of the relationship. Is it linear?
2. Compare your scatterplot for public service announcements to the scatterplot in Figure 8.1 . How are they similar? How do they differ? Which relationship do you think is the stronger? Why?
3. The regression and correlation statistics for the relationship between number of volunteers and public service announcements are a = 63.9; b = 8.5; r = 0.74.
a. Use this information to draw a regression line in your scatterplot.
b. Use this information to write the regression equation and estimate the number of volunteers expected if the number of public service announcements is 4.
TABLE 8.5
Database on Number of Volunteers and Public Service Announcements
|
Week |
Number of Volunteers |
Public Service Announcements |
|
1 |
130 |
7 |
|
2 |
98 |
2 |
|
3 |
123 |
6 |
|
4 |
138 |
10 |
|
5 |
132 |
8 |
|
6 |
150 |
9 |
|
7 |
83 |
5 |
|
8 |
115 |
6 |
|
9 |
76 |
5 |
|
10 |
149 |
7 |
EXERCISE 8.3 Physicians for Access
Scenario
Physicians for Access, a nonprofit organization, provides funds to increase the general population’s access to medical care. To help in deciding how to distribute resources, the organization’s director commissioned a study of the relationship between life expectancy and the number of people per physician in various areas. The organization assumes that life expectancy in a government jurisdiction is an indicator of general health conditions. Table 8.6 contains data from a sample of countries.
Section A: Getting Started
1. Create a scatterplot for the relationship between life expectancy and number of people per physician.
2. Describe the relationship shown in the scatterplot. Is it linear? positive or negative? weak, moderate, strong?
3. The regression and correlation statistics are a = 72; b = −.002; r = −.61.
a. Use these statistics to further interpret the relationship.
b. Write the regression equation relating the dependent variable, life expectancy, to the independent variable, number of people per physician.
c. Calculate the life expectancy for a country with 450 people per physician.
d. The actual life expectancy for the country in 3c is 76.5 years. Calculate the residual for this country.
TABLE 8.6
Database on Average Life Expectancy and Number of People per Physician
|
Country |
Life Expectancy (Years) |
Number of People Per Physician |
|
A |
71 |
370 |
|
B |
65 |
684 |
|
C |
70 |
640 |
|
D |
78 |
400 |
|
E |
57 |
2,470 |
|
F |
77 |
233 |
|
G |
61 |
7,610 |
|
H |
56 |
2360 |
|
I |
65 |
1,060 |
|
J |
69 |
259 |
|
K |
78 |
275 |
|
L |
70 |
1,190 |
|
M |
76 |
611 |
|
N |
64 |
2,990 |
|
O |
75 |
570 |
4. Suggest two additional independent variables that the researchers should consider for including in a multiple regression analysis of life expectancy.
EXERCISE 8.4 Fresh Start Center
Scenario
In Chapter 5 you were introduced to the Fresh Start Center, a community partnership that offers training in culinary arts, automotive repairs, and carpentry to unemployed workers. You have been asked to help the center develop a plan for monitoring its performance. Table 8.7 includes 25 cases from the first round of data collection. The data report the number of months of unemployment prior to entering training and the employment status of June 30 graduates as of August 1.
TABLE 8.7
Database on Fresh Start Participants
Section A: Getting Started
1. Create a contingency table to examine the relationship between training program and employment status.
a. Write a sentence to describe the relationship shown in the table.
b. Add education level as a control variable (you will want to combine values to create two groups). Does the control variable affect the original relationship? How?
2. Write a draft memo “Recommendations for Analyzing the Fresh Start Data.” Remember, the actual analysis will be conducted on a large database. Your memo should indicate how you suggest analyzing the data to
a. describe the clientele in Fresh Start’s programs;
b. describe how the clientele view what they learned in the program;
c. identify the factors associated with the various outcomes.
Section B: Small Group and Class Exercises
1. Working with a group of two to four classmates develop a strategy for analyzing the database.
a. Identify the tables that should be included.
b. Use the database to construct an example illustrating what each table would look like.
c. For each table explain the value of the information it would produce.
2. Review how each small group handled (i) describing the clientele,
(ii) describing the clientele’s perception of what they learned, (iii) factors associated with outcomes.
a. Assess the value of each proposed table.
b. Consider how the data can be organized most effectively.
APPENDIX A: Using Excel to Obtain Regression and Correlation Statistics
You can use Excel to calculate the correlation and regression statistics and to create a scatterplot. After entering the data in the spreadsheet
■ click on the Tools menu;
■ click on Data Analysis;
■ click on Regression, then click OK. A dialogue box will appear.
■ In the box for Input Y range, enter the range of cells for the dependent variable.
■ In the box for Input X range, put in the range of cells containing the data for the independent variable. Click OK.
Try this with the data from Table 8.5 (in Exercise 8.2 ).
■ In the Input Y range box, enter the cell range for the number of volunteers data.
■ In the Input X range box, enter the cell range for the number of public service announcements data. Click OK.
■ The output will include three boxes of statistics.
■ First box—Multiple R. (It will not have a sign.) This is the value of r.
■ The third box will contain the regression statistics.
■ Look for the column headed Coefficients.
■ The first value under the heading “Coefficients” is the value of a; to the left is its label “Intercept.”
■ In the second row, you will find the value of b; to the left is label “X variable.”
You may need to determine the direction of the relationship from the sign of b, the regression coefficient, since Excel does not report a sign for the correlation coefficient. (This is another reason why viewing the scatterplot is important.) The b coefficient is also called the slope coefficient; it tells us the slope of the regression line. As noted earlier, the b coefficient shows the change in Y, the dependent variable, for a one unit increase in X, the independent variable. For a straight line that slope is either positive, as X increases y also increases, or negative, as X increases y decreases.
You can obtain a scatterplot by marking the two columns of data and clicking on the icon for “scatter”.
Note: If the Data Analysis module is not available on the Tools menu, you will have to add it. For 1997–2003 versions of Excel, click on the Tools menu, then Add-Ins; then check Analysis Tool Pak. For the 2007 version click on Excel Options; on the drop down menu select Add-Ins. Click on GO at the bottom of the page. In the Add-Ins Box, check Analysis Tool Pak.