safety program mangement

profilemhd
what_is_data_analysis.pdf

Introduction: What Is Data Analysis?

What is the wealth of the United States? Who’s got it? And howis it changing? What are the consequences of an experimental drug? Does it work, or does it not, or does its effect depend on condi-

tions? What is the direction of the stock market? Is there a pattern?

What is the historical trend of world climate? Is there evidence of

global warming? — This is a diverse lot of questions with a common

element: The answers depend, in part, on data. Human beings ask lots

of questions and sometimes, particularly in the sciences, facts help.

Data analysis is a body of methods that help to describe facts, detect patterns,

develop explanations, and test hypotheses. It is used in all of the sciences. It

is used in business, in administration, and in policy.

The numerical results provided by a data analysis are usually

simple: It finds the number that describes a typical value and it finds

differences among numbers. Data analysis finds averages, like the

average income or the average temperature, and it finds differences

like the difference in income from group to group or the differences in

average temperature from year to year. Fundamentally, the numerical

answers provided by data analysis are that simple.

But data analysis is not about numbers — it uses them. Data

analysis is about the world, asking, always asking, “How does it

work?” And that’s where data analysis gets tricky.

1

Macintosh HD:DA:DA XI:Volume I:006 Intro (What is the wealth) June 10, 1996

Introduction to Data analysis: The Rules of Evidence Joel H. Levine

For example: Between 1790 and 1990 the population of the United States increased by 245 million people, from 4 million to 249 million people. Those are the facts. But if I were to interpret those numbers and report that the population grew at an average rate of 1.2 million people per year, 245 million people divided by 200 years, the report would be wrong. The facts would be correct and the arithmetic would be correct — 245 million people divided by 200 years is approximately 1.2 million people per year. But the interpretation “grew at an average rate of 1.2 million people per year” would be wrong, dead wrong. The U.S. population did not grow that way, not even approximately

For example: The average number of students per class at my university is 16. That is a fact. It is also a fact that the average number of classmates a student will find in his or her classes is 37. That too is a fact. The numerical results are correct in both cases, both 16 and 37 are correct even though one number is twice the magnitude of the other — no tricks. But the two different numbers respond to two subtly dif- ferent questions about how the world (my university) works, subtly different questions that lead to large differences in the result.

The tools of the trade for data analysis begin with just two ideas:

Writers begin their trade with their A, B, C’s. Musicians begin with

their scales. Data analysts begin with lines and tables. The first of

these two ideas, the straight line, is the kind of thing I can construct on

a graph using a pencil and a ruler, the same idea I can represent

algebraically by the equation “y = mx + b”. So, for example, the line

constructed on the graph in Figure 1 expresses a hypothetical relation

between education, left to right, and income, bottom to top. It says

that a person with no education has an income of $10,000 and that the

rest of us have an additional $3,000 for each year of education that is

completed (a relation that may or may not be true).

2

Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth) March 22, 1997

Introduction: What is Data Analysis?

Lin e:

Inc om

e = $1

0,0 00

pl us

$3 ,00

0 p er

ye ar

of Ed

uc ati

on

} 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

$10,000

$20,000

$30,000

$40,000

$50,000

$60,000

$70,000

Intercept: b = $10,000

Run of 1 year

Rise of $3,000

Slope: m = $3,000 per year of education

P er

so n

a l

In co

m e

Years of Education Completed

Figure 1 Hypothetical Linear Relation Between Income and Education

The hypothetical line shows an intercept, b, equal to $10,000 and a slope, which is the rise in dollars

divided by the run in years, that is equal to $3,0000 per year.

3

Wednesday, June 12, 1996 Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth)

Introduction to Data analysis: The Rules of Evidence Joel H. Levine

This first idea, the straight line, is the best tool that data analysts

have for figuring out how things work. The second idea is the table or,

more precisely, the “additive model”. The first idea, the line, is

reserved for data we can plot on a graph, while this second idea, the

additive model, is used for data we organize in tables. For example,

the table in Figure 2 represents daily mean temperatures for two cities

and two dates: The two rows of the table show mean temperature for

the two cities, the two columns show mean temperatures for the two

dates.

The additive model analyzes each datum, each of the quantities in

the table, into four components — one component applying to the

whole table, a second component specific to the row, a third

component specific to the column, and a fourth component called a

“residual” — a leftover that picks up everything else. In this example

the additive model analyzes the temperature in Phoenix in July into

1: 64.5° to establish an average for the whole table, both cities and both dates,

2: plus 7.5° above average for Phoenix, in the first row,

3: plus 21° above average for July, in the second column,

4: plus 1° as a residual to account for the difference between the sum of the first three numbers and the data.

Adding it up,

Observed equals All Effect plus Phoenix Effect plus July Effect plus Residual .

That is,

92° = 64.5° + 21° + 7.5° + 1°

4

Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth) March 22, 1997

Introduction: What is Data Analysis?

Washington, D.C.

Phoenix

January July All Effect

expressed as the average for "all"

cities (both of them) and "all" dates (both

of them)

Row Effects for Cities

expressed in degrees above or below average

52

35 79

°

64.5

+7.5

–7.5

Column Effects for Months

expressed in degrees above or below average

+21 –21

°

° ° °

° °

°

Datum = All Effect + Row Effect + Column Effect + Residual

92 °

° ° ° ° °92 = 64.5 + 7.5 + 21 1+

Figure 2

Normal Daily Mean Temperatures in Degrees Fahrenheit

From the Statistical Abstract of the United States, 1987, Table 346, from the original by the U.S. National Oceanic and Atmospheric Administration, Climatography of the United States, No. 81, Sept., 1982. Also note John Tukey’s, Exploratory Data Analysis, Addison Wesley, 1970, 0. 333.

5

Wednesday, June 12, 1996 Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth)

Introduction to Data analysis: The Rules of Evidence Joel H. Levine

There you are, lines and tables: That is data analysis, or at least a

good beginning. So what is it that fills up books and fills up the

careers of data analysts and statisticians? Things begin to get

“interesting”, that is to say, problematical, because even the best-

behaved data show variance: Measure a twenty gram weight on a

scale, measure it 100 times, and you will get a variety of answers —

same weight, same scale, but different answers. Find out the incomes

of people who have completed college and you will get a variety of

answers. Look at the temperatures in Phoenix in July, and you will get

a variety, day to day, season to season, and year to year. Variation

forces us to employ considerable care in the use of the linear model

and the additive model.

And life gets worse — or more interesting: Truth is that lots of

things just are not linear: Adding one more year of elementary school,

increasing a person’s years of education from five to six, doesn’t really

have the same impact on income as adding one more year of college,

increasing a person’s years of education from fifteen to sixteen —

while completing a college degree. So the number of dollars gained for

each extra year of education, is not constant — which means that,

often, the linear model doesn’t work in its simplest form, not even

when you allow for variation. And with tables of numbers, the

additive model doesn’t always add up to something that is useful.

So what do we do with a difficult problem? This may be the

single most important thing we teach in data analysis: Common sense

would tell you that what you tackle a difficult problem with a difficult

technique. Common sense would also tell you that the best data

analyst is the one with the largest collection of difficult “high

powered” techniques. But common sense is wrong on both points: In

data analysis the real “trick” is to simplify the problem and the best data

analyst is the one who gets the job done, and done well, with the most

simple methods.

Data analysts do not build more complicated techniques for more

complicated problems — not if we can help it. For example, what

would we do with the numbers graphed in Figure 3? Here the

numbers double at each step, doubling from 1, to 2, to 4, to 8, which is

certainly not the pattern of a straight line. In this example the trick is

6

Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth) March 22, 1997

Introduction: What is Data Analysis?

to simplify the problem by using logarithms or the logarithmic graph

paper shown in Figure 4 so that, now, we can get the job done with

simple methods. Now, on this new graph, the progression, 1, 2, 4, 8,…

is a straight line.

x=1, y=1

X Axis

Y Axis

0 1 2 3 4 5 6

0

1

2

3

4

5

6

x=2, y=2

x=4, y=8

7

8

• x=3, y=4

x=1, y=1

X Axis

Y Axis

0 1 2 3 4 5 6

1

2

4

8

16

x=3, y=4

x=2, y=2

Logarithmic Scale

• x=4, y=8

Figure 3

Non-Linear Relation Between X and Y

Figure 4

Non-Linear Exponential Relation Between X and Y Made

Linear Using a Semi-Logarithmic Graph

“Tricks” like this enormously extend the range of things that an

experienced data analyst can analyze while staying with the basics of

lines and tables. In sociology, which is my field, this means learning to

use things like “log people”. In business and economics it means

learning to use things like “log dollars”. In biology it means learning

to use things like the square root of the number of beasties in a drop of

pond water or the cube root of the weight of an organism. Learning

what these things mean is perhaps the most time consuming part of an

introduction to data analysis. And the payoff is that these techniques

extend the ability of simple tools, of the line and the table, to make

sense of a complicated world.

7

Wednesday, June 12, 1996 Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth)

Introduction to Data Analysis Joel H. Levine

And what are the Rules of data analysis? Some of the rules are

clear and easy to state, but these are rather like the clear and easy rules

of writing: Very specific and not very helpful — the equivalent of

reminders to dot your “i’s” and cross your “t’s”. The real rules, the

important ones, exist but there is no list — only broad strategies with

respect to which the tactics must be improvised. Nevertheless it is

possible to at least name some of these “rules.” I’ll try the list from

different angles. So:

1. Look At the Data / Think About the Data / Think About the

Problem / Ask what it is you Want to Know

Think about the data. Think about the problem. Think about

what it is you are trying to discover. That would seem obvious,

“Think.” But, trust me, it is the most important step and often omitted

as if, somehow, human intervention in the processes of science were a

threat to its objectivity and to the solidity of the science. But, no,

thinking is required: You have to interpret evidence in terms of your

experience. You have to evaluate data in terms of your prior

expectations (and you had better have some expectations). You have to

think about data in terms of concepts and theories, even though the

concepts and theories may turn out to be wrong.

2. Estimate the Central Tendency of the Data.

The “central tendency” can be something as simple as an average:

The average weight of these people is 150 pounds. Or it can be something

more complicated like a rate: The rate of growth of the population is two

percent per annum. Or it can be something sophisticated, something

based on a theory: The orbit of this planet is an ellipse. And why would

you have thought to estimate something as specific as a rate of growth

or the trace of an ellipse? Because you thought about the data, about

the problem, and about where you were going (Rule 1).

3. Look at the Exceptions to the Central Tendency

8

Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth) March 22, 1997

Introduction: What is Data Analysis?

If you’ve measured a median, look at the exceptions that lie above

and below the median. If you’ve estimated a rate, look at the data that

are not described by the rate. The point is that there is always, or

almost always, variation: You may have measured the average but,

almost always, some of the cases are not average. You may have

measured a rate of change but, almost always, some numbers are large

compared to the average rate, some are small. And these exceptions

are not usually just the result of embarrassingly human error or

regrettable sloppiness: On the contrary, often the exceptions contain

information about the process that generated the data. And sometimes

they tell you that the original idea (to which the variations are the

exception) is wrong, or in need of refinement. So, look at the

exceptions which, as you can see, brings us back to rule 1, except that

this time the data we look at are the exceptions.

That circle of three rules describes one of the constant practices of

analysis, cycling between the central tendencies and the exceptions as

you revise the ideas that are guiding your analysis. Trying to describe

the Rules from another angle, another theme that organizes the rules of

evidence can be introduced by three key words: falsifiability, validity,

and parsimony.

1. Falsifiability

Falsifiability requires that there be some sort of evidence which,

had it been found, your conclusions would have had to be judged

false. Even though it’s your theory and your evidence, it’s up to

you to go the additional step and formulate your ideas so they can

be tested — and falsified if they are false. More, you yourself

have to look for the counter evidence. This is another way to

describe one of the previous rules which was “Look at the

Exceptions”.

2. Validity

9

March 22, 1997 Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth)

Introduction to Data Analysis Joel H. Levine

Validity in the scientific sense, requires that conclusions be more

than computationally correct. Conclusions must also be

“sensible” and true statements about the world: For example, I

noted earlier that it would be wrong to report that the population

of the United States had grown at an average rate of 1.2 million

people per year. — Wrong, even though the population grew by

245 million people over an interval of 200 years. Wrong even

though 245 divided by 200 is (approximately) 1.2. Wrong because

it is neither sensible nor true that the American population of 4

million people in the United States in 1790 could have increased

to 5.1 million people in just twelve months. That would have

been a thirty percent increase in one year — which is not likely

(and didn’t happen). It would be closer to the truth, more valid,

to describe the annual growth using a percentage, stating that the

population increased by an average of 2 percent per year — 2

percent per year when the population was 4 million (as it was in

1790), 2 percent per year when the population was 250 million (as

it was in 1990). That’s better.

3. Parsimony

Parsimony is the analyst’s version of the phrase “Keep It Simple.”

It means getting the job done with the simplest tools, provided

that they work. In military terms you might think about weapons

that provide the maximum “bang for the buck”. In the sciences

our “weapons” are ideas and we favor simple ideas with

maximum effect. This means that when we choose among

equations that predict something or use them to describe facts, we

choose the simplest equation that will do the job. When we

construct explanations or theories we choose the most general

principles that can explain the detail of particular events. That’s

why sociologists are attracted to broad concepts like social class

and why economists are attracted to theories of rational

individual behavior — except that a simple explanation is no

explanation at all unless it is also falsifiable and valid.

10

Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth) March 22, 1997

Introduction: What is Data Analysis?

I will be specific about the more easily specified rules of data

analysis. But make no mistake, it is these broad and not-well-specified

principles that generate the specific rules we follow: Think about the

data. Look for the central tendency. Look for the variation. Strive for

falsifiability, validity, and parsimony. Perhaps the most powerful rule

is the first one, “Think”. The data are telling us something about the

real world, but what? Think about the world behind the numbers and

let good sense and reason guide the analysis.

Reading:

Stephen D. Berkowitz, Introduction to Structural Analysis, Chapter 1,

“What is Structural Analysis,” Butterworths, Toronto, 1982; revised

edition forthcoming, Westview, Denver, circa 1997.

Stephen J. Gould, “The Median Isn’t the Message,” Discover, June, 1985.

Charles S. Peirce, “The Fixation of Belief”, reprinted in Bronstein,

Krikorian, and Wiener, The Basic Problems of Philosophy, 1955, Prentice

Hall, pp. 40- 50. Original, Popular Science Monthly, 1877.

11

March 22, 1997 Macintosh HD:DA:DA IX:Volume I:006 Intro (What is the wealth)