Consider a population of households who choose a consumption bundle q1, . . . , qJ to maximize preferences, represented by a utility function of the form
5 Nonsampling Error
Nonsampling error is a catch-all term for the variety of errors, other than sampling uncertainty, that may arise during the course of all survey-related activities.
Unlike sampling error, which reflects sample-to-sample variability and is only present in sample surveys, nonsampling error can also be present in censuses and administrative data.
You should worry about nonsampling error because it may cause systematic distortion or bias in estimation of population parameters of interest. Further, unlike sampling error, we cannot generally reduce this bias by gathering more data.
Nonsampling error may arise because of:
• sample selection, which occurs when the sampling design systemati- cally oversamples or undersamples a fraction of the population; an ex- treme form of sample selection is noncoverage, which occurs when the sampling design systematically excludes a fraction of the population;
• nonresponse, which results from failure to obtain answers to some or all survey questions (respectively, item and unit nonresponse);
• measurement error, which occurs when data are incorrectly requested, provided, received or recorded (because of poor questionnaire design, interview bias, respondent errors, or problems with the survey process), or when they are incorrectly coded, edited or imputed.
Sample selection (and noncoverage) may arise either because of design de- cisions by the survey statistician or the investigator, or because of self- selection by the individual units under investigation;
266
5.1 Noncoverage
Noncoverage may be caused by defects in the sampling frame, such as in- accuracy, incompleteness, duplications, inadequacy or obsolescence, or by field procedures (e.g., while a survey is conducted, the interviewer systematically misses certain types of households or persons). This is especially a problem for rare or elusive populations, such as homeless people.
A classic example of defective sampling frame occurs in telephone surveys, as individuals and families without phones (and, increasingly, those with cellular phones but no landline) do not enter the sampling frame.
Online web surveys may also su↵er from noncoverage problems, as internet usage depends largely on factors such as access and age.
When the available sampling frame is defective, the researcher faces two al- ternatives:
• redefine the target population to better fit the frame; this may be subject to criticism because the frame population no longer coincides with the target population;
• admit the possibility of noncoverage bias in statistics describing the target population; this bias arises because the population parameters of interest are “confounded” with parameters that determine the coverage probability.
267
5.1.1 Truncated and censored samples
Truncated and censored samples are forms of noncoverage arising when the sample design systematically excludes a fraction of the population in a way that depends on the value of some variable of interest Z.
Let D be a binary indicator of coverage, equal to one if Z assumes values in some region B and equal to zero otherwise. The fraction of the population subject to sampling is P[D = 1] = P[Z 2 B]. I say that a sample is censored if P[D = 1] is known or may accurately be estimated from the available data, otherwise I say that the sample is truncated.
In the case of a censored sample, I further distinguish between fixed and random censoring depending on whether the set of points B for which D = 0 is fixed or is itself random, e.g., depends on the value of some other random variable W .
To illustrate these concepts, I now consider two examples: top-coding (Sec- tion 5.1.2) and self-selection into the labor force (Section 5.1.3).
268
5.1.2 Top-coding
To protect the privacy of high-income individuals, income data are often sub- ject to top-coding, that is, the actual income amount is recorded if it falls below a certain threshold, otherwise the survey only contains a flag indicat- ing that the income is top-coded at the threshold value. A leading example are the public-use releases of the US Current Population Survey (see, e.g., http://cps.ipums.org/cps/topcodes tables.shtml).
Sampling is censored if the population proportion of individuals with income below the top-coding threshold is known or may be accurately estimated, otherwise it is truncated.
To analyze the problems caused by top-coding, let the variability of income in the population be represented by a nonnegative continuous random variable Z
⇤, let the income threshold be equa to some constant c > 0, and let D be a binary indicator equal to one if Z⇤ 2 B = [0, c) and equal to zero otherwise, so D = 1[Z⇤ < c]. The fraction of the population for which actual income may be observed is equal to
P[D = 1] = P[Z⇤ < c] = F(c),
where F denotes the DF of Z⇤. Although actual income is unobservable when it exceeds c, measured income is often conventionally set equal to c in this case. With this convention, measured income may be represented by the random variable
Z = DZ⇤ + (1 � D)c = min{Z⇤, c},
corresponding to a right-censored version of Z⇤ with fixed censoring at the point c.
269
Truncated samples
Let the data {(Di, Zi)} n i=1 consist of n independent observations on (D, Z).
A truncated sample only contains individuals with income below the thresh- old c, so Di = 1 for all i.
Conditional on Di = 1, the DF of an observation Zi from a truncated sample is
G(z | Di = 1) = P[Z⇤ z | Z⇤ < c] =
8 <
:
F(z)
F(c) , if z < c,
1, otherwise (5.1)
(Figure 13). Since F(c) is unknown, it is generally impossible to identify F on the basis of a truncated sample. Thus, the EDF of the observed data provides a good estimate of the ratio F(z)/F(c) for z < c, but not of F(z).
270
Figure 13: Population DF F(z) and conditional DF G(z | Di = 1) for a trun- cated sample.
0 .2
.4 .6
.8 1
-2 0 2 z
F(z) G(z|D=1)
271
Censored samples
A censored sample contains individuals with income both below c and above c, so both Di = 0 and Di = 1 are observed. Although the actual income of those with Di = 0 is unknown, you have su�cient information for estimating P[D = 1], for example by D̄ = n�1
Pn i=1 Di.
The conditional DF of Zi given Di = 1 is equal to (5.1), while the DF of a top-coded observation is degenerate with all its mass concentrated at the point c, so G(z | Di = 0) = 1[z � c]. Because the marginal distribution of Di is
P[Di = d] = ( 1 � F(c), if d = 0,
F(c), if d = 1,
the joint DF of (Di, Zi) is
G(D, Z) = G(z | Di = d) P[Di = d]
=
( 1[z � c] (1 � F(c)), if d = 0,
1[z < c] F(z) + 1[z � c] F(c), if d = 1.
The marginal DF of a sample observation Zi is therefore
G(z) = G(z, 0) + G(z, 1) =
( F(z), if z < c,
1, if z � c
(Figure 14), which is the DF of a mixed (continuous-discrete) distribution assigning probability mass 1 � F(c) to the single point z = c and spreading the remaining mass F(c) over the interval (�1, c) according to the DF F(z). Thus, a censored sample contains enough information to identify F(z) for z < c, but no information to identify F(z) for z � c.
It is easily seen that, if µ = E[Z⇤] exists, then the mean of Zi is
E[Zi] = µ � Z 1
c
[1 � F(z)] dz < µ.
Although knowledge of F(z) for z < c is generally insu�cient to identify the mean of Z⇤, it is nevertheless su�cient to recover all the quantiles ⇣p of Z
⇤
such that p F(c). In particular, it is enough to identify the median of Z⇤
whenever F(c) > 1/2, that is, c is above the median.
272
Figure 14: Population DF F(z) and DF G(z) for a censored sample.
0 .2
.4 .6
.8 1
-2 -1 0 1 2 z
F(z) G(z)
273
5.1.3 Random censoring
In prototypical model of self-selection into the labor force (see e.g. Heckman 1978), people work (D = 1) if their potential market wage Z⇤ exceeds their reservation wage W ⇤, and do not work (D = 0) otherwise. The probability of a person working is therefore
P[D = 1] = P[W ⇤ < Z⇤] = P[C < 0],
where C = W ⇤�Z⇤. The model also assumes that the observed wage coincides with the potential wage for those who work and is conventionally set to zero for those who do not work. The observed wage may then be represented by the nonnegative random variable
Z = D Z⇤ = 1[W ⇤ < Z⇤] Z⇤,
which corresponds to a left-censored version of Z⇤. Here censoring is not fixed but depends on W ⇤, as Z⇤ is only observed when it exceeds W ⇤.
Let the data {(Di, Zi)} n i=1 consist of n independent observations on (D, Z). A
truncated sample only contains people who work, so Di = 1 for all i. If F(z, c) denotes the joint DF of (Z⇤, C), the DF of Zi for those who work is
G(z | D = 1) = P[Z⇤ z | D = 1] = P[Z⇤ z, C < 0]
P[C < 0] =
F(z, 0)
P[C < 0] .
A censored sample contains instead both people who work and people who do not work, so it contains enough information to estimate P[D = 1]. The DF of Zi for those who work is G(z | D = 1), whereas for those who do not work it is degenerate with all its mass concentrated at zero, that is,
G(z | Di = 0) = P[Zi z | Di = 0] = 1[z > 0].
The joint DF of (Di, Zi) in a censored sample is therefore
G(d, z) = G(z | Di = d) P[Di = d} = ( F(z, 0), if d = 1,
1[z > 0] P[C � 0}, if d = 0.
Given knowledge of G(d, z), you can then compute the marginal DF of observed wages, G(z) = G(0, z)+G(1, z), and compare it to the marginal DF of potential wages, F(z) = limc!1 F(z, c).
274
5.2 Sample selection
Simple random sampling (Section 4.3.1) is of fundamental importance in theo- retical statistics, because it guarantees that the model representing variability at the population level coincides with the model representing variability at the sample level. However, many sample designs of practical importance are far from this ideal.
To formalize the distinction between the models of variability at the population and at the sample level, I represent the population by a random variable Z with density function f(z) and the sample by a collection Z1, . . . , Zn of IID random variables with common density function g(z) = w(z) f(z), where w(z) is a nonnegative function that represents the sampling design.
Because the density function g must integrate to one, the weight function w must satisfy
1 =
Z g(z) dz =
Z w(z) f(z) dz = E[w(Z)].
This formulation assumes that the sample design consists of n independent replications of the same chance experiment, but allows the model of variability to di↵er at the sample and at the population level.
Simple random sampling corresponds to the case when w(z) = 1. A more gen- eral sample design introduces a systematic di↵erence between the probability distribution of a sample value and that of the population. When this occurs, I say that sampling is selective. In particular:
• values of Z such that w(z) > 1 are oversampled;
• values of Z such that 0 < w(z) < 1 are undersampled;
• values of Z such that w(z) = 0 are systematically missing in the data, as with noncoverage.
Section 5.2.1 and 5.2.2 illustrate with two examples.
275
Figure 15: Population density f, density g of a sample observation, and weight function w = g/f representing the sampling design.
0 .5
1 1. 5
2
0 .2 .4 .6 .8 1 z
f(z) g(z) w(z)
276
5.2.1 Response-based sampling
Let Z = (X, Y ) be a random vector, and suppose that the outcome of interest Y is discrete and takes values 1, . . . , S. Also let As = {(x, y): y = s} be the population stratum for which Y = s. For concreteness, X may be a vector of observable individual characteristics and Y may represent the mode of transportation (automobile, bus, train, etc.).
Response-based sampling consists in drawing at random ns units from each stratum As and recording the associated value of X. This sampling scheme is motivated by the fact that sampling bus riders at a bus stop, train riders at a train station or car drivers at a parking lot, and recording their characteristics X, is often simpler and less expensive than interviewing people at their homes.
More generally, whenever the population units are physically clustered on the basis of the alternative they choose, response-based sampling o↵ers economies of scale that may not be possible under other sampling processes.
If !(s) denotes the sampling fraction from the sth population stratum and f(y) denotes the population probability that Y = y, then the probability of observing Y = y in the sample is
g(y) = !(y)f(y)
PS s=1 !(s)f(s)
= w(y) f(y),
where the weight function
w(y) = !(y)
PS s=1 !(s)f(s)
is generally di↵erent from one. Thus, g(y) 6= f(s).
A notable exception is when the sampling fraction !(s) is the same for all strata, which corresponds to self-weighting or proportional allocation.
277
5.2.2 Flow and stock sampling
Let Z be a nonnegative continuous random variable representing the duration of an unemployment spell, and let f, µ and �2 respectively denote the density function, the mean and the variance of Z.
There are two ways of gathering information about the distribution of Z:
• Flow sampling: consists of drawing a random sample from the popu- lation of those who become unemployed during a specified time period.
• Stock sampling: consists of drawing a random sample from the popu- lation of those who are unemployed at a given point in time.
Suppose you observe unemployment durations Z1, . . . , Zn for a random sam- ple of individuals. For simplicity, assume that these data correspond to com- pleted unemployment spells.
Under flow sampling, the density of a completed unemployment spell Zi for a new entrant is the same as the density f(z) of Z.
Next consider stock sampling. Write the completed duration of an unem- ployment spell for the ith person as Zi = Ui + Vi, where Ui is the elapsed duration (the duration up to the time of the survey) and Vi is the successive duration. For the ith person to be registered as unemployed it is necessary that Ui > 0, that is, elapsed duration must be left-censored. It can be shown that, under stock sampling, the density of Zi is
g(z) = z
µ f(z).
In this case w(z) = z/µ, so stock sampling oversamples spells longer than the mean µ and undersamples spells shorter than µ. In turn, this leads to incorrect inferences about the distribution of Z at the population level. For this reason, stock sampling is said to be length-biased.
For example, under stock sampling, the mean of the observed durations is
E[Zi] = Z 1
0
z g(z) dz = µ
✓ 1 +
� 2
µ2
◆ > µ.
Thus, stock sampling leads to an upward biased measure of mean unem- ployment duration and the relative bias (E[Zi] � µ)/µ is proportional to the squared coe�cient of variation �/µ of Z.
278
5.3 Nonresponse
Nonresponse is a typical problem with microeconomic surveys. Because of nonresponse, the available data do not conform to the ideal case of com- plete data for which standard statistical methods and software packages are typically designed.
Nonresponse not only implies an e�ciency loss relative to the complete- data case, due to the smaller e↵ective sample size, but also makes it harder to identify population parameters. As shown by Horowitz and Manski (1998), the seriousness of the problem is directly proportional to the amount of nonresponse.
It is useful to distinguish between three types of nonresponse:
• unit nonresponse (Section 5.3.1), which occurs when no information is available from a sample unit;
• attrition and new entry (Section 5.3.2), which is a special case of unit nonresponse arising in panel data;
• item nonresponse (Section 5.3.3), which occurs when no information is available from a sample unit on a particular item included in the survey.
279
5.3.1 Unit nonresponse
Unit (or complete) nonresponse occurs when no information is available from a sample unit. It may be caused by refusal or inability of the unit to participate, or by a missing survey instrument.
There is a trend towards increasing unit nonresponse rates in micro-level sur- veys (see, e.g., Meyer, Mok and Sullivan 2015).
I will now consider separately three types of unit nonresponse with somewhat di↵erent characteristics:
• noncontact, i.e. failure to deliver the survey request because of failure to locate the sample unit, noncontact, etc.;
• inability to participate, e.g. a contacted person cannot understand the language of the questionnaire;
• refusal to participate, i.e. the survey request is delivered but is re- jected.
280
Noncontact
Many factors influence whether sample units can be contacted. The situation is di↵erent between face-to-face and telephone surveys on the one hand, and mail, e-mail and web surveys on the other hand.
In face-to-face and telephone surveys, it is not known exactly when people are at home and accessible. So interviewers are generally instructed to make multiple contact attempts on a unit until contact is achieved. Notice that:
• contact attempts in the evening and on weekends are more successful than at other times;
• households where someone is almost always at home are the easiest to contact;
• the percentage of successful attempts tends to decline with each succes- sive attempt.
Mail, e-mail and web surveys make the survey request accessible to the sam- ple unit continuously after it is sent. In these cases, it is more di�cult to di↵erentiate between noncontact (the request has not been seen) and refusal to participate (the request has been seen and rejected).
Inability to participate
Sometimes, sample units are successfully contacted and would be willing to cooperate, but cannot for a number of reasons. For example:
• cannot understand the language(s) of the survey;
• are incapable of understanding the questions or retrieving from memory the information requested;
• have physical health problems;
• do not have the information requested (this sometimes occurs in business surveys).
281
Refusal to participate
Factors influencing refusal to participate include:
• Social environment: in household surveys, large urban areas tend to generate more refusals, while households with more than one member tend to generate less refusals than single person households.
• Personal characteristics: men tend to generate more refusals than women.
• Interviewer characteristics: more experienced interviewers tend to obtain lower refusal rates.
• Survey design: rewards (monetary or in-kind) tend to reduce refusal rates.
While the first two factors are outside the control of a researcher, the last two can be manipulated to increase response rates. See Groves, Cialdini and Couper (1992) and Groves and Couper (1998) for more detail.
Theories of refusal behavior (see Groves et al. 2004, pp. 176–177):
• The social exchange theory, or opportunity cost hypothesis (see also Hill and Willis 2001), posits a “social exchange” between the survey interviewer, who desires information that the respondent has, and the re- spondent, who decides how much information to reveal. People refuse to participate if the act of participation involves costs (actual or perceived) that exceed the expected rewards.
• The leverage-salience theory: di↵erent people attach di↵erent impor- tance to features of the survey, such as topic, burden, incentives, author- ity of sponsor, etc. When a sample person is approached, one or more of these features are made salient in the interaction with the interviewer or with the survey material provided to the person. Depending on what is made salient and how the person values these features, the result could be a refusal or an acceptance.
282
5.3.2 Panel attrition and new entry
A panel is called balanced if all sample units are observed at exactly the same times. Otherwise, it is called unbalanced. Unbalanced panels are the most common case.
Attrition and new entry are typical sources of unbalancedness. Thus, the study of attrition and new entry plays a key role in evaluating what you can learn from a panel survey about population characteristics of interest.
Attrition
Attrition arises when units initially included in a panel survey are lost through time. Attrition a↵ects, more or less seriously, all microeconomic panels.
An important distinction is between exogenous and endogenous attrition:
• attrition is exogenous if it is unrelated to the outcome of interest Y ; exogenous attrition does not bias the information carried by the sample about the population regression function of Y and typically only causes e�ciency problems;
• attrition is endogenous if it is systematically related to Y ; endogenous attrition may cause bias if not properly controlled for.
For example, in panels of individuals, attrition is typically associated with important transitions in a person’s life: going to college, finding a new job, marriage or divorce, retirement, etc. If these events are the main object of study, then attrition is endogenous. This may lead to invalid inference about the population of interest, even when attrition rates are modest.
Attrition depends on both individual behavior and the procedures followed by the agency collecting the data. This distinction is important but often ignored.
It is also useful to distinguish between intermittent and monotone attri- tion, depending on whether or not a panel participant, who at some point leaves, returns to the panel at a later date. In the first case units are lost permanently, in the second they are lost only temporary.
283
New entry
It is another kind of unit nonresponse arising in panel surveys, and is defined as failure to obtain data from a sample unit at any wave before its addition to the initial sample.
Depending on its characteristics, new entry may reduce or exacerbate the e↵ects of attrition on the representativeness of a panel.
284
5.3.3 Item nonresponse
Item (or partial) nonresponse occurs when a unit refuses to answer, or fails to provide a useful response, to a particular question included in the survey.
Qualitatively, the impact of item nonresponse is the same as that of unit nonresponse. There are, however, important di↵erences:
• The damage tends to be limited when item nonresponse is confined to a few variables, but can be substantial when the cumulative e↵ect of item nonresponse to several related questions is considered (e.g., the e↵ect on total income of item nonresponse to questions about specific income components).
• The causes of item and unit nonresponse tend to be di↵erent. Among those that have been studied are:
– inadequate comprehension of the intent of a question;
– failure to retrieve adequate information;
– lack of willingness or motivation to disclose the information requested.
The rate of item nonresponse tends to di↵er across variables, and is typically found to be highest for questions on income and assets.
285
5.4 Missing-data mechanisms
In microeconomic data, nonresponse is one of the main reasons why data may be missing. As Kline and Santos (2013) argue:
“Despite major advances in the design and collection of survey and administrative data, missing and incomplete records remain a pervasive feature of virtually every modern economic data set.”
To formalize the general missing data problem, let M be a missing data indicator, equal to one if the information on some outcome of interest Y is missing and to zero otherwise, and let X be a vector of covariates that may help predict missingness and are always observed. The available data consist of a sample {(Mi, Xi, Yi)}
n i=1 from (M, X, Y ), with Yi missing if Mi = 1.
What variables may be included in X depends on the specific missing data problem. For example, in the case of unit nonresponse the only available information is from the frame and the fieldwork. In the case of item non- response one has the additional information on the nonmissing variables. In the case of attrition one also has the information from previous waves.
Let f(m, z) = f(m, x, y) denote the joint density of (M, Z) = (M, X, Y ) and let
p(m | z) = P[M = m | Z = z], m = 0, 1, denote the conditional probability of M.
Following Rubin (1976), Little and Rubin (2002) and Seaman (2013), I dis- tinguish between three types of missing-data mechanisms depending on the nature of the joint density f(m, z) or, equivalently, on the nature of the con- ditional probability p(m | z). The equivalence between the two formulations follows from the fact that
f(m, y | x) f(x) = f(m, z) = p(m | z) f(z).
286
5.4.1 Missing completely at random
Data are missing completely at random (MCAR) if M does not depend on the variables in Z that is, neither on X nor on Y , also written M??Z or M??(X, Y ). Thus, MCAR is equivalent to the condition that
f(m, z) = f(z) p(m),
or that p(m | z) does not depend on z.
5.4.2 Missing at random
Data are missing at random (MAR) if, after conditioning on the co- variates in X, there is no dependence between M and Y , also written (M??Y ) | X. Thus, MAR is equivalent to the condition that
f(m, y | x) = f(y | x) p(m | x),
Because it is always true that
f(m, y | x) = p(m | x, y) f(y | x) = f(y | m, x) p(m | x),
MAR implies that f(y | m, x) = f(y | x)
and p(m | x, y) = p(m | x).
MAR does not hold if either of these two conditions fails.
5.4.3 Missing not at random
Data are missing not at random (MNAR) if, even after conditioning on the covariates in X, there is dependence between M and Y , that is,
f(m, y | x) 6= f(y | x) p(m | x).
Equivalently, data are MNAR if p(m | z) depends on y and possibly also on x.
287
5.4.4 Ignorable and nonignorabile missingness
A missing-data mechanism is ignorable if data are MCAR or MAR, and is nonignorable if data are MNAR.
The dominant approach to missing data assumes ignorability. However, as argued by Kline and Santos (2013), this is
“. . . an assumption whose popularity owes more to convenience than plausibility. Even in settings where it is reasonable to be- lieve that nonresponse is approximately ignorable, the prevalence of missing values in modern economic data suggests that economists ought to assess the sensitivity of their conclusions to small devia- tions from this assumption.”
In what follows I shall consider three di↵erent approaches to missing data:
• Design the survey to reduce nonresponse (Section 5.4.4). This ex-ante approach is obvious but may be costly and is unfeasible if the data have already been collected.
• If the missing data mechanism is ignorable, ex-post approaches include:
– reweighing the data to compensate for the units lost to nonresponse (Section 5.5.1);
– replacing the missing values with imputations based on a statis- tical model (Section 5.5.2).
• If the missing data mechanism is nonignorable, ex-post approaches include:
– modeling missingness explicitly (Section 5.6.1),
– measuring the uncertainty that missingness creates (Section 5.6.2).
288
Survey design to reduce missingness
A number of aspects of survey design can influence the probability of missing data and may therefore be adjusted to reduce missingness. These include:
• Survey content: a survey on drugs or on financial matters needs careful ordering and wording of the questions, or some randomized response technique.
• Design of the survey instrument: very important. For example, the complexity of a questionnaire is known to negatively a↵ect response rates and data quality. Important dimensions of complexity are: (i) how long it takes to fill the questionnaire, (ii) the wording of the questions, (iii) the reference period for the questions, and (iv) the branching and skip patterns.
• Characteristics and training of the interviewers: they matter a lot.
• Data collection methods: telephone, mail, fax and internet surveys tend to have lower response rates than in-person surveys.
• Time of the survey: vacations, weekends, time of the day may matter.
• Survey introduction: the respondent should be told clearly what the purpose of the survey is and should be assured about anonimity and confidentiality.
• Rewards: monetary and nonmonetary rewards appear to increase co- operation, but their e↵ect on data quality is not clear.
289
5.5 Approaches under ignorability
5.5.1 Weighting
Surveys with complex design usually display unequal selection probabil- ities, di↵erent response rates across subgroups, and departures from known distribution of key variables (e.g. gender) for the target population. To compensate for these features, survey agencies typically provide survey weights.
There are four main types of weights:
• First-stage ratio adjustment weights w (1) i (for stratified multistage
designs): To stabilize estimates over di↵erent selections of PSUs in a stratum. They are usually equal to the ratio between the population size of a stratum (from the sampling frame) and its survey estimate.
• Base (design) weights w (2) i : For di↵erential selection probabilities.
They are usually equal to the inverse of the selection probabilities.
• Nonresponse weights w (3) i : To adjust for unit nonresponse. They
are usually equal to the inverse of the (estimated) probabilities of unit nonresponse.
• Post-stratification weights w (4) i : To ensure that survey estimates of
sub-totals (e.g. by gender or age group) agree with external sources.
Both the first-stage ratio adjustment weights and the post-stratification weights are typically based on information from the target population, not on sample information.
Once the various weights have been computed, they are combined into a final weight
wi = w (1) i w
(2) i w
(3) i w
(4) i , i = 1, . . . , n.
Often, surveys only provide the final weights, not their individual components.
290
Estimates of population size, population total and population mean
Using the final weights, an estimate of the population size is
N̂w = nX
i=1
wi,
an estimate of the population total of Z is
⌧̂w = nX
i=1
wiZi,
and the resulting estimate of the population mean of Z is
µ̂w = ⌧̂w
N̂w
=
Pn i=1 wiZiPn i=1 wi
.
Notice that µ̂w does not change if the weights are rescaled up or down. Thus, for estimation of the population mean, it does not matter if the weights add up to the total size of the population, or they add up to one.
All three estimators are constructed by analogy with the estimators presented in Section 4.3.4. Because of the complex nature of the final weights, how- ever, their sampling properties tend to be quite complicated, especially if the missing-data mechanism is nonignorable.
In general, weights based on sample information tend to increase the sampling variance of survey estimators, but are used in order to reduce biases resulting from noncoverage and nonresponse.
On the other hand, weights using frame population information (such as the first-stage adjustment weights and the post-stratification weights) tend to reduce the sampling variance of survey estimators.
291
5.5.2 Imputations
Survey weights only adjust for unit nonresponse. Adjusting for item nonre- sponse is typically done through imputation. This procedure consists of filling- in values that are missing, either because of item nonresponse or because the response is considered incorrect or implausible.
When a missing value is imputed, the dataset should in principle be supple- mented by an indicator (or “flag”) showing whether the response was measured or imputed. Sometimes, this is not done.
Imputing missing values is often advocated on the ground that:
• it makes the data easier to analyze, for standard methods (e.g. OLS) may then be applied to the completed data;
• it reduces nonresponse bias relative to the alternative of excluding all observations with missing values.
In some cases, imputation may indeed reduce the bias relative to the case where missing values are simply ignored. However, as pointed out by Dempster and Rubin (1983):
“the idea of imputation is both seductive and dangerous. It is se- ductive because it can lull the user into the pleasurable state of believing that the data are complete after all, and is dangerous be- cause it lumps together situations where the problem is su�ciently minor that it can legitimately be handled in this way and situations where standard estimators applied to real and imputed data have substantial bias”.
A key problem with the imputation methods discussed in the next two sections is that they all rely on the assumption that the data are MAR, conditional on the auxiliary variables employed in the imputation procedure. This assumption is untestable unless external information is available, such as administrative records or a follow-up of nonrespondents.
If the MAR assumption does not hold, then other imputation methods should be considered that explicitly model the nonresponse mechanism (Greenlees, Reece and Zieschang 1982).
292
Deterministic imputations
I distinguish between deterministic and random imputation methods.
Deterministic imputation methods replace missing values with means or other non-random values. Examples include:
• Mean imputation: Replaces the missing value of a variable Y for a nonrespondent with the mean value of Y for the respondents.
• Cell mean imputation: The sample is first partitioned into cells, based on the values of a set of auxiliary variables Z. Then the missing values of Y in a cell are replaced with the mean value of Y for the respondents in that cell.
• Hot-deck imputation: Replaces the missing value of Y for a nonre- spondent (or recipient) with the value of Y for a “similar” respondent (or donor).
• Regression imputation: Replaces the missing values of Y with the predicted values from a regression of Y on a set of auxiliary variables Z using the nonmissing data. Mean imputation is the special case when Z only contains the constant term. Regression imputation requires that all auxiliary variables in Z have nonmissing values. When this is not the case, then the missing auxiliary variables are imputed through some preliminary imputation procedure.
• Cold-deck imputation: Replaces the missing value of Y with values from a previous survey or a di↵erent data set, e.g. historical data.
Hot deck methods are popular (for example, they are used by the Bureau of Census for imputing missing values in the US Current Population Survey) because they do not rely on model fitting and replace missing values with values that are “plausible” become they come from respondents. See Lillard, Smith and Welch (1986), Andridge and Little (2010), and Bollinger et al. (2015) for more details.
The main problem with deterministic imputation is underestimation of the variability of the variable of interest.
293
Random imputations
These methods replace missing values with randomly chosen values and are designed to avoid underestimation of the variability of the variable of interest.
Examples include:
• Random mean imputation: Replaces the missing values of Y with random draws from a N(Ȳ , s2) distribution, where Ȳ and s2 are the sample mean and variance of Y for the respondents.
• Random cell mean imputation: Replaces the missing values of Y with random draws from a N(Z̄cR, s
2 cR) distribution, where Ȳc and s
2 c are
the sample mean and variance of Y for the responding units in cell c.
• Random regression imputation: Replaces the missing values of Y with the predicted value from an auxiliary regression of Y on a set of auxiliary variables Z using the nonmissing data, plus a randomly gen- erated error term (typically, a random draw from the set of residuals of the auxiliary regression).
• Random hot-deck imputation: A donor i is selected at random from the set of respondents (or donor pool) and the imputed value of Y for the jth recipient is computed as Y ⇤j = Yi. The imputed value Y
⇤ j is
therefore a random draw from the EDF of Y for the respondents. To preserve key covariances, the same donor may be used for imputation of related missing variables.
• Random hot-deck imputation within classes: The sample is first partitioned into a number of “imputation classes” defined on the basis of auxiliary variables Z. A donor i is then selected at random from the set of respondents within the class containing the jth case, and the imputed value of Y for the jth recipient is computed as Y ⇤j = Yi. The imputed value Y ⇤j is therefore a random draw from the edf of Y within the class. To preserve key covariances, the same donor may be used for imputation of related missing variables within the class.
294
When are imputations adequate?
In general, survey statisticians try to include in the imputation all the relevant explanatory variables (see Schafer 1997). Unfortunately, considering all the variables that are relevant may impractical because of multicollinearity or degrees of freedom problems, or just impossible.
This means that imputation procedures necessarily impose exclusion restric- tions that may be at odds with the restrictions imposed by applied researchers in their model of interest.
I suggest checking whether your model of interest includes any relevant re- gressor which has been omitted from the imputation procedure. If all your regressors are used as auxiliary variables in the imputation, then I suggest using the imputed data, possibly correcting the estimator’s variance using multiple imputation methods.
A second problem arises when the model of interest involves a nonlinear transformation of the outcome variable (e.g., the log transformation or a cat- egorization). Thus, let D be a binary indicator equal to 0 if Y is missing, let Y
⇤ denote the imputed value of Y , and let g(·) be a nonlinear transformation. An imputation such that
E[Y ⇤ | D, Z = 0] = E[Y | Z]
does not imply that
E[g(Y ) | Z] = E[g(Y ⇤) | D, Z = 0].
So, for example, if the imputation model for Y is a linear regression, a model involving a monotone transformation of Y may lead to inconsistent estimates. More generally, if the imputation model is linear in Y , then an imputed data estimator based on a moment function (.) that is nonlinear in Y may be inconsistent.
Following Meng (1994), I say that the model of interest and the imputation models are uncongenial if they are based either on di↵erent sets of ex- planatory variables or on di↵erent parametric assumptions. If the two models are uncongenial, biases are likely to arise.
295
Multiple imputations
When imputed values are treated as “real values”, standard errors of estimates are underestimated, leading to confidence intervals that are too narrow and test statistics that are too large.
Multiple imputation (Rubin 1987) works as follows:
• Define an imputation model for the relationship between the poten- tially missing Y variables and a vector of regressors X.
• Estimate the imputation model using the nonmissing data.
• Replace the missing values with random draws from the estimated im- putation model, thus obtaining a completed data set. Repeat m � 2 times.
• Analyze each of the m completed datasets by standard complete-data methods and obtain m estimates of the statistics of interest.
• Take the sample variance of these m estimates as a measure of the un- certainty due to the presence of missing values.
296
5.6 Approaches under nonignorability
I first show what bias may arise assuming ignorability (i.e., MCAR or MAR) when the missing data mechanism is in fact nonignorable. Let Zi = (Xi, Yi) and suppose that the complete data obey the classical linear model
Yi = ↵ + �Xi + Ui, i = 1, . . . , n, (5.2)
where Ui has zero mean, finite variance and is mean independent of Xi. Thus, if there were no missing data, an OLS regression of Yi on a constant and Xi would give unbiased and consistent estimates of ↵ and �.
Now consider the OLS estimator �̂ in a regression of Yi on a constant and Xi using the available nonmissing data, namely those with Mi = 0. There are two problems with this estimator of �:
• The e↵ective sample size is n(1 � M̄) < n, where M̄ = n�1 Pn
i=1 Mi
is the fraction of missing data, so �̂ is less e�cient than the unfeasible estimator based on the complete data.
• More importantly, the model for the conditional mean of the available nonmissing data is not (5.2) but
E[Yi | Xi, Mi = 0] = ↵ + �Xi + hi, i = 1, . . . , n,
where hi = E[Ui | Xi, Mi = 0]. Thus, �̂ may su↵er of omitted variable bias due to omission of the term hi. Specifically, �̂ is biased and incon- sistent for � unless missingness is ignorable, that is, one of the following two conditions holds:
– hi = 0; this is the case when E[Ui | Xi, Mi = 0] = E[Ui | Xi], that is, Ui is mean independent of Mi conditionally on Xi;
– hi 6= 0, but Xi and hi are uncorrelated.
297
5.6.1 Modeling missingness
When the data are MNAR, bias may be avoided by explicitly modeling the missing data mechanism.
To illustrate this approach, consider again model (5.2) with Y missing be- cause of item nonresponse. Following the “opportunity cost hypothesis” (Sec- tion 5.3.1) assume that Mi = 1 whenever providing information on Y is per- ceived by the respondent as a net cost, that is, whenever
C ⇤ i = ⌘i + Vi > 0,
where ⌘i is a systematic component that may depend on observable variables and Vi is a random component that is independent of ⌘i but potentially cor- related with Ui. For simplicity, normalize Vi to have zero mean and unit variance. Both C⇤i and Vi are typically unobservable.
To derive a closed-form expression for hi = E[Ui | Xi, Mi = 0], let (Ui, Vi) be jointly Gaussian with zero means, finite variances and covariance �, so Ui = �Vi + ✏i, where ✏i has zero mean and is independent of Vi.
Under this set of assumptions, the probability that Y is not missing is de- scribed by the probit model (see Section 6.2.3)
P[Mi = 0] = P[C⇤i 0] = P[Vi �⌘i] = 1 � �(⌘i),
where � denotes the standard Gaussian DF. Notice that P[Mi = 0] ! 1 if ⌘i ! �1 and P[Mi = 0] ! 0 if ⌘i ! 1. Further,
hi = E[Ui | Xi, Vi �⌘i] = � E[Vi | Xi, Vi �⌘i]
= �
Z �⌘i
�1 v
�(v)
�(�⌘i) dv
= � �
1 � �(⌘i)
Z �⌘i
�1 � 0(v) dv
= �� �(⌘i)
1 � �(⌘i) ,
where � denotes the standard Gaussian density and I used the fact that �0(v) = �v �(v) and limv!�1 �(v) = 0. The ratio �(u) = �(u)/[1��(u)] is often called the inverse Mills ratio. Clearly �(u) ! 0 if u ! �1.
298
Heckman two-step estimator
Thus, under the assumptions of the model, the conditional mean of Y for units with nonmissing data is
E[Yi | Xi, Mi = 0] = ↵ + �Xi � � �(⌘i), i = 1, . . . , n,
Now suppose ⌘i = � >W i, where W i is a vector of observable variables that
includes a constant term. Under the model assumptions, the parameter � may be estimated consistently by the probit ML estimator �̂, so ⌘i may be
estimated by ⌘̂i = �̂ > W i.
Heckman (1979) showed that, under the model assumptions, a consistent and asymptotically normal estimator of ✓ = (↵,�,�) may be obtained by an OLS regression of Yi on a constant term, Xi and ��̂i, where �̂i = �(⌘̂i).
The resulting estimator, known as Heckman 2-step, or Heckit estimator, exemplifies the control function approach, in which an additional regressor is introduced to “control” for the bias in the regression of Yi on a constant term and Xi due to nonignorable missing data.
Notice that, in this model, hi = 0 whenever � = 0, so a test of � = 0 is a test of ignorable missingness.
299
Problems with the Heckit estimator
Under the model assumptions, the errors in the population regression of Yi on a constant Xi and �̂i are heteroskedastic. This is because
Yi = ↵ + �Xi � ��i + "i
= ↵ + �Xi � ��̂i + " ⇤ i , i = 1, . . . , n,
where "⇤i = �(�̂i � �i) + "i is an error that has zero mean conditional on Xi and Mi = 0 but is heteroskedastic.
A more serious problem is identification of the model parameters ↵, � and �. This is typically achieved through one three methods:
• Nonlinearity of �(u) (identification via functional form). This method is unsatisfactory because
lim u!1
�(u) = lim u!1
� � 0(u)
�(u) = lim
u!1
u�(u)
�(u) = lim
u!1 u,
so �(u) is approximately linear for large values of u (Figure 16), which may create problems if W i only consists of the constant term and Xi.
• By assuming that W i includes at least one variable, sometimes called an instrumental variable, that is not in Xi (identification via exclusion restrictions). Finding a justification for the proposed exclusion restric- tions is often not easy. In sample surveys, the characteristics of the interview process may sometimes provide credible exclusion restrictions.
• By focusing on the subsample of respondents with predicted probability of nonresponse less than some threshold close to zero (identification at infinity). For this subsample there is essentially no bias, so this method requires no exclusion restrictions. Unfortunately, one may end up with a very small subsample.
An even more serious problem is the fact that the Heckit estimator is incon- sistent if the joint distribution of Ui and Vi is not Gaussian.
300
Figure 16: Inverse Mills ratio �(u) = �(u)/[1 � �(u)].
0 1
2 3
4
la m
b d
a
−4 −2 0 2 4
u
301
5.6.2 Measuring uncertainty due to missingness
To illustrate this approach, suppose you are interested in learning about µi = E[Yi | Xi], i = 1, . . . , n. Let
µim = E[Yi | Xi, Mi = m], m = 0, 1, and notice that
µi = µi0(1 � ⇡i) + µi1 ⇡i,
where ⇡i = P[Mi = 1 | Xi] is the missing data probability for the ith unit. If you always observe Mi but only observe Yi when Mi = 0, then you can in principle estimate µi0 and ⇡i from the data but not µi1, and therefore not µi.
To deal with this problem, nonsample or prior information must be avail- able. You may for example assume that µi0 = µi1 (ignorable missingness). This is weaker than the MAR assumption, as it corresponds to mean inde- pendence between Yi and Mi conditionally on Xi, but still restrictive.
As an alternative, Manski (1989) proposed to exploit prior information on the distribution of Yi given Mi = 1. Specifically, suppose you know that
ai µi1 bi,
where ai and bi are known or estimable from the data. Then µi must necessarily belong to the closed interval [µiL, µiU], where
µiL = µi0(1 � ⇡i) + ai⇡i,
µiU = µi0(1 � ⇡i) + bi⇡i.
The length of this interval,
µiU � µiL = (bi � ai)⇡i,
may be taken as a measure of your uncertainty about the population mean of Yi given Xi caused by the missing data. The smaller is the range bi � ai or the missing data probability ⇡i, the more informative are the data.
If Yi is a binary indicator, then its mean is a probability, so µi1 is naturally bounded between 0 and 1. In this case, putting ai = 0 and bi = 1 leads to the worst-case bounds
µi0(1 � ⇡i) µi µi0(1 � ⇡i) + ⇡i,
so the missing data probability ⇡i is a direct measures of your uncertainty about µi.
302
5.7 Measurement and reporting errors
I have so far assumed that the data correspond to exact measurements of the variables of interest.
In practice, however, the data may be “dirty”, i.e., subject to measurement errors or contaminated in various ways. The nature of the measurement errors or the form of the contamination process generally a↵ects the information contained in the data about the population features of interest.
In some cases, one may associate with each population element a true value of the variable of interest and view a measurement error as a deviation from the true value.
This approach is not always suitable. For example, respondents may be ob- served to change their initial response when reinterviewed. In these cases, it is more useful to think of measurement error as response variability.
303
5.7.1 Sources of measurement error
Measurement error may arise from many di↵erent sources:
• interviewer e↵ect;
• respondent e↵ect: for example, individuals who are less willing to participate to a survey may give inaccurate or biased responses;
• instrument e↵ect: basic mistakes in the questionnaire, etc.;
• mode e↵ect: face-to-face, phone, internet, etc.
304
5.7.2 Cognitive aspects of response behavior
There is broad agreement on the fact that measurement errors can generally be traced to problems in the survey response process.
Psycologists regard survey response as a sequential process consisting of:
• comprehension: interpretation of the question;
• recall: retrieval of information from memory;
• estimation and judgment: combining or supplementing what has been retrieved from memory;
• reporting: selecting and communicating an answer.
Example: Reported income
Comprehension problems may arise from the use of technical terms or di�culty with the income concept. A typical case is the tendency to exclude capital losses, spousal income, or income from second or third jobs from household income.
A more precise wording may in principle avoid or reduce these di�culties. However, this may lead to questions that are longer to read or to an increase in the number of questions (e.g. separate questions by income source or by person).
If income cannot be retrieved easily, then the survey instrument might sug- gest the interviewer to look up for information (from tax statements, bank statements, etc.). This, however, increases costs and may a↵ect interviewer behavior.
If reported income is the result of guessing, then the variance of income may be increased. Guessing may also lead to bias. For example, if guessing is about a multi-component quantity (such as household income), then biases may arise from the fact that certain components may be forgotten or neglected.
305
Problems in answering survey questions
Flaws in the cognitive operations involved in producing an answer may cause errors in the responses.
The following problems in the response process can give rise to errors in survey reports:
• failure to encode the information sought: the information sought is not stored in memory in an accessible form; people cannot provide information they do not have;
• misinterpretation of the question due to grammatical ambiguity, excessive complecity, faulty presupposition, vague concepts, vague quan- tifiers, unfamiliar terms, or false inferences;
• forgetting and other memory problems: Mismatches between the terms used in the question and the terms used to encode the events initially, distortions in the representation of the events over time, retrieval failure, reconstruction errors;
• flawed judgement or estimation strategies: When respondents do not have an estimate/judgement they can draw on, their answer may be a↵ected by the wording of the question or by the context in which the question is placed;
• problems in formatting an answer: problems in translating a judge- ment into an acceptable format given the type of question (open-ended questions with numerical answers, closed questions with ordered response scales, closed questions with categorical response options);
• misreporting: more frequent with “sensitive questions”, namely those that are likely to be seen as intrusive or embarassing; self-administration seems to increase reporting in these cases;
• navigational errors: failure to follow instructions may be more serious when the questions are self-administered.
Some of these problems may be avoided or reduced by a careful design of the questionnaire.
306
Recalling facts and events
A unified theory of memory is still lacking.
There is broad agreement on the fact that the information asked in surveys is stored in long term memory (LTM). It is not clear, however, how it is stored in LTM, how LTM works, and how information is retrieved from LTM.
One idea is that retrieval is better when the question asked matches the way in which information is stored.
Another idea is that the ability to recall events decays over time, although there is no mechanical time discounting. For example, isolated events are easier to remember than sequences of similar events.
Additional factors that seem to play a role:
• proximity to temporal boundaries or “landmarks”;
• distinctiveness: the better events are described in a question, the eas- ier it is to remember them, major events are easier to remember than frequent, smaller events;
• emotional impact: Events with a strong emotional impact are easier to remember.
307
Recalling dates
The main issue is how dates are attached to information about events stored in LTM. It appears that the two pieces of information are stored separately, which requires some inferential strategy.
The quality of date reports depends on:
• how well the event itself is remembered;
• the relationship between the event of interest and other dateable events and periods (birthdays, anniversaries, etc.).
A practical implication is that it helps to give people landmark events or structured diaries to list events.
Common problems in recalling dates:
• seam e↵ects: occur in panel surveys with repeated interviews, and depends on the fact that information tends to be asked backwards and events far from the interview date are less likely to be remembered;
• telescoping: events that happened in the past tend to be anticipated (forward telescoping) or posticipated (backward telescoping);
• durations: short durations are poorly remembered; not much is known about how recall of these short durations may be improved; there is some evidence that recall of short durations is completely dominated by the experienced utility (or disutility); in general, even for relatively short durations, it seems better to ask initial and end dates.
308
Influence of the survey instrument on recall
The design of the survey instrument may facilitate recall. In particular:
• Recall order: it appears that remembering forward (from the distant past) is easier. This is especially true if there is a natural causal order. If there is no natural causal ordering, asking backwards may be preferable (but the evidence on this is weak). Asking without an order tends to be bad.
• Time on task: giving more time to think improves the quality of recall (unless respondents know the answer).
• Decomposition of complex events or facts: this strategy may lead to serious nonresponse problems, because totals cannot be reconstructed when any of the components is missing.
• Recall clues: this is related to distinctiveness. It seems to work well only for simple specific events, not with sequences of multiple events.
309
5.7.3 Di↵erential item functioning
Data on subjective assessments of health or wellbeing are important to economists, who consider them as proxies for utility. They are also important to policy-makers, who recently have been looking for alternatives to standard economic indicators (see, e.g., Stiglitz, Sen and Fitoussi 2009).
Given the increasing use of this kind of data, the role of potential measure- ment issues related to the data collection process is receiving attention.
Surveys respondents are often asked to rate their health or wellbeing on an ordered scale, sometimes called a rating scale or Likert scale. An example is the question: “In general, would you say that your health is excellent, very good, good, fair, or poor?”, or some minor variant thereof, which is often asked in health surveys. Household surveys ask similar questions on happiness or life satisfaction, while consumer surveys ask similar questions on customer satisfaction.
When asked to rate their own health on a given ordered scale, people may answer di↵erently for two di↵erent reasons:
• because their true or perceived health di↵ers;
• because they interpret di↵erently the various levels of the scale.
As a consequence, di↵erences in self-reports between otherwise similar indi- viduals may depend on di↵erences in response style, e.g., the mapping of true or perceived health into reported health (Figure 17).
Lack of interpersonal comparability of responses to subjective survey questions is often referred to as “response category di↵erential item functioning” (DIF), a term originated in the educational testing literature (see e.g. Holland and Wainer 1993), where a test question is said to have DIF if equally able individuals have unequal probabilities of answering the question correctly.
From the view point of statistical modeling, the DIF problem is essentially one of identification in ordered response models where the observed responses are derived from latent continuous random variables discretized through a set of heterogeneous thresholds or cuto↵ points (see e.g. Peracchi and Rossetti 2012).
310
Figure 17: Comparison of self-assessed pain in two groups when there are di↵erences in response scales.
None Mild Serious
D e n si
ty
Latent pain level
Group A
None Mild Serious
D e n si
ty
Latent pain level
Group B
311
Anchoring vignettes
Following the seminal paper of King et al. (2004), anchoring vignettes have been developed as a new component of survey instruments that may be used to solve the DIF problem.
They are brief descriptions of hypothetical people or situations that survey respondents are asked to evaluate on the same scale they use to rate their own situation. Because the people or situations described in the vignettes are the same for all respondents, vignettes have the potential to identify in- dividual variation in subjective thresholds.
A number of social surveys have introduced specific modules with vignette questions. Examples include the Survey of Health, Ageing and Retirement in Europe (SHARE), the U.S. Health and Retirement Study (HRS), the English Longitudinal Study of Ageing (ELSA), and the World Health Organization’s World Health Surveys (WHS).
Introducing anchoring vignettes implies substantial costs in terms of survey design and also reduces the time available for collecting other information.
312
Testing the key assumptions behind anchoring vignettes
Although vignettes are increasingly employed by researchers in various fields, reliability of this approach hinges crucially on the validity of two key assump- tions (King et al. 2004):
• response consistency: it assumes that individuals use the available response categories in the same way when assessing their own situation and the hypothetical situations in the vignettes;
• vignette equivalence: it assumes that the hypothetical situation in a vignette is perceived by all respondents in the same way and on the same uni-dimensional scale, apart from random error.
As pointed out by Deaton (2010), the vignette approach replaces the assump- tion that there are no di↵erences in the way people rank themselves on a sub- jective scale with the alternative assumption (response consistency) that there are no di↵erences in their capacity for empathy with other people’s conditions.
In addition, vignette equivalence assumes that there are no systematic di↵er- ences in the way people perceive the situations represented in each vignette. The latter is also a very strong assumption, for example because of problems with translation of the same vignette in di↵erent languages.
Hence, testing these two key assumptions is a critical step in evaluating the validity of the vignette approach (see Peracchi and Rossetti 2013).
313
5.7.4 Sensitive questions
Many policies are evaluated by analyzing the answers to sensitive questions. Examples include policies that try to change peoples’ attitudes toward harmful traditional practices; fiscal policies that try to measure tax evasion; or anti- racism policies by measuring expressions of racism or acceptance of di↵erent cultures, etc.
Various methods have been proposed to guarantee that respondents reveal their true attitudes. Their basic idea is that if a sensitive question is asked indirectly, the respondent may reveal a truthful response.
For example, the list experiment (or item count or unmatched count method (Miller 1984, Blair and Imai 2012) works by (i) randomly dividing the survey respondents in two groups: the treated and the controls, and (ii) aggregating the sensitive item with a list of other nonsensitive items.
The controls receive a list of J nonsensitive yes/no items. They are asked to report how many of the listed items they agree on, but not which items. The treated instead receive the same list of nonsensitive, plus a sensitive yes/no item (J + 1 items in total). As for the controls, the treated are also asked to report the number of listed items they agree on.
This design relies on three key assumptions:
• no liars: all respondents give truthful answers to the sensitive item;
• randomization of the treatment: potential responses are jointly in- dependent of the treatment variable;
• no design e↵ect: the addition of the sensitive item does not change the sum of a�rmative answers to the control items.
If all three assumptions hold, then an unbiased estimator of the population proportion of those who agree on the sensitive item is the mean-di↵erence
⌧̂ = Ȳ1 � Ȳ0,
where Ȳ1 and Ȳ0 respectively denote the mean number of listed items the treated and the control agree on.
314
5.7.5 The classical measurement error model
This model regards the observed data as realizations of the random vector
Z = Z⇤ + V ,
where Z⇤ and V are independent random vectors with finite variances. The random vector Z⇤ represents the variability of the error-free measurement or “signal”, whereas V represents the measurement error or “noise”.
A motivation for this model is often the inaccuracy of the instruments with which Z⇤ is measured.
The assumption of independence between Z⇤ and V implies that
V[Z] = V[Z⇤] + V[V ],
that is, the observed data are more variable than the correct measurements.
Under the additional assumption that V has mean zero, the model implies that
E[Z | Z⇤] = Z⇤, that is, the measurements of Z⇤ are inaccurate but not systematically dis- torted. If µ = E[Z⇤], then
E[Z] = E[Z⇤] = µ.
Given a sample from the distribution of Z, the sample mean Z̄ is unbiased for µ and the only e↵ect of measurement error is to inflate its sampling variance relative to the no measurement error case.
This result is very special and is generally not valid in regression contexts.
315
Attenuation bias
To illustrate, suppose that
Y ⇤ = ↵ + �X⇤ + ✏,
Y = Y ⇤ + ⌫,
X = X⇤ + ⇠,
where X⇤, Y ⇤, ✏, ⌫ and ⇠ are unobservable random variables with finite variances, denoted by �2X⇤, �
2 Y ⇤, etc., and X and Y are error-ridden surrogates
of X⇤ and Y ⇤. The relationship between the observable X and Y may be written
Y = ↵ + �X + U,
U = ✏ + ⌫ � �⇠.
Under the classical measurement error model, ✏, ⌫ and ⇠ have zero mean and are uncorrelated with each other and with X⇤. Thus, E[Y ] = E[Y ⇤] = µY , E[X] = E[X⇤] = µX, �2X = �2X⇤ + �2⇠ and �2Y = �2Y ⇤ + �2⌫. Further
C[X, Y ] = C[X⇤, Y ⇤] = � �2X⇤, E[XU] = ��� 2 ⇠,
with E[XU] = 0 if either � = 0 (i.e., Y ⇤ does not depend on X⇤) or �2⇠ = 0 (i.e., X⇤ is measured without error).
In this case, the best linear predictor (BLP) of Y given X is E⇤[Y | X] = ↵0 + �0X, where ↵0 = µY � �0 µX, �0 = C[X, Y ]/�2X = ��, and
� = � 2 X⇤
� 2 X
= � 2 X⇤
� 2 X⇤ + �
2 ⇠
= 1
1 + �2⇠/� 2 X⇤
is called the attenuation factor or reliability ratio. If �2⇠ > 0, then 0 < � < 1, so
|�0| < |�|, (5.3)
called the attenuation bias due to measurement error. Notice that the prob- lem arises because of measurement errors in the regressor, not in the out- come.
In addition, if µX 6= 0, then
↵0 � ↵ = (� � �0)µX = �(1 � �)µX 6= 0.
316
Bounding the slope of the population regression function
Let {(Xi, Yi)} n i=1 be a sample from the distribution of (X, Y ) and let ↵̂ and �̂
be the OLS estimators of ↵ and � in the regression of Yi on a constant and Xi. As n ! 1,
↵̂ p ! ↵0 = ↵ + �(1 � �)µX, �̂
p ! �0 = ��.
If � > 0 is known, correcting ↵̂ and �̂ for their asymptotic bias gives the following consistent estimators of ↵ and �
↵̃ = ↵̂ � �̃(1 � �)X̄, �̃ = �̂
� .
These two estimators remain consistent if � is unknown but is replaced by some consistent estimator �̂ > 0.
What can you learn about � when � is not known or cannot consistently be estimated? It turns out that the inequality (5.3) is not the only information that the observed data contain about the population regression parameter �.
To see this, consider the BLP of X given Y (or reverse regression), E⇤(X | Y ) = �0 + �0 Y , where �0 = µY � �0 µX and
�0 = C[X, Y ] � 2 Y
= C[X⇤, Y ⇤]
� 2 Y
= � �
2 X⇤
� 2 Y ⇤ + �
2 ⌫
.
By Cauchy-Schwarz inequality C[X⇤, Y ⇤]2 �2X⇤ �2Y ⇤, so �2 �2Y ⇤/�2X⇤. Hence
1
�0 = � 2 Y ⇤ + �
2 ⌫
� � 2 X⇤
� � + � 2 ⌫
� � 2 X⇤
,
that is, |1/�0| � |�|. Combining this result with (5.3) gives
|�0| |�|
���� 1
�0
���� . (5.4)
Thus, although the classical measurement error model does not allow point- identification of �, it still provides information about � through the bound (5.4).
317
The Wald estimator
To point identify �, additional information is needed.
Consider the simple linear model
Yi = ↵ + �X ⇤ i + Vi, i = 1, . . . , n, (5.5)
where Vi has mean zero and is uncorrelated with X ⇤ i (in the notation of Sec-
tion 5.7.5, Vi = ✏i + ⌫i). Suppose, as before, that instead of the unobserved X
⇤ i you only have available the error-ridden measurement
Xi = X ⇤ i + ⇠i,
where ⇠i has mean zero and is uncorrelated with X ⇤ i , but clearly correlated
with Xi. This gives the following linear model
Yi = ↵ + �Xi + Ui, i = 1, . . . , n,
where Ui = Vi � � ⇠i is now correlated with Xi, so the OLS estimator in a regression of Yi on a constant and Xi is inconsistent for �.
Partition the data into groups A1 and A2 of size n1 and n2 respectively, and let Ȳj and X̄j be the sample averages of Yi and Xi for group Aj. A Wald estimator (or grouping estimator) of � is
�̃ = Ȳ1 � Ȳ2
X̄1 � X̄2 =
Pn i=1 WiYiPn i=1 WiXi
,
where
Wi =
( n/n1 = 1 + n2/n1, if i 2 A1,
�n/n2 = �1 � n1/n2, if i 2 A2
is a binary indicator with zero sample mean. This estimator, first proposed by Wald (1940), is a simple IV estimator that uses Wi as instrument.
The Wald estimator �̃ is consistent for � if:
• grouping does not depend on Vi or ⇠i (therefore not on the observed Yi or Xi);
• E[X⇤i | A1] 6= E[X⇤i | A2], that is, grouping depends on some variable (di↵erent from Yi or Xi) known to be correlated with X
⇤ i .
318
5.7.6 Nonclassical measurement error models
The classical measurement error model is widely used, but is not the only measurement error model and may actually be inappropriate in certain situ- ations. So one must be careful in drawing general conclusions from this model.
The “best guess model”
Consider again the simple linear model (5.5), where X⇤i is not observed, but now assume that what you observe is Xi = E⇤(X⇤i | Zi) = ⇡0 + ⇡1Zi, that is, the BLP of X⇤i given the information contained in the variable Zi. Stock and Watson (2015) refer to this case as the “best guess model”.
It follows from the definition of BLP that X⇤i = Xi +"i, where E["iXi] = 0, so
Yi = ↵ + �Xi + !i, i = 1, . . . , n,
where !i = Vi +�"i is uncorrelated with Xi. In this case, you have the result that the OLS estimators ↵̂ and �̂ in a regression of Yi on a constant and Xi are consistent for ↵ and � respectively despite the fact that Xi is an error-ridden measurement of X⇤i .
319
Misclassification
The assumption that the “signal” Z⇤ and the “noise” V are independent is inappropriate when some of the elements of Z, and the corresponding elements of Z⇤, are not continuous but discrete. In this case, the problem of measurement error is better viewed as a problem of misclassification.
The simplest case is when you are interested in the distribution of an unob- served binary random variable Z⇤ (e.g., an indicator for college degree) taking values one and zero with probability ⇡ and 1�⇡ respectively, where 0 < ⇡ < 1. What you instead observe is another 0–1 random variable Z that depends on Z
⇤ through the relationships
P[Z = 0 | Z⇤ = 1] = ⌘, P[Z = 1 | Z⇤ = 0] = ⌫,
where ⌘ and ⌫ are the probabilities of misclassification.
Although this model may also be written Z = Z⇤ + V , where V = Z � Z⇤ is the measurement error, this is not a classical measurement error model. To see this, notice that
E[Z | Z⇤ = 0] = P[Z = 1 | Z⇤ = 0] = ⌫, E[Z | Z⇤ = 1] = P[Z = 1 | Z⇤ = 1] = 1 � ⌘,
so
E[V | Z⇤ = z] = E[Z | Z⇤ = z] � z = ( ⌫, if z = 0,
�⌘, if z = 1.
Therefore
E[V ] = E[V | Z⇤ = 0] P[Z⇤ = 0] + E[V | Z⇤ = 1] P[Z⇤ = 1] = ⌫(1 � ⇡) � ⌘⇡,
whereas E[V Z⇤] = E[V | Z⇤ = 1] P[Z⇤ = 1] = �⌘⇡.
Hence C[V, Z⇤] = E[V Z⇤] � E[V ] E[Z⇤]) = �⇡(1 � ⇡)(⌘ + ⌫).
Thus, the measurement error V is not independent of Z⇤. In particular, V is negatively correlated with Z⇤ and its conditional mean given Z⇤ is di↵erent from zero.
320
The contaminated sampling model
This model regards the data as realizations of a random variable Z which is equal to the “signal” Z⇤ with probability 1�⇡, and with probability ⇡ is equal to some extraneous random variable W . Formally
Z = (1 � D)Z⇤ + DW = Z⇤ + D(W � Z⇤),
where D is an unobservable binary random variable, distributed indepen- dently of Z⇤ and W , and equal to zero with probability 1�⇡ and to one with probability ⇡ (0 ⇡ < 1). Realizations of Z such that D = 0 correspond to error-free measurements of Z⇤.
The contaminated sampling model is often motivated with reference to coding errors (e.g. digit transposition) and may be useful in situations where mea- surement errors occur with positive probability but not always. Alternatively, this model may be used to represent the fact that a statistical model is only an approximation to the actual data generation process. As such, it may be good for the bulk of the data but not for the whole sample.
If F and H are the df’s of Z⇤ and W respectively, then the df of Z is
G(z) = (1 � ⇡) F(z) + ⇡ H(z), 0 ⇡ < 1.
The distribution of Z is therefore a mixture of the distributions of Z⇤ and W , with mixing parameter ⇡ equal to the probability of measurement error. If both Z⇤ and W have finite mean, then the mean of an observed data point is
E[Z] = (1 � ⇡) E[Z⇤] + ⇡ E[W ].
If both Z⇤ and W have finite variance, then the variance of an observed data point can be shown to be
V[Z] = (1 � ⇡)(V[Z⇤]) + ⇡(V[W ]) + ⇡(1 � ⇡)(E[Z⇤] � E[W ])2.
If Z⇤ and W have the same mean, then
V[Z] = (1 � ⇡) V[Z⇤] + ⇡ V[W ].
Thus, unless both ✏ and the distribution of W are known, contaminated sam- pling does not enable one to identify the distribution of Z⇤ or interesting aspects of this distribution, such as its mean and variance.
321
5.8 Data privacy and confidentiality
The modern proliferation of data and the advances in computing technology have led to new concerns about data privacy.
Statistical disclosure limitation (SDL) consists of a variety of methods used by public and private data providers to protect the privacy and confi- dentiality of identifiable information on individual and business.
Privacy and confidentiality protection is based on two key principles:
• individual information should only be used for the statistical purposes for which it was collected;
• individual information shared with a data provider should not be used in a way that might harm the individual.
An example of SDL is top-coding of income (Section 5.1.2). This method protects privacy and confidentiality but it also limits what you can learn from the data. As shown by Burkhauser et al. (2012), the top-coded public-use releases of the CPS are fine for measuring the evolution of the IDR, but are completely useless for measuring the evolution of incomes among the top 1% of households.
SDL methods may lead to bias and wrong inference about population pa- rameters of interest.
As Abowd and Schmutte (2015) argue:
“Advances in SDL have unambiguously made more data available than ever before, while protecting the privacy and confidentiality of identifiable information on individuals and businesses. But mod- ern SDL intrinsically distorts the underlying data in ways that are generally not clear to the researcher and that may compromise eco- nomic analyses, depending on the specific hypotheses under study.”
322
5.8.1 Methods of statistical disclosure limitation
The main types of SDL include:
• suppression, which is used to eliminate an entire record from the data or to eliminate an entire attribute;
• aggregation or coarsening, which refers to the coarsening of values a variable can take, or the combination of information from multiple variables;
• noise infusion, a method in which the underlying microdata are dis- torted using either additive or multiplicative noise.
• data swapping, the practice of switching the values of a selected set of attributes for one data record with the values reported in another record;
• synthetic data, involving the publication of a data set with the same structure as the confidential data, in which the published data are drawn from the same data-generating process as the confidential data but some or all of the confidential data have been suppressed and imputed.
Abowd and Schmutte (2015) argue that “researchers, and agencies, should prefer SDL methods whose details can be made publicly available”.
323
5.8.2 Ignorable vs. nonignorable SDL
A SDL is said to be ignorable if the analyst can consistently estimate the population object of interest and make correct inferences using the published data without explicitly accounting for SDL.
A SDL is said to be nonignorable if the data analyst cannot consistently estimate the population object of interest without the parameters of the SDL model.
If a SDL is nonignorable, the analyst needs to take the SDL into account. The analyst can only do so if either (i) the SDL is known, that is, the data provider publishes su�cient details of the SDL model’s application to the confidential data, or (ii) the SDL is discoverable, that is, the analyst can recover the parameters of the SDL model based on prior information and the published data.
For example, in the top-coding example, the SDL is known if the value of the top-coding threshold is published, and is discoverable if it can be inferred from the data.
324
5.9 Stata commands
Sample selection models
heckman
This command fits regression models with selection by using either Heckman’s two-step estimator or full maximum likelihood.
heckprob
This command fits probit models with sample selection by maximum likeli- hood.
325
Weights
Most Stata commands can deal with weighted data.
Stata allows four kinds of weights:
• fweights, or frequency weights, indicate replicated data. The weight tells the command how many observations each observation really rep- resents. fweights allow data to be stored more parsimoniously. The weighting variable contains positive integers. The result of the com- mand is the same as if you duplicated each observation however many times and then ran the command unweighted.
• pweights, or sampling weights, are weights that denote the inverse of the probability that the observation is included because of the sample design. Commands that allow pweights typically provide a cluster() option. These can be combined to produce estimates for unstratified cluster-sampled data.
• aweights, or analytic weights, are weights that are inversely proportional to the variance of an observation; i.e., the variance of the ith observation is assumed to be �2/wi, where wi are the weights. Typically, the obser- vations represent averages and the weights are the number of elements that gave rise to the average. For most Stata commands, the recorded scale of aweights is irrelevant; Stata internally rescales them to sum to n, the number of observations in your data, when it uses them.
• iweights, or importance weights, are weights that indicate the “impor- tance” of the observation in some vague sense. iweights have no formal statistical definition; any command that supports iweights will define exactly how they are treated. Usually, they are intended for use by programmers who want to produce a certain computation.
326
Imputations and multiple imputations
impute
This command fills in missing values using regression imputation.
mi
This is a suite of commands for multiple imputation.
Consider the following simple example, consisting of six commands:
. webuse mheart5 (load the data)
. mi set mlong (set the data to be mi)
. mi register imputed age bmi (inform mi which variables need im- putations)
. set seed 29390 ( set the random-number seed)
. mi impute mvn age bmi = attack smokes hsgrad female, add(10) (create m = 10 imputations)
. mi estimate: logistic attack smokes age bmi hsgrad female (fit the desired model separately on each of the 10 imputed datasets and com- bine the results)
What this example does is to fit
. logistic attack smokes age bmi hsgrad female
with the age and bmi variables containing missing values. Fitting the model by typing logistic ... would drop all cases with missing values, thus ignoring some of the information in the data. Multiple imputation attempts to use that information. The method imputes m values to fill-in each of the missing values. After that, statistics are performed on the m imputed datasets separately and the results combined. The goal is to obtain better estimates of parameters and their standard errors.
For more details see the Stata Multiple-Imputation Reference Manual [MI]. In particular [MI] mi set, [MI] mi impute, [MI] mi import, and [MI] mi estimate.
327