stratic Plan

profilekanchi_123
SamplePaper.pdf

QUANTIFYING THE DISCOVERABILITY OF IDENTITY ATTRIBUTES IN

INTERNET-BASED PUBLIC RECORDS: IMPACT ON IDENTITY THEFT

AND KNOWLEDGE-BASED AUTHENTICATION

by

Margaret S. Leary

RICHARD YELLEN, Ph.D., Faculty Mentor and Chair

DANIELLE BABB, Ph.D., Committee Member

SALIM ZAFAR, Ph.D., Committee Member

Barbara Butts Williams, Ph.D., Interim Dean, School of Business & Technology

A Dissertation Presented in Partial Fulfillment

Of the Requirements for the Degree

Doctor of Philosophy

Capella University

November 2008

3336833

3336833 2008

© Margaret Leary, 2008

Abstract

This study explored the comparative discoverability of identity attributes in Internet-

accessible public records. It offers up a framework for a methodology for ascribing a

“discoverability factor” to identity attributes commonly used with knowledge-based

authentication systems. The study also sought to determine if correlations exist between

the frequency with which identity attributes are published in public records and reported

identity theft. Following a comprehensive literature review, a quantitative research

methodology employing content analysis was performed on a total of 6,598 public

records, enumerating the number and types of different identity attributes that were found

in easily accessible public records. Public records were selected from those available on

SearchSystems.net, using a stratified purposive sampling approach. Descriptive statistics

were first performed and reported on the coded data, after which the identity attributes

were grouped using principal components analysis (PCA) to reduce the data. Correlations

to the Federal Trade Commission’s 2008 reported identity theft rates were performed

using Spearman’s rho at a significance level of .05. The findings from this research

supported the overall hypothesis of a relationship between the amount of identity

information in online public records and identity theft. It did not, however, infer a cause-

and-effect relationship between the two, and sampling limitations somewhat weaken

generalization. Identity attribute indices were assigned based on the frequency with

which they were found, with a discussion provided on the possible use of discoverability

as a factor to be considered when developing an identity confidence scoring algorithm.

iii

Dedication

This work is dedicated to family, friends, and coworkers at both Northern

Virginia Community College and Nortel Government Solutions, all of whom provided

me with limitless support during the completion of this study.

iv

Acknowledgments

The successful completion of this degree is due in no small part to the efforts of

many people, including my family, who suffered along with me every step of the way

during the interminably long period of time it took for me to complete this dissertation

and degree. It is with deep gratitude and appreciation that I acknowledge the following

individuals for their contributions.

First, I gratefully acknowledge the support I received from the administration at

Northern Virginia Community College (NVCC). Without their support of my 7-year

externship with Nortel Government Solutions, I would not have had access to the

industry and resources that made this work possible. Specifically among my colleagues at

NVCC, I would like to thank Professor Kevin Reed for allowing himself to be used as a

sounding board—I offer to return the favor as he is now performing his own doctoral

work. Too, I must thank my friend, Maryann Daimler, for her constant prodding to get

the study completed.

I would also like to thank my mentor and committee chair, Dr. Richard Yellen, to

whom I certainly lied when I informed him I was “low maintenance” 5 years ago when I

started in the program. I also wish to thank my other committee members, Dr. Danielle

Babb and Dr. Salim Zafar, who never seemed surprised to hear from me no matter how

many months (years) had passed since I’d last contacted them.

v

Table of Contents

Acknowledgments iv

List of Tables viii

List of Figures ix

CHAPTER 1. INTRODUCTION 1

Background of the Study 3

Statement of the Problem 7

Significance of the Study 7

Purpose of the Study 7

Rationale 8

Research Questions 9

Nature of the Study 11

Definition of Terms 11

Assumptions and Limitations 14

CHAPTER 2. LITERATURE REVIEW 16

Overview of the Chapter 16

Identity Theft 16

Identity Data Aggregation 20

Knowledge-Based Authentication 23

Identity Authentication in Federated Environments 27

Measuring the Effectiveness of Authentication by Knowledge 30

Measuring the Effectiveness of Authentication Through Possession 34

Measuring the Effectiveness of Biometric Authentication 35

vi

Measuring the Effectiveness of KBA 37

Evaluating KBA Using Guessability 38

Evaluating KBA Through False Acceptance and Rejection Rates 39

Evaluating KBA Through Other Methods 41

CHAPTER 3. METHODOLOGY 44

Introduction 44

Research Approach 44

Sampling Design 48

Data Collection 51

Measurement Strategy 52

Data Analysis Strategy 52

Data Display 53

CHAPTER 4. RESULTS 54

Introduction 54

Data Collection 54

Data Coding and Categorization 55

Descriptive Statistics 57

Tests of Research Questions 62

Summary 74

CHAPTER 5. RESULTS, CONCLUSIONS, AND RECOMMENDATIONS 76

Summary of the Study 76

Summary of the Research Findings 76

Implications of the Study 79

vii

Contributions of the Study 82

Limitations 85

Recommendations for Future Research 87

Conclusion 89

REFERENCES 91

APPENDIX A. CODEBOOK 99

APPENDIX B. CODER FORM 102

viii

List of Tables

Table 1. Descriptive Statistics of Public Records Sites Across the 50 States 58

Table 2. Descriptive Statistics of Identity Attribute Types by Category 59 Table 3. Frequencies and Percentages of Identity Attributes 61

Table 4. Descriptive Statistics of Identity Attributes Within Each of the 50 States 62

Table 5. Variance Explained by Resulting Components 65

Table 6. Components and Respective Identity Attributes 66

Table 7. Comparative Discoverability Index for the Groups of Identity Attributes 66

Table 8. Comparative Discoverability Index for the Identity Attributes 67

Table 9. FTC 2007 State Identity Theft Rankings 69

Table 10. Spearman Rho Correlations Between Identity Attributes and Identity Theft Ranking 71 Table 11. Spearman Rho Correlations Between Attribute Groups and Identity Theft Ranking 72 Table 12. Summary of the Hypotheses Testing 80

ix

List of Figures

Figure 1. Conceptual framework 11

Figure 2. Total number of public records sites by record category 58

Figure 3. Scree plot from resulting EFA procedure 66

Figure B1. Codebook 103

1

CHAPTER 1. INTRODUCTION

In January 2001, the Chief Information Officer’s Council revealed a database

cataloguing over 1,300 federal electronic government initiatives (Martin, 2001)

developed with the intention of meeting the President’s Management Agenda of

providing online access for citizens and businesses to interact with government. Citizens

and businesses are increasingly utilizing these services, as demonstrated by the IRS’s

Free File E-Government solution that saw more than 4 million tax returns submitted in

fiscal year 2007 at a cost savings of $9.2 million to the government (Office of

Management and Budget [OMB], 2008, p. 5).

Paralleling this rapid rise in citizen-centric Web applications is a requirement for

higher identity authentication assurance levels as citizens and businesses attempt to

access these online services. The OMB directed that these electronic transactions be

secure and maintain citizen and agency privacy, thus requiring “some type of identity

verification or authentication” (Bolton, 2003, p. 1). In related guidance documents, the

National Institutes of Standards and Technology (NIST) defined the processes for

establishing confidence in user identities presented electronically as “E-Authentication”

(as cited in Burr, Dodson, & Polk, 2006, p. vi) and provided guidelines for selecting

authentication technologies and protocols suitable to application assurance levels defined

in the previously referenced OMB directive.

Traditional user authentication systems, such as Personal Identification Numbers

(PINs) and passwords, have proven problematic for federal agencies and their citizen

application users. Separate registration processes and credential (PIN, password, etc.)

2

issuance are expensive for agencies to maintain and can be burdensome for citizens who

may have infrequent dealings with an agency.

As a possible solution, one of the authentication technologies under consideration

for use with e-government by NIST is Knowledge-Based Authentication (KBA;

Chokhani, Dodson, Hastings, Burr, & Polk, 2006). KBA authenticates user identities on

the basis of “shared secrets” using the individual’s personal attributes, such as last name,

first name, Social Security Number (SSN), and date of birth. Depending on the assurance

level required by the application, a multistep approach may be used that first verifies that

electronically presented identities are valid based on these personal attributes and then

uses a separate set of more difficult questions based on financial or personal information

(such as car payment or mortgage amount) obtained from proprietary sources or public

records to bind the identity to the individual presenting the identity for access to the

application.

KBA is usually performed through intermediary services referred to as knowledge

brokers or commercial data resellers that purchase and mine personal identity attributes

and information from both public and purchased proprietary records. These services then

resell these personal data to government agencies and corporations in the form of

background check services or, more recently, identity authentication services for Web-

enabled applications. As an authentication technology, KBA systems afford an easy

method for citizens to authenticate to a Web-based application, especially where there is

not a previously established relationship with the application’s organization.

The effectiveness of KBA is difficult to evaluate and its use gives rise to privacy

concerns where access may be granted to an application containing information that can

3

be used to facilitate identity theft, such as with a birth certificate. Does the prevalence of

this personal identity information in public records contribute to identity theft? To what

extent does the level of trust an organization can place in KBA depend on the difficulty

with which this personal identity information can be discovered in public records? This

exploratory study sought to address these questions in order to provide guidance to

government agencies interested in using KBA to meet their authentication requirements.

Background of the Study

NIST defined electronic authentication as “the process of establishing confidence

in user identities electronically presented to an information system” (as cited in Burr et

al., 2006, p. vi). The level of confidence in presented identity credentials is contingent on

the processes used to validate that the claimed identity actually exists and is bound to the

identity claimant at the time of enrollment.

Traditional shared-knowledge authentication systems rely on a previously

established relationship between the authenticator and identity claimant that has already

verified identity in some manner. A shared secret, such as a PIN or password, is then

used to bind the identity to the identity claimant when access to an electronic system is

required. Having to establish and maintain a relationship in order to manage a PIN or

password is not an effective approach for identity management in cases where the user

infrequently requires services. In February 2004, NIST hosted a symposium, “Knowledge

Based Authentication: Is It Quantifiable?” Conference information posted by NIST noted

that in instances where infrequent access is needed to conduct business with federal

agencies, “other authentication tools such as passwords and PKI certificates can be

4

expensive to administer for the application provider and difficult to use for the remote

individual” (NIST, 2004). NIST suggested that KBA could prove to be a useful

authentication tool in these instances.

KBA services generally function by presenting a series of questions at the time of

login to the identity claimant. The identity claimant must answer all or some of the

questions accurately in order to be successfully authenticated to the system. Basic

questions such as first name, last name, middle name, SSN, and address are asked within

the services and can then be followed by more challenging questions such as the make

and model of an individual’s car, car payment amount, questions regarding home

improvements performed on the house or the amount of money for which a homeowner

purchased a home. These systems are customizable to application owner requirements

with the number and type of questions asked during authentication. Static questions

(name, SSN, or address) are easier for the application user to answer; however, they tend

to be populated more frequently in accessible databases. Temporal attributes are based on

more questions that are infrequently disclosed, such as “Which of the following

individuals lived at your residence at 12345 Brown Street?” These are also usually

presented with multiple-choice answers as they tend to be more difficult for visitors to

accurately answer.

Much of these data are culled from public records, with knowledge brokers (also

known as commercial data resellers) sending personnel out to copy paper records into

their databases where electronic records are not available (“After the Breach,” 2005) or

from private companies with customer data to sell (as with utility companies or insurance

agencies). Knowledge-based authentication services rely on this information being

5

limited in its distribution such that only the legitimate identity claimant would

successfully be able to answer a series of these questions; however, the ease with which

the majority of identity data can be accessed through public records raises a concern that

knowledge-based authentication systems are establishing identity on the basis of “pseudo

secrets.” In some states, even an individual’s SSN can still be found in public records.

Alabama recently signed into law a bill requiring an individual’s consent to having their

SSN revealed on state documents prior to their release; however, exceptions are made for

liens, conviction records, and bankruptcy filings (State of Alabama, 2006).

The effectiveness of KBA as an authentication technology has proven elusive to

validate. KBA vendors cite the use of proprietary algorithms, some supplying confidence

scores based on whether or not the data were retrieved from public sources versus records

purchased from proprietary sources. These algorithms, however, are not made public for

academic review and testing such as is common with cryptographic algorithms. Some

KBA services make reference to “false negatives” and “false positives” when discussing

validation effectiveness (Barrett, 2004). The term “false negative” is used in these cases

to reference a valid user who has been denied access to a service because the input the

user provided did not match that stored in the service’s database. The term “false

positive” signifies an event where a person who does not rightfully own the identity has

been provided access, presumably due to being able to guess or discover sufficient

answers to questions posed by the KBA service provider. The use of false positives and

negatives to quantify authentication effectiveness is more appropriate to technologies

such as biometrics as KBA lacks an inherent ability to test and gather metrics on the

number of false positive error rates. While metrics can be captured through customer

6

service complaints from authentic identities that have been incorrectly denied access, it is

unlikely that application owners would receive notices from someone intentionally

spoofing another’s identity. Other proposed approaches by independent researchers to

evaluating KBA have included applying probability metrics based on the ease with which

the data can be guessed (Chokhani, 2003) as well as more recent efforts to develop a

generic KBA model based on probabilistic models (Chen, 2007). Both of these methods,

however, fail to factor in identity attribute discoverability.

In a graduate information security course at Johns Hopkins University, students

used electronic public records to gather over a million records, with hundreds of

thousands of individuals (Zeller, 2005). This proliferation of personal identity

information in government public records is a growing concern of politicians and

government agencies. A Government Accounting Office (GA) study reported finding

SSNs in records of more than 41 states. While federal agencies are prohibited from

posting SSN information publicly, the GAO (2004) found that nearly 15–28% of the

nation’s 3,141 counties did make them publicly accessible over the Internet. In an earlier

report, the GAO found that identity-theft-related crimes were enhanced by the growth of

the Internet as it “increases the availability and accessibility of personal identifying

information” (2002, p. 6), linking this increase in availability to an increase of identity

theft by aliens. The GAO reports, however, do not offer empirical data to support their

statement and no studies could be found attempting to correlate identity theft rates to the

availability of electronic public record databases. A study is necessary to determine the

impact that electronically available public records has on identity assurance and KBA

through a correlation to identity theft rates.

7

Statement of the Problem

The effectiveness of most authentication technologies can be measured either by

their susceptibility to guessing or through performance testing to determine the

percentage of users incorrectly denied access or who were accurately granted access.

Both of these measurement methods fall short in their ability to assess the effectiveness

of authentication systems that use personal identity attributes that are accessible by the

general public. The goal of this research was to determine the extent to which this

authentication information is discoverable on the Internet and the impact discoverability

has on identity theft and assurance.

Significance of the Study

This study serves to lay the groundwork for the development of a KBA

assessment methodology. To accurately evaluate the effectiveness of KBA systems, the

discoverability of personal identity attributes and their impact on identity theft must first

be quantified. The identification of the frequencies with which identity attributes are

found in public records can assist in assigning a “discoverability factor” to each attribute.

This discoverability factor can be used by government agencies to map the selection of

specific identity attributes to appropriate e-authentication assurance levels.

Purpose of the Study

The purpose of this quantitative study was twofold:

1. Examine the comparative discoverability of identity attributes in online public records, associating a “discoverability factor” to individual and linked identity attributes where specific combinations of identity attributes occur.

8

2. Determine if correlations exist between the frequencies with which identity attributes can be found in public records and instances of identity theft.

Rationale

The vulnerabilities that a lack of strong identity management practices present to

national security are significant. The ability for citizens to weakly authenticate to

government services provides access to terrorists to “breeder documents,” or documents

that are used to obtain other documents for identity, such as drivers’ licenses, social

security cards, and birth certificates. With these documents, long-term identities can be

established and maintained. This was the case with the 9/11 terrorists, all of whom had

valid drivers’ licenses and were, in some cases, even registered as U.S. citizens to vote

(Johnson, 2004). David Temoshok (2005), the Director of the Government Services

Agency’s (GSA) Identity Policy and Management Office that has been tasked with

implementing the E-Authentication Portal, suggested that trust in the identity verification

procedures is one of the critical issues of federated identity.

In a public response to NIST’s draft version of Federal Information Processing

Standard (FIPS) 201, the International Technology Association of America (ITAA, 2004)

requested that NIST consider the use of knowledge-based authentication—incorporating

them where appropriate into the identity verification procedures used to provide

identification cards to all government employees, contractors, and their affiliates. This

identity card is mandatory for access to all government facilities and systems. The use of

an identity authentication technology that has not been adequately evaluated could

present a risk to the security of national systems.

9

Government agencies are already in the process of implementing KBA to meet

critical electronic authentication needs. The Social Security Administration (SSA) is

presently using KBA to allow beneficiaries to change mailing addresses for their Social

Security checks, as well as check their Social Security benefits and apply for direct

deposit using the Internet (Office of the Inspector General [OIG], 2004). Furthermore, the

access to personal information with even seemingly low-risk applications such as those

used by SSA creates an opportunity for an identity thief to collect, or aggregate, identity

information that can contribute to identity theft. For this reason, access to personal

information must be protected using authentication technologies appropriate to the level

of risk as defined in the e-Authentication Guidelines.

Research Questions

Using Cooper’s management research hierarchy (Cooper & Schindler, 2003), the

following management dilemma and research questions were defined and examined

within this study.

Management Dilemma

KBA provides a cost effective and quick method for authenticating citizens and businesses to government applications, however it may not provide sufficient identity assurance to meet OMB e-Authentication guidelines as much of the authenticating data may be easily discovered on the Internet.

10

Management Questions

1. What personal identity attributes offer higher levels of assurance when selecting the questions asked with KBA services?

2. To what extent do online public records provide information that can be used

by an identity thief to build a more comprehensive identity profile of their target, increasing the likelihood of spoofed identities with KBA services?

Research Questions

1. Can the frequency with which identity attributes are accessible on the Internet be used as an indicator of discoverability?

2. Is there a correlation between reported identity theft rates and the availability of personal data in public databases?

Investigative Questions

1. Who are the major KBA service providers in the industry?

2. What personal information do the major KBA service providers require for authentication?

Measurement Questions

1. What identity attributes appear both singly and in combination with other attributes most frequently in public records databases?

2. Does a correlation exist between the publication of personal identity attributes in public records and identity theft?

11

Nature of the Study

This study enumerated the frequency with which personal identity attributes

reside in public records databases and explored the impact of their discoverability on

identity theft. The methodology used in this study is conceptualized in Figure 1.

Figure 1. Conceptual framework.

Research Problem Governments and organizations are increasingly using KBA service providers to authenticate individuals to online services using personal data as identity verifiers. What level of identify assurance can KBA provide given the availability of personal data to identity thieves? Does the use of KBA serve to increase the likelihood of identity fraud/theft in its ability to provide access to “breeder documents?

Research Question 1: What is the frequency, or discoverability, of personal identity attributes in public records databases?

Research Question 2: Is there a correlation between identity theft rates and the availability of personal data in public databases?

Outcome of Research: Ascribe a probability, or discoverability factor, to individual and linked identity attributes.

Research Method: Quantitative –using content analysis to categorize and code Web-based public records content

Research Method: Quantitative – examine results for a correlation between discoverability and identity theft rates. Test Hypothesis: Greater rates of personal data in online government public records will correlate to higher incidences of identity theft/fraud. Independent Variable (IV): Personal data available in government-provided online records. Dependent Variable (DV): Rates of identity theft.

12

Definition of Terms

Following are definitions for terms that were used in this study.

Authentication. The process of binding a user identity to an individual with a

specific level of assurance.

Credential Service Provider. An organization or service that issues identity

credentials (i.e., passwords, tokens, etc.) after the identity has been verified.

Credit Card Fraud. The use of a credit card by an unauthorized party facilitated

by inadvertent disclosure of identity information.

Discoverability Factor. The degree to which specific identity attributes can be

found by individuals other than the target identity.

E-Authentication. The process of binding a user identity to an individual with a

specific level of assurance during an online, or electronic, transaction.

FIPS-201. Federal Information Processing Standard entitled “Personal Identity

Verification of Federal Employees and Contractors.” Specifies the identity proofing,

credentialing, and personal identity verification card requirements for federal employees,

contractors, and their affiliates.

Identity Assurance. The level of trust that can be placed in an identity presented

by an identity claimant.

Identity Attributes. A characteristic associated with an identity that must be

presented to authenticate one’s identity. Examples include last name, first name, SSN,

date of birth, and so forth.

Identity Claimant. An individual who is presenting an identity for verification

during the authentication process.

13

Identity-Proofing. The process of validating that an identity exists and verifying,

or binding, an identity to an identity claimant.

Identity Theft. The loss or disclosure of identity attributes sufficient for another

individual to impersonate that individual. Usually performed to enable the thief to

commit a crime while using the impersonated individual’s identity.

Identity Validation. The process in which the identity presented by an identity

claimant is checked to ensure that the identity is a real one. Usually includes checks

against databases to ensure that the addresses used are real and the individual is not

deceased.

Identity Verification. The process in which the identity claimant provides

sufficient proof (i.e., by answering questions that only that individual should know) to

effectively “bind” the identity to the claimant.

Personally Identifiable Information (PII). Information that identifies an individual

either directly or by reference using an individual’s unique identity characteristics, or

attributes, such as name, date of birth, mailing address, telephone number, SSN, e-mail

address, or other information that links the individual to that identity.

President’s Management Agenda. A strategy established by President George W.

Bush in the Summer of 2001 for improving federal government services in five areas of

management weaknesses.

Static Attributes. Personal characteristics that do not frequently change (i.e., date

of birth).

14

Temporal Attributes. Personal characteristics that do frequently change (i.e.,

address, employment, individuals living at address). Also referred to as “dynamic”

attributes.

Transitive Trust. The acceptance of an identity verified by one system at a

different system without additional verification.

Assumptions and Limitations

As will be discussed in chapter 3, the dynamic nature of the Internet provides

significant challenges to researchers when attempting to reproduce Web-based content

analysis. It was assumed that the majority of information contained in electronically

accessible online databases would be present within databases, rather than as other types

of documents or objects. While the expected lifetime of an online database record has

been demonstrated to be longer than that of other Web-based content (Koehler, 2004), the

dynamic nature of the Web limits the life expectancy of the content analyzed within this

study. Additionally, this study focused on those public records that can be electronically

discovered using the Internet. It did not address the discoverability of public records

freely available to the public in nonelectronic forms, nor did it address databases

containing personal information that are held by private entities, such as retail stores or

insurance companies, that are sold to data aggregators. These records are more difficult to

sample; however, they should not be overlooked in their contribution to identity

aggregation. Finally, as a result of several very public data breaches at data aggregators,

there was a heightened awareness of identity theft and identity attribute aggregation at the

time this study was conducted. Resultantly, the legislative landscape is rapidly evolving

15

and, in some cases, is in direct conflict with state and federal goals of electronically

enabling public records. Pending legislation may limit the availability of personally

identifiable information in the future.

16

CHAPTER 2. LITERATURE REVIEW

Overview of the Chapter

This review focused on examining existing literature for research relating to the

impact of easy access to identity attributes on identity theft. A discussion of identity theft

is provided, as well as the impact of identity data aggregation for use with knowledge-

based authentication. Literature has been selected that can provide a foundation for the

comparison of methodologies used to assess the effectiveness of other authentication

technologies to those based on personal identity attributes. A literature review for the

research methodology used in the study is provided separately in chapter 3.

Identity Theft

In a final rule issued in October 2004, the Federal Trade Commission (FTC)

defined the term identity theft as “a fraud that is committed or attempted, using a person’s

identifying information without authority” (2004, p. 1). It is important to distinguish that

the use of another individual’s information is illegal only if used for fraudulent purposes.

The FTC proposed that “identifying information” should be synonymous with “means of

identification” cited in the federal criminal statute relating to identity fraud (Public Law

105-318). Identification information includes “any name or number that may be used,

alone or in conjunction with any other information, to identify a specific individual”

(Identity Theft and Assumption Deterrence Act, 1998, p. 2). Specifically cited in the

statute are name, SSN, birth date, driver’s license or identification numbers, alien

registration number, government passport number, employer or taxpayer identification

17

number, and e-mail address, among other biometric and telecommunications information

(Identity Theft and Assumption Deterrence Act).

While the Privacy Act of 1974 prevents the disclosure of personally identifiable

information on citizens held in federal government databases, federal court systems are

exempt from this requirement and are allowed to even disclose citizen SSNs in public

records (GAO, 2004). The SSA suggested that misuse of identity information will be

difficult to reverse while this information is available to the public (“Social Security

Number,” 2004), certainly supporting the hypothesis that the prevalence of these data in

public records is a contributing factor to identity theft.

While many of these records do not contain all of the information necessary to

successfully complete authentication to an online application using KBA, the existence of

multiple sources of information containing pieces of the authentication puzzle allows an

identity thief to compile data and profile a target. Solove described the problem

associated with data aggregation wherein, in isolation, a particular piece of information

may not be invasive of one’s privacy; however, when such pieces of data are amassed,

they effectively form a digital biography, or digital “dossier” (2004, p. 1) on the

individual.

As a result of the proliferation of personal data on the Internet, KBA may not

provide agencies with sufficient identity confidence and may, in fact, increase the

problem of personal data aggregation that can lead to an increased likelihood of identity

theft through providing access to additional “breeder documents” such as birth

certificates or marriage licenses. Eventually, sufficient identity information can be

18

amassed that will allow an attacker to “spoof” the identity of a valid user at online

applications, such as with government applications.

In the FTC’s annual commissioned study on identity theft complaints, 258,427

consumers reported having their identity stolen in 2007, 11% of these being stolen to

facilitate government documents/benefits fraud (FTC, 2008). In a report to Senate, IRS

Commissioner Mark Everson placed the figure at almost 1.5 million individuals who had

their personal information misused to obtain government documents, tax forms, or tax

refunds (U.S. Senate Finance Committee, 2007).

Some reports, however, suggest that identity theft is not on the increase and point

to reasons other than disclosure through public records as its principal source. Javelin

Strategy and Research has preformed several studies on the topic. The first of their

reports, the 2006 Identity Fraud Survey Report released by Javelin Strategy and Research

with the Council of Better Business Bureaus, challenged the belief that access to personal

identity information leads to identity fraud. The report cited a marginal decline in overall

identity theft rates from 4.7% to 4.0% from 2003 to 2006, consistent with the overall

decline of reported identity fraud complaints in 2006 cited by the FTC (2006). The

Javelin report stated that the majority of identity theft occurs as a result of “traditional”

reasons, lost or stolen wallets or credit cards, and not from the Internet. Furthermore,

Javelin’s survey results have found that almost half (47%) of the reported fraud

incidences were perpetrated by someone the victim knew.

Javelin’s original study had several weaknesses associated with it. First, sponsors

of the survey included CheckFree, Visa, and Wells Fargo & Company—financial

services companies that may have a less-than-impartial interest in the outcome as their

19

intent is to instill trust in their online services. Secondly, Javelin’s narrow interpretation

of identity fraud as “the unauthorized use of another’s personal information to achieve

illicit financial gain” (2006, p. 11) makes it only applicable to financial accounts. Despite

this constraint, Javelin generalized the outcome to all forms of identity fraud, indicating

that, contrary to growing fears on the subject, identity fraud and data compromise was

contained and less widespread than thought.

The report was updated in 2007 and re-released in February 2008. In the latest

report, Javelin differentiated identity theft from identity fraud, stating that “Identity theft

happens when your personal information is accessed by someone else without your

explicit permission” (2008, p. 5). Personal information is defined as “Social Security

number, bank or credit card account numbers, passwords, telephone calling card

numbers, birth dates, name, address and so on” (Javelin, p. 5). In this report, Javelin

stated that “with even the most basic information, a criminal can either take over your

existing financial accounts or use your identity to create new ones” (p. 5). As with the

first report, Javelin emphasized the role that traditional methods play on identity theft

(79% in those cases where the victim knew how the criminal obtained the identity

information), defining it in this report as “when a criminal can make direct contact with

the consumer’s personal identification” (p. 5).

It is significant to note, however, that only 155 (35%) of the 445 victims surveyed

in the study actually knew how the data were accessed (Javelin, 2008). This results in

these traditional methods discussed earlier being responsible in only 123 (27.6%) of the

total 445 incidences of identity theft. Resultantly, Javelin’s reports, which Javelin

claimed are the largest ever on identity fraud, do not provide substantial evidence that

20

identity theft is most often perpetrated by individuals personally knowing or having

traditional access to the victim’s personal data.

In actuality, while there are many reports on the number of victims and the impact

of identity theft, there has been little academic research performed regarding the

methodology used by criminals to perpetuate the crime. One study, performed in October

2007 by the Center for Identity Management and Information Protection at Utica College,

acknowledged the dearth of research into this area and refuted other findings of

preestablished relationships between the identity thief and the victim, stating that “while

there were instances in which relatives and friends proved to be the perpetrators, they

were in the minority” (Gordon, Rebovich, Choo, & Gordon, 2007, p. 66). The study

examined 517 closed identity theft cases collected from the Secret Service. Among other

findings, the study revealed that in approximately 20% of the 102 total cases from 2001–

2004, the Internet was used in some manner to commit the crime. In 27 of these cases, it

was specifically used to “search databases” (p. 51).

Identity Data Aggregation

In a frequently referenced article among privacy advocates, Solove (2004) posited

a scenario in which the government compels individuals to provide personal data about

themselves, places these data into public records databases, and then makes the

information freely available on the Internet or provides it for commercial use upon

request. Solove proceeded to enumerate some of the public records maintained by

federal, state, and local governments, including births, deaths, marriages, divorces,

professional licensure, voting information, bankruptcy records, and so forth, arguing that

21

we are seeing the creation of architectures of vulnerability that leave individuals

susceptible to identity theft. Solove is not alone in his belief that the ready access to the

thousands of databases provides for digital profiling. In extensive works on the topic for

the Department of Homeland Security and on the topic of privacy, Carnegie-Mellon

researcher Dr. Latanya Sweeney successfully demonstrated that even when databases are

devoid of explicit identifiers such as name, address, or SSN, certain combinations of

identifiers, termed quasi-identifying, provide sufficient information to link an individual

to a record containing explicit identifiers. As an example, birth date, gender, and zip code

information combine to uniquely identify 87% of the population of the United States

(Sweeney, 2000). Sweeney additionally noted that more than half (53%) of the U.S.

population are likely to be uniquely identified through only the combination of the city,

town, or municipality in which they reside, their gender, and date of birth. With only the

county, gender, and date of birth, 18% of the U.S. population can be identified. Malin

provided an example with which quasi-identifying data from two tables can be used for

identification:

Given two tables, Wi(name, date of birth, gender, zip code) and Wj(year of birth, gender, IP address). Under the assumption that the IP address has not been spoofed, a relationship between the IP address of a computer and the geographic zip code can be established. As such, the linkage attribute set Sij is defined as: Sij={<date of birth, year of birth>, <gender, gender>, <zip code, IP address>}. (2002, p. 7)

In this example, a linkage between the IP address and the geographic zip code can

be made that makes the other data relevant to identity. While Sweeney’s recent work has

focused on assisting medical organizations to comply with new privacy requirements, it

22

nonetheless illustrates the ease with which data records can be combined into compiling

the digital dossier that Solove addressed.

While the growth in data aggregation is related to inexpensive computers with

large storage capacities (Sweeney, 2000), it is also certainly tied to the growth of the

Internet and enhanced search-engine capabilities that facilitate the rapid searching of

large databases. Several Internet records-search businesses now specialize in the

aggregation of public records databases. KnowX, a ChoicePoint company, provides

access to “documents compiled by various public offices and agencies which are made

available to the general public. Examples of public records include real estate records,

lien filings, business entity filings, lawsuit information and court dockets, court decisions

and death records.” Other records are available under KnowX’s Professional fee-paid

services (KnowX, 2005). Search Systems provides access to more than 36,000 searchable

public records for a present cost of less than $5 per month.

Knowledge brokers, or companies that specialize in the aggregation and sale of

personal data, resell the data from these online data repositories, providing personal

information to law enforcement agencies performing background investigations and to

commercial entities extending credit or checking references for rental housing. The Drug

Enforcement Administration (DEA) regularly provides information from their Controlled

Substance Act (CSA) database to knowledge brokers such as KnowX. This database,

considered public information, contains the doctor’s name, licensure, license status, and

the location from which the DEA has authorized the physician to dispense controlled

substances to patients (usually a medical office). Interestingly, Solove (2004) discussed

that not only is the government a supplier of this information to the private sector, but, in

23

turn, it purchases the services of these knowledge brokers to generate information about

individuals, enumerating contracts with ChoicePoint with the Justice Department, FBI,

IRS, and other federal agencies to substantiate this claim.

Knowledge-Based Authentication

In addition to background investigations, knowledge brokers use these databases

to validate and verify identity for KBA services (Willox, 2001). KBA service providers

rely on the availability of personal identity attributes contained within public and

proprietary records to provide authentication services to Web-based applications and for

credit-granting purposes. The importance of public records for identity verification was

supported in a position paper written by the Property Records Industry Association

(PRIA), which stated, “In order to grant credit rapidly and appropriately, the collection of

information about consumers through public records is necessary for businesses to make

fair and objective risk decision” (2006, p. 10).

In all cases, whether for use with performing background checks or for verifying

identity at the time the user is attempting to authenticate to a site or system, knowledge

brokers use the data gleaned from these same databases as “source,” or valid, data.

ChoicePoint and LexisNexis Group are considered to be two of the nation’s largest data

aggregators (Olsen, 2005). While LexisNexis has recently acquired ChoicePoint, their

data sources can vary. In a document that described ChoicePoint’s methodology for a

State of Tennessee project, ChoicePoint reported performing identity checks using the

following data sources: a “composite” file consisting of credit header data, property tax

records, casualty insurance records, driver’s license file from 35 state agencies,

24

residential phone listings, and address records from the U.S. Postal Service National

Change of Address file (as cited in State of Tennessee, 2003).

ChoicePoint has also claimed to use more robust data sources, “not just wallet-

based or financial history information” (2004, p. 1) than other competing products;

however, the information observed on sample screens at their ProID Web site does not

indicate that all of the questions being asked would be difficult to answer using public

sources. Questions such as “On which of the following streets have you NEVER lived or

owned property?” can be answered using property records searches for those individuals

who have been long-time property owners.

Other service providers may also rely on equally discoverable data. During the

NIST symposium on Quantifying Knowledge-Based Authentication, Experian

representative Kim Cartwright (2004) provided the following identifiers that are used

with their services during the identity validation process: address, phone number, SSN,

driver’s license, and date of birth—comparing this data to that gathered from its own

consumer credit records, consumer demographics, vehicle ownership, property

ownership, and other unspecified reference files. LexisNexis advertised their InstantID

product as a powerful tool that “simultaneously searches multiple independent

databases—containing 4 billion consumer and 300 million business records—for

information that can verify and validate a person’s identity” (2005, p. 1). The tool

validates name, address, SSN, date of birth, and phone number. Credit header data, sold

to KBA vendors separately from the credit history file, is one of the few sources outside

of court records by which SSNs can be validated. While a part of the credit reporting

25

bureau’s file, the FTC has determined that it is not a part of the credit history and so does

not fall under the Fair Credit Reporting Act (“Protecting Consumers’ Data,” 2005).

These data are, of course, generally run through consistency checks by some

vendors to check that the phone number area code matches that assigned within the city

and state, the driver’s license number is consistent with the format as issued by the state,

the SSN is not listed on the Social Security Death Index, and that the address is not that

of a designated “high risk” location, such as nightclub, drop box, and so forth. Of

particular concern, however, is that the bulk of the data to which identity is verified—

name, address, phone number, and date of birth—is easily retrievable on the Internet.

Even property tax records and vehicle records that comprise challenge-response type

questions used by KBA vendors (i.e., how much did you pay for your house?) are

publicly available through public records aggregators such as Search Systems. While it is

becoming increasingly more difficult (although not impossible) to find SSNs on the

Internet, research has demonstrated that identity thieves possessing enough personal data

can easily retrieve the victim’s SSN, then use the SSN as a key to access the victim’s

financial benefits (“Identity Theft and Social Security Numbers,” 2004).

The emphasis on enabling e-government services will continue to create

additional personal data repositories. As the availability of personal data in public records

increases, it increases the ability of the KBA service provider to successfully match

identity data. When access to public records is limited, so too is the ability to assemble a

digital dossier on an individual. LexisNexis stated that the paucity of data makes it

difficult to detect the international identity thief (as cited in Willox, 2001). Paradoxically,

an increase in the availability of personal data increases their susceptibility to discovery,

26

undermining the confidence that can be placed in that identity’s verification. This

problem has been looked at from both sides of the coin. Solove’s (2004) proposed

solution was to regulate access to public records and to remove identifying information

from the records, while fellow attorney Lynn LoPucki (2003) proposed the creation of a

Public Identification System that publishes most of an individual’s personal data,

eliminating their use in providing proof of identity—requiring instead that identification

be determined by public claim and personal contact.

Both analysts agreed that in order to protect privacy, the secrecy paradigm must

be abandoned. Solove, however, suggested that “by taking obscure facts and making

them widely available, privacy can be violated” (2004, p. 143), arguing that “an SSN,

mother’s maiden name, and birth date should be prohibited as the method by which

access can be obtained to accounts” (p. 143). LoPucki’s (2001) arguments were certainly

more convincing, suggesting that by making all identity information completely public,

any artificial value placed on any of the identifier is removed and the identifier becomes

essentially worthless to the identity thief, as it can no longer be used for financial gain as

an authenticator; simply knowing an individual’s SSN would not provide access to other

records. LoPucki’s proposal that publicly listed contact information be provided as a

means to authenticate identity fell somewhat short, however, as he did not address the

problem resulting from a compromise of this information wherein the contact information

is changed by the identity thief, much in the same manner as credit bureau data are

altered by identity thieves today.

A false illusion of secrecy has been created to surround personal data that were

never intended to be secret and has resulted in the commercialization of these private

27

data. The use of these pseudo-secret data has extended to all Web sites, prompting one

financial executive to lament that the misuse of personal data, such as mother’s maiden

name, for authentication at Web sites and elsewhere, has resulted in there being relatively

few authenticators left to banks to use today to secure online transactions (Archer, 2004).

Regardless of whether the fault lies with the citizen for indiscriminately sharing his or her

personal data with all and sundry, or with the financial industry for substituting the

convenience of using an existing shared secret for the expense and security of creating

and disseminating a truly shared secret, once the cat is out of the bag there is no putting it

back in. Once private knowledge becomes public knowledge in any manner, knowledge-

based authentication may fail to provide a sufficient level of identity confidence in

digitally presented credentials.

Identity Authentication in Federated Environments

David Temoshok, the Director of the GSA’s Identity Policy and Management

Office that has been tasked with implementing the E-Authentication Portal, defined

federated identity as “rules, agreements, and standards, technologies that make identity

and entitlements portable across autonomous domains” (2005, p. 4), and suggested that

trust in the identity verification procedures is one of critical issues of federated identity.

Transitive trust relationships are prevalent authentication models for federated

environments, such as with those participating with Liberty Alliance and GSA’s e-

Authentication Portal. These federated environments provide for transparent movement

between participating systems, agreeing to accept the credentials of users who have been

identity-proofed prior to accessing the present system (Electronic Authentication

28

Partnership, 2004). Once identity is authenticated and a Credential Service Provider

issues a credential, the credential is accepted by any participant within the federation

during the same browser session. Interoperable credentials within a federated

environment ensure that the user does not have to remember different passwords for

every site that the user visits within the federation. The degradation of identity confidence

will not be apparent to the systems participating in a transitive trust relationship with the

system performing the initial identity authentication.

To combat the threat of spoofed identities with government e-authentication

applications, in December 2003, the Office of Management and Budget (OMB, 2003)

issued Memorandum M-04-04, providing guidance to the heads of all departments and

agencies within the federal government on securing authentication to online government

services. Citing the National Research Council’s (NRC) report on authentication, Who

Goes There? Authentication Through the Lens of Privacy, OMB deferred to the NRC’s

definition of e-authentication as “the process of establishing confidence in user identities

electronically presented to an information system” (OMB, p. 3). The level of confidence,

or assurance, in electronically presented identities is directly linked to the strength of the

identity authentication technology or protocol used by the identity claimant attempting

authentication, and is determined by the level of risk presented to the application or user

in the transaction. Risk assessments are to be performed by each agency on its e-

government application to determine the risks and impact upon compromise presented by

each electronic transaction type (i.e., a request for information or submission of data to be

added to an online database). As applications become increasingly more risky to the

29

agency or individual participating in the transaction, authentication to the application

becomes more rigorous.

Authentication is largely based on three factors:

1. Something one knows, such as a PIN, password, or a combination of personal information;

2. Something one has, such as a hard token, or smart card; or

3. Something one is, as generally demonstrated by a biometric such as a fingerprint or retinal scan.

Any of these single authentication factors presents certain vulnerabilities that can lower

identity authentication confidence; however, when combined into the use of “multifactor

authentication” (i.e., a doctor is required to present both a password and biometric when

authenticating to write a prescription for controlled substances), there exists a greater

confidence that the individual presenting the identity claim has been effectively bound to

that identity.

In supplemental guidance from NIST, authentication technologies are mapped to

the four assurance levels defined in the OMB memorandum, ranging from Level 1

(providing the least identity authentication confidence) to Level 4 (providing a high level

of confidence that the identity claimant is bound to the presented identity; Burr et al.,

2006). As a part of the risk assessment process, online government applications are

assigned a required authentication assurance level and, thusly, a required minimum

authentication technology standard, based on the results of a risk assessment.

While the NIST guidance states that it addresses the most widely implemented

forms of authentication protocols for remote authentication based on secrets, it directs

that applications requiring Level 4 assurance use hardware-based cryptographic tokens

30

that link the identity to “something they have” (i.e., PKI-based “smart cards” or other

hardware device). NIST also informs readers that they are continuing to study both the

topics of knowledge-based authentication—which they define as authentication based on

the claimant correctly answering many personal, but not secret, questions—as well as the

use of biometrics that authenticate claimants based on physical characteristics or

identifiers possessed by the claimant (“something they are”). It is important to note that,

for the purposes of this discussion, the use of the term knowledge-based authentication,

or KBA, refers to the definition used here by NIST and does not refer to PIN/password

technologies.

Measuring the Effectiveness of Authentication by Knowledge

Authentication technologies are only as effective as their ability to ensure that

only authorized users are able to access an application or system. The National Computer

Security Center, in its identification and authentication “best practices” guidance,

suggested that effective authentication technologies must “uniquely and unforgeably

identify an individual” (1991, p. 5). The ease with which an imposter can impersonate an

authorized user during authentication varies considerably with the technology used. How

is authentication effectiveness measured among the myriad forms of authentication

technologies? Is there an existing measurement methodology that can be applied to

knowledge-based authentication?

Authentication technologies based on shared secrets such as PINs or passwords

rely on the authorized user maintaining the secrecy of the authentication data. Factors that

31

impact the effectiveness of PIN/password authentication technologies, therefore, include

their susceptibility to being guessed or discovered by an imposter.

NIST’s guidelines discuss the vulnerability of passwords to guessing attacks—

comparing randomly selected passwords to user-selected passwords of the same length.

They defined password-guessing entropy as an “estimate of the average amount of work

required to guess the password of a selected user” (Burr et al., 2006, p. 46) when applied

to a distribution of passwords. NIST argued that password-guessing entropy is the most

critical measure of the strength of a password system, since it largely determines the

resistance to password-guessing attacks.

Password strength is a factor of the size of the required password, and of the

character type used within the password. Limiting the characters to numbers provides

only 10 different numbers (0–9) upon which the PIN can be based. Four-digit PINs,

therefore, can have only a maximum of 10,000 unique combinations, as calculated by

104. Passwords are typically much longer and more complex than PINs. The National

Security Agency’s (2006) most recent guidelines on the use of passwords within

government agencies recommended the use of 12-character passwords. These secure

passwords must consist of all of the following: upper and lower case alpha characters,

numbers, and special characters, such as the question mark, found on the keyboard. This

results in 9412, or in excess of 475 sextillion (4.75920314823) possible combinations, or

permutations. A related study on the strength of passwords and cryptography indicated

that an 840 MHz Pentium III can cycle through 250,000 passwords per second in an

offline dictionary attack (Song, Wagner, & Tian, 2001). Conceivably, this would take

more than 3 quadrillion years to exhaust all possible combinations, were there not new

32

developments that significantly reduce the time. A password-cracking algorithm,

Rainbow Crack, has been developed that pregenerates password-combination tables and

uses a more sophisticated searching algorithm (Bragg, 2004). Additionally, distributed

password-cracking approaches are being utilized that combine idle computer CPU time

from many PCs, enabling an increase in the amount of computer power available for

password cracking. One such implementation was recently revealed to be used by the

Secret Service in which the agency has linked more than 4,000 employee computers to

attempt cracking more than 1 million password combinations per second (Krebs, 2005).

Complicating the ability to measure password effectiveness is that while

computer-generated randomly selected passwords of the 94 printable keyboard characters

are much more effective than user-selected passwords, most users do not select from this

full range of characters when allowed to create their own passwords. Studies have

demonstrated that users tend to favor certain English-language character frequency

distributions, significantly reducing the possible combinations to a more easily and

quickly searched size. Bruce Schneier, author of Applied Cryptography, stated that

English has 1.3 bits of entropy per character; a 30-character English passphrase has as much security as a 40-bit key. Random passwords have less than 4 bits of entropy per character. A 12-character password is more secure than a 40 bit key. (1999, p. 27)

NIST estimates agree with Schneier’s and also calculate that were the 12-character user-

selected password chosen from the full 94 available “complex” character set previously

discussed, rather than being limited to only the 10 numbers plus upper-/lower-case

alphabet, it would result in a password-guessing entropy of 79 bits (Burr et al., 2006).

While it still can be considered “computationally unfeasible” to crack strong passwords

33

such as those cited within anyone’s given lifespan, tools such as these remind network

administrators that password authentication effectiveness continues to be an evolving

landscape.

While the susceptibility of PIN/passwords to guessing attacks is quantifiable, the

ability to quantify PIN/password susceptibility to discovery proves more elusive and can

be compared to discoverability vulnerabilities associated with KBA. When used as the

sole means for authentication, distribution of the shared secret completely defeats the

technology’s ability to identify an unauthorized user. Distribution occurs when passwords

are shared among coworkers, friends, or relatives (often for the purposes of accessing

services in the absence of the principal user). Unintentional distribution occurs when the

password is written down near the system and found by someone, or socially engineered

from the user.

Some online government applications, such as government-funded student loans,

issue 4-number PINs to financial aid recipients in order to check student loan processing

status. The PIN, issued by the Department of Education, is mailed in an “out-of-band”

transaction to the student. This process is vulnerable to the risk of discoverability as a

result of mail theft; however, that is not a risk factor that can be easily quantified as the

legitimate user oftentimes is not even aware that the password has been discovered and,

once compromised, the imposter has the same access privileges as the user. For this

reason, a recently released report from the Federal Deposit Insurance Corporation

concluded that single-factor password-based credentials no longer provide sufficient

security for remote access to critical infrastructure; “two-factor authentication should be

34

considered as the new security baseline for remote access to computer systems” (2004, p.

36).

Measuring the Effectiveness of Authentication Through Possession

Several types of authentication devices exist that require the identity claimant to

prove possession of the device at the time of authentication. These devices are usually

provided to the user through a controlled issuance process at the time of enrollment. As

with PIN/password systems, this requires a preestablished enrollment or issuance process

and would not be a suitable technology for authenticating citizens to government service

sites unless the identified transaction risks warranted requiring a higher level of identity

assurance.

Such devices include USB tokens containing a CPU and digital signature and

encryption keys that plug into a user’s PC—one-time password devices that are

synchronized with the system to which the user is authenticating—and “smart cards.”

Smart cards are credit-card-sized devices that contain a CPU and memory. Also called

Personal Identity Verifier (PIV) cards, NIST defined the PIV as

A physical artifact (e.g., identity card, “smart” card) issued to an individual that contains stored identity credentials (e.g., photograph, cryptographic keys, biometric data) so that the claimed identity of the cardholder can be verified against the stored credentials by another person (human readable and verifiable) or an automated process (computer readable and verifiable). (as cited in Chandramouli et al., 2008, p. 1)

The Department of Defense has presently issued more than 4 million PIV cards to

its personnel, and all federal agencies were directed by OMB to issue PIV cards to all

employees, contractors and affiliates (OMB, 2004) by October 2006. These cards will not

35

only contain biometrics, they will also contain radio-frequency identification chips that

provide for contactless authentication. Due to the variable nature and complexity of the

data stored on the card, conformance testing focuses largely on interoperability and the

ability of the card reader’s application to effectively access the data from the card and

validate that the data are correct and that the card has been issued by an authorized

source. Conformance testing for existing Government Smart Card (GSC) tests

interoperability, as defined in the GSC Interoperability Specification document drafted by

NIST (Schwarzhoff et al., 2003), and validates that the smart-card product conforms to

these specifications. Obviously, since possession of the card, by itself, in no way

guarantees the identity of the person presenting the card, testing for identity

authentication effectiveness is not feasible unless the card possesses a biometric or

requires a PIN/password also be used in two-factor authentication. Bank ATM cards,

requiring that both the card and the PIN be presented, are the most common example of

how two-factor authentication is implemented.

Measuring the Effectiveness of Biometric Authentication

Biometric authentication provides access to the user when a stored template of a

physical characteristic, such as an iris scan, fingerprint, or facial or voice scan, is matched

to the physical characteristic presented by the identity claimant at the time of

authentication. Unlike PIN/password and smart-card authentication technologies, the

effectiveness of biometric technologies, in terms of identification accuracy, can be

measured and the methodology for its measurement is consistent across the biometric

technologies. Although the effectiveness of biometric authentication can be measured,

36

these technologies are not 100% accurate. There are a total of four possible outcomes at

the time of authentication (Jain, Bolle, & Pankanti, 1999):

Outcome 1: An authorized user is correctly accepted;

Outcome 2: An imposter is correctly rejected;

Outcome 3: An authorized user is incorrectly rejected; and

Outcome 4: An imposter is incorrectly accepted.

The probability rate of an authorized user being rejected is known as a False

Rejection Rate, or FRR, while the probability, or frequency, rate of an imposter being

incorrectly authenticated as a valid user is known as a False Acceptance Rate, or FAR. A

false rejection of a valid user, while posing an inconvenience to the user and a ding to

customer service, is not as serious to application security as a false acceptance of an

imposter, in which the imposter has been granted all of the rights and privileges as the

valid user. Jain et al. noted that, in principal, the FAR and the FRR can be used, as well

as an Equal Error Rate (EER), to estimate the identification accuracy of a biometric

system. The EER represents the calculation of where the two probabilities (the FAR and

the FRR) represent the same value. As an example, were the EER to be 3%, it would

mean that 3% of authorized users were incorrectly denied access while 3% of the identity

imposters were incorrectly authenticated to the system.

In actuality, the determination of risk, rather than the EER, plays the largest role

in tuning a biometric system, or configuring the system to allow to higher or lower

sensitivity levels that result in higher or lower FRRs or FARs. Obviously, as the system is

tuned to greater sensitivity levels, demanding a positive match on more data points stored

in the template, the system will screen out more imposters, yet will generate more false

37

rejections of valid users (resulting in higher FRRs and lower FARs). Conversely, if

customer service is of greater significance than authentication assurance, the systems can

be tuned to accept more weakly matched templates, resulting in higher FARs and lower

FRRs. Critical applications would more likely be tuned so that fewer FARs resulted, at

the expense of a greater number of FRRs.

Jain et al. (1999) stressed the necessity of more descriptive performance metrics

during testing. An instance is cited in which vendor-asserted performance claims of an

FRR of 0.3% and FAR of 0.1% were not substantiated during independent testing

performed by Sandia National Laboratory, which found that the same system had an FRR

of 25% and an unknown FAR (Jain et al.). They also emphasized that in order to obtain

fair and honest test results, enough samples representative of the population of all four

categories (imposters and genuine) should be made available for testing.

Measuring the Effectiveness of KBA

None of the test methods specified previously sufficiently evaluates KBA. In

contrast to PIN/password authentication, KBA relies on the individual accurately

answering several questions about him- or herself that are then correlated to answers

culled from public and private records, as previously discussed. A series of validation

checks is then performed against the data, eliminating users who provide inconsistent

data (i.e., SSNs do not match the name in the file or are found on the Social Security

Death Index, addresses do not match, etc.) or in which the user provides incorrect

answers to the questions.

38

The customization that KBA companies provide with the number and types of

questions asked, as well as the thresholds that can be set for acceptable authentication

(based on the number of correctly answered questions), makes quantifying the

effectiveness of KBA extremely difficult. This has resulted in the examination of the

feasibility of quantifying KBA by NIST, which has been examining the technology for its

suitability with e-government services. At their 2004 symposium, several methodologies

were proposed to address the issue.

Evaluating KBA Using Guessability

Using a model similar to that referenced in the NIST guidelines, cryptography

researcher Dr. Santosh Chokhani (2003) proposed a methodology in which effectiveness

is based on how susceptible an individual’s identity attribute is to being guessed.

Attributes are categorized as being either static or temporal. Examples of static identifiers

include birth date, while temporal identifiers are more dynamic and include back account

balances and payroll amounts. Chokhani contended that the extent to which an identifier

is susceptible to guessing is partially dependent on the individual doing the guessing.

Someone close to the identity claimant, or with intimate knowledge of the claimant (for

example, an estranged spouse), may have personal knowledge sufficient to accurately

answer the questions, allowing him or her to effectively masquerade as the authorized

individual. Chokhani also provided both formula and tables to calculate the probability of

compromising KBA based on the claimant type and specific identifier guessability.

Chokhani provided a matrix used for the calculation of guessability metrics wherein the

questions asked are based on assumptions and the likelihood of the answers being

39

guessed by an individual without any prior knowledge. Date of birth, for instance, is

given a 1 in 18,250, or 214, probability that someone other than a family member,

employer, friend, or professional acquaintance might be able to guess it given Chokhani’s

assumption that someone “can be assumed to be between 20 and 70 years of age” (p. 2).

Based on application identity assurance requirements, Chokhani (2003)

recommended specific mixes of temporal and static identifiers (temporal identifiers, such

as bank balances, being obviously more difficult for someone, even the identity claimant,

to accurately provide). Chokhani later mentioned that the claimant’s desire to

masquerade—as well as the valid user’s personality-based factors and network and size

of personal and professional relationship—must be considered; however, he did not

factor these considerations into his formula or proposed metrics.

While Chokhani’s recommendations, based more on common sense than on his

probability metrics, have considerable value to identifier selection, the guessability

approach can only partially gauge KBA effectiveness as it does not factor in

discoverability. Largely drawn from public records data sources, attributes used in KBA

are susceptible to data-mining techniques or targeted attacks whereby an identity thief

builds a digital dossier on the victim (Solove, 2004) and is then able to successfully

authenticate as that user with the information obtained from the Web.

Evaluating KBA Through False Acceptance and Rejection Rates

Other approaches to evaluating KBA attempt to treat the service as a form of

biometric authentication technology, attempting to define FAR and FRR, and even

providing the ability to tune service to acceptable thresholds (Cartwright, 2004;

40

ChoicePoint, 2004). Experian and ChoicePoint use similar models in that companies can

determine the type and number of initial personal data input to be validated (name, SSN,

address, date of birth) and, based on a predetermined score “cut-off,” or threshold, go on

to be asked more challenging questions or be denied access and referred to a customer

service desk for exception processing. In their example provided at the symposium,

Experian suggested that 90.10% of accounts pass the initial score, while 9.90% are

referred for exception processing. Experian also stated pass rates of 90.24% for “good

accounts,” resulting in an FRR of 9.76%, a rather high FRR by most biometric standards.

Dr. George Datesman, consultant for Mitretek, also proposed a model similar to

Experian’s in which the goal of KBA is error-free authentication, or “100% assurance

that a user is who he/she claims to be” (2004, p. 4). With Datesman’s model, errors are

classified similarly to biometric errors into two types: type I errors that identify the false

rejection of a claimed identity (FRR) and type II errors that identify false acceptance of a

claimed identity (FAR). Datesman discussed the necessity of standardization of error

measurement techniques as opposed to identity authentication methods as well as

establishing minimum acceptable error rates and confidence intervals at each assurance

level. Datesman’s discussion was, however, absent of any guidance on how these

measurements can be determined.

Herein lies the inherent problem with treating KBA as a biometric technology in

measuring effectiveness. While companies will almost assuredly get feedback from angry

customers denied access to services from which they can capture FRR, the FRR can also

be predicted through formal test procedures in which a file provided to the service is

“seeded” with valid and nonvalid identities. Assuming imposters and genuine identities

41

were made available for testing, the sample would contain a selection of valid users

presenting valid data that should be accurately authenticated consistent with Outcome 1

(Jain et al., 1999). Those valid users who did not authenticate correctly would provide an

estimation of the FRR (Outcome 3). Measuring to determine Outcome 2 (an imposter is

correctly rejected) and Outcome 4 (an imposter is incorrectly accepted) prove to be more

difficult. Outcome 4, the basis for determining FAR, is critical to information security,

yet is impossible to calculate because, as with PIN/password technologies, testing would

require that enough valid personal knowledge be provided to spoof a valid user on the

system, unless an identity is simply crafted out of vapor. To this extent, while FRRs

should be calculated to determine burden on users and customer service staff, attempting

to measure KBA effectiveness using biometric testing methods is not comprehensive

enough to satisfy most information security or application assurance requirements.

Evaluating KBA Through Other Methods

Some KBA service vendors advertise the use of identity scoring models

(Cartwright, 2004) that, based on the quality and quantity of data and data source, can

provide a probability of identity confidence (or identity score). These models require that

the data owners determine the thresholds that are acceptable to meet application

assurance requirements. Unfortunately, KBA vendors choose not to share these

algorithms, and so, as with proprietary cryptographic algorithms, they must remain

suspect in their ability to accurately perform as advertised unless mathematicians or

statisticians can test their accuracy rates. Since the models are not shared, it is not known

to what extent the likelihood of discoverability is addressed.

42

In a recent academic project resulting from the KBA symposium, researchers at

the University of Wisconsin-Madison (Chen, 2007) proposed a KBA framework based on

Bayesian networks, considering causal and probabilistic relationships between identity

attributes. While the approach had definite merit in its potential ability to determine

outcomes and adapt responses accordingly, it, too, only minimally considers the obvious

vulnerability of prior discovery.

Having been studied by leading authentication technology researchers, KBA

remains an ethereal technology in its ability to be evaluated for its identity authentication

effectiveness. While both KBA and PIN/passwords are susceptible to guessability

attacks, the use of PIN/password technologies proves to be less susceptible to

discoverability than KBA, as the likelihood is slim that the PIN/password authentication

data are published in public records databases on the Web as are most KBA identifiers.

Furthermore, the attempts to quantify KBA’s effectiveness in a manner similar to

biometric technologies are inadequate in that only FRRs can be estimated. Estimates on

the number of imposters possessing sufficient information to masquerade as legitimate

users are impossible to determine using this same approach.

In summary, a review of the literature found one study released by the Council of

Better Business Bureaus and Javelin Strategy and Research that supported the null

hypothesis that the availability of personal identity information cannot be correlated to

identity theft; however, the Javelin study’s narrow definition of identity fraud necessitates

a new study. Additionally, concerns expressed by government agencies and scholars

linking the misuse of identity information to the proliferation of personal data provide

support for the hypothesis. The literature review revealed that existing measurement

43

protocols are incomplete in evaluating the effectiveness of KBA as they do not consider

the discoverability of these attributes. As KBA services have already been implemented

for use with online applications providing access to identity breeder documents, such as

birth certificates, it affirms the immediate need for a study to evaluate the extent of the

discoverability of attributes used in these services.

44

CHAPTER 3. METHODOLOGY

Introduction

The structure of content on the Web has been compared to a library whose

“collection is distributed haphazardly on the shelves, with no underlying classification

scheme, bibliographic control, or accession catalog, and a substantial portion of the

material is incomplete, transitory, or simply disappears from the shelves after a short

time” (O’Neill, McClain, & Lavoie, 1998, p.1). This haphazard collection of digital

information continues to proliferate at a staggering rate. Between 2000 and 2003,

“surface” Web content (Web content indexed by search engines such as Google) tripled

from 50 terabytes to an estimated 167 terabytes in size (Lyman & Varian, 2003). “Deep”

Web content, consisting primarily of databases and other media types that are not

routinely crawled and indexed by search engines, is estimated to be 500 times the size of

the surface Web. This deep Web content resides on approximately 200,000 Web sites—

95% of them publicly accessible (Bergman, 2001). Developing a research framework to

allow the enumeration of personal identifiers located on a targeted subset of this deep

Web can prove to be challenging for researchers in terms of selecting an appropriate

research methodology and sampling plan. This chapter addresses the research approach,

population, data collection, and analysis techniques that were used to perform this study.

Research Approach

A research methodology is a systemic process that moves the researcher from

inquiry and hypotheses to data collection and analysis. The type of data collected is

45

intertwined with the selected research methodology and drives the manner in which the

data are collected (Myers, 1997). Leedy and Ormrod (2001) agreed with Myers, stating

that the type of data being analyzed may lead the researcher to a specific research

approach, suggesting that a quantitative research methodology is appropriate in instances

where there exists an objective reality that can be measured and specific methods for

measuring variables are defined and collected from a sample of data that can be

converted to numerical representation.

Quantitative research characterizes the problem under study in terms of how

many or how often and results in the numeric analysis of data that prove, or disprove, a

researcher’s assumptions. While data can be gathered through a variety of methods,

including the use of surveys, results are scored so that they yield statistically measurable

results. An example, as applied to the study of Web content (Bergman, 2001), is that an

analysis of scored survey results can provide a measurement of user concern regarding

identity theft resulting from publicly accessible online databases. Quantitative research,

in this case, would not have discerned why users were concerned about this phenomenon

unless this had been previously hypothesized by the researcher prior to the onset of data

collection. In this respect, quantitative research design is considered a fixed approach in

that the data collection process must be constructed in advance to specifically address the

researcher’s hypotheses. The ability to perform comparisons and statistical analysis on

the collected data, however, is a decided advantage to quantitative approaches when

dealing with large amounts of empirical data.

A content analysis of the personal identifiers stored in online deep Web databases,

as discussed previously, can be used within the constructs of either a quantitative or

46

qualitative approach, or both, using a mixed research approach. Kaid and Wadsworth

(1989) deferred to Berelson’s definition of content analysis as being the most widely

accepted by researchers. By definition, a content analysis is a method in which content

can be analyzed in a systematic, objective, and quantitative manner for the purpose of

measuring variables. Berelson (1952) characterized content analysis as systematic in that

in order to reduce generalization errors in the content being analyzed, uniform coding and

analysis procedures are defined in advance of the data gathering process. It must be

objective in that researcher bias must be absent from the study or from the sample

selection process. Finally, it must quantitatively and accurately represent the body of

material being examined, leading independent researchers to the same conclusions. This

detailed and systematic methodology, as well as the frequency tabulation of

authentication attributes found in the documents, provides for the quantitative analysis of

the information contained in the examined documents (Robson, 2002).

Typically applied to human communication such as newspapers, video, books,

television, art, music, or transcripts to identify patterns, themes, or biases (Leedy &

Ormrod, 2001), content analysis can be used for making numerical comparisons among

and within documents, as long as the information is available to be reanalyzed for

reliability checks. This enables researchers to sift through large volumes of data with

relative ease in a systemic fashion (GAO, 1996). These characteristics indicate that

content analysis is a suitable tool to quantitatively examine Web content by statistically

measuring the frequency and location with which personal identity attributes are found.

Some considerations, however, must be addressed. While performing a content

analysis allows a researcher to study the raw data and arrive at conclusions relating to a

47

hypothesis, can content analysis, by itself, be used to lead the researcher to a holistic

analysis of the data that it categorizes? As cited in Kaid and Wadsworth (1989), some

researchers such as Krippendorff view content analysis as little more than a data

gathering tool that enables inferences to be made from the data to their context while

others, such as Holsti, consider it to be a much more powerful. Kaid and Wadsworth

suggest that quantitative examinations of the content found on the Web are, in

themselves, without much meaning unless the researcher can make comparisons and

draw relationships from the data. It would be of little research value; therefore, to simply

relate the frequency with which personal identity attributes can be found in public records

databases. A relationship linking identity theft rates to the frequency, with which these

data are discoverable, however, serves the goal of achieving a more holistic analysis of

this Web content.

Content analysis categorizations can also present problems if the category

definitions are faulty or found to be nonmutually exclusive or nonexhaustive (Stempel &

Westley, 1981). Researchers concur that reliability in content analysis has been achieved

when there is repeatability in recoding the same data over time and researchers

performing the same research, using the same methodology, derive the same results

(GAO, 1996). Reliability, in this context, is dependent on information availability for

reexamination. Koehler (2004) performed a longitudinal study of Web pages over a

period of 6 years that established the ephemeral nature of the Web, concluding that the

Web is not particularly stable for publication of long-term information and the

maintenance of individual objects or items. Koehler did differentiate, however, between

the materials published to the Web and material “for which the Web serves as a conduit

48

for access” (Conclusions section, ¶ 1)—citing the longer half-lives for online databases,

as compared with the much shorter half-lives of other published Web documents. It

would, therefore, be important for a researcher to note any impact to long-term reliability

when engaging in a content analysis of Web-based content. For these reasons, a

quantitative approach employing content analysis was deemed by the researcher to be the

most appropriate research approach to the problem under study.

Sampling Design

Personally identifiable data exist in large volumes within online databases and

Web pages. Sampling is necessary as the body of material is too extensive to be analyzed

in its entirety (GAO, 1997). Selecting an adequate and representative sample of the

population of interest from the millions of available Web pages available can be a

challenging task for a researcher. In the article “A Methodology for Sampling the World

Wide Web,” researchers O’Neill et al. explained, “Compiling a random sample of Web

sites is not a straightforward exercise, largely because enumeration of all Web sites is not

available” (1997, Sampling the Web: A Basic Strategy section, ¶ 6). Robson (2002)

concurred, stating that that it is “usually necessary to reduce your task to manageable

dimensions by sampling from the population of interest” (p. 353).

Sampling in content analysis is performed similarly to survey sampling, with care

taken to ensure that the sample is representative of its population, with each unit having

an equal chance of being represented in the sample (Stempel & Westley, 1981). Stempel

also suggested that there are additional considerations for sampling in content analysis,

such as document availability, that may lead the researcher to use stratified or purposive

49

sampling. Robson (2002) substantiated the unsuitability of random sampling if a full list

of the population is unattainable.

Researchers Liddle, Yau, and Embley (2001), in their research efforts to

categorize deep Web database content through structured queries, also found that

randomly selected fields provided for uneven coverage in their collection process, so they

proposed a stratified sampling method to extract data from the deep Web in order to

ensure better coverage and an adequate number of representative sample fields from their

queries.

To facilitate a comparison to FTC identity theft rates, that are categorized by

state, a stratified purposive sampling approach was employed to ensure that all states had

an equal chance for representation in the study. Stratified sampling divides the population

into separate groups (referred to as “strata”) based on a shared characteristic, such as size,

gender, educational level, income, or, in this case, geographic location. Purposive

sampling, as indicated earlier by Stempel (1981), is useful for content analysis when

specific documents or records need to be selected (i.e., San Bernardino County property

records databases were examined for personal identity information) and is indicated when

resources or records availability are limited, yet require justification of the sample

selection process. As cited in Tashokkori and Teddlie (2003), Kemper, Stringfield, and

Teddlie offer that purposive sampling provides the ability to focus the sample on

information-rich cases and minimize the sample size in a nonrandom method. They

further assert that while particularly useful in qualitative approaches, purposive sampling

can be used with quantitative approaches as well. While a proportional probability

stratified sampling methodology would also have been suitable due to the disparate

50

population of records (one county government may not have a population of citizens or

records equal to another government), a lack of resources limited its usefulness in this

study.

SearchSystems.net hosts the largest publicly accessible directory of public records

databases. As such, it was considered to be the basis of the target population. While many

links at the site are already categorized by state, other links are categorized by the type of

record. Consistent with a stratified purposive sampling approach, data were stratified by

state and then by the record type using the procedures described in chapter 4 and the

codebook in Appendix A. An outline is presented, as follows, of the steps that were

undertaken for the sample in this study:

1. The sample frame was established as a listing of public records accessible within each state, as provided by Search Systems. Search Systems bills itself as the largest repository of public records databases, aggregating links to more than 40,000 accessible public records databases. Many (but not all) of these records contain PII. A paid subscription that allowed access to the aggregator’s links was procured at a fee of $4.95 per month. Premium records costing additional fees, as with bankruptcy records, were not included in the population from which the sample was drawn. Search Systems does not include U.S. territories and possessions, with the exception of Washington, DC, and these were excluded as they were not a part of the FTC’s datasets for later comparison purposes. A manual inspection of the records descriptions was performed to eliminate databases serving only historical purposes—those records prior to 1930—after which date even the U.S. Census publishes information contained in household census records. Only records databases providing PII on living individuals that were freely, and without requiring registration, considered valid for the purposes of this study.

2. Using a purposive approach, categories of public records that are commonly used for identification and authentication purposes were defined and recategorized after the results of a pilot test to ensure mutual exclusivity so that a single record could not be counted multiple times. Appendix A contains specific search queries that were used to extract the public records links from Search Systems.

51

Robson (2002) discussed the difficulty with prespecifying the number of

observations required in a flexible design study, stating that it is appropriate to continue

until saturation is reached (an apparently subjective goal). Larger sample sizes result in

fewer generalization errors (Robson). This sampling approach provided the ability to

enumerate a large, disparate grouping of data that facilitated later correlation to identity

theft rate data. It also provided a repeatable methodology that affords future researchers

with an equal chance for all databases to be represented within the sample, thereby

eliminating researcher bias.

Data Collection

Data collection within the inspected records was performed using a

documentation content analysis of personal attribute categories contained in the Internet

record. Identity attributes are defined for this study as information that identifies an

individual or links to other information that would be used to identify an individual.

Microsoft Excel was used to store the data to facilitate their exportation into SPSS for

later statistical analysis. To avoid violating privacy rights, personal data on individuals

were not collected for the purposes of this study. Instead, publicly accessible databases

were enumerated and categorized to determine the amount and type of personal data at

each discovered site that are commonly used with KBA systems. Chapter 4 further

discusses the data collection process.

52

Measurement Strategy

Content analysis provides a systematic, replicable technique for categorizing data

based on explicit coding rules (Berelson, 1952). Consistent with content analysis, a priori

content categories and recording units were established and operationalized prior to

collection, coding, and analysis. Colleagues provided a review of the proposed a priori

categories, and revisions were made to ensure mutual exclusivity and exhaustiveness

(Weber, 1990) and that the categories were saturated (Leedy & Ormrod, 2001). Coding

began only after the final units of data collection were defined, tested, and refined.

Collected data were categorized according to the defined a priori categorization of

content. From the sample of sites, the frequencies with which the defined units of data

(authenticators previously determined to be common to the majority of KBA service

providers) occurred within a record type in that state were counted. Acting on the

assumption that personal attributes are weighted by most authentication systems as being

equal in importance (i.e., knowing your house’s square footage is as important as

knowing the names of the previous owners), the data collected from the sites were

measured on an ordinal scale and ranked by the frequency with which they were found at

different sites. Additionally, the frequency with which these attributes (units) jointly

occurred with one another was measured (collocations).

Data Analysis Strategy

The statistical software program SPSS was used to analyze the data collected.

Based on the frequency with which combinations of identity attributes were found at

these sites, a “discoverability metric,” or index, was derived from the analysis that

53

allowed these attributes to be comparatively ranked by availability. Data were then

analyzed to determine the frequency, distribution, and mode on the authenticators within

the data sets. Using a Spearman’s rho, the relative frequencies of the authenticators by

category (location and type of site) were assessed for statistical significance. Lastly, data

gathered in the first part of this study were compared with statistics reported by the FTC

for identity theft incidences by state as a simple correlational study.

Data Display

Personal attributes from frequency tables were summarized and displayed using

bar charts and tables since, as Robson (2002) explained, they are the preferable methods

to use to display data associated with frequency tables and they are quickly and easily

understood by most everyone. Data are also presented in a priori tables and graphs.

In conclusion, while a qualitative research approach affords flexibility in an

exploratory study, a quantitative analysis is more appropriate for an enumeration of the

content within online public databases. Research has further demonstrated that content

analysis is an appropriate research methodology with either quantitative or qualitative

research and is suitable for the data collection of Web content using a purposive stratified

sampling approach that ensures an adequate and reliable sample.

54

CHAPTER 4. RESULTS

Introduction

There were two main objectives to the current study. The first was to examine the

comparative discoverability of identity attributes in online public records, associating a

discoverability index, or factor, to individual and linked identity attributes where specific

combinations of identity attributes occur. The second objective was to determine if

correlations exist between the frequencies with which identity attributes can be found in

public records and instances of identity theft.

To meet the goals of this study, a content analysis was conducted, after which

descriptive statistics for each of the identity attributes, as well as the number of identity

attributes per type of record searched, were generated. Nonparametric correlations were

then conducted to assess whether there was an association between the identity attributes

and identity theft rankings.

This chapter presents the results of the data collection, categorization, and content

analysis relative to this study. The first section of the chapter discusses the data collection

and preparation processes. Following that is a description of the data sampled, followed

by findings from an analysis of the data that address the research questions presented in

chapter 3. Conclusions and recommendations are presented in chapter 5.

Data Collection

As of the date that the data collection process was completed—May 1, 2008—

Search Systems has registered more than 41,754 public records sites. The number of sites

55

accessible from the aggregator increases daily, so the process of enumeration is

analogous to that of shooting a moving target. Initially, software with URL harvester

capabilities, Xenu, was used to retrieve the public records links for the purpose of

generating the sample. The software proved to be of little use as Search Systems obscures

the links through the use of redirection. An attempt was made to enumerate the total

number of public records sites at each state using Search Systems’ “Search United States

Public Records by State.” This initial approach was discarded as many of the links are

informational ones, such as school performance reports, that would not provide

information relative to this study. The approach was modified to limit the data collection

to 12 different categories of public records that eliminated business-related records (i.e.,

professional licenses and corporate filings) and informational sites. Public records

categories used by knowledge brokers to authenticate online identity were selected based

on a review of public records sources at LexisNexis’s site. Categorical searches in Search

Systems were performed on selected public records categories and the “Advanced

Search” function was also used to find records for which no category had been

predefined.

Data Coding and Categorization

Knowledge gained through the investigator’s professional experiences testing a

number of KBA systems served as the foundation for the selection of common identity

attributes that serve as identity authenticators. Selected identity attributes included name,

address, phone number, date of birth, marriage, place of birth, and property-specific

56

information such as square footage, property value, mortgage holder, and improvements

made on the property. A description of these attributes appears in Appendix A.

A test of the data collection process was performed to ensure that the a priori

categories contained collected data that were mutually exclusive and exhaustive. Two

colleagues were provided a data collection form and, after brief training on collection

procedures, were asked to retrieve data from Alabama. When the collected data were

reviewed, it became apparent that several categories needed combining as, in many cases,

court cases contained probate records and other recorded documents that otherwise risked

being recorded multiple times. Resultantly, five categories were eliminated in order to

reduce record replication. The remaining seven categories consisted of the following

types: accident reports, birth certificates, court records, inmate/arrest records, marriage

records, property records/deeds, and voter registrations. Appendix A provides detailed

enumeration criteria. A subsequent test using Arkansas displayed no evidence of overlap

in the documents in the sample, satisfying the requirement for exclusivity.

Consistent with a purposive sampling strategy that would ensure as accurate a

count as possible, the descriptions for each record were reviewed, excluding those that

were information only and did not contain personal data. Only government-provided or

-contracted, freely accessible sites were of interest as it was assumed that the registration

process would deter most identity thieves from accessing data on that site.

Identity attributes contained in examined records were noted on the data

collection forms. Identity attributes were aggregated as cumulative totals within each

category of public record by state. As an example, if one marriage record in Ohio

displayed only the name and address of the bride and groom, and another site within the

57

state included date of birth, the data were coded to show that Ohio marriage records

contained name, address, and date of birth.

Coded data were inspected and sorted in Excel to identify coding errors. The

cleaned data were imported into SPSS 14.0 for analysis. Attribute totals, by state, were

transformed into a new ordinal variable to facilitate additional correlation to ID theft rates

as reported by the FTC’s (2007) Consumer Fraud and Identity Theft Complaint Data:

January 2007 through December 2007 report.

Descriptive Statistics

A total of 8,659 sites were identified as containing data of interest to the study.

Sites that required separate payment, registration, or were unavailable at the time of

review could not be enumerated. This excluded 2,061 sites, reducing the number of sites

examined for identity attributes to 6,598, for which the range, mean, and standard

deviation by record category across the 50 states are shown in Table 1, and a graph

depicting the total number of records per record category is shown in Figure 2.

The findings in Table 1 reveal that property records sites were the most numerous

category of records per state (M = 83.46), with the number of property records sites found

within each state ranging from as few as 1 site located in Wyoming to a total of 409 sites

accessible in Texas. Arrest records were the next most numerous type of public record

per state (M = 28.96), followed by court records sites (M = 15.02). It should be noted that

while there were more sites containing court records (N = 1,612) than arrest records (N =

1,562), 861 court records sites required registration or a paid subscription, compared to

only 114 arrest records sites that could not be freely and anonymously enumerated.

58

Table 1. Descriptive Statistics of Public Records Sites Across the 50 States

Variable Range M SD

Accident reports

Birth certificates

Court records

Arrest records

Marriage records

Property records

Voter registrations

0–8

0–7

0–192

0–225

0–23

1–409

0–13

.64

.46

15.02

29.00

2.50

83.46

.92

1.88

1.34

32.68

44.72

4.46

92.27

2.23

Total Number of Accessible Sites Per Category

32

23

751

1448

125

4173

46

0 1000 2000 3000 4000 5000

Accident reports

Birth certificates

Court records

Arrest records

Marriage records

Property records

Voter registrations

C la

s s

e s

Figure 2. Total number of public records sites by record category.

59

The least common type of records located were accident reports (M = .64);

however, these reports often contained a number of identity attributes unique to this

report type (i.e., vehicle identification number [VIN], driver’s license number, home

address). Attribute uniqueness was not a variable examined separately in this study.

Descriptive Statistics of Identity Attribute Types by Category Accessible sites within each category of public records (i.e., accident reports)

were reviewed to determine the types of identity attributes published within each state.

The range, mean, and standard deviation for the number of identity attribute types for

each record category are shown in Table 2.

Table 2. Descriptive Statistics of Identity Attribute Types by Category

Variable Range M SD

Accident reports

Birth certificates

Court records

Arrest records

Marriage records

Property records

Voter registrations

0–6

0–5

0–6

0–5

0–6

5–6

0–4

.74

.62

2.12

2.88

1.02

5.02

.72

1.76

1.51

1.51

1.19

1.45

.14

1.20

60

The findings in Table 2 indicate that property records yielded the largest number

of different identity attributes (M = 5.02), followed by arrest records (M = 2.88), and then

by court records (M = 2.12). All other record categories yielded about 1 identity attribute.

Frequency Total for Each Attribute

The frequencies and percentages for the identity attribute types are presented in

Table 3. As can be gleaned from the table, the most frequently published attribute was an

individual’s name (30%), followed by an individual’s home address (17%), and then by

an individual’s birth year (13.5%). An individual’s physical description was published

only 8.4% of the time. An individual homeowner’s property value, property tax, and the

number of square feet were accessible 7.7% of the time. All other identity attributes were

not frequently found online. No SSNs were present in the examined records.

Frequency of Attributes per State

The total number of identity attribute types published varied by state, ranging

from a minimum of 5 attributes (Wyoming) to a maximum of 26 (Ohio). The range,

mean, and mode for each of the attributes are detailed in Table 4. The attribute with the

highest mode was an individual’s name (mode = 4), followed by an individual’s date of

birth and home address (mode = 2). The mode for an individual’s physical description,

home’s property value, property tax, and number of square feet was only 1. The mode for

all the other attributes was zero.

61

Table 3. Frequencies and Percentages of Identity Attributes

Variable F %

Name

Date of birth

Birth year

Mother’s maiden name

Place of birth

Home address

SSN

Last four digits of SSN

Home phone number

Driver’s license number

VIN

Property value

Property tax

Number of square feet

Physical description

196

88

14

9

10

111

0

1

3

7

8

50

50

50

55

30.0

13.5

2.1

1.4

1.5

17.0

0.0

0.2

0.5

1.1

1.2

7.7

7.7

7.7

8.4

62

Table 4. Descriptive Statistics of Identity Attributes Within Each of the 50 States

Variable Range M Mode

Name

Date of birth

Birth year

Mother’s maiden name

Place of birth

Home address

SSN

Last four digits of SSN

Home phone number

Driver’s license number

VIN

Property value

Property tax

Number of square feet

Physical description

1–7

0–6

0–3

0–1

0–2

1–5

0–1

0–1

0–1

0–1

0–1

1–1

1–1

1–1

0–3

3.92

1.76

.28

18.00

.20

2.22

.02

.02

.06

.14

.16

1.00

1.00

1.00

1.10

4

2

0

0

0

2

0

0

0

0

0

1

1

1

1

Tests of Research Questions

The purpose of this study was twofold. First, it explored the extent to which

identity attributes used to authenticate individuals in online transactions using

63

knowledge-based authentication services are discoverable in public records databases.

Secondly, the study examined these frequencies to determine if there is any correlation to

identity theft rates. Consistent with this, the following research questions are addressed in

this section:

1. What is the comparative discoverability of identity attributes in online public records?

2. Is there an association between the frequencies with which identity attributes can be found in public records and identity theft?

RQ1: Discoverability Metrics of Identity Attribute: What is the Comparative Frequency, or Discoverability, of Personal Identity Attributes in Public Records Databases?

To meet this study objective, the data were analyzed to determine the frequency

with which they are discoverable in public records. The findings were useful in

establishing how accessible specific identity attributes are to identity claimants during

online transactions. This will assist government agencies and commercial services relying

on knowledge-based services to select specific identity attributes relative to application

risk.

The comparative discoverability metrics, or indices, for each identity attribute and

for groups of attributes are described in this section. Initially, regression procedures were

conducted with the intent to use the regression coefficient for each attribute as the index

of discoverability. This was discarded as the data in use were not normally distributed

and were remarkably skewed, even after transformation, and would have negatively

impacted reliability. Instead, the discoverability index for each attribute was determined

by calculating the frequency of that attribute divided by the total number of attributes.

The index for each group of attributes was the frequency for the whole group of attributes

64

divided by the total number of attributes. Identity attributes were first grouped together

using an exploratory analysis procedure (EFA) appropriate for use where no hypothesis is

present.

Principal components analysis (PCA) was used to extract the components to

determine the number of identity attributes to retain. PCA is preferred over principal

factor analysis (PFA) for purposes of data reduction. An orthogonal Varimax procedure

was specified for the rotation procedure.

The resulting Cattell scree plot is presented in Figure 3, while the percentage of

variance accounted for by each of the components is shown in Table 5. Upon closer

inspection of the scree plot in Figure 3 and the proportion of variance each factor

explained (refer to Table 5), there appeared to be a large gap between the seventh

(Eigenvalue = 1.00) and eighth (Eigenvalue = .64) components. The first seven

components appeared to be distinct from the other eight components. Further, the

Eigenvalue of the eighth component was below the acceptable criterion of 1.00. As such,

principal components analysis was deemed to yield seven components with the

Eigenvalue of the first factor—personal information—extracted, accounting for 23.02%

of the total variance. The seven components and the attributes that loaded highly onto the

components are displayed in Table 6.

The comparative discoverability indices for the group of attributes are shown in

Table 7. The comparative discoverability indices for identity attributes are presented in

Table 8.

65

Figure 3. Scree plot from resulting EFA procedure.

Table 5. Variance Explained by Resulting Components

Component Eigenvalue total % variance explained

1. Personal information

2. Home information

3. Driving information

4. Verification questions

5. Birth year

6. SSN

7. Last 4 digits of SSN

3.45

2.86

2.11

1.23

1.01

1.00

1.00

23.02

19.04

14.05

8.18

6.76

6.69

6.69

66

Table 6. Components and Respective Identity Attributes

Component Attributes

Personal information

Home information

Driving information

Verification questions

Birth year

SSN

Last four digits of SSN

Name, date of birth, physical description

Home address, property value, property tax, square feet

License number, VIN, home phone

Mother’s maiden name, place of birth

Birth year

SSN

Last four digits of SSN

Table 7. Comparative Discoverability Index for the Groups of Identity Attributes

Group of attributes Index

Personal information

Home information

Driving information

Verification questions

Birth year

SSN

Last four digits of SSN

.52

.41

.03

.03

.02

.00

.00

67

Table 8. Comparative Discoverability Index for the Identity Attributes

Identity attribute Index

Name

Home address

Date of birth

Physical description

Property value

Property tax

Number of square feet

Driver’s license number

VIN

Home phone number

Mother’s maiden name

Place of birth

Birth year

SSN

Last four digits of SSN

.30

.17

.14

.08

.08

.08

.08

.01

.01

.01

.01

.02

.02

.00

.00

Recommendations are made in chapter 5 of this study for the application of these

indices to KBA service offerings.

68

RQ2: Correlation Between Frequency of Attributes and Identity Theft Rankings: Is There a Correlation Between Identity Theft Rates and the Availability of Personal Data in Public Databases?

Based on recent FTC and other news media reports of identity theft being

facilitated by public records, as cited earlier in this study, several hypotheses relating to

RQ2 were formed, for which the null is stated in the results for each hypothesis test, as

follows. Nonparametric correlations were conducted to test these hypotheses against the

FTC’s identity theft rankings for 2007. Correlations were performed using Spearman’s

rho, the most commonly used nonparametric statistic to measure ranked data. As a

directional relationship was hypothesized, one-tailed tests were employed and set to a

significance level of .05.

H1a: There is no correlation between identity theft rates and the total number of

Web-accessible public records. Table 9 displays ranked identity theft rates by state as

published in the FTC report. Lower numbers represent higher incidences of reported

identity theft rates, with Arizona having the greatest number of reported identity theft

(rank = 1) and North Dakota having the fewest (rank = 50). These data were tested

against the total number of public records sites as reported by Search Systems, ranked in

a similar manner to the FTC data, with lower ranks indicating a higher number of

published records.

The result of the analysis for H1a indicated that there was a moderate positive

relationship between state identity theft rankings and the number of published sites at

each state (rho = .443, p = .001). That is, states with more public records sites tended to

have more incidences of reported identity theft. Thus, the null hypothesis for H1a was

rejected.

69

Table 9. FTC 2007 State Identity Theft Rankings

State

FTC identity theft rank

Rank by total no. of sites

No. of sites (N = 8,659)

Arizona

California

Nevada

Texas

Florida

New York

Georgia

Colorado

New Mexico

Maryland

Illinois

New Jersey

Washington

Pennsylvania

Michigan

Delaware

Alabama

Virginia

Connecticut

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

30

6

39

1

2

8

15

16

40

44

4

38

14

17

5

43

22

13

18

91

394

66

780

462

377

229

192

53

37

431

69

229

188

395

37

142

238

175

70

Table 9. FTC 2007 State Identity Theft Rankings (continued)

State

FTC identity theft rank

Rank by total no. of sites

No. of sites (N = 8,659)

Oregon

Missouri

North Carolina

Massachusetts

Tennessee

Oklahoma

Indiana

Ohio

Louisiana

Kansas

South Carolina

Utah

Mississippi

Arkansas

Rhode Island

Minnesota

Idaho

New Hampshire

Alaska

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

19

35

9

11

20

23

21

3

37

28

27

32

25

29

36

12

42

31

45

159

79

298

273

153

132

142

450

74

109

120

89

122

108

74

265

38

90

35

71

Table 9. FTC 2007 State Identity Theft Rankings (continued)

State

FTC identity theft rank

Rank by total no. of sites

No. of sites (N = 8,659)

Hawaii

Nebraska

Wisconsin

Kentucky

Wyoming

Montana

Maine

West Virginia

Vermont

Iowa

South Dakota

North Dakota

39

40

41

42

43

44

45

46

47

48

49

50

48

26

7

33

49

41

34

24

46

10

50

47

23

122

382

87

20

40

86

131

33

293

16

30

H1b: There is no correlation between identity theft rates and individual identity

attributes published in Web-accessible public records. The results of the correlations

between the frequency of identity attributes and state identity theft ranks are presented in

Table 10. As evidenced in the data, six of the identity attributes were significantly and

negatively associated with the FTC’s identity theft ranking. These attributes were name

(rho = –.50, p = .000), date of birth (rho = –.43, p = .001), birth year (rho = –.24, p =

72

.048), home address (rho = –.46, p = .000), VIN (rho = –.25, p = .025), and physical

description (rho = –.27, p = .028). For these variables, the more frequently they are found

in public records, the lower the state’s identity theft ranking (lower rankings indicate

higher incidences of reported identity theft in the state). Resultantly, the null hypothesis

for H1b was rejected for these identity attributes.

Table 10. Spearman Rho Correlations Between Identity Attributes and Identity Theft Ranking

Variable Theft rank rho Sig.

Name

Date of birth

Birth year

Mother’s maiden name

Place of birth

Home address

SSN

Last four digits of SSN

Home phone number

Driver’s license number

VIN

Physical description

–.50

–.43

–.24

–.03

–.03

–.46

–.06

–.04

–.01

–.11

–.25

–.27

.000

.001

.048

.426

.419

.000

.329

.406

.476

.224

.038

.028

73

H1c: There is no correlation between identity theft rates and identity attribute

groups published in Web-accessible public records. The findings of the nonparametric

correlations between the attribute groups and the total attribute sum, on the one hand, and

identity theft ranking, on the other, are shown in Table 11. The findings indicate that the

group of personal information attributes was significantly and negatively associated with

identity theft ranking (rho = –.45, p = .000). Thus, the easier it was to access personal

information, the lower the identity theft ranking of the state (the higher the incidence of

identity theft). The group of home information attributes was also significantly and

negatively associated with identity theft ranking (rho = –.46, p = .000). Again, the easier

it was to access home information, the lower the identity theft ranking of the state. As the

p value for both driving information and verification questions were greater than 0.05, the

null was accepted for those attribute groups. Thus, it was concluded that personal and

home information attribute groups are associated to a greater degree to identity theft rates

than is driver information or verification questions.

Table 11. Spearman Rho Correlations Between Attribute Groups and Identity Theft Ranking

Variable Theft rank rho Sig.

Personal information

Home information

Driving information

Verification questions

–.45

–.46

–.22

–.02

.000

.000

.065

.435

74

H1d: There is no correlation between identity theft rates and the total sum of

different identity attributes published in Web-accessible public records at each state.

Lastly, the sum of identity attributes published at each state was significantly and

negatively associated with identity theft ranking (rho = –.443, p = .001). Thus, the greater

the number of different identity attributes published in these public records categories by

a state, the higher the incidences of identity theft rates, resulting in a rejection of the null

hypothesis. This statistic should be interpreted with some level of caution as the total

numbers only represent the different types of identity attributes found across the seven

categories of records for each state, as previously discussed. While limiting the use of this

statistic in examining the impact of the total number of identity attributes in all public

records at each site, it was significant to the study results in that it indicated that states

that publish many different identity attributes in each record may be contributing to

identity theft rate incidences in that state.

Summary

The purpose of this analysis was to first and foremost examine the extent to which

personally identifiable information can be discovered in public records. In this respect,

the researcher was able to derive the frequency with which identity information attributes

are found in public records and compute a comparative discoverability metric from the

data. This allowed the researcher to provide recommendations in chapter 5 for the use of

this metric in computing assurance levels for the use of knowledge-based authentication

services. A second objective of determining whether or not there is a correlation to this

online data with state identity theft rates was met.

75

Several hypotheses were tested, and the results suggest a positive relationship

between the amount of personally identifiable data published in online records and state

identity theft rates—specifically, not only with respect to the total number of published

sites within a state, but also identifying specific attributes such as name, date of birth,

birth year, home address, VIN, and physical description that can be correlated to

increases in identity theft rates. Other attributes, such as mother’s maiden name, driver’s

license number, and property tax information, did not evidence a correlation. A

discussion of the impact of these findings on knowledge-based authentication and

recommendations drawn from the data analysis contained in this chapter are discussed in

chapter 5.

76

CHAPTER 5. RESULTS, CONCLUSIONS, AND RECOMMENDATIONS

Summary of the Study

The principal goal of this exploratory study was to assess the discoverability of

identity attributes in Internet-based public records. A quantitative research methodology

employing content analysis was used to assess both the quantity and type of identity

information resident in the records. A frequency analysis was performed to identify how

susceptible identity attributes are to discovery by would-be identity thieves so that

recommendations could be made for the use of knowledge-based authentication systems

that heavily rely on public records to authenticate individuals to government Web sites. A

secondary goal of the study was to assess whether or not the type and amount of these

data correlated to recorded identity theft rates. Chapters 1–4 presented the study’s

objective, its significance, conceptual framework, hypotheses, sampling and data

collection methodology, and data analysis. This chapter provides a summary of the

research conducted and the interpretations and implications of the data analysis. It further

identifies recommendations to government agencies for the use of KBA to perform online

authentication, as well as study limitations and considerations for future research.

Summary of the Research Findings

The findings from the research consist of (a) the results from the content analysis

performed on electronic records, and (b) the correlation results between the frequency

with which these data are published and reported identity theft rates. These findings are

summarized below.

77

The Discoverability of Identity Attributes in Online Public Records

The following research question was developed to examine the frequency with

which personal identity attributes are published on the Internet. Research Question 1:

Discoverability Metrics of Identity Attribute: What is the comparative frequency, or

discoverability, of personal identity attributes in public records databases? To address this

question, the study examined a total of 6,598 public records sites containing identity

attributes and the data analyzed to determine the frequency with which they can be

discovered in public records. To represent this frequency comparatively, a discoverability

index for each attribute was determined by calculating the frequency of that attribute

divided by the total number of attributes. Descriptive statistics performed on the results

revealed that property records yielded the greatest number of different identity attributes,

while also being the most numerous type of public record available over the Internet.

Arrest records were the next most numerous, and oftentimes contained full physical

descriptions of the individual, as well as a photograph. Court records often consisted of

traffic and accident reports, both containing a wealth of personally identifiable data. The

significance of these results is discussed in the Implications section of this chapter.

Correlation Between Identity Attribute Publishing Frequencies and Identity Theft

The following research question and hypotheses were developed to determine if a

correlation existed between the FTC’s reported identity theft rates and (a) the total

number of public records sites published by each state, (b) the different types of identity

attributes found in each public record, (c) linked, or grouped, identity attributes published

at each public record site, and (d) the total number of different identity attributes

78

published in each site. Research Question 2: Is there a correlation between identity theft

rates and the availability of personal data in public databases?

Hypothesis 1(a) (null). The null hypothesis stated that there is not a correlation

between identity theft rates and the total number of Web-accessible public records sites.

Data were collected and analyzed using nonparametric statistics to measure ranked data.

Based on the Spearman’s rho results, the null for H1(a) was rejected at the 5%

significance level. Results indicated a moderate positive correlation between reported

identity theft rates and the number of published sites at each state.

Hypothesis 1(b) (null). The null hypothesis stated that there is not a correlation

between identity theft rates and the type of individual identity attributes published in

Web-accessible public records. Twelve identity attributes were collected from public

records sites. Of these, 6 (name, date of birth, birth year, home address, and vehicle

identification) were significantly correlated to the FTC’s identity theft state rankings,

indicating that the more frequently they were found in public records, the higher the

incidence of identity theft rates. This resulted in a rejection of the null hypothesis for

these identity attributes.

Hypothesis 1(c) (null). The null hypothesis stated that there is not a correlation

between identity theft rates and grouped identity attributes published in Web-accessible

public records. The collected identity attributes were grouped together using an

exploratory factor analysis and a principal components analysis to reduce and extract the

components that would be retained. Findings from the Spearman’s rho indicated that

attributes combined into personal information (name, date of birth and physical

description) and those regarding home information (home address, property value,

79

property tax, and square feet) were significantly associated with identity theft, resulting in

a rejection of the null hypothesis for those groups of attributes.

Hypothesis 1(d) (null). The null hypothesis stated that there is not a correlation

between identity theft rates and the total sum of different identity attributes published in

Web-accessible public records at each state. Based on the results of the Spearman’s rho,

the null hypothesis was rejected, concluding that the greater the number of different

identity attributes published at each site, the greater the incidences of reported identity

theft. While the total sum in the analysis only represented the sum total of the different

types of identity attributes found across the seven categories of records for each state, it

was significant to the study results in that it indicated that states that publish many

different identity attributes in each record may be contributing to identity theft rate

incidences in that state.

Implications of the Study

This study revealed several implications to identity theft that could be of interest

to government agencies or other organizations considering the use of KBA to

authenticate individuals to online applications. The findings from the research supported

the overall hypothesis that there is a relationship between the prevalence of identity

information in public records and identity theft. The study results, however, did not infer

a cause-and-effect relationship between the two solely on the basis of these correlation

statistics. These findings are consistent with an article in the Journal of Economic Crime

Management (Pinheiro, 2004) that postulated that through just knowing several pieces of

personal information, information can then be matched to a credit report to authenticate

80

the identity to a prospective creditor, thereby allowing an identity thief to “take over”

another person’s identity.

Table 12. Summary of the Hypotheses Testing

Hypothesis (null) Results Conclusion

H1a: There is not a correlation between identity theft rates and the total number of Web-accessible public records site

Rejected States with larger numbers of published public records sites are associated with more reported incidences of identity theft.

H1b: There is not a correlation between identity theft rates and the type of individual identity attributes published in Web-accessible public records.

Partially rejected

A relationship to identity theft rates is supported with the following six attributes: name, date of birth, birth year, home address, and vehicle identification.

H1c: There is not a correlation between identity theft rates and grouped identity attribute published in Web-accessible public records.

Partially rejected

A relationship to identity theft rates is supported with groups of attributes containing personal and home information.

H1d: There is not a correlation between identity theft rates and the total sum of different identity attributes published in Web- accessible public records at each state.

Rejected A relationship is supported between identity theft rates and the total sum of different attributes found in online public records.

During the enumeration process, it was evident that many counties still publish

full images of marriage licenses and applications, as well as of birth certificates, most

containing mothers’ maiden name information as well as dates of birth commonly used to

authenticate individuals in online transactions. A GAO (2004) study on the availability of

81

SSNs in public records identified names and birth dates as being among the three

personal identifiers with SSNs that are most often sought by identity thieves. Also of

concern with property and court records is the uniqueness of some of the identity

information contained in these types of reports that cannot be found elsewhere, including

vehicle identification and driver’s license data that can help to build an identity profile. In

their study, the GAO concluded that few state agencies posted SSNs on the Internet;

however, they estimated that “local government offices in as many as 15–28% of

counties do make SSNs available through the Internet” (p. 4). The GAO study pointed

out that these offices have begun restricting SSNs in online and other public records

overall, which could explain why no SSNs were found in the Internet records examined

for the study 4 years after the study was published.

With this information so readily accessible, what is the real value of pseudo-

secrets culled from public records to any identity authentication or verification system

associated with any level of risk to the business, government agency, or consumer? In a

response to the FTC’s Identity Theft Task Force’s request for public comments,

ChoicePoint (2007), a leading knowledge broker, suggested that KBA effectively

confirms the existence of an identity through the verification and correlation of multiple

data elements in public records databases. However, while the identity can be confirmed

as one that exists in the record, ChoicePoint acknowledged that it is more difficult to

prove that the identity claimant is actually who he or she is claiming to be. “Fraudulent

use of the SSN and similarly issued (identity credential) tokens and breeder documents—

such as driver’s licenses, birth certificates, and so forth—perpetuates identity fraud and

threatens to undermine important credentialing efforts designed to make us more secure”

82

(ChoicePoint, p. 6). The findings from this study indicated that the ready accessibility of

identity information in Internet-based records is related in some way to identity theft, thus

diminishing effect on the value of using public records information as a primary source of

identity verification or authentication.

Contributions of the Study

This study’s contribution is twofold. First, the study’s methodology can be used

as a foundation for future researchers of public records databases and knowledge-based

authentication systems, in which there still exists a dearth of research. As identity theft

becomes a politically charged issue, many industries that rely on public records data are

presenting papers in opposition of proposed limits or access restrictions on the records

and the personally identifiable information contained within them. Oftentimes, these

studies and papers do not have substantive research supporting the conclusions. As an

example, in a statement presented to the Ways and Means Committee of the U.S. House

of Representatives, PRIA argued for the continued use of SSNs in public records,

contending that “it is a common misconception that easy access to public records has

facilitated identity theft” (“Protecting the Privacy of the SSN,” 2007 ¶ 7). PRIA did not,

however, provide any data on which to base their assertion that would refute the findings

of this present study, instead basing their contention on any lack of evidence to the

contrary and a 2003 Synovate study prepared for the FTC in which PRIA stated it “did

not identify a correlation between public records access and the three categories of

identity theft” (2006, p. 15). An inspection of the Synovate report indicated, however,

that approximately half of the 4,057 ID theft victims surveyed stated that “they did not

83

know how the person who misused their personal information obtained it” (Synovate,

2003, p. 9). The term public records does not appear in the Synovate study and no

correlation analysis of public records and identity theft rates was performed. The absence

of objective research prevents legislators and agency administrators from making

informed decisions on the proper measures to take to prevent identity theft. By building

and improving on the approach used in this current study, other researchers will be able

to provide a more comprehensive understanding of how public records information

impacts identity theft rates.

This study also extends the discussion of quantifying knowledge-based

authentication to including discoverability as a factor that must be considered when

assessing its use within an authentication technology framework. From the indices

derived from the frequency with which identity attributes are found in public records, a

Discoverability Factor (DF) can be ascribed to the combinations of attributes that may be

used in knowledge-based authentication. This discoverability factor can be used by

government agencies to map the selection of specific identity attributes to appropriate e-

authentication assurance levels.

A DF can be calculated by multiplying any combined attributes’ index numbers.

The greater the resulting number, the higher the likelihood of discoverability. This serves

to reduce the level of confidence, or trust, that an application owner should place in the

identity-proofing process. Two examples are provided below of how identity attributes

indices would be combined to calculate the DF:

Example A. Name (.30) x Property Tax (.08) x Place of Birth (.02) results in a DF = .00048.

84

Example B. Name (.30) x Home Address (.17) x Date of Birth (.14) results in a DF = .00714.

Of the two examples, Example A results in a selection of identity attributes that,

when combined with each other, provide a greater assurance level of identity

authentication than Example B as it has a lower DF and is therefore less susceptible to

discovery. Lower DFs would be more appropriate for use with online applications where

greater authentication-related risks exist. Examples of applications where stronger levels

of identity authentication may be required are those that provide access to the identity

claimant to breeder documents containing additional identity information or sensitive

health-related records.

To fully assess the effectiveness of knowledge used as an authenticator, the

discoverability of the knowledge must be comprehensively assessed. Previous research in

this field has neglected to include the impact of discoverability on KBA and, as such,

identity confidence algorithms, as proposed by Chokhani (2003) for NIST and discussed

in chapter 2 of this study, should be recalculated to include a discoverability factor for

both public as well as proprietary information. Consistent with Chokhani’s approach,

ChoicePoint (2007) proposed to the FTC that KBA blend in a mix of static (SSNs),

dynamic (addresses), and highly dynamic (banking or other transaction records) attributes

to verify and authenticate the identity of individuals seeking to conduct secure

transactions with either the public or private sector. In practice, this suggested approach

will likely prove problematic as proprietary records (usually purchased from utility

companies, banks, warranty registrations, shoppers’ discount cards, etc.) holding

transactional information that would be used for authentication are very limited with

85

respect to the population of citizens that can be authenticated, even as property records,

for example, are limited in scope to authenticating property owners.

Limitations

Several limitations were identified during the course of this research. The most

serious limitation was that the total number of records within each public records

database examined could not be ascertained using today’s technology. These databases

could contain as few as one record, or could have captured an entire county’s population

with hundreds of thousands of records. Advancements in data-mining techniques with

databases residing in this “deep Web” environment holds the promise for researchers to

resolve this limitation, enabling a more complex statistical analysis.

Reliability, or the extent to which a measuring procedure will produce the same

results when repeated (Carmines & Zeller, 1979), was also a limitation of this study.

Reproducing the same sample of public records sites will prove to be difficult for

subsequent researchers as it is analogous to shooting a moving target. While new records

are being posted at an alarming rate to the Search Systems site, many more are being

taken offline or modified by states and counties that are becoming increasingly sensitive

to the increase in identity theft rates and public perception. An example of this was found

during the course of enumeration in this study with the state of Colorado. Many Colorado

sites linked in Search Systems generated a page error or posted a disclaimer that these

records had been removed. This limitation is inherent to Web-based content analyses, as

discussed in chapter 3 of this study.

86

Reliability limitations due to intercoder, or inter-rater, consistencies are also

acknowledged where more than one coder enters, or categorizes, the data in a content

analysis. This limitation was mitigated within the study through the use of a pilot test to

identify coding and categorization issues prior to a single coder entering all of the data in

the final data collection. An additional way to ensure reliability is to “measure a construct

that is very clearly and even narrowly defined” (Muijs, 2004, p. 74). Unlike other studies

that use content analysis to quantify elements contained within literature or speech,

content rules for the purposes of this study simply noted the presence of predefined

identity attributes in an electronic record. There remains, however, the possibility that

data will be coded differently in subsequent studies by different researchers.

Similar to limitations with intercoder reliability is a limitation associated with the

exploratory factor analysis used to reduce identity attributes. Another researcher

analyzing the same data could select different factors. Both of these limitations could

serve to make it difficult to generalize the findings of this study at a significant

confidence level.

Finally, it has recently been brought to public attention that Maryland’s, and

possibly other states’, online traffic records contain out-of-state license information,

many that previously used the SSN as the driver’s license number (Krebs, 2008). While

no full SSN fields were listed in any of the records examined during this study, this study

did not examine the format of the number in the driver’s license field to determine if it

was in the same format as an SSN. Additionally, Departments of Motor Vehicles have

universally abandoned the use of SSNs as the driver’s license number, so the records with

SSNs will be interspersed with records that no longer display SSNs. This makes it more

87

difficult to exactly enumerate the percentage of records within a single database

containing SSNs unless all of the records can be enumerated. The presence of SSNs in

other identifier fields warrants a reinvestigation before discoverability indices can be

reliably calculated.

Recommendations for Future Research

This study was an exploratory study with findings that suggest an association

between certain identity attributes and groups of attributes that warrant additional

investigation. Future research should take into consideration the limitations encountered

in this study. Additionally, the following are suggestions for further examination of this

subject.

False Negative and False Positive Rates Associated With the Use of KBA

Many proponents of KBA have cited how KBA can be used to reduce false

positive rates. Separate tests performed by the researcher with several KBA services

indicated that the testing of false positive rates, in which unauthorized users are provided

access after successfully passing the authentication questions, has proven difficult. False

negative rates, in which individuals are denied access to an application as the knowledge

broker providing KBA services does not have sufficient data to effectively authenticate

the individual, also appear to be higher than expected. A methodology should be

developed and comprehensive testing of knowledge-based services should be performed

to derive industry-standard-acceptable false positive and false negative rates, as is the

case with other authentication technologies.

88

The Impact of Registration and Fee-Based Access on Limiting Identity Theft From Public Records

While the examination of public records in this study demonstrated that some

property and court systems required user self-registration and fees, no studies have been

performed that indicate whether or not registration is a successful deterrent to an identity

thief who may be building a dossier on a target identity. It did prove, however, to be a

deterrent for the purposes of this study, as the researcher did not enumerate those fields—

which could have conceivably held more sensitive data than those that were freely

enumerated. It is recommended that subsequent researchers both ensure that sufficient

funding exists to provide for the examination of fee-based sites, as well as record those

results such that a comparison can be made between fee-based and freely accessible sites

with respect to the type of data found in each record.

The Impact of Public Records, Both Paper-Based and Internet-Based, on Identity Theft

The GAO survey-based study performed in 2004 provided a methodology that

can be utilized to examine more than simply the availability of SSNs in public records.

The data collection can be modified to extend to all personally identifiable information in

both paper-based and electronic records, as well as track the data by state for comparison

purposes to FTC identity theft rates.

Develop a Methodology for Calculating an Identity Confidence Algorithm That Factors in Discoverability

KBA services that rely, even in part, on publicly accessible data must factor in

discoverability to their identity confidence scoring algorithms, wherever these may exist.

In the absence of verifiable, testable algorithms, researchers should consider developing a

methodology that combines discoverability with guessability (Chokhani, 2003) to

89

develop a more accurate scoring mechanism for assessing authentication risk. An

alternative approach proposed by Dr. Peter Alterman, Chair of the Federal Public Key

Infrastructure Steering Committee, suggested that while “combining personal information

available in multiple databases with common identity credentials does offer reasonable

assurance of identity” (2003, Abstract), the reliability of that identity is influenced by the

number and relationship of identity credentials generated over time, as well as the level to

which the identity verification service is indemnified from liability. A mathematical

algorithm was presented in his paper as a model for what would be considered an identity

confidence scoring engine. While the model largely related to presented credentials (i.e.,

birth certificate, driver’s license, or passport), components of his algorithm might be

relevant to a similar algorithm applied to KBA—as an example, an identity confidence

score based on the number of corresponding identity attributes, indemnification

considerations, as well as discoverability and guessability of the identity attributes.

Impact of Geographic Location on Identity Theft

No statistical analysis is necessary to examine the disproportionate number of

southern border states numbering among the top 10 states with the highest incidences of

identity theft (FTC, 2008). A study should be performed to further assess the relationship

between geographic location and identity theft.

Conclusion

While the research performed in this study provides a framework for more

comprehensive testing of the impact of public records on identity theft, the purpose of

this research was not to build a case for the containment of public records. Identity theft

90

will not be resolved by limiting access to personally identifiable information; once the

genie is out of the bottle, so to speak, the information cannot be made private. Pseudo-

secrets that were never intended to be kept private (DOB, mother’s maiden name, SSN,

etc.) should not be used to prove identity as they are vulnerable to discoverability even by

the fact that the owner of the information is free to share these secrets with anyone the

owner chooses. As such, breeder documents (birth certificates, social security cards,

driver’s licenses) that are obtainable using knowledge-based authentication heavily

reliant on Internet-accessible public records should not be used to bind an identity to an

identity claimant for access to high-risk applications. Even biometric identification

systems used in secure facilities may have accomplished nothing more than binding a

physical characteristic, such as a fingerprint or retina scan, to a false identity claimant, if

the identity is authenticated by correlating personal knowledge of the identity to

discoverable facts in public records.

While this study provides a framework for further research, the full impact of

public records on identity theft and, therefore, KBA systems will likely continue to prove

elusive to quantify given the magnitude of the data that exist in non-normalized databases

on the Internet that makes data mining this information difficult. The only saving grace is

that large-scale data mining of this information is also difficult for identity thieves to

perform, for now prohibiting the additional increase in identity theft that will likely result

when technology matures to resolve the challenge. Acknowledging the discoverability of

this information is the first step towards developing realistic and accurate algorithms that

help agencies select appropriate authentication questions or alternative authentication

technologies.

91

REFERENCES

After the breach: how secure and accurate is consumer information held by ChoicePoint and other data aggregators: Hearings before the California Senate Banking, Finance and Insurance Committee (2005) (testimony of Chris Jay Hoofnagle).

Alterman, P. (2003). On the reliability of authentication of identity. Retrieved August 14,

2008, from http://www.cio.gov/fpkipa/documents/ReliabilityAuthentication Identity.pdf

Archer, J. (2004, November). Initiatives for protecting financial institution customers.

Statement presented at Inside ID Conference and Expo, Washington, DC. Barrett, J. (2004, February). Information sources and metrics: Authentication processes

and risk decisions. Paper presented at the “Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD.

Berelson, B. (1952). Content analysis in communication research. New York: Hafner. Bergman, M. K. (2001, July). The deep Web: Surfacing hidden value. The Journal of

Electronic Publishing, 7(1), 97–99. Retrieved May 1, 2005 from University of Michigan Web site: http://www.press.umich.edu/jep/07-01/bergman.html

Bolton, J. B. (2003). E-authentication guidance for federal agencies. Memorandum to the

heads of all departments and agencies (OMB-04-04). Retrieved October 25, 2004, from http://www.whitehouse.gov/omb/egov/legislation_memo.htm

Bragg, R. (2004, July). Rainbow crack—not a new street drug. Retrieved May 1, 2005,

from http://redmondmag.com/columns/article.asp?EditorialsID=736 Burr, W., Dodson, D., & Polk, T. (2006). Electronic authentication guideline (NIST

Special Publication 800-63, 2006 Ed.). Retrieved November 16, 2008, from http:// csrc.nist.gov/publications/nistpubs/800-63/SP800-63V1_0_2.pdf

Carmines, E., & Zeller, R. (1979). Reliability and validity assessment. Beverly Hills, CA:

Sage. Cartwright, K. (2004, February). Information sources and metrics. Paper presented at the

“Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD. Retrieved May 16, 2005, from http://csrc.nist.gov/kba/Presentations/Day% 201/Cartwright-Info%20Sources.pdf

92

Chandramouli, R., Dray, J., Ferraiolo, H., Guthery, S., MacGregor, W., & Mehta, K. (2008). Interfaces for personal identity verification—part 4: The PIV transitional interface and data model specification (NIST Special Publication 800-73, 2008 Ed.). Retrieved November 16, 2008, from http://csrc.nist.gov/publications/nist pubs/800-73-2/sp800-73-2_part4_transitional-specification-final.pdf

Chen, Y. (2007). A Bayesian network model of knowledge-based authentication.

Retrieved November 16, 2008, from http://research.bus.wisc.edu/yechen/ Publications_files/chen-thesis.pdf

ChoicePoint. (2004). Business solutions/authentication solutions: ProID. Retrieved

October 28, 2004, from http://www.choicepoint.com/business/authen/proid.html ChoicePoint. (2007). Federal Identity Theft Task Force, project no. P065410 [Letter to

Donald S. Clark, Secretary, Federal Trade Commission]. Retrieved July 15, 2008, from http://www.idtheft.gov/comments/102.pdf

Chokhani, S. (2003, February). Knowledge-based authentication metrics. Paper presented

at the “Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD. Retrieved December 14, 2005, from http://csrc.nist.gov/ archive/kba/Presentations/Day%202/Chokhani-Attachment.pdf

Chokhani, S., Dodson, D., Hastings, N., Burr, W., & Polk, T. (2006). Special publication

800-63 part 2: Knowledge-based electronic authentication guidelines: Draft. Cooper, D., & Schindler, P. (2003). Business research methods (8th ed.). New York:

McGraw Hill/Irwin. Datesman, G. (2004, February). Standard metrics for knowledge-based authentication.

Paper presented at the “Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD. Retrieved May 16, 2005, from http://csrc.nist .gov/kba/Presentations/Day%201/Cartwright-Info%20Sources.pdf

Electronic Authentication Partnership. (2004, October). Report on technical

interoperability. Retrieved May 7, 2005, from www.eapartnership.org/docs/Oct 2004/Oct2004_D_Interoperability_Report.doc

Federal Deposit Insurance Corporation. (2004, December). Putting an end to account-

hijacking identity theft. Retrieved May 5, 2005, from http://www.fdic.gov/ consumers/consumer/idtheftstudy/identity_theft.pdf

Federal Trade Commission. (2004). FTC issues final rules on FACTA identity theft

definitions, active duty alert duration, and appropriate proof of identity. Retrieved May 28, 2005, from http://www.ftc.gov/opa/2004/10/facataidtheft.htm

93

Federal Trade Commission. (2006). Consumer fraud and identity theft complaint data January–December 2005. Retrieved May 28, 2005, from http://www.consumer .gov/sentinel/pubs/Top10Fraud2005.pdf

Federal Trade Commission. (2008). Consumer fraud and identity theft complaint data

January–December 2007. Retrieved March 13, 2008, from http://www.ftc.gov/ sentinel/reports/sentinel-annual-reports/sentinel-cy2007.pdf

General Accounting Office. (1996). Content analysis: A methodology for structuring and

analyzing written material (GAO/PEMD-10.3.1). Retrieved April 18, 2005, from http://archive.gao.gov/d48t13/138426.pdf

General Accounting Office. (1997). General policies/procedures and communications

manual (GAO/GPPM-97). Retrieved April 18, 2005, from http://www.gao.gov/ policy/gppm-cm.pdf

General Accounting Office. (2002). Identity fraud: Prevalence and links to alien illegal

activities (GAO-02-830T). Retrieved June 6, 2006, from http://www.consumer .gov/idtheft/pdf/gao-d02830t.pdf

General Accounting Office. (2004). Social Security numbers: Governments could do

more to reduce display in public records and on identity cards (GAO-05-59). Retrieved June 6, 2006, from http://purl.access.gpo.gov/GPO/LPS55812

Gordon, G., Rebovich, D., Choo, K., & Gordon, J. (2007). Identity fraud trends and

patterns: Building a data-based foundation for proactive enforcement. Retrieved October 30, 2007, from Utica College, Center for Identity Management and Information Protection Web site: http://www.utica.edu/academic/institutes/ecii/ publications/media/cimip_id_theft_study_oct_22_noon.pdf

Identity Theft and Assumption Deterrence Act of 1998, Pub. L. No. 105-318 Stat. 3007

(1998). Retrieved June 1, 2006, from http://www.ftc.gov/os/statutes/itada/itadact .pdf

Identity theft and social security numbers: Hearings before the Subcommittee on

Commerce, Trade, and Consumer Protection of the House Committee on Energy and Commerce, 108th Cong. (2004) (prepared statement of Thomas B. Leary).

International Technology Association of America. (2004). Comments on NIST FIPS 201

draft: Personal Identity Verification (PIV) for federal employees and contractors. Retrieved June 10, 2006, from www.itaa.org/es/docs/nistpivcomments.pdf

Jain, A., Bolle, R., & Pankanti, S. (1999). Biometrics: Personal identification in

networked society. Retrieved April 15, 2005, from http://www.cse.msu.edu/~cse 891/Sect601/textbook/1.pdf

94

Javelin Strategy and Research. (2006). The 2006 Identity Fraud Survey report. Retrieved

April 27, 2006, from the Council of Better Business Bureau Web site: http://www.javelinstrategy.com/products/AD35BA/27/delivery.pdf

Javelin Strategy and Research. (2008). 2008 Identity Fraud Survey report: Consumer

version—How consumers can protect themselves. Retrieved July 15, 2008, from http://www.idsafety.net/803.R_2008%20Identity%20Fraud%20Survey%20Report

_Consumer%20Version.pdf Johnson, S. (2004). Defending our borders is central to fighting terror. Retrieved May

16, 2005, from http://www.samjohnson.house.gov/News/DocumentSingle.aspx? DocumentID=20720

Kaid, L., & Wadsworth, A. (1989). Content analysis. In P. Emmert & L. L. Barker (Eds.),

Measurement of communication behavior (pp. 197–217). New York: Longman. KnowX. (2005). Standard: Public record info. Retrieved May 16, 2005, from http://www

.knowx.com/home.exe?form=home/fa1_pr_about1.htm Koehler, W. (2004). A longitudinal study of Web pages continued: A report after six

years. Information Research, 9(2). Retrieved May 5, 2005, from http:// InformationR.net/ir/9-2/paper174.html

Krebs, B. (2005). DNA key to decoding human factor: Secret Service’s Distributed

Computing Project aimed at decoding encrypted evidence. Retrieved May 3, 2005, from http://www.washingtonpost.com/wp-dyn/articles/A6098-2005Mar28 .html

Krebs, B. (2008). Speeding in Maryland could be hazardous to your identity. Retrieved

August 20, 2008, from http://voices.washingtonpost.com/securityfix/2008/07/ maryland_traffic_site_lists_so.html

Leedy, P., & Ormrod, J. (2001). Practical research planning and design (7th ed.). Upper

Saddle River, NJ: Merrill Prentice Hall. LexisNexis. (2005). InstantID. Retrieved May 12, 2005, from http://www.lexisnexis

.com/instantid/printerfriendly.asp Liddle, S., Yau, S., & Embley, D. (2001, November). On the automatic extraction of data

from the hidden Web. Proceedings of the International Workshop on Data Semantics in Web Information Systems, Yokohama, Japan. Retrieved May 11, 2005, from www.deg.byu.edu/papers/daswis01.pdf

95

LoPucki, L. (2001). Human identification theory and the identity theft problem. Texas Law Review, 80, 89–134. Retrieved April 29, 2005, from http://ssrn.com/abstract=

263213 LoPucki, L. (2003). Did privacy cause identity theft? (Research Paper No. 03-5).

Retrieved April 28, 2005, from http://ssrn.com/abstract=386881 Lyman, P., & Varian, H. (2003). How much information? Retrieved April 30, 2005, from

http://www.sims.berkeley.edu/how-much-info-2003 Malin, B. (2002, December). Compromising privacy with trail re-identification: The

REIDIT algorithms (CMU-CALD-02-108). Pittsburgh, PA: Carnegie Mellon University, School of Computer Science.

Martin, E. (2001). GSA and Federal CIO Council launch e-gov inventory online.

Retrieved November 16, 2008, from http://www.gsa.gov/Portal/gsa/ep/content View.do?contentType=GSA_BASIC&contentId=9204&noc=T

Muijs, D. (2004). Doing quantitative research in education with SPSS. London: Sage. Myers, M. (1997, June). Qualitative research in information systems. MIS Quarterly,

21(2), 241–242. Retrieved April 30, 2005, from http://www.qual.auckland.ac.nz/ National Computer Security Center. (1991, September). Guide to understanding

identification and authentication in trusted systems (NCSC-TG-017 Library No. 5-235,479 Version 1). Retrieved April 30, 2005, from http://www.radium.ncsc .mil/tpep/library/rainbow/

National Institute of Standards and Technology. (2004). Knowledge based

authentication: Is it quantifiable? Retrieved June 9, 2006, from http://csrc.nist .gov/archive/kba/index.html

National Security Agency, Systems and Network Attack Center. (2006). The 60 minute

network security guide (first steps towards a secure network environment). Retrieved July 15, 2006, from http://www.nsa.gov/snac/support/I33-011R-2006 .pdf

Office of the Inspector General. (2004). Current practices in electronic records

authentication. Retrieved May 17, 2006, from http://www.ssa.gov/oig/ADOBEPDF/audittxt/A-04-04-24004.htm

Office of Management and Budget. (2003). E-authentication guidance for federal

agencies. Retrieved May 7, 2005, from http://www.whitehouse.gov/omb/ memoranda/fy04/m04-04.pdf

96

Office of Management and Budget. (2004). Policy for a common identification standard for federal employees and contractors (HSPD-12). Retrieved May 7, 2005, from http://www.whitehouse.gov/omb/memoranda/fy2005/m05-24.pdf

Office of Management and Budget. (2008, January). Report to Congress on the benefits

of the President’s e-government initiatives. Retrieved August 24, 2008, from http://www.whitehouse.gov/omb/egov/documents/FY08_Benefits_Report.pdf

Olsen, F. (2005, April 25). Shopping for data. Lawmakers have tough questions for

largely unregulated data firms. Retrieved May 8, 2005, from http://www.fcw .com/article88676-04-25-05-Print

O’Neill, E., McClain, P., & Lavoie, B. (1998). A methodology for sampling the World

Wide Web. Retrieved May 5, 2005, from http://digitalarchive.oclc.org/da/ViewObjectMain.jsp?objid=0000003447&frame= true

Pinheiro, R. (2004). Preventing identity theft using trusted authenticators. Journal of

Economic Crime Management, 2(1), 1–16. Retrieved August 3, 2008, from http:// www.utica.edu/academic/institutes/ecii/jecm/articles.cfm?action=issue&id=15 Property Records Industry Association. (2006). Privacy and public land records: Making

practical policy. Retrieved August 5, 2008, from http://www.pria.us/Papers/PRIA WhitePaperFinal010406.pdf

Protecting consumers’ data: Policy issues raised by ChoicePoint: Hearings before the

Subcommittee on Commerce, Trade, and Consumer Protection of the House Committee on Energy and Commerce, 109th Cong. (2005) (prepared statement of Deborah Majoras).

Protecting the privacy of the social security number from identity theft: Hearing before

the Subcommittee on Social Security, of the House Committee of Ways and Means, 109th Congress. (2007) (statement of the Property Records Industry Association (PRIA). Retrieved August 5, 2008, from http://waysandmeans.house .gov/hearings.asp?formmode=view&id=6348

Robson, C. (2002). Real world research: A resource for social scientists and

practitioner-researchers (2nd ed.). Malden, MA: Blackwell. Schneier, B. (1999, July). Mistakes and blunders: A hacker looks at cryptography.

Keynote presentation at Black Hat USA, Las Vegas, NV. Retrieved May 5, 2005, from http://blackhat.com/html/bh-media-archives/ bh-archives-97-98-99.html

97

Schwarzhoff, T., Dray, J., Wack, J., Dalci, E., Goldfine, A., & Iorga, M. (2003). Government smart card interoperability specification (Version 2.1). Retrieved November 16, 2008, from http://csrc.nist.gov/publications/nistir/nistir-6887.pdf

Social security number high-risk issues: Hearings before the Subcommittee on Social

Security of the House Committee of Ways and Means, 109th Cong. (2006) (prepared statement of Patrick P. O’Carroll, Jr.).

Solove, D. J. (2004). The digital person: Technology and privacy in the digital world.

New York: NYU Press. Song, D. X., Wagner, D., & Tian, X. (2001, August). Timing analysis of keystrokes and

timing attacks on SSH. Paper presented at the 10th USENIX Security Symposium, Washington, DC. Retrieved May 5, 2005, from www.usenix.org/events/sec01/full _papers/song/song.ps

State of Alabama, Office of the Governor. (2006). Governor Riley signs law to protect

Social Security numbers. Retrieved June 5, 2006, from http://www.governorpress .alabama.gov/pr/pr-2006-04-27-01-protectssn.asp

State of Tennessee. (2003, December 30). Letter to Lt. Gov. John S. Wilder TennCare

Enrollee Database Verification Project summary report. Retrieved May 13, 2005, from http://www.tennessee.gov/tenncare/pdf/ChoicePointreport123103.pdf

Stempel, G. & Westley, B. (1981) Research methods in mass communication. Englewood

Cliffs, NJ: Prentice- Hall. Sweeney, L. (2000). Uniqueness of simple demographics in the U.S. population (LIDAP-

WP4). Pittsburgh, PA: Carnegie Mellon University, Laboratory for International Data Privacy.

Synovate. (2003). Federal Trade Commission—Identity Theft Survey report. Retrieved

September 12, 2007, from http://www.ftc.gov/os/2003/09/synovatereport.pdf Tashakkori, A., Teddlie, C. (2003). Handbook of Mixed Methods in Social & Behavioral

Research. Newbury Park, CA: Sage. Temoshok, D. (2005, May). E-authentication: Creating an environment of trust. Paper

presented at the Postsecondary Electronic Standards Council 2nd Annual Conference on Standards and Technology, Washington, DC. Retrieved May 16, 2005, from http://www.pesc.org/events/ACTS2/presentations/Lunch%20E-Auth.

%20Temoshok.ppt U.S. Senate Finance Committee. (2007). Filing your taxes: An ounce of prevention is

worth a pound of cure. Retrieved December 17, 2007, from http://www.senate

98

.gov/~finance/hearings/testimony/2007test/041207testme.pdf Weber, R. P. (1990). Basic content analysis (2nd ed.). Newbury Park, CA: Sage. Willox, N. (2001). Identity theft: Authentication as a solution revisited. Retrieved May

12, 2005, from www.lexisnexis.com/risksolutions/conference/docs/authentication .pdf Zeller, T. (2005, May 18). Personal data for the taking. New York Times. Retrieved

November 16, 2008, from http://www.unm.edu/~pre/law/articles_advise/ technology.html

99

APPENDIX A. CODEBOOK

This codebook provides a listing of record types, attribute codes, and instructions

to enable the researcher to count identity attributes, or variables, present in online public

records. Coding was input directly into an Excel spreadsheet with was set up to collect

the following data.

State: The name of the state from which the records were reviewed.

Number of Sites: The total number of sites at a selected state returned by the

keyword search. This includes all sites, whether freely accessible or not. This number is

collected immediately after the search results are returned following the instructions

below.

Number of Sites with IDA: The total number of sites within a state, after

exclusions, containing identity attributes.

Number of IDA by Record Type: The total number of different identity attributes

by record type. This variable is counted once per record type, irrespective of the number

of sites within the state’s record type containing the information.

A Priori Record Description: A description of each public record examined for

the purposes of this analysis. Records meeting the following conditions will be excluded

from examination for all records:

1. Archive or historical records. Records must be currently reported by the government office. Archive and historical records for individuals over 100 years old were not examined.

2. Fee-based records or those requiring a paid subscription.

3. Sites requiring registration.

4. Library, newspaper, or genealogical records.

100

The following public records categories will be examined during the enumeration:

1. Accident Reports—Retrieved using the keywords “accident” and the state name in the Advanced Search - Match all keywords function.

2. Birth Records—Birth certificates or indices from official public records sources. Retrieved from Search Systems using Search Public Records by Type of Record -> Births.

3. Court Records—Civil or criminal court filings and case dispositions, civil suits, judgments, deed transfers, property liens, divorce records, and traffic court records. Retrieved using Search Public Records by Type of Record -> Court Records.

4. Inmate/Arrest Records—Retrieved using Advanced Search to Match all keywords using the state name and also including the terms “arrest” and “inmate” in the Match any keyword.

5. Marriage Records—Includes licenses and marriage certificates. Retrieved using Search Public Records by Type of Record -> Marriages.

6. Property Records/Deeds—County property tax assessment records. Retrieved using the Search Public Records by Type of Record -> Property – U.S.

7. Voter Registrations—Current data from the state on registered and inactive voters. Retrieved using the Search Public Records by Type of Record -> Voters.

Identity Attributes (IDA): Each attribute is counted when the record is reviewed

by placing a “1” in the field to indicate the presence of any of the following identity

attributes:

1. Name—First name, last name, middle name or middle initial, or combination of any of these

2. DOB—Any combination of the month, day, and year of birth

3. Birth Yr—Year of birth or the age is present

4. Mother’s Maiden Name—Mother’s maiden name

5. POB—City, state, or both of birth place

101

6. Address—Home address

7. SSN—Full 9-digit Social Security Number

8. Last 4 of SSN—Last 4 digits of the individual’s Social Security Number

9. Home Phone—Home phone number of an individual – with or without the area code

10. Driver’s License Number—Individual’s driver’s license number

11. Vehicle ID—Boat, car, plane, or other VIN, including license plate

12. Property Value/Sale Price—A property’s assessed value, mortgage amount, deed transfer amount

13. Last Year’s Prop. Tax—Current or last year’s assessed property tax amount

14. Sq. Ft.—Square Footage (Finished Area) of House

15. Phys. Des.—Details of an individual’s physical description, such as any combination of sex, gender, hair color, or eye color

102

APPENDIX B. CODER FORM

An electronic Excel coding form, similar to that in Figure B1 was used to collect

data. The following steps were used to transfer data to the coder form.

1. For each state(s) assigned to the coder, a search was performed at Search Systems within each category (Record Description) of public records. The following information was recorded as the data was reviewed: a. The number of sites reported by Search Systems from the search in the

column labeled “# of Sites.”

b. The number of sites containing public records after the application of the exclusions in the column labeled “# of sites with info.”

2. Each site was examined for the presence of identity attributes (IDA). For each site with identity information present, a “1” was placed in the appropriate IDA column on the coding form to indicate what type of information was found. At no time was actual personally identifiable data found at the site recorded onto the spreadsheet. Each identity attribute was enumerated only once per record category in each state. For instance, if there were 24 total sites containing information, with 8 of those sites containing birth dates, a single “1” was placed in the column under DOB to indicate that the date of birth was found in that record category.

3. For records that required a name lookup, a generic name, such as Jones or Smith, was used to retrieve a record. In some cases where property tax records were examined, a lookup in 411.com was performed, or a generic address (i.e., 101 Main St.) was provided to retrieve a record.

Figure B1. Codebook.

1 0

3