stratic Plan
QUANTIFYING THE DISCOVERABILITY OF IDENTITY ATTRIBUTES IN
INTERNET-BASED PUBLIC RECORDS: IMPACT ON IDENTITY THEFT
AND KNOWLEDGE-BASED AUTHENTICATION
by
Margaret S. Leary
RICHARD YELLEN, Ph.D., Faculty Mentor and Chair
DANIELLE BABB, Ph.D., Committee Member
SALIM ZAFAR, Ph.D., Committee Member
Barbara Butts Williams, Ph.D., Interim Dean, School of Business & Technology
A Dissertation Presented in Partial Fulfillment
Of the Requirements for the Degree
Doctor of Philosophy
Capella University
November 2008
3336833
3336833 2008
© Margaret Leary, 2008
Abstract
This study explored the comparative discoverability of identity attributes in Internet-
accessible public records. It offers up a framework for a methodology for ascribing a
“discoverability factor” to identity attributes commonly used with knowledge-based
authentication systems. The study also sought to determine if correlations exist between
the frequency with which identity attributes are published in public records and reported
identity theft. Following a comprehensive literature review, a quantitative research
methodology employing content analysis was performed on a total of 6,598 public
records, enumerating the number and types of different identity attributes that were found
in easily accessible public records. Public records were selected from those available on
SearchSystems.net, using a stratified purposive sampling approach. Descriptive statistics
were first performed and reported on the coded data, after which the identity attributes
were grouped using principal components analysis (PCA) to reduce the data. Correlations
to the Federal Trade Commission’s 2008 reported identity theft rates were performed
using Spearman’s rho at a significance level of .05. The findings from this research
supported the overall hypothesis of a relationship between the amount of identity
information in online public records and identity theft. It did not, however, infer a cause-
and-effect relationship between the two, and sampling limitations somewhat weaken
generalization. Identity attribute indices were assigned based on the frequency with
which they were found, with a discussion provided on the possible use of discoverability
as a factor to be considered when developing an identity confidence scoring algorithm.
iii
Dedication
This work is dedicated to family, friends, and coworkers at both Northern
Virginia Community College and Nortel Government Solutions, all of whom provided
me with limitless support during the completion of this study.
iv
Acknowledgments
The successful completion of this degree is due in no small part to the efforts of
many people, including my family, who suffered along with me every step of the way
during the interminably long period of time it took for me to complete this dissertation
and degree. It is with deep gratitude and appreciation that I acknowledge the following
individuals for their contributions.
First, I gratefully acknowledge the support I received from the administration at
Northern Virginia Community College (NVCC). Without their support of my 7-year
externship with Nortel Government Solutions, I would not have had access to the
industry and resources that made this work possible. Specifically among my colleagues at
NVCC, I would like to thank Professor Kevin Reed for allowing himself to be used as a
sounding board—I offer to return the favor as he is now performing his own doctoral
work. Too, I must thank my friend, Maryann Daimler, for her constant prodding to get
the study completed.
I would also like to thank my mentor and committee chair, Dr. Richard Yellen, to
whom I certainly lied when I informed him I was “low maintenance” 5 years ago when I
started in the program. I also wish to thank my other committee members, Dr. Danielle
Babb and Dr. Salim Zafar, who never seemed surprised to hear from me no matter how
many months (years) had passed since I’d last contacted them.
v
Table of Contents
Acknowledgments iv
List of Tables viii
List of Figures ix
CHAPTER 1. INTRODUCTION 1
Background of the Study 3
Statement of the Problem 7
Significance of the Study 7
Purpose of the Study 7
Rationale 8
Research Questions 9
Nature of the Study 11
Definition of Terms 11
Assumptions and Limitations 14
CHAPTER 2. LITERATURE REVIEW 16
Overview of the Chapter 16
Identity Theft 16
Identity Data Aggregation 20
Knowledge-Based Authentication 23
Identity Authentication in Federated Environments 27
Measuring the Effectiveness of Authentication by Knowledge 30
Measuring the Effectiveness of Authentication Through Possession 34
Measuring the Effectiveness of Biometric Authentication 35
vi
Measuring the Effectiveness of KBA 37
Evaluating KBA Using Guessability 38
Evaluating KBA Through False Acceptance and Rejection Rates 39
Evaluating KBA Through Other Methods 41
CHAPTER 3. METHODOLOGY 44
Introduction 44
Research Approach 44
Sampling Design 48
Data Collection 51
Measurement Strategy 52
Data Analysis Strategy 52
Data Display 53
CHAPTER 4. RESULTS 54
Introduction 54
Data Collection 54
Data Coding and Categorization 55
Descriptive Statistics 57
Tests of Research Questions 62
Summary 74
CHAPTER 5. RESULTS, CONCLUSIONS, AND RECOMMENDATIONS 76
Summary of the Study 76
Summary of the Research Findings 76
Implications of the Study 79
vii
Contributions of the Study 82
Limitations 85
Recommendations for Future Research 87
Conclusion 89
REFERENCES 91
APPENDIX A. CODEBOOK 99
APPENDIX B. CODER FORM 102
viii
List of Tables
Table 1. Descriptive Statistics of Public Records Sites Across the 50 States 58
Table 2. Descriptive Statistics of Identity Attribute Types by Category 59 Table 3. Frequencies and Percentages of Identity Attributes 61
Table 4. Descriptive Statistics of Identity Attributes Within Each of the 50 States 62
Table 5. Variance Explained by Resulting Components 65
Table 6. Components and Respective Identity Attributes 66
Table 7. Comparative Discoverability Index for the Groups of Identity Attributes 66
Table 8. Comparative Discoverability Index for the Identity Attributes 67
Table 9. FTC 2007 State Identity Theft Rankings 69
Table 10. Spearman Rho Correlations Between Identity Attributes and Identity Theft Ranking 71 Table 11. Spearman Rho Correlations Between Attribute Groups and Identity Theft Ranking 72 Table 12. Summary of the Hypotheses Testing 80
ix
List of Figures
Figure 1. Conceptual framework 11
Figure 2. Total number of public records sites by record category 58
Figure 3. Scree plot from resulting EFA procedure 66
Figure B1. Codebook 103
1
CHAPTER 1. INTRODUCTION
In January 2001, the Chief Information Officer’s Council revealed a database
cataloguing over 1,300 federal electronic government initiatives (Martin, 2001)
developed with the intention of meeting the President’s Management Agenda of
providing online access for citizens and businesses to interact with government. Citizens
and businesses are increasingly utilizing these services, as demonstrated by the IRS’s
Free File E-Government solution that saw more than 4 million tax returns submitted in
fiscal year 2007 at a cost savings of $9.2 million to the government (Office of
Management and Budget [OMB], 2008, p. 5).
Paralleling this rapid rise in citizen-centric Web applications is a requirement for
higher identity authentication assurance levels as citizens and businesses attempt to
access these online services. The OMB directed that these electronic transactions be
secure and maintain citizen and agency privacy, thus requiring “some type of identity
verification or authentication” (Bolton, 2003, p. 1). In related guidance documents, the
National Institutes of Standards and Technology (NIST) defined the processes for
establishing confidence in user identities presented electronically as “E-Authentication”
(as cited in Burr, Dodson, & Polk, 2006, p. vi) and provided guidelines for selecting
authentication technologies and protocols suitable to application assurance levels defined
in the previously referenced OMB directive.
Traditional user authentication systems, such as Personal Identification Numbers
(PINs) and passwords, have proven problematic for federal agencies and their citizen
application users. Separate registration processes and credential (PIN, password, etc.)
2
issuance are expensive for agencies to maintain and can be burdensome for citizens who
may have infrequent dealings with an agency.
As a possible solution, one of the authentication technologies under consideration
for use with e-government by NIST is Knowledge-Based Authentication (KBA;
Chokhani, Dodson, Hastings, Burr, & Polk, 2006). KBA authenticates user identities on
the basis of “shared secrets” using the individual’s personal attributes, such as last name,
first name, Social Security Number (SSN), and date of birth. Depending on the assurance
level required by the application, a multistep approach may be used that first verifies that
electronically presented identities are valid based on these personal attributes and then
uses a separate set of more difficult questions based on financial or personal information
(such as car payment or mortgage amount) obtained from proprietary sources or public
records to bind the identity to the individual presenting the identity for access to the
application.
KBA is usually performed through intermediary services referred to as knowledge
brokers or commercial data resellers that purchase and mine personal identity attributes
and information from both public and purchased proprietary records. These services then
resell these personal data to government agencies and corporations in the form of
background check services or, more recently, identity authentication services for Web-
enabled applications. As an authentication technology, KBA systems afford an easy
method for citizens to authenticate to a Web-based application, especially where there is
not a previously established relationship with the application’s organization.
The effectiveness of KBA is difficult to evaluate and its use gives rise to privacy
concerns where access may be granted to an application containing information that can
3
be used to facilitate identity theft, such as with a birth certificate. Does the prevalence of
this personal identity information in public records contribute to identity theft? To what
extent does the level of trust an organization can place in KBA depend on the difficulty
with which this personal identity information can be discovered in public records? This
exploratory study sought to address these questions in order to provide guidance to
government agencies interested in using KBA to meet their authentication requirements.
Background of the Study
NIST defined electronic authentication as “the process of establishing confidence
in user identities electronically presented to an information system” (as cited in Burr et
al., 2006, p. vi). The level of confidence in presented identity credentials is contingent on
the processes used to validate that the claimed identity actually exists and is bound to the
identity claimant at the time of enrollment.
Traditional shared-knowledge authentication systems rely on a previously
established relationship between the authenticator and identity claimant that has already
verified identity in some manner. A shared secret, such as a PIN or password, is then
used to bind the identity to the identity claimant when access to an electronic system is
required. Having to establish and maintain a relationship in order to manage a PIN or
password is not an effective approach for identity management in cases where the user
infrequently requires services. In February 2004, NIST hosted a symposium, “Knowledge
Based Authentication: Is It Quantifiable?” Conference information posted by NIST noted
that in instances where infrequent access is needed to conduct business with federal
agencies, “other authentication tools such as passwords and PKI certificates can be
4
expensive to administer for the application provider and difficult to use for the remote
individual” (NIST, 2004). NIST suggested that KBA could prove to be a useful
authentication tool in these instances.
KBA services generally function by presenting a series of questions at the time of
login to the identity claimant. The identity claimant must answer all or some of the
questions accurately in order to be successfully authenticated to the system. Basic
questions such as first name, last name, middle name, SSN, and address are asked within
the services and can then be followed by more challenging questions such as the make
and model of an individual’s car, car payment amount, questions regarding home
improvements performed on the house or the amount of money for which a homeowner
purchased a home. These systems are customizable to application owner requirements
with the number and type of questions asked during authentication. Static questions
(name, SSN, or address) are easier for the application user to answer; however, they tend
to be populated more frequently in accessible databases. Temporal attributes are based on
more questions that are infrequently disclosed, such as “Which of the following
individuals lived at your residence at 12345 Brown Street?” These are also usually
presented with multiple-choice answers as they tend to be more difficult for visitors to
accurately answer.
Much of these data are culled from public records, with knowledge brokers (also
known as commercial data resellers) sending personnel out to copy paper records into
their databases where electronic records are not available (“After the Breach,” 2005) or
from private companies with customer data to sell (as with utility companies or insurance
agencies). Knowledge-based authentication services rely on this information being
5
limited in its distribution such that only the legitimate identity claimant would
successfully be able to answer a series of these questions; however, the ease with which
the majority of identity data can be accessed through public records raises a concern that
knowledge-based authentication systems are establishing identity on the basis of “pseudo
secrets.” In some states, even an individual’s SSN can still be found in public records.
Alabama recently signed into law a bill requiring an individual’s consent to having their
SSN revealed on state documents prior to their release; however, exceptions are made for
liens, conviction records, and bankruptcy filings (State of Alabama, 2006).
The effectiveness of KBA as an authentication technology has proven elusive to
validate. KBA vendors cite the use of proprietary algorithms, some supplying confidence
scores based on whether or not the data were retrieved from public sources versus records
purchased from proprietary sources. These algorithms, however, are not made public for
academic review and testing such as is common with cryptographic algorithms. Some
KBA services make reference to “false negatives” and “false positives” when discussing
validation effectiveness (Barrett, 2004). The term “false negative” is used in these cases
to reference a valid user who has been denied access to a service because the input the
user provided did not match that stored in the service’s database. The term “false
positive” signifies an event where a person who does not rightfully own the identity has
been provided access, presumably due to being able to guess or discover sufficient
answers to questions posed by the KBA service provider. The use of false positives and
negatives to quantify authentication effectiveness is more appropriate to technologies
such as biometrics as KBA lacks an inherent ability to test and gather metrics on the
number of false positive error rates. While metrics can be captured through customer
6
service complaints from authentic identities that have been incorrectly denied access, it is
unlikely that application owners would receive notices from someone intentionally
spoofing another’s identity. Other proposed approaches by independent researchers to
evaluating KBA have included applying probability metrics based on the ease with which
the data can be guessed (Chokhani, 2003) as well as more recent efforts to develop a
generic KBA model based on probabilistic models (Chen, 2007). Both of these methods,
however, fail to factor in identity attribute discoverability.
In a graduate information security course at Johns Hopkins University, students
used electronic public records to gather over a million records, with hundreds of
thousands of individuals (Zeller, 2005). This proliferation of personal identity
information in government public records is a growing concern of politicians and
government agencies. A Government Accounting Office (GA) study reported finding
SSNs in records of more than 41 states. While federal agencies are prohibited from
posting SSN information publicly, the GAO (2004) found that nearly 15–28% of the
nation’s 3,141 counties did make them publicly accessible over the Internet. In an earlier
report, the GAO found that identity-theft-related crimes were enhanced by the growth of
the Internet as it “increases the availability and accessibility of personal identifying
information” (2002, p. 6), linking this increase in availability to an increase of identity
theft by aliens. The GAO reports, however, do not offer empirical data to support their
statement and no studies could be found attempting to correlate identity theft rates to the
availability of electronic public record databases. A study is necessary to determine the
impact that electronically available public records has on identity assurance and KBA
through a correlation to identity theft rates.
7
Statement of the Problem
The effectiveness of most authentication technologies can be measured either by
their susceptibility to guessing or through performance testing to determine the
percentage of users incorrectly denied access or who were accurately granted access.
Both of these measurement methods fall short in their ability to assess the effectiveness
of authentication systems that use personal identity attributes that are accessible by the
general public. The goal of this research was to determine the extent to which this
authentication information is discoverable on the Internet and the impact discoverability
has on identity theft and assurance.
Significance of the Study
This study serves to lay the groundwork for the development of a KBA
assessment methodology. To accurately evaluate the effectiveness of KBA systems, the
discoverability of personal identity attributes and their impact on identity theft must first
be quantified. The identification of the frequencies with which identity attributes are
found in public records can assist in assigning a “discoverability factor” to each attribute.
This discoverability factor can be used by government agencies to map the selection of
specific identity attributes to appropriate e-authentication assurance levels.
Purpose of the Study
The purpose of this quantitative study was twofold:
1. Examine the comparative discoverability of identity attributes in online public records, associating a “discoverability factor” to individual and linked identity attributes where specific combinations of identity attributes occur.
8
2. Determine if correlations exist between the frequencies with which identity attributes can be found in public records and instances of identity theft.
Rationale
The vulnerabilities that a lack of strong identity management practices present to
national security are significant. The ability for citizens to weakly authenticate to
government services provides access to terrorists to “breeder documents,” or documents
that are used to obtain other documents for identity, such as drivers’ licenses, social
security cards, and birth certificates. With these documents, long-term identities can be
established and maintained. This was the case with the 9/11 terrorists, all of whom had
valid drivers’ licenses and were, in some cases, even registered as U.S. citizens to vote
(Johnson, 2004). David Temoshok (2005), the Director of the Government Services
Agency’s (GSA) Identity Policy and Management Office that has been tasked with
implementing the E-Authentication Portal, suggested that trust in the identity verification
procedures is one of the critical issues of federated identity.
In a public response to NIST’s draft version of Federal Information Processing
Standard (FIPS) 201, the International Technology Association of America (ITAA, 2004)
requested that NIST consider the use of knowledge-based authentication—incorporating
them where appropriate into the identity verification procedures used to provide
identification cards to all government employees, contractors, and their affiliates. This
identity card is mandatory for access to all government facilities and systems. The use of
an identity authentication technology that has not been adequately evaluated could
present a risk to the security of national systems.
9
Government agencies are already in the process of implementing KBA to meet
critical electronic authentication needs. The Social Security Administration (SSA) is
presently using KBA to allow beneficiaries to change mailing addresses for their Social
Security checks, as well as check their Social Security benefits and apply for direct
deposit using the Internet (Office of the Inspector General [OIG], 2004). Furthermore, the
access to personal information with even seemingly low-risk applications such as those
used by SSA creates an opportunity for an identity thief to collect, or aggregate, identity
information that can contribute to identity theft. For this reason, access to personal
information must be protected using authentication technologies appropriate to the level
of risk as defined in the e-Authentication Guidelines.
Research Questions
Using Cooper’s management research hierarchy (Cooper & Schindler, 2003), the
following management dilemma and research questions were defined and examined
within this study.
Management Dilemma
KBA provides a cost effective and quick method for authenticating citizens and businesses to government applications, however it may not provide sufficient identity assurance to meet OMB e-Authentication guidelines as much of the authenticating data may be easily discovered on the Internet.
10
Management Questions
1. What personal identity attributes offer higher levels of assurance when selecting the questions asked with KBA services?
2. To what extent do online public records provide information that can be used
by an identity thief to build a more comprehensive identity profile of their target, increasing the likelihood of spoofed identities with KBA services?
Research Questions
1. Can the frequency with which identity attributes are accessible on the Internet be used as an indicator of discoverability?
2. Is there a correlation between reported identity theft rates and the availability of personal data in public databases?
Investigative Questions
1. Who are the major KBA service providers in the industry?
2. What personal information do the major KBA service providers require for authentication?
Measurement Questions
1. What identity attributes appear both singly and in combination with other attributes most frequently in public records databases?
2. Does a correlation exist between the publication of personal identity attributes in public records and identity theft?
11
Nature of the Study
This study enumerated the frequency with which personal identity attributes
reside in public records databases and explored the impact of their discoverability on
identity theft. The methodology used in this study is conceptualized in Figure 1.
Figure 1. Conceptual framework.
Research Problem Governments and organizations are increasingly using KBA service providers to authenticate individuals to online services using personal data as identity verifiers. What level of identify assurance can KBA provide given the availability of personal data to identity thieves? Does the use of KBA serve to increase the likelihood of identity fraud/theft in its ability to provide access to “breeder documents?
Research Question 1: What is the frequency, or discoverability, of personal identity attributes in public records databases?
Research Question 2: Is there a correlation between identity theft rates and the availability of personal data in public databases?
Outcome of Research: Ascribe a probability, or discoverability factor, to individual and linked identity attributes.
Research Method: Quantitative –using content analysis to categorize and code Web-based public records content
Research Method: Quantitative – examine results for a correlation between discoverability and identity theft rates. Test Hypothesis: Greater rates of personal data in online government public records will correlate to higher incidences of identity theft/fraud. Independent Variable (IV): Personal data available in government-provided online records. Dependent Variable (DV): Rates of identity theft.
12
Definition of Terms
Following are definitions for terms that were used in this study.
Authentication. The process of binding a user identity to an individual with a
specific level of assurance.
Credential Service Provider. An organization or service that issues identity
credentials (i.e., passwords, tokens, etc.) after the identity has been verified.
Credit Card Fraud. The use of a credit card by an unauthorized party facilitated
by inadvertent disclosure of identity information.
Discoverability Factor. The degree to which specific identity attributes can be
found by individuals other than the target identity.
E-Authentication. The process of binding a user identity to an individual with a
specific level of assurance during an online, or electronic, transaction.
FIPS-201. Federal Information Processing Standard entitled “Personal Identity
Verification of Federal Employees and Contractors.” Specifies the identity proofing,
credentialing, and personal identity verification card requirements for federal employees,
contractors, and their affiliates.
Identity Assurance. The level of trust that can be placed in an identity presented
by an identity claimant.
Identity Attributes. A characteristic associated with an identity that must be
presented to authenticate one’s identity. Examples include last name, first name, SSN,
date of birth, and so forth.
Identity Claimant. An individual who is presenting an identity for verification
during the authentication process.
13
Identity-Proofing. The process of validating that an identity exists and verifying,
or binding, an identity to an identity claimant.
Identity Theft. The loss or disclosure of identity attributes sufficient for another
individual to impersonate that individual. Usually performed to enable the thief to
commit a crime while using the impersonated individual’s identity.
Identity Validation. The process in which the identity presented by an identity
claimant is checked to ensure that the identity is a real one. Usually includes checks
against databases to ensure that the addresses used are real and the individual is not
deceased.
Identity Verification. The process in which the identity claimant provides
sufficient proof (i.e., by answering questions that only that individual should know) to
effectively “bind” the identity to the claimant.
Personally Identifiable Information (PII). Information that identifies an individual
either directly or by reference using an individual’s unique identity characteristics, or
attributes, such as name, date of birth, mailing address, telephone number, SSN, e-mail
address, or other information that links the individual to that identity.
President’s Management Agenda. A strategy established by President George W.
Bush in the Summer of 2001 for improving federal government services in five areas of
management weaknesses.
Static Attributes. Personal characteristics that do not frequently change (i.e., date
of birth).
14
Temporal Attributes. Personal characteristics that do frequently change (i.e.,
address, employment, individuals living at address). Also referred to as “dynamic”
attributes.
Transitive Trust. The acceptance of an identity verified by one system at a
different system without additional verification.
Assumptions and Limitations
As will be discussed in chapter 3, the dynamic nature of the Internet provides
significant challenges to researchers when attempting to reproduce Web-based content
analysis. It was assumed that the majority of information contained in electronically
accessible online databases would be present within databases, rather than as other types
of documents or objects. While the expected lifetime of an online database record has
been demonstrated to be longer than that of other Web-based content (Koehler, 2004), the
dynamic nature of the Web limits the life expectancy of the content analyzed within this
study. Additionally, this study focused on those public records that can be electronically
discovered using the Internet. It did not address the discoverability of public records
freely available to the public in nonelectronic forms, nor did it address databases
containing personal information that are held by private entities, such as retail stores or
insurance companies, that are sold to data aggregators. These records are more difficult to
sample; however, they should not be overlooked in their contribution to identity
aggregation. Finally, as a result of several very public data breaches at data aggregators,
there was a heightened awareness of identity theft and identity attribute aggregation at the
time this study was conducted. Resultantly, the legislative landscape is rapidly evolving
15
and, in some cases, is in direct conflict with state and federal goals of electronically
enabling public records. Pending legislation may limit the availability of personally
identifiable information in the future.
16
CHAPTER 2. LITERATURE REVIEW
Overview of the Chapter
This review focused on examining existing literature for research relating to the
impact of easy access to identity attributes on identity theft. A discussion of identity theft
is provided, as well as the impact of identity data aggregation for use with knowledge-
based authentication. Literature has been selected that can provide a foundation for the
comparison of methodologies used to assess the effectiveness of other authentication
technologies to those based on personal identity attributes. A literature review for the
research methodology used in the study is provided separately in chapter 3.
Identity Theft
In a final rule issued in October 2004, the Federal Trade Commission (FTC)
defined the term identity theft as “a fraud that is committed or attempted, using a person’s
identifying information without authority” (2004, p. 1). It is important to distinguish that
the use of another individual’s information is illegal only if used for fraudulent purposes.
The FTC proposed that “identifying information” should be synonymous with “means of
identification” cited in the federal criminal statute relating to identity fraud (Public Law
105-318). Identification information includes “any name or number that may be used,
alone or in conjunction with any other information, to identify a specific individual”
(Identity Theft and Assumption Deterrence Act, 1998, p. 2). Specifically cited in the
statute are name, SSN, birth date, driver’s license or identification numbers, alien
registration number, government passport number, employer or taxpayer identification
17
number, and e-mail address, among other biometric and telecommunications information
(Identity Theft and Assumption Deterrence Act).
While the Privacy Act of 1974 prevents the disclosure of personally identifiable
information on citizens held in federal government databases, federal court systems are
exempt from this requirement and are allowed to even disclose citizen SSNs in public
records (GAO, 2004). The SSA suggested that misuse of identity information will be
difficult to reverse while this information is available to the public (“Social Security
Number,” 2004), certainly supporting the hypothesis that the prevalence of these data in
public records is a contributing factor to identity theft.
While many of these records do not contain all of the information necessary to
successfully complete authentication to an online application using KBA, the existence of
multiple sources of information containing pieces of the authentication puzzle allows an
identity thief to compile data and profile a target. Solove described the problem
associated with data aggregation wherein, in isolation, a particular piece of information
may not be invasive of one’s privacy; however, when such pieces of data are amassed,
they effectively form a digital biography, or digital “dossier” (2004, p. 1) on the
individual.
As a result of the proliferation of personal data on the Internet, KBA may not
provide agencies with sufficient identity confidence and may, in fact, increase the
problem of personal data aggregation that can lead to an increased likelihood of identity
theft through providing access to additional “breeder documents” such as birth
certificates or marriage licenses. Eventually, sufficient identity information can be
18
amassed that will allow an attacker to “spoof” the identity of a valid user at online
applications, such as with government applications.
In the FTC’s annual commissioned study on identity theft complaints, 258,427
consumers reported having their identity stolen in 2007, 11% of these being stolen to
facilitate government documents/benefits fraud (FTC, 2008). In a report to Senate, IRS
Commissioner Mark Everson placed the figure at almost 1.5 million individuals who had
their personal information misused to obtain government documents, tax forms, or tax
refunds (U.S. Senate Finance Committee, 2007).
Some reports, however, suggest that identity theft is not on the increase and point
to reasons other than disclosure through public records as its principal source. Javelin
Strategy and Research has preformed several studies on the topic. The first of their
reports, the 2006 Identity Fraud Survey Report released by Javelin Strategy and Research
with the Council of Better Business Bureaus, challenged the belief that access to personal
identity information leads to identity fraud. The report cited a marginal decline in overall
identity theft rates from 4.7% to 4.0% from 2003 to 2006, consistent with the overall
decline of reported identity fraud complaints in 2006 cited by the FTC (2006). The
Javelin report stated that the majority of identity theft occurs as a result of “traditional”
reasons, lost or stolen wallets or credit cards, and not from the Internet. Furthermore,
Javelin’s survey results have found that almost half (47%) of the reported fraud
incidences were perpetrated by someone the victim knew.
Javelin’s original study had several weaknesses associated with it. First, sponsors
of the survey included CheckFree, Visa, and Wells Fargo & Company—financial
services companies that may have a less-than-impartial interest in the outcome as their
19
intent is to instill trust in their online services. Secondly, Javelin’s narrow interpretation
of identity fraud as “the unauthorized use of another’s personal information to achieve
illicit financial gain” (2006, p. 11) makes it only applicable to financial accounts. Despite
this constraint, Javelin generalized the outcome to all forms of identity fraud, indicating
that, contrary to growing fears on the subject, identity fraud and data compromise was
contained and less widespread than thought.
The report was updated in 2007 and re-released in February 2008. In the latest
report, Javelin differentiated identity theft from identity fraud, stating that “Identity theft
happens when your personal information is accessed by someone else without your
explicit permission” (2008, p. 5). Personal information is defined as “Social Security
number, bank or credit card account numbers, passwords, telephone calling card
numbers, birth dates, name, address and so on” (Javelin, p. 5). In this report, Javelin
stated that “with even the most basic information, a criminal can either take over your
existing financial accounts or use your identity to create new ones” (p. 5). As with the
first report, Javelin emphasized the role that traditional methods play on identity theft
(79% in those cases where the victim knew how the criminal obtained the identity
information), defining it in this report as “when a criminal can make direct contact with
the consumer’s personal identification” (p. 5).
It is significant to note, however, that only 155 (35%) of the 445 victims surveyed
in the study actually knew how the data were accessed (Javelin, 2008). This results in
these traditional methods discussed earlier being responsible in only 123 (27.6%) of the
total 445 incidences of identity theft. Resultantly, Javelin’s reports, which Javelin
claimed are the largest ever on identity fraud, do not provide substantial evidence that
20
identity theft is most often perpetrated by individuals personally knowing or having
traditional access to the victim’s personal data.
In actuality, while there are many reports on the number of victims and the impact
of identity theft, there has been little academic research performed regarding the
methodology used by criminals to perpetuate the crime. One study, performed in October
2007 by the Center for Identity Management and Information Protection at Utica College,
acknowledged the dearth of research into this area and refuted other findings of
preestablished relationships between the identity thief and the victim, stating that “while
there were instances in which relatives and friends proved to be the perpetrators, they
were in the minority” (Gordon, Rebovich, Choo, & Gordon, 2007, p. 66). The study
examined 517 closed identity theft cases collected from the Secret Service. Among other
findings, the study revealed that in approximately 20% of the 102 total cases from 2001–
2004, the Internet was used in some manner to commit the crime. In 27 of these cases, it
was specifically used to “search databases” (p. 51).
Identity Data Aggregation
In a frequently referenced article among privacy advocates, Solove (2004) posited
a scenario in which the government compels individuals to provide personal data about
themselves, places these data into public records databases, and then makes the
information freely available on the Internet or provides it for commercial use upon
request. Solove proceeded to enumerate some of the public records maintained by
federal, state, and local governments, including births, deaths, marriages, divorces,
professional licensure, voting information, bankruptcy records, and so forth, arguing that
21
we are seeing the creation of architectures of vulnerability that leave individuals
susceptible to identity theft. Solove is not alone in his belief that the ready access to the
thousands of databases provides for digital profiling. In extensive works on the topic for
the Department of Homeland Security and on the topic of privacy, Carnegie-Mellon
researcher Dr. Latanya Sweeney successfully demonstrated that even when databases are
devoid of explicit identifiers such as name, address, or SSN, certain combinations of
identifiers, termed quasi-identifying, provide sufficient information to link an individual
to a record containing explicit identifiers. As an example, birth date, gender, and zip code
information combine to uniquely identify 87% of the population of the United States
(Sweeney, 2000). Sweeney additionally noted that more than half (53%) of the U.S.
population are likely to be uniquely identified through only the combination of the city,
town, or municipality in which they reside, their gender, and date of birth. With only the
county, gender, and date of birth, 18% of the U.S. population can be identified. Malin
provided an example with which quasi-identifying data from two tables can be used for
identification:
Given two tables, Wi(name, date of birth, gender, zip code) and Wj(year of birth, gender, IP address). Under the assumption that the IP address has not been spoofed, a relationship between the IP address of a computer and the geographic zip code can be established. As such, the linkage attribute set Sij is defined as: Sij={<date of birth, year of birth>, <gender, gender>, <zip code, IP address>}. (2002, p. 7)
In this example, a linkage between the IP address and the geographic zip code can
be made that makes the other data relevant to identity. While Sweeney’s recent work has
focused on assisting medical organizations to comply with new privacy requirements, it
22
nonetheless illustrates the ease with which data records can be combined into compiling
the digital dossier that Solove addressed.
While the growth in data aggregation is related to inexpensive computers with
large storage capacities (Sweeney, 2000), it is also certainly tied to the growth of the
Internet and enhanced search-engine capabilities that facilitate the rapid searching of
large databases. Several Internet records-search businesses now specialize in the
aggregation of public records databases. KnowX, a ChoicePoint company, provides
access to “documents compiled by various public offices and agencies which are made
available to the general public. Examples of public records include real estate records,
lien filings, business entity filings, lawsuit information and court dockets, court decisions
and death records.” Other records are available under KnowX’s Professional fee-paid
services (KnowX, 2005). Search Systems provides access to more than 36,000 searchable
public records for a present cost of less than $5 per month.
Knowledge brokers, or companies that specialize in the aggregation and sale of
personal data, resell the data from these online data repositories, providing personal
information to law enforcement agencies performing background investigations and to
commercial entities extending credit or checking references for rental housing. The Drug
Enforcement Administration (DEA) regularly provides information from their Controlled
Substance Act (CSA) database to knowledge brokers such as KnowX. This database,
considered public information, contains the doctor’s name, licensure, license status, and
the location from which the DEA has authorized the physician to dispense controlled
substances to patients (usually a medical office). Interestingly, Solove (2004) discussed
that not only is the government a supplier of this information to the private sector, but, in
23
turn, it purchases the services of these knowledge brokers to generate information about
individuals, enumerating contracts with ChoicePoint with the Justice Department, FBI,
IRS, and other federal agencies to substantiate this claim.
Knowledge-Based Authentication
In addition to background investigations, knowledge brokers use these databases
to validate and verify identity for KBA services (Willox, 2001). KBA service providers
rely on the availability of personal identity attributes contained within public and
proprietary records to provide authentication services to Web-based applications and for
credit-granting purposes. The importance of public records for identity verification was
supported in a position paper written by the Property Records Industry Association
(PRIA), which stated, “In order to grant credit rapidly and appropriately, the collection of
information about consumers through public records is necessary for businesses to make
fair and objective risk decision” (2006, p. 10).
In all cases, whether for use with performing background checks or for verifying
identity at the time the user is attempting to authenticate to a site or system, knowledge
brokers use the data gleaned from these same databases as “source,” or valid, data.
ChoicePoint and LexisNexis Group are considered to be two of the nation’s largest data
aggregators (Olsen, 2005). While LexisNexis has recently acquired ChoicePoint, their
data sources can vary. In a document that described ChoicePoint’s methodology for a
State of Tennessee project, ChoicePoint reported performing identity checks using the
following data sources: a “composite” file consisting of credit header data, property tax
records, casualty insurance records, driver’s license file from 35 state agencies,
24
residential phone listings, and address records from the U.S. Postal Service National
Change of Address file (as cited in State of Tennessee, 2003).
ChoicePoint has also claimed to use more robust data sources, “not just wallet-
based or financial history information” (2004, p. 1) than other competing products;
however, the information observed on sample screens at their ProID Web site does not
indicate that all of the questions being asked would be difficult to answer using public
sources. Questions such as “On which of the following streets have you NEVER lived or
owned property?” can be answered using property records searches for those individuals
who have been long-time property owners.
Other service providers may also rely on equally discoverable data. During the
NIST symposium on Quantifying Knowledge-Based Authentication, Experian
representative Kim Cartwright (2004) provided the following identifiers that are used
with their services during the identity validation process: address, phone number, SSN,
driver’s license, and date of birth—comparing this data to that gathered from its own
consumer credit records, consumer demographics, vehicle ownership, property
ownership, and other unspecified reference files. LexisNexis advertised their InstantID
product as a powerful tool that “simultaneously searches multiple independent
databases—containing 4 billion consumer and 300 million business records—for
information that can verify and validate a person’s identity” (2005, p. 1). The tool
validates name, address, SSN, date of birth, and phone number. Credit header data, sold
to KBA vendors separately from the credit history file, is one of the few sources outside
of court records by which SSNs can be validated. While a part of the credit reporting
25
bureau’s file, the FTC has determined that it is not a part of the credit history and so does
not fall under the Fair Credit Reporting Act (“Protecting Consumers’ Data,” 2005).
These data are, of course, generally run through consistency checks by some
vendors to check that the phone number area code matches that assigned within the city
and state, the driver’s license number is consistent with the format as issued by the state,
the SSN is not listed on the Social Security Death Index, and that the address is not that
of a designated “high risk” location, such as nightclub, drop box, and so forth. Of
particular concern, however, is that the bulk of the data to which identity is verified—
name, address, phone number, and date of birth—is easily retrievable on the Internet.
Even property tax records and vehicle records that comprise challenge-response type
questions used by KBA vendors (i.e., how much did you pay for your house?) are
publicly available through public records aggregators such as Search Systems. While it is
becoming increasingly more difficult (although not impossible) to find SSNs on the
Internet, research has demonstrated that identity thieves possessing enough personal data
can easily retrieve the victim’s SSN, then use the SSN as a key to access the victim’s
financial benefits (“Identity Theft and Social Security Numbers,” 2004).
The emphasis on enabling e-government services will continue to create
additional personal data repositories. As the availability of personal data in public records
increases, it increases the ability of the KBA service provider to successfully match
identity data. When access to public records is limited, so too is the ability to assemble a
digital dossier on an individual. LexisNexis stated that the paucity of data makes it
difficult to detect the international identity thief (as cited in Willox, 2001). Paradoxically,
an increase in the availability of personal data increases their susceptibility to discovery,
26
undermining the confidence that can be placed in that identity’s verification. This
problem has been looked at from both sides of the coin. Solove’s (2004) proposed
solution was to regulate access to public records and to remove identifying information
from the records, while fellow attorney Lynn LoPucki (2003) proposed the creation of a
Public Identification System that publishes most of an individual’s personal data,
eliminating their use in providing proof of identity—requiring instead that identification
be determined by public claim and personal contact.
Both analysts agreed that in order to protect privacy, the secrecy paradigm must
be abandoned. Solove, however, suggested that “by taking obscure facts and making
them widely available, privacy can be violated” (2004, p. 143), arguing that “an SSN,
mother’s maiden name, and birth date should be prohibited as the method by which
access can be obtained to accounts” (p. 143). LoPucki’s (2001) arguments were certainly
more convincing, suggesting that by making all identity information completely public,
any artificial value placed on any of the identifier is removed and the identifier becomes
essentially worthless to the identity thief, as it can no longer be used for financial gain as
an authenticator; simply knowing an individual’s SSN would not provide access to other
records. LoPucki’s proposal that publicly listed contact information be provided as a
means to authenticate identity fell somewhat short, however, as he did not address the
problem resulting from a compromise of this information wherein the contact information
is changed by the identity thief, much in the same manner as credit bureau data are
altered by identity thieves today.
A false illusion of secrecy has been created to surround personal data that were
never intended to be secret and has resulted in the commercialization of these private
27
data. The use of these pseudo-secret data has extended to all Web sites, prompting one
financial executive to lament that the misuse of personal data, such as mother’s maiden
name, for authentication at Web sites and elsewhere, has resulted in there being relatively
few authenticators left to banks to use today to secure online transactions (Archer, 2004).
Regardless of whether the fault lies with the citizen for indiscriminately sharing his or her
personal data with all and sundry, or with the financial industry for substituting the
convenience of using an existing shared secret for the expense and security of creating
and disseminating a truly shared secret, once the cat is out of the bag there is no putting it
back in. Once private knowledge becomes public knowledge in any manner, knowledge-
based authentication may fail to provide a sufficient level of identity confidence in
digitally presented credentials.
Identity Authentication in Federated Environments
David Temoshok, the Director of the GSA’s Identity Policy and Management
Office that has been tasked with implementing the E-Authentication Portal, defined
federated identity as “rules, agreements, and standards, technologies that make identity
and entitlements portable across autonomous domains” (2005, p. 4), and suggested that
trust in the identity verification procedures is one of critical issues of federated identity.
Transitive trust relationships are prevalent authentication models for federated
environments, such as with those participating with Liberty Alliance and GSA’s e-
Authentication Portal. These federated environments provide for transparent movement
between participating systems, agreeing to accept the credentials of users who have been
identity-proofed prior to accessing the present system (Electronic Authentication
28
Partnership, 2004). Once identity is authenticated and a Credential Service Provider
issues a credential, the credential is accepted by any participant within the federation
during the same browser session. Interoperable credentials within a federated
environment ensure that the user does not have to remember different passwords for
every site that the user visits within the federation. The degradation of identity confidence
will not be apparent to the systems participating in a transitive trust relationship with the
system performing the initial identity authentication.
To combat the threat of spoofed identities with government e-authentication
applications, in December 2003, the Office of Management and Budget (OMB, 2003)
issued Memorandum M-04-04, providing guidance to the heads of all departments and
agencies within the federal government on securing authentication to online government
services. Citing the National Research Council’s (NRC) report on authentication, Who
Goes There? Authentication Through the Lens of Privacy, OMB deferred to the NRC’s
definition of e-authentication as “the process of establishing confidence in user identities
electronically presented to an information system” (OMB, p. 3). The level of confidence,
or assurance, in electronically presented identities is directly linked to the strength of the
identity authentication technology or protocol used by the identity claimant attempting
authentication, and is determined by the level of risk presented to the application or user
in the transaction. Risk assessments are to be performed by each agency on its e-
government application to determine the risks and impact upon compromise presented by
each electronic transaction type (i.e., a request for information or submission of data to be
added to an online database). As applications become increasingly more risky to the
29
agency or individual participating in the transaction, authentication to the application
becomes more rigorous.
Authentication is largely based on three factors:
1. Something one knows, such as a PIN, password, or a combination of personal information;
2. Something one has, such as a hard token, or smart card; or
3. Something one is, as generally demonstrated by a biometric such as a fingerprint or retinal scan.
Any of these single authentication factors presents certain vulnerabilities that can lower
identity authentication confidence; however, when combined into the use of “multifactor
authentication” (i.e., a doctor is required to present both a password and biometric when
authenticating to write a prescription for controlled substances), there exists a greater
confidence that the individual presenting the identity claim has been effectively bound to
that identity.
In supplemental guidance from NIST, authentication technologies are mapped to
the four assurance levels defined in the OMB memorandum, ranging from Level 1
(providing the least identity authentication confidence) to Level 4 (providing a high level
of confidence that the identity claimant is bound to the presented identity; Burr et al.,
2006). As a part of the risk assessment process, online government applications are
assigned a required authentication assurance level and, thusly, a required minimum
authentication technology standard, based on the results of a risk assessment.
While the NIST guidance states that it addresses the most widely implemented
forms of authentication protocols for remote authentication based on secrets, it directs
that applications requiring Level 4 assurance use hardware-based cryptographic tokens
30
that link the identity to “something they have” (i.e., PKI-based “smart cards” or other
hardware device). NIST also informs readers that they are continuing to study both the
topics of knowledge-based authentication—which they define as authentication based on
the claimant correctly answering many personal, but not secret, questions—as well as the
use of biometrics that authenticate claimants based on physical characteristics or
identifiers possessed by the claimant (“something they are”). It is important to note that,
for the purposes of this discussion, the use of the term knowledge-based authentication,
or KBA, refers to the definition used here by NIST and does not refer to PIN/password
technologies.
Measuring the Effectiveness of Authentication by Knowledge
Authentication technologies are only as effective as their ability to ensure that
only authorized users are able to access an application or system. The National Computer
Security Center, in its identification and authentication “best practices” guidance,
suggested that effective authentication technologies must “uniquely and unforgeably
identify an individual” (1991, p. 5). The ease with which an imposter can impersonate an
authorized user during authentication varies considerably with the technology used. How
is authentication effectiveness measured among the myriad forms of authentication
technologies? Is there an existing measurement methodology that can be applied to
knowledge-based authentication?
Authentication technologies based on shared secrets such as PINs or passwords
rely on the authorized user maintaining the secrecy of the authentication data. Factors that
31
impact the effectiveness of PIN/password authentication technologies, therefore, include
their susceptibility to being guessed or discovered by an imposter.
NIST’s guidelines discuss the vulnerability of passwords to guessing attacks—
comparing randomly selected passwords to user-selected passwords of the same length.
They defined password-guessing entropy as an “estimate of the average amount of work
required to guess the password of a selected user” (Burr et al., 2006, p. 46) when applied
to a distribution of passwords. NIST argued that password-guessing entropy is the most
critical measure of the strength of a password system, since it largely determines the
resistance to password-guessing attacks.
Password strength is a factor of the size of the required password, and of the
character type used within the password. Limiting the characters to numbers provides
only 10 different numbers (0–9) upon which the PIN can be based. Four-digit PINs,
therefore, can have only a maximum of 10,000 unique combinations, as calculated by
104. Passwords are typically much longer and more complex than PINs. The National
Security Agency’s (2006) most recent guidelines on the use of passwords within
government agencies recommended the use of 12-character passwords. These secure
passwords must consist of all of the following: upper and lower case alpha characters,
numbers, and special characters, such as the question mark, found on the keyboard. This
results in 9412, or in excess of 475 sextillion (4.75920314823) possible combinations, or
permutations. A related study on the strength of passwords and cryptography indicated
that an 840 MHz Pentium III can cycle through 250,000 passwords per second in an
offline dictionary attack (Song, Wagner, & Tian, 2001). Conceivably, this would take
more than 3 quadrillion years to exhaust all possible combinations, were there not new
32
developments that significantly reduce the time. A password-cracking algorithm,
Rainbow Crack, has been developed that pregenerates password-combination tables and
uses a more sophisticated searching algorithm (Bragg, 2004). Additionally, distributed
password-cracking approaches are being utilized that combine idle computer CPU time
from many PCs, enabling an increase in the amount of computer power available for
password cracking. One such implementation was recently revealed to be used by the
Secret Service in which the agency has linked more than 4,000 employee computers to
attempt cracking more than 1 million password combinations per second (Krebs, 2005).
Complicating the ability to measure password effectiveness is that while
computer-generated randomly selected passwords of the 94 printable keyboard characters
are much more effective than user-selected passwords, most users do not select from this
full range of characters when allowed to create their own passwords. Studies have
demonstrated that users tend to favor certain English-language character frequency
distributions, significantly reducing the possible combinations to a more easily and
quickly searched size. Bruce Schneier, author of Applied Cryptography, stated that
English has 1.3 bits of entropy per character; a 30-character English passphrase has as much security as a 40-bit key. Random passwords have less than 4 bits of entropy per character. A 12-character password is more secure than a 40 bit key. (1999, p. 27)
NIST estimates agree with Schneier’s and also calculate that were the 12-character user-
selected password chosen from the full 94 available “complex” character set previously
discussed, rather than being limited to only the 10 numbers plus upper-/lower-case
alphabet, it would result in a password-guessing entropy of 79 bits (Burr et al., 2006).
While it still can be considered “computationally unfeasible” to crack strong passwords
33
such as those cited within anyone’s given lifespan, tools such as these remind network
administrators that password authentication effectiveness continues to be an evolving
landscape.
While the susceptibility of PIN/passwords to guessing attacks is quantifiable, the
ability to quantify PIN/password susceptibility to discovery proves more elusive and can
be compared to discoverability vulnerabilities associated with KBA. When used as the
sole means for authentication, distribution of the shared secret completely defeats the
technology’s ability to identify an unauthorized user. Distribution occurs when passwords
are shared among coworkers, friends, or relatives (often for the purposes of accessing
services in the absence of the principal user). Unintentional distribution occurs when the
password is written down near the system and found by someone, or socially engineered
from the user.
Some online government applications, such as government-funded student loans,
issue 4-number PINs to financial aid recipients in order to check student loan processing
status. The PIN, issued by the Department of Education, is mailed in an “out-of-band”
transaction to the student. This process is vulnerable to the risk of discoverability as a
result of mail theft; however, that is not a risk factor that can be easily quantified as the
legitimate user oftentimes is not even aware that the password has been discovered and,
once compromised, the imposter has the same access privileges as the user. For this
reason, a recently released report from the Federal Deposit Insurance Corporation
concluded that single-factor password-based credentials no longer provide sufficient
security for remote access to critical infrastructure; “two-factor authentication should be
34
considered as the new security baseline for remote access to computer systems” (2004, p.
36).
Measuring the Effectiveness of Authentication Through Possession
Several types of authentication devices exist that require the identity claimant to
prove possession of the device at the time of authentication. These devices are usually
provided to the user through a controlled issuance process at the time of enrollment. As
with PIN/password systems, this requires a preestablished enrollment or issuance process
and would not be a suitable technology for authenticating citizens to government service
sites unless the identified transaction risks warranted requiring a higher level of identity
assurance.
Such devices include USB tokens containing a CPU and digital signature and
encryption keys that plug into a user’s PC—one-time password devices that are
synchronized with the system to which the user is authenticating—and “smart cards.”
Smart cards are credit-card-sized devices that contain a CPU and memory. Also called
Personal Identity Verifier (PIV) cards, NIST defined the PIV as
A physical artifact (e.g., identity card, “smart” card) issued to an individual that contains stored identity credentials (e.g., photograph, cryptographic keys, biometric data) so that the claimed identity of the cardholder can be verified against the stored credentials by another person (human readable and verifiable) or an automated process (computer readable and verifiable). (as cited in Chandramouli et al., 2008, p. 1)
The Department of Defense has presently issued more than 4 million PIV cards to
its personnel, and all federal agencies were directed by OMB to issue PIV cards to all
employees, contractors and affiliates (OMB, 2004) by October 2006. These cards will not
35
only contain biometrics, they will also contain radio-frequency identification chips that
provide for contactless authentication. Due to the variable nature and complexity of the
data stored on the card, conformance testing focuses largely on interoperability and the
ability of the card reader’s application to effectively access the data from the card and
validate that the data are correct and that the card has been issued by an authorized
source. Conformance testing for existing Government Smart Card (GSC) tests
interoperability, as defined in the GSC Interoperability Specification document drafted by
NIST (Schwarzhoff et al., 2003), and validates that the smart-card product conforms to
these specifications. Obviously, since possession of the card, by itself, in no way
guarantees the identity of the person presenting the card, testing for identity
authentication effectiveness is not feasible unless the card possesses a biometric or
requires a PIN/password also be used in two-factor authentication. Bank ATM cards,
requiring that both the card and the PIN be presented, are the most common example of
how two-factor authentication is implemented.
Measuring the Effectiveness of Biometric Authentication
Biometric authentication provides access to the user when a stored template of a
physical characteristic, such as an iris scan, fingerprint, or facial or voice scan, is matched
to the physical characteristic presented by the identity claimant at the time of
authentication. Unlike PIN/password and smart-card authentication technologies, the
effectiveness of biometric technologies, in terms of identification accuracy, can be
measured and the methodology for its measurement is consistent across the biometric
technologies. Although the effectiveness of biometric authentication can be measured,
36
these technologies are not 100% accurate. There are a total of four possible outcomes at
the time of authentication (Jain, Bolle, & Pankanti, 1999):
Outcome 1: An authorized user is correctly accepted;
Outcome 2: An imposter is correctly rejected;
Outcome 3: An authorized user is incorrectly rejected; and
Outcome 4: An imposter is incorrectly accepted.
The probability rate of an authorized user being rejected is known as a False
Rejection Rate, or FRR, while the probability, or frequency, rate of an imposter being
incorrectly authenticated as a valid user is known as a False Acceptance Rate, or FAR. A
false rejection of a valid user, while posing an inconvenience to the user and a ding to
customer service, is not as serious to application security as a false acceptance of an
imposter, in which the imposter has been granted all of the rights and privileges as the
valid user. Jain et al. noted that, in principal, the FAR and the FRR can be used, as well
as an Equal Error Rate (EER), to estimate the identification accuracy of a biometric
system. The EER represents the calculation of where the two probabilities (the FAR and
the FRR) represent the same value. As an example, were the EER to be 3%, it would
mean that 3% of authorized users were incorrectly denied access while 3% of the identity
imposters were incorrectly authenticated to the system.
In actuality, the determination of risk, rather than the EER, plays the largest role
in tuning a biometric system, or configuring the system to allow to higher or lower
sensitivity levels that result in higher or lower FRRs or FARs. Obviously, as the system is
tuned to greater sensitivity levels, demanding a positive match on more data points stored
in the template, the system will screen out more imposters, yet will generate more false
37
rejections of valid users (resulting in higher FRRs and lower FARs). Conversely, if
customer service is of greater significance than authentication assurance, the systems can
be tuned to accept more weakly matched templates, resulting in higher FARs and lower
FRRs. Critical applications would more likely be tuned so that fewer FARs resulted, at
the expense of a greater number of FRRs.
Jain et al. (1999) stressed the necessity of more descriptive performance metrics
during testing. An instance is cited in which vendor-asserted performance claims of an
FRR of 0.3% and FAR of 0.1% were not substantiated during independent testing
performed by Sandia National Laboratory, which found that the same system had an FRR
of 25% and an unknown FAR (Jain et al.). They also emphasized that in order to obtain
fair and honest test results, enough samples representative of the population of all four
categories (imposters and genuine) should be made available for testing.
Measuring the Effectiveness of KBA
None of the test methods specified previously sufficiently evaluates KBA. In
contrast to PIN/password authentication, KBA relies on the individual accurately
answering several questions about him- or herself that are then correlated to answers
culled from public and private records, as previously discussed. A series of validation
checks is then performed against the data, eliminating users who provide inconsistent
data (i.e., SSNs do not match the name in the file or are found on the Social Security
Death Index, addresses do not match, etc.) or in which the user provides incorrect
answers to the questions.
38
The customization that KBA companies provide with the number and types of
questions asked, as well as the thresholds that can be set for acceptable authentication
(based on the number of correctly answered questions), makes quantifying the
effectiveness of KBA extremely difficult. This has resulted in the examination of the
feasibility of quantifying KBA by NIST, which has been examining the technology for its
suitability with e-government services. At their 2004 symposium, several methodologies
were proposed to address the issue.
Evaluating KBA Using Guessability
Using a model similar to that referenced in the NIST guidelines, cryptography
researcher Dr. Santosh Chokhani (2003) proposed a methodology in which effectiveness
is based on how susceptible an individual’s identity attribute is to being guessed.
Attributes are categorized as being either static or temporal. Examples of static identifiers
include birth date, while temporal identifiers are more dynamic and include back account
balances and payroll amounts. Chokhani contended that the extent to which an identifier
is susceptible to guessing is partially dependent on the individual doing the guessing.
Someone close to the identity claimant, or with intimate knowledge of the claimant (for
example, an estranged spouse), may have personal knowledge sufficient to accurately
answer the questions, allowing him or her to effectively masquerade as the authorized
individual. Chokhani also provided both formula and tables to calculate the probability of
compromising KBA based on the claimant type and specific identifier guessability.
Chokhani provided a matrix used for the calculation of guessability metrics wherein the
questions asked are based on assumptions and the likelihood of the answers being
39
guessed by an individual without any prior knowledge. Date of birth, for instance, is
given a 1 in 18,250, or 214, probability that someone other than a family member,
employer, friend, or professional acquaintance might be able to guess it given Chokhani’s
assumption that someone “can be assumed to be between 20 and 70 years of age” (p. 2).
Based on application identity assurance requirements, Chokhani (2003)
recommended specific mixes of temporal and static identifiers (temporal identifiers, such
as bank balances, being obviously more difficult for someone, even the identity claimant,
to accurately provide). Chokhani later mentioned that the claimant’s desire to
masquerade—as well as the valid user’s personality-based factors and network and size
of personal and professional relationship—must be considered; however, he did not
factor these considerations into his formula or proposed metrics.
While Chokhani’s recommendations, based more on common sense than on his
probability metrics, have considerable value to identifier selection, the guessability
approach can only partially gauge KBA effectiveness as it does not factor in
discoverability. Largely drawn from public records data sources, attributes used in KBA
are susceptible to data-mining techniques or targeted attacks whereby an identity thief
builds a digital dossier on the victim (Solove, 2004) and is then able to successfully
authenticate as that user with the information obtained from the Web.
Evaluating KBA Through False Acceptance and Rejection Rates
Other approaches to evaluating KBA attempt to treat the service as a form of
biometric authentication technology, attempting to define FAR and FRR, and even
providing the ability to tune service to acceptable thresholds (Cartwright, 2004;
40
ChoicePoint, 2004). Experian and ChoicePoint use similar models in that companies can
determine the type and number of initial personal data input to be validated (name, SSN,
address, date of birth) and, based on a predetermined score “cut-off,” or threshold, go on
to be asked more challenging questions or be denied access and referred to a customer
service desk for exception processing. In their example provided at the symposium,
Experian suggested that 90.10% of accounts pass the initial score, while 9.90% are
referred for exception processing. Experian also stated pass rates of 90.24% for “good
accounts,” resulting in an FRR of 9.76%, a rather high FRR by most biometric standards.
Dr. George Datesman, consultant for Mitretek, also proposed a model similar to
Experian’s in which the goal of KBA is error-free authentication, or “100% assurance
that a user is who he/she claims to be” (2004, p. 4). With Datesman’s model, errors are
classified similarly to biometric errors into two types: type I errors that identify the false
rejection of a claimed identity (FRR) and type II errors that identify false acceptance of a
claimed identity (FAR). Datesman discussed the necessity of standardization of error
measurement techniques as opposed to identity authentication methods as well as
establishing minimum acceptable error rates and confidence intervals at each assurance
level. Datesman’s discussion was, however, absent of any guidance on how these
measurements can be determined.
Herein lies the inherent problem with treating KBA as a biometric technology in
measuring effectiveness. While companies will almost assuredly get feedback from angry
customers denied access to services from which they can capture FRR, the FRR can also
be predicted through formal test procedures in which a file provided to the service is
“seeded” with valid and nonvalid identities. Assuming imposters and genuine identities
41
were made available for testing, the sample would contain a selection of valid users
presenting valid data that should be accurately authenticated consistent with Outcome 1
(Jain et al., 1999). Those valid users who did not authenticate correctly would provide an
estimation of the FRR (Outcome 3). Measuring to determine Outcome 2 (an imposter is
correctly rejected) and Outcome 4 (an imposter is incorrectly accepted) prove to be more
difficult. Outcome 4, the basis for determining FAR, is critical to information security,
yet is impossible to calculate because, as with PIN/password technologies, testing would
require that enough valid personal knowledge be provided to spoof a valid user on the
system, unless an identity is simply crafted out of vapor. To this extent, while FRRs
should be calculated to determine burden on users and customer service staff, attempting
to measure KBA effectiveness using biometric testing methods is not comprehensive
enough to satisfy most information security or application assurance requirements.
Evaluating KBA Through Other Methods
Some KBA service vendors advertise the use of identity scoring models
(Cartwright, 2004) that, based on the quality and quantity of data and data source, can
provide a probability of identity confidence (or identity score). These models require that
the data owners determine the thresholds that are acceptable to meet application
assurance requirements. Unfortunately, KBA vendors choose not to share these
algorithms, and so, as with proprietary cryptographic algorithms, they must remain
suspect in their ability to accurately perform as advertised unless mathematicians or
statisticians can test their accuracy rates. Since the models are not shared, it is not known
to what extent the likelihood of discoverability is addressed.
42
In a recent academic project resulting from the KBA symposium, researchers at
the University of Wisconsin-Madison (Chen, 2007) proposed a KBA framework based on
Bayesian networks, considering causal and probabilistic relationships between identity
attributes. While the approach had definite merit in its potential ability to determine
outcomes and adapt responses accordingly, it, too, only minimally considers the obvious
vulnerability of prior discovery.
Having been studied by leading authentication technology researchers, KBA
remains an ethereal technology in its ability to be evaluated for its identity authentication
effectiveness. While both KBA and PIN/passwords are susceptible to guessability
attacks, the use of PIN/password technologies proves to be less susceptible to
discoverability than KBA, as the likelihood is slim that the PIN/password authentication
data are published in public records databases on the Web as are most KBA identifiers.
Furthermore, the attempts to quantify KBA’s effectiveness in a manner similar to
biometric technologies are inadequate in that only FRRs can be estimated. Estimates on
the number of imposters possessing sufficient information to masquerade as legitimate
users are impossible to determine using this same approach.
In summary, a review of the literature found one study released by the Council of
Better Business Bureaus and Javelin Strategy and Research that supported the null
hypothesis that the availability of personal identity information cannot be correlated to
identity theft; however, the Javelin study’s narrow definition of identity fraud necessitates
a new study. Additionally, concerns expressed by government agencies and scholars
linking the misuse of identity information to the proliferation of personal data provide
support for the hypothesis. The literature review revealed that existing measurement
43
protocols are incomplete in evaluating the effectiveness of KBA as they do not consider
the discoverability of these attributes. As KBA services have already been implemented
for use with online applications providing access to identity breeder documents, such as
birth certificates, it affirms the immediate need for a study to evaluate the extent of the
discoverability of attributes used in these services.
44
CHAPTER 3. METHODOLOGY
Introduction
The structure of content on the Web has been compared to a library whose
“collection is distributed haphazardly on the shelves, with no underlying classification
scheme, bibliographic control, or accession catalog, and a substantial portion of the
material is incomplete, transitory, or simply disappears from the shelves after a short
time” (O’Neill, McClain, & Lavoie, 1998, p.1). This haphazard collection of digital
information continues to proliferate at a staggering rate. Between 2000 and 2003,
“surface” Web content (Web content indexed by search engines such as Google) tripled
from 50 terabytes to an estimated 167 terabytes in size (Lyman & Varian, 2003). “Deep”
Web content, consisting primarily of databases and other media types that are not
routinely crawled and indexed by search engines, is estimated to be 500 times the size of
the surface Web. This deep Web content resides on approximately 200,000 Web sites—
95% of them publicly accessible (Bergman, 2001). Developing a research framework to
allow the enumeration of personal identifiers located on a targeted subset of this deep
Web can prove to be challenging for researchers in terms of selecting an appropriate
research methodology and sampling plan. This chapter addresses the research approach,
population, data collection, and analysis techniques that were used to perform this study.
Research Approach
A research methodology is a systemic process that moves the researcher from
inquiry and hypotheses to data collection and analysis. The type of data collected is
45
intertwined with the selected research methodology and drives the manner in which the
data are collected (Myers, 1997). Leedy and Ormrod (2001) agreed with Myers, stating
that the type of data being analyzed may lead the researcher to a specific research
approach, suggesting that a quantitative research methodology is appropriate in instances
where there exists an objective reality that can be measured and specific methods for
measuring variables are defined and collected from a sample of data that can be
converted to numerical representation.
Quantitative research characterizes the problem under study in terms of how
many or how often and results in the numeric analysis of data that prove, or disprove, a
researcher’s assumptions. While data can be gathered through a variety of methods,
including the use of surveys, results are scored so that they yield statistically measurable
results. An example, as applied to the study of Web content (Bergman, 2001), is that an
analysis of scored survey results can provide a measurement of user concern regarding
identity theft resulting from publicly accessible online databases. Quantitative research,
in this case, would not have discerned why users were concerned about this phenomenon
unless this had been previously hypothesized by the researcher prior to the onset of data
collection. In this respect, quantitative research design is considered a fixed approach in
that the data collection process must be constructed in advance to specifically address the
researcher’s hypotheses. The ability to perform comparisons and statistical analysis on
the collected data, however, is a decided advantage to quantitative approaches when
dealing with large amounts of empirical data.
A content analysis of the personal identifiers stored in online deep Web databases,
as discussed previously, can be used within the constructs of either a quantitative or
46
qualitative approach, or both, using a mixed research approach. Kaid and Wadsworth
(1989) deferred to Berelson’s definition of content analysis as being the most widely
accepted by researchers. By definition, a content analysis is a method in which content
can be analyzed in a systematic, objective, and quantitative manner for the purpose of
measuring variables. Berelson (1952) characterized content analysis as systematic in that
in order to reduce generalization errors in the content being analyzed, uniform coding and
analysis procedures are defined in advance of the data gathering process. It must be
objective in that researcher bias must be absent from the study or from the sample
selection process. Finally, it must quantitatively and accurately represent the body of
material being examined, leading independent researchers to the same conclusions. This
detailed and systematic methodology, as well as the frequency tabulation of
authentication attributes found in the documents, provides for the quantitative analysis of
the information contained in the examined documents (Robson, 2002).
Typically applied to human communication such as newspapers, video, books,
television, art, music, or transcripts to identify patterns, themes, or biases (Leedy &
Ormrod, 2001), content analysis can be used for making numerical comparisons among
and within documents, as long as the information is available to be reanalyzed for
reliability checks. This enables researchers to sift through large volumes of data with
relative ease in a systemic fashion (GAO, 1996). These characteristics indicate that
content analysis is a suitable tool to quantitatively examine Web content by statistically
measuring the frequency and location with which personal identity attributes are found.
Some considerations, however, must be addressed. While performing a content
analysis allows a researcher to study the raw data and arrive at conclusions relating to a
47
hypothesis, can content analysis, by itself, be used to lead the researcher to a holistic
analysis of the data that it categorizes? As cited in Kaid and Wadsworth (1989), some
researchers such as Krippendorff view content analysis as little more than a data
gathering tool that enables inferences to be made from the data to their context while
others, such as Holsti, consider it to be a much more powerful. Kaid and Wadsworth
suggest that quantitative examinations of the content found on the Web are, in
themselves, without much meaning unless the researcher can make comparisons and
draw relationships from the data. It would be of little research value; therefore, to simply
relate the frequency with which personal identity attributes can be found in public records
databases. A relationship linking identity theft rates to the frequency, with which these
data are discoverable, however, serves the goal of achieving a more holistic analysis of
this Web content.
Content analysis categorizations can also present problems if the category
definitions are faulty or found to be nonmutually exclusive or nonexhaustive (Stempel &
Westley, 1981). Researchers concur that reliability in content analysis has been achieved
when there is repeatability in recoding the same data over time and researchers
performing the same research, using the same methodology, derive the same results
(GAO, 1996). Reliability, in this context, is dependent on information availability for
reexamination. Koehler (2004) performed a longitudinal study of Web pages over a
period of 6 years that established the ephemeral nature of the Web, concluding that the
Web is not particularly stable for publication of long-term information and the
maintenance of individual objects or items. Koehler did differentiate, however, between
the materials published to the Web and material “for which the Web serves as a conduit
48
for access” (Conclusions section, ¶ 1)—citing the longer half-lives for online databases,
as compared with the much shorter half-lives of other published Web documents. It
would, therefore, be important for a researcher to note any impact to long-term reliability
when engaging in a content analysis of Web-based content. For these reasons, a
quantitative approach employing content analysis was deemed by the researcher to be the
most appropriate research approach to the problem under study.
Sampling Design
Personally identifiable data exist in large volumes within online databases and
Web pages. Sampling is necessary as the body of material is too extensive to be analyzed
in its entirety (GAO, 1997). Selecting an adequate and representative sample of the
population of interest from the millions of available Web pages available can be a
challenging task for a researcher. In the article “A Methodology for Sampling the World
Wide Web,” researchers O’Neill et al. explained, “Compiling a random sample of Web
sites is not a straightforward exercise, largely because enumeration of all Web sites is not
available” (1997, Sampling the Web: A Basic Strategy section, ¶ 6). Robson (2002)
concurred, stating that that it is “usually necessary to reduce your task to manageable
dimensions by sampling from the population of interest” (p. 353).
Sampling in content analysis is performed similarly to survey sampling, with care
taken to ensure that the sample is representative of its population, with each unit having
an equal chance of being represented in the sample (Stempel & Westley, 1981). Stempel
also suggested that there are additional considerations for sampling in content analysis,
such as document availability, that may lead the researcher to use stratified or purposive
49
sampling. Robson (2002) substantiated the unsuitability of random sampling if a full list
of the population is unattainable.
Researchers Liddle, Yau, and Embley (2001), in their research efforts to
categorize deep Web database content through structured queries, also found that
randomly selected fields provided for uneven coverage in their collection process, so they
proposed a stratified sampling method to extract data from the deep Web in order to
ensure better coverage and an adequate number of representative sample fields from their
queries.
To facilitate a comparison to FTC identity theft rates, that are categorized by
state, a stratified purposive sampling approach was employed to ensure that all states had
an equal chance for representation in the study. Stratified sampling divides the population
into separate groups (referred to as “strata”) based on a shared characteristic, such as size,
gender, educational level, income, or, in this case, geographic location. Purposive
sampling, as indicated earlier by Stempel (1981), is useful for content analysis when
specific documents or records need to be selected (i.e., San Bernardino County property
records databases were examined for personal identity information) and is indicated when
resources or records availability are limited, yet require justification of the sample
selection process. As cited in Tashokkori and Teddlie (2003), Kemper, Stringfield, and
Teddlie offer that purposive sampling provides the ability to focus the sample on
information-rich cases and minimize the sample size in a nonrandom method. They
further assert that while particularly useful in qualitative approaches, purposive sampling
can be used with quantitative approaches as well. While a proportional probability
stratified sampling methodology would also have been suitable due to the disparate
50
population of records (one county government may not have a population of citizens or
records equal to another government), a lack of resources limited its usefulness in this
study.
SearchSystems.net hosts the largest publicly accessible directory of public records
databases. As such, it was considered to be the basis of the target population. While many
links at the site are already categorized by state, other links are categorized by the type of
record. Consistent with a stratified purposive sampling approach, data were stratified by
state and then by the record type using the procedures described in chapter 4 and the
codebook in Appendix A. An outline is presented, as follows, of the steps that were
undertaken for the sample in this study:
1. The sample frame was established as a listing of public records accessible within each state, as provided by Search Systems. Search Systems bills itself as the largest repository of public records databases, aggregating links to more than 40,000 accessible public records databases. Many (but not all) of these records contain PII. A paid subscription that allowed access to the aggregator’s links was procured at a fee of $4.95 per month. Premium records costing additional fees, as with bankruptcy records, were not included in the population from which the sample was drawn. Search Systems does not include U.S. territories and possessions, with the exception of Washington, DC, and these were excluded as they were not a part of the FTC’s datasets for later comparison purposes. A manual inspection of the records descriptions was performed to eliminate databases serving only historical purposes—those records prior to 1930—after which date even the U.S. Census publishes information contained in household census records. Only records databases providing PII on living individuals that were freely, and without requiring registration, considered valid for the purposes of this study.
2. Using a purposive approach, categories of public records that are commonly used for identification and authentication purposes were defined and recategorized after the results of a pilot test to ensure mutual exclusivity so that a single record could not be counted multiple times. Appendix A contains specific search queries that were used to extract the public records links from Search Systems.
51
Robson (2002) discussed the difficulty with prespecifying the number of
observations required in a flexible design study, stating that it is appropriate to continue
until saturation is reached (an apparently subjective goal). Larger sample sizes result in
fewer generalization errors (Robson). This sampling approach provided the ability to
enumerate a large, disparate grouping of data that facilitated later correlation to identity
theft rate data. It also provided a repeatable methodology that affords future researchers
with an equal chance for all databases to be represented within the sample, thereby
eliminating researcher bias.
Data Collection
Data collection within the inspected records was performed using a
documentation content analysis of personal attribute categories contained in the Internet
record. Identity attributes are defined for this study as information that identifies an
individual or links to other information that would be used to identify an individual.
Microsoft Excel was used to store the data to facilitate their exportation into SPSS for
later statistical analysis. To avoid violating privacy rights, personal data on individuals
were not collected for the purposes of this study. Instead, publicly accessible databases
were enumerated and categorized to determine the amount and type of personal data at
each discovered site that are commonly used with KBA systems. Chapter 4 further
discusses the data collection process.
52
Measurement Strategy
Content analysis provides a systematic, replicable technique for categorizing data
based on explicit coding rules (Berelson, 1952). Consistent with content analysis, a priori
content categories and recording units were established and operationalized prior to
collection, coding, and analysis. Colleagues provided a review of the proposed a priori
categories, and revisions were made to ensure mutual exclusivity and exhaustiveness
(Weber, 1990) and that the categories were saturated (Leedy & Ormrod, 2001). Coding
began only after the final units of data collection were defined, tested, and refined.
Collected data were categorized according to the defined a priori categorization of
content. From the sample of sites, the frequencies with which the defined units of data
(authenticators previously determined to be common to the majority of KBA service
providers) occurred within a record type in that state were counted. Acting on the
assumption that personal attributes are weighted by most authentication systems as being
equal in importance (i.e., knowing your house’s square footage is as important as
knowing the names of the previous owners), the data collected from the sites were
measured on an ordinal scale and ranked by the frequency with which they were found at
different sites. Additionally, the frequency with which these attributes (units) jointly
occurred with one another was measured (collocations).
Data Analysis Strategy
The statistical software program SPSS was used to analyze the data collected.
Based on the frequency with which combinations of identity attributes were found at
these sites, a “discoverability metric,” or index, was derived from the analysis that
53
allowed these attributes to be comparatively ranked by availability. Data were then
analyzed to determine the frequency, distribution, and mode on the authenticators within
the data sets. Using a Spearman’s rho, the relative frequencies of the authenticators by
category (location and type of site) were assessed for statistical significance. Lastly, data
gathered in the first part of this study were compared with statistics reported by the FTC
for identity theft incidences by state as a simple correlational study.
Data Display
Personal attributes from frequency tables were summarized and displayed using
bar charts and tables since, as Robson (2002) explained, they are the preferable methods
to use to display data associated with frequency tables and they are quickly and easily
understood by most everyone. Data are also presented in a priori tables and graphs.
In conclusion, while a qualitative research approach affords flexibility in an
exploratory study, a quantitative analysis is more appropriate for an enumeration of the
content within online public databases. Research has further demonstrated that content
analysis is an appropriate research methodology with either quantitative or qualitative
research and is suitable for the data collection of Web content using a purposive stratified
sampling approach that ensures an adequate and reliable sample.
54
CHAPTER 4. RESULTS
Introduction
There were two main objectives to the current study. The first was to examine the
comparative discoverability of identity attributes in online public records, associating a
discoverability index, or factor, to individual and linked identity attributes where specific
combinations of identity attributes occur. The second objective was to determine if
correlations exist between the frequencies with which identity attributes can be found in
public records and instances of identity theft.
To meet the goals of this study, a content analysis was conducted, after which
descriptive statistics for each of the identity attributes, as well as the number of identity
attributes per type of record searched, were generated. Nonparametric correlations were
then conducted to assess whether there was an association between the identity attributes
and identity theft rankings.
This chapter presents the results of the data collection, categorization, and content
analysis relative to this study. The first section of the chapter discusses the data collection
and preparation processes. Following that is a description of the data sampled, followed
by findings from an analysis of the data that address the research questions presented in
chapter 3. Conclusions and recommendations are presented in chapter 5.
Data Collection
As of the date that the data collection process was completed—May 1, 2008—
Search Systems has registered more than 41,754 public records sites. The number of sites
55
accessible from the aggregator increases daily, so the process of enumeration is
analogous to that of shooting a moving target. Initially, software with URL harvester
capabilities, Xenu, was used to retrieve the public records links for the purpose of
generating the sample. The software proved to be of little use as Search Systems obscures
the links through the use of redirection. An attempt was made to enumerate the total
number of public records sites at each state using Search Systems’ “Search United States
Public Records by State.” This initial approach was discarded as many of the links are
informational ones, such as school performance reports, that would not provide
information relative to this study. The approach was modified to limit the data collection
to 12 different categories of public records that eliminated business-related records (i.e.,
professional licenses and corporate filings) and informational sites. Public records
categories used by knowledge brokers to authenticate online identity were selected based
on a review of public records sources at LexisNexis’s site. Categorical searches in Search
Systems were performed on selected public records categories and the “Advanced
Search” function was also used to find records for which no category had been
predefined.
Data Coding and Categorization
Knowledge gained through the investigator’s professional experiences testing a
number of KBA systems served as the foundation for the selection of common identity
attributes that serve as identity authenticators. Selected identity attributes included name,
address, phone number, date of birth, marriage, place of birth, and property-specific
56
information such as square footage, property value, mortgage holder, and improvements
made on the property. A description of these attributes appears in Appendix A.
A test of the data collection process was performed to ensure that the a priori
categories contained collected data that were mutually exclusive and exhaustive. Two
colleagues were provided a data collection form and, after brief training on collection
procedures, were asked to retrieve data from Alabama. When the collected data were
reviewed, it became apparent that several categories needed combining as, in many cases,
court cases contained probate records and other recorded documents that otherwise risked
being recorded multiple times. Resultantly, five categories were eliminated in order to
reduce record replication. The remaining seven categories consisted of the following
types: accident reports, birth certificates, court records, inmate/arrest records, marriage
records, property records/deeds, and voter registrations. Appendix A provides detailed
enumeration criteria. A subsequent test using Arkansas displayed no evidence of overlap
in the documents in the sample, satisfying the requirement for exclusivity.
Consistent with a purposive sampling strategy that would ensure as accurate a
count as possible, the descriptions for each record were reviewed, excluding those that
were information only and did not contain personal data. Only government-provided or
-contracted, freely accessible sites were of interest as it was assumed that the registration
process would deter most identity thieves from accessing data on that site.
Identity attributes contained in examined records were noted on the data
collection forms. Identity attributes were aggregated as cumulative totals within each
category of public record by state. As an example, if one marriage record in Ohio
displayed only the name and address of the bride and groom, and another site within the
57
state included date of birth, the data were coded to show that Ohio marriage records
contained name, address, and date of birth.
Coded data were inspected and sorted in Excel to identify coding errors. The
cleaned data were imported into SPSS 14.0 for analysis. Attribute totals, by state, were
transformed into a new ordinal variable to facilitate additional correlation to ID theft rates
as reported by the FTC’s (2007) Consumer Fraud and Identity Theft Complaint Data:
January 2007 through December 2007 report.
Descriptive Statistics
A total of 8,659 sites were identified as containing data of interest to the study.
Sites that required separate payment, registration, or were unavailable at the time of
review could not be enumerated. This excluded 2,061 sites, reducing the number of sites
examined for identity attributes to 6,598, for which the range, mean, and standard
deviation by record category across the 50 states are shown in Table 1, and a graph
depicting the total number of records per record category is shown in Figure 2.
The findings in Table 1 reveal that property records sites were the most numerous
category of records per state (M = 83.46), with the number of property records sites found
within each state ranging from as few as 1 site located in Wyoming to a total of 409 sites
accessible in Texas. Arrest records were the next most numerous type of public record
per state (M = 28.96), followed by court records sites (M = 15.02). It should be noted that
while there were more sites containing court records (N = 1,612) than arrest records (N =
1,562), 861 court records sites required registration or a paid subscription, compared to
only 114 arrest records sites that could not be freely and anonymously enumerated.
58
Table 1. Descriptive Statistics of Public Records Sites Across the 50 States
Variable Range M SD
Accident reports
Birth certificates
Court records
Arrest records
Marriage records
Property records
Voter registrations
0–8
0–7
0–192
0–225
0–23
1–409
0–13
.64
.46
15.02
29.00
2.50
83.46
.92
1.88
1.34
32.68
44.72
4.46
92.27
2.23
Total Number of Accessible Sites Per Category
32
23
751
1448
125
4173
46
0 1000 2000 3000 4000 5000
Accident reports
Birth certificates
Court records
Arrest records
Marriage records
Property records
Voter registrations
C la
s s
e s
Figure 2. Total number of public records sites by record category.
59
The least common type of records located were accident reports (M = .64);
however, these reports often contained a number of identity attributes unique to this
report type (i.e., vehicle identification number [VIN], driver’s license number, home
address). Attribute uniqueness was not a variable examined separately in this study.
Descriptive Statistics of Identity Attribute Types by Category Accessible sites within each category of public records (i.e., accident reports)
were reviewed to determine the types of identity attributes published within each state.
The range, mean, and standard deviation for the number of identity attribute types for
each record category are shown in Table 2.
Table 2. Descriptive Statistics of Identity Attribute Types by Category
Variable Range M SD
Accident reports
Birth certificates
Court records
Arrest records
Marriage records
Property records
Voter registrations
0–6
0–5
0–6
0–5
0–6
5–6
0–4
.74
.62
2.12
2.88
1.02
5.02
.72
1.76
1.51
1.51
1.19
1.45
.14
1.20
60
The findings in Table 2 indicate that property records yielded the largest number
of different identity attributes (M = 5.02), followed by arrest records (M = 2.88), and then
by court records (M = 2.12). All other record categories yielded about 1 identity attribute.
Frequency Total for Each Attribute
The frequencies and percentages for the identity attribute types are presented in
Table 3. As can be gleaned from the table, the most frequently published attribute was an
individual’s name (30%), followed by an individual’s home address (17%), and then by
an individual’s birth year (13.5%). An individual’s physical description was published
only 8.4% of the time. An individual homeowner’s property value, property tax, and the
number of square feet were accessible 7.7% of the time. All other identity attributes were
not frequently found online. No SSNs were present in the examined records.
Frequency of Attributes per State
The total number of identity attribute types published varied by state, ranging
from a minimum of 5 attributes (Wyoming) to a maximum of 26 (Ohio). The range,
mean, and mode for each of the attributes are detailed in Table 4. The attribute with the
highest mode was an individual’s name (mode = 4), followed by an individual’s date of
birth and home address (mode = 2). The mode for an individual’s physical description,
home’s property value, property tax, and number of square feet was only 1. The mode for
all the other attributes was zero.
61
Table 3. Frequencies and Percentages of Identity Attributes
Variable F %
Name
Date of birth
Birth year
Mother’s maiden name
Place of birth
Home address
SSN
Last four digits of SSN
Home phone number
Driver’s license number
VIN
Property value
Property tax
Number of square feet
Physical description
196
88
14
9
10
111
0
1
3
7
8
50
50
50
55
30.0
13.5
2.1
1.4
1.5
17.0
0.0
0.2
0.5
1.1
1.2
7.7
7.7
7.7
8.4
62
Table 4. Descriptive Statistics of Identity Attributes Within Each of the 50 States
Variable Range M Mode
Name
Date of birth
Birth year
Mother’s maiden name
Place of birth
Home address
SSN
Last four digits of SSN
Home phone number
Driver’s license number
VIN
Property value
Property tax
Number of square feet
Physical description
1–7
0–6
0–3
0–1
0–2
1–5
0–1
0–1
0–1
0–1
0–1
1–1
1–1
1–1
0–3
3.92
1.76
.28
18.00
.20
2.22
.02
.02
.06
.14
.16
1.00
1.00
1.00
1.10
4
2
0
0
0
2
0
0
0
0
0
1
1
1
1
Tests of Research Questions
The purpose of this study was twofold. First, it explored the extent to which
identity attributes used to authenticate individuals in online transactions using
63
knowledge-based authentication services are discoverable in public records databases.
Secondly, the study examined these frequencies to determine if there is any correlation to
identity theft rates. Consistent with this, the following research questions are addressed in
this section:
1. What is the comparative discoverability of identity attributes in online public records?
2. Is there an association between the frequencies with which identity attributes can be found in public records and identity theft?
RQ1: Discoverability Metrics of Identity Attribute: What is the Comparative Frequency, or Discoverability, of Personal Identity Attributes in Public Records Databases?
To meet this study objective, the data were analyzed to determine the frequency
with which they are discoverable in public records. The findings were useful in
establishing how accessible specific identity attributes are to identity claimants during
online transactions. This will assist government agencies and commercial services relying
on knowledge-based services to select specific identity attributes relative to application
risk.
The comparative discoverability metrics, or indices, for each identity attribute and
for groups of attributes are described in this section. Initially, regression procedures were
conducted with the intent to use the regression coefficient for each attribute as the index
of discoverability. This was discarded as the data in use were not normally distributed
and were remarkably skewed, even after transformation, and would have negatively
impacted reliability. Instead, the discoverability index for each attribute was determined
by calculating the frequency of that attribute divided by the total number of attributes.
The index for each group of attributes was the frequency for the whole group of attributes
64
divided by the total number of attributes. Identity attributes were first grouped together
using an exploratory analysis procedure (EFA) appropriate for use where no hypothesis is
present.
Principal components analysis (PCA) was used to extract the components to
determine the number of identity attributes to retain. PCA is preferred over principal
factor analysis (PFA) for purposes of data reduction. An orthogonal Varimax procedure
was specified for the rotation procedure.
The resulting Cattell scree plot is presented in Figure 3, while the percentage of
variance accounted for by each of the components is shown in Table 5. Upon closer
inspection of the scree plot in Figure 3 and the proportion of variance each factor
explained (refer to Table 5), there appeared to be a large gap between the seventh
(Eigenvalue = 1.00) and eighth (Eigenvalue = .64) components. The first seven
components appeared to be distinct from the other eight components. Further, the
Eigenvalue of the eighth component was below the acceptable criterion of 1.00. As such,
principal components analysis was deemed to yield seven components with the
Eigenvalue of the first factor—personal information—extracted, accounting for 23.02%
of the total variance. The seven components and the attributes that loaded highly onto the
components are displayed in Table 6.
The comparative discoverability indices for the group of attributes are shown in
Table 7. The comparative discoverability indices for identity attributes are presented in
Table 8.
65
Figure 3. Scree plot from resulting EFA procedure.
Table 5. Variance Explained by Resulting Components
Component Eigenvalue total % variance explained
1. Personal information
2. Home information
3. Driving information
4. Verification questions
5. Birth year
6. SSN
7. Last 4 digits of SSN
3.45
2.86
2.11
1.23
1.01
1.00
1.00
23.02
19.04
14.05
8.18
6.76
6.69
6.69
66
Table 6. Components and Respective Identity Attributes
Component Attributes
Personal information
Home information
Driving information
Verification questions
Birth year
SSN
Last four digits of SSN
Name, date of birth, physical description
Home address, property value, property tax, square feet
License number, VIN, home phone
Mother’s maiden name, place of birth
Birth year
SSN
Last four digits of SSN
Table 7. Comparative Discoverability Index for the Groups of Identity Attributes
Group of attributes Index
Personal information
Home information
Driving information
Verification questions
Birth year
SSN
Last four digits of SSN
.52
.41
.03
.03
.02
.00
.00
67
Table 8. Comparative Discoverability Index for the Identity Attributes
Identity attribute Index
Name
Home address
Date of birth
Physical description
Property value
Property tax
Number of square feet
Driver’s license number
VIN
Home phone number
Mother’s maiden name
Place of birth
Birth year
SSN
Last four digits of SSN
.30
.17
.14
.08
.08
.08
.08
.01
.01
.01
.01
.02
.02
.00
.00
Recommendations are made in chapter 5 of this study for the application of these
indices to KBA service offerings.
68
RQ2: Correlation Between Frequency of Attributes and Identity Theft Rankings: Is There a Correlation Between Identity Theft Rates and the Availability of Personal Data in Public Databases?
Based on recent FTC and other news media reports of identity theft being
facilitated by public records, as cited earlier in this study, several hypotheses relating to
RQ2 were formed, for which the null is stated in the results for each hypothesis test, as
follows. Nonparametric correlations were conducted to test these hypotheses against the
FTC’s identity theft rankings for 2007. Correlations were performed using Spearman’s
rho, the most commonly used nonparametric statistic to measure ranked data. As a
directional relationship was hypothesized, one-tailed tests were employed and set to a
significance level of .05.
H1a: There is no correlation between identity theft rates and the total number of
Web-accessible public records. Table 9 displays ranked identity theft rates by state as
published in the FTC report. Lower numbers represent higher incidences of reported
identity theft rates, with Arizona having the greatest number of reported identity theft
(rank = 1) and North Dakota having the fewest (rank = 50). These data were tested
against the total number of public records sites as reported by Search Systems, ranked in
a similar manner to the FTC data, with lower ranks indicating a higher number of
published records.
The result of the analysis for H1a indicated that there was a moderate positive
relationship between state identity theft rankings and the number of published sites at
each state (rho = .443, p = .001). That is, states with more public records sites tended to
have more incidences of reported identity theft. Thus, the null hypothesis for H1a was
rejected.
69
Table 9. FTC 2007 State Identity Theft Rankings
State
FTC identity theft rank
Rank by total no. of sites
No. of sites (N = 8,659)
Arizona
California
Nevada
Texas
Florida
New York
Georgia
Colorado
New Mexico
Maryland
Illinois
New Jersey
Washington
Pennsylvania
Michigan
Delaware
Alabama
Virginia
Connecticut
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
30
6
39
1
2
8
15
16
40
44
4
38
14
17
5
43
22
13
18
91
394
66
780
462
377
229
192
53
37
431
69
229
188
395
37
142
238
175
70
Table 9. FTC 2007 State Identity Theft Rankings (continued)
State
FTC identity theft rank
Rank by total no. of sites
No. of sites (N = 8,659)
Oregon
Missouri
North Carolina
Massachusetts
Tennessee
Oklahoma
Indiana
Ohio
Louisiana
Kansas
South Carolina
Utah
Mississippi
Arkansas
Rhode Island
Minnesota
Idaho
New Hampshire
Alaska
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
19
35
9
11
20
23
21
3
37
28
27
32
25
29
36
12
42
31
45
159
79
298
273
153
132
142
450
74
109
120
89
122
108
74
265
38
90
35
71
Table 9. FTC 2007 State Identity Theft Rankings (continued)
State
FTC identity theft rank
Rank by total no. of sites
No. of sites (N = 8,659)
Hawaii
Nebraska
Wisconsin
Kentucky
Wyoming
Montana
Maine
West Virginia
Vermont
Iowa
South Dakota
North Dakota
39
40
41
42
43
44
45
46
47
48
49
50
48
26
7
33
49
41
34
24
46
10
50
47
23
122
382
87
20
40
86
131
33
293
16
30
H1b: There is no correlation between identity theft rates and individual identity
attributes published in Web-accessible public records. The results of the correlations
between the frequency of identity attributes and state identity theft ranks are presented in
Table 10. As evidenced in the data, six of the identity attributes were significantly and
negatively associated with the FTC’s identity theft ranking. These attributes were name
(rho = –.50, p = .000), date of birth (rho = –.43, p = .001), birth year (rho = –.24, p =
72
.048), home address (rho = –.46, p = .000), VIN (rho = –.25, p = .025), and physical
description (rho = –.27, p = .028). For these variables, the more frequently they are found
in public records, the lower the state’s identity theft ranking (lower rankings indicate
higher incidences of reported identity theft in the state). Resultantly, the null hypothesis
for H1b was rejected for these identity attributes.
Table 10. Spearman Rho Correlations Between Identity Attributes and Identity Theft Ranking
Variable Theft rank rho Sig.
Name
Date of birth
Birth year
Mother’s maiden name
Place of birth
Home address
SSN
Last four digits of SSN
Home phone number
Driver’s license number
VIN
Physical description
–.50
–.43
–.24
–.03
–.03
–.46
–.06
–.04
–.01
–.11
–.25
–.27
.000
.001
.048
.426
.419
.000
.329
.406
.476
.224
.038
.028
73
H1c: There is no correlation between identity theft rates and identity attribute
groups published in Web-accessible public records. The findings of the nonparametric
correlations between the attribute groups and the total attribute sum, on the one hand, and
identity theft ranking, on the other, are shown in Table 11. The findings indicate that the
group of personal information attributes was significantly and negatively associated with
identity theft ranking (rho = –.45, p = .000). Thus, the easier it was to access personal
information, the lower the identity theft ranking of the state (the higher the incidence of
identity theft). The group of home information attributes was also significantly and
negatively associated with identity theft ranking (rho = –.46, p = .000). Again, the easier
it was to access home information, the lower the identity theft ranking of the state. As the
p value for both driving information and verification questions were greater than 0.05, the
null was accepted for those attribute groups. Thus, it was concluded that personal and
home information attribute groups are associated to a greater degree to identity theft rates
than is driver information or verification questions.
Table 11. Spearman Rho Correlations Between Attribute Groups and Identity Theft Ranking
Variable Theft rank rho Sig.
Personal information
Home information
Driving information
Verification questions
–.45
–.46
–.22
–.02
.000
.000
.065
.435
74
H1d: There is no correlation between identity theft rates and the total sum of
different identity attributes published in Web-accessible public records at each state.
Lastly, the sum of identity attributes published at each state was significantly and
negatively associated with identity theft ranking (rho = –.443, p = .001). Thus, the greater
the number of different identity attributes published in these public records categories by
a state, the higher the incidences of identity theft rates, resulting in a rejection of the null
hypothesis. This statistic should be interpreted with some level of caution as the total
numbers only represent the different types of identity attributes found across the seven
categories of records for each state, as previously discussed. While limiting the use of this
statistic in examining the impact of the total number of identity attributes in all public
records at each site, it was significant to the study results in that it indicated that states
that publish many different identity attributes in each record may be contributing to
identity theft rate incidences in that state.
Summary
The purpose of this analysis was to first and foremost examine the extent to which
personally identifiable information can be discovered in public records. In this respect,
the researcher was able to derive the frequency with which identity information attributes
are found in public records and compute a comparative discoverability metric from the
data. This allowed the researcher to provide recommendations in chapter 5 for the use of
this metric in computing assurance levels for the use of knowledge-based authentication
services. A second objective of determining whether or not there is a correlation to this
online data with state identity theft rates was met.
75
Several hypotheses were tested, and the results suggest a positive relationship
between the amount of personally identifiable data published in online records and state
identity theft rates—specifically, not only with respect to the total number of published
sites within a state, but also identifying specific attributes such as name, date of birth,
birth year, home address, VIN, and physical description that can be correlated to
increases in identity theft rates. Other attributes, such as mother’s maiden name, driver’s
license number, and property tax information, did not evidence a correlation. A
discussion of the impact of these findings on knowledge-based authentication and
recommendations drawn from the data analysis contained in this chapter are discussed in
chapter 5.
76
CHAPTER 5. RESULTS, CONCLUSIONS, AND RECOMMENDATIONS
Summary of the Study
The principal goal of this exploratory study was to assess the discoverability of
identity attributes in Internet-based public records. A quantitative research methodology
employing content analysis was used to assess both the quantity and type of identity
information resident in the records. A frequency analysis was performed to identify how
susceptible identity attributes are to discovery by would-be identity thieves so that
recommendations could be made for the use of knowledge-based authentication systems
that heavily rely on public records to authenticate individuals to government Web sites. A
secondary goal of the study was to assess whether or not the type and amount of these
data correlated to recorded identity theft rates. Chapters 1–4 presented the study’s
objective, its significance, conceptual framework, hypotheses, sampling and data
collection methodology, and data analysis. This chapter provides a summary of the
research conducted and the interpretations and implications of the data analysis. It further
identifies recommendations to government agencies for the use of KBA to perform online
authentication, as well as study limitations and considerations for future research.
Summary of the Research Findings
The findings from the research consist of (a) the results from the content analysis
performed on electronic records, and (b) the correlation results between the frequency
with which these data are published and reported identity theft rates. These findings are
summarized below.
77
The Discoverability of Identity Attributes in Online Public Records
The following research question was developed to examine the frequency with
which personal identity attributes are published on the Internet. Research Question 1:
Discoverability Metrics of Identity Attribute: What is the comparative frequency, or
discoverability, of personal identity attributes in public records databases? To address this
question, the study examined a total of 6,598 public records sites containing identity
attributes and the data analyzed to determine the frequency with which they can be
discovered in public records. To represent this frequency comparatively, a discoverability
index for each attribute was determined by calculating the frequency of that attribute
divided by the total number of attributes. Descriptive statistics performed on the results
revealed that property records yielded the greatest number of different identity attributes,
while also being the most numerous type of public record available over the Internet.
Arrest records were the next most numerous, and oftentimes contained full physical
descriptions of the individual, as well as a photograph. Court records often consisted of
traffic and accident reports, both containing a wealth of personally identifiable data. The
significance of these results is discussed in the Implications section of this chapter.
Correlation Between Identity Attribute Publishing Frequencies and Identity Theft
The following research question and hypotheses were developed to determine if a
correlation existed between the FTC’s reported identity theft rates and (a) the total
number of public records sites published by each state, (b) the different types of identity
attributes found in each public record, (c) linked, or grouped, identity attributes published
at each public record site, and (d) the total number of different identity attributes
78
published in each site. Research Question 2: Is there a correlation between identity theft
rates and the availability of personal data in public databases?
Hypothesis 1(a) (null). The null hypothesis stated that there is not a correlation
between identity theft rates and the total number of Web-accessible public records sites.
Data were collected and analyzed using nonparametric statistics to measure ranked data.
Based on the Spearman’s rho results, the null for H1(a) was rejected at the 5%
significance level. Results indicated a moderate positive correlation between reported
identity theft rates and the number of published sites at each state.
Hypothesis 1(b) (null). The null hypothesis stated that there is not a correlation
between identity theft rates and the type of individual identity attributes published in
Web-accessible public records. Twelve identity attributes were collected from public
records sites. Of these, 6 (name, date of birth, birth year, home address, and vehicle
identification) were significantly correlated to the FTC’s identity theft state rankings,
indicating that the more frequently they were found in public records, the higher the
incidence of identity theft rates. This resulted in a rejection of the null hypothesis for
these identity attributes.
Hypothesis 1(c) (null). The null hypothesis stated that there is not a correlation
between identity theft rates and grouped identity attributes published in Web-accessible
public records. The collected identity attributes were grouped together using an
exploratory factor analysis and a principal components analysis to reduce and extract the
components that would be retained. Findings from the Spearman’s rho indicated that
attributes combined into personal information (name, date of birth and physical
description) and those regarding home information (home address, property value,
79
property tax, and square feet) were significantly associated with identity theft, resulting in
a rejection of the null hypothesis for those groups of attributes.
Hypothesis 1(d) (null). The null hypothesis stated that there is not a correlation
between identity theft rates and the total sum of different identity attributes published in
Web-accessible public records at each state. Based on the results of the Spearman’s rho,
the null hypothesis was rejected, concluding that the greater the number of different
identity attributes published at each site, the greater the incidences of reported identity
theft. While the total sum in the analysis only represented the sum total of the different
types of identity attributes found across the seven categories of records for each state, it
was significant to the study results in that it indicated that states that publish many
different identity attributes in each record may be contributing to identity theft rate
incidences in that state.
Implications of the Study
This study revealed several implications to identity theft that could be of interest
to government agencies or other organizations considering the use of KBA to
authenticate individuals to online applications. The findings from the research supported
the overall hypothesis that there is a relationship between the prevalence of identity
information in public records and identity theft. The study results, however, did not infer
a cause-and-effect relationship between the two solely on the basis of these correlation
statistics. These findings are consistent with an article in the Journal of Economic Crime
Management (Pinheiro, 2004) that postulated that through just knowing several pieces of
personal information, information can then be matched to a credit report to authenticate
80
the identity to a prospective creditor, thereby allowing an identity thief to “take over”
another person’s identity.
Table 12. Summary of the Hypotheses Testing
Hypothesis (null) Results Conclusion
H1a: There is not a correlation between identity theft rates and the total number of Web-accessible public records site
Rejected States with larger numbers of published public records sites are associated with more reported incidences of identity theft.
H1b: There is not a correlation between identity theft rates and the type of individual identity attributes published in Web-accessible public records.
Partially rejected
A relationship to identity theft rates is supported with the following six attributes: name, date of birth, birth year, home address, and vehicle identification.
H1c: There is not a correlation between identity theft rates and grouped identity attribute published in Web-accessible public records.
Partially rejected
A relationship to identity theft rates is supported with groups of attributes containing personal and home information.
H1d: There is not a correlation between identity theft rates and the total sum of different identity attributes published in Web- accessible public records at each state.
Rejected A relationship is supported between identity theft rates and the total sum of different attributes found in online public records.
During the enumeration process, it was evident that many counties still publish
full images of marriage licenses and applications, as well as of birth certificates, most
containing mothers’ maiden name information as well as dates of birth commonly used to
authenticate individuals in online transactions. A GAO (2004) study on the availability of
81
SSNs in public records identified names and birth dates as being among the three
personal identifiers with SSNs that are most often sought by identity thieves. Also of
concern with property and court records is the uniqueness of some of the identity
information contained in these types of reports that cannot be found elsewhere, including
vehicle identification and driver’s license data that can help to build an identity profile. In
their study, the GAO concluded that few state agencies posted SSNs on the Internet;
however, they estimated that “local government offices in as many as 15–28% of
counties do make SSNs available through the Internet” (p. 4). The GAO study pointed
out that these offices have begun restricting SSNs in online and other public records
overall, which could explain why no SSNs were found in the Internet records examined
for the study 4 years after the study was published.
With this information so readily accessible, what is the real value of pseudo-
secrets culled from public records to any identity authentication or verification system
associated with any level of risk to the business, government agency, or consumer? In a
response to the FTC’s Identity Theft Task Force’s request for public comments,
ChoicePoint (2007), a leading knowledge broker, suggested that KBA effectively
confirms the existence of an identity through the verification and correlation of multiple
data elements in public records databases. However, while the identity can be confirmed
as one that exists in the record, ChoicePoint acknowledged that it is more difficult to
prove that the identity claimant is actually who he or she is claiming to be. “Fraudulent
use of the SSN and similarly issued (identity credential) tokens and breeder documents—
such as driver’s licenses, birth certificates, and so forth—perpetuates identity fraud and
threatens to undermine important credentialing efforts designed to make us more secure”
82
(ChoicePoint, p. 6). The findings from this study indicated that the ready accessibility of
identity information in Internet-based records is related in some way to identity theft, thus
diminishing effect on the value of using public records information as a primary source of
identity verification or authentication.
Contributions of the Study
This study’s contribution is twofold. First, the study’s methodology can be used
as a foundation for future researchers of public records databases and knowledge-based
authentication systems, in which there still exists a dearth of research. As identity theft
becomes a politically charged issue, many industries that rely on public records data are
presenting papers in opposition of proposed limits or access restrictions on the records
and the personally identifiable information contained within them. Oftentimes, these
studies and papers do not have substantive research supporting the conclusions. As an
example, in a statement presented to the Ways and Means Committee of the U.S. House
of Representatives, PRIA argued for the continued use of SSNs in public records,
contending that “it is a common misconception that easy access to public records has
facilitated identity theft” (“Protecting the Privacy of the SSN,” 2007 ¶ 7). PRIA did not,
however, provide any data on which to base their assertion that would refute the findings
of this present study, instead basing their contention on any lack of evidence to the
contrary and a 2003 Synovate study prepared for the FTC in which PRIA stated it “did
not identify a correlation between public records access and the three categories of
identity theft” (2006, p. 15). An inspection of the Synovate report indicated, however,
that approximately half of the 4,057 ID theft victims surveyed stated that “they did not
83
know how the person who misused their personal information obtained it” (Synovate,
2003, p. 9). The term public records does not appear in the Synovate study and no
correlation analysis of public records and identity theft rates was performed. The absence
of objective research prevents legislators and agency administrators from making
informed decisions on the proper measures to take to prevent identity theft. By building
and improving on the approach used in this current study, other researchers will be able
to provide a more comprehensive understanding of how public records information
impacts identity theft rates.
This study also extends the discussion of quantifying knowledge-based
authentication to including discoverability as a factor that must be considered when
assessing its use within an authentication technology framework. From the indices
derived from the frequency with which identity attributes are found in public records, a
Discoverability Factor (DF) can be ascribed to the combinations of attributes that may be
used in knowledge-based authentication. This discoverability factor can be used by
government agencies to map the selection of specific identity attributes to appropriate e-
authentication assurance levels.
A DF can be calculated by multiplying any combined attributes’ index numbers.
The greater the resulting number, the higher the likelihood of discoverability. This serves
to reduce the level of confidence, or trust, that an application owner should place in the
identity-proofing process. Two examples are provided below of how identity attributes
indices would be combined to calculate the DF:
Example A. Name (.30) x Property Tax (.08) x Place of Birth (.02) results in a DF = .00048.
84
Example B. Name (.30) x Home Address (.17) x Date of Birth (.14) results in a DF = .00714.
Of the two examples, Example A results in a selection of identity attributes that,
when combined with each other, provide a greater assurance level of identity
authentication than Example B as it has a lower DF and is therefore less susceptible to
discovery. Lower DFs would be more appropriate for use with online applications where
greater authentication-related risks exist. Examples of applications where stronger levels
of identity authentication may be required are those that provide access to the identity
claimant to breeder documents containing additional identity information or sensitive
health-related records.
To fully assess the effectiveness of knowledge used as an authenticator, the
discoverability of the knowledge must be comprehensively assessed. Previous research in
this field has neglected to include the impact of discoverability on KBA and, as such,
identity confidence algorithms, as proposed by Chokhani (2003) for NIST and discussed
in chapter 2 of this study, should be recalculated to include a discoverability factor for
both public as well as proprietary information. Consistent with Chokhani’s approach,
ChoicePoint (2007) proposed to the FTC that KBA blend in a mix of static (SSNs),
dynamic (addresses), and highly dynamic (banking or other transaction records) attributes
to verify and authenticate the identity of individuals seeking to conduct secure
transactions with either the public or private sector. In practice, this suggested approach
will likely prove problematic as proprietary records (usually purchased from utility
companies, banks, warranty registrations, shoppers’ discount cards, etc.) holding
transactional information that would be used for authentication are very limited with
85
respect to the population of citizens that can be authenticated, even as property records,
for example, are limited in scope to authenticating property owners.
Limitations
Several limitations were identified during the course of this research. The most
serious limitation was that the total number of records within each public records
database examined could not be ascertained using today’s technology. These databases
could contain as few as one record, or could have captured an entire county’s population
with hundreds of thousands of records. Advancements in data-mining techniques with
databases residing in this “deep Web” environment holds the promise for researchers to
resolve this limitation, enabling a more complex statistical analysis.
Reliability, or the extent to which a measuring procedure will produce the same
results when repeated (Carmines & Zeller, 1979), was also a limitation of this study.
Reproducing the same sample of public records sites will prove to be difficult for
subsequent researchers as it is analogous to shooting a moving target. While new records
are being posted at an alarming rate to the Search Systems site, many more are being
taken offline or modified by states and counties that are becoming increasingly sensitive
to the increase in identity theft rates and public perception. An example of this was found
during the course of enumeration in this study with the state of Colorado. Many Colorado
sites linked in Search Systems generated a page error or posted a disclaimer that these
records had been removed. This limitation is inherent to Web-based content analyses, as
discussed in chapter 3 of this study.
86
Reliability limitations due to intercoder, or inter-rater, consistencies are also
acknowledged where more than one coder enters, or categorizes, the data in a content
analysis. This limitation was mitigated within the study through the use of a pilot test to
identify coding and categorization issues prior to a single coder entering all of the data in
the final data collection. An additional way to ensure reliability is to “measure a construct
that is very clearly and even narrowly defined” (Muijs, 2004, p. 74). Unlike other studies
that use content analysis to quantify elements contained within literature or speech,
content rules for the purposes of this study simply noted the presence of predefined
identity attributes in an electronic record. There remains, however, the possibility that
data will be coded differently in subsequent studies by different researchers.
Similar to limitations with intercoder reliability is a limitation associated with the
exploratory factor analysis used to reduce identity attributes. Another researcher
analyzing the same data could select different factors. Both of these limitations could
serve to make it difficult to generalize the findings of this study at a significant
confidence level.
Finally, it has recently been brought to public attention that Maryland’s, and
possibly other states’, online traffic records contain out-of-state license information,
many that previously used the SSN as the driver’s license number (Krebs, 2008). While
no full SSN fields were listed in any of the records examined during this study, this study
did not examine the format of the number in the driver’s license field to determine if it
was in the same format as an SSN. Additionally, Departments of Motor Vehicles have
universally abandoned the use of SSNs as the driver’s license number, so the records with
SSNs will be interspersed with records that no longer display SSNs. This makes it more
87
difficult to exactly enumerate the percentage of records within a single database
containing SSNs unless all of the records can be enumerated. The presence of SSNs in
other identifier fields warrants a reinvestigation before discoverability indices can be
reliably calculated.
Recommendations for Future Research
This study was an exploratory study with findings that suggest an association
between certain identity attributes and groups of attributes that warrant additional
investigation. Future research should take into consideration the limitations encountered
in this study. Additionally, the following are suggestions for further examination of this
subject.
False Negative and False Positive Rates Associated With the Use of KBA
Many proponents of KBA have cited how KBA can be used to reduce false
positive rates. Separate tests performed by the researcher with several KBA services
indicated that the testing of false positive rates, in which unauthorized users are provided
access after successfully passing the authentication questions, has proven difficult. False
negative rates, in which individuals are denied access to an application as the knowledge
broker providing KBA services does not have sufficient data to effectively authenticate
the individual, also appear to be higher than expected. A methodology should be
developed and comprehensive testing of knowledge-based services should be performed
to derive industry-standard-acceptable false positive and false negative rates, as is the
case with other authentication technologies.
88
The Impact of Registration and Fee-Based Access on Limiting Identity Theft From Public Records
While the examination of public records in this study demonstrated that some
property and court systems required user self-registration and fees, no studies have been
performed that indicate whether or not registration is a successful deterrent to an identity
thief who may be building a dossier on a target identity. It did prove, however, to be a
deterrent for the purposes of this study, as the researcher did not enumerate those fields—
which could have conceivably held more sensitive data than those that were freely
enumerated. It is recommended that subsequent researchers both ensure that sufficient
funding exists to provide for the examination of fee-based sites, as well as record those
results such that a comparison can be made between fee-based and freely accessible sites
with respect to the type of data found in each record.
The Impact of Public Records, Both Paper-Based and Internet-Based, on Identity Theft
The GAO survey-based study performed in 2004 provided a methodology that
can be utilized to examine more than simply the availability of SSNs in public records.
The data collection can be modified to extend to all personally identifiable information in
both paper-based and electronic records, as well as track the data by state for comparison
purposes to FTC identity theft rates.
Develop a Methodology for Calculating an Identity Confidence Algorithm That Factors in Discoverability
KBA services that rely, even in part, on publicly accessible data must factor in
discoverability to their identity confidence scoring algorithms, wherever these may exist.
In the absence of verifiable, testable algorithms, researchers should consider developing a
methodology that combines discoverability with guessability (Chokhani, 2003) to
89
develop a more accurate scoring mechanism for assessing authentication risk. An
alternative approach proposed by Dr. Peter Alterman, Chair of the Federal Public Key
Infrastructure Steering Committee, suggested that while “combining personal information
available in multiple databases with common identity credentials does offer reasonable
assurance of identity” (2003, Abstract), the reliability of that identity is influenced by the
number and relationship of identity credentials generated over time, as well as the level to
which the identity verification service is indemnified from liability. A mathematical
algorithm was presented in his paper as a model for what would be considered an identity
confidence scoring engine. While the model largely related to presented credentials (i.e.,
birth certificate, driver’s license, or passport), components of his algorithm might be
relevant to a similar algorithm applied to KBA—as an example, an identity confidence
score based on the number of corresponding identity attributes, indemnification
considerations, as well as discoverability and guessability of the identity attributes.
Impact of Geographic Location on Identity Theft
No statistical analysis is necessary to examine the disproportionate number of
southern border states numbering among the top 10 states with the highest incidences of
identity theft (FTC, 2008). A study should be performed to further assess the relationship
between geographic location and identity theft.
Conclusion
While the research performed in this study provides a framework for more
comprehensive testing of the impact of public records on identity theft, the purpose of
this research was not to build a case for the containment of public records. Identity theft
90
will not be resolved by limiting access to personally identifiable information; once the
genie is out of the bottle, so to speak, the information cannot be made private. Pseudo-
secrets that were never intended to be kept private (DOB, mother’s maiden name, SSN,
etc.) should not be used to prove identity as they are vulnerable to discoverability even by
the fact that the owner of the information is free to share these secrets with anyone the
owner chooses. As such, breeder documents (birth certificates, social security cards,
driver’s licenses) that are obtainable using knowledge-based authentication heavily
reliant on Internet-accessible public records should not be used to bind an identity to an
identity claimant for access to high-risk applications. Even biometric identification
systems used in secure facilities may have accomplished nothing more than binding a
physical characteristic, such as a fingerprint or retina scan, to a false identity claimant, if
the identity is authenticated by correlating personal knowledge of the identity to
discoverable facts in public records.
While this study provides a framework for further research, the full impact of
public records on identity theft and, therefore, KBA systems will likely continue to prove
elusive to quantify given the magnitude of the data that exist in non-normalized databases
on the Internet that makes data mining this information difficult. The only saving grace is
that large-scale data mining of this information is also difficult for identity thieves to
perform, for now prohibiting the additional increase in identity theft that will likely result
when technology matures to resolve the challenge. Acknowledging the discoverability of
this information is the first step towards developing realistic and accurate algorithms that
help agencies select appropriate authentication questions or alternative authentication
technologies.
91
REFERENCES
After the breach: how secure and accurate is consumer information held by ChoicePoint and other data aggregators: Hearings before the California Senate Banking, Finance and Insurance Committee (2005) (testimony of Chris Jay Hoofnagle).
Alterman, P. (2003). On the reliability of authentication of identity. Retrieved August 14,
2008, from http://www.cio.gov/fpkipa/documents/ReliabilityAuthentication Identity.pdf
Archer, J. (2004, November). Initiatives for protecting financial institution customers.
Statement presented at Inside ID Conference and Expo, Washington, DC. Barrett, J. (2004, February). Information sources and metrics: Authentication processes
and risk decisions. Paper presented at the “Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD.
Berelson, B. (1952). Content analysis in communication research. New York: Hafner. Bergman, M. K. (2001, July). The deep Web: Surfacing hidden value. The Journal of
Electronic Publishing, 7(1), 97–99. Retrieved May 1, 2005 from University of Michigan Web site: http://www.press.umich.edu/jep/07-01/bergman.html
Bolton, J. B. (2003). E-authentication guidance for federal agencies. Memorandum to the
heads of all departments and agencies (OMB-04-04). Retrieved October 25, 2004, from http://www.whitehouse.gov/omb/egov/legislation_memo.htm
Bragg, R. (2004, July). Rainbow crack—not a new street drug. Retrieved May 1, 2005,
from http://redmondmag.com/columns/article.asp?EditorialsID=736 Burr, W., Dodson, D., & Polk, T. (2006). Electronic authentication guideline (NIST
Special Publication 800-63, 2006 Ed.). Retrieved November 16, 2008, from http:// csrc.nist.gov/publications/nistpubs/800-63/SP800-63V1_0_2.pdf
Carmines, E., & Zeller, R. (1979). Reliability and validity assessment. Beverly Hills, CA:
Sage. Cartwright, K. (2004, February). Information sources and metrics. Paper presented at the
“Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD. Retrieved May 16, 2005, from http://csrc.nist.gov/kba/Presentations/Day% 201/Cartwright-Info%20Sources.pdf
92
Chandramouli, R., Dray, J., Ferraiolo, H., Guthery, S., MacGregor, W., & Mehta, K. (2008). Interfaces for personal identity verification—part 4: The PIV transitional interface and data model specification (NIST Special Publication 800-73, 2008 Ed.). Retrieved November 16, 2008, from http://csrc.nist.gov/publications/nist pubs/800-73-2/sp800-73-2_part4_transitional-specification-final.pdf
Chen, Y. (2007). A Bayesian network model of knowledge-based authentication.
Retrieved November 16, 2008, from http://research.bus.wisc.edu/yechen/ Publications_files/chen-thesis.pdf
ChoicePoint. (2004). Business solutions/authentication solutions: ProID. Retrieved
October 28, 2004, from http://www.choicepoint.com/business/authen/proid.html ChoicePoint. (2007). Federal Identity Theft Task Force, project no. P065410 [Letter to
Donald S. Clark, Secretary, Federal Trade Commission]. Retrieved July 15, 2008, from http://www.idtheft.gov/comments/102.pdf
Chokhani, S. (2003, February). Knowledge-based authentication metrics. Paper presented
at the “Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD. Retrieved December 14, 2005, from http://csrc.nist.gov/ archive/kba/Presentations/Day%202/Chokhani-Attachment.pdf
Chokhani, S., Dodson, D., Hastings, N., Burr, W., & Polk, T. (2006). Special publication
800-63 part 2: Knowledge-based electronic authentication guidelines: Draft. Cooper, D., & Schindler, P. (2003). Business research methods (8th ed.). New York:
McGraw Hill/Irwin. Datesman, G. (2004, February). Standard metrics for knowledge-based authentication.
Paper presented at the “Knowledge Based Authentication: Is it Quantifiable?” symposium, Gaithersburg, MD. Retrieved May 16, 2005, from http://csrc.nist .gov/kba/Presentations/Day%201/Cartwright-Info%20Sources.pdf
Electronic Authentication Partnership. (2004, October). Report on technical
interoperability. Retrieved May 7, 2005, from www.eapartnership.org/docs/Oct 2004/Oct2004_D_Interoperability_Report.doc
Federal Deposit Insurance Corporation. (2004, December). Putting an end to account-
hijacking identity theft. Retrieved May 5, 2005, from http://www.fdic.gov/ consumers/consumer/idtheftstudy/identity_theft.pdf
Federal Trade Commission. (2004). FTC issues final rules on FACTA identity theft
definitions, active duty alert duration, and appropriate proof of identity. Retrieved May 28, 2005, from http://www.ftc.gov/opa/2004/10/facataidtheft.htm
93
Federal Trade Commission. (2006). Consumer fraud and identity theft complaint data January–December 2005. Retrieved May 28, 2005, from http://www.consumer .gov/sentinel/pubs/Top10Fraud2005.pdf
Federal Trade Commission. (2008). Consumer fraud and identity theft complaint data
January–December 2007. Retrieved March 13, 2008, from http://www.ftc.gov/ sentinel/reports/sentinel-annual-reports/sentinel-cy2007.pdf
General Accounting Office. (1996). Content analysis: A methodology for structuring and
analyzing written material (GAO/PEMD-10.3.1). Retrieved April 18, 2005, from http://archive.gao.gov/d48t13/138426.pdf
General Accounting Office. (1997). General policies/procedures and communications
manual (GAO/GPPM-97). Retrieved April 18, 2005, from http://www.gao.gov/ policy/gppm-cm.pdf
General Accounting Office. (2002). Identity fraud: Prevalence and links to alien illegal
activities (GAO-02-830T). Retrieved June 6, 2006, from http://www.consumer .gov/idtheft/pdf/gao-d02830t.pdf
General Accounting Office. (2004). Social Security numbers: Governments could do
more to reduce display in public records and on identity cards (GAO-05-59). Retrieved June 6, 2006, from http://purl.access.gpo.gov/GPO/LPS55812
Gordon, G., Rebovich, D., Choo, K., & Gordon, J. (2007). Identity fraud trends and
patterns: Building a data-based foundation for proactive enforcement. Retrieved October 30, 2007, from Utica College, Center for Identity Management and Information Protection Web site: http://www.utica.edu/academic/institutes/ecii/ publications/media/cimip_id_theft_study_oct_22_noon.pdf
Identity Theft and Assumption Deterrence Act of 1998, Pub. L. No. 105-318 Stat. 3007
(1998). Retrieved June 1, 2006, from http://www.ftc.gov/os/statutes/itada/itadact .pdf
Identity theft and social security numbers: Hearings before the Subcommittee on
Commerce, Trade, and Consumer Protection of the House Committee on Energy and Commerce, 108th Cong. (2004) (prepared statement of Thomas B. Leary).
International Technology Association of America. (2004). Comments on NIST FIPS 201
draft: Personal Identity Verification (PIV) for federal employees and contractors. Retrieved June 10, 2006, from www.itaa.org/es/docs/nistpivcomments.pdf
Jain, A., Bolle, R., & Pankanti, S. (1999). Biometrics: Personal identification in
networked society. Retrieved April 15, 2005, from http://www.cse.msu.edu/~cse 891/Sect601/textbook/1.pdf
94
Javelin Strategy and Research. (2006). The 2006 Identity Fraud Survey report. Retrieved
April 27, 2006, from the Council of Better Business Bureau Web site: http://www.javelinstrategy.com/products/AD35BA/27/delivery.pdf
Javelin Strategy and Research. (2008). 2008 Identity Fraud Survey report: Consumer
version—How consumers can protect themselves. Retrieved July 15, 2008, from http://www.idsafety.net/803.R_2008%20Identity%20Fraud%20Survey%20Report
_Consumer%20Version.pdf Johnson, S. (2004). Defending our borders is central to fighting terror. Retrieved May
16, 2005, from http://www.samjohnson.house.gov/News/DocumentSingle.aspx? DocumentID=20720
Kaid, L., & Wadsworth, A. (1989). Content analysis. In P. Emmert & L. L. Barker (Eds.),
Measurement of communication behavior (pp. 197–217). New York: Longman. KnowX. (2005). Standard: Public record info. Retrieved May 16, 2005, from http://www
.knowx.com/home.exe?form=home/fa1_pr_about1.htm Koehler, W. (2004). A longitudinal study of Web pages continued: A report after six
years. Information Research, 9(2). Retrieved May 5, 2005, from http:// InformationR.net/ir/9-2/paper174.html
Krebs, B. (2005). DNA key to decoding human factor: Secret Service’s Distributed
Computing Project aimed at decoding encrypted evidence. Retrieved May 3, 2005, from http://www.washingtonpost.com/wp-dyn/articles/A6098-2005Mar28 .html
Krebs, B. (2008). Speeding in Maryland could be hazardous to your identity. Retrieved
August 20, 2008, from http://voices.washingtonpost.com/securityfix/2008/07/ maryland_traffic_site_lists_so.html
Leedy, P., & Ormrod, J. (2001). Practical research planning and design (7th ed.). Upper
Saddle River, NJ: Merrill Prentice Hall. LexisNexis. (2005). InstantID. Retrieved May 12, 2005, from http://www.lexisnexis
.com/instantid/printerfriendly.asp Liddle, S., Yau, S., & Embley, D. (2001, November). On the automatic extraction of data
from the hidden Web. Proceedings of the International Workshop on Data Semantics in Web Information Systems, Yokohama, Japan. Retrieved May 11, 2005, from www.deg.byu.edu/papers/daswis01.pdf
95
LoPucki, L. (2001). Human identification theory and the identity theft problem. Texas Law Review, 80, 89–134. Retrieved April 29, 2005, from http://ssrn.com/abstract=
263213 LoPucki, L. (2003). Did privacy cause identity theft? (Research Paper No. 03-5).
Retrieved April 28, 2005, from http://ssrn.com/abstract=386881 Lyman, P., & Varian, H. (2003). How much information? Retrieved April 30, 2005, from
http://www.sims.berkeley.edu/how-much-info-2003 Malin, B. (2002, December). Compromising privacy with trail re-identification: The
REIDIT algorithms (CMU-CALD-02-108). Pittsburgh, PA: Carnegie Mellon University, School of Computer Science.
Martin, E. (2001). GSA and Federal CIO Council launch e-gov inventory online.
Retrieved November 16, 2008, from http://www.gsa.gov/Portal/gsa/ep/content View.do?contentType=GSA_BASIC&contentId=9204&noc=T
Muijs, D. (2004). Doing quantitative research in education with SPSS. London: Sage. Myers, M. (1997, June). Qualitative research in information systems. MIS Quarterly,
21(2), 241–242. Retrieved April 30, 2005, from http://www.qual.auckland.ac.nz/ National Computer Security Center. (1991, September). Guide to understanding
identification and authentication in trusted systems (NCSC-TG-017 Library No. 5-235,479 Version 1). Retrieved April 30, 2005, from http://www.radium.ncsc .mil/tpep/library/rainbow/
National Institute of Standards and Technology. (2004). Knowledge based
authentication: Is it quantifiable? Retrieved June 9, 2006, from http://csrc.nist .gov/archive/kba/index.html
National Security Agency, Systems and Network Attack Center. (2006). The 60 minute
network security guide (first steps towards a secure network environment). Retrieved July 15, 2006, from http://www.nsa.gov/snac/support/I33-011R-2006 .pdf
Office of the Inspector General. (2004). Current practices in electronic records
authentication. Retrieved May 17, 2006, from http://www.ssa.gov/oig/ADOBEPDF/audittxt/A-04-04-24004.htm
Office of Management and Budget. (2003). E-authentication guidance for federal
agencies. Retrieved May 7, 2005, from http://www.whitehouse.gov/omb/ memoranda/fy04/m04-04.pdf
96
Office of Management and Budget. (2004). Policy for a common identification standard for federal employees and contractors (HSPD-12). Retrieved May 7, 2005, from http://www.whitehouse.gov/omb/memoranda/fy2005/m05-24.pdf
Office of Management and Budget. (2008, January). Report to Congress on the benefits
of the President’s e-government initiatives. Retrieved August 24, 2008, from http://www.whitehouse.gov/omb/egov/documents/FY08_Benefits_Report.pdf
Olsen, F. (2005, April 25). Shopping for data. Lawmakers have tough questions for
largely unregulated data firms. Retrieved May 8, 2005, from http://www.fcw .com/article88676-04-25-05-Print
O’Neill, E., McClain, P., & Lavoie, B. (1998). A methodology for sampling the World
Wide Web. Retrieved May 5, 2005, from http://digitalarchive.oclc.org/da/ViewObjectMain.jsp?objid=0000003447&frame= true
Pinheiro, R. (2004). Preventing identity theft using trusted authenticators. Journal of
Economic Crime Management, 2(1), 1–16. Retrieved August 3, 2008, from http:// www.utica.edu/academic/institutes/ecii/jecm/articles.cfm?action=issue&id=15 Property Records Industry Association. (2006). Privacy and public land records: Making
practical policy. Retrieved August 5, 2008, from http://www.pria.us/Papers/PRIA WhitePaperFinal010406.pdf
Protecting consumers’ data: Policy issues raised by ChoicePoint: Hearings before the
Subcommittee on Commerce, Trade, and Consumer Protection of the House Committee on Energy and Commerce, 109th Cong. (2005) (prepared statement of Deborah Majoras).
Protecting the privacy of the social security number from identity theft: Hearing before
the Subcommittee on Social Security, of the House Committee of Ways and Means, 109th Congress. (2007) (statement of the Property Records Industry Association (PRIA). Retrieved August 5, 2008, from http://waysandmeans.house .gov/hearings.asp?formmode=view&id=6348
Robson, C. (2002). Real world research: A resource for social scientists and
practitioner-researchers (2nd ed.). Malden, MA: Blackwell. Schneier, B. (1999, July). Mistakes and blunders: A hacker looks at cryptography.
Keynote presentation at Black Hat USA, Las Vegas, NV. Retrieved May 5, 2005, from http://blackhat.com/html/bh-media-archives/ bh-archives-97-98-99.html
97
Schwarzhoff, T., Dray, J., Wack, J., Dalci, E., Goldfine, A., & Iorga, M. (2003). Government smart card interoperability specification (Version 2.1). Retrieved November 16, 2008, from http://csrc.nist.gov/publications/nistir/nistir-6887.pdf
Social security number high-risk issues: Hearings before the Subcommittee on Social
Security of the House Committee of Ways and Means, 109th Cong. (2006) (prepared statement of Patrick P. O’Carroll, Jr.).
Solove, D. J. (2004). The digital person: Technology and privacy in the digital world.
New York: NYU Press. Song, D. X., Wagner, D., & Tian, X. (2001, August). Timing analysis of keystrokes and
timing attacks on SSH. Paper presented at the 10th USENIX Security Symposium, Washington, DC. Retrieved May 5, 2005, from www.usenix.org/events/sec01/full _papers/song/song.ps
State of Alabama, Office of the Governor. (2006). Governor Riley signs law to protect
Social Security numbers. Retrieved June 5, 2006, from http://www.governorpress .alabama.gov/pr/pr-2006-04-27-01-protectssn.asp
State of Tennessee. (2003, December 30). Letter to Lt. Gov. John S. Wilder TennCare
Enrollee Database Verification Project summary report. Retrieved May 13, 2005, from http://www.tennessee.gov/tenncare/pdf/ChoicePointreport123103.pdf
Stempel, G. & Westley, B. (1981) Research methods in mass communication. Englewood
Cliffs, NJ: Prentice- Hall. Sweeney, L. (2000). Uniqueness of simple demographics in the U.S. population (LIDAP-
WP4). Pittsburgh, PA: Carnegie Mellon University, Laboratory for International Data Privacy.
Synovate. (2003). Federal Trade Commission—Identity Theft Survey report. Retrieved
September 12, 2007, from http://www.ftc.gov/os/2003/09/synovatereport.pdf Tashakkori, A., Teddlie, C. (2003). Handbook of Mixed Methods in Social & Behavioral
Research. Newbury Park, CA: Sage. Temoshok, D. (2005, May). E-authentication: Creating an environment of trust. Paper
presented at the Postsecondary Electronic Standards Council 2nd Annual Conference on Standards and Technology, Washington, DC. Retrieved May 16, 2005, from http://www.pesc.org/events/ACTS2/presentations/Lunch%20E-Auth.
%20Temoshok.ppt U.S. Senate Finance Committee. (2007). Filing your taxes: An ounce of prevention is
worth a pound of cure. Retrieved December 17, 2007, from http://www.senate
98
.gov/~finance/hearings/testimony/2007test/041207testme.pdf Weber, R. P. (1990). Basic content analysis (2nd ed.). Newbury Park, CA: Sage. Willox, N. (2001). Identity theft: Authentication as a solution revisited. Retrieved May
12, 2005, from www.lexisnexis.com/risksolutions/conference/docs/authentication .pdf Zeller, T. (2005, May 18). Personal data for the taking. New York Times. Retrieved
November 16, 2008, from http://www.unm.edu/~pre/law/articles_advise/ technology.html
99
APPENDIX A. CODEBOOK
This codebook provides a listing of record types, attribute codes, and instructions
to enable the researcher to count identity attributes, or variables, present in online public
records. Coding was input directly into an Excel spreadsheet with was set up to collect
the following data.
State: The name of the state from which the records were reviewed.
Number of Sites: The total number of sites at a selected state returned by the
keyword search. This includes all sites, whether freely accessible or not. This number is
collected immediately after the search results are returned following the instructions
below.
Number of Sites with IDA: The total number of sites within a state, after
exclusions, containing identity attributes.
Number of IDA by Record Type: The total number of different identity attributes
by record type. This variable is counted once per record type, irrespective of the number
of sites within the state’s record type containing the information.
A Priori Record Description: A description of each public record examined for
the purposes of this analysis. Records meeting the following conditions will be excluded
from examination for all records:
1. Archive or historical records. Records must be currently reported by the government office. Archive and historical records for individuals over 100 years old were not examined.
2. Fee-based records or those requiring a paid subscription.
3. Sites requiring registration.
4. Library, newspaper, or genealogical records.
100
The following public records categories will be examined during the enumeration:
1. Accident Reports—Retrieved using the keywords “accident” and the state name in the Advanced Search - Match all keywords function.
2. Birth Records—Birth certificates or indices from official public records sources. Retrieved from Search Systems using Search Public Records by Type of Record -> Births.
3. Court Records—Civil or criminal court filings and case dispositions, civil suits, judgments, deed transfers, property liens, divorce records, and traffic court records. Retrieved using Search Public Records by Type of Record -> Court Records.
4. Inmate/Arrest Records—Retrieved using Advanced Search to Match all keywords using the state name and also including the terms “arrest” and “inmate” in the Match any keyword.
5. Marriage Records—Includes licenses and marriage certificates. Retrieved using Search Public Records by Type of Record -> Marriages.
6. Property Records/Deeds—County property tax assessment records. Retrieved using the Search Public Records by Type of Record -> Property – U.S.
7. Voter Registrations—Current data from the state on registered and inactive voters. Retrieved using the Search Public Records by Type of Record -> Voters.
Identity Attributes (IDA): Each attribute is counted when the record is reviewed
by placing a “1” in the field to indicate the presence of any of the following identity
attributes:
1. Name—First name, last name, middle name or middle initial, or combination of any of these
2. DOB—Any combination of the month, day, and year of birth
3. Birth Yr—Year of birth or the age is present
4. Mother’s Maiden Name—Mother’s maiden name
5. POB—City, state, or both of birth place
101
6. Address—Home address
7. SSN—Full 9-digit Social Security Number
8. Last 4 of SSN—Last 4 digits of the individual’s Social Security Number
9. Home Phone—Home phone number of an individual – with or without the area code
10. Driver’s License Number—Individual’s driver’s license number
11. Vehicle ID—Boat, car, plane, or other VIN, including license plate
12. Property Value/Sale Price—A property’s assessed value, mortgage amount, deed transfer amount
13. Last Year’s Prop. Tax—Current or last year’s assessed property tax amount
14. Sq. Ft.—Square Footage (Finished Area) of House
15. Phys. Des.—Details of an individual’s physical description, such as any combination of sex, gender, hair color, or eye color
102
APPENDIX B. CODER FORM
An electronic Excel coding form, similar to that in Figure B1 was used to collect
data. The following steps were used to transfer data to the coder form.
1. For each state(s) assigned to the coder, a search was performed at Search Systems within each category (Record Description) of public records. The following information was recorded as the data was reviewed: a. The number of sites reported by Search Systems from the search in the
column labeled “# of Sites.”
b. The number of sites containing public records after the application of the exclusions in the column labeled “# of sites with info.”
2. Each site was examined for the presence of identity attributes (IDA). For each site with identity information present, a “1” was placed in the appropriate IDA column on the coding form to indicate what type of information was found. At no time was actual personally identifiable data found at the site recorded onto the spreadsheet. Each identity attribute was enumerated only once per record category in each state. For instance, if there were 24 total sites containing information, with 8 of those sites containing birth dates, a single “1” was placed in the column under DOB to indicate that the date of birth was found in that record category.
3. For records that required a name lookup, a generic name, such as Jones or Smith, was used to retrieve a record. In some cases where property tax records were examined, a lookup in 411.com was performed, or a generic address (i.e., 101 Main St.) was provided to retrieve a record.
Figure B1. Codebook.
1 0
3