answer the questions based on the article
The Catch data warehouse: support for community health
care decision-making
Donald J. Berndt a , Alan R. Hevner
a,*, James Studnicki b
a Information Systems and Decision Sciences Department, College of Business Administration, 4202 Fowler Ave., CIS1040,
University of South Florida, Tampa, FL 33620, USA b College of Public Health, University of South Florida, Tampa, FL 33620, USA
Accepted 1 April 2002
Abstract
The measurement and assessment of health status in communities throughout the world is a massive information technology
challenge. Comprehensive Assessment for Tracking Community Health (CATCH) provides systematic methods for community-
level assessment that is invaluable for resource allocation and health care policy formulation. CATCH is based on health status
indicators from multiple data sources, using an innovative comparative framework and weighted evaluation process to produce a
rank-ordered list of critical community health care challenges. The community-level focus is intended to empower local decision-
makers by providing a clear methodology for organizing and interpreting relevant health care data. Extensive field experience
with the CATCH methods, in combination with expertise in data warehousing technology, has led to an innovative application of
information technology in the health care arena. The data warehouse allows a core set of reports to be produced at a reasonable
cost for community use. In addition, online analytic processing (OLAP) functionality can be used to gain a deeper understanding
of specific health care issues. The data warehouse in conjunction with Web-enabled dissemination methods allows the infor-
mation to be presented in a variety of formats and to be distributed more widely in the decision-making community. In this paper,
we focus on the technical challenges of designing and implementing an effective data warehouse for health care information.
Illustrations of actual data designs and reporting formats from the CATCH data warehouse are used throughout the discussion.
Ongoing research directions in health care data warehousing and community health care decision-making conclude the paper.
D 2002 Elsevier Science B.V. All rights reserved.
Keywords: Health care information systems; Data warehousing; Data staging; Online analytic processing (OLAP); Decision support systems;
Community decision-making; Data quality
1. Introduction
The United States spends over a trillion dollars
annually on health expenditures. Both as a percentage
of national productivity and per capita, health care
spending by the United States exceeds that of any other
nation in the world. However, this tremendous expen-
diture has not secured the U.S. a rank among the
‘healthiest’ nations. In fact, for many health indicators,
such as infant mortality and measles immunizations,
the U.S. ranks below some countries characterized as
underdeveloped [23,29]. Prolonged public debates on
health care policy in the United States have focused on
0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved.
doi:10.1016/S0167-9236(02)00114-8
* Corresponding author.
E-mail addresses: dberndt@coba.usf.edu (D.J. Berndt),
ahevner@coba.usf.edu (A.R. Hevner), jstudnic@hsc.usf.edu
(J. Studnicki).
www.elsevier.com/locate/dsw
Decision Support Systems 35 (2003) 367–384
insurance coverage and medical care financing pro-
grams without any serious examination of the true
health status of the nation.
The need to assess the health status of U.S.
communities in a comprehensive and systematic
manner has been widely recognized within the health
professions. The Institute of Medicine (IOM) of the
National Academy of Sciences has acknowledged the
importance of a population-based perspective in two
influential reports, emphasizing the need for a regular
and systematic collection, assemblage, and analysis
of the health status of our nation’s communities
[16,18]. A community health profile is comprised
of socio-demographic characteristics, health status
and quality of life indicators, health risk factors,
and health resource measures. The intent of such a
comprehensive health profile is to assist a community
in developing, refining, and monitoring a long-term
strategic view of its overall health status. Although
there are many sources of health data, there are no
standard data definitions, formats, or reports across
the health care industry. Thus, health care data are
widely used (and misused) in an ad-hoc manner to
justify managerial objectives of health institutions
and agencies, a maze of mandated categorical fund-
ing, and a variety of political agendas. Sound infor-
mation and accepted analytic techniques are even
more important as funding is consolidated in block
grants and local community decision-making is
emphasized.
As part of the ongoing clarification of the public
health role at the community level and the transition
from a disease to a health focus and from a treatment
to a prevention strategy, there has been recognition
that partnerships and collaboration are necessary to
support effective action [17,21]. Health organizations,
public sector agencies, medical care providers, busi-
nesses, the religious community, educational institu-
tions, and other community organizations are
interdependent components of a multi-sectoral com-
munity health environment. The overall community
must be empowered to make the necessary, and
sometimes difficult, resource allocation choices to
improve health through information, education,
behavior change, and social support [7]. Such collab-
orative action at the community level must be
informed by unbiased data describing the communi-
ty’s health status, needs, and resources. The ability is
also needed to track progress over time to meet the
community’s health care goals [24].
The gap between current practice in community
health care spending and the above goals of collabo-
rative community health care decision-making is vast.
The availability and quality of health indicators are
problematic. There is little empirical evidence on the
use, sharing, or integration of health data into deci-
sion-making to provide guidance to community health
organizations. While most of the literature on collab-
orative leadership and community engagement
emphasizes the process [4,5], little attention has been
focused on the effect of the availability of a common
set of data, such as the community health profile, on
the quality and inclusiveness of decision-making.
There is also scant information about the use of data
and information technology to support and monitor
the process.
The purpose of this paper is to present an overview
of the Comprehensive Assessment for Tracking Com-
munity Health (CATCH) methods [25] and then focus
on the construction of a comprehensive health care
data warehouse that provides automated support for
CATCH. The combination of extensive field experi-
ence with CATCH and the application of current data
warehousing technology make this an innovative
interdisciplinary research effort. Section 2 briefly
presents the CATCH methods and our motivation
for building a data warehouse. In Section 3, we
present a detailed discussion of the technical chal-
lenges in designing and implementing the data ware-
house. Twin star data staging, an effective approach
for ensuring quality as data are entered into the
warehouse, is highlighted in Section 4. Section 5
discusses the use of the data warehouse for advanced
health care applications. The paper concludes with
future research directions on data warehousing tech-
nical challenges and the use of health profiles to
support improved community health care decision-
making.
2. The CATCH methods of community assessment
The University of South Florida’s Center for
Health Outcomes Research (CHOR) developed
CATCH to provide comprehensive and objective
health status data for community health planning
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384368
purposes. CATCH collects, organizes, analyzes, pri-
oritizes, and reports data on over 250 health and social
indicators on a local community basis. The CATCH
methods have been tested, refined, and validated in
the field over the past 10 years. Reports have been
prepared for more than 20 U.S. counties both within
and outside of Florida.
The CATCH process can be briefly described as
shown in Fig. 1. Community health indicator data are
gathered from a variety of sources. Secondary data
sources include health care data reported by hospitals,
local, state, and federal health agencies, and national
health care groups. Primary data sources would involve
data gathered from door-to-door or mail-in surveys. All
health care data are translated into common formats and
integrated with other data warehouse components to
support the production of health care report cards.
Over 250 indicators are used within CATCH and
are organized into 10 indicator categories. These
indicators and categories represent a wide spectrum
of health care issues and have evolved through both
research and field practice. Table 1 lists the 10
indicator categories and presents a few representative
indicators to lend a sense of perspective to the level of
detail provided in CATCH reports. These indicators
are collected from a variety of sources.
Each indicator value is compared against the state
average, an average from a peer group of counties,
and other interesting values (e.g., a national goal for
that indicator) [26]. The results of these comparisons
are organized into a multi-dimensional matrix based
on favorable or unfavorable comparisons against each
comparison dimension. Fig. 1 shows a 2-by-2 com-
parison matrix based on state averages and peer
Fig. 1. The CATCH process.
Table 1
Ten indicator groups with representative indicators
Demographic
Characteristics
Health Status: Morbidity
and Mortality
Total Population Breast Cancer
Racial Composition Cardiovascular Disease
Net Migration Stroke
Socioeconomic
Characteristics
Sentinel Events
Rubella
Employment Measles
High School Dropouts Late Stage Cancer
Per Capita Income Avoidable Hospitalizations
Maternal and Child Health Health Resource Availability
Infant Mortality Licensed Hospital Beds
Low Birthweight Licensed Medical Doctors
Birth Defects Mortality Licensed Registered Nurses
Social and Mental Health Infectious Disease
Domestic Violence Syphilis
Homicide Rate AIDS
Psychiatric Admissions Hepatitis
Physical Environmental Health Behavioral Risk Factors
Foodborne Outbreaks Smoking
Contaminated Wells Obesity
Lead Poisoning Mammograms
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 369
averages. Community indicators that demonstrate
unfavorable comparisons on all dimensions are high-
lighted as community health challenges. After this
simple comparison, the health care challenges are
prioritized using a set of five filters.
Number Affected—number of persons in the
community affected by the indicator.
Economic Impact—an estimate of the direct cost
per case for individuals affected by the indicator.
Availability of Efficacious Intervention—an esti-
mate of the relative degree to which treatment or
prevention is likely to be effective.
Magnitude of Difference—the degree to which the
community indicator is worse than the dimensional
comparisons.
Trend Analysis—for a 5-year period is the trend
favorable or unfavorable and what is the magnitude
of change in the trend direction?
The community stakeholders are given an oppor-
tunity to weight the importance of each of the above
factors. The final product of the CATCH methodology
is a comprehensive, prioritized listing of community
health care challenges. A more detailed description of
the CATCH methodology with a complete listing of
health care indicators can be found in Ref. [25].
2.1. Limitations
While the value of CATCH is incontrovertible, the
ultimate deployment of CATCH throughout Florida
and the nation has been constrained by several serious
limitations: . The handcrafted process is labor-intensive and
slow. Hundreds of individual sources of data must be
identified and contacted. Data are often provided in
hard copy formats and must be manually checked,
validated, and entered into spreadsheets. With manual
methods, it takes 3–4 months to complete a CATCH
report for a single county. . Longitudinal trend analyses over many years are
cost prohibitive for most communities. Since each
application is expensive and time-consuming, the
capability to fund and produce annual assessments
in a single community is limited. . Most public health funding comes from state and
federal governments. A statewide CATCH assessment
would help to prioritize funding and serve to enable
effective program evaluation based on quantifiable
outcome assessment. Since nearly all data indicators
available in Florida are available in most other states,
there is reason to be confident that CATCH will be
expanded nationally and even internationally. . With the massive amount of health care data
involved, many interesting relationships and correla-
tions between health status indicators can be found
and investigated. In the manual system, such discov-
ery was not feasible. A comprehensive and integrated
data warehouse provides the infrastructure for such
data mining efforts.
2.2. CATCH data warehouse challenges
The application of data warehousing technologies
for the automated support of CATCH holds tremen-
dous promise. The remainder of this paper describes
our work to construct an effective and efficient data
warehouse solution, enabling both cost-effective
report generation and ad-hoc analyses of critical
health care issues. The construction of a data ware-
house for public health care data poses major chal-
lenges beyond those required for the construction of a
commercial data warehouse (e.g., retail sales). Such
challenges include the following. . Data come from a very diverse set of sources.
Health care data are published in a wide variety of
formats with differing semantics. There are currently
few standards in the health care field for such data.
The data integration task to build the data warehouse
requires significant effort. . CATCH reports are disseminated to a diverse
and geographically distributed set of stakeholders. . The data warehouse is required to support the
activities of public policy formulation. The socio-
political issues of health care planning impact design
features such as security, availability, data quality, and
performance.
3. The CATCH data warehouse
The goals of the CATCH data warehouse include
the support and enhancement of the CATCH methods,
the provision of cost-effective and thorough reports to
communities, and the creation of a rich environment
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384370
for more detailed research into critical health care
issues. In addition, a focus on data quality makes the
data warehouse an especially valuable asset over time
as a rich and trustworthy historical repository is built.
Lastly, the data warehouse lends itself to a variety of
dissemination strategies based on hardcopy reports,
interactive access, and Web-enabled information
delivery. The different access technologies allow a
diverse group of community planners and stakehold-
ers to investigate important health care issues using
comparable data. All of these characteristics make the
CATCH data warehouse a unique application of
technology in the field of public health. In fact, the
implementation of this type of data warehouse and its
use in monitoring, as well as improving health status,
will become a primary role of public health agencies
in the future.
The CATCH data warehouse includes a variety of
components arranged in three broad categories:
reporting tables for direct support of the CATCH
methods, aggregated dimensional structures, and
fine-grained or transaction-oriented dimensional struc-
tures. In the sections that follow, examples of these
data warehouse components are presented. All of the
components draw on the dimensional model or star
schema, some components with more than a dozen
dimensions and some with a few simple dimensions.
3.1. The dimensional model
Important missions of a data warehouse include the
support of decision-making activities and the creation
of an infrastructure for ad-hoc exploration of very
large collections of data. Decision-makers should be
able to pursue many of their investigations using
browsing tools, without relying on database program-
mers to construct queries. The emphasis on end-user
data access places a premium on an understandable
database design that provides an intuitive basis for
navigating through the data. The star schema or
dimensional model has been recognized as an effec-
tive structure for organizing many data warehouse
components [12,15,19]. The star schema is character-
ized by a center fact table, which usually contains
numeric information that can be used in summary
reports. Radiating from the fact table are dimension
tables that provide a rich query environment. This
structure provides a logical data cube, with dimen-
sions such as time and location identifying a set of
numeric measurements within the cube. Fig. 2 con-
tains a fragment from the hospital discharge trans-
action-oriented star schema discussed in this paper.
3.1.1. Fact tables
The most appropriate facts are additive numeric data
items that can be summed, averaged, or combined in
other ways across the dimensions to form summary
statistics. The only way to compress the millions of data
points and produce a reasonably sized answer set is to
present some mathematical summarization. No human
will want thousands, let alone millions, of items in
answer to their queries. As Kimball [19] pointsout, ‘‘the
best and most useful facts are numeric, continuously
valued, and additive.’’ The CATCH data warehouse
includes facts such as counts of hundreds of different
health events, population-based rates, age-adjusted
rates, and even fine-grained financial data in the case
of the hospital discharge data depicted in Fig. 2. For
example, using the hospital discharge star it is possible
to focus on a single hospital (using the hospital dimen-
sion), select a single disease (using the ICD DIAGNOSIS
dimension), and investigate how the length of stay has
varied over a specified time period. Using the hierarch-
ical nature of the dimensions, it is also possible to ‘roll-
up’ to compare types of hospitals, disease categories,
or even patient age bands. While the dimensional
structure is simple and readily understandable, it sup-
ports a large and very useful universe of queries.
3.1.2. Dimension tables
The dimensions define the query environment, the
richer the set of dimensions the more ways the data
can be accessed via queries. Two of the important
characteristics of dimensions are the richness of the
attributes that describe the dimension and the hier-
archical nature of the dimension. For example, the
COUNTY dimension in the CATCH data warehouse
includes attributes that describe whether a county is
coastal, wealthy, urban, dense, large in area, or
includes a military base. Therefore, the counties can
be organized by any value in this attribute set. Some
of the attributes lend themselves to hierarchical
organization. In the case of COUNTY, there is natural
geographic hierarchy that includes groups of counties
that form regions within the state and the state itself.
The county is also composed of finer geographic units
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 371
such as communities, ZIP codes, and census tracts. The
dimension hierarchies enable roll-up and drill-down
operations that control the level of detail in queries.
These formally defined hierarchies also provide the
framework for navigation or data browsing.
In order to describe the dimension hierarchies suc-
cinctly to both end-users and developers, dimension
hierarchy diagrams have been utilized in the CATCH
data warehouse design process. These diagrams show
the hierarchical nature so that end-users have an
uncluttered view of how they can navigate and design-
ers can easily understand the dimensional structures.
Fig. 3 illustrates an important health care dimension
based on the International Classification of Disease
(ICD) codes. Currently, we are using versions 9 and 10
of the ICD codes. These codes are divided into chapters
and sections, which provides a natural hierarchy for the
codes. Fig. 3 shows the hierarchical structure using
separate tables, but these tables can be easily denor-
malized to enhance query performance. In addition,
there are several other tables that provide alternative
hierarchies for this important dimension. This ICD
PROCEDURE dimension is combined with many other
dimensions such as patient age, gender, mortality risk,
and severity of illness to form star schemas (see Fig. 2)
with rich query environments.
3.2. Data warehouse design: the data access pyramid
The mission of the CATCH data warehouse is to
support the automated and cost-effective application
of CATCH, as well as to enable more detailed
Fig. 2. Hospital discharge star schema (not all dimensions are shown).
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384372
analyses that were not possible using the coarse-
grained data that typified past CATCH reports. In
order to meet these goals, the data warehouse design
includes several levels of data granularity, from the
coarse-grained data used in generic report production
to actual event-level data, such as hospital discharges.
The data warehouse design includes major compo-
nents at all three levels of granularity as illustrated in
the data access pyramid found in Fig. 4.
Report indicators—Reporting tables with derived
or highly aggregated data are used to support the core
CATCH reports, including comparisons between a
target county and peer counties. These tables also
provide fast response for interactive access via data
browsing tools and can provide the foundation for
simple community-wide Internet access. In addition,
the metadata play an important role at the reporting
level, providing indicator definitions, state or federal
goals, and expert domain knowledge for priority
filters (e.g., economic impact and treatment availabil-
ity). This report level of the data warehouse may not
be needed in all data warehouse applications but
provides important support for rapid generation of
community CATCH reports.
Aggregate data—There are families of star sche-
mas that provide true dimensional data warehouse
capabilities, such as interactive roll-up and drill-down
operations. These components have carefully designed
dimensions that can be utilized by more sophisticated
data browsing tools. The star schemas are populated
using thorough data staging and quality procedures that
usually involve processing detailed data sets extracted
by various health care agencies and organizations.
Typically, the data are aggregated and transformed for
loading into a family of related star schemas—a con-
stellation—that share important dimensions and sup-
port interactive online analytic processing (OLAP)
techniques.
Transaction data—For certain types of informa-
tion, the design calls for retaining very fine-grained or
even event level data. An example is the hospital
discharge data that includes each hospital discharge
event for the more than 200 hospitals that are man-
dated to report such information in Florida. These data
are retained at the transaction level because of the rich
set of facts and dimensions available for analysis and
the density of potential aggregations that result in
negligible space savings.
These three levels of aggregation within the data
warehouse combine to meet a wide range of reporting
requirements and performance goals, thus providing a
flexible basis for disseminating health care informa-
tion to community decision-makers. The following
two sections (Sections 3.3 and 3.4) provide some
examples of the major data warehouse components.
At the aggregate data level, a coarse-grained compo-
nent based on the Public Health Information Data
System (PHIDS) is used to support CATCH report
production and high-level browsing. A second exam-
Fig. 3. ICD PROCEDURE dimension hierarchy.
Fig. 4. Data access pyramid.
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 373
ple aggregate is procedure volume information
formed from the underlying hospital discharge data.
The original hospital discharge data provide an exam-
ple of transaction-oriented data that supports detailed
analyses, along with other data such as vital statistics
(e.g., births and deaths) and specific disease registries.
3.3. Aggregated Florida Department of Health Data
An example of a highly aggregated data warehouse
component is the Public Health Information Data
System (PHIDS) star schema. The Florida Department
of Health collects, analyzes, and reports a large
number of public health indicators. These items have
always provided critical assessment measures within
CATCH. The importance of the PHIDS indicators
made them obvious candidates for inclusion in the
data warehouse and a natural resource for automation
of the traditional CATCH report.
The PHIDS indicators are clearly not the fine-
grained data that support a detailed OLAP environ-
ment. The data are highly aggregated and provided
annually at the county level. Therefore, the data set is
suitable for generating the traditional CATCH report,
but unsuitable for more specific analyses. Essentially,
the construction of the data warehouse has been a
search for both fine and coarse data that can provide
synergies through integration. The simple star schema
used to implement the PHIDS-based data warehouse
component has only the year reported and the county
as explicit dimensions. Currently, many of the PHIDS
indicators are maintained using spreadsheets at the
Florida Department of Health. For use in the data
warehouse, the data are first extracted from the
spreadsheets, reformatted using custom staging pro-
grams, and then loaded via a bulk loader utility. The
twin star staging process, as described in Section 4, is
used to ensure data quality. Data correctness is veri-
fied by sampling the data and comparing the data
warehouse values with published PHIDS reports.
3.4. Transaction-oriented hospital discharge data
Florida hospital discharge transactions are col-
lected by the Agency for Health Care Administration
(AHCA) from the more than 200 short-term acute care
hospitals in the state. These hospitals report every
discharge transaction, regardless of payer, throughout
the state. Hospital discharge data are used to derive
several CATCH indicators such as avoidable hospital-
izations due to diabetes and other chronic diseases.
Typically, the large volume of hospital discharge
transactions is scanned to form derived or aggregated
data for CATCH indicators. However, the broader
mission of the CATCH data warehouse is both to
support the CATCH methods and enable more
detailed investigations of critical local health care
issues. It is the ability to fully explore issues at
appropriate levels of detail that make the fine-grained
components so important. While first staging and
preprocessing the hospital discharge data for use in
forming CATCH indicators, the value of the discharge
transactions themselves became very apparent. The
hospital discharge transactions provide an interesting
set of numeric data items, such as length of stay and a
breakdown of revenues, which are very well suited for
a data warehousing approach. In addition, the trans-
actions include a rich set of attributes that provide
many natural dimensions for use in formulating
queries.
Transaction-based star schemas can provide very
useful functionality within a data warehouse frame-
work, making the hospital discharge star an important
component of the CATCH data warehouse. The hos-
pital discharge data includes over 20 interesting
dimensions such as the discharging hospital character-
istics, admission criteria, diagnostic codes, procedure
codes, reimbursement categories, time, geographic
location, and many others. Furthermore, many of
these dimensions are hierarchical in nature, easily
supporting important roll-up/drill-down operations.
Fig. 2 is a partial representation of the discharge star
schema. The discharge star is equally rich in additive
numeric facts. For instance, length of patient stay is a
particularly important measurement for analysis.
There is also a measurement indicating elapsed days
until the medical procedure. Finally, there is a total
revenue item that provides important cost information.
In fact, there is also a large text field with embedded
revenue items that provides a breakdown of the
various costs from room charges to laboratory fees.
Procedures to parse this text field have been devel-
oped as part of the data staging activities and are used
to extract revenue items, providing nearly 30 interest-
ing numeric facts for each transaction. It is not
uncommon to have useful information buried in text
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384374
fields that must be preprocessed using data staging
tools or customized procedures. This can be a chal-
lenging task since the source database has no under-
standing of the structure embedded in such text fields
and therefore, simple query access is impossible. In
this case, the rich set of facts and highly dimensional
structure of the hospital discharge data make it a
powerful warehouse component for detailed investi-
gations and customized analyses.
The hospital discharge star has repeating groups for
diagnoses (ICD DX 1–10) and procedures (ICD
PROCEDURE 1–10). This design mirrors the underlying
data and simplifies the data staging process for the
millions of discharge records used in the project. An
alternative design without repeating groups might
simplify some queries, but this fine-grained data is
at the bottom of the data access pyramid and is
typically aggregated for most query processing. The
original positional representation also conveys infor-
mation relevant to health care coding practitioners and
is used in several ancillary algorithms. For many
purposes, the primary diagnosis or procedure is used
in calculating higher-level health care indicators, so
this structure is maintained in the transaction-oriented
data [28].
It is sometimes preferable to store the actual trans-
actions rather than lightly aggregated data that has
been derived from the underlying transactions. Kim-
ball [19] uses the term sparsity failure to describe the
size explosion that can occur when creating aggregate
data from a sparsely populated fact table. Detailed fact
data such as hospital discharge transactions will
probably not have all combinations of the dimensions
present in the actual data. In other words, not all
diseases occur in all hospitals during a particular year
and therefore the effect with regard to size is not
multiplicative. If we consider only the cardinality of
the actual dimensions then the possible combinations
of dimension key values is very large for the hospital
discharge data. For example, consider the following
four dimensions with approximate cardinalities, hos-
pitals (250), ICD codes (15,000), severity ratings (5),
and payers (10). This could result in 187.5 million
dimension key combinations. Further, we can define
density as the actual number of records (roughly 2
million/year for discharges) divided by the potential
combinations of dimension keys, yielding a density of
2/187.5 or roughly 1.07%. This remarkably low
density makes intuitive sense since the very fine
ICD distinctions lead to sparse usage. Imagine that
we decide to construct an aggregate table by creating
150 disease categories that summarize the 15,000 ICD
codes, reducing the dimension size by a factor of 100.
In this case, all 150 categories may appear for each
hospital (a reasonable assumption) giving a density of
100% and roughly 1.9 million rows. This rather
insignificant space savings comes at the expense of
losing the richness of the original ICD codes and the
flexibility of having individual cost data for each
transaction. Therefore, in the CATCH data warehouse
and many other applications, transaction-oriented
components make good sense. In fact, to really under-
stand the implications for tasks such as data ware-
house capacity planning it is often necessary to
sample the data to discover the actual distribution of
dimension values. The design challenge is to carefully
consider the number of fine-grained items that are
summarized to form the aggregate data and look for a
factor of 10 or more as a reasonable compression ratio
[19].
3.5. Performance issues
The large volumes of data contained in the CATCH
data warehouse coupled with demanding queries can
conspire to produce some truly awful performance. As
in any database project, good design is the most
effective tool for enhancing performance. The
CATCH data warehouse design continues to evolve
in response to new challenges. In addition to design
changes, three other techniques offer avenues for
improving performance: aggregate tables, star schema
indexing strategies, and physical table partitions.
3.5.1. Aggregates
Many data warehouse designers identify aggre-
gates as one of the most effective strategies for
improving performance. Kimball [19] notes that
‘‘aggregates can have a very significant effect on
performance, in some cases speeding queries by a
factor of 100 or even 1000.’’ If the aggregate data are
useful, having the data physically ready and waiting
will certainly improve query speeds. In addition, if
sparsity failure is avoided, then the amount of data
required may also be substantially reduced. That is,
benefits from both reduced space and previously
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 375
handled computations can accrue through the use of
aggregates. In addition, many data warehousing nav-
igation tools are aggregate-aware, making the aggre-
gate structures transparent to the end user. However,
there are a potentially large number of aggregates that
are possible given a rich set of dimensions. The choice
of which aggregate tables to build is based on the type
of queries being executed and will naturally change
over time [14].
Aggregates play an important role in the CATCH
data warehouse. Some data are extracted and loaded
in aggregate form, such as the PHIDS indicators
discussed above, and other aggregates are derived
from more detailed data warehouse components. For
instance, vital statistics such as death and birth certif-
icates are used to derive a collection of aggregated
mortality and birth-related indicators. There are two
somewhat different purposes for aggregates. Highly
aggregated data are used to directly support traditional
CATCH report production, while lightly aggregated
data are used to improve query performance. The
continual re-evaluation of aggregates is an important
task in data warehouse administration.
3.5.2. Indexing
Many database management systems intended for
data warehousing support bitmap index structures.
Bitmap indexes are especially suited to low cardin-
ality dimensions such as admission quarter, day of the
week, gender, and others. These indexes are space
efficient and speed the star queries that characterize
access to fine-grained structures such as the hospital
discharge data. Another technique is to cache the
smaller dimension tables in memory for improved
query performance. All of these techniques have been
employed and performance tuning continues to be an
ongoing activity as the user community grows and
explores new uses for the data warehouse.
3.5.3. Partitioning
The third important performance tuning technique
is the use of physical table partitioning [6]. The use of
table partitions is important both for query perform-
ance, as well as data warehouse management. Since
the data are loaded or staged at different times, these
activities can be isolated through partitioning. This
also allows preprocessing and data quality procedures
to be run on separate partitions. In addition, parti-
tioned indexes can also be used. One of the most
important benefits of partitioned tables is the oppor-
tunity for the optimizer to exclude large portions of
the data when queries include restrictions on parti-
tioning attributes. An excellent example of partitioned
tables in the CATCH data warehouse is the hospital
discharge data. In recent years, there have been
roughly 2 million discharge transactions/year. The
goal is to keep at least 10 years of discharge data or
20 million transactions available for analysis, but
often only a few years are necessary for any given
query, thereby creating an ideal parameter for parti-
tioning. The hospital discharge data is partitioned by
year, with roughly 1.5–2 million rows per partition. If
a query specifies a single year or a small range of
years, the optimizer can create an execution plan that
only searches the required partitions, leaving the vast
majority of data untouched. Since most of the detailed
interactive analyses fit this mold, the performance
tends to be quite good. However, the entire collection
of data is still available for queries that cover a wide
range of years, it just takes more time.
4. Data staging and quality assurance
The extraction, transformation, and loading (ETL)
functions in a data warehouse are considered the most
time-consuming and expensive portion of the devel-
opment lifecycle [22]. These processes are concerned
with the extraction of data from legacy systems,
transformation and preprocessing requirements to
produce useful, integrated data, and the transportation
of the data into the actual data warehouse structures.
The CATCH data warehouse involves somewhat
unusual challenges with regard to data staging activ-
ities. The data are drawn from multiple organizations,
which usually apply in-house transformations to data
collected by yet another layer of organizations. For
instance, the hospital discharge data are originally
collected by hospitals and reported to the Florida
Department of Health. These data are then integrated,
preprocessed, and provided to other interested organ-
izations, including the CATCH data warehouse proj-
ect. In the case of demographic data, population levels
are extracted from the Florida Governor’s Office and
the Census Bureau. Overall, the data warehouse has
continued to grow without the need for a data purging
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384376
strategy. However, as the size continues to increase
and finer geographic levels are used, a purging strat-
egy will become necessary in the near term. The
design is already multi-level, as described in Fig. 4,
and it is the base of the data pyramid that accounts for
most of the space. As space becomes an issue, the
earlier years of fine-grained data will be purged and
retained offline. These structures are maintained as
physical partitions, so the purging operations can be
conducted without disrupting data access and data can
easily be re-introduced.
Two innovative techniques, twin star data staging
and data quality filters, have been developed to
manage the ETL processing required in the CATCH
data warehouse.
4.1. Twin star staging
Fig. 5 outlines the twin star staging process and its
three component stages. The approach is designed to
utilize the power of commercial database systems,
especially referential integrity constraints and excep-
tion processing. The various stages use a combination
of scripting languages, bulk-loading tools, and data-
base procedures.
4.1.1. Stage 1
The process begins with file-based preprocessing
and cleansing activities. These procedures can be
written in any programming language, but AWK
and Perl have been especially useful in the CATCH
project with their built-in parsing and pattern match-
ing capabilities. Data transformation, quality checks,
and simple reports can all be performed on the initial
data file. Even though many checks will be repeated
throughout the data staging process, the presence of
redundant checks is an asset with regard to data
quality. Stage 1 of the twin star strategy involves
using a bulk loader to move the data into a staging
table within the database system. The staging table is
designed for maximum flexibility in storing data,
minimizing data type conflicts, and providing a work-
bench for database-resident transformation proce-
dures. Typically this includes additional attributes
that are created as part of the preprocessing and
cleansing tasks. Bulk loading utilities are used to
quickly populate the staging table and capture prob-
lematic data in a series of log files. Data type, unique-
ness, and ‘‘not null’’ checks for critical staging table
attributes can be used to control the thoroughness of
this data staging step. With care, many simple data
quality issues can be resolved at this early stage.
4.1.2. Stage 2
The temporary star shares the critical data dimen-
sions with the permanent star, and is essentially a
‘twin’ of the permanent star (though there may be
different supporting dimensions for particular tasks).
The fact table attributes and important dimensions
should be exact duplicates so that any operations or
referential integrity checks will be consistent between
the stars. Stage 2 entails moving the data from the
staging table to the temporary star. Attribute data
types should be compatible and referential integrity
constraints can be used to check for valid dimension
keys. The referential integrity constraints are disabled
and later re-enabled sequentially after the load to
perform the checks in one sweep, thereby improving
processing time. Most database systems provide a
method of capturing invalid rows and it is important
to make use of such capabilities during both the StageFig. 5. Twin star staging.
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 377
2 and 3 transfers. Since the temporary star is the
functional equivalent of the permanent star, just much
smaller, the interface and data browsing tools devel-
oped for the actual data warehouse can be used to
exercise the temporary star. Test reports, browsing by
power users, and sanity checks based on comparisons
with previously loaded data in the permanent star are
all useful methods of ensuring high quality data in the
temporary star.
4.1.3. Stage 3
The permanent star is the long-term storage area
for the data warehouse. This star must be carefully
indexed, distributed across storage devices to avoid I/
O bottlenecks, and possibly partitioned. As noted
earlier, partitioned tables can provide performance
improvements by distributing information across
physical devices and by allowing the query optimizer
to select only the relevant partitions. In addition,
partitioned tables ease data warehouse management
tasks through creation, loading, and archiving of
independent partitions. The Stage 3 transfer from the
temporary star to the permanent star should be fast
and free of data type and referential integrity viola-
tions. The simple transfer will allow large volumes to
be processed within most load windows. Redundant
referential integrity constraints can be used as a final
check (again disabling and re-enabling for efficiency).
The resulting exception tables should be empty, but
any offending rows are a clear sign that somehow
problems survived Stages 1 and 2. This provides a last
opportunity to postpone releasing or publishing the
data.
4.2. Data quality filters
The data quality issues that surface while initially
constructing a data warehouse are among the most
challenging obstacles, contributing significantly to the
time spent in data staging activities. As noted above,
the ETL processes and quality assurance procedures
can account for the majority of time and resource
commitments in a data warehouse project. This has
been the case in the CATCH data warehouse project,
where there are a large number of data sources and
many intermediate stages for errors to be introduced.
In addition, the challenge of producing a truly inte-
grated design requires translations to common defi-
nitions and shared dimensions. Rather than any
‘‘magic bullet,’’ a long-term effort to develop a
comprehensive set of preprocessing procedures will
produce the best data quality. The procedures under
development on the CATCH project include a meas-
ure of redundancy to provide added insurance against
quality problems surviving various phases of the ETL
process. As more procedures have been added to the
quality assurance arsenal, an interesting structure has
emerged, mirroring the natural structure of the data
warehouse, with procedures falling into the following
categories of quality filters. . Fact filters are the quality procedures used to
check the fine-grained data, such as hospital discharge
transactions. For example, any discrepancies between
itemized fees and total charges should be flagged.
Quality procedures at this level compare attributes
within a fact table row, or may compare between two
rows, but the focus is on fine-grained data. . Aggregate filters include quality checks that
become possible only when the focus is on summaries
of fact-level answer sets. As we have seen, aggregates
are important for boosting performance, but they also
present data quality assurance opportunities. At this
level, ‘roll-up’ operations over important dimensions
allow aggregate averages, maximums, or other sum-
maries to be compared. With regard to hospital dis-
charge transactions, comparisons of average lengths
of stay, maximum costs, or diagnostic volumes can all
be usefully compared by hospital and by year. That is,
large hospitals can be verified against each other and
new data can be compared against previous years.
Aggregate filters can be the basis for some very
powerful data quality procedures, effectively using
the capabilities existing in the data warehouse. . Dimension filters are the procedures used to
investigate ‘dirty’ dimensions. For instance, many
business-oriented data warehouses include a customer
dimension that can be very large and may have severe
data quality problems [19]. Duplicate customer
entries, household matching, and data obsolescence
issues are among the problems inherent in such
dimensions [3,10]. In the CATCH data warehouse,
dimensions that must be carefully monitored include
hospitals, practitioners, and geographic entities such
as counties and communities. Dimension filters can be
used to monitor many problems with regard to chang-
ing dimensions.
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384378
These three broad categories of quality filters can
be further refined based on the type of comparison
being used. For instance, the intratuple filters involve
comparisons between attributes within a single record.
Of course, the record itself may be at an aggregate
level and represent a summary of a fact-level answer
set. For example, average pharmacy costs may have a
fairly predictable relationship with total charges for a
given disease. This type of comparison could be used
as a quality check within a given aggregate hospital
discharge record, an example of an intratuple aggre-
gate filter.
Comparisons across records, or intertuple filters,
provide a rich set of quality assurance opportunities
that examine relationships between fact table rows or
aggregates of these rows. An example of this type of
filter would be comparisons between disease volumes
by year. Unlikely disease distributions, after account-
ing for population growth, might indicate a data
quality problem with new data. The distinction
between intertuple and intratuple comparisons, com-
bined with the major filter categories, leads to six
interesting filter categories that seem to naturally
describe the many types of quality procedures being
built into our data warehouse.
An additional quality assurance strategy involves
comparing data warehouse aggregates with known
summaries published by outside sources. This process
can best be described as a quality benchmark, where
externally derived data is used to check internal data
warehouse procedures. This type of quality procedure
usually includes permanent data quality tables popu-
lated with externally produced data summaries based
on published reports or spreadsheet calculations. For
instance, state-level reports on the number of specific
disease occurrences provide a benchmark for data
warehouse aggregates based on the underlying hospi-
tal data. Automated comparison procedures report
only reasonably large departures based on user-
defined thresholds. Quality benchmarks are particu-
larly important as ongoing development activities
yield both larger volumes of data, as well as new
aggregation procedures. Before new versions of pro-
cedures or interface tools are deployed, historical
quality benchmarks can be used to evaluate their
performance. Both quality benchmarks and filters
are part of the substantial infrastructure necessary to
meet data quality goals. These tools account for a
significant portion of the CATCH data warehouse
development effort [2].
5. CATCH data warehouse applications
The data warehouse is used to support a variety of
activities, from automating the original CATCH
reports to supporting current health care research
initiatives. The CATCH methods provided a solid
foundation for the initial implementation efforts, but
as components have been added the synergies have
opened new application opportunities. Clearly, the
human–computer interface is of paramount impor-
tance in the data warehouse environment and the
primary determinant of success from the end-user
perspective [1]. In order to support analysis and
reporting tasks, the data warehouse must have high
quality data and make that data accessible through
effective interface technologies. The act of releasing
data in a warehouse is in a very real sense the same as
publishing that data in printed form—retractions in
both media can be very painful.
5.1. Producing CATCH reports
CATCH reports have been refined over the past
decade in the field. The field expertise available in the
interdisciplinary research team infused the require-
ments process and provided a clearly identifiable goal
as the first step in data warehouse construction.
Hundreds of stored procedures, as well as the design
itself, implement many aspects of this domain exper-
tise. The stored procedures generate the health status
indicators and move them upward in the data access
pyramid for final report production. The reports allow
quick and easy access to comprehensive summaries
and more detailed collections of information from the
data warehouse using standard report writing technol-
ogies. This type of pre-defined and thorough reporting
is critical for implementing a more automated CATCH
report and will probably be the preferred format for
many users. For example, the comparison between
target counties and peer counties, as well as state
averages, are fundamental components of the original
CATCH reports and important tools for community
health care planners. In addition, current and historical
trend information is provided on fact sheets for each
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 379
health indicator. The final reports are really reference
books (numbering over 300 pages) with several major
parts, such as comparisons, fact sheets, and prioritized
lists.
New features that move beyond the original
CATCH reports include components that enable
user-defined communities to supplement the tradi-
tional county-level perspective. Users can define
smaller communities based on geographic or demo-
graphic criteria, with community fact sheets providing
an exploded view of selected health status indicators.
While CATCH has traditionally focused on large
hardcopy reports, the reports are now produced
directly in Web-friendly formats for electronic distri-
bution. The advantage of this approach is that a strong
methodological structure can be retained as the reports
are much more widely distributed. The interested
reader can refer to the Center for Health Outcomes
Research (CHOR) Web site for examples of current
reports [chor.hsc.usf.edu].
In addition to static reports, the high-level compo-
nents of the data warehouse can be accessed dynam-
ically using data browsing tools. It is usually possible
to constrain the navigation, while still providing
enough freedom to explore many more perspectives
than can be accommodated in a traditional report. Fig.
6 shows an online analytic processing (OLAP) tool
being used to browse through trend information for
specific indicators at the county level. Most of these
tools can support both desktop and Web browser
access, making this an important new avenue for data
dissemination.
5.2. Investigating health care issues
Data warehouse browsing tools provide star query-
like access through a flexible menu-based interface,
with pull-down menus representing important dimen-
sions. These types of tools are easy to use and support
some ad-hoc exploration, but are usually controlled
Fig. 6. Browsing screen for community indicators.
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384380
through an administrative layer that determines the
data available to end-users. In developing a flexible
interface, there is a tradeoff between the ability to
express ad-hoc queries and the ease-of-use that results
from pre-defined constructs implemented by data
warehouse designers and administrators. Of course,
SQL can provide an ad-hoc query facility, but requires
some care in the data warehouse environment with
very large tables and ill-formed queries conspiring to
sharply degrade performance. In addition, use of SQL
by casual users often produces incorrect queries
resulting in erroneous results from the data ware-
house. As noted above, OLAP tools can be used to
empower standard report users and allow simple
navigation through many more views than can be
produced using traditional reporting tools, yet still
curtail unwanted operations.
A second, and in some ways more important role
for the browsing tools is to provide a flexible interface
for more customized analysis. Health care issues high-
lighted by preliminary reports can be investigated
more fully using the finer levels of detail maintained
in the data warehouse. These tasks might entail query-
ing the true dimensional star schemas that include age,
gender, race, and other dimensions, or even the event-
oriented data, such as hospital discharges. These data
warehouse resources support much more detailed
analyses, allowing the user to focus on issues such as
differences in age or race with regard to specific health
status indicators. Once decision-makers review the
CATCH report, they may have community-specific
issues that relate to the diverse population groupings
that inevitably fall within somewhat arbitrary political
boundaries. Dealing effectively with such important
Fig. 7. Browsing screen for hospital disease indicators.
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 381
issues requires a more careful and focused analysis that
is precluded at the higher levels of aggregation that
make up the generic CATCH reports.
A current research initiative involves the explora-
tion of volume and cost-related issues in health out-
comes. Data browsing tools are used for exploratory
analysis. Used in this manner, OLAP tools provide a
first step in the data mining process [13]. Fig. 7
illustrates a browsing screen in which detailed vol-
ume, length of stay, and cost data are presented for
specific hospitals, or groups of hospitals. In addition
to tabular representations, these tools provide graphic
capabilities that support simple data visualization.
5.3. Security issues
Currently, dynamic access to the CATCH data
warehouse is restricted to the development team and
associated health care researchers. Obviously, some of
the data may be sensitive in nature. Data security is an
important issue in the health care environment and the
CATCH data warehouse attempts to balance the
information requirements for local health care plan-
ning with critical security issues. It is important to
note that no patient identifiers of any kind are incor-
porated within the data warehouse. Security issues are
mostly concerned with reporting rare events in geo-
graphic areas that might allow a person to be identi-
fied through other data sources. Detailed security
policies provide guidance for the manipulation and
reporting of health care data in the data warehouse.
6. Conclusions
The CATCH data warehouse can have an important
impact on our health status by making rigorous, quan-
titative information available to health care decision
makers in local, state, national, and international com-
munities. In this paper, we have described some of the
technical challenges faced in designing and implement-
ing a data warehouse for health care information. We
have presented innovative research contributions in the
areas of data warehouse design, data staging for ETL
processing, data quality assurance, and health care data
warehouse applications.
The CATCH data warehouse is now fully func-
tional. For example, it has been recently used to
produce a comprehensive CATCH report for Miami–
Dade COUNTY, Florida’s largest county. As part of
this report we were asked to provide more detailed
assessments of the eight commission districts within
the county. The flexibility of the data warehouse to
provide customized reporting allowed us to provide
these analyses rapidly and effectively. Because this
report was the first to be fully automated, we verified
the accuracy of the report with a complete hand check
of every data table. The discrepancies between the
automated data tables and the manually derived tables
were minimal and easily reconciled.
The CATCH data warehouse remains a work in
progress. We are pursuing an active research agenda
to enhance the technical data warehousing capabilities
and community health care applications. We invite the
reader to follow the progress of the data warehouse at
our CHOR web site [chor.hsc.usf.edu]. In the next
sections (Sections 6.1 and 6.2), we briefly review our
current research directions.
6.1. Data warehouse research directions
The CATCH data warehouse provides a rich
research environment for focused investigation in
the following areas. . Data Warehouse Design—The variety and vola-
tility of health care data sources make the mainte-
nance of the data warehouse design a true challenge.
Changes to source data formats frequently require the
updating of dimension table schemas. Often historical
data cannot be placed into the new format without
information loss. Finding solutions for maintaining
historical accuracy while providing efficient use of all
data in current applications is difficult. We are
researching design techniques to minimize the impact
of dimension table changes on the maintenance and
operations of the data warehouse [2]. . Data Staging—As presented in Section 3, we
have implemented an innovative twin star data staging
procedure. Ongoing research will study the perform-
ance of twin star data staging on various data loads.
Enhancements to the procedure will be proposed and
implemented. . Data Quality—Issues of data quality dominate
our research agenda. The health care field places
particular emphasis on data accuracy, timeliness, pri-
vacy, and ease of use [27]. We are in close contact
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384382
with the communities who receive the CATCH reports
and have interviewed a number of users to elicit data
quality requirements. This information will be used to
drive our research to improve data quality in the
CATCH data warehouse. . Data Dissemination—The technologies for dis-
seminating CATCH reports to communities are rap-
idly evolving. The requirements of the receiving
communities and the capabilities of the data ware-
house system will drive future research directions. . Data Mining—We are aggressively investigating
several areas of knowledge discovery in the CATCH
data warehouse [11]. We have a unique capability to
perform detailed studies in such areas as physician and
hospital volume, racial disparities in health care, and
environmental impacts on community health status.
6.2. Community decision-making with CATCH data
The CATCH data warehouse will result in wide-
spread distribution of data previously unavailable to
most communities, as well as online access for spe-
cialized inquiry. Many issues arise as to how the
communities will make the most effective use of the
CATCH data for health care decision-making. This is
an area with considerable research potential.
There is a rich literature on the decision-making
process both with and without information technol-
ogy. The study of group decision support systems and
environments has a strong tradition in the manage-
ment information systems field [8]. In many ways,
this important body of work is appropriate to health
care decision-making, which is usually group-ori-
ented. For example, Dennis et al. [9] study the effects
of minority influence on decision-making and find
that the presence or absence of technology has very
different effects. Another important contributing area
would be the political process and its ramifications to
decision-making [20]. Certainly, policy making in
health care is very much a political process.
The use of the CATCH methodology and state-of-
the-art data warehousing technology across many
Florida communities will provide a rich research
opportunity for studying interesting issues on group
decision-making in community health care organiza-
tions. Such issues would include the composition of
the decision-making group, the community stakehold-
ers and their political influence, the decision-making
process, and dissemination patterns of health care
information in the community. The complexities and
the interrelationships among these issues make the
design of research studies both a challenge and an
opportunity. As the automated CATCH reports are
produced for various communities in Florida, we will
study how effectively the CATCH information is used
for health care planning.
Acknowledgements
The authors gratefully recognize the U.S. Depart-
ment of Commerce, which has provided funding
through a Technology Opportunities Program (TOP)
grant. The Florida Department of Health has been a
research partner. Research collaborators in the College
of Public Health include R. Campbell, E. Gilbert, S.
Luther, B. Myers, and B. Steverson. Contributing
graduate students in the College of Business Admin-
istration include S. Hedge-Desai, R. Marsh, D.
McCorkel, M. Nevrekar, M. Pearl, R. Rajendrababu,
and J. Slayton. The authors also thank Oracle
Corporation for making their state-of-the-art develop-
ment tools available through the Oracle Academic
Initiative.
References
[1] D. Berndt, A. Hevner, J. Studnicki, Data warehouse dissemina-
tion strategies for community health assessments, informatik/
informatique, Journal of the Swiss Informatics Society (1) (Feb-
ruary 2001) 27–33.
[2] D. Berndt, J. Fisher, A. Hevner, J. Studnicki, Healthcare data
warehousing and quality assurance, IEEE Computer 34 (12)
(December 2001) 33–42.
[3] D. Berndt, R. Satterfield, Customer and household matching:
resolving entity identity in data warehouses, Proceedings of
AeroSense 2000, Conference on Data Mining and Knowledge
Discovery, Orlando (April 2000).
[4] Center for Disease Control and Prevention, Principles of Com-
munity Engagement, 1997, Atlanta.
[5] D. Chrislip, C. Larson, Collaborative Leadership: How Citi-
zens and Civic Leaders Can Make a Difference, Jossey-Bass,
San Francisco, 1994.
[6] M. Corey, M. Abbey, Oracle Data Warehousing, Oracle Press
and Osborne McGraw-Hill, New York, 1997.
[7] S. Cropper, Collaborative working and the issue of sustainabil-
ity, in: C. Huxham (Ed.), Creating Collaborative Advantage,
SAGE Publishers, London, 1996.
[8] A. Dennis, Information exchange and use in group decision
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 383
making: you can lead a group to information but you can’t
make it think, MIS Quarterly 20 (4) (1996) 433–458.
[9] A. Dennis, K. Hilmer, N. Taylor, Information exchange and
use in GSS and verbal group decision making, Journal of MIS
14 (3) (1998) 61–88.
[10] D. Dey, S. Sarkar, P. De, A probabilistic decision model for
entity matching in heterogeneous databases, Management Sci-
ence 44 (10) (October 1998) 1379–1396.
[11] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy
(Eds.), Advances in Knowledge Discovery and Data Mining,
The AAAI Press, Menlo Park, CA, 1996.
[12] P. Gray, H. Watson, Decision Support in the Data Warehouse,
Prentice-Hall, Englewood Cliffs, NJ, 1998.
[13] J. Han, M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann Publishers, San Francisco, 2001.
[14] V. Harinarayan, A. Rajaraman, J. Ullman, Implementing data
cubes efficiently, Proceedings of the 1996 ACM SIGMOD,
Montreal (June 1996).
[15] W. Inmon, Building the Data Warehouse, Wiley, New York,
1992.
[16] Institute of Medicine, Summary of recommendations, in: W.
Waterfall (Ed.), The Future of Public Health, National Acad-
emy Press, Washington, DC, 1988.
[17] Institute of Medicine, Healthy Communities: New Partner-
ships for the Future of Public Health, National Academy Press,
Washington, DC, 1996.
[18] Institute of Medicine, Measurement tools for a community
health improvement process, in: J. Durch, L. Bailey, M. Stoto
(Eds.), Improving Health in the Community, a Role for Per-
formance Monitoring, National Academy Press, Washington,
DC, 1997.
[19] R. Kimball, The Data Warehouse Toolkit, Wiley, New York,
1996.
[20] H. Mintzberg, The Nature of Managerial Work, Harper and
Row, New York, 1973.
[21] H. Nakajima, Editorial: new players for a new era, World
Health 50 (3) (1997) 3.
[22] J. Srivastava, P. Chen, Warehouse creation—a potential road-
block to data warehousing, IEEE Transactions on Knowledge
and Data Engineering 11 (1) (1999) 118–126.
[23] B. Starfield, Primary care and health: a cross-national compar-
ison, Journal of the American Medical Association 266 (16)
(1991) 2268–2271.
[24] J. Studnicki, Evaluating the performance of public health
agencies: information needs, American Journal of Preventive
Medicine, Research and Measurement in Public Health Prac-
tice 11 (6) (1995) 74–80.
[25] J. Studnicki, B. Steverson, B. Myers, A. Hevner, D. Berndt,
Comprehensive assessment for tracking community health
(CATCH), Best Practices and Benchmarking in Healthcare 2
(5) (September/October 1997) 196–207.
[26] J. Studnicki, A. Hevner, D. Berndt, S. Luther, Comparing
alternative methods for composing community peer groups:
a data warehouse application, Journal of Public Health Man-
agement and Practice 7 (6) (November 2001) 87–94.
[27] R. Wang, M. Ziad, Y. Lee, Data Quality, Kluwer Academic
Publishing, New York, 2001.
[28] J. Weissman, C. Gatsonis, A. Epstein, Rates of avoidable hos-
pitalization by insurance status in Massachusetts and Mary-
land, Journal of the American Medical Association 268 (17)
(November 1992) 2388–2394.
[29] World Health Organization, The World Health Report 1995:
Bridging the Gaps, Report of the Director-General, Geneva,
1995.
Donald J. Berndt is an Assistant Professor
in the Information Systems and Decision
Sciences Department in the College of Busi-
ness Administration at the University of
South Florida. His research interests include
data warehousing, knowledge discovery,
and data mining. Dr. Berndt received a
PhD in Information Systems from the Stern
School of Business at New York University.
He is a member of Beta Gamma Sigma,
ACM, AIS, and INFORMS.
Alan R. Hevner is an Eminent Scholar and
Professor in the Information Systems and
Decision Sciences Department in the Col-
lege of Business Administration at the
University of South Florida. He holds the
Salomon Brothers/HRCP Chair of Distrib-
uted Technology. His research interests
include software engineering, software test-
ing, distributed database systems, and health
care information systems. He received a
PhD in Computer Science from Purdue
University. Dr. Hevner is a member of ACM, IEEE, AIS, and
INFORMS.
James Studnicki is a Professor of Health
Policy and Management at the University
of South Florida College of Public Health.
His research interests include measuring the
health status of communities, evaluating
alternative treatment outcomes, and study-
ing the influence of managed care penetra-
tion on the utilization and quality of health
services. Dr. Studnicki received a ScD from
Johns Hopkins University.
D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384384
- Introduction
- The CATCH methods of community assessment
- Limitations
- CATCH data warehouse challenges
- The CATCH data warehouse
- The dimensional model
- Fact tables
- Dimension tables
- Data warehouse design: the data access pyramid
- Aggregated Florida Department of Health Data
- Transaction-oriented hospital discharge data
- Performance issues
- Aggregates
- Indexing
- Partitioning
- Data staging and quality assurance
- Twin star staging
- Stage 1
- Stage 2
- Stage 3
- Data quality filters
- CATCH data warehouse applications
- Producing CATCH reports
- Investigating health care issues
- Security issues
- Conclusions
- Data warehouse research directions
- Community decision-making with CATCH data
- Acknowledgements
- References