answer the questions based on the article

tareq1144

BerndtCATCHdatawarehouseDecision-Support-Systems-2.pdf

Home >Information Systems homework help >answer the questions based on the article

The Catch data warehouse: support for community health

care decision-making

Donald J. Berndt a , Alan R. Hevner

a,*, James Studnicki b

a Information Systems and Decision Sciences Department, College of Business Administration, 4202 Fowler Ave., CIS1040,

University of South Florida, Tampa, FL 33620, USA b College of Public Health, University of South Florida, Tampa, FL 33620, USA

Accepted 1 April 2002

Abstract

The measurement and assessment of health status in communities throughout the world is a massive information technology

challenge. Comprehensive Assessment for Tracking Community Health (CATCH) provides systematic methods for community-

level assessment that is invaluable for resource allocation and health care policy formulation. CATCH is based on health status

indicators from multiple data sources, using an innovative comparative framework and weighted evaluation process to produce a

rank-ordered list of critical community health care challenges. The community-level focus is intended to empower local decision-

makers by providing a clear methodology for organizing and interpreting relevant health care data. Extensive field experience

with the CATCH methods, in combination with expertise in data warehousing technology, has led to an innovative application of

information technology in the health care arena. The data warehouse allows a core set of reports to be produced at a reasonable

cost for community use. In addition, online analytic processing (OLAP) functionality can be used to gain a deeper understanding

of specific health care issues. The data warehouse in conjunction with Web-enabled dissemination methods allows the infor-

mation to be presented in a variety of formats and to be distributed more widely in the decision-making community. In this paper,

we focus on the technical challenges of designing and implementing an effective data warehouse for health care information.

Illustrations of actual data designs and reporting formats from the CATCH data warehouse are used throughout the discussion.

Ongoing research directions in health care data warehousing and community health care decision-making conclude the paper.

Keywords: Health care information systems; Data warehousing; Data staging; Online analytic processing (OLAP); Decision support systems;

Community decision-making; Data quality

1. Introduction

The United States spends over a trillion dollars

annually on health expenditures. Both as a percentage

of national productivity and per capita, health care

spending by the United States exceeds that of any other

nation in the world. However, this tremendous expen-

diture has not secured the U.S. a rank among the

‘healthiest’ nations. In fact, for many health indicators,

such as infant mortality and measles immunizations,

the U.S. ranks below some countries characterized as

underdeveloped [23,29]. Prolonged public debates on

health care policy in the United States have focused on

doi:10.1016/S0167-9236(02)00114-8

* Corresponding author.

E-mail addresses: [email protected] (D.J. Berndt),

[email protected] (A.R. Hevner), [email protected]

(J. Studnicki).

www.elsevier.com/locate/dsw

Decision Support Systems 35 (2003) 367–384

insurance coverage and medical care financing pro-

grams without any serious examination of the true

health status of the nation.

The need to assess the health status of U.S.

communities in a comprehensive and systematic

manner has been widely recognized within the health

professions. The Institute of Medicine (IOM) of the

National Academy of Sciences has acknowledged the

importance of a population-based perspective in two

influential reports, emphasizing the need for a regular

and systematic collection, assemblage, and analysis

of the health status of our nation’s communities

[16,18]. A community health profile is comprised

of socio-demographic characteristics, health status

and quality of life indicators, health risk factors,

and health resource measures. The intent of such a

comprehensive health profile is to assist a community

in developing, refining, and monitoring a long-term

strategic view of its overall health status. Although

there are many sources of health data, there are no

standard data definitions, formats, or reports across

the health care industry. Thus, health care data are

widely used (and misused) in an ad-hoc manner to

justify managerial objectives of health institutions

and agencies, a maze of mandated categorical fund-

ing, and a variety of political agendas. Sound infor-

mation and accepted analytic techniques are even

more important as funding is consolidated in block

grants and local community decision-making is

emphasized.

As part of the ongoing clarification of the public

health role at the community level and the transition

from a disease to a health focus and from a treatment

to a prevention strategy, there has been recognition

that partnerships and collaboration are necessary to

support effective action [17,21]. Health organizations,

public sector agencies, medical care providers, busi-

nesses, the religious community, educational institu-

tions, and other community organizations are

interdependent components of a multi-sectoral com-

munity health environment. The overall community

must be empowered to make the necessary, and

sometimes difficult, resource allocation choices to

improve health through information, education,

behavior change, and social support [7]. Such collab-

orative action at the community level must be

informed by unbiased data describing the communi-

ty’s health status, needs, and resources. The ability is

also needed to track progress over time to meet the

community’s health care goals [24].

The gap between current practice in community

health care spending and the above goals of collabo-

rative community health care decision-making is vast.

The availability and quality of health indicators are

problematic. There is little empirical evidence on the

use, sharing, or integration of health data into deci-

sion-making to provide guidance to community health

organizations. While most of the literature on collab-

orative leadership and community engagement

emphasizes the process [4,5], little attention has been

focused on the effect of the availability of a common

set of data, such as the community health profile, on

the quality and inclusiveness of decision-making.

There is also scant information about the use of data

and information technology to support and monitor

the process.

The purpose of this paper is to present an overview

of the Comprehensive Assessment for Tracking Com-

munity Health (CATCH) methods [25] and then focus

on the construction of a comprehensive health care

data warehouse that provides automated support for

CATCH. The combination of extensive field experi-

ence with CATCH and the application of current data

warehousing technology make this an innovative

interdisciplinary research effort. Section 2 briefly

presents the CATCH methods and our motivation

for building a data warehouse. In Section 3, we

present a detailed discussion of the technical chal-

lenges in designing and implementing the data ware-

house. Twin star data staging, an effective approach

for ensuring quality as data are entered into the

warehouse, is highlighted in Section 4. Section 5

discusses the use of the data warehouse for advanced

health care applications. The paper concludes with

future research directions on data warehousing tech-

nical challenges and the use of health profiles to

support improved community health care decision-

making.

2. The CATCH methods of community assessment

The University of South Florida’s Center for

Health Outcomes Research (CHOR) developed

CATCH to provide comprehensive and objective

health status data for community health planning

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384368

purposes. CATCH collects, organizes, analyzes, pri-

oritizes, and reports data on over 250 health and social

indicators on a local community basis. The CATCH

methods have been tested, refined, and validated in

the field over the past 10 years. Reports have been

prepared for more than 20 U.S. counties both within

and outside of Florida.

The CATCH process can be briefly described as

shown in Fig. 1. Community health indicator data are

gathered from a variety of sources. Secondary data

sources include health care data reported by hospitals,

local, state, and federal health agencies, and national

health care groups. Primary data sources would involve

data gathered from door-to-door or mail-in surveys. All

health care data are translated into common formats and

integrated with other data warehouse components to

support the production of health care report cards.

Over 250 indicators are used within CATCH and

are organized into 10 indicator categories. These

indicators and categories represent a wide spectrum

of health care issues and have evolved through both

research and field practice. Table 1 lists the 10

indicator categories and presents a few representative

indicators to lend a sense of perspective to the level of

detail provided in CATCH reports. These indicators

are collected from a variety of sources.

Each indicator value is compared against the state

average, an average from a peer group of counties,

and other interesting values (e.g., a national goal for

that indicator) [26]. The results of these comparisons

are organized into a multi-dimensional matrix based

on favorable or unfavorable comparisons against each

comparison dimension. Fig. 1 shows a 2-by-2 com-

parison matrix based on state averages and peer

Fig. 1. The CATCH process.

Table 1

Ten indicator groups with representative indicators

Demographic

Characteristics

Health Status: Morbidity

and Mortality

Total Population Breast Cancer

Racial Composition Cardiovascular Disease

Net Migration Stroke

Socioeconomic

Characteristics

Sentinel Events

Rubella

Employment Measles

High School Dropouts Late Stage Cancer

Per Capita Income Avoidable Hospitalizations

Maternal and Child Health Health Resource Availability

Infant Mortality Licensed Hospital Beds

Low Birthweight Licensed Medical Doctors

Birth Defects Mortality Licensed Registered Nurses

Social and Mental Health Infectious Disease

Domestic Violence Syphilis

Homicide Rate AIDS

Psychiatric Admissions Hepatitis

Physical Environmental Health Behavioral Risk Factors

Foodborne Outbreaks Smoking

Contaminated Wells Obesity

Lead Poisoning Mammograms

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 369

averages. Community indicators that demonstrate

unfavorable comparisons on all dimensions are high-

lighted as community health challenges. After this

simple comparison, the health care challenges are

prioritized using a set of five filters.

Number Affected—number of persons in the

community affected by the indicator.

Economic Impact—an estimate of the direct cost

per case for individuals affected by the indicator.

Availability of Efficacious Intervention—an esti-

mate of the relative degree to which treatment or

prevention is likely to be effective.

Magnitude of Difference—the degree to which the

community indicator is worse than the dimensional

comparisons.

Trend Analysis—for a 5-year period is the trend

favorable or unfavorable and what is the magnitude

of change in the trend direction?

The community stakeholders are given an oppor-

tunity to weight the importance of each of the above

factors. The final product of the CATCH methodology

is a comprehensive, prioritized listing of community

health care challenges. A more detailed description of

the CATCH methodology with a complete listing of

health care indicators can be found in Ref. [25].

2.1. Limitations

While the value of CATCH is incontrovertible, the

ultimate deployment of CATCH throughout Florida

and the nation has been constrained by several serious

limitations: . The handcrafted process is labor-intensive and

slow. Hundreds of individual sources of data must be

identified and contacted. Data are often provided in

hard copy formats and must be manually checked,

validated, and entered into spreadsheets. With manual

methods, it takes 3–4 months to complete a CATCH

report for a single county. . Longitudinal trend analyses over many years are

cost prohibitive for most communities. Since each

application is expensive and time-consuming, the

capability to fund and produce annual assessments

in a single community is limited. . Most public health funding comes from state and

federal governments. A statewide CATCH assessment

would help to prioritize funding and serve to enable

effective program evaluation based on quantifiable

outcome assessment. Since nearly all data indicators

available in Florida are available in most other states,

there is reason to be confident that CATCH will be

expanded nationally and even internationally. . With the massive amount of health care data

involved, many interesting relationships and correla-

tions between health status indicators can be found

and investigated. In the manual system, such discov-

ery was not feasible. A comprehensive and integrated

data warehouse provides the infrastructure for such

data mining efforts.

2.2. CATCH data warehouse challenges

The application of data warehousing technologies

for the automated support of CATCH holds tremen-

dous promise. The remainder of this paper describes

our work to construct an effective and efficient data

warehouse solution, enabling both cost-effective

report generation and ad-hoc analyses of critical

health care issues. The construction of a data ware-

house for public health care data poses major chal-

lenges beyond those required for the construction of a

commercial data warehouse (e.g., retail sales). Such

challenges include the following. . Data come from a very diverse set of sources.

Health care data are published in a wide variety of

formats with differing semantics. There are currently

few standards in the health care field for such data.

The data integration task to build the data warehouse

requires significant effort. . CATCH reports are disseminated to a diverse

and geographically distributed set of stakeholders. . The data warehouse is required to support the

activities of public policy formulation. The socio-

political issues of health care planning impact design

features such as security, availability, data quality, and

performance.

3. The CATCH data warehouse

The goals of the CATCH data warehouse include

the support and enhancement of the CATCH methods,

the provision of cost-effective and thorough reports to

communities, and the creation of a rich environment

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384370

for more detailed research into critical health care

issues. In addition, a focus on data quality makes the

data warehouse an especially valuable asset over time

as a rich and trustworthy historical repository is built.

Lastly, the data warehouse lends itself to a variety of

dissemination strategies based on hardcopy reports,

interactive access, and Web-enabled information

delivery. The different access technologies allow a

diverse group of community planners and stakehold-

ers to investigate important health care issues using

comparable data. All of these characteristics make the

CATCH data warehouse a unique application of

technology in the field of public health. In fact, the

implementation of this type of data warehouse and its

use in monitoring, as well as improving health status,

will become a primary role of public health agencies

in the future.

The CATCH data warehouse includes a variety of

components arranged in three broad categories:

reporting tables for direct support of the CATCH

methods, aggregated dimensional structures, and

fine-grained or transaction-oriented dimensional struc-

tures. In the sections that follow, examples of these

data warehouse components are presented. All of the

components draw on the dimensional model or star

schema, some components with more than a dozen

dimensions and some with a few simple dimensions.

3.1. The dimensional model

Important missions of a data warehouse include the

support of decision-making activities and the creation

of an infrastructure for ad-hoc exploration of very

large collections of data. Decision-makers should be

able to pursue many of their investigations using

browsing tools, without relying on database program-

mers to construct queries. The emphasis on end-user

data access places a premium on an understandable

database design that provides an intuitive basis for

navigating through the data. The star schema or

dimensional model has been recognized as an effec-

tive structure for organizing many data warehouse

components [12,15,19]. The star schema is character-

ized by a center fact table, which usually contains

numeric information that can be used in summary

reports. Radiating from the fact table are dimension

tables that provide a rich query environment. This

structure provides a logical data cube, with dimen-

sions such as time and location identifying a set of

numeric measurements within the cube. Fig. 2 con-

tains a fragment from the hospital discharge trans-

action-oriented star schema discussed in this paper.

3.1.1. Fact tables

The most appropriate facts are additive numeric data

items that can be summed, averaged, or combined in

other ways across the dimensions to form summary

statistics. The only way to compress the millions of data

points and produce a reasonably sized answer set is to

present some mathematical summarization. No human

will want thousands, let alone millions, of items in

answer to their queries. As Kimball [19] pointsout, ‘‘the

best and most useful facts are numeric, continuously

valued, and additive.’’ The CATCH data warehouse

includes facts such as counts of hundreds of different

health events, population-based rates, age-adjusted

rates, and even fine-grained financial data in the case

of the hospital discharge data depicted in Fig. 2. For

example, using the hospital discharge star it is possible

to focus on a single hospital (using the hospital dimen-

sion), select a single disease (using the ICD DIAGNOSIS

dimension), and investigate how the length of stay has

varied over a specified time period. Using the hierarch-

ical nature of the dimensions, it is also possible to ‘roll-

up’ to compare types of hospitals, disease categories,

or even patient age bands. While the dimensional

structure is simple and readily understandable, it sup-

ports a large and very useful universe of queries.

3.1.2. Dimension tables

The dimensions define the query environment, the

richer the set of dimensions the more ways the data

can be accessed via queries. Two of the important

characteristics of dimensions are the richness of the

attributes that describe the dimension and the hier-

archical nature of the dimension. For example, the

COUNTY dimension in the CATCH data warehouse

includes attributes that describe whether a county is

coastal, wealthy, urban, dense, large in area, or

includes a military base. Therefore, the counties can

be organized by any value in this attribute set. Some

of the attributes lend themselves to hierarchical

organization. In the case of COUNTY, there is natural

geographic hierarchy that includes groups of counties

that form regions within the state and the state itself.

The county is also composed of finer geographic units

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 371

such as communities, ZIP codes, and census tracts. The

dimension hierarchies enable roll-up and drill-down

operations that control the level of detail in queries.

These formally defined hierarchies also provide the

framework for navigation or data browsing.

In order to describe the dimension hierarchies suc-

cinctly to both end-users and developers, dimension

hierarchy diagrams have been utilized in the CATCH

data warehouse design process. These diagrams show

the hierarchical nature so that end-users have an

uncluttered view of how they can navigate and design-

ers can easily understand the dimensional structures.

Fig. 3 illustrates an important health care dimension

based on the International Classification of Disease

(ICD) codes. Currently, we are using versions 9 and 10

of the ICD codes. These codes are divided into chapters

and sections, which provides a natural hierarchy for the

codes. Fig. 3 shows the hierarchical structure using

separate tables, but these tables can be easily denor-

malized to enhance query performance. In addition,

there are several other tables that provide alternative

hierarchies for this important dimension. This ICD

PROCEDURE dimension is combined with many other

dimensions such as patient age, gender, mortality risk,

and severity of illness to form star schemas (see Fig. 2)

with rich query environments.

3.2. Data warehouse design: the data access pyramid

The mission of the CATCH data warehouse is to

support the automated and cost-effective application

of CATCH, as well as to enable more detailed

Fig. 2. Hospital discharge star schema (not all dimensions are shown).

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384372

analyses that were not possible using the coarse-

grained data that typified past CATCH reports. In

order to meet these goals, the data warehouse design

includes several levels of data granularity, from the

coarse-grained data used in generic report production

to actual event-level data, such as hospital discharges.

The data warehouse design includes major compo-

nents at all three levels of granularity as illustrated in

the data access pyramid found in Fig. 4.

Report indicators—Reporting tables with derived

or highly aggregated data are used to support the core

CATCH reports, including comparisons between a

target county and peer counties. These tables also

provide fast response for interactive access via data

browsing tools and can provide the foundation for

simple community-wide Internet access. In addition,

the metadata play an important role at the reporting

level, providing indicator definitions, state or federal

goals, and expert domain knowledge for priority

filters (e.g., economic impact and treatment availabil-

ity). This report level of the data warehouse may not

be needed in all data warehouse applications but

provides important support for rapid generation of

community CATCH reports.

Aggregate data—There are families of star sche-

mas that provide true dimensional data warehouse

capabilities, such as interactive roll-up and drill-down

operations. These components have carefully designed

dimensions that can be utilized by more sophisticated

data browsing tools. The star schemas are populated

using thorough data staging and quality procedures that

usually involve processing detailed data sets extracted

by various health care agencies and organizations.

Typically, the data are aggregated and transformed for

loading into a family of related star schemas—a con-

stellation—that share important dimensions and sup-

port interactive online analytic processing (OLAP)

techniques.

Transaction data—For certain types of informa-

tion, the design calls for retaining very fine-grained or

even event level data. An example is the hospital

discharge data that includes each hospital discharge

event for the more than 200 hospitals that are man-

dated to report such information in Florida. These data

are retained at the transaction level because of the rich

set of facts and dimensions available for analysis and

the density of potential aggregations that result in

negligible space savings.

These three levels of aggregation within the data

warehouse combine to meet a wide range of reporting

requirements and performance goals, thus providing a

flexible basis for disseminating health care informa-

tion to community decision-makers. The following

two sections (Sections 3.3 and 3.4) provide some

examples of the major data warehouse components.

At the aggregate data level, a coarse-grained compo-

nent based on the Public Health Information Data

System (PHIDS) is used to support CATCH report

production and high-level browsing. A second exam-

Fig. 3. ICD PROCEDURE dimension hierarchy.

Fig. 4. Data access pyramid.

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 373

ple aggregate is procedure volume information

formed from the underlying hospital discharge data.

The original hospital discharge data provide an exam-

ple of transaction-oriented data that supports detailed

analyses, along with other data such as vital statistics

(e.g., births and deaths) and specific disease registries.

3.3. Aggregated Florida Department of Health Data

An example of a highly aggregated data warehouse

component is the Public Health Information Data

System (PHIDS) star schema. The Florida Department

of Health collects, analyzes, and reports a large

number of public health indicators. These items have

always provided critical assessment measures within

CATCH. The importance of the PHIDS indicators

made them obvious candidates for inclusion in the

data warehouse and a natural resource for automation

of the traditional CATCH report.

The PHIDS indicators are clearly not the fine-

grained data that support a detailed OLAP environ-

ment. The data are highly aggregated and provided

annually at the county level. Therefore, the data set is

suitable for generating the traditional CATCH report,

but unsuitable for more specific analyses. Essentially,

the construction of the data warehouse has been a

search for both fine and coarse data that can provide

synergies through integration. The simple star schema

used to implement the PHIDS-based data warehouse

component has only the year reported and the county

as explicit dimensions. Currently, many of the PHIDS

indicators are maintained using spreadsheets at the

Florida Department of Health. For use in the data

warehouse, the data are first extracted from the

spreadsheets, reformatted using custom staging pro-

grams, and then loaded via a bulk loader utility. The

twin star staging process, as described in Section 4, is

used to ensure data quality. Data correctness is veri-

fied by sampling the data and comparing the data

warehouse values with published PHIDS reports.

3.4. Transaction-oriented hospital discharge data

Florida hospital discharge transactions are col-

lected by the Agency for Health Care Administration

(AHCA) from the more than 200 short-term acute care

hospitals in the state. These hospitals report every

discharge transaction, regardless of payer, throughout

the state. Hospital discharge data are used to derive

several CATCH indicators such as avoidable hospital-

izations due to diabetes and other chronic diseases.

Typically, the large volume of hospital discharge

transactions is scanned to form derived or aggregated

data for CATCH indicators. However, the broader

mission of the CATCH data warehouse is both to

support the CATCH methods and enable more

detailed investigations of critical local health care

issues. It is the ability to fully explore issues at

appropriate levels of detail that make the fine-grained

components so important. While first staging and

preprocessing the hospital discharge data for use in

forming CATCH indicators, the value of the discharge

transactions themselves became very apparent. The

hospital discharge transactions provide an interesting

set of numeric data items, such as length of stay and a

breakdown of revenues, which are very well suited for

a data warehousing approach. In addition, the trans-

actions include a rich set of attributes that provide

many natural dimensions for use in formulating

queries.

Transaction-based star schemas can provide very

useful functionality within a data warehouse frame-

work, making the hospital discharge star an important

component of the CATCH data warehouse. The hos-

pital discharge data includes over 20 interesting

dimensions such as the discharging hospital character-

istics, admission criteria, diagnostic codes, procedure

codes, reimbursement categories, time, geographic

location, and many others. Furthermore, many of

these dimensions are hierarchical in nature, easily

supporting important roll-up/drill-down operations.

Fig. 2 is a partial representation of the discharge star

schema. The discharge star is equally rich in additive

numeric facts. For instance, length of patient stay is a

particularly important measurement for analysis.

There is also a measurement indicating elapsed days

until the medical procedure. Finally, there is a total

revenue item that provides important cost information.

In fact, there is also a large text field with embedded

revenue items that provides a breakdown of the

various costs from room charges to laboratory fees.

Procedures to parse this text field have been devel-

oped as part of the data staging activities and are used

to extract revenue items, providing nearly 30 interest-

ing numeric facts for each transaction. It is not

uncommon to have useful information buried in text

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384374

fields that must be preprocessed using data staging

tools or customized procedures. This can be a chal-

lenging task since the source database has no under-

standing of the structure embedded in such text fields

and therefore, simple query access is impossible. In

this case, the rich set of facts and highly dimensional

structure of the hospital discharge data make it a

powerful warehouse component for detailed investi-

gations and customized analyses.

The hospital discharge star has repeating groups for

diagnoses (ICD DX 1–10) and procedures (ICD

PROCEDURE 1–10). This design mirrors the underlying

data and simplifies the data staging process for the

millions of discharge records used in the project. An

alternative design without repeating groups might

simplify some queries, but this fine-grained data is

at the bottom of the data access pyramid and is

typically aggregated for most query processing. The

original positional representation also conveys infor-

mation relevant to health care coding practitioners and

is used in several ancillary algorithms. For many

purposes, the primary diagnosis or procedure is used

in calculating higher-level health care indicators, so

this structure is maintained in the transaction-oriented

data [28].

It is sometimes preferable to store the actual trans-

actions rather than lightly aggregated data that has

been derived from the underlying transactions. Kim-

ball [19] uses the term sparsity failure to describe the

size explosion that can occur when creating aggregate

data from a sparsely populated fact table. Detailed fact

data such as hospital discharge transactions will

probably not have all combinations of the dimensions

present in the actual data. In other words, not all

diseases occur in all hospitals during a particular year

and therefore the effect with regard to size is not

multiplicative. If we consider only the cardinality of

the actual dimensions then the possible combinations

of dimension key values is very large for the hospital

discharge data. For example, consider the following

four dimensions with approximate cardinalities, hos-

pitals (250), ICD codes (15,000), severity ratings (5),

and payers (10). This could result in 187.5 million

dimension key combinations. Further, we can define

density as the actual number of records (roughly 2

million/year for discharges) divided by the potential

combinations of dimension keys, yielding a density of

2/187.5 or roughly 1.07%. This remarkably low

density makes intuitive sense since the very fine

ICD distinctions lead to sparse usage. Imagine that

we decide to construct an aggregate table by creating

150 disease categories that summarize the 15,000 ICD

codes, reducing the dimension size by a factor of 100.

In this case, all 150 categories may appear for each

hospital (a reasonable assumption) giving a density of

100% and roughly 1.9 million rows. This rather

insignificant space savings comes at the expense of

losing the richness of the original ICD codes and the

flexibility of having individual cost data for each

transaction. Therefore, in the CATCH data warehouse

and many other applications, transaction-oriented

components make good sense. In fact, to really under-

stand the implications for tasks such as data ware-

house capacity planning it is often necessary to

sample the data to discover the actual distribution of

dimension values. The design challenge is to carefully

consider the number of fine-grained items that are

summarized to form the aggregate data and look for a

factor of 10 or more as a reasonable compression ratio

[19].

3.5. Performance issues

The large volumes of data contained in the CATCH

data warehouse coupled with demanding queries can

conspire to produce some truly awful performance. As

in any database project, good design is the most

effective tool for enhancing performance. The

CATCH data warehouse design continues to evolve

in response to new challenges. In addition to design

changes, three other techniques offer avenues for

improving performance: aggregate tables, star schema

indexing strategies, and physical table partitions.

3.5.1. Aggregates

Many data warehouse designers identify aggre-

gates as one of the most effective strategies for

improving performance. Kimball [19] notes that

‘‘aggregates can have a very significant effect on

performance, in some cases speeding queries by a

factor of 100 or even 1000.’’ If the aggregate data are

useful, having the data physically ready and waiting

will certainly improve query speeds. In addition, if

sparsity failure is avoided, then the amount of data

required may also be substantially reduced. That is,

benefits from both reduced space and previously

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 375

handled computations can accrue through the use of

aggregates. In addition, many data warehousing nav-

igation tools are aggregate-aware, making the aggre-

gate structures transparent to the end user. However,

there are a potentially large number of aggregates that

are possible given a rich set of dimensions. The choice

of which aggregate tables to build is based on the type

of queries being executed and will naturally change

over time [14].

Aggregates play an important role in the CATCH

data warehouse. Some data are extracted and loaded

in aggregate form, such as the PHIDS indicators

discussed above, and other aggregates are derived

from more detailed data warehouse components. For

instance, vital statistics such as death and birth certif-

icates are used to derive a collection of aggregated

mortality and birth-related indicators. There are two

somewhat different purposes for aggregates. Highly

aggregated data are used to directly support traditional

CATCH report production, while lightly aggregated

data are used to improve query performance. The

continual re-evaluation of aggregates is an important

task in data warehouse administration.

3.5.2. Indexing

Many database management systems intended for

data warehousing support bitmap index structures.

Bitmap indexes are especially suited to low cardin-

ality dimensions such as admission quarter, day of the

week, gender, and others. These indexes are space

efficient and speed the star queries that characterize

access to fine-grained structures such as the hospital

discharge data. Another technique is to cache the

smaller dimension tables in memory for improved

query performance. All of these techniques have been

employed and performance tuning continues to be an

ongoing activity as the user community grows and

explores new uses for the data warehouse.

3.5.3. Partitioning

The third important performance tuning technique

is the use of physical table partitioning [6]. The use of

table partitions is important both for query perform-

ance, as well as data warehouse management. Since

the data are loaded or staged at different times, these

activities can be isolated through partitioning. This

also allows preprocessing and data quality procedures

to be run on separate partitions. In addition, parti-

tioned indexes can also be used. One of the most

important benefits of partitioned tables is the oppor-

tunity for the optimizer to exclude large portions of

the data when queries include restrictions on parti-

tioning attributes. An excellent example of partitioned

tables in the CATCH data warehouse is the hospital

discharge data. In recent years, there have been

roughly 2 million discharge transactions/year. The

goal is to keep at least 10 years of discharge data or

20 million transactions available for analysis, but

often only a few years are necessary for any given

query, thereby creating an ideal parameter for parti-

tioning. The hospital discharge data is partitioned by

year, with roughly 1.5–2 million rows per partition. If

a query specifies a single year or a small range of

years, the optimizer can create an execution plan that

only searches the required partitions, leaving the vast

majority of data untouched. Since most of the detailed

interactive analyses fit this mold, the performance

tends to be quite good. However, the entire collection

of data is still available for queries that cover a wide

range of years, it just takes more time.

4. Data staging and quality assurance

The extraction, transformation, and loading (ETL)

functions in a data warehouse are considered the most

time-consuming and expensive portion of the devel-

opment lifecycle [22]. These processes are concerned

with the extraction of data from legacy systems,

transformation and preprocessing requirements to

produce useful, integrated data, and the transportation

of the data into the actual data warehouse structures.

The CATCH data warehouse involves somewhat

unusual challenges with regard to data staging activ-

ities. The data are drawn from multiple organizations,

which usually apply in-house transformations to data

collected by yet another layer of organizations. For

instance, the hospital discharge data are originally

collected by hospitals and reported to the Florida

Department of Health. These data are then integrated,

preprocessed, and provided to other interested organ-

izations, including the CATCH data warehouse proj-

ect. In the case of demographic data, population levels

are extracted from the Florida Governor’s Office and

the Census Bureau. Overall, the data warehouse has

continued to grow without the need for a data purging

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384376

strategy. However, as the size continues to increase

and finer geographic levels are used, a purging strat-

egy will become necessary in the near term. The

design is already multi-level, as described in Fig. 4,

and it is the base of the data pyramid that accounts for

most of the space. As space becomes an issue, the

earlier years of fine-grained data will be purged and

retained offline. These structures are maintained as

physical partitions, so the purging operations can be

conducted without disrupting data access and data can

easily be re-introduced.

Two innovative techniques, twin star data staging

and data quality filters, have been developed to

manage the ETL processing required in the CATCH

data warehouse.

4.1. Twin star staging

Fig. 5 outlines the twin star staging process and its

three component stages. The approach is designed to

utilize the power of commercial database systems,

especially referential integrity constraints and excep-

tion processing. The various stages use a combination

of scripting languages, bulk-loading tools, and data-

base procedures.

4.1.1. Stage 1

The process begins with file-based preprocessing

and cleansing activities. These procedures can be

written in any programming language, but AWK

and Perl have been especially useful in the CATCH

project with their built-in parsing and pattern match-

ing capabilities. Data transformation, quality checks,

and simple reports can all be performed on the initial

data file. Even though many checks will be repeated

throughout the data staging process, the presence of

redundant checks is an asset with regard to data

quality. Stage 1 of the twin star strategy involves

using a bulk loader to move the data into a staging

table within the database system. The staging table is

designed for maximum flexibility in storing data,

minimizing data type conflicts, and providing a work-

bench for database-resident transformation proce-

dures. Typically this includes additional attributes

that are created as part of the preprocessing and

cleansing tasks. Bulk loading utilities are used to

quickly populate the staging table and capture prob-

lematic data in a series of log files. Data type, unique-

ness, and ‘‘not null’’ checks for critical staging table

attributes can be used to control the thoroughness of

this data staging step. With care, many simple data

quality issues can be resolved at this early stage.

4.1.2. Stage 2

The temporary star shares the critical data dimen-

sions with the permanent star, and is essentially a

‘twin’ of the permanent star (though there may be

different supporting dimensions for particular tasks).

The fact table attributes and important dimensions

should be exact duplicates so that any operations or

referential integrity checks will be consistent between

the stars. Stage 2 entails moving the data from the

staging table to the temporary star. Attribute data

types should be compatible and referential integrity

constraints can be used to check for valid dimension

keys. The referential integrity constraints are disabled

and later re-enabled sequentially after the load to

perform the checks in one sweep, thereby improving

processing time. Most database systems provide a

method of capturing invalid rows and it is important

to make use of such capabilities during both the StageFig. 5. Twin star staging.

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 377

2 and 3 transfers. Since the temporary star is the

functional equivalent of the permanent star, just much

smaller, the interface and data browsing tools devel-

oped for the actual data warehouse can be used to

exercise the temporary star. Test reports, browsing by

power users, and sanity checks based on comparisons

with previously loaded data in the permanent star are

all useful methods of ensuring high quality data in the

temporary star.

4.1.3. Stage 3

The permanent star is the long-term storage area

for the data warehouse. This star must be carefully

indexed, distributed across storage devices to avoid I/

O bottlenecks, and possibly partitioned. As noted

earlier, partitioned tables can provide performance

improvements by distributing information across

physical devices and by allowing the query optimizer

to select only the relevant partitions. In addition,

partitioned tables ease data warehouse management

tasks through creation, loading, and archiving of

independent partitions. The Stage 3 transfer from the

temporary star to the permanent star should be fast

and free of data type and referential integrity viola-

tions. The simple transfer will allow large volumes to

be processed within most load windows. Redundant

referential integrity constraints can be used as a final

check (again disabling and re-enabling for efficiency).

The resulting exception tables should be empty, but

any offending rows are a clear sign that somehow

problems survived Stages 1 and 2. This provides a last

opportunity to postpone releasing or publishing the

data.

4.2. Data quality filters

The data quality issues that surface while initially

constructing a data warehouse are among the most

challenging obstacles, contributing significantly to the

time spent in data staging activities. As noted above,

the ETL processes and quality assurance procedures

can account for the majority of time and resource

commitments in a data warehouse project. This has

been the case in the CATCH data warehouse project,

where there are a large number of data sources and

many intermediate stages for errors to be introduced.

In addition, the challenge of producing a truly inte-

grated design requires translations to common defi-

nitions and shared dimensions. Rather than any

‘‘magic bullet,’’ a long-term effort to develop a

comprehensive set of preprocessing procedures will

produce the best data quality. The procedures under

development on the CATCH project include a meas-

ure of redundancy to provide added insurance against

quality problems surviving various phases of the ETL

process. As more procedures have been added to the

quality assurance arsenal, an interesting structure has

emerged, mirroring the natural structure of the data

warehouse, with procedures falling into the following

categories of quality filters. . Fact filters are the quality procedures used to

check the fine-grained data, such as hospital discharge

transactions. For example, any discrepancies between

itemized fees and total charges should be flagged.

Quality procedures at this level compare attributes

within a fact table row, or may compare between two

rows, but the focus is on fine-grained data. . Aggregate filters include quality checks that

become possible only when the focus is on summaries

of fact-level answer sets. As we have seen, aggregates

are important for boosting performance, but they also

present data quality assurance opportunities. At this

level, ‘roll-up’ operations over important dimensions

allow aggregate averages, maximums, or other sum-

maries to be compared. With regard to hospital dis-

charge transactions, comparisons of average lengths

of stay, maximum costs, or diagnostic volumes can all

be usefully compared by hospital and by year. That is,

large hospitals can be verified against each other and

new data can be compared against previous years.

Aggregate filters can be the basis for some very

powerful data quality procedures, effectively using

the capabilities existing in the data warehouse. . Dimension filters are the procedures used to

investigate ‘dirty’ dimensions. For instance, many

business-oriented data warehouses include a customer

dimension that can be very large and may have severe

data quality problems [19]. Duplicate customer

entries, household matching, and data obsolescence

issues are among the problems inherent in such

dimensions [3,10]. In the CATCH data warehouse,

dimensions that must be carefully monitored include

hospitals, practitioners, and geographic entities such

as counties and communities. Dimension filters can be

used to monitor many problems with regard to chang-

ing dimensions.

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384378

These three broad categories of quality filters can

be further refined based on the type of comparison

being used. For instance, the intratuple filters involve

comparisons between attributes within a single record.

Of course, the record itself may be at an aggregate

level and represent a summary of a fact-level answer

set. For example, average pharmacy costs may have a

fairly predictable relationship with total charges for a

given disease. This type of comparison could be used

as a quality check within a given aggregate hospital

discharge record, an example of an intratuple aggre-

gate filter.

Comparisons across records, or intertuple filters,

provide a rich set of quality assurance opportunities

that examine relationships between fact table rows or

aggregates of these rows. An example of this type of

filter would be comparisons between disease volumes

by year. Unlikely disease distributions, after account-

ing for population growth, might indicate a data

quality problem with new data. The distinction

between intertuple and intratuple comparisons, com-

bined with the major filter categories, leads to six

interesting filter categories that seem to naturally

describe the many types of quality procedures being

built into our data warehouse.

An additional quality assurance strategy involves

comparing data warehouse aggregates with known

summaries published by outside sources. This process

can best be described as a quality benchmark, where

externally derived data is used to check internal data

warehouse procedures. This type of quality procedure

usually includes permanent data quality tables popu-

lated with externally produced data summaries based

on published reports or spreadsheet calculations. For

instance, state-level reports on the number of specific

disease occurrences provide a benchmark for data

warehouse aggregates based on the underlying hospi-

tal data. Automated comparison procedures report

only reasonably large departures based on user-

defined thresholds. Quality benchmarks are particu-

larly important as ongoing development activities

yield both larger volumes of data, as well as new

aggregation procedures. Before new versions of pro-

cedures or interface tools are deployed, historical

quality benchmarks can be used to evaluate their

performance. Both quality benchmarks and filters

are part of the substantial infrastructure necessary to

meet data quality goals. These tools account for a

significant portion of the CATCH data warehouse

development effort [2].

5. CATCH data warehouse applications

The data warehouse is used to support a variety of

activities, from automating the original CATCH

reports to supporting current health care research

initiatives. The CATCH methods provided a solid

foundation for the initial implementation efforts, but

as components have been added the synergies have

opened new application opportunities. Clearly, the

human–computer interface is of paramount impor-

tance in the data warehouse environment and the

primary determinant of success from the end-user

perspective [1]. In order to support analysis and

reporting tasks, the data warehouse must have high

quality data and make that data accessible through

effective interface technologies. The act of releasing

data in a warehouse is in a very real sense the same as

publishing that data in printed form—retractions in

both media can be very painful.

5.1. Producing CATCH reports

CATCH reports have been refined over the past

decade in the field. The field expertise available in the

interdisciplinary research team infused the require-

ments process and provided a clearly identifiable goal

as the first step in data warehouse construction.

Hundreds of stored procedures, as well as the design

itself, implement many aspects of this domain exper-

tise. The stored procedures generate the health status

indicators and move them upward in the data access

pyramid for final report production. The reports allow

quick and easy access to comprehensive summaries

and more detailed collections of information from the

data warehouse using standard report writing technol-

ogies. This type of pre-defined and thorough reporting

is critical for implementing a more automated CATCH

report and will probably be the preferred format for

many users. For example, the comparison between

target counties and peer counties, as well as state

averages, are fundamental components of the original

CATCH reports and important tools for community

health care planners. In addition, current and historical

trend information is provided on fact sheets for each

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 379

health indicator. The final reports are really reference

books (numbering over 300 pages) with several major

parts, such as comparisons, fact sheets, and prioritized

lists.

New features that move beyond the original

CATCH reports include components that enable

user-defined communities to supplement the tradi-

tional county-level perspective. Users can define

smaller communities based on geographic or demo-

graphic criteria, with community fact sheets providing

an exploded view of selected health status indicators.

While CATCH has traditionally focused on large

hardcopy reports, the reports are now produced

directly in Web-friendly formats for electronic distri-

bution. The advantage of this approach is that a strong

methodological structure can be retained as the reports

are much more widely distributed. The interested

reader can refer to the Center for Health Outcomes

Research (CHOR) Web site for examples of current

reports [chor.hsc.usf.edu].

In addition to static reports, the high-level compo-

nents of the data warehouse can be accessed dynam-

ically using data browsing tools. It is usually possible

to constrain the navigation, while still providing

enough freedom to explore many more perspectives

than can be accommodated in a traditional report. Fig.

6 shows an online analytic processing (OLAP) tool

being used to browse through trend information for

specific indicators at the county level. Most of these

tools can support both desktop and Web browser

access, making this an important new avenue for data

dissemination.

5.2. Investigating health care issues

Data warehouse browsing tools provide star query-

like access through a flexible menu-based interface,

with pull-down menus representing important dimen-

sions. These types of tools are easy to use and support

some ad-hoc exploration, but are usually controlled

Fig. 6. Browsing screen for community indicators.

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384380

chor.hsc.usf.edu

through an administrative layer that determines the

data available to end-users. In developing a flexible

interface, there is a tradeoff between the ability to

express ad-hoc queries and the ease-of-use that results

from pre-defined constructs implemented by data

warehouse designers and administrators. Of course,

SQL can provide an ad-hoc query facility, but requires

some care in the data warehouse environment with

very large tables and ill-formed queries conspiring to

sharply degrade performance. In addition, use of SQL

by casual users often produces incorrect queries

resulting in erroneous results from the data ware-

house. As noted above, OLAP tools can be used to

empower standard report users and allow simple

navigation through many more views than can be

produced using traditional reporting tools, yet still

curtail unwanted operations.

A second, and in some ways more important role

for the browsing tools is to provide a flexible interface

for more customized analysis. Health care issues high-

lighted by preliminary reports can be investigated

more fully using the finer levels of detail maintained

in the data warehouse. These tasks might entail query-

ing the true dimensional star schemas that include age,

gender, race, and other dimensions, or even the event-

oriented data, such as hospital discharges. These data

warehouse resources support much more detailed

analyses, allowing the user to focus on issues such as

differences in age or race with regard to specific health

status indicators. Once decision-makers review the

CATCH report, they may have community-specific

issues that relate to the diverse population groupings

that inevitably fall within somewhat arbitrary political

boundaries. Dealing effectively with such important

Fig. 7. Browsing screen for hospital disease indicators.

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 381

issues requires a more careful and focused analysis that

is precluded at the higher levels of aggregation that

make up the generic CATCH reports.

A current research initiative involves the explora-

tion of volume and cost-related issues in health out-

comes. Data browsing tools are used for exploratory

analysis. Used in this manner, OLAP tools provide a

first step in the data mining process [13]. Fig. 7

illustrates a browsing screen in which detailed vol-

ume, length of stay, and cost data are presented for

specific hospitals, or groups of hospitals. In addition

to tabular representations, these tools provide graphic

capabilities that support simple data visualization.

5.3. Security issues

Currently, dynamic access to the CATCH data

warehouse is restricted to the development team and

associated health care researchers. Obviously, some of

the data may be sensitive in nature. Data security is an

important issue in the health care environment and the

CATCH data warehouse attempts to balance the

information requirements for local health care plan-

ning with critical security issues. It is important to

note that no patient identifiers of any kind are incor-

porated within the data warehouse. Security issues are

mostly concerned with reporting rare events in geo-

graphic areas that might allow a person to be identi-

fied through other data sources. Detailed security

policies provide guidance for the manipulation and

reporting of health care data in the data warehouse.

6. Conclusions

The CATCH data warehouse can have an important

impact on our health status by making rigorous, quan-

titative information available to health care decision

makers in local, state, national, and international com-

munities. In this paper, we have described some of the

technical challenges faced in designing and implement-

ing a data warehouse for health care information. We

have presented innovative research contributions in the

areas of data warehouse design, data staging for ETL

processing, data quality assurance, and health care data

warehouse applications.

The CATCH data warehouse is now fully func-

tional. For example, it has been recently used to

produce a comprehensive CATCH report for Miami–

Dade COUNTY, Florida’s largest county. As part of

this report we were asked to provide more detailed

assessments of the eight commission districts within

the county. The flexibility of the data warehouse to

provide customized reporting allowed us to provide

these analyses rapidly and effectively. Because this

report was the first to be fully automated, we verified

the accuracy of the report with a complete hand check

of every data table. The discrepancies between the

automated data tables and the manually derived tables

were minimal and easily reconciled.

The CATCH data warehouse remains a work in

progress. We are pursuing an active research agenda

to enhance the technical data warehousing capabilities

and community health care applications. We invite the

reader to follow the progress of the data warehouse at

our CHOR web site [chor.hsc.usf.edu]. In the next

sections (Sections 6.1 and 6.2), we briefly review our

current research directions.

6.1. Data warehouse research directions

The CATCH data warehouse provides a rich

research environment for focused investigation in

the following areas. . Data Warehouse Design—The variety and vola-

tility of health care data sources make the mainte-

nance of the data warehouse design a true challenge.

Changes to source data formats frequently require the

updating of dimension table schemas. Often historical

data cannot be placed into the new format without

information loss. Finding solutions for maintaining

historical accuracy while providing efficient use of all

data in current applications is difficult. We are

researching design techniques to minimize the impact

of dimension table changes on the maintenance and

operations of the data warehouse [2]. . Data Staging—As presented in Section 3, we

have implemented an innovative twin star data staging

procedure. Ongoing research will study the perform-

ance of twin star data staging on various data loads.

Enhancements to the procedure will be proposed and

implemented. . Data Quality—Issues of data quality dominate

our research agenda. The health care field places

particular emphasis on data accuracy, timeliness, pri-

vacy, and ease of use [27]. We are in close contact

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384382

chor.hsc.usf.edu

with the communities who receive the CATCH reports

and have interviewed a number of users to elicit data

quality requirements. This information will be used to

drive our research to improve data quality in the

CATCH data warehouse. . Data Dissemination—The technologies for dis-

seminating CATCH reports to communities are rap-

idly evolving. The requirements of the receiving

communities and the capabilities of the data ware-

house system will drive future research directions. . Data Mining—We are aggressively investigating

several areas of knowledge discovery in the CATCH

data warehouse [11]. We have a unique capability to

perform detailed studies in such areas as physician and

hospital volume, racial disparities in health care, and

environmental impacts on community health status.

6.2. Community decision-making with CATCH data

The CATCH data warehouse will result in wide-

spread distribution of data previously unavailable to

most communities, as well as online access for spe-

cialized inquiry. Many issues arise as to how the

communities will make the most effective use of the

CATCH data for health care decision-making. This is

an area with considerable research potential.

There is a rich literature on the decision-making

process both with and without information technol-

ogy. The study of group decision support systems and

environments has a strong tradition in the manage-

ment information systems field [8]. In many ways,

this important body of work is appropriate to health

care decision-making, which is usually group-ori-

ented. For example, Dennis et al. [9] study the effects

of minority influence on decision-making and find

that the presence or absence of technology has very

different effects. Another important contributing area

would be the political process and its ramifications to

decision-making [20]. Certainly, policy making in

health care is very much a political process.

The use of the CATCH methodology and state-of-

the-art data warehousing technology across many

Florida communities will provide a rich research

opportunity for studying interesting issues on group

decision-making in community health care organiza-

tions. Such issues would include the composition of

the decision-making group, the community stakehold-

ers and their political influence, the decision-making

process, and dissemination patterns of health care

information in the community. The complexities and

the interrelationships among these issues make the

design of research studies both a challenge and an

opportunity. As the automated CATCH reports are

produced for various communities in Florida, we will

study how effectively the CATCH information is used

for health care planning.

Acknowledgements

The authors gratefully recognize the U.S. Depart-

ment of Commerce, which has provided funding

through a Technology Opportunities Program (TOP)

grant. The Florida Department of Health has been a

research partner. Research collaborators in the College

of Public Health include R. Campbell, E. Gilbert, S.

Luther, B. Myers, and B. Steverson. Contributing

graduate students in the College of Business Admin-

istration include S. Hedge-Desai, R. Marsh, D.

McCorkel, M. Nevrekar, M. Pearl, R. Rajendrababu,

and J. Slayton. The authors also thank Oracle

Corporation for making their state-of-the-art develop-

ment tools available through the Oracle Academic

Initiative.

References

[1] D. Berndt, A. Hevner, J. Studnicki, Data warehouse dissemina-

tion strategies for community health assessments, informatik/

informatique, Journal of the Swiss Informatics Society (1) (Feb-

ruary 2001) 27–33.

[2] D. Berndt, J. Fisher, A. Hevner, J. Studnicki, Healthcare data

warehousing and quality assurance, IEEE Computer 34 (12)

(December 2001) 33–42.

[3] D. Berndt, R. Satterfield, Customer and household matching:

resolving entity identity in data warehouses, Proceedings of

AeroSense 2000, Conference on Data Mining and Knowledge

Discovery, Orlando (April 2000).

[4] Center for Disease Control and Prevention, Principles of Com-

munity Engagement, 1997, Atlanta.

[5] D. Chrislip, C. Larson, Collaborative Leadership: How Citi-

zens and Civic Leaders Can Make a Difference, Jossey-Bass,

San Francisco, 1994.

[6] M. Corey, M. Abbey, Oracle Data Warehousing, Oracle Press

and Osborne McGraw-Hill, New York, 1997.

[7] S. Cropper, Collaborative working and the issue of sustainabil-

ity, in: C. Huxham (Ed.), Creating Collaborative Advantage,

SAGE Publishers, London, 1996.

[8] A. Dennis, Information exchange and use in group decision

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384 383

making: you can lead a group to information but you can’t

make it think, MIS Quarterly 20 (4) (1996) 433–458.

[9] A. Dennis, K. Hilmer, N. Taylor, Information exchange and

use in GSS and verbal group decision making, Journal of MIS

14 (3) (1998) 61–88.

[10] D. Dey, S. Sarkar, P. De, A probabilistic decision model for

entity matching in heterogeneous databases, Management Sci-

ence 44 (10) (October 1998) 1379–1396.

[11] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy

(Eds.), Advances in Knowledge Discovery and Data Mining,

The AAAI Press, Menlo Park, CA, 1996.

[12] P. Gray, H. Watson, Decision Support in the Data Warehouse,

Prentice-Hall, Englewood Cliffs, NJ, 1998.

[13] J. Han, M. Kamber, Data Mining: Concepts and Techniques,

Morgan Kaufmann Publishers, San Francisco, 2001.

[14] V. Harinarayan, A. Rajaraman, J. Ullman, Implementing data

cubes efficiently, Proceedings of the 1996 ACM SIGMOD,

Montreal (June 1996).

[15] W. Inmon, Building the Data Warehouse, Wiley, New York,

1992.

[16] Institute of Medicine, Summary of recommendations, in: W.

Waterfall (Ed.), The Future of Public Health, National Acad-

emy Press, Washington, DC, 1988.

[17] Institute of Medicine, Healthy Communities: New Partner-

ships for the Future of Public Health, National Academy Press,

Washington, DC, 1996.

[18] Institute of Medicine, Measurement tools for a community

health improvement process, in: J. Durch, L. Bailey, M. Stoto

(Eds.), Improving Health in the Community, a Role for Per-

formance Monitoring, National Academy Press, Washington,

DC, 1997.

[19] R. Kimball, The Data Warehouse Toolkit, Wiley, New York,

1996.

[20] H. Mintzberg, The Nature of Managerial Work, Harper and

Row, New York, 1973.

[21] H. Nakajima, Editorial: new players for a new era, World

Health 50 (3) (1997) 3.

[22] J. Srivastava, P. Chen, Warehouse creation—a potential road-

block to data warehousing, IEEE Transactions on Knowledge

and Data Engineering 11 (1) (1999) 118–126.

[23] B. Starfield, Primary care and health: a cross-national compar-

ison, Journal of the American Medical Association 266 (16)

(1991) 2268–2271.

[24] J. Studnicki, Evaluating the performance of public health

agencies: information needs, American Journal of Preventive

Medicine, Research and Measurement in Public Health Prac-

tice 11 (6) (1995) 74–80.

[25] J. Studnicki, B. Steverson, B. Myers, A. Hevner, D. Berndt,

Comprehensive assessment for tracking community health

(CATCH), Best Practices and Benchmarking in Healthcare 2

(5) (September/October 1997) 196–207.

[26] J. Studnicki, A. Hevner, D. Berndt, S. Luther, Comparing

alternative methods for composing community peer groups:

a data warehouse application, Journal of Public Health Man-

agement and Practice 7 (6) (November 2001) 87–94.

[27] R. Wang, M. Ziad, Y. Lee, Data Quality, Kluwer Academic

Publishing, New York, 2001.

[28] J. Weissman, C. Gatsonis, A. Epstein, Rates of avoidable hos-

pitalization by insurance status in Massachusetts and Mary-

land, Journal of the American Medical Association 268 (17)

(November 1992) 2388–2394.

[29] World Health Organization, The World Health Report 1995:

Bridging the Gaps, Report of the Director-General, Geneva,

1995.

Donald J. Berndt is an Assistant Professor

in the Information Systems and Decision

Sciences Department in the College of Busi-

ness Administration at the University of

South Florida. His research interests include

data warehousing, knowledge discovery,

and data mining. Dr. Berndt received a

PhD in Information Systems from the Stern

School of Business at New York University.

He is a member of Beta Gamma Sigma,

ACM, AIS, and INFORMS.

Alan R. Hevner is an Eminent Scholar and

Professor in the Information Systems and

Decision Sciences Department in the Col-

lege of Business Administration at the

University of South Florida. He holds the

Salomon Brothers/HRCP Chair of Distrib-

uted Technology. His research interests

include software engineering, software test-

ing, distributed database systems, and health

care information systems. He received a

PhD in Computer Science from Purdue

University. Dr. Hevner is a member of ACM, IEEE, AIS, and

INFORMS.

James Studnicki is a Professor of Health

Policy and Management at the University

of South Florida College of Public Health.

His research interests include measuring the

health status of communities, evaluating

alternative treatment outcomes, and study-

ing the influence of managed care penetra-

tion on the utilization and quality of health

services. Dr. Studnicki received a ScD from

Johns Hopkins University.

D.J. Berndt et al. / Decision Support Systems 35 (2003) 367–384384

Introduction
The CATCH methods of community assessment

Limitations
CATCH data warehouse challenges

The CATCH data warehouse

The dimensional model

Fact tables
Dimension tables

Data warehouse design: the data access pyramid
Aggregated Florida Department of Health Data
Transaction-oriented hospital discharge data
Performance issues

Aggregates
Indexing
Partitioning

Data staging and quality assurance

Twin star staging

Stage 1
Stage 2
Stage 3

Data quality filters

CATCH data warehouse applications

Producing CATCH reports
Investigating health care issues
Security issues

Conclusions

Data warehouse research directions
Community decision-making with CATCH data

Acknowledgements
References