discussion
Chapter 2: Healthcare Data, Information and Knowledge
Elmer Bernstam MD
Todd Johnson PhD
Trevor Cohen MD PhD
After reviewing these slides the viewer should be able to:
Define data, information, and knowledge
Understand how vocabularies convert data to information
Describe methods that convert information to knowledge
Distinguish informatics from other computational disciplines, particularly computer science
Describe the differences between data-centric and information-centric technology
Learning Objectives
Data are symbols or observations reflecting differences in the world. Example = 250.00 (Note: data is the plural of datum)
Information is data with meaning. Example = ICD-9 code of 250.00 means type 2 diabetes
Knowledge is information that is justifiably believed to be true. Example = obese patients are more likely to develop type 2 diabetes
Introduction
Computers generate and analyze binary information: zero (off) and one (on). Each zero or one is a bit; a series of 8 bits is a byte. Note that these bits and bytes have no meaning per se
Bits can occur as various data types
Integers such as 345 or 669988
Floating point numbers such as 14.1 or -1.23
Characters such as a or z
Character strings such as “hello” or “goodbye”
Introduction
Data can be aggregated into a variety of formats such as image files (JPG, GIG, PNG), text files, sound files (WAV, MP3) or video files (WMV, MP4)
Recognize that these formats do not define what information is available, just the category format
Data are the domain of computer scientists, but information is the domain of informatics and informaticians
Introduction
Information retrieval involves both computer science (data) and informatics (information). See image below
Introduction
Computer data not only lacks meaning, but must includes dates and other qualifiers to gain significance. For example, blood glucose = 127. Was that mg/dl, was the sample drawn fasting, etc.
Everything must be standardized, otherwise computer B will not understand data transmitted from computer A (i.e. data won’t be interoperable)
Data and Information
A modern way to convert medical information to knowledge is to use a clinical data warehouse (CDW)
EHRs are now a huge source of healthcare data and information. They contain both structured (coded e.g. ICD-9 codes) and unstructured text (free text or natural language)
Interpreting free text requires natural language processing (NLP)
Information to Knowledge
Data from EHRs, Radiology, Pathology, etc. are copied into a staging database where they are cleaned and loaded into another common database and associated with meta data (data that describes data). ICD-type data is an example of meta data
Tools can be applied to the data in the CDW, such as simple descriptive analytics that reports the number of patients with breast cancer, their age, menopausal status, etc. More about this in chapter 3
CDWs do a better job of analyzing and reporting aggregate healthcare data than the average EHR, which tends to focus on the individual
Clinical Data Warehouse
CDWs can be used to evaluate a critical clinical process, cost estimates and they can analyze potential solutions
CDWs are highly valuable for informatics and evidence based medical research
CDWs can help track infections and report trends to public health
Next slide shows a typical CDW schema
Clinical Data Warehouse
Clinical Data Warehouse
ETL = extract, transfer and load
Informatics for Integrating Biology and the Bedside (i2b2) is a Harvard project used by many other academic institutions in the US
The program is open source and modular and incorporates genomic and clinical information for research purposes
Data base consists of facts (diagnoses, lab results, etc.) queried by users and dimensions that describe the facts
With this model data can be aggregated from multiple hospitals
i2b2 platform https://www.i2b2.org
i2b2 star schema
In order to extract concepts from free text in EHRs or CDWs several systems have been developed. See below
Concept Extraction
| Concept Extractor | Gold Standard | Precision | Recall | F-score (F1) |
| cTAKES17 | Mayo clinic | 0.80 | 0.65 | 0.72 |
| MetaMap20 | NLM 500 articles | 0.32 | 0.53 | 0.40 |
| MEDLEE21 | Proprietary | 0.86 | 0.77 | 0.81 |
With other industries such as banking, data and information are much closer (smaller semantic gap).
For example, banking data such as $100.50 is close to an account balance of $100.50. It leaves little leeway for a different interpretation
In healthcare, there are subjective factors (“I feel sick”) that are difficult to measure and vary from patient to patient and physician to physician. Lab results are more objective and easier to interpret
What Makes Informatics Difficult?
What Makes Informatics Difficult?
It is difficult to model all of healthcare. View the HL7 RIM model on next slide
Biomedical information is difficult due to incomplete, imprecise, vague, inconsistent and uncertain information
Humans can adapt to this dynamic and vague information but computers can not. Clinical decision support in EHRs is precise, when in reality it might need to be flexible over time
HL7 version 3 RIM model
Health IT is an attractive solution to our troubled healthcare system, but is it realistic?
Other IT fields have experienced serious “ups and downs” such as artificial intelligence
There is a large gap between healthcare data generated and information (semantic gap)
Is it too early to expect EHRs and computerization to change healthcare?
Why Health IT Fails Sometimes
Computer scientists focus on data, while informaticists focus on information
There is a gap between healthcare data and information (semantic gap)
The transformation of information into knowledge is a primary goal of informaticists
Clinical data warehouses are increasing used to research clinical questions and generate knowledge from information
Conclusions
19