Discussion
Chapter 15: Information Retrieval from Medical Knowledge Resources
WILLIAM R. HERSH
Learning Objectives
After viewing this presentation, viewers should be able to:
Enumerate the basic biomedical and health knowledge resources in books, journals, electronic databases, and other sources
Describe the major approaches used to indexing knowledge-based content
Apply advanced searching techniques to the major biomedical and health knowledge resources
Discuss the major results of information retrieval evaluation studies
Describe future directions for research in information retrieval
Introduction
Information Retrieval (IR), sometimes called search, concerns the acquisition, organization, and searching of knowledge-based information, which is usually defined as information derived and organized from observational or experimental research
Although IR in biomedicine traditionally concentrated on the retrieval of text from the biomedical literature, the study has expanded to include newer types of media that include images, video, chemical structures, gene and protein sequences, and a wide range of other digital media of relevance to biomedical education, research, and patient care
Introduction
The overall goal of the IR process is to find content that meets a person’s information needs
Components of information retrieval systems
IR tends to focus on knowledge-based information
Knowledge-based information categories:
Primary knowledge–based information (also called primary literature) is original research that appears in journals, books, reports, and other sources
Secondary knowledge–based information consists of the writing that reviews, condenses, and/or synthesizes the primary literature. The most common examples of this type of literature are books, monographs, and review articles in journals and other publications
Knowledge Based Information
Virtually all scientific journals are published electronically
Not only is there the increased convenience of redistributing articles, but research has found that freely available on the Web have a higher likelihood of being cited by other papers than those that are not (Bork 2012)
Printing and mailing, tasks no longer needed in electronic publishing, comprised a significant part of the “added value” from publishers of journals. There is still however value added by publishers, such as hiring and managing editorial staff to produce the journals and managing the peer review process
Publication of Knowledge-Based Information
The basic principle of open access publishing is that authors and/or their institutions pay the cost of production of manuscripts up front after they are accepted through a peer review process. After the paper is published, it becomes freely available on the Web. Since most research is usually funded by grants, the cost of open access publishing should be included in grant budgets. The uptake of publishers adhering to the open access model has been modest, with the most prominent being Biomed Central (BMC, www.biomedcentral.com ) and the Public Library of Science ( PLoS, www.plos.org )
Publishing Costs and Open Access
Information content is classified in four categories:
Bibliographic: the best-known and most widely used biomedical bibliographic database is MEDLINE, which contains bibliographic references to all the biomedical articles, editorials, and letters to the editors in approximately 5,000 scientific journals
Full-text content: a large component of this content consists of the online versions of books and periodicals. As already noted, most traditionally paper-based medical literature, from textbooks to journals, is now available electronically
Content
Annotated content: these resources are usually not stored as freestanding Web pages but instead are often housed in database management systems
Aggregated content: Aggregated content has been developed for all types of users from consumers to clinicians to scientists. Probably the largest aggregated consumer information resource is MedlinePlus ( http://medlineplus.gov ) from the NLM. MedlinePlus includes all the types of content previously described, aggregated for easy access to a given topic
Content
Indexing is the process of assigning metadata to content to facilitate its retrieval. Most modern commercial content is indexed in two ways:
Manual indexing—where human indexers, usually using a controlled terminology, assign indexing terms and attributes to documents, often following a specific protocol
Automated indexing—where computers make the indexing assignments, usually limited to breaking out each word in the document (or part of the document) as an indexing term
Indexing
A controlled terminology contains a set of terms that can be applied to a task, such as indexing
When the terminology defines the terms, it is usually called a vocabulary
When the terminology contains variants or synonyms of terms, it is also called a thesaurus
Controlled Terminologies
A controlled terminology usually contains a list of terms that are the canonical representations of the concepts. If it is a thesaurus, it contains relationships between terms, which typically fall into three categories:
Hierarchical—terms that are broader or narrower. The hierarchical organization not only provides an overview of the structure of a thesaurus but also can be used to enhance searching (e.g., MeSH tree explosions that add terms from an entire portion of the hierarchy to augment a search)
Synonym—terms that are synonyms, allowing the indexer or searcher to express a concept in different words
Related—terms that are not synonymous or hierarchical but are somehow otherwise related. These usually remind the searcher of different but related terms that may enhance a search
Controlled Terminologies
The MeSH terminology is used to manually index most of the databases produced by the NLM
The latest version contains over 26,000 subject headings
MeSH contains the three types of relationships described in the previous slide:
Hierarchical—MeSH is organized hierarchically into 16 trees, such as Diseases, Organisms, and Chemicals and Drugs
Synonym—MeSH contains a vast number of entry terms, which are synonyms of the headings
Related—terms that may be useful for searchers to add to their searches when appropriate are suggested for many headings
Medical Subject Headings (MeSH)
The MeSH terminology files, their associated data, and their supporting documentation are available on the NLM’s MeSH Web site http://www.nlm.nih.gov/mesh
Medical Subject Headings (MeSH)
“Slice” through MeSH hierarchy
Manual indexing is most commonly done for bibliographic and annotated content, although it is sometimes for other types of content as well
While most Web content is indexed automatically (see next slide), one approach to manual indexing has been to apply metadata to Web pages and sites, exemplified by the Dublin Core Metadata Initiative (DCMI, www.dublincore.org )
Manuel Indexing
The goal of the DCMI has been to develop a set of standard data elements that creators of Web resources can use to apply metadata to their content
DCMI standard has been approved by the National Information Standards Organization and the International Organization of Standards specification has 15 defined elements and sample elements include:
DC.title - name given to the resource
DC.creator - person or organization primarily responsible for creating the intellectual content of the resource
DC.subject - topic of the resource
DC.description - a textual description of the content of the resource
DC.publisher - entity responsible for making the resource available in its present form
DC.date - date associated with the creation or availability of the resource
Manuel Indexing
In automated indexing, the indexing is done by a computer
We will focus on the automated indexing used in operational IR systems, namely the indexing of documents by the words they contain
Word indexing is typically done by defining all consecutive alphanumeric sequences between white space (which consists of spaces, punctuation, carriage returns, and other non-alphanumeric characters) as words. Systems must take particular care to apply the same process to documents and the user’s query, especially with characters such as hyphens and apostrophes
Automated Indexing
A commonly used approach for term weighting is TF*IDF weighting, which combines the inverse document frequency (IDF) and term frequency (TF).
The usual formula is:
Automated Indexing
Synonymy—different words may have the same meaning, such as high and elevated. This problem may extend to the level of phrases with no words in common, such as the synonyms hypertension and high blood pressure
Polysemy—the same word may have different meanings or senses. For example, the word lead can refer to an element or to a part of an electrocardiogram machine
Content—words in a document may not reflect its focus. For example, an article describing hypertension may make mention in passing to other concepts, such as congestive heart failure (CHF) that are not the focus of the article
Context—words take on meaning based on other words around them
Morphology—words can have suffixes that do not change the underlying meaning, such as indicators of plurals, various participles, adjectival forms of nouns, and nominalized forms of adjectives
Granularity—queries and documents may describe concepts at different levels of a hierarchy. A query for antibiotics for treatment of a specific infection returns documents that only contain specific antibiotics
Automated Indexing Limitations
Exact-Match Retrieval- In exact-match searching, the IR system gives the user all documents that exactly match the criteria specified in the search statement(s). This type of searching is often called Boolean searching
Retrieval
Boolean operators
Partial-Match Retrieval-Although partial-match searching was conceptualized very early, it did not see widespread use in IR systems until the advent of Web search engines in the 1990s
The most common approach to document ranking in partial-match searching is to give each a score based on the sum of the weights of terms common to the document and query
Retrieval
There are many different retrieval interfaces, with some of the features reflecting the content or structure of the underlying database
PubMed is the system at NLM that searches MEDLINE and other bibliographic databases
Retrieval Systems
There has been a great deal of research over the years devoted to evaluation of IR systems.
One of those frameworks organized evaluation around six questions that someone advocating the use of IR systems might ask (Hersh 1998):
Was the system used?
For what was the system used?
Were the users satisfied?
How well did they use the system?
What factors were associated with successful or unsuccessful use of the system?
Did the system have an impact?
Evaluation
There are many ways to evaluate the performance of IR systems, the most widely used of which are the relevance-based measures of recall and precision
Recall is the proportion of relevant documents retrieved from the database:
In other words, recall answers the question, for a given search, what fraction of all the relevant documents have been obtained from the database?
System-Oriented Evaluation
Precision is the proportion of relevant documents retrieved in the search:
This measure answers the question, for a search, what fraction of the retrieved documents are relevant?
One problem that arises when one is comparing systems that use ranking versus those that do not is that non-ranking systems, typically using Boolean searching, tend to retrieve a fixed set of documents and as a result have fixed points of recall and precision
System-Oriented Evaluation
A number of user-oriented evaluations have been performed over the years looking at users of biomedical information. Most of these studies have focused on clinicians
For example, Hersh et al studied in 1995 using the task-oriented approach compared Boolean versus natural language searching in the textbook Scientific American Medicine
There are more studies listed in the textbook Chapter 15
User-Oriented Evaluation
Research taking place in several areas related to IR include:
Information extraction and text mining—usually through the use of natural language processing (NLP) to extract facts and knowledge from text
Summarization—Providing automated extracts or abstracts summarizing the content of longer documents
Question-answering—Going beyond retrieval of documents to providing actual answers to questions, as exemplified by the IBM Corp. Watson system, which is being applied to medicine (Ferrucci 2010)
Future Directions
There are many biomedical and health knowledge resources online available in bibliographic databases, journals and other full-text resources, Web sites, and other sources
Bibliographic content is likely to be indexed using controlled vocabularies assigned by humans
Full-text and other resources are likely to be indexed via extraction of words
The major approaches to searching biomedical and health knowledge resources include exact-match searching using sets and Boolean operators and partial-match searching on words using relevance ranking
System-oriented evaluation studies tend to focus on performance of search systems and usually involvement measurement of the relevance-based measures of recall and precision
User-oriented evaluation studies tend to compare users and their abilities to complete tasks using retrieval systems
Conclusions