Discussion

ALHEChapter15_Information_Retrieval.pptx

Chapter 15: Information Retrieval from Medical Knowledge Resources

WILLIAM R. HERSH

Learning Objectives

After viewing this presentation, viewers should be able to:

Enumerate the basic biomedical and health knowledge resources in books, journals, electronic databases, and other sources

Describe the major approaches used to indexing knowledge-based content

Apply advanced searching techniques to the major biomedical and health knowledge resources

Discuss the major results of information retrieval evaluation studies

Describe future directions for research in information retrieval

Introduction

Information Retrieval (IR), sometimes called search, concerns the acquisition, organization, and searching of knowledge-based information, which is usually defined as information derived and organized from observational or experimental research

Although IR in biomedicine traditionally concentrated on the retrieval of text from the biomedical literature, the study has expanded to include newer types of media that include images, video, chemical structures, gene and protein sequences, and a wide range of other digital media of relevance to biomedical education, research, and patient care

Introduction

The overall goal of the IR process is to find content that meets a person’s information needs

Components of information retrieval systems

IR tends to focus on knowledge-based information

Knowledge-based information categories:

Primary knowledge–based information (also called primary literature) is original research that appears in journals, books, reports, and other sources

Secondary knowledge–based information consists of the writing that reviews, condenses, and/or synthesizes the primary literature. The most common examples of this type of literature are books, monographs, and review articles in journals and other publications

Knowledge Based Information

Virtually all scientific journals are published electronically

Not only is there the increased convenience of redistributing articles, but research has found that freely available on the Web have a higher likelihood of being cited by other papers than those that are not (Bork 2012)

Printing and mailing, tasks no longer needed in electronic publishing, comprised a significant part of the “added value” from publishers of journals. There is still however value added by publishers, such as hiring and managing editorial staff to produce the journals and managing the peer review process

Publication of Knowledge-Based Information

The basic principle of open access publishing is that authors and/or their institutions pay the cost of production of manuscripts up front after they are accepted through a peer review process. After the paper is published, it becomes freely available on the Web. Since most research is usually funded by grants, the cost of open access publishing should be included in grant budgets. The uptake of publishers adhering to the open access model has been modest, with the most prominent being Biomed Central (BMC, www.biomedcentral.com ) and the Public Library of Science ( PLoS, www.plos.org )

Publishing Costs and Open Access

Information content is classified in four categories:

Bibliographic: the best-known and most widely used biomedical bibliographic database is MEDLINE, which contains bibliographic references to all the biomedical articles, editorials, and letters to the editors in approximately 5,000 scientific journals

Full-text content: a large component of this content consists of the online versions of books and periodicals. As already noted, most traditionally paper-based medical literature, from textbooks to journals, is now available electronically

Content

Annotated content: these resources are usually not stored as freestanding Web pages but instead are often housed in database management systems

Aggregated content: Aggregated content has been developed for all types of users from consumers to clinicians to scientists. Probably the largest aggregated consumer information resource is MedlinePlus ( http://medlineplus.gov ) from the NLM. MedlinePlus includes all the types of content previously described, aggregated for easy access to a given topic

Content

Indexing is the process of assigning metadata to content to facilitate its retrieval. Most modern commercial content is indexed in two ways:

Manual indexing—where human indexers, usually using a controlled terminology, assign indexing terms and attributes to documents, often following a specific protocol

Automated indexing—where computers make the indexing assignments, usually limited to breaking out each word in the document (or part of the document) as an indexing term

Indexing

A controlled terminology contains a set of terms that can be applied to a task, such as indexing

When the terminology defines the terms, it is usually called a vocabulary

When the terminology contains variants or synonyms of terms, it is also called a thesaurus

Controlled Terminologies

A controlled terminology usually contains a list of terms that are the canonical representations of the concepts. If it is a thesaurus, it contains relationships between terms, which typically fall into three categories:

Hierarchical—terms that are broader or narrower. The hierarchical organization not only provides an overview of the structure of a thesaurus but also can be used to enhance searching (e.g., MeSH tree explosions that add terms from an entire portion of the hierarchy to augment a search)

Synonym—terms that are synonyms, allowing the indexer or searcher to express a concept in different words

Related—terms that are not synonymous or hierarchical but are somehow otherwise related. These usually remind the searcher of different but related terms that may enhance a search

Controlled Terminologies

The MeSH terminology is used to manually index most of the databases produced by the NLM

The latest version contains over 26,000 subject headings

MeSH contains the three types of relationships described in the previous slide:

Hierarchical—MeSH is organized hierarchically into 16 trees, such as Diseases, Organisms, and Chemicals and Drugs

Synonym—MeSH contains a vast number of entry terms, which are synonyms of the headings

Related—terms that may be useful for searchers to add to their searches when appropriate are suggested for many headings

Medical Subject Headings (MeSH)

The MeSH terminology files, their associated data, and their supporting documentation are available on the NLM’s MeSH Web site http://www.nlm.nih.gov/mesh

Medical Subject Headings (MeSH)

“Slice” through MeSH hierarchy

Manual indexing is most commonly done for bibliographic and annotated content, although it is sometimes for other types of content as well

While most Web content is indexed automatically (see next slide), one approach to manual indexing has been to apply metadata to Web pages and sites, exemplified by the Dublin Core Metadata Initiative (DCMI, www.dublincore.org )

Manuel Indexing

The goal of the DCMI has been to develop a set of standard data elements that creators of Web resources can use to apply metadata to their content

DCMI standard has been approved by the National Information Standards Organization and the International Organization of Standards specification has 15 defined elements and sample elements include:

DC.title - name given to the resource

DC.creator - person or organization primarily responsible for creating the intellectual content of the resource

DC.subject - topic of the resource

DC.description - a textual description of the content of the resource

DC.publisher - entity responsible for making the resource available in its present form

DC.date - date associated with the creation or availability of the resource

Manuel Indexing

In automated indexing, the indexing is done by a computer

We will focus on the automated indexing used in operational IR systems, namely the indexing of documents by the words they contain

Word indexing is typically done by defining all consecutive alphanumeric sequences between white space (which consists of spaces, punctuation, carriage returns, and other non-alphanumeric characters) as words. Systems must take particular care to apply the same process to documents and the user’s query, especially with characters such as hyphens and apostrophes

Automated Indexing

A commonly used approach for term weighting is TF*IDF weighting, which combines the inverse document frequency (IDF) and term frequency (TF).

The usual formula is:

Automated Indexing

Synonymy—different words may have the same meaning, such as high and elevated. This problem may extend to the level of phrases with no words in common, such as the synonyms hypertension and high blood pressure

Polysemy—the same word may have different meanings or senses. For example, the word lead can refer to an element or to a part of an electrocardiogram machine

Content—words in a document may not reflect its focus. For example, an article describing hypertension may make mention in passing to other concepts, such as congestive heart failure (CHF) that are not the focus of the article

Context—words take on meaning based on other words around them

Morphology—words can have suffixes that do not change the underlying meaning, such as indicators of plurals, various participles, adjectival forms of nouns, and nominalized forms of adjectives

Granularity—queries and documents may describe concepts at different levels of a hierarchy. A query for antibiotics for treatment of a specific infection returns documents that only contain specific antibiotics

Automated Indexing Limitations

Exact-Match Retrieval- In exact-match searching, the IR system gives the user all documents that exactly match the criteria specified in the search statement(s). This type of searching is often called Boolean searching

Retrieval

Boolean operators

Partial-Match Retrieval-Although partial-match searching was conceptualized very early, it did not see widespread use in IR systems until the advent of Web search engines in the 1990s

The most common approach to document ranking in partial-match searching is to give each a score based on the sum of the weights of terms common to the document and query

Retrieval

There are many different retrieval interfaces, with some of the features reflecting the content or structure of the underlying database

PubMed is the system at NLM that searches MEDLINE and other bibliographic databases

Retrieval Systems

There has been a great deal of research over the years devoted to evaluation of IR systems.

One of those frameworks organized evaluation around six questions that someone advocating the use of IR systems might ask (Hersh 1998):

Was the system used?

For what was the system used?

Were the users satisfied?

How well did they use the system?

What factors were associated with successful or unsuccessful use of the system?

Did the system have an impact?

Evaluation

There are many ways to evaluate the performance of IR systems, the most widely used of which are the relevance-based measures of recall and precision

Recall is the proportion of relevant documents retrieved from the database:

In other words, recall answers the question, for a given search, what fraction of all the relevant documents have been obtained from the database?

System-Oriented Evaluation

Precision is the proportion of relevant documents retrieved in the search:

This measure answers the question, for a search, what fraction of the retrieved documents are relevant?

One problem that arises when one is comparing systems that use ranking versus those that do not is that non-ranking systems, typically using Boolean searching, tend to retrieve a fixed set of documents and as a result have fixed points of recall and precision

System-Oriented Evaluation

A number of user-oriented evaluations have been performed over the years looking at users of biomedical information. Most of these studies have focused on clinicians

For example, Hersh et al studied in 1995 using the task-oriented approach compared Boolean versus natural language searching in the textbook Scientific American Medicine

There are more studies listed in the textbook Chapter 15

User-Oriented Evaluation

Research taking place in several areas related to IR include:

Information extraction and text mining—usually through the use of natural language processing (NLP) to extract facts and knowledge from text

Summarization—Providing automated extracts or abstracts summarizing the content of longer documents

Question-answering—Going beyond retrieval of documents to providing actual answers to questions, as exemplified by the IBM Corp. Watson system, which is being applied to medicine (Ferrucci 2010)

Future Directions

There are many biomedical and health knowledge resources online available in bibliographic databases, journals and other full-text resources, Web sites, and other sources

Bibliographic content is likely to be indexed using controlled vocabularies assigned by humans

Full-text and other resources are likely to be indexed via extraction of words

The major approaches to searching biomedical and health knowledge resources include exact-match searching using sets and Boolean operators and partial-match searching on words using relevance ranking

System-oriented evaluation studies tend to focus on performance of search systems and usually involvement measurement of the relevance-based measures of recall and precision

User-oriented evaluation studies tend to compare users and their abilities to complete tasks using retrieval systems

Conclusions