Dealing with complex queries in decision-support systems

https://doi.org/10.1016/j.datak.2010.10.006Get rights and content

Abstract

In decision-making problems under uncertainty, a decision table consists of a set of attributes indicating what is the optimal decision (response) within the different scenarios defined by the attributes. We recently introduced a method to give explanations of these responses. In this paper, the method is extended. To do this, it is combined with a query system to answer expert questions about the preferred action for a given instantiation of decision table attributes. The main difficulty is to accurately answer queries associated with incomplete instantiations. Incomplete instantiations are the result of the evaluation of a partial model outputting decision tables that only include a subset of the whole problem, leading to uncertain responses. Our proposal establishes an automatic and interactive dialogue between the decision-support system and the expert to elicit information from the expert to reduce uncertainty. Typically, the process involves learning a Bayesian network structure from a relevant part of the decision table and computing some interesting conditional probabilities that are revised accordingly.

Introduction

Under uncertainty, a modern and useful decision-theoretic model is the influence diagram [17]. It consists of an acyclic directed graph with associated probabilities and utilities, respectively modeling the uncertainties and preferences tied in with the stated problem. Nowadays this probabilistic graphical model is frequently adopted as a basis for constructing decision-support systems (DSSs). The results of evaluating an influence diagram are decision tables containing the optimal decision alternatives, policies or responses. Thus, for every decision, there is an associated decision table with the best alternative, i.e. the alternative with the maximum expected utility for every combination of relevant variables (usually called attributes within this context) that are observable before the decision is made. The evaluation algorithm determines which of the observable variables are relevant. These variables are outcomes of random variables and/or other past decisions.

A decision table may have millions of rows and typically more than twenty columns leading to enormous data sets for storage and analysis. Expert DSS users demand such an analysis on mainly two grounds. First, DSS decision tables provide the best decision-making recommendations. However, experts may find such recommendations hard to accept if they come without any explanation whatsoever of why the proposed decisions are optimal. Unexplained responses are not good enough for expert users since DSSs operate on a model that is an approximation of the real world. The importance of explanations has been reported in the literature, see e.g. [9], [12], [13]. Thus, for example, in health-care problems, usually involving difficult trade-offs between the treatment benefits and risks, practitioners may use decision tables to determine the best patient treatment recommendations. For this purpose, they need to understand the underlying reasons or implicit rules.

In medical DSSs, clinical practice guidelines assemble the relevant knowledge gathered through literature review, meta-analysis, expert consensus, etc., and operationalize this information as informal, text documents. This makes the gathered information difficult to interpret automatically and the decision-making process hard to guide. Shiffman and Greenes [19] propose translating guideline knowledge into decision table-based rule sets. Shiffman [18] proposes augmenting decision tables by layers, storing collateral information in slots at various levels beneath the logic layer of the conventional decision table. Information relates to table cells, rows and columns. It may include how tests are performed, the benefits/risks of the recommended strategies, costs, literature citations, etc., to help understand the domain. All these decision tables are different than ours. Our knowledge base is the model (influence diagram) and its evaluation, stored in the decision tables. The model (graph with probabilistic dependencies and probability and utility information) is built from clinical practice guidelines, data and expert input. Also, there is no uncertainty in clinical guidelines. Influence diagrams are based on subjective probabilities and utilities, and support learning and reasoning with uncertainty and preferences.

In [6] we introduced KBM2L lists to find explanations. The main idea stems from how computers manage multidimensional matrices: computer memory stores and manages these matrices as linear arrays, and each position is a function of the order chosen for the matrix dimensions. KBM2L lists are new list-based structures that optimize this order by putting equal responses in consecutive positions, yielding the target explanations and simultaneously achieving compact storage. These lists implicitly include the probability and utility models, they are simple, and have no added complex layers.

Not only do expert users employ decision tables as a knowledge base (KB) for explanations; they also query the DSS about which is the best recommendation for a given set of attributes in different ways. This is the second reason for decision table analysis. In a typical session, experts interact with DSSs to:

  • (A)

    formulate a query in the KB domain;

  • (B)

    translate the query into the KB formalism;

  • (C)

    implement the response retrieval;

  • (D)

    build the response efficiently;

  • (E)

    communicate the response(s) and/or suggest improvements, and wait for user feedback.

For (A) and (B), we distinguish between two groups of queries (closed/open) depending on whether or not the whole set of attributes is instantiated. A closed query is a specific and well-defined query entered by users that know all the attribute information. An open query is less specific, as it includes attribute values that are undefined either because they are hard or expensive to obtain or they are unreliable. Martinez et al. [15] give a similar classification for GIS (geographical information systems), although they focus on data efficient updating and access from a physical point of view (merely as a database), rather than from a logical point of view (as a KB).

(C) to (E) may be troublesome, especially for open queries, due to imprecise response retrieval failing to satisfy users. Additionally, the DSS may not include the whole decision table, because an exhaustive evaluation of the decision-making problem can be too costly. In this case there will be no response at all. Worse still, both situations could apply at the same time, demanding a methodology to undertake tasks (C)–(E) dealing with ambiguity and ignorance about the response.

Let us illustrate these ideas with the following clinical problem. It is a real health-care decision-making problem regarding the optimal treatment of non-Hodgkin lymphoma of the stomach.

Primary gastric non-Hodgkin lymphoma, gastric NHL for short, is a relatively rare disorder, accounting for about 5% of gastric tumors. This disorder is caused by a chronic infection by the Helicobacter pylori bacterium [5]. Treatment consists of a combination of antibiotics, chemotherapy, radiotherapy and surgery.

A number of influence diagrams have been constructed and validated [14]. These models are only meant to be used for patients with histologically confirmed gastric NHL. We have taken the most complex version with three decision nodes. This influence diagram is shown in Fig. 1, and is briefly discussed in the following. The first of the decision nodes, helicobacter-treatment (ht), corresponds to the decision to prescribe antibiotics against H. pylori. The second decision concerns carrying out surgery (s). The possibilities are either curative surgery, involving the complete removal of the stomach and locoregional tumor mass; palliative surgery, i.e. partial removal of the stomach and tumor; or no surgery. The last decision, ct-rt-schedule (ctrts), is concerned with the selection of chemotherapy (Chemo), radiotherapy (Radio), chemotherapy followed by radiotherapy (Ch.Next.Rad), or none.

The influence diagram model consists of 17 chance nodes (ellipses), one value node (diamond), three decision nodes (rectangles) and 42 arcs. Nodes to the left of the decision nodes (see Fig. 1) concern pretreatment information. Nodes to the right of the decision nodes are posttreatment nodes. Variables with their associated domains are listed in Table 1. See [14] for further details on the model. Bielza et al. [1] detail the use of KBM2L lists to gain a better understanding of the treatment basis of the gastric NHL model.

The gastric NHL influence diagram evaluation outputs three decision tables, one for each decision variable, each containing the optimal treatment for each combination of attributes in the tables.

Let us take the first decision table concerning the ht decision. It contains four attributes (cs, bd, hc, and hp), and the expected utility of each treatment alternative ht = No/Yes. To illustrate likely user queries, suppose a user queries the DSS about patients with the following configurations:Q0: HC=Low.Grade, HP=Present, CS=I and BD=YesOQ1: HP=Absent, CS=I and BD=YesOQ2: CS=II2.

We will look at all the discussed queries in this paper. In the first case, Q0, the query is closed since the four attributes are instantiated. The question is about a patient that has a good histological classification (hc = Low.Grade), a favorable prognosis (cs = I), the H. pylori bacterium (hp = Present), and a big tumor (bd = Yes). Unless this query corresponds precisely to an unsolved part of the problem, the response should be easy to retrieve.

In the second case, OQ1, the query is open because the doctor has not yet performed a biopsy to ascertain the histological-classification (hc). This could perhaps be due to the high cost of the biopsy.

In the third case, OQ2, the query is even more open, specifying only a medium clinical stage (cs = II2) for the patient. However, the user may be interested in finding out which treatment patients like these should receive. Responses are not expected to be easy to retrieve now. There are many possible alternatives, where users will find it unsatisfactory if different and perhaps unknown responses are retrieved. Therefore, strategies should be developed to assure user satisfaction. One possibility is table reordering to provide more precise answers. Another is sophisticated prediction procedures to infer the unknown responses from (somehow) close known responses or by having the user intervene at some steps to reduce response uncertainty.

In this paper, we propose a query system based on the KBM2L framework to deal with these complex situations. Unlike database management systems that operate with facts, DSSs must provide explanations besides efficiently retrieving the query response information [10]. Thus, our KBM2L framework provides not only an efficient and satisfactory query response retrieval but also an informed response explanation. It is not our aim to develop clinical practice guidelines, but to provide a DSS with a user interface capable of performing complex queries involving more than just accessing a clinical protocol database or document.

The paper is organised as follows. Section 2 outlines the technique of KBM2L lists. Section 3 describes the query complexity and shows how to deal with a closed query. Section 4 tackles less specific and more complex open queries. The proposal combines decision tables that have been compacted using KBM2L lists with learning, information access and information retrieval processes. We give several examples applied to the non-Hodgkin lymphoma problem. Section 5 contains the conclusions and suggests further research.

Section snippets

Basics

A decision table output by evaluating an influence diagram is a set of attributes that determines the optimal policy. Besides all the attribute configurations, a decision table includes the response or optimal alternative associated with each configuration. A base is defined as a vector with elements equal to the attributes in a specific order. Given a base, an index is a vector whose elements are the attribute values, interpreted as the coordinates with respect to that base. With a fixed order

Complexity of queries

Queries are stated as attribute instantiations. Therefore, they are related to the KBM2L index and employ multidimensional point access methods [21]. The DSS is expected to return a response stating the optimal policy using a small subset of the KB. However, an added difficulty is that the optimal policy may be unknown.

Let us explain this point in further detail. As mentioned earlier, the exhaustive evaluation of the decision-making problem may be too costly in terms of time and memory

Open queries

We have seen that the expert is an agent that queries the DSS about the optimal policy for the decision-making problem. Expert and DSS enter into a dialogue consisting of queries, responses and explanations. For closed queries, the expert receives definite and accurate responses. Responses to open queries are not so straightforward due to expert imprecision. Not all attributes are instantiated. Possible reasons are the unreliability of some attribute values, missing knowledge, high retrieval

Conclusions and further research

A decision model builds on guidelines, probabilities, utilities, probabilistic relationships, among other sources of information. Decision tables are the result of evaluating a decision model, taking into account that information. Their extraordinarily large size motivated us to analyse them. The aim was to save memory space and, more interestingly, retrieve knowledge (to understand DSS suggestions). In our previous paper we managed to achieve both aims. Moreover, by analysing the items—groups

Acknowledgments

Research partially supported by grants from the Spanish Ministry of Science and Innovation (TIN2007-62626 and Consolider Ingenio 2010-CSD2007-00018). Thanks to Peter Lucas for valuable support with the medical problem. We are also grateful to the referees for their valuable remarks that have definitely helped to improve the manuscript.

Concha Bielza received the M.S. degree in mathematics from Complutense University of Madrid, Madrid, Spain, in 1989 and the Ph.D. degree in computer science from the Universidad Politécnica de Madrid, Madrid, in 1996. She is currently a Full Professor of statistics and operations research with the Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid. Her research interests are primarily in the areas of probabilistic graphical models, decision analysis, metaheuristics for

References (21)

There are more references available in the full text version of this article.

Cited by (0)

Concha Bielza received the M.S. degree in mathematics from Complutense University of Madrid, Madrid, Spain, in 1989 and the Ph.D. degree in computer science from the Universidad Politécnica de Madrid, Madrid, in 1996. She is currently a Full Professor of statistics and operations research with the Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid. Her research interests are primarily in the areas of probabilistic graphical models, decision analysis, metaheuristics for optimization, data mining, classification models, and real applications, like biomedicine, bioinformatics and neuroscience. Her research has appeared in journals like Management Science, Computers and Operations Research, Statistics and Computing, the European Journal of Operational Research, Decision-Support Systems, Naval Research Logistics, the Journal of the Operational Research Society, Medical Decision Making, Methods of Information in Medicine, IEEE Transactions on SMC, International Journal of Systems Science, Bioinformatics, Briefings in Bioinformatics, Journal of Statistical Software, Journal of Heuristics, Intelligent Data Analysis, Developmental Neurobiology, Neuroinformatics, IEEE Transactions on Signal Processing, and Expert Systems with Applications as well as chapters of many books.

Juan A. Fernandez del Pozo received his MS degree in Computer Science in 1999 and PhD in Computer Science in 2006 from Universidad Politécnica de Madrid (UPM), Madrid (Spain). He is currently Associate Professor of Statistics and Operations Research at School of Computer Science and member of the Computational Intelligence Group at the UPM. His research interest includes decision analysis and intelligent decision-support systems based on influence diagrams and Bayesian networks that perform knowledge acquisition in huge decision tables, knowledge discovery and data mining on models' outputs for explanation synthesis and sensitivity analysis. He is also interested in optimization based on evolutionary algorithm and classification models. He is collaborating with several Spanish Foundations in modeling the service quality and life quality on social service environments. His articles have appeared in various academic journals including: Springer Lecture Notes in Computer Science, Journal of Operational Research, Computers & Operations Research, Decision-Support Systems, Expert Systems with Applications, Medical Decision Making. His teaching interests include Statistics, Decision-Support Systems and Operations Research.

View full text