Model composition in a distributed environment

https://doi.org/10.1016/S0167-9236(02)00116-1Get rights and content

Abstract

Organizations that operate from multiple locations have data and model resources that are distributed at various sites of the organization. Decision making is facilitated when these resources are leveraged to support model composition and execution, i.e., composing and executing a sequence of models in response to a particular decision making situation. In this paper, we address the problem of model composition when data sources (models and data) are distributed across multiple sites and have different scopes.

Introduction

Two important resources that aid users in managerial decision making within an organization are models and data (note that we view a “model” as a computer-executable procedure that may require data inputs, a view that is widely held in the decision support systems (DSS) literature [7], [27], [30]). Examples of model include: a demand-forecasting program that uses price data to forecast product demands, a production planning program based on linear programming that determines the optimal quantities of products to be produced, etc. Models and data are typically distributed at various organizational sites on a variety of platforms that are interconnected by a communications network. This environment is characterized as distributed and heterogeneous.

One of the challenging issues facing organizations today is to leverage existing data and model resources for decision making. This entails composing and executing a sequence of models (i.e., a composite model) on different platforms while acquiring model input data from databases at multiple locations. In practice, data available in databases are often stored at different levels of some generalization/specialization (i.e., data scope) hierarchy. Model composition complexity increases when data have differing scopes, and when data is distributed with overlapping key values among database fragments. The scope of an input data of a model may not be the same as the scope of data stored in databases. For example, a cost model may require the inventory count of a particular product across all plants in an organization as input, whereas the inventory count may be stored in database tables on a plant-by-plant basis, thereby requiring aggregation of data before it is presented to the model as input.

In this paper, we address the following problem. Given a query and a set of data sources (i.e., database tables and model executables) distributed at various sites with overlapping key values and having different scopes. How could data sources be integrated to form composite models and then be executed in response to the query with minimal human intervention?

We present a hypothetical organization to illustrate model composition and execution in the context of differing data scopes and when data is distributed with overlapping key values. Consider ABC Tool, an organization that operates three plants in US that are located in California (CA), Florida (FL), and Virginia (VA), respectively, and two plants in Canada that are located in British Columbia (BC) and Ontario (ON), respectively. Each plant site has a computer that contains relational database tables and models (that are executable programs). Fig. 1 illustrates the integration of the production model and the sales forecasting model to answer a query that requires the total quantity of all products to be produced by the US operations. The production model, which determines the total units of various products to be produced in a production cycle, uses inventory level (given by quantity) and projected sales as inputs. The projected sales is generated as an output by the sales-forecasting model, whereas inventory level is generated by summing the quantity attribute grouped by products across three database tables VAProd, CAProd, and FLProd, respectively Table 1, Table 2, Table 3. These tables, which are located in New York, Virginia, and Texas, respectively, contain data on items produced at their respective locations (Similar tables exist for Canadian plants: BCProd and ONProd). As seen from Table 1, Table 2, products with the same product identifier (pid), i.e., pid=101 and 102 are produced at two locations, thus overlapping key values exists in these tables. There could be replication of data as well. The primary copy of VAProd is stored in Virginia whereas a secondary copy may be maintained in California. The database tables of ABC can be viewed in the context of a data scope hierarchy as shown in Fig. 2. A query requiring total inventory level of all products for the entire US operations would involve retrieving quantity data from VAProd, CAProd and FLProd and then aggregating them on a product-by-product basis.

Consider a query that requires the total quantity of product with pid=102 to be produced by the entire operations of ABC (that includes US and Canadian plants). Here the scope of the query is the root scope. The same composite model as in Fig. 1 is used, however, model input data processing is far more complex. First, price data for product with pid=102 are retrieved from VAProd, CAProd, FLProd , BCProd and ONProd, then averaged before sending it to the sales forecasting model. The inventory level input is generated by aggregating quantity field value from all tables VAProd, CAProd, FLProd, BCProd and ONProd for pid=102, since the scope of the query is the root scope. Data is retrieved from database copies nearest to the user.

Many proposals on model composition and execution are available in the literature [1], [2], [3], [5], [6], [7], [12], [13], [14], [15], [16], [17], [18], [21], [23], [22], [25]. However, only few proposals deal with model composition and execution in a distributed environment [1], [5], [13], [14], [21]. To our knowledge, the only proposal that exists in the model composition literature that deals with data scopes is by Dutta and Basu [12]. Dutta and Basu propose a sort hierarchy in their first-order logic approach for unifying a variable of a given sort with terms at the same level or lower in the sort hierarchy. They do not address the issue of arithmetic computations based on the hierarchy.

Few papers from the heterogeneous database literature have addressed data integration in a heterogeneous environment [4], [19], [20], [26], [29], [31], [32]. These papers primarily deal with the differences in data representation and storage as well as with the differences in the underlying platforms during data integration. Various data models such as those based on object-oriented concepts [4], [19], [31], [32], knowledge-based approaches [20], etc. have been proposed to represent global schemata of federated database systems. To our knowledge, none of the proposals in the heterogeneous database systems literature address the issue of data scopes. In the data warehousing literature, differences in the data scopes are handled using the aggregation operator during roll-up [11], i.e., when the scope of a query is at a higher level than the scope of the data stored. However to our knowledge, none of the proposals in the data warehousing literature address automation of scope matching and roll-up processing based on some user-specified ad hoc query.

The contributions of this paper are as follows. First, we present constructs to support model composition and execution when data have differing scopes and when data are distributed over multiple platforms with overlapping key values. These constructs, which are developed by extending prior research on filter spaces [9], are used to represent metadata of models and data. A loose collection of the metadata of models and data constitutes the global schema in our approach. A control procedure uses the metadata and associated filter spaces to create and execute composite models. In contrast to the federated database systems, where significant effort is used to integrate various local schemata to create a global schema, in our approach, very less effort is expended in creating a global schema since it is a loose collection of metadata. Therefore, metadata of models and data can be easily added or removed from the global schema. Furthermore, the constructs presented enable model composition to be independent of the underlying implementation and the distribution of models and data. We then present a systems architecture that (1) enables model and data implementation and distribution to be transparent to the user, (2) supports a high degree of automation during model composition and execution, and (3) enables the use of a variety of platforms, access methods and protocols. In contrast to our approach, OLAP tools from the data warehousing literature do not have the ability for automating scope matching and related processing to handle ad hoc queries. Furthermore, OLAP tools cannot handle data distribution. The constructs and the architecture presented in this paper, jointly facilitate a solution to the research problem in this paper.

This paper is divided into six sections and an appendix. Section 2 provides an informal description of the filter space approach for model composition. Section 3 contains formal specifications of the data scope hierarchy. Section 4 contains details of model composition. Description of the systems architecture is presented in Section 5. The contributions of this paper are summarized in Section 6. Appendix A contains definitions related to the filter space approach.

Section snippets

The filter space approach for model composition

A filter space is an n-dimensional space that represents the space defined by constraints known as filter clauses. A group of filter clauses constitutes a filter list. A filter list can be associated with a data source (i.e., a relational table or a model output expressed as a relation) to specify the logical conditions (i.e., extent) of the data in the data source. The filter space of a data source encloses all the tuples of the data source. Model composition entails matching filter spaces of

Data scope hierarchies and data sources

A data scope hierarchy such as in Fig. 2 represents an organizational view of the data. In this section, we first present definitions and results related to a data scope hierarchy. We then describe how data sources in a data scope hierarchy can provide data for a query.

Definition 1

Let Ij denote an index set at level j≥1. The ith data scope hierarchy associated with Ij at level j, iIj, and given by Tij, is a finite set of data scopes such that: (i) given a set of m scopes where m>n≥0, there is a specially

Model composition

Model composition is facilitated by a control procedure that creates a composite model using metadata information of existing models and databases. A key-value set definition, which represents the metadata of relations can be embedded in metadata structures of database tables and models.

The metadata of a database table, which is represented by a database template, is an ordered triple DB=〈κν, Source, Copy〉 where κν represents the key-value set definition that describes the data stored in the

Systems architecture

A systems architecture is an important part of our solution to support model composition and execution in a distributed environment. The architecture presented below in Fig. 4 has the following features: (1) enables model/data distribution and implementation to be transparent to the user, (2) supports a high degree of automation during model composition and execution, and (3) permits a variety of databases, model implementations, protocols and access methods to be used in order to leverage

Conclusion

In response to the problem of leveraging existing data and model resources that are distributed at various sites within an organization for decision making, we have addressed the problem of model composition in a distributed environment. We have extended the filter space approach [9] to include data scope hierarchies in order to facilitate model composition when there are overlapping key-values in a replicated environment and when data scopes vary. Our approach can be applied to a variety of

Acknowledgements

The author would like to thank the anonymous referees for their valued comments.

Kaushal Chari is an Associate Professor of Information Systems and Decision Sciences at the University of South Florida. He obtained a B.Tech. in Mechanical Engineering from the Indian Institute of Technology Kanpur, followed by an MBA and PhD from the University of Iowa. His current research interests include DSS, e-commerce, workflow systems and multi-agent systems. His past research appeared in journals such as Decision Support Systems, INFORMS Journal on Computing, Information Systems

References (32)

  • A Basu et al.

    The analysis of assumptions in model bases using metagraphs

    Management Science

    (1998)
  • E Bertino et al.

    Applications of object-oriented technology to the integration of heterogeneous database systems

    Distributed and Parallel Databases

    (1994)
  • R.H Bonczek et al.

    Foundations of Decision Support Systems

    (1981)
  • K. Chari, “An Enterprise-Wide Decision Support System,” Technical Report, Department of Information Systems and...
  • K Chari

    Model composition using filter spaces

    Information Systems Research

    (2002)
  • D Comer

    Internetworking with TCP/IP: Principles, Protocols, and Architecture

    (2000)
  • Cited by (0)

    Kaushal Chari is an Associate Professor of Information Systems and Decision Sciences at the University of South Florida. He obtained a B.Tech. in Mechanical Engineering from the Indian Institute of Technology Kanpur, followed by an MBA and PhD from the University of Iowa. His current research interests include DSS, e-commerce, workflow systems and multi-agent systems. His past research appeared in journals such as Decision Support Systems, INFORMS Journal on Computing, Information Systems Research, Telecommunication Systems, European Journal of Operational Research, Computers and Operations Research and Omega. He was the Co-Chair of INFORMS CIST 2001 and is currently serving as the Functional Editor-MIS for Interfaces Journal.

    View full text