An efficient workload-based data layout scheme for multidimensional data

https://doi.org/10.1016/S0169-023X(01)00043-XGet rights and content

Abstract

Physical data layout is a crucial factor in the performance of queries and updates in large data warehouses. Data layout enhances and complements other performance features such as materialized views and dynamic caching of aggregated results. Prior work has identified that the multidimensional nature of large data warehouses imposes natural restrictions on the query workload. A method based on a “uniform” query class approach has been proposed for data clustering and shown to be optimal. However, we believe that realistic query workloads will exhibit data access skew. For instance, if time is a dimension in the data model, then more queries are likely to focus on the most recent time interval. The query class approach does not adequately model the possibility of multidimensional data access skew. We propose the affinity graph model for capturing workload characteristics in the presence of access skew and describe an efficient algorithm for physical data layout. Our proposed algorithm considers declustering and load balancing issues which are inherent to the multidisk data layout problem. We demonstrate the validity of this approach experimentally.

Introduction

Data warehouses are naturally organized using the multidimensional schema model for supporting complex analytical decision support, OLAP, and data mining applications. The multidimensional schema model consists of a number of logical dimensions, e.g., date, product, location. It also models hierarchies on each dimension, e.g., month, quarter, year on the date dimension. The multidimensional schema is also called a star schema since the main or fact table is at the center and the dimension tables are spokes connecting to the main table using primary key–foreign key relationships. OLAP and warehousing applications typically pose queries that select subregions of the overall multidimensional space and perform aggregations along the hierarchy definitions. In a large data warehouse, the physical layout (placement) and clustering of multidimensional data greatly influence performance since the large I/O costs incurred are dominant. In addition, clustering can improve the performance of batched updates and database utilities such as reorg. Choosing an appropriate clustering is crucial for a high performance data warehouse.

In general, data layout can be considered to be an optimization problem using a cost function based on response times and/or throughput of a workload. It encompasses the issues of clustering for minimizing seek times, declustering for obtaining I/O parallelism, and load balancing in a multiquery and multidisk environment. While there is a substantial body of literature on these issues [3], [7], [9], [17], [23], most past work has not considered the data layout issues arising from a multidimensional data model. One exception is the recently proposed query class approach to modeling the distribution of the query workload in a multidimensional data warehouse [10]. This approach recognizes that the workload is constrained by the model and hence the data layout can be optimized more effectively. We are in agreement with this assumption that the workload follows the model. However, the query class approach uses a simplifying assumption that a query is uniformly likely to access any subspace of the multidimensional space. We expect real-world applications to demonstrate skew in the access of the multidimensional data. For example, although a warehouse contains the last two years worth of data, many of the queries are likely to be biased towards analyzing the most recent data, i.e, latest month compared to last month or latest quarter compared to last quarter. The query class approach does not capture this type of access skew and hence can result in suboptimal clustering of the data.

In this paper, we investigate a new approach to the data layout problem in order to address the more realistic and generalized model of a multidimensional query workload. We explicitly model the effects of data access skew in the queries. We construct a data structure called the affinity graph and use it as input to an efficient algorithm for data layout that encompasses clustering, declustering, and load balancing issues for a multidisk environment. In the field of object oriented databases, techniques similar in spirit have been used where the relationship between objects is captured in an object graph. This graph is then used as input to algorithms which output optimal clusterings for the layout of objects on disk [22].

The affinity graph structure is space-efficient and our data layout algorithm is time-efficient and provides a close to optimal solution. We are able to adapt our technique to include the query class model for those data areas which do not exhibit any special data access skew. We have validated the algorithm by performing simulation experiments and varying the type of data access skew, the number of dimensions and the size of the data blocks accessed. Our results show that the affinity graph algorithm is better suited for the real-world multidimensional queries that exhibit data access skews as well as change gradually over time. Our work is complementary to work done on choosing a set of views to materialize given a space or maintenance cost constraint. Even for a fixed set of materialized views, performance will vary depending upon the physical organization of these views. Our focus is on the layout of the fact table of a star schema which impacts the performance of most queries in a warehousing environment.

Section snippets

Data layout issues

In this section, we describe the data layout issues for multidimensional data. We show the effect of data access skew on data layout using examples and provide a generic description of the problem parameters.

Affinity graphs

In this section, we describe our overall approach to the problem. We propose a data structure called the affinity graph which is used for characterizing the workload over the data warehouse. We then provide an algorithm which given the affinity graph as input, outputs a linearization which minimizes the average query cost. We aim to use the access patterns of the workload at the block level in constructing the affinity graph and then using the linearization output for placing data on disk.

Experimental results

We implemented the affinity graph algorithms using the C programming language and carried out experiments for a number of different synthetic datasets. The datasets model multidimensional schemas with several dimensions and hierarchies along different dimensions [12]. The parameters of interest for each dataset are the number of dimensions, the cardinality of each dimension, the number of tuples and the distribution followed by the tuples. Some of our comparisons are against the query class

Related work

We have already described the past work on multidimensional data placement and clustering in the prior sections [9], [10]. These papers are based on single disk and hence only focused on the clustering problem. In this section, we describe related work on various aspects of data placement.

The benefits of declustering by placing related data on different disks was studied in [15]. This paper contrasted the declustering approach with clustering (placing related data on the same disk) and

Future work and contributions

The work described so far assumes a static scenario where the workload is known and assumed to change very slowly. In practice, workload does change over time based on data value changes and occasional schema changes. For data placement, one strategy is to rerun the affinity graph algorithm periodically and make necessary layout changes. An alternative approach would be to maintain the affinity graph dynamically with the weights of the edges prioritizing more frequent queries. One of the

Dr. Kazi A. Zaman works at nQuire Software where he works on query processing and optimization issues. He received his B.Tech (Hons.) degree from the Indian Institute of Technology, Kharagpur, India in 1995 and his M.S. and Ph.D. degrees from Columbia University New York, USA in 1997 and 2001, respectively. His research interests include query processing, query optimization, data warehousing and data mining. He has refereed several journal and conference submissions and is a member of the ACM

References (24)

  • TPC-D benchmark specification....
  • A. Aho et al.

    The Design and Analysis of Computer Algorithms

    (1974)
  • G. Copeland et al.

    Data placement in Bubba

    Proceedings of the 1988 ACM SIGMOD Conference on Management of Data, Chicago, Illinois, Association for Computing Machinery

    (1988)
  • H. Du et al.

    Disk allocation for cartesian product files on multiple-disk systems

    ACM Trans. Database Syst.

    (1982)
  • C. Faloutsos et al.

    Modelling skewed distributions using multifractals and the 80-20 law

  • M. Garey et al.

    Some simplified NP-complete graph problems

    Theor. Comput. Sci.

    (1976)
  • S. Ghandeharizadeh et al.

    A performance analysis of alternative multi-attribute declustering strategies

  • V. Harinarayan et al.

    Implementing data cubes efficiently

  • H. Jagadish

    Linear clustering of objects with multiple attributes

  • H. Jagadish et al.

    Snakes and sandwiches: optimal clustering strategies for a data warehouse

  • M. Kim et al.

    Optimal file distribution for partial match retrieval

  • R. Kimball

    The Data Warehouse Toolkit

    (1996)
  • Cited by (2)

    • Data Layout and Scheduling Tasks in a Meteorological Cloud Environment

      2023, Intelligent Automation and Soft Computing

    Dr. Kazi A. Zaman works at nQuire Software where he works on query processing and optimization issues. He received his B.Tech (Hons.) degree from the Indian Institute of Technology, Kharagpur, India in 1995 and his M.S. and Ph.D. degrees from Columbia University New York, USA in 1997 and 2001, respectively. His research interests include query processing, query optimization, data warehousing and data mining. He has refereed several journal and conference submissions and is a member of the ACM SIGMOD and SIGKDD groups.

    Dr. Sriram Padmanabhan is a Research Staff Member and Manager of the Data Management Group at the IBM T.J. Watson Research Center, Hawthorne, New York. He received his B.Tech degree from the Indian Institute of Technology, Madras, India in 1986 and his M.S. and Ph.D. degrees from The University of Michigan, Ann Arbor, USA in 1989 and 1992, respectively. His research interests include database algorithms, query processing, query optimization, parallel databases, data warehousing, distributed and semistructured databases. He has several publications in journals and conferences and is the author of a book chapter on DB2. He has refereed several journal and conference submissions and is a member of the ACM SIGMOD group.

    1

    This work was done while the author was at IBM T.J. Watson Research Center.

    View full text