A characterization of hierarchical computable distance functions for data warehouse systems

doi:10.1016/j.dss.2014.03.011

Decision Support Systems

Volume 62, June 2014, Pages 144-157

https://doi.org/10.1016/j.dss.2014.03.011 Get rights and content

Highlights

•
A characterization of hierarchical and categorical attributes
•
A characterization of hierarchical computable distances
•
A probabilistic model that defines the discriminant capabilities of the distance functions.
•
A set of experimental results that provide an empirical evaluation of proposed model.

Abstract

A data warehouse is a huge multidimensional repository used for statistical analysis of historical data. In a data warehouse events are modeled as multidimensional cubes where cells store numerical indicators while dimensions describe the events from different points of view. Dimensions are typically described at different levels of details through hierarchies of concepts. Computing the distance/similarity between two cells has several applications in this domain. In this context distance is typically based on the least common ancestor between attribute values, but the effectiveness of such distance functions varies according to the structure and to the number of the involved hierarchies. In this paper we propose a characterization of hierarchy types based on their structure and expressiveness, we provide a characterization of the different types of distance functions and we verify their effectiveness on different types of hierarchies in terms of their intrinsic discriminant capacity.

Introduction

Effectively measuring the similarity, or symmetrically the distance, between objects is a generic research issue whose solution changes depending on the characteristics of the involved data. In this paper we focus on the distance functions for categorical and hierarchical attributes. On the one hand categorical data are intrinsically unordered and this limits the possibility of defining an effective distance measure [1]. On the other hand, the presence of hierarchies of concepts enriches the description of the objects and provides a tool for partially restoring an ordering between them.

Categorical and hierarchical attributes are particularly relevant since they are one of the building bricks for multidimensional data spaces, where a single data object is described by several of this attributes. We will refer to such data spaces as Hierarchical Non-Ordered Discrete Data Spaces — HNODDSs that extend the acronym NODDS coined in [2]. Similarity search for HNODDSs is becoming increasingly important for several application domains such as multimedia information retrieval, statistical data analysis, scientific databases and data mining [3]. In particular, HNODDSs are at the core of data warehouses that are huge multidimensional repositories used for statistical analysis of historical data [4]. In a data warehouse events are modeled as multidimensional cubes where cells store numerical indicators while dimensions describe the events from different points of view. For example, a SALE cube would store the quantity sold and the corresponding sale amount; each sale would be defined by a CITY, a PRODUCT and a DATE. All the attributes are categorical, except DATE. Fig. 1 shows an example for the CITY hierarchy. Identifying the distance between a couple of cube cells/events has several applications. For example a user would benefit in automatically retrieving events that are similar to those she is currently browsing. On the other hand, the identification of events that are very dissimilar from all of the others (i.e. outliers) would be very useful to both end-users (e.g. detection of anomalous behaviors) and system administrators (e.g. detection of erroneous data during the data warehouse ETL).

Distance functions for hierarchical data have been proposed in many different contexts. [5] defines a set of measures for assessing the distance between words exploiting a taxonomy of concepts, [6] shows that these methods work well on a large set of a taxonomies of medical terms. [7] analyzes different similarity criteria and tests them in the area of data warehousing on user labeled data to understand which is the one that matches the human perception of similarity at best. All the previous papers state that, when categorical and hierarchical attributes are involved the least common ancestor — LCA between values at the lowest level of the hierarchy plays a crucial role in defining a user-meaningful distance function. [7] uses the term hierarchical computable for such type of distances. Please note that distances included in this class do not keep information coming from a corpus into account (e.g. the frequency a particular city has in the data set). All the previous papers investigate the effectiveness of hierarchical computable distances that is measured a posteriori typically through a manual tagging of the results. The authors also debate on the weakness of their measures, but limit their discussion to empirical considerations, failing to provide a well-founded answer due to the lack of an analytic model.

Effectiveness of hierarchical computable distances depends on both the LCA's features used to define them (e.g. the level of the LCA or the distance from it or the number of data objects subsumed by the LCA) and on the hierarchy structure. In particular the quantity of information coded in a hierarchy, and consequently the level of precision of a hierarchical computable distance, changes with its depth and size.

When multidimensional objects are involved a clear understanding of the characteristics that influence the effectiveness of distance functions is even more crucial since the presence of several hierarchies, possibly with different characteristics, could make correctly capturing similarity more complex or even impossible. As shown in the paper, HNODDSs are subject to the so called curse of dimensionality, thus it is important to know to which extent similarity queries (e.g. range and nearest neighbor queries) make sense.

To the best of our knowledge no paper in the literature proposes a precise characterization of the effectiveness and limitations of hierarchical computable distances in terms of the structure and size of the hierarchy. In this paper we move a first relevant step in this direction, by providing:

•
a characterization of hierarchical and categorical attributes according to the structure of their hierarchies (see Section 4);
•
a characterization of hierarchical computable distances based on the type of information they consider (see Section 4);
•
a probabilistic model that analytically defines the discriminant capabilities of the distance functions. The model works in the mono-dimensional case (see Section 5.1) as well as in the multi-dimensional one (see Section 5.2). Together with the model some indicators are provided to evaluate the discriminant capacity of hierarchies, distance function and HNODDS: some of them are original, while others are enabled in HNODDSs by the model (e.g. Intrinsic dimensionality).
•
a set of experimental results, carried out on both real and synthetic data sets, that provide an empirical evaluation of the effectiveness and efficiency of hierarchical computable distances, both for a single categorical and hierarchical attribute and for HNODDSs (see Section 6).

Our contributions are valuable tools for both practitioners and researchers involved in defining similarity functions or in designing hierarchy structures. Indeed, no techniques are currently available to estimate a priori the capabilities/limits of a given similarity measure when applied on a given categorical hierarchy or a given HNODDS. All the evaluations are carried out a posteriori through subjective user's feedbacks. Conversely, the characterization we propose defines a formal framework for such evaluation, gives some rules of thumb for coupling measures and hierarchies and it finally provides indicators (original and non original) for evaluating, at design time and on an objective basis, the discriminant capacity of distance functions and hierarchies.

Section snippets

Related literature

The definition of similarity and distance functions is a wide research area that covers several application domains ranging from information retrieval to multimedia applications. Each domain requires a specific definition of distance that should exploit the characteristics of the involved data and should be meaningful for the application users as well. When data are numeric, distances can be derived starting from the classic Euclidean distance function or the Minkowski one that generalizes it.

Basic definitions and background

In this section we introduce a basic formal setting to manipulate objects in HNODDS.

Hierarchy Given a set A = {a₀,…a_l} of hierarchical computable attributes — HCA, each defined on a categorical domain Dom(a_k), a hierarchy h is defined by (1) a roll-up total order ≽ _h of A; and (2) a family of roll-up functions including a function $Rol l_{a_{k}}^{a_{i}} : Dom (a_{k}) \to Dom (a_{i})$ for each pair of attributes a_k and a_i such that a_k ≽ _ha_i.

The top attribute of the order is denoted by dim = a_k ∈ A|a_k ≽ _ha_i ∀ a_i ∈ A and determines the finest

Distance characterization

The type of hierarchy defined in Section 3 encodes an IS-A taxonomy with single inheritance and same depth for all branches. A hierarchy encodes semantic relationships between its concepts, the higher the level the more general the concepts. The basic idea of hierarchical computable distances is that, given two concepts, the more general the concept that subsumes them, the more distant they will be. According to this intuition, and given that the most specific subsumer of two concepts is their

An analytical model for hierarchically computable distances

The discriminant capacity of a distance function is bounded by the number of distinct distances it can determine and by the distribution of such values. For example using a Minkowski distance in a multidimensional Euclidean space with n objects, we can have up to $\frac{n \times (n - 1)}{2} + 1$ distinct distances. This is not the case for HCDs since the information stored in the hierarchy does not always provide detailed information about the distances between objects. Distance distribution also affects similarity

Empirical results and discussion

In this section we complete our analysis through a set of experiments aimed at:

•
appraising, independently from our statistical model, the ability of HCDs to catch object similarity (6.1 HCD effectiveness, 6.2 Discriminant capacity in HNODDS with non-uniformly distributed data);
•
evaluating the capability of our model and of the related indicators (i.e. IDC, Intrinsic dimensionality, Curse of dimensionality) to measure such ability (6.1 HCD effectiveness, 6.3 Range query effectiveness);
•
analyzing

Conclusions

In this paper we analyzed the discriminant capabilities of the family of hierarchical computable distances when applied to a single hierarchical and categorical attribute or to a multi-dimensional HNODDS. The evidences obtained from the proposed analytical model and the empirical experiments prove that, hierarchical computable distances properly model the distance between data objects but their discriminant capacity is quickly reduced as dimensionality increases. This problem arises much

Acknowledgment

Thanks to Paolo Ciaccia for his comments and discussion of these ideas.

References (35)

T. Pedersen et al.
Measures of semantic similarity and relatedness in the biomedical domain
Journal of Biomedical Informatics
(2007)
S. Lin et al.
An outlier-based data association method for linking criminal incidents
Decision Support Systems
(2006)
V. Pestov
An axiomatic approach to intrinsic dimension of a dataset
Neural Networks
(2008)
C.C. Hsu et al.
Incremental clustering of mixed data based on distance hierarchy
Expert Systems with Applications
(2008)
P. Buneman
A note on the metric properties of trees
Journal of Combinatorial Theory, Series B
(August 1974)
S. Boriah et al.
Similarity measures for categorical data: a comparative evaluation
G. Qian et al.
Dynamic indexing for multidimensional non-ordered discrete data spaces using a data-partitioning approach
ACM Transactions on Database Systems
(2006)
P. Zezula et al.
Similarity Search: The Metric Space Approach
(2002)
M. Golfarelli et al.
Data Warehouse Design: Modern Principles and Methodologies
(2009)
Y. Li et al.
An approach for measuring semantic similarity between words using multiple information sources
IEEE Transactions on Knowledge and Data Engineering
(2003)

E. Baikousi et al.

Similarity measures for multidimensional data

P. Tan et al.

Introduction to Data Mining

(2005)

T. Skopal et al.

On nonmetric similarity search problems in complex domains

ACM Computing Surveys

(October 2011)

E. Eskin et al.

A Geometric Framework for Unsupervised Anomaly Detection

(2002)

K.S. Jones

A statistical interpretation of the term specificity and its application in retrieval

Journal of Documentation

(1972)

D. Goodall

A new similarity index based on probability

Biometrics

(1966)

P. Ganesan et al.

Exploiting hierarchical domain structure to compute similarity

ACM Transactions on Information Systems

(2003)

Cited by (0)

Matteo Golfarelli received his Ph.D. for his work on autonomous agents in 1998. In 2000 he joined the University of Bologna as a researcher. Since 2005 he is Associate Professor, teaching Information Systems, Database Systems and Data Mining. He has published over 80 papers in refereed journals and international conferences in the fields of data warehousing, pattern recognition, mobile robotics, multi-agent systems. He is co-author of the book Data Warehouse Design: Modern Principles and Methodologies. He served in the PC of several international conferences and as a reviewer in journals. Matteo Golfarelli has been co-chair of DOLAP 2012, he is permanent co-chair of the Business Information System conference; he is member of the editorial board of the International Journal of Data Mining, Modelling and Management (IJDMMM). His current research interests include all the aspects related to business intelligence and data warehousing, in particular multidimensional modeling, Business Intelligence on Social Data and Data Mining.

Elisa Turricchia received her degree cum laude in Computer Science from the University of Bologna, Italy, in March 2009, presenting a thesis about interoperability issues among heterogeneous data warehouse systems. In 2012 she received her Ph.D. for her work on Pervasive Business Intelligence. Her current research interests include the study of methods for expressing and executing OLAP preference queries and for managing distributed data warehouses. She has published 8 papers in refereed journals and international conferences on these topics.

View full text

A characterization of hierarchical computable distance functions for data warehouse systems

Highlights

Abstract

Introduction

Section snippets

Related literature

Basic definitions and background

Distance characterization

An analytical model for hierarchically computable distances

Empirical results and discussion

Conclusions

Acknowledgment

Journal of Biomedical Informatics

Decision Support Systems

Neural Networks

Expert Systems with Applications

Journal of Combinatorial Theory, Series B

Similarity measures for categorical data: a comparative evaluation

Dynamic indexing for multidimensional non-ordered discrete data spaces using a data-partitioning approach

ACM Transactions on Database Systems

Similarity Search: The Metric Space Approach

Data Warehouse Design: Modern Principles and Methodologies

An approach for measuring semantic similarity between words using multiple information sources

IEEE Transactions on Knowledge and Data Engineering