Statistical treatment of the information content of a database

doi:10.1016/0306-4379(86)90029-3

Information Systems

Volume 11, Issue 3, 1986, Pages 211-223

https://doi.org/10.1016/0306-4379(86)90029-3 Get rights and content

Abstract

The statistical analysis of the database contents is usually performed by using software packages, that require the numerical coding of database attributes. Unfortunately, statistics computed from attribute values ciphered might be meaningless (this is the case when the attribute values are intrinsically not ordered in any way).

We present an analytical data model, where the information content of a database relation is represented by a contingency table and analysed using the methods of the multivariate information theory. From these quantitative tools of analysis may benefit first the database user interested in a statistical view of the database contents and inclined to put queries like “to what extent are attributes related (in a given database state)?” or like “how does one attribute depend on the others (in a given database state)?”. A second application, here only sketched, is the measurement of the record selectivities for queries, in view of an evaluation of the physical database organization performance.

References (39)

T.Y. Cheung
A statistical model for estimating the number of records in a relational database
Information Processing Letters
(1982)
S. Christodoulakis
Estimating record selectivities
Information Systems
(1983)
F.M. Malvestuto
Theory of random observables in relational databases
Information Systems
(1983)
P.M. Lewis
Approximating probability distributions to reduce storage requirements
Information & Control
(1959)
D.T. Brown
A note on approximating to discrete probability distributions
Information & Control
(1959)
J.B. Kam et al.
A model of statistical databases and their security
ACM Database Systems
(1977)
F.Y. Chin
Security in statistical databases for queries with small counts
ACM Database Systems
(1978)
S.J. Eggers et al.
A compression technique for large statistical databases
F.Y. Chin et al.
Statistical database design
ACM Database Systems
(1981)
E. Wong
A statistical approach to incomplete database
ACM Database Systems
(1982)

T.H. Merret et al.

Distributions model of relations

S. Christodoulakis

A multivariate statistical model for data base performance evaluation

D. Bates et al.

A framework for research in database management for statistical analysis

J. Schloerer

Information loss in partitioned statistical databases

The Computer Journal

(1983)

E. Lefons et al.

An analytical approach to statistical databases

L.I. Schiff

Quantum Mechanics

(1968)

S. Christodoulakis

Implications of certain assumptions in database performance evaluation

ACM Database Systems

(1984)

S. Buhler et al.

W.J. Dixon et al.

BMDP-79 Biomedical Computer Programs P-Series

(1979)

Cited by (34)

On the similarity metric and the distance metric
2009, Theoretical Computer Science
Similarity and dissimilarity measures are widely used in many research areas and applications. When a dissimilarity measure is used, it is normally required to be a distance metric. However, when a similarity measure is used, there is no formal requirement. In this article, we have three contributions. First, we give a formal definition of similarity metric. Second, we show the relationship between similarity metric and distance metric. Third, we present general solutions to normalize a given similarity metric or distance metric.
The implication problem for measure-based constraints
2008, Information Systems
We study the implication problem of measure-based constraints. These constraints are formulated in a framework for measures generalizing that for mathematical measures. Measures arise naturally in a wide variety of domains. We show that measure constraints, for particular measures, correspond to constraints that occur in relational databases, data mining applications, cooperative game theory, and in the Dempster–Shafer and possibility theories of reasoning about uncertainty. We prove that the implication problem for measure constraints is in general decidable. We introduce inference systems for particular classes of measure constraints and show that some of these are complete, yielding tractability for the corresponding implication problem.
On approximation measures for functional dependencies
2004, Information Systems
We examine the issue of how to measure the degree to which a functional dependency (FD) is approximate. The primary motivation lies in the fact that approximate FDs represent potentially interesting patterns existent in a table. Their discovery is a valuable data mining problem. However, before algorithms can be developed, a measure must be defined quantifying their approximation degree.
First we develop an approximation measure by axiomatizing the following intuition: the degree to which X→Y is approximate in a table T is the degree to which T determines a function from Π_X(T) to Π_Y(T). We prove that a unique unnormalized measure satisfies these axioms up to a multiplicative constant. Next we compare the measure developed with two other measures from the literature. In all but one case, we show that the measures can be made to differ as much as possible within normalization. We examine these measure on several real datasets and observe that many of the theoretically possible extreme differences do not bear themselves out. We offer some conclusions as to particular situations where certain measures are more appropriate than others.
Why is the snowflake schema a good data warehouse design?
2003, Information Systems
Citation Excerpt :
Although it is tempting to extend SSNF to cyclic database schemas, it is an open problem to what extent our results will generalise in such cases. We now utilise the information-theoretic treatment of relational databases developed in [17–20]. This approach is important since it allows us to accommodate for probabilistic information in the data warehouse, which is fundamental in decision making [21].
Database design for data warehouses is based on the notion of the snowflake schema and its important special case, the star schema. The snowflake schema represents a dimensional model which is composed of a central fact table and a set of constituent dimension tables which can be further broken up into subdimension tables. We formalise the concept of a snowflake schema in terms of an acyclic database schema whose join tree satisfies certain structural properties. We then define a normal form for snowflake schemas which captures its intuitive meaning with respect to a set of functional and inclusion dependencies. We show that snowflake schemas in this normal form are independent as well as separable when the relation schemas are pairwise incomparable. This implies that relations in the data warehouse can be updated independently of each other as long as referential integrity is maintained. In addition, we show that a data warehouse in snowflake normal form can be queried by joining the relation over the fact table with the relations over its dimension and subdimension tables. We also examine an information-theoretic interpretation of the snowflake schema and show that the redundancy of the primary key of the fact table is zero.
A note on approximation measures for multi-valued dependencies in relational databases
2003, Information Processing Letters
We consider the problem of defining a normalized approximation measure for multi-valued dependencies in relational database theory. An approximation measure is a function mapping relation instances to real numbers. The number to which an instance is mapped, intuitively, describes the strength of the dependency in that instance. A normalized approximation measure for functional dependencies has been proposed previously: the minimum number of tuples that need be removed for the functional dependency to hold divided by the total number of tuples. This leads naturally to a normalized measure for multi-valued dependencies: the minimum number of tuples that need be removed for the multi-valued dependency to hold divided by the total number of tuples.
The measure for functional dependencies can be computed efficiently, O(|r|log(|r|)) where |r| is the relation instance. However, we show that an efficient algorithm for computing the analogous measure for multi-valued dependencies is not likely to exist. A polynomial time algorithm for computing the measure would lead to a polynomial time algorithm for an NP-complete problem (proven by a reduction from the maximum edge biclique problem in graph theory). Hence, we argue that it is not a good measure. We propose an alternate measure based on the lossless join characterization of multi-valued dependencies. This measure is efficiently computable, O(|r|²).
A unique formal system for binary decompositions of database relations, probability distributions, and graphs
1992, Information Sciences
It is shown that the formal system for binary decompositions of database relations applies successfully to binary decompositions of probability distributions and graphs too. As an application, the problem of determining the binary decompositions of a given information system (no matter if a relation, a probability distribution, or a graph) is discussed, and a number of search procedures using the derivation rules of the proposed formal system are presented and compared on a sample relation.

View all citing articles on Scopus

View full text

Statistical treatment of the information content of a database

Abstract

Information Processing Letters

Information Systems

Information Systems

Information & Control

Information & Control

A model of statistical databases and their security

ACM Database Systems

Security in statistical databases for queries with small counts

ACM Database Systems

A compression technique for large statistical databases

Statistical database design

ACM Database Systems

A statistical approach to incomplete database

ACM Database Systems

Distributions model of relations

A multivariate statistical model for data base performance evaluation

A framework for research in database management for statistical analysis

Information loss in partitioned statistical databases

The Computer Journal

An analytical approach to statistical databases

Quantum Mechanics

Implications of certain assumptions in database performance evaluation

ACM Database Systems

BMDP-79 Biomedical Computer Programs P-Series