Elsevier

Information Sciences

Volume 181, Issue 13, 1 July 2011, Pages 2608-2625
Information Sciences

Minimal perfect hashing: A competitive method for indexing internal memory

https://doi.org/10.1016/j.ins.2009.12.003Get rights and content

Abstract

A perfect hash function (PHF) is an injective function that maps keys from a set S to unique values. Since no collisions occur, each key can be retrieved from a hash table with a single probe. A minimal perfect hash function (MPHF) is a PHF with the smallest possible range, that is, the hash table size is exactly the number of keys in S. MPHFs are widely used for memory efficient storage and fast retrieval of items from static sets. Differently from other hashing schemes, MPHFs completely avoid the problem of wasted space and wasted time to deal with collisions. Until recently, the amount of space to store an MPHF description for practical implementations found in the literature was O(logn) bits per key and therefore similar to the overhead of space of other hashing schemes. Recent results on MPHFs presented in the literature changed this scenario: an MPHF can now be described by approximately 2.6 bits per key.

The objective of this paper is to show that MPHFs are, after the new recent results, a good option to index internal memory when static key sets are involved and both successful and unsuccessful searches are allowed. We have shown that MPHFs provide the best tradeoff between space usage and lookup time when compared with other open addressing and chaining hash schemes such as linear hashing, quadratic hashing, double hashing, dense hashing, cuckoo hashing, sparse hashing, hopscotch hashing, chaining with move to front heuristic and exact fit. We considered lookup time for successful and unsuccessful searches in two scenarios: (i) the MPHF description fits in the CPU cache and (ii) the MPHF description does not fit entirely in the CPU cache. Considering lookup time, the minimal perfect hashing outperforms the other hashing schemes in the two scenarios and, in the first scenario, the performance is better even when the compared methods leave more than 80% of the hash table entries free. Considering space overhead (the amount of used space other than the key-value pairs), the minimal perfect hashing is within a factor of O(logn) bits lower than the other hashing schemes for both scenarios.

Introduction

In this paper we study data structures that are suitable for indexing internal memory efficiently in terms of both space and lookup time, especially when memory intensive applications are involved. An important point to account for while designing data structures to handle large volumes of data is the memory latency bottleneck [11], [27]. It has come up due to the increasing improvement on processor speed that cannot be followed in the same pace by memory chips, which are optimized for capacity instead of speed to keep their cost affordable. This bottleneck can heavily hit the performance of basic database system operations when the data structures involved in their implementation do not pay attention to the actual memory hierarchy and the data structures are not designed to be cache-conscious [29].

In this paper we are interested in applications where a key set is fixed for a given period of time. This is the case, for example, in the cache-conscious implementation of basic database operations like equi-joins, as described in [29]. This also happens in search engine applications, which use extensive preprocessing of data to allow very fast evaluation of queries. More formally, given a static key set SU of size n from a key universe U of size U, where each key may be associated with satellite data, the question we are interested in is: what are the data structures that provide the best tradeoff between space usage and lookup time?

An efficient way to represent a vocabulary in terms of lookup time is using a table indexed by a hash function. A hash function h:UM is a function that maps the keys from U to a given interval of integers M=[0,m-1]={0,1,,m-1}. Considering SU and given a key xS, the hash function h computes an integer in [0,m-1] for the storage or retrieval of x in a hash table. Hashing methods for non-static key sets can be used to construct data structures storing S and supporting membership queries of the type “xS?” in expected time O(1). However, they involve a certain amount of wasted space owing to unused locations in the table and wasted time to resolve collisions when two keys are hashed to the same table location. The efficiency of a search for a given key kS in the traditional hashing techniques is mainly dependent on the hash table load factor α=n/m (i.e., the ratio between the number of items and the number of entries in the hash table).

For scenarios where the key set is fixed for a given period of time, we might consider it as a static key set. Hence, it is possible to compute a function as part of the preprocessing phase to find any key in a table in one probe; such hash functions are called perfect. Given a key set S, we shall say that a hash function h:UM is a perfect hash function (PHF) for S if h is an injection on S, which means that there are no collisions among the keys in S: if x and y are in S and xy, then h(x)h(y). Since no collisions occur, each key can be retrieved from the table with a single probe. If m=n (the table has the same size as S), then h is a minimal perfect hash function (MPHF). MPHFs completely avoid the problem of wasted space and time. As observed in [29], MPHFs also avoid cache misses that arise due to collision resolution schemes like open addressing and chaining [25].

Some types of databases are updated only rarely, typically by periodic batch updates. This is true, for example, for most data warehousing applications (see [39] for more examples and discussion). Another interesting phenomenon was the web popularization, which created various new challenges related to the huge growth of data volume and the need to process it in order to get useful information. Search engines are responsible for collecting, representing, processing and disseminating information according to the user information need. Besides the quality of the information provided, it is essential to satisfy the user information need in an efficient way, despite the huge amount of data to be processed and the huge amount of users issuing queries all the time. In such scenarios, it is mandatory the construction of efficient data structures that are able to represent large volumes of data in a compact way.

The Monet project [29] is a good example to illustrate the importance of this study for the database community. This project has built a database system based on vertical partitioning2 to carry out basic operations seven times faster than the ones that are based on record or row fragmentation. The performance gains come from the fact that the vertical fragmentation allows to build cache-conscious data structure to implement the basic database operations. Perfect hash functions play an important role in the case of equi-joins, as it allows the most efficient implementation of the vertically partitioned hash-join algorithm, which was shown to be the fastest one for equi-joins. According to [29, Section 4.3.3], before using perfect hash functions they were storing 4 tuples in each bucket of the hash tables. By using a perfect hash function the targeted hash-bucket size was reduced from 4 tuples to just 1 tuple. As a consequence, it avoids following a bucket-chain during hash-lookups that causes too many cache and translation lookaside buffer (TLB) misses. We refer the reader to [29] to obtain more details on the hash-join algorithm implementation with PHFs, as it goes beyond the scope of this paper.

The objective of this paper is to show that MPHFs provide the best tradeoff between space usage and lookup time when compared to other hashing schemes. It was not the case in the past because the space overhead to store MPHFs was O(logn) bits per key for practical algorithms [12], [28]. However, new results on MPHFs by Botelho, Pagh and Ziviani [7] require approximately 2.6 bits per key of space to describe the function and can be evaluated in O(1) time. We compare the new O(1) space complexity MPHFs with the O(logn) space complexity of open addressing and chaining methods.

In our prior work presented in [9] we have shown that the minimal perfect hash functions presented in [7] outperforms open addressing techniques in terms of lookup time and space overhead when we are indexing static data sets in internal memory. In this paper we extended those results in two aspects. First, we have designed an optimization of the MPHFs that considerably improves their lookup time performance, which is presented in Section 4. Second, we have surveyed the main hashing schemes available in the literature and added four other methods to our comparative study. Thus, in this paper we consider six open addressing and three chaining hashing structures to compare with our optimized minimal perfect hash functions.

To compare MPHFs with other open addressing and chaining hash schemes we considered lookup time for successful and unsuccessful searches3 in two scenarios: (i) the MPHF description fits in the CPU cache and (ii) the MPHF description cannot be entirely placed in the CPU cache. We show that other hashing schemes cannot outperform minimal perfect hashing considering lookup time in the two scenarios, even when the hash table occupancy is lower than 20%. An MPHF requiring just 2.6 bits per key of storage space is able to store sets in the order of 10 million keys in a 4 MB CPU cache, which is enough for a large range of applications. For both scenarios, the space overhead of minimal perfect hashing is within a factor of O(logn) bits lower than other hashing schemes.

This paper is organized as follows. Section 2 presents a survey of the main open addressing, chaining and minimal perfect hashing methods available in the literature. Section 3 gives an overview of the hashing data structures used throughout this paper. Section 4 presents the minimal perfect hashing scheme considered in this paper. Sections 5 Open addressing methods, 6 Chaining methods present the open addressing and chaining techniques used to compare with minimal perfect hashing. Section 7 presents the experimental results. Finally, Section 8 concludes the paper.

Section snippets

Related work

Since the origin of hashing, proposed by Luhn in 1953 [25], collision avoidance has been one of its main challenges. This is well illustrated by the birthday paradox, which says that if 23 or more people are grouped together at random, the probability that at least two people have a common birthday exceeds 50%, as can be seen in Feller [15, p. 33]. There are two ways of facing the collisions problem. The first one is to assume that collisions will happen, since the probability of finding a hash

Overview of hashing data structures

In this section we describe the general data structures used by the minimal perfect hashing, open addressing, and chaining methods presented in Sections 4 Minimal perfect hashing, 5 Open addressing methods, 6 Chaining methods, respectively. More optimized open addressing and chaining methods might use specific data structures derived from the ones presented in this section. This is the case for the open addressing methods described in Sections 5.5 Hopscotch hashing, 5.6 Sparse hashing (sparse

Minimal perfect hashing

The minimal perfect hash function h:[0,n-1] used to index the hash table presented in Fig. 1(a) is taken from the family of MPHFs proposed in [7]. The MPHFs are generated based on random r-partite hypergraphs where each edge connects r2 vertices.4 In our experiments we used a version that employs hypergraphs with r=3, since it generates the fastest and most compact MPHFs. However, in order

Open addressing methods

In this section we present the most efficient open addressing techniques we know of. Sections 5.1 Linear hashing, 5.2 Quadratic hashing, 5.3 Double hashing present the linear hashing, quadratic hashing and double hashing methods, respectively. Section 5.4 presents the cuckoo hashing method. Section 5.5 presents the hopscotch hashing method, which exploits the advantages of linear hashing, chaining and cuckoo hashing approaches. Finally, Section 5.5 presents the sparse hashing method, which uses

Chaining methods

In this section we discuss two hashing techniques that are variations of the traditional chaining technique depicted in Fig. 1(c): chaining with move to front heuristic and exact fit.

Experimental results

In this section we present the key sets used in the experiments and the results of the comparative study. All experiments were carried out on a computer running Linux version 2.6, with a 1.86 gigahertz 64 bits Intel Core 2 processor, 4 gigabytes of main memory and 4 megabytes of L2 cache. All results presented are averages on 50 trials and were statistically validated with a confidence level of 95%. Table 1 summarizes the symbols and acronyms used throughout this section.

The linear hashing,

Conclusions

In this paper we have presented a thorough study of data structures that are suitable for indexing internal memory in an efficient way in terms of both space and lookup time when we have a key set that is fixed for a given period of time (i.e., a static key set) and each key is associated with a satellite data. This is widely used in data warehousing and search engine applications (see [39] for other examples).

It is well known that an efficient way to represent a key set in terms of lookup time

Acknowledgments

This work was supported by the Brazilian National Institute of Science and Technology for the Web (Grant MCT/CNPq 573871/2008-6), Project InfoWeb (Grant MCT/CNPq/CT-INFO 550874/2007-0), CNPq Grant 305237/02-0 (Nivio Ziviani), CNPq Scholarship 132352/2009-5 (Guilherme Menezes), and by UOL (www.uol.com.br), through its UOL Bolsa Pesquisa program, process number 20090213161400. We also want to thank the reviewers for the comments that helped us to considerably improve the paper.

References (45)

  • A. Bosselaers, R. Govaerts, J. Vandewalle, Fast hashing on the pentium, in: Proceedings of the 16th Annual...
  • F. Botelho, R. Pagh, N. Ziviani, Simple and space-efficient minimal perfect hash functions, in: Proceedings of the 10th...
  • F. Botelho, N. Ziviani, External perfect hashing for very large key sets, in: Proceedings of the 16th ACM Conference on...
  • F.C. Botelho, H.R. Langbehn, G.V. Menezes, N. Ziviani, Indexing internal memory with minimal perfect hash functions,...
  • D. Burger, J.R. Goodman, A. Kägi, Memory bandwidth limitations of future microprocessors, in: Proceedings of the 23rd...
  • U. Erlingsson, M. Manasse, F. Mcsherry, A cool and practical alternative to traditional hash tables, in: Proceedings of...
  • W. Feller

    An Introduction to Probability Theory and Its Applications

    (1968)
  • M.L. Fredman et al.

    On the size of separating systems and families of perfect hashing functions

    SIAM Journal on Algebraic and Discrete Methods

    (1984)
  • H. Gao, J.F. Groote, W.H. Hesselink, Almost wait-free resizable hashtables, in: Proceedings of the 18th International...
  • H. Gao et al.

    Lock-free dynamic hash tables with open addressing

    Distributed Computing

    (2005)
  • T. Hagerup, T. Tholey, Efficient minimal perfect hashing in nearly minimal space, in: Proceedings of the 18th Symposium...
  • C. Halatsis et al.

    Pseudochaining in hash tables

    Communication of The ACM

    (1978)
  • Cited by (10)

    • Consistency analysis on orientation features for fast and accurate palmprint identification

      2014, Information Sciences
      Citation Excerpt :

      In the next section, we will introduce how to use the consistent features to facilitate hashing based fast palmprint identification. Previous studies reveal that hashing is very effective for fast search of high-dimensional fuzzy data [12,9,2]. The core of hashing is the design of hash function, especially when the data contain a certain amount of noises.

    • Practical perfect hashing in nearly optimal space

      2013, Information Systems
      Citation Excerpt :

      The engineering to combine several theoretical results into a practical solution has turned perfect hashing into a very compact data structure to solve the membership problem when the key universe is static and known in advance. Perfect hashing is the data structure that provides the best trade-off between space usage and lookup time when compared with other open addressing and chaining hash schemes too index static key sets [7]. The EM algorithm uses a number of techniques from the literature to allow the construction of PHFs or MPHFs for sets on the order of billions of keys.

    • TJJE: An efficient algorithm for top-k join on massive data

      2013, Information Sciences
      Citation Excerpt :

      Traditional hash join does not take cache into consideration so that they incur poor CPU utilization [4]. And it is shown in many applications that a better performance can be achieved by a higher cache utilization [6,7,15,24]. This paper proposes a cache-conscious algorithm C3-JPIPT to construct JPIPT more efficiently.

    • Rare Variants Analysis in Genetic Association Studies with Privacy Protection via Hybrid System

      2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • A modified MD5 algorithm incorporating hirose compression function

      2018, 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management, HNICEM 2018
    • Research on the privacy-preserving retrieval over ciphertext on cloud

      2016, Proceedings of the 6th International Conference on Information Communication and Management, ICICM 2016
    View all citing articles on Scopus
    1

    This work was performed while the first author was an associated professor at the Department of Computer Engineering of the Federal Center for Technological Education of Minas Gerais, Belo Horizonte, Brazil, and an associated researcher at the Department of Computer Science of the Federal University of Minas Gerais, Belo Horizonte, Brazil.

    View full text