A nearly-optimal Fano-based coding algorithm

https://doi.org/10.1016/S0306-4573(03)00007-4Get rights and content

Abstract

Statistical coding techniques have been used for a long time in lossless data compression, using methods such as Huffman’s algorithm, arithmetic coding, Shannon’s method, Fano’s method, etc. Most of these methods can be implemented either statically or adaptively. In this paper, we show that although Fano coding is sub-optimal, it is possible to generate static Fano-based encoding schemes which are arbitrarily close to the optimal, i.e. those generated by Huffman’s algorithm. By taking advantage of the properties of the encoding schemes generated by this method, and the concept of “code word arrangement”, we present an enhanced version of the static Fano’s method, namely Fano+. We formally analyze Fano+ by presenting some properties of the Fano tree, and the theory of list rearrangements. Our enhanced algorithm achieves compression ratios arbitrarily close to those of Huffman’s algorithm on files of the Calgary corpus and the Canterbury corpus.

Introduction

Huffman’s algorithm is a well-known encoding method that generates an optimal prefix encoding scheme, in the sense that the average code word length is minimum. As opposed to this, Fano’s method has not been used so much because it generates prefix encoding schemes that can be sub-optimal.

In this paper, we present an enhancement of the traditional Fano’s method by which encoding schemes, which are arbitrarily close to the optimum, can be easily constructed.

We assume that we are given a source alphabet, S={s1,…,sm}, whose probabilities of occurrence are P=[p1,…,pm], and a code alphabet, A={a1,…,ar}. We intend to generate an encoding scheme, {siwi}, in such a way that ̄=∑i=1mpii is minimized, where ℓi is the length of wi. Although, we specifically consider the binary code alphabet, the generalization to the r-ary case is not too intricate.

Lossless encoding methods used to solve this problem include Huffman’s algorithm (Huffman, 1952), Shannon’s method (Shannon & Weaver, 1949), arithmetic coding (Sayood, 2000), Fano’s method (Hankerson, Harris, & Johnson, 1998), etc. Adaptive versions of these methods have been proposed, and can be found in (Faller, 1973; Gallager, 1978; Hankerson et al., 1998; Knuth, 1985; Rueda, 2002; Sayood, 2000). Our survey is necessarily brief as this is a well-reputed field.

We assume that the source is memoryless or zeroth-order, which means that the occurrence of the next symbol is independent of any other symbol that has occurred previously. Higher-order models include Markov models (Hankerson et al., 1998), dictionary techniques (Ziv & Lempel, 1977; Ziv & Lempel, 1978), prediction with partial matching (Witten, Moffat, & Bell, 1999), grammar based compression (Kieffer & Yang, 2000), etc., and the techniques introduced here are also readily applicable for such “structure” models.

Huffman’s algorithm proceeds by generating the so called Huffman tree, by recursively merging symbols (nodes) into a new conceptual symbol which constitutes an internal node of the tree. In this way, Huffman’s algorithm generates the tree in a bottom-up fashion. As opposed to this, Fano’s method proceeds by generating a coding tree as well, but it proceeds in a top-down fashion. At each step, the list of symbols is partitioned into two (or more, if the output alphabet is non-binary) new sub-lists, generating two or more new nodes in the corresponding coding tree. Although Fano’s method, typically, generates a sub-optimal encoding scheme,3 the loss in compression ratio with respect to Huffman’s algorithm can be relatively small, but can also be quite significant if the optimality of the partitioning is small.

The binary alphabet version of Fano’s method proceeds by partitioning the list of symbols into two new sub-lists in such a way that the sums of probabilities of these two new sub-lists are as close to being equal as possible. This procedure is recursively applied to the new sub-lists until two atomic sub-lists with a single symbol are obtained, and simultaneously a bit is appended to the code words of each symbol in these sub-lists. As a result of this, if we start with the probabilities of occurrence in a decreasing order, it can be seen that, occasionally, symbols with higher probabilities are assigned to longer code words than those with lower probabilities. This condition is not desirable; we attempt to rectify this condition in Fano+, a superior Fano-based scheme which we develop in this paper.

On the other hand, after constructing a coding tree (such as a Huffman tree or a Fano tree), an encoding scheme is generated from that tree by labeling the branches with the code alphabet (typically, binary) symbols. Given a tree constructed from a source alphabet of m symbols, 2m−1 different encoding schemes can be generated. Of these, only one of them is of a family of codes known as canonical codes, in which the code words are arranged in a lexicographical order (Witten et al., 1999). Canonical codes are desired because they allow extremely fast decoding, and require approximately half of the space used by a decoding tree. These codes are generated by Fano coding in a natural way.

In this paper, we introduce Fano+, an enhanced version of the static Fano coding approach utilizing the concept which we called “code word arrangement”, and which is based on a fundamental property of two lists arranged in an increasing and decreasing order respectively (Hardy, Littlewood, & Póyla, 1959). In our context, these lists are the code words (in terms of their lengths) and their probabilities respectively. This paper formally details the encoding scheme generation algorithm, the partitioning procedures suitable for Fano+, and a rigorous analysis of their respective properties.

We finally discuss some empirical results obtained from running the static Huffman’s algorithm, the static version of the traditional Fano coding, and Fano+, on real life data. Our empirical results show that the compression ratios achieved by Fano+ are comparable to those of other optimal encoding methods such as Huffman’s algorithm. Although we use the zeroth-order statistical model, other structure/statistical models such as higher-order models, dictionary models, etc., can also be used in conjunction with Fano+ to achieve compression ratios that are close to those attained by most well known compression schemes.

Section snippets

Properties of the traditional Fano coding

Consider the source alphabet S={s1,…,sm} with probabilities of occurrence P=[p1,…,pm], where p1p2⩾⋯⩾pm. Unless otherwise stated, in this paper, we assume that the code alphabet is A={0,1}.

We define an encoding scheme as a mapping, φ:s1w1,…,smwm, where wiA+, for i=1,…,m. One of the properties of the encoding schemes generated by Huffman’s algorithm is that ℓ1⩽ℓ2⩽⋯⩽ℓm, where ℓi is the length of wi. In general, this property is not satisfied by the encoding schemes generated by Fano’s method.

The enhanced coding algorithm

Considering the facts discussed above, we now propose Fano+ using the following modification to the traditional static Fano’s method. It is well known that Fano’s method requires that the source symbols and their probabilities are sorted in a non-increasing order of the probabilities. What we incorporate is as follows: After all the code words are generated, we sort them in terms of their increasing order of lengths maintaining S and P in the order of the probabilities. This enhancement leads

Properties of the enhanced Fano coding

To facilitate the analysis, we first introduce two important properties of Fano+, the enhanced static Fano coding. The first relates to the efficiency in compression achieved by Fano+, and the second is the property that it achieves lossless compression.

The efficiency of compression of Fano+ is a direct consequence of the rearrangement of the code words such that they are sorted in an increasing order of length (Rule 1). This is stated in the following theorem, for which a proof can be found in

Empirical results

In order to analyze the efficiency of Fano+, we have conducted some experiments on files of the Calgary corpus5 and the Canterbury corpus (Witten et al., 1999). The empirical results obtained are displayed in Table 1, Table 2 respectively. The columns labeled “Huffman” and “Fano” correspond to the static Huffman’s algorithm and the traditional static Fano’s method respectively. The columns labeled “Fano+

Conclusions

In this paper, we present an encoding scheme generation algorithm which is Fano-based and almost optimal. We first showed that for the encoding schemes, whose code words are not arranged in an increasing order of lengths, the corresponding coding tree does not satisfy the so-called sibling property. To rectify this, we introduced an enhanced version of the static Fano’s method, Fano+, whose properties have been formally proven.

The encoding algorithm associated with Fano+ have been formally

References (16)

  • A. Andersson et al.

    Sorting in linear time?

    Journal of Computer and System Sciences

    (1998)
  • D. Knuth

    Dynamic Huffman coding

    Journal of Algorithms

    (1985)
  • Faller, N. (1973). An adaptive system for data compression. In 7th Asilomar conference on circuits, systems, and...
  • R. Gallager

    Variations on a theme by Huffman

    IEEE Transactions on Information Theory

    (1978)
  • D. Hankerson et al.

    Introduction to information theory and data compression

    (1998)
  • G. Hardy et al.

    Inequalities

    (1959)
  • D. Huffman

    A method for the construction of minimum redundancy codes

    Proceedings of IRE

    (1952)
  • J.C. Kieffer et al.

    Grammar-bassed codes: a new class of universal lossless source codes

    IEEE Transactions on Information Theory

    (2000)
There are more references available in the full text version of this article.

Cited by (9)

  • An adaptive character wordlength algorithm for data compression

    2008, Computers and Mathematics with Applications
    Citation Excerpt :

    Substitution data compression techniques involve the swapping of repeating characters by a shorter representation, such as null suppression, RLE, bit mapping and half byte packing [4,8,9]. Statistical data compression techniques involve the generation of the shortest average code length based on an estimated probability of the characters, such as Shannon–Fano coding [5,10,11], static/dynamic/adaptive Huffman coding [12–15], and arithmetic coding [16,17]. Finally, dictionary data compression techniques involve the substitution of sub-strings of text by indices or pointer code, relative to a dictionary of the sub-strings, such as the LZW data compression technique [18–20].

  • Multistage test data compression technique for VLSI circuits

    2017, Proceedings of 2016 International Conference on Advanced Communication Control and Computing Technologies, ICACCCT 2016
  • Stochastic learning-based weak estimation and its applications

    2010, Knowledge-Based Intelligent System Advancements: Systemic and Cybernetic Approaches
  • A new algorithm for calculating adaptive character wordlength via estimating compressed file size

    2010, 2nd International Conference on Computer Research and Development, ICCRD 2010
View all citing articles on Scopus
1

Partially supported by Departamento de Informática, Universidad Nacional de San Juan, Argentina, and NSERC, the Natural Science and Engineering Research Council of Canada. A preliminary version of this paper was presented at the 2001 IEEE Conference on Systems, Man and Cybernetics, Tucson, Arizon, USA.

2

Partially supported by NSERC, the Natural Science and Engineering Research Council of Canada.

View full text