A nearly-optimal Fano-based coding algorithm

doi:10.1016/S0306-4573(03)00007-4

Information Processing & Management

Volume 40, Issue 2, March 2004, Pages 257-268

https://doi.org/10.1016/S0306-4573(03)00007-4 Get rights and content

Abstract

Statistical coding techniques have been used for a long time in lossless data compression, using methods such as Huffman’s algorithm, arithmetic coding, Shannon’s method, Fano’s method, etc. Most of these methods can be implemented either statically or adaptively. In this paper, we show that although Fano coding is sub-optimal, it is possible to generate static Fano-based encoding schemes which are arbitrarily close to the optimal, i.e. those generated by Huffman’s algorithm. By taking advantage of the properties of the encoding schemes generated by this method, and the concept of “code word arrangement”, we present an enhanced version of the static Fano’s method, namely Fano⁺. We formally analyze Fano⁺ by presenting some properties of the Fano tree, and the theory of list rearrangements. Our enhanced algorithm achieves compression ratios arbitrarily close to those of Huffman’s algorithm on files of the Calgary corpus and the Canterbury corpus.

Introduction

Huffman’s algorithm is a well-known encoding method that generates an optimal prefix encoding scheme, in the sense that the average code word length is minimum. As opposed to this, Fano’s method has not been used so much because it generates prefix encoding schemes that can be sub-optimal.

In this paper, we present an enhancement of the traditional Fano’s method by which encoding schemes, which are arbitrarily close to the optimum, can be easily constructed.

We assume that we are given a source alphabet, $S ={s_{1},…,s_{m}}$ , whose probabilities of occurrence are $P =[p_{1},…,p_{m}]$ , and a code alphabet, $A ={a_{1},…,a_{r}}$ . We intend to generate an encoding scheme, {s_i→w_i}, in such a way that $ℓ ̄ =∑_{i=1}^{m} p_{i} ℓ_{i}$ is minimized, where ℓ_i is the length of w_i. Although, we specifically consider the binary code alphabet, the generalization to the r-ary case is not too intricate.

Lossless encoding methods used to solve this problem include Huffman’s algorithm (Huffman, 1952), Shannon’s method (Shannon & Weaver, 1949), arithmetic coding (Sayood, 2000), Fano’s method (Hankerson, Harris, & Johnson, 1998), etc. Adaptive versions of these methods have been proposed, and can be found in (Faller, 1973; Gallager, 1978; Hankerson et al., 1998; Knuth, 1985; Rueda, 2002; Sayood, 2000). Our survey is necessarily brief as this is a well-reputed field.

We assume that the source is memoryless or zeroth-order, which means that the occurrence of the next symbol is independent of any other symbol that has occurred previously. Higher-order models include Markov models (Hankerson et al., 1998), dictionary techniques (Ziv & Lempel, 1977; Ziv & Lempel, 1978), prediction with partial matching (Witten, Moffat, & Bell, 1999), grammar based compression (Kieffer & Yang, 2000), etc., and the techniques introduced here are also readily applicable for such “structure” models.

Huffman’s algorithm proceeds by generating the so called Huffman tree, by recursively merging symbols (nodes) into a new conceptual symbol which constitutes an internal node of the tree. In this way, Huffman’s algorithm generates the tree in a bottom-up fashion. As opposed to this, Fano’s method proceeds by generating a coding tree as well, but it proceeds in a top-down fashion. At each step, the list of symbols is partitioned into two (or more, if the output alphabet is non-binary) new sub-lists, generating two or more new nodes in the corresponding coding tree. Although Fano’s method, typically, generates a sub-optimal encoding scheme,³ the loss in compression ratio with respect to Huffman’s algorithm can be relatively small, but can also be quite significant if the optimality of the partitioning is small.

The binary alphabet version of Fano’s method proceeds by partitioning the list of symbols into two new sub-lists in such a way that the sums of probabilities of these two new sub-lists are as close to being equal as possible. This procedure is recursively applied to the new sub-lists until two atomic sub-lists with a single symbol are obtained, and simultaneously a bit is appended to the code words of each symbol in these sub-lists. As a result of this, if we start with the probabilities of occurrence in a decreasing order, it can be seen that, occasionally, symbols with higher probabilities are assigned to longer code words than those with lower probabilities. This condition is not desirable; we attempt to rectify this condition in Fano⁺, a superior Fano-based scheme which we develop in this paper.

On the other hand, after constructing a coding tree (such as a Huffman tree or a Fano tree), an encoding scheme is generated from that tree by labeling the branches with the code alphabet (typically, binary) symbols. Given a tree constructed from a source alphabet of m symbols, 2^m−1 different encoding schemes can be generated. Of these, only one of them is of a family of codes known as canonical codes, in which the code words are arranged in a lexicographical order (Witten et al., 1999). Canonical codes are desired because they allow extremely fast decoding, and require approximately half of the space used by a decoding tree. These codes are generated by Fano coding in a natural way.

In this paper, we introduce Fano⁺, an enhanced version of the static Fano coding approach utilizing the concept which we called “code word arrangement”, and which is based on a fundamental property of two lists arranged in an increasing and decreasing order respectively (Hardy, Littlewood, & Póyla, 1959). In our context, these lists are the code words (in terms of their lengths) and their probabilities respectively. This paper formally details the encoding scheme generation algorithm, the partitioning procedures suitable for Fano⁺, and a rigorous analysis of their respective properties.

We finally discuss some empirical results obtained from running the static Huffman’s algorithm, the static version of the traditional Fano coding, and Fano⁺, on real life data. Our empirical results show that the compression ratios achieved by Fano⁺ are comparable to those of other optimal encoding methods such as Huffman’s algorithm. Although we use the zeroth-order statistical model, other structure/statistical models such as higher-order models, dictionary models, etc., can also be used in conjunction with Fano⁺ to achieve compression ratios that are close to those attained by most well known compression schemes.

Section snippets

Properties of the traditional Fano coding

Consider the source alphabet $S ={s_{1},…,s_{m}}$ with probabilities of occurrence $P =[p_{1},…,p_{m}]$ , where p₁⩾p₂⩾⋯⩾p_m. Unless otherwise stated, in this paper, we assume that the code alphabet is $A ={0,1}$ .

We define an encoding scheme as a mapping, φ:s₁→w₁,…,s_m→w_m, where $w_{i} ∈ A^{+}$ , for i=1,…,m. One of the properties of the encoding schemes generated by Huffman’s algorithm is that ℓ₁⩽ℓ₂⩽⋯⩽ℓ_m, where ℓ_i is the length of w_i. In general, this property is not satisfied by the encoding schemes generated by Fano’s method.

The enhanced coding algorithm

Considering the facts discussed above, we now propose Fano⁺ using the following modification to the traditional static Fano’s method. It is well known that Fano’s method requires that the source symbols and their probabilities are sorted in a non-increasing order of the probabilities. What we incorporate is as follows: After all the code words are generated, we sort them in terms of their increasing order of lengths maintaining $S$ and $P$ in the order of the probabilities. This enhancement leads

Properties of the enhanced Fano coding

To facilitate the analysis, we first introduce two important properties of Fano⁺, the enhanced static Fano coding. The first relates to the efficiency in compression achieved by Fano⁺, and the second is the property that it achieves lossless compression.

The efficiency of compression of Fano⁺ is a direct consequence of the rearrangement of the code words such that they are sorted in an increasing order of length (Rule 1). This is stated in the following theorem, for which a proof can be found in

Empirical results

In order to analyze the efficiency of Fano⁺, we have conducted some experiments on files of the Calgary corpus⁵ and the Canterbury corpus (Witten et al., 1999). The empirical results obtained are displayed in Table 1, Table 2 respectively. The columns labeled “Huffman” and “Fano” correspond to the static Huffman’s algorithm and the traditional static Fano’s method respectively. The columns labeled “Fano⁺”

Conclusions

In this paper, we present an encoding scheme generation algorithm which is Fano-based and almost optimal. We first showed that for the encoding schemes, whose code words are not arranged in an increasing order of lengths, the corresponding coding tree does not satisfy the so-called sibling property. To rectify this, we introduced an enhanced version of the static Fano’s method, Fano⁺, whose properties have been formally proven.

The encoding algorithm associated with Fano⁺ have been formally

References (16)

A. Andersson et al.
Sorting in linear time?
Journal of Computer and System Sciences
(1998)
D. Knuth
Dynamic Huffman coding
Journal of Algorithms
(1985)
Faller, N. (1973). An adaptive system for data compression. In 7th Asilomar conference on circuits, systems, and...
R. Gallager
Variations on a theme by Huffman
IEEE Transactions on Information Theory
(1978)
D. Hankerson et al.
Introduction to information theory and data compression
(1998)
G. Hardy et al.
Inequalities
(1959)
D. Huffman
A method for the construction of minimum redundancy codes
Proceedings of IRE
(1952)
J.C. Kieffer et al.
Grammar-bassed codes: a new class of universal lossless source codes
IEEE Transactions on Information Theory
(2000)

There are more references available in the full text version of this article.

Cited by (9)

An adaptive character wordlength algorithm for data compression
2008, Computers and Mathematics with Applications
Citation Excerpt :
Substitution data compression techniques involve the swapping of repeating characters by a shorter representation, such as null suppression, RLE, bit mapping and half byte packing [4,8,9]. Statistical data compression techniques involve the generation of the shortest average code length based on an estimated probability of the characters, such as Shannon–Fano coding [5,10,11], static/dynamic/adaptive Huffman coding [12–15], and arithmetic coding [16,17]. Finally, dictionary data compression techniques involve the substitution of sub-strings of text by indices or pointer code, relative to a dictionary of the sub-strings, such as the LZW data compression technique [18–20].
This paper presents a new and efficient data compression algorithm, namely, the adaptive character wordlength (ACW) algorithm, which can be used as complementary algorithm to statistical compression techniques. In such techniques, the characters in the source file are converted to a binary code, where the most common characters in the file have the shortest binary codes, and the least common have the longest; the binary codes are generated based on the estimated probability of the character within the file. Then, the binary coded file is compressed using 8 bits character wordlength. In this new algorithm, an optimum character wordlength, $b$ , is calculated, where $b > 8$ , so that the compression ratio is increased by a factor of $b / 8$ . In order to validate this algorithm, it is used as a complement algorithm to Huffman code to compress a source file having 10 characters with different probabilities, and these characters are randomly distributed within the source file. The results obtained and the factors that affect the optimum value of $b$ are discussed, and, finally, conclusions are presented.
A fast and efficient nearly-optimal adaptive Fano coding scheme
2006, Information Sciences
Adaptive coding techniques have been increasingly used in lossless data compression. They are suitable for a wide range of applications, in which on-line compression is required, including communications, internet, e-mail, and e-commerce. In this paper, we present an adaptive Fano coding method applicable to binary and multi-symbol code alphabets. We introduce the corresponding partitioning procedure that deals with consecutive partitionings, and that possesses, what we have called, the nearly-equal-probability property, i.e. that satisfy the principles of Fano coding. To determine the optimal partitioning, we propose a brute-force algorithm that searches the entire space of all possible partitionings. We show that this algorithm operates in polynomial-time complexity on the size of the input alphabet, where the degree of the polynomial is given by the size of the output alphabet. As opposed to this, we also propose a greedy algorithm that quickly finds a sub-optimal, but accurate, consecutive partitioning. The empirical results on real-life benchmark data files demonstrate that our scheme compresses and decompresses faster than adaptive Huffman coding, while consuming less memory resources.
Multistage test data compression technique for VLSI circuits
2017, Proceedings of 2016 International Conference on Advanced Communication Control and Computing Technologies, ICACCCT 2016
Stochastic learning-based weak estimation and its applications
2010, Knowledge-Based Intelligent System Advancements: Systemic and Cybernetic Approaches
A new algorithm for calculating adaptive character wordlength via estimating compressed file size
2010, 2nd International Conference on Computer Research and Development, ICCRD 2010
An efficient compression scheme for data communication which uses a new family of self-organizing binary search trees
2008, International Journal of Communication Systems

View all citing articles on Scopus

¹: Partially supported by Departamento de Informática, Universidad Nacional de San Juan, Argentina, and NSERC, the Natural Science and Engineering Research Council of Canada. A preliminary version of this paper was presented at the 2001 IEEE Conference on Systems, Man and Cybernetics, Tucson, Arizon, USA.

²: Partially supported by NSERC, the Natural Science and Engineering Research Council of Canada.

View full text

A nearly-optimal Fano-based coding algorithm

Abstract

Introduction

Section snippets

Properties of the traditional Fano coding

The enhanced coding algorithm

Properties of the enhanced Fano coding

Empirical results

Conclusions

Journal of Computer and System Sciences

Journal of Algorithms

Variations on a theme by Huffman

IEEE Transactions on Information Theory

Introduction to information theory and data compression

Inequalities

A method for the construction of minimum redundancy codes

Proceedings of IRE

Grammar-bassed codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory