A nearly-optimal Fano-based coding algorithm
Introduction
Huffman’s algorithm is a well-known encoding method that generates an optimal prefix encoding scheme, in the sense that the average code word length is minimum. As opposed to this, Fano’s method has not been used so much because it generates prefix encoding schemes that can be sub-optimal.
In this paper, we present an enhancement of the traditional Fano’s method by which encoding schemes, which are arbitrarily close to the optimum, can be easily constructed.
We assume that we are given a source alphabet, , whose probabilities of occurrence are , and a code alphabet, . We intend to generate an encoding scheme, {si→wi}, in such a way that is minimized, where ℓi is the length of wi. Although, we specifically consider the binary code alphabet, the generalization to the r-ary case is not too intricate.
Lossless encoding methods used to solve this problem include Huffman’s algorithm (Huffman, 1952), Shannon’s method (Shannon & Weaver, 1949), arithmetic coding (Sayood, 2000), Fano’s method (Hankerson, Harris, & Johnson, 1998), etc. Adaptive versions of these methods have been proposed, and can be found in (Faller, 1973; Gallager, 1978; Hankerson et al., 1998; Knuth, 1985; Rueda, 2002; Sayood, 2000). Our survey is necessarily brief as this is a well-reputed field.
We assume that the source is memoryless or zeroth-order, which means that the occurrence of the next symbol is independent of any other symbol that has occurred previously. Higher-order models include Markov models (Hankerson et al., 1998), dictionary techniques (Ziv & Lempel, 1977; Ziv & Lempel, 1978), prediction with partial matching (Witten, Moffat, & Bell, 1999), grammar based compression (Kieffer & Yang, 2000), etc., and the techniques introduced here are also readily applicable for such “structure” models.
Huffman’s algorithm proceeds by generating the so called Huffman tree, by recursively merging symbols (nodes) into a new conceptual symbol which constitutes an internal node of the tree. In this way, Huffman’s algorithm generates the tree in a bottom-up fashion. As opposed to this, Fano’s method proceeds by generating a coding tree as well, but it proceeds in a top-down fashion. At each step, the list of symbols is partitioned into two (or more, if the output alphabet is non-binary) new sub-lists, generating two or more new nodes in the corresponding coding tree. Although Fano’s method, typically, generates a sub-optimal encoding scheme,3 the loss in compression ratio with respect to Huffman’s algorithm can be relatively small, but can also be quite significant if the optimality of the partitioning is small.
The binary alphabet version of Fano’s method proceeds by partitioning the list of symbols into two new sub-lists in such a way that the sums of probabilities of these two new sub-lists are as close to being equal as possible. This procedure is recursively applied to the new sub-lists until two atomic sub-lists with a single symbol are obtained, and simultaneously a bit is appended to the code words of each symbol in these sub-lists. As a result of this, if we start with the probabilities of occurrence in a decreasing order, it can be seen that, occasionally, symbols with higher probabilities are assigned to longer code words than those with lower probabilities. This condition is not desirable; we attempt to rectify this condition in Fano+, a superior Fano-based scheme which we develop in this paper.
On the other hand, after constructing a coding tree (such as a Huffman tree or a Fano tree), an encoding scheme is generated from that tree by labeling the branches with the code alphabet (typically, binary) symbols. Given a tree constructed from a source alphabet of m symbols, 2m−1 different encoding schemes can be generated. Of these, only one of them is of a family of codes known as canonical codes, in which the code words are arranged in a lexicographical order (Witten et al., 1999). Canonical codes are desired because they allow extremely fast decoding, and require approximately half of the space used by a decoding tree. These codes are generated by Fano coding in a natural way.
In this paper, we introduce Fano+, an enhanced version of the static Fano coding approach utilizing the concept which we called “code word arrangement”, and which is based on a fundamental property of two lists arranged in an increasing and decreasing order respectively (Hardy, Littlewood, & Póyla, 1959). In our context, these lists are the code words (in terms of their lengths) and their probabilities respectively. This paper formally details the encoding scheme generation algorithm, the partitioning procedures suitable for Fano+, and a rigorous analysis of their respective properties.
We finally discuss some empirical results obtained from running the static Huffman’s algorithm, the static version of the traditional Fano coding, and Fano+, on real life data. Our empirical results show that the compression ratios achieved by Fano+ are comparable to those of other optimal encoding methods such as Huffman’s algorithm. Although we use the zeroth-order statistical model, other structure/statistical models such as higher-order models, dictionary models, etc., can also be used in conjunction with Fano+ to achieve compression ratios that are close to those attained by most well known compression schemes.
Section snippets
Properties of the traditional Fano coding
Consider the source alphabet with probabilities of occurrence , where p1⩾p2⩾⋯⩾pm. Unless otherwise stated, in this paper, we assume that the code alphabet is .
We define an encoding scheme as a mapping, φ:s1→w1,…,sm→wm, where , for i=1,…,m. One of the properties of the encoding schemes generated by Huffman’s algorithm is that ℓ1⩽ℓ2⩽⋯⩽ℓm, where ℓi is the length of wi. In general, this property is not satisfied by the encoding schemes generated by Fano’s method.
The enhanced coding algorithm
Considering the facts discussed above, we now propose Fano+ using the following modification to the traditional static Fano’s method. It is well known that Fano’s method requires that the source symbols and their probabilities are sorted in a non-increasing order of the probabilities. What we incorporate is as follows: After all the code words are generated, we sort them in terms of their increasing order of lengths maintaining and in the order of the probabilities. This enhancement leads
Properties of the enhanced Fano coding
To facilitate the analysis, we first introduce two important properties of Fano+, the enhanced static Fano coding. The first relates to the efficiency in compression achieved by Fano+, and the second is the property that it achieves lossless compression.
The efficiency of compression of Fano+ is a direct consequence of the rearrangement of the code words such that they are sorted in an increasing order of length (Rule 1). This is stated in the following theorem, for which a proof can be found in
Empirical results
In order to analyze the efficiency of Fano+, we have conducted some experiments on files of the Calgary corpus5 and the Canterbury corpus (Witten et al., 1999). The empirical results obtained are displayed in Table 1, Table 2 respectively. The columns labeled “Huffman” and “Fano” correspond to the static Huffman’s algorithm and the traditional static Fano’s method respectively. The columns labeled “Fano+”
Conclusions
In this paper, we present an encoding scheme generation algorithm which is Fano-based and almost optimal. We first showed that for the encoding schemes, whose code words are not arranged in an increasing order of lengths, the corresponding coding tree does not satisfy the so-called sibling property. To rectify this, we introduced an enhanced version of the static Fano’s method, Fano+, whose properties have been formally proven.
The encoding algorithm associated with Fano+ have been formally
References (16)
- et al.
Sorting in linear time?
Journal of Computer and System Sciences
(1998) Dynamic Huffman coding
Journal of Algorithms
(1985)- Faller, N. (1973). An adaptive system for data compression. In 7th Asilomar conference on circuits, systems, and...
Variations on a theme by Huffman
IEEE Transactions on Information Theory
(1978)- et al.
Introduction to information theory and data compression
(1998) - et al.
Inequalities
(1959) A method for the construction of minimum redundancy codes
Proceedings of IRE
(1952)- et al.
Grammar-bassed codes: a new class of universal lossless source codes
IEEE Transactions on Information Theory
(2000)
Cited by (9)
An adaptive character wordlength algorithm for data compression
2008, Computers and Mathematics with ApplicationsCitation Excerpt :Substitution data compression techniques involve the swapping of repeating characters by a shorter representation, such as null suppression, RLE, bit mapping and half byte packing [4,8,9]. Statistical data compression techniques involve the generation of the shortest average code length based on an estimated probability of the characters, such as Shannon–Fano coding [5,10,11], static/dynamic/adaptive Huffman coding [12–15], and arithmetic coding [16,17]. Finally, dictionary data compression techniques involve the substitution of sub-strings of text by indices or pointer code, relative to a dictionary of the sub-strings, such as the LZW data compression technique [18–20].
A fast and efficient nearly-optimal adaptive Fano coding scheme
2006, Information SciencesMultistage test data compression technique for VLSI circuits
2017, Proceedings of 2016 International Conference on Advanced Communication Control and Computing Technologies, ICACCCT 2016Stochastic learning-based weak estimation and its applications
2010, Knowledge-Based Intelligent System Advancements: Systemic and Cybernetic ApproachesA new algorithm for calculating adaptive character wordlength via estimating compressed file size
2010, 2nd International Conference on Computer Research and Development, ICCRD 2010An efficient compression scheme for data communication which uses a new family of self-organizing binary search trees
2008, International Journal of Communication Systems
- 1
Partially supported by Departamento de Informática, Universidad Nacional de San Juan, Argentina, and NSERC, the Natural Science and Engineering Research Council of Canada. A preliminary version of this paper was presented at the 2001 IEEE Conference on Systems, Man and Cybernetics, Tucson, Arizon, USA.
- 2
Partially supported by NSERC, the Natural Science and Engineering Research Council of Canada.