Abstract
Data volumes are exploding as organizations and users collect and store increasing amounts of information for their own use and for sharing it with others. To cope with these large datasets, software developers typically take advantage of faster and faster I/O-subsystems and multi-core processors, and/or they exploit the virtual memory to make the caching and delivering of data requested by their algorithms simple and effective whenever their working set is small. Sometimes, they gain an additional speed-up by reducing the storage usage of their algorithms because this impacts on the number of machines/disks required for a given computation, and on the amount of data that is transferred/cached to/in the faster memory levels closer to the CPUs. However it is well known, both in the algorithm and software engineering communities, that the principled exploitation of all these issues via a proper arrangement of data and a properly structured algorithmic computation can abundantly surpass the best expected technology advancements and the help coming from (sophisticated) operating systems or heuristics. As a result, data compression and indexing nowadays play a key role in the design of modern algorithms for applications that manage massive datasets.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Future and Emerging Technologies – Proactive: Disruptive Solutions for Energy Efficient ICT. In: EU Expert Consultation Workshop (February 2010)
Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays and Pattern Matching. Springer, Heidelberg (2008)
Agarwal, P.K., Erickson, J.: Geometric Range Searching and Its Relatives. Advances in Discrete and Computational Geometry 23, 156 (1999)
Ajwani, D., Beckmann, A., Jacob, R., Meyer, U., Moruz, G.: On computational models for flash memory devices. In: Procs. SEA. LNCS, vol. 5526, pp. 16–27. Springer, Heidelberg (2009)
Albers, S.: Energy-efficient algorithms. Comm. ACM 53(5), 86–96 (2010)
Arge, L., Brodal, G.S., Fagerberg, R.: Cache-Oblivious Data Structures. In: Handbook of Data Structures. CRC Press, Boca Raton (2005)
Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Procs. ALENEX, pp. 84–97. SIAM, Philadelphia (2010)
Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 83–94. Springer, Heidelberg (2007)
Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 170–183. Springer, Heidelberg (2010)
Barbay, J., He, M., Munro, J.I., Srinivasa Rao, S.: Succinct indexes for string, bynary relations and multi-labeled trees. In: Procs. SODA, pp. 680–689 (2007)
Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: Procs. STACS, pp. 111–122 (2009)
Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. IEEE Computer 40(12), 33–37 (2007)
Beckmann, A., Meyer, U., Sanders, P., Singler, J.: Energy-Efficient Sorting using Solid State Disks. In: Procs. IEEE Green Computing Conference (2010)
Bender, M., Farach-Colton, M., Kuszmaul, B.: Cache-oblivious String B-trees. In: Procs. ACM PODS, pp. 233–242 (2006)
Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.: Representing trees of higher degree. Algorithmica 43, 275–292 (2005)
Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table compression with combinatorial optimization. J. ACM 50(6), 825–851 (2003)
Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: Procs. ACM WSDM, pp. 95–106 (2008)
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Cameron, K.W., Pruhs, K., Irani, S., Ranganathan, P., Brooks, D.: Report of the science of power management workshop. NSF Report (August 2009)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. on Computer Systems 26(2) (2008)
Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric BWT: Linking range searching and text indexing. In: Procs. IEEE DCC, pp. 252–261 (2008)
Chiu, S.Y., Hon, W.K., Shah, R., Vitter, J.: I/O-efficient compressed text indexes: From theory to practice. In: Procs. IEEE DCC (2010)
Claude, F., Navarro, G.: A fast and compact web graph representation. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 118–129. Springer, Heidelberg (2007)
Claude, F., Navarro, G.: Practical Rank/Select queries over arbitrary sequences. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 176–187. Springer, Heidelberg (2008)
Claude, F., Navarro, G.: Self-Indexed Text Compression using Straight-Line Programs. In: Královič, R., Niwiński, D. (eds.) MFCS 2009. LNCS, vol. 5734, pp. 235–246. Springer, Heidelberg (2009)
Claude, F., Navarro, G.: Extended compact web graph representations. In: Elomaa, T. (ed.) Ukkonen Festschrift 2010. LNCS, vol. 6060, pp. 77–91. Springer, Heidelberg (2010)
Cutting, D.: Apache Lucene (2008), http://lucene.apache.org/
Delpratt, O., Rahman, N., Raman, R.: Compressed prefix sums. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 235–247. Springer, Heidelberg (2007)
Ding, S., Attenberg, J., Suel, T.: Scalable techniques for document identifier assignment in inverted indexes. In: Procs. WWW, pp. 311–320 (2010)
Farzan, A., Munro, I.: Succinct Representations of Arbitrary Graphs. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 393–404. Springer, Heidelberg (2008)
Farzan, A., Raman, R., Rao, S.S.: Universal succinct representations of trees? In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009. LNCS, vol. 5555, pp. 451–462. Springer, Heidelberg (2009)
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 697–710. Springer, Heidelberg (2010)
Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: Theory vs practice in BWT compression. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 756–767. Springer, Heidelberg (2006)
Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 561–572. Springer, Heidelberg (2006)
Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52, 688–713 (2005)
Ferragina, P., Gonzalez, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics (2009)
Ferragina, P., Grossi, R.: The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999)
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Procs. ACM PODS, pp. 181–190 (2008)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and searching XML data via two zips. In: Procs. WWW, pp. 751–760 (2006)
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Ferragina, P., Manzini, G.: On compressing the textual web. In: Procs. ACM WSDM, pp. 391–400 (2010)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. on Algorithms 3(2) (2007)
Ferragina, P., Nitto, I.: A delta-compressed storage scheme supporting I/O-efficient random access. Draft (2010)
Ferragina, P., Nitto, I., Venturini, R.: On optimally partitioning a text to improve its compression. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 420–431. Springer, Heidelberg (2009)
Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of lempel-ziv compression. In: Procs. ACM-SIAM SODA, pp. 768–777 (2009)
Ferragina, P., Rao, S.S.: Tree compression and indexing. In: Kao, M.-Y. (ed.) Encyclopedia of Algorithms. Springer, Heidelberg (2008)
Ferragina, P., Venturini, R.: Compressed permuterm index. In: Procs. ACM SIGIR, pp. 535–542 (2007)
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. In: Procs. ACM-SIAM SODA, pp. 690–696 (2007)
Ferragina, P., Venturini, R.: Weighted compressed self-indexes. Draft (2010)
Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)
Giancarlo, R., Restivo, A., Sciortino, M.: From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization. Theoretical Computer Science 387(3), 236–248 (2007)
Golynski, A.: Optimal lower bounds for rank and select indexes. Theoretical Computer Science 387, 348–359 (2007)
Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 371–382. Springer, Heidelberg (2007)
Grossi, R., Orlandi, A., Raman, R., Rao, S.S.: More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In: Procs STACS, pp. 517–528 (2009)
Hon, W.K., Lam, T., Shah, R., Tam, S., Vitter, J.S.: Compressed index for dictionary matching. In: Procs. IEEE DCC, pp. 23–32 (2008)
Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.: On entropy-compressed text indexing in external memory. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)
Hon, W.K., Shah, R., Vitter, J.S.: Compression, indexing, and retrieval for massive string data. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 260–274. Springer, Heidelberg (2010)
Jansson, J., Sadakane, K., Sung, W.K.: Ultra-succinct representation of ordered trees. In: Procs ACM-SIAM SODA, pp. 575–584 (2007)
Kant, K.: Data center evolution: A tutorial on state of the art, issues, and challenges. Computer Networks 53(17), 2939–2965 (2009)
Kant, K.: Toward a science of power management. IEEE Computer 42(9), 99–101 (2009)
Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler-based compression. Theoretical Computer Science 387(3), 220–235 (2007)
Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. BioInformatics 24(6), 791–797 (2008)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(R25) (2009)
Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. BioInformatics 26(5), 589–595 (2010)
Li, Y., He, B., Luo, Q., Yi, K.: Tree indexing on flash disks. In: Procs. ICDE, pp. 1303–1306 (2009)
Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)
Mäkinen, V., Navarro, G., Siren, J., Valimaki, N.: Storage and Retrieval of Highly Repetitive Sequence Collections. Journal of Computational Biology 17(3), 281–308 (2010)
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Mehlhorn, K., Ziegelmann, M.: Resource constrained shortest paths. In: Paterson, M. (ed.) ESA 2000. LNCS, vol. 1879, pp. 326–337. Springer, Heidelberg (2000)
Nath, S., Gibbons, P.B.: Online maintenance of very large random samples on flash storage. VLDB Journal 19(1), 67–90 (2010)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Navarro, G., Russo, L.M.: Re-pair achieves high-order entropy. In: Procs. IEEE DCC, p. 537 (2008)
Okanohara, D., Sadakane, K.: Practical entropy-compressed Rank/Select dictionary. In: Procs. ALENEX (2007)
Patrascu, M.: Succincter. In: Procs. IEEE FOCS, pp. 305–313 (2008)
Rajpoot, N., Şahinalp, C.: Dictionary-based data compression. In: Sayood, K. (ed.) Handbook of Lossless Data Compression, Academic Press, London (2002)
Ranganathan, P.: Recipe for efficiency: Principles of power-aware computing. Comm. ACM 53(4), 60–67 (2010)
Rivoire, S., Shah, M.A., Ranganathan, P., Kozyrakis, C.: Joulesort: a balanced energy-efficiency benchmark. In: Procs. ACM SIGMOD, pp. 365–376 (2007)
Sadakane, K., Navarro, G.: Fully-functional succinct trees. In: Procs. ACM-SIAM SODA, pp. 134–149 (2010)
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008)
Vitter, J.: External memory algorithms and data structures. ACM Computing Surveys 33(2), 209–271 (2001)
Vo, B.D., Vo, K.-P.: Compressing table data with column dependency. Theoretical Computer Science 387(3), 273–283 (2007)
Willets, K.: Full-text searching & the Burrows-Wheeler transform. Dr Dobbs Journal (December 2003)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)
Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Procs. WWW, pp. 401–410 (2009)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Information Theory 23, 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferragina, P. (2010). Data Structures: Time, I/Os, Entropy, Joules!. In: de Berg, M., Meyer, U. (eds) Algorithms – ESA 2010. ESA 2010. Lecture Notes in Computer Science, vol 6347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15781-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-15781-3_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15780-6
Online ISBN: 978-3-642-15781-3
eBook Packages: Computer ScienceComputer Science (R0)