Data Structures: Time, I/Os, Entropy, Joules!

Ferragina, Paolo

doi:10.1007/978-3-642-15781-3_1

Data Structures: Time, I/Os, Entropy, Joules!

Paolo Ferragina¹⁸

Conference paper

729 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6347))

Abstract

Data volumes are exploding as organizations and users collect and store increasing amounts of information for their own use and for sharing it with others. To cope with these large datasets, software developers typically take advantage of faster and faster I/O-subsystems and multi-core processors, and/or they exploit the virtual memory to make the caching and delivering of data requested by their algorithms simple and effective whenever their working set is small. Sometimes, they gain an additional speed-up by reducing the storage usage of their algorithms because this impacts on the number of machines/disks required for a given computation, and on the amount of data that is transferred/cached to/in the faster memory levels closer to the CPUs. However it is well known, both in the algorithm and software engineering communities, that the principled exploitation of all these issues via a proper arrangement of data and a properly structured algorithmic computation can abundantly surpass the best expected technology advancements and the help coming from (sophisticated) operating systems or heuristics. As a result, data compression and indexing nowadays play a key role in the design of modern algorithms for applications that manage massive datasets.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Future and Emerging Technologies – Proactive: Disruptive Solutions for Energy Efficient ICT. In: EU Expert Consultation Workshop (February 2010)
Google Scholar
Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays and Pattern Matching. Springer, Heidelberg (2008)
Book Google Scholar
Agarwal, P.K., Erickson, J.: Geometric Range Searching and Its Relatives. Advances in Discrete and Computational Geometry 23, 156 (1999)
Google Scholar
Ajwani, D., Beckmann, A., Jacob, R., Meyer, U., Moruz, G.: On computational models for flash memory devices. In: Procs. SEA. LNCS, vol. 5526, pp. 16–27. Springer, Heidelberg (2009)
Google Scholar
Albers, S.: Energy-efficient algorithms. Comm. ACM 53(5), 86–96 (2010)
Article Google Scholar
Arge, L., Brodal, G.S., Fagerberg, R.: Cache-Oblivious Data Structures. In: Handbook of Data Structures. CRC Press, Boca Raton (2005)
Google Scholar
Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Procs. ALENEX, pp. 84–97. SIAM, Philadelphia (2010)
Google Scholar
Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 83–94. Springer, Heidelberg (2007)
Chapter Google Scholar
Barbay, J., Claude, F., Navarro, G.: Compact rich-functional binary relation representations. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 170–183. Springer, Heidelberg (2010)
Google Scholar
Barbay, J., He, M., Munro, J.I., Srinivasa Rao, S.: Succinct indexes for string, bynary relations and multi-labeled trees. In: Procs. SODA, pp. 680–689 (2007)
Google Scholar
Barbay, J., Navarro, G.: Compressed representations of permutations, and applications. In: Procs. STACS, pp. 111–122 (2009)
Google Scholar
Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. IEEE Computer 40(12), 33–37 (2007)
Google Scholar
Beckmann, A., Meyer, U., Sanders, P., Singler, J.: Energy-Efficient Sorting using Solid State Disks. In: Procs. IEEE Green Computing Conference (2010)
Google Scholar
Bender, M., Farach-Colton, M., Kuszmaul, B.: Cache-oblivious String B-trees. In: Procs. ACM PODS, pp. 233–242 (2006)
Google Scholar
Benoit, D., Demaine, E., Munro, I., Raman, R., Raman, V., Rao, S.: Representing trees of higher degree. Algorithmica 43, 275–292 (2005)
Article MATH MathSciNet Google Scholar
Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table compression with combinatorial optimization. J. ACM 50(6), 825–851 (2003)
Article MathSciNet Google Scholar
Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: Procs. ACM WSDM, pp. 95–106 (2008)
Google Scholar
Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Google Scholar
Cameron, K.W., Pruhs, K., Irani, S., Ranganathan, P., Brooks, D.: Report of the science of power management workshop. NSF Report (August 2009)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. on Computer Systems 26(2) (2008)
Google Scholar
Chien, Y.-F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric BWT: Linking range searching and text indexing. In: Procs. IEEE DCC, pp. 252–261 (2008)
Google Scholar
Chiu, S.Y., Hon, W.K., Shah, R., Vitter, J.: I/O-efficient compressed text indexes: From theory to practice. In: Procs. IEEE DCC (2010)
Google Scholar
Claude, F., Navarro, G.: A fast and compact web graph representation. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 118–129. Springer, Heidelberg (2007)
Chapter Google Scholar
Claude, F., Navarro, G.: Practical Rank/Select queries over arbitrary sequences. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 176–187. Springer, Heidelberg (2008)
Chapter Google Scholar
Claude, F., Navarro, G.: Self-Indexed Text Compression using Straight-Line Programs. In: Královič, R., Niwiński, D. (eds.) MFCS 2009. LNCS, vol. 5734, pp. 235–246. Springer, Heidelberg (2009)
Chapter Google Scholar
Claude, F., Navarro, G.: Extended compact web graph representations. In: Elomaa, T. (ed.) Ukkonen Festschrift 2010. LNCS, vol. 6060, pp. 77–91. Springer, Heidelberg (2010)
Chapter Google Scholar
Cutting, D.: Apache Lucene (2008), http://lucene.apache.org/
Delpratt, O., Rahman, N., Raman, R.: Compressed prefix sums. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plášil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 235–247. Springer, Heidelberg (2007)
Chapter Google Scholar
Ding, S., Attenberg, J., Suel, T.: Scalable techniques for document identifier assignment in inverted indexes. In: Procs. WWW, pp. 311–320 (2010)
Google Scholar
Farzan, A., Munro, I.: Succinct Representations of Arbitrary Graphs. In: Halperin, D., Mehlhorn, K. (eds.) ESA 2008. LNCS, vol. 5193, pp. 393–404. Springer, Heidelberg (2008)
Chapter Google Scholar
Farzan, A., Raman, R., Rao, S.S.: Universal succinct representations of trees? In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009. LNCS, vol. 5555, pp. 451–462. Springer, Heidelberg (2009)
Google Scholar
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 697–710. Springer, Heidelberg (2010)
Google Scholar
Ferragina, P., Giancarlo, R., Manzini, G.: The engineering of a compression boosting library: Theory vs practice in BWT compression. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 756–767. Springer, Heidelberg (2006)
Chapter Google Scholar
Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 561–572. Springer, Heidelberg (2006)
Chapter Google Scholar
Ferragina, P., Giancarlo, R., Manzini, G., Sciortino, M.: Boosting textual compression in optimal linear time. J. ACM 52, 688–713 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Gonzalez, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics (2009)
Google Scholar
Ferragina, P., Grossi, R.: The string B-tree: A new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999)
Article MATH MathSciNet Google Scholar
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Procs. ACM PODS, pp. 181–190 (2008)
Google Scholar
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and searching XML data via two zips. In: Procs. WWW, pp. 751–760 (2006)
Google Scholar
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Compressing and indexing labeled trees, with applications. J. ACM 57(1) (2009)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
Ferragina, P., Manzini, G.: On compressing the textual web. In: Procs. ACM WSDM, pp. 391–400 (2010)
Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. on Algorithms 3(2) (2007)
Google Scholar
Ferragina, P., Nitto, I.: A delta-compressed storage scheme supporting I/O-efficient random access. Draft (2010)
Google Scholar
Ferragina, P., Nitto, I., Venturini, R.: On optimally partitioning a text to improve its compression. In: Fiat, A., Sanders, P. (eds.) ESA 2009. LNCS, vol. 5757, pp. 420–431. Springer, Heidelberg (2009)
Chapter Google Scholar
Ferragina, P., Nitto, I., Venturini, R.: On the bit-complexity of lempel-ziv compression. In: Procs. ACM-SIAM SODA, pp. 768–777 (2009)
Google Scholar
Ferragina, P., Rao, S.S.: Tree compression and indexing. In: Kao, M.-Y. (ed.) Encyclopedia of Algorithms. Springer, Heidelberg (2008)
Google Scholar
Ferragina, P., Venturini, R.: Compressed permuterm index. In: Procs. ACM SIGIR, pp. 535–542 (2007)
Google Scholar
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. In: Procs. ACM-SIAM SODA, pp. 690–696 (2007)
Google Scholar
Ferragina, P., Venturini, R.: Weighted compressed self-indexes. Draft (2010)
Google Scholar
Fischer, J.: Optimal succinctness for range minimum queries. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 158–169. Springer, Heidelberg (2010)
Google Scholar
Giancarlo, R., Restivo, A., Sciortino, M.: From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization. Theoretical Computer Science 387(3), 236–248 (2007)
MATH MathSciNet Google Scholar
Golynski, A.: Optimal lower bounds for rank and select indexes. Theoretical Computer Science 387, 348–359 (2007)
MATH MathSciNet Google Scholar
Golynski, A., Grossi, R., Gupta, A., Raman, R., Rao, S.S.: On the size of succinct indices. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 371–382. Springer, Heidelberg (2007)
Chapter Google Scholar
Grossi, R., Orlandi, A., Raman, R., Rao, S.S.: More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In: Procs STACS, pp. 517–528 (2009)
Google Scholar
Hon, W.K., Lam, T., Shah, R., Tam, S., Vitter, J.S.: Compressed index for dictionary matching. In: Procs. IEEE DCC, pp. 23–32 (2008)
Google Scholar
Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.: On entropy-compressed text indexing in external memory. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 75–89. Springer, Heidelberg (2009)
Chapter Google Scholar
Hon, W.K., Shah, R., Vitter, J.S.: Compression, indexing, and retrieval for massive string data. In: Amir, A., Parida, L. (eds.) CPM 2010. LNCS, vol. 6129, pp. 260–274. Springer, Heidelberg (2010)
Chapter Google Scholar
Jansson, J., Sadakane, K., Sung, W.K.: Ultra-succinct representation of ordered trees. In: Procs ACM-SIAM SODA, pp. 575–584 (2007)
Google Scholar
Kant, K.: Data center evolution: A tutorial on state of the art, issues, and challenges. Computer Networks 53(17), 2939–2965 (2009)
Article Google Scholar
Kant, K.: Toward a science of power management. IEEE Computer 42(9), 99–101 (2009)
Google Scholar
Kaplan, H., Landau, S., Verbin, E.: A simpler analysis of Burrows-Wheeler-based compression. Theoretical Computer Science 387(3), 220–235 (2007)
MATH MathSciNet Google Scholar
Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. BioInformatics 24(6), 791–797 (2008)
Article Google Scholar
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10(R25) (2009)
Google Scholar
Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. BioInformatics 26(5), 589–595 (2010)
Article Google Scholar
Li, Y., He, B., Luo, Q., Yi, K.: Tree indexing on flash disks. In: Procs. ICDE, pp. 1303–1306 (2009)
Google Scholar
Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 229–241. Springer, Heidelberg (2007)
Chapter Google Scholar
Mäkinen, V., Navarro, G., Siren, J., Valimaki, N.: Storage and Retrieval of Highly Repetitive Sequence Collections. Journal of Computational Biology 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Mehlhorn, K., Ziegelmann, M.: Resource constrained shortest paths. In: Paterson, M. (ed.) ESA 2000. LNCS, vol. 1879, pp. 326–337. Springer, Heidelberg (2000)
Chapter Google Scholar
Nath, S., Gibbons, P.B.: Online maintenance of very large random samples on flash storage. VLDB Journal 19(1), 67–90 (2010)
Article Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Google Scholar
Navarro, G., Russo, L.M.: Re-pair achieves high-order entropy. In: Procs. IEEE DCC, p. 537 (2008)
Google Scholar
Okanohara, D., Sadakane, K.: Practical entropy-compressed Rank/Select dictionary. In: Procs. ALENEX (2007)
Google Scholar
Patrascu, M.: Succincter. In: Procs. IEEE FOCS, pp. 305–313 (2008)
Google Scholar
Rajpoot, N., Şahinalp, C.: Dictionary-based data compression. In: Sayood, K. (ed.) Handbook of Lossless Data Compression, Academic Press, London (2002)
Google Scholar
Ranganathan, P.: Recipe for efficiency: Principles of power-aware computing. Comm. ACM 53(4), 60–67 (2010)
Article Google Scholar
Rivoire, S., Shah, M.A., Ranganathan, P., Kozyrakis, C.: Joulesort: a balanced energy-efficiency benchmark. In: Procs. ACM SIGMOD, pp. 365–376 (2007)
Google Scholar
Sadakane, K., Navarro, G.: Fully-functional succinct trees. In: Procs. ACM-SIAM SODA, pp. 134–149 (2010)
Google Scholar
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008)
Chapter Google Scholar
Vitter, J.: External memory algorithms and data structures. ACM Computing Surveys 33(2), 209–271 (2001)
Article Google Scholar
Vo, B.D., Vo, K.-P.: Compressing table data with column dependency. Theoretical Computer Science 387(3), 273–283 (2007)
MATH MathSciNet Google Scholar
Willets, K.: Full-text searching & the Burrows-Wheeler transform. Dr Dobbs Journal (December 2003)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Procs. WWW, pp. 401–410 (2009)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Information Theory 23, 337–343 (1977)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Pisa, Italy
Paolo Ferragina

Authors

Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computing Science, TU Eindhoven, Eindhoven, The Netherlands
Mark de Berg
Institute for Computer Science, J.W. Goethe University, 60325, Frankfurt/Main, Germany
Ulrich Meyer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferragina, P. (2010). Data Structures: Time, I/Os, Entropy, Joules!. In: de Berg, M., Meyer, U. (eds) Algorithms – ESA 2010. ESA 2010. Lecture Notes in Computer Science, vol 6347. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15781-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-15781-3_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15780-6
Online ISBN: 978-3-642-15781-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics