Abstract
Starting with Microsoft Office 2007, the Office Open XML file formats have become the default file format of Microsoft Office. As each day a lot of office documents have to be stored and transferred, reducing the document size will yield a benefit when storing and transferring these files. We present a compressed format for XML-based office documents that omits that data from an office document that is already defined by the Office Open XML format. Our evaluation shows that our compressed format reduces the – already compressed – office documents to a data size down to 41% of the original document size. Furthermore, for search operations tested in our evaluation, searching is faster on our compressed office documents than it is on the original documents.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adiego, J., Navarro, G., de la Fuente, P.: Lempel-Ziv Compression of Structured Text. In: Data Compression Conference (2004)
Arion, Bonifati, A., Manolescu, I., Pugliese, A.: XQueC: A Query-Conscious Compressed XML Database. ACM Transactions on Internet Technology (2007)
Bayardo, R.J., Gruhl, D., Josifovski, V., Myllymaki, J.: An evaluation of binary xml encoding optimizations for fast stream based XML processing. In: Proc. of the 13th International Conference on World Wide Web (2004)
Böttcher, S., Steinmetz, R., Klein, N.: XML Index Compression by DTD Subtraction. In: 9th International Conference on Enterprise Information Systems, ICEIS (2007)
Böttcher, S., Hartel, R., Messinger, C.: SEPA. Queryable SEPA Message Compression by XML Schema Subtraction. In: 12th International Conference on Enterprise Information Systems, ICEIS (2010)
Buneman, P., Grohe, M., Koch, C.: Path Queries on Compressed XML. In: VLDB (2003)
Busatto, G., Lohrey, M., Maneth, S.: Efficient Memory Representation of XML Dokuments. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 199–216. Springer, Heidelberg (2005)
Cheney, J.: Compressing XML with multiplexed hierarchical models. In: Proceedings of the 2001 IEEE Data Compression Conference, DCC 2001 (2001)
Cheng, J., Ng, W.: XQzip, Querying Compressed XML Using Structural Indexing. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 219–236. Springer, Heidelberg (2004)
Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402 (1984)
Cormack, G., Horspool, N.: Data compression using adaptive coding and partial string matching. Computer Journal 30(6) (1987)
Fraenkel, A., Klein, S.: Robust universal complete codes for transmission and compresion. Discrete Applied Mathematics 64, 31–55 (1996)
Girardot, M., Sundaresan, N., Millau: An Encod¬ing Format for Efficient Representation and Exchange of XML over the Web. In: Proceedings of the 9th International WWW Conference (2000)
Golomb, S.W.: Run-length encodings. IEEE Trans Info Theory 12(3), 399 (1966)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. In: Proc. of the I.R.E. (1952)
Liefke, H., Suciu, D.: XMill: An Efficient Compressor for XML Data. In: Proc. of ACM SIGMOD (2000)
Martin, G.N.N.: Range encoding: an algorithm for removing redundancy from a digitized message. In: Video and Data Recording Conference, Southampton (1979)
Min, J.K., Park, M.J., Chung, C.W.: XPRESS: A Queriable Compression for XML Data. In: Proceedings of SIGMOD (2003)
Ng, W., Lam, W.Y., Wood, P.T., Levene, M.: XCQ: A queriable XML compression system. Knowledge and Information Systems (2006)
Subramanian, H., Shankar, P.: Compressing XML Documents Using Recursive Finite State Automata. In: Farré, J., Litovsky, I., Schmitz, S. (eds.) CIAA 2005. LNCS, vol. 3845, pp. 282–293. Springer, Heidelberg (2006)
Tolani, P.M., Hartisa, J.R.: XGRIND: A query-friendly XML compressor. In: Proc. ICDE (2002)
Welch, T.A.: A technique for high-performance data compression. Computer Journal 17(6), 8–19 (1984)
Werner, C., Buschmann, C., Brandt, Y., Fischer, S.: Compressing SOAP Messages by using Pushdown Automata. In: ICWS (2006)
Witten, H., Neal, R.M., Cleary, J.G.: Arithmetic coding for data compression. Communcations of the ACM 30(6), 520–540 (1987)
Zhang, N., Kacholia, V., Özsu, M.T.: A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML. In: ICDE (2004)
Ziv, Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Ziv, Lempel, A.: Compression on individual sequences via variable-rate coding. IEEE Transactions on Information Theory (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Böttcher, S., Hartel, R., Messinger, C. (2010). Searchable Compression of Office Documents by XML Schema Subtraction. In: Lee, M.L., Yu, J.X., Bellahsène, Z., Unland, R. (eds) Database and XML Technologies. XSym 2010. Lecture Notes in Computer Science, vol 6309. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15684-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-15684-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15683-0
Online ISBN: 978-3-642-15684-7
eBook Packages: Computer ScienceComputer Science (R0)