TextGen: a realistic text data content generation method for modern storage system benchmarks

Wang, Long-xiang; Dong, Xiao-she; Zhang, Xing-jun; Wang, Yin-feng; Ju, Tao; Feng, Guo-fu

doi:10.1631/FITEE.1500332

TextGen: a realistic text data content generation method for modern storage system benchmarks

Published: 14 October 2016

Volume 17, pages 982–993, (2016)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Long-xiang Wang¹,
Xiao-she Dong¹,
Xing-jun Zhang¹,
Yin-feng Wang²,
Tao Ju¹ &
…
Guo-fu Feng³

70 Accesses
5 Citations
Explore all metrics

Abstract

Modern storage systems incorporate data compressors to improve their performance and capacity. As a result, data content can significantly influence the result of a storage system benchmark. Because real-world proprietary datasets are too large to be copied onto a test storage system, and most data cannot be shared due to privacy issues, a benchmark needs to generate data synthetically. To ensure that the result is accurate, it is necessary to generate data content based on the characterization of real-world data properties that influence the storage system performance during the execution of a benchmark. The existing approach, called SDGen, cannot guarantee that the benchmark result is accurate in storage systems that have built-in word-based compressors. The reason is that SDGen characterizes the properties that influence compression performance only at the byte level, and no properties are characterized at the word level. To address this problem, we present TextGen, a realistic text data content generation method for modern storage system benchmarks. TextGen builds the word corpus by segmenting real-world text datasets, and creates a word-frequency distribution by counting each word in the corpus. To improve data generation performance, the word-frequency distribution is fitted to a lognormal distribution by maximum likelihood estimation. The Monte Carlo approach is used to generate synthetic data. The running time of TextGen generation depends only on the expected data size, which means that the time complexity of TextGen is O(n). To evaluate TextGen, four real-world datasets were used to perform an experiment. The experimental results show that, compared with SDGen, the compression performance and compression ratio of the datasets generated by TextGen deviate less from real-world datasets when end-tagged dense code, a representative of word-based compressors, is evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features

A cost model for random access queries in document stores

Article 24 March 2021

A performance study of optane persistent memory: from storage data structures' perspective

Article 24 September 2022

References

Agrawal, N., Bolosky, W.J., Douceur, J.R., et al., 2007. A five-year study of file-system metadata. ACM Trans. Stor., 3(3):9.1–9.32. http://dx.doi.org/10.1145/1288783.1288788
Google Scholar
Agrawal, N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., 2009. Generating realistic impressions for file-system benchmarking. ACM Trans. Stor., 5(4):16.1–16.30. http://dx.doi.org/10.1145/1629080.1629086
Google Scholar
Anderson, E., Kallahalla, M., Uysal, M., et al., 2004. Buttress: a toolkit for flexible and high fidelity I/O benchmarking. Proc. USENIX Conf. on File and Storage Technologies, p.4.
Google Scholar
Armstrong, T.G., Ponnekanti, V., Borthakur, D., et al., 2013. Linkbench: a database benchmark based on the Facebook social graph. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.1185–1196. http://dx.doi.org/10.1145/2463676.2465296
Google Scholar
Arnold, R., Bell, T., 1997. A corpus for the evaluation of lossless compression algorithms. Data Compression Conf., p.201–210. http://dx.doi.org/10.1109/DCC.1997.582019
Google Scholar
Baayen, H., 1992. Statistical-models for word-frequency distributions—a linguistic evaluation. Comput. Human. 26(5-6):347–363. http://dx.doi.org/10.1007/Bf00136980
Article Google Scholar
Bäck, T., 1996. Evolutionary Algorithms in Theory and Practice. Oxford University Press, Oxford, UK, p.120.
MATH Google Scholar
Bonwick, J., Ahrens, M., Henson, V., et al., 2003. The Zettabyte File System. Technical Report, Sun Microsystems, Inc., Santa Clara, USA.
Google Scholar
Box, G.E.P., Muller, M.E., 1958. A note on the generation of random normal deviates. Ann.. Math. Statist., 29(2): 610–611. http://dx.doi.org/10.1214/aoms/1177706645
Article Google Scholar
Brisaboa, N.R., Iglesias, E., Navarro, G., et al., 2003. An efficient compression code for text databases. Adv. Inform. Retriev., 2633:468–481. http://dx.doi.org/10.1007/3-540-36618-0_33
Article Google Scholar
Brisaboa, N.R., Fariña, A., Navarro, G., et al., 2007. Lightweight natural language text compression. Inform. Retriev., 10(1): 1–33. http://dx.doi.org/10.1007/s10791-006-9001-9
Article Google Scholar
Brisaboa, N.R., Fariña, A., Navarro, G., 2008. New adaptive compressors for natural language text. Softw.-Pract. Exper., 38(13): 1429–1450. http://dx.doi.org/10.1002/spe.882
Article Google Scholar
Brisaboa, N.R., Fariña, A., Navarro, G., et al., 2010. Dynamic lightweight text compression. ACM Trans. Inform. Syst., 28(3): 1–32. http://dx.doi.org/10.1145/1777432.1777433
Article Google Scholar
Chilan, C.M., 2005. IOzone: an Open Source File System Benchmark Tool. Technical Report, the National Center for Supercomputing Applications Hierarchical Data Format Group, University of Illinois at Urbana-Champaign, Illinois.
Google Scholar
Cooper, B.F., Silberstein, A., Tam, E., et al., 2010. Benchmarking cloud serving systems with YCSB. Proc. ACM Symp. on Cloud Computing, p.143–154. http://dx.doi.org/10.1145/1807128.1807152
Google Scholar
Difallah, D.E., Pavlo, A., Curino, C., et al., 2013. OLTP-bench: an extensible testbed for benchmarking relational databases. Proc. VLDB Endow., 7(4): 277–288. http://dx.doi.org/10.14778/2732240.2732246
Article Google Scholar
Drago, I., Bocchi, E., Mellia, M., et al., 2013. Benchmarking personal cloud storage. Proc. Conf. on Int. Measurement, p.205–212. http://dx.doi.org/10.1145/2504730.2504762
Dvorský, J., Pokorný, J., Snáš el, V., 1999. Word-based compression methods and indexing for text retrieval systems. Adv. Database Inform. Syst., 1691:76–84. http://dx.doi.org/10.1007/3-540-48252-0_6
Article Google Scholar
Fariña, A., Brisaboa, N.R., Navarro, G., et al., 2012. Word-based self-indexes for natural language text. ACM Trans. Inform. Syst., 30(1): 1–34. http://dx.doi.org/10.1145/2094072.2094073
Article Google Scholar
Gracia-Tinedo, R., Harnik, D., Naor, D., et al., 2015. SDGen: mimicking datasets for content generation in storage benchmarks. Proc. USENIX Conf. on File and Storage Technologies, p.317–330.
Google Scholar
Horspool, R.N., Cormack, G.V., 1992. Constructing wordbased text compression algorithms. Data Compression Conf., p.62–71. http://dx.doi.org/10.1109/DCC.1992.227475
Google Scholar
Lang, K., 1995. Newsweeder: learning to filter netnews. Proc. Int. Conf. on Machine Learning, p.331–339.
Li, A., Yang, X., Kandula, S., et al., 2010. Cloudcmp: comparing public cloud providers. Proc. ACM SIGCOMM Conf. on Internet Measurement, p.1–14. http://dx.doi.org/10.1145/1879141.1879143
Google Scholar
Li, W.T., 1992. Random texts exhibit Zipf-law-like wordfrequency distribution. IEEE Trans. Inform. Theor., 38(6): 1842–1845. http://dx.doi.org/10.1109/18.165464
Article Google Scholar
Moffat, A., Zobel, J., Sharman, N., 1997. Text compression for dynamic document databases. IEEE Trans. Knowl. Database Eng., 9(2): 302–313. http://dx.doi.org/10.1109/69.591454
Article Google Scholar
Myung, I.J., 2003. Tutorial on maximum likelihood estimation. J. Math. Psychol., 47(1): 90–100. http://dx.doi.org/10.1016/S0022-2496(02)00028-7
Article MathSciNet Google Scholar
Powers, D.M.W., 1998. Applications and explanations of Zipf’s law. Proc. Joint Conf. on New Methods in Language Processing and Computational Natural Language Learning, p.151–160.
Google Scholar
Rodeh, O., Bacik, J., Mason, C., 2013. BTRFS: the Linux B-tree filesystem. ACM Trans. Stor., 9(3): 1–32. http://dx.doi.org/10.1145/2501620.2501623
Article Google Scholar
Salomon, D., 2006. Data Compression: the Complete Reference. Springer-Verlag New York, Inc., New York, USA, p.885.
Google Scholar
Tarasov, V., Bhanage, S., Zadok, E., et al., 2011. Benchmarking file system benchmarking: it *is* rocket science. Proc. USENIX Conf. on Hot Topics in Operating Systems, p.8–13.
Google Scholar
Traeger, A., Zadok, E., Joukov, N., et al., 2008. A nine year study of file system and storage benchmarking. ACM Trans. Stor., 4(2): 1–56. http://dx.doi.org/10.1145/1367829.1367831
Article Google Scholar
Vitter, J.S., 1985. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1): 37–57. http://dx.doi.org/10.1145/3147.3165
Article MathSciNet Google Scholar
Yoshida, S., Morihara, T., Yahagi, H., et al., 1999. Application of a word-based text compression method to Japanese and Chinese texts. Data Compression Conf., p.561. http://dx.doi.org/10.1109/DCC.1999.785718
Google Scholar
Ziv, J., Lempel, A., 1977. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theor., 23(3): 337–343. http://dx.doi.org/10.1109/TIT.1977.1055714
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, 710049, China
Long-xiang Wang, Xiao-she Dong, Xing-jun Zhang & Tao Ju
Shenzhen Institute of Information Technology, Shenzhen, 518172, China
Yin-feng Wang
College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
Guo-fu Feng

Authors

Long-xiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-she Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xing-jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yin-feng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Ju
View author publications
You can also search for this author in PubMed Google Scholar
Guo-fu Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing-jun Zhang.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61572394 and 61272098), the Shenzhen Funda mental Research Plan (Nos. JCYJ20120615101127404 and JSGG20140519141854753), and the National Key Technologies R&D Program of China (No. 2011BAH04B03)

ORCID: Xing-jun ZHANG, http://orcid.org/0000-0003-1434-7016

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Lx., Dong, Xs., Zhang, Xj. et al. TextGen: a realistic text data content generation method for modern storage system benchmarks. Frontiers Inf Technol Electronic Eng 17, 982–993 (2016). https://doi.org/10.1631/FITEE.1500332

Download citation

Received: 13 October 2015
Accepted: 21 March 2016
Published: 14 October 2016
Issue Date: October 2016
DOI: https://doi.org/10.1631/FITEE.1500332

Keywords

CLC number

TP311.1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TextGen: a realistic text data content generation method for modern storage system benchmarks

Abstract

Access this article

Similar content being viewed by others

Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features

A cost model for random access queries in document stores

A performance study of optane persistent memory: from storage data structures' perspective

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

CLC number

Navigation

TextGen: a realistic text data content generation method for modern storage system benchmarks

Abstract

Access this article

Similar content being viewed by others

Analyzing Data Properties Using Statistical Sampling Techniques – Illustrated on Scientific File Formats and Compression Features

A cost model for random access queries in document stores

A performance study of optane persistent memory: from storage data structures' perspective

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CLC number

Search

Navigation