Skip to main content

Workload-Aware Self-Tuning Histograms of String Data

  • Conference paper
  • First Online:
Database and Expert Systems Applications (Globe 2015, DEXA 2015)

Abstract

In this paper we extend STHoles, a very successful algorithm that uses query results to build and maintain multi-dimensional histograms of numerical data. Our contribution is the formal definition of extensions of all relevant concepts; such that they are independent of the domain of the data, but subsume STHoles concepts as their numerical specialization. At the same time, we also derive specializations for the string domain and implement these into a prototype that we use to empirically validate our approach. Our current implementation uses string prefixes as the machinery for describing string ranges. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in hierarchically structured string domains, such as those of filesystem pathnames and URIs. In fact, we base the empirical validation of the approach on existing, publicly available Semantic Web data where we demonstrate convergence to accurate and efficient histograms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    STRHist is available at https://bitbucket.org/acharal/strhist

    For more details on Semagrow, please see http://www.semagrow.eu.

  2. 2.

    Please see http://agris.fao.org for more details on AGRIS. The AGRIS site mentions 7 million distinct publications, but this includes recent additions that are not in end-2013 data dump used for these experiments.

  3. 3.

    We use the canonical string representation of URIs as defined in Sect. 2, IETF RFC 7320 (http://tools.ietf.org/html/rfc7320).

References

  1. Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD 1999), pp. 181–192. ACM (1999)

    Google Scholar 

  2. Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of 2001 ACM International Conference on Management of Data (SIGMOD 2001), pp. 211–222. ACM (2001)

    Google Scholar 

  3. Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006). IEEE Computer Society (2006)

    Google Scholar 

  4. Roh, Y.J., Kim, J.H., Chung, Y.D., Son, J.H., Kim, M.H.: Hierarchically organized skew-tolerant histograms for geographic data objects. In: Proceedings of 2010 ACM International Conference on Management of Data (SIGMOD 2010), pp. 627–638. ACM (2010)

    Google Scholar 

  5. Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: Overcoming the underestimation problem. In: Proceedings of 20th International Conference on Data Engineering (ICDE 2004). IEEE Computer Society (2004)

    Google Scholar 

  6. Lim, L., Wang, M., Vitter, J.S.: CXHist: An on-line classification-based histogram for XML string selectivity estimation. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, 30 August – 2 September 2005, pp. 1187–1198 (2005)

    Google Scholar 

  7. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: A search and metadata engine for the Semantic Web. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659. ACM (2004)

    Google Scholar 

  8. Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012)

    Google Scholar 

  9. Langegger, A., Wöss, W.: RDFStats - an extensible RDF statistics generator and library. In: Proceedings of DEXA 2009, pp. 79–83 (2009)

    Google Scholar 

  10. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of 19th International World Wide Web Conference (WWW 2010), Raleigh, NC, USA, 26–30 April 2010 (2010)

    Google Scholar 

  11. Charalambidis, A., Konstantopoulos, S., Karkaletsis, V.: Dataset descriptions for optimizing federated querying. In: Poster Track, Companion Volume to the Procedings of the 24th Intl World Wide Web Conference (WWW 2015), Florence, Italy, 18–22 May 2015. ACM (2015)

    Google Scholar 

Download references

Acknowledgements

The work described here was partially carried out at the 2014 edition of the International Research-Centred Summer School, held at NCSR ‘Demokritos’, Athens, Greece, 3–30 July 2014. For more details please see http://irss.iit.demokritos.gr

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007–2013) under grant agreement No. 318497. More details at http://www.semagrow.eu.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stasinos Konstantopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zoulis, N., Mavroudi, E., Lykoura, A., Charalambidis, A., Konstantopoulos, S. (2015). Workload-Aware Self-Tuning Histograms of String Data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22849-5_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22848-8

  • Online ISBN: 978-3-319-22849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics