Skip to main content
Log in

Indexing dataspaces with partitions

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient access to dataspaces, in this paper, we first introduce our survey of data features in the real dataspaces. Based on the features observed in our study, several partitioning based index approaches are proposed to accelerate the query processing in dataspaces. Specifically, the vertical partitioning index utilizes the partitions on tokens to merge and compress data. We can both reduce the number of I/O reads and avoid aggregation of data inside a compressed list. The horizontal partitioning index supports pruning partitions of tuples in the top-k query. Thus, we can reduce the computation overhead of irrelevant candidate tuples to the query. Finally, we also propose a hybrid index with both vertical and horizontal partitioning. The extensive experiment results in real data sets demonstrate that our approaches outperform the previous techniques and scale well with the large data size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal, R., Somani, A., Xu, Y.: Storage and querying of e-commerce data. In: VLDB, pp. 149–158 (2001)

  2. Abadi, D., Madden, S., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD Conference (2008)

  3. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable semantic web data management using vertical partitioning. In: VLDB, pp. 411–422 (2007)

  4. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994)

  5. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)

  6. Arion, A., Bonifati, A., Manolescu, I., Pugliese, A.: Path summaries and path partitioning in modern xml databases. World Wide Web 11(1), 117–151 (2008)

    Article  Google Scholar 

  7. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press / Addison-Wesley (1999)

  8. Beckmann, J.L., Halverson, A., Krishnamurthy, R., Naughton, J.F.: Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In: ICDE, p. 58 (2006)

  9. Bruno, E., Faessel, N., Glotin, H., Maitre, J.L., Scholl, M.: Indexing and querying segmented web pages: the blockweb model. World Wide Web 14(5–6), 623–649 (2011)

    Article  Google Scholar 

  10. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)

  11. Chu, E., Baid, A., Chen, T., Doan, A., Naughton, J.F.: A relational approach to incrementally extracting and querying structure in unstructured data. In: VLDB, pp. 1045–1056 (2007)

  12. Chu, E., Beckmann, J.L., Naughton, J.F.: The case for a wide-table approach to manage sparse relational data sets. In: SIGMOD Conference, pp. 821–832 (2007)

  13. de Vries, A.P., Mamoulis, N., Nes, N., Kersten, M.L.: Efficient k-NN search on vertically decomposed data. In: SIGMOD Conference, pp. 322–333 (2002)

  14. Dong, X., Halevy, A.Y.: Indexing dataspaces. In: SIGMOD Conference, pp. 43–54 (2007)

  15. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS (2001)

  16. Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 34(4), 27–33 (2005)

    Article  Google Scholar 

  17. Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)

  18. Franklin, M.J., Halevy, A.Y., Maier, D.: A first tutorial on dataspaces. PVLDB 1(2), 1516–1517 (2008)

    Google Scholar 

  19. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)

  20. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)

    Article  MathSciNet  Google Scholar 

  21. Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD Conference, pp. 847–860 (2008)

  22. Lester, N., Moffat, A., Zobel, J.: Fast on-line index construction by geometric partitioning. In: CIKM, pp. 776–783 (2005)

  23. Li, Q., Chen, J., Wu, Y.: Algorithm for extracting loosely structured data records through digging strict patterns. World Wide Web 12(3), 263–284 (2009)

    Article  Google Scholar 

  24. Lu, W., Chen, J., Du, X., Wang, J., Pan, W.: Efficient top-k approximate searches against a relation with multiple attributes. World Wide Web 14(5–6), 573–597 (2011)

    Article  Google Scholar 

  25. Mamoulis, N.: Efficient processing of joins on set-valued attributes. In: SIGMOD Conference, pp. 157–168 (2003)

  26. Ng, W., Lau, H.L., Zhou, A.: Divide, compress and conquer: Querying xml via partitioned path-based compressed data blocks. World Wide Web 11(2), 169–197 (2008)

    Article  Google Scholar 

  27. Salles, M.A.V., Dittrich, J.-P., Karakashian, S.K., Girard, O.R., Blunschi, L.: Itrails: pay-as-you-go information integration in dataspaces. In: VLDB, pp. 663–674 (2007)

  28. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)

  29. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD Conference, pp. 743–754 (2004)

  30. Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. In: SIGMOD Conference, pp. 861–874 (2008)

  31. Song, S., Chen, L., Yu, P.S.: On data dependencies in dataspaces. In: ICDE, pp. 470–481 (2011)

  32. Song, S., Chen, S., Yuan, M.: Materialization and decomposition of dataspaces for efficient search. IEEE Trans. Knowl. Data Eng. 23(12), 1872–1887 (2011)

    Article  Google Scholar 

  33. Tomasic, A., Garcia-Molina, H., Shoens, K.A.: Incremental updates of inverted lists for text document retrieval. In: SIGMOD Conference, pp. 289–300 (1994)

  34. Ukkonen, E.: Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  35. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann (1999)

  36. Wu, X., Theodoratos, D., Souldatos, S., Dalamagas, T., Sellis, T.K.: Evaluation techniques for generalized path pattern queries on xml data. World Wide Web 13(4), 441–474 (2010)

    Article  Google Scholar 

  37. Zipf, G.K.: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley (1949)

  38. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), Article 6 (2006). doi:10.1145/1132956.1132959

  39. Zobel, J., Moffat, A., Sacks-Davis, R.: An efficient indexing technique for full text databases. In: VLDB, pp. 352–362 (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaoxu Song.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, S., Chen, L. Indexing dataspaces with partitions. World Wide Web 16, 141–170 (2013). https://doi.org/10.1007/s11280-012-0163-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-012-0163-7

Keywords

Navigation