Skip to main content

Document-Centric OLAP in the Schema-Chaos World

  • Conference paper
Business Intelligence for the Real-Time Enterprises (BIRTE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4365))

Abstract

Gaining business insights such as measuring the effectiveness of a product campaign requires the integration of a multitude of different data sources. Such data sources include in-house applications (like CRM, ERP), partner databases (like loyalty card data from retailers), and syndicated data sources (like credit reports from Experian). However, different data sources represent the same semantic attributes in different ways. E.g., two XML schemas for purchase orders may represent price as /SAP46Order/Product/Price or /PeopleSoft/Item/Sold/ Cost, respectively. The different paths to the same semantic information depend on the schema, making it difficult to index the data and for query languages such as XQuery to process aggregation queries. Shredding the XML documents is not feasible due to the vast number of different schemas and the complexity of the XML documents. The only known approach today is to ETL every single document into a common schema, and then use XQuery on the transformed data to perform aggregation. Such a solution does not scale well with the number of schemas or their natural evoluation. This paper presents a robust solution to document-centric OLAP over highly-heterogeneous data. The solution is based on the exploitation of text-indexing that provides the necessary flexibility and well-established techniques for aggregation (like star-joins and bitmap processing). We present the overall architecture and the experimental performance results from our implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. United Nations. United Nations Standard Product Characterization. In: http://www.unspsc.org

  2. Brown, P., Haas, P., Myllymaki, J., Pirahesh, H., Reinwald, B., Sismanis, Y.: Data Management in a Connected World, chapter Toward Automated Large-Scale Information Integration and Discovery. Springer, Heidelberg (2005)

    Google Scholar 

  3. IBM. WebSphere Product Center. http://www.ibm.com/software/integration/wpc

  4. IBM. WebSphere Customer Center. http://www.ibm.com/software/integration/wcc

  5. IBM. DB2 Entity Analytic Solutions. http://www-306.ibm.com/software/data/db2/eas

  6. IBM. An Open, Industrial-Strength Platform for Unstructured Information Analysis and Search. http://www.research.ibm.com/UIMA

  7. IBM. WebSphere ProfileStage. http://ibm.ascential.com/products/profilestage.html

  8. Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: Efficient and Scalable Discovery of Composite Keys. In: VLDB (2006)

    Google Scholar 

  9. Madhavan, J., Bernstein, P., Rahm, E.: Corpus based schema mapping. In: ICDE (2005)

    Google Scholar 

  10. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-Store: A Column-oriented DBMS. In: VLDB, pp. 553–564 (2005)

    Google Scholar 

  11. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery 1(1), 29–53 (1997)

    Article  Google Scholar 

  12. Beyer, K.S., Chamberlin, D., Colby, L.S., Ozcan, F., Pirahesh, H., Xu, Y.: Extending XQuery for analytics. In: SIGMOD (2005)

    Google Scholar 

  13. Widom, J.: Research problems in data warehousing. In: 4th International Conference on Information and Knowledge Management, pp. 25–30, Baltimore, Maryland (1995)

    Google Scholar 

  14. Madhavan, J., Bernstein, P., Rahm, E.: Generic Schema Matching with Cupid. In: VLDB (2001)

    Google Scholar 

  15. Doan, A., Domingos, P., Halevy, A.: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In: SIGMOD, pp. 509–520 (2001)

    Google Scholar 

  16. Rahm, E., Bernstein, P.: A survey of approaches to automated schema mapping. VLDB Journal 10, 334–350 (2001)

    Article  MATH  Google Scholar 

  17. Franklin, M., Halevy, A., Maier, D.: From Databases to Dataspaces: A New Abstraction for Information Management. In: SIGMOD Record (December 2005)

    Google Scholar 

  18. French, C.D.: “One Size Fits All” Database Architectures Do Not Work for DDS. In: Carey, M.J., Schneider, D.A. (eds.) SIGMOD, pp. 449–450 (1995)

    Google Scholar 

  19. Sybase. Sybase IQ. http://www.sybase.com/bi

  20. SenSage. Addamark. http://www.addamark.com/product/sls.htm

  21. Kxsystems. Kdb. http://www.kx.com/products/database.php

  22. Copeland, G.P., Alexander, W., Boughter, E.E., Keller, T.W.: Data Placement in Bubba. In: Boral, H., Larson, P. (eds.) SIGMOD, pp. 99–108 (1988)

    Google Scholar 

  23. Boncz, P.A., Zukowski, M., Nes, N.: Monetdb/x100: Hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005)

    Google Scholar 

  24. Pedersen, D., Riis, K., Pedersen, T.B.: XML-Extended OLAP querying. In: SSDBM, pp. 195–206 (2002)

    Google Scholar 

  25. Pedersen, D., Pedersen, J., Pedersen, T.B.: Integrating XML Data in the TARGIT OLAP System. In: ICDE, pp. 778–781 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Christoph Bussler Malu Castellanos Umesh Dayal Sham Navathe

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sismanis, Y., Reinwald, B., Pirahesh, H. (2007). Document-Centric OLAP in the Schema-Chaos World. In: Bussler, C., Castellanos, M., Dayal, U., Navathe, S. (eds) Business Intelligence for the Real-Time Enterprises. BIRTE 2006. Lecture Notes in Computer Science, vol 4365. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73950-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73950-0_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73949-4

  • Online ISBN: 978-3-540-73950-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics