Document-Centric OLAP in the Schema-Chaos World

Sismanis, Yannis; Reinwald, Berthold; Pirahesh, Hamid

doi:10.1007/978-3-540-73950-0_7

Yannis Sismanis¹,
Berthold Reinwald¹ &
Hamid Pirahesh¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4365))

Included in the following conference series:

International Workshop on Business Intelligence for the Real-Time Enterprise

542 Accesses
1 Citations

Abstract

Gaining business insights such as measuring the effectiveness of a product campaign requires the integration of a multitude of different data sources. Such data sources include in-house applications (like CRM, ERP), partner databases (like loyalty card data from retailers), and syndicated data sources (like credit reports from Experian). However, different data sources represent the same semantic attributes in different ways. E.g., two XML schemas for purchase orders may represent price as /SAP46Order/Product/Price or /PeopleSoft/Item/Sold/ Cost, respectively. The different paths to the same semantic information depend on the schema, making it difficult to index the data and for query languages such as XQuery to process aggregation queries. Shredding the XML documents is not feasible due to the vast number of different schemas and the complexity of the XML documents. The only known approach today is to ETL every single document into a common schema, and then use XQuery on the transformed data to perform aggregation. Such a solution does not scale well with the number of schemas or their natural evoluation. This paper presents a robust solution to document-centric OLAP over highly-heterogeneous data. The solution is based on the exploitation of text-indexing that provides the necessary flexibility and well-established techniques for aggregation (like star-joins and bitmap processing). We present the overall architecture and the experimental performance results from our implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

United Nations. United Nations Standard Product Characterization. In: http://www.unspsc.org
Brown, P., Haas, P., Myllymaki, J., Pirahesh, H., Reinwald, B., Sismanis, Y.: Data Management in a Connected World, chapter Toward Automated Large-Scale Information Integration and Discovery. Springer, Heidelberg (2005)
Google Scholar
IBM. WebSphere Product Center. http://www.ibm.com/software/integration/wpc
IBM. WebSphere Customer Center. http://www.ibm.com/software/integration/wcc
IBM. DB2 Entity Analytic Solutions. http://www-306.ibm.com/software/data/db2/eas
IBM. An Open, Industrial-Strength Platform for Unstructured Information Analysis and Search. http://www.research.ibm.com/UIMA
IBM. WebSphere ProfileStage. http://ibm.ascential.com/products/profilestage.html
Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: Efficient and Scalable Discovery of Composite Keys. In: VLDB (2006)
Google Scholar
Madhavan, J., Bernstein, P., Rahm, E.: Corpus based schema mapping. In: ICDE (2005)
Google Scholar
Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E.J., O’Neil, P.E., Rasin, A., Tran, N., Zdonik, S.B.: C-Store: A Column-oriented DBMS. In: VLDB, pp. 553–564 (2005)
Google Scholar
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery 1(1), 29–53 (1997)
Article Google Scholar
Beyer, K.S., Chamberlin, D., Colby, L.S., Ozcan, F., Pirahesh, H., Xu, Y.: Extending XQuery for analytics. In: SIGMOD (2005)
Google Scholar
Widom, J.: Research problems in data warehousing. In: 4th International Conference on Information and Knowledge Management, pp. 25–30, Baltimore, Maryland (1995)
Google Scholar
Madhavan, J., Bernstein, P., Rahm, E.: Generic Schema Matching with Cupid. In: VLDB (2001)
Google Scholar
Doan, A., Domingos, P., Halevy, A.: Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. In: SIGMOD, pp. 509–520 (2001)
Google Scholar
Rahm, E., Bernstein, P.: A survey of approaches to automated schema mapping. VLDB Journal 10, 334–350 (2001)
Article MATH Google Scholar
Franklin, M., Halevy, A., Maier, D.: From Databases to Dataspaces: A New Abstraction for Information Management. In: SIGMOD Record (December 2005)
Google Scholar
French, C.D.: “One Size Fits All” Database Architectures Do Not Work for DDS. In: Carey, M.J., Schneider, D.A. (eds.) SIGMOD, pp. 449–450 (1995)
Google Scholar
Sybase. Sybase IQ. http://www.sybase.com/bi
SenSage. Addamark. http://www.addamark.com/product/sls.htm
Kxsystems. Kdb. http://www.kx.com/products/database.php
Copeland, G.P., Alexander, W., Boughter, E.E., Keller, T.W.: Data Placement in Bubba. In: Boral, H., Larson, P. (eds.) SIGMOD, pp. 99–108 (1988)
Google Scholar
Boncz, P.A., Zukowski, M., Nes, N.: Monetdb/x100: Hyper-pipelining query execution. In: CIDR, pp. 225–237 (2005)
Google Scholar
Pedersen, D., Riis, K., Pedersen, T.B.: XML-Extended OLAP querying. In: SSDBM, pp. 195–206 (2002)
Google Scholar
Pedersen, D., Pedersen, J., Pedersen, T.B.: Integrating XML Data in the TARGIT OLAP System. In: ICDE, pp. 778–781 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center,
Yannis Sismanis, Berthold Reinwald & Hamid Pirahesh

Authors

Yannis Sismanis
View author publications
You can also search for this author in PubMed Google Scholar
Berthold Reinwald
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Pirahesh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Christoph Bussler Malu Castellanos Umesh Dayal Sham Navathe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sismanis, Y., Reinwald, B., Pirahesh, H. (2007). Document-Centric OLAP in the Schema-Chaos World. In: Bussler, C., Castellanos, M., Dayal, U., Navathe, S. (eds) Business Intelligence for the Real-Time Enterprises. BIRTE 2006. Lecture Notes in Computer Science, vol 4365. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73950-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-73950-0_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73949-4
Online ISBN: 978-3-540-73950-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics