Skip to main content
Log in

Integrating semantically heterogeneous aggregate views of distributed databases

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

In statistical databases and data warehousing applications it is commonly the case that aggregate views are maintained as an underlying mechanism for summarising information. Where the databases or applications are distributed, or arise from independent data collections or system developments, there may be incompatibility, heterogeneity, and data inconsistency. These challenges need to be overcome if federations of aggregated databases are to be successfully incorporated into systems for database management, querying, retrieval, and knowledge discovery.

In this paper we address the issue of integrating aggregate views that have semantically heterogeneous classification schemes. In previous work we have developed a methodology that is efficient but that cannot easily handle data inconsistencies. Our previous approach is therefore not particularly well-suited to very large databases or federations of large numbers of databases. We now address these scalability issues by introducing a methodology for heterogeneous aggregate view integration that constructs a dynamic shared ontology to which each of the aggregate views can be explicitly related. A maximum likelihood technique, implemented using the EM (Expectation-Maximisation) algorithm, is used to inherently handle data inconsistencies in the computation of integrated aggregates that are described in terms of the dynamic shared ontology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Anand, S.S., Scotney, B.W., Tan, M.G., McClean, S.I., Bell, D.A., Hughes, J.G., Magill, I.C.: Designing a kernel for data mining. IEEE Expert March-April, 65–74 (1997)

    Article  Google Scholar 

  2. AnHai, D., Pedro, D., Alon, Y.H.: Reconciling schemas of disparate data sources: a machine-learning approach. In: ACM SIGMOD Conf. on Management of Data, pp. 509–520. Assoc. Comput. Mach., New York (2001)

    Google Scholar 

  3. Bergamaschi, S., et al.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215–249 (2001)

    Article  MATH  Google Scholar 

  4. Caragea, D., et al.: Information integration from semantically heterogeneous biological data sources. In: Proceedings of the 16th Intl. Workshop on Database and Expert Systems Applications, Las Vegas, Nevada, pp. 580–584 (2005)

  5. Chen, R., Krishnamoorthy, S.: A new algorithm for learning parameters of a Bayesian Network from distributed data. In: IEEE International Conference on Data Mining, Maebashi, Japan, pp. 585–588 (2002)

  6. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  7. Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Mag. 26(1), 83–94 (2005)

    Google Scholar 

  8. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: data models and languages. J. Intell. Inf. Syst. 8(2), 117–132 (1997)

    Article  Google Scholar 

  9. Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI Press/MIT Press, Cambridge (2000)

    Google Scholar 

  10. Kittler, J., et al.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–238 (1998)

    Article  Google Scholar 

  11. Levy, A.: The information manifold approach to data integration. IEEE Intell. Syst. 1312–1316 (1998)

  12. Lim, E.-P., Srivastava, J., Shekhar, S.: An evidential reasoning approach to attribute value conflict resolution in database management. IEEE Trans. Knowl. Data Eng. 8, 707–723 (1996)

    Article  Google Scholar 

  13. Malvestuto, F.M.: The derivation problem for summary data. In: Proc. ACM-SIGMOD Conf. on Management of Data, pp. 82–89. Assoc. Comput. Mach., New York (1988)

    Google Scholar 

  14. McClean, S.I., Scotney, B.W.: Using evidence theory for the integration of distributed databases. Int. J. Intell. Syst. 12(10), 763–776 (1997)

    Article  Google Scholar 

  15. McClean, S.I., Scotney, B.W., Shapcott, C.M.: Aggregation of imprecise and uncertain information in databases. IEEE Trans. Knowl. Data Eng. 13(6), 902–912 (2001)

    Article  Google Scholar 

  16. McClean, S.I., Scotney, B.W., Greer, K.R.C.: A scalable approach to integrating heterogeneous aggregate views of distributed databases. IEEE Trans. Knowl. Data Eng. 15(1), 232–235 (2003)

    Article  Google Scholar 

  17. McClean, S.I., Scotney, B.W., Morrow, P.J., Greer, K.R.C.: Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl. Eng. 54, 189–210 (2005)

    Article  Google Scholar 

  18. Sadreddini, M.H., Bell, D.A., McClean, S.I.: A model for integration of raw data and aggregate views in heterogeneous statistical databases. Database Technol. 4(2), 115–127 (1991)

    Google Scholar 

  19. Sadreddini, M.H., Bell, D.A., McClean, S.I.: A framework for query optimization in distributed statistical databases. Inf. Softw. Technol. 6, 363–377 (1992)

    Article  Google Scholar 

  20. Scotney, B.W., McClean, S.I.: Efficient knowledge discovery through the integration of heterogeneous data. Inf. Softw. Technol. 41, 569–578 (1999). Special Issue-Knowledge Discovery and Data Mining

    Article  Google Scholar 

  21. Scotney, B.W., McClean, S.I., Rodgers, M.C.: Optimal and efficient integration of heterogeneous summary tables in a distributed database. Data Knowl. Eng. 29, 337–350 (1999)

    Article  MATH  Google Scholar 

  22. Tsoumakas, G., Angelis, L., Vlahavas, I.: Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl. Eng. 49(3), 223–242 (2004)

    Article  Google Scholar 

  23. Vardi, Y., Lee, D.: From image deblurring to optimal investments: maximum likelihood solutions for positive linear inverse problems (with discussion), J. R. Stat. Soc. Ser. B 569–612 (1993)

  24. Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: a crossmine approach. IEEE Trans. Knowl. Data Eng. 18(6), 770–783 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philip Morrow.

Additional information

Recommended by: Ahmed K. Elmagarmid.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McClean, S., Scotney, B., Morrow, P. et al. Integrating semantically heterogeneous aggregate views of distributed databases. Distrib Parallel Databases 24, 73–94 (2008). https://doi.org/10.1007/s10619-008-7031-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-008-7031-6

Keywords

Navigation