Skip to main content

Statistical Analysis as Methodological Framework for Data(base) Integration

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2813))

Abstract

We propose a methodological framework for building a statistical integration model for heterogeneous data sources.

We apply the latent class analysis, a well-established statistical method, to investigate the relationships between entities in data sources as relationships among dependent variables, with the purpose of discovering the latent factors that affect them. The latent factors are associated with the real world entities which are unobservable in the sense that we do not know the real world class memberships, but only the stored data.

The approach provides the evaluation of uncertainties which aggregate in the integration process. The key parameter evaluated by the method is the probability of the real world class membership. Its value varies depending on the selection criteria applied in the pre-integration stages and in the subsequent integration steps. By adjusting selection criteria and the integration strategies the proposed framework allows to improve data quality by optimizing the integration process.

Part of this work has been supported by the German Science Foundation DFG (grant no. CO 207/13-1).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altareva, E., Conrad, S.: The Problem of Uncertainty and Database Integration. In: Kutsche, R.-D., Conrad, S., Hasselbring, W. (eds.) Engineering Federated Information Systems, Proceedings of the 4th Workshop EFIS 2001, Berlin (Germany), October 9-10, pp. 92–99. infix-Verlag / IOS Press (2001)

    Google Scholar 

  2. Altareva, E., Conrad, S.: Dealing with Uncertainties during the Data(base) Integration Process. In: Weber, G. (ed.) Tagungsband zum 14. GI-Workshop Grundlagen von Datenbanken, Strandhotel Fischland, Halbinsel Fischland-Darß- Zingst, Mecklenburg-Vorpommern, May 21-24, pp. 6–10. Fachbereich Informatik, Universität Rostock (2002)

    Google Scholar 

  3. Bartholomew, D., Knott, M.: Latent Variable Models and Factor Analysis. Kendall’s Library of Statistics, vol. 7. Arnold, London (1999)

    MATH  Google Scholar 

  4. Basilevsky, A.: Statistical Factor Analysis and Related Methods: Theory and Applications. Wiley and Sons, New York (1994)

    Book  MATH  Google Scholar 

  5. Beneventano, D., Bergamaschi, S., Guerra, F., Vincini, M.: The MOMIS Approach to Information Integration. In: ICEIS 2001, Proc. of the 3rd Int. Conf. on Enterprise Information Systems, Setubal, Portugal, July 7-10 (2001)

    Google Scholar 

  6. Chen, M.-S., Han, J., Yu, P.S.: Data Mining: An Overview from a Database Perspective. IEEE Transactions on Software Engineering 8(6), 866–883 (1996)

    Google Scholar 

  7. Dayal, U., Hwang, H.-Y.: View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Transactions on Software Engineering 10(6), 628–644 (1984)

    Article  Google Scholar 

  8. Doan, A., Domingos, P., Levy, A.Y.: Learning Mappings between Data Schemas. In: Proceedings of the AAAI 2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX (2000)

    Google Scholar 

  9. Doan, A., Domingos, P., Levy, A.Y.: Learning Source Description for Data Integration. In: Suciu, D., Vossen, G. (eds.) WebDB 2000. LNCS, vol. 1997, pp. 81–86. Springer, Heidelberg (2001)

    Google Scholar 

  10. Fan, W., Lu, H., Madnick, S.E., Cheung, D.W.-L.: Discovering and Reconciling Value Conflicts for Numerical Data Integration. Information Systems 26(8), 635–656 (2001)

    Article  MATH  Google Scholar 

  11. Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)

    Article  Google Scholar 

  12. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Improving Data Cleaning Quality Using a Data Lineage Facility. In: Theodoratos, D., Hammer, J., Jeusfeld, M.A., Staudt, M. (eds.) Proc. of the 3rd Intl. Workshop on Design and Management of Data Warehouses, DMDW 2001, Interlaken, Switzerland, June 4 (2001)

    Google Scholar 

  13. Gertz, M., Schmitt, I.: Data Integration Techniques based on Data Quality Aspects. In: Schmitt, I., Türker, C., Hildebrandt, E., Höding, M. (eds.) FDBS 2003. Workshop Föderierte Datenbanken, Magdeburg, Germany, p. 1. Shaker Verlag, Aachen (1998)

    Google Scholar 

  14. Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Massachusetts Institute of Technology (2001)

    Google Scholar 

  15. Höding, M., Conrad, S.: Data-Mining Tasks in Federated Database Systems Design. In: Özsu, T., Dogac, A., Ulusoy, Ö. (eds.) Issues and Applications of Database Technology (IADT 1998), Proc. of the 3rd World Conf. on Integrated Design and Process Technology, Berlin, Germany. Society for Design and Process Science, July 6-9, vol. 2, pp. 384–391 (1998)

    Google Scholar 

  16. Jarke, M., Jeusfeld, M.A., Quix, C., Vassiliadis, P.: Architecture and Quality in Data Warehouses: An Extended Repository Approach. Information Systems 24(3), 229–253 (1999)

    Article  Google Scholar 

  17. Low, W.L., Lee, M.-L., Ling, T.W.: A knowledge-based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26(8), 585–606 (2001)

    Article  MATH  Google Scholar 

  18. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid. In: Apers, P.M., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 11-14, pp. 49–58. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  19. Miller, R.J., Haas, L.M., Hernandez, M.A.: Schema Mapping as Query Discovery. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (eds.) VLDB 2000, Proc. of 26th International Conference on Very Large Data Bases, Cairo, Egypt, September 10-14, pp. 77–88. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  20. Palopoli, L., Terracina, G., Ursino, D.: The System DIKE: Towards the Semi-Automatic Synthesis of Cooperative Information Systems and Data Warehouses. In: Masunaga, Y., Pokorny, J., Stuller, J., Thalheim, B. (eds.) Proceedings of Chalenges, 2000 ADBIS-DASFAA Symposium on Advances in Databases and Information Systems, Enlarged Fourth East-European Conference on Advances in Databases and Information Systems, Prague, Czech Republic, September 5-8, pp. 108–117. Matfyz Press (2000)

    Google Scholar 

  21. Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic Schema Matching. VLDB Journal 10(4), 334–350 (2001)

    Article  MATH  Google Scholar 

  22. Schmitt, I., Saake, G.: Merging Inheritance Hierarchies for Database Integration. In: Proc. of the 3rd IFCIS Int. Conf. on Cooperative Information Systems, CoopIS 1998, August 20–22. IEEE Computer Society Press, Los Alamitos (1998)

    Google Scholar 

  23. Schmitt, I., Türker, C.: An Incremental Approach to Schema Integration by Refining Extensional Relationships. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, Bethesda, Maryland, USA, November 3–7. ACM Press, New York (1998)

    Google Scholar 

  24. Schwarz, K., Schmitt, I., Türker, C., Höding, M., Hildebrandt, E., Balko, S., Conrad, S., Saake, G.: Design Support for Database Federations. In: Akoka, J., Bouzeghoub, M., Comyn-Wattiau, I., Métais, E. (eds.) ER 1999. LNCS, vol. 1728, pp. 445–459. Springer, Heidelberg (1999)

    Google Scholar 

  25. Spaccapietra, S., Parent, C., Dupont, Y.: Model Independent Assertions for Integration of Heterogeneous Schemas. VLDB Journal 1(1), 81–126 (1992)

    Article  Google Scholar 

  26. Winkler, K., Spiliopoulou, M.: Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 461–474. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  27. Yan, L.-L., Miller, R.J., Haas, L.M., Fagin, R.: Data-Driven Understanding and Refinement of Schema Mappings. In: Aref, W.G. (ed.) ACM SIGMOD Conference 2001, SIGMOD, Electronic Proceedings, Santa Barbara, CA, USA (2001), http://www.acm.org/sigmod/sigmod01/eproceedings

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Altareva, E., Conrad, S. (2003). Statistical Analysis as Methodological Framework for Data(base) Integration. In: Song, IY., Liddle, S.W., Ling, TW., Scheuermann, P. (eds) Conceptual Modeling - ER 2003. ER 2003. Lecture Notes in Computer Science, vol 2813. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39648-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-39648-2_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20299-8

  • Online ISBN: 978-3-540-39648-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics