Abstract
We propose a methodological framework for building a statistical integration model for heterogeneous data sources.
We apply the latent class analysis, a well-established statistical method, to investigate the relationships between entities in data sources as relationships among dependent variables, with the purpose of discovering the latent factors that affect them. The latent factors are associated with the real world entities which are unobservable in the sense that we do not know the real world class memberships, but only the stored data.
The approach provides the evaluation of uncertainties which aggregate in the integration process. The key parameter evaluated by the method is the probability of the real world class membership. Its value varies depending on the selection criteria applied in the pre-integration stages and in the subsequent integration steps. By adjusting selection criteria and the integration strategies the proposed framework allows to improve data quality by optimizing the integration process.
Part of this work has been supported by the German Science Foundation DFG (grant no. CO 207/13-1).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Altareva, E., Conrad, S.: The Problem of Uncertainty and Database Integration. In: Kutsche, R.-D., Conrad, S., Hasselbring, W. (eds.) Engineering Federated Information Systems, Proceedings of the 4th Workshop EFIS 2001, Berlin (Germany), October 9-10, pp. 92–99. infix-Verlag / IOS Press (2001)
Altareva, E., Conrad, S.: Dealing with Uncertainties during the Data(base) Integration Process. In: Weber, G. (ed.) Tagungsband zum 14. GI-Workshop Grundlagen von Datenbanken, Strandhotel Fischland, Halbinsel Fischland-Darß- Zingst, Mecklenburg-Vorpommern, May 21-24, pp. 6–10. Fachbereich Informatik, Universität Rostock (2002)
Bartholomew, D., Knott, M.: Latent Variable Models and Factor Analysis. Kendall’s Library of Statistics, vol. 7. Arnold, London (1999)
Basilevsky, A.: Statistical Factor Analysis and Related Methods: Theory and Applications. Wiley and Sons, New York (1994)
Beneventano, D., Bergamaschi, S., Guerra, F., Vincini, M.: The MOMIS Approach to Information Integration. In: ICEIS 2001, Proc. of the 3rd Int. Conf. on Enterprise Information Systems, Setubal, Portugal, July 7-10 (2001)
Chen, M.-S., Han, J., Yu, P.S.: Data Mining: An Overview from a Database Perspective. IEEE Transactions on Software Engineering 8(6), 866–883 (1996)
Dayal, U., Hwang, H.-Y.: View Definition and Generalization for Database Integration in a Multidatabase System. IEEE Transactions on Software Engineering 10(6), 628–644 (1984)
Doan, A., Domingos, P., Levy, A.Y.: Learning Mappings between Data Schemas. In: Proceedings of the AAAI 2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX (2000)
Doan, A., Domingos, P., Levy, A.Y.: Learning Source Description for Data Integration. In: Suciu, D., Vossen, G. (eds.) WebDB 2000. LNCS, vol. 1997, pp. 81–86. Springer, Heidelberg (2001)
Fan, W., Lu, H., Madnick, S.E., Cheung, D.W.-L.: Discovering and Reconciling Value Conflicts for Numerical Data Integration. Information Systems 26(8), 635–656 (2001)
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Improving Data Cleaning Quality Using a Data Lineage Facility. In: Theodoratos, D., Hammer, J., Jeusfeld, M.A., Staudt, M. (eds.) Proc. of the 3rd Intl. Workshop on Design and Management of Data Warehouses, DMDW 2001, Interlaken, Switzerland, June 4 (2001)
Gertz, M., Schmitt, I.: Data Integration Techniques based on Data Quality Aspects. In: Schmitt, I., Türker, C., Hildebrandt, E., Höding, M. (eds.) FDBS 2003. Workshop Föderierte Datenbanken, Magdeburg, Germany, p. 1. Shaker Verlag, Aachen (1998)
Hand, D., Mannila, H., Smyth, P.: Principles of Data Mining. MIT Press, Massachusetts Institute of Technology (2001)
Höding, M., Conrad, S.: Data-Mining Tasks in Federated Database Systems Design. In: Özsu, T., Dogac, A., Ulusoy, Ö. (eds.) Issues and Applications of Database Technology (IADT 1998), Proc. of the 3rd World Conf. on Integrated Design and Process Technology, Berlin, Germany. Society for Design and Process Science, July 6-9, vol. 2, pp. 384–391 (1998)
Jarke, M., Jeusfeld, M.A., Quix, C., Vassiliadis, P.: Architecture and Quality in Data Warehouses: An Extended Repository Approach. Information Systems 24(3), 229–253 (1999)
Low, W.L., Lee, M.-L., Ling, T.W.: A knowledge-based Approach for Duplicate Elimination in Data Cleaning. Information Systems 26(8), 585–606 (2001)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic Schema Matching with Cupid. In: Apers, P.M., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 11-14, pp. 49–58. Morgan Kaufmann, San Francisco (2001)
Miller, R.J., Haas, L.M., Hernandez, M.A.: Schema Mapping as Query Discovery. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.-Y. (eds.) VLDB 2000, Proc. of 26th International Conference on Very Large Data Bases, Cairo, Egypt, September 10-14, pp. 77–88. Morgan Kaufmann, San Francisco (2000)
Palopoli, L., Terracina, G., Ursino, D.: The System DIKE: Towards the Semi-Automatic Synthesis of Cooperative Information Systems and Data Warehouses. In: Masunaga, Y., Pokorny, J., Stuller, J., Thalheim, B. (eds.) Proceedings of Chalenges, 2000 ADBIS-DASFAA Symposium on Advances in Databases and Information Systems, Enlarged Fourth East-European Conference on Advances in Databases and Information Systems, Prague, Czech Republic, September 5-8, pp. 108–117. Matfyz Press (2000)
Rahm, E., Bernstein, P.A.: A Survey of Approaches to Automatic Schema Matching. VLDB Journal 10(4), 334–350 (2001)
Schmitt, I., Saake, G.: Merging Inheritance Hierarchies for Database Integration. In: Proc. of the 3rd IFCIS Int. Conf. on Cooperative Information Systems, CoopIS 1998, August 20–22. IEEE Computer Society Press, Los Alamitos (1998)
Schmitt, I., Türker, C.: An Incremental Approach to Schema Integration by Refining Extensional Relationships. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, Bethesda, Maryland, USA, November 3–7. ACM Press, New York (1998)
Schwarz, K., Schmitt, I., Türker, C., Höding, M., Hildebrandt, E., Balko, S., Conrad, S., Saake, G.: Design Support for Database Federations. In: Akoka, J., Bouzeghoub, M., Comyn-Wattiau, I., Métais, E. (eds.) ER 1999. LNCS, vol. 1728, pp. 445–459. Springer, Heidelberg (1999)
Spaccapietra, S., Parent, C., Dupont, Y.: Model Independent Assertions for Integration of Heterogeneous Schemas. VLDB Journal 1(1), 81–126 (1992)
Winkler, K., Spiliopoulou, M.: Structuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 461–474. Springer, Heidelberg (2002)
Yan, L.-L., Miller, R.J., Haas, L.M., Fagin, R.: Data-Driven Understanding and Refinement of Schema Mappings. In: Aref, W.G. (ed.) ACM SIGMOD Conference 2001, SIGMOD, Electronic Proceedings, Santa Barbara, CA, USA (2001), http://www.acm.org/sigmod/sigmod01/eproceedings
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Altareva, E., Conrad, S. (2003). Statistical Analysis as Methodological Framework for Data(base) Integration. In: Song, IY., Liddle, S.W., Ling, TW., Scheuermann, P. (eds) Conceptual Modeling - ER 2003. ER 2003. Lecture Notes in Computer Science, vol 2813. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39648-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-39648-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20299-8
Online ISBN: 978-3-540-39648-2
eBook Packages: Springer Book Archive