Abstract
The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.
This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.
- Adali, S., Candan, K. S., Papakonstantinou, Y., and Subrahmanian, V. S. 1996. Query caching and optimization in distributed mediator systems. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, 137--146. Google ScholarDigital Library
- Agrawal, P., Benjelloun, O., Sarma, A. D., Hayworth, C., Nabar, S. U., Sugihara, T., and Widom, J. 2006. Trio: A system for data, uncertainty, and lineage. In Proceedings of the International Conference on Very Large Databases (VLDB), 1151--1154. Google ScholarDigital Library
- Ahmed, R., De Smedt, P., Du, W., Kent, W., Ketabchi, M. A., Litwin, W. A., Rafii, A., and Shan, M.-C. 1991. The Pegasus heterogeneous multidatabase system. IEEE Comput. 24, 12, 19--27. Google ScholarDigital Library
- Ambite, J. L., Ashish, N., Barish, G., Knoblock, C. A., Minton, S., Modi, P. J., Muslea, I., Philpot, A., and Tejada, S. 1998. Ariadne: A system for constructing mediators for Internet sources. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, 561--563. Google ScholarDigital Library
- Ambite, J. L., Knoblock, C. A., Muslea, I., and Philpot, A. G. 2001. Compiling source descriptions for efficient and flexible information integration. J. Intell. Inf. Syst. 16, 2, 149--187. Google ScholarDigital Library
- Arenas, M., Bertossi, L. E., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). ACM Press, 68--79. Google ScholarDigital Library
- Arens, Y., Knoblock, C. A., and Shen, W.-M. 1996. Query reformulation for dynamic information integration. J. Intell. Inf. Syst. 6, 2-3 (June), 99--130. Google ScholarDigital Library
- Batini, C., Lenzerin, M., and Navathe, S. B. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364. Google ScholarDigital Library
- Bayardo, Jr., R. J., Bohrer, W., Brice, R., Cichocki, A., Fowler, J., Helal, A., Kashyap, V., Ksiezyk, T., Martin, G., Nodine, M., Rashid, M., Rusinkiewicz, M., Shea, R., Unnikrishnan, C., Unruh, A., and Woelk, D. 1997. InfoSleuth: Agent-Based semantic integration of information in open and dynamic environments. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, New York, 195--206. Google ScholarDigital Library
- Belcastro, V., Dutkowski, A., Kaminski, W., Kowalewski, M., Mallamaci, C. L., Meszyk, S., Mostardi, T., Scrocco, F. P., Staniszkis, W., and Turco, G. 1988. An overview of the distributed query system DQS. In Proceedings of the International Conference on Extending Database Technology (EDBT). Springer, 170--189. Google ScholarDigital Library
- Benjelloun, O., Sarma, A. D., Hayworth, C., and Widom, J. 2006. An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull. 29, 1, 5--16.Google Scholar
- Berlin, J. and Motro, A. 2006. Tuplerank: Ranking discovered content in virtual databases. In Proceedings of the International Workshop on Next Generation Information on Technology and Systems (NGITS), 13--25.Google Scholar
- Bertossi, L. E., Bravo, L., Franconi, E., and Lopatenko, A. 2005. Complexity and approximation of fixing numerical attributes in databases under integrity constraints. In Proceedings of the International Conference on Database Programming Languages (DBPL), 262--278.Google Scholar
- Bertossi, L. E. and Chomicki, J. 2003. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases, 43--83.Google Scholar
- Bilke, A., Bleiholder, J., Böhm, C., Draba, K., Naumann, F., and Weis, M. 2005. Automatic data fusion with HumMer. In Proceedings of the International Conference on Very Large Databases (VLDB), 1251--1254. Google ScholarDigital Library
- Bilke, A. and Naumann, F. 2005. Schema matching using duplicates. In Proceedings of the International Conference on Data Engineering (ICDE), 69--80. Google ScholarDigital Library
- Bleiholder, J. and Naumann, F. 2005. Declarative data fusion—Syntax, semantics, and implementation. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 58--73.Google Scholar
- Bleiholder, J. and Naumann, F. 2006. Conflict handling strategies in an integrated information system. In Proceedings of the IJCAI Workshop on Information on the Web (IIWeb).Google Scholar
- Bohannon, P., Flaster, M., Fan, W., and Rastogi, R. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM International Conference on Management of Data SIGMOD, 143--154. Google ScholarDigital Library
- Brill, D., Templeton, M., and Yu, C. T. 1984. Distributed query processing strategies in Mermaid, a frontend to data management systems. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, 211--218. Google ScholarDigital Library
- Brzezinski, Z., Getta, J. R., Rybnik, J., and Stepniewski, W. 1984. Unibase—An integrated access to databases. In Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, San Francisco, CA, 388--396. Google ScholarDigital Library
- Burdick, D., Deshpande, P., Jayram, T. S., Ramakrishnan, R., and Vaithyanathan, S. 2005. OLAP over uncertain and imprecise data. In Proceedings of the International Conference on Very Large Databases (VLDB), 970--981. Google ScholarDigital Library
- Calmet, J., Jekutsch, S., and Schü, J. 1997. A generic query-translation framework for a mediator architecture. In Proceedings of the International Conference on Data Engineering (ICDE), W. A. Gray and P.-Å. Larson, Eds. IEEE Computer Society, 434--443. Google ScholarDigital Library
- Calmet, J. and Kullmann, P. 1999. Meta Web search with KOMET. In Proceedings of the Workshop on Intelligent Information Integration.Google Scholar
- Calvanese, D., Giacomo, G. D., Lembo, D., Lenzerini, M., and Rosati, R. 2005. Inconsistency tolerance in P2P data integration: An epistemic logic approach. In Proceedings of the International Conference on Database Programming Languages (DBPL).Google Scholar
- Caroprese, L., Greco, S., Trubitsyna, I., and Zumpano, E. 2006. Preferred generalized answers for inconsistent databases. In Proceedings of the Internation Symposium on Methodologies for Information Systems (ISMIS), 344--349.Google Scholar
- Caroprese, L. and Zumpano, E. 2006. A framework for merging, repairing and querying inconsistent databases. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 383--398.Google Scholar
- Chaudhuri, S., Ganjam, K., Ganti, V., Kapoor, R., Narasayya, V., and Vassilakis, T. 2005. Data cleaning in Microsoft SQL Server 2005. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, New York, 918--920. Google ScholarDigital Library
- Chomicki, J., Marcinkowski, J., and Staworko, S. 2004a. Computing consistent query answers using conflict hypergraphs. In Proceedings of the Internation Conference on Information and Knowledge Management (CIKM). ACM Press, New York, 417--426. Google ScholarDigital Library
- Chomicki, J., Marcinkowski, J., and Staworko, S. 2004b. Hippo: A system for computing consistent answers to a class of SQL queries. In Proceedings of the International Conference on Extending Database Technology (EDBT), 841--844.Google Scholar
- Cody, W. F., Haas, L. M., Niblack, W., Arya, M., Carey, M. J., Fagin, R., Flickner, M., Lee, D., Petkovic, D., Schwarz, P. M., Thomas, J., Roth, M. T., Williams, J. H., and Wimmers, E. L. 1995. Querying multimedia data from multiple repositories by content: The Garlic project. In Proceedings of the IFIP Working Conference on Visual Database Systems (VDB-3). Chapman & Hall, Ltd., 17--35. Google ScholarDigital Library
- Cohen, S. and Sagiv, Y. 2005. An incremental algorithm for computing ranked full disjunctions. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). ACM Press, New York, 98--107. Google ScholarDigital Library
- Collet, C., Huhns, M. N., and Shen, W.-M. 1991. Resource integration using a large knowledge base in Carnot. IEEE Comput. 24, 12, 55--62. Google ScholarDigital Library
- Connors, T., Hasan, W., Kolovson, C., Neimat, M.-A., Schneider, D., and Wilkinson, K. 1991. The Papyrus integrated data server. In Proceedings of the International Conference on Parallel and Distributed Information Systems. IEEE Computer Society Press, 139--141. Google ScholarDigital Library
- Dayal, U. 1983. Processing queries over generalization hierarchies in a multidatabase system. In Proceedings of the International Conference on Very Large Databases (VLDB), 342--353. Google ScholarDigital Library
- Dayal, U. and Hwang, H.-Y. 1984. View definition and generalization for database system integration in a multidatabase system. IEEE Trans. Softw. Eng. 10, 6 (Nov.), 628--645.Google Scholar
- DeMichiel, L. G. 1989. Resolving database incompatibility: An approach to performing relational operations over mismatched domains. IEEE Trans. Knowl. Data Eng. 1, 4, 485--493. Google ScholarDigital Library
- Dittrich, K. R. and Domenig, R. 1999. Towards exploitation of the data universe: Database technology for comprehensive query services. In Proceedings of the International Conference on Business Infromation Systems (BIS).Google Scholar
- Domenig, R. and Dittrich, K. R. 1999. An overview and classification of mediated query systems. SIGMOD Rec. 28, 3, 63--72. Google ScholarDigital Library
- Draper, D., Halevy, A. Y., and Weld, D. S. 2001a. The Nimble integration engine. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, New York, 567--568. Google ScholarDigital Library
- Draper, D., Halevy, A. Y., and Weld, D. S. 2001b. The Nimble XML data integration system. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, 155--160. Google ScholarDigital Library
- Dwyer, P. and Larson, J. 1987. Some experiences with a distributed database testbed system. Proc. IEEE 75, 5 (May), 633--648.Google ScholarCross Ref
- Eiter, T., Fink, M., Greco, G., and Lembo, D. 2003. Efficient evaluation of logic programs for querying data integration systems. In Proceedings of the International Conference on Logic Programming (ICLP), 163--177.Google Scholar
- Fagin, R., Kolaitis, P. G., and Popa, L. 2005. Data exchange: Getting to the core. Trans. Dat. Syst. 30, 1, 174--210. Google ScholarDigital Library
- Flesca, S., Furfaro, F., and Parisi, F. 2005. Consistent query answers on numerical databases under aggregate constraints. In Proceedings of the International Conference on Database Programming Languages (DBPL), 279--294.Google Scholar
- Fuxman, A., Fazli, E., and Miller, R. J. 2005a. ConQuer: Efficient management of inconsistent databases. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, New York, 155--166. Google ScholarDigital Library
- Fuxman, A., Fuxman, D., and Miller, R. J. 2005b. ConQuer: A system for efficient querying over inconsistent databases. In Proceedings of the International Conference on Very Large Databases (VLDB), 1354--1357. Google ScholarDigital Library
- Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2000a. AJAX: An extensible data cleaning tool. In Proceedings of the ACM International Conference on Management of Data SIGMOD, W. Chen et al., 590. Google ScholarDigital Library
- Galhardas, H., Florescu, D., Shasha, D., and Simon, E. 2000b. An extensible framework for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE), 312. Google ScholarDigital Library
- Galhardas, H., Florescu, D., Shasha, D., Simon, E., and Saita, C.-A. 2001. Declarative data cleaning: Language, model, and algorithms. In Proceedings of the International Conference on Very Large Databases (VLDB), 371--380. Google ScholarDigital Library
- Galindo-Legaria, C. A. 1994. Outerjoins as disjunctions. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, 348--358. Google ScholarDigital Library
- Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., and Widom, J. 1997. The TSIMMIS approach to mediation: Data models and languages. J. Intell. Inf. Syst. 8, 2, 117--132. Google ScholarDigital Library
- Genesereth, M. R., Keller, A. M., and Duschka, O. M. 1997. Infomaster: An information integration system. In Proceedings of the ACM International Conference on Management of Data SIGMOD, 539--542. Google ScholarDigital Library
- Greco, S., Pontieri, L., and Zumpano, E. 2001. Integrating and managing conflicting data. In Revised Papers from the 4th International Andrei Ershov Memorial Conference on Perspectives of System Informatics. Springer, 349--362. Google ScholarDigital Library
- Haas, L. M., Kodali, P., Rice, J. E., Schwarz, P. M., and Swope, W. C. 2000. Integrating life sciences data-with a little Garlic. In Proceedings of the IEEE International Conference on Bioinformatics and Bio Engineering (BIBE). IEEE Computer Society, 5. Google ScholarDigital Library
- Halevy, A. Y., Ashish, N., Bitton, D., Carey, M. J., Draper, D., Pollock, J., Rosenthal, A., and Sikka, V. 2005. Enterprise information integration: Successes, challenges and controversies. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, New York, 778--787. Google ScholarDigital Library
- Hammer, J., McHugh, J., and Garcia-Molina, H. 1997. Semistructured data: The TSIMMIS experience. In Proceedings of the East European Conference on Advances in Databases and Information Systems (ADBIS), 1--8.Google Scholar
- Hernández, M. A. and Stolfo, S. J. 1998. Real-World data is dirty: Data cleansing and the merge/purge problem. Data Mining Knowl. Discov. 2, 1, 9--37. Google ScholarDigital Library
- Ives, Z. G., Florescu, D., Friedman, M., Levy, A. Y., and Weld, D. S. 1999. An adaptive query execution system for data integration. In Proceedings of the ACM International Conference on Management of Data SIGMOD, 299--310. Google ScholarDigital Library
- Ives, Z. G., Khandelwal, N., Kapur, A., and Cakir, M. 2005. ORCHESTRA: Rapid, collaborative sharing of dynamic data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 107--118.Google Scholar
- Jakobson, G., Piatetsky-Shapiro, G., Lafond, C., Rajinikanth, M., and Hernandez, J. 1988. CALIDA: A knowledge-based system for integrating multiple heterogeneous databases. In Proceedings of the 3rd International Conference on Data and Knowledge Bases: Improving Usability and Responsiveness, 3--18.Google Scholar
- Josifovski, V., Schwarz, P., Haas, L., and Lin, E. 2002. Garlic: A new flavor of federated query processing for DB2. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, 524--532. Google ScholarDigital Library
- Kent, W., Ahmed, R., Albert, J., Ketabchi, M. A., and Shan, M.-C. 1992. Object identification in multidatabase systems. In Proceedings of the IFIP WG 2.6 Database Semantics Conference on Interoperable Database Systems (DS-5), 313--330. Google ScholarDigital Library
- Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K., and Lee, D. 2003. A taxonomy of dirty data. Data Mining Knowl. Discov. 7, 1, 81--99. Google ScholarDigital Library
- Knoblock, C. A. 1995. Planning, executing, sensing, and replanning for information gathering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), C. Mellish, ed. Morgan Kaufmann, San Francisco, CA, 1686--1693.Google Scholar
- Knoblock, C. A., Minton, S., Ambite, J. L., Ashish, N., Modi, P. J., Muslea, I., Philpot, A. G., and Tejada, S. 1998. Modeling Web sources for information integration. In Proceedings of the National Conference on Artificial Intelligence (AAAI). American Association for Artificial Intelligence, Menlo Park, CA, 211--218. Google ScholarDigital Library
- Kwok, C. T. and Weld, D. S. 1996. Planning to gather information. In Proceedings of the National Conference on Artificial Intelligence (AAAI). AAAI/MIT Press, Portland, 32--39.Google Scholar
- Landers, T. and Rosenberg, R. L. 1982. An overview of MULTIBASE. In Proceedings of the 2nd International Symposium on Distributed Data Bases, H. J. Schneider, ed. North Holland, Berlin.Google Scholar
- Lembo, D., Lenzerini, M., and Rosati, R. 2002. Source inconsistency and incompleteness in data integration. In Proceedings of the International Workshop on Knowledge Representation Meets Databases (KRDB).Google Scholar
- Lenat, D. B., Guha, R. V., Pittman, K., Pratt, D., and Shepherd, M. 1990. CYC: Toward programs with common sense. Commun. ACM 33, 8, 30--49. Google ScholarDigital Library
- Leone, N., Greco, G., Ianni, G., Lio, V., Terracina, G., Eiter, T., Faber, W., Fink, M., Gottlob, G., Rosati, R., Lembo, D., Lenzerini, M., Ruzzi, M., Kalka, E., Nowicki, B., and Staniszkis, W. 2005. The INFOMIX system for advanced integration of incomplete and inconsistent data. In Proceedings of the ACM International Conference on Management of Data SIGMOD, 915--917. Google ScholarDigital Library
- Levenshtein, V. 1965. Binary codes capable of correcting spurious insertions and deletions of ones. Problems Inf. Transm. 1, 8--17.Google Scholar
- Levy, A. Y., Rajaraman, A., and Ordille, J. J. 1996a. Querying heterogeneous information sources using source descriptions. In Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 251--262. Google ScholarDigital Library
- Levy, A. Y., Rajaraman, A., and Ordille, J. J. 1996b. The World Wide Web as a collection of views: Query processing in the information manifold. In Proceedings of the SIGMOD Workshop on Materialized Views: Techniques and Applications (VIEW), 43--55.Google Scholar
- Lim, E.-P., Cao, Y., and Chiang, R. H. L. 1997. Source-Aware multidatabase query processing. In Proceedings of the Workshop on Engineering Federated Information Database Systems (EFDBS), 69--80.Google Scholar
- Lim, E.-P., Srivastava, J., and Hwang, S.-Y. 1995. An algebraic transformation framework for multidatabase queries. Distrib. Parallel Databases 3, 3, 273--307. Google ScholarDigital Library
- Lim, E.-P., Srivastava, J., and Shekhar, S. 1994. Resolving attribute incompatibility in database integration: An evidential reasoning approach. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE Computer Society, 154--163. Google ScholarDigital Library
- Litwin, W. 1985. An overview of the multidatabase system MRDSM. In Proceedings of the ACM Annual Conference on the Range of Computing: Mid-80's Perspective. ACM Press, New York, 524--533. Google ScholarDigital Library
- Litwin, W. and Abdellatif, A. 1987. An overview of the multi-database manipulation language MDSL. Proc. IEEE 75, 5 (May), 621--632.Google ScholarCross Ref
- Litwin, W., Boudenant, J., Esculier, C., Ferrier, A., Glorieux, A. M., Chimia, J. L., Kabbaj, K., Moulinoux, C., Rolin, P., and Stangret, C. 1982. SIRIUS system for distributed data management. In Distributed Databases. North-Holland, Amsterdam, The Netherlands, 311--343.Google Scholar
- Liu, L. and Pu, C. 1995. The distributed interoperable object model and its application to large-scale interoperable database systems. In Proceedings of the Internation Conference on Information and Knowledge Management (CIKM). ACM Press, New York, 105--112. Google ScholarDigital Library
- McHugh, J., Abiteboul, S., Goldman, R., Quass, D., and Widom, J. 1997. Lore: A database management system for semistructured data. SIGMOD Rec. 26, 3, 54--66. Google ScholarDigital Library
- Melnik, S., Bernstein, P. A., Halevy, A., and Rahm, E. 2005. Supporting executable mappings in model management. In Proceedings of the ACM International Conference on Management of Data SIGMOD, 167--178. Google ScholarDigital Library
- Mena, E., Kashyap, V., Sheth, A. P., and Illarramendi, A. 1996. OBSERVER: An approach for query processing in global information systems based on interoperation across pre-existing ontologies. In Proceedings of the IFCIS Conference on Cooperative Information Systems (CoopIS), 14--25. Google ScholarDigital Library
- Miller, R. J., Ioannidis, Y. E., and Ramakrishnan, R. 1993. The use of information capacity in schema integration and translation. In Proceedings of the International Conference on Very Large Databases (VLDB), R. Agrawal et al., eds. Morgan Kaufmann, 120--133. Google ScholarDigital Library
- Motro, A. 1986. Completeness information and its application to query processing. In Proceedings of the International Conference on Very Large Databases (VLDB), 170--178. Google ScholarDigital Library
- Motro, A. 1999. Multiplex: A formal model for multidatabases and its implementation. In Proceedings of the International Workshop on Next Generation Information on Technology and Systems (NGITS). Springer, 138. Google ScholarDigital Library
- Motro, A. and Anokhin, P. 2006. Fusionplex: Resolution of data inconsistencies in the integration of heterogeneous information sources. Inf. Fusion 7, 2, 176--196. Google ScholarDigital Library
- Motro, A., Anokhin, P., and Acar, A. C. 2004. Utility-Based resolution of data inconsistencies. In Proceedings of the International Workshop on Information Qualities in Information Systems (IQIS). ACM Press, 35--43. Google ScholarDigital Library
- Motro, A., Berlin, J., and Anokhin, P. 2004. Multiplex, Fusionplex, and Autoplex—Three generations of information integration. SIGMOD Rec. 33, 4, 51--57. Google ScholarDigital Library
- Naumann, F., Bilke, A., Bleiholder, J., and Weis, M. 2006. Data fusion in three steps: Resolving schema, tuple, and value inconsistencies. IEEE Data Eng. Bull. 29, 2, 21--31.Google Scholar
- Naumann, F., Freytag, J.-C., and Leser, U. 2004. Completeness of integrated information sources. Inf. Syst. 29, 7, 583--615. Google ScholarDigital Library
- Nodine, M. H., Fowler, J., and Perry, B. 1999. Active information gathering in InfoSleuth. In Proceedings of the International Symposium on Cooperative Database Systems for Advanced Applications (CODAS), 15--26.Google Scholar
- Ordille, J. J. and Miller, B. P. 1993. Distributed active catalogs and meta-data caching in descriptive name services. In Proceedings of the International Conference on Distributed Computing Systems, 120--129.Google Scholar
- Papakonstantinou, Y., Abiteboul, S., and Garcia-Molina, H. 1996. Object fusion in mediator systems. In Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 413--424. Google ScholarDigital Library
- Parsons, S. 1996. Current approaches to handling imperfect information in data and knowledge bases. IEEE Trans. Knowl. Data Eng. 8, 3 (Jun.), 353--372. Google ScholarDigital Library
- Popa, L., Velegrakis, Y., Miller, R. J., Hernández, M. A., and Fagin, R. 2002. Translating Web data. In Proceedings of the International Conference on Very Large Databases (VLDB). Google ScholarDigital Library
- Rahm, E. and Bernstein, P. A. 2001. On matching schemas automatically. Tech. Rep. MSR-TR-2001-17, Microsoft Research, Redmond, Washington. February.Google Scholar
- Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4, 3--13.Google Scholar
- Rajaraman, A. and Ullman, J. D. 1996. Integrating information by outerjoins and full disjunctions (extended abstract). In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS). ACM Press, 238--248. Google ScholarDigital Library
- Rajinikanth, M., Jakobson, G., Lafond, C., Papp, W., and Piatetsky-Shapiro, G. 1990. Multiple database integration in CALIDA: Design and implementation. In Proceedings of the International Conference on Systems Integration (ICSI). IEEE Press, 378--384. Google ScholarDigital Library
- Raman, V., Chou, A., and Hellerstein, J. M. 1999. Scalable spreadsheets for interactive data analysis. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google Scholar
- Raman, V. and Hellerstein, J. M. 2001. Potter's Wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB). Morgan Kaufmann, 381--390. Google ScholarDigital Library
- Rao, J., Pirahesh, H., and Zuzarte, C. 2004. Canonical abstraction for outerjoin optimization. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, 671--682. Google ScholarDigital Library
- Reck, C. and König-Ries, B. 1997. An architecture for transparent access to semantically heterogeneous information sources. In Proceedings of the International Workshop on Cooperative Information Agents (CIA). Springer, 260--271. Google ScholarDigital Library
- Rusinkiewicz, M., Elmasri, R., Czejdo, B., Georgakopoulos, D., Karabatis, G., Jamoussi, A., Loa, K., and Li, Y. 1989. Omnibase: Design and implementation of a multidatabase system. In Proceedings of the 1st Annual Symposium in Parallel and Distributed Processing, 162--169.Google Scholar
- Sarma, A. D., Benjelloun, O., Halevy, A. Y., and Widom, J. 2006. Working models for uncertain data. In Proceedings of the International Conference on Data Engineering (ICDE), 7. Google ScholarDigital Library
- Sattler, K., Conrad, S., and Saake, G. 2000. Adding conflict resolution features to a query language for database federations. In Proceedings of the Workshop on Engineering Federated Information System (EFIS), M. Roantree et al., eds, 41--52.Google Scholar
- Scannapieco, M., Virgillito, A., Marchetti, C., Mecella, M., and Baldoni, R. 2004. The DaQuinCIS architecture: A platform for exchanging and improving data quality in cooperative information systems. Inf. Syst. 29, 7, 551--582. Google ScholarDigital Library
- Schallehn, E. and Sattler, K.-U. 2003. Using similarity-based operations for resolving data-level conflicts. In Proceedings of the British National Conference on Databases (BNCOD), 172--189.Google Scholar
- Schallehn, E., Sattler, K.-U., and Saake, G. 2004. Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 3, 361--387. Google ScholarDigital Library
- Sheth, A. P. and Larson, J. A. 1990. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22, 3, 183--236. Google ScholarDigital Library
- Shipman, D. W. 1981. The functional data model and the data languages DAPLEX. Trans. Dat. Syst. 6, 1, 140--173. Google ScholarDigital Library
- Shoens, K. A., Luniewski, A., Schwarz, P. M., Stamos, J. W., and II, J. T. 1993. The Rufus system: Information organization for semi-structured data. In Proceedings of the International Conference on Very Large Databases (VLDB), R. Agrawal et al., eds. Morgan Kaufmann, 97--107. Google ScholarDigital Library
- Singh, M. P., Cannata, P., Huhns, M. N., Jacobs, N., Ksiezyk, T., Ong, K., Sheth, A. P., Tomlinson, C., and Woelk, D. 1997. The Carnot heterogeneous database project: Implemented applications. Distrib. Parallel Databases 5, 2, 207--225. Google ScholarDigital Library
- Staworko, S., Chomicki, J., and Marcinkowski, J. 2006. Preference-Driven querying of inconsistent relational databases. In Proceedings of the International Workshop on Inconsistency and Incompleteness in Databases (IIDB).Google Scholar
- Subrahmanian, V. S., Adali, S., Brink, A., Emery, R., Lu, J., Rajput, A., Rogers, T., Ross, R., and Ward, C. 1995. Hermes: A heterogeneous reasoning and mediator system. Tech. Rep., University of Maryland.Google Scholar
- Templeton, M., Brill, D., Dao, S., Lund, E., Ward, P., Chen, A., and MacGregor, R. 1987. Mermaid—A front-end to distributed heterogeneous databases. Proc. IEEE 75, 5 (May), 695--708.Google ScholarCross Ref
- Tomasic, A., Amouroux, R., Bonnet, P., Kapitskaia, O., Naacke, H., and Raschid, L. 1997. The distributed information search component (Disco) and the World Wide Web. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, 546--548. Google ScholarDigital Library
- Tomasic, A., Raschid, L., and Valduriez, P. 1998. Scaling access to heterogeneous data sources with Disco. IEEE Trans. Knowl. Data Eng. 10, 5, 808--823. Google ScholarDigital Library
- Tsai, P. S. M. and Chen, A. L. P. 2000. Partial natural outerjoin—An operation for interoperability in a multidatabase environment. J. Inf. Sci. Eng. 16, 4 (Jul.), 593--617.Google Scholar
- Tseng, F. S.-C., Chen, A. L. P., and Yang, W.-P. 1993. Answering heterogeneous database queries with degrees of uncertainty. Distrib. Parallel Databases 1, 3, 281--302. Google ScholarDigital Library
- Ullman, J. D., Garcia-Molina, H., and Widom, J. 2001. Database Systems: The Complete Book. Prentice Hall PTR. Google ScholarDigital Library
- Wang, H. and Zaniolo, C. 2000. Using SQL to build new aggregates and extenders for object-relational systems. In Proceedings of the International Conference on Very Large Databases (VLDB), A. E. Abbadi et al., eds. Morgan Kaufmann, 166--175. Google ScholarDigital Library
- Weis, M. and Naumann, F. 2004. Detecting duplicate objects in XML documents. In Proceedings of the International Workshop on Information Quality Informative Systems (IQIS). Google ScholarDigital Library
- Weis, M. and Naumann, F. 2005. DogmatiX tracks down duplicates in XML. In Proceedings of the ACM International Conference on Management of Data SIGMOD. ACM Press, New York, 431--442. Google ScholarDigital Library
- Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 262--276.Google Scholar
- Wiederhold, G. 1992. Mediators in the architecture of future information systems. Comput. 25, 3 (Mar.), 38--49. Google ScholarDigital Library
- Wijsen, J. 2003. Condensed representation of database repairs for consistent query answering. In Proceedings of the International Conference on Database Theory (ICDT), 378--393. Google ScholarDigital Library
- Yan, L. L. and Zsu, M. T. 1999. Conflict tolerant queries in AURORA. In Proc. of CoopIS. IEEE Computer Society, 279. Google ScholarDigital Library
- Yerneni, R., Papakonstantinou, Y., Abiteboul, S., and Garcia-Molina, H. 1998. Fusion queries over Internet databases. In Proceedings of the International Conference on Extending Database Technology (EDBT), 57--71. Google ScholarDigital Library
Index Terms
- Data fusion
Recommendations
A Taxonomy of Dirty Data
Today large corporations are constructing enterprise data warehouses from disparate data sources in order to run enterprise-wide data analysis applications, including decision support systems, multidimensional online analytical applications, data mining,...
A Review on Data Cleansing Methods for Big Data
AbstractMassive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing SystemsData Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Comments