research-article

Data Integration: After the Teenage Years

Authors:

Behzad Golshan,

George Mihaila,

Wang-Chiew TanAuthors Info & Claims

PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pages 101 - 106

https://doi.org/10.1145/3034786.3056124

Published: 09 May 2017 Publication History

Abstract

The field of data integration has expanded significantly over the years, from providing a uniform query and update interface to structured databases within an enterprise to the ability to search, ex- change, and even update, structured or unstructured data that are within or external to the enterprise. This paper describes the evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's. In addition, we describe two important challenges for the field going forward. The first challenge is to develop good open-source tools for different components of data integration pipelines. The second challenge is to provide practitioners with viable solutions for the long-standing problem of systematically combining structured and unstructured data.

References

[1]

B. Alexe, W. C. Tan, and Y. Velegrakis. Stbenchmark: towards a benchmark for mapping systems. PVLDB, 1(1):230--244, 2008.

Digital Library

[2]

M. Arenas, P. Barceló, L. Libkin, and F. Murlak. Foundations of Data Exchange. Cambridge University Press, 2014.

Digital Library

[3]

P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller. The ibench integration metadata generator. PVLDB, 9(3):108--119, 2015.

Digital Library

[4]

S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR, 2015.

[5]

C. Beeri and M. Vardi. A proof procedure for data dependencies. Journal of the ACM, 31(4):718--741, 1984.

Digital Library

[6]

P. A. Bernstein. Applying model management to classical meta-data problems. In CIDR, 2003.

[7]

P. A. Bernstein and L. M. Haas. Information integration in the enterprise. Commun. ACM, 51(9):72--79, 2008.

Digital Library

[8]

Biggorilla: Data integration and data preparation in python. http://www.biggorilla.org, 2017.

[9]

P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In Proc. of PODS, pages 1--12, 2008.

Digital Library

[10]

T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative information systems. Journal of Intelligent and Cooperative Information Systems, pages 55--62, 1993.

[11]

K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3):15--28, 2016.

[12]

S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim. Optimizing queries with materialized views. In Proc. of ICDE, pages 190--200, Taipei, Taiwan, 1995.

Digital Library

[13]

D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.

[14]

A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.

Digital Library

[15]

X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. SOLOMON: seeking the truth via copying detection. PVLDB, 3(2):1617--1620, 2010.

Digital Library

[16]

X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.

[17]

R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theor. Comput. Sci., 336(1):89--124, 2005.

[18]

W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012.

Digital Library

[19]

G. H. L. Fletcher, J. V. den Bussche, D. V. Gucht, and S. Vansummeren. Towards a theory of search queries. ACM Trans. Database Syst., 35(4):28:1--28:33, 2010.

Digital Library

[20]

M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. Sigmod Record, 34(4):27--33, 2005.

Digital Library

[21]

J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. In Proc. of ACM SIGMOD, pages 331--342, 2001.

Digital Library

[22]

P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer. Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In UIST, pages 65--74, 2011.

Digital Library

[23]

R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014.

Digital Library

[24]

A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Managing google's data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3):5--14, 2016.

[25]

A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration: The teenage years. In VLDB, 2006.

Digital Library

[26]

M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. Hil: A high-level scripting language for entity integration. In Proc. of EDBT, pages 549--560, 2013.

Digital Library

[27]

I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.

Digital Library

[28]

Z. G. Ives, T. J. Green, G. Karvounarakis, N. E. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. C. N. Pereira. The ORCHESTRA collaborative data sharing system. SIGMOD Record, 37(3):26--32, 2008.

Digital Library

[29]

S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.

Digital Library

[30]

P. Kolaitis. Schema mappings, data exchange, and metadata management. In Proc. of ACM PODS, pages 61--75, 2005.

Digital Library

[31]

P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.

Digital Library

[32]

P. A. Larson and H. Yang. Computing queries from derived relations. In Proc. of VLDB, pages 259--269, 1985.

Digital Library

[33]

M. Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of ACM PODS, 2002.

Digital Library

[34]

S. Lohr. For Big-Data Scientists, 'Janitor Work' is Key Hurdle to Insights. New York Times, 2014.

[35]

J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep-web crawl. In Proc. of VLDB, pages 1241--1252, 2008.

Digital Library

[36]

C. D. Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60--67, 2016.

Digital Library

[37]

L. Seligman, P. Mork, A. Y. Halevy, K. P. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kannan, and D. Burdick. Openii: an open source information integration toolkit. In Proc. of ACM SIGMOD.

Digital Library

[38]

A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.

[39]

M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.

[40]

O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The GMAP: A versatile tool for physical data independence. VLDB Journal, 5(2):101--118, 1996.

Digital Library

[41]

H. Z. Yang and P. A. Larson. Query transformation for PSJ-queries. In Proc. of VLDB, pages 245--254, 1987.

Digital Library

Cited By

Iglesias EVidal MCollarana DChaves-Fraga D(2024)Empowering the SDM-RDFizer tool for scaling up to complex knowledge graph creation pipelines1Semantic Web10.3233/SW-243580(1-28)Online publication date: 28-Mar-2024
https://doi.org/10.3233/SW-243580
Ngo V(2024)Does ChatGPT change artificial intelligence-enabled marketing capability? Social media investigation of public sentiment and usageGlobal Media and China10.1177/20594364241228880Online publication date: 31-Jan-2024
https://doi.org/10.1177/20594364241228880
Li ZSun WZhan DKang YChen LBozzon AHai R(2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3357389
Show More Cited By

Index Terms

Data Integration: After the Teenage Years
1. Information systems
  1. Data management systems
    1. Information integration
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Data structures and algorithms for data management

Recommendations

Data integration flows for business intelligence
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

Business Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Today's BI architecture typically consists of a data warehouse (...
On-demand big data integration

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository ...
Data Warehouse Based Approach to the Integration of Semi-structured Data
Advances in Web and Network Technologies, and Information Management

Semi-structured data play an increasing role in the development of the web through the use of XML. However, the management of semi-structured data poses specific problems because semi-structured data, contrary to classical database, do not rely on a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

May 2017

458 pages

ISBN:9781450341981

DOI:10.1145/3034786

Conference Chair:
Emanuel Sallinger
University of Oxford, United Kingdom
,
General Chair:
Jan Van den Bussche
Hasselt University, Belgium
,
Program Chair:
Floris Geerts
University of Antwerp, Belgium

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'17

Sponsor:

SIGMOD

SIGMOD/PODS'17: International Conference on Management of Data

May 14 - 19, 2017

Illinois, Chicago, USA

Acceptance Rates

PODS '17 Paper Acceptance Rate 29 of 101 submissions, 29%;

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
1,539
Total Downloads

Downloads (Last 12 months)98
Downloads (Last 6 weeks)9

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Iglesias EVidal MCollarana DChaves-Fraga D(2024)Empowering the SDM-RDFizer tool for scaling up to complex knowledge graph creation pipelines1Semantic Web10.3233/SW-243580(1-28)Online publication date: 28-Mar-2024
https://doi.org/10.3233/SW-243580
Ngo V(2024)Does ChatGPT change artificial intelligence-enabled marketing capability? Social media investigation of public sentiment and usageGlobal Media and China10.1177/20594364241228880Online publication date: 31-Jan-2024
https://doi.org/10.1177/20594364241228880
Li ZSun WZhan DKang YChen LBozzon AHai R(2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
https://doi.org/10.1109/TKDE.2024.3357389
Sun WHai R(2024)Cross-Source ML Model Training2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00464(5665-5666)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00464
Nikoletos KIoannou EPapadakis G(2024)The Five Generations of Entity Resolution on Web DataWeb Engineering10.1007/978-3-031-62362-2_46(469-473)Online publication date: 16-Jun-2024
https://doi.org/10.1007/978-3-031-62362-2_46
Rott JDorsch RFreund MBöhm MHarth AKrcmar H(2024)Breaking Down Barriers with Knowledge Graphs: Data Integration for Cross-Organizational Process MiningProcess Mining Workshops10.1007/978-3-031-56107-8_38(499-512)Online publication date: 13-Apr-2024
https://doi.org/10.1007/978-3-031-56107-8_38
Dončević JFertalj KBrčić MKrajna A(2023)Mask–Mediator–Wrapper: A Revised Mediator–Wrapper Architecture for Heterogeneous Data Source IntegrationApplied Sciences10.3390/app1304247113:4(2471)Online publication date: 14-Feb-2023
https://doi.org/10.3390/app13042471
Flores JRabbani KNadal SGómez CRomero OJamin EDasiopoulou S(2023)Incremental schema integration for data wrangling via knowledge graphsSemantic Web10.3233/SW-233347(1-38)Online publication date: 8-Jun-2023
https://doi.org/10.3233/SW-233347
Xia HLiu JLou JQin ZRen KCao YXiong L(2023)Equitable Data Valuation Meets the Right to Be Forgotten in Model MarketsProceedings of the VLDB Endowment10.14778/3611479.361153116:11(3349-3362)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611531
Hai RKoutras CIonescu ALi ZSun Wvan Schijndel JKang YKatsifodimos A(2023)Amalur: Data Integration Meets Machine Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00301(3729-3739)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00301
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten