skip to main content
10.1145/3034786.3056124acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Data Integration: After the Teenage Years

Published: 09 May 2017 Publication History

Abstract

The field of data integration has expanded significantly over the years, from providing a uniform query and update interface to structured databases within an enterprise to the ability to search, ex- change, and even update, structured or unstructured data that are within or external to the enterprise. This paper describes the evolution in the landscape of data integration since the work on rewriting queries using views in the mid-1990's. In addition, we describe two important challenges for the field going forward. The first challenge is to develop good open-source tools for different components of data integration pipelines. The second challenge is to provide practitioners with viable solutions for the long-standing problem of systematically combining structured and unstructured data.

References

[1]
B. Alexe, W. C. Tan, and Y. Velegrakis. Stbenchmark: towards a benchmark for mapping systems. PVLDB, 1(1):230--244, 2008.
[2]
M. Arenas, P. Barceló, L. Libkin, and F. Murlak. Foundations of Data Exchange. Cambridge University Press, 2014.
[3]
P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller. The ibench integration metadata generator. PVLDB, 9(3):108--119, 2015.
[4]
S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR, 2015.
[5]
C. Beeri and M. Vardi. A proof procedure for data dependencies. Journal of the ACM, 31(4):718--741, 1984.
[6]
P. A. Bernstein. Applying model management to classical meta-data problems. In CIDR, 2003.
[7]
P. A. Bernstein and L. M. Haas. Information integration in the enterprise. Commun. ACM, 51(9):72--79, 2008.
[8]
Biggorilla: Data integration and data preparation in python. http://www.biggorilla.org, 2017.
[9]
P. Buneman, J. Cheney, W. C. Tan, and S. Vansummeren. Curated databases. In Proc. of PODS, pages 1--12, 2008.
[10]
T. Catarci and M. Lenzerini. Representing and using interschema knowledge in cooperative information systems. Journal of Intelligent and Cooperative Information Systems, pages 55--62, 1993.
[11]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 39(3):15--28, 2016.
[12]
S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim. Optimizing queries with materialized views. In Proc. of ICDE, pages 190--200, Taipei, Taiwan, 1995.
[13]
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.
[14]
A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.
[15]
X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. SOLOMON: seeking the truth via copying detection. PVLDB, 3(2):1617--1620, 2010.
[16]
X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.
[17]
R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theor. Comput. Sci., 336(1):89--124, 2005.
[18]
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012.
[19]
G. H. L. Fletcher, J. V. den Bussche, D. V. Gucht, and S. Vansummeren. Towards a theory of search queries. ACM Trans. Database Syst., 35(4):28:1--28:33, 2010.
[20]
M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. Sigmod Record, 34(4):27--33, 2005.
[21]
J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. In Proc. of ACM SIGMOD, pages 331--342, 2001.
[22]
P. J. Guo, S. Kandel, J. M. Hellerstein, and J. Heer. Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In UIST, pages 65--74, 2011.
[23]
R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014.
[24]
A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Managing google's data lake: an overview of the goods system. IEEE Data Eng. Bull., 39(3):5--14, 2016.
[25]
A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration: The teenage years. In VLDB, 2006.
[26]
M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. Hil: A high-level scripting language for entity integration. In Proc. of EDBT, pages 549--560, 2013.
[27]
I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015.
[28]
Z. G. Ives, T. J. Green, G. Karvounarakis, N. E. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. C. N. Pereira. The ORCHESTRA collaborative data sharing system. SIGMOD Record, 37(3):26--32, 2008.
[29]
S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363--3372, 2011.
[30]
P. Kolaitis. Schema mappings, data exchange, and metadata management. In Proc. of ACM PODS, pages 61--75, 2005.
[31]
P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.
[32]
P. A. Larson and H. Yang. Computing queries from derived relations. In Proc. of VLDB, pages 259--269, 1985.
[33]
M. Lenzerini. Data Integration: A Theoretical Perspective. In Proc. of ACM PODS, 2002.
[34]
S. Lohr. For Big-Data Scientists, 'Janitor Work' is Key Hurdle to Insights. New York Times, 2014.
[35]
J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep-web crawl. In Proc. of VLDB, pages 1241--1252, 2008.
[36]
C. D. Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Record, 45(1):60--67, 2016.
[37]
L. Seligman, P. Mork, A. Y. Halevy, K. P. Smith, M. J. Carey, K. Chen, C. Wolf, J. Madhavan, A. Kannan, and D. Burdick. Openii: an open source information integration toolkit. In Proc. of ACM SIGMOD.
[38]
A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.
[39]
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.
[40]
O. G. Tsatalos, M. H. Solomon, and Y. E. Ioannidis. The GMAP: A versatile tool for physical data independence. VLDB Journal, 5(2):101--118, 1996.
[41]
H. Z. Yang and P. A. Larson. Query transformation for PSJ-queries. In Proc. of VLDB, pages 245--254, 1987.

Cited By

View all
  • (2024)Empowering the SDM-RDFizer tool for scaling up to complex knowledge graph creation pipelines1Semantic Web10.3233/SW-243580(1-28)Online publication date: 28-Mar-2024
  • (2024)Does ChatGPT change artificial intelligence-enabled marketing capability? Social media investigation of public sentiment and usageGlobal Media and China10.1177/20594364241228880Online publication date: 31-Jan-2024
  • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
May 2017
458 pages
ISBN:9781450341981
DOI:10.1145/3034786
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data integration
  2. open-source
  3. structured and unstructured data
  4. views

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'17
Sponsor:

Acceptance Rates

PODS '17 Paper Acceptance Rate 29 of 101 submissions, 29%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)98
  • Downloads (Last 6 weeks)9
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Empowering the SDM-RDFizer tool for scaling up to complex knowledge graph creation pipelines1Semantic Web10.3233/SW-243580(1-28)Online publication date: 28-Mar-2024
  • (2024)Does ChatGPT change artificial intelligence-enabled marketing capability? Social media investigation of public sentiment and usageGlobal Media and China10.1177/20594364241228880Online publication date: 31-Jan-2024
  • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
  • (2024)Cross-Source ML Model Training2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00464(5665-5666)Online publication date: 13-May-2024
  • (2024)The Five Generations of Entity Resolution on Web DataWeb Engineering10.1007/978-3-031-62362-2_46(469-473)Online publication date: 16-Jun-2024
  • (2024)Breaking Down Barriers with Knowledge Graphs: Data Integration for Cross-Organizational Process MiningProcess Mining Workshops10.1007/978-3-031-56107-8_38(499-512)Online publication date: 13-Apr-2024
  • (2023)Mask–Mediator–Wrapper: A Revised Mediator–Wrapper Architecture for Heterogeneous Data Source IntegrationApplied Sciences10.3390/app1304247113:4(2471)Online publication date: 14-Feb-2023
  • (2023)Incremental schema integration for data wrangling via knowledge graphsSemantic Web10.3233/SW-233347(1-38)Online publication date: 8-Jun-2023
  • (2023)Equitable Data Valuation Meets the Right to Be Forgotten in Model MarketsProceedings of the VLDB Endowment10.14778/3611479.361153116:11(3349-3362)Online publication date: 24-Aug-2023
  • (2023)Amalur: Data Integration Meets Machine Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00301(3729-3739)Online publication date: Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media