skip to main content
10.1145/3292500.3332296acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
tutorial

Data Integration and Machine Learning: A Natural Synergy

Published: 25 July 2019 Publication History

Abstract

As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest possible variety of sources; and this is why data integration plays a key role. At the same time machine learning is driving automation in data integration, resulting in overall reduction of integration costs and improved accuracy. This tutorial focuses on three aspects of the synergistic relationship between data integration and machine learning: (1) we survey how state-of-the-art data integration solutions rely on machine learning-based approaches for accurate results and effective human-in-the-loop pipelines, (2) we review how end-to-end machine learning applications rely on data integration to identify accurate, clean, and relevant data for their analytics exercises, and (3) we discuss open research challenges and opportunities that span across data integration and machine learning.

Supplementary Material

Part 1 of 2 (p3193-dong-part1.mp4)
Part 2 of 2 (p3193-dong-part2.mp4)

References

[1]
P. Bailis, E. Gan, S. Madden, D. Narayanan, K. Rong, and S. Suri. Macrobase: Prioritizing attention in fast data. In SIGMOD, pages 541--556, 2017.
[2]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In SIGMOD, pages 2201--2206, 2016.
[3]
R. Das, A. Neelakantan, D. Belanger, and A. McCallum. Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL, 2017.
[4]
S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431--1446, 2017.
[5]
A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.
[6]
X. L. Dong. Challenges and innovations in building a product knowledge graph. In SigKDD, 2018.
[7]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
[8]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.
[9]
X. L. Dong, E. Gabrilovich, K. Murphy, V. Dang, W. Horn, C. Lugaresi, S. Sun, and W. Zhang. Knowledge-based trust: Estimating the trustworthiness of web sources. In VLDB, 2015.
[10]
X. L. Dong and F. Naumann. Data fusion--resolving data conflicts for integration. PVLDB, 2009.
[11]
X. L. Dong and D. Srivastava. Big data integration. Proc. VLDB Endow., 6(11):1188-- 1189, Aug. 2013.
[12]
X. L. Dong and D. Srivastava. Big data integration. Synthesis Lectures on Data Management, 7(1):1--198, 2015.
[13]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the Americal Statistical Association, 64(328):1183--1210, 1969.
[14]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. pages 371--380, 2001.
[15]
J. Gao, Q. Li, B. Zhao, W. Fan, and J. Han. Mining reliable information from passively and actively crowdsourced data. In KDD, pages 2121--2122, 2016.
[16]
L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012.
[17]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014.
[18]
P. Gulhane, A. Madaan, R. Mehta, J. Ramamirtham, R. Rastogi, S. Satpal, srinivasan H. Sengamedu, A. Tengli, and C. Tiwari. Web-scale information extraction with vertex. In ICDE, 2011.
[19]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 795--806, New York, NY, USA, 2016. ACM.
[20]
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, Mar. 2009.
[21]
O. Hassanzadeh, F. Chiang, R. J. Miller, and H. C. Lee. Framework for evaluating clustering algorithms in duplicate detection. PVLDB, 2(1):1282--1293, 2009.
[22]
R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier. An unsupervised neural attention model for aspect extraction. In ACL, 2017.
[23]
R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011.
[24]
H. Ji. Entity linking and wikification reading list. http://nlp.cs.rpi.edu/kbp/2014/elreading.html, 2014.
[25]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998.
[26]
H. Kopcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.
[27]
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning for statistical modeling. Proc. VLDB Endow., 9(12):948-- 959, Aug. 2016.
[28]
A. Kumar, M. Boehm, and J. Yang. Data management in machine learning: Challenges, techniques, and systems. SIGMOD '17, pages 1717--1722, 2017.
[29]
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the Deep Web: Is the problem solved? PVLDB, 6(2), 2013.
[30]
C. Lockard, X. L. Dong, A. Einolghozati, and P. Shiralkar. Ceres: Distantly supervised relation extraction from the semi-structured web. In PVLDB, 2018.
[31]
X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNsCRF. In ACL, 2016.
[32]
C. Manning. Representations for language: From word embeddings to sentence meanings. https://simons.berkeley.edu/talks/christopher-manning-2017--3--27, 2017.
[33]
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, 2009.
[34]
T. Mitchell. Learning from limited labeled data (but a lot of unlabeled data). https://lld-workshop.github.io/slides/tom_mitchell_lld.pdf, 2017.
[35]
T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. Never-ending learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), 2015.
[36]
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Sigmod, 2018.
[37]
A. Neelakantan, B. Roth, and A. McCallum. Compositional vector space models for knowledge base completion. In ACL, 2015.
[38]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, pages 689--696, USA, 2011. Omnipress.
[39]
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In COLING, pages 877--885, 2010.
[40]
R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012.
[41]
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich. Data management challenges in production machine learning. In SIGMOD, pages 1723--1726, 2017.
[42]
J. Pujara and L. Getoor. Generic statistical relational entity resolution in knowledge graphs. In AAAI, 2016.
[43]
A. Ratner, S. H. Bach, H. R. Ehrenberg, J. A. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. PVLDB, 11(3):269--282, 2017.
[44]
A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pages 3567--3575, 2016.
[45]
V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. J. Mach. Learn. Res., 11:1297--1322, Aug. 2010.
[46]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
[47]
T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. Slimfast: Guaranteed results for data fusion and source reliability. In Proceedings of Data Integration and Machine Learning: A Natural Synergy Woodstock '18, June 03--05, 2018, Woodstock, NY the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 1399--1414, New York, NY, USA, 2017. ACM.
[48]
S. Riedel, L. Yao, B. M. Marlin, and A. McCallum. Relation extraction with matrix factorization and universal schemas. In HLT-NAACL, 2013.
[49]
B. Saha and D. Srivastava. Data quality: The other face of big data. In ICDE, pages 1294--1297, 2014.
[50]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In SIGKDD, 2002.
[51]
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.
[52]
K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Representing text for joint embedding of text and knowledge bases. In EMNLP, 2015.
[53]
R. Trivedi, B. Sisman, J. Ma, C. Faloustos, H. Zha, and X. L. Dong. Linknbed: Multi-graph representation learning with entity linkage. In ACL.
[54]
P. Verga, A. Neelakantan, and A. McCallum. Generalizing to unseen entities and entity pairs with row-less universal schema. In ACL, 2017.
[55]
V. Verroios, H. Garcia-Molina, and Y. Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, pages 1133--1148, 2017.
[56]
X. Wang, X. L. Dong, and A. Meliou. Data x-ray: A diagnostic tool for data errors. In SIGMOD, pages 1231--1245, 2015.
[57]
S. Wu, L. Hsiao, X. Cheng, B. Hancock, T. Rekatsinas, P. Levis, and C. Ré. Fonduer: Knowledge base construction from richly formatted data. In SIGMOD, SIGMOD'18, 2018.
[58]
C. Zhang, C. RÃ?, M. Cafarella, C. D. Sa, A. Ratner, J. Shin, F. Wang, and S. Wu. Deepdive: Declarative knowledge base construction. CACM, 60(5):93--102, 2017.
[59]
G. Zheng, S. Mukherjee, X. L. Dong, and F. Li. Opentag: Open attribute value extraction from product profiles. In SigKDD, 2018.

Cited By

View all
  • (2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal10.1007/s00778-022-00775-932:4(791-813)Online publication date: 3-Jan-2023
  • (2021)Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU ClusterJournal of Grid Computing10.1007/s10723-021-09550-619:1Online publication date: 22-Feb-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2019
3305 pages
ISBN:9781450362016
DOI:10.1145/3292500
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Check for updates

Author Tags

  1. data cleaning
  2. data fusion
  3. data integration
  4. entity linkage
  5. schema mapping

Qualifiers

  • Tutorial

Conference

KDD '19
Sponsor:

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal10.1007/s00778-022-00775-932:4(791-813)Online publication date: 3-Jan-2023
  • (2021)Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU ClusterJournal of Grid Computing10.1007/s10723-021-09550-619:1Online publication date: 22-Feb-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media