skip to main content
research-article
Free access

Magellan: toward building ecosystems of entity matching solutions

Published: 22 July 2020 Publication History

Abstract

Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build <code>PyMatcher</code> and <code>CloudMatcher</code>, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration.

References

[1]
Workshop on Human-In-the-Loop Data Analytics, http://hilda.io/.
[2]
Christen P. Data Matching. Springer 2012.
[3]
Das, S., P.S.G.C., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD'17 (New York, NY, USA, 2017), ACM, 1431--1446.
[4]
Doan, A., et al. Toward a system building agenda for Data Integration (and Data Science). IEEE Data Eng. Bull. 41, 2 (2018), 35--46.
[5]
Doan, A., Halevy, A.Y., Ives, Z.G. Principles of Data Integration. Morgan Kaufmann, 2012.
[6]
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. Duplicate record detection: A survey. IEEE TKDE 19, 1 (2007), 1--16.
[7]
Govind, Y., et al. Cloudmatcher: A cloud/crowd service for entity matching. In BIGDAS (2017).
[8]
Govind, Y., et al. Entity matching meets data science: A progress report from the magellan project. In SIGMOD (2019).
[9]
Konda, Y., et al. Magellan: Toward building entity matching management systems. PVLDB 9, 12 (2016), 1197--1208.
[10]
Konda, P., et al. Performing entity matching end to end: A case study. In EDBT (2019).
[11]
Mudgal, S., et al. Deep learning for entity matching: A design space exploration. In IGMOD (2018).
[12]
Papadakis, G., et al. The return of JedAI: End-to-End entity resolution for structured and semi-structured data. PVLDB 11, 12 (2018), 1950--1953.
[13]
Papadakis, G., et al. Web-scale, Schema-Agnostic, End-to-End Entity Resolution. In The Web Conference (WWW), (Lyon, France, April), 2018.

Cited By

View all
  • (2024)Matching Feature Separation Network for Domain Adaptation in Entity MatchingProceedings of the ACM Web Conference 202410.1145/3589334.3645397(1975-1985)Online publication date: 13-May-2024
  • (2024)Enhancing entity resolution with multichannel BERT: a comprehensive approachThird International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024)10.1117/12.3031934(17)Online publication date: 8-Jun-2024
  • (2024)Unveiling Dis-Integration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00463(5664-5664)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Magellan: toward building ecosystems of entity matching solutions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Communications of the ACM
    Communications of the ACM  Volume 63, Issue 8
    August 2020
    93 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/3411844
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 July 2020
    Published in CACM Volume 63, Issue 8

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)421
    • Downloads (Last 6 weeks)62
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Matching Feature Separation Network for Domain Adaptation in Entity MatchingProceedings of the ACM Web Conference 202410.1145/3589334.3645397(1975-1985)Online publication date: 13-May-2024
    • (2024)Enhancing entity resolution with multichannel BERT: a comprehensive approachThird International Conference on Algorithms, Microchips, and Network Applications (AMNA 2024)10.1117/12.3031934(17)Online publication date: 8-Jun-2024
    • (2024)Unveiling Dis-Integration2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00463(5664-5664)Online publication date: 13-May-2024
    • (2024)Efficient Entity Resolution via Hierarchical Graph Attention and Semantic Blocking2024 5th International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)10.1109/ICBASE63199.2024.10762137(277-281)Online publication date: 20-Sep-2024
    • (2024)Adaptive Target-Consistency Entity Matching Algorithm Based on Semi-Supervised Learning2024 10th International Conference on Big Data and Information Analytics (BigDIA)10.1109/BigDIA63733.2024.10808744(31-37)Online publication date: 25-Oct-2024
    • (2024)Dual-Module Feature Alignment Domain Adversarial Model for Entity Resolution2024 11th International Conference on Behavioural and Social Computing (BESC)10.1109/BESC64747.2024.10780643(1-8)Online publication date: 16-Aug-2024
    • (2024)Low-resource entity resolution with domain generalization and active learningNeurocomputing10.1016/j.neucom.2024.128131599(128131)Online publication date: Sep-2024
    • (2024)On tuning parameters guiding similarity computations in a data deduplication pipeline for customers recordsInformation Systems10.1016/j.is.2023.102323121(102323)Online publication date: Mar-2024
    • (2023)Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming ContestsACM SIGMOD Record10.1145/3615952.361596552:2(43-47)Online publication date: 11-Aug-2023
    • (2023)Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data IntegrationProceedings of the ACM on Management of Data10.1145/35889381:1(1-26)Online publication date: 30-May-2023
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Digital Edition

    View this article in digital edition.

    Digital Edition

    Magazine Site

    View this article on the magazine site (external)

    Magazine Site

    Login options

    Full Access

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media