research-article

Minun: evaluating counterfactual explanations for entity matching

Authors:
Jin Wang

Megagon Labs

Megagon Labs
View Profile

,
Yuliang Li

Megagon Labs

Megagon Labs
View Profile

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine LearningJune 2022Article No.: 7Pages 1–11https://doi.org/10.1145/3533028.3533304

Published:12 June 2022Publication History

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

Pages 1–11

ABSTRACT

Entity Matching (EM) is an important problem in data integration and cleaning. More recently, deep learning techniques, especially pre-trained language models, have been integrated into EM applications and achieved promising results. Unfortunately, the significant performance gain comes with the loss of explainability and transparency, deterring EM from the requirement of responsible data management. To address this issue, recent studies extended explainable AI techniques to explain black-box EM models. However, these solutions have the major drawbacks that (i) their explanations do not capture the unique semantics characteristics of the EM problem; and (ii) they fail to provide an objective method to quantitatively evaluate the provided explanations. In this paper, we propose Minun, a model-agnostic method to generate explanations for EM solutions. We utilize counterfactual examples generated from an EM customized search space as the explanations and develop two search algorithms to efficiently find such results. We also come up with a novel evaluation framework based on a student-teacher paradigm. The framework enables the evaluation of explanations of diverse formats by capturing the performance gain of a "student" model at simulating the target "teacher" model when explanations are given as side input. We conduct an extensive set of experiments on explaining state-of-the-art deep EM models on popular EM benchmark datasets. The results demonstrate that Minun significantly outperforms popular explainable AI methods such as LIME and SHAP on both explanation quality and scalability.

References

A. B. Arrieta, N. D. Rodríguez, J. D. Ser, and et al. Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion, 58:82--115, 2020.Google ScholarDigital Library
A. Baraldi, F. D. Buono, M. Paganelli, and F. Guerra. Using landmarks for explaining entity matching models. In EDBT, pages 451--456, 2021.Google Scholar
N. Barlaug. LEMON: explainable entity matching. CoRR, abs/2110.00516, 2021.Google Scholar
U. Brunner and K. Stockinger. Entity matching with transformer architectures - A step forward in data integration. In A. Bonifati, Y. Zhou, M. A. V. Salles, A. Böhm, D. Olteanu, G. H. L. Fletcher, A. Khan, and B. Yang, editors, EDBT, pages 463--473, 2020.Google Scholar
V. D. Cicco, D. Firmani, N. Koudas, P. Merialdo, and D. Srivastava. Interpreting deep learning models for entity resolution: an experience report using LIME. In aiDM@SIGMOD, pages 8:1--8:4, 2019.Google Scholar
K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of bert's attention. In BlackboxNLP@ACL 2019, pages 276--286, 2019.Google ScholarCross Ref
F. Doshi-Velez and B. Kim. A roadmap for a rigorous science of interpretability. CoRR, abs/1702.08608, 2017.Google Scholar
A. Ebaid, S. Thirumuruganathan, W. G. Aref, A. K. Elmagarmid, and M. Ouzzani. EXPLAINER: entity resolution explanations. In ICDE, pages 2000--2003, 2019.Google ScholarCross Ref
M. Ebraheem, S. Thirumuruganathan, S. R. Joty M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. PVLDB, 11(11):1454--1467, 2018.Google Scholar
C. Fu, X. Han, J. He, and L. Sun. Hierarchical matching network for heterogeneous entity resolution. In C. Bessiere, editor, IJCAI, pages 3665--3671, 2020.Google Scholar
C. Fu, X. Han, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong. End-to-end multi-perspective matching for entity resolution. In S. Kraus, editor, IJCAI, pages 4961--4967, 2019.Google Scholar
S. Galhotra, R. Pradhan, and B. Salimi. Explaining black-box algorithms using probabilistic contrastive counterfactuals. In SIGMOD, pages 577--590, 2021.Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.Google ScholarDigital Library
R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM Comput. Surv., 51(5):93:1--93:42, 2019.Google ScholarDigital Library
P. Hase and M. Bansal. Evaluating explainable AI: which algorithmic explanations help users predict model behavior? In ACL, pages 5540--5552, 2020.Google ScholarCross Ref
S. Jain and B. C. Wallace. Attention is not explanation. In NAACL-HLT, pages 3543--3556, 2019.Google Scholar
J. Kasai, K. Qian, S. Gurajada, Y. Li, and L. Popa. Low-resource deep entity resolution with transfer and active learning. In ACL, pages 5851--5861, 2019.Google ScholarCross Ref
P. Konda, S. Das, P. S. G. C., A. Doan, and et al. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.Google ScholarDigital Library
Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan. Deep entity matching with pre-trained language models. Proc. VLDB Endow., 14(1):50--60, 2020.Google ScholarDigital Library
Y. Li, J. Li, Y. Suhara, J. Wang, W. Hirota, and W. Tan. Deep entity matching: Challenges and opportunities. ACM J. Data Inf. Qual., 13(1):1:1--1:17, 2021.Google ScholarDigital Library
S. M. Lundberg and S. Lee. A unified approach to interpreting model predictions. In NIPS, pages 4765--4774, 2017.Google Scholar
Z. Miao, Y. Li, and X. Wang. Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In SIGMOD, pages 1303--1316, 2021.Google ScholarDigital Library
C. Molnar. Interpretable machine learning. Lulu. com, 2020.Google Scholar
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, pages 19--34, 2018.Google ScholarDigital Library
H. Nie, X. Han, B. He, L. Sun, B. Chen, W. Zhang, S. Wu, and H. Kong. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In CIKM, pages 629--638, 2019.Google ScholarDigital Library
G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv., 53(2):31:1--31:42, 2020.Google ScholarDigital Library
R. Peeters and C. Bizer. Dual-objective fine-tuning of BERT for entity matching. Proc. VLDB Endow., 14(10):1913--1921, 2021.Google ScholarDigital Library
R. Peeters, C. Bizer, and G. Glavas. Intermediate training of BERT for product matching. In F. Piai, D. Firmani, V. Crescenzi, A. D. Angelis, X. L. Dong, M. Mazzei, P. Merialdo, and D. Srivastava, editors, DI2KG@VLDB, 2020.Google Scholar
D. Pruthi, B. Dhingra, L. B. Soares, M. Collins, Z. C. Lipton, G. Neubig, and W. W. Cohen. Evaluating explanations: How much do explanations from the teacher aid students? CoRR, abs/2012.00893, 2020.Google Scholar
M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In ACM SIGKDD, pages 1135--1144, 2016.Google ScholarDigital Library
M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations. In AAAI, pages 1527--1535, 2018.Google ScholarCross Ref
M. Schleich, Z. Geng, Y. Zhang, and D. Suciu. Geco: Quality counterfactual explanations in real time. Proc. VLDB Endow., 14(9):1681--1693, 2021.Google ScholarDigital Library
S. Serrano and N. A. Smith. Is attention interpretable? In ACL, pages 2931--2951, 2019.Google ScholarCross Ref
J. Stoyanovich, B. Howe, and H. V. Jagadish. Responsible data management. PVLDB, 13(12):3474--3488, 2020.Google ScholarDigital Library
S. Thirumuruganathan, M. Ouzzani, and N. Tang. Explaining entity resolution predictions: Where are we and what needs to be done? In HILDA@SIGMOD, pages 10:1--10:6, 2019.Google Scholar
S. Thirumuruganathan, N. Tang, M. Ouzzani, and A. Doan. Data curation with deep learning. In EDBT, pages 277--286, 2020.Google Scholar
B. van Aken, B. Winter, A. Löser, and F. A. Gers. How does BERT answer questions?: A layer-wise analysis of transformer representations. In CIKM, pages 1823--1832, 2019.Google ScholarDigital Library
J. Wang, Y. Li, and W. Hirota. Machamp: A generalized entity matching benchmark. In CIKM, pages 4633--4642. ACM, 2021.Google ScholarDigital Library
J. Wang, C. Lin, M. Li, and C. Zaniolo. Boosting approximate dictionary-based entity extraction with synonyms. Inf. Sci., 530:1--21, 2020.Google ScholarCross Ref
J. Wang, C. Lin, and C. Zaniolo. Mf-join: Efficient fuzzy string similarity join with multi-level filtering. In ICDE, pages 386--397. IEEE, 2019.Google ScholarCross Ref
S. Wiegreffe and Y. Pinter. Attention is not not explanation. In EMNLP-IJCNLP, pages 11--20, 2019.Google ScholarCross Ref
R. Wu, S. Chaba, S. Sawlani, X. Chu, and S. Thirumuruganathan. Zeroer: Entity resolution using zero labeled examples. In SIGMOD, pages 1149--1164, 2020.Google ScholarDigital Library

Recommendations

Complexity results for explanations in the structural-model approach

We analyze the computational complexity of Halpern and Pearl's (causal) explanations in the structural-model approach, which are based on their notions of weak and actual cause. In particular, we give a precise picture of the complexity of deciding ...
Read More
Complexity results for structure-based causality

We give a precise picture of the computational complexity of causal relationships in Pearl's structural models, where we focus on causality between variables, event causality, and probabilistic causality. As for causality between variables, we consider ...
Read More
PARAFAC-Based Blind Identification of Underdetermined Mixtures Using Gaussian Mixture Model

This paper presents a novel algorithm, named GMM-PARAFAC, for blind identification of underdetermined instantaneous linear mixtures. The GMM-PARAFAC algorithm uses Gaussian mixture model (GMM) to model non-Gaussianity of the independent sources. We show ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
June 2022
63 pages
ISBN:9781450393751
DOI:10.1145/3533028
Conference Chairs:
Matthias Boehm,
Paroma Varma,
Doris Xin
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
DEEM '22 Paper Acceptance Rate9of13submissions,69%Overall Acceptance Rate23of37submissions,62%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 184
  Total Downloads
- Downloads (Last 12 months)44
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Minun: evaluating counterfactual explanations for entity matching

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

ABSTRACT

References

Cited By

Recommendations

Complexity results for explanations in the structural-model approach

Complexity results for structure-based causality

PARAFAC-Based Blind Identification of Underdetermined Mixtures Using Gaussian Mixture Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Minun: evaluating counterfactual explanations for entity matching

DEEM '22: Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning

ABSTRACT

References

Cited By

Recommendations

Complexity results for explanations in the structural-model approach

Complexity results for structure-based causality

PARAFAC-Based Blind Identification of Underdetermined Mixtures Using Gaussian Mixture Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media