research-article

Connected Components for Scaling Partial-order Blocking to Billion Entities

Authors:
Tobias Backes

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

0000-0003-2492-5297
View Profile

,
Stefan Dietze

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

0009-0001-4364-9243
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 16 Issue 1Article No.: 9pp 1–29https://doi.org/10.1145/3646553

Published:19 March 2024Publication History

Journal of Data and Information Quality

Abstract

In entity resolution, blocking pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related blocking-keys. Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but—as was shown for author disambiguation—the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset partial order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions, and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.

REFERENCES

[1] Aho Alfred V., Garey Michael R., and Ullman Jeffrey D.. 1972. The transitive reduction of a directed graph. SIAM J. Comput. 1, 2 (1972), 131–137.Google ScholarDigital Library
[2] Anderson Kris E., Talburt John R., Hagan Nicholas K. A., Zimmerman Timothy J., and Hagan Deasia. 2023. Optimal starting parameters for unsupervised data clustering and cleaning in the data washing machine. In Proceedings of the Future Technologies Conference. Springer, 106–125.Google ScholarCross Ref
[3] Backes Tobias. 2018. Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL’18). ACM Press, 203–212.Google ScholarDigital Library
[4] Backes Tobias. 2018. The impact of name-matching and blocking on author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM Press, 803–812.Google ScholarDigital Library
[5] Backes Tobias. 2023. Partial Orders and Progressive Blocking: A Matching-based Framework for Large-scale Entity Resolution in Bibliographic Data. Ph. D. Dissertation. Universitäts-und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf.Google Scholar
[6] Backes Tobias and Dietze Stefan. 2022. Lattice-based progressive author disambiguation. Inf. Syst. 109 (2022).Google ScholarDigital Library
[7] Backes Tobias, Hienert Daniel, and Dietze Stefan. 2022. Towards hierarchical affiliation resolution: framework, baselines, dataset. Int. J. Digit. Librar. 23, 3 (2022), 267–288.Google ScholarDigital Library
[8] Bayardo Roberto J. and Panda Biswanath. 2011. Fast algorithms for finding extremal sets. In Proceedings of the SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 25–34.Google ScholarCross Ref
[9] Beneventano Domenico, Bergamaschi Sonia, Gagliardelli Luca, and Simonini Giovanni. 2020. BLAST2: An efficient technique for loose schema information extraction from heterogeneous big data sources. J. Data Inform. Qual. 12, 4 (2020), 1–22.Google ScholarDigital Library
[10] Bouros Panagiotis, Mamoulis Nikos, Ge Shen, and Terrovitis Manolis. 2016. Set containment join revisited. Knowl. Inf. Syst. 49, 1 (2016), 375–402.Google ScholarDigital Library
[11] Charikar Moses S.. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing. Association for Computing Machinery, 380–388.Google ScholarDigital Library
[12] Christen Peter and Gayler Ross. 2008. Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In Proceedings of the 7th Australasian Data Mining Conference (AusDM’08). Australian Computer Society, Inc., 51–60.Google ScholarDigital Library
[13] Manning Prabhakar Raghavan Christopher D. and Schütze Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.Google ScholarCross Ref
[14] Christophides Vassilis, Efthymiou Vasilis, Palpanas Themis, Papadakis George, and Stefanidis Kostas. 2020. An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53, 6 (2020), 1–42.Google ScholarDigital Library
[15] Christophides Vassilis, Efthymiou Vasilis, and Stefanidis Kostas. 2015. Entity Resolution in the Web of Data. Springer.Google ScholarDigital Library
[16] Cao Nicola De, Izacard Gautier, Riedel Sebastian, and Petroni Fabio. 2021. Autoregressive Entity Retrieval. Retrieved from http://arxiv.org/abs/2010.00904Google Scholar
[17] Donner Paul, Rimmert Christine, and Eck Nees Jan van. 2020. Comparing institutional-level bibliometric research performance indicator values based on different affiliation disambiguation systems. Quantit. Sci. Stud. 1, 1 (Feb.2020), 150–170.Google ScholarCross Ref
[18] Ebraheem Muhammad, Thirumuruganathan Saravanan, Joty Shafiq, Ouzzani Mourad, and Tang Nan. 2018. DeepER—Deep entity resolution. Proc. VLDB Endow. 11, 11 (July2018), 1454–1467. DOI:Google ScholarDigital Library
[19] Firmani Donatella, Saha Barna, and Srivastava Divesh. 2016. Online entity resolution using an Oracle. Proc. VLDB Endow. 9, 5 (Jan.2016), 384–395. DOI:Google ScholarDigital Library
[20] Fort Marta, Sellarès J. Antoni, and Valladares Nacho. 2013. Finding extremal sets on the GPU. J. Parallel Distrib. Comput. 74, 1 (2013), 1891–1899.Google ScholarDigital Library
[21] Luca Gagliardelli, George Papadakis, Giovanni Simonini, Sonia Bergamaschi, and Themis Palpanas. 2022. Generalized supervised meta-blocking. Proc. VLDB Endow. 15, 9 (2022), 1902–1910.Google ScholarDigital Library
[22] Gazzarri Leonardo and Herschel Melanie. 2021. End-to-end task based parallelization for entity resolution on dynamic data. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE’21). IEEE, 1248–1259.Google ScholarCross Ref
[23] Gyawali Bikash, Anastasiou Lucas, and Knoth Petr. 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 901–910.Google Scholar
[24] Henzinger Monika. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. American Association for Computing Machinery, 284–291.Google ScholarDigital Library
[25] Hopcroft John and Tarjan Robert. 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6 (1973), 372–378.Google ScholarDigital Library
[26] Jampani Ravindranath and Pudi Vikram. 2005. Using prefix-trees for efficiently computing set joins. In Proceedings of the 10th International Conference on Database Systems for Advanced Applications. Springer, 761–772.Google ScholarDigital Library
[27] Kandpal Nikhil, Deng Haikang, Roberts Adam, Wallace Eric, and Raffel Colin. 2023. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 15696–15707.Google ScholarDigital Library
[28] Köpcke Hanna, Thor Andreas, and Rahm Erhard. 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 1-2 (Sept.2010), 484–493. DOI:Google ScholarDigital Library
[29] Leiserson Charles E., Maza Marc Moreno, Li Liyun, and Xie Yuzhen. 2010. Parallel computation of the minimal elements of a poset. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation. Association for Computing Machinery, 53–62.Google ScholarDigital Library
[30] Li Bing, Miao Yukai, Wang Yaoshu, Sun Yifang, and Wang Wei. 2021. Improving the efficiency and effectiveness for BERT-based entity resolution. Proc. AAAI Conf. Artif. Intell. 35, 15 (May2021), 13226–13233. DOI:Google ScholarCross Ref
[31] Li Bo-Han, Liu Yi, Zhang An-Man, Wang Wen-Huan, and Wan Shuo. 2020. A survey on blocking technology of entity resolution. J. Comput. Sci. Technol. 35 (2020), 769–793.Google ScholarDigital Library
[32] Ling Xiao, Singh Sameer, and Weld Daniel S.. 2015. Design challenges for entity linking. Trans. Assoc. Computat. Ling. 3 (Dec.2015), 315–328. DOI:Google ScholarCross Ref
[33] Luo Jizhou, Zhang Wei, Shi Shengfei, Gao Hong, Li Jianzhong, Wu Wei, and Jiang Shouxu. 2019. FreshJoin: An efficient and adaptive algorithm for set containment join. Data Sci. Eng. 4, 4 (2019), 293–308.Google ScholarCross Ref
[34] Luo Yongming, Fletcher George H. L., Hidders Jan, and Bra Paul De. 2015. Efficient and scalable trie-based algorithms for computing set containment relations. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 303–314.Google ScholarCross Ref
[35] Manku Gurmeet Singh, Jain Arvind, and Sarma Anish Das. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. American Association for Computing Machinery, 141–150.Google ScholarDigital Library
[36] Marinov Martin and Gregg D.. 2014. A practical algorithm for finding extremal sets up to permutation. J. Experim. Algor. 9, 4 (2014).Google Scholar
[37] Marinov Martin, Nash Nicholas, and Gregg David. 2016. Practical algorithms for finding extremal sets. J. Experim. Algor. 21 (2016), 1–21.Google ScholarDigital Library
[38] Mudgal Sidharth, Li Han, Rekatsinas Theodoros, Doan AnHai, Park Youngchoon, Krishnan Ganesh, Deep Rohit, Arcaute Esteban, and Raghavendra Vijay. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data. ACM Press, 19–34.Google ScholarDigital Library
[39] Nauman Felix and Herschel Melanie. 2022. An Introduction to Duplicate Detection. Springer Nature.Google Scholar
[40] Papadakis George, Skoutas Dimitrios, Thanos Emmanouil, and Palpanas Themis. 2020. Blocking and filtering techniques for entity resolution: A survey. Comput. Surv. 53, 2 (May2020), 1–42.Google ScholarDigital Library
[41] Peeters Ralph and Bizer Christian. 2023. Entity Matching using Large Language Models. Retrieved from http://arxiv.org/abs/2310.11244Google Scholar
[42] Pritchard Paul. 1991. Opportunistic algorithms for eliminating supersets. Acta Inform. 28, 8 (1991), 733–754.Google ScholarDigital Library
[43] Pritchard Paul. 1995. A simple sub-quadratic algorithm for computing the subset partial order. Inform. Process. Lett. 56, 6 (1995), 337–341.Google ScholarDigital Library
[44] Pritchard Paul. 1997. An old sub-quadratic algorithm for rinding extremal sets. Inform. Process. Lett. 62, 6 (1997), 329–334.Google ScholarDigital Library
[45] Pritchard Paul. 1999. On computing the subset graph of a collection of sets. J. Algor. 33, 2 (1999), 187–203.Google ScholarDigital Library
[46] Ravichandran Deepak and Vassilvitski Sergei. 2021. Evaluation of Cohort Algorithms for the FloC API. Technical Report. Google Research & Ads.Google Scholar
[47] Rimmert C., Schwechheimer H., and Winterhager M.. 2017. Disambiguation of Author Addresses in Bibliometric Databases. Technical Report. Bielefeld University.Google Scholar
[48] Savnik Iztok, Akulich Mikita, Krnc Matjaž, and Škrekovski Riste. 2021. Data structure set-trie for storing and querying sets: Theoretical and empirical analysis. PLoS One 16, 2 (2021).Google ScholarCross Ref
[49] Sevgili Özge, Shelmanov Artem, Arkhipov Mikhail, Panchenko Alexander, and Biemann Chris. 2022. Neural entity linking: A survey of models based on deep learning. Semant. Web 13, 3 (Apr.2022), 527–570. DOI:Google ScholarDigital Library
[50] Shen Hong. 1998. Fully dynamic algorithms for maintaining extremal sets in a family of sets. Int. J. Comput. Math. 69, 3-4 (Jan.1998), 203–215.Google ScholarCross Ref
[51] Shen Hong and Evans David J.. 1996. Fast sequential and parallel algorithms for finding extremal sets. Int. J. Comput. Math. 61, 3-4 (1996), 195–211.Google ScholarCross Ref
[52] Shen Wei, Li Yuhan, Liu Yinan, Han Jiawei, Wang Jianyong, and Yuan Xiaojie. 2021. Entity Linking Meets Deep Learning: Techniques and Solutions. Retrieved from http://arxiv.org/abs/2109.12520Google Scholar
[53] Shen Wei, Wang Jianyong, and Han Jiawei. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27, 2 (Feb.2015), 443–460. DOI:Google ScholarCross Ref
[54] Talburt John R.. 2011. Entity Resolution and Information Quality. Elsevier.Google Scholar
[55] Talburt John R., Pullen Daniel, Claassens Leon, Wang Richard, et al. 2020. An iterative, self-assessing entity resolution system: First steps toward a data washing machine. Int. J. Advanc. Comput. Sci. Applic. 11, 12 (2020).Google Scholar
[56] Talburt John R. and Zhou Yinle. 2013. A practical guide to entity resolution with OYSTER. In Handbook of Data Quality: Research and Practice. Springer, 235–270.Google ScholarCross Ref
[57] Tang Jiawei, Zuo Yifei, Cao Lei, and Madden Samuel. 2022. Generic entity resolution models. In Proceedings of the NeurIPS 2022 First Table Representation Workshop.Google Scholar
[58] Thirumuruganathan Saravanan, Li Han, Tang Nan, Ouzzani Mourad, Govind Yash, Paulsen Derek, Fung Glenn, and Doan AnHai. 2021. Deep learning for blocking in entity matching: A design space exploration. Proc. VLDB Endow. 14, 11 (July2021), 2459–2472. DOI:Google ScholarDigital Library
[59] Trajanoska Milena, Stojanov Riste, and Trajanov Dimitar. 2023. Enhancing Knowledge Graph Construction Using Large Language Models. Retrieved from http://arxiv.org/abs/2305.04676Google Scholar
[60] Verroios Vasilis and Garcia-Molina Hector. 2019. Top-k entity resolution with adaptive locality-sensitive hashing. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE’19). IEEE, 1718–1721.Google ScholarCross Ref
[61] Wang Qing, Cui Mingyuan, and Liang Huizhi. 2015. Semantic-aware blocking for entity resolution. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 166–180.Google ScholarDigital Library
[62] Wang Yifan. 2022. A Survey on Efficient Processing of Similarity Queries over Neural Embeddings. Retrieved from http://arxiv.org/abs/2204.07922Google Scholar
[63] Yang Chengcheng, Deng Dong, Shang Shuo, Zhu Fan, Liu Li, and Shao Ling. 2021. Internal and external memory set containment join. VLDB J. 30, 3 (2021), 447–470.Google ScholarDigital Library
[64] Yellin Daniel M.. 1992. Algorithms for subset testing and finding maximal sets. In Proceedings of the 3rd Annual ACM-SIAM Symposium on Discrete Algorithms. ACM & SIAM, 386–392.Google ScholarDigital Library
[65] Yellin Daniel M. and Jutla Charanjit S.. 1993. Finding extremal sets in less than quadratic time. Inform. Process. Lett. 48, 1 (1993), 29–34.Google ScholarDigital Library
[66] Yu Minghe, Li Guoliang, Deng Dong, and Feng Jianhua. 2016. String similarity search and join: A survey. Front. Comput. Sci. 10, 3 (2016), 399–417.Google ScholarDigital Library
[67] Zeakis Alexandros, Papadakis George, Skoutas Dimitrios, and Koubarakis Manolis. 2023. Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark]. Retrieved from http://arxiv.org/abs/2304.12329Google Scholar

Index Terms

Connected Components for Scaling Partial-order Blocking to Billion Entities
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Reasoning about belief and knowledge
2. Information systems
  1. Data management systems
    1. Information integration

Recommendations

Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Read More
Entity resolution framework using rough set blocking for heterogeneous web of data

Entity Resolution (ER) is the method of resolving two similar entities used in the process of data cleaning and data integration. However, existing ER Framework lead to exhaustive pairwise comparisons. The most efficient ER method is blocking, inherently ...
Read More
High-Value Token-Blocking: Efficient Blocking Method for Record Linkage
Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 16, Issue 1
March 2024
187 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3613486
Editor:
Felix Naumann
Hasso Plattner Institute, Germany
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 March 2024
- Online AM: 20 February 2024
- Accepted: 27 December 2023
- Revised: 3 December 2023
- Received: 9 August 2023
Published in jdiq Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Entity resolution
blocking
partial orders
lattices
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 72
  Total Downloads
- Downloads (Last 12 months)72
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Connected Components for Scaling Partial-order Blocking to Billion Entities

Journal of Data and Information Quality

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Entity resolution with iterative blocking

Entity resolution framework using rough set blocking for heterogeneous web of data

High-Value Token-Blocking: Efficient Blocking Method for Record Linkage