Abstract
In entity resolution, blocking pre-partitions data for further processing by more expensive methods. Two entity mentions are in the same block if they share identical or related blocking-keys. Previous work has sometimes related blocking keys by grouping or alphabetically sorting them, but—as was shown for author disambiguation—the respective equivalences or total orders are not necessarily well-suited to model the logical matching-relation between blocking keys. To address this, we present a novel blocking approach that exploits the subset partial order over entity representations to build a matching-based bipartite graph, using connected components as blocks. To prevent over- and underconnectedness, we allow specification of overly general and generalization of overly specific representations. To build the bipartite graph, we contribute a new parallellized algorithm with configurable time/space tradeoff for minimal element search in the subset partial order. As a job-based approach, it combines dynamic scalability and easier integration to make it more convenient than the previously described approaches. Experiments on large gold standards for publication records, author mentions, and affiliation strings suggest that our approach is competitive in performance and allows better addressing of domain-specific problems. For duplicate detection and author disambiguation, our method offers the expected performance as defined by the vector-similarity baseline used in another work on the same dataset and the common surname, first-initial baseline. For top-level institution resolution, we have reproduced the challenges described in prior work, strengthening the conclusion that for affiliation data, overlapping blocks under minimal elements are more suitable than connected components.
- [1] . 1972. The transitive reduction of a directed graph. SIAM J. Comput. 1, 2 (1972), 131–137.Google ScholarDigital Library
- [2] . 2023. Optimal starting parameters for unsupervised data clustering and cleaning in the data washing machine. In Proceedings of the Future Technologies Conference. Springer, 106–125.Google ScholarCross Ref
- [3] . 2018. Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (JCDL’18). ACM Press, 203–212.Google ScholarDigital Library
- [4] . 2018. The impact of name-matching and blocking on author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM Press, 803–812.Google ScholarDigital Library
- [5] . 2023. Partial Orders and Progressive Blocking: A Matching-based Framework for Large-scale Entity Resolution in Bibliographic Data. Ph. D. Dissertation. Universitäts-und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf.Google Scholar
- [6] . 2022. Lattice-based progressive author disambiguation. Inf. Syst. 109 (2022).Google ScholarDigital Library
- [7] . 2022. Towards hierarchical affiliation resolution: framework, baselines, dataset. Int. J. Digit. Librar. 23, 3 (2022), 267–288.Google ScholarDigital Library
- [8] . 2011. Fast algorithms for finding extremal sets. In Proceedings of the SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 25–34.Google ScholarCross Ref
- [9] . 2020. BLAST2: An efficient technique for loose schema information extraction from heterogeneous big data sources. J. Data Inform. Qual. 12, 4 (2020), 1–22.Google ScholarDigital Library
- [10] . 2016. Set containment join revisited. Knowl. Inf. Syst. 49, 1 (2016), 375–402.Google ScholarDigital Library
- [11] . 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing. Association for Computing Machinery, 380–388.Google ScholarDigital Library
- [12] . 2008. Towards scalable real-time entity resolution using a similarity-aware inverted index approach. In Proceedings of the 7th Australasian Data Mining Conference (AusDM’08). Australian Computer Society, Inc., 51–60.Google ScholarDigital Library
- [13] . 2008. Introduction to Information Retrieval. Cambridge University Press.Google ScholarCross Ref
- [14] . 2020. An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53, 6 (2020), 1–42.Google ScholarDigital Library
- [15] . 2015. Entity Resolution in the Web of Data. Springer.Google ScholarDigital Library
- [16] . 2021. Autoregressive Entity Retrieval. Retrieved from http://arxiv.org/abs/2010.00904Google Scholar
- [17] . 2020. Comparing institutional-level bibliometric research performance indicator values based on different affiliation disambiguation systems. Quantit. Sci. Stud. 1, 1 (
Feb. 2020), 150–170.Google ScholarCross Ref - [18] . 2018. DeepER—Deep entity resolution. Proc. VLDB Endow. 11, 11 (
July 2018), 1454–1467.DOI: Google ScholarDigital Library - [19] . 2016. Online entity resolution using an Oracle. Proc. VLDB Endow. 9, 5 (
Jan. 2016), 384–395.DOI: Google ScholarDigital Library - [20] . 2013. Finding extremal sets on the GPU. J. Parallel Distrib. Comput. 74, 1 (2013), 1891–1899.Google ScholarDigital Library
- [21] Luca Gagliardelli, George Papadakis, Giovanni Simonini, Sonia Bergamaschi, and Themis Palpanas. 2022. Generalized supervised meta-blocking. Proc. VLDB Endow. 15, 9 (2022), 1902–1910.Google ScholarDigital Library
- [22] . 2021. End-to-end task based parallelization for entity resolution on dynamic data. In Proceedings of the IEEE 37th International Conference on Data Engineering (ICDE’21). IEEE, 1248–1259.Google ScholarCross Ref
- [23] . 2020. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 901–910.Google Scholar
- [24] . 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. American Association for Computing Machinery, 284–291.Google ScholarDigital Library
- [25] . 1973. Algorithm 447: Efficient algorithms for graph manipulation. Commun. ACM 16, 6 (1973), 372–378.Google ScholarDigital Library
- [26] . 2005. Using prefix-trees for efficiently computing set joins. In Proceedings of the 10th International Conference on Database Systems for Advanced Applications. Springer, 761–772.Google ScholarDigital Library
- [27] . 2023. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning. PMLR, 15696–15707.Google ScholarDigital Library
- [28] . 2010. Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3, 1-2 (
Sept. 2010), 484–493.DOI: Google ScholarDigital Library - [29] . 2010. Parallel computation of the minimal elements of a poset. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation. Association for Computing Machinery, 53–62.Google ScholarDigital Library
- [30] . 2021. Improving the efficiency and effectiveness for BERT-based entity resolution. Proc. AAAI Conf. Artif. Intell. 35, 15 (
May 2021), 13226–13233.DOI: Google ScholarCross Ref - [31] . 2020. A survey on blocking technology of entity resolution. J. Comput. Sci. Technol. 35 (2020), 769–793.Google ScholarDigital Library
- [32] . 2015. Design challenges for entity linking. Trans. Assoc. Computat. Ling. 3 (
Dec. 2015), 315–328.DOI: Google ScholarCross Ref - [33] . 2019. FreshJoin: An efficient and adaptive algorithm for set containment join. Data Sci. Eng. 4, 4 (2019), 293–308.Google ScholarCross Ref
- [34] . 2015. Efficient and scalable trie-based algorithms for computing set containment relations. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 303–314.Google ScholarCross Ref
- [35] . 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. American Association for Computing Machinery, 141–150.Google ScholarDigital Library
- [36] . 2014. A practical algorithm for finding extremal sets up to permutation. J. Experim. Algor. 9, 4 (2014).Google Scholar
- [37] . 2016. Practical algorithms for finding extremal sets. J. Experim. Algor. 21 (2016), 1–21.Google ScholarDigital Library
- [38] . 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the International Conference on Management of Data. ACM Press, 19–34.Google ScholarDigital Library
- [39] . 2022. An Introduction to Duplicate Detection. Springer Nature.Google Scholar
- [40] . 2020. Blocking and filtering techniques for entity resolution: A survey. Comput. Surv. 53, 2 (
May 2020), 1–42.Google ScholarDigital Library - [41] . 2023. Entity Matching using Large Language Models. Retrieved from http://arxiv.org/abs/2310.11244Google Scholar
- [42] . 1991. Opportunistic algorithms for eliminating supersets. Acta Inform. 28, 8 (1991), 733–754.Google ScholarDigital Library
- [43] . 1995. A simple sub-quadratic algorithm for computing the subset partial order. Inform. Process. Lett. 56, 6 (1995), 337–341.Google ScholarDigital Library
- [44] . 1997. An old sub-quadratic algorithm for rinding extremal sets. Inform. Process. Lett. 62, 6 (1997), 329–334.Google ScholarDigital Library
- [45] . 1999. On computing the subset graph of a collection of sets. J. Algor. 33, 2 (1999), 187–203.Google ScholarDigital Library
- [46] . 2021. Evaluation of Cohort Algorithms for the FloC API.
Technical Report . Google Research & Ads.Google Scholar - [47] . 2017. Disambiguation of Author Addresses in Bibliometric Databases.
Technical Report . Bielefeld University.Google Scholar - [48] . 2021. Data structure set-trie for storing and querying sets: Theoretical and empirical analysis. PLoS One 16, 2 (2021).Google ScholarCross Ref
- [49] . 2022. Neural entity linking: A survey of models based on deep learning. Semant. Web 13, 3 (
Apr. 2022), 527–570.DOI: Google ScholarDigital Library - [50] . 1998. Fully dynamic algorithms for maintaining extremal sets in a family of sets. Int. J. Comput. Math. 69, 3-4 (
Jan. 1998), 203–215.Google ScholarCross Ref - [51] . 1996. Fast sequential and parallel algorithms for finding extremal sets. Int. J. Comput. Math. 61, 3-4 (1996), 195–211.Google ScholarCross Ref
- [52] . 2021. Entity Linking Meets Deep Learning: Techniques and Solutions. Retrieved from http://arxiv.org/abs/2109.12520Google Scholar
- [53] . 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Trans. Knowl. Data Eng. 27, 2 (
Feb. 2015), 443–460.DOI: Google ScholarCross Ref - [54] . 2011. Entity Resolution and Information Quality. Elsevier.Google Scholar
- [55] . 2020. An iterative, self-assessing entity resolution system: First steps toward a data washing machine. Int. J. Advanc. Comput. Sci. Applic. 11, 12 (2020).Google Scholar
- [56] . 2013. A practical guide to entity resolution with OYSTER. In Handbook of Data Quality: Research and Practice. Springer, 235–270.Google ScholarCross Ref
- [57] . 2022. Generic entity resolution models. In Proceedings of the NeurIPS 2022 First Table Representation Workshop.Google Scholar
- [58] . 2021. Deep learning for blocking in entity matching: A design space exploration. Proc. VLDB Endow. 14, 11 (
July 2021), 2459–2472.DOI: Google ScholarDigital Library - [59] . 2023. Enhancing Knowledge Graph Construction Using Large Language Models. Retrieved from http://arxiv.org/abs/2305.04676Google Scholar
- [60] . 2019. Top-k entity resolution with adaptive locality-sensitive hashing. In Proceedings of the IEEE 35th International Conference on Data Engineering (ICDE’19). IEEE, 1718–1721.Google ScholarCross Ref
- [61] . 2015. Semantic-aware blocking for entity resolution. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 166–180.Google ScholarDigital Library
- [62] . 2022. A Survey on Efficient Processing of Similarity Queries over Neural Embeddings. Retrieved from http://arxiv.org/abs/2204.07922Google Scholar
- [63] . 2021. Internal and external memory set containment join. VLDB J. 30, 3 (2021), 447–470.Google ScholarDigital Library
- [64] . 1992. Algorithms for subset testing and finding maximal sets. In Proceedings of the 3rd Annual ACM-SIAM Symposium on Discrete Algorithms. ACM & SIAM, 386–392.Google ScholarDigital Library
- [65] . 1993. Finding extremal sets in less than quadratic time. Inform. Process. Lett. 48, 1 (1993), 29–34.Google ScholarDigital Library
- [66] . 2016. String similarity search and join: A survey. Front. Comput. Sci. 10, 3 (2016), 399–417.Google ScholarDigital Library
- [67] . 2023. Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark]. Retrieved from http://arxiv.org/abs/2304.12329Google Scholar
Index Terms
- Connected Components for Scaling Partial-order Blocking to Billion Entities
Recommendations
Entity resolution with iterative blocking
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataEntity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. ...
Entity resolution framework using rough set blocking for heterogeneous web of data
Entity Resolution (ER) is the method of resolving two similar entities used in the process of data cleaning and data integration. However, existing ER Framework lead to exhaustive pairwise comparisons. The most efficient ER method is blocking, inherently ...
High-Value Token-Blocking: Efficient Blocking Method for Record Linkage
Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as ...
Comments