Abstract
The tremendous growth of the World Wide Web (WWW) accumulates and exposes an abundance of unresolved real-world entities that are exposed to public Web databases. Entity resolution (ER) is the vital prerequisite for leveraging and resolving Web entities that describe the same real-world objects. Data blocking is a popular method for addressing Web entities and grouping similar entity profiles without duplication. The existing ER techniques apply hierarchical blocking to ease dimensionality reduction. Canopy clustering is a pre-clustering method for increasing processing speed. However, it performs a pairwise comparison of the entities, which results in a computationally intensive process. Moreover, conventional data-blocking techniques have limited control over both the block size and overlapping blocks, despite the significance of blocking quality in many potential applications. This paper proposes a Real-Delegate (Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies) that exploits attribute-based unsupervised hierarchical blocking as well as meta-blocking without relying on pre-clustering. The proposed approach significantly improves the efficiency of the blocking function in three phases. In the initial phase, the Real-Delegate approach links the multiple sets of equivalent entity descriptions using Linked Open Data (LOD) to integrate multiple Web sources. The next phase employs attribute-based unsupervised hierarchical blocking with rough set theory (RST), which considerably reduces superfluous comparisons. Finally, the Real-Delegate approach eliminates a redundant entity by employing a graph-based meta-blocking model that represents a redundancy-positive block and removes overlapping profiles effectively. The experimental results demonstrate that the proposed approach significantly improves the effectiveness of entity resolution compared with the token blocking method in a large-scale Web dataset.












Similar content being viewed by others
References
Dong, X.L., Srivastava, D.: Big data integration. IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248 (2013)
Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the Web of data, ACM Proceedings on WWW, pp. 203–204 (2014)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Steorts, R.C., Ventura, S.L., Sadinle, M., Fienberg, S.E.: A comparison of blocking methods for record linkage. In: Domingo-Ferrer, J., et al. (eds.) Privacy in Statistical Databases, pp. 253–268. Springer, Berlin (2014)
Zhu, S., Wang, D., Li, T.: Data clustering with size constraints. Knowl. Based Syst. 23(8), 883–889 (2010)
Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution, 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2015)
Christophides, Vassilis, Efthymiou, Vasilis, Stefanidis, Kostas: Entity resolution in the web of data. Synth. Lect. Semant. Web 5(3), 1–122 (2015)
Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: To compare or not to compare: making entity resolution more efficient, ACM Proceedings of the International Workshop on Semantic Web Information Management, p. 3 (2011)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures, ACM Proceedings of the ninth SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48 (2003)
De Assis Costa, G., de Oliveira, J.M.P.: A relational learning approach for collective entity resolution in the web of data, ACM Proceedings of the 5th International Conference on Consuming Linked Data, vol. 1264, pp. 13–24 (2014)
Li, C., Jin, L., Mehrotra, S.: Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web 9(4), 557–584 (2006). Springer
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of highdimensional data sets with application to reference matching, ACM Proceedings of the Sixth SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 169–178 (2000)
Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches, Theoretical Computer Science, vol. 92, pp. 191–1211. Elsevier Science Publishers Ltd., Essex (1992)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration, ACM Proceedings of the \(12^{{\rm th}}\) International Conference on WWW, pp. 90–101 (2003)
Vries, T., Ke, H., Chawla, S., Christen, P.: Robust record linkage blocking using suffix arrays and Bloom filters, ACM Transactions on Knowledge Discovery from Data, vol. 5, No. 2 (2011)
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking, ACM Proceedings of the SIGMOD International Conference on Management of data, pp. 219–232 (2009)
Shu, L., Chen, A., Xiong, M., Meng, W.: Efficient spectral neighborhood blocking for entity resolution, IEEE 27th International Conference on Data Engineering, pp. 1067–1078 (2011)
Sarma, Das A., Jain, A., Machanavajjhala, A., Bohannon, P.: An automatic blocking mechanism for large-scale de-duplication tasks, 21st ACM International Conference on Information and Knowledge Management, pp. 1055–1064 (2012)
Ramadan, B., Christen, P.: Unsupervised blocking key selection for real-time entity resolution, Springer International Publishing on Pacific-Asia Conference on Knowledge, pp. 574–585 (2015)
Chen, H.-L., Yang, B., Liu, J., Liu, D.-Y.: A support vector machine classifier with rough set-based feature selection for breast cancer diagnosis. Expert Syst. Appl. 38(7), 9014–9022 (2011)
Kaya, Y., Uyar, M.: A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease. Appl. Soft Comput. 13(8), 3429–3438 (2013)
Nin, J., Muntes-Mulero, V., Mart ınez-Bazan, N., Larriba-Pey, J.-L.: On the use of semantic blocking techniques for data cleansing and integration, IEEE 11th International Symposium on Database Engineering and Applications, pp. 190–198 (2007)
Papadakis, G., Ioannou, E., Niederee, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces, ACM Proceedings of the Fourth International Conference on Web Search and Data Mining, pp. 535–544 (2011)
Ma, Y., Tran, T.: Typimatch: type-specific unsupervised learning of keys and key values for heterogeneous Web data integration, Sixth ACM International Conference on Web Search and Data Mining, pp. 325–334 (2013)
Papadakis, G., Ioannou, E., Niederee, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data, ACM Proceedings of the Fifth International Conference on Web Search and Web Data Mining, pp. 53–62 (2012)
Kim, H.S., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections, ACM Proceedings of the 13th International Conference on Extending Database Technology, pp. 525–536 (2010)
Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26(8), 1946–1960 (2014)
Papadakis, G., Ioannou, E., Niederee, C., Palpanas, T., Nejdl, W.: Eliminating the redundancy in blocking-based entity resolution methods, ACM Proceedings of the 2011 Joint International Conference on Digital Libraries, pp. 85–94 (2011)
Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, heterogeneous data with enhanced metablocking, In EDBT, pp. 221–232 (2016)
Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised metablocking. ACM proceedings of the VLDB 7(14), 1929–1940 (2014)
Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25(12), 2665–2682 (2013)
Efthymiou, V., Stefanidis, K., Christophides, V.: Benchmarking blocking algorithms for Web entities. IEEE Trans. Big Data (2016). doi:10.1109/TBDATA.2016.2576463
Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking: realizing scalable entity resolution over large, heterogeneous data, IEEE International Conference on Big data (Big data), pp. 411–420 (2015)
Efthymiou, V., Papadakis, G., Papastefanatos, G., Stefanidis, K., Palpanas, T.: Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems 65, 137–157 (2017)
Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples, ACM Proceedings of the 12th International Conference on Information Integration and Web-based Applications and Services, pp. 453–460 (2010)
Bizer, C., Heath, T., Berners-Lee, T.: Linked data—the story so far. In: Semantic Services, Interoperability and Web Applications: Emerging Concepts, pp. 205–227 (2009)
Vidhya, K.A., Geetha, T.V.: Rough set theory for document clustering: a review. J. Intell. Fuzzy Syst. 32(3), 2165–2185 (2017)
Vidhya, K.A., Geetha, T.V., Aghila, G.: Text document classification using Rough Set theory and Multi-level Naïve Bayes. Int. J. Appl. Eng. Res. 10(75), 331–336 (2015). (IJAER)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vidhya, K.A., Geetha, T.V. Resolving Entity on A Large scale: DEtermining Linked Entities and Grouping similar Attributes represented in assorted TErminologies. Distrib Parallel Databases 35, 303–332 (2017). https://doi.org/10.1007/s10619-017-7205-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-017-7205-1