Abstract
An increasing amount of media metadata are published by different organizations on the Web which leads to a fragmented dataset landscape. Identifying media metadata from disparate datasets and integrating heterogeneous datasets have many applications but also pose significant challenges. To tackle this problem, entity resolution methods are commonly used as an essential prerequisite for integrating media information from different sources and effectively foster the re-use of existing data sources. While the amount of media metadata published on the Web grows steadily, how to scale it well to large media knowledge bases while maintaining a high matching quality is a critical challenge. This article investigates the relationships between media entities. To that end, the media database is formulated as a knowledge graph with entities as nodes and the associations between related entities as edges. Thus, media entities are grouped into communities by how they share neighbors. Then, a structural clustering-based model is proposed to detect communities and discover anchor vertices as well as isolated vertices. Specifically, an initial seed set of matched anchor vertex pairs is obtained. Furthermore, an iterative propagation approach for identifying the matched entities in the whole graph is developed, where community similarity is introduced into the measure function to control the total measurement of candidate pairs. Therefore, starting with the elements of the initial seed set, the entity resolution algorithm updates the matching information over the whole network along with the neighbor relationships iteratively. Extensive experiments are conducted on real datasets to evaluate how the seed set impacts the matching process and performance. The experiment results show this model can achieve an excellent balance between accuracy and efficiency and is a clear improvement compared to state-of-the-art methods.
Similar content being viewed by others
References
Balduzzi M, Platzer C, Holz T, Kirda E, Balzarotti D, Kruegel C (2010) Abusing social networks for automated user profiling. In: International workshop on recent advances in intrusion detection. Springer, pp 422–441
Baxter R, Christen P, Churches T, et al. (2003) A comparison of fast blocking methods for record linkage. In: ACM SIGKDD. Citeseer, vol 3, pp 25–27
Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. Acm Trans Knowl Discov Data 1(1):5
Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555
Doan A, Halevy AY (2005) Semantic integration research in the database community: a brief survey. AI Mag 26(1):83
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, pp 85–96
Elmagarmid AK, Ipeirotis PG, Verykios VS (2012) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice & open challenges. Proc VLDB Endowment 5(12):2018–2019
Gu Q, Zhang Y, Cao J, Xu G, Cuzzocrea A (2014) A confidence-based entity resolution approach with incomplete information. In: International conference on data science and advanced analytics, pp 97–103
He JL, Fu Y, Chen DB (2015) A novel top-k strategy for influence maximization in complex networks with community structure. Plos One 10(12):e0145283
Jain P, Kumaraguru P (2012) Finding nemo: searching and resolving identities of users across online social networks. arXiv:1212.6147
Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 538–543
Jentzsch A, Isele R, Bizer C (2010) Silk-generating rdf links while publishing or consuming linked data. In: 9Th international semantic web conference (ISWC’10)
Korula N, Lattanzi S (2014) An efficient reconciliation algorithm for social networks. Proc VLDB Endowment 7(5):377–388
Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z Sigma: simple greedy matching for aligning large knowledge bases. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 572–580
Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z (2013) Sigma: simple greedy matching for aligning large knowledge bases. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 572–580
Lee T, Hwang SW (2017) Linking, integrating, and translating entities via iterative graph matching. In: Technologies and applications of artificial intelligence, pp 248–255
Li J, Wang Z, Zhang X, Tang J (2013) Large scale instance matching via multiple indexes and candidate selection. Knowl-Based Syst 50(3):112–120
Livi L, Rizzi A (2013) The graph matching problem. Pattern Anal Appl 16 (3):253–283
Mahdisoltani F, Biega J, Suchanek F (2014) Yago3: a knowledge base from multilingual wikipedias. In: 7Th biennial conference on innovative data systems research. CIDR conference
Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: 2009 IEEE symposium on Security and privacy, pp 173–187
Ngomo ACN, Auer S (2011) Limes-a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI, pp 2312–2317
Otero-Cerdeira L, Rodríguez-martínez FJ, Gómez-Rodríguez A (2015) Ontology matching: A literature review. Expert Syst Appl 42(2):949–971
Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endowment 9 (9):684–695
Shao C, Hu LM, Li JZ, Wang ZC, Chung T, Xia JB (2016) Rimom-im: a novel iterative framework for instance matching. J Comput Sci Technol 31(1):185–197
Shu K, Wang S, Tang J, Zafarani R, Liu H (2017) User identity linkage across online social networks: a review. ACM SIGKDD Explor Newslett 18(2):5–17
Suchanek FM, Abiteboul S, Senellart P (2011) Paris: Probabilistic alignment of relations, instances, and schema. Proc VLDB Endowment 5(3):157–168
Xu X, Yuruk N, Feng Z, Schweiger TAJ (2007) Scan: a structural clustering algorithm for networks. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 824–833
Yu M (2014) Entity linking on graph data. In: Proceedings of the 23rd international conference on World Wide Web. ACM, pp 21–26
Zhang Y, Tang J, Yang Z, Pei J, Yu PS (2015) Cosnet: Connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1485–1494
Zhu H, Xie R, Liu Z, Sun M (2017) Iterative entity alignment via joint knowledge embeddings. In: Twenty-sixth international joint conference on artificial intelligence, pp 4258–4264
Acknowledgements
This work is partially supported by National Key Research and Development Plan (No. 2018YFB1003800).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gu, Q., Cao, J. & Liu, Y. Entity resolution for media metadata based on structural clustering. Multimed Tools Appl 79, 219–242 (2020). https://doi.org/10.1007/s11042-019-08062-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08062-6