Skip to main content
Log in

Entity resolution for media metadata based on structural clustering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

An increasing amount of media metadata are published by different organizations on the Web which leads to a fragmented dataset landscape. Identifying media metadata from disparate datasets and integrating heterogeneous datasets have many applications but also pose significant challenges. To tackle this problem, entity resolution methods are commonly used as an essential prerequisite for integrating media information from different sources and effectively foster the re-use of existing data sources. While the amount of media metadata published on the Web grows steadily, how to scale it well to large media knowledge bases while maintaining a high matching quality is a critical challenge. This article investigates the relationships between media entities. To that end, the media database is formulated as a knowledge graph with entities as nodes and the associations between related entities as edges. Thus, media entities are grouped into communities by how they share neighbors. Then, a structural clustering-based model is proposed to detect communities and discover anchor vertices as well as isolated vertices. Specifically, an initial seed set of matched anchor vertex pairs is obtained. Furthermore, an iterative propagation approach for identifying the matched entities in the whole graph is developed, where community similarity is introduced into the measure function to control the total measurement of candidate pairs. Therefore, starting with the elements of the initial seed set, the entity resolution algorithm updates the matching information over the whole network along with the neighbor relationships iteratively. Extensive experiments are conducted on real datasets to evaluate how the seed set impacts the matching process and performance. The experiment results show this model can achieve an excellent balance between accuracy and efficiency and is a clear improvement compared to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.imdb.com

  2. https://www.imdb.com/pressroom/stats/

  3. https://business.tivo.com

  4. http://webdam.inria.fr/paris/

  5. http://en.wikipedia.org/wiki

  6. http://wordnet.princeton.edu

  7. http://www.geonames.org

References

  1. Balduzzi M, Platzer C, Holz T, Kirda E, Balzarotti D, Kruegel C (2010) Abusing social networks for automated user profiling. In: International workshop on recent advances in intrusion detection. Springer, pp 422–441

  2. Baxter R, Christen P, Churches T, et al. (2003) A comparison of fast blocking methods for record linkage. In: ACM SIGKDD. Citeseer, vol 3, pp 25–27

  3. Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. Acm Trans Knowl Discov Data 1(1):5

    Article  Google Scholar 

  4. Christen P (2012) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans Knowl Data Eng 24(9):1537–1555

    Article  Google Scholar 

  5. Doan A, Halevy AY (2005) Semantic integration research in the database community: a brief survey. AI Mag 26(1):83

    Google Scholar 

  6. Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. ACM, pp 85–96

  7. Elmagarmid AK, Ipeirotis PG, Verykios VS (2012) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  8. Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210

    Article  Google Scholar 

  9. Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice & open challenges. Proc VLDB Endowment 5(12):2018–2019

    Article  Google Scholar 

  10. Gu Q, Zhang Y, Cao J, Xu G, Cuzzocrea A (2014) A confidence-based entity resolution approach with incomplete information. In: International conference on data science and advanced analytics, pp 97–103

  11. He JL, Fu Y, Chen DB (2015) A novel top-k strategy for influence maximization in complex networks with community structure. Plos One 10(12):e0145283

    Article  Google Scholar 

  12. Jain P, Kumaraguru P (2012) Finding nemo: searching and resolving identities of users across online social networks. arXiv:1212.6147

  13. Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 538–543

  14. Jentzsch A, Isele R, Bizer C (2010) Silk-generating rdf links while publishing or consuming linked data. In: 9Th international semantic web conference (ISWC’10)

  15. Korula N, Lattanzi S (2014) An efficient reconciliation algorithm for social networks. Proc VLDB Endowment 7(5):377–388

    Article  Google Scholar 

  16. Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z Sigma: simple greedy matching for aligning large knowledge bases. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 572–580

  17. Lacoste-Julien S, Palla K, Davies A, Kasneci G, Graepel T, Ghahramani Z (2013) Sigma: simple greedy matching for aligning large knowledge bases. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 572–580

  18. Lee T, Hwang SW (2017) Linking, integrating, and translating entities via iterative graph matching. In: Technologies and applications of artificial intelligence, pp 248–255

  19. Li J, Wang Z, Zhang X, Tang J (2013) Large scale instance matching via multiple indexes and candidate selection. Knowl-Based Syst 50(3):112–120

    Article  Google Scholar 

  20. Livi L, Rizzi A (2013) The graph matching problem. Pattern Anal Appl 16 (3):253–283

    Article  MathSciNet  Google Scholar 

  21. Mahdisoltani F, Biega J, Suchanek F (2014) Yago3: a knowledge base from multilingual wikipedias. In: 7Th biennial conference on innovative data systems research. CIDR conference

  22. Narayanan A, Shmatikov V (2009) De-anonymizing social networks. In: 2009 IEEE symposium on Security and privacy, pp 173–187

  23. Ngomo ACN, Auer S (2011) Limes-a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI, pp 2312–2317

  24. Otero-Cerdeira L, Rodríguez-martínez FJ, Gómez-Rodríguez A (2015) Ontology matching: A literature review. Expert Syst Appl 42(2):949–971

    Article  Google Scholar 

  25. Papadakis G, Svirsky J, Gal A, Palpanas T (2016) Comparative analysis of approximate blocking techniques for entity resolution. Proc VLDB Endowment 9 (9):684–695

    Article  Google Scholar 

  26. Shao C, Hu LM, Li JZ, Wang ZC, Chung T, Xia JB (2016) Rimom-im: a novel iterative framework for instance matching. J Comput Sci Technol 31(1):185–197

    Article  MathSciNet  Google Scholar 

  27. Shu K, Wang S, Tang J, Zafarani R, Liu H (2017) User identity linkage across online social networks: a review. ACM SIGKDD Explor Newslett 18(2):5–17

    Article  Google Scholar 

  28. Suchanek FM, Abiteboul S, Senellart P (2011) Paris: Probabilistic alignment of relations, instances, and schema. Proc VLDB Endowment 5(3):157–168

    Article  Google Scholar 

  29. Xu X, Yuruk N, Feng Z, Schweiger TAJ (2007) Scan: a structural clustering algorithm for networks. In: ACM SIGKDD international conference on knowledge discovery and data mining, pp 824–833

  30. Yu M (2014) Entity linking on graph data. In: Proceedings of the 23rd international conference on World Wide Web. ACM, pp 21–26

  31. Zhang Y, Tang J, Yang Z, Pei J, Yu PS (2015) Cosnet: Connecting heterogeneous social networks with local and global consistency. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1485–1494

  32. Zhu H, Xie R, Liu Z, Sun M (2017) Iterative entity alignment via joint knowledge embeddings. In: Twenty-sixth international joint conference on artificial intelligence, pp 4258–4264

Download references

Acknowledgements

This work is partially supported by National Key Research and Development Plan (No. 2018YFB1003800).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Cao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, Q., Cao, J. & Liu, Y. Entity resolution for media metadata based on structural clustering. Multimed Tools Appl 79, 219–242 (2020). https://doi.org/10.1007/s11042-019-08062-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08062-6

Keywords

Navigation