A self-verifying clustering approach to unsupervised matching of product titles

Akritidis, Leonidas; Fevgas, Athanasios; Bozanis, Panayiotis; Makris, Christos

doi:10.1007/s10462-020-09807-8

A self-verifying clustering approach to unsupervised matching of product titles

Published: 13 February 2020

Volume 53, pages 4777–4820, (2020)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Leonidas Akritidis ORCID: orcid.org/0000-0001-6602-0723^1,2,
Athanasios Fevgas²,
Panayiotis Bozanis²^nAff1 &
…
Christos Makris³

815 Accesses
6 Citations
Explore all metrics

Abstract

The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employ external data sources to enrich the titles; these solutions are rather impractical, since the process of fetching external data is inefficient. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles that is independent of any external sources. UPM consists of three stages. During the first stage, the algorithm analyzes the titles and extracts combinations of words out of them. These combinations are evaluated in stage 2 according to several criteria, and the most appropriate of them are selected to form the initial clusters. The third phase is a post-processing verification stage that refines the initial clusters by correcting the erroneous matches. This stage is designed to operate in combination with all clustering approaches, especially when the data possess properties that prevent the co-existence of two data points within the same cluster. The experimental evaluation of UPM with multiple datasets demonstrates its superiority against the state-of-the-art clustering approaches and string similarity metrics, in terms of both efficiency and effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recommender Systems: Techniques, Applications, and Challenges

A Short Review on Different Clustering Techniques and Their Applications

Examination of the Criticality of Customer Segmentation Using Unsupervised Learning Methods

Article 09 January 2024

Notes

References

Akritidis L, Bozanis P (2018) Effective unsupervised matching of product titles with k-combinations and permutations. In: Proceedings of the 14th IEEE international conference on innovations in intelligent systems and applications (INISTA), pp 1–10
Bär D, Biemann C, Gurevych I, Zesch T (2012) UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the 1st joint conference on lexical and computational semantics, pp 435–440
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), pp 39–48
Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM international conference on management of data (SIGMOD), pp 313–324
Christen P (2008) FEBRL: a freely available record linkage system with a graphical user interface. In: Proceedings of the 2nd Australasian workshop on health data and knowledge management, pp 17–25
de Bakker M, Frasincar F, Vandic D (2013) A hybrid model words-driven approach for web product duplicate detection. In: Proceedings of the international conference on advanced information systems engineering, pp 149–161
Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957
Article Google Scholar
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybernet 3(3):32–57
Article MathSciNet Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Ester M, Kriegel HP, Sander J, Xu X, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international confernece on knowledge discovery and data mining (KDD), pp 226–231
Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recognit 41(1):176–190
Article Google Scholar
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
Google Scholar
Gopalakrishnan V, Iyengar SP, Madaan A, Rastogi R, Sengamedu S (2012) Matching product titles using web-based enrichment. In: Proceedings of the 21st ACM international conference on information and knowledge management (CIKM), pp 605–614
Hua W, Wang Z, Wang H, Zheng K, Zhou X (2015) Short text understanding through lexical-semantic analysis. In: Proceedings of the 31st IEEE international conference on data engineering, pp 495–506
Islam A, Inkpen D (2008) Semantic Text Similarity using Corpus-Based Word Similarity and String Similarity. ACM Trans Knowl Discov Data (TKDD) 2(2):10
Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Köpcke H, Thor A, Thomas S, Rahm E (2012) Tailoring entity resolution for matching product offers. In: Proceedings of the 15th international conference on extending database technology, pp 545–550
Li C, Lu J, Lu Y (2008) Efficient merging and filtering algorithms for approximate string searches. In: Proceedings of the 24th IEEE international conference on data engineering (ICDE), pp 257–266
Londhe N, Gopalakrishnan V, Zhang A, Ngo HQ, Srihari R (2014) Matching titles with cross title web-search enrichment and community detection. Proc VLDB Endow 7(12):1167–1178
Article Google Scholar
Lu W, Robertson S, MacFarlane A (2005) Field-weighted XML retrieval based on BM25. In: Proceedings of international workshop of the initiative for the evaluation of XML retrieval, pp 161–171
Lu J, Lin C, Wang W, Li C, Wang H (2013) String similarity measures and joins with synonyms. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 373–384
MacQueen J, et al. (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability vol 1, no 14, pp 281–297
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Proceedings of advances in neural information processing systems, pp 849–856
Shen W, DeRose P, Vu L, Doan A, Ramakrishnan R (2007) Source-aware entity matching: a compositional approach. In: Proceedings of the 23rd IEEE international conference on data engineering (ICDE), pp 196–205
Sneath PH (1957) The application of computers to taxonomy. Microbiology 17(1):201–226
Article Google Scholar
Sorensen TA (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on danish commons. Biologiske Skrifter 5:1–34
Google Scholar
Wang J, Li G, Fe J (2011a) Fast-join: An efficient method for fuzzy token matching based string similarity join. In: Proceedings of the 27th IEEE international conference on data engineering (ICDE), pp 458–469
Wang J, Li G, Yu JX, Feng J (2011b) Entity matching: how similar is similar. Proc VLDB Endow 4(10):622–633
Article Google Scholar
Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst 36(3):15
Article Google Scholar
Xu R, Wunsch DC (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar

Download references

Author information

Panayiotis Bozanis
Present address: School of Science and Technology, International Hellenic University, Thessaloniki, Greece

Authors and Affiliations

School of Science and Technology, International Hellenic University, Thessaloniki, Greece
Leonidas Akritidis
Data Structuring and Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
Leonidas Akritidis, Athanasios Fevgas & Panayiotis Bozanis
Department of Computer Engineering and Informatics, University of Patras, Patras, Greece
Christos Makris

Authors

Leonidas Akritidis
View author publications
You can also search for this author in PubMed Google Scholar
Athanasios Fevgas
View author publications
You can also search for this author in PubMed Google Scholar
Panayiotis Bozanis
View author publications
You can also search for this author in PubMed Google Scholar
Christos Makris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonidas Akritidis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akritidis, L., Fevgas, A., Bozanis, P. et al. A self-verifying clustering approach to unsupervised matching of product titles. Artif Intell Rev 53, 4777–4820 (2020). https://doi.org/10.1007/s10462-020-09807-8

Download citation

Published: 13 February 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10462-020-09807-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A self-verifying clustering approach to unsupervised matching of product titles

Abstract

Access this article

Similar content being viewed by others

Recommender Systems: Techniques, Applications, and Challenges

A Short Review on Different Clustering Techniques and Their Applications

Examination of the Criticality of Customer Segmentation Using Unsupervised Learning Methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A self-verifying clustering approach to unsupervised matching of product titles

Abstract

Access this article

Similar content being viewed by others

Recommender Systems: Techniques, Applications, and Challenges

A Short Review on Different Clustering Techniques and Their Applications

Examination of the Criticality of Customer Segmentation Using Unsupervised Learning Methods

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation