A Novel Approach to Clustering Merchandise Records

Cheng, Tao-Yuan; Wang, Shan

doi:10.1007/s11390-007-9029-3

A Novel Approach to Clustering Merchandise Records

Regular Paper
Published: 17 April 2007

Volume 22, pages 228–231, (2007)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Tao-Yuan Cheng¹ &
Shan Wang¹

29 Accesses
2 Citations
Explore all metrics

Abstract

Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an integrated database. Traditional object identification techniques tend to use character-based or vector space model-based similarity computing in judging, but they cannot work well in merchandise databases. This paper brings forward a new approach to object identification. First, we use merchandise images to judge whether two records belong to the same object; then, we use Naïve Bayesian Model to judge whether two merchandise names have similar meaning. We do experiments on data downloaded from shopping websites, and the results show good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Tejada S, Knoblock C A, Minton S. Learning domain-independent string transformation weights for high accuracy object identification. In Proc. SIGKDD’2002, Edmonton, Canada, July 23–26, 2002, pp.350–359.
Cohen W W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In Proc. SIGKDD’2002, Edmonton, Canada, July 2002, pp.475–480.
Cohen W, McAllester D, Kautz H. Hardening soft information sources. In Proc. SIGKDD’2000, Boston, USA, August 20–23, 2000, pp.255–259.
On B W, Lee D, Kang J, Mitra P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. JCDL’2005, Denver, USA, June 7–11, 2005, pp.344–353.
McCallum A, Nigamy K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. SIGKDD’2000, Boston, USA, 2000, pp.169–178.
Monge A E, Elkan C P. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proc. DMKD’1997, Tucson, USA, May 11, 1997, pp.23–29.
Bin Wang, Zhiwei Li, Mingjing Li. Large-scale duplicate detection for web image search. Technical Report, TR20060312013, Microsoft Research Asia, Beijing, China, 2006.
Ming Li, Xiaobing Xue, Zhihua Zhou. Chinese web index page recommendation based on multi-instance learning. Journal of Software, 2004, 15(9): 1328–1335.
MATH Google Scholar
Newcombe H, Kennedy J, Axford S, James A. Automatic linkage of vital records. Science, 1959, 130: 954–959.
Article Google Scholar
Felligi I, Sunter A. A theory for record linkage. Journal of the American Statistical Society, 1969, 64: 1183–1210.
Google Scholar
Winkler W E. The state of record linkage and current research problems. Technical Report, RR/1999/04.U.S., Bureau of the Census, Washington DC, USA, 1999.
Hua-Jun Zeng, Qi-Cai He, Zheng Chen et al. Learning to cluster search results. In Proc. SIGIR’2004, Sheffield, UK, 2004, pp.210–217.

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, 100872, China
Tao-Yuan Cheng & Shan Wang

Authors

Tao-Yuan Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Shan Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao-Yuan Cheng.

Additional information

This work was done when the first author was visiting Microsoft Research Asia.

Electronic supplementary material

Supplementary material - Chinese Abstract (PDF 57.0 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, TY., Wang, S. A Novel Approach to Clustering Merchandise Records. J Comput Sci Technol 22, 228–231 (2007). https://doi.org/10.1007/s11390-007-9029-3

Download citation

Received: 01 May 2006
Revised: 11 January 2007
Published: 17 April 2007
Issue Date: March 2007
DOI: https://doi.org/10.1007/s11390-007-9029-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Approach to Clustering Merchandise Records

Abstract

Access this article

Similar content being viewed by others

A self-verifying clustering approach to unsupervised matching of product titles

An efficient clustering algorithm based on the k-nearest neighbors with an indexing ratio

Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material - Chinese Abstract (PDF 57.0 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Novel Approach to Clustering Merchandise Records

Abstract

Access this article

Similar content being viewed by others

A self-verifying clustering approach to unsupervised matching of product titles

An efficient clustering algorithm based on the k-nearest neighbors with an indexing ratio

Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material - Chinese Abstract (PDF 57.0 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation