Skip to main content
Log in

A Novel Approach to Clustering Merchandise Records

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an integrated database. Traditional object identification techniques tend to use character-based or vector space model-based similarity computing in judging, but they cannot work well in merchandise databases. This paper brings forward a new approach to object identification. First, we use merchandise images to judge whether two records belong to the same object; then, we use Naïve Bayesian Model to judge whether two merchandise names have similar meaning. We do experiments on data downloaded from shopping websites, and the results show good performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Tejada S, Knoblock C A, Minton S. Learning domain-independent string transformation weights for high accuracy object identification. In Proc. SIGKDD’2002, Edmonton, Canada, July 23–26, 2002, pp.350–359.

  2. Cohen W W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In Proc. SIGKDD’2002, Edmonton, Canada, July 2002, pp.475–480.

  3. Cohen W, McAllester D, Kautz H. Hardening soft information sources. In Proc. SIGKDD’2000, Boston, USA, August 20–23, 2000, pp.255–259.

  4. On B W, Lee D, Kang J, Mitra P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. JCDL’2005, Denver, USA, June 7–11, 2005, pp.344–353.

  5. McCallum A, Nigamy K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. SIGKDD’2000, Boston, USA, 2000, pp.169–178.

  6. Monge A E, Elkan C P. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proc. DMKD’1997, Tucson, USA, May 11, 1997, pp.23–29.

  7. Bin Wang, Zhiwei Li, Mingjing Li. Large-scale duplicate detection for web image search. Technical Report, TR20060312013, Microsoft Research Asia, Beijing, China, 2006.

  8. Ming Li, Xiaobing Xue, Zhihua Zhou. Chinese web index page recommendation based on multi-instance learning. Journal of Software, 2004, 15(9): 1328–1335.

    MATH  Google Scholar 

  9. Newcombe H, Kennedy J, Axford S, James A. Automatic linkage of vital records. Science, 1959, 130: 954–959.

    Article  Google Scholar 

  10. Felligi I, Sunter A. A theory for record linkage. Journal of the American Statistical Society, 1969, 64: 1183–1210.

    Google Scholar 

  11. Winkler W E. The state of record linkage and current research problems. Technical Report, RR/1999/04.U.S., Bureau of the Census, Washington DC, USA, 1999.

  12. Hua-Jun Zeng, Qi-Cai He, Zheng Chen et al. Learning to cluster search results. In Proc. SIGIR’2004, Sheffield, UK, 2004, pp.210–217.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao-Yuan Cheng.

Additional information

This work was done when the first author was visiting Microsoft Research Asia.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, TY., Wang, S. A Novel Approach to Clustering Merchandise Records. J Comput Sci Technol 22, 228–231 (2007). https://doi.org/10.1007/s11390-007-9029-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-007-9029-3

Keywords

Navigation