Abstract
Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an integrated database. Traditional object identification techniques tend to use character-based or vector space model-based similarity computing in judging, but they cannot work well in merchandise databases. This paper brings forward a new approach to object identification. First, we use merchandise images to judge whether two records belong to the same object; then, we use Naïve Bayesian Model to judge whether two merchandise names have similar meaning. We do experiments on data downloaded from shopping websites, and the results show good performance.
Similar content being viewed by others
References
Tejada S, Knoblock C A, Minton S. Learning domain-independent string transformation weights for high accuracy object identification. In Proc. SIGKDD’2002, Edmonton, Canada, July 23–26, 2002, pp.350–359.
Cohen W W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In Proc. SIGKDD’2002, Edmonton, Canada, July 2002, pp.475–480.
Cohen W, McAllester D, Kautz H. Hardening soft information sources. In Proc. SIGKDD’2000, Boston, USA, August 20–23, 2000, pp.255–259.
On B W, Lee D, Kang J, Mitra P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. JCDL’2005, Denver, USA, June 7–11, 2005, pp.344–353.
McCallum A, Nigamy K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. SIGKDD’2000, Boston, USA, 2000, pp.169–178.
Monge A E, Elkan C P. An efficient domain independent algorithm for detecting approximately duplicate database records. In Proc. DMKD’1997, Tucson, USA, May 11, 1997, pp.23–29.
Bin Wang, Zhiwei Li, Mingjing Li. Large-scale duplicate detection for web image search. Technical Report, TR20060312013, Microsoft Research Asia, Beijing, China, 2006.
Ming Li, Xiaobing Xue, Zhihua Zhou. Chinese web index page recommendation based on multi-instance learning. Journal of Software, 2004, 15(9): 1328–1335.
Newcombe H, Kennedy J, Axford S, James A. Automatic linkage of vital records. Science, 1959, 130: 954–959.
Felligi I, Sunter A. A theory for record linkage. Journal of the American Statistical Society, 1969, 64: 1183–1210.
Winkler W E. The state of record linkage and current research problems. Technical Report, RR/1999/04.U.S., Bureau of the Census, Washington DC, USA, 1999.
Hua-Jun Zeng, Qi-Cai He, Zheng Chen et al. Learning to cluster search results. In Proc. SIGIR’2004, Sheffield, UK, 2004, pp.210–217.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was done when the first author was visiting Microsoft Research Asia.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Cheng, TY., Wang, S. A Novel Approach to Clustering Merchandise Records. J Comput Sci Technol 22, 228–231 (2007). https://doi.org/10.1007/s11390-007-9029-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-007-9029-3