Abstract
It is always a challenge for large E-commerce platforms to audit mass information in real time manner, especially to identify multi-registrations efficiently. In this paper, we design a novel method for detecting multi-registrations in Chinese E-commerce platforms. In the proposed method, company names in Chinese are first divided into regional attribute, template attribute and the key attribute according to most companies’ naming rules, by utilizing the Chinese word segmentation technology. This greatly narrows down the searching range with the extracted key attribute. Then, the similarity between the company names are computed by a dynamic threshold-based string matching algorithm. Finally, the company names with high similarity are detected. This method is evaluated by using the dataset from a real E-commerce company, and the results show this method has better accuracy, efficiency and scalability, compared with other methods. The proposed method is more precision and more time-saving than artificial means, therefore, it can save a lot of human cost for B2B industry.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sui, Z., Yu, S.: A model of computing sentences similarity based on dependency tree. In: Proceedings of ICCIP 1998, pp. 458–465. Tsinghua University Press, Beijing (1998)
Li, S.: Research of relevancy between sentences based on semantic computation. In: Computer Engineering and Applications, pp. 75–76 (2002)
Lv, X., Ren, F., Huang, Z., et al.: Sentence similarity model and the most similar sentence search algorithm. Journal of Northeastern University: Natural Science ed., 531–534 (2003)
Qin, B., Liu, T., Wang, Y., et al.: Question answering system based on frequently asked questions. Journal of Harbin Institute of Technology, 1179–1182 (2003)
Chen, K., Fan, X.-Z., Liu, J., Jia, K.-L.: Calculation Method of Chinese Question Semantic Similarity Based on Question Semantic Representation. Transactions of Beijing Institute of Technology, 1073–1076 (2007)
Minton, S.N., Nanjo, C., Knoblock, C.A.: A heterogeneous field matching method for record linkage. In: Proceedings of the 5th International Conference on Data Mining, pp. 314–321. IEEE Computer Society, Washington (2005)
Cohen, W.W.: Integration of Heterogeneous Databases without Common Domains Using Queries Based Textual Similarity. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 201–212 (1998)
Hu, D.-B., Ding, J.: Learning String-edit Distance. Study on Similar Engineering Decision Problem Identification Based on Combination of Improved Edit-Distance and Skeletal Dependency Tree with POS. Systems Engineering Procedia, 406–413 (2011)
William, E.W.: Overview of Record Linkage and Current Research Directions. Tech. Rep. US Census Bureau, Washington, USA (2006)
Masek, W., Paterson, M.A.: A Faster Algorithm for Computing String Edit Distance. Computer System Science, 18–31 (1980)
Liu, W., Cao, X.-B.: Improvement for the Algorithm of Detecting Approximately Duplicate Database Records Based on MPN. Control & Automation, 152–154 (2005)
Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys, 377–439 (1992)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: The 8th Int’l Conf. on Database Systems for Advanced Applications, Kyoto, Japan (2003)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proc 28th VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 191–212 (1992)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, S., Wei, J., Wang, S. (2012). Efficiently Identifying Duplicated Chinese Company Names in Large-Scale Registration Database. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_52
Download citation
DOI: https://doi.org/10.1007/978-3-642-35527-1_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)