Efficiently Identifying Duplicated Chinese Company Names in Large-Scale Registration Database

Liu, Shaowu; Wei, Jiyong; Wang, Shouwei

doi:10.1007/978-3-642-35527-1_52

Shaowu Liu²²,
Jiyong Wei²³ &
Shouwei Wang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

3452 Accesses

Abstract

It is always a challenge for large E-commerce platforms to audit mass information in real time manner, especially to identify multi-registrations efficiently. In this paper, we design a novel method for detecting multi-registrations in Chinese E-commerce platforms. In the proposed method, company names in Chinese are first divided into regional attribute, template attribute and the key attribute according to most companies’ naming rules, by utilizing the Chinese word segmentation technology. This greatly narrows down the searching range with the extracted key attribute. Then, the similarity between the company names are computed by a dynamic threshold-based string matching algorithm. Finally, the company names with high similarity are detected. This method is evaluated by using the dataset from a real E-commerce company, and the results show this method has better accuracy, efficiency and scalability, compared with other methods. The proposed method is more precision and more time-saving than artificial means, therefore, it can save a lot of human cost for B2B industry.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sui, Z., Yu, S.: A model of computing sentences similarity based on dependency tree. In: Proceedings of ICCIP 1998, pp. 458–465. Tsinghua University Press, Beijing (1998)
Google Scholar
Li, S.: Research of relevancy between sentences based on semantic computation. In: Computer Engineering and Applications, pp. 75–76 (2002)
Google Scholar
Lv, X., Ren, F., Huang, Z., et al.: Sentence similarity model and the most similar sentence search algorithm. Journal of Northeastern University: Natural Science ed., 531–534 (2003)
Google Scholar
Qin, B., Liu, T., Wang, Y., et al.: Question answering system based on frequently asked questions. Journal of Harbin Institute of Technology, 1179–1182 (2003)
Google Scholar
Chen, K., Fan, X.-Z., Liu, J., Jia, K.-L.: Calculation Method of Chinese Question Semantic Similarity Based on Question Semantic Representation. Transactions of Beijing Institute of Technology, 1073–1076 (2007)
Google Scholar
Minton, S.N., Nanjo, C., Knoblock, C.A.: A heterogeneous field matching method for record linkage. In: Proceedings of the 5th International Conference on Data Mining, pp. 314–321. IEEE Computer Society, Washington (2005)
Google Scholar
Cohen, W.W.: Integration of Heterogeneous Databases without Common Domains Using Queries Based Textual Similarity. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 201–212 (1998)
Google Scholar
Hu, D.-B., Ding, J.: Learning String-edit Distance. Study on Similar Engineering Decision Problem Identification Based on Combination of Improved Edit-Distance and Skeletal Dependency Tree with POS. Systems Engineering Procedia, 406–413 (2011)
Google Scholar
William, E.W.: Overview of Record Linkage and Current Research Directions. Tech. Rep. US Census Bureau, Washington, USA (2006)
Google Scholar
Masek, W., Paterson, M.A.: A Faster Algorithm for Computing String Edit Distance. Computer System Science, 18–31 (1980)
Google Scholar
Liu, W., Cao, X.-B.: Improvement for the Algorithm of Detecting Approximately Duplicate Database Records Based on MPN. Control & Automation, 152–154 (2005)
Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys, 377–439 (1992)
Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: The 8th Int’l Conf. on Database Systems for Advanced Applications, Kyoto, Japan (2003)
Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proc 28th VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 191–212 (1992)
Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Focus Technology Co., LTD, Nanjing, China
Shaowu Liu & Shouwei Wang
Department of E-Business School, Nanjing University, China
Jiyong Wei

Authors

Shaowu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiyong Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shouwei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Fudan University, Handan Road 220, 200433, Shanghai, China
Shuigeng Zhou
Chinese Academy of Sciences, Academy of Mathematics and Systems Science, Dongguancun East Road 55, 100190, Beijing, China
Songmao Zhang
Department of Computer Science and Engineering, University of Minnesota, Union Street SE 200, 55455, Minneapolis, MN, USA
George Karypis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, S., Wei, J., Wang, S. (2012). Efficiently Identifying Duplicated Chinese Company Names in Large-Scale Registration Database. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_52

Download citation

DOI: https://doi.org/10.1007/978-3-642-35527-1_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics