Skip to main content

Efficiently Identifying Duplicated Chinese Company Names in Large-Scale Registration Database

  • Conference paper
Book cover Advanced Data Mining and Applications (ADMA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7713))

Included in the following conference series:

  • 3452 Accesses

Abstract

It is always a challenge for large E-commerce platforms to audit mass information in real time manner, especially to identify multi-registrations efficiently. In this paper, we design a novel method for detecting multi-registrations in Chinese E-commerce platforms. In the proposed method, company names in Chinese are first divided into regional attribute, template attribute and the key attribute according to most companies’ naming rules, by utilizing the Chinese word segmentation technology. This greatly narrows down the searching range with the extracted key attribute. Then, the similarity between the company names are computed by a dynamic threshold-based string matching algorithm. Finally, the company names with high similarity are detected. This method is evaluated by using the dataset from a real E-commerce company, and the results show this method has better accuracy, efficiency and scalability, compared with other methods. The proposed method is more precision and more time-saving than artificial means, therefore, it can save a lot of human cost for B2B industry.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Sui, Z., Yu, S.: A model of computing sentences similarity based on dependency tree. In: Proceedings of ICCIP 1998, pp. 458–465. Tsinghua University Press, Beijing (1998)

    Google Scholar 

  2. Li, S.: Research of relevancy between sentences based on semantic computation. In: Computer Engineering and Applications, pp. 75–76 (2002)

    Google Scholar 

  3. Lv, X., Ren, F., Huang, Z., et al.: Sentence similarity model and the most similar sentence search algorithm. Journal of Northeastern University: Natural Science ed., 531–534 (2003)

    Google Scholar 

  4. Qin, B., Liu, T., Wang, Y., et al.: Question answering system based on frequently asked questions. Journal of Harbin Institute of Technology, 1179–1182 (2003)

    Google Scholar 

  5. Chen, K., Fan, X.-Z., Liu, J., Jia, K.-L.: Calculation Method of Chinese Question Semantic Similarity Based on Question Semantic Representation. Transactions of Beijing Institute of Technology, 1073–1076 (2007)

    Google Scholar 

  6. Minton, S.N., Nanjo, C., Knoblock, C.A.: A heterogeneous field matching method for record linkage. In: Proceedings of the 5th International Conference on Data Mining, pp. 314–321. IEEE Computer Society, Washington (2005)

    Google Scholar 

  7. Cohen, W.W.: Integration of Heterogeneous Databases without Common Domains Using Queries Based Textual Similarity. In: Proc. ACM SIGMOD Conf. on Management of Data, pp. 201–212 (1998)

    Google Scholar 

  8. Hu, D.-B., Ding, J.: Learning String-edit Distance. Study on Similar Engineering Decision Problem Identification Based on Combination of Improved Edit-Distance and Skeletal Dependency Tree with POS. Systems Engineering Procedia, 406–413 (2011)

    Google Scholar 

  9. William, E.W.: Overview of Record Linkage and Current Research Directions. Tech. Rep. US Census Bureau, Washington, USA (2006)

    Google Scholar 

  10. Masek, W., Paterson, M.A.: A Faster Algorithm for Computing String Edit Distance. Computer System Science, 18–31 (1980)

    Google Scholar 

  11. Liu, W., Cao, X.-B.: Improvement for the Algorithm of Detecting Approximately Duplicate Database Records Based on MPN. Control & Automation, 152–154 (2005)

    Google Scholar 

  12. Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys, 377–439 (1992)

    Google Scholar 

  13. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: The 8th Int’l Conf. on Database Systems for Advanced Applications, Kyoto, Japan (2003)

    Google Scholar 

  14. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proc 28th VLDB, pp. 586–597. Morgan Kaufmann, San Francisco (2002)

    Google Scholar 

  15. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 191–212 (1992)

    Google Scholar 

  16. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proceedings of the 28th VLDB Conference, Hong Kong, China (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, S., Wei, J., Wang, S. (2012). Efficiently Identifying Duplicated Chinese Company Names in Large-Scale Registration Database. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35527-1_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35526-4

  • Online ISBN: 978-3-642-35527-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics