skip to main content
10.1145/3318299.3318369acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Text Deduplication with Minimum Loss Ratio

Authors Info & Claims
Published:22 February 2019Publication History

ABSTRACT

Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.

References

  1. Aggarwal, C.C.: Mining text and social streams: a review. SIGKDD Explorations 15(2), 9--19 (2013) Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barua, J., Patel, D., Agrawal, A.: Removing noise content from online news articles. In: COMAD. pp. 113--116 (2014) Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Broder, A.Z.: Identifying and filtering near-duplicate documents. In: CPM. pp. 1--10 (2000) Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC. pp.380--388 (2002) Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171--191 (2002) Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dean, J., Henzinger, M.R.: Finding related pages in the world wide web. Computer networks31(11-16), 1467--1479 (1999) Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. In: ACM Transactions on Internet Technology (TOIT). vol. 3, pp. 1--27 (2003) Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. El-Kishky, A., Song, Y., Wang, C., Voss, C.R., Han, J.: Scalable topical phrase mining from text corpora. PVLDB 8(3), 305--316 (2014) Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Flaounas, I.N., Ali, O., Turchi, M., Snowsill, T., Nicart, F., Bie, T.D., Cristianini, N.: Noam: news outlets analysis and monitoring system. In: SIGMOD. Pp. 1275--1278(2011) Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR. pp. 284--291 (2006) Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of computer computations, pp. 85--103 (1972)Google ScholarGoogle ScholarCross RefCross Ref
  12. Kołcz, A., Chowdhury, A.: Lexicon randomization for near-duplicate detection with i-match. The Journal of Supercomputing 45(3), 255--276 (2008) Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: SIGMOD. pp. 1729--1744 (2015) Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Monge, A.E.: Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23(4), 14--20 (2000)Google ScholarGoogle Scholar
  15. Navarro, G.: A guided tour to approximate string matching. ACM computing surveys (CSUR) 33(1), 31--88 (2001) Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Potthast, M., Stein, B., Barron-Cedéno, A., Rosso, P.: An evaluation framework for plagiarism detection. In: COLING. pp. 997--1005 (2010) Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Raman, V., Hellerstein, J.M.: Potter's wheel: An interactive data cleaning system. In: VLDB. pp. 381--390 (2001) Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shen, W., Han, J., Wang, J.: A probabilistic model for linking named entities in web text with heterogeneous information networks. In: SIGMOD. pp. 1199--1210 (2014) Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sluban, B., Grcar, M.: Url tree: Efficient unsupervised content extraction from streams of web documents. In: CIKM. pp. 2267--2272 (2013) Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: SIGIR. pp. 563--570 (2008) Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Williams, K., Giles, C.L.: Near duplicate detection in an academic digital library. In: ACM Symposium on Document Engineering. pp. 91--94 (2013) Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wu, H., Pang, T.H., Liu, B., Li, X.: A refinement approach to handling model misfit in text categorization. In: SIGKDD. pp. 207--216 (2002) Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36(3), 15 (2011) Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Text Deduplication with Minimum Loss Ratio

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing
      February 2019
      563 pages
      ISBN:9781450366007
      DOI:10.1145/3318299

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 February 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader