ABSTRACT
Text deduplication is an important operation for text document analysis applications. Given a set of text documents, we often need to remove the text documents whose similarity values are not less than the specified threshold. However, if the set of similar text documents to be removed is too large, the remaining set of text documents may be not enough for text analysis. In this paper, we consider the problem on how to balance the removed set and the remaining set of text documents. We try to reduce the duplication information as much as possible with the minimum number of text documents to be removed. We propose a greedy algorithm for our problem based on the concept of similarity graph which can represent the similar relationship for a set of text documents. We also consider the incremental algorithm for the dynamic settings. The experimental results based on the real news document datasets show the efficiency of the proposed algorithms.
- Aggarwal, C.C.: Mining text and social streams: a review. SIGKDD Explorations 15(2), 9--19 (2013) Google ScholarDigital Library
- Barua, J., Patel, D., Agrawal, A.: Removing noise content from online news articles. In: COMAD. pp. 113--116 (2014) Google ScholarDigital Library
- Broder, A.Z.: Identifying and filtering near-duplicate documents. In: CPM. pp. 1--10 (2000) Google ScholarDigital Library
- Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC. pp.380--388 (2002) Google ScholarDigital Library
- Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS) 20(2), 171--191 (2002) Google ScholarDigital Library
- Dean, J., Henzinger, M.R.: Finding related pages in the world wide web. Computer networks31(11-16), 1467--1479 (1999) Google ScholarDigital Library
- Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. In: ACM Transactions on Internet Technology (TOIT). vol. 3, pp. 1--27 (2003) Google ScholarDigital Library
- El-Kishky, A., Song, Y., Wang, C., Voss, C.R., Han, J.: Scalable topical phrase mining from text corpora. PVLDB 8(3), 305--316 (2014) Google ScholarDigital Library
- Flaounas, I.N., Ali, O., Turchi, M., Snowsill, T., Nicart, F., Bie, T.D., Cristianini, N.: Noam: news outlets analysis and monitoring system. In: SIGMOD. Pp. 1275--1278(2011) Google ScholarDigital Library
- Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR. pp. 284--291 (2006) Google ScholarDigital Library
- Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of computer computations, pp. 85--103 (1972)Google ScholarCross Ref
- Kołcz, A., Chowdhury, A.: Lexicon randomization for near-duplicate detection with i-match. The Journal of Supercomputing 45(3), 255--276 (2008) Google ScholarDigital Library
- Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: SIGMOD. pp. 1729--1744 (2015) Google ScholarDigital Library
- Monge, A.E.: Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull. 23(4), 14--20 (2000)Google Scholar
- Navarro, G.: A guided tour to approximate string matching. ACM computing surveys (CSUR) 33(1), 31--88 (2001) Google ScholarDigital Library
- Potthast, M., Stein, B., Barron-Cedéno, A., Rosso, P.: An evaluation framework for plagiarism detection. In: COLING. pp. 997--1005 (2010) Google ScholarDigital Library
- Raman, V., Hellerstein, J.M.: Potter's wheel: An interactive data cleaning system. In: VLDB. pp. 381--390 (2001) Google ScholarDigital Library
- Shen, W., Han, J., Wang, J.: A probabilistic model for linking named entities in web text with heterogeneous information networks. In: SIGMOD. pp. 1199--1210 (2014) Google ScholarDigital Library
- Sluban, B., Grcar, M.: Url tree: Efficient unsupervised content extraction from streams of web documents. In: CIKM. pp. 2267--2272 (2013) Google ScholarDigital Library
- Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: SIGIR. pp. 563--570 (2008) Google ScholarDigital Library
- Williams, K., Giles, C.L.: Near duplicate detection in an academic digital library. In: ACM Symposium on Document Engineering. pp. 91--94 (2013) Google ScholarDigital Library
- Wu, H., Pang, T.H., Liu, B., Li, X.: A refinement approach to handling model misfit in text categorization. In: SIGKDD. pp. 207--216 (2002) Google ScholarDigital Library
- Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36(3), 15 (2011) Google ScholarDigital Library
Index Terms
- Text Deduplication with Minimum Loss Ratio
Recommendations
On the minimum vertex cover of generalized Petersen graphs
AbstractIt is known that any vertex cover of the generalized Petersen graph P ( n , k ) has size at least n. Behsaz, Hatami and Mahmoodian characterized such graphs with minimum vertex cover numbers n and n + 1, and those with k ≤ 3. For k ≥ 4 ...
Optimization problems in multiple subtree graphs
We study various optimization problems in t-subtree graphs, the intersection graphs of t-subtrees, where a t-subtree is the union of t disjoint subtrees of some tree. This graph class generalizes both the class of chordal graphs and the class of t-...
Optimization problems in multiple-interval graphs
Multiple-interval graphs are a natural generalization of interval graphs where each vertex may have more then one interval associated with it. We initiate the study of optimization problems in multiple-interval graphs by considering three classical ...
Comments