skip to main content
10.1145/2600428.2609597acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Continuous word embeddings for detecting local text reuses at the semantic level

Published: 03 July 2014 Publication History

Abstract

Text reuse is a common phenomenon in a variety of user-generated content. Along with the quick expansion of social media, reuses of local text are occurring much more frequently than ever before. The task of detecting these local reuses serves as an essential step for many applications. It has attracted extensive attention in recent years. However, semantic level similarities have not received consideration in most previous works. In this paper, we introduce a novel method to efficiently detect local reuses at the semantic level for large scale problems. We propose to use continuous vector representations of words to capture the semantic level similarities between short text segments. In order to handle tens of billions of documents, methods based on information geometry and hashing methods are introduced to aggregate and map text segments presented by word embeddings to binary hash codes. Experimental results demonstrate that the proposed methods achieve significantly better performance than state-of-the-art approaches in all six document collections belonging to four different categories. At some recall levels, the precisions of the proposed method are even 10 times higher than previous methods. Moreover, the efficiency of the proposed method is comparable to or better than that of some other hashing methods.

References

[1]
S. Amari and H. Nagaoka. Methods of information geometry, volume 191. AMS Bookstore, 2000.
[2]
Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning eigenfunctions links spectral embedding and kernel pca. Neural Computation, 2004.
[3]
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3, Mar. 2003.
[4]
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of SEQUENCES 1997, 1997.
[5]
A. Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, pages 1--10, 2000.
[6]
M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC '02, 2002.
[7]
A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.
[8]
S. Clinchant and F. Perronnin. Aggregating continuous word embeddings for information retrieval. August 2013.
[9]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The JMLR, 2011.
[10]
G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 -- 507, 2006.
[11]
G. Hinton and R. Salakhutdinov. Discovering binary codes for documents by learning deep generative models. Topics in Cognitive Science, 2010.
[12]
T. Hofmann. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. 2000.
[13]
E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of ACL'12, 2012.
[14]
T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to detect remote protein homologies. In ISMB, volume 99, pages 149--158, 1999.
[15]
T. Jaakkola, D. Haussler, et al. Exploiting generative models in discriminative classifiers. Proceedings of NIPS, 1999.
[16]
H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Aggregating local image descriptors into compact codes. IEEE TPAMI, 2011.
[17]
J. W. Kim, K. S. Candan, and J. Tatemura. Efficient overlap and content reuse detection in blogs and online news articles. In Proceedings of WWW '09, 2009.
[18]
A. Kołcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of SIGKDD 2004, pages 605--610, 2004.
[19]
A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of ACL'11, 2011.
[20]
D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. Similarity measures for tracking information flow. In Proceedings of CIKM '05, 2005.
[21]
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, 2013.
[22]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS. 2013.
[23]
P. J. Moreno and R. Rifkin. Using the fisher kernel method for web audio classification. In Proceedings of ICASSP'00, 2000.
[24]
M. Norouzi and D. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of ICML '11.
[25]
X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of ACL'13, 2013.
[26]
R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969--978, 2009.
[27]
J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, pages 1--24, 2013.
[28]
J. Seo and W. B. Croft. Local text reuse detection. In Proceedings of SIGIR '08, 2008.
[29]
A. Si, H. V. Leong, and R. W. Lau. Check: a document plagiarism detection system. In Proceedings of the 1997 ACM symposium on Applied computing, 1997.
[30]
R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Ng. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems, 2011.
[31]
Q. Sun, R. Li, D. Luo, and X. Wu. Text segmentation with lda-based fisher kernel. In Proceedings of ACL'08, 2008.
[32]
M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In Proceedings of SIGIR '08, 2008.
[33]
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL'10, 2010.
[34]
E. Varol, F. Can, C. Aykanat, and O. Kaya. Codet: Sentence-based containment detection in news corpora. In Proceedings of CIKM'11, 2011.
[35]
Q. Wang, L. Ruan, Z. Zhang, and L. Si. Learning compact hashing codes for efficient tag completion and prediction. In Proceedings of CIKM '13, 2013.
[36]
Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Processings of NIPS, 2008.
[37]
D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In Proceeding of SIGIR '10, 2010.
[38]
Q. Zhang, Y. Wu, Z. Ding, and X. Huang. Learning hash codes for efficient content reuse detection. In Proceedings of SIGIR'12, 2012.
[39]
Q. Zhang, Y. Zhang, H. Yu, and X. Huang. Efficient partial-duplicate detection based on sequence matching. In Proceedings of SIGIR '10, 2010.

Cited By

View all
  • (2020)A Hybrid Classification Method via Character Embedding in Chinese Short Text With Few WordsIEEE Access10.1109/ACCESS.2020.29944508(92120-92128)Online publication date: 2020
  • (2019)Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2018.10.00956:3(1026-1045)Online publication date: 1-May-2019
  • (2018)Improving Pseudo-Relevance Feedback With Neural Network-Based Word RepresentationsIEEE Access10.1109/ACCESS.2018.28764256(62152-62165)Online publication date: 2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
July 2014
1330 pages
ISBN:9781450322577
DOI:10.1145/2600428
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fisher vector
  2. local text reuse
  3. word embedding

Qualifiers

  • Research-article

Conference

SIGIR '14
Sponsor:

Acceptance Rates

SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)A Hybrid Classification Method via Character Embedding in Chinese Short Text With Few WordsIEEE Access10.1109/ACCESS.2020.29944508(92120-92128)Online publication date: 2020
  • (2019)Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2018.10.00956:3(1026-1045)Online publication date: 1-May-2019
  • (2018)Improving Pseudo-Relevance Feedback With Neural Network-Based Word RepresentationsIEEE Access10.1109/ACCESS.2018.28764256(62152-62165)Online publication date: 2018
  • (2018)Neural information retrievalInformation Retrieval10.1007/s10791-017-9321-y21:2-3(111-182)Online publication date: 1-Jun-2018
  • (2017)Modeling and Learning Distributed Word Representation with Metadata for Question RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.266562529:6(1226-1239)Online publication date: 1-Jun-2017
  • (2017)Text plagiarism classification using syntax based linguistic featuresExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.07.00688:C(448-464)Online publication date: 1-Dec-2017
  • (2016)A Mixed Generative-Discriminative Based Hashing MethodIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250712728:4(845-857)Online publication date: 1-Apr-2016
  • (2015)LSIFProceedings of the 2015 IEEE / WIC / ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) - Volume 0110.1109/WI-IAT.2015.2(417-424)Online publication date: 6-Dec-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media