research-article

Continuous word embeddings for detecting local text reuses at the semantic level

Authors:

Xuanjing HuangAuthors Info & Claims

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

Pages 797 - 806

https://doi.org/10.1145/2600428.2609597

Published: 03 July 2014 Publication History

Abstract

Text reuse is a common phenomenon in a variety of user-generated content. Along with the quick expansion of social media, reuses of local text are occurring much more frequently than ever before. The task of detecting these local reuses serves as an essential step for many applications. It has attracted extensive attention in recent years. However, semantic level similarities have not received consideration in most previous works. In this paper, we introduce a novel method to efficiently detect local reuses at the semantic level for large scale problems. We propose to use continuous vector representations of words to capture the semantic level similarities between short text segments. In order to handle tens of billions of documents, methods based on information geometry and hashing methods are introduced to aggregate and map text segments presented by word embeddings to binary hash codes. Experimental results demonstrate that the proposed methods achieve significantly better performance than state-of-the-art approaches in all six document collections belonging to four different categories. At some recall levels, the precisions of the proposed method are even 10 times higher than previous methods. Moreover, the efficiency of the proposed method is comparable to or better than that of some other hashing methods.

References

[1]

S. Amari and H. Nagaoka. Methods of information geometry, volume 191. AMS Bookstore, 2000.

[2]

Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning eigenfunctions links spectral embedding and kernel pca. Neural Computation, 2004.

Digital Library

[3]

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3, Mar. 2003.

Digital Library

[4]

A. Z. Broder. On the resemblance and containment of documents. In Proceedings of SEQUENCES 1997, 1997.

Digital Library

[5]

A. Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, pages 1--10, 2000.

[6]

M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC '02, 2002.

Digital Library

[7]

A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002.

Digital Library

[8]

S. Clinchant and F. Perronnin. Aggregating continuous word embeddings for information retrieval. August 2013.

[9]

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. The JMLR, 2011.

Digital Library

[10]

G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 -- 507, 2006.

[11]

G. Hinton and R. Salakhutdinov. Discovering binary codes for documents by learning deep generative models. Topics in Cognitive Science, 2010.

[12]

T. Hofmann. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. 2000.

[13]

E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng. Improving word representations via global context and multiple word prototypes. In Proceedings of ACL'12, 2012.

Digital Library

[14]

T. Jaakkola, M. Diekhans, and D. Haussler. Using the fisher kernel method to detect remote protein homologies. In ISMB, volume 99, pages 149--158, 1999.

Digital Library

[15]

T. Jaakkola, D. Haussler, et al. Exploiting generative models in discriminative classifiers. Proceedings of NIPS, 1999.

Digital Library

[16]

H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Aggregating local image descriptors into compact codes. IEEE TPAMI, 2011.

[17]

J. W. Kim, K. S. Candan, and J. Tatemura. Efficient overlap and content reuse detection in blogs and online news articles. In Proceedings of WWW '09, 2009.

Digital Library

[18]

A. Kołcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. In Proceedings of SIGKDD 2004, pages 605--610, 2004.

Digital Library

[19]

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of ACL'11, 2011.

Digital Library

[20]

D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel. Similarity measures for tracking information flow. In Proceedings of CIKM '05, 2005.

Digital Library

[21]

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, 2013.

[22]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS. 2013.

Digital Library

[23]

P. J. Moreno and R. Rifkin. Using the fisher kernel method for web audio classification. In Proceedings of ICASSP'00, 2000.

Digital Library

[24]

M. Norouzi and D. Fleet. Minimal loss hashing for compact binary codes. In Proceedings of ICML '11.

[25]

X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of ACL'13, 2013.

[26]

R. Salakhutdinov and G. Hinton. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969--978, 2009.

Digital Library

[27]

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, pages 1--24, 2013.

Digital Library

[28]

J. Seo and W. B. Croft. Local text reuse detection. In Proceedings of SIGIR '08, 2008.

Digital Library

[29]

A. Si, H. V. Leong, and R. W. Lau. Check: a document plagiarism detection system. In Proceedings of the 1997 ACM symposium on Applied computing, 1997.

Digital Library

[30]

R. Socher, E. H. Huang, J. Pennin, C. D. Manning, and A. Ng. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances in Neural Information Processing Systems, 2011.

[31]

Q. Sun, R. Li, D. Luo, and X. Wu. Text segmentation with lda-based fisher kernel. In Proceedings of ACL'08, 2008.

Digital Library

[32]

M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In Proceedings of SIGIR '08, 2008.

Digital Library

[33]

J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In Proceedings of ACL'10, 2010.

Digital Library

[34]

E. Varol, F. Can, C. Aykanat, and O. Kaya. Codet: Sentence-based containment detection in news corpora. In Proceedings of CIKM'11, 2011.

Digital Library

[35]

Q. Wang, L. Ruan, Z. Zhang, and L. Si. Learning compact hashing codes for efficient tag completion and prediction. In Proceedings of CIKM '13, 2013.

Digital Library

[36]

Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Processings of NIPS, 2008.

[37]

D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In Proceeding of SIGIR '10, 2010.

Digital Library

[38]

Q. Zhang, Y. Wu, Z. Ding, and X. Huang. Learning hash codes for efficient content reuse detection. In Proceedings of SIGIR'12, 2012.

Digital Library

[39]

Q. Zhang, Y. Zhang, H. Yu, and X. Huang. Efficient partial-duplicate detection based on sequence matching. In Proceedings of SIGIR '10, 2010.

Digital Library

Cited By

Zhu YLi YYue YQiang JYuan Y(2020)A Hybrid Classification Method via Character Embedding in Chinese Short Text With Few WordsIEEE Access10.1109/ACCESS.2020.29944508(92120-92128)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2994450
Roy DGanguly DMitra MJones G(2019)Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2018.10.00956:3(1026-1045)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1016/j.ipm.2018.10.009
Xu BLin HLin YYang LXu K(2018)Improving Pseudo-Relevance Feedback With Neural Network-Based Word RepresentationsIEEE Access10.1109/ACCESS.2018.28764256(62152-62165)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2876425
Show More Cited By

Index Terms

Continuous word embeddings for detecting local text reuses at the semantic level
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

Local text reuse detection
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers because of its importance in Web search. Local text reuse occurs when ...
WEKE: Learning Word Embeddings for Keyphrase Extraction
Web and Big Data
Abstract
Traditional supervised keyphrase extraction models depend on the features of labeled keyphrases while prevailing unsupervised models mainly rely on global structure of the word graph, with nodes representing candidate words and edges/links ...
Combining Explicit and Implicit Semantic Similarity Information for Word Embeddings
ICCAI '18: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence

In this paper, we propose a new framework that combines both explicit and implicit semantic similarity information for training word embeddings. While the former determines the similarity degree between two words explicitly, the latter reflects word ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval

July 2014

1330 pages

ISBN:9781450322577

DOI:10.1145/2600428

General Chairs:
Shlomo Geva
Queensland University of Technology
,
Andrew Trotman
University of Dunedin
,
Program Chairs:
Peter Bruza
Queensland University of Technology
,
Charles L.A. Clarke
University of Waterloo
,
Kal Järvelin
University of Tampere

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '14

Sponsor:

SIGIR

SIGIR '14: The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 6 - 11, 2014

Queensland, Gold Coast, Australia

Acceptance Rates

SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
580
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu YLi YYue YQiang JYuan Y(2020)A Hybrid Classification Method via Character Embedding in Chinese Short Text With Few WordsIEEE Access10.1109/ACCESS.2020.29944508(92120-92128)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2994450
Roy DGanguly DMitra MJones G(2019)Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2018.10.00956:3(1026-1045)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1016/j.ipm.2018.10.009
Xu BLin HLin YYang LXu K(2018)Improving Pseudo-Relevance Feedback With Neural Network-Based Word RepresentationsIEEE Access10.1109/ACCESS.2018.28764256(62152-62165)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2876425
Onal KZhang YAltingovde IRahman MKaragoz PBraylan ADang BChang HKim HMcnamara QAngert ABanner EKhetan VMcdonnell TNguyen AXu DWallace BRijke MLease M(2018)Neural information retrievalInformation Retrieval10.1007/s10791-017-9321-y21:2-3(111-182)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s10791-017-9321-y
Zhou GHuang J(2017)Modeling and Learning Distributed Word Representation with Metadata for Question RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.266562529:6(1226-1239)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1109/TKDE.2017.2665625
K VGupta D(2017)Text plagiarism classification using syntax based linguistic featuresExpert Systems with Applications: An International Journal10.1016/j.eswa.2017.07.00688:C(448-464)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1016/j.eswa.2017.07.006
Zhang QWang YQian JHuang X(2016)A Mixed Generative-Discriminative Based Hashing MethodIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.250712728:4(845-857)Online publication date: 1-Apr-2016
https://dl.acm.org/doi/10.1109/TKDE.2015.2507127
Zhao MWang HCao LZhang CYin HXu F(2015)LSIFProceedings of the 2015 IEEE / WIC / ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) - Volume 0110.1109/WI-IAT.2015.2(417-424)Online publication date: 6-Dec-2015
https://dl.acm.org/doi/10.1109/WI-IAT.2015.2

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten