research-article

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Authors:

Gaurav Maheshwari,

Priyansh Trivedi,

Harshita Sahijwani,

Sourish Dasgupta,

Jens LehmannAuthors Info & Claims

K-CAP '17: Proceedings of the 9th Knowledge Capture Conference

Article No.: 16, Pages 1 - 8

https://doi.org/10.1145/3148011.3148016

Published: 04 December 2017 Publication History

Abstract

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

References

[1]

Eneko Agirre, Carmen Banea, et al. 2015. SemEval-2015 task 2: Semantic textual similarity, English, S-panish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), June.

Digital Library

[2]

David Blei and John Lafferty. 2006. Correlated topic models. Advances in neural information processing systems 18 (2006), 147.

[3]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003), 993--1022.

Digital Library

[4]

Chris Brockett and William B Dolan. 2005. Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing. 1--8.

[5]

Stephen Clark, Bob Coecke, and Mehrnoosh Sadrzadeh. 2008. A compositional distributional model of meaning. In Proceedings of the Second Quantum Interaction Symposium (QI-2008). 133--140.

[6]

Andrew M. Dai, Christopher Olah, and Quoc V. Le. 2015. Document embedding with paragraph vectors. In NIPS Deep Learning Workshop.

[7]

Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, Vol. 7. 1606--1611.

Digital Library

[8]

Christian Hänig, Robert Remus, and Xose De La Puente. {n. d.}. ExB Themis: Extensive Feature Extraction from Word Alignments for Semantic Textual Similarity. ({n. d.}).

[9]

Lan Huang, David Milne, Eibe Frank, and Ian H Witten. 2012. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 63, 8 (2012), 1593--1608.

Digital Library

[10]

Paul Jaccard. 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz.

[11]

Apache Jakarta. 2004. Apache Lucene-a high-performance, full-featured text search engine library. (2004).

[12]

Chi-Hong Leung and Yuen-Yan Chan. 2007. A natural language processing approach to automatic plagiarism detection. In Proceedings of the 8th ACM SIGITE conference on Information technology education. ACM, 213--218.

Digital Library

[13]

Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning. ACM, 577--584.

Digital Library

[14]

Percy Liang, Michael I Jordan, and Dan Klein. 2013. Learning dependency-based compositional semantics. Computational Linguistics 39, 2 (2013), 389--446.

Digital Library

[15]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. 14. 1532--43.

[16]

Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/884893/en.

[17]

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. NIST SPECIAL PUBLICATION SP 109 (1995), 109.

[18]

Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.

Digital Library

[19]

Temple F Smith and Michael S Waterman. 1981. Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195--197.

[20]

Md Arafat Sultan, Steven Bethard, and Tamara Sumner. 2014. DLS@ CU: Sentence Similarity from Word Alignment. SemEval 2014 (2014), 241.

[21]

Peter D Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37, 1 (2010), 141--188.

[22]

Naomi Zeichner, Jonathan Berant, and Ido Dagan. 2012. Crowdsourcing inference-rule evaluation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 156--160.

Digital Library

Cited By

Ko TDavid Jeong HLee J(2023)Natural Language Processing–Driven Similar Project Determination Using Project Scope StatementsJournal of Management in Engineering10.1061/JMENEA.MEENG-522939:3Online publication date: May-2023
https://doi.org/10.1061/JMENEA.MEENG-5229
Grijalva-Arriaga PCornejo-Gómez GGómez-Chabla RAntonelli LThomas P(2022)Alignment Techniques in Domain-Specific ModelsTechnologies and Innovation10.1007/978-3-031-19961-5_4(45-61)Online publication date: 23-Oct-2022
https://doi.org/10.1007/978-3-031-19961-5_4
Miah MSulaiman JSarwar TZamli KJose R(2021)Study of Keyword Extraction Techniques for Electric Double‐Layer Capacitor Domain Using Text Similarity Indexes: An Experimental AnalysisComplexity10.1155/2021/81923202021:1Online publication date: 2-Dec-2021
https://doi.org/10.1155/2021/8192320
Show More Cited By

Index Terms

SimDoc: Topic Sequence Alignment based Document Similarity Framework
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient text document clustering with new similarity measures

In this paper, two new similarity measures, namely distance of term frequency-based similarity measure (DTFSM) and presence of common terms-based similarity measure (PCTSM), are proposed to compute the similarity between two documents for improving the ...
An information-theoretic measure for document similarity
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

Recent work has demonstrated that the assessment of pairwise object similarity can be approached in an axiomatic manner using information theory. We extend this concept specifically to document similarity and test the effectiveness of an information-...
Query-sensitive similarity measures for information retrieval

The application of document clustering to information retrieval has been motivated by the potential effectiveness gains postulated by the cluster hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

K-CAP '17: Proceedings of the 9th Knowledge Capture Conference

December 2017

271 pages

ISBN:9781450355537

DOI:10.1145/3148011

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

WDAqua

Conference

K-CAP 2017

Sponsor:

SIGAI

K-CAP 2017: Knowledge Capture Conference

December 4 - 6, 2017

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 55 of 198 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
152
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ko TDavid Jeong HLee J(2023)Natural Language Processing–Driven Similar Project Determination Using Project Scope StatementsJournal of Management in Engineering10.1061/JMENEA.MEENG-522939:3Online publication date: May-2023
https://doi.org/10.1061/JMENEA.MEENG-5229
Grijalva-Arriaga PCornejo-Gómez GGómez-Chabla RAntonelli LThomas P(2022)Alignment Techniques in Domain-Specific ModelsTechnologies and Innovation10.1007/978-3-031-19961-5_4(45-61)Online publication date: 23-Oct-2022
https://doi.org/10.1007/978-3-031-19961-5_4
Miah MSulaiman JSarwar TZamli KJose R(2021)Study of Keyword Extraction Techniques for Electric Double‐Layer Capacitor Domain Using Text Similarity Indexes: An Experimental AnalysisComplexity10.1155/2021/81923202021:1Online publication date: 2-Dec-2021
https://doi.org/10.1155/2021/8192320
Miah MSulaiman JAzad SZamli KJose R(2021)Comparison of document similarity algorithms in extracting document keywords from an academic paper2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM)10.1109/ICSECS52883.2021.00121(631-636)Online publication date: Aug-2021
https://doi.org/10.1109/ICSECS52883.2021.00121
Khan ISiddiqui MJambi K(2019)Towards Building an Arabic Plagiarism Detection System: Plagiarism Detection in ArabicInternational Journal of Information Retrieval Research10.4018/IJIRR.20190701029:3(12-22)Online publication date: Jul-2019
https://doi.org/10.4018/IJIRR.2019070102

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten