skip to main content
10.1145/3148011.3148016acmconferencesArticle/Chapter ViewAbstractPublication Pagesk-capConference Proceedingsconference-collections
research-article

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Published: 04 December 2017 Publication History

Abstract

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

References

[1]
Eneko Agirre, Carmen Banea, et al. 2015. SemEval-2015 task 2: Semantic textual similarity, English, S-panish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), June.
[2]
David Blei and John Lafferty. 2006. Correlated topic models. Advances in neural information processing systems 18 (2006), 147.
[3]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research 3 (2003), 993--1022.
[4]
Chris Brockett and William B Dolan. 2005. Support vector machines for paraphrase identification and corpus construction. In Proceedings of the 3rd International Workshop on Paraphrasing. 1--8.
[5]
Stephen Clark, Bob Coecke, and Mehrnoosh Sadrzadeh. 2008. A compositional distributional model of meaning. In Proceedings of the Second Quantum Interaction Symposium (QI-2008). 133--140.
[6]
Andrew M. Dai, Christopher Olah, and Quoc V. Le. 2015. Document embedding with paragraph vectors. In NIPS Deep Learning Workshop.
[7]
Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, Vol. 7. 1606--1611.
[8]
Christian Hänig, Robert Remus, and Xose De La Puente. {n. d.}. ExB Themis: Extensive Feature Extraction from Word Alignments for Semantic Textual Similarity. ({n. d.}).
[9]
Lan Huang, David Milne, Eibe Frank, and Ian H Witten. 2012. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 63, 8 (2012), 1593--1608.
[10]
Paul Jaccard. 1901. Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz.
[11]
Apache Jakarta. 2004. Apache Lucene-a high-performance, full-featured text search engine library. (2004).
[12]
Chi-Hong Leung and Yuen-Yan Chan. 2007. A natural language processing approach to automatic plagiarism detection. In Proceedings of the 8th ACM SIGITE conference on Information technology education. ACM, 213--218.
[13]
Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning. ACM, 577--584.
[14]
Percy Liang, Michael I Jordan, and Dan Klein. 2013. Learning dependency-based compositional semantics. Computational Linguistics 39, 2 (2013), 389--446.
[15]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. 14. 1532--43.
[16]
Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/884893/en.
[17]
Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. NIST SPECIAL PUBLICATION SP 109 (1995), 109.
[18]
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.
[19]
Temple F Smith and Michael S Waterman. 1981. Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195--197.
[20]
Md Arafat Sultan, Steven Bethard, and Tamara Sumner. 2014. DLS@ CU: Sentence Similarity from Word Alignment. SemEval 2014 (2014), 241.
[21]
Peter D Turney, Patrick Pantel, et al. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37, 1 (2010), 141--188.
[22]
Naomi Zeichner, Jonathan Berant, and Ido Dagan. 2012. Crowdsourcing inference-rule evaluation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 156--160.

Cited By

View all
  • (2023)Natural Language Processing–Driven Similar Project Determination Using Project Scope StatementsJournal of Management in Engineering10.1061/JMENEA.MEENG-522939:3Online publication date: May-2023
  • (2022)Alignment Techniques in Domain-Specific ModelsTechnologies and Innovation10.1007/978-3-031-19961-5_4(45-61)Online publication date: 23-Oct-2022
  • (2021)Study of Keyword Extraction Techniques for Electric Double‐Layer Capacitor Domain Using Text Similarity Indexes: An Experimental AnalysisComplexity10.1155/2021/81923202021:1Online publication date: 2-Dec-2021
  • Show More Cited By

Index Terms

  1. SimDoc: Topic Sequence Alignment based Document Similarity Framework
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        K-CAP '17: Proceedings of the 9th Knowledge Capture Conference
        December 2017
        271 pages
        ISBN:9781450355537
        DOI:10.1145/3148011
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 December 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Document Topic Models
        2. Lexical Semantics
        3. Similarity Measures

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        • WDAqua

        Conference

        K-CAP 2017
        Sponsor:
        K-CAP 2017: Knowledge Capture Conference
        December 4 - 6, 2017
        TX, Austin, USA

        Acceptance Rates

        Overall Acceptance Rate 55 of 198 submissions, 28%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 20 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Natural Language Processing–Driven Similar Project Determination Using Project Scope StatementsJournal of Management in Engineering10.1061/JMENEA.MEENG-522939:3Online publication date: May-2023
        • (2022)Alignment Techniques in Domain-Specific ModelsTechnologies and Innovation10.1007/978-3-031-19961-5_4(45-61)Online publication date: 23-Oct-2022
        • (2021)Study of Keyword Extraction Techniques for Electric Double‐Layer Capacitor Domain Using Text Similarity Indexes: An Experimental AnalysisComplexity10.1155/2021/81923202021:1Online publication date: 2-Dec-2021
        • (2021)Comparison of document similarity algorithms in extracting document keywords from an academic paper2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM)10.1109/ICSECS52883.2021.00121(631-636)Online publication date: Aug-2021
        • (2019)Towards Building an Arabic Plagiarism Detection System: Plagiarism Detection in ArabicInternational Journal of Information Retrieval Research10.4018/IJIRR.20190701029:3(12-22)Online publication date: Jul-2019

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media