research-article

Query-specific subtopic clustering

Authors:

Sumanta Kashyapi,

Laura DietzAuthors Info & Claims

JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

Article No.: 11, Pages 1 - 9

https://doi.org/10.1145/3529372.3530923

Published: 20 June 2022 Publication History

Abstract

We propose a Query-Specific Siamese Similarity Metric (QS3M) for query-specific clustering of text documents. Our approach uses fine-tuned BERT embeddings to train a non-linear projection into a query-specific similarity space. We build on the idea of Siamese networks but include a third component, a representation of the query. QS3M is able to model the fine-grained similarity between text passages about the same broad topic and also generalizes to new unseen queries during evaluation. The empirical evaluation for clustering employs two TREC datasets and a set of academic abstracts from arXiv. When used to obtain query-relevant clusters, QS3M achieves a 12% performance improvement on the TREC datasets over a strong BERT-based reference method and many baselines such as TF-IDF and topic models. A similar improvement is observed for the arXiv dataset suggesting the general applicability of QS3M to different domains. Qualitative evaluation is carried out to gain insight into the strengths and limitations of the model.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[2]

Carmen Banea, Samer Hassan, Michael Mohler, and Rada Mihalcea. 2012. Unt: A supervised synergistic approach to semantic text similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics, 635--642.

[3]

Siddhartha Banerjee and Prasenjit Mitra. 2015. Wikikreator: Improving wikipedia stubs automatically. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 867--877.

[4]

Sugato Basu, Arindam Banerjee, and Raymond Mooney. 2002. Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002. Citeseer.

Digital Library

[5]

Andrea Bernardini, Claudio Carpineto, and Massimiliano D'Amico. 2009. Full-subtopic retrieval with keyphrase-based search results clustering. In 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 1. IEEE, 206--213.

Digital Library

[6]

Michael S Bernstein, Bongwon Suh, Lichan Hong, Jilin Chen, Sanjay Kairam, and Ed H Chi. 2010. Eddi: interactive topic-based browsing of social status streams. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 303--312.

Digital Library

[7]

Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. 2004. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the twenty-first international conference on Machine learning. ACM, 11.

Digital Library

[8]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.

Digital Library

[9]

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a "siamese" time delay neural network. Advances in neural information processing systems 6 (1993).

[10]

Claudio Carpineto, Massimiliano D'Amico, and Giovanni Romano. 2012. Evaluating subtopic retrieval methods: Clustering versus diversification of search results. Information Processing & Management 48, 2 (2012), 358--373.

Digital Library

[11]

Minsik Cho, Keivan A Vahid, Saurabh Adya, and Mohammad Rastegari. 2021. DKM: Differentiable K-Means Clustering Layer for Neural Network Compression. arXiv preprint arXiv:2108.12659 (2021).

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[13]

Laura Dietz and Jeff Dalton. 2020. Humans Optional? Automatic Large-Scale Test Collections for Entity, Passage, and Entity-Passage Retrieval. Datenbank-Spektrum (2020), 1--12.

[14]

Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. 2017. TREC Complex Answer Retrieval Overview. In TREC.

[15]

Liat Ein Dor, Yosi Mass, Alon Halfon, Elad Venezian, Ilya Shnayderman, Ranit Aharonov, and Noam Slonim. 2018. Learning thematic similarity metric from article sections using triplet networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 49--54.

[16]

Marina Drosou and Evaggelia Pitoura. 2010. Search result diversification. ACM SIGMOD Record 39, 1 (2010), 41--47.

Digital Library

[17]

Susan Dumais, Edward Cutrell, and Hao Chen. 2001. Optimizing search by showing results in context. In Proceedings of the SIGCHI conference on Human factors in computing systems. 277--284.

Digital Library

[18]

Wael H Gomaa and Aly A Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.

[19]

Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, Vol. 4. 9--56.

[20]

Kenneth Wai-Ting Leung, Wilfred Ng, and Dik Lun Lee. 2008. Personalized concept-based clustering of search engine queries. IEEE transactions on knowledge and data engineering 20, 11 (2008), 1505--1518.

[21]

Chandler May, Alex Wang, Shikha Bordia, Samuel R Bowman, and Rachel Rudinger. 2019. On measuring social biases in sentence encoders. arXiv preprint arXiv:1903.10561 (2019).

[22]

Donald Metzler, Susan Dumais, and Christopher Meek. 2007. Similarity measures for short segments of text. In European conference on information retrieval. Springer, 16--27.

Digital Library

[23]

Xi Peng, Ivor W Tsang, Joey Tianyi Zhou, and Hongyuan Zhu. 2018. k-meansnet: When k-means meets differentiable programming. arXiv preprint arXiv:1808.07292 (2018).

[24]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT. 2227--2237.

[25]

Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018. Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 1499--1509.

[26]

Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding the Behaviors of BERT in Ranking. arXiv preprint arXiv:1904.07531 (2019).

[27]

Fiana Raiber and Oren Kurland. 2013. Ranking document clusters using markov random fields. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 333--342.

Digital Library

[28]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982--3992.

[29]

Mark Steyvers and Tom Griffiths. 2007. Probabilistic topic models. Handbook of latent semantic analysis 427, 7 (2007), 424--440.

[30]

Robert S Taylor. 2015. Question-negotiation and information seeking in libraries. College & Research Libraries 76, 3 (2015), 251--267.

[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[32]

Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, and Hongwei Hao. 2015. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 62--69.

[33]

Neil Zeghidour, Gabriel Synnaeve, Nicolas Usunier, and Emmanuel Dupoux. 2016. Joint learning of speaker and phonetic similarities with siamese networks. In INTERSPEECH. 1295--1299.

[34]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).

Cited By

Dietz LChatterjee SLennox CKashyapi SOza PGamari BAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)WikimarksProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531731(3003-3012)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531731

Index Terms

Query-specific subtopic clustering
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
  2. Information systems applications
    1. Digital libraries and archives

Recommendations

Topic detection based on BERT and seed LDA clustering model
ICIAI '23: Proceedings of the 2023 7th International Conference on Innovation in Artificial Intelligence

Aiming at the problem that the LDA model is not effective for short text topic extraction, this paper proposes a topic detection method based on BERT and seed LDA clustering model. Firstly, the seed LDA model (sLDA) is designed for optimize the LDA ...
Query directed clustering

This paper identifies the conditions under which web page clustering algorithms are effective and identifies the problems that cause them to fail. It then presents Query Directed Clustering (QDC), a web page clustering algorithm that produces higher-...
Topic detection by topic model induced distance using biased initiation
AST/UCMA/ISA/ACN'10: Proceedings of the 2010 international conference on Advances in computer science and information technology

Clustering is widely used in topic detection task. However, the vector space model based distance, such as cosine-like distance, will get a low precision and recall when the corpus contains many related topics. In this paper, we propose a new distance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

June 2022

392 pages

ISBN:9781450393454

DOI:10.1145/3529372

General Chairs:
Akiko Aizawa
National Institute of Informatics, Japan
,
Thomas Mandl
University of Hildesheim, Germany
,
Zeljko Carevic
GESIS - Leibniz Institute for the Social Sciences, Germany
,
Program Chairs:
Annika Hinze
University of Waikato, New Zealand
,
Philipp Mayr
GESIS - Leibniz Institute for the Social Sciences, Germany
,
Philipp Schaer
TH Köln (University of Applied Sciences), Germany

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

IEEE Technical Committee on Digital Libraries (TC DL)

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

JCDL '22

Sponsor:

JCDL '22: The ACM/IEEE Joint Conference on Digital Libraries in 2022

June 20 - 24, 2022

Cologne, Germany

Acceptance Rates

JCDL '22 Paper Acceptance Rate 35 of 132 submissions, 27%;

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
115
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dietz LChatterjee SLennox CKashyapi SOza PGamari BAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)WikimarksProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531731(3003-3012)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531731

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten