research-article

Offline versus Online Representation Learning of Documents Using External Knowledge

Authors:

Gia-Hung Nguyen,

Nathalie SoufAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 37, Issue 4

Article No.: 42, Pages 1 - 34

https://doi.org/10.1145/3349527

Published: 19 September 2019 Publication History

Abstract

An intensive recent research work investigated the combined use of hand-curated knowledge resources and corpus-driven resources to learn effective text representations. The overall learning process could be run by online revising the learning objective or by offline refining an original learned representation. The differentiated impact of each of the learning approaches on the quality of the learned representations has not been studied so far in the literature. This article focuses on the design of comparable offline vs. online knowledge-enhanced document representation learning models and the comparison of their effectiveness using a set of standard IR and NLP downstream tasks. The results of quantitative and qualitative analyses show that (1) offline vs. online learning approaches have dissimilar result trends regarding the task as well as the dataset distribution counts with regard to domain application; (2) while considering external knowledge resources is undoubtedly beneficial, the way used to express relational constraints could affect semantic inference effectiveness. The findings of this work present opportunities for the design of future representation learning models, but also for providing insights about the evaluation of such models.

References

[1]

Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2006. MaxMatcher: Biological concept extraction using approximate dictionary lookup. In Proceedings of the PRICAI’06. 1145--1149.

[2]

Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016. Analysis of the paragraph vector model for information retrieval. In Proceedings of the ICTIR. ACM, 133--142.

[3]

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference (ISWC'07/ASWC'07). Springer-Verlag, Berlin, Heidelberg, 722--735.

Digital Library

[4]

Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018. Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the EMNLP. 2830--2836.

[5]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155.

Digital Library

[6]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the NIPS. 2787--2795.

[7]

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the ACL. Association for Computational Linguistics, 136--145.

[8]

Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity—Multilingual and cross-lingual focused evaluation. Retrieved from: CoRR abs/1708.00055 (2017). arxiv:1708.00055.

[9]

Jianpeng Cheng, Zhongyuan Wang, Ji-Rong Wen, Jun Yan, and Zheng Chen. 2015. Contextual text understanding in distributional semantic space. In Proceedings of the CIKM. 133--142.

Digital Library

[10]

Wanying Chiu and Kun Lu. 2015. Paradigmatic relations and syntagmatic relations: How are they related? In Proceedings of the ASIST. 122:1--122:4.

[11]

E. Choi, M. Taha Bahadori, E. Searles, C. Coffey, and J. Sun. 2016. Multi-layer representation learning for medical concepts. In Proceedings of the KDD.

[12]

Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representations. Retrieved from: arXiv preprint arXiv:1803.05449 (2018).

[13]

Andrew M. Dai, Christopher Olah, and Quoc V. Le. 2015. Document embedding with paragraph vectors. In Proceedings of the NIPS Workshops.

[14]

Lance De Vine, Guido Zuccon, Bevan Koopman, Laurianne Sitbon, and Peter Bruza. 2014. Medical semantic similarity with a neural language model. In Proceedings of the CIKM. 1819--1822.

Digital Library

[15]

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Assoc. Inform. Sci. Technol. 41, 6 (1990), 391.

[16]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from: CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805.

[17]

Duy Dinh, Lynda Tamine, and Fatiha Boubekeur. 2013. Factors affecting the effectiveness of biomedical document indexing and retrieval based on terminologies. Artific. Intell. Med. 57, 2 (2013), 155--167.

Digital Library

[18]

Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the COLING.

Digital Library

[19]

Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Question answering over freebase with multi-column convolutional neural networks. In Proceedings of the ACL. 260--269.

[20]

Tracy Edinger, Aaron M. Cohen, Steven Bedrick, Kyle Ambert, and William Hersh. 2012. Barriers to retrieving patient information from electronic health record data: Failure analysis from the TREC medical records track. In Proceedings of the AMIA.

[21]

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. Retrieved from: http://arxiv.org/abs/1411.4166.

[22]

Paolo Ferragina and Ugo Scaiella. 2010. Tagme: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of the CIKM. ACM, 1625--1628.

Digital Library

[23]

P. Ferragina and U. Scaiella. 2012. Fast and accurate annotation of short texts with Wikipedia pages. IEEE Softw. 29, 1 (2012), 70--75.

Digital Library

[24]

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the WWW. 406--414.

Digital Library

[25]

Julio Gonzalo, Felisa Verdejo, Irina Chugur, and Juan Cigarran. 1998. Indexing with WordNet synsets can improve text retrieval. In Usage of WordNet in Natural Language Processing Systems. WordNet@ACL/COLING.

[26]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A deep relevance matching model for ad hoc retrieval. In Proceedings of the CIKM.

[27]

Zellig S. Harris. 1954. Distributional structure. Word 10, 2--3 (1954), 146--162.

[28]

William Hersh, Chris Buckley, T. J. Leone, and David Hickam. 1994. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the SIGIR. 192--201.

[29]

Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Ling. 41, 4 (2015), 665--695.

Digital Library

[30]

Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the ACL. 95--105.

[31]

Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy. 2015. Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the HLT-NAACL.

[32]

Richard Johansson and Luis Nieto Piña. 2015. Embedding a semantic network in a word space. In Proceedings of the NAACL. 1428--1433.

[33]

Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese CBOW: Optimizing word embeddings for sentence representations. Retrieved from: arXiv preprint arXiv:1606.04640 (2016).

[34]

Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the NIPS. 3294--3302.

[35]

Bevan Koopman, Guido Zuccon, Peter Bruza, Laurianne Sitbon, and Michael Lawley. 2016. Information retrieval as semantic inference: A graph inference model applied to medical search. Inf. Retr. J. 19, 1--2 (2016), 6--37.

Digital Library

[36]

Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the ICML. 1188--1196.

Digital Library

[37]

Bo Li and Ping Cheng. 2018. Learning neural representation for CLIR with adversarial framework. EMNLP. 1861--1870.

[38]

Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the ACL. 1501--1511.

[39]

Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018. Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of the EMNLP. 1247--1256.

[40]

Fuli Luo, Tianyu Liu, Zexue He, Qiaolin Xia, Zhifang Sui, and Baobao Chang. 2018. Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of the EMNLP.

[41]

Macini Massimiliano, Camacho-Collados Jose, Iacobacci Ignacio, and Navigli Roberto. 2017. Embedding words and senses together via joint knowledge-enhanced training. In Proceedings of the CoNLL.

[42]

Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. Context2vec: Learning generic context embedding with bidirectional LSTM. In Proceedings of the CoNLL. 51--61.

[43]

Stuart E. Middleton, Nigel R. Shadbolt, and David C. De Roure. 2004. Ontological user profiling in recommender systems. ACM Trans. Inf. Syst. 22, 1 (Jan. 2004), 54--88.

Digital Library

[44]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. Retrieved from: arXiv preprint arXiv:1301.3781 (2013).

[45]

Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of the ACL. 236--244.

[46]

Nikola Mrkšic, Diarmuid Ó. Séaghdha, Blaise Thomson, Milica Gašic, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the NAACL-HLT. 142--148.

[47]

Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2017. Learning concept-driven document embeddings for medical information search. In Proceedings of the AIME. 160--170.

[48]

Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2018. A tri-partite neural document language model for semantic information retrieval. In Proceedings of the ESWC. 445--461.

[49]

Serguei Pakhomov, Bridget McInnes, Terrence Adam, Ying Liu, Ted Pedersen, and Genevieve B. Melton. 2010. Semantic similarity and relatedness between clinical terms: An experimental study. In Proceedings of the AMIA, Vol. 2010. 572.

[50]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP. 1532--1543.

[51]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the NAACL.

[52]

Stephen E. Robertson and Karen Spärck Jones. 1976. Relevance weighting of search terms. J. Assoc. Inform. Sci. Technol. 27, 3 (1976), 129--146.

[53]

Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the ACL.

[54]

Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM 8, 10 (1965), 627--633.

Digital Library

[55]

Magnus Sahlgren. 2008. The distributional hypothesis. Italian J. Ling. 20, 1 (2008), 33--54.

[56]

Mark Sanderson and Bruce Croft. 1999. Deriving concept hierarchies from text. In Proceedings of the SIGIR. 206--213.

Digital Library

[57]

Hinrich Schütze and Jan O. Pedersen. 1995. Information retrieval based on word senses. Comput. Ling. 24, 3 (1995), 97--123.

[58]

Donia Scott, Walter Daelemans, and Marilyn A. Walker (Eds.). 2004. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain (ACL'04).

[59]

N. Stokes, Y. L. Cavedon, and J. Zobel. 2009. Exploring criteria for successful query expansion in the genomic domain. Inf. Retr. 12 (2009), 17--50.

Digital Library

[60]

Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2015. Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Proceedings of the ACL. 136--145.

[61]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS. 6000--6010.

[62]

Ellen M. Voorhees. 2001. Overview of the TREC 2001 question answering track. In Proceedings of the TREC. 42--51.

[63]

Ivan Vulic and Goran Glavas. 2018. Explicit retrofitting of distributional word vectors. In Proceedings of the ACL. 34--45. Retrieved from: https://aclanthology.info/papers/P18-1004/p18-1004.

[64]

Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the SIGIR. ACM, 363--372.

[65]

Koki Washio and Tsuneaki Kato. 2018. Neural latent relational analysis to capture lexical semantic relations in a vector space. In Proceedings of the PEMNLP. 594--600. Retrieved from: https://aclanthology.info/papers/D18-1058/d18-1058.

[66]

Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39, 2--3 (2005), 165--210.

[67]

L. Xiaojie, N. Jian-Yun, and S. Alessandro. 2016. Constraining word embeddings by prior knowledge—Application to medical information retrieval. In Proceedings of the AIRS. 155--167.

[68]

Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC-NET: A general framework for incorporating knowledge into word representations. In Proceedings of the CIKM. 1219--1228.

Digital Library

[69]

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the embedding of words and entities for named entity disambiguation. CoNLL. 250--259.

[70]

Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the ACL. 545--550.

[71]

Hamed Zamani and W. Bruce Croft. 2016. Estimating embedding vectors for queries. In Proceedings of the ICTIR. ACM, 123--132.

Cited By

He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 30-Jan-2023
https://dl.acm.org/doi/10.1145/3570725
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Guo JCai YFan YSun FZhang RCheng X(2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3486250
Show More Cited By

Index Terms

Offline versus Online Representation Learning of Documents Using External Knowledge
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
2. Information systems
  1. Information retrieval

Recommendations

Representation Learning: A Review and New Perspectives

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. ...
A multi-view representation learning framework for commonsense knowledge bases
Abstract
Commonsense knowledge bases play an essential role in a wide range of natural language processing tasks. This paper studies the problem of representation learning for commonsense knowledge bases to effectively incorporate their knowledge into ...
On Context Distribution Shift in Task Representation Learning for Online Meta RL
Advanced Intelligent Computing Technology and Applications
Abstract
Offline Meta Reinforcement Learning (OMRL) aims to learn transferable knowledge from offline datasets to enhance the learning process for new target tasks. Context-based Reinforcement Learning (RL) adopts a context encoder to expediently adapt the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 37, Issue 4

October 2019

299 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3357218

Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2019

Accepted: 01 July 2019

Revised: 01 July 2019

Received: 01 January 2019

Published in TOIS Volume 37, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
367
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

He LHuang ZChen ELiu QTong SWang HLian DWang S(2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 30-Jan-2023
https://dl.acm.org/doi/10.1145/3570725
Hambarde KProença H(2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295776
Guo JCai YFan YSun FZhang RCheng X(2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3486250
Wu JZhang RMao YHuai J(2022)Dealing With Hierarchical Types and Label Noise in Fine-Grained Entity TypingIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2022.315528130(1305-1318)Online publication date: 2022
https://doi.org/10.1109/TASLP.2022.3155281
Tamine LGoeuriot L(2021)Semantic Information Retrieval on Medical TextsACM Computing Surveys10.1145/346247654:7(1-38)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3462476
Agosti MMarchesin SSilvello G(2020)Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information RetrievalACM Transactions on Information Systems10.1145/341799638:4(1-48)Online publication date: 11-Sep-2020
https://dl.acm.org/doi/10.1145/3417996

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents