skip to main content
research-article

Offline versus Online Representation Learning of Documents Using External Knowledge

Published: 19 September 2019 Publication History

Abstract

An intensive recent research work investigated the combined use of hand-curated knowledge resources and corpus-driven resources to learn effective text representations. The overall learning process could be run by online revising the learning objective or by offline refining an original learned representation. The differentiated impact of each of the learning approaches on the quality of the learned representations has not been studied so far in the literature. This article focuses on the design of comparable offline vs. online knowledge-enhanced document representation learning models and the comparison of their effectiveness using a set of standard IR and NLP downstream tasks. The results of quantitative and qualitative analyses show that (1) offline vs. online learning approaches have dissimilar result trends regarding the task as well as the dataset distribution counts with regard to domain application; (2) while considering external knowledge resources is undoubtedly beneficial, the way used to express relational constraints could affect semantic inference effectiveness. The findings of this work present opportunities for the design of future representation learning models, but also for providing insights about the evaluation of such models.

References

[1]
Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2006. MaxMatcher: Biological concept extraction using approximate dictionary lookup. In Proceedings of the PRICAI’06. 1145--1149.
[2]
Qingyao Ai, Liu Yang, Jiafeng Guo, and W. Bruce Croft. 2016. Analysis of the paragraph vector model for information retrieval. In Proceedings of the ICTIR. ACM, 133--142.
[3]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference (ISWC'07/ASWC'07). Springer-Verlag, Berlin, Heidelberg, 722--735.
[4]
Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018. Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the EMNLP. 2830--2836.
[5]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155.
[6]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the NIPS. 2787--2795.
[7]
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the ACL. Association for Computational Linguistics, 136--145.
[8]
Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity—Multilingual and cross-lingual focused evaluation. Retrieved from: CoRR abs/1708.00055 (2017). arxiv:1708.00055.
[9]
Jianpeng Cheng, Zhongyuan Wang, Ji-Rong Wen, Jun Yan, and Zheng Chen. 2015. Contextual text understanding in distributional semantic space. In Proceedings of the CIKM. 133--142.
[10]
Wanying Chiu and Kun Lu. 2015. Paradigmatic relations and syntagmatic relations: How are they related? In Proceedings of the ASIST. 122:1--122:4.
[11]
E. Choi, M. Taha Bahadori, E. Searles, C. Coffey, and J. Sun. 2016. Multi-layer representation learning for medical concepts. In Proceedings of the KDD.
[12]
Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representations. Retrieved from: arXiv preprint arXiv:1803.05449 (2018).
[13]
Andrew M. Dai, Christopher Olah, and Quoc V. Le. 2015. Document embedding with paragraph vectors. In Proceedings of the NIPS Workshops.
[14]
Lance De Vine, Guido Zuccon, Bevan Koopman, Laurianne Sitbon, and Peter Bruza. 2014. Medical semantic similarity with a neural language model. In Proceedings of the CIKM. 1819--1822.
[15]
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Assoc. Inform. Sci. Technol. 41, 6 (1990), 391.
[16]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from: CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805.
[17]
Duy Dinh, Lynda Tamine, and Fatiha Boubekeur. 2013. Factors affecting the effectiveness of biomedical document indexing and retrieval based on terminologies. Artific. Intell. Med. 57, 2 (2013), 155--167.
[18]
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the COLING.
[19]
Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Question answering over freebase with multi-column convolutional neural networks. In Proceedings of the ACL. 260--269.
[20]
Tracy Edinger, Aaron M. Cohen, Steven Bedrick, Kyle Ambert, and William Hersh. 2012. Barriers to retrieving patient information from electronic health record data: Failure analysis from the TREC medical records track. In Proceedings of the AMIA.
[21]
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. Retrieved from: http://arxiv.org/abs/1411.4166.
[22]
Paolo Ferragina and Ugo Scaiella. 2010. Tagme: On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of the CIKM. ACM, 1625--1628.
[23]
P. Ferragina and U. Scaiella. 2012. Fast and accurate annotation of short texts with Wikipedia pages. IEEE Softw. 29, 1 (2012), 70--75.
[24]
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the WWW. 406--414.
[25]
Julio Gonzalo, Felisa Verdejo, Irina Chugur, and Juan Cigarran. 1998. Indexing with WordNet synsets can improve text retrieval. In Usage of WordNet in Natural Language Processing Systems. WordNet@ACL/COLING.
[26]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A deep relevance matching model for ad hoc retrieval. In Proceedings of the CIKM.
[27]
Zellig S. Harris. 1954. Distributional structure. Word 10, 2--3 (1954), 146--162.
[28]
William Hersh, Chris Buckley, T. J. Leone, and David Hickam. 1994. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the SIGIR. 192--201.
[29]
Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Ling. 41, 4 (2015), 665--695.
[30]
Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the ACL. 95--105.
[31]
Sujay Kumar Jauhar, Chris Dyer, and Eduard H. Hovy. 2015. Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the HLT-NAACL.
[32]
Richard Johansson and Luis Nieto Piña. 2015. Embedding a semantic network in a word space. In Proceedings of the NAACL. 1428--1433.
[33]
Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese CBOW: Optimizing word embeddings for sentence representations. Retrieved from: arXiv preprint arXiv:1606.04640 (2016).
[34]
Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the NIPS. 3294--3302.
[35]
Bevan Koopman, Guido Zuccon, Peter Bruza, Laurianne Sitbon, and Michael Lawley. 2016. Information retrieval as semantic inference: A graph inference model applied to medical search. Inf. Retr. J. 19, 1--2 (2016), 6--37.
[36]
Quoc V. Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the ICML. 1188--1196.
[37]
Bo Li and Ping Cheng. 2018. Learning neural representation for CLIR with adversarial framework. EMNLP. 1861--1870.
[38]
Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the ACL. 1501--1511.
[39]
Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018. Jointly multiple events extraction via attention-based graph information aggregation. In Proceedings of the EMNLP. 1247--1256.
[40]
Fuli Luo, Tianyu Liu, Zexue He, Qiaolin Xia, Zhifang Sui, and Baobao Chang. 2018. Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of the EMNLP.
[41]
Macini Massimiliano, Camacho-Collados Jose, Iacobacci Ignacio, and Navigli Roberto. 2017. Embedding words and senses together via joint knowledge-enhanced training. In Proceedings of the CoNLL.
[42]
Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. Context2vec: Learning generic context embedding with bidirectional LSTM. In Proceedings of the CoNLL. 51--61.
[43]
Stuart E. Middleton, Nigel R. Shadbolt, and David C. De Roure. 2004. Ontological user profiling in recommender systems. ACM Trans. Inf. Syst. 22, 1 (Jan. 2004), 54--88.
[44]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. Retrieved from: arXiv preprint arXiv:1301.3781 (2013).
[45]
Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of the ACL. 236--244.
[46]
Nikola Mrkšic, Diarmuid Ó. Séaghdha, Blaise Thomson, Milica Gašic, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the NAACL-HLT. 142--148.
[47]
Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2017. Learning concept-driven document embeddings for medical information search. In Proceedings of the AIME. 160--170.
[48]
Gia-Hung Nguyen, Lynda Tamine, Laure Soulier, and Nathalie Souf. 2018. A tri-partite neural document language model for semantic information retrieval. In Proceedings of the ESWC. 445--461.
[49]
Serguei Pakhomov, Bridget McInnes, Terrence Adam, Ying Liu, Ted Pedersen, and Genevieve B. Melton. 2010. Semantic similarity and relatedness between clinical terms: An experimental study. In Proceedings of the AMIA, Vol. 2010. 572.
[50]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the EMNLP. 1532--1543.
[51]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the NAACL.
[52]
Stephen E. Robertson and Karen Spärck Jones. 1976. Relevance weighting of search terms. J. Assoc. Inform. Sci. Technol. 27, 3 (1976), 129--146.
[53]
Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the ACL.
[54]
Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM 8, 10 (1965), 627--633.
[55]
Magnus Sahlgren. 2008. The distributional hypothesis. Italian J. Ling. 20, 1 (2008), 33--54.
[56]
Mark Sanderson and Bruce Croft. 1999. Deriving concept hierarchies from text. In Proceedings of the SIGIR. 206--213.
[57]
Hinrich Schütze and Jan O. Pedersen. 1995. Information retrieval based on word senses. Comput. Ling. 24, 3 (1995), 97--123.
[58]
Donia Scott, Walter Daelemans, and Marilyn A. Walker (Eds.). 2004. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain (ACL'04).
[59]
N. Stokes, Y. L. Cavedon, and J. Zobel. 2009. Exploring criteria for successful query expansion in the genomic domain. Inf. Retr. 12 (2009), 17--50.
[60]
Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng. 2015. Learning word representations by jointly modeling syntagmatic and paradigmatic relations. In Proceedings of the ACL. 136--145.
[61]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS. 6000--6010.
[62]
Ellen M. Voorhees. 2001. Overview of the TREC 2001 question answering track. In Proceedings of the TREC. 42--51.
[63]
Ivan Vulic and Goran Glavas. 2018. Explicit retrofitting of distributional word vectors. In Proceedings of the ACL. 34--45. Retrieved from: https://aclanthology.info/papers/P18-1004/p18-1004.
[64]
Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the SIGIR. ACM, 363--372.
[65]
Koki Washio and Tsuneaki Kato. 2018. Neural latent relational analysis to capture lexical semantic relations in a vector space. In Proceedings of the PEMNLP. 594--600. Retrieved from: https://aclanthology.info/papers/D18-1058/d18-1058.
[66]
Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39, 2--3 (2005), 165--210.
[67]
L. Xiaojie, N. Jian-Yun, and S. Alessandro. 2016. Constraining word embeddings by prior knowledge—Application to medical information retrieval. In Proceedings of the AIRS. 155--167.
[68]
Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang Wang, Xiaoguang Liu, and Tie-Yan Liu. 2014. RC-NET: A general framework for incorporating knowledge into word representations. In Proceedings of the CIKM. 1219--1228.
[69]
Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the embedding of words and entities for named entity disambiguation. CoNLL. 250--259.
[70]
Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the ACL. 545--550.
[71]
Hamed Zamani and W. Bruce Croft. 2016. Estimating embedding vectors for queries. In Proceedings of the ICTIR. ACM, 123--132.

Cited By

View all
  • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 30-Jan-2023
  • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
  • (2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 37, Issue 4
October 2019
299 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3357218
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2019
Accepted: 01 July 2019
Revised: 01 July 2019
Received: 01 January 2019
Published in TOIS Volume 37, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Representation learning
  2. information retrieval
  3. knowledge resources
  4. natural language processing

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An Efficient and Robust Semantic Hashing Framework for Similar Text SearchACM Transactions on Information Systems10.1145/357072541:4(1-31)Online publication date: 30-Jan-2023
  • (2023)Information Retrieval: Recent Advances and BeyondIEEE Access10.1109/ACCESS.2023.329577611(76581-76604)Online publication date: 2023
  • (2022)Semantic Models for the First-Stage Retrieval: A Comprehensive ReviewACM Transactions on Information Systems10.1145/348625040:4(1-42)Online publication date: 24-Mar-2022
  • (2022)Dealing With Hierarchical Types and Label Noise in Fine-Grained Entity TypingIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2022.315528130(1305-1318)Online publication date: 2022
  • (2021)Semantic Information Retrieval on Medical TextsACM Computing Surveys10.1145/346247654:7(1-38)Online publication date: 17-Sep-2021
  • (2020)Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information RetrievalACM Transactions on Information Systems10.1145/341799638:4(1-48)Online publication date: 11-Sep-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media