Skip to main content
Log in

Lexifield: a system for the automatic building of lexicons by semantic expansion of short word lists

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We present Lexifield, a fully automatic language-independent system for building domain-specific lexicons from a short list of terms defining the domain. Lexifield relies on a pre-trained word embedding model, a definition dictionary and a dictionary of synonyms. To evaluate this system, four lexicons have been generated: one lexicon in French for the topic “son” (“sound”) and three lexicons in English for the topics “sound”, “taste” and “odour”. As compared to other word embedding-based systems and a state-of-the-art sensorial lexicon, Sensicon, our system achieves better precision and recall on reference lists extracted from manually created resources such as Roget’s Thesaurus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://dictionary.cambridge.org/fr/dictionnaire/anglais/lexicon.

  2. http://wattpad.com.

  3. http://gcide.gnu.org.ua/.

  4. https://www.littre.org/.

  5. https://www.thesaurus.com/.

  6. http://crisco.unicaen.fr/des/.

  7. https://grammalecte.net/home.php?prj=fr.

  8. http://moby-thesaurus.org/.

  9. As the embedding of a word depends on the context where it is used and thus on its POS tag, in the experiments, we distinguish taste\(_{noun}\) and taste\(_{verb}\) for instance.

  10. https://www.wiktionary.org/.

  11. https://dumps.wikimedia.org/backup-index.html.

  12. https://www.synonym.com/.

  13. http://crisco.unicaen.fr/des/.

  14. https://embeddings.sketchengine.co.uk/static/index.html.

  15. http://www.gutenberg.org/ebooks/22.

  16. https://github.com/Ejhfast/empath-client.

References

  1. Al-Shalabi R, Kanaan G (2004) Constructing an automatic lexicon for arabic language. Int J Comput Inf Sci 2(2):114–128

    Google Scholar 

  2. Amsler RA (1981) A taxonomy for English nouns and verbs. In: Proceedings of the 19th annual meeting, Association for Computational Linguistics, pp 133–138

  3. Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735

    Article  Google Scholar 

  4. Baker CF, Fillmore CJ, Lowe JB (1998) The Berkeley framenet project. In: Proceedings of the 17th international conference on computational linguistics, vol1, Association for Computational Linguistics, pp 86–90

  5. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  6. Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of GSCL pp 31–40

  7. Calzolari N (1984) Detecting patterns in a lexical data base. In: Proceedings of the 10th international conference on computational linguistics, COLING ’84, Association for Computational Linguistics, Stroudsburg, PA, USA, pp 170–173. https://doi.org/10.3115/980431.980527

  8. Chodorow MS, Byrd RJ, Heidorn GE (1985) Extracting semantic hierarchies from a large on-line dictionary. In: Proceedings of the 23rd annual meeting, Association for Computational Linguistics, pp 299–304

  9. Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  10. Copestake A (1990) An approach to building the hierarchical element of a lexical knowledge base from a machine readable dictionary. In: First international workshop on inheritance in NLP

  11. Dubois J, Dubois-Charlier F (2010) La combinatoire lexico-syntaxique dans le dictionnaire électronique des mots. les termes du domaine de la musique à titre d’illustration. Langages 179–180(3):31–56

    Article  Google Scholar 

  12. Dubois J, Dubois-Charlier F (1997) Les Verbes français. Larousse, Paris

    Google Scholar 

  13. Fang H (2008) A re-examination of query expansion using lexical resources. In: Proceedings of ACL-08: HLT, pp 139–147

  14. Fast E, Chen B, Bernstein MS (2016) Empath: understanding topic signals in large-scale text. In: Proceedings of the 2016 CHI conference on human factors in computing systems, ACM, pp 4647–4657

  15. Fellbaum C (1998) WordNet: an electronic lexical database. Bradford Books, Cambridge

    Book  Google Scholar 

  16. Globerson A, Chechik G, Pereira F, Tishby N (2007) Euclidean embedding of co-occurrence data. J Mach Learn Res 8:2265–2295

    MathSciNet  MATH  Google Scholar 

  17. Jakubíček M, Kilgarriff A, Kovář V, Rychlỳ P, Suchomel V (2013) The tenten corpus family. In: 7th International corpus linguistics conference, CL, pp 125–127

  18. Kotov A, Zhai C (2012) Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In: Proceedings of the fifth ACM international conference on Web search and data mining, ACM, pp 403–412

  19. Kuzi S, Shtok A, Kurland O (2016) Query expansion using word embeddings. In: Proceedings of the 25th ACM international on conference on information and knowledge management, ACM, pp 1929–1932

  20. Lavelli A, Sebastiani F, Zanoli R (2004) Distributional term representations: an experimental comparison. In: Proceedings of the thirteenth ACM international conference on information and knowledge management, pp 615–624

  21. Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Proceedings of the 27th international conference on neural information processing systems, vol. 2, NIPS’14, pp 2177–2185

  22. Liu S, Liu F, Yu C, Meng W (2004) An effective approach to document retrieval via utilizing WordNet and recognizing phrases. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 266–272

  23. Manguin JL (2004) Transitivité partielle de la synonymie: application aux dictionnaires de synonymes. Corela—cognition, représentation, langage

  24. Markowitz J, Ahlswede T, Evens M (1986) Semantically significant patterns in dictionary definitions. In: 24th Annual meeting of the association for computational linguistics. http://aclweb.org/anthology/P86-1018

  25. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  26. Mitchell J, Lapata M (2010) Composition in distributional models of semantics. Cognit Sci 34(8):1388–1429

    Article  Google Scholar 

  27. Park D, Kim S, Lee J, Choo J, Diakopoulos N, Elmqvist N (2018) Conceptvector: text visual analytics via interactive lexicon building using word embedding. IEEE Trans Vis Comput Gr 24(1):361–370

    Article  Google Scholar 

  28. Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001, vol 71. Mahway: Lawrence Erlbaum Associates

  29. Riloff E, Shepherd J (1997) A corpus-based approach for building semantic lexicons. In: Proceedings of the second conference on empirical methods in natural language processing (EMNLP-2), pp 117–124

  30. Riloff E, Shepherd J (1999) A corpus-based bootstrapping algorithm for semi-automated semantic lexicon construction. Nat Lang Eng 5(2):147–156

    Article  Google Scholar 

  31. Roark B, Charniak E (1998) Noun-phrase co-occurrence statistics for semiautomatic semantic lexicon construction. In: Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics, vol 2, Association for Computational Linguistics, pp 1110–1116

  32. Sagot B (2005) Automatic acquisition of a Slovak lexicon from a raw corpus. In: International conference on text, speech and dialogue, Springer, pp 156–163

  33. Tekiroglu SS, Özbal G, Strapparava C (2014) Sensicon: an automatically constructed sensorial lexicon. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1511–1521

  34. Tonelli S, Pighin D (2009) New features for framenet: WordNet mapping. In: Proceedings of the thirteenth conference on computational natural language learning, Association for Computational Linguistics, pp 219–227

  35. Verma N, Bhattacharyya P (2004) Automatic lexicon generation through WordNet. GWC 2004:226

    Google Scholar 

  36. Voorhees EM (1994) Query expansion using lexical-semantic relations. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, ACM Press, pp 61–69

  37. Zhang J, Deng B, Li X (2009) Concept based query expansion using WordNet. In: Proceedings of the 2009 international e-conference on advanced science and technology, IEEE Computer Society, pp 52–55

  38. Zhu M, Wu YFB (2014) Search by multiple examples. In: Proceedings of the 7th ACM international conference on Web search and data mining, ACM Press, pp 667–672

Download references

Acknowledgements

This work was partially supported by SoundCITYve project from Labex IMU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christine Largeron.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mpouli, S., Beigbeder, M. & Largeron, C. Lexifield: a system for the automatic building of lexicons by semantic expansion of short word lists. Knowl Inf Syst 62, 3181–3201 (2020). https://doi.org/10.1007/s10115-020-01451-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-020-01451-6

Keywords

Navigation