Skip to main content

Food Data Normalization Using Lexical and Semantic Similarities Heuristics

  • Conference paper
  • First Online:
Biomedical Engineering Systems and Technologies (BIOSTEC 2020)

Abstract

Food is one of the main health and environmental factors in today’s society. With modernization the food supply is expanding and food-related data is increasing. This type of data comes in many different forms and making it inter-operable is one of the main requirements for using in any kind of analyses. One step towards this goal is data normalization of data coming from different sources. Food-related is collected regarding various aspects – food composition, food consumption, recipe data, etc. The most commonly encountered form is food data related to food products, which in order to serve its purpose – sales and profits, is often distorted and manipulated for marketing plans of producers and retailers. This causes the data to be often misinterpreted. There exist some studies addressing the problem of heterogeneous data by data normalization based on lexical similarity of the food products’ English names. We took this task a step further by considering data in non-English, low-resourced language – Slovenian. Working with such languages is challenging, as they have very limited resources and tools for Natural Language Processing (NLP). In our previously published work we considered different heuristics for matching food products: one based on lexical similarity [23], and two semantic similarity heuristics, i.e. based on word vector representations (embeddings). These data normalization approaches are evaluated once on a data set with 439 ground truth pairs of food products, obtained by matching their EAN barcodes. In this work, we extend this approach by introducing a new semantic similarity heuristic, based on sentence vector embeddings. Additionally, we extend the evaluation by taking real-world examples and tasking a subject-matter expert to rate the relevance of the top three matches for each example. The results show that using semantic similarity with the sentence embedding method yields best results, achieving 88% accuracy for the ground truth data set and 91% accuracy from the human expert evaluation, while the lexical similarity heuristic provides comparing results with 75% and 85% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alaux, J., Grave, E., Cuturi, M., Joulin, A.: Unsupervised hyperalignment for multilingual word embeddings. arXiv preprint arXiv:1811.01124 (2018)

  2. Aronson, A.R.: MetaMap: mapping text to the UMLS metathesaurus. Bethesda, MD: NLM, NIH, DHHS, pp. 1–26 (2006)

    Google Scholar 

  3. Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(suppl\(\_\)1), D267–D270 (2004)

    Google Scholar 

  4. Cestnik, B., et al.: Estimating probabilities: a crucial task in machine learning. In: ECAI, vol. 90, pp. 147–149 (1990)

    Google Scholar 

  5. Chen, X., Cardie, C.: Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933 (2018)

  6. Donnelly, K.: SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 121, 279 (2006)

    Google Scholar 

  7. (EFSA), European Food Safety Authority: The food classification and description system foodex 2 (revision 2), vol. 12, no. 5, p. 804E . EFSA Supporting Publications (2015)

    Google Scholar 

  8. Eftimov, T., Ispirova, G., Finglas, P., Korosec, P., Korousic-Seljak, B.: Quisper ontology learning from personalized dietary web services. In: KEOD, pp. 277–284 (2018)

    Google Scholar 

  9. Eftimov, T., Korošec, P., Koroušić Seljak, B.: StandFood: standardization of foods using a semi-automatic system for classifying and describing foods according to FoodEx2. Nutrients 9(6), 542 (2017)

    Article  Google Scholar 

  10. Eftimov, T., Seljak, B.K.: Pos tagging-probability weighted method for matching the internet recipe ingredients with food composition data. In: 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), vol. 1, pp. 330–336. IEEE (2015)

    Google Scholar 

  11. Grcar, M., Krek, S., Dobrovoljc, K.: Obeliks: statisticni oblikoskladenjski oznacevalnik in lematizator za slovenski jezik. In: Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia (2012)

    Google Scholar 

  12. Griffiths, E.J., Dooley, D.M., Buttigieg, P.L., Hoehndorf, R., Brinkman, F.S., Hsiao, W.W.: FoodON: a global farm-to-fork food ontology. In: ICBO/BioCreative (2016)

    Google Scholar 

  13. Ispirova, G., Eftimov, T., Korousic-Seljak, B., Korosec, P.: Mapping food composition data from various data sources to a domain-specific ontology. In: KEOD, pp. 203–210 (2017)

    Google Scholar 

  14. Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)

    Google Scholar 

  15. Kosub, S.: A note on the triangle inequality for the Jaccard distance. Pattern Recogn. Lett. 120, 36–38 (2019)

    Article  Google Scholar 

  16. Lu, Z., et al.: The gene normalization task in BioCreative III. BMC Bioinform. 12(8), S2 (2011)

    Article  Google Scholar 

  17. Màrquez, L., Rodríguez, H.: Part-of-speech tagging using decision trees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 25–36. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026668

    Chapter  Google Scholar 

  18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  20. Morgan, A.A., et al.: Overview of BioCreative II gene normalization. Genome Biol. 9, S3 (2008). https://doi.org/10.1186/gb-2008-9-s2-s3

    Article  Google Scholar 

  21. Pennington, J.A., Smith, E.C., Chatfield, M.R., Hendricks, T.C.: LANGUAL: a food-description language. Terminol. Int. J. Theoret. Appl. Issues Spec. Commun. 1(2), 277–289 (1994)

    Google Scholar 

  22. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  23. Popovski, G., Ispirova, G., Hadzi-Kotarova, N., Valenčič, E., Eftimov, T., Seljak, B.K.: Food data integration by using heuristics based on lexical and semantic similarities. In: Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5 HEALTHINF: HEALTHINF, pp. 208–216. INSTICC, SciTePress (2020). https://doi.org/10.5220/0008990602080216

  24. Popovski, G., Ispirova, G., Hadzi-Kotarova, N., Valenčič, E., Eftimov, T., Koroušić Seljak, B.: Food data integration by using heuristics based on lexical and semantic similarities. In: Proceedings of the 13th International Conference on Health Informatics (2020, in press)

    Google Scholar 

  25. Popovski, G., Kochev, S., Koroušić Seljak, B., Eftimov, T.: FoodIE: a rule-based named-entity recognition method for food information extraction. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods, (ICPRAM 2019), pp. 915–922 (2019)

    Google Scholar 

  26. Popovski, G., Koroušić Seljak, B., Eftimov, T.: FoodOntoMap: linking food concepts across different food ontologies. In: Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 2: KEOD, pp. 195–202. INSTICC, SciTePress (2019). https://doi.org/10.5220/0008353201950202

  27. Popovski, G., Seljak, B.K., Eftimov, T.: A survey of named-entity recognition methods for food information extraction. IEEE Access 8, 31586–31594 (2020)

    Article  Google Scholar 

  28. Pramanik, S., Hussain, A.: Text normalization using memory augmented neural networks. Speech Commun. 109, 15–23 (2019)

    Article  Google Scholar 

  29. Schuyler, P.L., Hole, W.T., Tuttle, M.S., Sherertz, D.D.: The UMLS Metathesaurus: representing different views of biomedical concepts. Bull. Med. Libr. Assoc. 81(2), 217 (1993)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the project from the Slovenian Research Agency (research core funding No. P2-0098), and the European Union’s Horizon 2020 research and innovation programme (grant agreements No. 863059 and No. 769661).

Information and the views set out in this publication are those of the authors and do not necessarily reflect the official opinion of the European Union. Neither the European Union institutions and bodies nor any person acting on their behalf may be held responsible for the use that may be made of the information contained here.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gordana Ispirova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ispirova, G., Popovski, G., Valenčič, E., Hadzi-Kotarova, N., Eftimov, T., Seljak, B.K. (2021). Food Data Normalization Using Lexical and Semantic Similarities Heuristics. In: Ye, X., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2020. Communications in Computer and Information Science, vol 1400. Springer, Cham. https://doi.org/10.1007/978-3-030-72379-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72379-8_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72378-1

  • Online ISBN: 978-3-030-72379-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics