Abstract
Food is one of the main health and environmental factors in today’s society. With modernization the food supply is expanding and food-related data is increasing. This type of data comes in many different forms and making it inter-operable is one of the main requirements for using in any kind of analyses. One step towards this goal is data normalization of data coming from different sources. Food-related is collected regarding various aspects – food composition, food consumption, recipe data, etc. The most commonly encountered form is food data related to food products, which in order to serve its purpose – sales and profits, is often distorted and manipulated for marketing plans of producers and retailers. This causes the data to be often misinterpreted. There exist some studies addressing the problem of heterogeneous data by data normalization based on lexical similarity of the food products’ English names. We took this task a step further by considering data in non-English, low-resourced language – Slovenian. Working with such languages is challenging, as they have very limited resources and tools for Natural Language Processing (NLP). In our previously published work we considered different heuristics for matching food products: one based on lexical similarity [23], and two semantic similarity heuristics, i.e. based on word vector representations (embeddings). These data normalization approaches are evaluated once on a data set with 439 ground truth pairs of food products, obtained by matching their EAN barcodes. In this work, we extend this approach by introducing a new semantic similarity heuristic, based on sentence vector embeddings. Additionally, we extend the evaluation by taking real-world examples and tasking a subject-matter expert to rate the relevance of the top three matches for each example. The results show that using semantic similarity with the sentence embedding method yields best results, achieving 88% accuracy for the ground truth data set and 91% accuracy from the human expert evaluation, while the lexical similarity heuristic provides comparing results with 75% and 85% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alaux, J., Grave, E., Cuturi, M., Joulin, A.: Unsupervised hyperalignment for multilingual word embeddings. arXiv preprint arXiv:1811.01124 (2018)
Aronson, A.R.: MetaMap: mapping text to the UMLS metathesaurus. Bethesda, MD: NLM, NIH, DHHS, pp. 1–26 (2006)
Bodenreider, O.: The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 32(suppl\(\_\)1), D267–D270 (2004)
Cestnik, B., et al.: Estimating probabilities: a crucial task in machine learning. In: ECAI, vol. 90, pp. 147–149 (1990)
Chen, X., Cardie, C.: Unsupervised multilingual word embeddings. arXiv preprint arXiv:1808.08933 (2018)
Donnelly, K.: SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 121, 279 (2006)
(EFSA), European Food Safety Authority: The food classification and description system foodex 2 (revision 2), vol. 12, no. 5, p. 804E . EFSA Supporting Publications (2015)
Eftimov, T., Ispirova, G., Finglas, P., Korosec, P., Korousic-Seljak, B.: Quisper ontology learning from personalized dietary web services. In: KEOD, pp. 277–284 (2018)
Eftimov, T., Korošec, P., Koroušić Seljak, B.: StandFood: standardization of foods using a semi-automatic system for classifying and describing foods according to FoodEx2. Nutrients 9(6), 542 (2017)
Eftimov, T., Seljak, B.K.: Pos tagging-probability weighted method for matching the internet recipe ingredients with food composition data. In: 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), vol. 1, pp. 330–336. IEEE (2015)
Grcar, M., Krek, S., Dobrovoljc, K.: Obeliks: statisticni oblikoskladenjski oznacevalnik in lematizator za slovenski jezik. In: Zbornik Osme konference Jezikovne tehnologije, Ljubljana, Slovenia (2012)
Griffiths, E.J., Dooley, D.M., Buttigieg, P.L., Hoehndorf, R., Brinkman, F.S., Hsiao, W.W.: FoodON: a global farm-to-fork food ontology. In: ICBO/BioCreative (2016)
Ispirova, G., Eftimov, T., Korousic-Seljak, B., Korosec, P.: Mapping food composition data from various data sources to a domain-specific ontology. In: KEOD, pp. 203–210 (2017)
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)
Kosub, S.: A note on the triangle inequality for the Jaccard distance. Pattern Recogn. Lett. 120, 36–38 (2019)
Lu, Z., et al.: The gene normalization task in BioCreative III. BMC Bioinform. 12(8), S2 (2011)
Màrquez, L., Rodríguez, H.: Part-of-speech tagging using decision trees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 25–36. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026668
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Morgan, A.A., et al.: Overview of BioCreative II gene normalization. Genome Biol. 9, S3 (2008). https://doi.org/10.1186/gb-2008-9-s2-s3
Pennington, J.A., Smith, E.C., Chatfield, M.R., Hendricks, T.C.: LANGUAL: a food-description language. Terminol. Int. J. Theoret. Appl. Issues Spec. Commun. 1(2), 277–289 (1994)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Popovski, G., Ispirova, G., Hadzi-Kotarova, N., Valenčič, E., Eftimov, T., Seljak, B.K.: Food data integration by using heuristics based on lexical and semantic similarities. In: Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5 HEALTHINF: HEALTHINF, pp. 208–216. INSTICC, SciTePress (2020). https://doi.org/10.5220/0008990602080216
Popovski, G., Ispirova, G., Hadzi-Kotarova, N., Valenčič, E., Eftimov, T., Koroušić Seljak, B.: Food data integration by using heuristics based on lexical and semantic similarities. In: Proceedings of the 13th International Conference on Health Informatics (2020, in press)
Popovski, G., Kochev, S., Koroušić Seljak, B., Eftimov, T.: FoodIE: a rule-based named-entity recognition method for food information extraction. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods, (ICPRAM 2019), pp. 915–922 (2019)
Popovski, G., Koroušić Seljak, B., Eftimov, T.: FoodOntoMap: linking food concepts across different food ontologies. In: Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 2: KEOD, pp. 195–202. INSTICC, SciTePress (2019). https://doi.org/10.5220/0008353201950202
Popovski, G., Seljak, B.K., Eftimov, T.: A survey of named-entity recognition methods for food information extraction. IEEE Access 8, 31586–31594 (2020)
Pramanik, S., Hussain, A.: Text normalization using memory augmented neural networks. Speech Commun. 109, 15–23 (2019)
Schuyler, P.L., Hole, W.T., Tuttle, M.S., Sherertz, D.D.: The UMLS Metathesaurus: representing different views of biomedical concepts. Bull. Med. Libr. Assoc. 81(2), 217 (1993)
Acknowledgements
This work was supported by the project from the Slovenian Research Agency (research core funding No. P2-0098), and the European Union’s Horizon 2020 research and innovation programme (grant agreements No. 863059 and No. 769661).
Information and the views set out in this publication are those of the authors and do not necessarily reflect the official opinion of the European Union. Neither the European Union institutions and bodies nor any person acting on their behalf may be held responsible for the use that may be made of the information contained here.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ispirova, G., Popovski, G., Valenčič, E., Hadzi-Kotarova, N., Eftimov, T., Seljak, B.K. (2021). Food Data Normalization Using Lexical and Semantic Similarities Heuristics. In: Ye, X., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2020. Communications in Computer and Information Science, vol 1400. Springer, Cham. https://doi.org/10.1007/978-3-030-72379-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-72379-8_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72378-1
Online ISBN: 978-3-030-72379-8
eBook Packages: Computer ScienceComputer Science (R0)