Abstract
Textual information is constantly increasing. With this accumulation of documents, the satisfaction of user needs becomes more and more complex. For that, several information retrieval systems have been designed in order to respond to user requests. Document indexing is considered as a crucial phase in the information retrieval field. The main contribution of the current work resides in the suggestion of a novel hybrid approach for biomedical document indexing. We improve the estimation of the correspondence between a document and a given concept using two methods: vector space model (VSM) and description logics (DL). VSM performs partial matching between documents and external resource terms. DL allows representing knowledge in a relevant manner for better matching. The proposed contribution reduces the limitation of exact matching. It serves to index documents by exploiting medical subject headings (MeSH) thesaurus services with approximate matching. The latter partially matches document terms with biomedical vocabularies to extract other morphological variants in that resource. It also generates irrelevant concepts. The filtering step solves this problem and grants the selection of the most important concepts by exploiting the knowledge provided by MeSH. The experiments, carried out on different corpora, show encouraging results (+ 25% improvement in average accuracy compared to other approaches in the literature).
Notes
CNLP.Platform http://www.nlp.org.cn.
This research work is supported in part from the NSF Career Grant (NSF IIS 0448023). NSF CCF 0514679 and the research Grant from PA Dept of Health.
References
Ali M, Khalid S, Saleemi M (2019) Comprehensive stemmer for morphologically rich Urdu language. Int Arab J Inf Technol 16(1):138–147
Alotaibi FS, Gupta V (2018) A cognitive inspired unsupervised language-independent text stemmer for information retrieval. Cognit Syst Res 52:291–300
Aravazhi R, Chidambaram M (2018) An efficient indexing mesh term description logic using in medical subject headings. J Comput Math Sci 9(10):1556–1567
Aronson A, Mork J, Gay C, Humphrey S, Rogers W (2004) The nlm indexing initiative’s medical text indexer. Stud Health Technol Inf 11(1):268–272
Arroyo-Fernández I, Méndez-Cruz C, Sierra G, Torres-Moreno J, Sidorov G (2019) Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput Speech Lang 56:107–129
Baoli H, Ling C, Xiaoxue T (2018) Knowledge based collection selection for distributed information retrieval. Inf Process Manag 54(1):116–128
Boukhari K, Omri MN (2015) Said: a new stemmer algorithm to indexing unstructured document. In: 2015 15th international conference on intelligent systems design and applications (ISDA). IEEE, pp 59–63. https://doi.org/10.1109/ISDA.2015.7489180
Boukhari K, Omri MN (2016) Raid: robust algorithm for stemming text document. Int J Comput Inf Syst Ind Manag Appl 8(1):235–246
Boukhari K, Omri MN (2017a) Information retrieval approach based on indexing text documents: application to biomedical domain. In: The 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD), pp 2213–2220
Boukhari K, Omri MN (2017b) Information retrieval based on description logic: application to biomedical documents. In: International conference on high performance computing and simulation (HPCS), pp 846–853
Bracewell D, Ren F-J, Kuriowa S (2005) Multilingual single document keyword extraction for information retrieval. In: Proceedings of natural language processing and knowledge engineering (NLP-KE), pp 517–522
Chebil W, Soualmia LF, Darmoni SJ (2013) Biodi: a new approach to improve biomedical documents indexing. In: Decker H, Lhotská L, Link S, Basl J, Tjoa AM (eds) Database and expert systems applications. DEXA 2013. Lecture notes in computer science, vol 8055. Springer, Berlin, Heidelberg, pp 78–87
Dahak F, Boughanem M, Ballaa A (2017) A probabilistic model to exploit user expectations in xml information retrieval. Inf Process Manag 53(1):87–105
Dinh D, Tamine L (2011) Combining global and local semantic contexts for improving biomedical information retrieval. In: European conference on information retrieval research, pp 375–386
Ferjani F, Elloumi S, Jaoua A, Sahar Ahmad Ismail SBY, Ravan S (2012) Formal context coverage based on isolated labels: an efficient solution for text feature extraction. Inf Sci Inf Comput Sci Intell Syst Appl Int J 188(1):198–214
Fiorini N, Ranwez S, Montmain J, Ranwez V (2015) USI: a fast and accurate approach for conceptual document annotation. BMC Bioinf 16(1):1–10
Fkih F, Omri MN (2012) Complex terminology extraction model from unstructured web text based linguistic and statistical knowledge. Int J Inf Retrieval Res 2(3):1–18
Fkih F, Omri MN (2016a) Hybridization of an index based on concept lattice with a terminology extraction model for semantic information retrieval guided by wordnet. In: International conference on hybrid intelligent systems, pp 144–152
Fkih F, Omri MN (2016b) IRAFCA: an o(n) information retrieval algorithm based on formal concept analysis. Knowl Inf Syst 48(2):465–491
Garcia MAM, Rodriguez RP, Rifon LA (2018) Leveraging wikipedia knowledge to classify multilingual biomedical documents. Artif Intell Med 88(1):37–57
Haarslev V, Moller R (2001) Description of the racer system and its applications. In: The international workshop on description logics, pp 132–141
Hao S, Shi C, Niu Z, Cao L (2018) Concept coupling learning for improving concept lattice-based document retrieval. Eng Appl Artif Intell 69(1):56–75
Happe A, Pouliquen B, Burgun A, Cuggia M, Beux PL (2003) Automatic concept extraction from spoken medical reports. Int J Med Inf 70(2–3):255–263
Jiménez S, Cucerzan S, González FA, Gelbukh AF, Dueñas G (2018) BM25-CTF: improving TF and IDF factors in BM25 by using collection term frequencies. J Intell Fuzzy Syst 34(5):2887–2899
Jonquet C, LePendu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO resource index: ontology-based search and mining of biomedical resources. J Web Seman 9(3):316–324
Jutinico CJM, Montenegro-Marin CE, Burgos D, Crespo RG (2019) Natural language interface model for the evaluation of ergonomic routines in occupational health (ilena). J Ambient Intell Human Comput 10(4):1611–1619
Karaa WBA (2013) A new stemmer to improve information retrieval. Int J Netw Sec Appl (IJNSA) 5(4):143–154
Liu Y-H, Wacholderc N (2017) Evaluating the impact of mesh (medical subject headings) terms on different types of searchers. Inf Process Manag 53(4):851–870
Lv X, Guan Y, Deng B (2014) Transfer learning based clinical concept extraction on data from multiple sources. J Biomed Inf 52(3):55–64
Mahedi HH, Sanyal F, Chaki D (2018) A novel approach to extract important keywords from documents applying latent semantic analysis. In: International conference on knowledge and smart technology (KST), pp 1–6
Matsuo Y, Ishizuka M (2003) Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the sixteenth international Florida Artificial Intelligence Research Society conference, pp 392–396
Mukherjea S, Gaurav Chanda LVS, Sankararaman S, Kothari R, Batra VS, Bhardwaj DN, Srivastava B (2004) Enhancing a biomedical information extraction system with dictionary mining and context disambiguation. IBM J Res Dev 48(5–6):693–702
Naouar F, Hlaoua L, Omri MN (2016) Collaborative information retrieval model based on fuzzy confidence network. J Intell Fuzzy Syst 30(4):2119–2129
Naouar F, Hlaoua L, Omri MN (2017) Information retrieval model using uncertain confidence’s network. Int J Inf Retriev Res 7(2):34–50
Radhouani S, Falquet G (2008) Description logics-based modelling for precise information retrieval. In: International workshop on description logics, pp 1–11
Radhouani S, Falquet G, Chevallet JP (2008) Description logic to model a domain specific information retrieval system. In: International conference on database and expert systems applications, pp 142–149
Ru C, Tang J, Li S, Xie S, Wang T (2018) Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Inf Process Manag 54(4):593–608
Ruch P (2006) Automatic assignment of biomedical categories: toward a generic approach. Bioinf J 6(22):58–64
Sirin E, Parsia B, Grau BC, Kalyanpur A, Katz Y (2007) Pellet: a practical owl-dl reasoner. J Web Semant 5(2):51–53
Sohn S, Kim W, Comeau DC, Wilbur WJ (2008) Optimal training sets for bayesian prediction of mesh\(\textregistered {R}\) assignment. J Am Med Inf Assoc 15(4):546–553
Soldaini L, Goharian N (2016) Quickumls: a fast, unsupervised approach for medical concept extraction. In: Medical information retrieval (MedIR) workshop, pp 1–4
Song M (2015) Exploring concept graphs for biomedical literature mining. In: International conference on big data and smart computing, pp 103–110
Sun P, Wang L, Xia Q (2017) The keyword extraction of Chinese medical web page based on WF-TF-IDF algorithm. In: 9th international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), pp 193–198
Tsarkov D, Horrocks I (2004) Efficient reasoning with range and domain constraints. Descript Logic Workshop DL 2004:41–50
Warren P, Mulholland P, Collins TD, Motta E (2019) Improving comprehension of knowledge representation languages: a case study with description logics. Int J Hum Comput Stud 122:145–167
You W, Fontaine D, Barthès J-P (2013) An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst 34(3):691–724
Yuan L (2018) Supporting relevance feedback with concept learning for semantic information retrieval in large OWL knowledge base. In: Yoshida K, Lee M (eds) Knowledge management and acquisition for intelligent systems. PKAW 2018. Lecture notes in computer science, vol 11016. Springer, Cham, pp 61–75
Zhang C, Wang H, Liu Y, Wu D, Liao Y, Wang B (2008) Automatic keyword extraction from documents using conditional random fields. J Comput Inf Syst 4(3):1169–1180
Zhou X, Zhang X, Hu X (2006) Maxmatcher: Biological concept extraction using approximate dictionary lookup. In: Pacific rim international conference on artificial intelligence, pp 1145–1149
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Boukhari, K., Omri, M.N. DL-VSM based document indexing approach for information retrieval. J Ambient Intell Human Comput 14, 5383–5394 (2023). https://doi.org/10.1007/s12652-020-01684-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-020-01684-x