Skip to main content

Advertisement

Log in

DL-VSM based document indexing approach for information retrieval

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Textual information is constantly increasing. With this accumulation of documents, the satisfaction of user needs becomes more and more complex. For that, several information retrieval systems have been designed in order to respond to user requests. Document indexing is considered as a crucial phase in the information retrieval field. The main contribution of the current work resides in the suggestion of a novel hybrid approach for biomedical document indexing. We improve the estimation of the correspondence between a document and a given concept using two methods: vector space model (VSM) and description logics (DL). VSM performs partial matching between documents and external resource terms. DL allows representing knowledge in a relevant manner for better matching. The proposed contribution reduces the limitation of exact matching. It serves to index documents by exploiting medical subject headings (MeSH) thesaurus services with approximate matching. The latter partially matches document terms with biomedical vocabularies to extract other morphological variants in that resource. It also generates irrelevant concepts. The filtering step solves this problem and grants the selection of the most important concepts by exploiting the knowledge provided by MeSH. The experiments, carried out on different corpora, show encouraging results (+ 25% improvement in average accuracy compared to other approaches in the literature).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Notes

  1. CNLP.Platform http://www.nlp.org.cn.

  2. https://www.nlm.nih.gov/research/umls/.

  3. This research work is supported in part from the NSF Career Grant (NSF IIS 0448023). NSF CCF 0514679 and the research Grant from PA Dept of Health.

References

  • Ali M, Khalid S, Saleemi M (2019) Comprehensive stemmer for morphologically rich Urdu language. Int Arab J Inf Technol 16(1):138–147

    Google Scholar 

  • Alotaibi FS, Gupta V (2018) A cognitive inspired unsupervised language-independent text stemmer for information retrieval. Cognit Syst Res 52:291–300

    Article  Google Scholar 

  • Aravazhi R, Chidambaram M (2018) An efficient indexing mesh term description logic using in medical subject headings. J Comput Math Sci 9(10):1556–1567

    Google Scholar 

  • Aronson A, Mork J, Gay C, Humphrey S, Rogers W (2004) The nlm indexing initiative’s medical text indexer. Stud Health Technol Inf 11(1):268–272

    Google Scholar 

  • Arroyo-Fernández I, Méndez-Cruz C, Sierra G, Torres-Moreno J, Sidorov G (2019) Unsupervised sentence representations as word information series: revisiting TF-IDF. Comput Speech Lang 56:107–129

    Article  Google Scholar 

  • Baoli H, Ling C, Xiaoxue T (2018) Knowledge based collection selection for distributed information retrieval. Inf Process Manag 54(1):116–128

    Article  Google Scholar 

  • Boukhari K, Omri MN (2015) Said: a new stemmer algorithm to indexing unstructured document. In: 2015 15th international conference on intelligent systems design and applications (ISDA). IEEE, pp 59–63. https://doi.org/10.1109/ISDA.2015.7489180

  • Boukhari K, Omri MN (2016) Raid: robust algorithm for stemming text document. Int J Comput Inf Syst Ind Manag Appl 8(1):235–246

    Google Scholar 

  • Boukhari K, Omri MN (2017a) Information retrieval approach based on indexing text documents: application to biomedical domain. In: The 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD), pp 2213–2220

  • Boukhari K, Omri MN (2017b) Information retrieval based on description logic: application to biomedical documents. In: International conference on high performance computing and simulation (HPCS), pp 846–853

  • Bracewell D, Ren F-J, Kuriowa S (2005) Multilingual single document keyword extraction for information retrieval. In: Proceedings of natural language processing and knowledge engineering (NLP-KE), pp 517–522

  • Chebil W, Soualmia LF, Darmoni SJ (2013) Biodi: a new approach to improve biomedical documents indexing. In: Decker H, Lhotská L, Link S, Basl J, Tjoa AM (eds) Database and expert systems applications. DEXA 2013. Lecture notes in computer science, vol 8055. Springer, Berlin, Heidelberg, pp 78–87

    Chapter  Google Scholar 

  • Dahak F, Boughanem M, Ballaa A (2017) A probabilistic model to exploit user expectations in xml information retrieval. Inf Process Manag 53(1):87–105

    Article  Google Scholar 

  • Dinh D, Tamine L (2011) Combining global and local semantic contexts for improving biomedical information retrieval. In: European conference on information retrieval research, pp 375–386

  • Ferjani F, Elloumi S, Jaoua A, Sahar Ahmad Ismail SBY, Ravan S (2012) Formal context coverage based on isolated labels: an efficient solution for text feature extraction. Inf Sci Inf Comput Sci Intell Syst Appl Int J 188(1):198–214

    MathSciNet  MATH  Google Scholar 

  • Fiorini N, Ranwez S, Montmain J, Ranwez V (2015) USI: a fast and accurate approach for conceptual document annotation. BMC Bioinf 16(1):1–10

    Article  Google Scholar 

  • Fkih F, Omri MN (2012) Complex terminology extraction model from unstructured web text based linguistic and statistical knowledge. Int J Inf Retrieval Res 2(3):1–18

    Google Scholar 

  • Fkih F, Omri MN (2016a) Hybridization of an index based on concept lattice with a terminology extraction model for semantic information retrieval guided by wordnet. In: International conference on hybrid intelligent systems, pp 144–152

  • Fkih F, Omri MN (2016b) IRAFCA: an o(n) information retrieval algorithm based on formal concept analysis. Knowl Inf Syst 48(2):465–491

    Article  Google Scholar 

  • Garcia MAM, Rodriguez RP, Rifon LA (2018) Leveraging wikipedia knowledge to classify multilingual biomedical documents. Artif Intell Med 88(1):37–57

    Article  Google Scholar 

  • Haarslev V, Moller R (2001) Description of the racer system and its applications. In: The international workshop on description logics, pp 132–141

  • Hao S, Shi C, Niu Z, Cao L (2018) Concept coupling learning for improving concept lattice-based document retrieval. Eng Appl Artif Intell 69(1):56–75

    Google Scholar 

  • Happe A, Pouliquen B, Burgun A, Cuggia M, Beux PL (2003) Automatic concept extraction from spoken medical reports. Int J Med Inf 70(2–3):255–263

    Article  Google Scholar 

  • Jiménez S, Cucerzan S, González FA, Gelbukh AF, Dueñas G (2018) BM25-CTF: improving TF and IDF factors in BM25 by using collection term frequencies. J Intell Fuzzy Syst 34(5):2887–2899

    Article  Google Scholar 

  • Jonquet C, LePendu P, Falconer S, Coulet A, Noy NF, Musen MA, Shah NH (2011) NCBO resource index: ontology-based search and mining of biomedical resources. J Web Seman 9(3):316–324

    Article  Google Scholar 

  • Jutinico CJM, Montenegro-Marin CE, Burgos D, Crespo RG (2019) Natural language interface model for the evaluation of ergonomic routines in occupational health (ilena). J Ambient Intell Human Comput 10(4):1611–1619

    Article  Google Scholar 

  • Karaa WBA (2013) A new stemmer to improve information retrieval. Int J Netw Sec Appl (IJNSA) 5(4):143–154

    MathSciNet  Google Scholar 

  • Liu Y-H, Wacholderc N (2017) Evaluating the impact of mesh (medical subject headings) terms on different types of searchers. Inf Process Manag 53(4):851–870

    Article  Google Scholar 

  • Lv X, Guan Y, Deng B (2014) Transfer learning based clinical concept extraction on data from multiple sources. J Biomed Inf 52(3):55–64

    Article  Google Scholar 

  • Mahedi HH, Sanyal F, Chaki D (2018) A novel approach to extract important keywords from documents applying latent semantic analysis. In: International conference on knowledge and smart technology (KST), pp 1–6

  • Matsuo Y, Ishizuka M (2003) Keyword extraction from a single document using word co-occurrence statistical information. In: Proceedings of the sixteenth international Florida Artificial Intelligence Research Society conference, pp 392–396

  • Mukherjea S, Gaurav Chanda LVS, Sankararaman S, Kothari R, Batra VS, Bhardwaj DN, Srivastava B (2004) Enhancing a biomedical information extraction system with dictionary mining and context disambiguation. IBM J Res Dev 48(5–6):693–702

    Article  Google Scholar 

  • Naouar F, Hlaoua L, Omri MN (2016) Collaborative information retrieval model based on fuzzy confidence network. J Intell Fuzzy Syst 30(4):2119–2129

    Article  Google Scholar 

  • Naouar F, Hlaoua L, Omri MN (2017) Information retrieval model using uncertain confidence’s network. Int J Inf Retriev Res 7(2):34–50

    Google Scholar 

  • Radhouani S, Falquet G (2008) Description logics-based modelling for precise information retrieval. In: International workshop on description logics, pp 1–11

  • Radhouani S, Falquet G, Chevallet JP (2008) Description logic to model a domain specific information retrieval system. In: International conference on database and expert systems applications, pp 142–149

  • Ru C, Tang J, Li S, Xie S, Wang T (2018) Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Inf Process Manag 54(4):593–608

    Article  Google Scholar 

  • Ruch P (2006) Automatic assignment of biomedical categories: toward a generic approach. Bioinf J 6(22):58–64

    Google Scholar 

  • Sirin E, Parsia B, Grau BC, Kalyanpur A, Katz Y (2007) Pellet: a practical owl-dl reasoner. J Web Semant 5(2):51–53

    Article  Google Scholar 

  • Sohn S, Kim W, Comeau DC, Wilbur WJ (2008) Optimal training sets for bayesian prediction of mesh\(\textregistered {R}\) assignment. J Am Med Inf Assoc 15(4):546–553

    Article  Google Scholar 

  • Soldaini L, Goharian N (2016) Quickumls: a fast, unsupervised approach for medical concept extraction. In: Medical information retrieval (MedIR) workshop, pp 1–4

  • Song M (2015) Exploring concept graphs for biomedical literature mining. In: International conference on big data and smart computing, pp 103–110

  • Sun P, Wang L, Xia Q (2017) The keyword extraction of Chinese medical web page based on WF-TF-IDF algorithm. In: 9th international conference on cyber-enabled distributed computing and knowledge discovery (CyberC), pp 193–198

  • Tsarkov D, Horrocks I (2004) Efficient reasoning with range and domain constraints. Descript Logic Workshop DL 2004:41–50

    Google Scholar 

  • Warren P, Mulholland P, Collins TD, Motta E (2019) Improving comprehension of knowledge representation languages: a case study with description logics. Int J Hum Comput Stud 122:145–167

    Article  Google Scholar 

  • You W, Fontaine D, Barthès J-P (2013) An automatic keyphrase extraction system for scientific documents. Knowl Inf Syst 34(3):691–724

    Article  Google Scholar 

  • Yuan L (2018) Supporting relevance feedback with concept learning for semantic information retrieval in large OWL knowledge base. In: Yoshida K, Lee M (eds) Knowledge management and acquisition for intelligent systems. PKAW 2018. Lecture notes in computer science, vol 11016. Springer, Cham, pp 61–75

  • Zhang C, Wang H, Liu Y, Wu D, Liao Y, Wang B (2008) Automatic keyword extraction from documents using conditional random fields. J Comput Inf Syst 4(3):1169–1180

    Google Scholar 

  • Zhou X, Zhang X, Hu X (2006) Maxmatcher: Biological concept extraction using approximate dictionary lookup. In: Pacific rim international conference on artificial intelligence, pp 1145–1149

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kabil Boukhari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boukhari, K., Omri, M.N. DL-VSM based document indexing approach for information retrieval. J Ambient Intell Human Comput 14, 5383–5394 (2023). https://doi.org/10.1007/s12652-020-01684-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-020-01684-x

Keywords

Navigation