skip to main content
research-article

Context-aware Urdu Information Retrieval System

Published: 02 April 2023 Publication History

Abstract

World Wide Web (WWW) is playing a vital role for sharing dynamic knowledge in every field of life. The information on web comprises a huge amount of data in different forms such as structured, semi structured, or few is totally in unstructured format. Due to huge size of information, searching from larger textual data about the specific topic or getting precise information is a challenging task. All this leads to the problem of word sense ambiguity (WSA). Urdu language-based information retrieval system using different techniques related to Web Semantic Search Engine architecture is proposed to efficiently retrieve the relevant information and solve the problem of WSA. The proposed system has average precision ratio 96% as compared to average precision ratio of 74% and 75% average precision Google for single word query. For the long text queries, our system outperforms the existing famous search engines with 92% accuracy such as Bing and Google having 16.50% and 16% accuracy, respectively. Similarly, the proposed system for single word query, the recall ratio is 32.25% as compared to 25% and 25% of Bing and Google. The results of recall ratio for long text query are improved as well, showing 6.38% as compared to 6.20% and 4.8% of Bing and Google, respectively. The results showed that the proposed system gives better and efficient results as compared to the existing systems for Urdu language.

References

[1]
Brin Sergey and Lawrence Page. 1998. The anatomy of a large-scale hyper textual web search engine. Comput. Netw. ISDN Syst. 30, 1–7 (1998), 107–11.
[2]
Glavaš Goran and Jan Šnajder. 2014. Event graphs for information retrieval and multi-document summarization. Exp. Syst. Applic. 41, 15 (2014), 6904–6916.
[3]
Tekli Joe. 2016. An overview on XML semantic disambiguation from unstructured text to semi-structured data: Background, applications and ongoing challenges. IEEE Trans. Knowl. Data Eng 28, 6 (2016), 1383–1407.
[4]
Malve Ankita and P. P. M. Chawan. 2015. A comparative study of keyword and semantic based search engine. Int. J. Innov. Res. Sci., Eng. Technol. 4, 11 (2015), 11156–11161.
[5]
Li Jianqiang, Chunchen Liu, Bo Liu, Rui Mao, Yongcai Wang, Shi Chen, Ji-Jiang Yang, Hui Pan, and Qing Wang. 2015. Diversity-aware retrieval of medical records. Comput. Industr. 69 (2015), 81–91.
[6]
Berners-Lee Tim. 1999. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Webby its Inventor. DIANE Publishing Company.
[7]
Hardie Andrew. 2003. Developing a tagset for automated part-of-speech tagging in Urdu. In Corpus Linguistics.
[8]
Naseer Asma and Sarmad Hussain. 2009. Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification. Center for Research in Urdu Language Processing, Lahore, Pakistan. https://eprints.lancs.ac.uk/id/eprint/103/.
[9]
Muhammad A. Khan, Abdul Aleem, Abdul Wahab, and M. Nasir Khan. 2011. Copy detection in Urdu language documents using n-grams model. In IEEE International Conference on Computer Networks and Information Technology (ICCNIT). 263–266.
[10]
Riaz Kashif. 2008. Concept search in Urdu. 2008. In 2nd PhD Workshop on Information and Knowledge Management. 33–40.
[11]
Becker Dara and Kashif Riaz. 2002. A study in Urdu corpus construction. In 3rd Workshop on Asian Language Resources and International Standardization. 1–5.
[12]
Conicov Andrei. 2012. Indexing Linked Data. MS. Thesis. Department of Software Engineering, Univerzita Karlova, Matematicko-fyzikálnífakulta.
[13]
Schwartz Candy. 1998. Web search engines. J. American Societ. Inf. Sci. 49, 11 (1998), 973–982.
[14]
Daniel C. Fain and Jan O. Pedersen. 2006. Sponsored search: A brief history. Bull. Amer. Societ. Inf. Sci. Technol. 32, 2 (2006), 12–13.
[15]
Seymour Tom, Dean Frantsvog, and Satheesh Kumar. 2011. History of search engines. Int. J. Manag. Inf. Syst. 15, 4 (2011), 47–58.
[16]
Singh Jagendra and Dr. Aditi Sharan. 2013. A comparative study between keyword and semantic based search engines. In International Conference on Cloud, Big Data and Trust. 13–15.
[17]
Duhan Neelam, A. K. Sharma, and Komal Kumar Bhatia. 2009. Page ranking algorithms: A survey. In IEEE International Advance Computing Conference. 1530–1537.
[18]
Schumacher Kinga, Michael Sintek, and Leo Sauermann. 2008. Combining fact and document retrieval with spreading activation for semantic desktop search. In Springer European Semantic Web Conference. 569–583.
[19]
Wang Yushi, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1332–1342.
[20]
Minale A. Abebe, Joe Tekli, Fekade Getahun, Gilbert Tekli, and Richard Chbeir. 2016. A general multimedia representation space model toward event-based collective knowledge management. In Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC) and 15th International Symposium on Distributed Computing and Applications for Business Engineering (DCABES). 512–521.
[21]
Qureshi Maliha, Majid Bibi Asma, and Hikmat Ullah Khan. 2013. Comparative analysis of semantic search engines based on requirement space pyramid. Int. J. Fut. Comput. Commun. 2, 6 (2013), 562.
[22]
Escudero Sandra, Angel L. Garrido, and Sergio Ilarri. 2014. Obtaining knowledge from the web using fusion and summarization techniques. In IEEE 17th International Conference on Information Fusion (FUSION). 1–8.
[23]
Jay Patel, Pinal Shah, Kamlesh Makvana, and Parth Shah. 2015. Review on web search personalization through semantic data. In IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT). 1–6.
[24]
Gupta Rupal and Sanjay Kumar Malik. 2011. SPARQL semantics and execution analysis in semantic web using various tools. In IEEE International Conference on Communication Systems and Network Technologies (CSNT). 278–282.
[25]
Negi Yogender Singh and Suresh Kumar. 2014. A comparative analysis of keyword-and semantic-based search engines. In Intelligent Computing, Networking, and Informatics. Springer, New Delhi, 727–736.
[26]
Reyes Jose Alejandro and Azucena Montes. 2016. Learning discourse relations from news reports: An event-driven approach. IEEE Latin Amer. Trans. 14, 1 (2016), 356–363.
[27]
Saeeda Lama. 2017. Iterative approach for information extraction and ontology learning from textual aviation safety reports. In European Semantic Web Conference. 236–245.
[28]
Shah Urvi, Tim Finin, Anupam Joshi, R. Scott Cost, and James Matfield. 2002. Information retrieval on the semantic web. In 11th International Conference on Information and Knowledge Management. ACM, 461–468.
[29]
Khamparia Aditya and Babita Pandey. 2017. Comprehensive analysis of semantic web reasoners and tools: A survey. Educ. Inf. Technol. 22, 6 (2017), 3121–3145.
[30]
Sahu Sanjib Kumar, D. P. Mahapatra, and R. C. Balabantaray. 2016. Comparative study of search engines in context of features and semantics. J. Theoret. Appl. Inf. Technol. 88, 2 (2016), 210–218.
[31]
Aniket D. Kadam, Shashank D. Joshi, Sachin V. Shinde, and Sampat P. Medhane. 2015. Question answering search engine short review and road-map to future QA search engine. In IEEE International Conference on Electrical, Electronics, Signals, Communication and Optimization (EESCO). 1–8.
[32]
Ahmed Tafseer and Miriam Butt. 2011. Discovering semantic classes for Urdu NV complex predicates. In 9th International Conference on Computational Semantics Association for Computational Linguistics. 305–309.
[33]
Gupta Vaishali, Nisheeth Joshi, and ItiMathur. 2020. Rule based stemmer in Urdu. In IEEE 4th International Conference on Computer and Communication Technology (ICCCT). 1920–1927.
[34]
Khan Sajjad Ahmad, Waqas Anwar, and Usama Ijaz Bajwa. 2011. Challenges in developing a rule based Urdu stemmer. In 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP). 46–51.
[35]
Jiaul H. Paik, Dipasree Pal, and Swapan K. Parui. 2011. A novel corpus-based stemming algorithm using co-occurrence statistics. In 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 863–872.
[36]
Estahbanati Somayye and Reza Javidan. 2011. A new stemmer for Farsi language. In IEEE International Symposium on Computer Science and Software Engineering (CSSE). 25–29.
[37]
Riaz Kashif. 2008. Baseline for Urdu IR evaluation. In 2nd ACM Workshop on Improving on English Web Searching. 97–100.
[38]
Ayaz Bisma, Wajiha Altaf, Farah Sadiq, Hameeza Ahmed, and Muhammad Ali Ismai. 2016. Novel mania: A semantic search engine for Urdu. In IEEE International Conference on Open Source Systems & Technologies (ICOSST). 42–47.
[39]
David E. Goldschmidt and Mukkai Krishnamoorthy. 2005. Architecting a search engine for the semantic web. In AAAI Workshop on Contexts and Ontologies: Theory, Practice and Applications.
[40]
Choudhary Prakash and Neeta Nain. 2014. An annotated Urdu corpus of handwritten text image and benchmarking of corpus. In IEEE 37th International Conference on Information and Communication Technology. Electronics and Microelectronics (MIPRO). 1159–1164.
[41]
Al Zamil, G. H. Mohammed, and Qasem Al-Radaideh. 2014. Automatic extraction of ontological relations from Arabic text. J. King Saud Univ.-Comput. Inf. Sci. 26, 4 (2014), 462–472.
[42]
Alromima Waseem, Rania Elgohary, Ibrahim F. Moawad, and Mostafa Aref. 2015. Applying ontological engineering approach for Arabic Quran corpus: A comprehensive survey. In IEEE 7th International Conference on Intelligent Computing and Information Systems (ICICIS). 620–627.
[43]
Vivekanandam Shunmughavel and P. Jaganathan. 2015. A concept based ontology mapping method for effective retrieval of bio-medical documents. J. Med. Imag. Health Inform. 5 (2015), 926–935.
[44]
Celino Irene, Emanuele Della Valle, Dario Cerizza, and Andrea Turati. 2007. Squiggle: An experience in model-driven development of real-world semantic search engines. In International Conference on Web Engineering. 485–490.
[45]
Ding Li, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. 2004. Swoogle: A search and metadata engine for the semantic web. In 13th ACM International Conference on Information and Knowledge Management. 652–659.
[46]
Hogan Aidan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres, and Stefan Decker. 2011. Searching and browsing linked data with SWSE: The semantic web search engine. Web Seman.: Sci. Serv. Agents World Wide Web 9, 4 (2011), 365–401.
[47]
Albujasim Zainab Majeed. 2014. Search Queries in an Information Retrieval System for Arabic-language Texts. MS thesis. Department of Computer Science, University of Kentucky.
[48]
Mishra Ravi Bhushan and Sandeep Kumar. 2011. Semantic web reasoners and languages. Artif. Intell. Rev. 35, 4 (2011), 339–368.
[49]
Tsarkov Dmitry and Ian Horrocks. 2006. FaCT++ description logic reasoner: System description. In International Joint Conference on Automated Reasoning. 292–297.
[50]
B. Glimm, I. Horrocks, B. Motik, and G. Stoilos. 2009. HermiT: Reasoning with Large Ontologies. Computing Laboratory, Oxford University.
[51]
Horrocks Ian and Ulrike Sattler. 2007. A tableau decision procedure for $\mathcal {SHOIQ}$. J. Autom. Reason. 39, 3 (2007), 249–276.
[52]
Devisscher Martijn, Tim De Meyer, Wim Van Criekinge, and Peter Dawyndt. 2013. An ontology based query engine for querying biological sequences. EMBnet. J. 19 (2013), 51.
[53]
K. Shakeel, G. R. Tahir, I. Tehseen, and M. Ali. 2018. A framework of Urdu topic modeling using latent Dirichlet allocation (LDA). In IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC). 117–123.
[54]
M. Mirzayeya. 2021. History of Urdu language and its status in India and Pakistan. Academ.: Int. Multidiscip. Res. J. 11, 2 (2021), 584–591.
[55]
M. Sarim. 2020. Urdu natural language processing issues and challenges: A review study. In 2nd International Conference on Intelligent Technologies and Applications.
[56]
K. Visweswariah, V. Chenthamarakshan, and N. Kambhatla. 2010. Urdu and Hindi: Translation and sharing of linguistic resources. In International Conference on Computational Linguistics. 1283–1291.
[57]
N. A. Ansari and R. Mangrio. 2019. Morphology of Urdu Verbs: A word and paradigm approach. Pakist. J. Lang. Stud. 3, 1 (2019), 31–42.
[58]
J. Hassan and U. Shoaib. 2020. Multi-class review rating classification using deep recurrent neural network. Neural Process. Lett. 51, 1 (2020), 1031–1048.
[59]
U. Shoaib, N. Ahmad, P. Prinetto, and G. Tiotto. 2014. Integrating multiwordnet with Italian sign language lexical resources. Exp. Syst. Applic. 41, 5 (2014), 2300–2308.

Cited By

View all
  • (2024)A Systematic Approach to Probabilistic Modeling for Retrieving Sindhi Text DocumentsVFAST Transactions on Software Engineering10.21015/vtse.v12i4.201012:4(199-208)Online publication date: 31-Dec-2024
  • (2024)Biomedical semantic text summarizerBMC Bioinformatics10.1186/s12859-024-05712-x25:1Online publication date: 16-Apr-2024

Index Terms

  1. Context-aware Urdu Information Retrieval System

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
    March 2023
    570 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3579816
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 April 2023
    Online AM: 14 October 2022
    Accepted: 26 November 2021
    Revised: 26 October 2021
    Received: 29 April 2021
    Published in TALLIP Volume 22, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Urdu language
    2. information retrieval
    3. semantic web
    4. ontology
    5. triplets
    6. quad extraction
    7. context-based
    8. Web Semantic Search Engine
    9. WSA
    10. searching and indexing
    11. keywords
    12. corpus
    13. Uniform Resource Identifier

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)69
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Systematic Approach to Probabilistic Modeling for Retrieving Sindhi Text DocumentsVFAST Transactions on Software Engineering10.21015/vtse.v12i4.201012:4(199-208)Online publication date: 31-Dec-2024
    • (2024)Biomedical semantic text summarizerBMC Bioinformatics10.1186/s12859-024-05712-x25:1Online publication date: 16-Apr-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media