Abstract
With the rapid expansion of the Internet and the ever-growing volume of data, search engines face increasing difficulties in providing users with the most pertinent content for their queries. The typical search process, where a user inputs a query, and the system returns a list of pages, often falls short in delivering highly relevant results, with existing ranking methods not always aligning with user expectations. This paper addresses these challenges by developing a novel web page retrieval method to help users obtain more relevant content. The proposed method utilizes a Google custom search engine to respond to the user query, collect the search results along with its metadata from the Google dataset, and store it in JSON file via the Google API for the reranking process. This paper proposes a novel query expansion model that leverages the generative abilities of large language models, specifically ChatGPT, for interactive and automated query expansion to enhance the accuracy of the research results. The proposed model uses two metrics, namely cosine similarity and word mover’s distance, to assess the similarity between user queries and retrieve results by utilizing document metadata by considering the syntactic and semantic aspects of the text. The proposed method is very effective, and the results show a marked improvement in the search results compared to the results retrieved using the Bing, DuckDuckGo, and Google page rank algorithms.















Similar content being viewed by others
Data availability
No datasets were generated or analyzed during the current study.
References
Abdullah M, Madain A, Jararweh Y (2022) ChatGPT: fundamentals, applications and social impacts. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp 1–8
Yousif AJ, Al-Jammas MH (2024) A lightweight visual understanding system for enhanced assistance to the visually impaired using an embedded platform. Diyala J Eng Sci Diyala J Eng Sci 17(3):146
Agarwal S, Agarwal BB (2013) An Improvement on page ranking based on visits of links. Int J Sci Res 2(6):265–268
Alhaidari F, Alwarthan S, Alamoudi A (2020) User preference based weighted page ranking algorithm. In: 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), pp 1–6
AlAzawee WS, Jasim AM, Abdulkareem SA (2020) Design and implementation of database management for presidency of Diyala University. Diyala J. Eng. Sci. 13(2):34
Alqahtani AS, Saravanan P, Maheswari M, Alshmrany S (2022) An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrieval. J Supercomput 78(6):8625–8643
Attia M, Abdel-Fattah MA, Khedr AE (2022) A proposed multi criteria indexing and ranking model for documents and web pages on large scale data. J King Saud Univ-Comput Inf Sci 34(10):8702–8715
Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735
Baeza-Yates R, Davis E (2004) Web page ranking using link attributes. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp 328–329
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Brunet ME, Alkalay-Houlihan C, Anderson A, Zemel R (2019) Understanding the origins of bias in word embeddings. In: International Conference on Machine Learning, pp 803–811
Chakrabarti S, Dom BE, Kumar SR, Raghavan P, Rajagopalan S, Tomkins A, Gibson D, Kleinberg J (1999) Mining the Web’s link structure. Computer 32(8):60–67
Church KW (2017) Word2Vec. Nat Lang Eng 23(1):155–162
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv Preprint arXiv:1810.04805
Eminagaoglu M (2022) A new similarity measure for vector space models in text classification and information retrieval. J Inf Sci 48(4):463–476
Esmail F, Al-Ta’i ZTM (2024) The human gait recognition using an enhanced convolutional neural network: gait recognition. Acad Sci J 2(3):171–185
Fatima N, Faheem M, Dar MZN (2023) Optimized focused crawling for web page classification. In: 2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC), pp 1–6
Fujimura K, Inoue T, Sugisaki M (2005) The eigenrumor algorithm for ranking blogs. In: WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem, p 316
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
Gülle KJ, Ford N, Ebel P, Brokhausen F, Vogelsang A (2020) Topic modeling on user stories using word mover’s distance. In: 2020 IEEE Seventh International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), pp 52–60
Gupta Y, Saini A (2017) A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering. Knowl-Based Syst 136:97–120. https://doi.org/10.1016/j.knosys.2017.09.004
Hao Z, Qiumei P, Hong Z, Zhihao S (2015) An improved PageRank algorithm based on web content. 2015 14th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), 284–287
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang G, Guo C, Kusner MJ, Sun Y, Sha F, Weinberger KQ (2016) Supervised word mover’s distance. In: Advances in Neural Information Processing Systems, p 29
Imperial JM, Ong E (2021) A simple post-processing technique for improving readability assessment of texts using word Mover’s distance. ArXiv Preprint arXiv:2103.07277.
Karamiyan F, Mahootchi M, Mohebi A (2024) A personalized ranking method based on inverse reinforcement learning in search engines. Eng Appl Artif Intell 136:108915
Kasneci E, Seßler K, Küchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Günnemann S, Hüllermeier E (2023) ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Differ 103:102274
Kelotra A (2015) Upgradation of pagerank algorithm based upon time spent on web page and its link structure. Int J Comput Appl 109(11):7
Kiyani FF, Hamid B, Humayun M, Assiri M, Jhanjhi NZ (2023) Ranking of web search for best link identification by using hierarchy of web page content. In: International Conference on Systems Engineering, pp 78–89.
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604–632
Kumar PR, Singh AK (2010) Web structure mining: Exploring hyperlinks and algorithms for information retrieval. Am J Appl Sci 7(6):840
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International Conference on Machine Learning, pp 957–966
Li B, Han L (2013) Distance weighted cosine similarity measure for text classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20–23, 2013. Proceedings 14, pp 611–618
Li H, Jun X (2014) Semantic matching in search. Foundations Trends® Inf Retrieval 7(5):343–469. https://doi.org/10.1561/1500000035
Li Q, Zhou T, Lü L, Chen D (2014) Identifying influential spreaders by weighted LeaderRank. Phys A 404:47–55
Lin D (1998) An information-theoretic definition of similarity. Icml 98(1998):296–304
Lü L, Zhang Y-C, Yeung CH, Zhou T (2011) Leaders in social networks, the delicious case. PLoS ONE 6(6):e21202
Luu V-T, Forestier G, Weber J, Bourgeois P, Djelil F, Muller P-A (2020) A review of alignment based similarity measures for web usage mining. Artif Intell Rev 53(3):1529–1551
Matta Y, Malhotra D, Verma N (2022) AMSS: A novel take on web page ranking. In: Proceedings of the Third International Conference on Information Management and Machine Intelligence: ICIMMI 2021, pp 331–340
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press
Mehndiratta A, Asawa K (2023) A spectral learning based model to evaluate semantic textual similarity. Available at SSRN 4437092.
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ArXiv Preprint arXiv:1301.3781
Mittal K, Vaisla KS, Jain A (2024) A neuro-fuzzy algorithm for query expansion and information retrieval. Multimed Tools Appl 1–24
Mohana Arunachalam S, Koumpis A, Handschuh S (2018) Webometrics: some critical issues of WWW size estimation methods. Multimodal Technol Interact 2(2):12
Naamha EQ, Abdulmunim ME (2024) Web page ranking based on text content and link information using data mining techniques. ARO-Sci J Koya Univ 12(1):29–40
Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: Bringing order to the web. Stanford infolab.
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Sato R, Yamada M, Kashima H (2022) Re-evaluating word mover’s distance. In: International Conference on Machine Learning, pp 19231–19249
Shaffi SS, Jagadeesh S, Alleema NN, Mahesh C, Umanesan R, Muthulakshmi I (2024) A comprehensive review of web page ranking systems. In: 2024 11th international conference on computing for sustainable global development (INDIACom), pp 872–877
Sharma DK, Pamula R, Chauhan DS (2024) A hybrid evolutionary algorithm based automatic query expansion for enhancing document retrieval system. J Ambient Intell Humaniz Comput 15:1–20
Sharma HS, Sharma A (2023) Query expansion using word embedding, ontology and natural language processing. In: 2023 Second International Conference on Smart Technologies For Smart Nation (SmartTechCon), pp 410–414
Sharma PS, Yadav D (2020) Incremental refinement of page ranking of web pages. Int J Inf Retr Res (IJIRR) 10(3):57–73
Sharma PS, Yadav D, Thakur RN (2022) Web page ranking using web mining techniques: a comprehensive survey. Mob Inf Syst 2022(1):7519573
Singh J, Sharan A (2017) A new fuzzy logic-based query expansion model for efficient information retrieval using relevance feedback approach. Neural Comput Appl 28(9):2557–2580. https://doi.org/10.1007/s00521-016-2207-x
Suri S, Gupta A, Sharma K (2019) Comparative study of ranking algorithms. In: 2019 International Conference on Computing, Electronics & Communications Engineering (ICCECE), pp 73–77.
Thakur N, Mehrotra D, Bansal A, Bala M (2019) Comparative analysis of ranking functions for retrieving information from medical repository. Malays J Comput Sci 32(1):18–30
Tithi JJ, Petrini F (2020) An efficient shared-memory parallel Sinkhorn-Knopp algorithm to compute the word Mover’s distance. ArXiv Preprint arXiv:2005.06727
Tyagi LK, Gupta A, Sisodia VS (2023) A new era of web mining: innovative approaches in focused web crawling for domain-specific information. In: 2023 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS), pp 1–6
Usta A, Altingovde IS, Ozcan R, Ulusoy Ö (2021) Learning to rank for educational search engines. IEEE Trans Learn Technol 14(2):211–225
Vidyarthi A, Singh P (2023) Power rank: an interactive web page ranking algorithm. Principles of big graph: in-depth insight, vol 128. Elsevier, pp 353–379
Wang J, Dong Y (2020) Measurement of text similarity: a survey. Information 11(9):421
Xing W, Ghorbani A (2004) Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004, pp 305–314
Zheng Z, Hui K, He B, Han X, Sun L, Yates A (2021) Contextualized query expansion via unsupervised chunk selection for text retrieval. Inf Process Manag 58(5):102672
Author information
Authors and Affiliations
Contributions
Ali A. Alani was responsible for conceptualizing the research, developing the methodology, and writing the manuscript. Adil Al-Azzawi contributed by reviewing, editing, and providing critical revisions to the manuscript. Both authors have approved the final version of the paper for submission.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alani, A.A., Al-Azzawi, A. Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis. J Supercomput 81, 569 (2025). https://doi.org/10.1007/s11227-025-07008-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07008-0