Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis

Alani, Ali A.; Al-Azzawi, Adil

doi:10.1007/s11227-025-07008-0

Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis

Published: 03 March 2025

Volume 81, article number 569, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ali A. Alani¹ &
Adil Al-Azzawi¹

95 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

With the rapid expansion of the Internet and the ever-growing volume of data, search engines face increasing difficulties in providing users with the most pertinent content for their queries. The typical search process, where a user inputs a query, and the system returns a list of pages, often falls short in delivering highly relevant results, with existing ranking methods not always aligning with user expectations. This paper addresses these challenges by developing a novel web page retrieval method to help users obtain more relevant content. The proposed method utilizes a Google custom search engine to respond to the user query, collect the search results along with its metadata from the Google dataset, and store it in JSON file via the Google API for the reranking process. This paper proposes a novel query expansion model that leverages the generative abilities of large language models, specifically ChatGPT, for interactive and automated query expansion to enhance the accuracy of the research results. The proposed model uses two metrics, namely cosine similarity and word mover’s distance, to assess the similarity between user queries and retrieve results by utilizing document metadata by considering the syntactic and semantic aspects of the text. The proposed method is very effective, and the results show a marked improvement in the search results compared to the results retrieved using the Bing, DuckDuckGo, and Google page rank algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adapting LLMs for Efficient, Personalized Information Retrieval: Methods and Implications

A Topic Transition Map for Query Expansion: A Semantic Analysis of Click-Through Data and Test Collections

A survey of statistical approaches for query expansion

Article 01 September 2018

Data availability

No datasets were generated or analyzed during the current study.

References

Abdullah M, Madain A, Jararweh Y (2022) ChatGPT: fundamentals, applications and social impacts. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp 1–8
Yousif AJ, Al-Jammas MH (2024) A lightweight visual understanding system for enhanced assistance to the visually impaired using an embedded platform. Diyala J Eng Sci Diyala J Eng Sci 17(3):146
Article MATH Google Scholar
Agarwal S, Agarwal BB (2013) An Improvement on page ranking based on visits of links. Int J Sci Res 2(6):265–268
MATH Google Scholar
Alhaidari F, Alwarthan S, Alamoudi A (2020) User preference based weighted page ranking algorithm. In: 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), pp 1–6
AlAzawee WS, Jasim AM, Abdulkareem SA (2020) Design and implementation of database management for presidency of Diyala University. Diyala J. Eng. Sci. 13(2):34
Article Google Scholar
Alqahtani AS, Saravanan P, Maheswari M, Alshmrany S (2022) An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrieval. J Supercomput 78(6):8625–8643
Article MATH Google Scholar
Attia M, Abdel-Fattah MA, Khedr AE (2022) A proposed multi criteria indexing and ranking model for documents and web pages on large scale data. J King Saud Univ-Comput Inf Sci 34(10):8702–8715
MATH Google Scholar
Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735
Article MATH Google Scholar
Baeza-Yates R, Davis E (2004) Web page ranking using link attributes. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp 328–329
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Article MATH Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Brunet ME, Alkalay-Houlihan C, Anderson A, Zemel R (2019) Understanding the origins of bias in word embeddings. In: International Conference on Machine Learning, pp 803–811
Chakrabarti S, Dom BE, Kumar SR, Raghavan P, Rajagopalan S, Tomkins A, Gibson D, Kleinberg J (1999) Mining the Web’s link structure. Computer 32(8):60–67
Article Google Scholar
Church KW (2017) Word2Vec. Nat Lang Eng 23(1):155–162
Article Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv Preprint arXiv:1810.04805
Eminagaoglu M (2022) A new similarity measure for vector space models in text classification and information retrieval. J Inf Sci 48(4):463–476
Article MATH Google Scholar
Esmail F, Al-Ta’i ZTM (2024) The human gait recognition using an enhanced convolutional neural network: gait recognition. Acad Sci J 2(3):171–185
Google Scholar
Fatima N, Faheem M, Dar MZN (2023) Optimized focused crawling for web page classification. In: 2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC), pp 1–6
Fujimura K, Inoue T, Sugisaki M (2005) The eigenrumor algorithm for ranking blogs. In: WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem, p 316
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18
MATH Google Scholar
Gülle KJ, Ford N, Ebel P, Brokhausen F, Vogelsang A (2020) Topic modeling on user stories using word mover’s distance. In: 2020 IEEE Seventh International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), pp 52–60
Gupta Y, Saini A (2017) A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering. Knowl-Based Syst 136:97–120. https://doi.org/10.1016/j.knosys.2017.09.004
Article MATH Google Scholar
Hao Z, Qiumei P, Hong Z, Zhihao S (2015) An improved PageRank algorithm based on web content. 2015 14th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), 284–287
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article MATH Google Scholar
Huang G, Guo C, Kusner MJ, Sun Y, Sha F, Weinberger KQ (2016) Supervised word mover’s distance. In: Advances in Neural Information Processing Systems, p 29
Imperial JM, Ong E (2021) A simple post-processing technique for improving readability assessment of texts using word Mover’s distance. ArXiv Preprint arXiv:2103.07277.
Karamiyan F, Mahootchi M, Mohebi A (2024) A personalized ranking method based on inverse reinforcement learning in search engines. Eng Appl Artif Intell 136:108915
Article Google Scholar
Kasneci E, Seßler K, Küchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Günnemann S, Hüllermeier E (2023) ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Differ 103:102274
Article Google Scholar
Kelotra A (2015) Upgradation of pagerank algorithm based upon time spent on web page and its link structure. Int J Comput Appl 109(11):7
Google Scholar
Kiyani FF, Hamid B, Humayun M, Assiri M, Jhanjhi NZ (2023) Ranking of web search for best link identification by using hierarchy of web page content. In: International Conference on Systems Engineering, pp 78–89.
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604–632
Article MathSciNet MATH Google Scholar
Kumar PR, Singh AK (2010) Web structure mining: Exploring hyperlinks and algorithms for information retrieval. Am J Appl Sci 7(6):840
Article MATH Google Scholar
Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International Conference on Machine Learning, pp 957–966
Li B, Han L (2013) Distance weighted cosine similarity measure for text classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20–23, 2013. Proceedings 14, pp 611–618
Li H, Jun X (2014) Semantic matching in search. Foundations Trends® Inf Retrieval 7(5):343–469. https://doi.org/10.1561/1500000035
Article MATH Google Scholar
Li Q, Zhou T, Lü L, Chen D (2014) Identifying influential spreaders by weighted LeaderRank. Phys A 404:47–55
Article MathSciNet MATH Google Scholar
Lin D (1998) An information-theoretic definition of similarity. Icml 98(1998):296–304
MATH Google Scholar
Lü L, Zhang Y-C, Yeung CH, Zhou T (2011) Leaders in social networks, the delicious case. PLoS ONE 6(6):e21202
Article Google Scholar
Luu V-T, Forestier G, Weber J, Bourgeois P, Djelil F, Muller P-A (2020) A review of alignment based similarity measures for web usage mining. Artif Intell Rev 53(3):1529–1551
Article Google Scholar
Matta Y, Malhotra D, Verma N (2022) AMSS: A novel take on web page ranking. In: Proceedings of the Third International Conference on Information Management and Machine Intelligence: ICIMMI 2021, pp 331–340
Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press
Book MATH Google Scholar
Mehndiratta A, Asawa K (2023) A spectral learning based model to evaluate semantic textual similarity. Available at SSRN 4437092.
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ArXiv Preprint arXiv:1301.3781
Mittal K, Vaisla KS, Jain A (2024) A neuro-fuzzy algorithm for query expansion and information retrieval. Multimed Tools Appl 1–24
Mohana Arunachalam S, Koumpis A, Handschuh S (2018) Webometrics: some critical issues of WWW size estimation methods. Multimodal Technol Interact 2(2):12
Article MATH Google Scholar
Naamha EQ, Abdulmunim ME (2024) Web page ranking based on text content and link information using data mining techniques. ARO-Sci J Koya Univ 12(1):29–40
Google Scholar
Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: Bringing order to the web. Stanford infolab.
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Sato R, Yamada M, Kashima H (2022) Re-evaluating word mover’s distance. In: International Conference on Machine Learning, pp 19231–19249
Shaffi SS, Jagadeesh S, Alleema NN, Mahesh C, Umanesan R, Muthulakshmi I (2024) A comprehensive review of web page ranking systems. In: 2024 11th international conference on computing for sustainable global development (INDIACom), pp 872–877
Sharma DK, Pamula R, Chauhan DS (2024) A hybrid evolutionary algorithm based automatic query expansion for enhancing document retrieval system. J Ambient Intell Humaniz Comput 15:1–20
Article MATH Google Scholar
Sharma HS, Sharma A (2023) Query expansion using word embedding, ontology and natural language processing. In: 2023 Second International Conference on Smart Technologies For Smart Nation (SmartTechCon), pp 410–414
Sharma PS, Yadav D (2020) Incremental refinement of page ranking of web pages. Int J Inf Retr Res (IJIRR) 10(3):57–73
MATH Google Scholar
Sharma PS, Yadav D, Thakur RN (2022) Web page ranking using web mining techniques: a comprehensive survey. Mob Inf Syst 2022(1):7519573
MATH Google Scholar
Singh J, Sharan A (2017) A new fuzzy logic-based query expansion model for efficient information retrieval using relevance feedback approach. Neural Comput Appl 28(9):2557–2580. https://doi.org/10.1007/s00521-016-2207-x
Article MATH Google Scholar
Suri S, Gupta A, Sharma K (2019) Comparative study of ranking algorithms. In: 2019 International Conference on Computing, Electronics & Communications Engineering (ICCECE), pp 73–77.
Thakur N, Mehrotra D, Bansal A, Bala M (2019) Comparative analysis of ranking functions for retrieving information from medical repository. Malays J Comput Sci 32(1):18–30
Article MATH Google Scholar
Tithi JJ, Petrini F (2020) An efficient shared-memory parallel Sinkhorn-Knopp algorithm to compute the word Mover’s distance. ArXiv Preprint arXiv:2005.06727
Tyagi LK, Gupta A, Sisodia VS (2023) A new era of web mining: innovative approaches in focused web crawling for domain-specific information. In: 2023 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS), pp 1–6
Usta A, Altingovde IS, Ozcan R, Ulusoy Ö (2021) Learning to rank for educational search engines. IEEE Trans Learn Technol 14(2):211–225
Article MATH Google Scholar
Vidyarthi A, Singh P (2023) Power rank: an interactive web page ranking algorithm. Principles of big graph: in-depth insight, vol 128. Elsevier, pp 353–379
Chapter MATH Google Scholar
Wang J, Dong Y (2020) Measurement of text similarity: a survey. Information 11(9):421
Article MATH Google Scholar
Xing W, Ghorbani A (2004) Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004, pp 305–314
Zheng Z, Hui K, He B, Han X, Sun L, Yates A (2021) Contextualized query expansion via unsupervised chunk selection for text retrieval. Inf Process Manag 58(5):102672
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, College of Science, University of Diyala, Diyala, Iraq
Ali A. Alani & Adil Al-Azzawi

Authors

Ali A. Alani
View author publications
You can also search for this author inPubMed Google Scholar
Adil Al-Azzawi
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Ali A. Alani was responsible for conceptualizing the research, developing the methodology, and writing the manuscript. Adil Al-Azzawi contributed by reviewing, editing, and providing critical revisions to the manuscript. Both authors have approved the final version of the paper for submission.

Corresponding author

Correspondence to Adil Al-Azzawi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Alani, A.A., Al-Azzawi, A. Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis. J Supercomput 81, 569 (2025). https://doi.org/10.1007/s11227-025-07008-0

Download citation

Accepted: 30 January 2025
Published: 03 March 2025
DOI: https://doi.org/10.1007/s11227-025-07008-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adapting LLMs for Efficient, Personalized Information Retrieval: Methods and Implications

A Topic Transition Map for Query Expansion: A Semantic Analysis of Click-Through Data and Test Collections

A survey of statistical approaches for query expansion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now