Skip to main content

Advertisement

Log in

Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the rapid expansion of the Internet and the ever-growing volume of data, search engines face increasing difficulties in providing users with the most pertinent content for their queries. The typical search process, where a user inputs a query, and the system returns a list of pages, often falls short in delivering highly relevant results, with existing ranking methods not always aligning with user expectations. This paper addresses these challenges by developing a novel web page retrieval method to help users obtain more relevant content. The proposed method utilizes a Google custom search engine to respond to the user query, collect the search results along with its metadata from the Google dataset, and store it in JSON file via the Google API for the reranking process. This paper proposes a novel query expansion model that leverages the generative abilities of large language models, specifically ChatGPT, for interactive and automated query expansion to enhance the accuracy of the research results. The proposed model uses two metrics, namely cosine similarity and word mover’s distance, to assess the similarity between user queries and retrieve results by utilizing document metadata by considering the syntactic and semantic aspects of the text. The proposed method is very effective, and the results show a marked improvement in the search results compared to the results retrieved using the Bing, DuckDuckGo, and Google page rank algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

No datasets were generated or analyzed during the current study.

References

  1. Abdullah M, Madain A, Jararweh Y (2022) ChatGPT: fundamentals, applications and social impacts. In: 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp 1–8

  2. Yousif AJ, Al-Jammas MH (2024) A lightweight visual understanding system for enhanced assistance to the visually impaired using an embedded platform. Diyala J Eng Sci Diyala J Eng Sci 17(3):146

    Article  MATH  Google Scholar 

  3. Agarwal S, Agarwal BB (2013) An Improvement on page ranking based on visits of links. Int J Sci Res 2(6):265–268

    MATH  Google Scholar 

  4. Alhaidari F, Alwarthan S, Alamoudi A (2020) User preference based weighted page ranking algorithm. In: 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), pp 1–6

  5. AlAzawee WS, Jasim AM, Abdulkareem SA (2020) Design and implementation of database management for presidency of Diyala University. Diyala J. Eng. Sci. 13(2):34

    Article  Google Scholar 

  6. Alqahtani AS, Saravanan P, Maheswari M, Alshmrany S (2022) An automatic query expansion based on hybrid CMO-COOT algorithm for optimized information retrieval. J Supercomput 78(6):8625–8643

    Article  MATH  Google Scholar 

  7. Attia M, Abdel-Fattah MA, Khedr AE (2022) A proposed multi criteria indexing and ranking model for documents and web pages on large scale data. J King Saud Univ-Comput Inf Sci 34(10):8702–8715

    MATH  Google Scholar 

  8. Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735

    Article  MATH  Google Scholar 

  9. Baeza-Yates R, Davis E (2004) Web page ranking using link attributes. In: Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, pp 328–329

  10. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117

    Article  MATH  Google Scholar 

  11. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  12. Brunet ME, Alkalay-Houlihan C, Anderson A, Zemel R (2019) Understanding the origins of bias in word embeddings. In: International Conference on Machine Learning, pp 803–811

  13. Chakrabarti S, Dom BE, Kumar SR, Raghavan P, Rajagopalan S, Tomkins A, Gibson D, Kleinberg J (1999) Mining the Web’s link structure. Computer 32(8):60–67

    Article  Google Scholar 

  14. Church KW (2017) Word2Vec. Nat Lang Eng 23(1):155–162

    Article  Google Scholar 

  15. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. ArXiv Preprint arXiv:1810.04805

  16. Eminagaoglu M (2022) A new similarity measure for vector space models in text classification and information retrieval. J Inf Sci 48(4):463–476

    Article  MATH  Google Scholar 

  17. Esmail F, Al-Ta’i ZTM (2024) The human gait recognition using an enhanced convolutional neural network: gait recognition. Acad Sci J 2(3):171–185

    Google Scholar 

  18. Fatima N, Faheem M, Dar MZN (2023) Optimized focused crawling for web page classification. In: 2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC), pp 1–6

  19. Fujimura K, Inoue T, Sugisaki M (2005) The eigenrumor algorithm for ranking blogs. In: WWW 2005 2nd Annual Workshop on the Weblogging Ecosystem, p 316

  20. Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68(13):13–18

    MATH  Google Scholar 

  21. Gülle KJ, Ford N, Ebel P, Brokhausen F, Vogelsang A (2020) Topic modeling on user stories using word mover’s distance. In: 2020 IEEE Seventh International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), pp 52–60

  22. Gupta Y, Saini A (2017) A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering. Knowl-Based Syst 136:97–120. https://doi.org/10.1016/j.knosys.2017.09.004

    Article  MATH  Google Scholar 

  23. Hao Z, Qiumei P, Hong Z, Zhihao S (2015) An improved PageRank algorithm based on web content. 2015 14th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), 284–287

  24. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  MATH  Google Scholar 

  25. Huang G, Guo C, Kusner MJ, Sun Y, Sha F, Weinberger KQ (2016) Supervised word mover’s distance. In: Advances in Neural Information Processing Systems, p 29

  26. Imperial JM, Ong E (2021) A simple post-processing technique for improving readability assessment of texts using word Mover’s distance. ArXiv Preprint arXiv:2103.07277.

  27. Karamiyan F, Mahootchi M, Mohebi A (2024) A personalized ranking method based on inverse reinforcement learning in search engines. Eng Appl Artif Intell 136:108915

    Article  Google Scholar 

  28. Kasneci E, Seßler K, Küchemann S, Bannert M, Dementieva D, Fischer F, Gasser U, Groh G, Günnemann S, Hüllermeier E (2023) ChatGPT for good? On opportunities and challenges of large language models for education. Learn Individ Differ 103:102274

    Article  Google Scholar 

  29. Kelotra A (2015) Upgradation of pagerank algorithm based upon time spent on web page and its link structure. Int J Comput Appl 109(11):7

    Google Scholar 

  30. Kiyani FF, Hamid B, Humayun M, Assiri M, Jhanjhi NZ (2023) Ranking of web search for best link identification by using hierarchy of web page content. In: International Conference on Systems Engineering, pp 78–89.

  31. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604–632

    Article  MathSciNet  MATH  Google Scholar 

  32. Kumar PR, Singh AK (2010) Web structure mining: Exploring hyperlinks and algorithms for information retrieval. Am J Appl Sci 7(6):840

    Article  MATH  Google Scholar 

  33. Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: International Conference on Machine Learning, pp 957–966

  34. Li B, Han L (2013) Distance weighted cosine similarity measure for text classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20–23, 2013. Proceedings 14, pp 611–618

  35. Li H, Jun X (2014) Semantic matching in search. Foundations Trends® Inf Retrieval 7(5):343–469. https://doi.org/10.1561/1500000035

    Article  MATH  Google Scholar 

  36. Li Q, Zhou T, Lü L, Chen D (2014) Identifying influential spreaders by weighted LeaderRank. Phys A 404:47–55

    Article  MathSciNet  MATH  Google Scholar 

  37. Lin D (1998) An information-theoretic definition of similarity. Icml 98(1998):296–304

    MATH  Google Scholar 

  38. Lü L, Zhang Y-C, Yeung CH, Zhou T (2011) Leaders in social networks, the delicious case. PLoS ONE 6(6):e21202

    Article  Google Scholar 

  39. Luu V-T, Forestier G, Weber J, Bourgeois P, Djelil F, Muller P-A (2020) A review of alignment based similarity measures for web usage mining. Artif Intell Rev 53(3):1529–1551

    Article  Google Scholar 

  40. Matta Y, Malhotra D, Verma N (2022) AMSS: A novel take on web page ranking. In: Proceedings of the Third International Conference on Information Management and Machine Intelligence: ICIMMI 2021, pp 331–340

  41. Medsker L, Jain LC (1999) Recurrent neural networks: design and applications. CRC Press

    Book  MATH  Google Scholar 

  42. Mehndiratta A, Asawa K (2023) A spectral learning based model to evaluate semantic textual similarity. Available at SSRN 4437092.

  43. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. ArXiv Preprint arXiv:1301.3781

  44. Mittal K, Vaisla KS, Jain A (2024) A neuro-fuzzy algorithm for query expansion and information retrieval. Multimed Tools Appl 1–24

  45. Mohana Arunachalam S, Koumpis A, Handschuh S (2018) Webometrics: some critical issues of WWW size estimation methods. Multimodal Technol Interact 2(2):12

    Article  MATH  Google Scholar 

  46. Naamha EQ, Abdulmunim ME (2024) Web page ranking based on text content and link information using data mining techniques. ARO-Sci J Koya Univ 12(1):29–40

    Google Scholar 

  47. Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: Bringing order to the web. Stanford infolab.

  48. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543

  49. Sato R, Yamada M, Kashima H (2022) Re-evaluating word mover’s distance. In: International Conference on Machine Learning, pp 19231–19249

  50. Shaffi SS, Jagadeesh S, Alleema NN, Mahesh C, Umanesan R, Muthulakshmi I (2024) A comprehensive review of web page ranking systems. In: 2024 11th international conference on computing for sustainable global development (INDIACom), pp 872–877

  51. Sharma DK, Pamula R, Chauhan DS (2024) A hybrid evolutionary algorithm based automatic query expansion for enhancing document retrieval system. J Ambient Intell Humaniz Comput 15:1–20

    Article  MATH  Google Scholar 

  52. Sharma HS, Sharma A (2023) Query expansion using word embedding, ontology and natural language processing. In: 2023 Second International Conference on Smart Technologies For Smart Nation (SmartTechCon), pp 410–414

  53. Sharma PS, Yadav D (2020) Incremental refinement of page ranking of web pages. Int J Inf Retr Res (IJIRR) 10(3):57–73

    MATH  Google Scholar 

  54. Sharma PS, Yadav D, Thakur RN (2022) Web page ranking using web mining techniques: a comprehensive survey. Mob Inf Syst 2022(1):7519573

    MATH  Google Scholar 

  55. Singh J, Sharan A (2017) A new fuzzy logic-based query expansion model for efficient information retrieval using relevance feedback approach. Neural Comput Appl 28(9):2557–2580. https://doi.org/10.1007/s00521-016-2207-x

    Article  MATH  Google Scholar 

  56. Suri S, Gupta A, Sharma K (2019) Comparative study of ranking algorithms. In: 2019 International Conference on Computing, Electronics & Communications Engineering (ICCECE), pp 73–77.

  57. Thakur N, Mehrotra D, Bansal A, Bala M (2019) Comparative analysis of ranking functions for retrieving information from medical repository. Malays J Comput Sci 32(1):18–30

    Article  MATH  Google Scholar 

  58. Tithi JJ, Petrini F (2020) An efficient shared-memory parallel Sinkhorn-Knopp algorithm to compute the word Mover’s distance. ArXiv Preprint arXiv:2005.06727

  59. Tyagi LK, Gupta A, Sisodia VS (2023) A new era of web mining: innovative approaches in focused web crawling for domain-specific information. In: 2023 3rd International Conference on Technological Advancements in Computational Sciences (ICTACS), pp 1–6

  60. Usta A, Altingovde IS, Ozcan R, Ulusoy Ö (2021) Learning to rank for educational search engines. IEEE Trans Learn Technol 14(2):211–225

    Article  MATH  Google Scholar 

  61. Vidyarthi A, Singh P (2023) Power rank: an interactive web page ranking algorithm. Principles of big graph: in-depth insight, vol 128. Elsevier, pp 353–379

    Chapter  MATH  Google Scholar 

  62. Wang J, Dong Y (2020) Measurement of text similarity: a survey. Information 11(9):421

    Article  MATH  Google Scholar 

  63. Xing W, Ghorbani A (2004) Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004, pp 305–314

  64. Zheng Z, Hui K, He B, Han X, Sun L, Yates A (2021) Contextualized query expansion via unsupervised chunk selection for text retrieval. Inf Process Manag 58(5):102672

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Ali A. Alani was responsible for conceptualizing the research, developing the methodology, and writing the manuscript. Adil Al-Azzawi contributed by reviewing, editing, and providing critical revisions to the manuscript. Both authors have approved the final version of the paper for submission.

Corresponding author

Correspondence to Adil Al-Azzawi.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alani, A.A., Al-Azzawi, A. Optimizing web page retrieval performance with advanced query expansion: leveraging ChatGPT and metadata-driven analysis. J Supercomput 81, 569 (2025). https://doi.org/10.1007/s11227-025-07008-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-07008-0

Keywords