Skip to main content
Log in

A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

To provide the most relevant answers to the user’s query in the shortest time, search engines require quick data retrieval mechanism. One of the factors affecting the speed of data retrieval is how the load is distributed among the servers. The mechanism of load distribution between servers and consequently the performance of the search engine is affected by the way data is shared between servers. Document-based distribution and word-based distribution are the two main methods of data sharing, neither of which guarantees a permanent load balance. Existing solutions to improve load balance in both document-based and word-based distribution methods use users’ query history to obtain information about their search pattern. These methods examine queries to identify popular words among users and assign a weight to each one, which indicates the load of that word. The problem is that most of the time, the words with the words that follow them represent the purpose of the user, not alone. By considering words individually, it is possible to assign high weight to words that alone have no value to the user, which can lead to an unfair distribution of load when distributing data between servers. The proposed method tries to improve the data distribution process between the servers and thus the load balance by considering the sequence of constructive words of the queries along with the words and weighting them. The results of the experiments show that the improvement of the load balance of the proposed method is 38.21% on average compared to the document-based distribution method and 35.6% compared to the existing methods for creating a suitable load balance in the document-based distribution method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Enquiries about data availability should be directed to the authors.

Notes

  1. https://www.google.com/.

  2. Term Frequency/Inverse Document Frequency.

  3. https://parsijoo.ir/.

  4. https://trends.google.com/.

  5. Third of July 2020.

  6. Fourth of July 2020.

  7. Direct N-gram IDF.

  8. Indirect N-gram IDF.

References

  1. Nugraha, K. A., & Sebastian, D. (2018). Pembentukan dataset topik kata bahasa indonesia pada twitter menggunakan tf-idf & cosine similarity. Jurnal Teknik Informatika dan Sistem Informasi, 4(3), 376–386.

    Google Scholar 

  2. Xu, G., Meng, Y., Chen, Z., Qiu, X., Wang, C., & Yao, H. (2019). Research on topic detection and tracking for online news texts. IEEE Access, 7, 58407–58418.

    Article  Google Scholar 

  3. Ma, Y.-C., Chen, T.-F., & Chung, C.-P. (2002). Posting file partitioning and parallel information retrieval. Journal of systems and software, 63(2), 113–127.

    Article  Google Scholar 

  4. Mitchell, G. (2020). How much data is on the internet? https://www.sciencefocus.com/future-technology/how-much-data-is-on-the-internet/, July 2020.

  5. Moffat, A., Webber, W., & Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 348–355).

  6. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). New York: ACM Press.

    Google Scholar 

  7. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.

  8. Cahoon, B., McKinley, K. S., & Lu, Z. (2000). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems (TOIS), 18(1), 1–43.

    Article  Google Scholar 

  9. de Kretser, O., Moffat, A., Shimmin, T. & Zobel, J. (1998). Methodologies for distributed information retrieval. In: Proceedings 18th international conference on distributed computing systems (Cat. No. 98CB36183) (pp. 66–73). IEEE.

  10. Moffat, A., Webber, W., Zobel, J., & Baeza-Yates, R. (2007). A pipelined architecture for distributed text query evaluation. Information Retrieval, 10(3), 205–231.

    Article  Google Scholar 

  11. Teller, V. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Computational Linguistics, 26(4), 638–641.

    Article  Google Scholar 

  12. Jonassen, S., & Bratsberg, S. E. (2010). A combined semi-pipelined query processing architecture for distributed full-text retrieval. In: International conference on web information systems engineering (pp. 587–601). Springer.

  13. Frakes, W. (1992). Introduction to information storage and retrieval systems. Space, 14(10).

  14. Croft, W. B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice, vol. 520. Addison-Wesley Reading.

  15. Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2016). Information retrieval: Implementing and evaluating search engines. MIT Press.

  16. Mendoza, M., Marín, M., Gil-Costa, V., & Ferrarotti, F. (2016). Reducing hardware hit by queries in web search engines. Information Processing & Management, 52(6), 1031–1052.

    Article  Google Scholar 

  17. Mourão, A., & Magalhães, J. (2018). Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval. International Journal of Multimedia Information Retrieval, 7(1), 57–70.

    Article  Google Scholar 

  18. Barroso, L. A., Dean, J., & Holzle, U. (2003). Web search for a planet: The google cluster architecture. IEEE Micro, 23(2), 22–28.

    Article  Google Scholar 

  19. Cambazoglu, B. B., Kayaaslan, E., Jonassen, S., & Aykanat, C. (2013). A term-based inverted index partitioning model for efficient distributed query processing. ACM Transactions on the Web (TWEB), 7(3), 1–23.

    Article  Google Scholar 

  20. Gao, G., Li, R., & Xu, Z. (2018). Mimir: A term-distributed retrieval system for secret documents. International Journal of Information and Communication Technology, 12(1–2), 209–228.

    Article  Google Scholar 

  21. Patel, H. (2010). Inverted index partitioning strategies for a distributed search engine. Master’s thesis, University of Waterloo.

  22. Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  23. Roy, D., Mitra, M., & Ganguly, D. (2018). To clean or not to clean: Document preprocessing and reproducibility. Journal of Data and Information Quality (JDIQ), 10(4), 1–25.

    Article  Google Scholar 

  24. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv:1707.02919.

  25. Lo, R. T.-W., He, B., Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), 5, 17–24.

  26. Kayest, M., & Jain, S. K. (2019). Optimization driven cluster based indexing and matching for the document retrieval. Journal of King Saud University-Computer and Information Sciences.

  27. Ghag, K. V., & Shah, K. (2015). Comparative analysis of effect of stopwords removal on sentiment classification. In: 2015 international conference on computer, communication and control (IC4) (pp. 1–6). IEEE.

  28. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.

    Article  MathSciNet  Google Scholar 

  29. Silva, C. (2003). The importance of stop word removal on recall values in text categorization. Proceedings of the International Joint Conference on Neural Networks, 2003(3), 1661–1666.

    Google Scholar 

  30. Saif, H., Fernández, M., He, Y., & Alani, H. (2014). On stopwords, filtering and data sparsity for sentiment analysis of twitter.

  31. Dai, Z., Xiong, C., & Callan, J. (2016). Query-biased partitioning for selective search. In: Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1119–1128).

  32. Shirakawa, M., Hara, T., & Nishio, S. (2017). Idf for word n-grams. ACM Transactions on Information Systems (TOIS), 36(1), 1–38.

    Article  Google Scholar 

  33. Bookstein, A., & Swanson, D. R. (1974). Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25(5), 312–316.

    Article  Google Scholar 

  34. Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part I on the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, 26(4), 197–206.

  35. Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1(2), 163–190.

    Article  MathSciNet  Google Scholar 

  36. Papineni, K.(2001). Why inverse document frequency? In: Second meeting of the North American chapter of the association for computational linguistics.

  37. Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation.

Download references

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seyedakbar Mostafavi.

Ethics declarations

Competing interests

The authors hereby declare that there is no financial or non-financial interests regarding this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Manshadi, F.D., Mostafavi, S. & Zarifzadeh, S. A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines. Wireless Pers Commun 129, 1489–1511 (2023). https://doi.org/10.1007/s11277-023-10176-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-023-10176-y

Keywords

Navigation