A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines

Manshadi, Faridesadat Dehghan; Mostafavi, Seyedakbar; Zarifzadeh, Sajjad

doi:10.1007/s11277-023-10176-y

A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines

Published: 20 March 2023

Volume 129, pages 1489–1511, (2023)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Faridesadat Dehghan Manshadi¹,
Seyedakbar Mostafavi ORCID: orcid.org/0000-0003-3530-2642² &
Sajjad Zarifzadeh¹

65 Accesses
Explore all metrics

Abstract

To provide the most relevant answers to the user’s query in the shortest time, search engines require quick data retrieval mechanism. One of the factors affecting the speed of data retrieval is how the load is distributed among the servers. The mechanism of load distribution between servers and consequently the performance of the search engine is affected by the way data is shared between servers. Document-based distribution and word-based distribution are the two main methods of data sharing, neither of which guarantees a permanent load balance. Existing solutions to improve load balance in both document-based and word-based distribution methods use users’ query history to obtain information about their search pattern. These methods examine queries to identify popular words among users and assign a weight to each one, which indicates the load of that word. The problem is that most of the time, the words with the words that follow them represent the purpose of the user, not alone. By considering words individually, it is possible to assign high weight to words that alone have no value to the user, which can lead to an unfair distribution of load when distributing data between servers. The proposed method tries to improve the data distribution process between the servers and thus the load balance by considering the sequence of constructive words of the queries along with the words and weighting them. The results of the experiments show that the improvement of the load balance of the proposed method is 38.21% on average compared to the document-based distribution method and 35.6% compared to the existing methods for creating a suitable load balance in the document-based distribution method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

iDBP: A Distributed Min-Cut Density-Balanced Algorithm for Incremental Web-Pages Ranking

Impact of minimum-cut density-balanced partitioning solutions in distributed webpage ranking

Article 13 February 2019

Query Optimization: Issues and Challenges in Mining of Distributed Data

Data Availability

Enquiries about data availability should be directed to the authors.

Notes

https://www.google.com/.
Term Frequency/Inverse Document Frequency.
https://parsijoo.ir/.
https://trends.google.com/.
Third of July 2020.
Fourth of July 2020.
Direct N-gram IDF.
Indirect N-gram IDF.

References

Nugraha, K. A., & Sebastian, D. (2018). Pembentukan dataset topik kata bahasa indonesia pada twitter menggunakan tf-idf & cosine similarity. Jurnal Teknik Informatika dan Sistem Informasi, 4(3), 376–386.
Google Scholar
Xu, G., Meng, Y., Chen, Z., Qiu, X., Wang, C., & Yao, H. (2019). Research on topic detection and tracking for online news texts. IEEE Access, 7, 58407–58418.
Article Google Scholar
Ma, Y.-C., Chen, T.-F., & Chung, C.-P. (2002). Posting file partitioning and parallel information retrieval. Journal of systems and software, 63(2), 113–127.
Article Google Scholar
Mitchell, G. (2020). How much data is on the internet? https://www.sciencefocus.com/future-technology/how-much-data-is-on-the-internet/, July 2020.
Moffat, A., Webber, W., & Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 348–355).
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). New York: ACM Press.
Google Scholar
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Cahoon, B., McKinley, K. S., & Lu, Z. (2000). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems (TOIS), 18(1), 1–43.
Article Google Scholar
de Kretser, O., Moffat, A., Shimmin, T. & Zobel, J. (1998). Methodologies for distributed information retrieval. In: Proceedings 18th international conference on distributed computing systems (Cat. No. 98CB36183) (pp. 66–73). IEEE.
Moffat, A., Webber, W., Zobel, J., & Baeza-Yates, R. (2007). A pipelined architecture for distributed text query evaluation. Information Retrieval, 10(3), 205–231.
Article Google Scholar
Teller, V. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Computational Linguistics, 26(4), 638–641.
Article Google Scholar
Jonassen, S., & Bratsberg, S. E. (2010). A combined semi-pipelined query processing architecture for distributed full-text retrieval. In: International conference on web information systems engineering (pp. 587–601). Springer.
Frakes, W. (1992). Introduction to information storage and retrieval systems. Space, 14(10).
Croft, W. B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice, vol. 520. Addison-Wesley Reading.
Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2016). Information retrieval: Implementing and evaluating search engines. MIT Press.
Mendoza, M., Marín, M., Gil-Costa, V., & Ferrarotti, F. (2016). Reducing hardware hit by queries in web search engines. Information Processing & Management, 52(6), 1031–1052.
Article Google Scholar
Mourão, A., & Magalhães, J. (2018). Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval. International Journal of Multimedia Information Retrieval, 7(1), 57–70.
Article Google Scholar
Barroso, L. A., Dean, J., & Holzle, U. (2003). Web search for a planet: The google cluster architecture. IEEE Micro, 23(2), 22–28.
Article Google Scholar
Cambazoglu, B. B., Kayaaslan, E., Jonassen, S., & Aykanat, C. (2013). A term-based inverted index partitioning model for efficient distributed query processing. ACM Transactions on the Web (TWEB), 7(3), 1–23.
Article Google Scholar
Gao, G., Li, R., & Xu, Z. (2018). Mimir: A term-distributed retrieval system for secret documents. International Journal of Information and Communication Technology, 12(1–2), 209–228.
Article Google Scholar
Patel, H. (2010). Inverted index partitioning strategies for a distributed search engine. Master’s thesis, University of Waterloo.
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Article MATH Google Scholar
Roy, D., Mitra, M., & Ganguly, D. (2018). To clean or not to clean: Document preprocessing and reproducibility. Journal of Data and Information Quality (JDIQ), 10(4), 1–25.
Article Google Scholar
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv:1707.02919.
Lo, R. T.-W., He, B., Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), 5, 17–24.
Kayest, M., & Jain, S. K. (2019). Optimization driven cluster based indexing and matching for the document retrieval. Journal of King Saud University-Computer and Information Sciences.
Ghag, K. V., & Shah, K. (2015). Comparative analysis of effect of stopwords removal on sentiment classification. In: 2015 international conference on computer, communication and control (IC4) (pp. 1–6). IEEE.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.
Article MathSciNet Google Scholar
Silva, C. (2003). The importance of stop word removal on recall values in text categorization. Proceedings of the International Joint Conference on Neural Networks, 2003(3), 1661–1666.
Google Scholar
Saif, H., Fernández, M., He, Y., & Alani, H. (2014). On stopwords, filtering and data sparsity for sentiment analysis of twitter.
Dai, Z., Xiong, C., & Callan, J. (2016). Query-biased partitioning for selective search. In: Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1119–1128).
Shirakawa, M., Hara, T., & Nishio, S. (2017). Idf for word n-grams. ACM Transactions on Information Systems (TOIS), 36(1), 1–38.
Article Google Scholar
Bookstein, A., & Swanson, D. R. (1974). Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25(5), 312–316.
Article Google Scholar
Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part I on the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, 26(4), 197–206.
Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1(2), 163–190.
Article MathSciNet Google Scholar
Papineni, K.(2001). Why inverse document frequency? In: Second meeting of the North American chapter of the association for computational linguistics.
Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation.

Download references

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Yazd University, Yazd, Iran
Faridesadat Dehghan Manshadi & Sajjad Zarifzadeh
Department of Computer Engineering, Yazd University, Yazd, Iran
Seyedakbar Mostafavi

Authors

Faridesadat Dehghan Manshadi
View author publications
You can also search for this author in PubMed Google Scholar
Seyedakbar Mostafavi
View author publications
You can also search for this author in PubMed Google Scholar
Sajjad Zarifzadeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seyedakbar Mostafavi.

Ethics declarations

Competing interests

The authors hereby declare that there is no financial or non-financial interests regarding this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Manshadi, F.D., Mostafavi, S. & Zarifzadeh, S. A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines. Wireless Pers Commun 129, 1489–1511 (2023). https://doi.org/10.1007/s11277-023-10176-y

Download citation

Accepted: 05 February 2023
Published: 20 March 2023
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11277-023-10176-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines

Abstract

Access this article

Similar content being viewed by others

iDBP: A Distributed Min-Cut Density-Balanced Algorithm for Incremental Web-Pages Ranking

Impact of minimum-cut density-balanced partitioning solutions in distributed webpage ranking

Query Optimization: Issues and Challenges in Mining of Distributed Data

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines

Abstract

Access this article

Similar content being viewed by others

iDBP: A Distributed Min-Cut Density-Balanced Algorithm for Incremental Web-Pages Ranking

Impact of minimum-cut density-balanced partitioning solutions in distributed webpage ranking

Query Optimization: Issues and Challenges in Mining of Distributed Data

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation