Abstract
To provide the most relevant answers to the user’s query in the shortest time, search engines require quick data retrieval mechanism. One of the factors affecting the speed of data retrieval is how the load is distributed among the servers. The mechanism of load distribution between servers and consequently the performance of the search engine is affected by the way data is shared between servers. Document-based distribution and word-based distribution are the two main methods of data sharing, neither of which guarantees a permanent load balance. Existing solutions to improve load balance in both document-based and word-based distribution methods use users’ query history to obtain information about their search pattern. These methods examine queries to identify popular words among users and assign a weight to each one, which indicates the load of that word. The problem is that most of the time, the words with the words that follow them represent the purpose of the user, not alone. By considering words individually, it is possible to assign high weight to words that alone have no value to the user, which can lead to an unfair distribution of load when distributing data between servers. The proposed method tries to improve the data distribution process between the servers and thus the load balance by considering the sequence of constructive words of the queries along with the words and weighting them. The results of the experiments show that the improvement of the load balance of the proposed method is 38.21% on average compared to the document-based distribution method and 35.6% compared to the existing methods for creating a suitable load balance in the document-based distribution method.
Similar content being viewed by others
Data Availability
Enquiries about data availability should be directed to the authors.
Notes
Term Frequency/Inverse Document Frequency.
Third of July 2020.
Fourth of July 2020.
Direct N-gram IDF.
Indirect N-gram IDF.
References
Nugraha, K. A., & Sebastian, D. (2018). Pembentukan dataset topik kata bahasa indonesia pada twitter menggunakan tf-idf & cosine similarity. Jurnal Teknik Informatika dan Sistem Informasi, 4(3), 376–386.
Xu, G., Meng, Y., Chen, Z., Qiu, X., Wang, C., & Yao, H. (2019). Research on topic detection and tracking for online news texts. IEEE Access, 7, 58407–58418.
Ma, Y.-C., Chen, T.-F., & Chung, C.-P. (2002). Posting file partitioning and parallel information retrieval. Journal of systems and software, 63(2), 113–127.
Mitchell, G. (2020). How much data is on the internet? https://www.sciencefocus.com/future-technology/how-much-data-is-on-the-internet/, July 2020.
Moffat, A., Webber, W., & Zobel, J. (2006). Load balancing for term-distributed parallel retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 348–355).
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (Vol. 463). New York: ACM Press.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
Cahoon, B., McKinley, K. S., & Lu, Z. (2000). Evaluating the performance of distributed architectures for information retrieval using a variety of workloads. ACM Transactions on Information Systems (TOIS), 18(1), 1–43.
de Kretser, O., Moffat, A., Shimmin, T. & Zobel, J. (1998). Methodologies for distributed information retrieval. In: Proceedings 18th international conference on distributed computing systems (Cat. No. 98CB36183) (pp. 66–73). IEEE.
Moffat, A., Webber, W., Zobel, J., & Baeza-Yates, R. (2007). A pipelined architecture for distributed text query evaluation. Information Retrieval, 10(3), 205–231.
Teller, V. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Computational Linguistics, 26(4), 638–641.
Jonassen, S., & Bratsberg, S. E. (2010). A combined semi-pipelined query processing architecture for distributed full-text retrieval. In: International conference on web information systems engineering (pp. 587–601). Springer.
Frakes, W. (1992). Introduction to information storage and retrieval systems. Space, 14(10).
Croft, W. B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice, vol. 520. Addison-Wesley Reading.
Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2016). Information retrieval: Implementing and evaluating search engines. MIT Press.
Mendoza, M., Marín, M., Gil-Costa, V., & Ferrarotti, F. (2016). Reducing hardware hit by queries in web search engines. Information Processing & Management, 52(6), 1031–1052.
Mourão, A., & Magalhães, J. (2018). Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval. International Journal of Multimedia Information Retrieval, 7(1), 57–70.
Barroso, L. A., Dean, J., & Holzle, U. (2003). Web search for a planet: The google cluster architecture. IEEE Micro, 23(2), 22–28.
Cambazoglu, B. B., Kayaaslan, E., Jonassen, S., & Aykanat, C. (2013). A term-based inverted index partitioning model for efficient distributed query processing. ACM Transactions on the Web (TWEB), 7(3), 1–23.
Gao, G., Li, R., & Xu, Z. (2018). Mimir: A term-distributed retrieval system for secret documents. International Journal of Information and Communication Technology, 12(1–2), 209–228.
Patel, H. (2010). Inverted index partitioning strategies for a distributed search engine. Master’s thesis, University of Waterloo.
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Roy, D., Mitra, M., & Ganguly, D. (2018). To clean or not to clean: Document preprocessing and reproducibility. Journal of Data and Information Quality (JDIQ), 10(4), 1–25.
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). A brief survey of text mining: Classification, clustering and extraction techniques. arXiv:1707.02919.
Lo, R. T.-W., He, B., Ounis, I. (2005). Automatically building a stopword list for an information retrieval system. Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR), 5, 17–24.
Kayest, M., & Jain, S. K. (2019). Optimization driven cluster based indexing and matching for the document retrieval. Journal of King Saud University-Computer and Information Sciences.
Ghag, K. V., & Shah, K. (2015). Comparative analysis of effect of stopwords removal on sentiment classification. In: 2015 international conference on computer, communication and control (IC4) (pp. 1–6). IEEE.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165.
Silva, C. (2003). The importance of stop word removal on recall values in text categorization. Proceedings of the International Joint Conference on Neural Networks, 2003(3), 1661–1666.
Saif, H., Fernández, M., He, Y., & Alani, H. (2014). On stopwords, filtering and data sparsity for sentiment analysis of twitter.
Dai, Z., Xiong, C., & Callan, J. (2016). Query-biased partitioning for selective search. In: Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1119–1128).
Shirakawa, M., Hara, T., & Nishio, S. (2017). Idf for word n-grams. ACM Transactions on Information Systems (TOIS), 36(1), 1–38.
Bookstein, A., & Swanson, D. R. (1974). Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25(5), 312–316.
Harter, S. P. (1975). A probabilistic approach to automatic keyword indexing. Part I on the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, 26(4), 197–206.
Church, K. W., & Gale, W. A. (1995). Poisson mixtures. Natural Language Engineering, 1(2), 163–190.
Papineni, K.(2001). Why inverse document frequency? In: Second meeting of the North American chapter of the association for computational linguistics.
Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation.
Funding
The authors have not disclosed any funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors hereby declare that there is no financial or non-financial interests regarding this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Manshadi, F.D., Mostafavi, S. & Zarifzadeh, S. A Query-Based Weighted Document Partitioning Method for Load Balancing in Search Engines. Wireless Pers Commun 129, 1489–1511 (2023). https://doi.org/10.1007/s11277-023-10176-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-023-10176-y