Abstract
It has become evident that term weighting has a significant effect on relevant document retrieval for which various methods are proposed. However, the main question that arises is which weighting method is the best? In this paper, it is shown that proper aggregation of weights generated by carefully selected basic weighting methods improves retrieval of the relevant documents with respect to the user’s needs. Toward this aim, it is shown that even using simple central tendency measures such as average, median or mid-range over an appropriate subset of basic weighting methods provides term weight that not only outperforms using each basic weighting method but also results in more effective weights in comparison with recently proposed complicated weighting methods. Based on exploiting the proposed method on various datasets, we have studied the effects of normalization of the basic weights, normalization of the vector lengths, the use of different components in the term frequency factor, etc. Results reveal the criteria for selecting an appropriate subset of basic weighting methods that would be fed to the aggregator in order to achieve higher retrieval precision.
Similar content being viewed by others
Notes
Generalised Smoothed Pólya Urn Document with v burstiness parameters set to bst.
Generalised Smoothed Pólya Urn Document with v burstiness parameters estimated via MCMC.
Average of Non-Normalized Weights.
Average of normaliZed Weights.
Median of Non-Normalized Weights.
Median of normaliZed Weights.
Mid-Range of Non-Normalized Weights.
Mid-Range of normaliZed Weights.
References
Aizawa A (2003) An information-theoretic perspective of tf–idf measures. Inf Process Manag 39(1):45–65
Al-Anzi FS, AbuZeina D, Hasan S (2017) Utilizing standard deviation in text classification weighting schemes. Int J Innov Comput Inf Control 13(4):1385–1398
Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. Pearson Education Ltd., England
Balbi S, Misuraca M, Scepi G (2018) Combining different evaluation systems on social media for measuring user satisfaction. Inf Process Manag 54(4):674–685
Beel J, Langer S, Gipp B (2017) TF-IDuF: a novel term-weighting scheme for user modeling based on users’ personal document collections. In: Proceedings of the iConference 2017, Wuhan, China
Belkin NJ, Croft WB (1992) Information filtering and information retrieval: two sides of the same coin. Commun ACM 35(12):29–38
Bernauer L, Han EJ, Sohn SY (2018) Term discrimination for text search tasks derived from negative binomial distribution. Inf Process Manag 54(3):370–379
Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Inf Retrieval 15(1):54–92
Bordogna G, Pasi G (1993) A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation. J Am Soc Inf Sci 44(2):70–82
Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int J Approx Reason 12(3–4):317–339
Bordogna G, Carrara P, Pasi G (1992) Extending Boolean information retrieval: a fuzzy model based on linguistic variables. San Diego, CA, USA, s.n., pp 769–776
Burges C et al (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, pp 89–96
Carvalho F, Guedes GP (2020) TF-IDFC-RF: a novel supervised term weighting scheme. arXiv preprint. arXiv:2003.07193
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260
Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. Ph.D. dissertation, National University of Ireland, Galway
Cummins R (2017) Modelling word burstiness in natural language: a generalised Polya process for document language models in information retrieval. arXiv preprint. arXiv:1708.06011
Cummins R, O’Riordan C (2006) Evolving local and global weighting schemes in information retrieval. Inf Retrieval 9(3):311–330
Cummins R, Paik JH, Lv Y (2015) A Pólya urn document language model for improved information retrieval. ACM Trans Inf Syst (TOIS) 33(4):21
Dogan T, Uysal AK (2019) Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 130:45–59
Efron M (2010) Linear time series models for term weighting in information retrieval. J Am Soc Inf Sci Technol 61(7):1299–1312
Fan W, Gordon MD, Pathak P (2004) A generic ranking function discovery framework by genetic programming for information retrieval. Inf Process Manag 40(4):587–602
Fan W, Gordon MD, Pathak P (2005) Genetic programming-based discovery of ranking functions for effective web search. J Manag Inf Syst 21(4):37–56
Frakes WB, Baeza-Yates R (eds) (1992) Information retrieval: data structures & algorithms. Prentice Hall, Englewood Cliffs, NJ
Goldberg E (1931) Statistical machine. U.S., Patent No. 183 838 929-1931
Goslin K, Hofmann M (2018) A Wikipedia powered state-based approach to automatic search query enhancement. Inf Process Manag 54(4):726–739
Goswami P, Gaussier E, Amini M-R (2017) Exploring the space of information retrieval term scoring functions. Inf Process Manag 53(2):454–472
Gugnani S, Bihany T, Roul RK (2014) A Complete Survey on Web Document Ranking. In: IJCA proceedings on international conference on advances in computer engineering and applications ICACEA, no 2, pp 1–7
Gupta Y, Saini A, Saxena AK (2015) A new fuzzy logic based ranking function for efficient information retrieval system. Expert Syst Appl 42(3):1223–1234
Herrera-Viedma E (2001) Modeling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach. J Am Soc Inf Sci Technol 52(6):460–475
Holmstrom JE (1948) Section III. Opening plenary session. In: The Royal Society scientific information conference, London, UK, vol 21, pp 77–94
Ibrahim OAS, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061
Jabri S, Dahbi A, Gadi T, Bassir A (2018) Ranking of text documents using TF-IDF weighting and association rules mining. In: 2018 4th international conference on optimization and applications (ICOA), pp 1–6. IEEE.
Jones KS (1981) Information retrieval experiment. Butterworth-Heinemann, Newton, MA
Kadhim AI (2019) Term weighting for feature extraction on Twitter: a comparison between BM25 and TF-IDF. In: 2019 international conference on advanced science and engineering (ICOASE), pp 124–128. IEEE
Kamphuis C, de Vries AP, Boytsov L, Lin J (2020) Which BM25 do you mean? A large-scale reproducibility study of scoring variants. Springer, Cham, pp 28–34
Kandé D, Marone RM, Ndiaye S, Camara F (2018) A novel term weighting scheme model. In: Proceedings of the 4th international conference on frontiers of educational technologies, Moscow, pp 92–96
Karisani P, Rahgozar M, Oroumchian F (2016) A query term re-weighting approach using document similarity. Inf Process Manag 52(3):478–489
Kraft DH, Colvin E (2017) Fuzzy information retrieval (Synthesis lectures on information concepts, retrieval, and services). Morgan and Claypool, North Carolina
Kraft DH, Bordogna G, Pasi G (1995) An extended fuzzy linguistic approach to generalize Boolean information retrieval. Inf Sci Appl 2(3):119–134
Kraft DH, Colvin E, Bordogna G, Pasi G (2015) Fuzzy Information retrieval systems: a historical perspective. In: Tamir D, Rishe N, Kandel A (eds) Fifty years of fuzzy logic and its applications. studies in fuzziness and soft computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-19683-1_15
Lakshmi R, Baskar S (2019) Novel term weighting schemes for document representation based on ranking of terms and fuzzy logic with semantic relationship of terms. Expert Syst Appl 137:493–503
Li H (2011) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 4(1):1–113
Li X et al (2018) Exploring coherent topics by topic modeling with term weighting. Inf Process Manag 54(6):1345–1358
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165
Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015, Paris, France, pp 1473–1479
Marrara S, Pasi G, Viviani M (2017) Aggregation operators in information retrieval. Fuzzy Sets Syst 324:3–19
Matsuo R, Ho TB (2018) Semantic term weighting for clinical texts. Expert Syst Appl 114:543–551
Mitchell HF Jr (1953) The use of the univ AC FAC-tronic system in the library reference field. Am Doc 4(1):16–17
Moreo A, Esuli A, Sebastiani F (2020) Learning to weight for text classification. IEEE Trans Knowl Data Eng 32(2):302–316
Onan A (2020) Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. In: Concurrency and computation: practice and experience, p e5909
Paltoglou G, Thelwall M (2010) A study of information retrieval weighting schemes for sentiment analysis. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1386–1395
Pang L et al (2017) Deeprank: a new deep architecture for relevance ranking in information retrieval. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 257–266
Plansangket S (2017) New weighting schemes for document ranking and ranked query suggestion. Ph.D. dissertation, University of Essex
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp 275–281
Qazi A, Goudar RH (2018) An ontology-based term weighting technique for web document categorization. Procedia Comput Sci 133:75–81
Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Inf Process Manag 56(6):102060
Robertson SE (1977) The probability ranking principle in IR. J Doc 33(4):294–304
Roy D et al (2018) Using word embeddings for information retrieval: how collection and term normalization choices affect performance. In: Proceedings of the 27th ACM international conference on information and knowledge management, pp 1835–1838
Salton G (1968) Automatic information organization and retrieval. McGraw-Hill, New York
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Salton G, Yang C-S (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372
Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Sanderson M, Croft WB (2012) The history of information retrieval research. Proc IEEE 100(Special Centennial Issue):1444–1451
Song S-K, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Switzer P (1964) Vector images in document retrieval. US Government Printing Office, Washington, pp 163–171
Taube M, Gull CD, Wachtel IS (1952) Unit terms in coordinate indexing. Am Doc 3(4):213–218
Truica C-O, Radulescu F, Boicea A (2016) Comparing different term weighting schemas for topic modeling. In: 2016 18th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC), pp 307–310. IEEE.
Turpin A, Scholer F (2006) User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 11–18
Witschel HF (2008) Global term weights in distributed environments. Inf Process Manag 44(3):1049–1061
Wu H, Gu X, Gu Y (2017) Balancing between over-weighting and under-weighting in supervised term weighting. Inf Process Manag 53(2):547–557
Yue Y, Finley T, Radlinski F, Joachims T (2007) A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, Netherlands, pp 271–278
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst (TOIS) 22(2):179–214
Zhang J, Nguyen TN (2005) A new term significance weighting approach. J Intell Inf Syst 24(1):61–85
Acknowledgements
Authors have not received any funding for this research.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare that they have no conflict of interests.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ghahramani, F., Tahayori, H. & Visconti, A. Effects of central tendency measures on term weighting in textual information retrieval. Soft Comput 25, 7341–7378 (2021). https://doi.org/10.1007/s00500-021-05694-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-05694-5