Skip to main content
Log in

Effects of central tendency measures on term weighting in textual information retrieval

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

It has become evident that term weighting has a significant effect on relevant document retrieval for which various methods are proposed. However, the main question that arises is which weighting method is the best? In this paper, it is shown that proper aggregation of weights generated by carefully selected basic weighting methods improves retrieval of the relevant documents with respect to the user’s needs. Toward this aim, it is shown that even using simple central tendency measures such as average, median or mid-range over an appropriate subset of basic weighting methods provides term weight that not only outperforms using each basic weighting method but also results in more effective weights in comparison with recently proposed complicated weighting methods. Based on exploiting the proposed method on various datasets, we have studied the effects of normalization of the basic weights, normalization of the vector lengths, the use of different components in the term frequency factor, etc. Results reveal the criteria for selecting an appropriate subset of basic weighting methods that would be fed to the aggregator in order to achieve higher retrieval precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Generalised Smoothed Pólya Urn Document with v burstiness parameters set to bst.

  2. Generalised Smoothed Pólya Urn Document with v burstiness parameters estimated via MCMC.

  3. Average of Non-Normalized Weights.

  4. Average of normaliZed Weights.

  5. Median of Non-Normalized Weights.

  6. Median of normaliZed Weights.

  7. Mid-Range of Non-Normalized Weights.

  8. Mid-Range of normaliZed Weights.

References

  • Aizawa A (2003) An information-theoretic perspective of tf–idf measures. Inf Process Manag 39(1):45–65

    MATH  Google Scholar 

  • Al-Anzi FS, AbuZeina D, Hasan S (2017) Utilizing standard deviation in text classification weighting schemes. Int J Innov Comput Inf Control 13(4):1385–1398

    Google Scholar 

  • Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735

    Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. Pearson Education Ltd., England

    Google Scholar 

  • Balbi S, Misuraca M, Scepi G (2018) Combining different evaluation systems on social media for measuring user satisfaction. Inf Process Manag 54(4):674–685

    Google Scholar 

  • Beel J, Langer S, Gipp B (2017) TF-IDuF: a novel term-weighting scheme for user modeling based on users’ personal document collections. In: Proceedings of the iConference 2017, Wuhan, China

  • Belkin NJ, Croft WB (1992) Information filtering and information retrieval: two sides of the same coin. Commun ACM 35(12):29–38

    Google Scholar 

  • Bernauer L, Han EJ, Sohn SY (2018) Term discrimination for text search tasks derived from negative binomial distribution. Inf Process Manag 54(3):370–379

    Google Scholar 

  • Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Inf Retrieval 15(1):54–92

    Google Scholar 

  • Bordogna G, Pasi G (1993) A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation. J Am Soc Inf Sci 44(2):70–82

    Google Scholar 

  • Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int J Approx Reason 12(3–4):317–339

    MathSciNet  MATH  Google Scholar 

  • Bordogna G, Carrara P, Pasi G (1992) Extending Boolean information retrieval: a fuzzy model based on linguistic variables. San Diego, CA, USA, s.n., pp 769–776

  • Burges C et al (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, pp 89–96

  • Carvalho F, Guedes GP (2020) TF-IDFC-RF: a novel supervised term weighting scheme. arXiv preprint. arXiv:2003.07193

  • Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260

    Google Scholar 

  • Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. Ph.D. dissertation, National University of Ireland, Galway

  • Cummins R (2017) Modelling word burstiness in natural language: a generalised Polya process for document language models in information retrieval. arXiv preprint. arXiv:1708.06011

  • Cummins R, O’Riordan C (2006) Evolving local and global weighting schemes in information retrieval. Inf Retrieval 9(3):311–330

    Google Scholar 

  • Cummins R, Paik JH, Lv Y (2015) A Pólya urn document language model for improved information retrieval. ACM Trans Inf Syst (TOIS) 33(4):21

    Google Scholar 

  • Dogan T, Uysal AK (2019) Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 130:45–59

    Google Scholar 

  • Efron M (2010) Linear time series models for term weighting in information retrieval. J Am Soc Inf Sci Technol 61(7):1299–1312

    Google Scholar 

  • Fan W, Gordon MD, Pathak P (2004) A generic ranking function discovery framework by genetic programming for information retrieval. Inf Process Manag 40(4):587–602

    MATH  Google Scholar 

  • Fan W, Gordon MD, Pathak P (2005) Genetic programming-based discovery of ranking functions for effective web search. J Manag Inf Syst 21(4):37–56

    Google Scholar 

  • Frakes WB, Baeza-Yates R (eds) (1992) Information retrieval: data structures & algorithms. Prentice Hall, Englewood Cliffs, NJ

    Google Scholar 

  • Goldberg E (1931) Statistical machine. U.S., Patent No. 183 838 929-1931

  • Goslin K, Hofmann M (2018) A Wikipedia powered state-based approach to automatic search query enhancement. Inf Process Manag 54(4):726–739

    Google Scholar 

  • Goswami P, Gaussier E, Amini M-R (2017) Exploring the space of information retrieval term scoring functions. Inf Process Manag 53(2):454–472

    Google Scholar 

  • Gugnani S, Bihany T, Roul RK (2014) A Complete Survey on Web Document Ranking. In: IJCA proceedings on international conference on advances in computer engineering and applications ICACEA, no 2, pp 1–7

  • Gupta Y, Saini A, Saxena AK (2015) A new fuzzy logic based ranking function for efficient information retrieval system. Expert Syst Appl 42(3):1223–1234

    Google Scholar 

  • Herrera-Viedma E (2001) Modeling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach. J Am Soc Inf Sci Technol 52(6):460–475

    Google Scholar 

  • Holmstrom JE (1948) Section III. Opening plenary session. In: The Royal Society scientific information conference, London, UK, vol 21, pp 77–94

  • Ibrahim OAS, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061

    Google Scholar 

  • Jabri S, Dahbi A, Gadi T, Bassir A (2018) Ranking of text documents using TF-IDF weighting and association rules mining. In: 2018 4th international conference on optimization and applications (ICOA), pp 1–6. IEEE.

  • Jones KS (1981) Information retrieval experiment. Butterworth-Heinemann, Newton, MA

    Google Scholar 

  • Kadhim AI (2019) Term weighting for feature extraction on Twitter: a comparison between BM25 and TF-IDF. In: 2019 international conference on advanced science and engineering (ICOASE), pp 124–128. IEEE

  • Kamphuis C, de Vries AP, Boytsov L, Lin J (2020) Which BM25 do you mean? A large-scale reproducibility study of scoring variants. Springer, Cham, pp 28–34

    Google Scholar 

  • Kandé D, Marone RM, Ndiaye S, Camara F (2018) A novel term weighting scheme model. In: Proceedings of the 4th international conference on frontiers of educational technologies, Moscow, pp 92–96

  • Karisani P, Rahgozar M, Oroumchian F (2016) A query term re-weighting approach using document similarity. Inf Process Manag 52(3):478–489

    Google Scholar 

  • Kraft DH, Colvin E (2017) Fuzzy information retrieval (Synthesis lectures on information concepts, retrieval, and services). Morgan and Claypool, North Carolina

    Google Scholar 

  • Kraft DH, Bordogna G, Pasi G (1995) An extended fuzzy linguistic approach to generalize Boolean information retrieval. Inf Sci Appl 2(3):119–134

    MATH  Google Scholar 

  • Kraft DH, Colvin E, Bordogna G, Pasi G (2015) Fuzzy Information retrieval systems: a historical perspective. In: Tamir D, Rishe N, Kandel A (eds) Fifty years of fuzzy logic and its applications. studies in fuzziness and soft computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-19683-1_15

    Chapter  MATH  Google Scholar 

  • Lakshmi R, Baskar S (2019) Novel term weighting schemes for document representation based on ranking of terms and fuzzy logic with semantic relationship of terms. Expert Syst Appl 137:493–503

    Google Scholar 

  • Li H (2011) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 4(1):1–113

    Google Scholar 

  • Li X et al (2018) Exploring coherent topics by topic modeling with term weighting. Inf Process Manag 54(6):1345–1358

    Google Scholar 

  • Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317

    MathSciNet  Google Scholar 

  • Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165

    MathSciNet  Google Scholar 

  • Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015, Paris, France, pp 1473–1479

  • Marrara S, Pasi G, Viviani M (2017) Aggregation operators in information retrieval. Fuzzy Sets Syst 324:3–19

    MathSciNet  MATH  Google Scholar 

  • Matsuo R, Ho TB (2018) Semantic term weighting for clinical texts. Expert Syst Appl 114:543–551

    Google Scholar 

  • Mitchell HF Jr (1953) The use of the univ AC FAC-tronic system in the library reference field. Am Doc 4(1):16–17

    Google Scholar 

  • Moreo A, Esuli A, Sebastiani F (2020) Learning to weight for text classification. IEEE Trans Knowl Data Eng 32(2):302–316

    MATH  Google Scholar 

  • Onan A (2020) Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. In: Concurrency and computation: practice and experience, p e5909

  • Paltoglou G, Thelwall M (2010) A study of information retrieval weighting schemes for sentiment analysis. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1386–1395

  • Pang L et al (2017) Deeprank: a new deep architecture for relevance ranking in information retrieval. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 257–266

  • Plansangket S (2017) New weighting schemes for document ranking and ranked query suggestion. Ph.D. dissertation, University of Essex

  • Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp 275–281

  • Qazi A, Goudar RH (2018) An ontology-based term weighting technique for web document categorization. Procedia Comput Sci 133:75–81

    Google Scholar 

  • Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Inf Process Manag 56(6):102060

    Google Scholar 

  • Robertson SE (1977) The probability ranking principle in IR. J Doc 33(4):294–304

    Google Scholar 

  • Roy D et al (2018) Using word embeddings for information retrieval: how collection and term normalization choices affect performance. In: Proceedings of the 27th ACM international conference on information and knowledge management, pp 1835–1838

  • Salton G (1968) Automatic information organization and retrieval. McGraw-Hill, New York

    Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Google Scholar 

  • Salton G, Yang C-S (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372

    Google Scholar 

  • Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    MATH  Google Scholar 

  • Sanderson M, Croft WB (2012) The history of information retrieval research. Proc IEEE 100(Special Centennial Issue):1444–1451

    Google Scholar 

  • Song S-K, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930

    Google Scholar 

  • Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Google Scholar 

  • Switzer P (1964) Vector images in document retrieval. US Government Printing Office, Washington, pp 163–171

    Google Scholar 

  • Taube M, Gull CD, Wachtel IS (1952) Unit terms in coordinate indexing. Am Doc 3(4):213–218

    Google Scholar 

  • Truica C-O, Radulescu F, Boicea A (2016) Comparing different term weighting schemas for topic modeling. In: 2016 18th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC), pp 307–310. IEEE.

  • Turpin A, Scholer F (2006) User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 11–18

  • Witschel HF (2008) Global term weights in distributed environments. Inf Process Manag 44(3):1049–1061

    Google Scholar 

  • Wu H, Gu X, Gu Y (2017) Balancing between over-weighting and under-weighting in supervised term weighting. Inf Process Manag 53(2):547–557

    Google Scholar 

  • Yue Y, Finley T, Radlinski F, Joachims T (2007) A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, Netherlands, pp 271–278

  • Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst (TOIS) 22(2):179–214

    Google Scholar 

  • Zhang J, Nguyen TN (2005) A new term significance weighting approach. J Intell Inf Syst 24(1):61–85

    MATH  Google Scholar 

Download references

Acknowledgements

Authors have not received any funding for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hooman Tahayori.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghahramani, F., Tahayori, H. & Visconti, A. Effects of central tendency measures on term weighting in textual information retrieval. Soft Comput 25, 7341–7378 (2021). https://doi.org/10.1007/s00500-021-05694-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-05694-5

Keywords

Navigation