Effects of central tendency measures on term weighting in textual information retrieval

Ghahramani, Farzad; Tahayori, Hooman; Visconti, Andrea

doi:10.1007/s00500-021-05694-5

Effects of central tendency measures on term weighting in textual information retrieval

Methodologies and Application
Published: 24 March 2021

Volume 25, pages 7341–7378, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

303 Accesses
3 Citations
Explore all metrics

Abstract

It has become evident that term weighting has a significant effect on relevant document retrieval for which various methods are proposed. However, the main question that arises is which weighting method is the best? In this paper, it is shown that proper aggregation of weights generated by carefully selected basic weighting methods improves retrieval of the relevant documents with respect to the user’s needs. Toward this aim, it is shown that even using simple central tendency measures such as average, median or mid-range over an appropriate subset of basic weighting methods provides term weight that not only outperforms using each basic weighting method but also results in more effective weights in comparison with recently proposed complicated weighting methods. Based on exploiting the proposed method on various datasets, we have studied the effects of normalization of the basic weights, normalization of the vector lengths, the use of different components in the term frequency factor, etc. Results reveal the criteria for selecting an appropriate subset of basic weighting methods that would be fed to the aggregator in order to achieve higher retrieval precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Improving Information Retrieval Through a Global Term Weighting Scheme

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Article 08 September 2023

Paul Sheridan & Mikael Onsjö

A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms

Article 13 December 2018

Ahmet Arslan & Bekir Taner Dinçer

Notes

Generalised Smoothed Pólya Urn Document with v burstiness parameters set to bs_t.
Generalised Smoothed Pólya Urn Document with v burstiness parameters estimated via MCMC.
Average of Non-Normalized Weights.
Average of normaliZed Weights.
Median of Non-Normalized Weights.
Median of normaliZed Weights.
Mid-Range of Non-Normalized Weights.
Mid-Range of normaliZed Weights.

References

Aizawa A (2003) An information-theoretic perspective of tf–idf measures. Inf Process Manag 39(1):45–65
MATH Google Scholar
Al-Anzi FS, AbuZeina D, Hasan S (2017) Utilizing standard deviation in text classification weighting schemes. Int J Innov Comput Inf Control 13(4):1385–1398
Google Scholar
Azad HK, Deepak A (2019) Query expansion techniques for information retrieval: a survey. Inf Process Manag 56(5):1698–1735
Google Scholar
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval: the concepts and technology behind search, 2nd edn. Pearson Education Ltd., England
Google Scholar
Balbi S, Misuraca M, Scepi G (2018) Combining different evaluation systems on social media for measuring user satisfaction. Inf Process Manag 54(4):674–685
Google Scholar
Beel J, Langer S, Gipp B (2017) TF-IDuF: a novel term-weighting scheme for user modeling based on users’ personal document collections. In: Proceedings of the iConference 2017, Wuhan, China
Belkin NJ, Croft WB (1992) Information filtering and information retrieval: two sides of the same coin. Commun ACM 35(12):29–38
Google Scholar
Bernauer L, Han EJ, Sohn SY (2018) Term discrimination for text search tasks derived from negative binomial distribution. Inf Process Manag 54(3):370–379
Google Scholar
Blanco R, Lioma C (2012) Graph-based term weighting for information retrieval. Inf Retrieval 15(1):54–92
Google Scholar
Bordogna G, Pasi G (1993) A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation. J Am Soc Inf Sci 44(2):70–82
Google Scholar
Bordogna G, Pasi G (1995) Controlling retrieval through a user-adaptive representation of documents. Int J Approx Reason 12(3–4):317–339
MathSciNet MATH Google Scholar
Bordogna G, Carrara P, Pasi G (1992) Extending Boolean information retrieval: a fuzzy model based on linguistic variables. San Diego, CA, USA, s.n., pp 769–776
Burges C et al (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, pp 89–96
Carvalho F, Guedes GP (2020) TF-IDFC-RF: a novel supervised term weighting scheme. arXiv preprint. arXiv:2003.07193
Chen K, Zhang Z, Long J, Zhang H (2016) Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst Appl 66:245–260
Google Scholar
Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. Ph.D. dissertation, National University of Ireland, Galway
Cummins R (2017) Modelling word burstiness in natural language: a generalised Polya process for document language models in information retrieval. arXiv preprint. arXiv:1708.06011
Cummins R, O’Riordan C (2006) Evolving local and global weighting schemes in information retrieval. Inf Retrieval 9(3):311–330
Google Scholar
Cummins R, Paik JH, Lv Y (2015) A Pólya urn document language model for improved information retrieval. ACM Trans Inf Syst (TOIS) 33(4):21
Google Scholar
Dogan T, Uysal AK (2019) Improved inverse gravity moment term weighting for text classification. Expert Syst Appl 130:45–59
Google Scholar
Efron M (2010) Linear time series models for term weighting in information retrieval. J Am Soc Inf Sci Technol 61(7):1299–1312
Google Scholar
Fan W, Gordon MD, Pathak P (2004) A generic ranking function discovery framework by genetic programming for information retrieval. Inf Process Manag 40(4):587–602
MATH Google Scholar
Fan W, Gordon MD, Pathak P (2005) Genetic programming-based discovery of ranking functions for effective web search. J Manag Inf Syst 21(4):37–56
Google Scholar
Frakes WB, Baeza-Yates R (eds) (1992) Information retrieval: data structures & algorithms. Prentice Hall, Englewood Cliffs, NJ
Google Scholar
Goldberg E (1931) Statistical machine. U.S., Patent No. 183 838 929-1931
Goslin K, Hofmann M (2018) A Wikipedia powered state-based approach to automatic search query enhancement. Inf Process Manag 54(4):726–739
Google Scholar
Goswami P, Gaussier E, Amini M-R (2017) Exploring the space of information retrieval term scoring functions. Inf Process Manag 53(2):454–472
Google Scholar
Gugnani S, Bihany T, Roul RK (2014) A Complete Survey on Web Document Ranking. In: IJCA proceedings on international conference on advances in computer engineering and applications ICACEA, no 2, pp 1–7
Gupta Y, Saini A, Saxena AK (2015) A new fuzzy logic based ranking function for efficient information retrieval system. Expert Syst Appl 42(3):1223–1234
Google Scholar
Herrera-Viedma E (2001) Modeling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach. J Am Soc Inf Sci Technol 52(6):460–475
Google Scholar
Holmstrom JE (1948) Section III. Opening plenary session. In: The Royal Society scientific information conference, London, UK, vol 21, pp 77–94
Ibrahim OAS, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061
Google Scholar
Jabri S, Dahbi A, Gadi T, Bassir A (2018) Ranking of text documents using TF-IDF weighting and association rules mining. In: 2018 4th international conference on optimization and applications (ICOA), pp 1–6. IEEE.
Jones KS (1981) Information retrieval experiment. Butterworth-Heinemann, Newton, MA
Google Scholar
Kadhim AI (2019) Term weighting for feature extraction on Twitter: a comparison between BM25 and TF-IDF. In: 2019 international conference on advanced science and engineering (ICOASE), pp 124–128. IEEE
Kamphuis C, de Vries AP, Boytsov L, Lin J (2020) Which BM25 do you mean? A large-scale reproducibility study of scoring variants. Springer, Cham, pp 28–34
Google Scholar
Kandé D, Marone RM, Ndiaye S, Camara F (2018) A novel term weighting scheme model. In: Proceedings of the 4th international conference on frontiers of educational technologies, Moscow, pp 92–96
Karisani P, Rahgozar M, Oroumchian F (2016) A query term re-weighting approach using document similarity. Inf Process Manag 52(3):478–489
Google Scholar
Kraft DH, Colvin E (2017) Fuzzy information retrieval (Synthesis lectures on information concepts, retrieval, and services). Morgan and Claypool, North Carolina
Google Scholar
Kraft DH, Bordogna G, Pasi G (1995) An extended fuzzy linguistic approach to generalize Boolean information retrieval. Inf Sci Appl 2(3):119–134
MATH Google Scholar
Kraft DH, Colvin E, Bordogna G, Pasi G (2015) Fuzzy Information retrieval systems: a historical perspective. In: Tamir D, Rishe N, Kandel A (eds) Fifty years of fuzzy logic and its applications. studies in fuzziness and soft computing, vol 326. Springer, Cham. https://doi.org/10.1007/978-3-319-19683-1_15
Chapter MATH Google Scholar
Lakshmi R, Baskar S (2019) Novel term weighting schemes for document representation based on ranking of terms and fuzzy logic with semantic relationship of terms. Expert Syst Appl 137:493–503
Google Scholar
Li H (2011) Learning to rank for information retrieval and natural language processing. Synth Lect Hum Lang Technol 4(1):1–113
Google Scholar
Li X et al (2018) Exploring coherent topics by topic modeling with term weighting. Inf Process Manag 54(6):1345–1358
Google Scholar
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317
MathSciNet Google Scholar
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165
MathSciNet Google Scholar
Malliaros FD, Skianis K (2015) Graph-based term weighting for text categorization. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining 2015, Paris, France, pp 1473–1479
Marrara S, Pasi G, Viviani M (2017) Aggregation operators in information retrieval. Fuzzy Sets Syst 324:3–19
MathSciNet MATH Google Scholar
Matsuo R, Ho TB (2018) Semantic term weighting for clinical texts. Expert Syst Appl 114:543–551
Google Scholar
Mitchell HF Jr (1953) The use of the univ AC FAC-tronic system in the library reference field. Am Doc 4(1):16–17
Google Scholar
Moreo A, Esuli A, Sebastiani F (2020) Learning to weight for text classification. IEEE Trans Knowl Data Eng 32(2):302–316
MATH Google Scholar
Onan A (2020) Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. In: Concurrency and computation: practice and experience, p e5909
Paltoglou G, Thelwall M (2010) A study of information retrieval weighting schemes for sentiment analysis. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1386–1395
Pang L et al (2017) Deeprank: a new deep architecture for relevance ranking in information retrieval. In: Proceedings of the 2017 ACM on conference on information and knowledge management, pp 257–266
Plansangket S (2017) New weighting schemes for document ranking and ranked query suggestion. Ph.D. dissertation, University of Essex
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp 275–281
Qazi A, Goudar RH (2018) An ontology-based term weighting technique for web document categorization. Procedia Comput Sci 133:75–81
Google Scholar
Rashid J, Shah SMA, Irtaza A (2019) Fuzzy topic modeling approach for text mining over short text. Inf Process Manag 56(6):102060
Google Scholar
Robertson SE (1977) The probability ranking principle in IR. J Doc 33(4):294–304
Google Scholar
Roy D et al (2018) Using word embeddings for information retrieval: how collection and term normalization choices affect performance. In: Proceedings of the 27th ACM international conference on information and knowledge management, pp 1835–1838
Salton G (1968) Automatic information organization and retrieval. McGraw-Hill, New York
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Google Scholar
Salton G, Yang C-S (1973) On the specification of term values in automatic indexing. J Doc 29(4):351–372
Google Scholar
Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
MATH Google Scholar
Sanderson M, Croft WB (2012) The history of information retrieval research. Proc IEEE 100(Special Centennial Issue):1444–1451
Google Scholar
Song S-K, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930
Google Scholar
Sparck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Google Scholar
Switzer P (1964) Vector images in document retrieval. US Government Printing Office, Washington, pp 163–171
Google Scholar
Taube M, Gull CD, Wachtel IS (1952) Unit terms in coordinate indexing. Am Doc 3(4):213–218
Google Scholar
Truica C-O, Radulescu F, Boicea A (2016) Comparing different term weighting schemas for topic modeling. In: 2016 18th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC), pp 307–310. IEEE.
Turpin A, Scholer F (2006) User performance versus precision measures for simple search tasks. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 11–18
Witschel HF (2008) Global term weights in distributed environments. Inf Process Manag 44(3):1049–1061
Google Scholar
Wu H, Gu X, Gu Y (2017) Balancing between over-weighting and under-weighting in supervised term weighting. Inf Process Manag 53(2):547–557
Google Scholar
Yue Y, Finley T, Radlinski F, Joachims T (2007) A support vector method for optimizing average precision. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, Amsterdam, Netherlands, pp 271–278
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst (TOIS) 22(2):179–214
Google Scholar
Zhang J, Nguyen TN (2005) A new term significance weighting approach. J Intell Inf Syst 24(1):61–85
MATH Google Scholar

Download references

Acknowledgements

Authors have not received any funding for this research.

Author information

Authors and Affiliations

Department of Computer Science & Engineering and IT, School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
Farzad Ghahramani & Hooman Tahayori
Department of Computer Science, Universita degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
Andrea Visconti

Authors

Farzad Ghahramani
View author publications
You can also search for this author in PubMed Google Scholar
Hooman Tahayori
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Visconti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hooman Tahayori.

Ethics declarations

Conflict of interest

Authors declare that they have no conflict of interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghahramani, F., Tahayori, H. & Visconti, A. Effects of central tendency measures on term weighting in textual information retrieval. Soft Comput 25, 7341–7378 (2021). https://doi.org/10.1007/s00500-021-05694-5

Download citation

Accepted: 13 February 2021
Published: 24 March 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s00500-021-05694-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Effects of central tendency measures on term weighting in textual information retrieval

Abstract

Access this article

Similar content being viewed by others

Improving Information Retrieval Through a Global Term Weighting Scheme

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Effects of central tendency measures on term weighting in textual information retrieval

Abstract

Access this article

Similar content being viewed by others

Improving Information Retrieval Through a Global Term Weighting Scheme

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation