Term frequency with average term occurrences for textual information retrieval

Ibrahim, O. Ali Sadek; Landa-Silva, D.

doi:10.1007/s00500-015-1935-7

Term frequency with average term occurrences for textual information retrieval

Focus
Published: 28 November 2015

Volume 20, pages 3045–3061, (2016)
Cite this article

Soft Computing Aims and scope Submit manuscript

O. Ali Sadek Ibrahim^1,2 &
D. Landa-Silva¹

773 Accesses
23 Citations
Explore all metrics

Abstract

In the context of information retrieval (IR) from text documents, the term weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model. In this paper, we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and maybe infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TF-IDF and TF-ATO. The results show that both, stop-words removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information on the relevance judgement for the collection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Information Retrieval Through a Global Term Weighting Scheme

Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

Information-theoretic term weighting schemes for document clustering and classification

Article 30 July 2014

References

Chang CH, Hsu CC (1999) The design of an information system for hypertext retrieval and automatic discovery on WWW. PhD thesis, National Taiwan University
Christian Middleton and Ricardo Baeza-yates. A comparison of open source search engines. Technical report, 2007. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.119.6955
Christopher F (1992) Information retrieval. Chapter Lexical Analysis and Stoplists, pp 102–130. Prentice-Hall, Inc., Upper Saddle River, NJ
Cordan O, Herrera-Viedma E, Lapez-Pujalte C, Luque M, Zarco C (2003) A review on the application of evolutionary computation to information retrieval. Int J Approx Reason 34(23):241–264 Soft Computing Applications to Intelligent Information Retrieval on the Internet
Article MathSciNet MATH Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley-Interscience, New York, NY
Book MATH Google Scholar
Cummins R (2008) The evolution and analysis of term-weighting schemes in information retrieval. PhD thesis, National University of Ireland, Galway
Cummins R, O’Riordan C (2006) Term-weighting in information retrieval using genetic programming: a three stage process. In: Proceedings of the 2006 conference on ECAI 2006: 17th European conference on artificial intelligence August 29–September 1, Riva Del Garda, pp 793–794, Amsterdam. IOS Press
Greengrass E (2000) Information retrieval : a survey. Technical Report November, University of Maryland. http://www.csee.umbc.edu/csee/research/cadip/readings/IR.report.120600.book.pdf
Hersh W, Buckley C, Leone TJ, Hickam D (1994) Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’94, New York, NY. Springer-Verlag New York Inc, pp 192–201
He Y, Saif H, Fernández M, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of Twitter. In: LREC 2014, 9th international conference on language resources and evaluationReykjavik, Iceland, pp 810–817
Ibrahim OAS, Landa-Silva D (2014) A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections. In: Computational intelligence (UKCI), 2014 14th UK Workshop on, pp 1–8, Sept 2014
Jin R, Chai JY, Si L (2005) Learn to weight terms in information retrieval using category information. In: Proceedings of the 22nd international conference on machine learning, ICML ’05, New York, NY, ACM, pp 353–360
Jin R, Falusos C, Hauptmann AG (2001) Meta-scoring: automatically evaluating term weighting schemes in ir without precision-recall. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01, New York, NY, ACM, pp 83–89
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Ndellec C, Rouveirol C (eds), Machine Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp 137–142
Jones KS (1988) Document retrieval systems. Chapter a statistical interpretation of term specificity and its application in retrieval. Taylor Graham Publishing, London, pp 132–142
Jones KS, Willett P (eds) (1997) Readings in information retrieval. Morgan Kaufmann Publishers Inc., San Francisco, CA
Kaden M, Riedel M, Hermann W, Villmann T (2014) Border-sensitive learning in generalized learning vector quantization: an alternative to support vector machines. Soft Comput pp 1–12. doi: 10.1007/s00500-014-1496-1
Kwok KL (1997) Comparing representations in Chinese information retrieval. In: SIGIR ’97 Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, New York, NY, ACM, pp 34–41
Lemur. http://www.lemurproject.org/
Liu T-Y (2009) Learning to rank for information retrieval. Found Trend Inf Retrieval 3(3):225–331
Article Google Scholar
Lo RTW, He B, Ounis I (2005) Automatically building a stopword list for an information retrieval system. Digital information management: special issue on the 5th Dutch-Belgian information retrieval Workshop (DIR 2005) 3 (1):3–8
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317
Article MathSciNet Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York, NY ISBN 0521865719, 9780521865715
McCandless M, Hatcher E, Gospodnetic O (2010) Lucene in action, 2nd Edn Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT 2010. ISBN 1933988177, 9781933988177
McGill M (1979) An evaluation of factors affecting document ranking by information retrieval systems
Noreault T, McGill M, Koll M (1999) A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment. In: SIGIR ’80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval, Butterworth & Co., Kent, pp 57–76
Qin T, Liu TY, Xu J, Li H (2010) Letor: a benchmark collection for research on learning to rank for information retrieval. Inf Retrieval, 13(4):346–374. ISSN 1386–4564
Reed JW, Jiao Y, Potok TE, Klump BA, Elmore MT, Hurson AR (2014) Tf-icf: a new term weighting scheme for clustering dynamic data streams. In: Proceedings of the 5th international conference on machine learning and applications, ICMLA ’06, Washington, DC. IEEE Computer Society, pp 258–263
Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc, Boston, MA
Ricardo A, Baeza-Yates RNB, Ribeiro-Neto B (2011) Modern information retrieval-the concepts and technology behind search, 2nd edn. Pearson Education Ltd, Harlow
Google Scholar
Robertson SE, Walker S, Hancock-Beaulieu MM, Jones S, Gatford M (1995) Okapi at TREC-3. In: Harman D (ed) Proceeding of 3rd text retrieval conference TREC3, Gaithersburg, pp 109–126
Salton G, Buckley C (1997) Readings in information retrieval. Chapter improving retrieval performance by relevance feedback. Morgan Kaufmann Publishers Inc., San Francisco, CA, pp 355–364
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York, NY
MATH Google Scholar
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96, New York, NY, ACM pp 21–29
Sinka MP, Corne (2003a) Towards modernised and web-specific stoplists for web document, analysis
Sinka MP, Corne DW (2003b) Evolving better stoplists for document clustering and web intelligence. Design and application of hybrid intelligent systems, pp 1015–1023
SMART. SMART System Stop-words List. http://jmlr.org/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
Smucker MD, Kazai G, Lease M (2012) Overview of the trec (2012) crowdsourcing track. Technical report, DTIC Document
Soboroff I (2014) A comparison of pooled and sampled relevance judgments. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’07, New York, NY, ACM pp 785–786
Song S, Myaeng SH (2012) A novel term weighting scheme based on discrimination power obtained from past retrieval results. Inf Process Manag 48(5):919–930 Large-Scale and Distributed Systems for Information Retrieval
Article Google Scholar
Torgerson WS (1958) Theory and methods of scaling
University of Glasgow. Test collections. URL http://ir.dcs.gla.ac.uk/resources/test_collections/
Van Rijsbergen CJ (1975) Information retrieval. Butterworths. http://www.dcs.gla.ac.uk/Keith/Preface.html
Vinciarelli A (2005) Application of information retrieval techniques to single writer documents. Pattern Recogn Lett 26(14):2262–2271
Article Google Scholar
Voorhees EM (2004) Overview of the trec 2004 robust retrieval track. In: Proceedings of the 13th text retrieval conference (TREC-2004), p 13
Winkler S, Schaller S, Dorfer V, Affenzeller M, Petz G, Karpowicz M (2014) Data-based prediction of sentiments using heterogeneous model ensembles. Soft Comput pp 1–12. doi:10.1007/s00500-014-1325-6
Zhou L, Lai KK, Lean Y (2009) Credit scoring using support vector machines with direct search for parameters selection. Soft Comput 13(2):149–155
Article MATH Google Scholar
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Reading, MA
Google Scholar

Download references

Author information

Authors and Affiliations

ASAP Research Group, School of Computer Science, The University of Nottingham, Nottingham, UK
O. Ali Sadek Ibrahim & D. Landa-Silva
Department of Computer Science, Minia University, Al-Minya, Egypt
O. Ali Sadek Ibrahim

Authors

O. Ali Sadek Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
D. Landa-Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to O. Ali Sadek Ibrahim.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by D. Neagu.

Appendix: Detailed experimental results of Section 4.2

See Tables 10, 11, 12, 13, 14 in appendix.

Table 12 Average recall-precision results obtained on the FBIS collection from each case in the experiments

Full size table

Table 13 Average recall-precision results obtained on the cranfield collection from each case in the experiments

Full size table

Table 14 Average recall-precision results obtained on the CISI collection from each case in the experiments

Full size table

The cases studies on these results are as follows:

Case 1: applying term-weighting scheme without using stop-words removal nor discriminative approach.
Case 2: applying term-weighting scheme using stop-words removal but without discriminative approach.
Case 3: applying term-weighting scheme without using stop-words removal but using discriminative approach.
Case 4: applying term-weighting scheme using both stop-words removal and discriminative approach.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ibrahim, O.A.S., Landa-Silva, D. Term frequency with average term occurrences for textual information retrieval. Soft Comput 20, 3045–3061 (2016). https://doi.org/10.1007/s00500-015-1935-7

Download citation

Published: 28 November 2015
Issue Date: August 2016
DOI: https://doi.org/10.1007/s00500-015-1935-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Term frequency with average term occurrences for textual information retrieval

Abstract

Access this article

Similar content being viewed by others

Improving Information Retrieval Through a Global Term Weighting Scheme

Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

Information-theoretic term weighting schemes for document clustering and classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Appendix: Detailed experimental results of Section 4.2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Term frequency with average term occurrences for textual information retrieval

Abstract

Access this article

Similar content being viewed by others

Improving Information Retrieval Through a Global Term Weighting Scheme

Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

Information-theoretic term weighting schemes for document clustering and classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Appendix: Detailed experimental results of Section 4.2

Appendix: Detailed experimental results of Section 4.2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation