Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

Aryal, Sunil; Ting, Kai Ming; Haffari, Gholamreza; Washio, Takashi

doi:10.1007/978-3-319-28940-3_33

Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure

Sunil Aryal¹⁹,
Kai Ming Ting²⁰,
Gholamreza Haffari¹⁹ &
…
Takashi Washio²¹

Conference paper
First Online: 22 January 2016

953 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9460))

Abstract

In vector space model, different types of term weighting schemes are used to adjust bag-of-words document vectors in order to improve the performance of the most widely used cosine distance. Even though the cosine distance with some term weighting schemes result in more reliable (dis)similarity measure in some data sets, it may not perform well in others because of the underlying assumptions of the term weighting schemes. In this paper, we argue that the explicit adjustment of bag-of-words document vectors using term weighting is not required if a data-dependent dissimilarity measure called \(m_p\)-dissimilarity is used. Our empirical result in document retrieval task reveals that \(m_p\) with the simplest binary bag-of-words representation is either better or competitive to the cosine distance with the best performing state-of-the-art term weighting scheme in four widely used benchmark document collections.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The parameter p in \(m_p\) has the same role as in the case of traditional \(\ell _p\)-norm. The performance of \(m_p\) may be changed slightly using different p values in some data sets. Empirically, we observed that \(p=0.1\) is a reasonably good setting.
2.
http://web.ist.utl.pt/acardoso/datasets/.
3.
http://www.cs.waikato.ac.nz/ml/weka/datasets.html.

References

Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1986)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32(1), 18–34 (1998)
Article Google Scholar
Polettini, N.: The Vector Space Model in Information Retrieval - Term Weighting Problem, University of Trento, Italy (2004). https://wiki.eecs.yorku.ca/course_archive/2014-15/W/6339/_media/polettini_information_retrieval.pdf
Zhu, X., Goldberg, A.B., Rabbat, M., Nowak, R.: Learning bigrams from unigrams. In: Proceedings of ACL 2008: HLT, Association for Computational Linguistics, pp. 656–664 (2008)
Google Scholar
Aryal, S., Ting, K., Haffari, G., Washio, T.: \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 707–712. IEEE (2014)
Google Scholar
Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. Thesis, Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal (2007)
Google Scholar
Han, E.-H.(Sam), Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Monash University, Victoria, Australia
Sunil Aryal & Gholamreza Haffari
Federation University, Victoria, Australia
Kai Ming Ting
Osaka University, Osaka, Japan
Takashi Washio

Authors

Sunil Aryal
View author publications
You can also search for this author in PubMed Google Scholar
Kai Ming Ting
View author publications
You can also search for this author in PubMed Google Scholar
Gholamreza Haffari
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Washio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunil Aryal .

Editor information

Editors and Affiliations

Science and Engineering Faculty, Queensland University of Technology, Brisbane, Australia
Guido Zuccon
Brisbane, Queensland, Australia
Shlomo Geva
University of Tsukuba, Ibaraki, Japan
Hideo Joho
RMIT University, Melbourne, Australia
Falk Scholer
School of Computer Engineering, Nanyang Technological University, Singapore, Singapore
Aixin Sun
Tianjin University, China
Peng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aryal, S., Ting, K.M., Haffari, G., Washio, T. (2015). Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-28940-3_33
Published: 22 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28939-7
Online ISBN: 978-3-319-28940-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics