Abstract
In vector space model, different types of term weighting schemes are used to adjust bag-of-words document vectors in order to improve the performance of the most widely used cosine distance. Even though the cosine distance with some term weighting schemes result in more reliable (dis)similarity measure in some data sets, it may not perform well in others because of the underlying assumptions of the term weighting schemes. In this paper, we argue that the explicit adjustment of bag-of-words document vectors using term weighting is not required if a data-dependent dissimilarity measure called \(m_p\)-dissimilarity is used. Our empirical result in document retrieval task reveals that \(m_p\) with the simplest binary bag-of-words representation is either better or competitive to the cosine distance with the best performing state-of-the-art term weighting scheme in four widely used benchmark document collections.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The parameter p in \(m_p\) has the same role as in the case of traditional \(\ell _p\)-norm. The performance of \(m_p\) may be changed slightly using different p values in some data sets. Empirically, we observed that \(p=0.1\) is a reasonably good setting.
- 2.
- 3.
References
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1986)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32(1), 18–34 (1998)
Polettini, N.: The Vector Space Model in Information Retrieval - Term Weighting Problem, University of Trento, Italy (2004). https://wiki.eecs.yorku.ca/course_archive/2014-15/W/6339/_media/polettini_information_retrieval.pdf
Zhu, X., Goldberg, A.B., Rabbat, M., Nowak, R.: Learning bigrams from unigrams. In: Proceedings of ACL 2008: HLT, Association for Computational Linguistics, pp. 656–664 (2008)
Aryal, S., Ting, K., Haffari, G., Washio, T.: \(m_p\)-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 707–712. IEEE (2014)
Cardoso-Cachopo, A.: Improving Methods for Single-label Text Categorization. Ph.D. Thesis, Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal (2007)
Han, E.-H.(Sam), Karypis, G.: Centroid-based document classification: analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Aryal, S., Ting, K.M., Haffari, G., Washio, T. (2015). Beyond tf-idf and Cosine Distance in Documents Dissimilarity Measure. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-28940-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28939-7
Online ISBN: 978-3-319-28940-3
eBook Packages: Computer ScienceComputer Science (R0)