Abstract
Distance and similarity measures are essential to solve many pattern recognition problems such as classification, information retrieval and clustering, where the use of a specific distance could led to a better performance than others. A weighted cosine distance is proposed considering a variation in the weights of exclusive attributes of the input vectors. An agglomerative hierarchical clustering of documents was used for the comparison between the traditional cosine similarity and the one proposed in this paper. This modified measure has outcome in an improvement in the formation of clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Forbes, S.: On the local distribution of certain Illinois fishes: an essay in statistical ecology. In: Bulletin of the Illinois State Laboratory of Natural History, vol. 7, no. 8. Illinois State Laboratory of Natural History (1907)
Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998)
Arif, S.M., Holliday, J.D., Willett, P.: Comparison of chemical similarity measures using different numbers of query structures. J. Inf. Sci., 1–8 (2013)
Batyrshin, I.: Towards a general theory of similarity and association measures: similarity, dissimilarity and correlation functions. J. Intell. Fuzzy Syst. (2018)
Sahu, L., Mohan, B.R.: An improved k-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 2014 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1–5 (2014)
Gómez-Adorno, H., Alemán, Y., Vilariño Ayala, D., Sanchez-Perez, M., Pinto, D., Sidorov, G.: Author clustering using hierarchical clustering analysis-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland. CEUR-WS.org (2017)
García-Mondeja, Y., Castro-Castro, D., Lavielle-Castro, V., Muñoz, R.: Discovering author groups using a B-compact graph-based clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T., (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)
Mirco Kocher, J.S.: UniNE at CLEF 2017: author clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18, 491–504 (2014)
Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L.T. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-662-08968-2_16
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)
Batyrshin, I., Kubysheva, N., Solovyev, V., Villa-Vargas, L.: Visualization of similarity measures for binary data and 2 x 2 tables. Computación y Sistemas 20, 345–353 (2016)
Tschuggnall, M., et al.: Overview of the author identification task at PAN 2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings (2017)
Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. Volume 1609 of CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, M.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. 12, 461–486 (2009)
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)
Acknowledgments
This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813, BEIFI 20181315).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Martín-del-Campo-Rodríguez, C., Sidorov, G., Batyrshin, I. (2018). Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-04497-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04496-1
Online ISBN: 978-3-030-04497-8
eBook Packages: Computer ScienceComputer Science (R0)