Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity

Martín-del-Campo-Rodríguez, Carolina; Sidorov, Grigori; Batyrshin, Ildar

doi:10.1007/978-3-030-04497-8_4

Carolina Martín-del-Campo-Rodríguez¹⁵,
Grigori Sidorov¹⁵ &
Ildar Batyrshin¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11289))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

860 Accesses
1 Citations

Abstract

Distance and similarity measures are essential to solve many pattern recognition problems such as classification, information retrieval and clustering, where the use of a specific distance could led to a better performance than others. A weighted cosine distance is proposed considering a variation in the weights of exclusive attributes of the input vectors. An agglomerative hierarchical clustering of documents was used for the comparison between the traditional cosine similarity and the one proposed in this paper. This modified measure has outcome in an improvement in the formation of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Forbes, S.: On the local distribution of certain Illinois fishes: an essay in statistical ecology. In: Bulletin of the Illinois State Laboratory of Natural History, vol. 7, no. 8. Illinois State Laboratory of Natural History (1907)
Google Scholar
Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)
Google Scholar
Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998)
Article Google Scholar
Arif, S.M., Holliday, J.D., Willett, P.: Comparison of chemical similarity measures using different numbers of query structures. J. Inf. Sci., 1–8 (2013)
Google Scholar
Batyrshin, I.: Towards a general theory of similarity and association measures: similarity, dissimilarity and correlation functions. J. Intell. Fuzzy Syst. (2018)
Google Scholar
Sahu, L., Mohan, B.R.: An improved k-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 2014 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1–5 (2014)
Google Scholar
Gómez-Adorno, H., Alemán, Y., Vilariño Ayala, D., Sanchez-Perez, M., Pinto, D., Sidorov, G.: Author clustering using hierarchical clustering analysis-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland. CEUR-WS.org (2017)
Google Scholar
García-Mondeja, Y., Castro-Castro, D., Lavielle-Castro, V., Muñoz, R.: Discovering author groups using a B-compact graph-based clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T., (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)
Google Scholar
Mirco Kocher, J.S.: UniNE at CLEF 2017: author clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)
MATH Google Scholar
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18, 491–504 (2014)
Article Google Scholar
Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L.T. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-662-08968-2_16
Chapter Google Scholar
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)
Article Google Scholar
Batyrshin, I., Kubysheva, N., Solovyev, V., Villa-Vargas, L.: Visualization of similarity measures for binary data and 2 x 2 tables. Computación y Sistemas 20, 345–353 (2016)
Article Google Scholar
Tschuggnall, M., et al.: Overview of the author identification task at PAN 2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings (2017)
Google Scholar
Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. Volume 1609 of CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)
Google Scholar
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, M.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. 12, 461–486 (2009)
Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813, BEIFI 20181315).

Author information

Authors and Affiliations

Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City, Mexico
Carolina Martín-del-Campo-Rodríguez, Grigori Sidorov & Ildar Batyrshin

Authors

Carolina Martín-del-Campo-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar
Ildar Batyrshin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carolina Martín-del-Campo-Rodríguez .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Ildar Batyrshin
Universidad Panamericana, Mexico City, Mexico
María de Lourdes Martínez-Villaseñor
Faculty of Engineering, Universidad Panamericana, Mexico City, Mexico
Hiram Eredín Ponce Espinosa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martín-del-Campo-Rodríguez, C., Sidorov, G., Batyrshin, I. (2018). Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-04497-8_4
Published: 03 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04496-1
Online ISBN: 978-3-030-04497-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics