Skip to main content

Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity

  • Conference paper
  • First Online:
Advances in Computational Intelligence (MICAI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11289))

Included in the following conference series:

Abstract

Distance and similarity measures are essential to solve many pattern recognition problems such as classification, information retrieval and clustering, where the use of a specific distance could led to a better performance than others. A weighted cosine distance is proposed considering a variation in the weights of exclusive attributes of the input vectors. An agglomerative hierarchical clustering of documents was used for the comparison between the traditional cosine similarity and the one proposed in this paper. This modified measure has outcome in an improvement in the formation of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Forbes, S.: On the local distribution of certain Illinois fishes: an essay in statistical ecology. In: Bulletin of the Illinois State Laboratory of Natural History, vol. 7, no. 8. Illinois State Laboratory of Natural History (1907)

    Google Scholar 

  2. Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)

    Google Scholar 

  3. Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998)

    Article  Google Scholar 

  4. Arif, S.M., Holliday, J.D., Willett, P.: Comparison of chemical similarity measures using different numbers of query structures. J. Inf. Sci., 1–8 (2013)

    Google Scholar 

  5. Batyrshin, I.: Towards a general theory of similarity and association measures: similarity, dissimilarity and correlation functions. J. Intell. Fuzzy Syst. (2018)

    Google Scholar 

  6. Sahu, L., Mohan, B.R.: An improved k-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 2014 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1–5 (2014)

    Google Scholar 

  7. Gómez-Adorno, H., Alemán, Y., Vilariño Ayala, D., Sanchez-Perez, M., Pinto, D., Sidorov, G.: Author clustering using hierarchical clustering analysis-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland. CEUR-WS.org (2017)

    Google Scholar 

  8. García-Mondeja, Y., Castro-Castro, D., Lavielle-Castro, V., Muñoz, R.: Discovering author groups using a B-compact graph-based clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T., (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)

    Google Scholar 

  9. Mirco Kocher, J.S.: UniNE at CLEF 2017: author clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)

    Google Scholar 

  10. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)

    MATH  Google Scholar 

  11. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18, 491–504 (2014)

    Article  Google Scholar 

  12. Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L.T. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-662-08968-2_16

    Chapter  Google Scholar 

  13. Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)

    Article  Google Scholar 

  14. Batyrshin, I., Kubysheva, N., Solovyev, V., Villa-Vargas, L.: Visualization of similarity measures for binary data and 2 x 2 tables. Computación y Sistemas 20, 345–353 (2016)

    Article  Google Scholar 

  15. Tschuggnall, M., et al.: Overview of the author identification task at PAN 2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings (2017)

    Google Scholar 

  16. Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. Volume 1609 of CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)

    Google Scholar 

  17. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, M.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. 12, 461–486 (2009)

    Google Scholar 

  18. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813, BEIFI 20181315).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carolina Martín-del-Campo-Rodríguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martín-del-Campo-Rodríguez, C., Sidorov, G., Batyrshin, I. (2018). Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity. In: Batyrshin, I., Martínez-Villaseñor, M., Ponce Espinosa, H. (eds) Advances in Computational Intelligence. MICAI 2018. Lecture Notes in Computer Science(), vol 11289. Springer, Cham. https://doi.org/10.1007/978-3-030-04497-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04497-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04496-1

  • Online ISBN: 978-3-030-04497-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics