Skip to main content

Sentence Clustering Using Continuous Vector Space Representation

  • Conference paper
  • First Online:
Pattern Recognition and Image Analysis (IbPRIA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9117))

Included in the following conference series:

  • 4135 Accesses

Abstract

In this paper, we present a clustering approach based on the combined use of a continuous vector space representation of sentences and the \(k\)-means algorithm. The principal motivation of this proposal is to split a big heterogeneous corpus into clusters of similar sentences. We use the word2vec toolkit for obtaining the representation of a given word as a continuous vector space. We provide empirical evidence for proving that the use of our technique can lead to better clusters, in terms of intra-cluster perplexity and \(F1\) score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Available at http://www.statmt.org/wmt13.

  2. 2.

    Available at http://www.statmt.org/wmt14/medical-task/.

  3. 3.

    Available at http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/.

  4. 4.

    Available at http://opus.lingfil.uu.se/.

References

  1. Andrés-Ferrer, J., Sanchis-Trilles, G., Casacuberta, F.: Similarity word-sequence kernels for sentence clustering. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR & SPR 2010. LNCS, vol. 6218, pp. 610–619. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  2. Bengio, Y., Schwenk, H., Senécal, J. and Morin, F.: Neural probabilistic language models. In: Innovations in Machine Learning, pp. 137–186 (2006)

    Google Scholar 

  3. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. JMLR 12, 2493–2537 (2011)

    MATH  Google Scholar 

  4. Cortes, C., Mohri, M., Weston, J.: A general regression technique for learning transductions. In: Proceedings of conference on ML, pp. 153–160 (2005)

    Google Scholar 

  5. Hamerly, G., Elkan, C.: Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of Conference on Information and Knowledge Management, pp. 600–607 (2002)

    Google Scholar 

  6. Joachims, T.: Text categorisation with support vector machines: learning with many relevant features. In: Proceedings of ECML, pp. 137–142 (1998)

    Google Scholar 

  7. Karatzoglou, A., Feinerer, I.: Text clustering with string kernels in R. JSS 15, 1–28 (2006)

    Google Scholar 

  8. Lagarda, A., Juan, A.: Topic detection and classification techniques. WP4 deliverable, TransType2 (2003)

    Google Scholar 

  9. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)

    MATH  Google Scholar 

  10. MacQueen, J., and others: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  11. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of ICML, pp. 41–48 (1998)

    Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781

  13. Sanchis, G.: Building task-oriented machine translation systems (Doctoral dissertation, Universitat Politcnica de Valncia) (2012)

    Google Scholar 

  14. Sennrich, R.: Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: Proceedings of EAMT, pp. 185–192 (2012)

    Google Scholar 

  15. Serrano, N., Andrés-Ferrer, J., Casacuberta, F.: On a kernel regression approach to machine translation. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 394–401. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  16. Szedmak, Z.W.S.T.: Kernel regression based machine translation. In: Proceedings of ACL, pp. 185–188 (2007)

    Google Scholar 

  17. Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Proceedings of RANLP, pp. 237–248 (2009)

    Google Scholar 

  18. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)

    Google Scholar 

  19. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL, pp. 384–394 (2010)

    Google Scholar 

  20. Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of EACL, pp. 818–828 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mara Chinea-Rios .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Chinea-Rios, M., Sanchis-Trilles, G., Casacuberta, F. (2015). Sentence Clustering Using Continuous Vector Space Representation. In: Paredes, R., Cardoso, J., Pardo, X. (eds) Pattern Recognition and Image Analysis. IbPRIA 2015. Lecture Notes in Computer Science(), vol 9117. Springer, Cham. https://doi.org/10.1007/978-3-319-19390-8_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19390-8_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19389-2

  • Online ISBN: 978-3-319-19390-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics