Abstract
In this paper, we present a clustering approach based on the combined use of a continuous vector space representation of sentences and the \(k\)-means algorithm. The principal motivation of this proposal is to split a big heterogeneous corpus into clusters of similar sentences. We use the word2vec toolkit for obtaining the representation of a given word as a continuous vector space. We provide empirical evidence for proving that the use of our technique can lead to better clusters, in terms of intra-cluster perplexity and \(F1\) score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Available at http://www.statmt.org/wmt13.
- 2.
Available at http://www.statmt.org/wmt14/medical-task/.
- 3.
Available at http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/.
- 4.
Available at http://opus.lingfil.uu.se/.
References
Andrés-Ferrer, J., Sanchis-Trilles, G., Casacuberta, F.: Similarity word-sequence kernels for sentence clustering. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR & SPR 2010. LNCS, vol. 6218, pp. 610–619. Springer, Heidelberg (2010)
Bengio, Y., Schwenk, H., Senécal, J. and Morin, F.: Neural probabilistic language models. In: Innovations in Machine Learning, pp. 137–186 (2006)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. JMLR 12, 2493–2537 (2011)
Cortes, C., Mohri, M., Weston, J.: A general regression technique for learning transductions. In: Proceedings of conference on ML, pp. 153–160 (2005)
Hamerly, G., Elkan, C.: Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of Conference on Information and Knowledge Management, pp. 600–607 (2002)
Joachims, T.: Text categorisation with support vector machines: learning with many relevant features. In: Proceedings of ECML, pp. 137–142 (1998)
Karatzoglou, A., Feinerer, I.: Text clustering with string kernels in R. JSS 15, 1–28 (2006)
Lagarda, A., Juan, A.: Topic detection and classification techniques. WP4 deliverable, TransType2 (2003)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)
MacQueen, J., and others: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of ICML, pp. 41–48 (1998)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Sanchis, G.: Building task-oriented machine translation systems (Doctoral dissertation, Universitat Politcnica de Valncia) (2012)
Sennrich, R.: Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: Proceedings of EAMT, pp. 185–192 (2012)
Serrano, N., Andrés-Ferrer, J., Casacuberta, F.: On a kernel regression approach to machine translation. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 394–401. Springer, Heidelberg (2009)
Szedmak, Z.W.S.T.: Kernel regression based machine translation. In: Proceedings of ACL, pp. 185–188 (2007)
Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Proceedings of RANLP, pp. 237–248 (2009)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL, pp. 384–394 (2010)
Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of EACL, pp. 818–828 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Chinea-Rios, M., Sanchis-Trilles, G., Casacuberta, F. (2015). Sentence Clustering Using Continuous Vector Space Representation. In: Paredes, R., Cardoso, J., Pardo, X. (eds) Pattern Recognition and Image Analysis. IbPRIA 2015. Lecture Notes in Computer Science(), vol 9117. Springer, Cham. https://doi.org/10.1007/978-3-319-19390-8_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-19390-8_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19389-2
Online ISBN: 978-3-319-19390-8
eBook Packages: Computer ScienceComputer Science (R0)