Sentence Clustering Using Continuous Vector Space Representation

Chinea-Rios, Mara; Sanchis-Trilles, Germán; Casacuberta, Francisco

doi:10.1007/978-3-319-19390-8_49

Mara Chinea-Rios¹⁶,
Germán Sanchis-Trilles¹⁶ &
Francisco Casacuberta¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9117))

Included in the following conference series:

Iberian Conference on Pattern Recognition and Image Analysis

4171 Accesses

Abstract

In this paper, we present a clustering approach based on the combined use of a continuous vector space representation of sentences and the $k$-means algorithm. The principal motivation of this proposal is to split a big heterogeneous corpus into clusters of similar sentences. We use the word2vec toolkit for obtaining the representation of a given word as a continuous vector space. We provide empirical evidence for proving that the use of our technique can lead to better clusters, in terms of intra-cluster perplexity and $F1$ score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unsupervised Extractive Text Summarization Using Frequency-Based Sentence Clustering

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Text mining using nonnegative matrix factorization and latent semantic analysis

Article 21 April 2021

Notes

1.
Available at http://www.statmt.org/wmt13.
2.
Available at http://www.statmt.org/wmt14/medical-task/.
3.
Available at http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/.
4.
Available at http://opus.lingfil.uu.se/.

References

Andrés-Ferrer, J., Sanchis-Trilles, G., Casacuberta, F.: Similarity word-sequence kernels for sentence clustering. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR & SPR 2010. LNCS, vol. 6218, pp. 610–619. Springer, Heidelberg (2010)
Chapter Google Scholar
Bengio, Y., Schwenk, H., Senécal, J. and Morin, F.: Neural probabilistic language models. In: Innovations in Machine Learning, pp. 137–186 (2006)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. JMLR 12, 2493–2537 (2011)
MATH Google Scholar
Cortes, C., Mohri, M., Weston, J.: A general regression technique for learning transductions. In: Proceedings of conference on ML, pp. 153–160 (2005)
Google Scholar
Hamerly, G., Elkan, C.: Alternatives to the k-means algorithm that find better clusterings. In: Proceedings of Conference on Information and Knowledge Management, pp. 600–607 (2002)
Google Scholar
Joachims, T.: Text categorisation with support vector machines: learning with many relevant features. In: Proceedings of ECML, pp. 137–142 (1998)
Google Scholar
Karatzoglou, A., Feinerer, I.: Text clustering with string kernels in R. JSS 15, 1–28 (2006)
Google Scholar
Lagarda, A., Juan, A.: Topic detection and classification techniques. WP4 deliverable, TransType2 (2003)
Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)
MATH Google Scholar
MacQueen, J., and others: Some methods for classification and analysis of multivariate observations. In: Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of ICML, pp. 41–48 (1998)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Sanchis, G.: Building task-oriented machine translation systems (Doctoral dissertation, Universitat Politcnica de Valncia) (2012)
Google Scholar
Sennrich, R.: Mixture-modeling with unsupervised clusters for domain adaptation in statistical machine translation. In: Proceedings of EAMT, pp. 185–192 (2012)
Google Scholar
Serrano, N., Andrés-Ferrer, J., Casacuberta, F.: On a kernel regression approach to machine translation. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 394–401. Springer, Heidelberg (2009)
Chapter Google Scholar
Szedmak, Z.W.S.T.: Kernel regression based machine translation. In: Proceedings of ACL, pp. 185–188 (2007)
Google Scholar
Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Proceedings of RANLP, pp. 237–248 (2009)
Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC, pp. 2214–2218 (2012)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of ACL, pp. 384–394 (2010)
Google Scholar
Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of EACL, pp. 818–828 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Pattern Recognition and Human Language Technologies Center, Universitat Politècnica de València, Valencia, Spain
Mara Chinea-Rios, Germán Sanchis-Trilles & Francisco Casacuberta

Authors

Mara Chinea-Rios
View author publications
You can also search for this author in PubMed Google Scholar
Germán Sanchis-Trilles
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Casacuberta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mara Chinea-Rios .

Editor information

Editors and Affiliations

Universitat Politècnica de València, València, Spain
Roberto Paredes
Universidade do Porto, Porto, Portugal
Jaime S. Cardoso
Universidade de Santiago de Compostela, Santiago de Compostela, Spain
Xosé M. Pardo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chinea-Rios, M., Sanchis-Trilles, G., Casacuberta, F. (2015). Sentence Clustering Using Continuous Vector Space Representation. In: Paredes, R., Cardoso, J., Pardo, X. (eds) Pattern Recognition and Image Analysis. IbPRIA 2015. Lecture Notes in Computer Science(), vol 9117. Springer, Cham. https://doi.org/10.1007/978-3-319-19390-8_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-19390-8_49
Published: 09 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19389-2
Online ISBN: 978-3-319-19390-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics