Abstract
With the rapid proliferation of social media and online community, a vast amount of text data has been generated. Discovering the insightful value of the text data has increased its importance, a variety of text mining and process algorithms have been created in the recent years such as classification, clustering, similarity comparison. Most previous research uses a vector-space model for text representation and analysis. However, the vector-space model does not utilise the information about the relationships between the term to term. Moreover, the classic classification methods also ignore the relationships between each text document to another. In other word, the traditional text mining techniques assume the relation between terms and between documents are independent and identically distributed (iid). In this paper, we will introduce a novel term representation by involving the coupled relations from term to term. This coupled representation provides much richer information that enables us to create a coupled similarity metric for measuring document similarity, and a coupled document similarity based K-Nearest centroid classifier will be applied to the classification task. Experiments verify the proposed approach outperforming the classic vector-space based classifier, and show potential advantages and richness in exploring the other text mining tasks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Hinton, G.E., Sejnowski, T.J.: Learning and Relearning in Boltzmann Machines, vol. 1, pp. 282–317. MIT Press, Cambridge (1986)
Pollack, J.B.: Recursive distributed representations. Artif. Intell. 46(1), 77–105 (1990)
Elman, J.L.: Distributed representations, simple recurrent networks, and grammatical structure. Mach. Learn. 7(2–3), 195–225 (1991)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Cao, L., Gorodetsky, V., Mitkas, P.: Agent mining: the synergy of agents and data mining. IEEE Intell. Syst. 24(3), 64–72 (2009)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Bellegarda, J.R.: Statistical language model adaptation: review and perspectives. Speech Commun. 42(1), 93–108 (2004)
Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., Gauvain, J.-L.: Neural probabilistic language models. In: Holmes, D.E., Jain, L.C. (eds.) Innovations in Machine Learning. STUDFUZZ, vol. 194, pp. 137–186. Springer, Heidelberg (2006)
Mikolov, T., Karafiát, M., Burget, L., Cernock\(\grave{\rm y}\), J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of INTERSPEECH, pp. 1045–1048 (2010)
Mikolov, T., Kombrink, S., Burget, L., Cernocky, J.H., Khudanpur, S.: Extensions of recurrent neural network language model. In: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528–5531 (2011)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hinton, G., Salakhutdinov, R.: Discovering binary codes for documents by learning deep generative models. Top. Cogn. Sci. 3(1), 74–91 (2011)
Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and meaning representations for open-text semantic parsing. In: Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 127–135 (2012)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394 (2010)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167 (2008)
Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1081–1088 (2008)
Cao, L.: Non-iidness learning in behavioral and social data. Comput. J. (2013). doi:10.1093/comjnl/bxt084
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873–882 (2012)
Cao, L., Philip, S.Y. (eds.): Behavior Computing: Modeling, Analysis, Mining and Decision. Springer, Heidelberg (2012)
Wang, C., She, Z., Cao, L.: Coupled attribute analysis on numerical data. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp. accepted (2013)
Wang, C., She, Z., Cao, L.: Coupled clustering ensemble: incorporating coupling relationships both between base clusterings and objects. In: Proceedings of the 29th IEEE International Conference on Data Engineering, pp. accepted (2013)
Cheng, X., Miao, D., Wang, C., Cao, L.: Coupled term-term relation analysis for document clustering. In: Proceedings of IJCNN 2013 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, M. et al. (2014). Learning Heterogeneous Coupling Relationships Between Non-IID Terms. In: Cao, L., Zeng, Y., Symeonidis, A., Gorodetsky, V., Müller, J., Yu, P. (eds) Agents and Data Mining Interaction. ADMI 2013. Lecture Notes in Computer Science(), vol 8316. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55192-5_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-55192-5_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55191-8
Online ISBN: 978-3-642-55192-5
eBook Packages: Computer ScienceComputer Science (R0)