Abstract
Question clustering plays an important role in QA systems. Due to data sparseness and lexical gap in questions, there is no sufficient information to guarantee good clustering results. Besides, previous works pay little attention to the complexity of algorithms, resulting in infeasibility on large-scale datasets. In this paper, we propose a novel similarity measure, which employs word relatedness as additional information to help calculating similarity between questions. Based on the similarity measure and k-means algorithm, semantic k-means algorithm and its extended version are proposed. Experimental results show that the proposed methods have comparable performance with state-of-theart methods and cost less time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100. ACM (April 2008)
Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 297–304. ACM (July 2004)
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Transactions on Knowledge and Data Engineering 20(9), 1217–1229 (2008)
Ni, X., Quan, X., Lu, Z., Wenyin, L., Hua, B.: Short text clustering by finding core terms. Knowledge and Information Systems 27(3), 345–365 (2011)
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 541–544. IEEE (November 2003)
Banerjee, S., Ramanathan, K., Gupta, A.: Clustering short texts using wikipedia. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 787–788. ACM (July 2007)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (August 1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
Hong, L., Davison, B.D.: Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics, pp. 80–88. ACM (July 2010)
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, pp. 1445–1456 (May 2013)
Ji, Z., Xu, F., Wang, B., He, B.: Question-answer topic model for question retrieval in community question answering. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2471–2474. ACM (October 2012)
Guo, W., Diab, M.: Modeling sentences in the latent space. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 864–872. Association for Computational Linguistics (July 2012)
Strube, M., Ponzetto, S.P.: WikiRelate! Computing semantic relatedness using Wikipedia. In: AAAI, vol. 6, pp. 1419–1424 (July 2006)
Gracia, J.L., Mena, E.: Web-based measure of semantic relatedness. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 136–150. Springer, Heidelberg (2008)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, X., Wang, H. (2014). A Fast and Effective Method for Clustering Large-Scale Chinese Question Dataset. In: Zong, C., Nie, JY., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2014. Communications in Computer and Information Science, vol 496. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-45924-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-662-45924-9_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-45923-2
Online ISBN: 978-3-662-45924-9
eBook Packages: Computer ScienceComputer Science (R0)