Abstract
We propose an algorithm to effectively cluster a specific type of text documents: textual responses gathered through a survey system. Due to the peculiar features exhibited in such responses (e.g., short in length, rich in outliers, and diverse in categories), traditional unsupervised and semi-supervised clustering* techniques are challenged to achieve satisfactory performance as demanded by a survey task. We address this issue by proposing a semi-supervised, topic-driven approach. It first employs an unsupervised algorithm to generate a preliminary clustering schema for all the answers to a question. A human expert then uses this schema to identify the major topics in these answers. Finally, a topic-driven clustering algorithm is adopted to obtain an improved clustering schema. We evaluated this approach using five questions in a survey we recently conducted in the U.S. The results demonstrate that this approach can lead to significant improvement in clustering quality.
This work is partially supported by the SFSU CCLS mini grant program.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of the 20th VLDB Conf. (1994)
Banerjee, A., Dhillon, I., Sra, S., Ghosh, J.: Generative Model-Based Clustering of Directional Data. In: Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 19–28 (2003)
Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. In: Proc.of the 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 59–68 (2004)
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proc. 19th Int. Conf. Machine Learning, pp. 19–26 (2002)
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: SIGKDD 2002 (2002)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The 11th Annual Conf. CLT, pp. 92–100 (1998)
Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation. In: The 15th Conference on Computational Statistics (2002)
Choudhary, B., Bhattacharyya, P.: Text clustering using semantics. In: The Eleventh International WWW Conference (2002)
Cohn, D., Caruana, R., McCallum, A.: Semi-supervised clustering with user feedback (Tech. Rep. TR2003-1892). Cornell University (2003)
Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143–175 (2001)
Galloway, A.: A workbook on Questionnaire Design & Analysis, http://www.tardis.ed.ac.uk/~kate/qmcweb/qcont.htm
Ward, G.: The Moby Thesaurus List (English) (2002), http://www.gutenberg.org/etext/3202
Jian, W., Li, Z., Hu, X.: Ontology Based Clustering for Improving Genomic IR. In: 20th IEEE Int’l Sym. on Comp. Based Med. Sys. (2007)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Genomics Survey, http://dhcp-hensill4f-235-208.sfsu.edu/
Parsons, L., Ehtesham, L., Haque, Liu, H.: Subspace Clustering for High Dimensional Data: A Review. SIGKDD Exploration 1(6), 90–105 (2004)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Sholom, M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. TextMining Workshop, KDD (2000)
Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining partitions. Journal of MLC 3, 583–617 (2002)
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: ICML (2001)
Wallace, S., Wakimoto, P., Yang, H., Rodriguez, R.: Development of an on-line survey to assess training needs in nutritional genomics. In: Experimental Biology Annual Meeting, p. 53 (2007)
Zeng, H., Wang, X., Chen, Z., Lu, H., Ma, W.: Cbc: Clustering based text classification requiring minimal labeled data. In: ICDM (2003)
Zhao, Y., Karypis, G.: Topic-Driven Clustering for Document Datasets. In: SIAM International Conference on Data Mining, pp. 358–369 (2005)
Zhong, S.: Semi-supervised Model-based Document Clustering: A Comparative Study. Machine Learning 1(65) (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yang, H., Mysore, A., Wallace, S. (2009). A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-03348-3_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)