A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions

Yang, Hui; Mysore, Ajay; Wallace, Sharonda

doi:10.1007/978-3-642-03348-3_36

Hui Yang²⁵,
Ajay Mysore²⁵ &
Sharonda Wallace²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5678))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2282 Accesses

Abstract

We propose an algorithm to effectively cluster a specific type of text documents: textual responses gathered through a survey system. Due to the peculiar features exhibited in such responses (e.g., short in length, rich in outliers, and diverse in categories), traditional unsupervised and semi-supervised clustering* techniques are challenged to achieve satisfactory performance as demanded by a survey task. We address this issue by proposing a semi-supervised, topic-driven approach. It first employs an unsupervised algorithm to generate a preliminary clustering schema for all the answers to a question. A human expert then uses this schema to identify the major topics in these answers. Finally, a topic-driven clustering algorithm is adopted to obtain an improved clustering schema. We evaluated this approach using five questions in a survey we recently conducted in the U.S. The results demonstrate that this approach can lead to significant improvement in clustering quality.

This work is partially supported by the SFSU CCLS mini grant program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Interactive Coding of Responses to Open-Ended Questions in Russian

Investigating the capabilities of two-stage clustering algorithms in automatically discovering categories of questions using Bloom’s taxonomy

Article 27 March 2025

Providing Insights for Open-Response Surveys via End-to-End Context-Aware Clustering

References

Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of the 20th VLDB Conf. (1994)
Google Scholar
Banerjee, A., Dhillon, I., Sra, S., Ghosh, J.: Generative Model-Based Clustering of Directional Data. In: Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 19–28 (2003)
Google Scholar
Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. In: Proc.of the 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 59–68 (2004)
Google Scholar
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proc. 19th Int. Conf. Machine Learning, pp. 19–26 (2002)
Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: SIGKDD 2002 (2002)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The 11th Annual Conf. CLT, pp. 92–100 (1998)
Google Scholar
Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation. In: The 15th Conference on Computational Statistics (2002)
Google Scholar
Choudhary, B., Bhattacharyya, P.: Text clustering using semantics. In: The Eleventh International WWW Conference (2002)
Google Scholar
Cohn, D., Caruana, R., McCallum, A.: Semi-supervised clustering with user feedback (Tech. Rep. TR2003-1892). Cornell University (2003)
Google Scholar
Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143–175 (2001)
Article MATH Google Scholar
Galloway, A.: A workbook on Questionnaire Design & Analysis, http://www.tardis.ed.ac.uk/~kate/qmcweb/qcont.htm
Ward, G.: The Moby Thesaurus List (English) (2002), http://www.gutenberg.org/etext/3202
Jian, W., Li, Z., Hu, X.: Ontology Based Clustering for Improving Genomic IR. In: 20th IEEE Int’l Sym. on Comp. Based Med. Sys. (2007)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Book MATH Google Scholar
Genomics Survey, http://dhcp-hensill4f-235-208.sfsu.edu/
Parsons, L., Ehtesham, L., Haque, Liu, H.: Subspace Clustering for High Dimensional Data: A Review. SIGKDD Exploration 1(6), 90–105 (2004)
Article Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Sholom, M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)
MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. TextMining Workshop, KDD (2000)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining partitions. Journal of MLC 3, 583–617 (2002)
MathSciNet MATH Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: ICML (2001)
Google Scholar
Wallace, S., Wakimoto, P., Yang, H., Rodriguez, R.: Development of an on-line survey to assess training needs in nutritional genomics. In: Experimental Biology Annual Meeting, p. 53 (2007)
Google Scholar
Zeng, H., Wang, X., Chen, Z., Lu, H., Ma, W.: Cbc: Clustering based text classification requiring minimal labeled data. In: ICDM (2003)
Google Scholar
Zhao, Y., Karypis, G.: Topic-Driven Clustering for Document Datasets. In: SIAM International Conference on Data Mining, pp. 358–369 (2005)
Google Scholar
Zhong, S.: Semi-supervised Model-based Document Clustering: A Comparative Study. Machine Learning 1(65) (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, San Francisco State University, 94132, USA
Hui Yang & Ajay Mysore
Human Nutrition & Food Science, California State Polytechnic University, 91768, USA
Sharonda Wallace

Authors

Hui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Mysore
View author publications
You can also search for this author in PubMed Google Scholar
Sharonda Wallace
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Knowledge Science & Engineering Institute, School of Education Technology, Beijing Normal University, Xinjiekouwai Ave. 19, 100875, Beijing, China
Ronghuai Huang
The Hong Kong University of Science and Technology, Clear Water Bay,, Hong Kong, Hong Kong
Qiang Yang
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Faculty of Economics, University of Porto, Rua Dr. Roberto Frias, 4200-465, Porto, Portugal
João Gama
School of Information, Zhongguancum, Renmin University, 100872, Beijing, China
Xiaofeng Meng
School of Information Technology and Electrical Engineering, The University of Queensland, 4072, St. Lucia, Queensland, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, H., Mysore, A., Wallace, S. (2009). A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-03348-3_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics