Skip to main content

A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions

  • Conference paper
Advanced Data Mining and Applications (ADMA 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5678))

Included in the following conference series:

  • 2214 Accesses

Abstract

We propose an algorithm to effectively cluster a specific type of text documents: textual responses gathered through a survey system. Due to the peculiar features exhibited in such responses (e.g., short in length, rich in outliers, and diverse in categories), traditional unsupervised and semi-supervised clustering* techniques are challenged to achieve satisfactory performance as demanded by a survey task. We address this issue by proposing a semi-supervised, topic-driven approach. It first employs an unsupervised algorithm to generate a preliminary clustering schema for all the answers to a question. A human expert then uses this schema to identify the major topics in these answers. Finally, a topic-driven clustering algorithm is adopted to obtain an improved clustering schema. We evaluated this approach using five questions in a survey we recently conducted in the U.S. The results demonstrate that this approach can lead to significant improvement in clustering quality.

This work is partially supported by the SFSU CCLS mini grant program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of the 20th VLDB Conf. (1994)

    Google Scholar 

  2. Banerjee, A., Dhillon, I., Sra, S., Ghosh, J.: Generative Model-Based Clustering of Directional Data. In: Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 19–28 (2003)

    Google Scholar 

  3. Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. In: Proc.of the 10th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, pp. 59–68 (2004)

    Google Scholar 

  4. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proc. 19th Int. Conf. Machine Learning, pp. 19–26 (2002)

    Google Scholar 

  5. Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: SIGKDD 2002 (2002)

    Google Scholar 

  6. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The 11th Annual Conf. CLT, pp. 92–100 (1998)

    Google Scholar 

  7. Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation. In: The 15th Conference on Computational Statistics (2002)

    Google Scholar 

  8. Choudhary, B., Bhattacharyya, P.: Text clustering using semantics. In: The Eleventh International WWW Conference (2002)

    Google Scholar 

  9. Cohn, D., Caruana, R., McCallum, A.: Semi-supervised clustering with user feedback (Tech. Rep. TR2003-1892). Cornell University (2003)

    Google Scholar 

  10. Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Machine Learning 42, 143–175 (2001)

    Article  MATH  Google Scholar 

  11. Galloway, A.: A workbook on Questionnaire Design & Analysis, http://www.tardis.ed.ac.uk/~kate/qmcweb/qcont.htm

  12. Ward, G.: The Moby Thesaurus List (English) (2002), http://www.gutenberg.org/etext/3202

  13. Jian, W., Li, Z., Hu, X.: Ontology Based Clustering for Improving Genomic IR. In: 20th IEEE Int’l Sym. on Comp. Based Med. Sys. (2007)

    Google Scholar 

  14. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)

    Book  MATH  Google Scholar 

  15. Genomics Survey, http://dhcp-hensill4f-235-208.sfsu.edu/

  16. Parsons, L., Ehtesham, L., Haque, Liu, H.: Subspace Clustering for High Dimensional Data: A Review. SIGKDD Exploration 1(6), 90–105 (2004)

    Article  Google Scholar 

  17. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137

    Google Scholar 

  18. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  19. Sholom, M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)

    MATH  Google Scholar 

  20. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: Proc. TextMining Workshop, KDD (2000)

    Google Scholar 

  21. Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining partitions. Journal of MLC 3, 583–617 (2002)

    MathSciNet  MATH  Google Scholar 

  22. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Constrained k-means clustering with background knowledge. In: ICML (2001)

    Google Scholar 

  23. Wallace, S., Wakimoto, P., Yang, H., Rodriguez, R.: Development of an on-line survey to assess training needs in nutritional genomics. In: Experimental Biology Annual Meeting, p. 53 (2007)

    Google Scholar 

  24. Zeng, H., Wang, X., Chen, Z., Lu, H., Ma, W.: Cbc: Clustering based text classification requiring minimal labeled data. In: ICDM (2003)

    Google Scholar 

  25. Zhao, Y., Karypis, G.: Topic-Driven Clustering for Document Datasets. In: SIAM International Conference on Data Mining, pp. 358–369 (2005)

    Google Scholar 

  26. Zhong, S.: Semi-supervised Model-based Document Clustering: A Comparative Study. Machine Learning 1(65) (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yang, H., Mysore, A., Wallace, S. (2009). A Semi-supervised Topic-Driven Approach for Clustering Textual Answers to Survey Questions. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03348-3_36

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03347-6

  • Online ISBN: 978-3-642-03348-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics