Skip to main content

Using Topic Modeling for Code Discovery in Large Scale Text Data

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1312))

Abstract

When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of “topics” in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Some of these systems also include rudimentary keyword-based searches to support coding.

References

  1. Anderson, J.R., Corbett, A.T., Koedinger K.R., Pelletier, R.: Cognitive tutors: lessons learned. J. Learn. Sci. (1995). https://doi.org/10.1207/s15327809jls0402_2

  2. Arastoopour, G.I.: Connected design rationale: modeling and measuring engineering design learning. Unpublished Doctoral Dissertation. University of Wisconsin-Madison (2017)

    Google Scholar 

  3. Bakharia, A.: On the equivalence of inductive content analysis and topic modeling. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 291–298. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_25

    Chapter  Google Scholar 

  4. Baumer, E.P.S., Mimno, D., Guha, S., Quan, E., Gay, G.K.: Comparing grounded theory and topic modeling: extreme divergence or unlikely convergence? J. Assoc. Inf. Sci. Technol. 68(6), 1397–1410 (2017). https://doi.org/10.1002/asi.23786

    Article  Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Cai, Z., Graesser, A.C., Hu, X.: ASAT: AutoTutor script authoring tool. In: Sottilare, R., Graesser, A.C., Hu, X., Brawner, K. (eds.) Design Recommendations for Intelligent Tutoring Systems: Authoring Tools, pp. 199–210. Army Research Laboratory, Orlando (2015)

    Google Scholar 

  7. Cai, Z., Li, H., Hu, X., Graesser, A.C.: Can word probabilities from LDA be simply added up to represent documents? In: Proceedings of the 9th International Conference on Educational Data Mining, pp. 577–578 (2016)

    Google Scholar 

  8. Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D.W., Hu, X., Graesser, A.C.: nCoder+: a semantic tool for improving recall of nCoder coding. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 41–54. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_4

    Chapter  Google Scholar 

  9. Cai, Z., et al.: Trialog in ARIES: user input assessment in an intelligent tutoring system. In: Proceedings of the 3rd IEEE International Conference on Intelligent Computing and Intelligent Systems, pp. 429–433 (2010). https://doi.org/10.13140/2.1.4284.5446

  10. Charmaz, K.: Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. SAGE, Thousand Oaks (2006)

    Google Scholar 

  11. Chen, N.: Challenges of applying machine learning to qualitative coding. In: ACM SIGCHI Workshop on Human-Centered Machine Learning (2016)

    Google Scholar 

  12. Chesler, N.C., Ruis, A.R., Collier, W., Swiecki, Z., Arastoopour, G., Shaffer, D.W.: A novel paradigm for engineering education: virtual internships with individualized mentoring and assessment of engineering thinking. J. Biomech. Eng. 137(2), 1–8 (2015). https://doi.org/10.1115/1.4029235

    Article  Google Scholar 

  13. D’Angelo, C., Arastoopour, G., Chesler, N., Shaffer, D.W.: Collaborating in a virtual engineering internship. In: Connecting Computer-Supported Collaborative Learning to Policy and Practice: CSCL 2011 Conference Proceedings - Short Papers and Posters, 9th International Computer-Supported Collaborative Learning Conference (2011)

    Google Scholar 

  14. Dowell, N.M., et al.: Modeling learners’ social centrality and performance through language and discourse. In: Educational Data Mining – EDM 2015, pp. 250–257 (2015)

    Google Scholar 

  15. Eagan, B.R., Serlin, R., Ruis, A., Arastoopour, G., Shaffer, D.W.: Can we rely on IRR? Testing the assumptions of inter-rater reliability. In: CSCL 2017 Proceedings, Cim, pp. 529–532 (2017)

    Google Scholar 

  16. Eagan, B.R., Swiecki, Z., Farrell, C., Shaffer, D.W.: The binary replicate test: determining the sensitivity of CSCL models to coding error. In: Computer-Supported Collaborative Learning Conference, CSCL (2019)

    Google Scholar 

  17. Geertz, C.: The Interpretation of Cultures. Basic Books, New York (1973)

    Google Scholar 

  18. Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, New York (1967)

    Google Scholar 

  19. Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26(1), 124–132 (2016). https://doi.org/10.1007/s40593-015-0086-4

    Article  Google Scholar 

  20. Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw (2011). https://doi.org/10.18637/jss.v040.i13

  21. Hardy, M.: Career Interview with Ian Shaw. Qualitative Social Work. (2019). https://doi.org/10.1177/1473325017727342

  22. Liu, M., et al.: Understanding MOOCs as an emerging online learning tool: perspectives from the students. Am. J. Dist. Educ. (2014). https://doi.org/10.1080/08923647.2014.926145

  23. Mayfield, E., Adamson, D., Rosé, C.P.: LightSide Researcher’s Workbench (Version 2.1. 2)[Computer Software]. LightSide, Pittsburgh (2013)

    Google Scholar 

  24. Miles, M.B., Huberman, A.M.: Qualitative Data Analysis (Second Edition) (1994)

    Google Scholar 

  25. Ngulube, P.: Qualitative data analysis and interpretation: systematic search for meaning. In: Addressing Research Challenges: Making Headway for Developing Researchers (2015)

    Google Scholar 

  26. Nikolenko, S.I., Koltsov, S., Koltsova, O.: Topic modeling for qualitative studies. J. Inf. Sci. 1–15 (2015). https://doi.org/10.1177/0165551515617393

  27. Peters, G., Zörgő,S.: Introduction to the Reproducible Open Coding Kit (ROCK). Psyarxiv (2019). https://doi.org/10.31234/osf.io/stcx9

  28. Rezaei, E., Zavaraki, E.Z., Hatami, J., Abadi, K.A., Delavar, A.: The effect of MOOCs instructional design model based on students’ learning and motivation. Man in India. 97, 115–126 (2017)

    Google Scholar 

  29. Miles, M.B., Huberman, A.M., Saldana, J.: Qualitative Data Analysis: A Methods Sourcebook. SAGE, Thousand Oaks (2019)

    Google Scholar 

  30. Ruis, A.R., Rosser, A.A., Nathwani, J.N., Beems, M.V., Jung, S.A., Pugh, C.M.: Multiple uses for procedural simulators in continuing medical education contexts. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 211–222. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_18

    Chapter  Google Scholar 

  31. Snowdon, P.F.: What Is Le Penseur Really Doing? In: Dolby, D. (ed.) Ryle on Mind and Language. PD, pp. 116–125. Palgrave Macmillan UK, London (2014). https://doi.org/10.1057/9781137476203_7

    Chapter  Google Scholar 

  32. Shaffer, D.W.: Quantitative Ethnography. Cathcart Press, Madison (2017)

    Google Scholar 

  33. Strauss, A., Corbin, J.: Basics of qualitative research: techniques and grounded theory procedures for developing grounded theory. (1998). https://doi.org/10.2307/328955

  34. Swiecki, Z., Ruis, A.R., Gautam, D., Rus, V., Shaffer, D.W.: Understanding when students are active-in-thinking through modeling-in-context. Br. J. Edu. Technol. (2019). https://doi.org/10.1111/bjet.12869

    Article  Google Scholar 

  35. Theelen, H., Willems, M.C., van den Beemt, A., Conijn, R., den Brok, P.: Virtual internships in blended environments to prepare preservice teachers for the professional teaching context. Br. J. Edu. Technol. (2020). https://doi.org/10.1111/bjet.12760

    Article  Google Scholar 

  36. Wang, Y., Baker, R.: Content or platform: why do students complete MOOCs? J. Online Learn. Teach. (2015)

    Google Scholar 

  37. Wang, Y., Baker, R.: Grit and Intention: why do learners complete MOOCs? Int. Rev. Res. Open Dist. Learn. (2018). https://doi.org/10.19173/irrodl.v19i3.3393

  38. Yousef, A.M.F., Chatti, M.A., Schroeder, Ul, Wosnitza, M., Jakobs, H.: MOOCs a review of the state-of-the-art. In: Proceedings of the 6th International Conference on Computer Supported Education – CSEDU 2014, pp. 9–20 (2014)

    Google Scholar 

Download references

Acknowledgements

The research was supported by the National Science Foundation (DRL-1661036, 1713110; LDI-1934745), the Wisconsin Alumni Research Foundation, and the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison. The opinions, findings, and conclusions do not reflect the views of the funding agencies, cooperating institutions, or other individuals.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiqiang Cai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D.W. (2021). Using Topic Modeling for Code Discovery in Large Scale Text Data. In: Ruis, A.R., Lee, S.B. (eds) Advances in Quantitative Ethnography. ICQE 2021. Communications in Computer and Information Science, vol 1312. Springer, Cham. https://doi.org/10.1007/978-3-030-67788-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67788-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67787-9

  • Online ISBN: 978-3-030-67788-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics