Abstract
When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of “topics” in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Some of these systems also include rudimentary keyword-based searches to support coding.
References
Anderson, J.R., Corbett, A.T., Koedinger K.R., Pelletier, R.: Cognitive tutors: lessons learned. J. Learn. Sci. (1995). https://doi.org/10.1207/s15327809jls0402_2
Arastoopour, G.I.: Connected design rationale: modeling and measuring engineering design learning. Unpublished Doctoral Dissertation. University of Wisconsin-Madison (2017)
Bakharia, A.: On the equivalence of inductive content analysis and topic modeling. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 291–298. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_25
Baumer, E.P.S., Mimno, D., Guha, S., Quan, E., Gay, G.K.: Comparing grounded theory and topic modeling: extreme divergence or unlikely convergence? J. Assoc. Inf. Sci. Technol. 68(6), 1397–1410 (2017). https://doi.org/10.1002/asi.23786
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Cai, Z., Graesser, A.C., Hu, X.: ASAT: AutoTutor script authoring tool. In: Sottilare, R., Graesser, A.C., Hu, X., Brawner, K. (eds.) Design Recommendations for Intelligent Tutoring Systems: Authoring Tools, pp. 199–210. Army Research Laboratory, Orlando (2015)
Cai, Z., Li, H., Hu, X., Graesser, A.C.: Can word probabilities from LDA be simply added up to represent documents? In: Proceedings of the 9th International Conference on Educational Data Mining, pp. 577–578 (2016)
Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D.W., Hu, X., Graesser, A.C.: nCoder+: a semantic tool for improving recall of nCoder coding. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 41–54. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_4
Cai, Z., et al.: Trialog in ARIES: user input assessment in an intelligent tutoring system. In: Proceedings of the 3rd IEEE International Conference on Intelligent Computing and Intelligent Systems, pp. 429–433 (2010). https://doi.org/10.13140/2.1.4284.5446
Charmaz, K.: Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. SAGE, Thousand Oaks (2006)
Chen, N.: Challenges of applying machine learning to qualitative coding. In: ACM SIGCHI Workshop on Human-Centered Machine Learning (2016)
Chesler, N.C., Ruis, A.R., Collier, W., Swiecki, Z., Arastoopour, G., Shaffer, D.W.: A novel paradigm for engineering education: virtual internships with individualized mentoring and assessment of engineering thinking. J. Biomech. Eng. 137(2), 1–8 (2015). https://doi.org/10.1115/1.4029235
D’Angelo, C., Arastoopour, G., Chesler, N., Shaffer, D.W.: Collaborating in a virtual engineering internship. In: Connecting Computer-Supported Collaborative Learning to Policy and Practice: CSCL 2011 Conference Proceedings - Short Papers and Posters, 9th International Computer-Supported Collaborative Learning Conference (2011)
Dowell, N.M., et al.: Modeling learners’ social centrality and performance through language and discourse. In: Educational Data Mining – EDM 2015, pp. 250–257 (2015)
Eagan, B.R., Serlin, R., Ruis, A., Arastoopour, G., Shaffer, D.W.: Can we rely on IRR? Testing the assumptions of inter-rater reliability. In: CSCL 2017 Proceedings, Cim, pp. 529–532 (2017)
Eagan, B.R., Swiecki, Z., Farrell, C., Shaffer, D.W.: The binary replicate test: determining the sensitivity of CSCL models to coding error. In: Computer-Supported Collaborative Learning Conference, CSCL (2019)
Geertz, C.: The Interpretation of Cultures. Basic Books, New York (1973)
Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, New York (1967)
Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26(1), 124–132 (2016). https://doi.org/10.1007/s40593-015-0086-4
Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw (2011). https://doi.org/10.18637/jss.v040.i13
Hardy, M.: Career Interview with Ian Shaw. Qualitative Social Work. (2019). https://doi.org/10.1177/1473325017727342
Liu, M., et al.: Understanding MOOCs as an emerging online learning tool: perspectives from the students. Am. J. Dist. Educ. (2014). https://doi.org/10.1080/08923647.2014.926145
Mayfield, E., Adamson, D., Rosé, C.P.: LightSide Researcher’s Workbench (Version 2.1. 2)[Computer Software]. LightSide, Pittsburgh (2013)
Miles, M.B., Huberman, A.M.: Qualitative Data Analysis (Second Edition) (1994)
Ngulube, P.: Qualitative data analysis and interpretation: systematic search for meaning. In: Addressing Research Challenges: Making Headway for Developing Researchers (2015)
Nikolenko, S.I., Koltsov, S., Koltsova, O.: Topic modeling for qualitative studies. J. Inf. Sci. 1–15 (2015). https://doi.org/10.1177/0165551515617393
Peters, G., Zörgő,S.: Introduction to the Reproducible Open Coding Kit (ROCK). Psyarxiv (2019). https://doi.org/10.31234/osf.io/stcx9
Rezaei, E., Zavaraki, E.Z., Hatami, J., Abadi, K.A., Delavar, A.: The effect of MOOCs instructional design model based on students’ learning and motivation. Man in India. 97, 115–126 (2017)
Miles, M.B., Huberman, A.M., Saldana, J.: Qualitative Data Analysis: A Methods Sourcebook. SAGE, Thousand Oaks (2019)
Ruis, A.R., Rosser, A.A., Nathwani, J.N., Beems, M.V., Jung, S.A., Pugh, C.M.: Multiple uses for procedural simulators in continuing medical education contexts. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 211–222. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_18
Snowdon, P.F.: What Is Le Penseur Really Doing? In: Dolby, D. (ed.) Ryle on Mind and Language. PD, pp. 116–125. Palgrave Macmillan UK, London (2014). https://doi.org/10.1057/9781137476203_7
Shaffer, D.W.: Quantitative Ethnography. Cathcart Press, Madison (2017)
Strauss, A., Corbin, J.: Basics of qualitative research: techniques and grounded theory procedures for developing grounded theory. (1998). https://doi.org/10.2307/328955
Swiecki, Z., Ruis, A.R., Gautam, D., Rus, V., Shaffer, D.W.: Understanding when students are active-in-thinking through modeling-in-context. Br. J. Edu. Technol. (2019). https://doi.org/10.1111/bjet.12869
Theelen, H., Willems, M.C., van den Beemt, A., Conijn, R., den Brok, P.: Virtual internships in blended environments to prepare preservice teachers for the professional teaching context. Br. J. Edu. Technol. (2020). https://doi.org/10.1111/bjet.12760
Wang, Y., Baker, R.: Content or platform: why do students complete MOOCs? J. Online Learn. Teach. (2015)
Wang, Y., Baker, R.: Grit and Intention: why do learners complete MOOCs? Int. Rev. Res. Open Dist. Learn. (2018). https://doi.org/10.19173/irrodl.v19i3.3393
Yousef, A.M.F., Chatti, M.A., Schroeder, Ul, Wosnitza, M., Jakobs, H.: MOOCs a review of the state-of-the-art. In: Proceedings of the 6th International Conference on Computer Supported Education – CSEDU 2014, pp. 9–20 (2014)
Acknowledgements
The research was supported by the National Science Foundation (DRL-1661036, 1713110; LDI-1934745), the Wisconsin Alumni Research Foundation, and the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison. The opinions, findings, and conclusions do not reflect the views of the funding agencies, cooperating institutions, or other individuals.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D.W. (2021). Using Topic Modeling for Code Discovery in Large Scale Text Data. In: Ruis, A.R., Lee, S.B. (eds) Advances in Quantitative Ethnography. ICQE 2021. Communications in Computer and Information Science, vol 1312. Springer, Cham. https://doi.org/10.1007/978-3-030-67788-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-67788-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67787-9
Online ISBN: 978-3-030-67788-6
eBook Packages: Computer ScienceComputer Science (R0)