Using Topic Modeling for Code Discovery in Large Scale Text Data

Cai, Zhiqiang; Siebert-Evenstone, Amanda; Eagan, Brendan; Shaffer, David Williamson

doi:10.1007/978-3-030-67788-6_2

Using Topic Modeling for Code Discovery in Large Scale Text Data

Conference paper
First Online: 22 January 2021

1535 Accesses
11 Citations
1 Altmetric

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1312))

Abstract

When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of “topics” in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Some of these systems also include rudimentary keyword-based searches to support coding.

References

Anderson, J.R., Corbett, A.T., Koedinger K.R., Pelletier, R.: Cognitive tutors: lessons learned. J. Learn. Sci. (1995). https://doi.org/10.1207/s15327809jls0402_2
Arastoopour, G.I.: Connected design rationale: modeling and measuring engineering design learning. Unpublished Doctoral Dissertation. University of Wisconsin-Madison (2017)
Google Scholar
Bakharia, A.: On the equivalence of inductive content analysis and topic modeling. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 291–298. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_25
Chapter Google Scholar
Baumer, E.P.S., Mimno, D., Guha, S., Quan, E., Gay, G.K.: Comparing grounded theory and topic modeling: extreme divergence or unlikely convergence? J. Assoc. Inf. Sci. Technol. 68(6), 1397–1410 (2017). https://doi.org/10.1002/asi.23786
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Cai, Z., Graesser, A.C., Hu, X.: ASAT: AutoTutor script authoring tool. In: Sottilare, R., Graesser, A.C., Hu, X., Brawner, K. (eds.) Design Recommendations for Intelligent Tutoring Systems: Authoring Tools, pp. 199–210. Army Research Laboratory, Orlando (2015)
Google Scholar
Cai, Z., Li, H., Hu, X., Graesser, A.C.: Can word probabilities from LDA be simply added up to represent documents? In: Proceedings of the 9th International Conference on Educational Data Mining, pp. 577–578 (2016)
Google Scholar
Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D.W., Hu, X., Graesser, A.C.: nCoder+: a semantic tool for improving recall of nCoder coding. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 41–54. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_4
Chapter Google Scholar
Cai, Z., et al.: Trialog in ARIES: user input assessment in an intelligent tutoring system. In: Proceedings of the 3rd IEEE International Conference on Intelligent Computing and Intelligent Systems, pp. 429–433 (2010). https://doi.org/10.13140/2.1.4284.5446
Charmaz, K.: Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. SAGE, Thousand Oaks (2006)
Google Scholar
Chen, N.: Challenges of applying machine learning to qualitative coding. In: ACM SIGCHI Workshop on Human-Centered Machine Learning (2016)
Google Scholar
Chesler, N.C., Ruis, A.R., Collier, W., Swiecki, Z., Arastoopour, G., Shaffer, D.W.: A novel paradigm for engineering education: virtual internships with individualized mentoring and assessment of engineering thinking. J. Biomech. Eng. 137(2), 1–8 (2015). https://doi.org/10.1115/1.4029235
Article Google Scholar
D’Angelo, C., Arastoopour, G., Chesler, N., Shaffer, D.W.: Collaborating in a virtual engineering internship. In: Connecting Computer-Supported Collaborative Learning to Policy and Practice: CSCL 2011 Conference Proceedings - Short Papers and Posters, 9th International Computer-Supported Collaborative Learning Conference (2011)
Google Scholar
Dowell, N.M., et al.: Modeling learners’ social centrality and performance through language and discourse. In: Educational Data Mining – EDM 2015, pp. 250–257 (2015)
Google Scholar
Eagan, B.R., Serlin, R., Ruis, A., Arastoopour, G., Shaffer, D.W.: Can we rely on IRR? Testing the assumptions of inter-rater reliability. In: CSCL 2017 Proceedings, Cim, pp. 529–532 (2017)
Google Scholar
Eagan, B.R., Swiecki, Z., Farrell, C., Shaffer, D.W.: The binary replicate test: determining the sensitivity of CSCL models to coding error. In: Computer-Supported Collaborative Learning Conference, CSCL (2019)
Google Scholar
Geertz, C.: The Interpretation of Cultures. Basic Books, New York (1973)
Google Scholar
Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine de Gruyter, New York (1967)
Google Scholar
Graesser, A.C.: Conversations with AutoTutor help students learn. Int. J. Artif. Intell. Educ. 26(1), 124–132 (2016). https://doi.org/10.1007/s40593-015-0086-4
Article Google Scholar
Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw (2011). https://doi.org/10.18637/jss.v040.i13
Hardy, M.: Career Interview with Ian Shaw. Qualitative Social Work. (2019). https://doi.org/10.1177/1473325017727342
Liu, M., et al.: Understanding MOOCs as an emerging online learning tool: perspectives from the students. Am. J. Dist. Educ. (2014). https://doi.org/10.1080/08923647.2014.926145
Mayfield, E., Adamson, D., Rosé, C.P.: LightSide Researcher’s Workbench (Version 2.1. 2)[Computer Software]. LightSide, Pittsburgh (2013)
Google Scholar
Miles, M.B., Huberman, A.M.: Qualitative Data Analysis (Second Edition) (1994)
Google Scholar
Ngulube, P.: Qualitative data analysis and interpretation: systematic search for meaning. In: Addressing Research Challenges: Making Headway for Developing Researchers (2015)
Google Scholar
Nikolenko, S.I., Koltsov, S., Koltsova, O.: Topic modeling for qualitative studies. J. Inf. Sci. 1–15 (2015). https://doi.org/10.1177/0165551515617393
Peters, G., Zörgő,S.: Introduction to the Reproducible Open Coding Kit (ROCK). Psyarxiv (2019). https://doi.org/10.31234/osf.io/stcx9
Rezaei, E., Zavaraki, E.Z., Hatami, J., Abadi, K.A., Delavar, A.: The effect of MOOCs instructional design model based on students’ learning and motivation. Man in India. 97, 115–126 (2017)
Google Scholar
Miles, M.B., Huberman, A.M., Saldana, J.: Qualitative Data Analysis: A Methods Sourcebook. SAGE, Thousand Oaks (2019)
Google Scholar
Ruis, A.R., Rosser, A.A., Nathwani, J.N., Beems, M.V., Jung, S.A., Pugh, C.M.: Multiple uses for procedural simulators in continuing medical education contexts. In: Eagan, B., Misfeldt, M., Siebert-Evenstone, A. (eds.) ICQE 2019. CCIS, vol. 1112, pp. 211–222. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-33232-7_18
Chapter Google Scholar
Snowdon, P.F.: What Is Le Penseur Really Doing? In: Dolby, D. (ed.) Ryle on Mind and Language. PD, pp. 116–125. Palgrave Macmillan UK, London (2014). https://doi.org/10.1057/9781137476203_7
Chapter Google Scholar
Shaffer, D.W.: Quantitative Ethnography. Cathcart Press, Madison (2017)
Google Scholar
Strauss, A., Corbin, J.: Basics of qualitative research: techniques and grounded theory procedures for developing grounded theory. (1998). https://doi.org/10.2307/328955
Swiecki, Z., Ruis, A.R., Gautam, D., Rus, V., Shaffer, D.W.: Understanding when students are active-in-thinking through modeling-in-context. Br. J. Edu. Technol. (2019). https://doi.org/10.1111/bjet.12869
Article Google Scholar
Theelen, H., Willems, M.C., van den Beemt, A., Conijn, R., den Brok, P.: Virtual internships in blended environments to prepare preservice teachers for the professional teaching context. Br. J. Edu. Technol. (2020). https://doi.org/10.1111/bjet.12760
Article Google Scholar
Wang, Y., Baker, R.: Content or platform: why do students complete MOOCs? J. Online Learn. Teach. (2015)
Google Scholar
Wang, Y., Baker, R.: Grit and Intention: why do learners complete MOOCs? Int. Rev. Res. Open Dist. Learn. (2018). https://doi.org/10.19173/irrodl.v19i3.3393
Yousef, A.M.F., Chatti, M.A., Schroeder, Ul, Wosnitza, M., Jakobs, H.: MOOCs a review of the state-of-the-art. In: Proceedings of the 6th International Conference on Computer Supported Education – CSEDU 2014, pp. 9–20 (2014)
Google Scholar

Download references

Acknowledgements

The research was supported by the National Science Foundation (DRL-1661036, 1713110; LDI-1934745), the Wisconsin Alumni Research Foundation, and the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison. The opinions, findings, and conclusions do not reflect the views of the funding agencies, cooperating institutions, or other individuals.

Author information

Authors and Affiliations

University of Wisconsin-Madison, Madison, WI, 53706, USA
Zhiqiang Cai, Amanda Siebert-Evenstone, Brendan Eagan & David Williamson Shaffer

Authors

Zhiqiang Cai
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Siebert-Evenstone
View author publications
You can also search for this author in PubMed Google Scholar
Brendan Eagan
View author publications
You can also search for this author in PubMed Google Scholar
David Williamson Shaffer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiqiang Cai .

Editor information

Editors and Affiliations

University of Wisconsin–Madison, Madison, WI, USA
Andrew R. Ruis
Pepperdine University, Malibu, CA, USA
Seung B. Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Z., Siebert-Evenstone, A., Eagan, B., Shaffer, D.W. (2021). Using Topic Modeling for Code Discovery in Large Scale Text Data. In: Ruis, A.R., Lee, S.B. (eds) Advances in Quantitative Ethnography. ICQE 2021. Communications in Computer and Information Science, vol 1312. Springer, Cham. https://doi.org/10.1007/978-3-030-67788-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-67788-6_2
Published: 22 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67787-9
Online ISBN: 978-3-030-67788-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics