skip to main content
research-article

Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity

Published: 21 June 2018 Publication History

Abstract

Machine learning (ML) has become increasingly influential to human society, yet the primary advancements and applications of ML are driven by research in only a few computational disciplines. Even applications that affect or analyze human behaviors and social structures are often developed with limited input from experts outside of computational fields. Social scientists—experts trained to examine and explain the complexity of human behavior and interactions in the world—have considerable expertise to contribute to the development of ML applications for human-generated data, and their analytic practices could benefit from more human-centered ML methods. Although a few researchers have highlighted some gaps between ML and social sciences [51, 57, 70], most discussions only focus on quantitative methods. Yet many social science disciplines rely heavily on qualitative methods to distill patterns that are challenging to discover through quantitative data. One common analysis method for qualitative data is qualitative coding. In this article, we highlight three challenges of applying ML to qualitative coding. Additionally, we utilize our experience of designing a visual analytics tool for collaborative qualitative coding to demonstrate the potential in using ML to support qualitative coding by shifting the focus to identifying ambiguity. We illustrate dimensions of ambiguity and discuss the relationship between disagreement and ambiguity. Finally, we propose three research directions to ground ML applications for social science as part of the progression toward human-centered machine learning.

Supplementary Material

a9-chen-apndx.pdf (chen.zip)
Supplemental movie, appendix, image and software files for, Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity

References

[1]
Saleema Amershi, Maya Cakmak, W. Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI Magazine 35, 4 (2014), 105--120.
[2]
Susan Athey and Guido W. Imbens. 2015. Machine learning methods for estimating heterogeneous causal effects. Stat 1050, 5 (2015).
[3]
Solon Barocas. 2014. Data mining and the discourse on discrimination. In Data Ethics Workshop, Conference on Knowledge Discovery and Data Mining.
[4]
Gabriela Beirão and J. A. Sarsfield Cabral. 2007. Understanding attitudes towards public transport and private car: A qualitative study. Transp. Policy 14, 6 (2007), 478--489.
[5]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, Jan (2003), 993--1022.
[6]
Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Adv. Neural Inform. Process. Syst. 4349--4357.
[7]
Natasha K. Bowen and Shenyang Guo. 2011. Structural Equation Modeling. Oxford University Press.
[8]
Danah Boyd and Kate Crawford. 2012. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Inform. Commun. Soc. 15, 5 (2012), 662--679.
[9]
Michael Brooks. 2015. Human Centered Tools for Analyzing Online Social Data. Ph.D. Dissertation. University of Washington.
[10]
Michael Brooks, Katie Kuksenok, Megan K. Torkildson, Daniel Perry, John J. Robinson, Taylor J. Scott, Ona Anicello, Ariana Zukowski, Paul Harris, and Cecilia R. Aragon. 2013. Statistical affect detection in collaborative chat. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work. ACM, 317--328. http://dl.acm.org/citation.cfm?id=2441813
[11]
Claire Cain Miller. 2015. Algorithms and Bias: Q. and A. With Cynthia Dwork. New York Times. Retrieved December 01, 2016 from http://www.nytimes.com/2015/08/11/upshot/algorithms-and-bias-q-and-a-with-cynthia-dwork.html.
[12]
Kathy Charmaz. 2014. Constructing Grounded Theory. Sage.
[13]
Peter Cihon and Taha Yasseri. 2016. A biased review of biases in twitter studies on political collective action. Frontiers in Physics 4, 34 (2016).
[14]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 37--46.
[15]
Kevin Crowston, Eileen E. Allen, and Robert Heckman. 2012. Using natural language processing technology for qualitative data analysis. Int. J. Soc. Res. Methodol. 15, 6 (2012), 523--543.
[16]
Kevin Crowston, Xiaozhong Liu, and Eileen E. Allen. 2010. Machine learning and rule-based automated coding of qualitative data. Proceedings of the American Society for Information Science and Technology 47, 1 (2010), 1--2.
[17]
Marjorie Darrah. 2006. Neural network visualization techniques. In Methods and Procedures for the Verification and Validation of Artificial Neural Networks. Springer, 163--197.
[18]
N.K. Denzin and Y. S. Lincoln. 2011. The SAGE Handbook of Qualitative Research. SAGE Publications. 2010052892
[19]
Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. In arXiv:1702.08608.
[20]
Margaret Drouhard, Nan-Chen Chen, Jina Suh, Rafal Kocielnik, Vanessa Pena-Araya, Keting Cen, Xiangyi Zheng, and Cecilia R. Aragon. 2017. Aeonium: Visual analytics to support collaborative qualitative coding. In Proceedings of the 2017 IEEE Pacific Visualization Symposium (PacificVis’17). IEEE, 220--229.
[21]
Jeanine C. Evers, Christina Silver, Katja Mruck, and Bart Peeters. 2011. Introduction to the KWALON experiment: Discussions on qualitative data analysis software by developers and users. In Forum: Qual. Soc. Res. 12, 1 (2011).
[22]
Tiffany Derville Gallicano. 2013. An example of how to perform open coding, axial coding and selective coding. Retrieved from https://prpost.wordpress.com/2013/07/22/an-example-of-how-to-perform-open-coding-axial-coding-and-selective-coding/.
[23]
Barney G. Glaser and Judith Holton. 2004. Remodeling grounded theory. Grounded Theory Review 4, 1 (November 2004).
[24]
Justin Grimmer. 2015. We are all social scientists now: How big data, machine learning, and causal inference work together. PS: Political Science 8 Politics (2015). American Political Science Association.
[25]
Justin Grimmer and Brandon M. Stewart. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Anal. 21, 3 (2013), 267--297.
[26]
David J. Hand, Heikki Mannila, and Padhraic Smyth. 2001. Principles of Data Mining. MIT Press.
[27]
Moritz Hardt. 2014. How big data is unfair. Retrieved Sept. 2014 from https://medium.com/@mrtz/how-big-data-is-unfair-9aa544d739de.
[28]
Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In Proceedings of the European Conference on Computer Vision. Springer, 3--19.
[29]
Cheri Ann Hernandez. 2009. Theoretical coding in grounded theory methodology. Gr. Theor. Rev. 8, 3 (2009), 51--60.
[30]
Judith A. Holton. 2007. The coding process and its challenges. The Sage Handbook of Grounded Theory Part III, Sage, 265--89.
[31]
Giles Hooker. 2004. Discovering additive structure in black box functions. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 575--580.
[32]
H. V. Jagadish. 2015. Moving Past the Wild West Era for Big Data. IEEE Conference on Big Data Keynote Speech. Retrieved from http://static1.squarespace.com/static/55da03c0e4b06261f858e037/t/56383353e4b0c0c519842550/1446523731270/ethics-BD.pdf.
[33]
Anil K. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recog. Lett. 31, 8 (2010), 651--666.
[34]
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph. 18, 12 (2012), 2917--2926.
[35]
Matthew Kay, Cynthia Matuszek, and Sean A. Munson. 2015. Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 3819--3828.
[36]
Been Kim, Julie A. Shah, and Finale Doshi-Velez. 2015. Mind the gap: A generative approach to interpretable feature selection and extraction. In Advances in Neural Information Processing Systems 28 (NIPS’15).
[37]
Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith. 2014. A dependency parser for tweets. In Proceedings of Conference on Empirical Methods In Natural Language Processing (EMNLP'14). 1001--1012.
[38]
David Lazer, Alex (Sandy) Pentland, Lada Adamic, Sinan Aral, Albert Laszlo Barabasi, Devon Brewer, Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, Tony Jebara, Gary King, Michael Macy, Deb Roy, and Marshall Van Alstyne. 2009. Life in the network: The coming age of computational social science. Science 323, 5915 (Feb. 2009), 721--723. 19197046.
[39]
Margaret D. LeCompte. 2000. Analyzing qualitative data. Theor. Into Pract. 39, 3 (2000), 146--154.
[40]
Seth C. Lewis, Rodrigo Zamith, and Alfred Hermida. 2013. Content analysis in an era of big data: A hybrid approach to computational and manual methods. J. Broadcast. Electron. Media 57, 1 (2013), 34--52.
[41]
Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. arXiv:1705.07874 .
[42]
Matthew B. Miles and A. Michael Huberman. 1985. Qualitative Data Analysis. Sage, Newbury Park, CA.
[43]
Michael Muller, Shion Guha, Eric P. S. Baumer, David Mimno, and N. Sadat Shami. 2016. Machine learning and grounded theory method: Convergence, divergence, and combination. In Proceedings of the 19th International Conference on Supporting Group Work (GROUP’16). ACM, New York, NY, 3--8.
[44]
William Lawrence Neuman. 2005. Social Research Methods: Quantitative and Qualitative Approaches, Vol. 13. Allyn and Bacon, Boston.
[45]
Cathy O’Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Books.
[46]
Theodore M. Porter. 1994. From Quetelet to Maxwell: Social Statistics and the Origins of Statistical Physics. Springer Netherlands, Dordrecht, 345--362.
[47]
Nicholas Ralph, Melanie Birks, and Ysanne Chapman. 2015. The methodological dynamism of grounded theory. Int. J. Qual. Methods 14, 4 (2015), 1609406915611576.
[48]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Model-agnostic interpretability of machine learning. arXiv:1606.05386 (2016).
[49]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135--1144.
[50]
A. P. Rovai, J. D. Baker, and M. K. Ponton. 2013. Social Science Research Design and Statistics: A Practitioner’s Guide to Research Methods and IBM SPSS. Watertree Press. https://books.google.com/books?id=QId2AgAAQBAJ.
[51]
Cynthia Rudin. 2015. Can Machine Learning Be Useful for Social Science? Retrieved from http://citiespapers.ssrc.org/can-machine-learning-be-useful-for-social-science.
[52]
D. Sacha, M. Sedlmair, L. Zhang, J. A. Lee, D. Weiskopf, S. C. North, and D. A. Keim. 2016. Human-centered machine learning through interactive visualization: Review and open challenges. In Proceedings of the 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning.
[53]
Johnny Saldana. 2015. An introduction to codes and coding. In The Coding Manual for Qualitative Researchers. 1--31.
[54]
Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. 2017. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv:1708.08296 (2017).
[55]
Cyrus Samii, Laura Paler, and Sarah Zukerman Daly. 2016. Retrospective causal inference with machine learning ensembles: An application to anti-recidivism policies in colombia. Political Anal. 24, 4 (2016), 434--456.
[56]
Burr Settles. 2010. Active learning literature survey. University of Wisconsin, Madison 52, 55--66 (2010), 11.
[57]
Burr Settles. 2013. Machine Learning and Social Science: Taking The Best of Both Worlds. Retrieved December 1, 2016 from https://slackprop.wordpress.com/2013/02/05/machine-learning-and-social-science.
[58]
Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B. Viégas, and Martin Wattenberg. 2016. Embedding projector: Interactive visualization and interpretation of embeddings. arXiv:1611.05469
[59]
Kate Starbird, Dharma Dailey, Ann Hayward Walker, Thomas M. Leschine, Robert Pavia, and Ann Bostrom. 2015. Social media, public participation, and the 2010 BP deepwater horizon oil spill. Human Ecol. Risk Assess.: Int. J. 21, 3 (April 2015), 605--630.
[60]
Anselm L. Strauss. 1987. Qualitative Analysis for Social Scientists. Cambridge University Press.
[61]
Latanya Sweeney. 2013. Discrimination in online ad delivery. Queue 11, 3 (2013), 10.
[62]
Renata Tesch. 2013. Qualitative Research: Analysis Types and Software. Routledge.
[63]
2016. Economists are prone to fads, and the latest is machine learning. The Economist (US) (Nov. 2016). Retrieved December 1, 2016 from https://www.economist.com/finance-and-economics/2016/11/24/economists-are-prone-to-fads-and-the-latest-is-machine-learning.
[64]
Patrick Tierney. 2012. A qualitative analysis framework using natural language processing and graph theory. Int. Rev. Res. Open Distrib. Learn. 13, 5 (2012), 173--189.
[65]
Vanya Van Belle and Paulo Lisboa. 2013. Research directions in interpretable machine learning models. In Proceeding of European Symposium on Artificial Neuronal Networks, Computational Intelligence and Machiene Learning.
[66]
Hal R. Varian. 2014. Big data: New tricks for econometrics. J. Econ. Perspect. 28, 2 (2014), 3--27.
[67]
Alfredo Vellido, José David Martín-Guerrero, and Paulo J. G. Lisboa. 2012. Making machine learning models interpretable. In ESANN, Vol. 12. Citeseer, 163--172.
[68]
Sarah Vieweg, Amanda L. Hughes, Kate Starbird, and Leysia Palen. 2010. Microblogging during two natural hazards events: What twitter may contribute to situational awareness. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’10). ACM, New York, NY.
[69]
Hanna Wallach. 2014. Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency. Retrieved from https://medium.com/@hannawallach/big-data-machine-learning-and-the-social-sciences-927a8e20460d.
[70]
H. Wallach. 2016. Computational Social Science: Toward a Collaborative Future. Cambridge University Press. Retrieved from https://www.microsoft.com/en-us/research/publication/computational-social-science-toward-a-collaborative-future.
[71]
Xiaohong Wang, Sitao Wu, Xiaoru Wang, and Qunzhan Li. 2006. SVMV--A novel algorithm for the visualization of SVM classification results. In International Symposium on Neural Networks. Springer, 968--973.
[72]
Duncan J. Watts. 2004. The “new” science of networks. Annu. Rev. Sociol. 30, 1 (2004).
[73]
Gregor Wiedemann. 2013. Opening up to big data: Computer-assisted analysis of textual data in social sciences. Historical Social Research. GESIS - Leibniz Institute for the Social Sciences, 332--357.
[74]
Gregor Wiedemann and Wiedemann. 2016. Text Mining for Qualitative Data Analysis in the Social Sciences. Springer.
[75]
Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
[76]
Kanit Wongsuphasawat, Daniel Smilkov, James Wexler, Jimbo Wilson, Dandelion Mane, Doug Fritz, Dilip Krishnan, Fernanda B. Viégas, and Martin Wattenberg. 2017. Visualizing dataflow graphs of deep learning models in tensorflow. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2017), 1--12.
[77]
Jasy Liew Suet Yan, Nancy McCracken, Shichun Zhou, and Kevin Crowston. 2014. Optimizing features in active machine learning for complex qualitative content analysis. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science. 44--48.

Cited By

View all
  • (2025)Perceptions on the Implementation of a School Nursing Pilot Programme in the Canary IslandsNursing Reports10.3390/nursrep1502004815:2(48)Online publication date: 31-Jan-2025
  • (2024)Teaching Tip Teaching About Ambiguity in Analytics: A Student-Centered Semester-Long Project to Raise Awareness of Ambiguity by Predicting Student Exam PerformanceJournal of Information Systems Education10.62273/WQJY604735:3(249-260)Online publication date: 2024
  • (2024)Measurement and Policy Optimization of Regional Preschool Education Development Level Based on Generalized Orthogonal Fuzzy Sets and Prospect TheoryInternational Journal of Web-Based Learning and Teaching Technologies10.4018/IJWLTT.34180319:1(1-17)Online publication date: 9-Apr-2024
  • Show More Cited By

Index Terms

  1. Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Interactive Intelligent Systems
      ACM Transactions on Interactive Intelligent Systems  Volume 8, Issue 2
      Special Issue on Human-Centered Machine Learning
      June 2018
      259 pages
      ISSN:2160-6455
      EISSN:2160-6463
      DOI:10.1145/3232718
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 June 2018
      Accepted: 01 January 2018
      Revised: 01 December 2017
      Received: 01 January 2017
      Published in TIIS Volume 8, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Social scientists
      2. ambiguity
      3. computational social science
      4. human-centered machine learning
      5. machine learning
      6. qualitative coding

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)367
      • Downloads (Last 6 weeks)41
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Perceptions on the Implementation of a School Nursing Pilot Programme in the Canary IslandsNursing Reports10.3390/nursrep1502004815:2(48)Online publication date: 31-Jan-2025
      • (2024)Teaching Tip Teaching About Ambiguity in Analytics: A Student-Centered Semester-Long Project to Raise Awareness of Ambiguity by Predicting Student Exam PerformanceJournal of Information Systems Education10.62273/WQJY604735:3(249-260)Online publication date: 2024
      • (2024)Measurement and Policy Optimization of Regional Preschool Education Development Level Based on Generalized Orthogonal Fuzzy Sets and Prospect TheoryInternational Journal of Web-Based Learning and Teaching Technologies10.4018/IJWLTT.34180319:1(1-17)Online publication date: 9-Apr-2024
      • (2024)The Use of eXplainable Artificial Intelligence and Machine Learning Operation Principles to Support the Continuous Development of Machine Learning-Based Solutions in Fault Detection and IdentificationComputers10.3390/computers1310025213:10(252)Online publication date: 2-Oct-2024
      • (2024)Operational disruption in healthcare associated with software functionality issue due to software security patching: a case reportFrontiers in Digital Health10.3389/fdgth.2024.13674316Online publication date: 14-Mar-2024
      • (2024)Understanding older people's voice interactions with smart voice assistants: a new modified rule-based natural language processing model with human inputFrontiers in Digital Health10.3389/fdgth.2024.13299106Online publication date: 14-May-2024
      • (2024)Challenges in moderating disruptive player behavior in online competitive action gamesFrontiers in Computer Science10.3389/fcomp.2024.12837356Online publication date: 23-Feb-2024
      • (2024)Multi-Resolution Design: Using Qualitative and Quantitative Analyses to Recursively Zoom in and out of the Same DatasetJournal of Mixed Methods Research10.1177/15586898241284696Online publication date: 16-Sep-2024
      • (2024)Should ChatGPT help with my research? A caution against artificial intelligence in qualitative analysisQualitative Research10.1177/14687941241297375Online publication date: 5-Dec-2024
      • (2024)Bridging Qualitative Data SilosSocial Science Computer Review10.1177/0894439323121545942:3(760-776)Online publication date: 1-Jun-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media