Skip to main content
Log in

A feature location approach for mapping application features extracted from crowd-based screencasts to source code

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Crowd-based multimedia documents such as screencasts have emerged as a source for documenting requirements, the workflow and implementation issues of open source and agile software projects. For example, users can show and narrate how they manipulate an application’s GUI to perform a certain functionality, or a bug reporter could visually explain how to trigger a bug or a security vulnerability. Unfortunately, the streaming nature of programming screencasts and their binary format limit how developers can interact with a screencast’s content. In this research, we present an automated approach for mining and linking the multimedia content found in screencasts to their relevant software artifacts and, more specifically, to source code. We apply LDA-based mining approaches that take as input a set of screencast artifacts, such as GUI text and spoken word, to make the screencast content accessible and searchable to users and to link it to their relevant source code artifacts. To evaluate the applicability of our approach, we report on results from case studies that we conducted on existing WordPress and Mozilla Firefox screencasts. We found that our automated approach can significantly speed up the feature location process. For WordPress, we find that our approach using screencast speech and GUI text can successfully link relevant source code files within the top 10 hits of the result set with median Reciprocal Rank (RR) of 50% (rank 2) and 100% (rank 1). In the case of Firefox, our approach can identify relevant source code directories within the top 100 hits using screencast speech and GUI text with the median RR = 20%, meaning that the first true positive is ranked 5 or higher in more than 50% of the cases. Also, source code related to the frontend implementation that handles high-level or GUI-related aspects of an application is located with higher accuracy. We also found that term frequency rebalancing can further improve the linking results when using less noisy scenarios or locating less technical implementation of scenarios. Investigating the results of using original and weighted screencast data sources (speech, GUI, speech and GUI) that can result in having the highest median RR values in both case studies shows that speech data is an important information source that can result in having RR of 100%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. https://stackoverflow.com/

  2. https://wordpress.com/

  3. https://en.support.wordpress.com/video-tutorials/

  4. https://youtu.be/Am9SNUhSz4w

  5. https://www.reddit.com/r/firefox/comments/4fq1g0/firefox_ui_bug/

  6. https://youtu.be/CIvTVvFTWDA

  7. https://mcislab.github.io/publications/2020/emse_parisa/OnTheUseOfMultimediaDocumentationGoogle_Forms.pdf

  8. Portals such as https://www.wikipedia.org/ and https://stackoverflow.com/ contain crowd-based textual documentation.

  9. Portals such as https://commons.wikimedia.org/wiki/Main_Page and https://www.youtube.com/ contain crowd-based multimedia documents.

  10. http://ctags.sourceforge.net/

  11. https://youtube-dl.org/

  12. https://codex.wordpress.org/images/2/20/WP_27_modules.JPG

  13. http://tiberius.byethost13.com/pcw_lab/lab1/assign1.pdf?i=1

  14. https://www.techradar.com/news/best-cms-of-2018

  15. https://en.wikipedia.org/wiki/WordPress

  16. https://www.toptenreviews.com/best-internet-browser-software

  17. https://en.wikipedia.org/wiki/Firefox

  18. https://premium.wpmudev.org/blog/a-wordpress-tutorial-for-beginners-create-your-first-site-in-10-steps/

  19. https://www.youtube.com/playlist?list=PLdf7gmFvpFlDU-TLQV83LezrwSP9Pudd8

  20. https://www.youtube.com/playlist?list=PLdf7gmFvpFlCRBsGKdkyVrRbYJRNBAwbS

  21. https://wordpress.com/

  22. https://github.com/WordPress/wordpress-develop/tree/4.3

  23. https://support.mozilla.org/en-US/products/firefox

  24. https://www.youtube.com/playlist?list=PLdf7gmFvpFlDcfwNGmefCDCFpolHm_fwy

  25. https://www.youtube.com/playlist?list=PLdf7gmFvpFlC5z8SiLcCXbEzJqBcaGlXS

  26. https://www.youtube.com/playlist?list=PLdf7gmFvpFlD-J_gerUJ1i6URXLGOT2-H

  27. https://www.youtube.com/playlist?list=PLdf7gmFvpFlDzRSj0L1-zZhTzC0KynHhD

  28. https://www.youtube.com/playlist?list=PLdf7gmFvpFlAlxfUyVSyTZyjiPQm-ST7K

  29. https://www.youtube.com/playlist?list=PLdf7gmFvpFlBPWzefVqisCRnBtApQ5yC-

  30. https://www.youtube.com/playlist?list=PLdf7gmFvpFlACZGdcb1Crh6T48VzEfYqT

  31. https://hg.mozilla.org/mozilla-central

  32. https://www.ffmpeg.org/

  33. https://github.com/tesseract-ocr/tesseract

  34. https://cloud.google.com/vision/

  35. http://mallet.cs.umass.edu/

  36. https://xdebug.org/

  37. https://codex.wordpress.org/Functions_File_Explained

  38. https://perf-html.io/

  39. https://developer.mozilla.org/en-US/docs/Glossary/Chrome

  40. https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XUL/Tutorial/The_Chrome_URL#The_Chrome_URL

  41. https://mcislab.github.io/publications/2020/emse_parisa/Firefox-File-Level-Analysis.zip

  42. https://mcislab.github.io/publications/2020/emse_parisa/Firefox-Directory-Level-Analysis.zip

  43. https://mcislab.github.io/publications/2020/emse_parisa/Firefox-Directory-Level-Analysis.zip

  44. https://searchfox.org/

  45. https://mozilla-l10n.github.io/localizer-documentation/

  46. http://dtrace.org/blogs/about/

References

  • Adrian K et al (n.d.) Software Cartography: thematic software visualization with consistent layout. Journal of Software Maintenance and Evolution: Research and Practice 22(3):191–210. https://doi.org/10.1002/smr.414

  • Ali N et al (2012) Improving bug location using binary class relationships. In: 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation, pp 174–183. https://doi.org/10.1109/SCAM.2012.26

  • Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - ICSE ‘10. Cape Town, South Africa: ACM Press, pp 95–104. https://doi.org/10.1145/1806799.1806817

  • Bajracharya SK, Lopes CV (2012) Analyzing and mining a code search engine usage log. Empir Softw Eng. Kluwer Academic Publishers, 17(4–5), pp 424–466. https://doi.org/10.1007/s10664-010-9144-6

  • Baldi PF et al (2008) A theory of aspects as latent topics. In: Proceedings of the conference on object-oriented programming systems, languages, and applications, OOPSLA. New York, New York, USA: ACM Press, pp 543–562. https://doi.org/10.1145/1449764.1449807

  • Bao L et al (2015) Reverse engineering time-series interaction data from screen-captured videos. 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc., pp 399–408. https://doi.org/10.1109/SANER.2015.7081850

  • Bao L et al (2017) Extracting and analyzing time-series HCI data from screen-captured task videos. Empir Softw Eng 22(1):134–174. https://doi.org/10.1007/s10664-015-9417-1

    Article  Google Scholar 

  • Bao L et al (2019) VT-revolution: interactive programming video tutorial authoring and watching system. IEEE Trans Softw Eng 45:823–838. https://doi.org/10.1109/TSE.2018.2802916

    Article  Google Scholar 

  • Baroni M et al (2009) The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang Resour Eval 43(3):209–226. https://doi.org/10.1007/s10579-009-9081-4

    Article  Google Scholar 

  • Barzilay O, Treude C, Zagalsky A (2013) Facilitating crowd sourced software engineering via stack overflow. In: Finding source code on the web for remix and reuse. Springer New York, New York, pp 1–19. https://doi.org/10.1007/978-1-4614-6596-6

    Chapter  Google Scholar 

  • Bassett B, Kraft NA (2013) Structural information based term weighting in text retrieval for feature location. In: 2013 21st International Conference on Program Comprehension (ICPC). IEEE, pp 133–141. https://doi.org/10.1109/ICPC.2013.6613841

  • Blei DM (2012) Probabilistic topic models. In: Communications of the ACM. ACM, pp 77–84. https://doi.org/10.1145/2133806.2133826

  • Blei DM, Ng AY, Jordan MI (2003a) Latent dirichlet allocation. The Journal of Machine Learning Research. JMLR.org, 3, pp 993–1022

  • Blei DM et al (2003b) Hierarchical topic models and the nested Chinese restaurant process. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press (NIPS’03), pp 17–24

  • Brunelli R, Poggio T (1993) Face recognition: features versus templates. IEEE Trans Pattern Anal Mach Intell 15(10):1042–1052. https://doi.org/10.1109/34.254061

    Article  Google Scholar 

  • Campbell JC et al (2013) Deficient documentation detection: a methodology to locate deficient project documentation using topic analysis. IEEE International Working Conference on Mining Software Repositories. IEEE, Piscataway, NJ, USA, pp 57–60. https://doi.org/10.1109/MSR.2013.6624005

  • Chen T-H, Thomas SW, Hassan AE (2016) A survey on the use of topic models when mining software repositories. Empir Softw Eng 21(5):1843–1919. https://doi.org/10.1007/s10664-015-9402-8

    Article  Google Scholar 

  • Cheng X et al (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941. https://doi.org/10.1109/TKDE.2014.2313872

    Article  Google Scholar 

  • Cheriet M et al (2007) Character recognition systems: a guide for students and practitioners. Wiley-Interscience

  • Cleland-Huang J et al (2012) Breaking the big-bang practice of traceability: pushing timely trace recommendations to project stakeholders. In: 2012 20th IEEE International Requirements Engineering Conference (RE), pp 231–240. https://doi.org/10.1109/RE.2012.6345809

  • Cleland-Huang J et al (2014) Software traceability: trends and future directions. In: Proceedings of the on Future of Software Engineering - FOSE 2014. New York, New York, USA: ACM Press, pp 55–69. https://doi.org/10.1145/2593882.2593891

  • Deerwester S et al (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  • Dit B et al (2013) Feature location in source code: a taxonomy and survey. J Softw Evol Proc 25(1):53–95. https://doi.org/10.1002/smr.567

    Article  Google Scholar 

  • Eddy BP, Kraft NA, Gray J (2018) Impact of structural weighting on a latent Dirichlet allocation–based feature location technique. J Softw Evol Proc 30(1):e1892. https://doi.org/10.1002/smr.1892

    Article  Google Scholar 

  • Ellmann M et al (2017) Find, Understand, and Extend Development Screencasts on YouTube. Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics - SWAN 2017. New York, New York, USA: ACM Press, pp 1–7. https://doi.org/10.1145/3121257.3121260

  • Escobar-Avila J, Parra E, Haiduc S (2017) Text retrieval-based tagging of software engineering video tutorials. In: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, pp 341–343. https://doi.org/10.1109/ICSE-C.2017.121

  • Gaffney Jr. JE (1981) Metrics in software quality assurance. In: Proceedings of the ACM ‘81 Conference. New York, ACM (ACM ‘81), pp 126–130. https://doi.org/10.1145/800175.809854

  • Gotel O et al (2012) The grand challenge of traceability (v1.0). In: Cleland-Huang J, Gotel O, Zisman A (eds) Software and systems traceability. Springer London, London, pp 343–409. https://doi.org/10.1007/978-1-4471-2239-5_16

    Chapter  Google Scholar 

  • Gray WD (2007) Integrated models of cognitive systems (advances in cognitive models and architectures). Oxford University Press, Inc., New York

    Book  Google Scholar 

  • Grechanik M et al (2010) A search engine for finding highly relevant applications. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - volume 1. Cape Town, ACM Press, pp 475–484. https://doi.org/10.1145/1806799.1806868

  • Hofmann T, Thomas (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn. Kluwer Academic Publishers, 42(1/2): 177–196. https://doi.org/10.1023/A:1007617005950

  • Jiau HC, Yang F-P (2012) Facing up to the inequality of crowdsourced API documentation. ACM SIGSOFT Software Engineering Notes. ACM, 37(1): 1–9. https://doi.org/10.1145/2088883.2088892

  • Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Prentice-Hall, Inc., USA

    Google Scholar 

  • Kagdi H, Maletic JI (2007) Software repositories : a source for traceability links. TEFSE/GCT 2007 - 4th International Workshop on Traceability in Emerging Forms of Software Engineering, (APRIL 2002), pp 32–39

  • Kagdi H, Maletic JI, Sharif B (2007) Mining software repositories for traceability links. In: 15th IEEE International Conference on Program Comprehension (ICPC ‘07), pp 145–154. https://doi.org/10.1109/ICPC.2007.28

  • Keivanloo I (2013) Source code similarity and clone search. https://spectrum.library.concordia.ca/977472/

  • Keivanloo I, Roy CK, Rilling J (2014) SeByte: scalable clone and similarity search for bytecode. Sci Comput Program. Elsevier, 95: 426–444. https://doi.org/10.1016/J.SCICO.2013.10.006

  • Khandwala K, Guo PJ (2018) Codemotion: expanding the design space of learner interactions with computer programming tutorial videos. In: Proceedings of the fifth annual ACM conference on learning at scale - L@S ‘18. London, United Kingdom: ACM Press, pp. 1–10. doi: https://doi.org/10.1145/3231644.3231652

  • Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: identifying topics in source code. Inform Softw Technol. Elsevier, 49(3):230–243. https://doi.org/10.1016/J.INFSOF.2006.10.017

  • Kuhn A, Loretan P, Nierstrasz O (2012) Consistent layout for thematic software maps. https://doi.org/10.1109/WCRE.2008.45

  • Leach RJ (2000) Introduction to software engineering. CRC Press, Inc., Boca Raton

    MATH  Google Scholar 

  • Li C et al (2016) Topic modeling for short texts with auxiliary word Embeddings. In: Proceedings of the 39th international ACM SIGIR conference on Research and Development in information retrieval - SIGIR ‘16. Pisa, Italy: ACM Press, pp 165–174. https://doi.org/10.1145/2911451.2911499

  • Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent Dirichlet allocation. Inform Softw Technol. Elsevier B.V., 52(9):972–990. https://doi.org/10.1016/j.infsof.2010.04.002

  • MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using YouTube. 2015 IEEE 23rd International Conference on Program Comprehension (ICPC). IEEE, Piscataway, NJ, USA https://doi.org/10.1109/ICPC.2015.19

  • MacLeod L, Bergen A, Storey M-A (2017) Documenting and sharing software knowledge using screencasts. Empir Softw Eng 22(3):1478–1507. https://doi.org/10.1007/s10664-017-9501-9

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, USA

    Book  Google Scholar 

  • Marcus A et al (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering. USA: IEEE Comput. Soc, pp 214–223. https://doi.org/10.1109/WCRE.2004.10

  • Mcauliffe JD, Blei DM (2008) Supervised topic models. In”: Platt, J. C. et al. (eds) Advances in neural information processing systems 20. Curran Associates, Inc., pp 121–128. Available at: http://papers.nips.cc/paper/3328-supervised-topic-models.pdf

  • Mohorovičič S (2012) Creation and use of screencasts in higher education. MIPRO 2012 - 35th International Convention on Information and Communication Technology, Electronics and Microelectronics - Proceedings, pp 1293–1298

  • Moslehi P, Adams B, Rilling J (2016) On mining crowd-based speech documentation. In: Proceedings of the 13th international workshop on mining software repositories - MSR ‘16. Austin, ACM Press, pp 259–268. https://doi.org/10.1145/2901739.2901771

  • Moslehi P, Adams B, Rilling J (2018) Feature location using crowd-based screencasts. In: Proceedings of the 15th IEEE working conference on mining software repositories (MSR). Gothenburg, Sweden, pp 192–202. https://doi.org/10.1145/3196398.3196439

  • Moslehi P, Rilling J, Adams B (2020) Adoption of Crowd-based Software Engineering Tutorial Screencasts. Available at: https://mcislab.github.io/publications/2020/OnTheUseOfMultimediaDocumentation.pdf

  • Nasehi SM et al (2012) What makes a good code example?: a study of programming Q&a in StackOverflow. In: 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, Piscataway, NJ, USA, pp 25–34. https://doi.org/10.1109/ICSM.2012.6405249

  • Nguyen AT et al (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering. Essen, GermanyUSA: ACM Press, pp 70–79. https://doi.org/10.1145/2351676.2351687

  • Nixon MS, Aguado AS (2012a) Chapter 5 - high-level feature extraction: fixed shape matching. In: Nixon MS, Aguado AS (eds) Feature extraction and image processing for computer vision (third edition). Third edit. Oxford: Academic Press, pp 217–291

  • Nixon MS, Aguado AS (2012b) Chapter 7 - object description. In: Nixon MS, Aguado AS (eds) Feature Extraction and Image Processing for Computer Vision (Third edition). Third edit. Academic Press, Oxford, pp 343–397

    Chapter  Google Scholar 

  • Ott J et al (2018) A deep learning approach to identifying source code in images and video. International Conference on Mining Software Repositories (MSR), pp 376–386. https://doi.org/10.1145/3196398.3196402

  • Parnin C et al (2012) Crowd documentation: exploring the coverage and the dynamics of API discussions on stack overflow. Georgia Tech technical report. Available at: http://chrisparnin.me/pdf/crowddoc.pdf

  • Parra E, Escobar-Avila J, Haiduc S (2018) Automatic tag recommendation for software development video tutorials. In: Proceedings of the 26th Conference on Program Comprehension - ICPC ‘18. Gothenburg, ACM Press, pp 222–232. https://doi.org/10.1145/3196321.3196351

  • Pedrosa G et al (2017) Topic modeling for short texts with co-occurrence frequency-based expansion. Proceedings - 2016 5th Brazilian Conference on Intelligent Systems, BRACIS 2016, pp 277–282. https://doi.org/10.1109/BRACIS.2016.058

  • Pham R et al (2013) Creating a shared understanding of testing culture on a social coding site. In: 2013 35th international conference on software engineering (ICSE). IEEE, Piscataway, NJ, USA, pp 112–121. https://doi.org/10.1109/ICSE.2013.6606557

  • Poche E et al (2017) Analyzing user comments on YouTube coding tutorial videos. In: 2017 IEEE/ACM 25th international conference on program comprehension (ICPC). Buenos Aires, Argentina, pp 196–206. https://doi.org/10.1109/ICPC.2017.26

  • Ponzanelli L et al (2016) Too long; didn’t watch!: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th international conference on software engineering - ICSE ‘16. Austin, ACM Press, pp 261–272. https://doi.org/10.1145/2884781.2884824

  • Ponzanelli L et al (2019) Automatic identification and classification of software development video tutorial fragments. IEEE Trans Softw Eng 45:464–488. https://doi.org/10.1109/TSE.2017.2779479

    Article  Google Scholar 

  • Ramage D et al (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009. Singapore, pp 248–256

  • Storey M-A et al (2014) The (R) evolution of social media in software engineering. Proceedings of the on Future of Software Engineering - FOSE 2014, pp 100–116. https://doi.org/10.1145/2593882.2593887

  • Subramanian S, Inozemtseva L, Holmes R (2014) Live API documentation. In: Proceedings of the 36th international conference on software engineering - ICSE 2014. Hyderabad, India, pp 643–652. https://doi.org/10.1145/2568225.2568313

  • Thomas SW (2012) Mining unstructured software repositories using IR models. Queen’s University

  • Thomas SW et al (2010) Validating the use of topic models for software evolution. In: 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation. IEEE, pp 55–64. https://doi.org/10.1109/SCAM.2010.13

  • Turk D, France R, Rumpe B (2014) Limitations of agile software processes. abs/1409.6, pp 43–46

  • van der Spek P, Klusener S, van de Laar P (2008) Towards recovering architectural concepts using latent semantic indexing. In: 2008 12th European Conference on Software Maintenance and Reengineering, pp 253–257. https://doi.org/10.1109/CSMR.2008.4493321

  • Wallach HM (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania, USA: ACM (ICML ‘06), pp 977–984. https://doi.org/10.1145/1143844.1143967

  • Wang X, McCallum A, Wei X (2007) Topical N-grams: phrase and topic discovery, with an application to information retrieval. Proceedings - IEEE International Conference on Data Mining, ICDM, pp 697–702. https://doi.org/10.1109/ICDM.2007.86

  • Wells J, Barry RM, Spence A (2012) Using video tutorials as a carrot-and-stick approach to learning. IEEE Trans Educ 55(4):453–458. https://doi.org/10.1109/TE.2012.2187451

    Article  Google Scholar 

  • Yadid S, Yahav E (2016) Extracting code from programming tutorial videos. In: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software - Onward! 2016. Amsterdam, Netherlands: ACM Press, pp 98–111. https://doi.org/10.1145/2986012.2986021

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parisa Moslehi.

Additional information

Communicated by: Gabriele Bavota

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moslehi, P., Adams, B. & Rilling, J. A feature location approach for mapping application features extracted from crowd-based screencasts to source code. Empir Software Eng 25, 4873–4926 (2020). https://doi.org/10.1007/s10664-020-09874-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-020-09874-z

Keywords

Navigation