Skip to main content

Mining Source Code Topics Through Topic Model and Words Embedding

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10086))

Included in the following conference series:

Abstract

Developers nowadays can leverage existing systems to build their own applications. However, a lack of documentation hinders the process of software system reuse. We examine the problem of mining topics (i.e., topic extraction) from source code, which can facilitate the comprehension of the software systems. We propose a topic extraction method, Embedded Topic Extraction (EmbTE), that considers word semantics, which are never considered in mining topics from source code, by leveraging word embedding techniques. We also adopt Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract topics from source code. Moreover, an automated term selection algorithm is proposed to identify the most contributory terms from source code for the topic extraction task. The empirical studies on Github (https://github.com/) Java projects show that EmbTE outperforms other methods in terms of providing more coherent topics. The results also indicate that method name, method comments, class names and class comments are the most contributory types of terms to source code topic extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/javaparser/javaparser.

  2. 2.

    http://snowball.tartarus.org/algorithms/english/stop.txt.

  3. 3.

    http://nlp.stanford.edu/.

  4. 4.

    https://radimrehurek.com/gensim/.

  5. 5.

    https://code.google.com/archive/p/word2vec/.

  6. 6.

    https://github.com/AKSW/Palmetto.

  7. 7.

    http://scikit-learn.org/.

References

  1. Allamanis, M., Sutton, C.A.: Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR 2013), pp. 207–216, San Francisco, CA, USA, May 2013

    Google Scholar 

  2. Asuncion, H.U., Asuncion, A.U., Taylor, R.N.: Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pp. 95–104, Cape Town, South Africa, May 2010

    Google Scholar 

  3. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MATH  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2003), pp. 28–36, Baltimore, Maryland, USA, January 2003

    Google Scholar 

  6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  7. Haefliger, S., Krogh, G.V., Spaeth, S.: Code reuse in open source software. Manage. Sci. 54(1), 180–193 (2008)

    Article  Google Scholar 

  8. Haiduc, S., Aponte, J., Marcus, A.: Supporting program comprehension with source code summarization. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE 2010), pp. 223–226, Cape Town, South Africa, May 2010

    Google Scholar 

  9. Haiduc, S., Aponte, J., Moreno, L., Marcus, A.: On the use of automated text summarization techniques for summarizing source code. In: Proceedings of the 17th Working Conference on Reverse Engineering (WCRE 2010), pp. 35–44, Beverly, MA, USA, October 2010

    Google Scholar 

  10. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)

    Article  Google Scholar 

  11. Lukins, S.K., Kraft, N.A., Etzkorn, L.H.: Bug localization using latent Dirichlet allocation. Inf. Softw. Technol. 52(9), 972–990 (2010)

    Article  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013)

    Google Scholar 

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119, Lake Tahoe, United States, December 2013

    Google Scholar 

  14. Moreno, L., Aponte, J., Sridhara, G., Marcus, A., Pollock, L.L., Vijay-Shanker, K.: Automatic generation of natural language summaries for java classes. In: Proceedings of the 21st IEEE International Conference on Program Comprehension (ICPC 2013), pp. 23–32, San Francisco, NC, USA, May 2013

    Google Scholar 

  15. Niu, L., Dai, X., Zhang, J., Chen, J.: Topic2Vec: learning distributed representations of topics. In: Proceedings of the International Conference on Asian Language Processing 2015 (IALP 2015), pp. 193–196, Suzhou, China, October 2015

    Google Scholar 

  16. Rama, G.M., Sarkar, S., Heafield, K.: Mining business topics in source code using latent Dirichlet allocation. In: Proceedings of the 1st Annual India Software Engineering Conference (ISEC 2008), pp. 113–120, Hyderabad, India, February 2008

    Google Scholar 

  17. Rodeghero, P., McMillan, C., McBurney, P.W., Bosch, N., D’Mello, S.K.: Improving automated source code summarization via an eye-tracking study of programmers. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), pp. 390–401, Hyderabad, India, June 2014

    Google Scholar 

  18. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), pp. 399–408, Shanghai, China, February 2015

    Google Scholar 

  19. Sridhara, G., Pollock, L.L., Vijay-Shanker, K.: Automatically detecting and describing high level actions within methods. In: Proceedings of the 33rd International Conference on Software Engineering (ICSE 2011), pp. 101–110, Waikiki, Honolulu, HI, USA, May 2011

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Emma Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Zhang, W.E., Sheng, Q.Z., Abebe, E., Babar, M.A., Zhou, A. (2016). Mining Source Code Topics Through Topic Model and Words Embedding. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q. (eds) Advanced Data Mining and Applications. ADMA 2016. Lecture Notes in Computer Science(), vol 10086. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_47

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49586-6_47

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49585-9

  • Online ISBN: 978-3-319-49586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics