Using LSTMs to Model the Java Programming Language

Boldt, Brendon

doi:10.1007/978-3-319-68612-7_31

Brendon Boldt¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10614))

Included in the following conference series:

International Conference on Artificial Neural Networks

4376 Accesses
2 Citations
2 Altmetric

Abstract

Recurrent neural networks (RNNs), specifically long-short term memory networks (LSTMs), can model natural language effectively. This research investigates the ability for these same LSTMs to perform next “word” prediction on the Java programming language. Java source code from four different repositories undergoes a transformation that preserves the logical structure of the source code and removes the code’s various specificities such as variable names and literal values. Such datasets and an additional English language corpus are used to train and test standard LSTMs’ ability to predict the next element in a sequence. Results suggest that LSTMs can effectively model Java code achieving perplexities under 22 and accuracies above 0.47, which is an improvement over LSTM’s performance on the English language which demonstrated a perplexity of 85 and an accuracy of 0.27. This research can have applicability in other areas such as syntactic template suggestion and automated bug patching.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The Recurrent Neural Network for Program Synthesis

Idiomatizing Python Source Code Using Different Recurrent Architectures

Detecting and Fixing Nonidiomatic Snippets in Python Source Code with Deep Learning

Notes

1.
Functional insofar as method bodies describe the active (non-declarative) behavior of the program.
2.
VariableDeclarationStatement is not included in the tokenized version of the AST since the syntax is adequately represented by starting with the root node’s children.
3.
ElasticSearch had a proportion of $16\%$.

References

Allamanis, M., Sutton, C.: Mining source code repositories at massive scale using language modeling. In: Proceedings of the 10th Working Conference on Mining Software Repositories, MSR 2013, Piscataway, NJ, USA, pp. 207–216. IEEE Press (2013)
Google Scholar
Nguyen, A.T., Nguyen, T.N.: Graph-based statistical language model for code. In: Proceedings of the 37th International Conference on Software Engineering, ICSE 2015, Piscataway, NJ, USA, vol. 1, pp. 858–868. IEEE Press (2015)
Google Scholar
Asaduzzaman, M., Roy, C.K., Schneider, K.A., Hou, D.: A simple, efficient, context-sensitive approach for code completion. J. Softw.: Evol. Process 28(7), 512–541 (2016). JSME-15-0030.R3
Google Scholar
Kim, D., Nam, J., Song, J., Kim, S.: Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE 2013, Piscataway, NJ, USA, pp. 802–811. IEEE Press (2013)
Google Scholar
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR, abs/1409.2329 (2014)
Google Scholar
Eclipse Foundation: Eclipse documentation on the AST class (2016). http://help.eclipse.org/luna/index.jsp?topic=%2Forg.eclipse.jdt.doc.isv%2Freference%2Fapi%2Forg%2Feclipse%2Fjdt%2Fcore%2Fdom%2FAST.html. Accessed 18 Aug 2016
Sundermeyer, M., Ney, H., Schlüter, R.: From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(3), 517–529 (2015)
Article Google Scholar
Wang, M., Song, L., Yang, X., Luo, C.: A parallel-fusion RNN-LSTM architecture for image caption generation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4448–4452. IEEE (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Marist College, 3399 North Rd., Poughkeepsie, NY, USA
Brendon Boldt

Authors

Brendon Boldt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brendon Boldt .

Editor information

Editors and Affiliations

University of Lausanne, Lausanne, Switzerland
Alessandra Lintas
University of Genoa, Genoa, Italy
Stefano Rovetta
Universitat Pompeu Fabra, Barcelona, Spain
Paul F.M.J. Verschure
University of Lausanne, Lausanne, Switzerland
Alessandro E.P. Villa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boldt, B. (2017). Using LSTMs to Model the Java Programming Language. In: Lintas, A., Rovetta, S., Verschure, P., Villa, A. (eds) Artificial Neural Networks and Machine Learning – ICANN 2017. ICANN 2017. Lecture Notes in Computer Science(), vol 10614. Springer, Cham. https://doi.org/10.1007/978-3-319-68612-7_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-68612-7_31
Published: 25 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68611-0
Online ISBN: 978-3-319-68612-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics