Understanding the API usage in Java

https://doi.org/10.1016/j.infsof.2016.01.011Get rights and content

Abstract

Context

Application Programming Interfaces (APIs) facilitate the use of programming languages. They define sets of rules and specifications for software programs to interact with. The design of language API is usually artistic, driven by aesthetic concerns and the intuitions of language architects. Despite recent studies on limited scope of API usage, there is a lack of comprehensive, quantitative analyses that explore and seek to understand how real-world source code uses language APIs.

Objective

This study aims to understand how APIs are employed in practical development and explore their potential applications based on the results of API usage analysis.

Method

We conduct a large-scale, comprehensive, empirical analysis of the actual usage of APIs on Java, a modern, mature, and widely-used programming language. Our corpus contains over 5000 open-source Java projects, totaling 150 million source lines of code (SLoC). We study the usage of both core (official) API library and third-party (unofficial) API libraries. We resolve project dependencies automatically, generate accurate resolved abstract syntax trees (ASTs), capture used API entities from over 1.5 million ASTs, and measure the usage based on our defined metrics: frequency, popularity and coverage.

Results

Our study provides detailed quantitative information and yield insight, particularly, (1) confirms the conventional wisdom that the usage of APIs obeys Zipf distribution; (2) demonstrates that core API is not fully used (many classes, methods and fields have never been used); (3) discovers that deprecated API entities (in which some were deprecated long ago) are still widely used; (4) evaluates that the use of current compact profiles is under-utilized; (5) identifies API library coldspots and hotspots.

Conclusions

Our findings are suggestive of potential applications across language API design, optimization and restriction, API education, library recommendation and compact profile construction.

Introduction

Syntax and semantics define a programming language. Application Programming Interfaces (APIs) facilitate its use. Most of today’s software projects heavily depend on the use of API libraries [1]. They improve code reuse, reduce development cost and promote programmers’ productivity. However, API design has been artistic and biased, driven by aesthetic concerns and the intuitions of API designers. They usually have limited knowledge on how programmers actually use the API, which leads to many unnatural and rarely used API features being introduced, while not some expected ones [2], [3]. Meanwhile, the ever-growing APIs (increasing features have been introduced) remain a significant barrier to novice programmers [4]. In addition, API libraries have become one of the most influential factors for the choice of programming languages [5]. Poor design of the APIs increases the learning curve for developers and greatly influence their productivity. Therefore, it is significant to understand the actual usage of the current API libraries, and optimize the designs to promote API usability for programmers.

Studying how a large number of real-world programs use APIs can help validate or disprove the many popular “theories” concerning what APIs are most adopted, most useful, easiest to use; whether APIs have been fully used by the programmers, etc. that abound concerning programming in popular literature and on the Internet. For language education, the gap between APIs and their actual usage may guide pedagogy, giving teachers insight into what is common (and perhaps should be) and rare (and perhaps should not be). It also guides novice programmers to select a proportionally smaller fraction, i.e. most essence of the entire APIs to reduce the cost of learning. Language API designers may leverage data on actual API usage to optimize the design of API libraries, e.g. simplifying unpopular APIs and identifying unused APIs that could be eliminated. In addition, API usage analysis is crucial in mining API usage patterns [6], [7], [8], [9], and offers supports for API migration [10], [11]. It also produces a positive effect in software maintenance [12].

To this end, we perform a large-scale empirical study on a diverse corpus of over 5000 real-world Java projects to gain insight into how APIs are used in practice. We retrieve project dependencies with the aid of Maven [13], generate accurate resolved abstract syntax trees (ASTs) for approximately 150 million SLoC, capture used API entities (i.e. packages, classes, methods and fields) from over 1.5 million ASTs, and measure the usage based on our defined metrics: frequency (whether an API has been frequently used), popularity (whether an API has been widely used) and coverage (whether an API has been fully used). We analyze almost all the API libraries that are adopted by practical projects, including both core API and third-party APIs. Besides, we investigate some extra issues, e.g. construction of API subsets and selection of the versions of the third-party APIs. In summary, this paper makes the following contributions:

  • It presents a large-scale, comprehensive, empirical analysis of the use of APIs in a modern programming language, namely Java;

  • This is the first work to deeply study both core API and third-party APIs, including the use of deprecated API entities. It is also the first to study how API usage guide the design of the compact profiles (i.e. subset of APIs);

  • Some interesting results are demonstrated: (1) 1% of the most-used packages account for 80% of all API usage, while 70% least-used packages are used < 0.5% of all API usage and 50% only < 0.1%; (2) 15.3% of the classes, 41.2% of the methods and 41.6% of the fields from the core API are never used; (3) 9.5% of the packages have all subordinative methods never used and 29.2% of the classes have all subordinative methods never used; (4) 51.1% of deprecated classes, 43.5% of the deprecated methods and 18.1% of the deprecated fields from the core API have been adopted.

Taken together, our results permit API designers to empirically consider whether the design of the API facilitates programmers’ development based on their actual usage. Our study also identifies both hotspots (i.e. frequently and widely used APIs) and coldspots (i.e. rarely and narrowly used APIs) to inform programmers to selectively learn and adopt the APIs. For example, if the APIs are never used, alerting programmers to use them cautiously in practical development is indispensable. In addition, the results assist to construct appropriate subsets of the APIs, that can be employed in either resource-constrained devices or high security environment. We believe that our work enables data-driven language API design, optimization and simplification, analogous to how Cocke’s study at IBM in the 1970s on the actual usage of CISC instructions eventually led to the RISC architectures [14].

Section snippets

Methodology

This section first discusses the research questions studied, presents the basic information of the corpus used in this study then, and illustrates the process of how we set up and perform the experiments.

API usage provenance

Research in software engineering has shown that reuse can promote the productivity of the development team, reduce the time-to-market and improve the overall quality of software products [22]. Adopting API libraries is one of effective and efficient reuse approaches [23]. We are interested in the API provenance to figure out how much of the code is reused from existing API libraries and how much are newly added. To this end, we collect the use of all APIs, including project-specific API

Coverage analysis

API coverage analysis is considered as a principal way to assist API migration [16] and increase API usability [27]. It can also be applied to inspect whether the API library has been sufficiently utilized. New features have been introduced ceaselessly while few existing features that are rarely used have been removed from the core APIs. It is expected to result in the rapid growth of the core library and more resources consumption for devices. We desire to identify those coldspots of the core

Library popularity analysis

Apart from using the core API library, programmers usually select appropriate third-party libraries to maximize code reuse and improve the efficiency of the development process. Globally, projects in our corpus employ 103,256 external third-party dependencies. Suppose we ignore the possibility that a library has multiple versions (discussed in Section 5.3), 16,329 distinct third-party libraries are adopted. However, most usage is concentrated on a limited range. Only 15 libraries are adopted by

Applications

Our work and results offer a number of insights, inspiring some potential applications:

Construct validity

The construct validity of our study rests on the measurements performed, in particular related to the corpus construction, dependency resolution and API entity resolution.

Regarding the corpus construction, we select and download over 5000 projects with diverse characteristics, such as project sizes and domains. All projects are obtained from GitHub, one of the most popular and widely-used project hosting services, that hosts massive amounts of git-based projects. The reason we select Git is

API usage analysis

To some extent, our work is analogous to [16], [31]. Homan et al. studied the usage of the API entities from Java standard API over 39 projects [31]. Our study conducts a similar experiment on a larger corpus (containing over 5000 projects). Our results demonstrate that 15.3% of the classes and 41.2% of the method are not used at all, which is inconsistent with their results showing that 50% of the classes and 80% of the methods are never used. The coverage highly depends on the corpus scale.

Conclusion and future work

This paper has presented a large-scale study of how Java’s APIs are used in practice by analyzing more than 5000 open-source Java projects. Our study has exposed interesting quantitative information to help understand how APIs from the core library and third-party libraries have been used. There are several interesting directions for future work. In detail, we plan to (1) conduct a more comprehensive study on a variety of other programming languages to increase the external validity of our

Acknowledgments

The work is supported by the National Natural Science Foundation of China under grant no. 61572126, the Huawei Innovation Research Program (HIRP) under grant no. YB2013120195 and the Scientific Research Foundation of Graduation School of Southeast University grant no. YBJJ1313.

References (47)

  • Y.M. Mileva et al.

    Mining trends of library usage

    Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution and Software Evolution Workshops (IWPSE-Evol)

    (2009)
  • U. Sandberg, Tired of Date and Calendar?http://www.jayway.com/2006/09/16/tired-of-date-and-calendar/ (accessed...
  • Your Language Sucks. https://wiki.theory.org/YourLanguageSucks (accessed...
  • M. Robillard

    What makes APIs hard to learn? Answers from developers

    IEEE Softw.

    (2009)
  • L.A. Meyerovich et al.

    Empirical analysis of programming language adoption

    Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications (OOPSLA)

    (2013)
  • J.Y. Gil et al.

    Micro patterns in Java code

    Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA)

    (2005)
  • H. Zhong et al.

    MAPO: mining and recommending API usage patterns

    Proceedings of the 23rd European Conference on Object-oriented Programming (ECOOP)

    (2009)
  • G. Uddin et al.

    Temporal analysis of API usage concepts

    Proceedings of the 34th International Conference on Software Engineering (ICSE)

    (2012)
  • J. Wang et al.

    Mining succinct and high-coverage API usage patterns from source code

    Proceedings of the 10th IEEE Working Conference on Mining Software Repositories (MSR)

    (2013)
  • H.A. Nguyen et al.

    A graph-based approach to API usage adaptation

    Proceedings of the 25th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA)

    (2010)
  • M. Nita et al.

    Using twinning to adapt programs to alternative APIs

    2010 ACM/IEEE 32nd International Conference on Software Engineering (ICSE)

    (2010)
  • V. Bauer et al.

    Understanding API usage to support informed decision making in software maintenance

    Proceedings of the 16th European Conference on Software Maintenance and Reengineering (CSMR)

    (2012)
  • Maven. http://maven.apache.org/ (accessed...
  • J. Cocke et al.

    The evolution of RISC technology at IBM

    IBM J. Res. Dev.

    (1990)
  • Java SE. http://www.oracle.com/technetwork/java/javase/ (accessed...
  • R. Lämmel et al.

    Large-scale, AST-based API-usage analysis of open-source Java projects

    Proceedings of the 2011 ACM Symposium on Applied Computing (SAC)

    (2011)
  • Eclipse EGit. http://www.eclipse.org/egit/ (accessed...
  • Eclipse Aether. http://eclipse.org/aether/ (accessed...
  • Maven API. http://maven.apache.org/ref/3.3.1/index.html (accessed...
  • Eclipse JDT. http://www.eclipse.org/jdt/ (accessed...
  • Apache Commons BCEL. https://commons.apache.org/proper/commons-bcel/ (accessed...
  • W. Lim

    Effects of reuse on quality, productivity, and economics

    IEEE Softw.

    (1994)
  • L. Heinemann

    Effective and Efficient Reuse with Software Libraries

    (2012)
  • Cited by (51)

    • API beauty is in the eye of the clients: 2.2 million Maven dependencies reveal the spectrum of client–API usages

      2022, Journal of Systems and Software
      Citation Excerpt :

      Overall, the distribution of the share of clients using the most used type of an API, as well as the share of API types used by more than 50% of clients, indicates the existence, in most libraries, of a small compact subset of APIs being used by most client. This is consistent with previous work (Qiu et al., 2016; Lämmel et al., 2011; Thummalapenta and Xie, 2008). Java imposes a design constraint on multi-package libraries: a class member must be publicly visible in order to be used by another class, from another package, inside the library.

    • Towards cost-effective API deprecation: A win–win strategy for API developers and API users

      2022, Information and Software Technology
      Citation Excerpt :

      One of the functionalities of AWARE is to gather and provide API usage statistics. To gather API usage statistics, a comprehensive set of API usage metrics [29,39,40] is implemented by srcML toolkit, which is a lightweight approach for code analysis [41–47]. As presented in Table 3, different class-level, method-level, and field-level API usages can be measured.

    View all citing articles on Scopus
    View full text