Abstract
Many code search techniques have been proposed to return relevant code for a user query expressed as textual descriptions. However, source code is not mere text. It contains dependency relations among various program elements. To leverage these dependencies for more accurate code search results, techniques have been proposed to allow user queries to be expressed as control and data dependency relationships among program elements. Although such techniques have been shown to be effective for finding relevant code, it remains a question whether appropriate queries can be generated by average users. In this work, we address this concern by proposing a technique, AutoQuery, that can automatically construct dependency queries from a set of code snippets. We realize AutoQuery by the following major steps: firstly, code snippets (that are not necessarily compilable) are converted into program dependence graphs (PDGs); secondly, a new graph mining solution is built to return common structures in the PDGs; thirdly, the common structures are converted to dependency queries, which are used to retrieve results by using a dependence-based code search technique. We have evaluated AutoQuery on real systems with 47 different code search tasks. The results show that the automatically constructed dependency queries retrieve relevant code with a precision, recall, and F-measure of 68.4, 72.1, and 70.2 %, respectively. We have also performed a user study to compare the effectiveness of AutoQuery with that of human generated queries. The results show that queries constructed by AutoQuery on average help to retrieve code fragments with comparable F-measures to those retrieved by human constructed queries.












Similar content being viewed by others
Notes
We use the node types defined by CodeSurfer. There are 33 different node types for C/C++, e.g., function call, expression, etc.
First argument of function \(tag \rightarrow check()\).
We simply name a type by using its corresponding variable name in uppercase letters.
References
Andersen, J., Lawall, J.L.: Generic patch inference. Autom. Softw. Eng. 17(2), 119–148 (2010)
Andersen, J., Nguyen, A. C., Lo, D., Lawall, J. L., Khoo, S.-C.: Semantic patch inference. In: ASE, pp. 382–385 (2012)
Antoniol, G., Canfora, G., Casazza, G., De Lucia, A., Merlo, E.: Recovering traceability links between code and documentation. IEEE Trans. Softw. Eng. 28(10), 970–983 (2002)
Baah, G.K., Podgurski, A., Harrold, M.J.: The probabilistic program dependence graph and its application to fault diagnosis. IEEE Trans. Softw. Eng. 36(4), 528–545 (2010)
Baker, H.G.: Unify and conquer (garbage, updating, aliasing, ...) in functional languages. In: ACM Conference on LISP and Functional Programming, pp. 218–226 (1990)
Chan, W.-K., Cheng, H., Lo, D.: Searching connected api subgraph via text phrases. In: SIGSOFT FSE, p. 10 (2012)
Chang, R.-Y., Podgurski, A., Yang, J.: Discovering neglected conditions in software by mining dependence graphs. IEEE Trans. Softw. Eng. 34(5), 579–596 (2008)
Cheng, H., Lo, D., Zhou, Y., Wang, X., Yan, X.: Identifying bug signatures using discriminative graph mining. In: ISSTA, pp. 141–152 (2009)
Dagenais, B., Hendren, L.: Enabling static analysis for partial java programs. In OOPSLA, pp. 313–328 (2008)
Dit, B., Revelle, M., Gethers, M., Poshyvanyk, D.: Feature location in source code: a taxonomy and survey. J. Softw. 25(1), 53–95 (2013)
Dit, B., Revelle, M., Poshyvanyk, D.: Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir. Softw. Eng. 18(2), 277–309 (2013)
Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: ICSE, pp. 321–330 (2008)
Ganesh, V., Kiezun, A., Artzi, S., Guo, P.J., Hooimeijer, P., Ernst, M.D.: Hampi: a string solver for testing, analysis and vulnerability detection. In: CAV, pp. 1–19 (2011)
Codesurfer, Grammatech (2013)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York (1997)
Hardekopf, B., Lin, C.: The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In: PLDI, pp. 290–299 (2007)
Horwitz, S., Reps, T.W.: The use of program dependence graphs in software engineering. In: ICSE, pp. 392–411 (1992)
Jang, J., Agrawal, A., Brumley, D.: Redebug: Finding unpatched code clones in entire os distributions. In: IEEE Symposium on Security and Privacy (S&P), pp. 48–62 (2012)
Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: Scalable and accurate tree-based detection of code clones. In: ICSE, pp. 96–105 (2007)
Jiang, L., Su, Z.: Automatic mining of functionally equivalent code fragments via random testing. In: ISSTA, pp. 81–92 (2009)
Kim, J., Lee, S., Hwang, S.-W., Kim, S.: Towards an intelligent code search engine. In: AAAI (2010)
Komondoor, R., Horwitz, S.: Tool demonstration: finding duplicated code using program dependences. In: ESOP, pp. 383–386 (2001)
Lattner, C., Lenharth, A., Adve, V.S.: Making context-sensitive points-to analysis with heap cloning practical for the real world. In: PLDI, pp. 278–289 (2007)
Lee, M.-W., Roh, J.-W., won Hwang, S., Kim, S.: Instant code clone search. In: SIGSOFT FSE, pp. 167–176 (2010)
Li, J., Ernst, M.D.: Cbcd: Cloned buggy code detector. In: ICSE, pp. 310–320 (2012)
Liu, C., Chen, C., Han, J., Yu, P.S.: GPLAG: Detection of software plagiarism by program dependence graph analysis. In: KDD, pp. 872–881 (2006)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
McMillan, C., Grechanik, M., Poshyvanyk, D., Xie, Q., Fu, C.: Portfolio: finding relevant functions and their usage. In: ICSE, pp. 111–120 (2011)
Meng, N., Kim, M., McKinley, K.S.: Sydit: creating and applying a program transformation from an example. In: SIGSOFT FSE, pp. 440–443 (2011)
Meng, N., Kim, M., McKinley, K.S.: Systematic editing: generating program transformations from an example. In: PLDI, pp. 329–342 (2011)
Meng, N., Kim, M., McKinley, K.S.: Locating and applying systematic edits by learning from examples. In: ICSE (2013)
Milner, R.: A theory of type polymorphism in programming. J. Comput. Syst. Sci. 17, 348–375 (1978)
Nguyen, T.T., Nguyen, H.A., Pham, N.H., Al-Kofahi, J.M., Nguyen, T.N.: Graph-based mining of multiple object usage patterns. In: ESEC/SIGSOFT FSE, pp. 383–392 (2009)
O’Callahan, R., Jackson, D.: Lackwit: a program understanding tool based on type inference. In: ICSE, pp. 338–348 (1997)
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Program. 74(7), 470–495 (2009)
Sun, B., Podgurski, A., Ray, S.: Improving the precision of dependence-based defect mining by supervised learning of rule and violation graphs. In: ISSRE, pp. 1–10 (2010)
Thummalapenta, S., Xie, T.: Parseweb: a programmer assistant for reusing open source code on the web. In: ASE, pp. 204–213 (2007)
Tian, Y., Lo, D., Lawall, J.L.: Automated construction of a software-specific word similarity database. In: CSMR-WCRE, pp. 44–53 (2014)
Wang, S., Lo, D., Jiang, L.: Code search via topic-enriched dependence graph matching. In: WCRE, pp. 119–123 (2011)
Wang, S., Lo, D., Jiang, L.: Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. In Proceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM), ICSM ’12, pp. 604–607 (2012)
Wang, S., Lo, D., Jiang, L.: Understanding widespread changes: a taxonomic study. In: CSMR (2013)
Wang, S., Lo, D., Xing, Z., Jiang, L.: Concern localization using information retrieval: an empirical study on linux kernel. In: WCRE, pp. 92–96 (2011)
Wang, X., Lo, D., Cheng, J., Zhang, L., Mei, H., Yu, J.X.: Matching dependence-related queries in the system dependence graph. In: ASE, pp. 457–466 (2010)
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In ICDM, pp. 721–724 (2002)
Yan, X., Han, J.: Closegraph: mining closed frequent graph patterns. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. KDD ’03, pp. 286–295. NY, USA, ACM, New York (2003)
Zhu, F., Qu, Q., Lo, D., Yan, X., Han, J., Yu, P.S.: Mining top-k large structural patterns in a massive network. PVLDB 4(11), 807–818 (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, S., Lo, D. & Jiang, L. AutoQuery: automatic construction of dependency queries for code search. Autom Softw Eng 23, 393–425 (2016). https://doi.org/10.1007/s10515-014-0170-2
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1007/s10515-014-0170-2