skip to main content
research-article

Semantic-Enriched Code Knowledge Graph to Reveal Unknowns in Smart Contract Code Reuse

Published:30 September 2023Publication History
Skip Abstract Section

Abstract

Programmers who work with smart contract development often encounter challenges in reusing code from repositories. This is due to the presence of two unknowns that can lead to non-functional and functional failures. These unknowns are implicit collaborations between functions and subtle differences among similar functions. Current code mining methods can extract syntax and semantic knowledge (known knowledge), but they cannot uncover these unknowns due to a significant gap between the known and the unknown. To address this issue, we formulate knowledge acquisition as a knowledge deduction task and propose an analytic flow that uses the function clone as a bridge to gradually deduce the known knowledge into the problem-solving knowledge that can reveal the unknowns. This flow comprises five methods: clone detection, co-occurrence probability calculation, function usage frequency accumulation, description propagation, and control flow graph annotation. This provides a systematic and coherent approach to knowledge deduction. We then structure all of the knowledge into a semantic-enriched code Knowledge Graph (KG) and integrate this KG into two software engineering tasks: code recommendation and crowd-scaled coding practice checking. As a proof of concept, we apply our approach to 5,140 smart contract files available on Etherscan.io and confirm high accuracy of our KG construction steps. In our experiments, our code KG effectively improved code recommendation accuracy by 6% to 45%, increased diversity by 61% to 102%, and enhanced NDCG by 1% to 21%. Furthermore, compared to traditional analysis tools and the debugging-with-the-crowd method, our KG improved time efficiency by 30 to 380 seconds, vulnerability determination accuracy by 20% to 33%, and vulnerability fixing accuracy by 24% to 40% for novice developers who identified and fixed vulnerable smart contract functions.

REFERENCES

  1. [1] Nakamoto Satoshi. 2008. Bitcoin: A Peer-to-Peer Electronic Cash System. Apple Books.Google ScholarGoogle Scholar
  2. [2] Wood Daniel Davis. 2014. Ethereum: A Secure Decentralized Generalised Transaction Ledger. Ethereum Project Yellow Paper. Scientific Research.Google ScholarGoogle Scholar
  3. [3] Torres Christof Ferreira, Baden Mathis, Norvill Robert, Pontiveros Beltran Borja Fiz, Jonker Hugo L., and Mauw Sjouke. 2020. GIS: Shielding vulnerable smart contracts against attacks. In Proceedings of the 15th ACM Asia Conference on Computer and Communications Security.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Nguyen Tai Duy, Pham Long H., Sun Jun, Lin Yun, and Minh Quang Tran. 2020. sFuzz: An efficient adaptive fuzzer for Solidity smart contracts. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20).778788.Google ScholarGoogle Scholar
  5. [5] Buterin Vitalik. 2015. A Next Generation Smart Contract and Decentralized Application Platform. Retrieved June 1, 2023 from https://ethereum.org/en/whitepaper/.Google ScholarGoogle Scholar
  6. [6] Zou Weiqin, Lo David, Kochhar Pavneet Singh, Le Xuan-Bach D., Xia Xin, Feng Yang, Chen Zhenyu, and Xu Baowen. 2021. Smart contract development: Challenges and opportunities. IEEE Transactions on Software Engineering 47 (2021), 20842106.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Gao Zhipeng, Jiang Lingxiao, Xia Xin, Lo David, and Grundy John C.. 2021. Checking smart contracts with structural code embedding. IEEE Transactions on Software Engineering 47 (2021), 28742891.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] He Ningyu, Wu Lei, Wang Haoyu, Guo Yao, and Jiang Xuxian. 2020. Characterizing code clones in the Ethereum smart contract ecosystem. arXiv abs/1905.00272 (2020).Google ScholarGoogle Scholar
  9. [9] Oliva Gustavo Ansaldi, Hassan A., and Jiang Zhen Ming Jack. 2020. An exploratory study of smart contracts in the Ethereum blockchain platform. Empirical Software Engineering 25 (2020), 18641904.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Hegedűs Péter. 2018. Towards analyzing the complexity landscape of Solidity based Ethereum smart contracts. In Proceedings of the 2018 IEEE/ACM 1st International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB’18).3539.Google ScholarGoogle Scholar
  11. [11] Chen Xiangping, Liao Peiyong, Zhang Yixin, Huang Yuan, and Zheng Zibin. 2021. Understanding code reuse in smart contracts. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER’21).470479.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Lee Jong-Hoon, Yoon Seongho, and Lee Hyuk. 2022. SWC-based smart contract development guide research. In Proceedings of the 2022 24th International Conference on Advanced Communication Technology (ICACT’22).138141.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Yang Guang, Liu Ke, Chen Xiang, Zhou Yanlin, Yu Chi, and Lin Hao. 2022. CCGIR: Information retrieval-based code comment generation method for smart contracts. Knowledge-Based Systems 237 (2022), 107858.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Srinivas Kavitha, Abdelaziz I., Dolby Julian T., and McCusker Jamie P.. 2020. Graph4Code: A machine interpretable knowledge graph for code. arXiv abs/2002.09440 (2020).Google ScholarGoogle Scholar
  15. [15] Michail Amir. 2000. Data mining library reuse patterns using generalized association rules. In Proceedings of the 2000 International Conference on Software Engineering (ICSE’00).167176.Google ScholarGoogle Scholar
  16. [16] Cao Junming, Yang Shouliang, Jiang Wenhui, Zeng Hushuang, Shen Beijun, and Zhong Hao. 2020. BugPecker: Locating faulty methods with deep learning on revision graphs. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).12141218.Google ScholarGoogle Scholar
  17. [17] Robertson Stephen E. and Zaragoza Hugo. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (2009), 333389.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Gu Xiaodong, Zhang Hongyu, and Kim Sunghun. 2018. Deep code search. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE’18).933944.Google ScholarGoogle Scholar
  19. [19] Feng Zhangyin, Guo Daya, Tang Duyu, Duan Nan, Feng Xiaocheng, Gong Ming, Shou Linjun, et al. 2020. CodeBERT: A pre-trained model for programming and natural languages. arXiv abs/2002.08155 (2020).Google ScholarGoogle Scholar
  20. [20] Luu Loi, Chu Duc-Hiep, Olickel Hrishi, Saxena P., and Hobor Aquinas. 2016. Making smart contracts smarter. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] B. Mueller. n.d. Mythril–reversing and bug hunting framework for the Ethereum blockchain. GitHub. Retrieved June 1, 2023 from https://github.com/ConsenSys/mythril.Google ScholarGoogle Scholar
  22. [22] Tsankov Petar, Dan Andrei Marian, Drachsler-Cohen Dana, Gervais Arthur, Buenzli Florian, and Vechev Martin T.. 2018. Securify: Practical security analysis of smart contracts. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Monperrus Martin, Maia Anthony, Rouvoy Romain, and Seinturier Lionel. 2014. Debugging with the crowd: A debug recommendation system based on StackOverflow. ERCIM News 99 (2014), 2627.Google ScholarGoogle Scholar
  24. [24] Wang Zeli, Jin Hai, Dai Weiqi, Choo Kim-Kwang Raymond, and Zou Deqing. 2021. Ethereum smart contract security research: Survey and future research opportunities. Frontiers of Computer Science 15 (2021), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Feist Josselin, Grieco Gustavo, and Groce Alex. 2019. Slither: A static analysis framework for smart contracts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB’19).815.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Vacca Anna, Fredella Michele, Sorbo Andrea Di, Visaggio Corrado Aaron, and Canfora Gerardo. 2022. An empirical investigation on the trade-off between smart contract readability and gas consumption. In Proceedings of the 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC’22).214224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] GitHub. n.d. GitHub Home Page. Retrieved June 1, 2023 from https://github.com.Google ScholarGoogle Scholar
  28. [28] Gormley Clinton and Tong Zachary J.. 2015. Elasticsearch: The Definitive Guide. O’Reilly Media.Google ScholarGoogle Scholar
  29. [29] Ling Xiang, Wu Lingfei, Wang Sai Gang, Pan Gaoning, Ma Tengfei, Xu Fangli, Liu Alex X., Wu Chunming, and Ji Shouling. 2020. Deep graph matching and searching for semantic code retrieval. ACM Transactions on Knowledge Discovery from Data 15 (2020), 121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Sun Weisong, Fang Chunrong, Chen Yuchen, Tao Guanhong, Han Ting, and Zhang Quanjun. 2022. Code search based on context-aware code translation. In Proceedings of the 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE’22).388400.Google ScholarGoogle Scholar
  31. [31] Liu Chao, Xia Xin, Lo David, Liu Zhiwei, Hassan A., and Li Shanping. 2020. CodeMatcher: Searching code based on sequential semantics of important query words. ACM Transactions on Software Engineering and Methodology 31 (2020), Article 12, 37 pages.Google ScholarGoogle Scholar
  32. [32] Yu Hao, Zhang Yin, Zhao Yuli, and Zhang Bin. 2022. Incorporating code structure and quality in deep code search. Applied Sciences 12, 4 (2022), 2051.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Salza Pasquale, Schwizer Christoph, Gu Jian, and Gall Harald C.. 2021. On the effectiveness of transfer learning for code search. arXiv abs/2108.05890 (2021).Google ScholarGoogle Scholar
  34. [34] Lu Shuai, Guo Daya, Ren Shuo, Huang Junjie, Svyatkovskiy Alexey, Blanco Ambrosio, Clement Colin B., et al. 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. arXiv abs/2102.04664 (2021).Google ScholarGoogle Scholar
  35. [35] Huang Junjie, Tang Duyu, Shou Linjun, Gong Ming, Xu Ke, Jiang Daxin, Zhou Ming, and Duan Nan. 2021. CoSQA: 20,000+ web queries for code search and question answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Michael del Castillo. 2016. The DAO attacked: Code issue leads to 60 million ether theft. CoinDesk. Retrieved June 1, 2023 from https://www.coindesk.com/markets/2016/06/17/the-dao-attacked-code-issue-leads-to-60-million-ether-theft/.Google ScholarGoogle Scholar
  37. [37] Atzei Nicola, Bartoletti Massimo, and Cimoli Tiziana. 2017. A survey of attacks on Ethereum smart contracts (SoK). In Proceedings of the 6th International Conference on Principles of Security and Trust (POST’17). 164186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Landis J. Richard and Koch Gary G.. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33 2 (1977), 363–74.Google ScholarGoogle Scholar
  39. [39] Singh Ravindra Pal and Mangat Naurang Singh. 1996. Elements of Survey Sampling. Texts in the Mathematical Sciences, Vol. 15. Springer.Google ScholarGoogle Scholar
  40. [40] Welch B. L.. 1947. The generalization of student’s problems when several different population variances are involved. Biometrika 34 (1947), 2835.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Li Hongwei, Li Sirui, Sun Jiamou, Xing Zhenchang, Peng Xin, Liu Mingwei, and Zhao Xuejiao. 2018. Improving API caveats accessibility by mining API caveats knowledge graph. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME’18).183193.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Sun Jiamou, Xing Zhenchang, Chu Rui, Bai Heilai, Wang Jinshui, and Peng Xin. 2019. Know-how in programming tasks: From textual tutorials to task-oriented knowledge graph. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME’19).257268.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Ren Xiaoxue, Xing Zhenchang, Xia Xin, Li Guoqiang, and Sun Jianling. 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19).151162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Liu Mingwei, Peng Xin, Marcus Andrian, Xing Zhenchang, Xie Wenkai, Xing Shuangshuang, and Liu Yang. 2019. Generating query-specific class API summaries. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.Google ScholarGoogle Scholar
  45. [45] Liu Yang, Liu Mingwei, Peng Xin, Treude Christoph, Xing Zhenchang, and Zhang Xiaoxin. 2020. Generating concept based API element comparison using a knowledge graph. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).834845.Google ScholarGoogle Scholar
  46. [46] Robillard Martin P., Bodden Eric, Kawrykow David, Mezini Mira, and Ratchford Tristan. 2013. Automated API property inference techniques. IEEE Transactions on Software Engineering 39 (2013), 613637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Martin Monperrus, Bruch Marcel, and Mezini Mira. 2010. Detecting missing method calls in object-oriented software. In Proceedings of the European Conference on Object-Oriented Programming.Google ScholarGoogle Scholar
  48. [48] Bruch Marcel, Martin Monperrus, and Mezini Mira. 2009. Learning from examples to improve code completion systems. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’09). 213222.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Wong Edmund, Liu Taiyue, and Tan Lin. 2015. CloCom: Mining existing source code for automatic comment generation. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER’15).380389.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Wang Chong, Peng Xin, Liu Mingwei, Xing Zhenchang, Bai Xue, Xie Bing, and Wang Tuo. 2019. A learning-based approach for automatic construction of domain glossary from source code and documentation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.Google ScholarGoogle Scholar
  51. [51] Lin Zeqi, Zou Yanzhen, Zhao Junfeng, and Xie Bing. 2017. Improving software text retrieval using conceptual knowledge in source code. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17).123134.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Sim Susan Elliott, Umarji Medha, Ratanotayanon Sukanya, and Lopes Cristina V.. 2011. How well do search engines support code retrieval on the web? ACM Transactions on Software Engineering and Methodology 21 (2011), Article 4, 25 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Bajracharya Sushil Krishna, Ngo Trung Chi, Linstead Erik J., Dou Yimeng, Rigor Paul, Baldi Pierre, and Lopes Cristina V.. 2006. Sourcerer: A search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN Symposium on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA’06). 681682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Huang Qing and Wu Guoqing. 2019. Enhance code search via reformulating queries with evolving contexts. Automated Software Engineering 26 (2019), 705732.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Huang Qing and Wu Huaiguang. 2019. QE-integrating framework based on GitHub knowledge and SVM ranking. Science China Information Sciences 62 (2019), 116.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Huang Qing, Yang Yang, and Cheng Ming. 2019. Deep learning the semantics of change sequences for query expansion. Software: Practice and Experience 49 (2019), 16001617.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Huang Qing, Yang Yangrui, Zhan Xue, Wan Hongyan, and Wu Guoqing. 2018. Query expansion based on statistical learning from code changes. Software: Practice and Experience 48 (2018), 13331351.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Niu Haoran, Keivanloo Iman, and Zou Ying. 2015. Learning to rank code examples for code search engines. Empirical Software Engineering 22 (2015), 259291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Nguyen Tam The, Vu Phong Minh, and Nguyen Tung Thanh. 2019. Recommendation of exception handling code in mobile app development. arXiv abs/1908.06567 (2019).Google ScholarGoogle Scholar
  60. [60] Liu Xiaoning, Shen Beijun, Zhong Hao, and Zhu Jiangang. 2016. EXPSOL: Recommending online threads for exception-related bug reports. In Proceedings of the 2016 23rd Asia-Pacific Software Engineering Conference (APSEC’16).2532.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Semantic-Enriched Code Knowledge Graph to Reveal Unknowns in Smart Contract Code Reuse

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Software Engineering and Methodology
            ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 6
            November 2023
            949 pages
            ISSN:1049-331X
            EISSN:1557-7392
            DOI:10.1145/3625557
            • Editor:
            • Mauro Pezzè
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 30 September 2023
            • Online AM: 22 May 2023
            • Accepted: 27 April 2023
            • Revised: 19 April 2023
            • Received: 30 July 2022
            Published in tosem Volume 32, Issue 6

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)424
            • Downloads (Last 6 weeks)22

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text