Skip to main content
Log in

Discovering semantically related technical terms and web resources in Q&A discussions

从问答讨论中发现语义相关的技术术语和网络资源

  • Research Articles
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

A sheer number of techniques and web resources are available for software engineering practice and this number continues to grow. Discovering semantically similar or related technical terms and web resources offers the opportunity to design appealing services to facilitate information retrieval and information discovery. In this study, we extract technical terms and web resources from a community of question and answer (Q&A) discussions and propose an approach based on a neural language model to learn the semantic representations of technical terms and web resources in a joint low-dimensional vector space. Our approach maps technical terms and web resources to a semantic vector space based only on the surrounding technical terms and web resources of a technical term (or web resource) in a discussion thread, without the need for mining the text content of the discussion. We apply our approach to Stack Overflow data dump of March 2018. Through both quantitative and qualitative analyses in the clustering, search, and semantic reasoning tasks, we show that the learnt technical-term and web-resource vector representations can capture the semantic relatedness of technical terms and web resources, and they can be exploited to support various search and semantic reasoning tasks, by means of simple K-nearest neighbor search and simple algebraic operations on the learnt vector representations in the embedding space.

摘要

目前网络上拥有大量可用于软件工程实践的技术和网络资源,并且这个数量还在持续增长。发现语义相似或相关的技术术语和网络资源,可以设计吸引人的服务,以促进信息检索和信息发现的机会。本文从问答(Q&A)讨论的社区中提取技术术语和网络资源,并提出一种基于神经网络语言模型的技术术语和网络资源在联合低维向量空间中的语义表示方法。方法仅基于讨论线程中技术术语(或网络资源)的周围技术术语和web资源,将技术术语和网络资源映射到语义向量空间,而不需挖掘讨论的文本内容。将方法应用于2018年3月的堆栈溢出数据转储。对聚类、搜索和语义推理任务的定量和定性分析表明,所学习的技术术语和网络资源向量表示可以捕获技术术语和网络资源的语义相关性,通过简单的K近邻搜索和在嵌入空间中对学习的向量表示作简单的代数运算,可以支持各种搜索和语义推理任务。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Agrawal R, Imieliński T, Swami A, 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Rec, 22(2):207–216. https://doi.org/10.1145/170036.170072

    Article  Google Scholar 

  • Bansal M, Gimpel K, Livescu K, 2014. Tailoring continuous word representations for dependency parsing. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.809–815. https://doi.org/10.3115/v1/P14-2131

  • Baroni M, Dinu G, Kruszewski G, 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.238–247. https://doi.org/10.3115/v1/P14-1023

  • Barua A, Thomas SW, Hassan AE, 2014. What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng, 19(3):619–654. https://doi.org/10.1007/s10664-012-9231-y

    Article  Google Scholar 

  • Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3(4–5):993–1022.

    MATH  Google Scholar 

  • Bullinaria JA, Levy JP, 2012. Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods, 44(3):890–907. https://doi.org/10.3758/s13428-011-0183-8

    Article  Google Scholar 

  • Chen WL, Zhang Y, Zhang M, 2014. Feature embedding for dependency parsing. Proc 25th Int Conf on Computational Linguistics, p.816–826.

  • Collobert R, Weston J, Bottou L, et al., 2011. Natural language processing (almost) from scratch. J Mach Learn Res, 12:2493–2537.

    MATH  Google Scholar 

  • Grbovic M, Djuric N, Radosavljevic V, et al., 2015. Context- and content-aware embeddings for query rewriting in sponsored search. Proc 38th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.383–392. https://doi.org/10.1145/2766462.2767709

  • Gummidi SRB, Xie XK, Pedersen TB, 2019. A survey of spatial crowdsourcing. ACM Trans Database Syst, 44(2):8. https://doi.org/10.1145/3291933

    Article  MathSciNet  Google Scholar 

  • Gutmann MU, Hyvärinen A, 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res, 13(1):307–361.

    MathSciNet  MATH  Google Scholar 

  • Harris ZS, 1954. Distributional structure. Word, 10:146–162.

    Article  Google Scholar 

  • Hong LJ, Davison BD, 2010. Empirical study of topic modeling in Twitter. Proc 1st Workshop on Social Media Analytics, p.80–88. https://doi.org/10.1145/1964858.1964870

  • Huang Q, Xia X, Xing ZC, et al., 2018. API method recommendation without worrying about the task-API knowledge gap. Proc 33rd ACM/IEEE Int Conf on Automated Software Engineering, p.293–304. https://doi.org/10.1145/3238147.3238191

  • Jia JF, Li GQ, 2021. Learning natural ordering of tags in domain-specific Q&A sites. Front Inform Technol Electron Eng, 22(2):170–184. https://doi.org/10.1631/FITEE.1900645

    Article  Google Scholar 

  • Jia JF, Tumanian V, Li GQ, 2020. In favour of or against multi-lingual Q&A sites? Exploring the evidence from user and knowledge perspectives. Behav Inform Technol, p.1–16. https://doi.org/10.1080/0144929X.2020.1752308

  • Levy O, Goldberg Y, 2014a. Dependency-based word embeddings. Proc 52nd Annual Meeting of the Association for Computational Linguistics, p.302–308. https://doi.org/10.3115/v1/P14-2050

  • Levy O, Goldberg Y, 2014b. Linguistic regularities in sparse and explicit word representations. Proc 18th Conf on Computational Natural Language Learning, p.171–180. https://doi.org/10.3115/v1/W14-1618

  • Levy O, Goldberg Y, 2014c. Neural word embedding as implicit matrix factorization. Proc 27th Int Conf on Neural Information Processing Systems, p.2177–2185.

  • Levy O, Goldberg Y, Dagan I, 2015. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling, 3:211–225. https://doi.org/10.1162/tacl_a_00134

    Google Scholar 

  • Li J, Xing ZC, Sun AX, 2019. LinkLive: discovering web learning resources for developers from Q&A discussions. World Wide Web, 22(4):1699–1725. https://doi.org/10.1007/s11280-018-0621-y

    Article  Google Scholar 

  • MacQueen J, 1967. Some methods for classification and analysis of multivariate observations. Proc 5th Berkeley Symp on Mathematical Statistics and Probability, p.281–297.

  • Mikolov T, Sutskever I, Chen K, et al., 2013a. Distributed representations of words and phrases and their compositionality. Proc 26th Int Conf on Neural Information Processing Systems, p.3111–3119.

  • Mikolov T, Chen K, Corrado G, et al., 2013b. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781

  • Mitra B, 2015. Exploring session context using distributed representations of queries and reformulations. Proc 38th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.3–12. https://doi.org/10.1145/2766462.2767702

  • Passos A, Kumar V, McCallum A, 2014. Lexicon infused phrase embeddings for named entity resolution. https://arxiv.org/abs/1404.5367

  • Qiu SY, Cui Q, Bian J, et al., 2014. Co-learning of word representations and morpheme representations. Proc 25th Int Conf on Computational Linguistics, p.141–150.

  • Rand WM, 1971. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66(336):846–850.

    Article  Google Scholar 

  • Ren XX, Xing ZC, Xia X, et al., 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. Proc 34th IEEE/ACM Int Conf on Automated Software Engineering, p.151–162. https://doi.org/10.1109/ASE.2019.00024

  • Robillard M, Walker R, Zimmermann T, 2010. Recommendation systems for software engineering. IEEE Softw, 27(4):80–86. https://doi.org/10.1109/MS.2009.161

    Article  Google Scholar 

  • Rosen C, Shihab E, 2015. What are mobile developers asking about? A large scale study using Stack OverFlow. Empir Softw Eng, 21(3):1192–1223. https://doi.org/10.1007/s10664-015-9379-3

    Article  Google Scholar 

  • Sillito J, Maurer F, Nasehi SM, et al., 2012. What makes a good code example?: a study of programming Q&A in StackOverflow. Proc IEEE Int Conf on Software Maintenance, p.25–34. https://doi.org/10.1109/ICSM.2012.6405249

  • Tian Y, Lo D, Lawall J, 2014a. Automated construction of a software-specific word similarity database. Proc Software Evolution Week-IEEE Conf on Software Maintenance, Reengineering, and Reverse Engineering, p.44–53. https://doi.org/10.1109/CSMR-WCRE.2014.6747213

  • Tian Y, Lo D, Lawall J, 2014b. SEWordSim: software-specific word similarity database. Companion Proc 36th Int Conf on Software Engineering, p.568–571. https://doi.org/10.1145/2591062.2591071

  • Wang SW, Lo D, Jiang LX, 2012. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. Proc 28th IEEE Int Conf on Software Maintenance, p.604–607. https://doi.org/10.1109/ICSM.2012.6405332

  • Wang SW, Lo D, Jiang LX, 2013. An empirical study on developer interactions in Stack Overflow. Proc 28th Annual ACM Symp on Applied Computing, p.1019–1024. https://doi.org/10.1145/2480362.2480557

  • Xia X, Bao LF, Lo D, et al., 2017. What do developers search for on the web? Empir Softw Eng, 22(6):3149–3185. https://doi.org/10.1007/s10664-017-9514-4

    Article  Google Scholar 

  • Xie XK, Jin P, Yiu ML, et al., 2016. Enabling scalable geographic service sharing with weighted imprecise Voronoi cells. IEEE Trans Knowl Data Eng, 28(2):439–453. https://doi.org/10.1109/TKDE.2015.2464804

    Article  Google Scholar 

  • Xie XK, Lin X, Xu JL, et al., 2017. Reverse keyword-based location search. Proc IEEE 33rd Int Conf on Data Engineering, p.375–386. https://doi.org/10.1109/ICDE.2017.96

  • Xu BW, Xing ZC, Xia X, et al., 2017. AnswerBot: automated generation of answer summary to developers’ technical questions. Proc 32nd IEEE/ACM Int Conf on Automated Software Engineering, p.706–716. https://doi.org/10.1109/ASE.2017.8115681

  • Xu C, Bai YL, Bian J, et al., 2014. RC-NET: a general framework for incorporating knowledge into word representations. Proc 23rd ACM Int Conf on Information and Knowledge Management, p.1219–1228. https://doi.org/10.1145/2661829.2662038

  • Yang JQ, Tan L, 2014. SWordNet: inferring semantically related words from software context. Empir Softw Eng, 19(6):1856–1886. https://doi.org/10.1007/s10664-013-9264-x

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Guoqiang LI designed the research. Junfang JIA and Valeriia TUMANIAN processed the data and programmed the system. Guoqiang LI drafted the manuscript. Junfang JIA and Valeriia TUMANIAN helped organize the manuscript. Guoqiang LI and Valeriia TUMANIAN revised and finalized the paper.

Corresponding author

Correspondence to Guoqiang Li  (李国强).

Additional information

Compliance with ethics guidelines

Junfang JIA, Valeriia TUMANIAN, and Guoqiang LI declare that they have no conflict of interest.

Project supported by the National Natural Science Foundation of China (No. 61872232)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, J., Tumanian, V. & Li, G. Discovering semantically related technical terms and web resources in Q&A discussions. Front Inform Technol Electron Eng 22, 969–985 (2021). https://doi.org/10.1631/FITEE.2000186

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2000186

Key words

关键词

CLC number

Navigation