Discovering semantically related technical terms and web resources in Q&A discussions

Jia, Junfang; Tumanian, Valeriia; Li, Guoqiang

doi:10.1631/FITEE.2000186

Discovering semantically related technical terms and web resources in Q&A discussions

从问答讨论中发现语义相关的技术术语和网络资源

Research Articles
Published: 28 July 2021

Volume 22, pages 969–985, (2021)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

101 Accesses
Explore all metrics

Abstract

A sheer number of techniques and web resources are available for software engineering practice and this number continues to grow. Discovering semantically similar or related technical terms and web resources offers the opportunity to design appealing services to facilitate information retrieval and information discovery. In this study, we extract technical terms and web resources from a community of question and answer (Q&A) discussions and propose an approach based on a neural language model to learn the semantic representations of technical terms and web resources in a joint low-dimensional vector space. Our approach maps technical terms and web resources to a semantic vector space based only on the surrounding technical terms and web resources of a technical term (or web resource) in a discussion thread, without the need for mining the text content of the discussion. We apply our approach to Stack Overflow data dump of March 2018. Through both quantitative and qualitative analyses in the clustering, search, and semantic reasoning tasks, we show that the learnt technical-term and web-resource vector representations can capture the semantic relatedness of technical terms and web resources, and they can be exploited to support various search and semantic reasoning tasks, by means of simple K-nearest neighbor search and simple algebraic operations on the learnt vector representations in the embedding space.

摘要

目前网络上拥有大量可用于软件工程实践的技术和网络资源,并且这个数量还在持续增长。发现语义相似或相关的技术术语和网络资源,可以设计吸引人的服务,以促进信息检索和信息发现的机会。本文从问答(Q&A)讨论的社区中提取技术术语和网络资源,并提出一种基于神经网络语言模型的技术术语和网络资源在联合低维向量空间中的语义表示方法。方法仅基于讨论线程中技术术语(或网络资源)的周围技术术语和web资源,将技术术语和网络资源映射到语义向量空间,而不需挖掘讨论的文本内容。将方法应用于2018年3月的堆栈溢出数据转储。对聚类、搜索和语义推理任务的定量和定性分析表明,所学习的技术术语和网络资源向量表示可以捕获技术术语和网络资源的语义相关性,通过简单的K近邻搜索和在嵌入空间中对学习的向量表示作简单的代数运算,可以支持各种搜索和语义推理任务。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific cross-language relevant question retrieval

Article 04 November 2017

What’s Spain’s Paris? Mining analogical libraries from Q&A discussions

Article 18 September 2018

An empirical study on the potential of word embedding techniques in bug report management tasks

Article 25 July 2024

References

Agrawal R, Imieliński T, Swami A, 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Rec, 22(2):207–216. https://doi.org/10.1145/170036.170072
Article Google Scholar
Bansal M, Gimpel K, Livescu K, 2014. Tailoring continuous word representations for dependency parsing. Proc 52^nd Annual Meeting of the Association for Computational Linguistics, p.809–815. https://doi.org/10.3115/v1/P14-2131
Baroni M, Dinu G, Kruszewski G, 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proc 52^nd Annual Meeting of the Association for Computational Linguistics, p.238–247. https://doi.org/10.3115/v1/P14-1023
Barua A, Thomas SW, Hassan AE, 2014. What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng, 19(3):619–654. https://doi.org/10.1007/s10664-012-9231-y
Article Google Scholar
Blei DM, Ng AY, Jordan MI, 2003. Latent Dirichlet allocation. J Mach Learn Res, 3(4–5):993–1022.
MATH Google Scholar
Bullinaria JA, Levy JP, 2012. Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods, 44(3):890–907. https://doi.org/10.3758/s13428-011-0183-8
Article Google Scholar
Chen WL, Zhang Y, Zhang M, 2014. Feature embedding for dependency parsing. Proc 25^th Int Conf on Computational Linguistics, p.816–826.
Collobert R, Weston J, Bottou L, et al., 2011. Natural language processing (almost) from scratch. J Mach Learn Res, 12:2493–2537.
MATH Google Scholar
Grbovic M, Djuric N, Radosavljevic V, et al., 2015. Context- and content-aware embeddings for query rewriting in sponsored search. Proc 38^th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.383–392. https://doi.org/10.1145/2766462.2767709
Gummidi SRB, Xie XK, Pedersen TB, 2019. A survey of spatial crowdsourcing. ACM Trans Database Syst, 44(2):8. https://doi.org/10.1145/3291933
Article MathSciNet Google Scholar
Gutmann MU, Hyvärinen A, 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res, 13(1):307–361.
MathSciNet MATH Google Scholar
Harris ZS, 1954. Distributional structure. Word, 10:146–162.
Article Google Scholar
Hong LJ, Davison BD, 2010. Empirical study of topic modeling in Twitter. Proc 1^st Workshop on Social Media Analytics, p.80–88. https://doi.org/10.1145/1964858.1964870
Huang Q, Xia X, Xing ZC, et al., 2018. API method recommendation without worrying about the task-API knowledge gap. Proc 33^rd ACM/IEEE Int Conf on Automated Software Engineering, p.293–304. https://doi.org/10.1145/3238147.3238191
Jia JF, Li GQ, 2021. Learning natural ordering of tags in domain-specific Q&A sites. Front Inform Technol Electron Eng, 22(2):170–184. https://doi.org/10.1631/FITEE.1900645
Article Google Scholar
Jia JF, Tumanian V, Li GQ, 2020. In favour of or against multi-lingual Q&A sites? Exploring the evidence from user and knowledge perspectives. Behav Inform Technol, p.1–16. https://doi.org/10.1080/0144929X.2020.1752308
Levy O, Goldberg Y, 2014a. Dependency-based word embeddings. Proc 52^nd Annual Meeting of the Association for Computational Linguistics, p.302–308. https://doi.org/10.3115/v1/P14-2050
Levy O, Goldberg Y, 2014b. Linguistic regularities in sparse and explicit word representations. Proc 18^th Conf on Computational Natural Language Learning, p.171–180. https://doi.org/10.3115/v1/W14-1618
Levy O, Goldberg Y, 2014c. Neural word embedding as implicit matrix factorization. Proc 27^th Int Conf on Neural Information Processing Systems, p.2177–2185.
Levy O, Goldberg Y, Dagan I, 2015. Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling, 3:211–225. https://doi.org/10.1162/tacl_a_00134
Google Scholar
Li J, Xing ZC, Sun AX, 2019. LinkLive: discovering web learning resources for developers from Q&A discussions. World Wide Web, 22(4):1699–1725. https://doi.org/10.1007/s11280-018-0621-y
Article Google Scholar
MacQueen J, 1967. Some methods for classification and analysis of multivariate observations. Proc 5^th Berkeley Symp on Mathematical Statistics and Probability, p.281–297.
Mikolov T, Sutskever I, Chen K, et al., 2013a. Distributed representations of words and phrases and their compositionality. Proc 26^th Int Conf on Neural Information Processing Systems, p.3111–3119.
Mikolov T, Chen K, Corrado G, et al., 2013b. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781
Mitra B, 2015. Exploring session context using distributed representations of queries and reformulations. Proc 38^th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.3–12. https://doi.org/10.1145/2766462.2767702
Passos A, Kumar V, McCallum A, 2014. Lexicon infused phrase embeddings for named entity resolution. https://arxiv.org/abs/1404.5367
Qiu SY, Cui Q, Bian J, et al., 2014. Co-learning of word representations and morpheme representations. Proc 25^th Int Conf on Computational Linguistics, p.141–150.
Rand WM, 1971. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc, 66(336):846–850.
Article Google Scholar
Ren XX, Xing ZC, Xia X, et al., 2019. Discovering, explaining and summarizing controversial discussions in community Q&A sites. Proc 34^th IEEE/ACM Int Conf on Automated Software Engineering, p.151–162. https://doi.org/10.1109/ASE.2019.00024
Robillard M, Walker R, Zimmermann T, 2010. Recommendation systems for software engineering. IEEE Softw, 27(4):80–86. https://doi.org/10.1109/MS.2009.161
Article Google Scholar
Rosen C, Shihab E, 2015. What are mobile developers asking about? A large scale study using Stack OverFlow. Empir Softw Eng, 21(3):1192–1223. https://doi.org/10.1007/s10664-015-9379-3
Article Google Scholar
Sillito J, Maurer F, Nasehi SM, et al., 2012. What makes a good code example?: a study of programming Q&A in StackOverflow. Proc IEEE Int Conf on Software Maintenance, p.25–34. https://doi.org/10.1109/ICSM.2012.6405249
Tian Y, Lo D, Lawall J, 2014a. Automated construction of a software-specific word similarity database. Proc Software Evolution Week-IEEE Conf on Software Maintenance, Reengineering, and Reverse Engineering, p.44–53. https://doi.org/10.1109/CSMR-WCRE.2014.6747213
Tian Y, Lo D, Lawall J, 2014b. SEWordSim: software-specific word similarity database. Companion Proc 36^th Int Conf on Software Engineering, p.568–571. https://doi.org/10.1145/2591062.2591071
Wang SW, Lo D, Jiang LX, 2012. Inferring semantically related software terms and their taxonomy by leveraging collaborative tagging. Proc 28^th IEEE Int Conf on Software Maintenance, p.604–607. https://doi.org/10.1109/ICSM.2012.6405332
Wang SW, Lo D, Jiang LX, 2013. An empirical study on developer interactions in Stack Overflow. Proc 28^th Annual ACM Symp on Applied Computing, p.1019–1024. https://doi.org/10.1145/2480362.2480557
Xia X, Bao LF, Lo D, et al., 2017. What do developers search for on the web? Empir Softw Eng, 22(6):3149–3185. https://doi.org/10.1007/s10664-017-9514-4
Article Google Scholar
Xie XK, Jin P, Yiu ML, et al., 2016. Enabling scalable geographic service sharing with weighted imprecise Voronoi cells. IEEE Trans Knowl Data Eng, 28(2):439–453. https://doi.org/10.1109/TKDE.2015.2464804
Article Google Scholar
Xie XK, Lin X, Xu JL, et al., 2017. Reverse keyword-based location search. Proc IEEE 33^rd Int Conf on Data Engineering, p.375–386. https://doi.org/10.1109/ICDE.2017.96
Xu BW, Xing ZC, Xia X, et al., 2017. AnswerBot: automated generation of answer summary to developers’ technical questions. Proc 32^nd IEEE/ACM Int Conf on Automated Software Engineering, p.706–716. https://doi.org/10.1109/ASE.2017.8115681
Xu C, Bai YL, Bian J, et al., 2014. RC-NET: a general framework for incorporating knowledge into word representations. Proc 23^rd ACM Int Conf on Information and Knowledge Management, p.1219–1228. https://doi.org/10.1145/2661829.2662038
Yang JQ, Tan L, 2014. SWordNet: inferring semantically related words from software context. Empir Softw Eng, 19(6):1856–1886. https://doi.org/10.1007/s10664-013-9264-x
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Network Engineering, Shanxi Datong University, Datong, 037009, China
Junfang Jia (贾俊芳)
School of Software, Shanghai Jiao Tong University, Shanghai, 200240, China
Valeriia Tumanian & Guoqiang Li (李国强)

Authors

Junfang Jia (贾俊芳)
View author publications
You can also search for this author inPubMed Google Scholar
Valeriia Tumanian
View author publications
You can also search for this author inPubMed Google Scholar
Guoqiang Li (李国强)
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Guoqiang LI designed the research. Junfang JIA and Valeriia TUMANIAN processed the data and programmed the system. Guoqiang LI drafted the manuscript. Junfang JIA and Valeriia TUMANIAN helped organize the manuscript. Guoqiang LI and Valeriia TUMANIAN revised and finalized the paper.

Corresponding author

Correspondence to Guoqiang Li (李国强).

Additional information

Compliance with ethics guidelines

Junfang JIA, Valeriia TUMANIAN, and Guoqiang LI declare that they have no conflict of interest.

Project supported by the National Natural Science Foundation of China (No. 61872232)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, J., Tumanian, V. & Li, G. Discovering semantically related technical terms and web resources in Q&A discussions. Front Inform Technol Electron Eng 22, 969–985 (2021). https://doi.org/10.1631/FITEE.2000186

Download citation

Received: 21 April 2020
Accepted: 23 December 2020
Published: 28 July 2021
Issue Date: July 2021
DOI: https://doi.org/10.1631/FITEE.2000186

Key words

关键词

CLC number

TP311

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering semantically related technical terms and web resources in Q&A discussions

Abstract

摘要

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Domain-specific cross-language relevant question retrieval

What’s Spain’s Paris? Mining analogical libraries from Q&A discussions

An empirical study on the potential of word embedding techniques in bug report management tasks

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Compliance with ethics guidelines

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Subscribe and save

Buy Now