Identifying comparable entities with indirectly associative relations and word embeddings from web search logs
Introduction
The comparison of products or services plays a significant role in consumers' purchasing decision-making process, where they often resort to webpages, online reviews, and/or social media to obtain information regarding comparable entities. However, due to the bounded cognitive ability [22] and the information overload [29], consumers cannot effectively access the entire set of comparable entities in an effective manner. More importantly, such a comparison typically requires high-level domain knowledge of consumers. Thus, comparable entity identification, which aims at identifying a comprehensive and accurate set of entities that are comparable to a specified entity, is deemed desirable for helping consumers identify alternative products for consideration in their decision-making process [13].
Comparable entity identification is also vital for businesses in strategic and marketing management. Typically, managers can identify comparable entities by matching similarities and differences of entity categories and characteristics in their minds [22]. Due to limited cognitive abilities, they may only be aware of a small number of comparable entities, and entities that are out of sight will not be considered [3]. Although firms sometimes utilize paid profile services such as Hoovers (www.hoovers.com) and Mergent (www.mergentonline.com) to collect information regarding comparable entities, those services are provided by professionals for specified domains; thus, they may be costly and often suffer from scalability problems [32]. Moreover, such professional-based services cannot reach consumers' minds and fail to examine comparable entities from the perspective of users.
To overcome these limitations, some recent efforts have been made to automatically identify comparable entities or mine comparative relations from online user-generated contents (UGC) [1,29]. For comparable entity identification from the user perspective, extant methods are mainly conditioned on the premise that comparable entities have much higher cooccurrence in the same statements. However, this premise cannot be well applied in various types of UGC, such as web search logs, online product reviews, and tweets, where comparable entities appear less frequently in cooccurrence patterns [30], thereby leading to degraded performance.
To extend the premise, this study proposes a new perspective of comparable entity identification in terms of indirectly associative relation analysis. In various types of UGC, comparable entities not only directly appear in the same statements but also appear in an indirect form. The proposed indirectly associative relation analysis is a useful extension of previous efforts. It can improve consumers' and managers' exploration of various types of UGC and help them capture comparable entities that are ignored by extant methods. In consideration of indirectly associative relations, a typical type of UGC, namely, web search logs, is selected as the research data of this study. In web search logs, comparable entities seldom appear in cooccurrence patterns [30]. This type of UGC demands for novel methods for identifying indirectly associative relations, which deviate from the traditional cooccurrence premise.
In this study, a method, namely, ICE (identifying comparable entities) is proposed for identifying entities that are comparable to a specified entity from web search logs. Entities in ICE refer to the objects (e.g., companies, products, and persons) that users care about and then query through search engines [1,23]. Comparable entities, such as BMW and Mercedes-Benz, are entities that share a common utility and meet the similar needs of consumers [29]. In ICE, the specified entity for which comparable entities must be identified by the method is called a focal entity [1,23]. For example, Ford is selected as a focal entity for managers in Ford Motor Company, and Ausnutria will be selected as a focal entity for consumers who want to buy milk powder. Two key issues must be addressed. First, as previously discussed, most comparable entities do not frequently appear concurrently in the same web search logs [30]. Second, due to data noise and short queries in web search logs [6], the accuracy of the identification results depends not only on the cooccurrence positions where entities appear. It is necessary to incorporate an effective semantic analysis between entities into the identification process. Therefore, ICE consists of two stages: the derivation of a broad set of candidate entities that are indirectly associative with the focal entity, which are linked by their related aspects, and the measurement of the similarities between the candidate entities and the focal entity to target comparable entities from the obtained candidate set. Data experiments are conducted to evaluate the performance of ICE in comparison to several baseline methods.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces the proposed method, whose algorithmic details are presented in Section 4. The experimental results are provided in Section 5. Finally, the work is concluded in Section 6.
Section snippets
Related work
Since the focus of this study is for identifying comparable entities, the mainstream studies of comparable entity identification are reviewed in Section 2.1. The second stage of ICE is to detect comparable entities from a set of candidate entities identified in the first stage, which is relevant to comparative relation mining. Thus, similar methods of comparative relation mining are reviewed in Section 2.2.
The ICE method
In this study, a two-stage method ICE is proposed for identifying comparable entities based on indirectly associative relations from the perspective of consumers. ICE is composed of two stages: the first stage is the discovery of a broad set of candidate entities based on indirectly associative relations of keywords in web search logs, and the second stage is a semantic analysis that is implemented based on keyword and document representations for the detection of comparable entities from the
The algorithm
In this section, the algorithmic details and time complexity of ICE are analyzed to show the main factors affecting its computation time.
Algorithm 1 provides the pseudo-code of ICE. In the first stage, HashMap data structure is adopted to map the directly associative relation by traversing over the web search log sets Q once. Supposing there are totally NQ queries in the web search log set Q, the time complexity for generating candidate entities is O(NQ) (lines 3–5), which means the time
Data experiments
This section aims to test whether ICE can identify accurate comparable entities and effectively cover the entities that users might compare. In addition, the effectiveness of ICE in comparable entity ranking is also validated.
Conclusions
Comparable entity identification is desirable for both consumers and managers in their decision-making processes. In this paper, a novel two-stage ICE method for comparable entity identification that is based on web search logs has been proposed from the perspective of consumers. In the first stage, a candidate entity generation process has been designed based on the indirectly associative relation of comparable entities that are linked by their shared related aspect information. Furthermore,
Acknowledgements
The work was partly supported by the National Natural Science Foundation of China (71772177, 72072177), the MOE Project of Key Research Institute of Humanities and Social Sciences at Universities (17JJD630006), and the joint PhD scholarship of Renmin Business School.
Liye Wang is currently pursuing her PhD degree in the Department of Management Science and Engineering, School of Business, Renmin University of China. Her research interests include competitive intelligence, e-commerce and text mining. Her work has been published in the journal of Frontiers of Business Research in China.
References (33)
- et al.
Decision support from financial disclosures with deep neural networks and transfer learning
Decis. Support. Syst.
(2017) - et al.
Assessing product competitive advantages from the perspective of customers by mining user-generated content on social media
Decis. Support. Syst.
(2019) - et al.
Ranking of high-value social audiences on twitter
Decis. Support. Syst.
(2016) - et al.
Mining competitor relationships from online news: a network-based approach
Electron. Commer. Res. Appl.
(2011) - et al.
Finding competitive keywords from query logs to enhance search engine advertising
Inf. Manag.
(2017) - et al.
Mining comparative opinions from customer reviews for competitive intelligence
Decis. Support. Syst.
(2011) - et al.
Competitor mining with the web
IEEE Trans. Knowl. Data Eng.
(2008) - et al.
A novel method for identifying competitors using a financial transaction network
IEEE Trans. Eng. Manag.
(2019) - et al.
Managerial identification of competitors
J. Mark.
(1999) - et al.
Positioning and presenting design science research for maximum impact
MIS Q.
(2013)
How do they compare? Automatic identification of comparable entities on the web
Learning open-domain comparable entity graphs from user search queries
Mining comparative sentences and relations
Temporal diversity in recommender systems
Distributed representations of sentences and documents
Automated marketing research using online customer reviews
J. Mark. Res.
Cited by (9)
Why some products compete and others don't: A competitive attribution model from customer perspective
2023, Decision Support SystemsA novel textual data augmentation method for identifying comparative text from user-generated content
2022, Electronic Commerce Research and ApplicationsCitation Excerpt :Utilizing UGC on e-commerce platforms and social media for gaining comparative intelligence has attracted great attention in recent years. Prior related studies focused on two research directions: identifying comparative text (Ngo Xuan et al., 2015; Zhang et al., 2018), and mining comparative relations (Bi et al., 2019; Liu et al., 2019; Liu et al., 2020a; Liu et al., 2020b; Wang et al., 2020a), including competitors identifications (Liu et al., 2020a; Wang et al., 2020a) and competitive advantage analysis (Liu et al., 2019; Liu et al., 2020b). In this Section, we emphasize on reviewing research efforts on comparative text identification.
The moderating effects of entertainers on public engagement through government activities in social media during the COVID-19
2022, Telematics and InformaticsCitation Excerpt :To achieve this goal, we assessed whether a given OOI posted by a government user on social media, which was used to promote a specific activity that was participated in by entertainers, had a greater degree of diffusion. This variable was based on the word segmentation results of each collected post using the R package Jieba, which has shown effective performance in processing the contents of Chinese social media data in many previous studies (Chen and Chen, 2019; Wang et al., 2021). In particular, for word segmentation, we used a lexicon containing 5,980 names of entertainers, which was provided by Sougou Pinyin.
A hybrid similarity measure-based clustering approach for mixed attribute data
2024, International Journal of Machine Learning and CyberneticsImpact of word embedding models on text analytics in deep learning environment: a review
2023, Artificial Intelligence ReviewA review for comparative text mining: From data acquisition to practical application
2023, Journal of Information Science
Liye Wang is currently pursuing her PhD degree in the Department of Management Science and Engineering, School of Business, Renmin University of China. Her research interests include competitive intelligence, e-commerce and text mining. Her work has been published in the journal of Frontiers of Business Research in China.
Jin Zhang is an associate professor in the Department of Management Science and Engineering, School of Business, Renmin University of China. He received his PhD degree in the Department of Management Science and Engineering from the School of Economics and Management at Tsinghua University. His current research interests include data mining, business intelligence, and web search. His work has been published in journals such as MIS Quarterly, INFORMS Journal on Computing, Decision Support Systems, Information & Management, and IEEE Transactions on Neural Network and Learning Systems, etc.
Guoqing Chen received his PhD from the Catholic University of Leuven (K.U. Leuven, Belgium) and now is Professor of Information Systems at the School of Economics and Management, Tsinghua University, Beijing, China. His research interests include information systems management, business analytics and decision support systems. His work has been published in journals such as MIS Quarterly, Journal of Management Information Systems, Journal of Association for Information Systems, Decision Sciences, INFORMS Journal on Computing, Decision Support Systems, ACM Transactions on Knowledge Discovery from Data, etc.
Dandan Qiao is an assistant professor in the Department of Information Systems and Analytics at the National University of Singapore (NUS). Prior to joining NUS, She earned her Ph.D in the Department of Information Systems from Tsinghua University. Her current research interests lie in the intersection of information systems, behavioural science, and data mining. Her work has been published in journals such as MIS Quarterly, Information Systems Research, Information & Management, ACM Transactions on Knowledge Discovery from Data, etc.