Entity disambiguation with context awareness in user-generated short texts
Introduction
Conceptualization is mapping entities under different contexts to their most appropriate concepts, which plays an important role in understanding human knowledge, as humans understand the world by classifying objects into appropriate conceptual levels. Conceptualization benefits many natural language processing tasks, such as query understanding (Wang et al., 2017, Wang et al., 2015), text mining(Hua et al., 2015, Song et al., 2015, Kim et al., 2013, Yu et al., 2016, Aloulou et al., 2015, Cheng et al., 2015, Bekkali and Lachkar, 2019) and sentiment analysis (Cambria, Song, Wang, & Howard, 2014). Simultaneously, conceptualization faces extensive challenges, one of which is the ubiquitous polysemy. Taking “Harry Potter” as an example, it usually refers to “a book”, but it can also refer to “a movie” or “a character”. To better conceptualize, it is undoubtedly essential to eliminate the ambiguity of ambiguous entities. Sometimes, even humans may not be able to disambiguate their ambiguity because of a lack of knowledge or misunderstanding, which demonstrates the enormous challenges it poses to machines. In this paper, we focus on entity disambiguation in user-generated short texts (UGSTs), which could create tremendous value. Our previous works studied the role of these short texts in user identification (Li et al., 2018, Li et al., 2019, Li et al., 2018, Li et al., 2017). However, unlike traditional documents, which contain a large number of words, UGSTs present the characteristics of short length, unstructured and oral expressions. It is, therefore, far more challenging for disambiguation, due to the shortness and sparsity of UGSTs.
The essence of entity disambiguation is that an entity is likely to have multiple meanings, that is, the meanings expressed in different contexts may be different. In many cases, entity disambiguation is performed and evaluated by liking entity mentions in a given text to corresponding entities in a knowledge base. However, such measures have several limitations. First, the query information contained in the given text if often limited, so public resources are often used to mine more information to improve the quality of disambiguation. Such methods can accomplish the task of entity disambiguation excellently, but they are more suitable for the application of sever because they need to spend a lot of time on-line information collection. Second, toward conceptualization, disambiguation result should include the most appropriate concepts of entities in current context. Thus, many works focus on entity disambiguation at a concept-level, which map an entity into the most appropriate concept based on the context of the target entity. Existing methods mainly rely on similar or related entities in context for disambiguation. Such measures select entities that are related to each other and then use these entities as a guide. These methods are reasonable and work well if they can use enough contexts. However, an inevitable challenge existing methods face is that the number of entities contained in a UGST is limited. Consider the example “this apple is very delicious”; there is only one entity (marked with an underline). We conceptualize the entity “apple” to concepts such as “fruit” and “company”. However, because there is not enough context, we cannot distinguish the meaning of the target entity. Another research focuses on using the entire given text for disambiguation. Some works use statistical models to obtain the topic of the given short text, which serves as a guide for determining the most appropriate concept of the target entity (Cheng et al., 2015, Kim et al., 2013). However, UGSTs make these methods quite weak, as it may not be easy to build an effective statistical model because of the shortness and sparsity of these data.
To solve this problem, scholars have found that verbs and adjectives are also helpful for disambiguation (Wang et al., 2015, Hua et al., 2017). They have constructed a massive semantic net that maps non-entity terms (verbs and adjectives) to concepts, and then according to the correlation between concepts, selects the most appropriate concept for the target entity. Take the same short text “this apple is very delicious” as an example; the adjective term “delicious” is mapped to concept “food”. Therefore, we know that apple refers to “a kind of fruit”. However, despite the improved performance, the major limitation lies in the excessive computational cost induced by the construction of the network. Moreover, such methods generally include three steps, namely, text segmentation, term type checking and disambiguation, and these methods are easily affected by errors that propagated from text segmentation and type checking. For example, in the UGST “watch Harry Potter”, “watch” can be either an entity or a verb, but if it is marked as an entity, we cannot know exactly what “Harry Potter” means. In addition, the characteristics of UGSTs have identified some new problems. For example, in an extreme case, if we observe only two contextual terms, “Microsoft” and “delicious”, for the target entity “apple”, the clues provided by the most related terms are not necessarily correct for disambiguation.
In this paper, we propose an entity disambiguation method with context awareness (IDwCA), which focuses on utilizing various types of contextual terms to eliminate ambiguity, rather than relying solely on entities. First, unlike traditional methods that preserve only terms with the longest cover and classify terms simply into entity and concept, IDwCA detects all terms by using inter-term semantic information and distinguishes the type of terms according to term frequency and part-of-speech (POS). Second, we investigate the measurement of the semantic correlation between various types of terms, and set a dynamic threshold to filter out uninformative contextual terms to avoid noise interference. Then, a priority is assigned to each informative term to highlight their distinctiveness. Finally, we apply the informative terms to entity disambiguation and aggregate the probabilities of each concept of the target entity to find the approximate concept. The contributions of our work are summarized as follows:
- -
We propose a framework for entity disambiguation. Specifically, it first detects terms using semantic information and second, it distinguishes the type of terms using both POS information and term frequency. Third, it chooses informative contextual terms according to the correlation scores between contextual terms and concepts of the target entity, and assigns priorities to these terms according to contextual distance to avoid noise interference. Finally, it eliminates the ambiguity of terms through these various types of contextual terms.
- -
We investigate the semantic correlation between various types of terms using both word embedding and knowledge information to find informative terms for the target entity within a specific context. The experimental results on standard datasets demonstrate that the proposed correlation calculation method is feasible.
- -
The priorities of contextual terms in different locations for disambiguation are identified. We prioritize these terms to highlight their discriminative power. The closer the contextual term is to the target entity, the higher its priority is.
- -
The experimental results on ground-truth datasets demonstrate that the proposed framework for disambiguation is effective.
The rest of the paper is organized as follows. Section 2 surveys the related works. Section 3 introduces background knowledge. Section 4 details our entity disambiguation method. Section 5 presents an evaluation of our method. Finally, we conclude our work in Section 6.
Section snippets
Related works
Many approaches have been proposed for entity disambiguation, which can be classified into two categories: (1) entity disambiguation based on linked data; and (2) entity disambiguation at a concept-level. More details are as follows.
Preliminary knowledge
In this section, we first introduce some notations and definitions used in this paper, and the notations are shown in Table 1. Then, we review several key techniques to make it easier to understand the proposed approach.
Entity disambiguation with context awareness
In this section, we first present the formulation of the entity disambiguation problem. Then we describe the proposed method.
Experimental setup
In our method, we disambiguate the entities by using various types of contextual terms to obtain the most appropriate concept. As mentioned above, existing works for entity disambiguation can be roughly divided into two categories. One focuses on entity linking in which the target entity is linked to an entity in a knowledge base. The other one focuses on disambiguation at a concept-level. Thus we compare our method with the linking-based methods and concept-based methods.
Compared methods: In
Conclusions and future work
Entity conceptualization has received considerable attention from academia and industry, which can benefit many natural language processing applications, such as question–answering systems, short text classification, and sentiment analysis. In existing works, disambiguation is mainly based on similar or related entities in context and topic of the given text. However, the unique characteristics of UGSTs make these methods quite weak. In this paper, we focused on using more types of contextual
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Jiaqi Yang: Conceptualization, Data curation, Writing - original draft, Writing - review & editing, Investigation. Yongjun Li: Methodology, Writing - original draft, Writing - review & editing, Supervision. Congjie Gao: Writing - original draft, Visualization, Investigation. Wei Dong: Writing - original draft, Supervision.
Acknowledgment
This work was partly supported by Natural Science Basic Research Plan in Shaanxi Province of China (2018JM6063).
References (38)
- et al.
Uncertainty handling in semantic reasoning for accurate context understanding
Knowledge-Based System
(2015) - et al.
A deep dive into user display names across social networks
Information Science
(2018) - et al.
Matching user accounts based on user generated content across social networks
Future Generation Computer Systems
(2018) - Akkaya, C., Wiebe, J., & Mihalcea, R. (2012). Utilizing semantic composition in distributional semantic models for word...
- et al.
Weps3 evaluation campaign: Overview of the on-line reputation management task
- et al.
A simple but tough-to-beat baseline for sentence embeddings
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In...
- et al.
An effective short text conceptualization based on new short text similarity
Social Network Analysis and Mining
(2019) - et al.
A neural probabilistic language model
Journal of Machine Learning Research
(2003) - et al.
Semantic multidimensional scaling for open-domain sentiment analysis
IEEE Intelligent Systems
(2014)
Fast and accurate annotation of short texts with wikipedia pages
IEEE Software
Short text understanding through lexical-semantic analysis
Understand short texts by harvesting and analyzing semantic knowledge
IEEE Transactions on Knowledge and Data Engineering
Leveraging conceptualization for short-text embedding
IEEE Transactions on Knowledge and Data Engineering
Context-dependent conceptualization
Cited by (3)
Semisupervised neural biomedical sense disambiguation approach for aspect-based sentiment analysis on social networks
2022, Journal of Biomedical InformaticsCitation Excerpt :The second phase sentimentally classifies the predefined aspects into positive, negative, or neutral. These two phases are very challenging when ABSA is performed on health-related social media data [1–3,11,25]. More recently, most efforts in the field of medical information for ABSA have been interested in defining dependent domain aspects (such as conditions, diseases, complications, and so on), where the way we extract relevant aspects requires a complementary context-dependent knowledge and domain-specific sentiment source to correctly predict the sentiment for a given aspect or target [26].
Boosting Short Text Classification by Solving the OOV Problem
2023, IEEE/ACM Transactions on Audio Speech and Language Processing
- 1
ORCID: 0000-0002-3184-6805