Entity disambiguation with context awareness in user-generated short texts

https://doi.org/10.1016/j.eswa.2020.113652Get rights and content

Highlights

  • A framework for instance disambiguation with context awareness is proposed.

  • A method of correlation calculation based on corpus and knowledge is presented.

  • The priorities of contextual terms in different locations are identified.

  • The experiments for disambiguation based on ground-truth datasets are conducted.

Abstract

Conceptualization is to obtain the most appropriate concepts for noun terms (entities) under different contexts, which plays an important role in human knowledge understanding. However, in natural language, entities are often ambiguous, which creates difficulties in conceptualization. To accurately conceptualize, we must eliminate the ambiguity of entities. Existing methods mainly rely on similar or related entities in context for disambiguation. However, due to the sparsity of user-generated short texts, the number of entities that can be extracted from them is limited. In this paper, we propose an entity disambiguation method, which consists of three steps. (1) Measuring the correlation between terms, which uses both corpus and knowledge information to capture the specific semantic relationship. (2) Selecting informative terms, which considers various types of contextual terms, not just entities, thereby mitigating the effects of text sparsity. (3) Prioritizing informative terms to highlight their discriminative power, which reduces noise interference. Finally, the target entity is disambiguated based on informative terms. Experimental results on ground-truth datasets demonstrate that the proposed method outperforms baseline methods.

Introduction

Conceptualization is mapping entities under different contexts to their most appropriate concepts, which plays an important role in understanding human knowledge, as humans understand the world by classifying objects into appropriate conceptual levels. Conceptualization benefits many natural language processing tasks, such as query understanding (Wang et al., 2017, Wang et al., 2015), text mining(Hua et al., 2015, Song et al., 2015, Kim et al., 2013, Yu et al., 2016, Aloulou et al., 2015, Cheng et al., 2015, Bekkali and Lachkar, 2019) and sentiment analysis (Cambria, Song, Wang, & Howard, 2014). Simultaneously, conceptualization faces extensive challenges, one of which is the ubiquitous polysemy. Taking “Harry Potter” as an example, it usually refers to “a book”, but it can also refer to “a movie” or “a character”. To better conceptualize, it is undoubtedly essential to eliminate the ambiguity of ambiguous entities. Sometimes, even humans may not be able to disambiguate their ambiguity because of a lack of knowledge or misunderstanding, which demonstrates the enormous challenges it poses to machines. In this paper, we focus on entity disambiguation in user-generated short texts (UGSTs), which could create tremendous value. Our previous works studied the role of these short texts in user identification (Li et al., 2018, Li et al., 2019, Li et al., 2018, Li et al., 2017). However, unlike traditional documents, which contain a large number of words, UGSTs present the characteristics of short length, unstructured and oral expressions. It is, therefore, far more challenging for disambiguation, due to the shortness and sparsity of UGSTs.

The essence of entity disambiguation is that an entity is likely to have multiple meanings, that is, the meanings expressed in different contexts may be different. In many cases, entity disambiguation is performed and evaluated by liking entity mentions in a given text to corresponding entities in a knowledge base. However, such measures have several limitations. First, the query information contained in the given text if often limited, so public resources are often used to mine more information to improve the quality of disambiguation. Such methods can accomplish the task of entity disambiguation excellently, but they are more suitable for the application of sever because they need to spend a lot of time on-line information collection. Second, toward conceptualization, disambiguation result should include the most appropriate concepts of entities in current context. Thus, many works focus on entity disambiguation at a concept-level, which map an entity into the most appropriate concept based on the context of the target entity. Existing methods mainly rely on similar or related entities in context for disambiguation. Such measures select entities that are related to each other and then use these entities as a guide. These methods are reasonable and work well if they can use enough contexts. However, an inevitable challenge existing methods face is that the number of entities contained in a UGST is limited. Consider the example “this apple is very delicious”; there is only one entity (marked with an underline). We conceptualize the entity “apple” to concepts such as “fruit” and “company”. However, because there is not enough context, we cannot distinguish the meaning of the target entity. Another research focuses on using the entire given text for disambiguation. Some works use statistical models to obtain the topic of the given short text, which serves as a guide for determining the most appropriate concept of the target entity (Cheng et al., 2015, Kim et al., 2013). However, UGSTs make these methods quite weak, as it may not be easy to build an effective statistical model because of the shortness and sparsity of these data.

To solve this problem, scholars have found that verbs and adjectives are also helpful for disambiguation (Wang et al., 2015, Hua et al., 2017). They have constructed a massive semantic net that maps non-entity terms (verbs and adjectives) to concepts, and then according to the correlation between concepts, selects the most appropriate concept for the target entity. Take the same short text “this apple is very delicious” as an example; the adjective term “delicious” is mapped to concept “food”. Therefore, we know that apple refers to “a kind of fruit”. However, despite the improved performance, the major limitation lies in the excessive computational cost induced by the construction of the network. Moreover, such methods generally include three steps, namely, text segmentation, term type checking and disambiguation, and these methods are easily affected by errors that propagated from text segmentation and type checking. For example, in the UGST “watch Harry Potter”, “watch” can be either an entity or a verb, but if it is marked as an entity, we cannot know exactly what “Harry Potter” means. In addition, the characteristics of UGSTs have identified some new problems. For example, in an extreme case, if we observe only two contextual terms, “Microsoft” and “delicious”, for the target entity “apple”, the clues provided by the most related terms are not necessarily correct for disambiguation.

In this paper, we propose an entity disambiguation method with context awareness (IDwCA), which focuses on utilizing various types of contextual terms to eliminate ambiguity, rather than relying solely on entities. First, unlike traditional methods that preserve only terms with the longest cover and classify terms simply into entity and concept, IDwCA detects all terms by using inter-term semantic information and distinguishes the type of terms according to term frequency and part-of-speech (POS). Second, we investigate the measurement of the semantic correlation between various types of terms, and set a dynamic threshold to filter out uninformative contextual terms to avoid noise interference. Then, a priority is assigned to each informative term to highlight their distinctiveness. Finally, we apply the informative terms to entity disambiguation and aggregate the probabilities of each concept of the target entity to find the approximate concept. The contributions of our work are summarized as follows:

  • -

    We propose a framework for entity disambiguation. Specifically, it first detects terms using semantic information and second, it distinguishes the type of terms using both POS information and term frequency. Third, it chooses informative contextual terms according to the correlation scores between contextual terms and concepts of the target entity, and assigns priorities to these terms according to contextual distance to avoid noise interference. Finally, it eliminates the ambiguity of terms through these various types of contextual terms.

  • -

    We investigate the semantic correlation between various types of terms using both word embedding and knowledge information to find informative terms for the target entity within a specific context. The experimental results on standard datasets demonstrate that the proposed correlation calculation method is feasible.

  • -

    The priorities of contextual terms in different locations for disambiguation are identified. We prioritize these terms to highlight their discriminative power. The closer the contextual term is to the target entity, the higher its priority is.

  • -

    The experimental results on ground-truth datasets demonstrate that the proposed framework for disambiguation is effective.

The rest of the paper is organized as follows. Section 2 surveys the related works. Section 3 introduces background knowledge. Section 4 details our entity disambiguation method. Section 5 presents an evaluation of our method. Finally, we conclude our work in Section 6.

Section snippets

Related works

Many approaches have been proposed for entity disambiguation, which can be classified into two categories: (1) entity disambiguation based on linked data; and (2) entity disambiguation at a concept-level. More details are as follows.

Preliminary knowledge

In this section, we first introduce some notations and definitions used in this paper, and the notations are shown in Table 1. Then, we review several key techniques to make it easier to understand the proposed approach.

Entity disambiguation with context awareness

In this section, we first present the formulation of the entity disambiguation problem. Then we describe the proposed method.

Experimental setup

In our method, we disambiguate the entities by using various types of contextual terms to obtain the most appropriate concept. As mentioned above, existing works for entity disambiguation can be roughly divided into two categories. One focuses on entity linking in which the target entity is linked to an entity in a knowledge base. The other one focuses on disambiguation at a concept-level. Thus we compare our method with the linking-based methods and concept-based methods.

Compared methods: In

Conclusions and future work

Entity conceptualization has received considerable attention from academia and industry, which can benefit many natural language processing applications, such as question–answering systems, short text classification, and sentiment analysis. In existing works, disambiguation is mainly based on similar or related entities in context and topic of the given text. However, the unique characteristics of UGSTs make these methods quite weak. In this paper, we focused on using more types of contextual

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Jiaqi Yang: Conceptualization, Data curation, Writing - original draft, Writing - review & editing, Investigation. Yongjun Li: Methodology, Writing - original draft, Writing - review & editing, Supervision. Congjie Gao: Writing - original draft, Visualization, Investigation. Wei Dong: Writing - original draft, Supervision.

Acknowledgment

This work was partly supported by Natural Science Basic Research Plan in Shaanxi Province of China (2018JM6063).

References (38)

  • H. Aloulou et al.

    Uncertainty handling in semantic reasoning for accurate context understanding

    Knowledge-Based System

    (2015)
  • Y. Li et al.

    A deep dive into user display names across social networks

    Information Science

    (2018)
  • Y. Li et al.

    Matching user accounts based on user generated content across social networks

    Future Generation Computer Systems

    (2018)
  • Akkaya, C., Wiebe, J., & Mihalcea, R. (2012). Utilizing semantic composition in distributional semantic models for word...
  • E. Amigó et al.

    Weps3 evaluation campaign: Overview of the on-line reputation management task

  • S. Arora et al.

    A simple but tough-to-beat baseline for sentence embeddings

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In...
  • M. Bekkali et al.

    An effective short text conceptualization based on new short text similarity

    Social Network Analysis and Mining

    (2019)
  • Y. Bengio et al.

    A neural probabilistic language model

    Journal of Machine Learning Research

    (2003)
  • E. Cambria et al.

    Semantic multidimensional scaling for open-domain sentiment analysis

    IEEE Intelligent Systems

    (2014)
  • Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., & Trani, S. (2013). Dexter: an open source framework for entity...
  • Chen, P., & Al-Mubaid, H. (2006). Context-based term disambiguation in biomedical literature. In Proceedings of the...
  • Cheng, J., Wang, Z., Wen, J., Yan, J., & Chen, Z. (2015). Contextual text understanding in distributional semantic...
  • P. Ferragina et al.

    Fast and accurate annotation of short texts with wikipedia pages

    IEEE Software

    (2012)
  • Hoffart, J., Yosef, M. A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., et al. (2011). Robust disambiguation of...
  • W. Hua et al.

    Short text understanding through lexical-semantic analysis

  • W. Hua et al.

    Understand short texts by harvesting and analyzing semantic knowledge

    IEEE Transactions on Knowledge and Data Engineering

    (2017)
  • H. Huang et al.

    Leveraging conceptualization for short-text embedding

    IEEE Transactions on Knowledge and Data Engineering

    (2018)
  • D. Kim et al.

    Context-dependent conceptualization

  • Cited by (3)

    • Semisupervised neural biomedical sense disambiguation approach for aspect-based sentiment analysis on social networks

      2022, Journal of Biomedical Informatics
      Citation Excerpt :

      The second phase sentimentally classifies the predefined aspects into positive, negative, or neutral. These two phases are very challenging when ABSA is performed on health-related social media data [1–3,11,25]. More recently, most efforts in the field of medical information for ABSA have been interested in defining dependent domain aspects (such as conditions, diseases, complications, and so on), where the way we extract relevant aspects requires a complementary context-dependent knowledge and domain-specific sentiment source to correctly predict the sentiment for a given aspect or target [26].

    • Boosting Short Text Classification by Solving the OOV Problem

      2023, IEEE/ACM Transactions on Audio Speech and Language Processing
    1

    ORCID: 0000-0002-3184-6805

    View full text