Abstract:
Microblogging sites, like Twitter, continuously generate a large volume of streaming data. This streaming environment creates new challenges for two concomitant Informati...View moreMetadata
Abstract:
Microblogging sites, like Twitter, continuously generate a large volume of streaming data. This streaming environment creates new challenges for two concomitant Information Extraction tasks: Entity Mention Detection (EMD) and Entity Detection (ED). The new challenges include (1) continuously evolving topics, which may deprecate model-based approaches quickly; (2) non-literary nature of posts, which makes traditional NLP techniques less effective; and (3) huge volume of streaming data, which makes computationally expensive approaches less suitable. In this paper, we propose an approach for EMD/ED whose creation is guided by the constraints specific to streaming environments from the ground up. Our system TwiCS implements this approach. TwiCS employs a computationally light two-phase process. In the first phase, it exploits simple (low computation) syntactic cues to suggest Entity Mention (EM) candidates. In the second phase, it uses occurrence mining to classify candidates according to their likelihood of being true EMs. Our experiments show that TwiCS achieves an average effectiveness improvement of 14.6 percent, while maintaining at least 2.64 times higher throughput, when compared to several state-of-the-art systems.
Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 35, Issue: 1, 01 January 2023)