Abstract:
News portals, such as Yahoo News or Google News, collect large amounts of news articles from a variety of sources on a daily basis. Only a small portion of these document...Show MoreMetadata
Abstract:
News portals, such as Yahoo News or Google News, collect large amounts of news articles from a variety of sources on a daily basis. Only a small portion of these documents can be selected and displayed on the homepage. Thus, there is a strong preference for major, recent events. In this work, we propose a scalable First Story Detection (FSD) pipeline that identifies fresh news. This pipeline is used in order to instantiate a variety of FSD approaches. In addition we suggest a novel FSD technique that in comparison to existing systems, relies on relation extraction algorithms and exploits the named entities and their relations in order to decide about the freshness of an article. We evaluate our technique by instantiating existing state of art FSD techniques within our generic pipeline. As ground truth we use multiple datasets that cover different categories. Experimental results demonstrate that our FSD method in many cases provides an improvement over state-of-the-art techniques. In addition, we show using a large synthetic dataset that our general FSD pipeline has constant space and time requirements and is suitable for very high volume streams.
Published in: IEEE Transactions on Knowledge and Data Engineering ( Volume: 33, Issue: 11, 01 November 2021)