ABSTRACT
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content.
We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools.
We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.
- Eugene Agichtein. Scaling information extraction to large document collections. IEEE Data Eng. Bull, 28:3--10, 2005.Google Scholar
- Amitava Das, Utsab Burman, Balamurali Ar, and Sivaji Bandyopadhyay. NER from Tweets: SRI-JU System. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', page 62, 2013.Google Scholar
- Diego Marinho de Oliveira, Alberto H. F. Laender, Adriano Veloso, and Altigran S. da Silva. FS-NER: A Lightweight Filter-stream Approach to Named Entity Recognition on Twitter Data. In Proceedings of the 22nd International Conference on World Wide Web Companion, WWW '13 Companion, pages 597--604, 2013. Google ScholarDigital Library
- Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings of the 12th International Conference on World Wide Web, pages 178--186. ACM, 2003. Google ScholarDigital Library
- Yegin Genc, Winter A. Mason, and Jeffrey V. Nickerson. Classifying Short Messages using Collaborative Knowledge Bases: Reading Wikipedia to Understand Twitter. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 50--53, 2013.Google Scholar
- Bo Han and Timothy Baldwin. Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, HLT '11, pages 368--378. ACL, 2011. Google ScholarDigital Library
- Silviu Homoceanu, Felix Geilert, Christian Pek, and Wolf-Tilo Balke. Any Suggestions? Active Schema Support for Structuring Web Information. In Database Systems for Advanced Applications, pages 251--265. Springer, 2014.Google Scholar
- Amir Hossein Jadidinejad. Unsupervised Information Extraction using BabelNet and DBpedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 54--56, 2013.Google Scholar
- David Laniado and Peter Mika. Making Sense of Twitter. In Proceedings of the 9th International Semantic Web Conference, pages 470--485. Springer, 2010. Google ScholarDigital Library
- Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the 35th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 721--730. ACM, 2012. Google ScholarDigital Library
- Songyu Ma, Quan Shi, and Lu Xu. The Research of Web Parallel Information Extraction Based on Hadoop. In Proceedings of International Conference on Computer Science and Information Technology, pages 341--348. Springer, 2014.Google Scholar
- Pablo N. Mendes, Dirk Weissenborn, and Chris Hokamp. DBpedia Spotlight at the MSM2013 Challenge. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 57--61, 2013.Google Scholar
- Óscar Muñoz-García, Andrés García-Silva, and Óscar Corcho. Towards Concept Identification using a Knowledge-Intensive Approach. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 45--49, 2013.Google Scholar
- Deepak Ravichandran. Terascale Knowledge Acquisition. PhD thesis, Los Angeles, CA, USA, 2005. AAI3196880. Google ScholarDigital Library
- Giuseppe Rizzo and Raphaël Troncy. NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 73--76. ACL, 2012. Google ScholarDigital Library
- Sandhya Sachidanandan, Prathyush Sambaturu, and Kamalakar Karlapalem. NERTUW: Named Entity Recognition on Tweets using Wikipedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 67--70, 2013.Google Scholar
- Hassan Saif, Yulan He, and Harith Alani. Semantic Sentiment Analysis of Twitter. In Proceedings of the 11th International Conference on The Semantic Web, pages 508--524. Springer, 2012. Google ScholarDigital Library
- Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, and Rik Van de Walle. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, 2013.Google Scholar
- Henning Wachsmuth, Benno Stein, and Gregor Engels. Constructing Efficient Information Extraction Pipelines. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 2237--2240. ACM, 2011. Google ScholarDigital Library
- Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic, and Lyle Ungar. Web-scale Named Entity Recognition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 123--132. ACM, 2008. Google ScholarDigital Library
Index Terms
- Quick-and-clean extraction of linked data entities from microblogs
Recommendations
Analysis and robust extraction of changing named entities
NEWS '09: Proceedings of the 2009 Named Entities Workshop: Shared Task on TransliterationThis paper focuses on the change of named entities over time and its influence on the performance of the named entity tagger. First, we analyze Japanese named entities which appear in Mainichi Newspaper articles published in 1995, 1996, 1997, 1998 and ...
AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data
The Semantic Web – ISWC 2014AbstractOver the last decades, several billion Web pages have been made available on the Web. The ongoing transition from the current Web of unstructured data to the Web of Data yet requires scalable and accurate approaches for the extraction of ...
Configuring Named Entity Extraction through Real-Time Exploitation of Linked Data
WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)Named Entity Extraction is the process of identifying entities (like persons, locations, organizations, etc.) in texts and linking them to related semantic resources. This task is useful in several applications, e.g. for question answering, annotating ...
Comments