skip to main content
10.1145/2660517.2660527acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Quick-and-clean extraction of linked data entities from microblogs

Published:04 September 2014Publication History

ABSTRACT

In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content.

We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools.

We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

References

  1. Eugene Agichtein. Scaling information extraction to large document collections. IEEE Data Eng. Bull, 28:3--10, 2005.Google ScholarGoogle Scholar
  2. Amitava Das, Utsab Burman, Balamurali Ar, and Sivaji Bandyopadhyay. NER from Tweets: SRI-JU System. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', page 62, 2013.Google ScholarGoogle Scholar
  3. Diego Marinho de Oliveira, Alberto H. F. Laender, Adriano Veloso, and Altigran S. da Silva. FS-NER: A Lightweight Filter-stream Approach to Named Entity Recognition on Twitter Data. In Proceedings of the 22nd International Conference on World Wide Web Companion, WWW '13 Companion, pages 597--604, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings of the 12th International Conference on World Wide Web, pages 178--186. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Yegin Genc, Winter A. Mason, and Jeffrey V. Nickerson. Classifying Short Messages using Collaborative Knowledge Bases: Reading Wikipedia to Understand Twitter. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 50--53, 2013.Google ScholarGoogle Scholar
  6. Bo Han and Timothy Baldwin. Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, HLT '11, pages 368--378. ACL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Silviu Homoceanu, Felix Geilert, Christian Pek, and Wolf-Tilo Balke. Any Suggestions? Active Schema Support for Structuring Web Information. In Database Systems for Advanced Applications, pages 251--265. Springer, 2014.Google ScholarGoogle Scholar
  8. Amir Hossein Jadidinejad. Unsupervised Information Extraction using BabelNet and DBpedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 54--56, 2013.Google ScholarGoogle Scholar
  9. David Laniado and Peter Mika. Making Sense of Twitter. In Proceedings of the 9th International Semantic Web Conference, pages 470--485. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the 35th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 721--730. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Songyu Ma, Quan Shi, and Lu Xu. The Research of Web Parallel Information Extraction Based on Hadoop. In Proceedings of International Conference on Computer Science and Information Technology, pages 341--348. Springer, 2014.Google ScholarGoogle Scholar
  12. Pablo N. Mendes, Dirk Weissenborn, and Chris Hokamp. DBpedia Spotlight at the MSM2013 Challenge. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 57--61, 2013.Google ScholarGoogle Scholar
  13. Óscar Muñoz-García, Andrés García-Silva, and Óscar Corcho. Towards Concept Identification using a Knowledge-Intensive Approach. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 45--49, 2013.Google ScholarGoogle Scholar
  14. Deepak Ravichandran. Terascale Knowledge Acquisition. PhD thesis, Los Angeles, CA, USA, 2005. AAI3196880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Giuseppe Rizzo and Raphaël Troncy. NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 73--76. ACL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sandhya Sachidanandan, Prathyush Sambaturu, and Kamalakar Karlapalem. NERTUW: Named Entity Recognition on Tweets using Wikipedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 67--70, 2013.Google ScholarGoogle Scholar
  17. Hassan Saif, Yulan He, and Harith Alani. Semantic Sentiment Analysis of Twitter. In Proceedings of the 11th International Conference on The Semantic Web, pages 508--524. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, and Rik Van de Walle. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, 2013.Google ScholarGoogle Scholar
  19. Henning Wachsmuth, Benno Stein, and Gregor Engels. Constructing Efficient Information Extraction Pipelines. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 2237--2240. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic, and Lyle Ungar. Web-scale Named Entity Recognition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 123--132. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Quick-and-clean extraction of linked data entities from microblogs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            SEM '14: Proceedings of the 10th International Conference on Semantic Systems
            September 2014
            161 pages
            ISBN:9781450329279
            DOI:10.1145/2660517

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 4 September 2014

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SEM '14 Paper Acceptance Rate22of59submissions,37%Overall Acceptance Rate22of59submissions,37%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader