research-article

Quick-and-clean extraction of linked data entities from microblogs

Authors:
Oluwaseyi Feyisetan

University of Southampton, Southampton

University of Southampton, Southampton
View Profile

,
Elena Simperl

University of Southampton, Southampton

University of Southampton, Southampton
View Profile

,
Ramine Tinati

University of Southampton, Southampton

University of Southampton, Southampton
View Profile

,
Markus Luczak-Roesch

University of Southampton, Southampton

University of Southampton, Southampton
View Profile

,
Nigel Shadbolt

University of Southampton, Southampton

University of Southampton, Southampton
View Profile

SEM '14: Proceedings of the 10th International Conference on Semantic SystemsSeptember 2014Pages 5–12https://doi.org/10.1145/2660517.2660527

Published:04 September 2014Publication History

SEM '14: Proceedings of the 10th International Conference on Semantic Systems

Pages 5–12

ABSTRACT

In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities. Our approach is able to significantly speed-up the semantic analysis process by discarding retweets, tweets without pre-identifiable entities, as well similar and redundant tweets, while retaining information content.

We apply the approach on a corpus of 1:4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700,000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE tools.

We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

References

Eugene Agichtein. Scaling information extraction to large document collections. IEEE Data Eng. Bull, 28:3--10, 2005.Google Scholar
Amitava Das, Utsab Burman, Balamurali Ar, and Sivaji Bandyopadhyay. NER from Tweets: SRI-JU System. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', page 62, 2013.Google Scholar
Diego Marinho de Oliveira, Alberto H. F. Laender, Adriano Veloso, and Altigran S. da Silva. FS-NER: A Lightweight Filter-stream Approach to Named Entity Recognition on Twitter Data. In Proceedings of the 22nd International Conference on World Wide Web Companion, WWW '13 Companion, pages 597--604, 2013. Google ScholarDigital Library
Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings of the 12th International Conference on World Wide Web, pages 178--186. ACM, 2003. Google ScholarDigital Library
Yegin Genc, Winter A. Mason, and Jeffrey V. Nickerson. Classifying Short Messages using Collaborative Knowledge Bases: Reading Wikipedia to Understand Twitter. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 50--53, 2013.Google Scholar
Bo Han and Timothy Baldwin. Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, HLT '11, pages 368--378. ACL, 2011. Google ScholarDigital Library
Silviu Homoceanu, Felix Geilert, Christian Pek, and Wolf-Tilo Balke. Any Suggestions? Active Schema Support for Structuring Web Information. In Database Systems for Advanced Applications, pages 251--265. Springer, 2014.Google Scholar
Amir Hossein Jadidinejad. Unsupervised Information Extraction using BabelNet and DBpedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 54--56, 2013.Google Scholar
David Laniado and Peter Mika. Making Sense of Twitter. In Proceedings of the 9th International Semantic Web Conference, pages 470--485. Springer, 2010. Google ScholarDigital Library
Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. TwiNER: Named Entity Recognition in Targeted Twitter Stream. In Proceedings of the 35th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 721--730. ACM, 2012. Google ScholarDigital Library
Songyu Ma, Quan Shi, and Lu Xu. The Research of Web Parallel Information Extraction Based on Hadoop. In Proceedings of International Conference on Computer Science and Information Technology, pages 341--348. Springer, 2014.Google Scholar
Pablo N. Mendes, Dirk Weissenborn, and Chris Hokamp. DBpedia Spotlight at the MSM2013 Challenge. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 57--61, 2013.Google Scholar
Óscar Muñoz-García, Andrés García-Silva, and Óscar Corcho. Towards Concept Identification using a Knowledge-Intensive Approach. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 45--49, 2013.Google Scholar
Deepak Ravichandran. Terascale Knowledge Acquisition. PhD thesis, Los Angeles, CA, USA, 2005. AAI3196880. Google ScholarDigital Library
Giuseppe Rizzo and Raphaël Troncy. NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL '12, pages 73--76. ACL, 2012. Google ScholarDigital Library
Sandhya Sachidanandan, Prathyush Sambaturu, and Kamalakar Karlapalem. NERTUW: Named Entity Recognition on Tweets using Wikipedia. In Proceedings of the Concept Extraction Challenge at the Workshop on 'Making Sense of Microposts', pages 67--70, 2013.Google Scholar
Hassan Saif, Yulan He, and Harith Alani. Semantic Sentiment Analysis of Twitter. In Proceedings of the 11th International Conference on The Semantic Web, pages 508--524. Springer, 2012. Google ScholarDigital Library
Seth van Hooland, Max De Wilde, Ruben Verborgh, Thomas Steiner, and Rik Van de Walle. Exploring entity recognition and disambiguation for cultural heritage collections. Literary and Linguistic Computing, 2013.Google Scholar
Henning Wachsmuth, Benno Stein, and Gregor Engels. Constructing Efficient Information Extraction Pipelines. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, pages 2237--2240. ACM, 2011. Google ScholarDigital Library
Casey Whitelaw, Alex Kehlenbeck, Nemanja Petrovic, and Lyle Ungar. Web-scale Named Entity Recognition. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08, pages 123--132. ACM, 2008. Google ScholarDigital Library

Index Terms

Quick-and-clean extraction of linked data entities from microblogs

Recommendations

Analysis and robust extraction of changing named entities
NEWS '09: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration

This paper focuses on the change of named entities over time and its influence on the performance of the named entity tagger. First, we analyze Japanese named entities which appear in Mainichi Newspaper articles published in 1995, 1996, 1997, 1998 and ...
Read More
AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data
The Semantic Web – ISWC 2014
Abstract
Over the last decades, several billion Web pages have been made available on the Web. The ongoing transition from the current Web of unstructured data to the Web of Data yet requires scalable and accurate approaches for the extraction of ...
Read More
Configuring Named Entity Extraction through Real-Time Exploitation of Linked Data
WIMS '14: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14)

Named Entity Extraction is the process of identifying entities (like persons, locations, organizations, etc.) in texts and linking them to related semantic resources. This task is useful in several applications, e.g. for question answering, annotating ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SEM '14: Proceedings of the 10th International Conference on Semantic Systems
September 2014
161 pages
ISBN:9781450329279
DOI:10.1145/2660517
Editors:
Harald Sack
Hasso-Plattner-Institute for IT Systems Engineering, Germany
,
Agata Filipowska
Poznan University of Economics, Poland
,
Jens Lehmann
University of Leipzig, Germany
,
Sebastian Hellmann
Institute for Applied Informatics (InfAI), Leipzig, Germany
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SEM '14 Paper Acceptance Rate22of59submissions,37%Overall Acceptance Rate22of59submissions,37%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 171
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Quick-and-clean extraction of linked data entities from microblogs

SEM '14: Proceedings of the 10th International Conference on Semantic Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Analysis and robust extraction of changing named entities

AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data

Configuring Named Entity Extraction through Real-Time Exploitation of Linked Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Quick-and-clean extraction of linked data entities from microblogs

SEM '14: Proceedings of the 10th International Conference on Semantic Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Analysis and robust extraction of changing named entities

AGDISTIS - Graph-Based Disambiguation of Named Entities Using Linked Data

Configuring Named Entity Extraction through Real-Time Exploitation of Linked Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media