research-article

Named entities as privileged information for hierarchical text clustering

Authors:

Roberta A. Sinoara,

Camila V. Sundermann,

Ricardo M. Marcacini,

Marcos A. Domingues,

Solange O. RezendeAuthors Info & Claims

IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium

Pages 57 - 66

https://doi.org/10.1145/2628194.2628225

Published: 07 July 2014 Publication History

Abstract

Text clustering is a text mining task which is often used to aid the organization, knowledge extraction, and exploratory search of text collections. Nowadays, the automatic text clustering becomes essential as the volume and variety of digital text documents increase, either in social networks and the Web or inside organizations. This paper explores the use of named entities as privileged information in a hierarchical clustering process, so as to improve clusters quality and interpretation. We carried out an experimental evaluation on three text collections (one written in Portuguese and two written in English) and the results show that named entities can be applied as privileged information to power clustering solution in dynamic text collection scenarios.

References

[1]

C. C. Aggarwal and C. Zhai, editors. Mining Text Data. Springer, 2012.

[2]

T. H. Cao, T. M. Tang, and C. K. Chau. Text clustering with named entities: a model, experimentation and realization. In Data Mining: Foundations and Intelligent Paradigms, volume 23 of Intelligent Systems Reference Library, pages 267--287. Springer Berlin Heidelberg, 2012.

[3]

N. Cardoso. Rembrandt - reconhecimento de entidades mencionadas baseado em relações e análise detalhada do texto. In C. Mota and D. Santos, editors, Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, pages 195--211. Linguateca, 2008.

[4]

B. S. Everitt, S. Landau, M. Leese, and D. Stahl. Cluster Analysis. Wiley, 2011.

[5]

R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analysing Unstructured Data. Cambridge University Press, 2007.

Digital Library

[6]

J. Feyereisl and U. Aickelin. Privileged information for data clustering. Information Sciences, 194(0):4--23, 2012.

Digital Library

[7]

J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 362--370, 2005.

Digital Library

[8]

B. C. M. Fung, K. Wang, and M. Ester. Hierarchical document clustering using frequent itemsets. In Proceedings of SIAM International Conference on Data Mining, 2003.

[9]

J. Gantz and D. Reinsel. The digital universe em 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future, December 2012.

[10]

W. L. Kuechler. Business applications of unstructured text. Communications of the ACM, 50(10):86--93, 2007.

Digital Library

[11]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282--289, San Francisco, CA, 2001. Morgan Kaufmann.

Digital Library

[12]

B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.

Digital Library

[13]

R. M. Marcacini and S. O. Rezende. Torch: a tool for building topic hierarchies from growing text collections. In Workshop on Tools and Applications on Webmedia'2010: Brazilian Symposium on Multimedia and the Web, 2010.

[14]

R. M. Marcacini and S. O. Rezende. Incremental hierarchical text clustering with privileged information. In Proceedings of the 2013 ACM Symposium on Document Engineering, pages 231--232, 2013.

Digital Library

[15]

A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics, pages 1--8, 1999.

Digital Library

[16]

S. Montalvo, V. Fresno, and R. Martínez. NESM: a named entity based proximity measure for multilingual news clustering. Procesamiento de Lenguaje Natural, 48:81--88, 2012.

[17]

S. Montalvo, R. Martinez, A. Casillas, and V. Fresno. Bilingual news clustering using named entities and fuzzy similarity. In Proceedings of the 10th International Conference on Text, Speech and Dialogue, pages 107--114, Berlin, Heidelberg, 2007. Springer-Verlag.

Digital Library

[18]

J. Nothman, N. Ringland, W. Radford, T. Murphy, and J. R. Curran. Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194(0):151--175, 2013.

Digital Library

[19]

B. Pouliquen, R. Steinberger, C. Ignat, E. Kasper, and I. Temnikova. Multilingual and cross-lingual news topic tracking. In Proceedings of the 20th International Conference on Computational Linguistics, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics.

Digital Library

[20]

H. Toda and R. Kataoka. A search result clustering method using informatively named entities. In Proceeding of the 7th ACM International Workshop on Web Information and Data Management, pages 81--86. ACM Press, 2005.

Digital Library

[21]

V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5-6):544--557, 2009.

Digital Library

[22]

Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141--168, 2005.

Digital Library

Cited By

Hao QZheng WXiao YZhu W(2023)Multi-view support vector machines with sub-view learningSoft Computing10.1007/s00500-023-07884-927:10(6241-6259)Online publication date: 29-Mar-2023
https://doi.org/10.1007/s00500-023-07884-9
Dogra VVerma SKavita Chatterjee PShafi JChoi JIjaz M(2022)A Complete Process of Text Classification System Using State-of-the-Art NLP ModelsComputational Intelligence and Neuroscience10.1155/2022/18836982022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1883698
Lin LHe GLiu X(2019)Research on Topic Detection Technology for Information Security Texts2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064175(1621-1627)Online publication date: Dec-2019
https://doi.org/10.1109/ICCC47050.2019.9064175
Show More Cited By

Index Terms

Named entities as privileged information for hierarchical text clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Automatic Extraction of the Fine Category of Person Named Entities from Text Corpora

Named entities play an important role in many Natural Language Processing applications. Currently, most named entity recognition systems rely on a small set of general named entity (NE) types. Though some efforts have been proposed to expand the ...
From names to entities using thematic context distance
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Name ambiguity arises from the polysemy of names and causes uncertainty about the true identity of entities referenced in unstructured text. This is a major problem in areas like information retrieval or knowledge management, for example when searching ...
Geotagging Named Entities in News and Online Documents
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

News sources generate constant streams of text with many references to real world entities; understanding the content from such sources often requires effectively detecting the geographic foci of the entities. We study the problem of associating ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium

July 2014

411 pages

ISBN:9781450326278

DOI:10.1145/2628194

Editors:
Ana Maria Almeida
ISEP
,
Jorge Bernardino
CISUC-Polytechnic Institute of Coimbra
,
Elsa Ferreira Gomes
ISEP
,
General Chairs:
Bipin C. Desai
Concordia University
,
Jorge Bernardino
CISUC-Polytechnic Institute of Coimbra
,
Program Chairs:
Ana Maria Almeida
ISEP
,
Bipin C. Desai
Concordia University

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ISEP: Instituto Superior de Engenharia do Porto
BytePress
Concordia University: Concordia University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundação de Amparo à Pesquisa do Estado de São Paulo

Conference

IDEAS '14

Sponsor:

ISEP
Concordia University

IDEAS '14: 18th International Database Engineering & Applications Symposium

July 7 - 9, 2014

Porto, Portugal

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
218
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hao QZheng WXiao YZhu W(2023)Multi-view support vector machines with sub-view learningSoft Computing10.1007/s00500-023-07884-927:10(6241-6259)Online publication date: 29-Mar-2023
https://doi.org/10.1007/s00500-023-07884-9
Dogra VVerma SKavita Chatterjee PShafi JChoi JIjaz M(2022)A Complete Process of Text Classification System Using State-of-the-Art NLP ModelsComputational Intelligence and Neuroscience10.1155/2022/18836982022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1883698
Lin LHe GLiu X(2019)Research on Topic Detection Technology for Information Security Texts2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064175(1621-1627)Online publication date: Dec-2019
https://doi.org/10.1109/ICCC47050.2019.9064175
Cavaliere DSenatore SLoia V(2019)Data-Information-Concept Continuum From a Text Mining PerspectiveEncyclopedia of Bioinformatics and Computational Biology10.1016/B978-0-12-809633-8.20408-1(586-601)Online publication date: 2019
https://doi.org/10.1016/B978-0-12-809633-8.20408-1
Florence RNogueira BMarcacini RDesai BHong J(2017)Constrained Hierarchical Clustering for News EventsProceedings of the 21st International Database Engineering & Applications Symposium10.1145/3105831.3105859(49-56)Online publication date: 12-Jul-2017
https://dl.acm.org/doi/10.1145/3105831.3105859
Sinoara RScheicher RRezende S(2017)Evaluation of latent dirichlet allocation for document organization in different levels of semantic complexity2017 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI.2017.8280939(1-8)Online publication date: Nov-2017
https://doi.org/10.1109/SSCI.2017.8280939
Xiong WDing Z(2017)Mining based on Extraction and Importance Evaluation Using Multi-Measures Methods for Electronic DocumentsITM Web of Conferences10.1051/itmconf/2017120501812(05018)Online publication date: 5-Sep-2017
https://doi.org/10.1051/itmconf/20171205018
Sinoara RRossi RRezende S(2016)Semantic role-based representations in text classification2016 23rd International Conference on Pattern Recognition (ICPR)10.1109/ICPR.2016.7899981(2313-2318)Online publication date: Dec-2016
https://doi.org/10.1109/ICPR.2016.7899981

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten