skip to main content
10.1145/2628194.2628225acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

Named entities as privileged information for hierarchical text clustering

Published: 07 July 2014 Publication History

Abstract

Text clustering is a text mining task which is often used to aid the organization, knowledge extraction, and exploratory search of text collections. Nowadays, the automatic text clustering becomes essential as the volume and variety of digital text documents increase, either in social networks and the Web or inside organizations. This paper explores the use of named entities as privileged information in a hierarchical clustering process, so as to improve clusters quality and interpretation. We carried out an experimental evaluation on three text collections (one written in Portuguese and two written in English) and the results show that named entities can be applied as privileged information to power clustering solution in dynamic text collection scenarios.

References

[1]
C. C. Aggarwal and C. Zhai, editors. Mining Text Data. Springer, 2012.
[2]
T. H. Cao, T. M. Tang, and C. K. Chau. Text clustering with named entities: a model, experimentation and realization. In Data Mining: Foundations and Intelligent Paradigms, volume 23 of Intelligent Systems Reference Library, pages 267--287. Springer Berlin Heidelberg, 2012.
[3]
N. Cardoso. Rembrandt - reconhecimento de entidades mencionadas baseado em relações e análise detalhada do texto. In C. Mota and D. Santos, editors, Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM, pages 195--211. Linguateca, 2008.
[4]
B. S. Everitt, S. Landau, M. Leese, and D. Stahl. Cluster Analysis. Wiley, 2011.
[5]
R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analysing Unstructured Data. Cambridge University Press, 2007.
[6]
J. Feyereisl and U. Aickelin. Privileged information for data clustering. Information Sciences, 194(0):4--23, 2012.
[7]
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 362--370, 2005.
[8]
B. C. M. Fung, K. Wang, and M. Ester. Hierarchical document clustering using frequent itemsets. In Proceedings of SIAM International Conference on Data Mining, 2003.
[9]
J. Gantz and D. Reinsel. The digital universe em 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future, December 2012.
[10]
W. L. Kuechler. Business applications of unstructured text. Communications of the ACM, 50(10):86--93, 2007.
[11]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282--289, San Francisco, CA, 2001. Morgan Kaufmann.
[12]
B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.
[13]
R. M. Marcacini and S. O. Rezende. Torch: a tool for building topic hierarchies from growing text collections. In Workshop on Tools and Applications on Webmedia'2010: Brazilian Symposium on Multimedia and the Web, 2010.
[14]
R. M. Marcacini and S. O. Rezende. Incremental hierarchical text clustering with privileged information. In Proceedings of the 2013 ACM Symposium on Document Engineering, pages 231--232, 2013.
[15]
A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of the 9th Conference on European Chapter of the Association for Computational Linguistics, pages 1--8, 1999.
[16]
S. Montalvo, V. Fresno, and R. Martínez. NESM: a named entity based proximity measure for multilingual news clustering. Procesamiento de Lenguaje Natural, 48:81--88, 2012.
[17]
S. Montalvo, R. Martinez, A. Casillas, and V. Fresno. Bilingual news clustering using named entities and fuzzy similarity. In Proceedings of the 10th International Conference on Text, Speech and Dialogue, pages 107--114, Berlin, Heidelberg, 2007. Springer-Verlag.
[18]
J. Nothman, N. Ringland, W. Radford, T. Murphy, and J. R. Curran. Learning multilingual named entity recognition from wikipedia. Artificial Intelligence, 194(0):151--175, 2013.
[19]
B. Pouliquen, R. Steinberger, C. Ignat, E. Kasper, and I. Temnikova. Multilingual and cross-lingual news topic tracking. In Proceedings of the 20th International Conference on Computational Linguistics, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics.
[20]
H. Toda and R. Kataoka. A search result clustering method using informatively named entities. In Proceeding of the 7th ACM International Workshop on Web Information and Data Management, pages 81--86. ACM Press, 2005.
[21]
V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5-6):544--557, 2009.
[22]
Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141--168, 2005.

Cited By

View all
  • (2023)Multi-view support vector machines with sub-view learningSoft Computing10.1007/s00500-023-07884-927:10(6241-6259)Online publication date: 29-Mar-2023
  • (2022)A Complete Process of Text Classification System Using State-of-the-Art NLP ModelsComputational Intelligence and Neuroscience10.1155/2022/18836982022Online publication date: 1-Jan-2022
  • (2019)Research on Topic Detection Technology for Information Security Texts2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064175(1621-1627)Online publication date: Dec-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '14: Proceedings of the 18th International Database Engineering & Applications Symposium
July 2014
411 pages
ISBN:9781450326278
DOI:10.1145/2628194
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • ISEP: Instituto Superior de Engenharia do Porto
  • BytePress
  • Concordia University: Concordia University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. named entities
  2. privileged information
  3. text clustering

Qualifiers

  • Research-article

Funding Sources

Conference

IDEAS '14
Sponsor:
  • ISEP
  • Concordia University

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Multi-view support vector machines with sub-view learningSoft Computing10.1007/s00500-023-07884-927:10(6241-6259)Online publication date: 29-Mar-2023
  • (2022)A Complete Process of Text Classification System Using State-of-the-Art NLP ModelsComputational Intelligence and Neuroscience10.1155/2022/18836982022Online publication date: 1-Jan-2022
  • (2019)Research on Topic Detection Technology for Information Security Texts2019 IEEE 5th International Conference on Computer and Communications (ICCC)10.1109/ICCC47050.2019.9064175(1621-1627)Online publication date: Dec-2019
  • (2019)Data-Information-Concept Continuum From a Text Mining PerspectiveEncyclopedia of Bioinformatics and Computational Biology10.1016/B978-0-12-809633-8.20408-1(586-601)Online publication date: 2019
  • (2017)Constrained Hierarchical Clustering for News EventsProceedings of the 21st International Database Engineering & Applications Symposium10.1145/3105831.3105859(49-56)Online publication date: 12-Jul-2017
  • (2017)Evaluation of latent dirichlet allocation for document organization in different levels of semantic complexity2017 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI.2017.8280939(1-8)Online publication date: Nov-2017
  • (2017)Mining based on Extraction and Importance Evaluation Using Multi-Measures Methods for Electronic DocumentsITM Web of Conferences10.1051/itmconf/2017120501812(05018)Online publication date: 5-Sep-2017
  • (2016)Semantic role-based representations in text classification2016 23rd International Conference on Pattern Recognition (ICPR)10.1109/ICPR.2016.7899981(2313-2318)Online publication date: Dec-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media