A Method for Topic Detection in Great Volumes of Data

Amato, Flora; Gargiulo, Francesco; Maisto, Alessandro; Mazzeo, Antonino; Pelosi, Serena; Sansone, Carlo

doi:10.1007/978-3-319-25936-9_11

Flora Amato¹⁴,
Francesco Gargiulo¹⁴,
Alessandro Maisto¹⁵,
Antonino Mazzeo¹⁴,
Serena Pelosi¹⁵ &
…
Carlo Sansone¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 178))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

484 Accesses
1 Citations

Abstract

Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in the documents, represented by the rows. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them, considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 54–58. IEEE (2008)
Google Scholar
Jia Zhang, I., Madduri, R., Tan, W., Deichl, K., Alexander, J., Foster, I.: Toward semantics empowered biomedical web services. In: 2011 IEEE International Conference on Web Services (ICWS), pp. 371–378 (2011)
Google Scholar
Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009)
Chapter Google Scholar
Seo, Y.W., Sycara, K.: Text clustering for topic detection (2004)
Google Scholar
Song, Y., Du, J., Hou, L.: A topic detection approach based on multi-level clustering. In: 2012 31st Chinese Control Conference (CCC), pp. 3834–3838. IEEE (2012)
Google Scholar
Zhang, D., Li, S.: Topic detection based on k-means. In: 2011 International Conference on Electronics, Communications and Control (ICECC), pp. 2983–2985 (2011)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Amato, F., Gargiulo, F., Mazzeo, A., Romano, S., Sansone, C.: Combining syntactic and semantic vector space models in the health domain by using a clustering ensemble. In: HEALTHINF, pp. 382–385 (2013)
Google Scholar
Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2009, pp. 1193–1198. IEEE (2009)
Google Scholar
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)
MATH Google Scholar
Amato, F., Mazzeo, A., Penta, A., Picariello, A.: Knowledge representation and management for e-government documents. In: Mazzeo, A., Bellini, R., Motta, G. (eds.) E-Government ICT Professionalism and Competences Service Science, pp. 31–40. Springer, USA (2008)
Chapter Google Scholar
Amato, F.M., Penta, A., Picariello, A.: Building RDF ontologies from semi-structured legal documents, complex, intelligent and software intensive systems. In: International Conference on CISIS 2008 (2008)
Google Scholar
Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361 (1994)
Google Scholar
Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Gargiulo, F., Sansone, C.: SOCIAL: self-organizing classifier ensemble for adversarial learning. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS 2010. LNCS, vol. 5997, pp. 84–93. Springer, Heidelberg (2010)
Chapter Google Scholar
Gargiulo, F., Mazzariello, C., Sansone, C.: Multiple classifier systems: theory, applications and tools. In: Bianchini, M., Maggini, M., Jain, L.C. (eds.) Handbook on Neural Information Processing. ISRL, vol. 49, pp. 335–378. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria Elettrica e delle Teconolgie dell’Informazione (DIETI), University of Naples Federico II, Via Claudio 21, Naples, Italy
Flora Amato, Francesco Gargiulo, Antonino Mazzeo & Carlo Sansone
Dipartimento di Scienze Politiche, Sociali e della Comunicazione (DSPSC), University of Salerno, Snc, Via Giovanni Paolo II, Fisciano (SA), Italy
Alessandro Maisto & Serena Pelosi

Authors

Flora Amato
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Gargiulo
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Maisto
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Mazzeo
View author publications
You can also search for this author in PubMed Google Scholar
Serena Pelosi
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Sansone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francesco Gargiulo .

Editor information

Editors and Affiliations

School of Computing, Dublin City University, Dublin 9, Ireland
Markus Helfert
Medical University Graz, Graz, Austria
Andreas Holzinger
Informatics, University of MInho, Braga, Portugal
Orlando Belo
Dipt di Elettronica e Informazione, Politecnico di Milano, Milan, Italy
Chiara Francalanci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amato, F., Gargiulo, F., Maisto, A., Mazzeo, A., Pelosi, S., Sansone, C. (2015). A Method for Topic Detection in Great Volumes of Data. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2014. Communications in Computer and Information Science, vol 178. Springer, Cham. https://doi.org/10.1007/978-3-319-25936-9_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-25936-9_11
Published: 31 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25935-2
Online ISBN: 978-3-319-25936-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics