Abstract
Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in the documents, represented by the rows. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them, considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 54–58. IEEE (2008)
Jia Zhang, I., Madduri, R., Tan, W., Deichl, K., Alexander, J., Foster, I.: Toward semantics empowered biomedical web services. In: 2011 IEEE International Conference on Web Services (ICWS), pp. 371–378 (2011)
Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009)
Seo, Y.W., Sycara, K.: Text clustering for topic detection (2004)
Song, Y., Du, J., Hou, L.: A topic detection approach based on multi-level clustering. In: 2012 31st Chinese Control Conference (CCC), pp. 3834–3838. IEEE (2012)
Zhang, D., Li, S.: Topic detection based on k-means. In: 2011 International Conference on Electronics, Communications and Control (ICECC), pp. 2983–2985 (2011)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Amato, F., Gargiulo, F., Mazzeo, A., Romano, S., Sansone, C.: Combining syntactic and semantic vector space models in the health domain by using a clustering ensemble. In: HEALTHINF, pp. 382–385 (2013)
Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2009, pp. 1193–1198. IEEE (2009)
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)
Amato, F., Mazzeo, A., Penta, A., Picariello, A.: Knowledge representation and management for e-government documents. In: Mazzeo, A., Bellini, R., Motta, G. (eds.) E-Government ICT Professionalism and Competences Service Science, pp. 31–40. Springer, USA (2008)
Amato, F.M., Penta, A., Picariello, A.: Building RDF ontologies from semi-structured legal documents, complex, intelligent and software intensive systems. In: International Conference on CISIS 2008 (2008)
Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361 (1994)
Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000)
Gargiulo, F., Sansone, C.: SOCIAL: self-organizing classifier ensemble for adversarial learning. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS 2010. LNCS, vol. 5997, pp. 84–93. Springer, Heidelberg (2010)
Gargiulo, F., Mazzariello, C., Sansone, C.: Multiple classifier systems: theory, applications and tools. In: Bianchini, M., Maggini, M., Jain, L.C. (eds.) Handbook on Neural Information Processing. ISRL, vol. 49, pp. 335–378. Springer, Heidelberg (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Amato, F., Gargiulo, F., Maisto, A., Mazzeo, A., Pelosi, S., Sansone, C. (2015). A Method for Topic Detection in Great Volumes of Data. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2014. Communications in Computer and Information Science, vol 178. Springer, Cham. https://doi.org/10.1007/978-3-319-25936-9_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-25936-9_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25935-2
Online ISBN: 978-3-319-25936-9
eBook Packages: Computer ScienceComputer Science (R0)