Skip to main content

A Method for Topic Detection in Great Volumes of Data

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2014)

Abstract

Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in the documents, represented by the rows. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them, considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 54–58. IEEE (2008)

    Google Scholar 

  2. Jia Zhang, I., Madduri, R., Tan, W., Deichl, K., Alexander, J., Foster, I.: Toward semantics empowered biomedical web services. In: 2011 IEEE International Conference on Web Services (ICWS), pp. 371–378 (2011)

    Google Scholar 

  3. Bolelli, L., Ertekin, Ş., Giles, C.L.: Topic and trend detection in text collections using latent dirichlet allocation. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 776–780. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  4. Seo, Y.W., Sycara, K.: Text clustering for topic detection (2004)

    Google Scholar 

  5. Song, Y., Du, J., Hou, L.: A topic detection approach based on multi-level clustering. In: 2012 31st Chinese Control Conference (CCC), pp. 3834–3838. IEEE (2012)

    Google Scholar 

  6. Zhang, D., Li, S.: Topic detection based on k-means. In: 2011 International Conference on Electronics, Communications and Control (ICECC), pp. 2983–2985 (2011)

    Google Scholar 

  7. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  8. Amato, F., Gargiulo, F., Mazzeo, A., Romano, S., Sansone, C.: Combining syntactic and semantic vector space models in the health domain by using a clustering ensemble. In: HEALTHINF, pp. 382–385 (2013)

    Google Scholar 

  9. Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: International Conference on Complex, Intelligent and Software Intensive Systems, CISIS 2009, pp. 1193–1198. IEEE (2009)

    Google Scholar 

  10. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)

    MATH  Google Scholar 

  11. Amato, F., Mazzeo, A., Penta, A., Picariello, A.: Knowledge representation and management for e-government documents. In: Mazzeo, A., Bellini, R., Motta, G. (eds.) E-Government ICT Professionalism and Competences Service Science, pp. 31–40. Springer, USA (2008)

    Chapter  Google Scholar 

  12. Amato, F.M., Penta, A., Picariello, A.: Building RDF ontologies from semi-structured legal documents, complex, intelligent and software intensive systems. In: International Conference on CISIS 2008 (2008)

    Google Scholar 

  13. Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, pp. 357–361 (1994)

    Google Scholar 

  14. Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  15. Gargiulo, F., Sansone, C.: SOCIAL: self-organizing classifier ensemble for adversarial learning. In: El Gayar, N., Kittler, J., Roli, F. (eds.) MCS 2010. LNCS, vol. 5997, pp. 84–93. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  16. Gargiulo, F., Mazzariello, C., Sansone, C.: Multiple classifier systems: theory, applications and tools. In: Bianchini, M., Maggini, M., Jain, L.C. (eds.) Handbook on Neural Information Processing. ISRL, vol. 49, pp. 335–378. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Gargiulo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Amato, F., Gargiulo, F., Maisto, A., Mazzeo, A., Pelosi, S., Sansone, C. (2015). A Method for Topic Detection in Great Volumes of Data. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2014. Communications in Computer and Information Science, vol 178. Springer, Cham. https://doi.org/10.1007/978-3-319-25936-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25936-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25935-2

  • Online ISBN: 978-3-319-25936-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics