Skip to main content

Mining Unstructured Data via Computational Intelligence

  • Conference paper
  • First Online:
Advances in Artificial Intelligence and Soft Computing (MICAI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9413))

Included in the following conference series:

  • 1144 Accesses

Abstract

At present very large volumes of information are being regularly produced in the world. Most of this information is unstructured, lacking the properties usually expected from, for instance, relational databases. One of the more interesting issues in computer science is how, if possible, may we achieve data mining on such unstructured data. Intuitively, its analysis has been attempted by devising schemes to identify patterns and trends through means such as statistical pattern learning. The basic problem of this approach is that the user has to decide, a priori, the model of the patterns and, furthermore, the way in which they are to be found in the data. This is true regardless of the kind of data, be it textual, musical, financial or otherwise. In this paper we explore an alternative paradigm in which raw data is categorized by analyzing a large corpus from which a set of categories and the different instances in each category are determined, resulting in a structured database. Then each of the instances is mapped into a numerical value which preserves the underlying patterns. This is done using a genetic algorithm and a set of multi-layer perceptron networks. Every categorical instance is then replaced by the adequate numerical code. The resulting numerical database may be tackled with the usual clustering algorithms. We hypothesize that any unstructured data set may be approached in this fashion. In this work we exemplify with a textual database and apply our method to characterize texts by different authors and present experimental evidence that the resulting databases yield clustering results which permit authorship identification from raw textual data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. http://news.netcraft.com/archives/2014/04/02/april-2014-web-server-survey.html. Accessed 18 June 2015

  2. Odlyzko, A., Tilly, B.: A refutation of Metcalfe’s Law and a better estimate for the value of networks and network interconnections. Manuscript, 2 March 2005

    Google Scholar 

  3. http://www-03.ibm.com/press/us/en/pressrelease/46205.wss. Accessed 18 June 2015

  4. Tan, A.H.: Text mining: The state of the art and the challenges. In: Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, vol. 8, p. 65, April 1999

    Google Scholar 

  5. Pachet, F., Westermann, G., Laigre, D.: Musical data mining for electronic music distribution. In: First International Conference on Web Delivering of Music, 2001, Proceedings, pp. 101–106. IEEE, November 2001

    Google Scholar 

  6. Lei-da Chen, T.S., Frolick, M.N.: Data mining methods, applications, and tools (2000)

    Google Scholar 

  7. Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2007)

    Google Scholar 

  8. Kuri-Morales, A.: categorical encoding with neural networks and genetic algorithms. In: Proceedings of AICT 2015 (Applications of Computer and Computer Theory), Salerno, Italy, pp. 167–175. WSEAS, June 2015

    Google Scholar 

  9. Kuri-Morales, A., Aldana-Bobadilla, E.: The best genetic algorithm I. In: Castro, F., Gelbukh, A., González, M. (eds.) MICAI 2013, Part II. LNCS, vol. 8266, pp. 1–15. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  10. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)

    Article  Google Scholar 

  11. Ritter, H., Martinetz, T., Schulten, K.: Neural Computation and Self-Organizing Maps. An introduction (1992)

    Google Scholar 

  12. Aldana-Bobadilla, E., Kuri-Morales, A.: A clustering method based on the maximum entropy principle. Entropy 17(1), 151–180 (2015)

    Article  Google Scholar 

  13. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303–314 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  14. Kuri-Morales, A.F.: The best neural network architecture. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds.) MICAI 2014, Part II. LNCS, vol. 8857, pp. 72–84. Springer, Heidelberg (2014)

    Google Scholar 

  15. Hecht-Nielsen, R.: Theory of the backpropagation neural network. In International Joint Conference on Neural Networks, 1989, IJCNN, pp. 593–605. IEEE, June 1989

    Google Scholar 

  16. Westfall, P.H., Young, S.S.: Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment, vol. 279. Wiley, New York (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Angel Kuri-Morales .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Kuri-Morales, A. (2015). Mining Unstructured Data via Computational Intelligence. In: Sidorov, G., Galicia-Haro, S. (eds) Advances in Artificial Intelligence and Soft Computing. MICAI 2015. Lecture Notes in Computer Science(), vol 9413. Springer, Cham. https://doi.org/10.1007/978-3-319-27060-9_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27060-9_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27059-3

  • Online ISBN: 978-3-319-27060-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics