Exploring Kolmogorov Complexity Approximations for Data Analysis: Insights and Applications

Silva, Jorge Miguel; Pratas, Diogo; Matos, Sérgio

doi:10.1007/978-3-031-36007-7_12

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 678))

Included in the following conference series:

Doctoral Conference on Computing, Electrical and Industrial Systems

244 Accesses

Abstract

This paper presents a PhD research project focused on investigating the use of Kolmogorov complexity approximations as descriptors for various data types, with the aim of addressing inversion problems. The research explores the application of these approximations across different domains while considering the relationship between algorithmic and probabilistic complexities. The study starts with genomic data analysis, where specialized data compressors are employed to improve taxonomic identification, classification, and organization. The research then extends to analysing artistic paintings, utilizing information-based measures to attribute authorship, categorize styles, and describe the content. Additionally, the research examines Turing Machine-generated data, providing insights into the relationship between algorithmic and probabilistic complexities. A method for increasing probabilistic complexity without affecting algorithmic complexity is also proposed. Lastly, a methodology for identifying programs capable of generating outputs approximating given input strings is introduced, offering potential solutions to inversion problems. The paper highlights this research's diverse applications and findings, contributing to understanding the relationship between algorithmic and probabilistic complexities in data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hutter, M.: Algorithmic information theory. Scholarpedia 2, 2519 (2007)
Article Google Scholar
Nalbantoglu, Ö., Russell, D., Sayood, K.: Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12(1), 34–52 (2009). https://doi.org/10.3390/e12010034
Article Google Scholar
Silva, M., Pratas, D., Pinho, A.J.: Efficient DNA sequence compression with neural networks. GigaScience 9(11), giaa119 (2020). https://doi.org/10.1093/gigascience/giaa119
Article Google Scholar
MacKay, D.J.C., Mac Kay, D.J.C.: Information theory, inference and learning algorithms. Cambridge University Press, Cambridge (2003)
Google Scholar
Cohen, A.R., Bjornsson, C.S., Temple, S., Banker, G., Roysam, B.: Automatic summarization of changes in biological image sequences using algorithmic information theory. IEEE Trans. Pattern Anal. Mach. Intell. 31(8), 1386–1403 (2009). https://doi.org/10.1109/TPAMI.2008.162
Article Google Scholar
Maurer, U.: Information-theoretic cryptography. In: Wiener, M. (ed.) Advances in Cryptology — CRYPTO’ 99. LNCS, vol. 1666, pp. 47–65. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48405-1_4
Chapter Google Scholar
Yeboah-Ofori, A., Agbodza, C.K., Opoku-Boateng, F.A., Darvishi, I., Sbai, F.: Applied cryptography in network systems security for cyberattack prevention. In: 2021 International Conference on Cyber Security and Internet of Things (ICSIoT), pp. 43–48 (2021)
Google Scholar
Tenreiro Machado, J., Lopes, A.M.: Artistic painting: a fractional calculus perspective. Appl. Math. Model. 65, 614–626 (2019)
Google Scholar
Li, M., Vitanyi, P.: An Introduction to Kolmogorov Complexity and its Applications. TCS, Springer, New York (2008). https://doi.org/10.1007/978-0-387-49820-1
Book MATH Google Scholar
Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68(25), 3805–3808 (1992). https://doi.org/10.1103/PhysRevLett.68.3805
Article Google Scholar
Ciliberti, S., Martin, O.C., Wagner, A.: Innovation and robustness in complex regulatory gene networks. Proc. Natl. Acad. Sci. 104(34), 13591–13596 (2007). https://doi.org/10.1073/pnas.0705396104
Article Google Scholar
Adami, C., Ofria, C., Collier, T.C.: Evolution of biological complexity. Proc. Natl. Acad. Sci. 97(9), 4463–4468 (2000). https://doi.org/10.1073/pnas.97.9.4463
Article Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article MATH Google Scholar
RossQuinlan, J., Rivest, R.L.: Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989). https://doi.org/10.1016/0890-5401(89)90010-2
Article MathSciNet MATH Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979). https://doi.org/10.1109/TPAMI.1979.4766909
Article Google Scholar
Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: how to generate strong keys from biometrics and other noisy data. In: Cachin, C., Camenisch, J.L. (eds.) Advances in Cryptology - EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24676-3_31
Chapter Google Scholar
Chaitin, G.J.: Algorithmic information theory. IBM J. Res. Dev. 21(4), 350–359 (1977)
Google Scholar
Bruce, S.: Applied cryptography: Protocols, Algorthms, and Source Code in c.-2nd (1996)
Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Probl. Inf. Trans. 1(1), 1–7 (1965)
Google Scholar
Chaitin, G.J.: On the length of programs for computing finite binary sequences: statistical considerations. J. ACM (JACM). 16(1), 145–159 (1969)
Google Scholar
Calude, C.S.: Information and Randomness: An Algorithmic Perspective. Springer Science & Business Media, Heidelberg (2002). https://doi.org/10.1007/978-3-662-03049-3
Sayood, K.: Introduction. In: Introduction to data compression, pp. 1–10. Elsevier (2018). https://doi.org/10.1016/B978-0-12-809474-7.00001-X
Chapter MATH Google Scholar
Moffat, A.: Word-based text compression. Softw. Pract. Exp. 19(2), 185–198 (1989). https://doi.org/10.1002/spe.4380190207
Article Google Scholar
Knoll, B., de Freitas, N.: A machine learning perspective on predictive coding with PAQ8. In: 2012 Data Compression Conference, pp. 377–386. IEEE (2012)
Google Scholar
Carrasco, R.C., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) Grammatical Inference and Applications. LNCS, vol. 862, pp. 139–152. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58473-0_144
Chapter MATH Google Scholar
Silva, J.M., Pratas, D., Caetano, T., Matos, S.: The complexity landscape of viral genomes. GigaScience 11, 1–16 (2022). https://doi.org/10.1093/gigascience/giac079
Article Google Scholar
Silva, J.M., Pratas, D., Antunes, R., Matos, S., Pinho, A.J.: Automatic analysis of artistic paintings using information-based measures. Pattern Recogn. 114, 107864 (2021). https://doi.org/10.1016/j.patcog.2021.107864
Article Google Scholar
Wallace, C.S.: Minimum message length and kolmogorov complexity. Comput. J. 42(4), 270–283 (1999). https://doi.org/10.1093/comjnl/42.4.270
Article MATH Google Scholar
Hutter, M.: Universal algorithmic intelligence: a mathematical top→down approach. In: Goertzel, B., Pennachin, C. (eds.) Artificial general intelligence, pp. 227–290. Springer Berlin Heidelberg, Heidelberg (2007). https://doi.org/10.1007/978-3-540-68677-4_8
Chapter Google Scholar
Silva, J.M., Almeida, J.R.: The value of compression for taxonomic identification. In: 2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS), pp. 276–281. IEEE (2022)
Google Scholar
Zenil, H., Delahaye, J.-P.: An algorithmic information theoretic approach to the behaviour of financial markets. J. Econ. Surv. 25(3), 431–463 (2011)
Article Google Scholar
Silva, J.M., Pratas, D., Caetano, T., Matos, S.: Feature-based classification of archaeal sequences using compression-based methods. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds.) Pattern Recognition and Image Analysis, pp. 309–320. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-04881-4_25
Chapter Google Scholar
jorgeMFS. Complexity ANalysis VirAl Sequences (C.A.N.V.A.S.) Repository (2021). https://github.com/jorgeMFS/canvas
jorgeMFS. Classification and identification of Archaea (ARCHAEA2) Repository (2021). https://github.com/jorgeMFS/Archaea2
bioinformatics ua. COMPressor tAxonomic ClassificaTion (C.O.M.P.A.C.T.) Repository (2021). https://github.com/bioinformatics-ua/COMPACT
CANVAS Website. CANVAS Website (2021). https://asilab.github.io/canvas/
asilab. Measuring probabilistic-algorithmic information of artistic paintings (PANTHER) Repository (2021). https://github.com/asilab/panther
PANTHER Website. PANTHER Website (2021). http://panther.web.ua.pt/
Silva, J.M., Pinho, E., Matos, S., Pratas, D.: Statistical complexity analysis of turing machine tapes with fixed algorithmic complexity using the best-order Markov model. Entropy. 22(1), 105 (2020)
Google Scholar
asilab. TMCompression Repository (2021). https://github.com/asilab/TMCompression
jorgeMFS.Turing Machine Recreator (TMRecreator) (2021). https://github.com/jorgeMFS/TMRecreator
jorgeMFS. SPTTM (2021). https://github.com/jorgeMFS/spttm
bioinformatics ua. TM Neural Finder (2021). https://github.com/bioinformatics-ua/TM-Neural-Finder

Download references

Acknowledgments

This work was partially funded by National Funds through the FCT - Foundation for Science and Technology, in the context of the project UIDB/00127/2020. Furthermore, this work has received funding from the EC under grant agreement 101081813, Genomic Data Infrastructure. J.M.S. acknowledges the FCT grant SFRH/BD/141851/2018. National funds funded D.P. through FCT – Fundação para a Ciência e a Tecnologia, I.P., under the Scientific Employment Stimulus - Institutional Call - reference CEECINST/00026/2018.

Author information

Authors and Affiliations

IEETA, Institute of Electronics and Informatics Engineering of Aveiro and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Aveiro, Portugal
Jorge Miguel Silva, Diogo Pratas & Sérgio Matos
DETI, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
Diogo Pratas & Sérgio Matos
DoV, Department of Virology, University of Helsinki, Helsinki, Finland
Diogo Pratas

Authors

Jorge Miguel Silva
View author publications
You can also search for this author in PubMed Google Scholar
Diogo Pratas
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Matos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Miguel Silva .

Editor information

Editors and Affiliations

School of Science and Technology, NOVA University of Lisbon, Monte Caparica, Portugal
Luis M. Camarinha-Matos
School of Science and Technology, NOVA University of Lisbon, Monte Caparica, Portugal
Filipa Ferrada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, J.M., Pratas, D., Matos, S. (2023). Exploring Kolmogorov Complexity Approximations for Data Analysis: Insights and Applications. In: Camarinha-Matos, L.M., Ferrada, F. (eds) Technological Innovation for Connected Cyber Physical Spaces. DoCEIS 2023. IFIP Advances in Information and Communication Technology, vol 678. Springer, Cham. https://doi.org/10.1007/978-3-031-36007-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-36007-7_12
Published: 25 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36006-0
Online ISBN: 978-3-031-36007-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)