Abstract
The present paper aims to propose an information-theoretic method for interpreting the inference mechanism of neural networks. The new method aims to interpret the inference mechanism minimally by disentangling complex information into simpler and easily interpretable information. This disentanglement of complex information can be realized by maximizing mutual information between input patterns and the corresponding neurons. However, because the use of mutual information has faced difficulty in computation, we use the well-known autoencoder to increase mutual information by re-interpreting the sparsity constraint, which is considered a device to increase mutual information. The computational procedures to increase mutual information are decomposed into the serial operation of equal use of neurons and specific responses to input patterns. The specific responses are realized by enhancing the results by the equal use of neurons. The method was applied to three data sets: the glass, office equipment, and pulsar data sets. With all three data sets, we could observe that, when the number of neurons was forced to increase, mutual information could be increased. Then, collective weights, or average collectively treated weights, showed that the method could extract the simple and linear relations between inputs and targets, making it possible to interpret the inference mechanism minimally.





















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
[x,y]=glass_dataset
References
Linsker R (1988) Self-organization in a perceptual network. Computer 21(3):105–117
Linsker R (1989) How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comput 1(3):402–411
Linsker R (1992) Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comput 4(5):691–702
Linsker R (2005) Improved local learning rule for information maximization and related applications. Neural Netw 18(3):261–265
Becker S (1996) Mutual information maximization: models of cortical self-organization. Netw Comput Neural Syst 7:7–31
Deco G, Finnoff W, Zimmermann H (1995) Unsupervised mutual information criterion for elimination of overtraining in supervised multilayer networks. Neural Comput 7(1):86–107
Deco G, Obradovic D (2012) An information-theoretic approach to neural computing. Springer Science & Business Media, Berlin
Principe JC, Xu D, Fisher J (2000) Information theoretic learning. Unsupervised Adaptive Filtering 1:265–319
Principe JC (2010) Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, Berlin
Torkkola K (2003) Feature extraction by non-parametric mutual information maximization. J Mach Learn Res 3:1415–1438
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5 (Nov):1531–1555
Chow TW, Huang D (2005) Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information. IEEE Trans Neural Netw 16(1):213–224
Estévez P. A., Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186
Goodman B, Flaxman S (2016) European union regulations on algorithmic decision-making and a right to explanation, arXiv:1606.08813
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Torkkola K (2001) Nonlinear feature transforms using maximum mutual information. In: International joint conference on neural networks, 2001. Proceedings. IJCNN?01, vol 4. IEEE, pp 2756?-2761
Ng A (2011) Sparse autoencoder, vol. 72 of CS294a Lecture notes
Bengio Y, Lamblin P, Popovici D, Larochelle H, et al. (2007) Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems 19:153–160
Vincent P, Larochelle H, Bengio Y, Manzagol P-A. (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning. ACM, pp 1096–1103
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(Dec):3371–3408
Xu J, Xiang L, Hang R, Wu J (2014) Stacked sparse autoencoder (ssae) based framework for nuclei patch classification on breast cancer histopathology. In: IEEE 11Th international symposium on biomedical imaging (ISBI). IEEE, p 2014
Tao C, Pan H, Li Y, Zou Z (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geoscience and Remote Sensing Letters 12(12):2438–2442
Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Humaine association conference on affective computing and intelligent interaction, p 2013
Bologna G (2004) Is it worth generating rules from neural network ensembles? J Appl Log 2(3):325–348
Wall R, Cunningham P (2000) Exploring the potential for rule extraction from ensembles of neural networks. In: 11th Irish conference on artificial intelligence & cognitive science, pp 52–68
Nishiuchi K (2015) Fundamental statistical analysis by excel (in japanese), Nikkei BigData
Lyon RJ, Stappers B, Cooper S, Brooke J, Knowles J (2016) Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach. Mon Not R Astron Soc 459(1):1104–1123
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artificial Intell Res 16:321–357
Khiat S, Djamila H A temporal distributed group decision support system based on multi-criteria analysis. International Journal of Interactive Multimedia and Artificial Intelligence, vol. (in Press)
O L-C, NN O (2015) A network based methodology to reveal patterns in knowledge transfer. Int J Interactive Multimed Artificial Intell 3:67–76
Craven M, Shavlik JW (1996) Extracting tree-structured representations of trained networks. In: Advances in neural information processing systems, pp 24–30
Baehrens D, Schroeter T, Harmeling S, Kawanabe M, Hansen K, MÞller K-R (2010) How to explain individual classification decisions. J Mach Learn Res 11(Jun):1803–1831
Kononenko I, et al. (2010) An efficient explanation of individual classifications using game theory. J Mach Learn Res 11(Jan):1–18
Navarro A. ́ AM, Ger PM (2018) Comparison of clustering algorithms for learning analytics with educational datasets. IJIMAI 5(2):9–16
Dietterich TG (2000) Ensemble methods in machine learning. In: International workshop on multiple classifier systems. Springer, pp 1–15
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network, arXiv:1503.02531
Li J, Zhao R, Huang J-T, Gong Y (2014) Learning small-size dnn with output-distribution-based criteria. In: Fifteenth annual conference of the international speech communication association
Che Z, Purushotham S, Khemani R, Liu Y (2016) Interpretable deep models for icu outcome prediction. In: AMIA annual symposium proceedings, vol 2016. American Medical Informatics Association, p 371
Adriana R, Nicolas B, Ebrahimi KS, Antoine C, Carlo G, Yoshua B (2015) Fitnets: Hints for thin deep nets. In: Proc. ICLR
Ribeiro MT, Singh S, Guestrin C (2016) Why should i trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1135–1144
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Rakotomamonjy A (2003) Variable selection using SVM-based criteria. J Mach Learn Res 3:1357–1370
Perkins S, Lacker K, Theiler J (2003) Grafting: fast, incremental feature selection by gradient descent in function space. J Mach Learn Res 3:1333–1356
Reunanen J (2003) Overfitting in making comparisons between variable selection methods. J Mach Learn Res 3:1371–1382
Caruana R, Sa V. R. d. (2003) Benefitting from the variables that variable selection discards. J Mach Learn Res 3(7-8):1245–1264
Kohavi R, John G (1997) Wrappers for feature subset selection. Artificial Intelligence 97(1):273–324
Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artificial Intelligence 97(1):245–271
Gros C (2009) Cognitive computation with autonomously active neural networks: an emerging field. Cogn Comput 1(1):77–90
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kamimura, R. Minimum interpretation by autoencoder-based serial and enhanced mutual information production. Appl Intell 50, 2423–2448 (2020). https://doi.org/10.1007/s10489-019-01619-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01619-w