Skip to main content
Log in

Entropy based streaming big-data reduction with adjustable compression ratio

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The Internet of Things is a novel concept in which numerous physical devices are linked to the internet to collect, generate, and distribute data for processing. Data storage and processing become more challenging as the number of devices increases. One solution to the problem is to reduce the amount of stored data in such a way that processing accuracy does not suffer significantly. The reduction can be lossy or lossless, depending on the type of data. The article presents a novel lossy algorithm for reducing the amount of data stored in the system. The reduction process aims to reduce the volume of data while maintaining classification accuracy and properly adjusting the reduction ratio. A nonlinear cluster distance measure is used to create subgroups so that samples can be assigned to the correct clusters even though the cluster shape is nonlinear. Each sample is assumed to arrive one at a time during the reduction. As a result of this approach, the algorithm is suitable for streaming data. The user can adjust the degree of reduction, and the reduction algorithm strives to minimize classification error. The algorithm is not dependent on any particular classification technique. Subclusters are formed and readjusted after each sample during the calculation. To summarize the data from the subclusters, representative points are calculated. The data summary that is created can be saved and used for future processing. The accuracy difference between regular and reduced datasets is used to measure the effectiveness of the proposed method. Different classifiers are used to measure the accuracy difference. The results show that the nonlinear information-theoretic cluster distance measure improves the reduction rates with higher accuracy values compared to existing studies. At the same time, the reduction rate can be adjusted as desired, which is a lacking feature in the current methods. The characteristics are discussed, and the results are compared to previously published algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

KNN:

K-Nearest Neihgbourhood

SVM:

Supoort Vector Machine

CEF:

Clustering Evaluation Function

EbSDR:

Entropy based Streaming Data Reduction

DROP:

Decremental Reduction Optimization Procedure

SNN:

Selective Nearest Neighbor Rule

CNN:

Condensed Nearest Neighbor Rule

RNN:

Reduced Nearest Neighbor Rule

ENN:

Edited Nearest Neighbor

DEL:

Decremental Encoding Length

IBx:

Instance Based Learning

HMNEI:

Hit Miss Network Edition Iterative

ATISA:

Adaptive Threshold-based Instance Selection Algorithm

CNNIR:

Constraint Nearest Neighbor-Based Instance Reduction

BNNT:

Binary Nearest Neighbor Tree

LDIS:

Local Density-based Instance Selection

CDIS:

Central Density-based Instance Selection

LSSm:

Local Set-Based Smoother

LSBo:

Local Set Border Selector

ICF:

Iterative Case Filtering

SIR:

Spectral Instance Reduction

References

  1. Amro A, Al-Akhras M, Hindi KE, Habib M, Shawar BA (2021) Instance reduction for avoiding overfitting in decision trees. J Intell Syst 30(1):438–459

    Google Scholar 

  2. Atalay V, Zubaroğlu A (2021) Data stream clustering: a review. Artif Intell Rev 54:1201–1236

    Article  Google Scholar 

  3. Bader J, Nelson D, Chai-Zhang T, Gerych W, Rundensteiner E (2019) Neural network for nonlinear dimension reduction through manifold recovery, in 2019 IEEE MIT undergraduate research technology conference (URTC). MA, USA, Cambridge

    Google Scholar 

  4. Barshan E, Ghodsi A, Azimifar Z, Jahromi MZ (2011) Supervised principal component analysis: visualization, classification and regression on subspaces and submanifolds. Pattern Recogn 44(7):1357–1371

    Article  Google Scholar 

  5. Bo L, Wang C, Huang D-S (2009) Supervised feature extraction based on orthogonal discriminant projection. Neurocomputing 73(1):191–196

    Google Scholar 

  6. Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc April:153–172

    Article  MathSciNet  Google Scholar 

  7. Carbonera LL, Abel M (2015) A density-based approach for instance selection. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), Vietri sul Mare, pp 768–774. https://doi.org/10.1109/ICTAI.2015.114

  8. Carbonera LJ, Abel M (2016) A novel density-based approach for instance selection. In: 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI), San Jose, pp 549–556. https://doi.org/10.1109/ICTAI.2016.0090

  9. Cavalcanti GD, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40:6894–6900

    Article  Google Scholar 

  10. Chou C-H, Kuo B-H, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. In: 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp 556–559. https://doi.org/10.1109/ICPR.2006.1119

  11. Connor M, Kumar P (2010) Fast construction of k-nearest neighbor graphs for point clouds. IEEE Trans Vis Comput Graph 16(4):599–608

    Article  Google Scholar 

  12. Czarnowski I, Jędrzejowicz P (2018) An approach to data reduction for learning from big datasets: integrating stacking, rotation, and agent population learning techniques. Complexity 2018:13. https://doi.org/10.1155/2018/7404627

  13. "Data Skeleton Source Code," (2023) [Online]. Available: https://github.com/erhangokcay/data_skeleton. [Accessed 7 03 2023]

  14. Dua D, Graff C (2017) Machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml. Accessed 7 Mar 2023

  15. Fong Y, Xu J (2021) Forward stepwise deep autoencoder-based monotone nonlinear dimensionality. J Comput Graph Stat 30(3):519–529

    Article  MathSciNet  Google Scholar 

  16. Gates G (1972) Reduced nearest neighbor rule. IEEE Trans Inf Theory 18(3):431–433

    Article  Google Scholar 

  17. Ge W, Hongzhe X, Weibin Z, Weilu Z, Baiyang F (2011) Multi-kernel PCA based high-dimensional images feature reduction. In: 2011 International Conference on Electric Information and Control Engineering, pp 5966–5969

  18. Gisbrecht A, Schulz A, Hammer B (2015) Parametric nonlinear dimensionality reduction using kernel t-SNE. Neurocomputing 147:71–82

    Article  Google Scholar 

  19. Gokcay E (2020) An information-theoretic instance-based classifier. Inf Sci 536:263–276

    Article  MathSciNet  Google Scholar 

  20. Gokcay E, Principe J (2002) Information theoretic clustering. Pattern Anal Mach Intel IEEE Trans 24(2):158–171

    Article  Google Scholar 

  21. Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14:515–516

    Article  Google Scholar 

  22. He X, Niyogi P (2004) Locality preserving projections. In: Mozer MC, Jordan MI, Petsche T (eds) Advances in Neural Information Processing Systems. MIT Press, Cambridge, pp 153–160

  23. Hervé A, Williams Lynne J (2010) Principal component analysis. WIREs Comput Stat 2(4):433–459

    Article  Google Scholar 

  24. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  Google Scholar 

  25. Hout M, Papesh M, Goldinger S (2013) Multidimensional scaling. WIREs Cogn Sci 4:93–103

    Article  Google Scholar 

  26. Karimi A-H, Wong A, Ghodsi A (2018) "SRP: Efficient class-aware embedding learning for large-scale data via supervised random projections," arXiv preprint arXiv:1811.03166

  27. Kecman V (2005) Support vector machines – an introduction, In: Wang LP (ed.) Support Vector Machines: Theory and Applications, Springer, Berlin, pp 1–47. https://doi.org/10.1007/10984697_1

  28. Krawczyk B, Triguero I, García S, Woźniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59:601–628

    Article  Google Scholar 

  29. Lafon S, Lee A (2006) Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Trans Pattern Anal Mach Intell 28(9):1393–1403

    Article  Google Scholar 

  30. Leyva E, Gonzalez A, Perez R (2015) Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective. Pattern Recogn 48(4):1523–1537

    Article  Google Scholar 

  31. Mandal A, Maji P (2022) "Multiview regularized discriminant canonical correlation analysis: sequential extraction of relevant features from multiblock data," IEEE Trans Cybern, pp. 1–13

  32. Nikolaidis K, Goulermas J, Wu Q (2011) A class boundary preserving algorithm for data condensation. Pattern Recogn 44(3):704–715

    Article  Google Scholar 

  33. Nikolaidis K, Rodriguez-Martinez E, Goulermas JY (2012) Spectral graph optimization for instance reduction. IEEE Trans Neural Netw Learn Syst 23(7):1169–1175

    Article  Google Scholar 

  34. Ran R, Ren Y, Zhang S, Fang B (2021) A novel discriminant locality preserving projections method. J Math Imaging 63:541–554

    Article  MathSciNet  Google Scholar 

  35. Samko O, Marshall A, Rosin P (2006) Selection of the optimal parameter value for the Isomap algorithm. Pattern Recogn Lett 27(9):968–979

    Article  Google Scholar 

  36. Sarveniazi A (2014) An actual survey of dimensionality reduction. Am Jo Comput Math 4:55–72

    Article  Google Scholar 

  37. Siblini W, Kuntz P, Meyer F (2021) A review on dimensionality reduction for multi-label classification. IEEE Trans Knowl Data Eng 33(3):839–857

    Google Scholar 

  38. van der Maaten L, Postma E, Van Den Herik HJ (2009) Dimensionality reduction: a comparative review. J Mach Learn Res 10:66–71

    Google Scholar 

  39. Wang Q, Qin Z, Nie F, Li X (2021) C2DNDA: a deep framework for nonlinear dimensionality reduction. IEEE Trans Ind Electron 68(2):1684–1694

    Article  Google Scholar 

  40. Wilson D (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man, Cybern 3:408–421

    Article  MathSciNet  Google Scholar 

  41. Wilson D, Martinez T (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286

    Article  Google Scholar 

  42. Yang L, Zhu Q, Huang J, Wu Q, Cheng D, Hong X (2019) Constraint nearest neighbor for instance reduction. Soft Comput 23:13235–13245

    Article  Google Scholar 

  43. Zeng S, Gaao C, Wang X, Jiang L, Feng D (2019) Multiple kernel-based discriminant analysis via support vectors for dimension reduction. IEEE Access 7:35418–35430

    Article  Google Scholar 

Download references

Acknowledgments

I would like to thank my loving wife, Ozlemgul Pekdemir Gokcay, for her unlimited support and patience throughout all aspects of the study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erhan Gokcay.

Ethics declarations

The authors did not receive support from any organization for the submitted work.

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Fig. 15
figure 15

The CEF non-linear distance measure between clusters

Fig. 16
figure 16

The CEF non-linear distance measure between clusters

Fig. 17
figure 17

Variance of Classification Error

Fig. 18
figure 18

Flowchart of the algorithm

Fig. 19
figure 19

Timing analysis of single point CEF calculation

Fig. 20
figure 20

Timing analysis of single-point cluster optimization

Fig. 21
figure 21

Timing analysis of single-point cluster optimization

Fig. 22
figure 22

Timing analysis of CVI optimization

Fig. 23
figure 23

Timing analysis of average distance optimization

Fig. 24
figure 24

Timing analysis of reclustering optimization

Appendix B

Table 9 UCI Dataset

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gokcay, E. Entropy based streaming big-data reduction with adjustable compression ratio. Multimed Tools Appl 83, 2647–2681 (2024). https://doi.org/10.1007/s11042-023-15897-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15897-7

Keywords

Navigation