skip to main content
research-article

ARIS: A Noise Insensitive Data Pre-Processing Scheme for Data Reduction Using Influence Space

Published: 30 July 2022 Publication History

Abstract

The extensive growth of data quantity has posed many challenges to data analysis and retrieval. Noise and redundancy are typical representatives of the above-mentioned challenges, which may reduce the reliability of analysis and retrieval results and increase storage and computing overhead. To solve the above problems, a two-stage data pre-processing framework for noise identification and data reduction, called ARIS, is proposed in this article. The first stage identifies and removes noises by the following steps: First, the influence space (IS) is introduced to elaborate data distribution. Second, a ranking factor (RF) is defined to describe the possibility that the points are regarded as noises, then, the definition of noise is given based on RF. Third, a clean dataset (CD) is obtained by removing noise from the original dataset. The second stage learns representative data and realizes data reduction. In this process, CD is divided into multiple small regions by IS. Then the reduced dataset is formed by collecting the representations of each region. The performance of ARIS is verified by experiments on artificial and real datasets. Experimental results show that ARIS effectively weakens the impact of noise and reduces the amount of data and significantly improves the accuracy of data analysis within a reasonable time cost range.

References

[1]
Koufakou Anna, G. Ortiz Enrique, Georgiopoulos Michael, C. Anagnostopoulos Georgios, and M. Reynolds Kenneth. 2007. A scalable and efficient outlier detection strategy for categorical data. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, Vol. 2. IEEE, 210–217. DOI:
[2]
T. S. Arulananth, L. Balaji, M. Baskar, V. Anbarasu, and Koppula Srinivas Rao. 2020. PCA based dimensional data reduction and segmentation for DICOM images. Neural Processing Letters 52, 3 (Nov. 2020), 1–15. DOI:
[3]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.
[4]
Saptarshi Chakraborty, Debolina Paul, and Swagatam Das. 2021. Automated clustering of high-dimensional data with a feature weighted mean shift algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. AAAI Press, 6930–6938. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16854.
[5]
Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. 2012. ReBucket: A method for clustering duplicate crash reports based on call stack similarity. In Proceedings of the 34th International Conference on Software Engineering. IEEE, 1–11. DOI:
[6]
Youcef Djenouri, Djamel Djenouri, and Jerry Chun-Wei Lin. 2021. Trajectory outlier detection: New problems and solutions for smart cities. ACM Transactions on Knowledge Discovery from Data 15, 2 (Feb. 2021), 1–28. DOI:
[7]
Tusneem Elhassan, Aljourf M., Al-Mohanna F., and Mohamed Shoukri. 2016. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (RUS) as a data reduction method. Global Journal of Technology and Optimization 1, 2 (Jan. 2016), 1–11. DOI:
[8]
Wang Guoyin, Yao Yiyu, and Yu hong. 2009. Review of rough set theory and application. Journal of Computer Science 032, 007 (Jul. 2009), 1229–1246. DOI:
[9]
Li Haixia and Wu Suyi. 2019. Dimension reduction optimization of massive seismic data attributes based on principal component analysis method. Journal of Earthquake Engineering 41, 3 (Jun. 2019), 757–762. DOI:
[10]
Sibylle Hess, Wouter Duivesteijn, Philipp Honysz, and Katharina Morik. 2019. The SpectACl of nonconvex clustering: A spectral approach to density-based clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI Press, 3788–3795. DOI:
[11]
Li Hua, Jiang Feng, Yu Xu, Du Junwei, and Liu Guozhu. 2008. Attribute reduction based on granular decision entropy. Computer and Modernization 4 (Sept. 2008), 7–12. DOI:
[12]
Onur Inan and Mustafa Serter Uzer. 2021. A method of classification performance improvement via a strategy of clustering-based data elimination integrated with k-fold cross-validation. Arabian Journal for Science and Engineering 46, 2 (2021), 1199–1212. DOI:
[13]
Anil Jain. 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31, 8 (Jun 2010), 651–666. DOI:
[14]
Richard Jensen and Qiang Shen. 2005. Fuzzy-rough data reduction with ant colony optimization. Fuzzy Sets and Systems 149, 1 (Jan. 2005), 5–20. DOI:
[15]
Feng Jiang, Hongbo Zhao, Junwei Du, Yu Xue, and Yanjun Peng. 2018. Outlier detection based on approximation accuracy entropy. International Journal of Machine Learning and Cybernetics 10 (Nov. 2018), 2483–2499. DOI:
[16]
Wen Jin, Anthony K. H. Tung, Jiawei Han, and Wei Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Advances in Knowledge Discovery and Data Mining, Wee-Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang (Eds.), Vol. 3918. Springer, Berlin, 577–593. DOI:
[17]
Xie Junying and Qu Yanan. 2016. K-medoids clustering algorithms with optimized initial seeds by density peaks. Journal of Frontiers of Computer Science and Technology 10, 2 (2016), 230–247. DOI:
[18]
Dimitrios Kapsoulis, Konstantinos Tsiakas, Xenofon Trompoukis, Varvara Asouti, and Kyriakos Giannakoglou. 2018. Evolutionary multi-objective optimization assisted by metamodels, kernel PCA and multi-criteria decision making techniques with applications in aerodynamics. Applied Soft Computing 64 (Mar. 2018), 1–13. DOI:
[19]
Hn Kile and Kjetil Uhlen. 2012. Data reduction via clustering and averaging for contingency and reliability analysis. International Journal of Electrical Power and Energy Systems 43, 1 (Dec. 2012), 1435–1442. DOI:
[20]
Ioannis Koumarelas, Lan Jiang, and Felix Naumann. 2020. Finding the duplicate questions in stack overflow using word embeddings. Procedia Computer Science 171, 3 (2020), 2729–2733. DOI:
[21]
Marzena Kryszkiewicz. 1998. Rough set approach to incomplete information systems. Information Sciences 112, 1–4 (Dec. 1998), 39–49. DOI:
[22]
Aleksandar Lazarevic and Zoran Obradovic. 2001. Data reduction using multiple models integration. In Principles of Data Mining and Knowledge Discovery, 5th European Conference, PKDD 2001, Luc De Raedt and Arno Siebes (Eds.) Lecture Notes in Computer Science, Vol. 2168. Springer, Berlin, 302–313. DOI:
[23]
Nhien-An Le-Khac, Martin Bue, Michael Whelan, and Tahar Kechadi. 2010. A clustering-based data reduction for very large spatio-temporal datasets. In Proceedings of the 6th International Conference on Advanced Data Mining and Applications - Volume Part II(ADMA’10, Vol. 6441). Springer-Verlag, Berlin, 43–54. DOI:
[24]
Alexandre L. M. Levada. 2020. Parametric PCA for unsupervised metric learning. Pattern Recognition Letters 135 (Jul. 2020), 425–430. DOI:
[25]
Bo Liang, Jianghui Cai, and Haifneg Yang. 2022. A new cell group clustering algorithm based on validation & correction mechanism. Expert Systems with Applications 193 (May. 2022), 116410.1–116410.13. DOI:
[26]
Ali Luo, Jiannan Zhang, Jianjun Chen, Yihan Song, Yue Wu, Zhongrui Bai, Fengfei Wang, Bing Du, and Haotong Zhang. 2014. Data reduction and calibration for LAMOST survey. In Setting the Scene for Gaia and LAMOST, Sofia Feltzing, Gang Zhao, Nicholas A. Walton, and Patricia Whitelock (Eds.), Vol. 298. Cambridge University Press, 428–428. DOI:
[27]
Ali Luo, Yanxia Zhang, and Yongheng Zhao. 2004. Design and implementation of the spectra reduction and analysis software for LAMOST telescope. In Advanced Software, Control, and Communication Systems for Astronomy, Hilton Lewis and Gianni Raffi (Eds.), Vol. 5496. SPIE, 756–764. DOI:
[28]
Ji Ma and Yuyu Yuan. 2019. Dimension reduction of image deep feature using PCA. Journal of Visual Communication and Image Representation 63 (Aug. 2019), 102578. DOI:
[29]
Sen Ma, MingYang Jiao, ShiKun Zhang, Wen Zhao, and Dong Wei Wang. 2016. Practical null pointer dereference detection via value-dependence analysis. In Proceedings of the IEEE International Symposium on Software Reliability Engineering Workshops. IEEE, 70–77. DOI:
[30]
Neil Mac Parthalain, Qiang Shen, and Richard Jensen. 2020. A distance measure approach to exploring the rough set boundary region for attribute reduction. IEEE Transactions on Knowledge and Data Engineering 22, 3 (Mar. 2020), 305–317. DOI:
[31]
Henrique O. Marques, Ricardo J. G. B. Campello, Jörg Sander, and Arthur Zimek. 2020. Internal evaluation of unsupervised outlier detection. ACM Transactions on Knowledge Discovery from Data 14, 4, Article 47 (Jun. 2020), 42 pages. DOI:
[32]
Xiangrui Meng. 2013. Scalable simple random sampling and stratified sampling. In Proceedings of the 30th International Conference on International Conference on Machine Learning(ICML’13, Vol. 28). MIT Press, 531–539. DOI:
[33]
Misinem, A. Abu Bakar, A. Razak Hamdan, and Faizah Shaari. 2014. Attribute value pairs based on discernibility matrix for outliers detection. Journal of Theoretical and Applied Information Technology 66, 2 (Aug. 2014), 623–633.
[34]
Falguni N. Patel. 2016. Large high dimensional data handling using data reduction. In Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). IEEE, 1531–1536. DOI:
[35]
Zdzisaw Pawlak. 1985. Rough sets and fuzzy sets. Fuzzy Sets and Systems 17, 1 (Sept. 1985), 99–102. DOI:
[36]
A. Radhika and M. Syed Masood. 2021. Effective dimensionality reduction by using soft computing method in data mining techniques. Soft Computing 25 (Jan. 2021), 4643–4651. DOI:
[37]
Papia Ray, S. Surender Reddy, and Tuhina Banerjee. 2021. Various dimension reduction techniques for high dimensional data analysis: A review. Artificial Intelligence Review 54, 5 (Jan. 2021), 3473–3515. DOI:
[38]
Faiz Rim and Othman Nouha. 2019. Retrieving relevant passages using n-grams for open-domain question answering. International Journal of Artificial Intelligence Tools 28, 07 (Nov. 2019), 1950021.1–1950021.19. DOI:
[39]
Rodriguez, Alex, Laio, and Alessandro. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (Jun. 2014), 1492–1496. DOI:
[40]
Deb Rupam and Wee-Chung Liew lan. 2015. Incorrect attribute value detection for traffic accident data. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–7. DOI:
[41]
Prabha S., Student P., and Sujatha P. 2014. Reduction of big data sets using fuzzy clustering. International Journal of Advanced Research in Computer Engineering and Technology 3, 6 (Jun. 2014), 2235–2238.
[42]
Dey Sayak, Das Swagatam, and Mallipeddi Rammohan. 2020. The sparse minmax k-means algorithm for high-dimensional clustering. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). Morgan Kaufmann, 2103–2110. DOI:
[43]
Krishna Kumar Sharma and Ayan Seal. 2020. Clustering analysis using an adaptive fused distance. Engineering Applications of Artificial Intelligence 96 (Nov. 2020), 103928.1–103928.11. DOI:
[44]
Krishna Kumar Sharma and Ayan Seal. 2021. Multi-view spectral clustering for uncertain objects. Information Sciences 547 (Feb. 2021), 723–745. DOI:
[45]
Krishna Kumar Sharma and Ayan Seal. 2021. Outlier-robust multi-view clustering for uncertain data. Knowledge-Based Systems 211 (Jan. 2021), 106567.1–106567.14. DOI:
[46]
Krishna Kumar Sharma and Ayan Seal. 2021. Spectral embedded generalized mean based k-nearest neighbors clustering with S-distance. Expert Systems with Applications 169 (May 2021), 114326.1–114326.10. DOI:
[47]
Gaikwad Shital and Bogiri Nagaraju. 2016. Levenshtein distance algorithm for efficient and effective XML duplicate detection. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–5. DOI:
[48]
Jerzy Stefanowski and Alexis Tsoukiàs. 1999. On the extension of rough sets under incomplete information. In New Directions in Rough Sets, Data Mining, and Granular-Soft Computing, Ning Zhong, Andrzej Skowron, and Setsuo Ohsuga (Eds.), Vol. 1711. Springer, Berlin, 73–81. DOI:
[49]
Gaby Bou Tayeh, Abdallah Makhoul, Charith Perera, and Jacques Demerjian. 2019. A spatial-temporal correlation approach for data reduction in cluster-based sensor networks. IEEE Access 7 (2019), 50669–50680. DOI:
[50]
Ellen Vandervieren and Mia Hubert. 2008. An adjusted boxplot for skewed distributions. Computational Stats and Data Analysis 52, 12 (Aug. 2008), 5186–5201. DOI:
[51]
Yuqing Yang, Jianghui Cai, Haifneg Yang, Jifu Zhang, and Xujun Zhao. 2019. TAD: A trajectory clustering algorithm based on spatial-temporal density analysis. Expert Systems with Applications 139 (Aug. 2019), 112846.1–112846.16. DOI:
[52]
Zhongnan Zhang, Ling HeYize, and TanMinghong Liao. 2012. A heuristic approximately duplicate records detection algorithm based on attributes analysis. International Journal of Digital Content Technology and Its Applications 6, 4 (Mar. 2012), 259–267. DOI:

Cited By

View all
  • (2025)Interpretable deep classification of time series based on class discriminative prototype learningIntelligent Data Analysis: An International Journal10.1177/1088467X251319188Online publication date: 27-Feb-2025
  • (2025)An interval evolutionary algorithm based on dynamic relation adjustment strategy for many-objective problemsSwarm and Evolutionary Computation10.1016/j.swevo.2025.10185393(101853)Online publication date: Mar-2025
  • (2025)DyGraphformer: Transformer combining dynamic spatio-temporal graph network for multivariate time series forecastingNeural Networks10.1016/j.neunet.2024.106776181(106776)Online publication date: Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data
ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 6
December 2022
631 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3543989
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2022
Online AM: 15 March 2022
Accepted: 01 February 2022
Revised: 01 December 2021
Received: 01 August 2021
Published in TKDD Volume 16, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data pre-processing scheme
  2. influence space
  3. noise identification
  4. data representation
  5. ranking factor

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Key Research and Development Projects of Shanxi Province
  • Central Government Guides Local Science and Technology Development Funds
  • Fundamental Research Program of Shanxi Province

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)84
  • Downloads (Last 6 weeks)10
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Interpretable deep classification of time series based on class discriminative prototype learningIntelligent Data Analysis: An International Journal10.1177/1088467X251319188Online publication date: 27-Feb-2025
  • (2025)An interval evolutionary algorithm based on dynamic relation adjustment strategy for many-objective problemsSwarm and Evolutionary Computation10.1016/j.swevo.2025.10185393(101853)Online publication date: Mar-2025
  • (2025)DyGraphformer: Transformer combining dynamic spatio-temporal graph network for multivariate time series forecastingNeural Networks10.1016/j.neunet.2024.106776181(106776)Online publication date: Jan-2025
  • (2024)A Sample of Am and Ap Candidates from LAMOST DR10 (v1.0) Based on the Ensemble Regression ModelThe Astrophysical Journal Supplement Series10.3847/1538-4365/ad4107272:2(43)Online publication date: 11-Jun-2024
  • (2024)Stellar spectral template library construction based on generative adversarial networksAstronomy & Astrophysics10.1051/0004-6361/202349032687(A15)Online publication date: 24-Jun-2024
  • (2024)Investigation and implementation of digital software architecture based on internet of thingsMeasurement: Sensors10.1016/j.measen.2024.10111433(101114)Online publication date: Jun-2024
  • (2024)Using outlier elimination to assess learning-based correspondence matching methodsInformation Sciences: an International Journal10.1016/j.ins.2023.120056659:COnline publication date: 12-Apr-2024
  • (2024)A novel graph-attention based multimodal fusion network for joint classification of hyperspectral image and LiDAR dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123587249:PBOnline publication date: 1-Sep-2024
  • (2024)A new community detection method for simplified networks by combining structure and attribute informationExpert Systems with Applications10.1016/j.eswa.2023.123103246(123103)Online publication date: Jul-2024
  • (2024)A game theory based many-objective hybrid tensor decomposition for skin cancer predictionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122425239:COnline publication date: 17-Apr-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media