research-article

ARIS: A Noise Insensitive Data Pre-Processing Scheme for Data Reduction Using Influence Space

Authors:

Jing HaoAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 6

Article No.: 110, Pages 1 - 39

https://doi.org/10.1145/3522592

Published: 30 July 2022 Publication History

Abstract

The extensive growth of data quantity has posed many challenges to data analysis and retrieval. Noise and redundancy are typical representatives of the above-mentioned challenges, which may reduce the reliability of analysis and retrieval results and increase storage and computing overhead. To solve the above problems, a two-stage data pre-processing framework for noise identification and data reduction, called ARIS, is proposed in this article. The first stage identifies and removes noises by the following steps: First, the influence space (IS) is introduced to elaborate data distribution. Second, a ranking factor (RF) is defined to describe the possibility that the points are regarded as noises, then, the definition of noise is given based on RF. Third, a clean dataset (CD) is obtained by removing noise from the original dataset. The second stage learns representative data and realizes data reduction. In this process, CD is divided into multiple small regions by IS. Then the reduced dataset is formed by collecting the representations of each region. The performance of ARIS is verified by experiments on artificial and real datasets. Experimental results show that ARIS effectively weakens the impact of noise and reduces the amount of data and significantly improves the accuracy of data analysis within a reasonable time cost range.

References

[1]

Koufakou Anna, G. Ortiz Enrique, Georgiopoulos Michael, C. Anagnostopoulos Georgios, and M. Reynolds Kenneth. 2007. A scalable and efficient outlier detection strategy for categorical data. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, Vol. 2. IEEE, 210–217. DOI:

Digital Library

[2]

T. S. Arulananth, L. Balaji, M. Baskar, V. Anbarasu, and Koppula Srinivas Rao. 2020. PCA based dimensional data reduction and segmentation for DICOM images. Neural Processing Letters 52, 3 (Nov. 2020), 1–15. DOI:

[3]

Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.

[4]

Saptarshi Chakraborty, Debolina Paul, and Swagatam Das. 2021. Automated clustering of high-dimensional data with a feature weighted mean shift algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. AAAI Press, 6930–6938. Retrieved from https://ojs.aaai.org/index.php/AAAI/article/view/16854.

[5]

Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. 2012. ReBucket: A method for clustering duplicate crash reports based on call stack similarity. In Proceedings of the 34th International Conference on Software Engineering. IEEE, 1–11. DOI:

[6]

Youcef Djenouri, Djamel Djenouri, and Jerry Chun-Wei Lin. 2021. Trajectory outlier detection: New problems and solutions for smart cities. ACM Transactions on Knowledge Discovery from Data 15, 2 (Feb. 2021), 1–28. DOI:

Digital Library

[7]

Tusneem Elhassan, Aljourf M., Al-Mohanna F., and Mohamed Shoukri. 2016. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (RUS) as a data reduction method. Global Journal of Technology and Optimization 1, 2 (Jan. 2016), 1–11. DOI:

[8]

Wang Guoyin, Yao Yiyu, and Yu hong. 2009. Review of rough set theory and application. Journal of Computer Science 032, 007 (Jul. 2009), 1229–1246. DOI:

[9]

Li Haixia and Wu Suyi. 2019. Dimension reduction optimization of massive seismic data attributes based on principal component analysis method. Journal of Earthquake Engineering 41, 3 (Jun. 2019), 757–762. DOI:

[10]

Sibylle Hess, Wouter Duivesteijn, Philipp Honysz, and Katharina Morik. 2019. The SpectACl of nonconvex clustering: A spectral approach to density-based clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI Press, 3788–3795. DOI:

Digital Library

[11]

Li Hua, Jiang Feng, Yu Xu, Du Junwei, and Liu Guozhu. 2008. Attribute reduction based on granular decision entropy. Computer and Modernization 4 (Sept. 2008), 7–12. DOI:

[12]

Onur Inan and Mustafa Serter Uzer. 2021. A method of classification performance improvement via a strategy of clustering-based data elimination integrated with k-fold cross-validation. Arabian Journal for Science and Engineering 46, 2 (2021), 1199–1212. DOI:

[13]

Anil Jain. 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31, 8 (Jun 2010), 651–666. DOI:

Digital Library

[14]

Richard Jensen and Qiang Shen. 2005. Fuzzy-rough data reduction with ant colony optimization. Fuzzy Sets and Systems 149, 1 (Jan. 2005), 5–20. DOI:

Digital Library

[15]

Feng Jiang, Hongbo Zhao, Junwei Du, Yu Xue, and Yanjun Peng. 2018. Outlier detection based on approximation accuracy entropy. International Journal of Machine Learning and Cybernetics 10 (Nov. 2018), 2483–2499. DOI:

[16]

Wen Jin, Anthony K. H. Tung, Jiawei Han, and Wei Wang. 2006. Ranking outliers using symmetric neighborhood relationship. In Advances in Knowledge Discovery and Data Mining, Wee-Keong Ng, Masaru Kitsuregawa, Jianzhong Li, and Kuiyu Chang (Eds.), Vol. 3918. Springer, Berlin, 577–593. DOI:

Digital Library

[17]

Xie Junying and Qu Yanan. 2016. K-medoids clustering algorithms with optimized initial seeds by density peaks. Journal of Frontiers of Computer Science and Technology 10, 2 (2016), 230–247. DOI:

[18]

Dimitrios Kapsoulis, Konstantinos Tsiakas, Xenofon Trompoukis, Varvara Asouti, and Kyriakos Giannakoglou. 2018. Evolutionary multi-objective optimization assisted by metamodels, kernel PCA and multi-criteria decision making techniques with applications in aerodynamics. Applied Soft Computing 64 (Mar. 2018), 1–13. DOI:

Digital Library

[19]

Hn Kile and Kjetil Uhlen. 2012. Data reduction via clustering and averaging for contingency and reliability analysis. International Journal of Electrical Power and Energy Systems 43, 1 (Dec. 2012), 1435–1442. DOI:

[20]

Ioannis Koumarelas, Lan Jiang, and Felix Naumann. 2020. Finding the duplicate questions in stack overflow using word embeddings. Procedia Computer Science 171, 3 (2020), 2729–2733. DOI:

[21]

Marzena Kryszkiewicz. 1998. Rough set approach to incomplete information systems. Information Sciences 112, 1–4 (Dec. 1998), 39–49. DOI:

Digital Library

[22]

Aleksandar Lazarevic and Zoran Obradovic. 2001. Data reduction using multiple models integration. In Principles of Data Mining and Knowledge Discovery, 5th European Conference, PKDD 2001, Luc De Raedt and Arno Siebes (Eds.) Lecture Notes in Computer Science, Vol. 2168. Springer, Berlin, 302–313. DOI:

[23]

Nhien-An Le-Khac, Martin Bue, Michael Whelan, and Tahar Kechadi. 2010. A clustering-based data reduction for very large spatio-temporal datasets. In Proceedings of the 6th International Conference on Advanced Data Mining and Applications - Volume Part II(ADMA’10, Vol. 6441). Springer-Verlag, Berlin, 43–54. DOI:

[24]

Alexandre L. M. Levada. 2020. Parametric PCA for unsupervised metric learning. Pattern Recognition Letters 135 (Jul. 2020), 425–430. DOI:

[25]

Bo Liang, Jianghui Cai, and Haifneg Yang. 2022. A new cell group clustering algorithm based on validation & correction mechanism. Expert Systems with Applications 193 (May. 2022), 116410.1–116410.13. DOI:

Digital Library

[26]

Ali Luo, Jiannan Zhang, Jianjun Chen, Yihan Song, Yue Wu, Zhongrui Bai, Fengfei Wang, Bing Du, and Haotong Zhang. 2014. Data reduction and calibration for LAMOST survey. In Setting the Scene for Gaia and LAMOST, Sofia Feltzing, Gang Zhao, Nicholas A. Walton, and Patricia Whitelock (Eds.), Vol. 298. Cambridge University Press, 428–428. DOI:

[27]

Ali Luo, Yanxia Zhang, and Yongheng Zhao. 2004. Design and implementation of the spectra reduction and analysis software for LAMOST telescope. In Advanced Software, Control, and Communication Systems for Astronomy, Hilton Lewis and Gianni Raffi (Eds.), Vol. 5496. SPIE, 756–764. DOI:

[28]

Ji Ma and Yuyu Yuan. 2019. Dimension reduction of image deep feature using PCA. Journal of Visual Communication and Image Representation 63 (Aug. 2019), 102578. DOI:

Digital Library

[29]

Sen Ma, MingYang Jiao, ShiKun Zhang, Wen Zhao, and Dong Wei Wang. 2016. Practical null pointer dereference detection via value-dependence analysis. In Proceedings of the IEEE International Symposium on Software Reliability Engineering Workshops. IEEE, 70–77. DOI:

Digital Library

[30]

Neil Mac Parthalain, Qiang Shen, and Richard Jensen. 2020. A distance measure approach to exploring the rough set boundary region for attribute reduction. IEEE Transactions on Knowledge and Data Engineering 22, 3 (Mar. 2020), 305–317. DOI:

Digital Library

[31]

Henrique O. Marques, Ricardo J. G. B. Campello, Jörg Sander, and Arthur Zimek. 2020. Internal evaluation of unsupervised outlier detection. ACM Transactions on Knowledge Discovery from Data 14, 4, Article 47 (Jun. 2020), 42 pages. DOI:

Digital Library

[32]

Xiangrui Meng. 2013. Scalable simple random sampling and stratified sampling. In Proceedings of the 30th International Conference on International Conference on Machine Learning(ICML’13, Vol. 28). MIT Press, 531–539. DOI:

Digital Library

[33]

Misinem, A. Abu Bakar, A. Razak Hamdan, and Faizah Shaari. 2014. Attribute value pairs based on discernibility matrix for outliers detection. Journal of Theoretical and Applied Information Technology 66, 2 (Aug. 2014), 623–633.

[34]

Falguni N. Patel. 2016. Large high dimensional data handling using data reduction. In Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). IEEE, 1531–1536. DOI:

[35]

Zdzisaw Pawlak. 1985. Rough sets and fuzzy sets. Fuzzy Sets and Systems 17, 1 (Sept. 1985), 99–102. DOI:

Digital Library

[36]

A. Radhika and M. Syed Masood. 2021. Effective dimensionality reduction by using soft computing method in data mining techniques. Soft Computing 25 (Jan. 2021), 4643–4651. DOI:

Digital Library

[37]

Papia Ray, S. Surender Reddy, and Tuhina Banerjee. 2021. Various dimension reduction techniques for high dimensional data analysis: A review. Artificial Intelligence Review 54, 5 (Jan. 2021), 3473–3515. DOI:

Digital Library

[38]

Faiz Rim and Othman Nouha. 2019. Retrieving relevant passages using n-grams for open-domain question answering. International Journal of Artificial Intelligence Tools 28, 07 (Nov. 2019), 1950021.1–1950021.19. DOI:

[39]

Rodriguez, Alex, Laio, and Alessandro. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (Jun. 2014), 1492–1496. DOI:

[40]

Deb Rupam and Wee-Chung Liew lan. 2015. Incorrect attribute value detection for traffic accident data. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–7. DOI:

[41]

Prabha S., Student P., and Sujatha P. 2014. Reduction of big data sets using fuzzy clustering. International Journal of Advanced Research in Computer Engineering and Technology 3, 6 (Jun. 2014), 2235–2238.

[42]

Dey Sayak, Das Swagatam, and Mallipeddi Rammohan. 2020. The sparse minmax k-means algorithm for high-dimensional clustering. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). Morgan Kaufmann, 2103–2110. DOI:

[43]

Krishna Kumar Sharma and Ayan Seal. 2020. Clustering analysis using an adaptive fused distance. Engineering Applications of Artificial Intelligence 96 (Nov. 2020), 103928.1–103928.11. DOI:

[44]

Krishna Kumar Sharma and Ayan Seal. 2021. Multi-view spectral clustering for uncertain objects. Information Sciences 547 (Feb. 2021), 723–745. DOI:

[45]

Krishna Kumar Sharma and Ayan Seal. 2021. Outlier-robust multi-view clustering for uncertain data. Knowledge-Based Systems 211 (Jan. 2021), 106567.1–106567.14. DOI:

[46]

Krishna Kumar Sharma and Ayan Seal. 2021. Spectral embedded generalized mean based k-nearest neighbors clustering with S-distance. Expert Systems with Applications 169 (May 2021), 114326.1–114326.10. DOI:

[47]

Gaikwad Shital and Bogiri Nagaraju. 2016. Levenshtein distance algorithm for efficient and effective XML duplicate detection. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–5. DOI:

[48]

Jerzy Stefanowski and Alexis Tsoukiàs. 1999. On the extension of rough sets under incomplete information. In New Directions in Rough Sets, Data Mining, and Granular-Soft Computing, Ning Zhong, Andrzej Skowron, and Setsuo Ohsuga (Eds.), Vol. 1711. Springer, Berlin, 73–81. DOI:

[49]

Gaby Bou Tayeh, Abdallah Makhoul, Charith Perera, and Jacques Demerjian. 2019. A spatial-temporal correlation approach for data reduction in cluster-based sensor networks. IEEE Access 7 (2019), 50669–50680. DOI:

[50]

Ellen Vandervieren and Mia Hubert. 2008. An adjusted boxplot for skewed distributions. Computational Stats and Data Analysis 52, 12 (Aug. 2008), 5186–5201. DOI:

Digital Library

[51]

Yuqing Yang, Jianghui Cai, Haifneg Yang, Jifu Zhang, and Xujun Zhao. 2019. TAD: A trajectory clustering algorithm based on spatial-temporal density analysis. Expert Systems with Applications 139 (Aug. 2019), 112846.1–112846.16. DOI:

Digital Library

[52]

Zhongnan Zhang, Ling HeYize, and TanMinghong Liao. 2012. A heuristic approximately duplicate records detection algorithm based on attributes analysis. International Journal of Digital Content Technology and Its Applications 6, 4 (Mar. 2012), 259–267. DOI:

Cited By

Wang YCai JYang HShi CZhang MWang JZhang RZhao X(2025)Interpretable deep classification of time series based on class discriminative prototype learningIntelligent Data Analysis: An International Journal10.1177/1088467X251319188Online publication date: 27-Feb-2025
https://doi.org/10.1177/1088467X251319188
jingbo Zzhixia Zxingjuan Cjianghui Cjinjun C(2025)An interval evolutionary algorithm based on dynamic relation adjustment strategy for many-objective problemsSwarm and Evolutionary Computation10.1016/j.swevo.2025.10185393(101853)Online publication date: Mar-2025
https://doi.org/10.1016/j.swevo.2025.101853
Han SXun YCai JYang HLi Y(2025)DyGraphformer: Transformer combining dynamic spatio-temporal graph network for multivariate time series forecastingNeural Networks10.1016/j.neunet.2024.106776181(106776)Online publication date: Jan-2025
https://doi.org/10.1016/j.neunet.2024.106776
Show More Cited By

Index Terms

ARIS: A Noise Insensitive Data Pre-Processing Scheme for Data Reduction Using Influence Space
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Data cleaning
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Data compression

Recommendations

Big data pre-processing methods with vehicle driving data using MapReduce techniques

A huge amount of sensing data are generated by a large number of pervasive IoT devices. In order to find meaningful information from the big data, it is essential to perform pre-processing, in which many outlier data points need to be removed, because ...
An improved data pre-processing method for classification and insider information leakage detection

Data pre-processing, a step performed prior to data processing, converts data into a form that is easy to analyse. In this study, we propose a method for the pre-processing and integration of data collected from various sources to detect insider ...
Parallel pre-processing of affymetrix microarray data
Euro-Par 2010: Proceedings of the 2010 conference on Parallel processing

The study of genes is currently carried out by systematic analysis that relies on data produced by the microarray technology. The recent development of such technology and the increasing number of analysed samples result in an increased volume of raw ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 6

December 2022

631 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3543989

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2022

Online AM: 15 March 2022

Accepted: 01 February 2022

Revised: 01 December 2021

Received: 01 August 2021

Published in TKDD Volume 16, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Key Research and Development Projects of Shanxi Province
Central Government Guides Local Science and Technology Development Funds
Fundamental Research Program of Shanxi Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
434
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)10

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang YCai JYang HShi CZhang MWang JZhang RZhao X(2025)Interpretable deep classification of time series based on class discriminative prototype learningIntelligent Data Analysis: An International Journal10.1177/1088467X251319188Online publication date: 27-Feb-2025
https://doi.org/10.1177/1088467X251319188
jingbo Zzhixia Zxingjuan Cjianghui Cjinjun C(2025)An interval evolutionary algorithm based on dynamic relation adjustment strategy for many-objective problemsSwarm and Evolutionary Computation10.1016/j.swevo.2025.10185393(101853)Online publication date: Mar-2025
https://doi.org/10.1016/j.swevo.2025.101853
Han SXun YCai JYang HLi Y(2025)DyGraphformer: Transformer combining dynamic spatio-temporal graph network for multivariate time series forecastingNeural Networks10.1016/j.neunet.2024.106776181(106776)Online publication date: Jan-2025
https://doi.org/10.1016/j.neunet.2024.106776
Yang 杨 HWang 王 RCai 蔡 JLuo 罗 ADu 杜 BHe 贺 YSu 苏 MShi 史 CZhao 赵 XXun 荀 YYuan 员 Y(2024)A Sample of Am and Ap Candidates from LAMOST DR10 (v1.0) Based on the Ensemble Regression ModelThe Astrophysical Journal Supplement Series10.3847/1538-4365/ad4107272:2(43)Online publication date: 11-Jun-2024
https://doi.org/10.3847/1538-4365/ad4107
Cai JYan ZYang HChen XZheng AHao JZhao XXun Y(2024)Stellar spectral template library construction based on generative adversarial networksAstronomy & Astrophysics10.1051/0004-6361/202349032687(A15)Online publication date: 24-Jun-2024
https://doi.org/10.1051/0004-6361/202349032
Liu JGong BYang L(2024)Investigation and implementation of digital software architecture based on internet of thingsMeasurement: Sensors10.1016/j.measen.2024.10111433(101114)Online publication date: Jun-2024
https://doi.org/10.1016/j.measen.2024.101114
Ding XLuo YJie BLi QCheng Y(2024)Using outlier elimination to assess learning-based correspondence matching methodsInformation Sciences: an International Journal10.1016/j.ins.2023.120056659:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.ins.2023.120056
Cai JZhang MYang HHe YYang YShi CZhao XXun Y(2024)A novel graph-attention based multimodal fusion network for joint classification of hyperspectral image and LiDAR dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.123587249:PBOnline publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.123587
Cai JHao JYang HYang YZhao XXun YZhang D(2024)A new community detection method for simplified networks by combining structure and attribute informationExpert Systems with Applications10.1016/j.eswa.2023.123103246(123103)Online publication date: Jul-2024
https://doi.org/10.1016/j.eswa.2023.123103
Cai JYang JWen JZhao HCui Z(2024)A game theory based many-objective hybrid tensor decomposition for skin cancer predictionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122425239:COnline publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122425
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents