A framework of irregularity enlightenment for data pre-processing in data mining

Au, Siu-Tong; Duan, Rong; Hesar, Siamak G.; Jiang, Wei

doi:10.1007/s10479-008-0494-z

A framework of irregularity enlightenment for data pre-processing in data mining

Published: 06 December 2008

Volume 174, pages 47–66, (2010)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Siu-Tong Au¹,
Rong Duan¹,
Siamak G. Hesar² &
…
Wei Jiang²

198 Accesses
18 Citations
Explore all metrics

Abstract

Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Process Framework for Modeling Multivariate Time Series Data

A fragmented-periodogram approach for clustering big data time series

Article Open access 14 June 2019

Jorge Caiado, Nuno Crato & Pilar Poncela

A Comparison of Multivariate Time Series Clustering Methods

References

Alwan, L. C., & Roberts, H. V. (1988). Time-series modeling for statistical process control. Journal of Business and Economic Statistics, 6, 87–95.
Article Google Scholar
Apley, D. W., & Shi, J. (1999). The GLRT for statistical process control of autocorrelated processes. IIE Transactions, 31, 1123–1134.
Google Scholar
Bakshi, B. R. (1999). Multiscale analysis and modeling using wavelets. Journal of Chemometrics, 13, 415–434.
Article Google Scholar
Barnett, V., & Lewies, T. (1994). Outliers in statistical data (3rd ed.). New York: Wiley.
Google Scholar
Basseville, M., & Nikiforov, I. (1993). Detection of abrupt changes. Theory and application. Prentice Hall Information and system sciences series. New York: Prentice Hall.
Google Scholar
Bay, S., & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, 24–27 August 2003.
Bellman, R. (1961). Adaptive control processes: a guided tour. Princeton: Princeton University Press.
Google Scholar
Bianco, A. M., Garcia, B. M., Martinez, E. J., & Yohai, V. J. (1996). Robust procedures for regression models with ARIMA errors. In COMPSTAT 96, proceedings in computational statistics, part A (pp. 27–38). Berlin: Physica.
Google Scholar
Bianco, A. M., Garcia, B. M. G., Martinez, E. J., & Yohai, V. J. (2001). Outlier detection in regression models with ARIMA errors using robust estimations. Journal of Forecasting, 20, 565–579.
Article Google Scholar
Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed.). Berlin: Springer.
Google Scholar
Breunig, M., Kriegel, H., Ng, R., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of ACM SIGMOD, May 2000 (pp. 93–104).
Chan, P., & Mahoney, M. (2005). Modeling multiple time series for anomaly detection. In Proceedings of IEEE international conference on data mining (pp. 90–97).
Chen, C., & Liu, L. M. (1993). Forecasting time series with outliers. Journal of Forecast, 12, 13–35.
Article Google Scholar
Cooper, G., Hogan, B., Moore, A., Sabhnani, R., Tsui, R., Wagner, M., & Wong, W. K. (2003). Detection algorithms for biosurveillance: a tutorial. http://www.autonlab.org/tutorials/biosurv01.pdf.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407–499.
Article Google Scholar
Fan, J., & Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. arXiv:math/0602133.
Fawcett, T., & Provost, F. (1999). Activity monitoring: noticing interesting changes in behavior. In Proceedings of the fifth international conference on knowledge discovery and data mining (KDD-99).
Galati, D., & Simaan, M. (2006). Automatic decomposition of time series into step, ramp, and impulse primitives. Pattern Recognition, 39, 2166–2174.
Article Google Scholar
Guha, S., Mishra, N., Motwani, R., & O’Callaghan, L. (2000). Clustering data streams. In Proceedings of the 41st annual symposium on foundations of computer science. Redondo Beach, CA, Nov 12–14 (pp. 359–366).
Harris, T. J., & Ross, W. M. (1991). Statistical process control for correlated observations. The Canadian Journal of Chemical Engineering, 69, 48–57.
Article Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). The elements of statistical learning. Berlin: Springer.
Google Scholar
Hawkins, D. (2001). Fitting multiple change-point models to data. Computational Statistics and Data Analysis, 37, 323–341.
Article Google Scholar
Jiang, W. (2004). Multivariate control charts for monitoring autocorrelated processes. Journal of Quality Technology, 36, 367–379.
Google Scholar
Jiang, W., Wu, H., Tsung, F., Nair, V. N., & Tsui, K.-L. (2002). PID charts for process monitoring. Technometrics, 44, 205–214.
Article Google Scholar
Jiang, W., Au, T., & Tsui, K. (2007). A statistical process control approach to business activity monitoring. IIE Transactions, 39, 235–249.
Article Google Scholar
Keogh, E., Lin, J., & Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for past and future research. In Proceedings of the 3rd IEEE int’l conference on data mining, Melbourne, FL, Nov 19–22 (pp. 115–122).
Keogh, E., Lonardi, S., & Ratanamahatana, C. (2004). Towards parameter-free data mining. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, 22–25 Aug 2004.
Kim, Y., Street, W. N., & Menczer, F. (2003). Feature selection in data mining. In Data mining: opportunities and challenges (pp. 80–105).
Knorr, E., & Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the VLDB conference. New York, September 1998 (pp. 392–403).
Knorr, E., & Ng, R. (1999). Finding intentional knowledge of distance-based outliers. In Proceedings of 25th international conference on very large databases, September 1999 (pp. 211–222).
Knorr, E., Ng, R., & Tucakov, V. (2000). Distance-based outliers: algorithms and applications. The International Journal on Very Large Data Bases, 8(3–4), 237–253.
Article Google Scholar
Lavielle, M. (1999). Detection of multiple changes in a sequence of dependent variables. Stochastics Processes and Applications, 83, 79–102.
Article Google Scholar
Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing, 85(8), 1501–1510.
Article Google Scholar
Lerman, P. M. (1980). Fitting segmented regression models by grid search. Applied Statistics, 29, 77–84.
Article Google Scholar
Liu, H., Shah, S., & Jiang, W. (2004). On-line outlier detection and data cleaning. Computers and Chemical Engineering, 28, 1635–1647.
Article Google Scholar
Mahoney, M., & Chan, P. (2005). Trajectory boundary modeling of time series for anomaly detection. In KDD-2005 workshop on data mining methods for anomaly detection.
Montgomery, D. C. (2004). Introduction to statistical quality control (5th ed.) New York: Wiley.
Google Scholar
Montgomery, D. C., & Mastrangelo, C. M. (1991). Some statistical process control methods for autocorrelated data. Journal of Quality Technology, 23, 179–204.
Google Scholar
Moustakides, G. V. (1986). Optimal stopping times for detecting changes in distributions. Annals of Statistics, 14, 1379–1387.
Article Google Scholar
Nounou, M. N., & Bakshi, B. R. (1999). On-line multiscale filtering of random and gross errors without process models. AIChE Journal, 5(45), 1041–1058.
Article Google Scholar
Oates, T., Firoiu, L., & Cohen, P. (1999). Clustering time series with hidden markov models and dynamic time warping. In Proceedings of the IJCAI-99 workshop on neural, symbolic and reinforcement learning methods for sequence learning (pp. 17–21).
Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In Proceedings 19th international conference on data engineering, March 2003 (pp. 315–326).
Qian, Z., Jiang, W., & Tsui, K. (2006). Churn detection via customer profile modelling. International Journal of Production Research, 44, 2913–2933.
Article Google Scholar
Redman, T. C. (1998). The impact of poor data quality on the typical enterprise. Communications of the ACM, 41(2), 79–82.
Article Google Scholar
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.
Book Google Scholar
Runger, G. C., & Willemain, T. R. (1995). Model-based and model-free control of autocorrelated processes. Journal of Quality Technology, 27, 283–292.
Google Scholar
Runger, G. C., & Willemain, T. R. (1996). Batch means charts for autocorrelated data. IIE Transactions on Quality and Reliability, 28, 483–487.
Google Scholar
Shmueli (2005). Current and potential statistical methods for anomaly detection. In KDD-2005 workshop on data mining methods for anomaly detection.
Sjöstrand, K. (2005). Matlab implementation of lasso, lars, the elastic net and spca (Technical Report). June 2005.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58, 267–288.
Google Scholar
Tsay, R. S. (1988). Outliers, level shifts, and variance changes in time series. Journal of Forecasting, 7, 1–20.
Article Google Scholar
Tsay, R. S. (1996). Time series model specification in the presence of outliers. Journal of the American Statistical Association, 81, 132–141.
Article Google Scholar
Vander Wiel, S. A. (1996). Monitoring processes that wander using integrated moving average models. Technometrics, 38, 139–151.
Article Google Scholar
Wardell, D. G., Moskowitz, H., & Plante, R. D. (1994). Run-length distributions of special-cause control charts for correlated observations. Technometrics, 36, 3–17.
Article Google Scholar
Wei, L., Keogh, E., Van Herle, H., & Mafra-Neto, A. (2005). Atomic wedgie: efficient query filtering for streaming time series. In Proc. of the 5th IEEE international conference on data mining (ICDM 2005), Houston, TX, 27–30 Nov 2005 (pp. 490–497).
Wilson, J. H., & Keating, B. (2001). Business forecasting (4th ed.). New York: McGraw-Hill.
Google Scholar
Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265–286.
Article Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Research Labs, Florham Park, NJ, USA
Siu-Tong Au & Rong Duan
Stevens Institute of Technology, Hoboken, NJ, USA
Siamak G. Hesar & Wei Jiang

Authors

Siu-Tong Au
View author publications
You can also search for this author in PubMed Google Scholar
Rong Duan
View author publications
You can also search for this author in PubMed Google Scholar
Siamak G. Hesar
View author publications
You can also search for this author in PubMed Google Scholar
Wei Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Jiang.

Additional information

This work is supported by National Science Foundation Grant #IIS-0542881.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Au, ST., Duan, R., Hesar, S.G. et al. A framework of irregularity enlightenment for data pre-processing in data mining. Ann Oper Res 174, 47–66 (2010). https://doi.org/10.1007/s10479-008-0494-z

Download citation

Published: 06 December 2008
Issue Date: February 2010
DOI: https://doi.org/10.1007/s10479-008-0494-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A framework of irregularity enlightenment for data pre-processing in data mining

Abstract

Access this article

Similar content being viewed by others

Process Framework for Modeling Multivariate Time Series Data

A fragmented-periodogram approach for clustering big data time series

A Comparison of Multivariate Time Series Clustering Methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A framework of irregularity enlightenment for data pre-processing in data mining

Abstract

Access this article

Similar content being viewed by others

Process Framework for Modeling Multivariate Time Series Data

A fragmented-periodogram approach for clustering big data time series

A Comparison of Multivariate Time Series Clustering Methods

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation