Skip to main content
Log in

A framework of irregularity enlightenment for data pre-processing in data mining

  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Alwan, L. C., & Roberts, H. V. (1988). Time-series modeling for statistical process control. Journal of Business and Economic Statistics, 6, 87–95.

    Article  Google Scholar 

  • Apley, D. W., & Shi, J. (1999). The GLRT for statistical process control of autocorrelated processes. IIE Transactions, 31, 1123–1134.

    Google Scholar 

  • Bakshi, B. R. (1999). Multiscale analysis and modeling using wavelets. Journal of Chemometrics, 13, 415–434.

    Article  Google Scholar 

  • Barnett, V., & Lewies, T. (1994). Outliers in statistical data (3rd ed.). New York: Wiley.

    Google Scholar 

  • Basseville, M., & Nikiforov, I. (1993). Detection of abrupt changes. Theory and application. Prentice Hall Information and system sciences series. New York: Prentice Hall.

    Google Scholar 

  • Bay, S., & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, 24–27 August 2003.

  • Bellman, R. (1961). Adaptive control processes: a guided tour. Princeton: Princeton University Press.

    Google Scholar 

  • Bianco, A. M., Garcia, B. M., Martinez, E. J., & Yohai, V. J. (1996). Robust procedures for regression models with ARIMA errors. In COMPSTAT 96, proceedings in computational statistics, part A (pp. 27–38). Berlin: Physica.

    Google Scholar 

  • Bianco, A. M., Garcia, B. M. G., Martinez, E. J., & Yohai, V. J. (2001). Outlier detection in regression models with ARIMA errors using robust estimations. Journal of Forecasting, 20, 565–579.

    Article  Google Scholar 

  • Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed.). Berlin: Springer.

    Google Scholar 

  • Breunig, M., Kriegel, H., Ng, R., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of ACM SIGMOD, May 2000 (pp. 93–104).

  • Chan, P., & Mahoney, M. (2005). Modeling multiple time series for anomaly detection. In Proceedings of IEEE international conference on data mining (pp. 90–97).

  • Chen, C., & Liu, L. M. (1993). Forecasting time series with outliers. Journal of Forecast, 12, 13–35.

    Article  Google Scholar 

  • Cooper, G., Hogan, B., Moore, A., Sabhnani, R., Tsui, R., Wagner, M., & Wong, W. K. (2003). Detection algorithms for biosurveillance: a tutorial. http://www.autonlab.org/tutorials/biosurv01.pdf.

  • Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32, 407–499.

    Article  Google Scholar 

  • Fan, J., & Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. arXiv:math/0602133.

  • Fawcett, T., & Provost, F. (1999). Activity monitoring: noticing interesting changes in behavior. In Proceedings of the fifth international conference on knowledge discovery and data mining (KDD-99).

  • Galati, D., & Simaan, M. (2006). Automatic decomposition of time series into step, ramp, and impulse primitives. Pattern Recognition, 39, 2166–2174.

    Article  Google Scholar 

  • Guha, S., Mishra, N., Motwani, R., & O’Callaghan, L. (2000). Clustering data streams. In Proceedings of the 41st annual symposium on foundations of computer science. Redondo Beach, CA, Nov 12–14 (pp. 359–366).

  • Harris, T. J., & Ross, W. M. (1991). Statistical process control for correlated observations. The Canadian Journal of Chemical Engineering, 69, 48–57.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. H. (2003). The elements of statistical learning. Berlin: Springer.

    Google Scholar 

  • Hawkins, D. (2001). Fitting multiple change-point models to data. Computational Statistics and Data Analysis, 37, 323–341.

    Article  Google Scholar 

  • Jiang, W. (2004). Multivariate control charts for monitoring autocorrelated processes. Journal of Quality Technology, 36, 367–379.

    Google Scholar 

  • Jiang, W., Wu, H., Tsung, F., Nair, V. N., & Tsui, K.-L. (2002). PID charts for process monitoring. Technometrics, 44, 205–214.

    Article  Google Scholar 

  • Jiang, W., Au, T., & Tsui, K. (2007). A statistical process control approach to business activity monitoring. IIE Transactions, 39, 235–249.

    Article  Google Scholar 

  • Keogh, E., Lin, J., & Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for past and future research. In Proceedings of the 3rd IEEE int’l conference on data mining, Melbourne, FL, Nov 19–22 (pp. 115–122).

  • Keogh, E., Lonardi, S., & Ratanamahatana, C. (2004). Towards parameter-free data mining. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, 22–25 Aug 2004.

  • Kim, Y., Street, W. N., & Menczer, F. (2003). Feature selection in data mining. In Data mining: opportunities and challenges (pp. 80–105).

  • Knorr, E., & Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the VLDB conference. New York, September 1998 (pp. 392–403).

  • Knorr, E., & Ng, R. (1999). Finding intentional knowledge of distance-based outliers. In Proceedings of 25th international conference on very large databases, September 1999 (pp. 211–222).

  • Knorr, E., Ng, R., & Tucakov, V. (2000). Distance-based outliers: algorithms and applications. The International Journal on Very Large Data Bases, 8(3–4), 237–253.

    Article  Google Scholar 

  • Lavielle, M. (1999). Detection of multiple changes in a sequence of dependent variables. Stochastics Processes and Applications, 83, 79–102.

    Article  Google Scholar 

  • Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing, 85(8), 1501–1510.

    Article  Google Scholar 

  • Lerman, P. M. (1980). Fitting segmented regression models by grid search. Applied Statistics, 29, 77–84.

    Article  Google Scholar 

  • Liu, H., Shah, S., & Jiang, W. (2004). On-line outlier detection and data cleaning. Computers and Chemical Engineering, 28, 1635–1647.

    Article  Google Scholar 

  • Mahoney, M., & Chan, P. (2005). Trajectory boundary modeling of time series for anomaly detection. In KDD-2005 workshop on data mining methods for anomaly detection.

  • Montgomery, D. C. (2004). Introduction to statistical quality control (5th ed.) New York: Wiley.

    Google Scholar 

  • Montgomery, D. C., & Mastrangelo, C. M. (1991). Some statistical process control methods for autocorrelated data. Journal of Quality Technology, 23, 179–204.

    Google Scholar 

  • Moustakides, G. V. (1986). Optimal stopping times for detecting changes in distributions. Annals of Statistics, 14, 1379–1387.

    Article  Google Scholar 

  • Nounou, M. N., & Bakshi, B. R. (1999). On-line multiscale filtering of random and gross errors without process models. AIChE Journal, 5(45), 1041–1058.

    Article  Google Scholar 

  • Oates, T., Firoiu, L., & Cohen, P. (1999). Clustering time series with hidden markov models and dynamic time warping. In Proceedings of the IJCAI-99 workshop on neural, symbolic and reinforcement learning methods for sequence learning (pp. 17–21).

  • Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In Proceedings 19th international conference on data engineering, March 2003 (pp. 315–326).

  • Qian, Z., Jiang, W., & Tsui, K. (2006). Churn detection via customer profile modelling. International Journal of Production Research, 44, 2913–2933.

    Article  Google Scholar 

  • Redman, T. C. (1998). The impact of poor data quality on the typical enterprise. Communications of the ACM, 41(2), 79–82.

    Article  Google Scholar 

  • Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.

    Book  Google Scholar 

  • Runger, G. C., & Willemain, T. R. (1995). Model-based and model-free control of autocorrelated processes. Journal of Quality Technology, 27, 283–292.

    Google Scholar 

  • Runger, G. C., & Willemain, T. R. (1996). Batch means charts for autocorrelated data. IIE Transactions on Quality and Reliability, 28, 483–487.

    Google Scholar 

  • Shmueli (2005). Current and potential statistical methods for anomaly detection. In KDD-2005 workshop on data mining methods for anomaly detection.

  • Sjöstrand, K. (2005). Matlab implementation of lasso, lars, the elastic net and spca (Technical Report). June 2005.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58, 267–288.

    Google Scholar 

  • Tsay, R. S. (1988). Outliers, level shifts, and variance changes in time series. Journal of Forecasting, 7, 1–20.

    Article  Google Scholar 

  • Tsay, R. S. (1996). Time series model specification in the presence of outliers. Journal of the American Statistical Association, 81, 132–141.

    Article  Google Scholar 

  • Vander Wiel, S. A. (1996). Monitoring processes that wander using integrated moving average models. Technometrics, 38, 139–151.

    Article  Google Scholar 

  • Wardell, D. G., Moskowitz, H., & Plante, R. D. (1994). Run-length distributions of special-cause control charts for correlated observations. Technometrics, 36, 3–17.

    Article  Google Scholar 

  • Wei, L., Keogh, E., Van Herle, H., & Mafra-Neto, A. (2005). Atomic wedgie: efficient query filtering for streaming time series. In Proc. of the 5th IEEE international conference on data mining (ICDM 2005), Houston, TX, 27–30 Nov 2005 (pp. 490–497).

  • Wilson, J. H., & Keating, B. (2001). Business forecasting (4th ed.). New York: McGraw-Hill.

    Google Scholar 

  • Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265–286.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Jiang.

Additional information

This work is supported by National Science Foundation Grant #IIS-0542881.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Au, ST., Duan, R., Hesar, S.G. et al. A framework of irregularity enlightenment for data pre-processing in data mining. Ann Oper Res 174, 47–66 (2010). https://doi.org/10.1007/s10479-008-0494-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-008-0494-z

Keywords

Navigation