Skip to main content
Log in

Generalized regression model for sequence matching and clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Linear relation has been found to be valuable in rule discovery of stocks, such as if stock X goes up a, stock Y will go down b. The traditional linear regression models the linear relation of two sequences faithfully. However, if a user requires clustering of stocks into groups where sequences have high linearity or similarity with each other, it is prohibitively expensive to compare sequences one by one. In this paper, we present generalized regression model (GRM) to match the linearity of multiple sequences at a time. GRM also gives strong heuristic support for graceful and efficient clustering. The experiments on the stocks in the NASDAQ market mined interesting clusters of stock trends efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organizations and algorithms, pp 69–84

  2. Agrawal R, Lin KI, Sawhne HS, Shim K (1995) Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In: Proceedings of the 21st international conference on very large data bases, pp 490–501

  3. Berndt D, Clifford J (1994) Using dynamic time warping to find patterns in sequences. Working notes of the knowledge discovery in databases workshop, pp 359–370

  4. Bollobas B, Das G, Gunopulos D, Mannila H (1997) Time-series similarity problems and well-separated geometric sets. In: Proceedings of the 13th annual acm symposium on computational geometry, pp 454–456

  5. Bozkaya T, Yazdani N, Ozsoyoglu ZM (1997) Matching and indexing sequences of different lengths. In: Proceedings of the 6th international conference on information and knowledge management, pp 128–135

  6. Chan K, Fu W (1999) Efficient sequences matching by wavelets. In: Proceedings of the 15th international conference on data engineering

  7. Chu K, Wong M (1999) Fast time-series searching with scaling and shifting. In: Proceedings of the 18th ACM symposium on principles of database systems, pp 237–248

  8. Chung C, Lee S, Chun S, Kim D, Lee J (2000) Similarity search for multidimensional data sequences. In: Proceedings of the 16th international conference on data engineering, pp 599–608

  9. Das G, Gunopulos D (2000) Sequences similarity measures. Sequences tutorial in knowledge discovery and data mining

  10. Das G, Gunopulos D, Mannila H (1997) Finding similar sequences. In: Proceedings of the 1st European symposium on principles of data mining and knowledge discovery, pp 88–100

  11. Das G, Lin K, Mannila H, Renganathan G, Smyt P (1998) Rule discovery from sequences. Knowl Discov Data Min 16–22

  12. Day W, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1:1–24

    Article  Google Scholar 

  13. Dhillon I, Parlett B (2004) Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebr Appl 387:1–28

    Google Scholar 

  14. Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York

    Google Scholar 

  15. Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 419–429

  16. Goldin D, Kanellakis P (1995) On similarity queries for time-series data: Constraint specification and implementation. In: Proceedings of the 1st international conference on the principles and practice of constraint programming, pp 137–153

  17. Jagadish H, Mendelzon A, Milo T (1995) Similarity-based queries. In: Proceedings of the symposium on principles of database systems, pp 36–45

  18. Keogh E (2002) Exact indexing of dynamic time warping. In: Proceedings of the 28th international conference on very large data bases, pp 406–417

  19. Keogh E, Folias T (2002) The UCR time series data mining archive. Computer Science & Engineering Department, University of California, Riverside, CA. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html

  20. Keogh E, Smyth P (1997) A probabilistic approach to fast pattern matching in sequences databases. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, pp 24–30

  21. Li C, Yu P, Castelli V (1996) Similarity search algorithm for databases of long sequences. In: Proceedings of the 12th international conference on data engineering, pp 546–553

  22. Mosteller F, Tukey J (1977) Data analysis and regression: A second course in statistics. Addison-Wesley, Reading, MA

    Google Scholar 

  23. Park S, Chu W, Yoon J, Hsu C (2000) Efficient similarity searches for time-warped subsequences in sequence databases. In: Proceedings of the 16th international conference on data engineering

  24. Perng C, Wang H, Zhang S, Parker D (2000) Landmarks: A new model for similarity-based pattern querying in sequences databases. In: Proceedings of the 16th international conference on data engineering

  25. Rafiei D, Mendelzon A (1997) Similarity-based queries for sequences data. In: Proceedings of the ACM SIGMOD conference on management of data, pp 13–25

  26. Rafiei D, Mendelzon A (1998) Efficient retrieval of similar time sequences using DFT. In: Proceedings of the 5th international conference on foundations of data organizations and algorithms, pp 249–257

  27. Struzik Z, Siebes A (1999) The Haar wavelet transform in the sequences similarity paradigm. In: Proceedings of the fourth european conference on principles and practice of knowledge discovery in databases

  28. Swarztrauber P (1993) A parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix. Math Comp 20:651–668

    Article  MathSciNet  Google Scholar 

  29. Wooldridge J (1999) Introductory econometrics: A modern approach. South-Western College Publishing, Cincinnati

  30. Yi B, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: Proceedings of the 26th international conference on very large databases, pp 385–394

  31. Yi B, Jagadish H, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. In: Proceedings of the 14th international conference on data engineering, pp 23–27

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hansheng Lei.

Additional information

Hansheng Lei received his BE from Ocean University of China in 1998, MS from the University of Science and Technology of China in 2001, and Ph.D. from the University at Buffalo, the State University of New York in February 2006, all in computer science. He is currently an assistant professor in CS/CIS Department, University of Texas at Brownsville. His research interests include biometrics, pattern recognition, machine learning, and data mining.

Venu Govindaraju is a professor of Computer Science and Engineering at the University at Buffalo (UB), State University of New York. He received his B.-Tech. (Honors) from the Indian Institute of Technology (IIT), Kharagpur, India in 1986, and his Ph.D. degree in Computer Science from UB in 1992. His research is focused on pattern recognition applications in the areas of biometrics and digital libraries.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, H., Govindaraju, V. Generalized regression model for sequence matching and clustering. Knowl Inf Syst 12, 77–94 (2007). https://doi.org/10.1007/s10115-006-0008-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-006-0008-8

Keywords

Navigation