Abstract
Linear relation has been found to be valuable in rule discovery of stocks, such as if stock X goes up a, stock Y will go down b. The traditional linear regression models the linear relation of two sequences faithfully. However, if a user requires clustering of stocks into groups where sequences have high linearity or similarity with each other, it is prohibitively expensive to compare sequences one by one. In this paper, we present generalized regression model (GRM) to match the linearity of multiple sequences at a time. GRM also gives strong heuristic support for graceful and efficient clustering. The experiments on the stocks in the NASDAQ market mined interesting clusters of stock trends efficiently.
Similar content being viewed by others
References
Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th international conference on foundations of data organizations and algorithms, pp 69–84
Agrawal R, Lin KI, Sawhne HS, Shim K (1995) Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In: Proceedings of the 21st international conference on very large data bases, pp 490–501
Berndt D, Clifford J (1994) Using dynamic time warping to find patterns in sequences. Working notes of the knowledge discovery in databases workshop, pp 359–370
Bollobas B, Das G, Gunopulos D, Mannila H (1997) Time-series similarity problems and well-separated geometric sets. In: Proceedings of the 13th annual acm symposium on computational geometry, pp 454–456
Bozkaya T, Yazdani N, Ozsoyoglu ZM (1997) Matching and indexing sequences of different lengths. In: Proceedings of the 6th international conference on information and knowledge management, pp 128–135
Chan K, Fu W (1999) Efficient sequences matching by wavelets. In: Proceedings of the 15th international conference on data engineering
Chu K, Wong M (1999) Fast time-series searching with scaling and shifting. In: Proceedings of the 18th ACM symposium on principles of database systems, pp 237–248
Chung C, Lee S, Chun S, Kim D, Lee J (2000) Similarity search for multidimensional data sequences. In: Proceedings of the 16th international conference on data engineering, pp 599–608
Das G, Gunopulos D (2000) Sequences similarity measures. Sequences tutorial in knowledge discovery and data mining
Das G, Gunopulos D, Mannila H (1997) Finding similar sequences. In: Proceedings of the 1st European symposium on principles of data mining and knowledge discovery, pp 88–100
Das G, Lin K, Mannila H, Renganathan G, Smyt P (1998) Rule discovery from sequences. Knowl Discov Data Min 16–22
Day W, Edelsbrunner H (1984) Efficient algorithms for agglomerative hierarchical clustering methods. J Classif 1:1–24
Dhillon I, Parlett B (2004) Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebr Appl 387:1–28
Duda R, Hart P, Stork D (2000) Pattern classification, 2nd edn. Wiley, New York
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 419–429
Goldin D, Kanellakis P (1995) On similarity queries for time-series data: Constraint specification and implementation. In: Proceedings of the 1st international conference on the principles and practice of constraint programming, pp 137–153
Jagadish H, Mendelzon A, Milo T (1995) Similarity-based queries. In: Proceedings of the symposium on principles of database systems, pp 36–45
Keogh E (2002) Exact indexing of dynamic time warping. In: Proceedings of the 28th international conference on very large data bases, pp 406–417
Keogh E, Folias T (2002) The UCR time series data mining archive. Computer Science & Engineering Department, University of California, Riverside, CA. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html
Keogh E, Smyth P (1997) A probabilistic approach to fast pattern matching in sequences databases. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, pp 24–30
Li C, Yu P, Castelli V (1996) Similarity search algorithm for databases of long sequences. In: Proceedings of the 12th international conference on data engineering, pp 546–553
Mosteller F, Tukey J (1977) Data analysis and regression: A second course in statistics. Addison-Wesley, Reading, MA
Park S, Chu W, Yoon J, Hsu C (2000) Efficient similarity searches for time-warped subsequences in sequence databases. In: Proceedings of the 16th international conference on data engineering
Perng C, Wang H, Zhang S, Parker D (2000) Landmarks: A new model for similarity-based pattern querying in sequences databases. In: Proceedings of the 16th international conference on data engineering
Rafiei D, Mendelzon A (1997) Similarity-based queries for sequences data. In: Proceedings of the ACM SIGMOD conference on management of data, pp 13–25
Rafiei D, Mendelzon A (1998) Efficient retrieval of similar time sequences using DFT. In: Proceedings of the 5th international conference on foundations of data organizations and algorithms, pp 249–257
Struzik Z, Siebes A (1999) The Haar wavelet transform in the sequences similarity paradigm. In: Proceedings of the fourth european conference on principles and practice of knowledge discovery in databases
Swarztrauber P (1993) A parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix. Math Comp 20:651–668
Wooldridge J (1999) Introductory econometrics: A modern approach. South-Western College Publishing, Cincinnati
Yi B, Faloutsos C (2000) Fast time sequence indexing for arbitrary Lp norms. In: Proceedings of the 26th international conference on very large databases, pp 385–394
Yi B, Jagadish H, Faloutsos C (1998) Efficient retrieval of similar time sequences under time warping. In: Proceedings of the 14th international conference on data engineering, pp 23–27
Author information
Authors and Affiliations
Corresponding author
Additional information
Hansheng Lei received his BE from Ocean University of China in 1998, MS from the University of Science and Technology of China in 2001, and Ph.D. from the University at Buffalo, the State University of New York in February 2006, all in computer science. He is currently an assistant professor in CS/CIS Department, University of Texas at Brownsville. His research interests include biometrics, pattern recognition, machine learning, and data mining.
Venu Govindaraju is a professor of Computer Science and Engineering at the University at Buffalo (UB), State University of New York. He received his B.-Tech. (Honors) from the Indian Institute of Technology (IIT), Kharagpur, India in 1986, and his Ph.D. degree in Computer Science from UB in 1992. His research is focused on pattern recognition applications in the areas of biometrics and digital libraries.
Rights and permissions
About this article
Cite this article
Lei, H., Govindaraju, V. Generalized regression model for sequence matching and clustering. Knowl Inf Syst 12, 77–94 (2007). https://doi.org/10.1007/s10115-006-0008-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0008-8