Abstract
We introduce a spectrum of algorithms for measuring the similarity of high-dimensional vectors in Euclidean space. The algorithms proposed consist of a convex combination of two measures: one which contains summary data about the shape of a vector, and the other about the relative magnitudes of the coordinates. The former is based on a concept called bin-score permutations and a metric to quantify similarity of permutations, the latter on another novel approximation for inner-product computations based on power symmetric functions, which generalizes the Cauchy-Schwarz inequality. We present experiments on time-series data on labor statistics unemployment figures that show the effectiveness of the algorithm as a function of the parameter that combines the two parts.
Supported in part by NSF Grant No. CCR–9821038.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
R. Agrawal, K-I. Lin, H. S. Sawhney, and K. Shim. Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. The VLDB Journal, pp. 490–501, 1995.
B. Bollobas, G. Das, D. Gunopulos, and H. Mannila. Time-Series Similarity Problems and Well-Separated Geometric Sets. Proc. of 13th Annual ACM Symposium on Computational Geometry, Nice, France, pp. 454–456, 1997.
R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In 4th Int. Conference on Foundations of Data Organization and Algorithms, pp. 69–84, 1993.
S. Berchtold, D. Keim, and H. Kriegel. The X-tree: An index structure for high-dimensional data. In Proceedings of the Int. Conf. on Very Large Data Bases, pp. 28–39, Bombay, India, 1996.
S. Berchtold, C. Bohm, D. Keim, and H. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. In Proc. ACM Symp. on Principles of Database Systems, Tuscon, Arizona, 1997.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Launder, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391–407, 1990.
Ö. Eğecioğlu. How to approximate the inner-product: fast dynamic algorithms for Euclidean similarity. Technical Report TRCS98-37, Department of Computer Science, University of California at Santa Barbara, December 1998.
Ö. Eğecioğlu and H. Ferhatosmanoğlu, Dimensionality reduction and similarity computation by inner product approximations. Proc. 9th Int. Conf. on Information and Knowledge Management (CIKM’00), Nov. 2000, Washington DC.
V. Estivill-Castro and D. Wood. A Survey of Adaptive Sorting Algorithms. ACM Computing Surveys, Vol. 24, No. 4, pp. 441–476, 1992.
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 419–429, Minneapolis, May 1994.
A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 47–57, 1984.
N.A.J. Hastings and J.B. Peacock. Statistical Distributions, Halsted Press, New York, 1975.
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proc. of the 17th ACM-SIGIR Conference, pp. 282–291, 1994.
J. E. Humphreys. Reflection Groups and Coxeter Groups, Cambridge Studies in Advanced Mathematics, No. 29, Cambridge Univ. Press, Cambridge, 1990.
D. Knuth. The art of computer programming (Vol. III), Addison-Wesley, Reading, MA, 1973.
Korn F., Sidiropoulos N., Faloutsos C., Siegel E., and Protopapas Z. Fast nearest neighbor search in medical image databases. In Proceedings of the Int. Conf. on Very Large Data Bases, pages 215–226, Mumbai, India, 1996.
C-S. Perng, H. Wang, S. R. Zhang, and D. S. Parker. Landmarks: a new model for similarity-based pattern querying in time-series databases. Proc. of the 16-th ICDE, San Diego, CA, 2000.
T. Seidl and Kriegel H.-P. Efficient user-adaptable similarity search in large multimedia databases. In Proceedings of the Int. Conf. on Very Large Data Bases, pages 506–515, Athens, Greece, 1997.
D. White and R. Jain. Similarity indexing with the SS-tree. In Proc. Int. Conf. Data Engineering, pp. 516–523, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Eğecioğlu, Ö. (2001). Parametric Approximation Algorithms for High-Dimensional Euclidean Similarity. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_7
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive