Optimale Dimensionswahl bei der Bearbeitung des Similarity Join

Böhm, Christian; Kriegel, Hans-Peter

doi:10.1007/s00450-002-0104-2

Optimale Dimensionswahl bei der Bearbeitung des Similarity Join

Originalbeiträge
Published: 01 July 2002

Volume 17, pages 68–76, (2002)
Cite this article

Informatik - Forschung und Entwicklung

Christian Böhm¹ &
Hans-Peter Kriegel²

41 Accesses
Explore all metrics

Zusammenfassung.

Der Similarity Join spielt zunehmend eine Rolle bei verschiedenen Anwendungen des Data Mining. Obwohl bereits mehrere Algorithmen zur Auswertung dieser Grundoperation in modernen Datenbankanwendungen vorgeschlagen wurden, und trotz einer eindeutigen CPU-Dominanz dieser Algorithmen, gibt es bislang kaum Ansätze, die sich mit dem CPU-Aspekt beschäftigen. Wir schlagen in diesem Beitrag ein allgemeines Prinzip zur Reduktion von Distanzberechnungen vor, das bei vielen Grundalgorithmen für den Similarity Join, z.B. dem R-Tree Similarity Join und seinen Varianten, dem \(\varepsilon\)-kdB-Tree oder einem Spatial Hash-Verfahren eingesetzt werden kann. Unsere Lösung besteht aus einem Plane-Sweep-ähnlichen Verfahren, bei dem die optimale Sortierungsdimension gemäß einem Wahrscheinlichkeitsmodell ermittelt wird. In einer umfangreichen experimentellen Studie weisen wir die Überlegenheit unseres Verfahrens sowohl gegenüber verschiedenen Join-Basisverfahren ohne Plane-Sweep-Auswertung als auch gegenüber dem einfachen Plane-Sweep-ähnlichen Verfahren ohne Dimensionswahl nach.

Abstract.

The similarity join plays an increasing role in various applications of data mining. Several algorithms for the computation of this important database primitive of modern applications have been proposed. Although these algorithms are clearly CPU bound, until now no solution concentrates on the CPU aspect. In this paper we propose a general technique for reducing distance calculations. Our technique can be applied on top of many basic algorithms for the similarity join, such as the R-tree similarity join and its variants, the \(\varepsilon\)-kdB-tree, or some spatial hashing method. Our solution is a method which is similar to plane sweeping. The sweep dimension is selected according to a probability model. In an extensive experimental evaluation, we show the superiority of our approach with respect to different basic similarity join algorithms as well as with respect to a simple sweeping without selection of the optimal dimension.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Abteilung für Datenbanksysteme, Private Universität für Medizinische Informatik und Technik, Innrain 98, 6020 Innsbruck, Österreich (e-mail: christian.boehm@umit.at) , Austria
Christian Böhm
Institut für Informatik, Ludwig Maximilians Universität München, Oettingenstr. 67, 80538 München (e-mail: kriegel@dbs.informatik.uni-muenchen.de) , Germany
Hans-Peter Kriegel

Authors

Christian Böhm
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Peter Kriegel
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Eingegangen am 13. Februar 2001 / Angenommen am 25. Februar 2002

Rights and permissions

Reprints and permissions

About this article

Cite this article

Böhm, C., Kriegel, HP. Optimale Dimensionswahl bei der Bearbeitung des Similarity Join. Informatik Forsch Entw 17, 68–76 (2002). https://doi.org/10.1007/s00450-002-0104-2

Download citation

Published: 01 July 2002
Issue Date: July 2002
DOI: https://doi.org/10.1007/s00450-002-0104-2

SchlüsselwörterÄhnlichkeitsverbund, Ähnlichkeitssuche, Multimedia-Datenbank, Data Mining, Indexstruktur

Keywords: Similarity join, Similarity search, Multimedia database, Data mining, index structure

CR Subject Classification: H.3.2, H.3.3, H.5.1, I.2.6

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimale Dimensionswahl bei der Bearbeitung des Similarity Join

Zusammenfassung.

Abstract.

Access this article

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation