Skip to main content
Log in

On the use of Human-Computer Interaction for Projected Nearest Neighbor Search

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Nearest Neighbor search is an important and widely used technique in a number of important application domains. In many of these domains, the dimensionality of the data representation is often very high. Recent theoretical results have shown that the concept of proximity or nearest neighbors may not be very meaningful for the high dimensional case. Therefore, it is often a complex problem to find good quality nearest neighbors in such data sets. Furthermore, it is also difficult to judge the value and relevance of the returned results. In fact, it is hard for any fully automated system to satisfy a user about the quality of the nearest neighbors found unless he is directly involved in the process. This is especially the case for high dimensional data in which the meaningfulness of the nearest neighbors found is questionable. In this paper, we address the complex problem of high dimensional nearest neighbor search from the user perspective by designing a system which uses effective cooperation between the human and the computer. The system provides the user with visual representations of carefully chosen subspaces of the data in order to repeatedly elicit his preferences about the data patterns which are most closely related to the query point. These preferences are used in order to determine and quantify the meaningfulness of the nearest neighbors. Our system is not only able to find and quantify the meaningfulness of the nearest neighbors, but is also able to diagnose situations in which the nearest neighbors found are truly not meaningful.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.
Figure 10.
Figure 11.
Figure 12.
Figure 13.
Figure 14.
Figure 15.
Figure 16.
Figure 17.
Figure 18.
Figure 19.
Figure 20.
Figure 21.
Figure 22.
Figure 23.
Figure 24.

Similar content being viewed by others

Notes

  1. The Windows version of this tool is known as GGobi, and can be downloaded from www.ggobi.org.

  2. Note that a density scatter plot is significantly easier to comprehend than the scatter plot of the original data points which shows considerable overlap among the individual points. Also, the number of points in a lateral scatter plot can be chosen in order to provide the best visual profile.

  3. For our particular implementation, we used \(t=80\%\).

  4. For the Cases 1 and 2 data sets data generation, we used \(d=20\), \(\mu_s=0.1\), \(\gamma=2\), \(p=1\), \(q=5\), \(k=50\), \(l=4\), \(N=5000\) in accordance with the notations in Aggarwal and Yu (2000). The Case 3 data set was generated with the same parameters except that we used \(d=40\).

  5. http://www.cs.uci.edu/ \(\tilde{\ }\)mlearn.

  6. Since this problem is one in which class labels are attached to each data point, we can test the quality of the nearest neighbors by determining the percentage of data points which belonged to the same class as the target. We note that such a determination is somewhat evidential in nature, since the relationship between the class variable and the features are not completely known.

References

  • Aggarwal, C.C. 2002. Towards meaningful high dimensional nearest neighbor search by human-computer interaction. In Proceedings of the International Conference on Data Engineering, pp. 593–604.

  • Aggarwal, C.C. 2001. Re-designing distance functions and distance based applications for high dimensional data. ACM SIGMOD Record, 30(1):13–18.

    Google Scholar 

  • Aggarwal, C.C., Hinneburg, A., and Keim, D.A. 2001. On the surprising behavior of distance metrics in high dimensional space. International Conference on Database Theory, pp. 420–434.

  • Aggarwal, C.C. and Yu, P.S. 2000. Finding generalized projected clusters in high dimensional spaces. ACM SIGMOD Conference Proceedings, pp. 70–81.

  • Aggarwal, C.C. 2001. A human-computer cooperative system for effective high dimensional clustering. ACM KDD Conference Proceedings, pp. 221–226.

  • Aggarwal, C.C. and Yu, P.S. 2000. The IGrid Index: Reversing the dimensionality Curse for similarity indexing in high dimensional space. ACM KDD Conference Proceedings, pp. 119–129.

  • Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference Proceedings, pp. 94–105.

  • Ankerst, M., Ester, M., and Kriegel, H.-P. 2000. Towards an effective cooperation of the user and the computer for classification. ACM KDD Conference Proceedings, pp. 179–188.

  • Applied Visions Inc. URL: http://www.avi.com

  • Bennett, K.P., Fayyad, U., and Geiger, D. 1999. Density-based indexing for approximate nearest neighbor queries. ACM KDD Conference Proceedings, pp. 233–243.

  • Berchtold, S., Keim, D.A., and Kriegel, H.-P. 1996. The X-Tree: An index structure for high-dimensional data, Very Large Database Conference Proceedings, pp. 28–39.

  • Beyer, K., Ramakrishnan, R., Shaft, U., and Goldstein, J. 1999. When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theory, pp. 217–235.

  • Buja, A., Cook, D., Asimov, D., and Hurley, C. 1997. Dynamic projections in high-dimensional visualization: Theory and Computational Methods, Technical Report, AT&T Labs, Florham Park, NJ.

  • Chakrabarti, K. and Mehrotra, S. 2000. Local dimensionality reduction: A new approach to indexing high dimensional spaces. Very Large Database Conference Proceedings, pp. 89–100.

  • Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., and Xu, X. 1997. Density-connected sets and their application for trend detection in spatial databases. ACM KDD Conference Proceedings, pp. 10–15.

  • Faloutsos, C., Equitz, W., Niblack, W., Petkovic, D., and Barber, R. 1994. Efficient and effective querying by image content. Journal of Intelligent Information Systems, 3(4):231–262.

    Google Scholar 

  • Friedman, J.H. 1985. Exploration projection pursuit. Journal of the American Statistical Association, pp. 82:249–286.

  • Friedman, J.H. and Tukey, J.W. 1974. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions of Conputers, C23:881–890.

    Google Scholar 

  • Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. Very Large Database Conference Proceedings, pp. 518–529.

  • Han, J., Lakshmanan, L., and Ng, R. 1999. Constraint based multidimensional data mining. IEEE Computer, 32(8):46–50.

    Google Scholar 

  • Hinneburg, A., Aggarwal, C., and Keim, D.A. 2000. What is the nearest neighbor in high dimensional spaces? Proceedings of the VLDB Conference, pp. 506–515.

  • Hinneburg, A., Keim, D.A., and Wawryniuk, M. 1999. HD-Eye: Visual mining of high dimensional data. IEEE Computer Graphics and Applications, 19(5):22–31.

    Google Scholar 

  • Huber, P.J. 1985. Projection pursuit. The Annals of Statistics, 13(2):435–475.

  • Jolliffe, I.T. 1986. Principal Component Analysis, Springer-Verlag, New York.

  • Katayama, N. and Satoh , S. 1997. The SR-Tree: An index structure for high dimensional nearest neighbor queries. ACM SIGMOD Conference, pp. 369–380.

  • Katayama, N. and Satoh, S. 2001. Distinctiveness sensitive nearest neighbor sarch for efficient similarity retrieval of multimedia information. Proceedings of the ICDE Conference, pp. 493–502.

  • Keim, D.A. 1995. Visual Support for Query Specification and Data Mining. Ph. D. Thesis, Shaker Publishing Company, Aachen, Germany.

  • Keim, D.A., Kriegel, H.-P., and Seidl, T. 1994. Supporting data mining of large databases by visual feedback queries. ICDE Conference, pp. 302–313.

  • Kruskal, J.B. 1969. Towards a practical method which help uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In R.C. Milton and J.A. Nelder (Eds), Statistical Computation, Academic Press, New York, pp. 427–440.

  • Lin, K.-I., Jagadish, H.V., and Faloutsos, C. 1992. The TV-tree: An index structure for high dimensional data. VLDB Journal, 3(4):517–542.

    Google Scholar 

  • Rui, Y., Huang, T.S., and Mehrotra, S. 1997. Content-based image retrieval with relevance feedback in MARS. Proceedings of the IEEE Conference on Image Processing, pp. 815–818.

  • Salton, G. 1971. THE SMART Retrieval System — Experiments in Automatic Document Processing, Prentice Hall, Englewood Cliffs, NJ.

  • Sarawagi, S. 2000. User-adaptive Exploration of Multidimensional Data. VLDB Conference Proceedings, pp. 307–316.

  • Seidl, T. and Kriegel, H.-P. 1997. Efficient User-Adaptable Similarity Search in Large Multimedia Databases. Very Large Database Conference Proceedings, pp. 506–515.

  • Silverman, B.W. 1986. Density Estimation for Statistics and Data Analysis, Chapman and Hall.

  • Swayne, D.F., Cook, D., and Buja, A. 1998. XGobi: Interactive dynamic data visualization in the X window system. Journal of Computational and Graphical Statistics, 7(1):113–130.

    Google Scholar 

  • Tung, A.K.H., Ng, R., Lakshmanan, L.V.S., and Han, J. 2001. Constraint-based clustering in large databases. International Conference on Database Theory, pp. 405–419.

  • Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Very Large Database Conference Proceedings, pp. 194–205.

  • Wu, L., Faloutsos, C., Sycara, K., and Payne, T. 2000. FALCON: Feedback adaptive loop for content-based retrieval. Very Large Database Conference Proceedings, pp. 297–306.

  • Yang, L. 2000. Interactive exploration of very large relational datasets through 3D dynamic projections. ACM KDD Conference Proceedings, pp. 236–243.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charu C. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C. On the use of Human-Computer Interaction for Projected Nearest Neighbor Search. Data Min Knowl Disc 13, 89–117 (2006). https://doi.org/10.1007/s10618-005-0030-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0030-6

Keywords

Navigation