Skip to main content
Log in

Hollow-tree: a metric access method for data with missing values

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Similarity search is fundamental to store and retrieve large volumes of complex data required by many real world applications. A useful mechanism for such concept is the query-by-similarity. Based on their topological properties, metric similarity functions can be used to index sets of data which can be queried effectively and efficiently by the so-called metric access methods. However, data produced by various application domains and the varying data types handled often lead to missing data, hence, they do not follow the metric similarity requirements. As a consequence, missing data cause distortions in the index structure and yield bias in the query answer. In this paper, we propose the Hollow-tree, a novel access method aimed at successfully retrieving data with missing attribute values. It employs new strategies for indexing and searching data elements, capable of handling the missing data issues when the cause of missingness is ignorable. The indexing strategy is based on a family of distance functions that allow measuring the distance between elements with missing values, along with a set of policies able to organize the elements in the index without causing distortions to its internal structure. The searching strategy employs fractal dimension property of the data to achieve accurate query answer while considering data with missing values part of the response. Results from experiments performed on a variety of real and synthetic data sets showed that, while other metric access methods deteriorate with small amounts of missing values, the Hollow-tree maintains a remarkable performance with almost 100% of precision and recall for range queries and more than 90% for k-nearest neighbor queries, for up to 40% of missing values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. http://www.inmet.gov.br/portal/

  2. www.agritempo.gov.br

  3. http://www.gbdi.icmc.usp.br/old/arboretum

References

  • Banks, H.T., Hu, S., Rosenberg, E. (2017). A dynamical modeling approach for analysis of longitudinal clinical trials in the presence of missing endpoints. Applied Mathematics Letters, 63, 109–117.

    Article  MathSciNet  Google Scholar 

  • Bell, M.L., Fiero, M., Horton, N.J., Hsu, C.H. (2014). Handling missing data in rcts: a review of the top medical journals. BMC Medical Research Methodology, 14(118), 1–8.

    Google Scholar 

  • Belussi, A., & Faloutsos, C. (1995). Estimating the selectivity of spatial queries using the correlation fractal dimension. In International conference on very large data bases, Zurich, Switzerland (pp. 299–310).

  • Berchtold, S., Bohm, C., Braunmuller, B., Keim, D.A., Kriegel, H. (1997). Fast parallel similarity search in multimedia databases. In ACM SIGMOD International conference on management of data, Tucson, Arizona, USA (pp. 1–12).

    Article  Google Scholar 

  • Brinis, S., Traina, A.J.M., Traina, C. Jr. (2014). Analyzing missing data in metric spaces. Journal of Information and Data Management, 5(3), 224–237.

    Google Scholar 

  • Canahuate, G., Gibas, M., Ferhatosmanoglu, H. (2006). Indexing incomplete databases. In International conference on advances in databases, Munich, Germany (pp. 884–901).

    Google Scholar 

  • Cheng, W., Jin, X., Sun, J.T., Lin, X., Zhang, X., Wang, W. (2014). Searching dimension incomplete databases. IEEE Transactions on Knowledge and Data Engineering, 26(3), 725–738.

    Article  Google Scholar 

  • Ciaccia, P., Patella, M., Zezula, P. (1997). M-tree : an efficient access method for similarity search in metric spaces. In International conference on very large data bases, San Francisco, CA, USA (pp. 426–435).

  • Doi, K. (2007). Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computerized Medical Imaging and Graphics, 31 (4-5), 198–211.

    Article  Google Scholar 

  • Dong, Y., & Peng, C. (2013). Principled missing data methods for researchers. SpringerPlus, 2(1), 1–17.

    Article  Google Scholar 

  • Faloutsos, C., & Kamel, I. (1994). Beyond uniformity and independence: analysis of r-trees using the concept of fractal dimension. In ACM SIGACT-SIGMOD-SIGART Symposium on principles of database systems, New york, NY, USA (pp. 4–13).

  • Faloutsos, C., Seeger, B., Traina, A.J.M., Traina, C. Jr. (2000). Spatial join selectivity using power laws. In ACM SIGMOD International conference on management of data, New York, NY, USA (pp. 177–188).

  • Guo, Y., Ding, G., Han, J. (2018). Robust quantization for general similarity search. IEEE Transactions on Image Processing, 27(2), 949–963.

    Article  MathSciNet  Google Scholar 

  • Korn, F., Pagel, B., Faloutsos, C. (2001). On the dimensionality curse and the self-similarity blessing. IEEE Transactions on Knowledge and Data Engineering, 13(1), 96–111.

    Article  Google Scholar 

  • Little, R.J.A., & Rubin, D.B. (2014). Statistical analysis with missing data. Hoboken: Wiley Series in Probability and Statistics.

    MATH  Google Scholar 

  • Ooi, B.C., Goh, C.H., Tan, K.L. (1998). Fast high-dimensional data search in incomplete databases. In International conference on very large data bases, New york, NY, USA (pp. 357–367).

  • Papadopoulos, A., & Manolopoulos, Y. (1997). Performance of nearest neighbor queries in r-trees. In International conference on database theory, Delphi, Greece (pp. 394–408).

  • Pedersen, A.B., Mikkelsen, E.M., Cronin-Fenton, D., Kristensen, N.R., Pham, T.M., Pedersen, L., Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol, 9, 157–166.

    Article  Google Scholar 

  • Samet, H. (2006). Foundations of multidimensional and metric data structures. San Francisco: Morgan Kaufmann Publishers Inc.

    MATH  Google Scholar 

  • Schroeder, M. (1991). Fractals, Chaos, Power Laws. W.H. Freeman and Company, New York, USA.

  • Traina, C. Jr, Traina, A.J.M., Faloutsos, C. (2000). Distance exponent: a new concept for selectivity estimation in metric trees. In Technology IEEE International conference on data engineering, ICDE, San Diego, CA (p. 195).

  • Traina, C. Jr, Traina, A.J.M., Faloutsos, C., Seeger, B. (2002). Fast indexing and visualization of metric data sets using slim-trees. IEEE Transactions on Knowledge and Data Engineering, 14(2), 244–260.

    Article  Google Scholar 

  • Vieira, M.R., Traina, C. Jr, Traina, A.J.M., Arantes, A., Faloutsos, C. (2007). Boosting k-nearest neighbor queries estimating suitable query radii. In International conference on scientific and statistical database management, SSDBM, Los Alamitos, CA, USA (p. 10).

  • Vieira, M.R., Traina, C. Jr, Chino, F.J.T., Traina, A.J.M. (2010). Dbm-tree: a dynamic metric access method sensitive to local density data. Journal of Information and Data Management, 1, 111–128.

    Google Scholar 

  • Wei, H., Yu, J.X., Lu, C. (2018). String similarity search: a hash-based approach. IEEE Transactions on Knowledge and Data Engineering, 30(1), 170–184.

    Article  Google Scholar 

  • Wilson, D.R., & Martinez, T.R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1), 1–34.

    Article  MathSciNet  Google Scholar 

  • Yamagishi, Y., Aoyama, K., Saito, K., Ikeda, T. (2018). Pivot generation algorithm with a complete binary tree for efficient exact similarity search. IEICE Transactions on Information and Systems E101.D(1), 142–151.

    Article  Google Scholar 

  • Yianilos, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In ACM-SIAM symposium on discrete algorithms, Austin, USA (pp. 311–321).

  • Zezula, P., Dohnal, V., Amato, G., Batko, M. (2006). Similarity search: the metric space approach. Berlin: Springer.

    Book  Google Scholar 

  • Zhao, X., Xiao, C., Lin, X., Zhang, W., Wang, Y. (2018). Efficient structure similarity searches: a partition-based approach. The VLDB Journal, 27(1), 53–78.

    Article  Google Scholar 

Download references

Acknowledgements

This research was financed, in part, by the grant number 2016/17078-0 from the Sao Paulo Research Foundation (FAPESP), by the grant number 1406799 from the Coordination for the Improvement of Higher Education Personnel (CAPES), and by the grant numbers 150626/2017-7, 433328/2018-5, 309061/2017-2, 307615/2017-0, and 437420/2018-3 from the National Council for Scientific and Technological Development (CNPq).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Safia Brinis.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper was supported by CNPq, CAPES and FAPESP

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brinis, S., Traina, C. & Traina, A.J.M. Hollow-tree: a metric access method for data with missing values. J Intell Inf Syst 53, 481–508 (2019). https://doi.org/10.1007/s10844-019-00567-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-019-00567-8

Keywords

Navigation