Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

Rodrigues, Lucas Santiago; Cazzolato, Mirela Teixeira; Traina, Agma Juci Machado; Traina, Caetano

doi:10.1007/978-3-030-60936-8_13

Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

Lucas Santiago Rodrigues¹⁶,
Mirela Teixeira Cazzolato¹⁶,
Agma Juci Machado Traina¹⁶ &
…
Caetano Traina Jr.¹⁶

Conference paper
First Online: 14 October 2020

751 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12440))

Abstract

Incompleteness harms the quality of content-based retrieval and analysis in similarity queries. Missing data are usually evaluated using exclusion and imputation methods to infer possible values to complete gaps. However, such approaches can introduce bias into data and lose useful information. Similarity queries cannot perform over incomplete complex tuples, since distance functions are undefined over missing values. We propose the SOLID approach to allow similarity queries in complex databases without the need neither of data imputation nor deletion. First, SOLID finds highly-correlated metric spaces. Then, SOLID uses a weighted distance function to search by similarity over tuples of complex objects using compatibility factors among metric spaces. Experimental results show that SOLID outperforms imputation methods with different missing rates. SOLID was up to \(7.3\%\) better than the competitors in quality when querying over incomplete tuples, reducing \(16.42\%\) the error of similarity searches over incomplete data, and being up to 30.8 times faster than the closest competitor.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bastos, I.L.O., Angelo, M.F., Loula, A.C.: Recognition of static gestures applied to Brazilian sign language (libras). In: 28th SIBGRAPI (2015). https://doi.org/10.1109/SIBGRAPI.2015.26
Batista, G.E.A.P.A., Monard, M.C.: A study of K-nearest neighbour as an imputation method. His 87(251–260), 48 (2002)
Google Scholar
Figueroa, K., Reyes, N.: Permutation’s signatures for proximity searching in metric spaces. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 151–159. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_14
Chapter Google Scholar
Hunt, L.A.: Missing data imputation and its effect on the accuracy of classification. In: Palumbo, F., Montanari, A., Vichi, M. (eds.) Data Science. SCDAKO, pp. 3–14. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55723-6_1
Chapter Google Scholar
Little, R.J., Rubin, D.B.: Statistical analysis with missing data, vol. 793. John Wiley & Sons, Hoboken (2019)
Google Scholar
Pereira, C.R., et al.: Deep learning-aided Parkinson’s disease diagnosis from handwritten dynamics. In: 29th SIBGRAPI (2016). https://doi.org/10.1109/SIBGRAPI.2016.054
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowl.-Based Syst. 53, 51–65 (2013). https://doi.org/10.1016/j.knosys.2013.08.023
Article Google Scholar
Rohrmeier, A., et al.: Lumbar muscle and vertebral bodies segmentation of chemical shift encoding-based water-fat MRI: the reference database MyoSegmenTUM spine. BMC Musculoskelet. Disord. 20, 152 (2019). https://doi.org/10.1186/s12891-019-2528-x
Article Google Scholar
Salembier, P., Sikora, T., Manjunath, B.: Introduction to MPEG-7: Multimedia Content Description Interface. John Wiley & Sons, Hoboken (2002)
Google Scholar
Traina, A.J., et al.: Querying on large and complex databases by content: challenges on variety and veracity regarding real applications. Inf. Syst. 86, 10–27 (2019). https://doi.org/10.1016/j.is.2019.03.012
Article Google Scholar
Zabot, G.F., Cazzolato, M.T., Scabora, L.C., Traina, A.J.M., Traina-Jr., C.: Efficient indexing of multiple metric spaces with spectra. In: 2019 IEEE ISM, pp. 169–1697 (2019). https://doi.org/10.1109/ISM46123.2019.00038

Download references

Acknowledgments

This research was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, by the São Paulo Research Foundation (FAPESP, grants No. 2016/17078-0, 2018/24414-2, 2020/10902-5, 2020/07200-9), and the National Council for Scientific and Technological Development (CNPq).

Author information

Authors and Affiliations

Institute of Mathematics and Computer Sciences, University of São Paulo (USP), São Carlos, Brazil
Lucas Santiago Rodrigues, Mirela Teixeira Cazzolato, Agma Juci Machado Traina & Caetano Traina Jr.

Authors

Lucas Santiago Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Mirela Teixeira Cazzolato
View author publications
You can also search for this author in PubMed Google Scholar
Agma Juci Machado Traina
View author publications
You can also search for this author in PubMed Google Scholar
Caetano Traina Jr.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas Santiago Rodrigues .

Editor information

Editors and Affiliations

National Institute of Informatics, Tokyo, Japan
Shin'ichi Satoh
ISTI-CNR, Pisa, Italy
Lucia Vadicamo
University of Southern Denmark, Odense M, Denmark
Arthur Zimek
ISTI-CNR, Pisa, Italy
Fabio Carrara
University of Bologna, Bologna, Italy
Ilaria Bartolini
IT University of Copenhagen, Copenhagen, Denmark
Martin Aumüller
IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
IT University of Copenhagen, Copenhagen, Denmark
Rasmus Pagh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodrigues, L.S., Cazzolato, M.T., Traina, A.J.M., Traina, C. (2020). Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values. In: Satoh, S., et al. Similarity Search and Applications. SISAP 2020. Lecture Notes in Computer Science(), vol 12440. Springer, Cham. https://doi.org/10.1007/978-3-030-60936-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-60936-8_13
Published: 14 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60935-1
Online ISBN: 978-3-030-60936-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics