research-article

A General and Efficient Querying Method for Learning to Hash

Authors:
Jinfeng Li

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Xiao Yan

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Jian Zhang

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
An Xu

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
James Cheng

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Jie Liu

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Kelvin K. W. Ng

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Ti-chung Cheng

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMay 2018Pages 1333–1347https://doi.org/10.1145/3183713.3183750

Published:27 May 2018Publication History

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1333–1347

ABSTRACT

As an effective solution to the approximate nearest neighbors (ANN) search problem, learning to hash (L2H) is able to learn similarity-preserving hash functions tailored for a given dataset. However, existing L2H research mainly focuses on improving query performance by learning good hash functions, while Hamming ranking (HR) is used as the default querying method. We show by analysis and experiments that Hamming distance, the similarity indicator used in HR, is too coarse-grained and thus limits the performance of query processing. We propose a new fine-grained similarity indicator, quantization distance (QD), which provides more information about the similarity between a query and the items in a bucket. We then develop two efficient querying methods based on QD, which achieve significantly better query performance than HR. Our methods are general and can work with various L2H algorithms. Our experiments demonstrate that a simple and elegant querying method can produce performance gain equivalent to advanced and complicated learning algorithms.

References

Artem Babenko and Victor S. Lempitsky . 2012. The Inverted Multi-Index. In CVPR. 3069--3076. Google ScholarDigital Library
Jon Louis Bentley . 1975. Multidimensional Binary Search Trees Used for Associative Searching CACM, Vol. Vol. 18. 509--517. Google ScholarDigital Library
Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P Drake, Jane M Landolin, and Adam M Phillippy . 2015. Assembling Large Genomes With Single-Molecule Sequencing and Locality-Sensitive Hashing. In Nature biotechnology, Vol. Vol. 33. 623--630.Google Scholar
Deng Cai . 2016. A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search CoRR, Vol. Vol. abs/1612.07545.Google Scholar
Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar Rajaram . 2007. Google News Personalization: Scalable Online Collaborative Filtering WWW. 271--280. Google ScholarDigital Library
Junhao Gan, Jianlin Feng, Qiong Fang, and Wilfred Ng . 2012. Locality-Sensitive Hashing Scheme Based on Dynamic Collision Counting SIGMOD. 541--552. Google ScholarDigital Library
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun . 2013. Optimized Product Quantization for Approximate Nearest Neighbor Search CVPR. 2946--2953. Google ScholarDigital Library
Yunchao Gong and Svetlana Lazebnik . 2011. Iterative Quantization: A Procrustean Approach to Learning Binary Codes CVPR. 817--824. Google ScholarDigital Library
Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin . 2013. Iterative Quantization: A Procrustean Approach to Learning Binary Codes for Large-Scale Image Retrieval. In TPAMI, Vol. Vol. 35. 2916--2929. Google ScholarDigital Library
Antonin Guttman . 1984. R-Trees: A Dynamic Index Structure for Spatial Searching SIGMOD. 47--57. Google ScholarDigital Library
Kaiming He, Fang Wen, and Jian Sun . 2013. K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes. In CVPR. 2938--2945. Google ScholarDigital Library
Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R. Lyu . 2017. Towards Automated Log Parsing for Large-Scale Log Data Analysis TDSC.Google Scholar
Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum . 2012. KORE: Keyphrase Overlap Relatedness for Entity Disambiguation CIKM. 545--554. Google ScholarDigital Library
Qiang Huang, Jianlin Feng, Yikai Zhang, Qiong Fang, and Wilfred Ng . 2015. Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search PVLDB, Vol. Vol. 9. 1--12. Google ScholarDigital Library
Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang, Jinfeng Li, Yuying Guo, and James Cheng . 2018. FlexPS: Flexible Parallelism Control in Parameter Server Architecture PVLDB. Google ScholarDigital Library
Piotr Indyk and Rajeev Motwani . 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality STOC. 604--613. Google ScholarDigital Library
Wang-Cheng Kang, Wu-Jun Li, and Zhi-Hua Zhou . 2016. Column Sampling Based Discrete Supervised Hashing. In AAAI. 1230--1236. Google ScholarDigital Library
Brian Kulis and Kristen Grauman . 2009. Kernelized Locality-Sensitive Hashing for Scalable Image Search ICCV. 2130--2137.Google Scholar
Learning to Hash . 2017. http://cs.nju.edu.cn/lwj/L2H.html.Google Scholar
Cong Leng, Jiaxiang Wu, Jian Cheng, Xi Zhang, and Hanqing Lu . 2015. Hashing for Distributed Data. In ICML. 1642--1650. Google ScholarDigital Library
Jinfeng Li, James Cheng, Fan Yang, Yuzhen Huang, Yunjian Zhao, Xiao Yan, and Ruihao Zhao . 2017 a. LoSHa: A General Framework for Scalable Locality Sensitive Hashing SIGIR. 635--644. Google ScholarDigital Library
Jinfeng Li, James Cheng, Yunjian Zhao, Fan Yang, Yuzhen Huang, Haipeng Chen, and Ruihao Zhao . 2016 a. A Comparison of General-Purpose Distributed Systems for Data Processing IEEE BigData. 378--383.Google Scholar
Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang . 2016 b. Feature Learning Based Deep Supervised Hashing with Pairwise Labels IJCAI. 1711--1717. Google ScholarDigital Library
Wen Li, Ying Zhang, Yifang Sun, Wei Wang, Wenjie Zhang, and Xuemin Lin . 2016 c. Approximate Nearest Neighbor Search on High Dimensional Data - Experiments, Analyses, and Improvement. In CoRR, Vol. Vol. abs/1610.02455.Google Scholar
Xuelong Li, Di Hu, and Feiping Nie . 2017 b. Large Graph Hashing with Spectral Rotation. In AAAI. 2203--2209.Google Scholar
Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang . 2014 b. Discrete Graph Hashing. In NIPS. 3419--3427. Google ScholarDigital Library
Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, and Heng Tao Shen . 2014 a. SK-LSH: An Efficient Index Structure for Approximate Nearest Neighbor Search PVLDB, Vol. Vol. 7. 745--756. Google ScholarDigital Library
Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li . 2007. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search VLDB. 950--961. Google ScholarDigital Library
Marius Muja and David G. Lowe . 2009. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration VISAPP. 331--340.Google Scholar
Marius Muja and David G. Lowe . 2014. Scalable Nearest Neighbor Algorithms for High Dimensional Data TPAMI, Vol. Vol. 36. 2227--2240.Google Scholar
Ankur Narang and Souvik Bhattacherjee . 2011. Real-time Approximate Range Motif Discovery & Data Redundancy Removal Algorithm EDBT. 485--496. Google ScholarDigital Library
NNS Benchmark . 2017. https://github.com/DBWangGroupUNSW/nns_benchmark.Google Scholar
Mohammad Norouzi, Ali Punjani, and David J. Fleet . 2012. Fast Search in Hamming Space with Multi-Index Hashing CVPR. 3108--3115. Google ScholarDigital Library
Mohammad Norouzi, Ali Punjani, and David J. Fleet . 2014. Fast Exact Search in Hamming Space With Multi-Index Hashing TPAMI, Vol. Vol. 36. 1107--1119. Google ScholarDigital Library
OpenCV . 2017. http://opencv.org/.Google Scholar
Rina Panigrahy . 2006. Entropy Based Nearest Neighbor Search in High Dimensions SODA. 1186--1195. Google ScholarDigital Library
Lo"ıc Paulevé, Hervé Jégou, and Laurent Amsaleg . 2010. Locality Sensitive Hashing: A Comparison of Hash Function Types and Querying Mechanisms. In PRL, Vol. Vol. 31. 1348--1358. Google ScholarDigital Library
Yuxin Su, Irwin King, and Michael R. Lyu . 2017. Learning to Rank Using Localized Geometric Mean Metrics SIGIR. 45--54. Google ScholarDigital Library
Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis . 2009. Quality and Efficiency in High Dimensional Nearest Neighbor Search SIGMOD. 563--576. Google ScholarDigital Library
Jun Wang, Ondrej Kumar, and Shih-Fu Chang . 2010. Semi-Supervised Hashing for Scalable Image Retrieval CVPR. 3424--3431.Google Scholar
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji . 2014. Hashing for Similarity Search: A Survey. In CoRR, Vol. Vol. abs/1408.2927.Google Scholar
Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen . 2017. A Survey on Learning to Hash. In TPAMI.Google Scholar
Xin-Jing Wang, Lei Zhang, Feng Jing, and Wei-Ying Ma . 2006. AnnoSearch: Image Auto-Annotation by Search. In CVPR. 1483--1490. Google ScholarDigital Library
Yair Weiss, Antonio Torralba, and Robert Fergus . 2008. Spectral Hashing. In NIPS. 1753--1760. Google ScholarDigital Library
Fan Yang, Yuzhen Huang, Yunjian Zhao, Jinfeng Li, Guanxian Jiang, and James Cheng . 2017 a. The Best of Both Worlds: Big Data Programming with Both Productivity and Performance SIGMOD. 1619--1622. Google ScholarDigital Library
Fan Yang, Jinfeng Li, and James Cheng . 2016. Husky: Towards a More Efficient and Expressive Distributed Computing Framework PVLDB, Vol. Vol. 9. 420--431. Google ScholarDigital Library
Fan Yang, Fanhua Shang, Yuzhen Huang, James Cheng, Jinfeng Li, Yunjian Zhao, and Ruihao Zhao . 2017 b. LFTF: A Framework for Efficient Tensor Analytics at Scale PVLDB, Vol. Vol. 10. 745--756. Google ScholarDigital Library
Cui Yu . 2002. High-Dimensional Indexing: Transformational Approaches to High-Dimensional Range and Similarity Searches (Lecture Notes in Computer Science), Vol. Vol. 2341. Springer. Google ScholarDigital Library
Fuzhen Zhang . 2011. Matrix Theory: Basic Results and Techniques. Springer Science & Business Media.Google ScholarCross Ref
Ting Zhang, Chao Du, and Jingdong Wang . 2014. Composite Quantization for Approximate Nearest Neighbor Search ICML. 838--846. Google ScholarDigital Library
Yuxin Zheng, Qi Guo, Anthony K. H. Tung, and Sai Wu . 2016. LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index. In SIGMOD. 2023--2037. Google ScholarDigital Library

Index Terms

A General and Efficient Querying Method for Learning to Hash
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking
      1. Similarity measures
      2. Top-k retrieval in databases

Recommendations

When is ontology-mediated querying efficient?
LICS '19: Proceedings of the 34th Annual ACM/IEEE Symposium on Logic in Computer Science

In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational ...
Read More
Querying data provenance
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was ...
Read More
Learning Label Preserving Binary Codes for Multimedia Retrieval: A General Approach

Learning-based hashing has been researched extensively in the past few years due to its great potential in fast and accurate similarity search among huge volumes of multimedia data. In this article, we present a novel multimedia hashing framework, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Results Reproduced
- Results Reproduced / v1.1
Author Tags
distributed computing
large-scale similarity search
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 495
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A General and Efficient Querying Method for Learning to Hash

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

When is ontology-mediated querying efficient?

Querying data provenance

Learning Label Preserving Binary Codes for Multimedia Retrieval: A General Approach