Skip to main content
Log in

Hidden semantic hashing for fast retrieval over large scale document collection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As is well known, the semantics of documents are exposed to us in latent way. However, most existing hashing methods ignore this fact and thus fail to discover the hidden semantic structure. To overcome this issue, we pay more attention to discover its latent semantic structure when hashing for document corpus in this paper. We mainly adopt two measures to discover the hidden structures. On the one hand, the Laplacian graph constructed in semantic space rather than in term-document space is used to capture the semantic structure for document corpus during hashing. On the other hand, motivated by the fact that non-negative matrix factorization (NMF) is an effective algorithm to discover the latent semantic structure for documents, we employ NMF to extract a parts-based representation for document. In addition, to reduce semantic loss when mapping parts-based representation into Hamming space, we impose sparse constraints to make the element of parts-based representation more close to binary values. The experimental results demonstrate that the proposed hashing method is competitive with the state-of-the-art methods in document hashing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/datasets/CNAE-9.

  2. http://people.csail.mit.edu/jrennie/20Newsgroups/.

  3. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  4. http://www.nist.gov/speech/tests/tdt/tdt98/index.htm.

References

  1. Bentley JL (1990) K-d trees for semidynamic point sets. In: Proceedings of the sixth annual symposium on computational geometry, pp 187–197

  2. Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: Proceedings of the 23rd international conference on machine learning, ICML 2006, vol 148, pp 97–104

  3. Blei D M, Ng A Y, Trevor JMI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022

    MATH  Google Scholar 

  4. Cai D, He X, Han J, Huang T S (2011) Graph regularized nonnegative matrix factorization for data representation. IEEE Trans Pattern Anal Mach Intell IEEE 33 (8):1548–1560

    Article  Google Scholar 

  5. Chang X, Yang Y (2017) Semisupervised feature analysis by mining correlations among multiple tasks. IEEE Trans Neural Netw Learn Syst PP(99):1–12

    MathSciNet  Google Scholar 

  6. Chang EY, Zhu K, Wang H, Bai H, Li J, Qiu Z, Cui H (2007) PSVM: parallelizing support vector machines on distributed computers. In: Proceedings of the conference on the advances in neural information processing systems, vol 20, pp 1–8

  7. Chang X, Ma Z, Yang Y, Zeng Z et al (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197

    Article  Google Scholar 

  8. Chang X, Ma Z, Lin M, Yang Y et al (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920

    Article  MathSciNet  MATH  Google Scholar 

  9. Chang X, Yu Y-L, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39 (8):1617–1632

    Article  Google Scholar 

  10. Datar M, Indyk P, Immorlica N, Mirrokni V S (2004) Locality-sensitive hashing scheme based on p-stable distributions.. In: Proceedings of the 20th annual symposium on computational geometry (SCG’04), pp 253–262

  11. Deerwester S C, Dumais S T, Landauer T K, Furnas GW, Harshman R A (1990) Indexing by latent semantic analysis. JASIS 41(6):391–407

    Article  Google Scholar 

  12. Ding C, Li T, Peng W (2006) Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method. In: Proceedings of the national conference on artificial intelligence. IEEE, pp 342–347

  13. Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Iterative quantization a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 12(35):2916–2929

    Article  Google Scholar 

  14. Gonzalez E F, Zhang Y (2015) Accelerating the Lee-Seung algorithm for nonnegative matrix factorization, Department of Computational and Applied Mathematics, Rice University, Houston, Texas 77005, technical report: TR05-02

  15. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. ACM SIGMOD Rec 14(2):47–57

    Article  Google Scholar 

  16. Hoyer P O (2002) Non-negative sparse coding.. In: Proceedings of the 2002 IEEE signal processing society workshop, vol 2002, pp 557–565

  17. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proc. of 30th STOC. ACM pp 604–613

  18. Jabeen F, Khusro S, Majid A, Rauf A et al (2016) Semantics discovery in social tagging systems: a review. Multimed Tools Appl 75(1):573–605

    Article  Google Scholar 

  19. Jiang Q-Y, Li W-J (2015) Scalable graph hashing with feature transformation. In: Proceedings of the 24th international joint conference on artificial intelligence (IJCAI 2015), vol 2015, pp 2248–2254

  20. Jiang X, Zhang H, Liu R, Zuo Y (2016) A diversifying hidden units method based on NMF for document representation.. In: Proceedings of the 2016 IEEE international conference on knowledge engineering and applications, vol 2016, pp 103–107

  21. Kulis B, Grauman K (2009) Kernelized locality-sensitive hashing for scalable image search. In: Proceedings of the IEEE international conference on computer vision, pp 2130–2137

  22. Lee H, Battle A, Raina R, Ng A (2006) Efficient sparse coding algorithms, advances in neural information processing systems. NIPGS 401(6755):801–808

    Google Scholar 

  23. Lei Z, Jialie S, Liang X, Zhiyong C (2016) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern PP(99):1–14

    Google Scholar 

  24. Li H, Guan Y, Liu L, Wang F et al (2016) Re-ranking for microblog retrieval via multiple graph model. Multimed Tools Appl 75(1):8939–89548

    Article  Google Scholar 

  25. Liang R-Z, Shi L, Wang H, Meng J, Wang JJ-Y, Sun Q, Gu Y (2016) Optimizing top precision performance measure of content-based image retrieval by learning similarity function. In: Proceedings of the international conference on pattern recognition, pp 2954–2958

  26. Lin C-J (2007) On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans Neural Netw IEEE 18(6):1589–1596

    Article  Google Scholar 

  27. Liu W, Wang J, Kumar S, Chang S-F (2011) Hashing with graphs.. In: Proceedings of the 28th international conference on machine learning (ICML 2011), pp 1–8

  28. Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd international conference on very large data bases (VLDB 2007), pp 950–961

  29. Ma Z, Chang X, Yang Y, Sebe N et al (2017) The many shades of negativity. IEEE Trans Multimed 7(19):1558–1568

    Article  Google Scholar 

  30. Nugumanova A, Mansurova M, Baiburin Y, Alimzhanov Y (2017) Using non-negative matrix factorization for text segmentation.. In: Proceedings of the international conference mathematical and information technologies, MIT 2016, vol 1839, pp 233–242

  31. Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: Proceedings of the annual ACM-SIAM symposium on discrete algorithms. IEEE, pp 1186–1195

  32. Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reas IET 50(7):213–222

    Google Scholar 

  33. Shakhnarovich G, Viola P, Darrell T (2003) Fast pose estimation with parameter-sensitive hashing. Proc IEEE Int Conf Comput Vis 2(1):750–757

    Article  Google Scholar 

  34. Seung D, Lee L (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13(1):556–562

    Google Scholar 

  35. Tatwawadi K, Hernaez M, Ochoa I, WeissmanBentley T (2016) GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics 17(32):i479–i486

    Article  Google Scholar 

  36. Wachsmuth E, Oram M W, Perrett D I (1994) Recognition of objects and their component parts Responses of single units in the temporal cortex of the macaque. Cogn Psychol 4(1):509–522

    Google Scholar 

  37. Weiss Y, Torralba A, Fergus R (2008) Spectral hashing, advances in neural information processing systems. NIPS 1753–1760

  38. Xie L, Shen J, Zhu L et al (2016) Online cross-modal hashing for web image retrieval. Proc AAAI 2016:294–300

    Google Scholar 

  39. Xu J, Wang P, Tian G, Xu B, Zhao J, Wang F, Hao H (2015) Convolutional neural networks for text hashing.. In: Proceedings of the 24th international joint conference on artificial intelligence, vol 2015, pp 1369–1375

  40. Yang J, Li B, Tian K, Lv Z (2017) A fast image retrieval method designed for network big data. IEEE Trans Indus Inform PP(99):1–1

    Google Scholar 

  41. Zhang D, Wang J, Cai D, Lu J (2010) Self-taught hashing for fast similarity search.. In: Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2010), pp 18–25

  42. Zhang D, Wang J, Cai D, Lu J (2010) Laplacian co-hashing of terms and documents. Adv Inf Retriev Springer XX(01):577–580

    Article  Google Scholar 

  43. Zhu L, Shen J, Liu X, Xie L, Nie L (2016) Learning compact visual representation with canonical views for robust mobile landmark search.. In: Proceedings of the 25th international joint conference on artificial intelligence (IJCAI 2016), vol 2016, pp 3959–3965

  44. Zhu L, Shen J, Xie L, Cheng Z et al (2017) Unsupervised visual hashing with semantic assistance for efficient content-based web image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486

    Article  Google Scholar 

Download references

Acknowledgment

This work is supported in part by the National Natural Science Foundation of China under Grant No.61672254 and 61300222, Key project of National Natural Science Foundation of China Grant No U1536203, Natural Science Foundation of Hubei Province Grant No.2015CFB687 and Natural Science Foundation of Fujian Province, Grant No. 2015J01288, the Fundamental Research Funds for the Central Universities, HUST:2016YXMS088. The authors appreciate the valuable suggestions from the anonymous reviewers and the Editors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fuhao Zou.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, F., Tang, X., Li, K. et al. Hidden semantic hashing for fast retrieval over large scale document collection. Multimed Tools Appl 77, 3677–3697 (2018). https://doi.org/10.1007/s11042-017-5219-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5219-3

Keywords

Navigation