Skip to main content
Log in

A framework for image dark data assessment

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Image dark data, whose content and value are not clear, consistently occupy the storage space but hardly produce great value. Blindly applying data mining techniques on these data is highly likely to bring disappointed result and waste large resource. Therefore, it is of great significance to assess the dark data before data mining to help the user cognize the data. However, there are several challenges in dark data assessment work. First, the similarity between images must be objectively measured under aunified standard to help the user understand the evaluation values of dark data. Second, it is important to capture semantic features with generalization ability. Third, it is challenging to design an efficient assessment scheme to support large-scale datasets. To overcome these challenges, we propose an assessment framework which includes offline calculation and online assessment. In offline calculation, we first transform unlabeled images into hash codes by our developed Deep Self-taught Hashing (DSTH) algorithm which can extract semantic features with generalization ability, then construct a semantic graph using restricted Hamming distance, and finally use our designed Semantic Hash Ranking (SHR) algorithm to calculate the overall importance score (rank) for each node (image), which takes both the number of connected links and the weight on edges into consideration. During online assessment, we first translate the user’s query (semantic images) into hash codes using DSTH model, then match the data contained in the dark data via a predefined Hamming distance query range, and finally return the weighted average value of these matched data to help the user cognize the dark data. The results on real-world dataset show our framework can apply to large-scale datasets, help users evaluate the dark data by different requirements, and assist the user to conduct subsequent data mining work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13

Similar content being viewed by others

Notes

  1. https://www.gartner.com/it-glossary/dark-data/

References

  1. Cafarella, M.J., Ilyas, I.F., Kornacker, M., Kraska, T., Ré, C.: Dark data: are we solving the right problems? In: ICDE, pp. 1444–1445 (2016)

  2. Cai, H.Y., Huang, Z., Srivastava, D., Zhang, Q.: Indexing evolving events from tweet streams. In: ICDE, pp. 1538–1539 (2016)

  3. Cao, Y., Long, M., Liu, B., Wang, J.: Deep cauchy hashing for hamming space retrieval. In: CVPR, pp. 1229–1237 (2018)

  4. Gao, S., Cheng, X., Wang, H., Chia, L.-T.: Concept model-based unsupervised Web image re-ranking. In: ICIP, pp. 793–796 (2009)

  5. Ge, S.S., Zhang, Z., He, H.: Weighted graph model based sentence clustering and ranking for document summarization. In: ICIS, pp. 90–95 (2011)

  6. Heidorn, P.B.: Shedding light on the dark data in the long tail of science. Libr. Trends. 57(2), 280–299 (2018)

    Article  Google Scholar 

  7. Heidorn, P.B., Stahlman, G.R., Steffen, J.: Astrolabe: curating, linking and computing Astronomy’s dark data. CoRR. abs/1802.03629 (2018)

  8. Hu, M., Yang, Y., Shen, F., Xie, N., Shen, H.T.: Hashing with angular reconstructive Embeddings. IEEE Trans. Image Processing. 27(2), 545–555 (2018)

    Article  MathSciNet  Google Scholar 

  9. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)

  10. Keane, N., Yee, C., Liang, Z.: Using topic modeling and similarity thresholds to detect events. In: EVENTS@HLP-NAACL, pp. 34–42 (2015)

  11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

  12. Lai, H., Pan, Y., Liu, Y., Yan, S.: Simultaneous feature learning and hash coding with deep neural networks. In: CVPR, pp. 3270–3278 (2015)

  13. Li, J., Wu, Y., Zhao, J., Lu, K.: Low-rank discriminant embedding for multiview learning. IEEE Trans. Cybernetics. 47(11), 3516–3529 (2017)

    Article  Google Scholar 

  14. Li, J., Lu, K., Huang, Z., Zhu, L., Shen, H.T.: Transfer independently together: a generalized framework for domain adaptation. IEEE Trans. Cybernetics. 49(6), 2144–2155 (2019)

    Article  Google Scholar 

  15. Lin, K., Lu, J., Chen, C.-S., Zhou, J.: Learning compact binary descriptors with unsupervised deep neural networks. In: CVPR, pp. 1183–1192 (2016)

  16. Liu, H., Shao, M., Li, S., Yun, F.: Infinite ensemble for image clustering. In: SIGKDD, pp. 1745–1754 (2016)

  17. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot MultiBox detector. In: ECCV, pp. 21–37 (2016)

  18. Liu, Y., Song, J., Zhou, K., Yan, L., Liu, L., Zou, F., Shao, L.: Deep self-taught hashing for image retrieval. IEEE Trans. Cybernetics. 49(6), 2229–2241 (2019)

    Article  Google Scholar 

  19. Luo, Y., Yang, Y., Shen, F., Huang, Z., Zhou, P., Shen, H.T.: Robust discrete code modeling for supervised hashing. Pattern Recogn. 75, 128–135 (2018)

    Article  Google Scholar 

  20. Mehmood, R., Zhang, G., Bie, R., Dawood, H., Ahmad, H.: Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing. 208, 210–217 (2016)

    Article  Google Scholar 

  21. Michaelis, S., Piatkowski, N., Stolpe, M.: Solving Large Scale Learning Tasks. Challenges and Algorithms - Essays Dedicated to Katharina Morik on the Occasion of her 60th Birthday. Lecture Notes in Computer Science, vol. 9580, (2016)

  22. Mihalcea, R. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In ACL, (2004).

  23. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab (1999)

  24. Richter, F., Romberg, S., Hörster, E., Lienhart, R.: Multimodal ranking for image search on community databases. In: MIR, pp. 63–72 (2010)

  25. Shen, F., Liu, W., Zhang, S., Yang, Y., Shen, H.T.: Learning binary codes for maximum inner product search. In: ICCV, pp. 4148–4156 (2015)

  26. Shen, F., Shen, C., Liu, W., Shen, H.T.: Supervised discrete hashing. In: CVPR, pp. 37–45 (2015)

  27. Shen, F., Shen, C., Shi, Q., van den Hengel, A., Tang, Z., Shen, H.T.: Hashing on nonlinear manifolds. IEEE Trans. Image Processing. 24(6), 1839–1851 (2015)

    Article  MathSciNet  Google Scholar 

  28. Shen, F., Xu, Y., Liu, L., Yang, Y., Huang, Z., Shen, H.T.: Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3034–3044 (2018)

    Article  Google Scholar 

  29. Shukla, M., Manjunath, S., Saxena, R., Mondal, S., Lodha, S.: POSTER: WinOver enterprise dark data. In: SIGSAC, pp. 1674–1676 (2015)

  30. Song, J., He, T., Gao, L., Xu, X., Shen, H.T.: Deep region hashing for efficient large-scale instance search from images. arXiv preprint arXiv:1701.07901 (2017)

  31. Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. PR. 75, 175–187 (2018)

    Google Scholar 

  32. Wang, B., Yang, Y., Xu, X., Hanjalic, A., Shen, H.T.: Adversarial cross-modal retrieval. In: MM, pp. 154–162 (2017)

  33. Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Processing. 26(5), 2494–2507 (2017)

    Article  MathSciNet  Google Scholar 

  34. Yang, Y., Ma, Z., Yang, Y., Nie, F., Shen, H.T.: Multitask spectral clustering by exploring Intertask correlation. IEEE Trans. Cybernetics. 45(5), 1069–1080 (2015)

    Article  Google Scholar 

  35. Yang, Y., Luo, Y., Chen, W., Shen, F., Shao, J., Shen, H.T.: Zero-shot hashing via transferring supervised knowledge. In: MM, pp. 1286–1295 (2016)

  36. Yang, E., Liu, T., Cheng, D., Liu, W., Tao, D.: DistillHash: unsupervised deep hashing by distilling data pairs. In: CVPR, pp. 2946–2955 (2019)

  37. Yu, L., Li, W., Lu, Z., Zhao, M.: Alternating pointwise-pairwise learning for personalized item ranking. In: CIKM, pp. 2155–2158 (2017)

  38. Yu, L., Wang, Y., Zhou, K., Yang, Y., Liu, Y., Song, J., Xiao, Z.: A framework for image dark data assessment. In: APWeb-WAIM, pp. 3–18 (2019)

  39. Yu, L., Wang, Y., Zhou, K., Yang, Y., Liu, Y.: Semantic-aware data quality assessment for image big data. Futur. Gener. Comput. Syst. 102, 53–65 (2020)

    Article  Google Scholar 

  40. Zhang, D., Wang, J., Deng, C., Jinsong, L.: Self-taught hashing for fast similarity search. In: SIGIR, pp. 18–25 (2010)

  41. Zhang, C., Govindaraju, V., Borchardt, J., Foltz, T., Ré, C., Peters, S.: GeoDeepDive: statistical inference using familiar data-processing languages. In: SIGMOD, pp. 993–996 (2013)

  42. Zhang, C., Shin, J., Ré, C., Cafarella, M.J., Niu, F.: Extracting databases from dark data with DeepDive. In: SIGMOD, pp. 847–859 (2016)

  43. Zhang, H., Liu, L., Yang, L., Shao, L.: Unsupervised deep hashing with Pseudo labels for scalable image retrieval. IEEE Trans. Image Processing. 27(4), 1626–1638 (2018)

    Article  MathSciNet  Google Scholar 

  44. Zhou, K., Yu, L., Song, J., Yan, L., Zou, F., Shen, F.: Deep self-taught hashing for image retrieval. In: MM, pp. 1215–1218 (2015)

  45. Zhu, L., Shen, J., Liang, X., Cheng, Z.: Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans. Knowl. Data Eng. 29(2), 472–486 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the Innovation Group Project of the National Natural Science Foundation of China No.61821003 and the National Key Research and Development Program of China under grant No.2016YFB0800402 and the National Natural Science Foundation of China No.61672254 and No.61902135.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Liu.

Additional information

This article belongs to the Topical Collection: Special Issue on Web and Big Data 2019

Guest Editors: Jie Shao, Man Lung Yiu, and Toyoda Masashi

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, K., Wang, Y., Liu, Y. et al. A framework for image dark data assessment. World Wide Web 23, 2079–2105 (2020). https://doi.org/10.1007/s11280-020-00779-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-020-00779-x

Keywords

Navigation