Abstract
A novel detection framework named Markov Clustering Network (MCN) is proposed for fast and robust scene text detection. Different from the traditional top-down scene text detection approaches that inherit from the classic object detection, MCN detects scene text objects in a bottom-up manner. MCN predicts instance-level bounding boxes by firstly converting an image into a stochastic flow graph where Markov Clustering is performed based on the predicted stochastic flows. The stochastic flows encode the local correlation and semantic information of scene text objects. An object is modeled as strongly connected nodes by flows, which allows flexible and bottom-up detection for scale-varying and rotated text objects without prior knowledge of object size. The flow prediction is supported by the advanced Convolutional Neural Networks architectures and Position-aware spatial attention mechanism, which provides enhanced flow prediction by adaptively fusing spatial representations. The experimental evaluation on public benchmarks shows that our MCN method achieves the state-of-art performance on public benchmarks, especially in retrieving long and oriented texts.
Similar content being viewed by others
Notes
We use both 1D and 2D notation, alternatively, to index a node. The transformation between 1D notation m and 2D notation \((i_m,j_m)\) can be represented by \(m = i_m + \frac{H}{U}\cdot j_m\).
References
Bissacco, A., Cummins, M., Netzer, Y., Neven, H. (2013). Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE international conference on computer vision, pp. 785–792
Chen, D., Olobez, J. M., Bourlard, H. (2002). Text segmentation and recognition in complex background based on markov random field. In Object recognition supported by user interaction for service robots, Vol. 4, pp. 227–230.
Dai, Y., Huang, Z., Gao, Y., Chen, K. (2017). Fused text segmentation networks for multi-oriented scene text detection. arXiv preprint arXiv:1709.03272
Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., Fei-Fei, L. (2012). Ilsvrc-2012
Deng, D., Liu, H., Li, X., Cai, D. (2017). Pixellink: Detecting scene text via instance segmentation. In Thirty-second AAAI conference on artificial intelligence
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 142–158.
Gupta, A., Vedaldi, A., Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2315–2324
He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., Li, X. (2017a). Single shot text detector with regional attention. In Proceedings of the IEEE international conference on computer vision, pp. 3047–3055
He, D., Yang, X., Liang, C., Zhou, Z., Ororbi, A. G., Kifer, D., Lee Giles, C. (2017b). Multi-scale fcn with cascaded instance aware segmentation for arbitrary oriented word spotting in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3519–3528
He, W., Zhang, X. Y., Yin, F., Liu, C. L. (2017c). Deep direct regression for multi-oriented scene text detection. arXiv preprint arXiv:1703.08289
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E. (2017). Wordsup: Exploiting word annotations for character based text detection. In Proceedings of the IEEE international conference on computer vision
Huang, W., Lin, Z., Yang, J., Wang, J. (2013). Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE international conference on computer vision, pp. 1241–1248
Huang, W., Qiao, Y., Tang, X. (2014). Robust scene text detection with convolution neural network induced mser trees. In European conference on computer vision, pp. 497–511. Springer
ICDAR (2017). Rrobust reading competition. http://u-pat.org/ICDAR2017/index.php
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1), 1–20.
Jiang, F., Hao, Z., Liu, X. (2017a). Deep scene text detection with connected component proposals. arXiv preprint arXiv:1708.05133
Jiang, Y., Zhu, X., Wang, X., Yang, S., Li, W., Wang, H., Fu, P., Luo, Z. (2017b). R2cnn: Rotational region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V. R., Lu, S., et al. (2017). Icdar 2015 competition on robust reading. In 13th international conference on document analysis and recognition (ICDAR), 2015 , pp. 1156–1160. IEEE
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S. R., Mas, J., Mota, D. F., Almazan, J. A., de las Heras, L. P. (2013). Icdar 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition, pp. 1484–1493. IEEE
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Li, Y., Ma, J. (2017). A unified deep neural network for scene text detection. In International conference on intelligent computing, pp. 101–112. Springer
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W. (2017). Textboxes: A fast text detector with a single deep neural network. In Thirty-first AAAI conference on artificial intelligence
Liao, M., Zhu, Z., Shi, B., Xia, G. S., Bai, X. (2018a). Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5909–5918
Liao, M., Shi, B., & Bai, X. (2018b). Textboxes++: A single-shot oriented scene text detector. IEEE Transactions on Image Processing, 27(8), 3676–3690.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., Berg, A. C. (2016). Ssd: Single shot multibox detector. In European conference on computer vision, pp. 21–37. Springer
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., Yan, J. (2018a). Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5676–5685
Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., Goh, W. L. (2018b). Learning markov clustering networks for scene text detection. In The IEEE conference on computer vision and pattern recognition (CVPR)
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C. (2018). Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European conference on computer vision (ECCV), pp. 20–36
Lyu, P., Yao, C., Wu, W., Yan, S., Bai, X. (2018). Multi-oriented scene text detection via corner localization and region segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7553–7563
Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., et al. (2018). Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 20(11), 3111–3122.
Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.
Mishra, A., Alahari, K., Jawahar, C. (2012). Scene text recognition using higher order language priors
Neumann, L., Matas, J. (2012). Real-time scene text localization and recognition. In IEEE conference on computer vision and pattern recognition (CVPR), 2012, pp. 3538–3545. IEEE
Neumann, L., & Matas, J. (2016). Real-time lexicon-free scene text localization and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1872–1885.
Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable parallel programming with cuda. Queue, 6(2), 40–53.
Nistér, D., Stewénius, H. (2008). Linear time maximally stable extremal regions. In European conference on computer vision, pp. 183–196. Springer
Redmon, J., Divvala, S., Girshick, R., Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788
Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Satuluri, V., Parthasarathy, S. (2009). Scalable graph clustering using stochastic flows: applications to community discovery. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 737–746. ACM
Satuluri, V., Parthasarathy, S., Ucar, D. (2010). Markov clustering of protein interaction networks with improved balance and scalability. In Proceedings of the first ACM international conference on bioinformatics and computational biology, pp. 247–256. ACM
Semeniuta, S., Severyn, A., Barth, E. (2016). Recurrent dropout without memory loss. arXiv preprint arXiv:1603.05118
Shaw, P., Uszkoreit, J., Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155
Shi, B., Bai, X., Belongie, S. (2017). Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2550–2558
Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S., Zhang, Z. (2013). Scene text recognition using part-based tree-structured character detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2961–2968
Shrivastava, A., Gupta, A., Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769
Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Tian, Z., Huang, W., He, T., He, P., Qiao, Y. (2016). Detecting text in natural image with connectionist text proposal network. In European conference on computer vision, pp. 56–72. Springer
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Lim Tan, C. (2015). Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision, pp. 4651–4659
Van Dongen, S. M. (2001). Graph clustering by flow simulation. Ph.D. thesis
Wang, K., Belongie, S. (2010). Word spotting in the wild. In European conference on computer vision, pp. 591–604. Springer
Wang, T., Wu, D. J., Coates, A., Ng, A. Y. (2012). End-to-end text recognition with convolutional neural networks. In 21st international conference on pattern recognition (ICPR), 2012, pp. 3304–3308. IEEE
Xue, C., Lu, S., Zhan, F. (2018). Accurate scene text detection through border semantics awareness and bootstrapping. In European conference on computer vision (ECCV)
Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z. (2012). Detecting texts of arbitrary orientations in natural images. In 2012 IEEE conference on computer vision and pattern recognition, pp. 1083–1090. IEEE
Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., Cao, Z. (2016). Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002
Yao, C., Bai, X., & Liu, W. (2014). A unified framework for multioriented text detection and recognition. IEEE Transactions on Image Processing, 23(11), 4737–4749.
Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170
Zamberletti, A., Noce, L., Gallo, I. (2014). Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. In Asian conference on computer vision, pp. 91–105. Springer
Zhang, S., Liu, Y., Jin, L., Luo, C. (2018). Feature enhancement network: A refined scene text detector. In Thirty-second AAAI conference on artificial intelligence
Zhang, Z., Shen, W., Yao, C., Bai, X. (2015). Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2558–2567
Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., Bai, X. (2016). Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4159–4167
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J. (2017). East: An efficient and accurate scene text detector. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5551–5560
Zhu, Y., Yao, C., & Bai, X. (2016). Scene text detection and recognition: Recent advances and future trends. Frontiers of Computer Science, 10(1), 19–36.
Acknowledgements
This research is supported by the National Research Foundation Singapore under its AI Singapore Programme (Award Number: AISG-RP-2018-003) and the MOE Tier-1 research Grants: RG126/17 (S) and RG28/18 (S).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Florent Perronnin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, Z., Lin, G. & Goh, W.L. Bottom-Up Scene Text Detection with Markov Clustering Networks. Int J Comput Vis 128, 1786–1809 (2020). https://doi.org/10.1007/s11263-020-01298-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01298-y