Abstract
In the current era, the mode of communication through mobile devices is becoming more personalized with the evolution of touch-based input methods. While writing on touch-responsive devices, searching for emojis to capture the true intent is cumbersome. To solve this problem, the existing solutions consider either the text or only stroke-based drawings to predict the appropriate emojis. We do not leverage the full context by considering only a single input. While the user is digitally writing, it is challenging for the model to identify whether the intention is to write text or draw an emoji. Moreover, the model’s memory footprint and latency play an essential role in providing a seamless writing experience to the user. In this paper, we investigate the effectiveness of combining text and drawing as input to the model. We present SAMNet, a multimodal deep neural network that jointly learns the text and image features. Here image features are extracted from the stroke-based drawing and text from the previously written context. We also demonstrate the optimal way to fuse features from both modalities. The paper focuses on improving user experience and providing low latency on edge devices. We trained our model with a carefully crafted dataset of 63 emoji classes and evaluated the performance. We achieve a worst-case On-Device inference time of 60 ms and 76.74% top-3 prediction accuracy with a model size of 3.5 MB. We evaluated the results with the closest matching application-DigitalInk and found that SAMNet provided a 13.95% improvement in the top-3 prediction accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Audebert, N., Herold, C., Slimani, K., Vidal, C.: Multimodal deep networks for text and image-based document classification. In: Cellier, P., Driessens, K. (eds.) ECML PKDD 2019. CCIS, vol. 1167, pp. 427–443. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43823-4_35
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., Lehmann, S.: Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. arXiv preprint arXiv:1708.00524 (2017)
Gupta, A., et al.: Context-aware emoji prediction using deep learning. In: Dev, A., Agrawal, S.S., Sharma, A. (eds.) AIST 2021. CCIS, vol. 1546, pp. 244–254. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-95711-7_22
Ha, D., Eck, D.: A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017)
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Keysers, D., Deselaers, T., Rowley, H.A., Wang, L.L., Carbune, V.: Multi-language online handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1180–1194 (2016)
Kruk, J., Lubin, J., Sikka, K., Lin, X., Jurafsky, D., Divakaran, A.: Integrating text and image: determining multimodal document intent in Instagram posts. arXiv preprint arXiv:1904.09073 (2019)
Ma, W., Liu, R., Wang, L., Vosoughi, S.: Emoji prediction: extensions and benchmarking. arXiv preprint arXiv:2007.07389 (2020)
Mao, J., Xu, J., Jing, Y., Yuille, A.: Training and evaluating multimodal word embeddings with large-scale web annotated images. arXiv preprint arXiv:1611.08321 (2016)
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5(1), 39–46 (2002)
Prattichizzo, D., Meli, L., Malvezzi, M.: Digital handwriting with a finger or a stylus: a biomechanical comparison. IEEE Trans. Haptics 8(4), 356–370 (2015)
Summaira, J., Li, X., Shoib, A.M., Li, S., Abdul, J.: Recent advances and trends in multimodal deep learning: a review. arXiv preprint arXiv:2105.11087 (2021)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984 (2020)
Yang, F., et al.: Exploring deep multimodal fusion of text and photo for hate speech classification. In: Proceedings of the Third Workshop on Abusive Language Online, pp. 11–18 (2019)
Zahavy, T., Magnani, A., Krishnan, A., Mannor, S.: Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. arXiv preprint arXiv:1611.09534 (2016)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gothe, S.V., Khurana, R., Vachhani, J.R., Rakshith, S., Kashyap, P. (2023). SAMNet: Semantic Aware Multimodal Network for Emoji Drawing Classification. In: Gupta, D., Bhurchandi, K., Murala, S., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2022. Communications in Computer and Information Science, vol 1777. Springer, Cham. https://doi.org/10.1007/978-3-031-31417-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-31417-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31416-2
Online ISBN: 978-3-031-31417-9
eBook Packages: Computer ScienceComputer Science (R0)