Abstract
This article evaluates the use of CLIP, a contrastive language-image pre-training methodology, for analyzing aerial images of power line infrastructures. Companies record videos using drones and helicopters to assess the health status of the infrastructures, resulting in hours of unlabeled video. This study proposes a semi-supervised approach that combines natural language processing and image understanding to learn a common representation of images and text. A small set of images labeled based on criteria such as transmission tower type, camera angle view, and background were used to fine-tune CLIP for generating domain-specific embeddings. Results show that this approach achieved an F1 score of over 96% for detecting transmission towers, which could be used to automatically classify unlabeled aerial images as the first step in maintenance data pipelines for predictive detection of anomalies in components, presence of nests or plants, etc.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdelfattah, R., Wang, X, Wang, S.: TTPLA: An Aerial-Image Dataset for Detection and Segmentation of Transmission Towers and Power Lines. In: Proceedings of the Asian Conference on Computer Vision (2020)
Radford, A., et al.: Learning Transferable Visual Models From Natural Language Supervision. In: Proceedings of the International Conference on ML, pp. 8748–8763 (2021)
Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: Contrastive learning from unpaired medical images and text. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3876–3887 (2022)
Deng, Y., Campbell, R., Kumar, P.: Fire and Gun Detection Based on Sematic Embeddings. In: IEEE International Conference on Multimedia (2022)
Endo, M., Krishnan, R., Krishna, V., Ng, A.Y., Rajpurkar, P.: Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In: Proceedings of ML Research, vol. 158. PMLR, pp. 209–219, Nov. 28 (2021)
Khorramshahi, P., Rambhatla S.S., Chellappa, R.: Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 4183–4192 (2021)
Different types of transmission towers, Electrical Engineering Pics (2014)
Pillow (PIL Fork), PIL Documentation - Concepts, Pillow (PIL Fork) 9.4.0, https://pillow.readthedocs.io/en/stable/handbook/concepts.html#concept-modes. Accessed Mar 2023
OpenAI, CLIP (Contrastive Language-Image Pretraining), GitHub, Jan. 05, 2021. https://github.com/openai/CLIP. Accessed Mar 2023
Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, https://github.com/. Accessed Mar 2023
Acknowledgements
Authors acknowledge the funding under grant AI4TES and PDC2021–121567-C21 funded by the Spanish Ministry of Economic Affairs and Digital Transformation and MCIN/AEI/10.13039/501100011033/, respectively, and by EU Next GenerationEU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Losada, A., Bernardos, A.M., Besada, J. (2023). Image Classification Using Contrastive Language-Image Pre-training: Application to Aerial Views of Power Line Infrastructures. In: García Bringas, P., et al. 18th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2023). SOCO 2023. Lecture Notes in Networks and Systems, vol 750. Springer, Cham. https://doi.org/10.1007/978-3-031-42536-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-42536-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42535-6
Online ISBN: 978-3-031-42536-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)