Impact Statement:Autoencoders are essential for many applications, such as segmentation, classification, depth estimation, and so on. For video autoencoders, the accuracy of the features ...Show More
Abstract:
Convolutional neural networks with encoder and decoder structures, generally referred to as autoencoders, are used in many pixelwise transformation, detection, segmentati...Show MoreMetadata
Impact Statement:
Autoencoders are essential for many applications, such as segmentation, classification, depth estimation, and so on. For video autoencoders, the accuracy of the features extracted from consecutive image frames can be improved by stacking the temporal feature maps. However, this approach greatly increases the computational complexity. By contrast, the video-based autoencoder proposed in this study utilizes a simple series structure to progressively accumulate the enhanced temporal feature maps extracted from a arranged number of consecutive image frames. Compared with an existing state-of-the-art visual geometry group (VGG)-based image autoencoder, the proposed autoencoder reduces the absolute error and square error of the monocular depth estimates by 36.9% and 28.6%, respectively, with only a 0.35% increase in the number of network parameters. Given its low complexity and ability to extract the rich temporal and spatial features in video sequences, the proposed autoencoder provides a c...
Abstract:
Convolutional neural networks with encoder and decoder structures, generally referred to as autoencoders, are used in many pixelwise transformation, detection, segmentation, and estimation applications, for example, which can be applied for face swapping, lane detection, semantic segmentation, and depth estimation, respectively. However, traditional autoencoders, which are based on single-frame inputs, ignore the temporal consistency between consecutive frames, and may, hence, produce unsatisfactory results. Accordingly, in this article, a video-based depth estimation (VDE) autoencoder is proposed to improve the quality of depth estimation through the inclusion of two weighted temporal feature (WTF) modules in the encoder and a single spatial edge guided (SEG) module in the decoder. The WTF modules designed with channel weighted block submodule effectively extract the temporal similarities in consecutive frames, whereas the SEG module provides spatial edge guidance of the object contou...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 2, February 2024)