Traditional Hand Gesture Recognition (HGR) approaches often rely on multiple sensors, such as RGB, depth, and infrared cameras, to capture comprehensive multimodal data. However, this increases hardware complexity and costs, limiting the widespread adoption of HGR systems. In this paper, we propose a Multi-modal HGR approach that leverages automatic depth estimation from RGB videos to enhance HGR performance while using only a single RGB camera. Our method integrates synthetic depth features, optical flow (OF), and RGB data through an early fusion strategy. We conduct extensive experiments using three ConvNet-based models for HGR: the 3D-CNN variants of ResNet and ResNeXt, as well as the efficient 2D-CNN-based Temporal Shift Module (TSM). Our findings indicate that the multimodal input combination of synthetic Depth, OF, and RGB modalities results in superior performance compared to models using solely RGB or RGB+OF inputs, with the ResNeXt-101 model exhibiting the highest accuracy. To validate our approach, we employ the IPN Hand dataset, which we have meticulously refined to correct temporal annotation inconsistencies and increased the number of gesture classes from 13 to 14 by separating the similar dynamics of four specific gestures. Furthermore, we compute high-quality OF and depth maps for all 800,000 frames in the dataset. The enhanced and multimodal data of the IPN Hand dataset will be soon available at github.com/GibranBenitez/IPN-hand.