As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Automatic voice gender recognition (VGR) offers several real-world applications, including recommender system, human-robot interaction, and forensic application. VGR systems become challenging when these operate under unconstrained environments. In this study, we evaluate the performance of VGR systems using different fine-tuned pretrained Convolutional Neural Networks (CNNs), in which the speech signals under unconstrained environments are introduced as input data. First, preprocessing is applied to the original speech signal, which consists of noise attenuation based on low-pass filter and silence part removal based on sound amplitude. Then, the time-frequency features, such as Spectrogram, Mel-Spectrogram and Mel Frequency Cepstral Coefficients (MFCC) are extracted, which are converted into RGB images and processed by CNN models. Our research utilizes the VoxCeleb dataset, which is the largest video-audio dataset recorded under unconstrained environments. The results obtained by several fine-tuned CNN models provide higher accuracy compared with the state-of-the-art techniques on this topic. The best accuracy achieved is 98.58% using fine-tuned MobileNet, which is higher than the best accuracy provided by previous works.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.