Feature reduction based transfer structural subspace learning for small-footprint cross-domain keyword spotting via linear discriminant analysis

https://doi.org/10.1016/j.dsp.2022.103594Get rights and content

Abstract

Small-footprint keyword spotting has received considerable attention in recent years, which is often conducted on the assumption that the predefined keywords in training and testing data are obtained under the same condition. However, in practical situations, this assumption does not hold due to the widespread existence of various speech scenarios or voice-embedded devices. To tackle this problem, in this paper, we propose a new transfer subspace learning method called feature reduction based transfer structural subspace learning (FRTSSL) for small-footprint cross-domain keyword spotting. FRTSSL aims to learn a domain-invariant and discriminative subspace by which (1) feature reduction is used to high dimensional features to avoid unnecessary computation; and (2) transfer structural subspace learning jointly exploits the statistical properties and geometric structure to reduce the distribution discrepancy of the source and target domains based on a joint linear discriminant analysis (LDA) framework; and (3) the feedback term is constructed for improving the discrimination of the subspace based on source labels and pseudo-target labels. To preserve the intrinsic geometric structure of samples in the projection subspace, we first preserve the global subspace structure by imposing the reconstruction constraints on the reconstruction coefficient matrix, and then we preserve the space relationship of samples using a graph regularization method. Furthermore, we formulate a minimization problem that integrates marginal and conditional distribution alignment, reconstruction constraints, and graph regularization into the joint LDA framework, giving an effective optimization algorithm. Experimental results on four cross-domain keyword datasets show that our method outperforms some state-of-the-art conventional transfer learning methods and no transfer learning methods.

Introduction

Small-footprint keyword spotting (KWS) refers to detecting a relatively small set of predefined keywords for voice interaction, which is widely used in various IoT applications [1], [2]. Many technologies have been proposed for small-footprint KWS in the literature, e.g., nearest neighbor (NN), support vector machine (SVM), hidden Markov model (HMM), Gaussian mixture model (GMM), deep neural network (DNN) [3], [4], [5], [6]. These methods achieve promising results under one common assumption that the training and testing data are drawn from the same distribution. Unfortunately, this assumption often does not hold in real applications because of the widespread existence of various speech scenarios or voice-embedded devices. Since it is too expensive and time-consuming to obtain sufficient labeled data in new scenarios, we have to perform cross-domain keywords spotting (CDKWS) by exploiting the labeled data from one auxiliary domain to recognize the unlabeled data in another domain.

An intuitive example of CDKWS is shown in Fig. 1, which presents voice readings of the same predefined keyword on two different types of voice-embedded devices, denoted as D1 and D2. Suppose that we only have the voice data and labels on D1 at the sampling position SP1. If labels on D2 at the sampling position SP2 are missing or wrongly labeled, we can certainly use the data from SP1 to recognize the unlabeled data from SP2. Additionally, we can certainly use the data from SP1 to recognize the unlabeled data from SP3 on D1 with different sampling distances. Then the challenge arises: these voice readings are under totally different distributions. Therefore, it is challenging to design algorithms to tackle CDKWS problem.

To overcome this challenge, transfer learning is introduced, utilizing the rich knowledge of the auxiliary source domain to facilitate the learning process of the target tasks. According to the survey [7], conventional transfer learning methods can be divided into classifier-based methods and representation-based methods. In the first category, classifier parameters are adjusted such that they are adapted to the target data, e.g., adaptive support vector machine (A-SVM) [8], domain selection machine (DSM) [9], and domain transfer multiple kernel learning (DTMKL) [10]. However, these methods strongly depend on a certain type of classifier without considering the intrinsic structure of the data. In the second category, many methods attempt to learn multiple transformations to map the different datasets into a common subspace by minimizing the distribution discrepancy and the empirical risk. For example, transfer component analysis (TCA) [11], joint distribution adaptation (JDA) [12], and scatter component analysis (SCA) [13] reduce the distribution shifts by utilizing statistical properties, including sample means, class means and data scatter, etc. However, these methods shown above ignore the geometric structure between the source and target samples. Therefore, transfer sparse coding (TSC) [14], transfer subspace learning (TSL) [15], and low-rank transfer subspace learning (LTSL) [16] are proposed to learn a projection subspace by exploiting the sample relationship or the subspace structure. Although these methods have achieved significant results in some applications, they overlook the discriminative information contained in the labels and have not exploited the statistical properties well. Furthermore, it is generally challenging to obtain the optimal projection subspace through these methods if the distribution discrepancy between the source and target domains is too large.

Considering the problems mentioned above of the existing methods, in this paper, we synthetically propose a new feature reduction based transfer structural subspace learning (FRTSSL) method for small-footprint CDKWS. The core goal of FRTSSL is learning a domain-invariant projection subspace, by which 1) features reduction is utilized to high dimensional features as the preprocessing to avoid unnecessary computation in learning the projection subspace; 2) transfer structural subspace learning aims to learn the domain-invariant projection subspace by fully exploiting the discriminative information, geometric structure, and statistical properties based on the joint linear discriminant analysis (LDA) [17] framework; 3) the pseudo-target labels are updated and transferred to the second step to learn a better projection subspace. In conclusion, the key contributions of this paper consist of the following aspects:

  • 1)

    A new transfer subspace learning algorithm named FRTSSL is proposed to learn a domain-invariant and discriminative subspace for small-footprint CDKWS by feature reduction, transfer structural subspace learning and pseudo-label feedback. FRTSSL is a general framework and can be tailored according to specific applications.

  • 2)

    To the best of our knowledge, FRTSSL is the first work which integrates geometric structure preservation, distribution alignment and discriminant maximization into the joint LDA framework for small-footprint CDKWS. From the aspect of geometric structure preservation, the global subspace structure and the space relationship of samples are both preserved by imposing the reconstruction constraints and the graph regularization.

  • 3)

    Voice datasets of four predefined keywords are constructed with variations of voice-embedded devices, sampling distances, barriers, or environments. Specifically, we construct three groups of CDKWS tasks that refer to single, dual, or three variations, and the corresponding results demonstrate the effectiveness and efficiency of the proposed FRTSSL.

The rest of this paper is organized into the following sections: Section 2 presents a brief review of the related works of transfer learning and keyword spotting. Section 3 gives a framework of the proposed feature reduction based transfer structural subspace learning (FRTSSL) in detail. Section 4 describes the model optimization and iteration algorithm of the proposed FRTSSL. Section 5 present experiment results and analysis. Finally, Section 6 presents the conclusion and future work.

Section snippets

Transfer learning

Transfer learning has shown promising performance in many applications, e.g., activity recognition, natural language processing, and visual object recognition. According to the literature survey [7], transfer learning can be categorized into three subsettings: 1) inductive transfer learning, in which labels of two domains are available and tasks of two domains are different [18], [19], [20]; 2) transductive transfer learning, where the source and target tasks are the same while labels of the

Proposed method

In this section, we demonstrate the proposed feature reduction based transfer structural subspace learning (FRTSSL). We first present the mathematical notations in FRTSSL, and then we give a general framework of the proposed method for CDKWS in detail, including feature reduction, transfer structural subspace learning and the joint FRTSSL model.

FRTSSL model optimization

In the minimization problem of Eq. (14), there are many possible solutions of projection matrix P (i.e., non-unique solutions). To guarantee the unique solution of P, we convert the denominator term of Eq. (14) to an equality constraint for model optimization, and then the objective function can be written asP=argminP,Ztr(PT(Sw+γI+λ1(VtVsZ)(VtVsZ)T+V(λ2L+λ3c=0CMc)VT)P)+α2||Z||2,1s.t.PT(σSt+VVT)P=I

According to the constrained optimization, we denote Φ=diag(ϕ1,...,ϕd2)Rd2×d2 as the Lagrange

Experiments

In this section, we evaluate the performance of the proposed FRTSSL method for small-footprint CDKWS. First, we designed four experimental scenarios to collect voice datasets with variations of voice-embedded devices, sampling distances, barriers, or environments. Second, we describe the experimental settings and compare the proposed FRTSSL with some state-of-the-art baseline methods. Then, we verify the effectiveness of the proposed FRTSSL. Finally, we evaluate and discuss the parameter

Conclusion and future works

In this paper, we proposed a new transfer subspace learning method called feature reduction based transfer structural subspace learning (FRTSSL) for small-footprint cross-domain keyword spotting (CDKWS). FRTSSL aims to learn a domain-invariant, low-dimensional and discriminative projection subspace based on feature reduction, geometric structure preservation, statistical distribution alignment and discriminant maximization. By integrating subspace structure preservation, graph regularization

CRediT authorship contribution statement

Fei Ma: Conceptualization, Methodology, Software, Writing – original draft. Chengliang Wang: Conceptualization, Data curation, Writing – review & editing. Yujie Hao: Software, Validation. Xing Wu: Investigation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported by the National Natural Science Foundation of China under grand No. 61672115. We would like to thank Yanai Wang for her help on English writing enhancement.

Fei Ma is currently pursuing the Ph.D. degree in Computer Science at Chongqing University, Chongqing, China. Before that, he received his B.S. degree in College of Computer Science, Chongqing University. His current research interests include transfer learning, signal processing, especially for speech recognition.

References (50)

  • S.J. Pan et al.

    A survey on transfer learning

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • J. Yang et al.

    Cross-domain video concept detection using adaptive svms

  • L. Duan et al.

    Exploiting web images for event recognition in consumer videos: a multiple source domain adaptation approach

  • L. Duan et al.

    Domain transfer multiple kernel learning

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • S.J. Pan et al.

    Domain adaptation via transfer component analysis

    IEEE Trans. Neural Netw.

    (2010)
  • M. Long et al.

    Transfer feature learning with joint distribution adaptation

  • M. Ghifary et al.

    Scatter component analysis: a unified framework for domain adaptation and domain generalization

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • M. Long et al.

    Transfer sparse coding for robust image representation

  • S. Si et al.

    Bregman divergence-based regularization for transfer subspace learning

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • M. Shao et al.

    Generalized transfer subspace learning through low-rank constraint

    Int. J. Comput. Vis.

    (2014)
  • Y. Liu et al.

    Common subspace learning via cross-domain extreme learning machine

    Cogn. Comput.

    (2017)
  • W. Dai et al.

    Boosting for transfer learning

  • A. Evgeniou et al.

    Multi-task feature learning

    Adv. Neural Inf. Process. Syst.

    (2007)
  • R. Raina et al.

    Self-taught learning: transfer learning from unlabeled data

  • B. Gong et al.

    Geodesic flow kernel for unsupervised domain adaptation

  • Cited by (2)

    • Neighborhood preserving embedding with autoencoder

      2024, Digital Signal Processing: A Review Journal

    Fei Ma is currently pursuing the Ph.D. degree in Computer Science at Chongqing University, Chongqing, China. Before that, he received his B.S. degree in College of Computer Science, Chongqing University. His current research interests include transfer learning, signal processing, especially for speech recognition.

    Chengliang Wang is currently a professor with College of Computer Science, Chongqing University, China. He received the Ph.D. degree in control theory and control engineering from Chongqing University, Chongqing, China in 2004. He obtained postdoctoral certificate from Chongqing University, Chongqing, China in 2011. His current research interests include machine learning, pattern recognition, computer vision, especially for medical image analysis.

    Yujie Hao is now a master student of the College of Computer Science, Chongqing University. Before that, she received her B.S. degree from the School of Computer Science and Engineering, Anhui University of Science and Technology. Her main research interests are deep learning and speech recognition.

    Xing Wu is now a currently an associate professor with College of Computer Science, Chongqing University, China. He received the M.E. and Ph.D. degrees in computer science and technology from Chongqing University, Chongqing, China in 2004 and 2011, respectively. His current research interests include machine learning, reinforcement learning, computer vision, and natural language processing.

    View full text