Feature reduction based transfer structural subspace learning for small-footprint cross-domain keyword spotting via linear discriminant analysis

doi:10.1016/j.dsp.2022.103594

Digital Signal Processing

Volume 127, July 2022, 103594

https://doi.org/10.1016/j.dsp.2022.103594 Get rights and content

Abstract

Small-footprint keyword spotting has received considerable attention in recent years, which is often conducted on the assumption that the predefined keywords in training and testing data are obtained under the same condition. However, in practical situations, this assumption does not hold due to the widespread existence of various speech scenarios or voice-embedded devices. To tackle this problem, in this paper, we propose a new transfer subspace learning method called feature reduction based transfer structural subspace learning (FRTSSL) for small-footprint cross-domain keyword spotting. FRTSSL aims to learn a domain-invariant and discriminative subspace by which (1) feature reduction is used to high dimensional features to avoid unnecessary computation; and (2) transfer structural subspace learning jointly exploits the statistical properties and geometric structure to reduce the distribution discrepancy of the source and target domains based on a joint linear discriminant analysis (LDA) framework; and (3) the feedback term is constructed for improving the discrimination of the subspace based on source labels and pseudo-target labels. To preserve the intrinsic geometric structure of samples in the projection subspace, we first preserve the global subspace structure by imposing the reconstruction constraints on the reconstruction coefficient matrix, and then we preserve the space relationship of samples using a graph regularization method. Furthermore, we formulate a minimization problem that integrates marginal and conditional distribution alignment, reconstruction constraints, and graph regularization into the joint LDA framework, giving an effective optimization algorithm. Experimental results on four cross-domain keyword datasets show that our method outperforms some state-of-the-art conventional transfer learning methods and no transfer learning methods.

Introduction

Small-footprint keyword spotting (KWS) refers to detecting a relatively small set of predefined keywords for voice interaction, which is widely used in various IoT applications [1], [2]. Many technologies have been proposed for small-footprint KWS in the literature, e.g., nearest neighbor (NN), support vector machine (SVM), hidden Markov model (HMM), Gaussian mixture model (GMM), deep neural network (DNN) [3], [4], [5], [6]. These methods achieve promising results under one common assumption that the training and testing data are drawn from the same distribution. Unfortunately, this assumption often does not hold in real applications because of the widespread existence of various speech scenarios or voice-embedded devices. Since it is too expensive and time-consuming to obtain sufficient labeled data in new scenarios, we have to perform cross-domain keywords spotting (CDKWS) by exploiting the labeled data from one auxiliary domain to recognize the unlabeled data in another domain.

An intuitive example of CDKWS is shown in Fig. 1, which presents voice readings of the same predefined keyword on two different types of voice-embedded devices, denoted as $D_{1}$ and $D_{2}$ . Suppose that we only have the voice data and labels on $D_{1}$ at the sampling position SP1. If labels on $D_{2}$ at the sampling position SP2 are missing or wrongly labeled, we can certainly use the data from SP1 to recognize the unlabeled data from SP2. Additionally, we can certainly use the data from SP1 to recognize the unlabeled data from SP3 on $D_{1}$ with different sampling distances. Then the challenge arises: these voice readings are under totally different distributions. Therefore, it is challenging to design algorithms to tackle CDKWS problem.

To overcome this challenge, transfer learning is introduced, utilizing the rich knowledge of the auxiliary source domain to facilitate the learning process of the target tasks. According to the survey [7], conventional transfer learning methods can be divided into classifier-based methods and representation-based methods. In the first category, classifier parameters are adjusted such that they are adapted to the target data, e.g., adaptive support vector machine (A-SVM) [8], domain selection machine (DSM) [9], and domain transfer multiple kernel learning (DTMKL) [10]. However, these methods strongly depend on a certain type of classifier without considering the intrinsic structure of the data. In the second category, many methods attempt to learn multiple transformations to map the different datasets into a common subspace by minimizing the distribution discrepancy and the empirical risk. For example, transfer component analysis (TCA) [11], joint distribution adaptation (JDA) [12], and scatter component analysis (SCA) [13] reduce the distribution shifts by utilizing statistical properties, including sample means, class means and data scatter, etc. However, these methods shown above ignore the geometric structure between the source and target samples. Therefore, transfer sparse coding (TSC) [14], transfer subspace learning (TSL) [15], and low-rank transfer subspace learning (LTSL) [16] are proposed to learn a projection subspace by exploiting the sample relationship or the subspace structure. Although these methods have achieved significant results in some applications, they overlook the discriminative information contained in the labels and have not exploited the statistical properties well. Furthermore, it is generally challenging to obtain the optimal projection subspace through these methods if the distribution discrepancy between the source and target domains is too large.

Considering the problems mentioned above of the existing methods, in this paper, we synthetically propose a new feature reduction based transfer structural subspace learning (FRTSSL) method for small-footprint CDKWS. The core goal of FRTSSL is learning a domain-invariant projection subspace, by which 1) features reduction is utilized to high dimensional features as the preprocessing to avoid unnecessary computation in learning the projection subspace; 2) transfer structural subspace learning aims to learn the domain-invariant projection subspace by fully exploiting the discriminative information, geometric structure, and statistical properties based on the joint linear discriminant analysis (LDA) [17] framework; 3) the pseudo-target labels are updated and transferred to the second step to learn a better projection subspace. In conclusion, the key contributions of this paper consist of the following aspects:

1)
A new transfer subspace learning algorithm named FRTSSL is proposed to learn a domain-invariant and discriminative subspace for small-footprint CDKWS by feature reduction, transfer structural subspace learning and pseudo-label feedback. FRTSSL is a general framework and can be tailored according to specific applications.
2)
To the best of our knowledge, FRTSSL is the first work which integrates geometric structure preservation, distribution alignment and discriminant maximization into the joint LDA framework for small-footprint CDKWS. From the aspect of geometric structure preservation, the global subspace structure and the space relationship of samples are both preserved by imposing the reconstruction constraints and the graph regularization.
3)
Voice datasets of four predefined keywords are constructed with variations of voice-embedded devices, sampling distances, barriers, or environments. Specifically, we construct three groups of CDKWS tasks that refer to single, dual, or three variations, and the corresponding results demonstrate the effectiveness and efficiency of the proposed FRTSSL.

The rest of this paper is organized into the following sections: Section 2 presents a brief review of the related works of transfer learning and keyword spotting. Section 3 gives a framework of the proposed feature reduction based transfer structural subspace learning (FRTSSL) in detail. Section 4 describes the model optimization and iteration algorithm of the proposed FRTSSL. Section 5 present experiment results and analysis. Finally, Section 6 presents the conclusion and future work.

Section snippets

Transfer learning

Transfer learning has shown promising performance in many applications, e.g., activity recognition, natural language processing, and visual object recognition. According to the literature survey [7], transfer learning can be categorized into three subsettings: 1) inductive transfer learning, in which labels of two domains are available and tasks of two domains are different [18], [19], [20]; 2) transductive transfer learning, where the source and target tasks are the same while labels of the

Proposed method

In this section, we demonstrate the proposed feature reduction based transfer structural subspace learning (FRTSSL). We first present the mathematical notations in FRTSSL, and then we give a general framework of the proposed method for CDKWS in detail, including feature reduction, transfer structural subspace learning and the joint FRTSSL model.

FRTSSL model optimization

In the minimization problem of Eq. (14), there are many possible solutions of projection matrix P (i.e., non-unique solutions). To guarantee the unique solution of P, we convert the denominator term of Eq. (14) to an equality constraint for model optimization, and then the objective function can be written as $P = a r g \min_{P, Z} t r (P^{T} (S_{w} + γ I + λ_{1} (V_{t} - V_{s} Z) {(V_{t} - V_{s} Z)}^{T} + V (λ_{2} L + λ_{3} \sum_{c = 0}^{C} M_{c}) V^{T}) P) + \frac{α}{2} | | Z | |_{2, 1} s . t . P^{T} (σ S_{t} + V V^{T}) P = I$

According to the constrained optimization, we denote $Φ = d i a g (ϕ_{1}, . . ., ϕ_{d_{2}}) \in R^{d_{2} \times d_{2}}$ as the Lagrange

Experiments

In this section, we evaluate the performance of the proposed FRTSSL method for small-footprint CDKWS. First, we designed four experimental scenarios to collect voice datasets with variations of voice-embedded devices, sampling distances, barriers, or environments. Second, we describe the experimental settings and compare the proposed FRTSSL with some state-of-the-art baseline methods. Then, we verify the effectiveness of the proposed FRTSSL. Finally, we evaluate and discuss the parameter

Conclusion and future works

In this paper, we proposed a new transfer subspace learning method called feature reduction based transfer structural subspace learning (FRTSSL) for small-footprint cross-domain keyword spotting (CDKWS). FRTSSL aims to learn a domain-invariant, low-dimensional and discriminative projection subspace based on feature reduction, geometric structure preservation, statistical distribution alignment and discriminant maximization. By integrating subspace structure preservation, graph regularization

CRediT authorship contribution statement

Fei Ma: Conceptualization, Methodology, Software, Writing – original draft. Chengliang Wang: Conceptualization, Data curation, Writing – review & editing. Yujie Hao: Software, Validation. Xing Wu: Investigation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work is supported by the National Natural Science Foundation of China under grand No. 61672115. We would like to thank Yanai Wang for her help on English writing enhancement.

Fei Ma is currently pursuing the Ph.D. degree in Computer Science at Chongqing University, Chongqing, China. Before that, he received his B.S. degree in College of Computer Science, Chongqing University. His current research interests include transfer learning, signal processing, especially for speech recognition.

References (50)

T. Xiao et al.
Structure preservation and distribution alignment in discriminative transfer subspace learning
Neurocomputing
(2019)
W. Zhang et al.
Latent sparse transfer subspace learning for cross-corpus facial expression recognition
Digit. Signal Process.
(2021)
S. Shahnawazuddin et al.
Improving the performance of keyword spotting system for children's speech through prosody modification
Digit. Signal Process.
(2019)
C. Wang et al.
Machine learning of frustrated classical spin models (ii): kernel principal component analysis
Front. Phys.
(2018)
G. Chen et al.
Small-footprint keyword spotting using deep neural networks
S.-G. Leem et al.
Multitask learning of deep neural network-based keyword spotting for iot devices
IEEE Trans. Consum. Electron.
(2019)
R. Tang et al.
Deep residual learning for small-footprint keyword spotting
R. Vygon et al.
Learning efficient representations for keyword spotting with triplet loss
A. Fischer et al.
Improving hmm-based keyword spotting with character language models
Y.B. Ayed et al.
Keyword spotting using support vector machines

S.J. Pan et al.

A survey on transfer learning

IEEE Trans. Knowl. Data Eng.

(2009)

J. Yang et al.

Cross-domain video concept detection using adaptive svms

L. Duan et al.

Exploiting web images for event recognition in consumer videos: a multiple source domain adaptation approach

L. Duan et al.

Domain transfer multiple kernel learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

S.J. Pan et al.

Domain adaptation via transfer component analysis

IEEE Trans. Neural Netw.

(2010)

M. Long et al.

Transfer feature learning with joint distribution adaptation

M. Ghifary et al.

Scatter component analysis: a unified framework for domain adaptation and domain generalization

IEEE Trans. Pattern Anal. Mach. Intell.

(2016)

M. Long et al.

Transfer sparse coding for robust image representation

S. Si et al.

Bregman divergence-based regularization for transfer subspace learning

IEEE Trans. Knowl. Data Eng.

(2009)

M. Shao et al.

Generalized transfer subspace learning through low-rank constraint

Int. J. Comput. Vis.

(2014)

Y. Liu et al.

Common subspace learning via cross-domain extreme learning machine

Cogn. Comput.

(2017)

W. Dai et al.

Boosting for transfer learning

A. Evgeniou et al.

Multi-task feature learning

Adv. Neural Inf. Process. Syst.

(2007)

R. Raina et al.

Self-taught learning: transfer learning from unlabeled data

B. Gong et al.

Geodesic flow kernel for unsupervised domain adaptation

Cited by (2)

Neighborhood preserving embedding with autoencoder
2024, Digital Signal Processing: A Review Journal
Neighborhood preserving embedding (NPE) is a classical method for dimensionality reduction (DR), and it is a linear version of the locally linear embedding method. However, NPE and all its variants only consider the one-way mapping from high-dimensional space to low-dimensional space. The projected low-dimensional data may not accurately and effectively “represent” the original samples. To address this problem, we improve NPE based on linear autoencoder. The conventional projection of NPE is considered as the encoding stage, and the decoder stage is a reconstruction from the low-dimensional space to the original high-dimensional space, which is the key to maintaining more significant information. Based on this, we propose a new NPE method called NPEAE (neighborhood preserving embedding with autoencoder) in this paper. NPEAE performs excellently in face recognition, handwritten character categorization, object classification, etc. The experiments on MNIST, COIL-20, the Extended Yale B, Olivetti Research Laboratory (ORL), and FERET show that NPEAE has a higher recognition accuracy than other comparative methods.
Selective transfer subspace learning for small-footprint end-to-end cross-domain keyword spotting
2024, Speech Communication
In small-footprint end-to-end keyword spotting, it is often expensive and time-consuming to acquire sufficient labels in various speech scenarios. To overcome this problem, transfer learning leverages the rich knowledge of the auxiliary domain to annotate the unlabeled target data. However, most existing transfer learning methods typically learn a domain-invariant feature representation while ignoring the negative transfer problem. In this paper, we propose a new and general cross-domain keyword spotting framework called selective transfer subspace learning (STSL) that avoid negative transfer and dramatically improve the accuracy for cross-domain keyword spotting by actively selecting appropriate source samples. Specifically, STSL first aligns geometrical relationship and weighted distribution discrepancy to learn a domain-invariant projection subspace. Then, it actively selects appropriate source samples that are similar to the target domain for transfer learning to avoid negative transfer. Finally, we formulate a minimization problem that alternately optimizes the projection subspace and source active selection, giving an effective optimization. Experimental results on 10 groups of cross-domain keyword spotting tasks show that our STSL outperforms some state-of-the-art transfer learning methods and no transfer learning methods.

Chengliang Wang is currently a professor with College of Computer Science, Chongqing University, China. He received the Ph.D. degree in control theory and control engineering from Chongqing University, Chongqing, China in 2004. He obtained postdoctoral certificate from Chongqing University, Chongqing, China in 2011. His current research interests include machine learning, pattern recognition, computer vision, especially for medical image analysis.

Yujie Hao is now a master student of the College of Computer Science, Chongqing University. Before that, she received her B.S. degree from the School of Computer Science and Engineering, Anhui University of Science and Technology. Her main research interests are deep learning and speech recognition.

Xing Wu is now a currently an associate professor with College of Computer Science, Chongqing University, China. He received the M.E. and Ph.D. degrees in computer science and technology from Chongqing University, Chongqing, China in 2004 and 2011, respectively. His current research interests include machine learning, reinforcement learning, computer vision, and natural language processing.

View full text

Feature reduction based transfer structural subspace learning for small-footprint cross-domain keyword spotting via linear discriminant analysis

Abstract

Introduction

Section snippets

Transfer learning

Proposed method

FRTSSL model optimization

Experiments

Conclusion and future works

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Neurocomputing

Digit. Signal Process.

Digit. Signal Process.

Front. Phys.

Small-footprint keyword spotting using deep neural networks

Multitask learning of deep neural network-based keyword spotting for iot devices

IEEE Trans. Consum. Electron.

Deep residual learning for small-footprint keyword spotting

Learning efficient representations for keyword spotting with triplet loss

Improving hmm-based keyword spotting with character language models

Keyword spotting using support vector machines

A survey on transfer learning

IEEE Trans. Knowl. Data Eng.

Cross-domain video concept detection using adaptive svms

Exploiting web images for event recognition in consumer videos: a multiple source domain adaptation approach

Domain transfer multiple kernel learning

IEEE Trans. Pattern Anal. Mach. Intell.

Domain adaptation via transfer component analysis

IEEE Trans. Neural Netw.

Transfer feature learning with joint distribution adaptation

Scatter component analysis: a unified framework for domain adaptation and domain generalization

IEEE Trans. Pattern Anal. Mach. Intell.

Transfer sparse coding for robust image representation

Bregman divergence-based regularization for transfer subspace learning

IEEE Trans. Knowl. Data Eng.

Generalized transfer subspace learning through low-rank constraint

Int. J. Comput. Vis.

Common subspace learning via cross-domain extreme learning machine

Cogn. Comput.

Boosting for transfer learning

Multi-task feature learning

Adv. Neural Inf. Process. Syst.

Self-taught learning: transfer learning from unlabeled data

Geodesic flow kernel for unsupervised domain adaptation