Identification of Target Speech Utterances from Real Public Conversation

Kosaka, Naoto; Wakita, Yumi

doi:10.1007/978-3-030-49907-5_4

Naoto Kosaka⁹ &
Yumi Wakita⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12199))

Included in the following conference series:

International Conference on Human-Computer Interaction

5909 Accesses

Abstract

We are developing a conversation support system that can estimate the smooth progress of human-to-human conversation. When the system senses there has been little progress in the conversation, it attempts to provide a topic to lead a smoother discussion and good atmosphere. The conversation atmosphere is estimated using the fundamental frequency (F0) and sound power (SP). In its practical use, the following problems occur:

1.
Ambient noises, especially nonstationary speech signals of a person behind the target speaker, decrease the conversation-atmosphere estimation rate. It is difficult to cancel this speech noise, even when using current noise cancelling methods.
2.
Laughter utterances in which acoustic characteristics are quite different from usual speech utterances are often seen in daily conversation, which causes a decrease in the conversation-atmosphere estimation performance.

In this paper, we propose an identification method for target speech utterances from ambient speech noises or laughter utterances using the standard deviation value of SP and Mel-Frequency Cepstral Coefficients (MFCC).

You have full access to this open access chapter, Download conference paper PDF

Emotions Recognition from Spoken Marathi Speech Using LPC and PCA Technique

An experiment of Moroccan dialect speech recognition in noisy environments using PocketSphinx

Article 31 May 2024

Whispered speech recognition based on gammatone filterbank cepstral coefficients

Article 22 November 2017

Keywords

1 Introduction

Recently, with the spread of network devices, such as personal computers and smartphones, various pieces of information can be easily obtained, and the opportunity to obtain information by talking with people has decreased. The number of people who are living alone and are reclusive is increasing, and the lack of communication between humans has become a big problem.

The creation of some communities has been promoted to support the communication between the elderly, especially disaster-stricken elderly people. However, it is not easy to provide communication support to people who once missed the opportunity for interaction with others. When they are not confident in communicating with others or they have a negative impression of the community themselves, it is difficult to get people to immediately participate in communication support activities. Therefore, a different approach to solving this problem is necessary.

Many robots that communicate with humans have been developed for elderly persons [1,2,3]. They are effective at being interested in conversation; however, they have a problem in that the user becomes tired of their conversation for reasons such as lack of conversational flexibility and adaptability. It would be difficult to change their topics, expressions, utterance timings, or talking speeds according to the situation of the user and the conversation atmosphere. We have already proposed a conversation support system for the public community [4]. The system can understand whether or not the human-to-human communication proceeds smoothly. When sensing there has been little progress during the conversation, the system attempts to provide a topic to lead a smoother discussion and good atmosphere. We have already confirmed that the fundamental frequency (F0) and sound power (SP) values for each utterance are effective in estimating the conversation atmosphere using a free conversation database recorded in a recording studio. Figure 1 shows the structure of the conversation smoothness estimation process. The system extracts F0 values from input utterances and calculated the standard deviation of the F0 values (SD-F0). When the SD-F0 values of some utterances in the conversation are detected to be under the threshold value, the system decides that the conversation is not progressing smoothly and provides a new topic to liven up the conversation. However, the system was evaluated using conversations recorded in a studio, and we have not solved the problems of conversation support in real-world use.

We think that if we use our system in a real restaurant or lounge, the following fatal problems will occur and decrease the estimation performance:

1.
Ambient noise decreases the estimation performance. In particular, it would be difficult to exclude nonstationary noise, such as a person’s speech from behind the target speaker, even when we use current noise cancelling methods.
2.
Non-language utterances (e.g., laughter, cough, and clicking tongue) are often included in daily conversation more than in the conversations recorded in studios. The error rate of conversation atmosphere estimation increases when using real daily conversation because the F0 characteristics are different from those of normal speech.

In this paper, we describe an identification method for target normal speech utterances from ambient speech utterances and laughter utterances, which are frequently included in daily conversation.

2 Our Conversation Support System

The left side of Fig. 2 shows a scene of use of our conversation support system listening to a conversation between two speakers and estimating whether the conversation is progressing smoothly. The plush-doll on the table is our conversation support system. The two right pictures in Fig. 3 show the microphone for recording speech. The microphone is a small directional condenser microphone. Several microphones are placed just in front of the chairs on the sides of the table. Furthermore, it is desirable that the position of the microphone is adjustable according to the sitting positions of the speakers. In such locations, even if the microphone is installed near the user, background conversations of other people are frequently inputted.

3 Identification Method for Differentiating Between Ambient and Target Speech

We used our conversation support system in a restaurant during lunchtime to evaluate it for practical use. The system regarded signals as target utterances when the power level of signals exceeded the threshold. Over a three-minute period, over seventy utterances were extracted as target voices. However, the rate of target speech extraction was only 65%. The extracted utterances included several inputs from people sitting at the neighboring table and employees of the restaurant. It is necessary to distinguish between the speech of ambient speakers and the target speaker.

To exclude background speech, several sound source separation methods have been proposed using microphone-arrays [5, 6]. There are also several identification methods for the separation of sound sources based on the use of a single microphone, such as binary masking using a Bayesian network and non-negative matrix factorization [7, 8]. These methods are effective for separating target sounds from background noise. However, the associated system tends to be complex, and the separation of target sound sources from nonstationary noise, such as ambient speech, is typically insufficient.

We use the standard deviation values of SP (SD-SP) for each utterance to identify the target speaker located near a microphone. In previous our paper [9], we found that the SD-SP of each utterance when speakers talk near the microphone tends to be larger than that when speakers talk far from the microphone. However, we confirmed this using only male speakers. In this paper, we report the experience results using both males and females.

3.1 Free Conversation Recording in a Restaurant Environment

To analyze the difference between the acoustic characteristics of speech originating near a microphone and that originating far from a microphone, we recorded several conversations in an experimental room.

To reproduce the acoustic environment of a restaurant, two loudspeakers were placed behind the speaker and noises from actual restaurants were played. We asked a speaker and a partner to talk to each other freely amidst this noise. The partner stood behind the microphone, and its position was fixed. One speaker (Speaker1) stood at two different positions: one was 30 cm from the microphone and the other (Speaker2) was 120 cm from it. We recorded the speaker’s utterances for each spot. When the speaker’s distance is 30 cm from the microphone, the system regards the speaker as the target, whereas when the distance is 120 cm, the system does not regard the speaker as the target due to environmental noise. The former is the conversation that should be accepted by the system, and the latter is the conversation that should be excluded.

Table 1 lists the recording conditions. Figure 4 shows the experimental layout used to record the conversations.

Table 1. Conversation recording conditions

Full size table

3.2 Acoustic Analysis and Extraction of Each Utterance

We extracted the speaker’s utterances from the recorded conversation. The threshold level was decided using Eq. (1), and only parts above the threshold level are extracted as utterance parts.

$$ Threshold\; \, level \, = \, \left( {Average\;of\; \, ambient \, \;sound \, \;power} \right) + 5\;{\text{dB}} $$

(1)

We calculated the F0 values and SP values of each utterance as acoustic parameters. The analysis conditions and specifications are listed in Table 2.

Table 2. Parameter extraction details

Full size table

3.3 Standard Deviation of F0 or SP Value of Each Utterance

We calculated standard deviation values of SP (SD-SP) as well as the SD-F0 of each utterance. Their relationship is shown in Fig. 5. A comparison was made between the values obtained for utterances originating near the microphone (distance of 30 cm) and those for utterances originating far from the microphone (distance of 120 cm). The line in Fig. 5 represents the boundary calculated using the LDA method. Figure 5 indicates the following:

The SD-SP values at 30 cm are higher than those at 120 cm. The difference between the two sets of values is significant, based on the F-value analysis (confidence level of 95%). The tendency of the relationship is independent of male and female.
The difference in the SD-F0 values between the 30 cm and 120 cm cases is small, and the difference is not significant based on the F-value analysis.

We confirmed that the SD-SP values at 30 cm were higher than those at 120 cm as follows:

(1)
The SD-SP values for each utterance decrease as the distance between the microphone and the loudspeaker is increased.
(2)
The SD-SP values of loud utterances tend to be smaller than those of normal utterances.

To clarify the reason why the SD-SP values of loud utterances are smaller than that of normal utterances, we compared the changes in the time axis of SP values between loud utterances and normal utterances. We recorded two kinds of utterances. One was the normal utterance recorded in the room without noise, and the other was the loud utterance spoken while listening to the restaurant’s noises with headphones

Figure 6 shows examples of the two kinds of utterances. The left figure shows an example of a loud utterance’s SP, and the right figure shows an example of a normal utterance’s SP. The dotted lines in these figures represent the threshold levels when each utterance is spoken in a noisy restaurant. When comparing the shapes of the changes on the time axis of SP values, the flat part indicated dotted circle is found only in the upper threshold of the loud utterance for the Lombard effect. In a normal utterance, the shape of the SP has a tendency to change constantly. This is the reason the SD-SP of the loud utterance is smaller than that of the normal utterance.

3.4 Evaluation of Identification of Target Speech Using SD-SP and SD-F0

We estimated the distance between the speaker and the microphone for each speaker using the support vector machine (SVM) function for the other three speakers, and we evaluated an estimation performance by calculating the recall rates, precision rates, and F-measure rates (harmonic average value). Table 3 lists the values of each rate. The talker A–D are males and the talker E and F are females. Although the recall rate and precision rate varied depending on the speaker, the average rate across all four speakers was 80.5%. This result suggests that the realization of a conversation support system that can extract only the utterances of a target speaker near a microphone is feasible, even in an environment with ambient noise.

Table 3. Estkmation performance of each speaker

Full size table

4 Differentiation Between Laughter Utterances and Normal Speech Utterances

Daily conversation includes several non-language utterances, such as laughter, coughing, and tongue clicking. The rates of non-language utterances in the free conversation database are 19%. The utterances of the elderly include many types of non-language utterances, but 83% of utterances are “laughter”. We have already proposed an identification method between laughter utterances and normal speech utterances using the standard deviations of F0 values for each utterance [8]. However, the identification performance was not high. The F0 values and SP values characteristic of laughter utterances depend on the speaker, and it was not easy to detect the boundary between normal speech and laughter utterances independent of speakers. In this paper, we report the identification experiment results for laughter speech using the MFCC in addition to F0 and SP parameters.

4.1 Free Conversation Recording

We recorded six sets of daily conversations by six males as participating speakers. They met for the first time with each other. Figure 7 shows the location of the conversation recording. We used two microphones and a video camera for this purpose. Figure 8 shows an example photo extracted from the video data. The conditions of the recordings are listed in Table 4.

Table 4. Conditions of conversation: estimation performance of each speaker

Full size table

4.2 Laughter Utterance Extraction

We extracted both normal speech utterances and laugher utterances from the recording data. The laughter utterances were classified into three types according to social function in Tanaka’s paper [10]. Table 5 indicates the number of utterances for each type of extracted utterance.

Table 5. Number of extracted utterances of laughter and normal speech

Full size table

4.3 Acoustic Analysis Conditions

We calculated the F0, SP, and MFCC values of both normal speech and laughter utterances shown in Table 5 as acoustic parameters, and the standard deviations of these parameters were calculated. In a practical noisy environment, the threshold should be set above the SP of noise, and only the part of the input signal over the threshold is regarded as utterances. In the experiment, we calculated the average SP value of each utterance and regarded it as a threshold. We calculated the acoustic parameters of only the utterance parts above the threshold. The analysis conditions and specifications are listed in Table 6.

Table 6. Parameter extraction details

Full size table

4.4 Normal Speech Identification from Laughter Utterances by SVM

We confirmed the identification performance between the normal speech and laughter utterances using F0, SP, and MFCC. We used an SVM for identification. In particular, to confirm the effectiveness of MFCC, we compared the performance using only F0 and SP with that using F0, SP, and MFCC. Tables 7 and 8 show the identification performances in terms of recall, precision, F-measure, and accuracy of normal speech extraction from the database shown in Table 4.

To improve the conversation-atmosphere estimation performance, it is important that all extracted utterances are normal speech utterances. When the “True Positive (TP)”, “True Negative (TN)”, “False Positive (FP)”, and “False Negative (FN)” are counted on the side of normal speech utterances, the accuracy rate is defined by Eq. (2).

$$ Accuracy \, = \, \left( {TP \, + \, FN} \right) \, / \, \left( {TP \, + \, FP \, + \, FN \, + \, TN} \right) $$

(2)

Table 7. Identification rates of each speaker using F0, SP, and MFCC

Full size table

Table 8. Identification rates of each speaker using F0 and SP

Full size table

Tables 7 and 8 show the following:

The accuracy rate was approximately 69.23%. When using MFCC parameters in addition to F0 and PS, the accuracy rate increases compared to using only F0 and PS (from 61.55% to 69.23%).
For our conversation atmosphere estimation, a high precision rate is desirable. The precision rate was improved by approximately 10% by using MFCC (from 62.89% to 73.16%).
The performance depends on the speaker. In particular, the accuracy rates are decreased for speakers B and C. However, the differences in the accuracy tend to decrease when using MFCC.

4.5 Discussion

For the conversation atmosphere estimation, all extracted utterances should be normal speech utterances without laughter utterances, and the precision rates are important. As a result, the precision rates were 73.16%. This is insufficient for identification performance. However, if we extract all the utterances as normal speech, the precision rate is 53.29% (number of normal speech utterances/number of all utterances shown in Table 5). The results suggest that the precision rate would be improved drastically. In addition, the MFCC parameters are quite effective in improving the precision rates.

To clarify the reason for the effectiveness of MFCC, we compared the standard deviation of MFCC (SD-MFCC) values between the normal speech utterances and laughter utterances.

Figure 9 shows the comparison results between normal and laughter utterances using SD-MFCC values of the 2^nd dimension values and the 4^th dimension values. The SVM method was used for identification. The curved line in Fig. 9 indicates the boundary between normal speech and laughter by SVM.

This figure indicates that the area plotted by the MFCC values of laughter utterances is small. Almost all values of both the 2^nd dimension and 4^th dimension are plotted within the SVM boundary. However, the area in the case of normal speech is wide. The results indicate that it is difficult to separate both areas, but it is possible to decide the area plotting only normal speech utterances. The results suggest that MFCC values are only useful for extracting normal speech utterances.

The identification performance shown in Fig. 9 depends on the speaker. In the future, it is necessary to develop a speaker adaptation method to improve its performance.

5 Conclusion

We studied an identification method of target speech utterance to use a conversation support system in a lounge environment. We found that the standard deviation of the SP values is effective for identifying the target speech utterance from ambient speech. As a result of an identification experiment using the SVM method, although the recall rate and precision rate varied depending on the speaker, the average rate across all six speakers was 80.5%. We also confirmed the identification performance between normal speech and laughter utterances using F0, SP, and MFCC. The identification rate of normal speech utterances was 73.1% as the precision rate.

These results indicate that these acoustic characteristics would be effective in conversation atmosphere estimation and suggest that our conversation support system could be useful in practical scenarios.

References

Heerink, M., Kröse, B., Evers, V., Wielinga, B.: The influence of social presence on acceptance of a companion robot by older people. J. Phys. Agents 2(2), 33–40 (2008)
Google Scholar
Shiomi, M., Iio, T., Kamei, K., Sharma, C., Hagita, N.: Effectiveness of social behaviors for autonomous wheelchair robot to support elderly people in Japan. PloS One 10(5), e012803 (2015)
Article Google Scholar
Robinson, H., MacDonald, B., Broadbent, E.: The role of healthcare robots for older people at home: a review. Int. J. Soc. Robot. 6(4), 575–591 (2014). https://doi.org/10.1007/s12369-014-0242-2
Article Google Scholar
Wakita, Y., Yoshida, Y., Nakamura, M.: Influence of personal characteristics on nonverbal information for estimating communication smoothness. In: Kurosu, M. (ed.) HCI 2016. LNCS, vol. 9733, pp. 148–157. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39513-5_14
Chapter Google Scholar
Benesty, J., Chen, J., Huang, Y.: Microphone Array Signal Processing: Springer Topics in Signal Processing. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78612-2
Book Google Scholar
Chakraborty, R., Nadeu, C., Butko, T.: Detection and positioning of overlapped sounds in a room environment. In: Proceedings of Interspeech 2012, Portland, USA (2012)
Google Scholar
Itakura, K., et al.: Bayesian multichannel audio source separation based on integrated source and spatial models. IEEE/ACM TASLP 26(4), 831–846 (2018)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Proceedings of the NIPS 2000, pp. 556–562 (2000)
Google Scholar
Kosaka, N., Kido, K., Wakita, Y.: A simple identification method for differentiating between ambient and target speech. In: Proceedings of 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE), Osaka, Japan, 15–18 October 2019, pp. 860–863 (2019)
Google Scholar
Tanaka, H., Campbell, N.: Classification of social laughter in natural conversational speech. Comput. Speech Lang. 28(1), 314–325 (2014)
Google Scholar

Download references

Acknowledgment

This work is supported by JSPS KAKENHI Grant Number 19K04934.

Author information

Authors and Affiliations

Osaka Institute of Technology, Osaka, Japan
Naoto Kosaka & Yumi Wakita

Authors

Naoto Kosaka
View author publications
You can also search for this author in PubMed Google Scholar
Yumi Wakita
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yumi Wakita .

Editor information

Editors and Affiliations

Purdue University, West Lafayette, IN, USA
Vincent G. Duffy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kosaka, N., Wakita, Y. (2020). Identification of Target Speech Utterances from Real Public Conversation. In: Duffy, V. (eds) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Communication, Organization and Work. HCII 2020. Lecture Notes in Computer Science(), vol 12199. Springer, Cham. https://doi.org/10.1007/978-3-030-49907-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-49907-5_4
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49906-8
Online ISBN: 978-3-030-49907-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Identification of Target Speech Utterances from Real Public Conversation

Abstract

Similar content being viewed by others

Emotions Recognition from Spoken Marathi Speech Using LPC and PCA Technique

An experiment of Moroccan dialect speech recognition in noisy environments using PocketSphinx

Whispered speech recognition based on gammatone filterbank cepstral coefficients

Keywords

1 Introduction

2 Our Conversation Support System

3 Identification Method for Differentiating Between Ambient and Target Speech

3.1 Free Conversation Recording in a Restaurant Environment

3.2 Acoustic Analysis and Extraction of Each Utterance

3.3 Standard Deviation of F0 or SP Value of Each Utterance

3.4 Evaluation of Identification of Target Speech Using SD-SP and SD-F0

4 Differentiation Between Laughter Utterances and Normal Speech Utterances

4.1 Free Conversation Recording

4.2 Laughter Utterance Extraction

4.3 Acoustic Analysis Conditions

4.4 Normal Speech Identification from Laughter Utterances by SVM

4.5 Discussion

5 Conclusion

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us