research
111IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |1053-5888/19©2019IEEE
O nce a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assis- tants with a spoken language interface have become a ubiquitous commodity today. This success has been
made possible by major advancements in signal processing and machine learning for so-called far-field speech recog- nition, where the commands are spoken at a distance from
the sound-capturing device. The challenges encountered are quite unique and different from many other use cases of au- tomatic speech recognition (ASR). The purpose of this article is to describe, in a way that is amenable to the nonspecial- ist, the key speech processing algorithms that enable reliable, fully hands-free speech interaction with digital home assis- tants. These technologies include multichannel acoustic echo cancellation (MAEC), microphone array processing and dereverberation techniques for signal enhancement, reli- able wake-up word and end-of-interaction detection, and
Reinhold Haeb-Umbach, Shinji Watanabe, Tomohiro Nakatani, Michiel Bacchiani, Björn Hoffmeister, Michael L. Seltzer, Heiga Zen, and Mehrez Souden
Digital Object Identifier 10.1109/MSP.2019.2918706 Date of current version: 29 October 2019
Speech Processing for Digital Home Assistants
Combining signal processing with deep-learning techniques
©ISTOCKPHOTO.COM/MF3D
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
112 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
high-quality speech synthesis as well as sophisticated sta- tistical models for speech and language, learned from large amounts of heterogeneous training data. In all of these fields, deep learning (DL) has played a critical role.
Evolution of digital home assistants In the last several years, the smart speaker has emerged as a rapidly growing new category of consumer electronic devic- es. Smart speakers are Internet-connected loudspeakers con- taining a digital assistant that can perform a variety of tasks through a hands-free spoken language interface. In many cas- es, these devices lack a screen and voice is the only input and output modality. These digital home assis- tants initially performed a small number of tasks, such as playing music, retrieving the time or weather, setting alarms, and basic home automation. Over time, the capabili- ties of these systems have grown dramati- cally, as developers have created third-party “skills” in much the same way that smartphones created an ecosystem of apps.
The success of smart speakers in the marketplace can be largely attributed to advances in all of the constituent tech- nologies that comprise a digital assistant, including the digital signal processing involved in capturing the user’s voice, the speech recognition that turns said voice into text, the natural language understanding that converts the text into a user’s intent, the dialog system that decides how to respond, the natu- ral language generation (NLG) that puts the system’s action into natural language, and finally, the speech synthesis that speaks this response to the user.
In this article, we describe in detail the signal process- ing and speech technologies that are involved in captur- ing the user’s voice and converting it to text in the context of digital assistants for smart speakers. We focus on these aspects of the system because they are the ones most differ- ent from previous digital assistants, which reside on mobile phones. Unlike smartphones, smart speakers are located at a fixed location in a home environment, and thus need to be capable of performing accurate speech recognition from anywhere in the room. In these environments, the user may
be several meters from the device; as a result, the captured speech signal can be significantly corrupted by ambient noise and reverberation. In addition, smart speakers are typi- cally screenless devices, so they need to support completely hands-free interaction, including accurate voice activation to wake up the device.
We present breakthroughs in the field of far-field ASR, where reliable recognition is achieved despite significant sig- nal degradations. We show how the DL paradigm has pen- etrated virtually all components of the system and has played a pivotal role in the success of digital home assistants.
Not e t hat sever a l of t he t e ch nolog ica l a dva nc e - ments described in this article have been inspired or accompanied by efforts in the academic community, which have pro- vided resea rchers t he oppor t un it y to ca r r y out comprehensive evaluations of technologies for far-field robust speech recognition using shared data sets and a
common evaluation framework. Notably, the Computational Hearing in Multisource Environments (CHiME) series of challenges [1], [2], the Reverberant Voice Enhancement and Recognition Benchmark (REVERB) Challenge [3], and the Automatic Speech Recognition in Reverberant Environ- ments (ASpIRE) Challenge [4] were met with considerable enthusiasm by the research community.
While these challenges led to significant improvements in the state of the art, they were focused primarily on speech recognition accuracy in far-field conditions as the criterion for success. Factors such as algorithmic latency or com- putational efficiency were not considered. However, the success of digital assistants in smart speakers can be attrib- uted to not just the system’s accuracy but also its ability to operate with low latency, which creates a positive user experience by responding to the user’s query with an answer shortly after the user stops speaking.
The acoustic environment in the home In a typical home environment, the distance between the user and the microphones on the smart loudspeaker is on the order of a few meters. There are multiple ways in which this distance negatively impacts the quality of the recorded signal, particu- larly when compared to a voice signal captured on a mobile phone or headset.
First, signal attenuation occurs as the sound propagates from the source to the sensor. In free space, the power of the signal per unit surface decreases by the square of the dis- tance. This means that if the distance between the speaker and microphone is increased from 2 cm to 1 m, the signal will be attenuated by 34 dB. In reality, the user’s mouth is not an omnidirectional source and, therefore, the attenuation will not be this severe; however, it still results in a significant loss of signal power.
Second, the distance between the source and a sensor in a contained space such as a living room or kitchen causes reverberation as a consequence of multipath propagation.
0 50 100 150 200 250 300 350 Time (ms)
A m
p lit
u d e
Direct Sound Early Reflections Late Reverberation
FIGURE 1. An AIR consists of the direct sound, early reflections, and late reverberation.
Signal attenuation occurs as the sound propagates from the source to the sensor.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
113IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
The wavefront of the speech signal repeatedly reflects off the walls and objects in the room. Thus, the signal recorded at the microphone consists of multiple copies of the source signal, each with a different attenuation and time delay. This effect is described by the acoustic impulse response (AIR) or its equivalent representation in the frequency domain, the acoustic transfer function (ATF). Reverber- ant speech is thus modeled as the original source signal filtered by the AIR.
An AIR can be broadly divided into direct sound, early reflections (up to roughly the first 50 ms), and late rever- beration, as shown in Figure 1. While early reflections are actually known to improve the perceptual quality by increas- ing the signal level compared to the “dry” direct path signal, the late reverberation causes difficulty in perception—for humans and machines alike—because it smears the signal over time [5].
The degree of reverberation is often measured by the time it takes for the signal power to decrease to !60 dB below its original value; this is referred to as the reverbera- tion time and is denoted by .T60 Its value depends on the size of the room, the materials comprising the walls, floor, and ceiling, as well as the furniture. A typical value for a living room is between 300 and 700 ms. Because the reverbera- tion time is usually much longer than the typical short-time signal analysis window of 20–64 ms, its effect cannot be adequately described by considering a single speech frame in isolation. Thus, the convolution of the source signal with the AIR cannot be represented by multiplying their corre- sponding transforms in the short-time Fourier transform (STFT) domain; rather, it is approximated by a convolution over frames.
.x a s, , ,t f m f m
M
t m f 0
1
= =
-
-/ (1)
Here, ,x ,t f ,s ,t f and a ,t f are the STFT coefficients of the re- verberated signal, source signal, and AIR, respectively, at (discrete) time frame t and frequency bin index .f The length M of the STFT of the AIR is approximately given by / ,T B60 where B is the frame advance (e.g., 10 ms). Clearly, the ef- fect of reverberation spans multiple consecutive time frames, leading to a temporal dispersion of a speech event over adja- cent speech feature vectors.
Third, in a distant-talking speech recognition scenario, it is likely that the microphone will capture other interfer- ing sounds, in addition to the desired speech signal. These sources of acoustic interference can be diverse, hard to pre- dict, and often nonstationary in nature, and thus, difficult to compensate. In a home environment, common sources of interference include TV or radio, home appliances, and other people in the room.
These signal degradations can be observed in Figure 2, which shows signals of the speech utterance “Alexa stop” in 1) a close talk, 2) a distant speech, and 3) a distant speech with additional background speech recording. Keyword detection
and speech recognition are much more challenging in the lat- ter case.
The final major source of signal degradation is the cap- ture of signals that originate from the loudspeaker itself during playback. Because the loudspeaker and the micro- phones are colocated on the device, the playback signal can be as much as 30–40-dB louder than the user’s voice, ren- dering the user’s command inaudible if no countermeasures are taken.
System overview Figure 3 shows the high-level overview of a digital home assistant’s speech processing components. For sound ren- dering, the loudspeaker system plays music or system re- sponses. For sound capture, digital home assistants employ an array of microphones (typically between two to eight). Due to the form factor of the device, the array is compact with distances between the microphones on the order of a few centimeters. In the following section, techniques from multichannel signal processing are described that can com- pensate for many of the sources of signal degradation dis- cussed previously.
The signal processing front end performs acoustic echo cancellation, dereverberation, noise reduction (NR), and source separation, all of which aim to clean up the captured signal for input to the downstream speech recognizer. For a true hands- free interface, the system must detect whether speech has been directed to the device. This can be done using
! wake-word detectors (also called hotwords, keywords, or voice triggers), which decide whether a user has said the keyword (e.g., “OK Google”) that addresses the device
! end-of-query detectors, which are equally important for signaling that the user’s input is complete
“Alexa”
Close Talk
“Stop”
Distant Speech
Distant Speech With Background Noise
FIGURE 2. A speech utterance starting with the wake word “Alexa” followed by “Stop” in close-talk, reverberated, and noisy, reverberated conditions. The red bars indicate the detected start and end times of the keyword “Alexa” and the end of the utterance.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
114 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
! second-turn, device-directed speech classifiers, which eliminate the need to use the wake word when resuming an ongoing dialogue
! speaker identification modules, which make the system capable of interpreting a query in a user-dependent way.
Once device-directed speech is detected, it is forwarded to the ASR component.
The recognized word sequence is then forwarded to the natural language processing (NLP) and dialog man- agement subsystem, which analyzes the user input and decides on a response. The N LG component prepa res the desired system response, which is spoken out on the device through the text-to-speech (TTS) component. Note that NLP is beyond the scope of this article. The remain- der of this article focuses on the various speech process- ing tasks.
Some of the aforementioned processing tasks are carried out on the device, typically those close to the input–output, while others are done on the server. Although the division between client and server may vary, it is common practice to run signal enhancement and wake-word detection on the device, while the primary ASR and NLP are done on the server.
Multichannel speech enhancement The vector of the D microphone signals ( , , )y yy D1 Tf= at time–frequency (tf ) bin ( , )t f can be written in the STFT do- main [6] as
.a w ns oy , , ( )
, ( )
, ( )
, ( )
,t f m f i
m
M
i
N
t m f i
m f j
m
M
j
N
t m f j
t f 0
1
1 0
1
1
s o
= + + =
-
= -
=
-
= -
speech playback noise1 2 34444 4444 1 2 34444 4444 7// // (2)
The first sum is over the Ns speech sources , , , ,s i N1, ( ) t f i
sf= where a ,
( ) t f i is the vector of ATFs from the ith source to the
microphones. The second sum describes the playback of the No loudspeaker signals ,o ,
( ) t f j , , ,j N1 of= which are inadver-
tently captured by the microphones via the ATF vector w , ( ) t f j
at frequency bin .f Additionally, n ,t f denotes additive noise; here, we assume for simplicity that the transfer functions are time invariant and of the same length.
It is only one of many signals, which contains the user’s command, while all other components of the received signal are distortions. In the following section we describe how to extract this desired signal.
MAEC MAEC is a signal processing approach that prevents sig- nals generated by a device’s loudspeaker from being cap- tured by the device’s own microphones and confusing the system. MAEC is a well-established technology that relies on the use of adaptive filters [7]; these filters estimate the acoustic paths between loudspeakers and microphones to identify the part of the microphone signal that is caused by the system output and then subtracts it from the captured microphone signal.
Linea r adaptive f ilters ca n suppress the echoes by typically 10 –20 dB, but they cannot remove them com- pletely. One reason is the presence of nonlinear compo- nents in the echo signal, which are caused by loudspeaker nonlinearities and mechanical vibrations. Another reason is that the filter lengths must not be chosen to be too large to enable fast adaptation to changing echo paths. These lengths are usually shorter than the true loudspeaker-to- microphone impulse responses. Furthermore, there is a well-known ambiguity issue with system identification in MAEC [7].
Therefore, it is common practice in acoustic echo can- cellation to employ a residual echo suppressor following echo cancellation. In a modern digital home assistant, its filter coefficients are determined with the help of a neu- ral network (NN) [6]. The deep NN (DNN) is trained to estimate, for each tf bin, a speech presence probability (SPP). Details of this procedure are described in “Unsu- pervised and Supervised Speech Presence Probability Est i mat ion.” From t h is SPP a mask ca n be computed,
MAEC
Dereverberation,
Beamforming,
Source Separation,
Channel Selection ASR
TTS
Knowledge Base
Device-Directed Speech
Detection
NLP, Dialog
Management, and NLG
FIGURE 3. An overview of the example architecture of signal processing tasks in a smart loudspeaker.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
115IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
which separates desired speech-dominated tf bins from those dominated by residual echoes, and from this infor- mation, the coefficients of a multichannel filter for resid- ual echo suppression are computed.
With MAEC in place, it is possible that the device can listen to a command, while the loudspeaker is in use, e.g., playing music. The user can barge in and still be understood, an important feature for user convenience. Once the wake- up keyword has been detected, the loudspeaker signal and MAEC are ducked or switched off, while the speech recog- nizer is activated.
Dereverberation We now turn our attention to the first sum in (2). Assuming for simplicity that a single speech source is present, this term simplifies to (1).
As mentioned previously, it is the late reverberation that is harmful to speech recognition performance. Decom- posing the reverberated signal into the direct sound and early reflections x ,t f
(early) and the late reverberation x ,t f (late)
according to
,x x x, , ,t f t f t f (early) (late)
= + (3)
it is the late reverberation that a dereverberation algorithm aims to remove, while preserving the direct signal and ear- ly reflections.
There is a wealth of literature on signal dereverbera- tion [5]. Approaches can be broadly categorized into lin- ear filtering and magnitude or power spectrum-estimation techniques. For ASR tasks, the linear filtering approach is recommended because it does not introduce nonlinear dis- tortions to the signal, which can be detrimental to speech recognition performance.
Using the signal model in (1) where the AIR is a finite impulse response, a Kalman filter can be derived as the sta- tistically optimum linear estimator under a Gaussian source assumption. Because the AIR is unknown and even time vary- ing, the Kalman filter is embedded in an expectation maxi- mization (EM) framework, where Kalman filtering and signal parameter estimation alternate [8].
If the reverberated signal is modeled as an autoregressive stochastic process instead, linear prediction-based derever- beration filters can be derived. A particularly effective method that has found widespread use in far-field speech recognition is the weighted prediction error (WPE) approach [9]. WPE can be formulated as a multiple-input, multiple-output filter,
Unsupervised and Supervised Speech Presence Probability Estimation
In the unsupervised learning approach, a spatial mixture model is used to describe the statistics of y ,t f or a quantity derived from it:
( ) ( ),p py y, ,t f k k
t f k 0
1 ;ir=
=
/ (S1)
where we assumed a single speech source and where kr is the a priori probability that an observation belongs to mixture component k and ( )p y ,t f k;i is an appropriate component distribution with parameters ki [17]–[19]. This model rests upon the well-known sparsity of speech in the short-time Fourier transform (STFT) domain [20]
,y s z z
1 0
a n n,
, ,
,
,
, t f
f t f t f
t f
t f
t f =
+ = =
' (S2) where z ,t f is the hidden class affiliation variable, which indicates speech presence. The model parameters are esti- mated via the expectation maximization (EM) algorithm, which delivers the speech presence probability (SPP)
( )Pr z 1 y, , ,t f t f t f;c = = in the E-step [21]. The supervised learning approach to SPP estimation
employs a neural network (NN). Given a set of features extracted from the microphone signals at its input and the true class affiliations z ,t f at the output, the network is trained to output the SPP ,tc f [22], [23]. Because all of the STFT bins , ,f F0 1f= - are used as inputs, the network is
able to exploit interfrequency dependencies, while the mixture model-based SPP estimation operates on each fre- quency independently. If additional cross-channel features, such as interchannel phase differences, are used as inputs, spatial information can also be exploited for SPP estimation.
In a batch implementation, given the SPP the spatial covariance matrices of speech plus noise and noise are estimated by
;
( )
( ) .
1
1
y y
y y
( )
( )
,
, , ,
,
, , , n
t f t
t f t
t f t f H
t f t
t f t
t f t f H
y/
/
c
c
c
c
=
= -
-
f
f
/ /
/ /
(S3)
From these covariance matrices, the beamformer coeffi- cients of most common beamformers can be readily com- puted [21]. By an appropriate definition of the noise mask, this concept can also be extended to noisy and reverber- ant speech, leading to a significant dereverberation effect of the beamformer [24], as shown in Figure 4.
Low latency in a smart loudspeaker is important and impacts both the design of the EM (or statistical methods in general) and the NN-based approaches, see, e.g., [6], [15], and [25] for further discussion.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
116 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
allowing further multichannel processing, such as beamform- ing, to follow it [10], [11]. The underlying idea of WPE is to estimate the late reverberation x ,t f
(late) and subtract it from the observation to obtain a maximum likelihood estimate of the early arriving speech
.x x G x ,, , ,t f t f t f t f (early) = - T-t u (4)
Here, Gtf is a matrix containing the linear prediction coeffi- cients for the different channels and x ,t fD-u are stacked represen- tations of the observations: ( , , ) ,x x x, , ,t f t fT t L fT T1f= T TD- - - - +u where L is the length of the dereverberation filter. It is im- portant to note that x ,t f
(early)t at time frame t is estimated from observations at least T frames in the past. This ensures that the dereverberation filter does not destroy the inherent tempo- ral correlation of a speech signal, which is not caused by the reverberation. The filter coefficient matrix cannot be estimated in closed form; the reason is that the driving process of the au- toregressive model, ,x ,t f
(early) has an unknown and time-varying variance .,t fm However, an iterative procedure can be derived, which alternates between estimating the variance ,t fm and the matrix of filter coefficients G ,t f on signal segments.
Because WPE is an iterative algorithm, it is not suit- able for use in a digital home assistant, where low latency is important; however, the estimation of the filter coeffi- cients can be cast as a recursive least squares problem [12]. Furthermore, using the average over a window of observed speech power spectra as an estimate of the signal variance
,,t fm a very efficient low-latency version of the algorithm can be used [13].
Many authors reported that WPE leads to word error rate (WER) reductions of a subsequent speech recognizer [13], [14]. How much of a WER reduction is achieved by derever- beration depends on many factors such as degree of rever- beration, signal-to-noise ratio (SNR), difficulty of the ASR task, robustness of the models in the ASR decoder, and so on. In [13], relative WER improvements of 5–10% were reported on simulated digital home assistant data with a pair of microphones and a strong back-end ASR engine.
Multichannel NR and beamforming Multicha n nel N R a ims to remove additive distortions, denoted by n ,t f in (2). If the AIR from the desired source to the sensors is known, a spatial filter (i.e., a beamformer), can be designed that emphasizes the source sig na l over sig na ls wit h d if ferent transfer characteristics. In its sim- plest form, this filter compensates for the different propa- gation delays that the signals at the individual sensors of the microphone array exhibit and that are caused by their slightly different distances to the source.
For the noisy and reverberant home environment, this approach, however, is too simplistic. The microphone signals differ not only in their relative delay, the whole reflection pattern they are exposed to is different. Assuming again a single speech source and good echo suppression and derever- beration, (2) reduces to
,sy x n a n, , , , ,t f t f t f f t f t f.= + + (5)
where a f is the vector form of the AIRs to multiple micro- phones, and where we assume it to be time invariant under the condition that the source and microphone positions do not change during a speech segment (e.g., an utterance). Note that unlike (1) and (2), the multiplicative transfer function ap- proximation is used here, which is justified by the preceding dereverberation component. Any signal component that devi- ates from this assumption can be viewed as captured by the noise term .n ,t f Similarly, residual echoes can be viewed as contributing to ,n ,t f which results in a spatial filter for denois- ing, dereverberation, and residual echo suppression.
Looking at (5), it is obvious that s ,t f and a f can only be identified up to a (complex-valued) scalar because s · a,t f f = ( ) ( / ) .s C C· · a,t f f To fix this ambiguity, a scale factor is cho- sen such that for a given reference channel, e.g., channel 1, the value of the transfer function is 1. This yields the so-called relative transfer function (RTF) vector / .aa a ,f f f1=u
Spatial filtering for signal enhancement is a classic and well-studied topic for which statistically optimal solutions are known; however, these textbook solutions usually assume that the RTF ,a fu or its equivalent in anechoic environments (i.e., the vector of time difference of arrival), are known, which is an unrealistic assumption. The key to spatial filter- ing is, again, SPP estimation (see “Unsupervised and Super- vised Speech Presence Probability Estimation.”) The SPP tells us which tf bins are dominated by the desired speech signal and which are dominated by noise. Given this infor- mation, spatial covariance matrices for speech and noise can be estimated, from which, in turn, the beamformer coefficients are computed. An alternative approach is to use the SPP to derive a tf mask, which multiplies tf bins dominated by noise with zero, thus leading to an effective mask-based NR.
Figure 4 shows the effectiveness of beamforming for an example utterance. The spectrogram, i.e., the tf represen- tation of a clean speech signal, is displayed in Figure 4(a), fol- lowed in Figure 4(b) by the same utterance after convolution with an AIR, and in Figure 4(c) after the addition of noise. Figure 4(d) shows the output of the beamformer, which effec- tively removed noise and reverberation.
The usefulness of acoustic beamforming for speech recog- nition is well documented. On the CHiME 3 and 4 challenge data, acoustic beamforming reduced the WER by nearly half. On typical digital home assistant data, WER reductions on the order of 10–30% relative were reported [6], [15], [16].
Source separation and stream selection Now we assume that, in addition to the desired speech source, there are other competing talkers, resulting in a total of Ns speech signals, see (2). Blind source separation (BSS) is a technique that can separate multiple audio sources into individual audio streams autonomously. Traditionally, re- searchers tackle speech source separation using either unsu- pervised methods, e.g., independent component analysis and
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
117IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
clustering [26], or DL [27], [28]. With clustering in particular, BSS using spatial mixture models is a powerful tool that de- composes the microphone array signal into the individual talk- ers’ signals [17]–[19]. The parameters and variables of those mixture models are learned via the EM algorithm, as explained in “Unsupervised and Supervised Speech Presence Probability Estimation.” The only difference being that the mixture model now has as many components as concurrent speakers. During the EM, for each speaker, a source activity probability (SAP), which is the equivalent to the SPP in the multispeaker case, is estimated.
Extraction of the individual source signals may be achieved using the estimated SAP to derive for each speaker a mask, by which all tf bins not dominated by this speaker are zeroed out. An alternative to this is to use the SAP to compute beamform- ers, one for each of the speakers, similar to what is explained in “Unsupervised and Supervised Speech Presence Probabil- ity Estimation.”
Once the sources are separated, it remains to be decided which of the streams contains the user’s command for the digital home assistant. In [6], it is proposed to base this deci- sion on the detection of the wake-up keyword (e.g., “Hey Siri”): If the wake-word detector indicates the presence of the keyword, all streams, i.e., the output streams of source sepa- ration and the output of the acoustic beamformer, are scored for the presence of the keyword, and the stream with the highest score is considered to contain the user’s command.
ASR The key knowledge sources for ASR are the acoustic model (AM), the pronun- ciation model (PM) and the language model (LM). The LM assigns proba- bilities to hypothesized strings. The PM maps strings to subword units, where the phoneme is a common choice. Probabilities of the acoustic realization of the subword units (generally using some context) are expressed by the AM.
The AM of a speech recognition sys- tem is realized by a DNN. Such mod- els estimate the posterior probabilities of subword units in context given the input signal. State-of-the-art network architectures borrow concepts from image recognition networks, e.g., the ResNet [29], and also include sequen- tial modeling through the use of recur- rent structures. Many sites use long short-term memory (LSTM) network layers or time-delay NN structures to incorporate that temporal component into the model.
The AM is trained from examples and generally requires large corpora to allow robust parameter estimation of
these models (on the order of thousands of hours). It is essential that these corpora reflect the type of utterances that the device will recognize. For very novel applications, as was the case with the early deployment of digital home assistants, example data was not available, and collecting such large amounts of training data before product launch was considered uneco- nomical. Complicating matters further, the expected vari- ability is very large for speech coming into such a device; therefore, the bootstrapping problem of a model for a digital home assistant is considerably more complex than for other new application domains.
Bootstrapping the AM Most sites that developed early digital assistants have large data sets of in-domain, close-talking material available. To make use of that data but render it suitable for the digi- tal home assistant application, simulation techniques are employed. Using the well-known image method [30], suffi- ciently realistic AIRs can be generated for given room and microphone parameters. Alternatively, measured AIRs can be used, such as the collection in [31]. It is, of course, much easier to simulate thousands of AIRs representing large va- rieties of room and microphone array configurations in this way than to measure them. The nonreverberant close-talk recordings are then convolved with these AIRs to generate reverberant speech. It should be mentioned, however, that this simulates a static scenario. In reality, an AIR is time-varying,
0
200
400
F re
q u
e n cy
B in
I n
d e
x F
re q
u e
n cy
B in
I n
d e x
–20
–10
0
10
20
30
P o
w e
r (d
B )
0 100 200 300 0
200
400
Time Frame Index Time Frame Index (c) (d)
(a) (b)
0 100 200 300
FIGURE 4. The spectrogram of (a) a clean, (b) a reverberated, (c) a noisy and reverberated, and (d) an enhanced speech signal. Enhancement has been achieved with a beamformer that was trained to treat both noise and late reverberation as distortion.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
118 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
i.e., even the smallest movements of the speaker or changes in the environment will lead to a different reverberation pattern. Nevertheless, experience shows that systems trained with ar- tificially generated reverberant speech perform robustly on real reverberant data.
Additional data augmentation techniques [32] are used in the development of an AM for a new application, which per- turb existing recordings along different perceptually relevant parameters, such as speaking rate, vocal tract length, pitch, SNR, noise types, and so forth.
Integrating enhancement and acoustic modeling Although exper imentation showed that the simulation of reverberant distant-talking speech from close-talking corpora was effective in mitigating many of the problems posed in this setting, there is a large body of work that uses enhancement from mul- tichannel processing to (further) mitigate the problems that arise in distant-talking speech recognition, as discussed previ- ously. However, independent optimiza- tion of the enhancement component and acoustic modeling component may not lead to performance improvements per se because a mismatch in the training objectives can adversely affect the overall system perfor- mance. It appears advantageous to optimize the AM and enhancement component jointly with a criterion close to the ASR objective. This bypasses the signal-related objec- tive functions, such as maximizing the output SNR, which is used in classic beamforming to ensure that the enhance- ment result benefits the ASR that consumes its output. This direction was first advocated by [33] in Gaussian mixture- based acoustic modeling.
More recently, it was proposed to perform multichan- nel enhancement jointly with acoustic modeling in a DNN framework [34], [35]. To leverage the differences in the fine time structure of the signals at the different micro- phones, it is necessary to input the raw time-domain signal or its equivalent complex-valued STFT representation to the network. This is different from standard acoustic mod- eling, where the time-domain signal is first compressed to feature vector representations, such as logarithmic mel spectra or cepstra, which no longer carry subtle time information. A close look at the filter coefficients learned in the initial layers of the network showed that, indeed, beamformer-like spatial filters could be identified and that frequency resolutions resembling the mel filter bank were found [34].
An alternative to this single large enhancement and acous- tic modeling network is to keep enhancement and AM sepa- rate and still optimize both jointly toward an ASR-related criterion. It has been shown in [36] how the NN for SPP esti- mation (see “Unsupervised and Supervised Speech Presence Probability Estimation,”) can be trained from the objective function of the AM by back-propagating the gradient through the AM DNN and the beamforming operation all the way
to the DNN for SPP estimation. Clearly, this can be viewed as one large DNN with fixed, nontrainable signal processing layers in between.
A direct comparison of the fully integrated approach with separate, albeit jointly trained, speech enhancement and acoustic modeling stages on a common ASR task is unknown. Both techniques have been shown to provide sig- nificant WER reductions when compared to that of ASR on single-channel inputs, even when many distant-talking examples were used in training. However, what can be stated is that the integrated approach requires more training data because it has to learn the multichannel processing from the data. The approach with a separate beamformer in front of the AM acts as a kind of regularizer, helping the overall sys- tem to settle on appropriate local minima of the networks,
thus requiring less training data and being computationally less demanding.
Note that the integration can be extend- ed even from subword unit DNN acoustic modeling to end-to-end speech recogni- tion, which allows for the beamforming components to be optimized jointly with-
in the recognition architecture to improve the end-to-end speech recognition objective [37], [38].
In [39], the effect of integrating enhancement and acoustic modeling was reported using a Google Home production sys- tem, where relative WER improvement between 8 and 28% was obtained by integrating WPE dereverberation and DNN- based multichannel processing with the AM of the produc- tion system.
Language modeling Language modeling for the assistant is complex due to the ubiquity of applications for which it is used. Taking sample utterances from these interactions for the entire population of users allows us to estimate an LM that cov- ers the domain as a whole. LMs used in the first pass are n-gram models that predict the next word and the sentence end based on a limited history of typically the three or four preceding words. Speech recognition systems often produce an n-best list in the first pass and apply a rescor- ing second pass using a log-linear or neural LM working on the complete sentence.
For an individual user, however, the actual entropy of his/her utterances is more restricted. For instance, if users want to name a contact, they will likely pick a name that is in their contact list and less likely pick a name that is on the contact list of any user. In other words, a statically trained LM is a good fit for the domain but has poor priors when it comes to an individual user. More generally, the context in which an utterance is produced will have an impact on the content of the utterance. Digital assistant systems gener- ally implement this adjustment by biasing, i.e., adjusting the LM probabilities “on the fly” using the current con- text. The approach proposed in [40] achieves the biasing by boosting selected n-grams in the LM. An alternate method
Probabilities of the acoustic realization of the subword units are expressed by the AM.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
119IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
used in [41] does an on-the-fly adjustment of the weights in an LM interpolation.
A second aspect resulting from the multitude of use cases is multilinguality and code switching. Multilingual support for utterance-by-utterance language switching is implemented following the approach proposed in [42] by running several speech recognition systems, one per language, in parallel. The best system and hence, language, is chosen after recognition is completed, either solely based on the scores of the language- dependent systems or supported by a language identification component. Code switching within an utterance is usually implemented by adding a small degree of multilinguality directly to the LM, e.g., an Indian–English speech recognition system usually also covers a limited set of common phrases, such as Hindi, Telugu, and so on. A special case for a virtual assistant are cat- alogs, such as those for supporting a music or shopping domain, where multilingual content is common. For instance, users of an Indian–English system often ask for music or video titles in their native Indic language.
TTS synthesis Most smart loudspeakers have no screen on which to display information. On these devices, audio is the most natural way of providing responses to users, and TTS synthesis is used to generate spoken responses.
In the back end of these digital home assistants, an NLG module translates raw data into an understandable text in a markup language. A TTS system takes the markup text as its input and renders speech output. It consists of text analy- sis (front-end) and speech synthesis (back-end) parts. The text analysis component includes a series of NLP modules, such as sentence segmentation, word segmentation, part-of- speech tagging, dictionary lookup, and grapheme-to-pho- neme pronunciation conversion. The speech synthesis part is typically a cascade of prosody prediction and waveform generation modules.
In the digital home assistant domain, the text analysis com- ponent can access more contextual information than in other domains (e.g., synthesizing speech for a website), because the NLG module can provide it via markup language. For instance, sometimes it is difficult to disambiguate pronuncia- tion of a place name only from a written text; however, the NLG system can access its knowledge base to resolve the ambiguity and provide it via markup. Furthermore, the front end can also incorporate explicit annotations, providing hints about prosody and discourse domain [43]. Such coupling between NLG and TTS modules allows better synthesis in this domain.
In the back end, either an example-based (concatenative) or model-based (generative) approach is used in the wave- form generation module. The former finds the best sequence of small waveform units (e.g., half-phone, phone, and diphone level) from a unit database given a target linguistic or the acoustic features derived from an input text. The lat-
ter first learns a mapping function from text to speech by a model, then predicts a speech waveform given a text and the trained model. The concatenative approach is known to 1) require a large amount of speech data from a single speaker, 2) have a large footprint, 3) be computationally less expen- sive, and 4) have natural segmental quality but sound dis- continuous. On the other hand, the generative approach is known to 1) require less trainable data from multiple speak- ers, 2) have a small footprint, 3) be computationally expen- sive, and 4) have smooth transitions but achieve relatively poor vocoder quality. As mentioned previously, achieving
low latency is critical for use in digital home assistants. Vendors choose different approaches to synthesize naturally sound- ing speech with low latency. We discuss two very different solutions in the follow- ing to illustrate the range of options.
The Siri Team at Apple developed on- device, DL-guided, hybrid unit selection concatenative TTS systems [43] to achieve these goals. Conventionally, hidden Markov models (HMMs) were used in hybrid unit selection TTS systems. Later, HMMs were replaced by deep- and recurrent-mixture density networks to compute probabilis- tic acoustic targets and concatenation costs. Multiple levels of optimizations (e.g., long units, preselection, unit pruning, local caching, and parallel computation) enable the system to produce high-quality speech with an acceptable footprint and computational cost. Additionally, as an on-device system, it can synthesize speech without an Internet connection, allow- ing on-device, low-latency streaming synthesis. Combined with a higher sampling rate (i.e., 22–48 kHz) and better audio compression, the system achieved significant improvements over their conventional system. This Siri DL-based voice has been used since iOS 10.
On the other hand, Google Home uses a server-side generative TTS system to achieve these goals. Because Google’s TTS system runs on servers, an Internet connec- tion is essential; however, even on Wi-Fi-connected smart loudspeakers, an Internet connection can be unstable. Audio streaming using an unstable connection causes stuttering within a response. To prevent stuttering, no streaming is used in Google’s TTS service; rather, after synthesizing an entire utterance, the server sends the audio to the device. The device then starts playing the audio after receiving the entire response. Although this approach improves user experience, achieving low latency becomes challenging. To achieve high-quality TTS with low latency, Google developed the Parallel WaveNet-based TTS system [44]. Conventional gen- erative TTS systems often synthesized “robotic”-sounding vocoded speech. The introduction of sample-level autore- gressive audio generative models, such as WaveNet [45], has drastically improved the system’s naturalness; however, it is computationally expensive and difficult to be parallelized due to its autoregressive nature.
Parallel WaveNet introduced probability density distilla- tion, which allows for the training of a parallel feed-forward
A TTS system takes the markup text as its input and renders speech output.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
120 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
network from an autoregressive network with no significant difference in segmental naturalness. Because of the par- allelization-friendly architecture of Parallel WaveNet, by running it on tensor processing units, it achieved a 1,000" speed up (i.e., 20" faster than real time) relative to the original autoregressive WaveNet, while retaining its abil- ity to synthesize high-fidelity speech samples. This Paral- lel WaveNet-based voice has been used in Google Assistant since October 2017.
Fully hands-free interaction Digital home assistants have a completely hands-free voice- controlled interface. This has important and challenging impli- cations for speech processing systems. The first, most obvious one is that the device must always be listening to recognize when it is addressed by a user. There are other challenges as well, which are described in the following section.
Wake-up word detection To detect whether a user is addressing the device, a wake- up keyword, e.g., “Alexa,” “OK Google,” or “Hey Siri” is defined. If this word is detected, the device concludes that the ensuing speech is meant for it. It is extremely important for user satisfaction that this keyword detection works very reliably, with both very low false alarm and high recall rates. This, however, is not simple, particularly in the face of poor sig- nal quality (see Figure 2). Certainly, long wake-up words are easier to detect than short ones; however, because the keyword acts as the “name” of the device, its choice is influenced by marketing aspects, leaving minimal room for engineering considerations. Another requirement is low latency; i.e., the system must answer as quickly as a human would do. Furthermore, one must bear in mind that the key- word-spotting algorithm runs on the device. This is different from the ASR component, which is server-borne. Therefore, memory footprint and computational load considerations also play an important role [6].
In one approach proposed in [46], voice activity detection (VAD) is used in an initial step to reduce computation so that the search for a keyword is conducted only if speech has been detected. If speech is detected, a sliding window, whose size depends on the length of the keyword, is swept over the data, and a DNN classifier operates on the frames inside the win- dow. In its simplest form, classification is based upon a fully connected DNN, without any time alignment, resulting in sig- nificantly lower computational costs and latency compared to that of ASR. Then, max-pooling along the time axis is carried out on the DNN posteriors to arrive at a confidence score for the presence of a keyword.
To improve detection accuracy, convolutional [47], time- delay [48], or recurrent network layers [49] have been pro- posed, as well as subword modeling of the keyword and background speech using a DNN-HMM architecture [50], all of which aim to exploit the temporal properties of the
input signal for classification. To further reduce false alarm rates, multistage keyword-detection algorithms have been developed, where initial hypotheses are rechecked using cues like keyword duration, individual likelihoods of the phones comprising the keyword, and so forth [50]. This sec- ond-stage classifier is again realized by a DNN. The experi- ments in [50] show that using subword-based background models can reduce false accept rates (FARs) by roughly 37% relative at a fixed false rejection rate (FRR) of 4%. This work also demonstrates the effectiveness of the two-stage approach, which can reduce FARs by up to 67%, relative to a 4% FRR. A different voice trigger detection system was proposed in [51], where robustness and computational effi- ciency were achieved using a two-pass architecture.
End-of-query detection Not only must the beginning of device-directed speech be detected, but also quickly and accurately determining when the user has finished speaking to the system must be accom- plished. However, speech pauses must not be taken falsely as the end of the query, nor must ambient noise, e.g., a TV heard in the background, be taken as the continuation of an utterance.
From these considerations, it is clear that a VAD can be no more than one source of information about the end of the
user query [52]. Another source of infor- mation is the ASR decoder itself. Indeed, because the end-of-query detection is car- ried out on the server, the ASR engine is available for this task, and its acoustic and LM can be leveraged to identify the end of device-directed speech. An indication
of this is whether the active decoder hypotheses indicate the end of sentence, followed by silence frames. Because low latency is important, the decision cannot be postponed until all competing search hypotheses inside the ASR decoder have died out. To achieve a high degree of reliability, it was proposed to average over all active search hypotheses with this property [53]. Yet another cue for the end of the user query is the recognized word sequence. Those sources of information, VAD, ASR decoder search properties, and the one-best word/character hypothesis can be expressed as fixed-length features, which are input to a dedicated end-of- query DNN classifier [54].
Second-turn, device-directed speech classification For a natural interaction, it is desirable that the system detects whether another query is meant for it, without the user having to repeat the wake-up keyword again (Example: “Hey Cortana, what is the weather today?”; system answer; “And what about tomorrow?”). One approach toward this functionality is to use a specific DNN classifier, which rests its decisions on similar features, such as the end-of-query detector [55], i.e., a fixed- length acoustic embedding of the utterance computed by an LSTM, the ASR decoder related features, e.g., the entropy of the forward probability distribution (large entropy indicating nondevice-directed speech), and features related to the 1-best
Digital home assistants have a completely hands- free voice-controlled interface.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
121IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
word/character sequence hypothesis. Those features are com- bined and input to a dedicated second-turn, device-directed speech DNN detector.
An additional source of information that detects device- directed speech is derived from the speaker characteristic, because the second turn is spoken by the same speaker as the first. Actually, all speech following the wake- up keyword and spoken by the same speaker is considered to be device directed. Thus, a speaker-embedding vector can be computed from the detected keyword speech. This embedding can be used to make an acoustic beamformer speaker dependent [56], and to improve the end-of-query and second-turn, device-directed speech detection [57]. The encoder for mapping the keyword speech to an embedding is learned jointly with the classifier-detecting device-direct- edness. Thus, the classifier learns in a data-driven way what speaker and speech characteristics are relevant for detecting device-directedness.
Speaker identification When digital home assistants are used by multiple members of a household, it is necessary to understand what the user is asking for and who the user is. The latter is important to correctly answer queries such as “When is my next ap- pointment?” To do so, the system must perform utterance- by-utterance speaker identification. Speaker identification algorithms can be text-dependent, typically based on the wake-up keyword, [58], [59] or text independent [60], [61], and run locally on the device or on the server. An enroll- ment process is usually necessary so that the assistant can associate speech with a user profile. Enrollment can be im- plemented by explicitly asking a user to provide an identity and a few example phrases. An alternative approach is for the assistant to identify speakers based on analyzing past utterances, and, next time, when hearing a known speaker ask for providing an identity.
Case study To illustrate the impact of front-end multichannel signal pro- cessing on ASR, the engineering team at Apple evaluated the performance of the far-field Siri speech processing system on a large speech test set recorded on HomePod in sev- eral acoustic conditions [6], such as
! music and podcast playback at dif- ferent levels
! continuous background noise, including babble and rain noise
! directional noises generated by household appliances such as a vacuum cleaner, hairdryer, and microwave
! interference from external compet- ing sources of speech.
In these recordings, the locations of HomePod and the test subjects were
varied to cover different use cases, e.g., in living room or kitch- en environments where HomePod was placed against the wall or in the middle of the room.
The performance of Siri online multichannel signal pro- cessing was investigated in a real-world setup, where the trigger detection and subsequent voice command recogni- tion jointly affect the user experience. Therefore, two objec- tive Siri performance metrics, i.e., the FRRs and WERs, are reported.
Figure 5 shows the FRRs. The triggering threshold is the same in all conditions to keep the false alarm rates to a minimum. It can be observed that mask-based NR is suit- able in most acoustic conditions except for the multitalker scenario, which is well handled by the stream selection sys- tem. For instance, in the competing talker case, the absolute FRR improvement of the multistream system is 29.0% when compared to that of mask-based NR, which has no source separation capability, and 30.3% when compared to the out- put of the baseline digital signal processing (DSP) system (which includes echo cancellation and dereverberation). The gap between mask-based NR and the multistream system becomes smaller in other acoustic conditions. Overall, there is a clear trend of healthy voice trigger detection improve- ment when mask-based NR and source separation techniques (stream selection) are used.
Figure 6 shows the WERs achieved by combining multi- channel signal processing based on DL with the speech rec- ognizer trained offline using internally collected live data from HomePod to augment an existing training set, which was found to improve ASR performance [6]. More details on data combination strategies to train AMs can be found in [2] and [3]. The blue portion of the bar represents the error rate of the triggered utterances, and the green portion represents the error rate due to falsely rejected utterances (missed utter- ances). Because triggered utterances can be different using one processing algorithm or another in different acoustic condi- tions, the WER numbers are directly influenced by the trigger performance. Different numbers of words are used for evalua- tion in the blue portion of the bars because the corresponding number of false rejections are significantly different for each
100 90 80 70 60 50 40 30 20 10 0
F R
R s
(% )
Reverberation Echo Noise Competing Talker
Unprocessed +Baseline DSP +Mask-Based NR +Stream Selection
FIGURE 5. The FRRs of a “Hey Siri” detector in the following acoustic conditions: reverberation, echo, noise, and a competing talker. +Baseline DSP refers to the baseline DSP. +Mask-Based NR refers to the baseline DSP and mask-based NR. +Stream Selection refers to the baseline DSP, mask-based NR, and stream selection [6].
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
122 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
case. It is obvious that the optimal and incremental integra- tion of different speech processing technologies substantially improves the overall WER across conditions [6]. More specifi- cally, the WER relative improvements are roughly 40, 90, 74, and 61%, respectively, in the four investigated acoustic condi- tions of reverberant speech only, playback, loud background noise, and competing talker [6].
Summary and outlook This article provided an overview of the speech processing challenges and solutions of digital home assistants. While DL is the method of choice for overcoming many of these chal- lenges, it is apparent that there is more to it than simply train- ing a deep neural black-box classifier on sufficiently large data sets. A clever interplay of signal processing and DL needed to be developed to realize reliable, far-field, fully hands-free spo- ken interaction. The great success of this new class of products comes with new challenges, such as how to extend the range of applications and supported languages in an economically sensible way? Because of their conceptual simplicity, end- to-end ASR architectures appear to be one way to cope with those new challenges. However, more research is needed un- til those new concepts have proven effective for handling the unique and demanding challenges of smart loudspeakers. To see what is already possible today, please view an IEEE Signal Processing Society promotional video on YouTube [62], which illustrates that smart loudspeakers showcase signal processing at its best.
Authors Reinhold Haeb-Umbach (haeb@nt.uni-paderborn.de) re- ceived his Dipl.-Ing. and Dr.-Ing. degrees from Rheinisch- Westfälische Technische Hochschule Aachen, Germany, in 1983 and 1988, respectively. He has a background in speech research in both industrial and academic research environ-
ments. Since 2001, he has been a pro- fessor of communications engineering at Paderborn University, Germany. His research interests are in the fields of statistical signal processing and pattern recognition, with applications to speech enhancement, acoustic beam- forming and source separation as well as automatic speech recognition and unsupervised learning from speech and audio. He has coauthored more than 200 scientific publications includ- ing Robust Automatic Speech Recogni- tion—A Bridge to Practical Applications (Academic Press, 2015). He is a fellow of the International Speech Communica- tion Association.
Shinji Watanabe (shinjiw@ieee .org) received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan, in 1999,
2001, and 2006, respectively. He is an associate research professor at Johns Hopkins University, Baltimore, Maryland. He was a research scientist at the Nippon Telegraph and Telephone Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011; a visiting scholar at the Georgia Institute of Technology, Atlanta, in 2009; and a senior princi- pal research scientist at Mitsubishi Electric Research Laboratories, Cambridge, Massachusetts, from 2012 to 2017. His research interests include automatic speech recog- nition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published more than 150 papers and has received several awards, including the Best Paper Award from the Institute of Electronics, Information and Communication Engineers in 2003.
Tomohiro Nakatani (tnak@ieee.org) received his B.E., M.E., and Ph.D. degrees from Kyoto University, Japan, in 1989, 1991, and 2002, respectively. He is a senior distin- guished researcher at the Nippon Telegraph and Telephone Communication Science Laboratories, Kyoto, Japan. H e was a visiting scholar at the Georgia Institute of Technology, Atlanta, in 2005 and a visiting assistant profes- sor at Nagoya University, Japan, from 2008 to 2017. He is a member of the IEEE Signal Processing Society Speech and Language Processing Technical Committee. His research interests are in audio signal processing technologies for intelligent human–machine interfaces, including dereverber- ation, denoising, source separation, and robust automatic speech recognition.
Michiel Bacchiani (michiel@google.com) received his Ingenieur (ir.) degree from Technical University of Eindhoven, The Netherlands, and his Ph.D degree from Boston University, Massachusetts. He has been a speech researcher with Google since 2005. Currently, he manages a research group at Google Tokyo, which is focused on the
100 90 80 70 60 50 40 30 20 10
0
W E
R (
% )
R e
ve rb
e ra
tio n
+ B
a se
lin e
D S
P +
M a
sk -B
a se
d N
R +
S tr
e a
m S
e le
ct io
n
E ch
o +
B a
se lin
e D
S P
+ M
a sk
-B a se
d N
R +
S tr
e a
m S
e le
ct io
n
N o
is e
+ B
a se
lin e
D S
P +
M a
sk -B
a se
d N
R +
S tr
e a
m S
e le
ct io
n
C o
m p
e tin
g T
a lk
e r
+ B
a se
lin e
D S
P +
M a
sk -B
a se
d N
R +
S tr
e a
m S
e le
ct io
n
WER on Triggered Utterances WER (Deletion) Due to Missed Utterances
FIGURE 6. WERs in the following acoustic conditions: reverberation, an echo, noise, and a competing talker. +Baseline DSP refers to the baseline DSP, which includes echo cancellation and dereverberation. +Mask-based NR refers to the baseline DSP and mask-based NR +Stream Selection refers to the baseline DSP, mask-based NR, and stream selection [6].
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
123IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
joint modeling of speech and natural language understand- ing. Previously, he managed the acoustic modeling team responsible for developing novel algorithms and training infrastructure for all speech recognition applications that back Google services. Prior to joining Google, he worked as a member of the technical staff at IBM Research, as a techni- cal staff member at AT&T Labs Research, and as a research associate at Advanced Telecommunications Research in Kyoto, Japan.
Björn Hoffmeister (bhoffmeister@apple.com) received his M.S. degree from Lübeck University, Germany, and his Ph.D. degree from RWTH Aachen University, Germany. He was a senior science manager at Amazon, leading Alexa Speech, the automatic speech recognition R&D group. He joined Amazon in 2011 as a founding member of Alexa research and developed the wake-word detection solution. Following the launch of Echo in 2014, he managed and grew the speech R&D group, which supports speech for all Amazon Alexa devices across all languages. In 2015, he led R&D efforts for the Alexa Skills Kit project. In 2016, he helped to launch Amazon Web Services’ Lex service, which is based on Alexa speech and skills technology. In July 2019, he joined the Siri Team at Apple.
Michael L . Seltzer (mikeseltzer@fb.com) received his Sc.B. degree with honors from Brown University, Providence, Rhode Island, in 1996 and his M.S. and Ph.D. degrees from Carnegie Mellon University, Pittsburgh, Pennsylvania in 2000 and 2003, respectively. Currently, he is a research scientist in the Applied Machine Learning Division of Facebook. From 1998 to 2003, he was a member of the Robust Speech Recognition group at Carnegie Mellon University. From 2003 to 2017, he was a member of the Speech and Dialog Research group at Microsoft Research. In 2006, he received the IEEE Signal Processing Society Best Young Author Award for his work optimizing microphone array processing for speech rec- ognition. His research interests include speech recognition in adverse environments, acoustic modeling and adaptation, neu- ral networks, microphone arrays, and machine learning for speech and audio applications.
Heiga Zen (heigazen@google.com) received his A.E. degree from Suzuka National College of Technology, Japan, in 1999 and his Ph.D. degree from the Nagoya Institute of Technology, Japan, in 2006. Currently, he is a senior staff research scientist at Google. He was an intern/co-op research- er at the IBM T.J. Watson Research Center, Yorktown Heights, New York, from 2004 to 2005, and a research engineer in the Cambridge Research Laboratory at Toshiba Research Europe Ltd., United Kingdom, from 2008 to 2011. While at Google, he was with the Speech Team from July 2011 to July 2018 and joined the Brain Team in August 2018.
Mehrez Souden (msouden@apple.com) received his Ph.D. and M.Sc. degrees from the Institut National de la Recherche Scientifique, University of Québec, Montréal, Canada, in 2010 and 2006, respectively. Currently, he is a senior audio and speech processing engineer with Interactive Media Group, Apple Inc. He was with the Nippon Telegraph
and Telephone Communication Science Laboratories, Kyoto, Japan, from 2010 to 2012; the School of Electrical and Computer Engineering, the Georgia Institute of Technology, Atlanta, from 2013 to 2014; and the Intel Corporation from 2014 to 2015. In 2016, he joined Apple Inc. to work on signal processing and machine learning with an emphasis on acous- tics and speech. He has published more than 50 papers. He received the Alexander Graham Bell Canada Graduate Scholarship and a postdoctoral fellowship, both from the National Sciences and Engineering Research Council, in 2008 and 2013, respectively.
References [1] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recog- nition,” J. Comput. Speech Language, vol. 46, pp. 535–557, Nov. 2017. doi: /10.1016/j.csl.2016.11.005.
[2] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth CHiME speech sep- aration and recognition challenge: Dataset, task and baselines,” in Proc. INTERSPEECH, 2018, pp. 1561–1565.
[3] K. Kinoshita, M. Delcroix, S. Gannot, E. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas et al., “A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” in Proc. EURASIP Journal on Advances in Signal Processing, 2016. doi: 10.1186/s13634-016-0306-6.
[4] M. Harper, “The automatic speech recognition in reverberant environments (ASpIRE) challenge,” in Proc. IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, 2015, pp. 547–554.
[5] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, and W. Kellermann, “Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 114–126, Nov 2012.
[6] Audio Software Engineering and Siri Speech Team, “Optimizing Siri on HomePod in far-field settings,” Mach. Learn. J., vol. 1, no. 12, 2018. [Online]. Available: https://machinelearning.apple.com/2018/12/03/optimizing-siri-on -homepod-in-far-field-settings.html
[7] J. Benesty, T. Gänsler, D. Morgan, M. Sondhi, and S. Gay, Advances in Ne t work a n d Ac o u s t i c E c h o C a n c ell a t i o n. New York: Sp r i nge r-Ve rla g, 2001.
[8] B. Schwartz, S. Gannot, and E. A. P. Habets, “Online speech dereverberation using Kalman filter and EM algorithm,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 2, pp. 394–406, 2015.
[9] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2008, pp. 85–88.
[10] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear predic- tion methods for blind MIMO impulse response shortening,” IEEE Trans. Audio, Speech, and Language Process., vol. 20, no. 10, pp. 2707–2720, 2012.
[11] L. Drude, C. Boeddeker, J. Heymann, K. Kinoshita, M. Delcroix, T. Nakatani, and R. Haeb-Umbach, “Integrating neural network based beamforming and weighted prediction error dereverberation,” in Proc. INTERSPEECH, 2018, pp. 3043–3047.
[12] T. Yoshioka, H. Tachibana, T. Nakatani, and M. Miyoshi, “Adaptive dereverber- ation of speech signals with speaker-position change detection,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 3733–3736.
[13] J. Caroselli, I. Shafran, A. Narayanan, and R. Rose, “Adaptive multichannel dereverberation for automatic speech recognition,” in Proc. INTERSPEECH, 2017, pp. 2017–1791.
[14] CHiME Challenge, “The 5th CHiME speech separation and recognition chal- lenge: Results.” Accessed on: Dec. 7, 2018. [Online]. Available: http://spandh.dcs .shef.ac.uk/chime_challenge/results.html
[15] C. Boeddeker, H. Erdogan, T. Yoshioka, and R. Haeb-Umbach, “Exploring practical aspects of neural mask-based beamforming for far-field speech recogni- tion,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6697–6701.
[16] J. Heymann, M. Bacchiani, and T. Sainath, “Performance of mask based statis- tical beamforming in a smart home scenario,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6722–6726.
[17] N. Ito, S. Araki, and T. Nakatani, “Complex angular central Gaussian mixture model for directional statistics in mask-based microphone array signal processing,” in Proc. European Signal Processing Conf. (EUSIPCO), 2016, pp. 1153–1157.
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.
124 IEEE SIGNAL PROCESSING MAGAZINE | November 2019 |
[18] D. H. Tran Vu and R. Haeb-Umbach, “Blind speech separation employing directional statistics in an expectation maximization framework,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2010, pp. 241–244.
[19] N. Q. Duong, E. Vincent, and R. Gribonval, “Under-determined reverber- ant audio source separation using a full-rank spatial covariance model,” IEEE/ ACM Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1830–1840, 2010.
[20] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-fre- quency masking,” IEEE Trans. Signal Process., vol. 52, no. 7, pp. 1830–1847, 2004.
[21] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidat- ed perspective on multimicrophone speech enhancement and source separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 4, pp. 692– 730, 2017.
[22] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 196–200.
[23] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” in Proc. INTERSPEECH, 2016, pp. 1981–1985.
[24] J. Heymann, L. Drude, and R. Haeb-Umbach, “A generic neural acoustic beamforming architecture for robust multi-channel speech processing,” J. Comput. Speech Language, vol. 46, no. C, pp. 374–385, 2017.
[25] Y. Liu, A. Ganguly, K. Kamath, and T. Kristjansson, “Neural network based time-frequency masking and steering vector estimation for two-channel MVDR beamforming,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6717–6721.
[26] M. Pedersen, J. Larsen, U. Kjems, and L. Parra, “A survey of convolutive blind source separation methods,” in Springer Handbook Speech Processing and Speech Communication, Jacob Benesty, Yiteng Huang, Mohan Sondhi, Eds. New York: Springer, Nov. 2007, pp. 114–126. [Online]. Available: http://www2.imm.dtu.dk/ pubdb/views/edoc_download.php/4924/pdf/imm4924.pdf
[27] J. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
[28] D. Yu, M. Kolbaek, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2017, pp.#241–245.
[29] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recogni- tion. 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
[30] J. B. Allen and D. Berkley, “Image method for efficiently simulating small- room acoustics,” J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943–950, 1979.
[31] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in Proc. Int. Workshop Acoustic Signal Enhancement (IWAENC), 2014, pp. 313–317.
[32] C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home,” in Proc. INTERSPEECH, 2017, pp. 379–383.
[33] M. L. Seltzer, B. Raj, R. M. Stern, “Likelihood-maximizing beamforming for robust hands-free speech recognition,” IEEE Speech Audio Process., vol. 12, no. 5, pp. 489–498, 2004.
[34] T. N. Sainath, R. J. Weiss, K. W. Wilson, B. Li, A. Narayanan, E. Variani, M. Bacchiani, I. Shafran et al., “Multichannel signal processing with deep neural net- works for automatic speech recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 25, no. 5, pp. 965–979, 2017.
[35] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y. Zhang et al., “Deep beamforming networks for multi-channel speech recogni- tion,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 5745–5749.
[36] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb-Umbach, “BEAMNET: End-to-end training of a beamformer-supported multi-channel ASR system,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5325–5329.
[37] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss et al., State-of-the-art speech recognition with sequence-to- sequence models. 2017. [Online]. Available: http://arxiv.org/abs/1712.01769
[38] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel end-to-end speech recognition,” in Proc. Int. Conf. Machine Learning (ICML), 2017.
[39] B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak et al., “Acoustic modeling for Google Home,” in Proc. INTERSPEECH, 2017, pp. 399– 403.
[40] P. S. Aleksic, M. Ghodsi, A. H. Michaely, C. Allauzen, K. B. Hall, B. Roark, D. Rybach, and P. J. Moreno, “Bringing contextual information to Google speech recognition,” in Proc. INTERSPEECH, 2015, pp. 468–472.
[41] A. Raju, B. Hedayatnia, L. Liu, A. Gandhe, C. Khatri, A. Metallinou, A. Venkatesh, and A. Rastrow, “Contextual language model adaptation for conversa- tional agents,” in Proc. INTERSPEECH, 2018, pp. 3333–3337.
[42] H. Lin, J. Huang, F. Beaufays, B. Strope, and Y. Sung, “Recognition of multi- lingual speech in mobile applications,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4881–4884.
[43] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt et al., “Siri on-device deep learning-guided unit selection text- to-speech system,” in Proc. INTERSPEECH, 2017, pp. 4011–4015.
[44] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. van den Driessche, E. Lockhart et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proc. Int. Conf. Machine Learning (ICML), 2018, pp. 3918–3926.
[45] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior et al., Wavenet: A generative model for raw audio. 2016. [Online]. Available: http://arxiv.org/abs/1609.03499
[46] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4087–4091.
[47] T. N. Sainath and C. Parada, “Convolutional neural networks for small-foot- print keyword spotting,” in Proc. INTERSPEECH, 2015, pp. 1478–1482.
[48] K. Kumatani, S. Panchapagesan, M. Wu, M. Kim, N. Strom, G. Tiwari, and A. Mandai, “Direct modeling of raw audio with DNNs for wake word detection,” in Proc. IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, 2017, pp. 252–257.
[49] S. Fernández, A. Graves, and J. Schmidhuber, “An application of recurrent neural networks to discriminative keyword spotting,” in Proc. 17th Int. Conf. Artificial Neural Networks, 2007, pp. 220–229.
[50] M. Wu, S. Panchapagesan, M. Sun, J. Gu, R. Thomas, S. N. P. Vitaladevuni, B. Hoffmeister, and A. Mandal, “Monophone-based background modeling for two- stage on-device wake word detection,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5494–5498.
[51] Siri Team, “Hey Siri: An on-device DNN-powered voice trigger for Apple’s personal assistant,” Mach. Learn. J., vol. 1, no. 6, 2017. [Online]. Available: https:// machinelearning.apple.com/2017/10/01/hey-siri.html
[52] M. Shannon, G. Simko, S.-y. Chang, and C. Parada, “Improved end-of-query detection for streaming speech recognition,” in Proc. INTERSPEECH, 2017, pp. 1909–1913.
[53] B. Liu, B. Hoffmeister, and A. Rastrow, “Accurate endpointing with expected pause duration,” in Proc. INTERSPEECH, 2015, pp. 2912–2916.
[54] R. Maas, A. Rastrow, C. Ma, G. Lan, K. Goehner, G. Tiwari, S. Joseph, and B. Hoffmeister, “Combining acoustic embeddings and decoding features for end-of- utterance detection in real-time far-field speech recognition systems,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5544– 5548.
[55] S. Mallidi, R. Maas, K. Goehner, R. A. S. Matsoukas, and B. Hoffmeinster, “Device-directed utterance detection,” in Proc. INTERSPEECH, 2018, pp. 1225– 1228.
[56] J. Wang, J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, “Deep extractor network for target speaker recovery from single channel speech mixtures,” in Proc. INTERSPEECH, 2018, pp. 307–311.
[57] R. Maas, S. H. K. Parthasarathi, B. King, R. Huang, and B. Hoffmeister, “Anchored speech detection,” in Proc. INTERSPEECH, 2016, pp. 2963–2967.
[58] E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text-dependent speaker veri- fication,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052–4056.
[59] S.-X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to-end attention based text-dependent speaker verification,” in Proc. IEEE Spoken Language Technology (SLT) Workshop, 2016, pp. 171–178.
[60] C. Zhang and K. Koishida, “End-to-end text-independent speaker verification with triplet loss on short utterances,” in Proc. INTERSPEECH, 2017, pp. 1487– 1491.
[61] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. K hudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
[62] YouTube, “Signal processing in home assistants.” Accessed on: Dec. 7, 2018. [Online]. Available: https://www.youtube.com/watch?v=LJ54btWttdo
SP
Authorized licensed use limited to: California State University - Chico. Downloaded on March 19,2021 at 18:24:57 UTC from IEEE Xplore. Restrictions apply.