Double-talk robust acoustic echo canceller based on CNN filter (Haengwoo Lee)

Received Feb 25, 2019 Revised Nov 11, 2019 Accepted Jan 31, 2020 Conventional acoustic echo cancellation works by using an adaptive algorithm to identify the impulse response of the echo path. In this paper, we use the CNN(convolutional neural network) filter to remove the echo signal from the microphone input signal, so that only the speech signal is transmitted to the far-end. Using the neural network filter, weights are well converged by the general speech signal. Especially, it shows the ability to perform stable operation without divergence even in the double-talk state, in which both parties speak simultaneously. As a result of simulation, this system showed better performance and stable operation compared to the echo canceler of the adaptive filter structure. And, in double-talk, we showed the ERLE in the CNN is about 3 [dB] better than in the general neural network. Keyword:


INTRODUCTION
Acoustic echo is a problem when loudspeaker and near-end signals are combined at the microphone and sent to the far end. The acoustic echo signal disturbs receiving the near-end speeches in the far-end by which the received signals from the far-end in the near-end are emitted through the speaker and then combined with the near-end speeches in the microphone. If this is not properly handled, the far-end speaker will hear his / her voice delayed by the round trip time of the communication system, which is a very unpleasant problem in the hands-free calling. Generally, the elimination of the echo signal is achieved by adaptively converging the acoustic impulse response between the loudspeaker and the microphone using a FIR(finite impulse response) filter [1]. However, this operates normally only in a one-way conversation in which only a far-end signal exists, and in a double-talk interval in which the near-end speeches are also present, the ability to cancel an echo signal suddenly deteriorates. At this time, the presence of the near-end speech signal seriously degrades the convergence of the adaptive algorithm and may even be a factor for the filter coefficient to diverge. Therefore, in order to solve the double-talk problem, there is a method for preventing divergence of coefficients by detecting the double-talk state and stopping the update of the coefficient of the echo canceller [2] [3]. However, this method has a relatively long detection time, so that the coefficients of the echo canceller may diverge before which a double-talk is detected. Also, the signal received at the microphone may introduce non-linear distortion into the echo signal due to limitations of components such as power amplifiers and loudspeakers as well as echo and near-end speech. To overcome this problem, a neural network structure that can model complex nonlinear relationships can be a powerful alternative. The final goal of the acoustic echo canceller is to completely remove the echo signal and background noise, and transmit only the near-end speech to the far-end. From the point of view of speech separation, the near-end speech is separated from the microphone signal and transmitted to the far-end, which can naturally be treated as a speech separation problem. Therefore, instead of estimating the acoustic echo path, we apply speech separation learning to separate the near-end speech from the microphone signal together with the far-end speech, which is the raw signal of the echo accessible as additional information. In this approach, the double-talk problem is solved without using a specific sensing circuit. Deep learning shows great potential for speech separation. Experimental results show that the proposed method can remove the acoustic echo well in the noisy double-talk situation.
Computers that have been performing human commands and corresponding simple behaviors have transformed the paradigm of architecture by learning and reinforcing new connections. We experienced this change in 2016, through an artificial intelligence program called AlphaGo. The actual machine learning study began with the development of a multi-layer perceptron learning algorithm [4] in 1988. Through the 1990s, Decision trees, Bayesian networks, Support vector machines in the internet business are started to apply to Information search, Data mining, E-commerce, and recommendation services. In 2006, Geoffrey Hinton announced a way to improve existing neural network technology by adding more layers to the network [5]. As a result, machine learning has evolved into deep learning, and recently, deep learning technology has attracted attention as it shows performance exceeding human ability. In the late 2000s, machine learning contributed greatly to the advancement of the artificial intelligence industry, including the use of Apple's Siri, IBM's Watson, and Google's automatic speech recognizer. Recently, the deep-running model has achieved great results because it has developed a technique for learning multi-layer neural network composed of many layers. An error back-propagation algorithm that learns a multi-layer network can learn a deep network using a large number of layers by learning the lower layer synapses before learning the upper layer. The most commonly used deep-running model is the CNN [6][7] [8].
When the near-end and far-end speakers talk simultaneously in an acoustic echo cancellation system, this condition is called double-talk. Without a double-talk detector that works correctly in case of double-talk, the echo canceller cannot reliably cancel echoes and distort speech sent to the far end. To detect the double-talk state, we used cross-correlation energy of the far-end and microphone signals [9] or the variance of the maximum value of the tap in the adaptive filter [10] or zero-crossing rate of error [11]. However, because the detection takes a long time, the coefficients of the filter may diverge. Therefore, in this study, by using neural network technology which is inherently tough to double-talk [12][13][14] [15], there is no the need for a double-talk detector. Neural network based acoustic echo cancellers can operate reliably without diverging of weights, even under double-talk conditions.
In this paper, we discuss the learning algorithm of the multi-layer neural network in the following section 2. In section 3, we present the structure of a CNN neural network filter that is suitable for acoustic echo cancellation. Section 4 gives a discussion of the experimental results and analyses. Finally, Conclusion is made in section 5.

Learning Algorithm of Multi-Layer Neural Network
A multi-layer perceptron has the structure of a multi-layer forward neural network with one or more hidden layers. The weight and bias between the input layer and the hidden layer is represented by 1 , 1 , the weight and bias between the hidden layer and the output layer is expressed by 2 , 2 . Also, the weighted sum inputted to the j-th hidden neuron is called ℎ , the weighted sum inputted to the k-th output neuron , and the activation function of the hidden neuron is expressed as ∅ ℎ , the activation function of the output neuron ∅ .

Figure 2. Multi-layer neural network
The output values of the hidden neurons and output neurons can then be expressed by the following equations.
If we denote all weights and biases as a parameter θ, we can express the value of the k-th output neuron as a function ( , ) given the input x.
The back-propagation learning algorithm was developed by Geoffrey Hinton in the mid-1980s. The supervised learning of the multi-layer perceptron should be based on the target output value and the cost function using the difference of the values outputted by the multi-layer perceptron. When the learning data and the target output value are given as a pair of input and output orders ( , )( = 1, ⋯ , ), the error for the whole learning data X can be defined as a mean square error as shown in the following equation.
In the above equation, the error function ( , ) is set to one value given the data set X and the parameter . The data set X is the value given from the outside, and the target to be optimized is . Therefore it can be written ( ). The back-propagation learning algorithm uses the gradient descent method to find the parameters to minimize the error function ( ). The gradient descent method is an algorithm that finds a parameter that minimizes the value of a cost function iteratively.
Where is the learning rate that controls the speed of learning. In the multi-layer perceptron, the back-propagation learning uses the error function ( , ) for a data by applying a stochastic gradient descent method of updating a data for each weight. The parameter , which should be corrected through learning in the above equation, is the weight 2 and bias 2 between the hidden layer and the output layer, and the weight 1 and bias 1 between the input layer and the hidden layer. Therefore, if the error function is partially differentiated by the output-side parameter, it is as follows.
Here, ′ ( ) is the derivative value of the activation function of the output neuron, and is generally a unit step function ′ ( ) = ( ) since the ReLU function (max {0, }) is widely used. And then is the effect of the output neuron on the error. Next, the error function is partially differentiated by the input-side parameters as follows.
Taken together, it can be seen that the parameters between the input layer and the hidden layer are affected by the sum of multiplication the weights between the hidden and output layers by the effect of each output neuron on the error. Since the error of the output neuron propagates backward to the hidden neuron and influences the parameter control of the hidden neuron, the gradient descent learning method of the multi-layer perceptron is named as the error back-propagation learning algorithm, and finally each parameter is updated by the following equation.

Acoustic echo cancellation based on CNN filter
The CNN-based neural network filter not only expresses the features of the time domain of the signal but also the frequency domain. The CNN structure is composed of convolution kernels with various widths according to frequency bands, and it can read complex nonlinear characteristics depending on the number of hidden layers. In this paper, we use a network with a 2 hidden layer and a 5-band kernel. The width of the kernel is composed of 4, 8, 16, 32, 64 data considering that the speech signal has high energy in the low frequency band. Figure 3 shows an example with a kernel width of 4, and input data of adjacent neurons are superimposed as in a network with a different kernel width. For convenience, assuming that the bias is zero and using ReLU(Rectified Linear Unit), which is often used as an activation function, the output of each layer is obtained as follows.
The error value of the output neuron with respect to the target value is = − 2 and the derivative value of the activation function is ′ ( ) = 1 ( > 0). Using the NLMS (normalized least mean square) algorithm, the weights of each layer for > 0 are updated as follows:

Simulation results
In order to evaluate the performance of the proposed acoustic echo canceller, simulations were performed using a Python program. The speech signal is sampled at 8 kHz and represented as 8 bits. The simulation room size is (3×3×2) m and the room impulse response has a reverberation time of 64 ms. Echo canceling performance uses ERLE(echo return loss enhancement) defined as follows.
[ ] = 10log ( Here, {•} represents a probabilistic expectation value, and is the microphone input signal in which the echo signal is mixed with the near-end speaker voice and noise. Figure 4 shows the ERLE characteristics and residual error curves for two types of echo cancellers when the echo is generated by the white noise at the same time the near-end talker is speaking. The ERLE curve of the FIR filter structure, indicated by the dotted line (F) in the figure above, is strongly influenced by the speech of the near-end. On the other hand, the ERLE curve of the CNN filter structure indicated by the solid line (N) increases steadily, up to 25 [dB] regardless of the speech of the near-end. The lower figure shows the residual error of the CNN filter structure. It decreases continuously in a short time and converges to the minimum value. The curve fo upper figure in Figure 5 shows that the ERLE curve oscillates largely due to the speech of the near-end talker when the echo is generated by the speech of the far-end talker in the FIR filter structure. Here, the light curve (S) represents the speech of the near-end talker and the thick curve (E) represents the ERLE value. And the curve of lower figure is the mean error value after echo cancellation.

Conclusion
In this paper, we show that the CNN filter removes the echo signals well in the acoustic echo canceller despite the double-talk. Using the neural network filter, weights are well converged on the general speech signal. Especially, it shows the ability to perform stable operation without divergence even in doubletalk state. Therefore, the acoustic echo canceller can always update the weights regardless of the double-talk.
To compare the performance, We simulated the ERLE in double-talk in the general neural network and the CNN. The results show the ERLEs in the CNN are about 3 [dB] better than in the general neural network.