Analysis on performances of the optimization algorithms in CNN speech noise attenuator

In this paper, we studied the effect of the optimization algorithm of weight coefficients on the performance of the CNN (Convolutional Neural Network) noise attenuator. This system improves the performance of the noise attenuation by a deep learning algorithm using the neural network adaptive predictive filter instead of using the existing adaptive filter. Speech is estimated from a single input speech signal containing noise using 64-neuron, 16-filter CNN filters and an error back propagation algorithm. This is to use the quasi-periodic nature of the voiced sound section of the voice signal. In this study, to verify the performance of the noise attenuator for the optimization algorithm, a test program using the Keras library was written and training was performed. As a result of simulation, this system showed the smallest MSE value when using the Adam algorithm among the Adam, RMSprop, and Adagrad optimization algorithms, and the largest MSE value in the Adagrad algorithm. This is because the Adam algorithm requires a lot of computation but it has an excellent ability to estimate the optimal value by using the advantages of RMSprop and Momentum SGD


INTRODUCTION
Noise attenuation is to attenuate noise included in speech, and various studies have been conducted on noise attenuation technology so far. As noise attenuation methods, there are the spectrum subtraction method [1,2] and the Wiener filter method [3,4] based on the short-term spectrum estimation. These methods subtract the spectrum of noise estimated from the input speech signal or estimate the clear speech spectrum, and are advantageous when the noise and statistical characteristics of the speech signal are known in advance. Another method is to use a Comb filter [5] or an adaptive filter [6,7] using the quasi-periodic characteristics of the speech signal. The Comb filter method is used for noise having a specific frequency band, and the adaptive filter method has a function to automatically adjust the filter coefficients without knowing the statistical characteristics of the noise in advance. A single-input adaptive noise attenuator with one sensor receives a voice signal from one microphone and estimates the voice signal using the quasi-periodic characteristics of the voiced sound section.
Recently, a deep learning model is making great achievements as a technology that can learn using many hidden layers based on neural networks has been developed. By using the error back propagation algorithm to train multi-layer neural networks, even deep neural networks composed of many layers can be trained [8]. CNN [9] has finally proven to be a reliable tool for generalization of real world noise attenuation problems [10]. CNN is the most widely used deep learning model at present and can estimate the characteristics of speech well. In 2016, a model based on SNR (Signal to Noise Ratio)-aware CNN for speech enhancement was published [11]. This CNN model can efficiently process local temporal and spectral speech. Thus, the model effectively separates speech and noise from the input signal. Two SNR recognition algorithms have been proposed using CNNs to improve the generalization ability and accuracy of these models. The first algorithm incorporates a multi-task learning framework. Given a noisy speech as input to the model, the algorithm mainly reconstructs the noise-free speech and estimates the SNR level. The second algorithm does SNR adaptive noise attenuation. This algorithm first calculates the SNR level. Then, based on the calculated SNR level, an SNR-dependent CNN model is selected to reduce the noise. The proposed two SNR-aware CNN models outperform the simple deep neural network. In 2017, a CNN model for complex spectrogram enhancement was proposed to solve the phase estimation difficulties [12]. The proposed model restores clean real and virtual spectrograms from noisy spectrograms. This spectrogram is used to generate speech with very accurate phase information. The basic idea is that any signal can be represented as a function of real and virtual spectrograms.
Optimization algorithms [13,14,15,16] for updating the weight coefficients of CNN filters include Stochastic Gradient Descent(SGD), Momentum SGD, Nesterov momentum SGD, Adagrad, RMSprop, and Adam. In this study, we propose the best performing algorithm by examining the effect of the optimization algorithm on the performance when noise is attenuated using the deep learning algorithm of the CNN neural network filter instead of the adaptive filter of the adaptive noise attenuator. The content of this thesis is about the adaptive noise attenuator in Section II, the linear prediction of speech signals in Section III, the structure of the CNN neural network filter in Section IV, and the update algorithm of weight coefficients in Section V. And in Section VI, the simulation of the optimization algorithm and its results are described, and finally, a conclusion is drawn in Section VII. Figure 1 is a single-input noise attenuator that estimates the current voice sample from signals delayed by more than one sample by an adaptive prediction method using the quasi-periodic characteristics of the voice signal. A speech signal delayed by one or two pitches has a high correlation, but has little correlation with the white noise component. That is, the voice signal converges so as to have the least squares error of the target value as a relationship independent of noise. The output of the CNN filter estimates the characteristics of the voice signal included in the input signal, and this signal is subtracted from the input signal to become an error. This error signal is used as an update signal to update the weight of the CNN filter, and the average power is the same as Equation (1).

ADAPTIVE NOISE ATTENUATOR
Here, {•} is the average value, and assuming that the voice signal and noise are independent of each other, Since the noise energy in an arbitrary section is a fixed value, Minimizing { 2 } is to minimize the estimation error of the voice signal {( −̂) 2 }, and the output of the filter ̂ at this time estimates the voice signal best. Therefore, the minimization of {( −̂) 2 } means to minimize {( − ) 2 } and the error signal is estimated the noise.

LINEAR PREDICTIVE CODING ANALYSIS OF SPEECH SIGNAL
Linear predictive coding analysis is a method used in various fields such as speech analysis and synthesis, and can accurately express the characteristics of speech spectrum with a relatively small number of parameters. Assuming that the speech sample at discrete time is , and the predicted value of the speech sample at time is ̂, it can be expressed as Equation (4) [17].
Therefore, the present value of the voice signal can be predicted from the previous values of the previous values from Equation (4). Therefore, assuming the prediction error representing the difference between the actual input value and the predicted value is , it can be expressed by Equation (5).
Where is the linear prediction coefficient. Therefore, the LPC coefficient is calculated so that the mean squared value of is minimized. In this paper, the speech signal is estimated using the unique lowfrequency spectrum structure of voiced sounds.

STRUCTURE OF CNN NEURAL NETWORK FILTER
The neural network filter in Fig. 2 used in this paper shows the three-layer structure when 16 CNN filters are used. The CNN layer of the first layer consists of 64 neurons and 16 feature filters, and the size of the kernel is 16 samples, and a kernel exists at every sample interval. The input signal consists of 64×16 data for every sample, and ReLU is applied as an activation function at the output. The output of the CNN layer is flattened in one dimension through the next Flatten layer and spreads out to 49×16 = 784 nodes. These signals are input to the Fully-connected Neural Network (:FNN) layer with 784 neurons, and the ReLU function is applied again at the output. Then, it goes through the last layer, the FNN layer with 64 neurons, and is output as one signal. To reduce the amount of computation, the batch size was set to 30 and the bias parameter of each layer was omitted. The weight parameters to be calculated in this model are 256 (=16×16) in the CNN layer, 50,176 (=784×64) in the hidden layer, and 64 in the output layer, for a total of 50,496. The weight update algorithm uses Adam and the error backpropagation algorithm. This system is classified as supervised learning and prepares training data and learning target values with single input data.

THE ALGORITHM FOR WEIGHT COEFFICIENTS UPDATING
A multi-layer perceptron has the structure of a multi-layer neural network having one or more hidden layers. omitted. Also, the weighted sum inputted to the j-th hidden neuron is called ℎ , the weighted sum inputted to the k-th output neuron , and the activation function of the hidden neuron uses the ReLU function denoted by ∅ , and the output neuron does not use the activation function. Therefore, the output values of the hidden neuron and the output neuron can be expressed by Equations (6) and (7).
If we denote all weights as a parameter θ, we can express the value of the k-th output neuron as a function ( , ) given the input x.
The error backpropagation learning algorithm [18] is an algorithm for learning the multi-layer perceptron, and the supervised learning of the multi-layer perceptron defines an error function that is the difference between the values output by the multi-layer perceptron and given the learning target value. When the learning data and the target output value are given as a pair of input and output orders ( , )( = 1, ⋯ , ), the error for the whole learning data X can be defined as a mean square error as shown in the following equation.
In the above equation, the error function ( , ) is set to one value given the data set X and the parameter . The data set X is the value given from the outside, and the target to be optimized is . Therefore it can be written ( ). The back-propagation learning algorithm uses the gradient descent method to find the parameters to minimize the error function ( ). The gradient descent method is an algorithm that finds a parameter that minimizes the value of a cost function iteratively.
Where is the learning rate that controls the speed of learning. In the multi-layer perceptron, the backpropagation learning uses the error function ( , ) for a data by applying a stochastic gradient descent method of updating a data for each weight.
In the above equation, the weight 2 between the hidden layer and the output layer, and the weight 1 between the input layer and the hidden layer are parameters which should be corrected through learning.
As another method, the momentum method obtains and updates the current error and the error that reflects the previously used error to some extent.
Here, γ is 0< γ <1 as the reflection coefficient. And the Nesterov momentum method predicts the direction of movement before calculating ( ), and calculates the gradient after moving in that direction in advance.
In neural network training, if the learning rate value is too small, it takes a long time to learn, and if it is too large, the learning is not performed properly. A simple way to solve this problem is the learning rate decay method, which lowers the learning rate value of all parameters. The Adagrad method is a method of adjusting the learning rate according to the number of updates of the weights to give larger changes to parameters with fewer changes.
Where is a very small value and prevents division by zero. ( ) squares the existing gradient value and adds it continuously. Among the elements of the parameter, the significantly updated element has a lower learning rate, and is applied differently for each element of the parameter. However, as the learning is repeated, the value of the gradient squared gradually decreases, the update intensity becomes weak, and eventually becomes 0 at some point, and there is a disadvantage that learning may not proceed any more.
And the RMSprop changed the ( ) obtained by adding the square value of the slope in the Adagrad to an exponential moving average instead of a sum. Like Adagrad, ( ) does not grow indefinitely, and the relative magnitude difference of recent changes between variables is maintained.
Where is called a decaying factor and has a value of 0.9 to 0.999. In the Adagrad, since ( ) is defined as the sum of changes up to the current time, it increases as time passes and the learning rate decreases. However, in the RMSprop, it is defined as the exponential average of the previous change and the current change, so that a sudden decrease in the learning rate is prevented.
Finally, the Adam is an optimizer created by combining the RMSprop, which changes the learning rate, and the Momentum, which changes the update path by optimization. Like Momentum, it stores the exponential average of the gradients calculated so far, and stores the exponential average of the square values of the gradients like RMSprop.

RESULTS OF SIMULATIONS
In this study, a simulation program was written using the Keras library to verify the performance of the noise attenuator for the optimization algorithms. The input signal mixed with voice and noise was sampled at 8 kHz and consisted of 300,000 samples (37.5 sec). Since this system corresponds to supervised learning, the input data is internally composed of an input array of 64×499,901 samples and a target value of 499,901 samples. To evaluate the performance of the system, the mean square error MSE for the error between the target value, the input signal and the speech prediction value, was used. The MSE curves were compared. Figure 3 shows the MSE curves for the Adam, RMSprop, and Adagrad algorithms when the SNR is 20dB. Here, the Adam curve is black, the RMSprop curve is blue, and the Adagrad curve is red. From this figure, as the update progresses, it can be seen that the MSE decreases rapidly at first, and then gradually decreases from the number of batches of 3,000. The Adam algorithm represents the smallest MSE and the Adagrad algorithm represents the largest MSE. Next, Figure 4 shows the MSE curves for the three algorithms when the SNR is 10dB, and shows similar performance to the curves when the SNR is 20dB. That is, the Adam algorithm shows the best performance and the Adagrad algorithm shows the lowest performance. And Figure 5 shows the MSE curve when the SNR is 5dB because more noise is mixed. It can be seen that the difference in performance for each algorithm is reduced compared to the previous figure. However, the performance of Adam algorithm is still the best and the performance of Adagrad algorithm is the lowest. Finally, Figure 6 shows the MSE curve when the noise is very mixed and the SNR is 1dB. In this case, it was found that MSE significantly increased no matter which algorithm was used. It can be seen that if the noise is large, there is almost no difference according to the algorithm, and the performance is poor, so the noise is not removed well. In addition, Table 1 summarizes the MSE values for each optimization algorithm in a table. From this table, it can be seen that as the SNR decreases, the MSE increases, and when the SNR becomes 1 dB, almost no noise is removed. In addition, the performance of the Adam algorithm is the best, the performance of the RMSprop algorithm is slightly inferior to that of the Adam algorithm, and the performance of the Adagrad algorithm is the lowest

CONCLUSIONS
In this paper, the effect of the optimization algorithm on the performance of the noise attenuator using CNN deep learning technology was investigated. The noise attenuator was implemented using a 64-neuron, 16-filter CNN filter and an error backpropagation algorithm. The model was coded using the Keras library, and how the MSE value changes according to the optimization algorithm was observed. As a result of simulation, this system showed the smallest MSE value when using the Adam algorithm among the Adam, RMSprop, and Adagrad optimization algorithms, and the largest MSE value in the Adagrad algorithm. This is because the Adam algorithm requires a lot of computation, but it has an excellent ability to estimate the optimal value by using the advantages of RMSprop and Momentum SGD.