http://section.iaesonline.com/index.php/IJEEI/index Noise reduction system by using CNN deep learning

2021 In this paper, we propose a new algorithm to reduce the acoustic noise of hearing aids. This algorithm improves the noise reduction performance by the deep learning algorithm using the neural network adaptive prediction filter instead of the existing adaptive filter. The speech is estimated from a single input speech signal containing noise using a 80-neuron, 16-filter convolutional neural network(CNN) filter and an error backpropagation algorithm. This is by using the quasi-periodic property of the voiced sound section in the speech signal, and it is possible to predict the speech more effectively by applying the repeated pitch. In order to verify the performance of the noise reduction system proposed in this research, a simulation program using Tensorflow and Keras libraries was coded and a simulation was done. As a result of the experiment, the proposed deep learning algorithm improves the mean square error(MSE) of 28.5% compared to using the existing adaptive filter and 17.2% compared to using the FNN(full-connected neural network)


INTRODUCTION
When using a hearing aid, noise is a factor that makes users uncomfortable and makes speech recognition difficult. A speech improvement technology that reduces the noise included in the speech signal is required, and many studies have been conducted so far. There are two types of techniques for noise reduction. First, there are spectral subtraction methods [1,2] and Wiener filter methods [3,4] based on short-term spectrum estimation. These methods are suitable when the estimated spectrum of noise is subtracted from the input speech signal, or a clean speech spectrum is estimated, and the statistical characteristics of the noise and the obtained speech signal are known. Second, there are Comb filters [5] and adaptive filter methods [6,7] that use quasi-periodic characteristics of speech signals. The Comb filter method is used when the noise has a specific frequency band, and the adaptive filter method automatically adjusts the coefficients of the filter, so it is not necessary to know the statistical characteristics of the noise in advance.
The adaptive noise reduction system is divided into a single input and a multiple input system according to the number of acoustic sensors. The single input system [8] inputs a speech signal through a single microphone. Since the voiced section of the speech signal has a quasi-periodic characteristic, a signal having a high correlation with the input speech signal can be obtained by delaying the microphone input signal containing noise by one or two pitches. This signal is used as a reference signal for the adaptive filter. . To obtain the pitch delay value, the input signal is divided into approximately 30 ms intervals in which the statistical characteristics of the speech are not changed, and the autocorrelation function is calculated for each section to obtain the pitch delay value.
Deep learning [9] is a complex machine learning model that uses a large number of hidden layers based on neural networks. The recent deep learning model is making great progress in many fields because it has developed a technology that can learn multi-layered neural networks composed of many layers. An error backpropagation algorithm [10] that trains a multi-layer neural network can learn deep neural networks composed of many layers by pre-learning the synapses of the lower layer before learning the upper layer [11]. The most used deep learning model is Convolutional Neural Network. In this study, we propose a method to reduce noise using deep learning algorithm of neural network filter instead of adaptive filter of adaptive noise reduction system. Since the CNN model shows excellent performance in feature extraction, it has been recently used in video and audio signal processing [12,13]. The content of the paper looks at the adaptive noise reduction system in Section II, Section III describes the learning algorithm of the multilayer neural network, and Section IV proposes a new deep learning model structure. In addition, in Section V, the simulation and the results of this system are described, and finally, in Section VI, a conclusion is drawn. Figure 1 is a single-input adaptive noise reduction system that estimates the current speech from signals delayed by more than one sample by the adaptive prediction method using the quasi-periodic characteristics of the speech signal. One or two pitch-delayed input signals have a high correlation with speech signal components, but little correlation with noise components. Therefore, the speech signal is independent of noise and converges to be the least square error of the target value. Since the microphone input signal, which is a mixture of speech and noise signal, has quasi-periodic characteristics in the voiced section, one or two pitch delayed signals have a high correlation with the speech signal. At this time, the output of the filter becomes a speech estimation signal having a minimum square error and a speech signal in the input signal by minimizing the energy of the error signal. To obtain the pitch delay information, the pitch detector obtains autocorrelation function for every analysis section. The pitch delay time for which this value is the maximum is selected between 2-20 ms. In

LEARNING ALGORITHM OF MULTI-LAYER NEURAL NETWORK
A multi-layer perceptron has the structure of a multi-layer forward neural network with one or more hidden layers. Figure 3  . The weight and bias between the input layer and the hidden layer is represented by 1 , 1 , the weight and bias between the hidden layer and the output layer is expressed by 2 , 2 . Also, the weighted sum inputted to the j-th hidden neuron is called ℎ , the weighted sum inputted to the k-th output neuron , and the activation function of the hidden neuron is expressed as ∅ ℎ , the activation function of the output neuron ∅ .
If we denote all weights and biases as a parameter θ, we can express the value of the k-th output neuron as a function ( , ) given the input x.
The error backpropagation learning algorithm was developed by Geoffrey Hinton in the mid-1980s. The supervised learning of the multi-layer perceptron should be based on the target output value and the cost function using the difference of the values outputted by the multi-layer perceptron. When the learning data and the target output value are given as a pair of input and output orders ( , )( = 1, ⋯ , ), the error for the whole learning data X can be defined as a mean square error as shown in the following equation.
In the above equation, the error function ( , ) is set to one value given the data set X and the parameter . The data set X is the value given from the outside, and the target to be optimized is . Therefore it can be written ( ). The back-propagation learning algorithm uses the gradient descent method to find the parameters to minimize the error function ( ). The gradient descent method is an algorithm that finds a parameter that minimizes the value of a cost function iteratively. Where is the learning rate that controls the speed of learning. In the multi-layer perceptron, the backpropagation learning uses the error function ( , ) for a data by applying a stochastic gradient descent method of updating a data for each weight.
The parameter , which should be corrected through learning in the above equation, is the weight 2 and bias 2 between the hidden layer and the output layer, and the weight 1 and bias 1 between the input layer and the hidden layer. Therefore, if the error function is partially differentiated by the output-side parameter, it is as follows.
Here, ′ ( ) is the derivative value of the activation function of the output neuron, and is generally a unit step function ′ ( ) = ( ) since the ReLU(Rectified Linear Unit) function (max {0, }) is widely used. And then is the effect of the output neuron on the error. Next, the error function is partially differentiated by the input-side parameters as follows.
Taken together, it can be seen that the parameters between the input layer and the hidden layer are affected by the sum of multiplication the weights between the hidden and output layers by the effect of each output neuron on the error. Since the error of the output neuron propagates backward to the hidden neuron and influences the parameter control of the hidden neuron, the gradient descent learning method of the multi-layer perceptron is named as the error backpropagation learning algorithm, and finally each parameter is updated by the equations (14) -(17).

STRUCTURE OF DEEP LEARNING MODEL
The deep learning model in Figure 4 used in this paper has a three-stage structure using a CNN layer. The CNN layer in the first stage consists of 80 neurons and 16 feature filters, and the size of the kernel is 16 samples, with the kernel at every sample interval. The input signal is composed of 80 × 16 data for each sample, and ReLU is applied as an activation function at the output. The output of the CNN layer is flattened in one dimension through the next Flatten layer and is expanded to (80-16 + 1) × 16 = 1,040 nodes. These signals are input to the Fully-connected Neural Network layer with 1,040 neurons, and the ReLU function is applied again at the output. Subsequently, it is output as one signal through the FNN layer having 128 neurons as the last layer. To reduce the amount of calculation, the batch size was set to 30, and the total number of parameters to be calculated in this model is 133,248.
Adam and the error backpropagation algorithm are used as the weight update algorithm. Since this system is classified as supervised learning, training data and learning target values are prepared as single input data.

SIMULATIONS AND ANALYSES
In order to verify the performance of the proposed speech noise reduction system, a simulation program was created using the Tensorflow and Keras libraries. The input signal was sampled at 8 kHz with a mixture of speech and white noise, and 900,000 samples (112.5 sec) were prepared. Since this system is for supervised learning, the input data is internally composed of an input array of 80 × (900,000-79) samples and a target value of (900,000-79) samples. Figure 5 shows the waveforms of the audio signal, mixed signal and output signal. Mean Square Error was used to evaluate the performance between systems. MSE refers to the error of the speech predicted value for the input signal, which is the target value.
(18) Figure 6 shows the MSE performance for the three filters. In the case of using the existing adaptive filter and FNN and CNN filter, the MSE performance showed that the deep learning model using the CNN filter was the best. Simulation results showed that when using CNN filter, MSE improved by 28.5% compared to adaptive filter and 17.2% compared to FNN filter. The reason is that the adaptive filter reduces only linear noise between samples, while the FNN filter can reduce even nonlinear noise. That is, since the adaptive filter has an FIR (Finite Impulse Response) structure, it reduces the linear component of noise, and the FNN filter can reduce nonlinear components because each neuron is interconnected with all input signals and it is composed of two or more layers. Furthermore, the CNN filter achieved better performance because it also finds and utilizes several features between samples.  Figure 7 shows the signal to noise ratio enhancement (SNRE) of a noise attenuator using a CNN filter and an FNN filter. From this figure, it can be seen that the SNRE of the noise attenuator is about 2 dB higher when using a CNN filter than when using an FNN filter.

CONCLUSIONS
In order to improve the speech recognition performance of hearing aids, it is required to develop an excellent noise attenuator. In this paper, we proposed a new noise reduction system using deep learning technology. Using the CNN filter, the noise reduction performance is improved by deep learning using a neural network instead of the existing adaptive filter. The noise reduction system achieved a significant performance improvement using a 80-neuron, 16filter CNN filter and deep learning error backpropagation learning algorithm. As a result of the study, this system has an effect of reducing MSE by 28.5% compared to adaptive filter and 17.2% compared to FNN filter.