http://section.iaesonline.com/index.php/IJEEI/index A Review on Emotion Recognition Algorithms Using Speech

Info 2017 In recent years, there is a growing interest in speech emotion recognition (SER) by analyzing input speech. SER can be considered as simply pattern recognition task which includes features extraction, classifier, and speech emotion database. The objective of this paper is to provide a comprehensive review on various literature available on SER. Several audio features are available, including linear predictive coding coefficients (LPCC), Mel-frequency cepstral coefficients (MFCC), and Teager energy based features. While for classifier, many algorithms are available including hidden Markov model (HMM), Gaussian mixture model (GMM), vector quantization (VQ), artificial neural networks (ANN), and deep neural networks (DNN). In this paper, we also reviewed various speech emotion database. Finally, recent related works on SER using DNN will be


INTRODUCTION
Speech emotion recognition (SER) is one of the topics in speech processing that has been continuously researched. The initial start of that is simple speech recognition dates back from the late fifties [1]. In today's world, SER has shown to be quite a research hotspot, as indicated by the growth of publication papers in each year. Figure 1 shows the rough estimation of IEEE published papers that are related to SER. The data was analyzed from IEEE Explore. The aim of SER system is to extract the emotion from the unknown input speeh [2]. While each individual may have their own abstract emotional state, generally emotions can be grouped into a universal category of happiness, anger, surprise, fear, sadness as well as neutral. Some other researchers have their own categories, for example the database utilized in [3] categorized emotions into ten types, namely joy, acceptance, fear, surprise, sadness, disgust, anger, anticipation, neutral, and others. Although the classification of emotion might differ, the objective of SER is still the same, which is to extract emotional state. In [4], it is stated that SER is more or less a pattern recognition system. Figure 2 shows the typical speech emotion recognition (SER) system.

Figure 2. Typical Speech Emotion Recognition System
The application of SER can be targeted to several sectors. In banking, an auto caller equipped with SER may assist in detecting the emotion of the customer, generating custom responses based on the result [5][6][7]. In education, an e-learning portal with SER can detect the emotions of the user such as frustration and stress, determining whether the studying is conducive or not and give appropriate countermeasures [8]. Yet another application is in transportation, where in the near-future that vehicles are capable of auto-driving, the system can take over the steering wheel in the case where an unhealthy amount of emotion is detected from the driver [9].

REVIEW ON AUDIO FEATURES EXTRACTION
In this section, various audio features used in SER are reviewed, including linear predictive coding coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC), and Teager energy operator (TEO). The extraction process goes through three steps. First, the pre-emphasis is a filter used to emphasize on high frequency baud by increasing its amplitude and decreasing the amplitude of lower frequency. In speech, typically the higher frequency holds more important information to extract, while lower frequency might be mingled with noise. It should be noted that in modern speech recognition systems the pre-emphasis has lost its importance and replaced by channel normalization in the later steps, but for the sake of simple but effective methods, a high-pass filter is sufficient. Secondly, the frame blocking and windowing is a process to decompose the speech signal into short speech sequences called frames to conduct speech analysis. There are several windows that can be utilized such as the rectangle window, triangular window, but the Hamming window is often chosen as it softens the edges created due to framing, again emphasizing on simplicity. Third is the feature extraction. According to [1], speech features can be categorized into four groups, including namely continuous, qualitative, spectral, and TEO-based features, as shown in Figure 3.

Linear Predictive Coding Coefficients (LPCC)
Linear predictive coding (LPC) is a digital method for encoding an analog signal [10]. The way LPC works is that it predicts the next value of a signal based on the information it has received in the past, forming a linear pattern. The main objective of LPC to obtain a set of predictor coefficients that will minimize the mean squared error, . The formula used to obtain the LPC coefficients is: where [ ] is a frame of the speech signal and p the order of the LPC analysis. LPC encoding generally gives satisfactory quality speech at a lower bit rate and supplies pinpoint approximations of speech parameters. Although LPCC can be considered one of the more traditional features of speech, LPC has contribute to the overall recognition of emotion. In [11], they used LPCC as one of their features and achieved 86.41% recognition.

Mel-Frequency Cepstral Coefficients (MFCC)
The Mel-frequency cepstral coefficients (MFCC) is one of the most popular audio feature [12,13]. It is a representation of the speech signals where a feature called the cepstrum of a windowed short-time signal is derived from the FFT of that signal. Afterwards the signal goes to the frequency axis of the melfrequency scale using a log based transform, and then decorrelated using a modified Discrete Cosine Transform [14].
The steps to extract MFCC features, including pre-emphasis, frame blocking and windowing, FFT magnitude, Mel filterbank, log energy, and DCT as explained in [13]. MFCC utilizes the mel-scale, which is tuned to the human's ear frequency response. Due to this, MFCC has been proven to be invaluable in the speech recognition field, and has been attempted to be integrated with emotion recognition [15]. According to [1], Spectral audio features such as MFCC is best suited for a N-way classifiers.

Eager Energy Operator (TEO)
The Teager Energy Operator (TEO) was proposed by Herbert M. Teager and Shushan M. Teager in 1983. In their article, they argued that the speech model at that time was inaccurate due to its linear finite characteristics, and proposed a model that involves a nonlinear process. Later in another article, they generated a plot that implies the energy creating the sound, but the algorithm was not specified [16]. The works is further extended in [17] and Teager Energy Operator has since been defined for both real and complex continuous signals. TEO can be defined as TEO has been used in various speech signal applications. In [16], formants of vowels are tracked using TEOs. In SER, TEO features are used by [18] to make their system more robust in noisy environment. Moreover, TEO-based features are suitable to detect the stress level of emotion [1].

Summary of Various Audio Features
The features to be extracted are various, but they can be grouped into 4 distinct groups, namely continuous, qualitative, spectral, and TEO-based features. These features can be used as a sole determinant, but often they are used in combination to generate a more distinguishable pattern for the system. Table 1 shows the strength and weaknesses of various audio features. We selected MFCC due to its suitability for Nbased classifiers and DNN. Moreover, many researches have used MFCC as the audio features. So that, our proposed system could be benchmarked with other research. LPC on its own has is not as reliable, as seen that it is often combined with other feature extraction methods.

MFCC
Tuned in a scale that is suitable for the human ear. Alongside with LPCC, is considered one of the standard features extracted, even more-so in SER.
Best suited for N-way classifiers.
MFCC being in spectral form is sensitive towards noise.

TEO
Nonlinear approach, which is for suitable for speech. Superior detection in stress-levels of emotion.
More complicated computations as compared to LPC.

REVIEW ON CLASSIFIERS
After the SER system extracts the desired features from the audio speech data, the next step is to pass the data on to the classifier. The primary job of the classifier is to determine the unrevealed emotion of the user by using a set of defined algorithms and functions. Usually these classifier evaluations are performed using a single database or dataset, under one language. Up until now, there has been no agreed standard of which classifier is the best, but many have been evaluated to achieved better recognition. The ones that are most commonly used classifier are: GMMs, HMMs, SVMs ANNs as well as k-NN [1]. In this section, the three most popular classifiers HMM, GMM and VQ are discussed in brief and compared with the classifier that is used in this project, Deep Neural Network DNN, which is an extended version of ANN.

Hidden Markov Model (HMM)
The Hidden Markov Model (HMM) consist of the first order markov chain whose states are hidden from the observer. This means while that the observer cannot directly examine the internal behavior of the model as it remains hidden, the the data's temporal structure is recorded by these states. HMM can be considered as statistical models that describe the sequences of events [2]. To express this in mathematical terms, for modeling a sequence of observable data vectors, 1 , ⋯ , by an HMM, we assume the existence of a hidden Markov chain responsible for generating this observable data sequence. Let be the number of states, , = 1, ⋯ , be the initial state probabilities for the hidden Markov chain, and , = 1, ⋯ , , = 1, ⋯ , be the transition probability from state to state . Assuming the true state sequence is 1 , ⋯ , the likelihood of the observable data is given by HMM is also a sequential generating probabilistic model, which means that the classifier acts on the assumption that neighboring frames are closely related. While this is valid for speech signal frames, there are better alternatives due to its assumption and algorithm complexity [19].

Gaussian Mixture Models (GMM)
The Gaussian mixture model (GMM) uses alternate generating probabilistic model, which implies that for a particular word we can form multivariate Gaussian density models that represents all the frames [19]. Similar to HMM, GMM can be expressed in mathematical terms. Let ( ) be the -th frame of the isolated word . The probability of generating the frame Let ( ) using GMM can computed as follows: where is the number of mixtures, is the probability of the th mixture, and is the multivariate Gaussian density function with mean vector and covariance matrix. Compared to HMMs, GMM are superior in training and testing due to their efficiency in modeling multi-modal distributions as a whole. GMMs are used in SER when global features are the main focus. But due to this feature, GMMs are not suited when the user would like to model the temporal structure.

Vector Quantization (VQ)
Vector quantization (VQ) is a process of mapping feature vectors of test utterance to the best matching feature vectors of the reference models [20]. As compared to other techniques such as HMM, VQ boosts is its low computational burden due to its straightforward approach. The efficiency is due to its nature of using compact codebooks for reference models and codebook searcher [21]. While the basic VQ appears to be convenient, because the vectors are jumbled up, VQ does not take into account the temporal evolution of the signals.

Artificial Neural Network (ANN) and Deep Neural Network (DNN)
The term artificial neuron network (ANN) is a term commonly used for a system that imitates the flow of the neuron. Information is received from the input and flows from one node to another, until it reaches the output. Through this process, the system will learn about the input given. Three branches of ANN will be discussed, including feedforward neural network, deep neural network and convolutional neural network.
The feedforward neural network is the first type of neural network developed. The process is the most basic one of all: the data is forwarded through an input layer to a single hidden layer, then to the output layer. In feedfoward, there are no loops or cycles. In the feedforward neural network, there is the input layer, a hidden layer, and an output layer. A deep neural network expands the possibilities by adding more layers in the hidden layer segment [22] [23]. An interesting characteristic of DNNs is that they can learn high-level invariant features from raw data. Convolutional neural network (CNN), as shown in Fig. 4, is inspired by the visual cortex, where cells are activated according to their sub-regions. Applying that to ANN, the CNN information in the neurons are connected to their sub-regions first, before passing to the next layer. Some sub-regions may overlap. This contrasts with other neural network architectures where each neuron is independent [25]. While CNNs are highly sophisticated and can be used for SER, it is specifically suitable for image processing and recognition, due to the convolutional layer.
The classifier is the algorithm that determines how these features are manipulated and translated into emotion recognition. Common classifiers are HMM, SVM, GMM, and ANN. DNN is used, a more sophisticated version of feedforward ANN. Table 2 shows that ANN boosts deep potential for pattern recognition, provided that more layers are supplied. The weakness of inconvenience when adding emotion can be simply solved by consolidating all initial parameters at the start. This claim is further supported in Table 3, where using DNN may generate more accurate recognition compared to other classifiers.

REVIEW ON SPEECH EMOTION DATABASE
To complete the process of SER, the system requires a database for training and testing. An emotion database generally consists of various audio recordings that are labeled their appropriate emotion. For this section, the discussion will be directed towards the number of databases used, the method of obtaining the dataset, the variety of emotions categorized, as well as the challenges that most researches have in obtaining these databases.
Usually a single SER system will rely only on a single database, to reduce data variance due to external factors such as different accents. While most systems are supported by one database, there are some researches that utilizes more, such as by [26], that have used the Berlin emotional speech database (EMO-DB) in combination with the German FAU Aibo emotion corpus (FAUAEC). With that said, these databases are still only using one language; German. As previously mentioned, there are external factors that can affect the speech features that are extracted.
The closest attempt of integrating multiple databases was performed by [27] by using 6 standard databases (AVIC, DES, EMO-DB, eNTERFACE, SmartKom, SUSAS) in a cross-corpora and multilingual evaluation experiment. An alternative is using a database that has already integrated multiple languages, such as the INTERFACE corpus, which supports English, Slovenian, Spanish, and French.
Another aspect to consider is how these speech emotion data are obtained. One may debate that true authentic emotion can only be captured at the moment, but spontaneous speech is difficult to record. To ensure proper speech processing, the system requires better audio quality. This is simply not feasible to attain without proper sound recording setup and environment. Therefore, the most used method is for professional or experienced actors to express the emotion through acting, then labeling each speech segment on its appropriate category. The EMO-DB and LDC Emotional Prosody Speech and Transcripts are two examples of an actor-based database. Generally, this is conducted under ideal conditions (ie: in a studio with minimum noise interference).
Another interesting method of collecting data is by collecting the speech from existing media, such as from movies, television recording, etc. While the source can be still considered as a "professional actor", the method of collecting the data differs from the first but maintains the general quality of audio. This however is met with the problem of copyright of fair usage. An example of a research that utilizes this method is by [28]. Finally, there are researches that collect their data from non-professional actors. These databases are generally self-made from the local environment. But while a home-made database creation may be more convenient for the researcher, it becomes difficult to benchmark the results with other papers.
There are variations of emotions that are categorized. The German Database for example, groups the emotion into anger, boredom, disgust, fear, happy, neutral, and sad. The more emotions category the database has, the more challenging it is for the SER system to achieve high accuracy. To solve this, some researches such as [3] merges and omits certain emotions with similar attributes, eg, the emotion of "disgust" and "anger". and focuses on those emotions with distinct variations.
There are various other factors to consider when choosing the appropriate database such as number of actors, language, ethnicity, word utterance or whole sentence, but one factor that has been a deterrent for some young researchers is the fact that some databases are obscured by a pay wall. This leads to either creating their own database or using open-source databases available. Table 3 shows various databases along with the audio features and classifiers that are used by other researchers.

RELATED WORKS AND PROPOSED SER SYSTEM
Fortunately, SER is a topic that is abundant in papers these recent years. In 2017 only, there are more than 150 papers published that are related to SER, which covers different angles of approaches, new combination of features to be processed, implementation of a variety of algorithms, optimization of results. A brief sample of research methodologies conducted in 10 papers can be observed in Table 3. Table 4 shows additional closely related papers, i.e. SRR using DNN. Although many researches have been conducted on SER using various audio features, classifiers, or database, however, there is still a need to further improve the accuracy and processing time of an SER system.  With the large amount of emotional utterance, more variation of emotion classification should be possible. While it leaves more room for future researches to improve, the best recognition rate is only 57.9% using ELM-DNN, Based on Table 4, we proposed SER system as shown in Figure 5. The raw audio received from the EMO-DB is labeled into their respective emotions. These audios are then inserted into a temporary storage for feature extraction. The next step is feature extraction using MFCC. Finally, the extracted features are classified using DNN. The performance evaluation of the proposed system will be discussed in our next paper.

CONCLUSION
This paper has presented a comprehensive review on the emotion recognition using speech analysis and the design of SER system. A typical SER consisted of at least feature extraction, classifier, and speech emotion database. From the critical literature review, of the various audio features we selected MFCC due to its popularity and suitability, while deep neural network was selected as the classifier due to its higher accuracy if more data is available. A comprehensive and popular emotion database, EMO-DB, was selected. Further research includes implementation of the proposed SER system using Matlab and performance evaluation and benchmarking.