Single Input Single Head CNN-GRU-LSTM Architecture for Recognition of Human Activities

Due to its applications for the betterment of human life, human activity recognition has attracted more researchers in the recent past. Anticipation of intension behind the motion and behaviour recognition are intensive applications for research inside human activity recognition. Gyroscope, accelerometer, and magnetometer sensors are heavily used to obtain the data in time series for every timestep. The selection of temporal features is required for the successful recognition of human motion primitives. Different data pre-processing and feature extraction techniques were used in most past approaches with the constraint of sufficient domain knowledge. These approaches are heavily dependent on the quality of handcrafted features and are also time-consuming and not generalized. In this paper, a single head deep neural network-based approach with the combination of a convolutional neural network, Gated recurrent unit, and Long Short Term memory is proposed. The raw data from wearable sensors are used with minimum pre-processing steps and without the involvement of any feature extraction method. 93.48 % and 98.51% accuracy are obtained on UCI-HAR and WISDM datasets. This single-head deep neural network-based model shows higher classification performance over other architectures under deep neural networks.

classifiers. Sensor and classifier fusion layer was used in the hierarchical fusion model based on entropy, and for estimation of weights, the weight entropy method was used [7]. Machine learning-based methods require a very time-consuming and qualitative feature extraction technique, feature selection methods, and at the same time, domain knowledge. After consideration of a lot of preparation, pre-processing, and feature extraction operations on data, only decent performance can be achieved. These HAR systems based on machine learning are applicable only for specific applications and cannot classify similar tasks from other sources [8].
Deep learning-based models have gained tremendous momentum nowadays due to their successful performance in the fields of natural language processing, object detection. Image segmentation, classification, etc. Deep learning models gained momentum in HAR over machine learning due to their automatic feature extraction from raw sensor data. These extracted deep features can represent the original data more closely. Deep learning models with minimum pre-processing steps are capable of classifying different labels with minimum human intervention. Different deep learning models such as Deep Belief Network (DBN) [9], Convolutional Neural Network (CNN) [10], Deep feedforward neural networks, and Recurrent Neural Networks (RNN) [11] are heavily used in some of the past research. HAR has been recognized by exploiting some deep learning models such as LSTM, CNN, CNN-LSTM, and more. A smartphone-based activity recognition model based on CNN was proposed in [12]. Time series data of multivariate were utilized and classified with deep CNN framework in [13]. Another deep learning model based on CNN with time-series data of univariate was proposed for an end to end classification in [14]. The accelerometer sensor was used to recognize human activities with CNN-based architecture. CNN was used to extract local features, and the global characteristics of the signal were defined with the help of statistical features [15]. CNN as a multilayer classifier, where CNN and Pooling layers were used alternatively, followed by a fully connected layer for HAR in [16]. This model has experimented on publicly available UCI-HAR datasets [4].
Another neural network that has been widely exploited in the series of deep neural networks is the Recurrent Neural Network (RNN). HAR is a classification problem inside the time series dataset, so that temporal dependency is heavily required for such types of time series datasets for an efficient classification task. RNN provides good temporal dependencies for time series classification problems so that various RNN models were widely used in previous research. For example [17], a framework based on an LSTM feature extractor was proposed to experiment with the WISDM dataset to recognize human activities [18]. In another LSTM model, where the accelerometer and gyroscope sensor's data were first normalized, normalized data was passed through stacked LSTM and then utilized soft-max activation function for classification task [19]. Another bi-directional LSTM for recognizing human activities by accelerometer and gyroscope sensors of mobile phone, mounted on subject's waist was proposed in [20]. Bi-directional LSTM was proposed using the smartphone-based publicly available dataset UCI-HAR to recognize human activities [21]. In recent times various combinations of CNNs layers and RNN layers have been used. For an example, activity recognition system based on CNN followed by RNN dense layer was proposed for HAR in [22]. One model based on the combination of CNN and LSTM was proposed to take advantage of both networks. That model could be able to access data from multimodal sensors without heavy data pr-processing steps. That model utilized gyroscope and accelerometer data in both the cases, individually and combination [23]. Two LSTM layers followed by convolutional layers based model and after the extraction of features that model was succeeded by Global Average Pooling(GAP), Batch normalization and soft-max activation [24].
In recent studies in literature, the multi-head deep neural network with the combination of CNN and RNN has been proposed. For instance, in [25], multi-head CNN-RNN architecture for anomaly detection with multiple sensors was proposed in an industrial environment. In that model, one CNN head was used for one sensor, and then the feature map of each CNN head was concatenated and fed to the next RNN model for finding the temporal information in the feature map. One CNN and one LSTM head were connected in parallel as a multi-head system, and the feature map was then concatenated and passed to the soft-max function for classification in [26]. Multihead LSTM, multi-head CNN-LSTM, and multi-head Conv-LSTM models were designed and ensembled in [27] for recognition of the expenditure of patients on medications.
Some other methods like Extreme Learning Machine(ELM), which is a feedforward network and has no backpropagation ability was proposed for recognition of human motions in [28] [29]. Self-adapted architecture with a new sensor location based on ELM was proposed in [30]. The U-net-based model was proposed in [31] for activity recognition by using time series signals of sensors. Most of the approaches defined in past studies for HAR were utilized various feature extraction and selection methods. The accuracy of those approaches depended on the quality of handcrafted features and required expert knowledge of the domain [32]. In this paper, a single-input single head CNN-GRU-LSTM model is designed for HAR using raw wearable sensor data such as gyroscope, accelerometer without the process of handcrafted feature extraction. The feature extraction process inside deep learning models faces significant challenges due to the imbalanced and noisy data obtained from a smartphone or inertial wireless measurement units. In this modelling, the size of the filter in the convolutional layer is chosen as 7. This combination of three deep neural networks made HAR fruitful in terms of less computational cost, relevant accuracy, and good f-1 score by taking advantage of the trio. CNN extracted the local features, and GRU, LSTM maintains the long-term temporal dependencies of mapped features. The model experimented over two publicly available datasets UCI-HAR [4] and WISDM [33].

Contribution of this paper
a. Hybrid architecture of deep neural networks by taking the advantages of CNN, GRU, and LSTM is proposed to recognize human activities by taking the raw wearable sensor data with negligible preprocessing steps and without a handcrafted feature extraction process. b. The local features are extracted by the CNN layer and GRU; LSTM maintains the long-term temporal dependencies of these mapped features so that the model can recognize the diverse data. c. The model is experimented with over two publicly available datasets UCI-HAR, WISDM and gained accuracy of 93.48 %, 98.51%, respectively.

Organization of paper
The rest of the paper contains the following sections: Section II describes the methodology behind this proposed work of a single input single head CNN-GRU-LSTM activity recognition model. Section III describes the proposed approach's experiments and results, and the last section IV, consists of the conclusion.

METHODOLOGY
Human activity recognition is a time series classification problem. Successfully detecting activities from raw sensor data obtained from smartphone and wearable units requires an efficient feature extraction process. This proposed single input single head CNN-GRU-LSTM model extracts temporal feature map from the CNN layer then long-term temporal dependencies are maintained by GRU and LSTM layers. LSTM layer is also used to eradicate the gradient vanishing and exploding problem, which is common in CNN. This deep neural network approach finds the way for end-to-end classification from raw sensor data to feature extraction and then classification-the architecture experiments on two publicly available datasets UCI-HAR and WISDM.

Data Segmentation
The first step towards activity recognition is a segmentation of acquired raw data from wearable sensors. This process of data segmentation is carried out with the help sliding window. WISDM data set is segmented in the timesteps of 128 and 3 features for every timestep. UCI -HAR data was already available in a segmented form where data was segmented into 128 timesteps, and every timestep consisted of 9 features. So that the input vector size for the WISDM dataset is 128x3 and the input vector size for the UCI-HAR dataset is 128x9. This 128-size input vector is considered as one sample for one activity. This vector size of length 128 is calculated over n channels where the value of n is equal to the value of features of a particular dataset. In this case, the value of n for the WISDM dataset is three and for the UCI-HAR dataset is 9.

Feature Extraction
The advantages of both CNN and RNN are utilized by combining CNN and RNN. These neural networks have the capability of automatic feature extraction. CNNs [34] are generally used for processing the data of multiple arrays.
The architecture of CNN generally consists of CNN layer, pooling, and at last, fully connected layer. Operation of convolution on time series data of length K and width M is depicted in fig1., where M is nothing but the features available in the dataset. Feature map is generated after convolution between filters of length n and depth h, and time-series data of length K and features M. Each convolution unit generates its feature map. A set of local input time series data and kernel of size n x h are multiplied by exact overlapping with each other so that the size of the regional input time series, which is also called the receptive field, must be equal to the size of the filter for the generation of the feature map. Each value in the receptive field is multiplied with the weights of the filter bank, and then all the obtained values are summed up then obtained one number is The single value of the feature map. This value obtained by multiplication and addition is then passed through the non-linear function called activation function Rectified Linear Unit. The number of feature maps depends on the number of kernels used. These feature maps then passed through the pooling layer, which helped reduce the feature map's dimension by taking the maximum value inside the local patches. A regularly used Dropout layer reduces the chances of overfitting.
CNN performance is recognized as the extraction of local features by taking time-series data in frames. Local values inside the data frame are highly correlated with concerned activities. CNN takes each frame of the dataset independently and takes this information in terms of a feature map, so it can be said that CNN follows the shortterm dependencies. So, it is crucial to extract the features locally due to the high correlation between the values of features for activity recognition; hence CNN performs the crucial task. But long-term dependencies are also required for precise recognition of activities.\ RNN is introduced to gain the advantages of long-term dependencies, but due to the large size of the activity dataset, the gradient vanishing problem [35] gets associated with traditional RNN. So traditional RNN is not useful for activity recognition. GRU is introduced [36] to overcome this gradient vanishing and gradient exploding problem in traditional RNN. GRU, an extension of traditional RNN, is used to gain the advantages of long-term dependencies of time series sequences for the activity dataset [37].
GRU and LSTM layers are added after the CNN layer to connect the past information with the present scenario in the proposed model. The reset gate and update gate is part of the GRU unit, and LSTM consists of the input gate, output gate, and forget gate. LSTM can be able to capture more long-term dependencies than the GRU unit. The human activities dataset is generally considered as a very long dataset in dimensions so that intense temporal longterm dependencies are required. Both unit LSTM and GRU in combination can provide very long-term dependencies, and the model with these combinations can handle the data of large variations and diversity.

Proposed Architecture
In the following fig 2, the different parts of the proposed model are shown in the form of part A, part B, and Part C. Part A describes the CNN unit this unit followed by the GRU unit and LSTM unit, which are described as part B and part C respectively. In part, three convolutional 1D layers are used with the activation function of Relu and each CNN layer, followed by max-pooling and dropout layer. In part B, three GRU units are used, and a dropout layer follows each GRU unit. In part C first LSTM layer and second LSTM layer are followed by the dropout layer and flatten layer, respectively. in part B to capture the long-term dependencies and return sequences passes to the next layers. A 10% dropout follows each GRU unit. Two LSTM layers with 32 units are used in part C, where the first layer is followed by 10% dropout, and a flattening layer follows the second. LSTM layers are used to understand better more temporal long-term dependencies of time series sequences of data.
In the last section flatten layer is followed by a dense layer with 128 units, and this, in turn, is followed by a 10% dropout. The Batch-normalization layer is used for the normalization of the feature map for better classification efficiency. At last, a fully connected layer with a soft-max activation function is used.

EXPERIMENTS AND RESULTS
The proposed model is trained and tested on UCI-HAR and WISDM datasets.

Datasets
UCI-HAR [4]: The University of California Irvine (UCI) released the dataset and publicly available it on the UCI repository. Thirty volunteers participated, and waist-mounted smartphone's inbuilt sensors like accelerometer and gyroscope were used. Total six human activities (sitting, standing, walking, walking downstairs, and walking upstairs) were recorded with the sampling frequency of 50Hz. Three axial body acceleration, three axial angular velocities, and a total of three axial acceleration; hence a total of 9 features were measured and stored. Butterworth low pass filter with a cut-off frequency of 0.3 Hz was used to segregate body acceleration and gravity. 2.56 s window was used for data segmentation, and 70% of the total subject's data were recorded as training data and the rest as testing data. A total of 10299 samples were recorded, where 7352 were taken as training samples and 2947 as testing samples.
WISDM [33]: Wireless sensor data mining lab of Fordham University was used to acquire the WISDM dataset. A total of 36 volunteers participated in that project, and they performed six activities (jogging, descending stairs, ascending stairs, walking, sitting, and standing) with placing a smartphone in their pocket. An inbuilt accelerometer sensor of the smartphone with a sampling frequency of 20Hz was used for acquiring the data. Twenty-nine subjects are selected for this paper to train the proposed model and the rest for validation. Data normalization is performed with the normalization of all values in the range of 0-1.

Performance Representation
The performance of the proposed model is represented by accuracy score, f1-score, precision, recall, and confusion matrix. Accuracy is defined as the ratio of correctly classified labels or targets to a total number of samples. Correctly classified labels are generally known as adding a total number of true positives (TP) and true negatives (TP). Wrongly classified terms are generally considered as false positive (FP) and false-negative (FN).
The ratio of samples that are correctly predicted positive to the all-positive predicted samples is known as precision.

= (2) +
The ratio of correctly predicted positives to the samples which actually exist as positive, known as recall.

= (3) +
For the performance evaluation of the model that is going to test on an unbalanced dataset, the f1score is generally calculated. The harmonic mean of precision and recall is known as the f1 score.   fig. 4 describe that the five activities of the test dataset of the UCI-HAR dataset obtained accuracy and f1-score greater than 90% and all the activities of the test dataset of the WISDM dataset obtained accuracy and f1-score greater than 95%. Fig 5(a) shows the training and testing loss of the proposed model on the UCI-HAR dataset and fig 5(b) shows the training and testing accuracy of the proposed model on the UCI-HAR dataset. In the same way, fig 6(a) and fig 6(b) show the training and testing loss of the proposed model on the WISDM dataset and the training and testing accuracy of the proposed model on the WISDM dataset, respectively. Table 2 and Table 3 describes the effectiveness of proposed model on both the dataset. The effectiveness of the proposed model on the UCI-HAR dataset is described in table 2, where some previous models are taken for comparison with the proposed model. It is found that the proposed model outperformed six models in terms of accuracy and f1-score on the UCI-HAR dataset.

Results and Discussion
The effectiveness of the proposed model on the WISDM dataset is described in table 3, where some previous models are taken for comparison with the proposed model. It is found that the proposed model outperformed nine models in terms of accuracy and f1-score.  The novelty of the proposed model is to introduce the in-depth knowledge of long-term dependencies to the model by using both the units GRU and LSTM. Comparative analysis with similar research is included in this paper and found that this less complex model performed better than some existing research. Outstanding performance is observed in the case of the WISDM dataset with an accuracy of 98.51% and an f1-score of 98.52%. Due to the well understanding of short-term and long-term dependencies on temporal sequences of time series datasets, architecture could handle the diverse nature of data.  Table 4 represents the computational efficiency of proposed architecture over some previous researches on the basis of used number of trainable parameters. Standard CNN was implemented in [47] by Gao et al., with 1.55 million parameters and obtained lesser accuracy by 1.68% than our proposed work with 1.13million parameters on WISDM dataset. The residual network was implemented in [47] with 2.30 million trainable parameters and obtained a lesser accuracy by 0.19% than our model with 1.13 million trainable parameters on same dataset. The Multi-head attention-based CNN model were used in [48] for HAR and achieved a lesser accuracy and F-1 score by 2.11% and 3.12% respectively with 2.77 million trainable parameters on WISDM dataset than our proposed architecture. The various deep learning models were designed by Tufek et al., in [49] on UCI-HAR dataset and two conditions were applied on dataset. In the first condition only accelerometer and gyroscope were considered and in second condition, the whole dataset was used including total acceleration and it is observed that our model gets much higher accuracy with fewer number of parameters. Hence, we can say that as far as computational efficiency is concerned then this research, proposed in this paper performed well than the other similar previous researches.

CONCLUSION
Convolutional Neural Networks and Recurrent Neural Networks for human activity recognition are implemented and tested on publicly available datasets. The framework CNN-GRU-LSTM outperformed some similar research in this field. These architectures gained advantages of three neural networks CNN as well as GRU and LSTM. CNN generates local features of the input sequence with efficient local dependencies, and GRU utilizes its efficiency of capturing the long-term dependencies. Long-term dependencies in depth are provided to this framework by the LSTM unit, and hence this proposed model can handle more diverse data. The performance of the proposed model is tested on publicly available datasets such as WISDM and UCIHAR, and it is found that the architecture outperformed some multi-head architectures. This single head and input model with diverse data handling capacity is computationally efficient and more accurate than other similar architectures.