Recognition of Badminton Action Using Convolutional Neural Network

ABSTRACT


INTRODUCTION
The computer vision field has been widely used in various applications such as video surveillance, human-computer interaction, robotics, object andaction recognition and sport analysis [1,2]. Action recognition is a very challenging problem in computer vision field. There are two modalities in action recognition: 1) sensor-based modality and 2) video-based modality. In this new era of technology, where video transmissions are widely available online, video-based modality is increasingly used in recognizing the action.
There are three components in action recognition framework: 1) feature extraction, 2) action representation and 3) classification as illustrated in Figure 1. . Action recognition framework [3] Currently, deep learning approach has become a research interest in action recognition because handcrafted approach does not have the capability to extract high-level features due to certain limitation such as image or video noise and complexity [3][4][5][6]. However, deep learning works excellently in extracting highlevel features directly from raw data as its architecture consists of hundreds of hidden layers.
CNN is one of a supervised classification technique. It has taken place in many recent works with its simple but precise architecture. CNN falls into the deep learning classifier category in which it eliminates the manual feature extraction in the machine learning pipeline. CNN model will automatically extract the features of the image before classifying it into respective class [4]. The pipeline is similar with Artificial Neural Network (ANN): input layer, hidden layer and output layer, but the hidden layer of CNN could consist up to hundreds of layers to improve its performance accuracy. There are several pre-trained CNN model with different network architecture that are available such as LeNet, AlexNet, VggNet and ResNet. To train the CNN architecture from scratch required many data and consumes a lot of time. However, another way to train the CNN in a short time and does not require so much data is through transfer learning which the existing pretrained CNN model can be used.
There are few works have been done on implementing and analysing CNN in their studies [5][6][7][8][9][10][11][12][13][14][15][16]. Work in [5] evaluates the performance of two classifiers and two feature extractors in classification of Caltech 265 images. Two classifiers used in comparison study are Linear Support Vector Machine (SVM) and Quadratic SVM while two feature extractors used are Bag of Words (BoW) and pre-trained CNN. The study proved that the classification accuracy is the highest when the features were extracted from CNN.
In [6], the authors introduced an improved AlexNet model for scene classification, as AlexNet model is limited in image classification by decomposing the large convolutional kernel into two small convolutional kernels with reduced stride. 5*5 convolution is decomposed into two 3*3 convolution and 3*3 convolution is decomposed into a structure of 3*1 convolution then 1*3 convolution. The experiment was conducted on SUN397 and Places 2 datasets. In comparison with AlexNet and ZFNet model, the proposed improve AlexNet model has the highest accuracy.
Study in [7] compared two CNN models (GoogleNet and AlexNet) in classifying the different flowers using Visual Geometry Group's 102 category flower dataset. The method was divided into image segmentation and classification. Image segmentation was used to remove the background from images. Their finding is that GoogleNet performs better than AlexNet in flowers categorization.
The purpose of this study is to implement and investigate the performance and capability of transfer learning method of different pre-trained CNN models in recognizing badminton action. At the end of this study, a suitable pre-trained CNN model will be proposed to automatically recognize the actions in badminton from broadcasted video. For an efficient sport performance analysis, the automated action recognition system in sport field will be very beneficial to coach. In Section 2, we provide an explanation of our methodology and design of experiment. In Section 3, we provide the results and briefly discuss the obtained results. Lastly, the conclusion and further work are stated in Section 4.  For dataset construction, firstly, the full duration broadcast video of Yonex All England Man Single Match 2017 with 720p resolution and frame rate 25 frames per second obtained from the Youtube database was extracted into still image frames. The purpose of using the still image frames in this study because we want to avoid the video's variable length problem. For instance, one video image might be 20 seconds while another is 50 seconds. This video extraction produced 138130 image frames. Then, we annotated each image frame into hit and non-hit action. Hit action refers to the action of players hitting the shuttlecock while non-hit action refers otherwise. Lastly, 80 image frames were selected randomly from the total image frames which consist of 40 images for hit action and 40 images for non-hit action. The extraction process was done using VirtualDub software. Figure 3 shows the example of image frames used in this experimental work for hit and non-hit action.  Table 1. AlexNet, GoogleNet and VggNet are the most popular and widely used CNN models. CNNs are used on vision-based dataset for image classification, object detection, image recognition and image segmentation. Table 2 summarises the details of each model. These models were trained to classify 1000 object categories. However, in this study, we fine-tuned these models with our dataset to classify only 5 action categories. Lastly, the classification performance of each model was analysed in term of performance accuracy and visualised using the confusion matrix. As for the confusion matrix, the columns represent the result of the predicted class and the rows represent the actual class of the variables. Anything on the leading diagonal is a correct answer (green colour) for each different action while others (red colour) are the falsely classified action. As for confusion matrix's legend, 1 represents hit action and 2 represents non-hit action.

RESULTS AND ANALYSIS
As mentioned earlier, the aim of this experimental work is to evaluate and compare the performance of four different pre-trained CNN models in recognizing the actions of badminton. Table 3   The equation (1) below is the formula used to obtain the percentage accuracy. The total number of correctly classified actions refers to the sum of all the correct predicted classes in diagonal as illustrated in the confusion matrix while the total number of test samples refers to the total number of test samples used which is 16.   This supports the previous study by [7], where GoogleNet has a better performance in flower categorization but not to a great extend. Whereas, both VggNet models have been left behind with only 50.0% of accuracy in which all non-hit actions were falsely classified as hit action as shown in Figure 9. These four models have different architecture. As described in [17], AlexNet model consists of 8 learned layers-5 convolutional layers and 3 fully-connected layers with 60 million parameters. However, GoogleNet model has 22 learned layers with number of parameters that have been reduced to 4 million by inception module [18]. According to [19], VggNet-16 and VggNet-19 model has 16 and 19 learned layers respectively with 140 million parameters. Therefore, the results strongly support the claim of previous studies that the deepest network has the highest accuracy. For this reason, GoogleNet model has the highest accuracy compared to AlexNet and VggNet model because it has the deepest layer. But, the results also show that Alexnet model performs better than VggNet model even though VggNet model has a deeper layer. This is because VggNet model has 140 million parameters compared to AlexNet model that only has 60 million parameters. As stated in [20], small amount of parameter variation can achieve significant growth in performance.
Overall, it can be inferred that GoogleNet model can perform better in recognizing action in bádminton. We also aware that our study may have two limitations. The first is GPU memory and the second is training time. Since the GPU is out of memory to train both VggNet models, we trained the models using the CPU, but take a longer time to complete the training process. Not only that, we found out that the machine used to train the model affects the performance accuracy of the model. The performance accuracy of VggNets drop significantly to 50%, even though these models should perform better than AlexNet. It is compulsory that the same machine should be used as a limited function of CPU may greatly affect the results. It is plausible that a number of limitations might could have influenced the results obtained.

CONCLUSION
Sport performance analysis is an important branch in sport practice. In order to analyse the performance of athletes using notational analysis approach, the sport analyst will manually recognize the action before doing the analysis. At this stage, this study provides an analysis on the performance of deep learning models in recognizing badminton action. It can contribute to the automatic action recognition using the most simple and non-time consuming transfer learning method which has not been done before. In the future, the experiment can be improved by classifying more action in badminton instead of classifying the action into hit and non-hit. Moreover, we believed that this study is the starting point in developing more advance deep learning architecture for automated badminton action recognition.