Important Features of CICIDS-2017 Dataset For Anomaly Detection in High Dimension and Imbalanced Class Dataset

The growth in internet traffic volume presents a new issue in anomaly detection, one of which is the high data dimension. The feature selection technique has been proven to be able to solve the problem of high data dimension by producing relevant features. On the other hand, high-class imbalance is a problem in feature selection. In this study, two feature selection approaches are proposed that are able to produce the most ideal features in the high-class imbalanced dataset. CICIDS-2017 is a reliable dataset that has a problem in high-class imbalance, therefore it is used in this study. Furthermore, this study performs experiments in Information Gain feature selection technique on the imbalance class datasaet. For validation, the Random Forest classification algorithm is used, because of its ability to handle multi-class data. The experimental results show that the proposed approaches have a very surprising performance, and surpass the state-of-the-art methods.


INTRODUCTION
It has been stated by many researchers that feature selection is able to reduce dimensional data by removing redundant features and selecting the most optimal features [1], [2], [3]. Pervez and Farid [4] have applied a feature selection algorithm to reduce input features of the classification engine. Tama and Rhee [5] have used particle swarm optimization (PSO)-based feature selection to select attributes. Aghdam and Kabiri [6] have implemented Ant-Colony-based feature selection technique to produce optimal features. Meanwhile, Kushwaha et al. [7] have applied a filter-based feature selection technique to remove unnecessary features. In intrusion detection system (IDS) research, an effective feature selection technique can be used to produce relevant features that help in improving system's capability in term of attack detection with minimum false alarms rate and low computation time [6]. Various techniques have been proposed to produce an ideal feature selection technique that can improve IDS performance. Chen et al. [8], have proposed a combination of K-nearest neighbor (KNN) and tree seed algorithm (TSA) and the proposed method is able to increase the accuracy and efficiency of network intrusion detection. Gottwalt et al. [9] have introduced CorrCorr as a feature selection technique, and resulted in good detection capabilities with low false alarm rates. Meanwhile Zhou et al. [10] have developed effective IDS with feature selection techniques and ensemble classifier and experimental results show superior performance.

The Dataset
The CICIDS-2017 dataset was developed to meet the scarcity of realtime network traffic datasets [23]. The CICIDS-2017 dataset has the most recent and relevant data for testing security systems [24]. Nevertheless, the main reason of the use of this dataset is because it contains high-class imbalance data as stated in the study by Panigrahi and Borah [25], and Injadat et al. [26]. Other IDS Datasets such as NSL-KDD or UNSW-NB15 have a limited number of features, i.e.: NSL-KDD has 42 features; UNSW-NB15 has 49 features [27], while CICIDS-2017 dataset has a total of more than 80 features [24]. Thus, we consider CICIDS-2017 dataset is superior in terms of data dimensionality.
In the experiment only 30% of the MachineLearningCSV version of the CICIDS-2017 dataset were used. The data profile used is presented in Table 1. The MachineLearningCSV version of the CICIDS-2017 dataset contains 15 traffic classes consisting of normal and attack traffic. The data in the table also shows an unbalanced data distribution among the 15 classes. The imbalance of this data can also be seen from the percentage of data distribution against the main class and the distribution for each class. In the dataset, there are also classes with a small number of traffic attacks such as Web Attack-SQl Injection, Infiltration and Heartbleed. Regarding the imbalance class in the CICIDS-2017 dataset, it is also stated in the study by Abdulhammed et al. [20], Pelletier and Abualkibash [24] and Panigrahi and Borah [25]. For experimental purposes, 30% of the dataset is separated with a portion of 70% for training data and 30% for testing data. The training data profile in the experiment is presented in Table 2, while the testing data profile is presented in Table 3. Referring to this data profile, the two data portions both have 15 traffic classes (normal and attack) and both contain high class imbalance. This condition means that the characteristics and completeness of the data, both training and testing data, have met the needs of the experiment. The amounts of data used in the experiment are the training data that consists of 594,456 records of analyzed data and testing data that consists of 254,767 records.

Figure 1. Experimental Framework
In this study, the selected features generated from previous study [13] (presented in Table 4) will be validated using large-dimensional data that contain high class imbalance data, i.e.: CICIDS-2017 dataset with 15 traffic class labels. In addition, this study examines the Information Gain feature selection technique for high class imbalance dataset. The research experiment framework is illustrated in detail in Figure 1. Two feature selection approaches are proposed, named as Approach-1 and Approach-2: • Approach-1, the researchers use the approach introduced by Panigrahi and Borah [25], by grouping similar attack traffics and given a new label. For the experiment, the dataset that has been re-labeled (new label) with 7 class labels is divided into 70% as training data and 30% as testing data. Furthermore, feature selection is carried out using Information Gain. Based on previous research, this approach produces 22 features that result in ideal detection performance. Furthermore, the features of this selection are used to identify attacks on imbalanced datasets. • Approach-2, the researchers apply the Information Gain feature selection technique to input dataset with 15 traffic class labels. By applying the same feature weights as approach-1, which is a minimum of 0.4

Experiment Configuration
In this study, the authors use a Core i7 Notebook with 8GB RAM and 500 GB HDD and running Windows 10 operating system. Meanwhile, for analysis purposes, authors use Waikato Environment for Knowledge Analysis (WEKA) versuib 3.8. It is a machine learning software [28] and is widely used in data mining and machine learning researches including IDS research [28][29][30]. In normal and attack traffic classification experiments, several test options, which available at WEKA tool are used, such as: • Use training set: classification performance test using all input data • Cross Validation: classification performance test using k-fold cross-validation. In the experiment, 10fold and 5-fold cross-validation were used. • Presentage Split: classification performance test using split data. The experiment use 10 to 90 splits.

Random Forest (RF)
Random forest is one method in the decision tree. Random forest combination of each tree collected in a model. There are three important aspects in the random forest process, namely: (1) conducting bootstrap sampling with the aim of building a prediction tree; (2) every tree predicting decisions using random predictors; (3) then perform random forest prediction with combines the results from each decision tree by means of a majority vote for classification [30]. That is why Random Forest is known as the ensemble classifier method. If a classifier in an ensemble is a decision tree classifier, the classifier set is "forest". Each individual decision tree is created through a random selection of attributes at each node for separation [31]. The Random Forest algorithm was proposed by Breich in 2001 [32]. Some anomaly detection studies using Random Forest include researches conducted by Belavagi and Muniyal [33], Jiang et al. [34], and Abd and Hadi [35].

Measurement
In this experiment, the detection performance measurement was carried out for the Random Forest classification algorithm by measuring Accuracy, True Positive Rate (TPR), False Positive Rate (FPR), Precision, F-Measure and (Receiving Operating Curve) ROC metrics.
• Accuracy: is defined as the level of closeness between the categorization value and the actual value. Often used to measure the effectiveness of classification algorithms. Also known as Classification Rate (CR) = + + + + (1) • TPR: is defined as actual positive are correctly categorized as the positive class. Also known as Recall or Detection Rate (DR) or Sensitivity. = + (2) • FPR: is defined as actual negative are categorized as the positive class. Normal traffic is considered an attack. Also known as the False Acceptance Rate (FAR) or fall-out. = + (3) • Precission: is defined as a measure of the estimated probability of a correct positive prediction. Also known as Positive Predictive Value (PPV). = + (4) • F-Measure atau F1-Score: This is the mean harmonic weight of recall and precision. Used as a comparison of weighted recall and precision rates.
• ROC : This curve is used to evaluate the performance of the classification algorithm [36]. The X-axis represents the FAR value and the y-axis represents the Sensitivity value.

RESULTS
This section describes the results of the experiments that have been carried out in this study. The explanation includes the results of selecting features from each approach (Approach-1 and Approach-2), and testing results of the attack detection performances using the classification algorithms.

Selected Features
As have being described in the methodology section, in this study the approach used in feature selection testing is Information Gain. For Approach-1, 22 selected features are presented in Table 4. These features are the most relevant features based on Approach-1. Furthermore, these features will be used to detect normal and attack traffics. In Approach-2, the Information Gain selection technique is applied to the dataset with 15 high class imbalanced data. A list of sorted ranking features based on the weight generated through Approach-2 is presented in Table 5. These features were subsequently eliminated. By applying the same minimum weight as Approach-1, i.e.: 0.4, then 28 selected features are produced, as displayed in Table 6. Through this process, Approach-2 reduces 63.64% of features number. The features produced by Approach-2 will also be validated using the Random Forest classification algorithm. The validation results from Approach-1 and Approach-2 will then be compared, to see which approach is the most ideal.

Detection Performances
To test whether the features generated by the proposed method can be used to detect normal or attacks traffics on high-dimensional and imbalance data, validation is carried out through detection performances testing using features selected through Approach-1 and Approach -2. This detection test uses the Random Forest classification algorithm. Experiment results show very high accuracy as selected features are more and they are relevant and important in characterizing the attacks patterns, thus the classification algorithms are able to identify very well the attacks. In addition, to maintain the reliability of the test results, several testing modes were used, i.e.: the use of full train, 5-fold cross-validation, 10-fold cross-validation, and 10-90% data splitting which were applied to training data and testing data.

Measuring TPR, FPR, Precision, F-Measure, and ROC for Approach-1
In this experiment, the features generated by Approach-1 were used to detect attacks using the Random Forest classification algorithm. Table 7 presents the results of detection testing using the features selected in Approach-1. Based on the results of TPR, FPR, Precision, F-Measure, and ROC, it shows that using the features selected by Approach-1 on training dataset, the Random Forest algorithm, has an excellent performance for identifying normal and attack traffics.  The Random Forest algorithm performance also looks excellent when tested using testing dataset as presented in Table 8. The measurement results show that the TPR, Precision, F-Measure, and ROC values for all types of traffic reach 1.000 and with a very low FPR value of 0.000.

Measuring TPR, FPR, Precision, F-Measure, and ROC for Approach-2
Through Approach-2, 28 relevant features have been generated. The 28 features are used as input to detect attacks using the Random Forest algorithm. Based on the experimental results using training dataset, the Random Forest's performance in detecting attacks is shown in Table 9. The results of expereiments with testing dataset are presented in Table 10. The experimental results using both training dataset and testing dataset show that with the features generated through Approach-2, the Random Forest algorithm can detect both normal and attack traffics in imbalanced dataset. The experimental results show that the performance of Approach-2 is not very significant when compared to the performance of Approach-1. It is because the features generated through Approach-1 are also belonging to Approach-2. From the 28 features produced in Approach-2, 22 of them are in Approach-1. The features that produced by Approach-2 but not by Approach-1 are Flow IAT Mean, Avg Fwd Segment Size, Fwd Packet Length Mean, Bwd Packet Length Std, and Flow Bytes/s.

Accuracy Testing
The detection engine performance can also be measured by accuracy. Accuracy shows how the machine's ability to predict traffic according to its actual conditions. In other words, machine capabilities to classify exactly a class. Table 11 shows Random Forest's performance on accuracy. As explained in the previous section, in this experiment, several test modes were used, i.e.: Full Train, 10-fold, 5-fold, split 10 to split 90. The experimental results show that by using the features generated through Approach-1, the accuracy of Random Forest algorithm in predicting normal and attack traffics is excellent with an average accuracy value of 99.842% using training dataset and 99.830% using testing dataset. Furthermore, using the features generated by Approach-2 the accuracy of Random Forest algorithm in predicting traffic is presented in Table 12. The experimental results also show the accuracy of Random Forest algorithm which is excellent with an average accuracy value of 99.820% for training data and 99.790% for testing data.

Comparison
Having done experimentations on features selection using Approach-1 and Approach-2, validation is carried out with several classification algorithms, i.e.: RF, Naïve Bayes (NB), J48, RepTree, Bayes Network (Bnet), and OneR. This validation aims to see the ability of each algorithm in detecting the type of traffic using the selected features. The classification algorithms validation was carried out using training data and testing data. Details of the Approach-1 and Approach-2 validation processes, presented in Algorithm-1 and Algorithm-2.
In Figure 3, a comparison of the performance of Approach-1 and Approach-2 is presented. Comparisons were made based on the mean values of TPR, FPR, Precision, F-Measure, and ROC. Based on the graph, it can be seen that the TPR values of Approach-1 and Approach-2 are the same, i.e.: 0.998. Meanwhile, for the mean value of FPR, Approach-1 is better than Approach-2. Furthermore, the average value of Precision, F-Measure, and ROC shows the same performance between Approach-1 and Approach-2. Performance comparison between the proposed approaches and previous studies is shown in Table  13. Figures in the table, show that both Approach-1 and Approach-2 have better performance than previous studies in term of Accuracy, TPR, Precision, F-Measure, and ROC.

CONCLUSION
This study has proposed two approaches to produce relevant features to be used to detect attacks on high dimensionality, multi-class, and high-class imbalanced dataset. The Random Forest algorithm was chosen as the classification method because of its ability to handle multiclass data. Based on the results of the experiments on CICIDS-2017 dataset with 15 traffic class labels, Approach-1 and Approach-2 produced 22 and 28 important features, respectively. Furthermore, experiments on validations showed that combination of Approach-1 with 22 important features and Random Forest classification algorithm worked well in detecting attacks with an average accuracy rate of 99.842% on the training dataset and 99.830% on the test dataset. In addition, the results of the experiment prove that the proposed approach is able to provide recommendations of important and relevant features. With Random Forest algorithm, the resulting features are able to detect attacks with better performance on high-dimensional and high-class imbalanced datasets. The experimental results also show that the proposed method exceeds the performance of the state-of-the-art methods in terms of Accuracy, TPR, FPR, Precision, and ROC.
Although this research has shown surprising results, the Information Gain technique yet requires repeated experiments and validations to obtain the minimum weight for selecting important features. Therefore, in the near future, the research will focus on finding the most optimal way to produce the ideal features with involving intelligent approaches.