Network anomaly detection research: a survey

Received Oct 19, 2018 Revised Jan 09, 2019 Accepted Jan 21, 2019 Data analysis to identifying attacks/anomalies is a crucial task in anomaly detection and network anomaly detection itself is an important issue in network security. Researchers have developed methods and algorithms for the improvement of the anomaly detection system. At the same time, survey papers on anomaly detection researches are available. Nevertheless, this paper attempts to analyze futher and to provide alternative taxonomy on anomaly detection researches focusing on methods, types of anomalies, data repositories, outlier identity and the most used data type. In addition, this paper summarizes information on application network categories of the existing studies.


INTRODUCTION
Anomaly (also known as an outlier) detection is an important issue in information security as defined in [1] and [2].Anomaly and misuse detection are alternative approaches used to recognize intrusions [3] and as part of the Intrusion Detection System (IDS).IDS consists of three major groups: Signature-based Detection (SD), Anomaly-based Detection (AD) and Stateful Protocol Analysis (SPA) [4].It is important for administrators to recognize anomalies on the network that can help in managing and troubleshooting security issues [5].
Researches on network anomaly detection have been done for quite long time.Up to now, anomaly detection research is still widely progressing.As mentioned in [6], besides it is an important research area it also has dynamic issues.The research topics of anomaly detection are too diverse.Starting from discussing and proposing models [7], [8] to frameworks [9], [10], and to research concerning methods [11], [12].Moreover, the evaluation techniques and evaluation approaches of anomaly detection become more important, because they affect the accuracy of the identification.The evaluation and validation approaches used by researchers vary.Researchers use experiments [13], [14], simulations [15], or both approaches [16].
This paper attempts to analyze and to provide alternative taxonomy on anomaly detection researches focusing on methods, types of anomalies, data repositories, outlier identity and the most used data type.In addition, this paper summarizes information on application network categories of the existing studies.
This article is structured as follows.Section II provides information on preliminary studies and relevant researches, Section III describes the research methodology, section IV discusses the observation results of the survey study and Section V concludes the survey study results and provides future research plan.

PRELIMINARY STUDIES AND RELEVANT RESEARCHES
Anomaly detection can be interpreted as a detector of unexpected events, patterns, and behaviors, and deviates from the normal concept [17].Anomaly detection method firstly defines the normal system behavior profile then any deviation from the profile will be marked as an anomaly [18].Multiple devices connected to the network introduce new challenges in anomaly detection.Researchers have developed various methods, frameworks, techniques and algorithms in order to produce automatic and reliable anomaly detection.
Researches on anomaly detection have been spread and carried out in different aspects.Especially in networking, this concept appears along with IDS research.As mentioned by Abduvaliyev et al. [19] anomaly detection is one of three main techniques that can be used in IDS.This technique identifies wether the network traffic considered as normal or abnormal.By implementing this concept the IDS is expected to be able to detect new or unknown anomalies/attacks.Oreilly et al. [20] study on detecting anomalies in a nonstationary environment of wireless sensor networks.Authors in [6] discuss a variety of anomalous detections based on methods, systems and tools.While Al-Musawi et al. [21] present a grouping of important anomaly detection techniques for identifying traffic anomalies.The first two articles focus more on wireless sensor networks.The following article only focuses on methods, systems and tools.The last article focuses on anomaly detection on Border Gateway Protocol (BGP).
Extensive survey studies have also been carried out, however the studies are too diverse.Each researcher uses a different approach and focus topics.For example, Zhang et al. [1] evaluate and compare the existing outlier detection techniques specifically developed for wireless sensor networks (WSNs).Gogoi et al. [22] provide a comprehensive up-to-date survey on outlier detection methods.While Marnerides & Mauthe [23] disscus the dimensions of theoretical methodologies and traffic features.Bhuyan et al. [6] present a structured and comprehensive survey on anomaly-based network intrusion detection and Weller-Fahi et al. [24] present a taxonomy of network anomaly detection.Patcha et al. [25] and Garcia-Teodore et al. [26] present existing solutions and latest technological trends of network anomaly detection.Table 1 summarizes the discussion topics covered by this paper and other existing survey studies.

Network Categories and Application Domain
In this paper, the category of the application network is based on the information about the environment and application domains applied to the anomaly detection.The category considers also the type of traffic or data used as follows: 1) Smart network; includes smart control system in smart city, home, and industries; 2) Large scale network; inculdes Internet Service Provider (ISP), Multi-Protocol Label Switching (MPLS), backbone network and cloud computing.3) Wireless Sensor Network.4) Mobile networks, and 5) Conventional nework; includes computer network and TCP network.

Anomaly detection methods
Methods used by researchers are also evolving and diverse enough.This sub-section presents the methods used in anomaly detection research.The observations of survey papers conclude several similar methods.As an example, authors in [1] identify statistical, nearest neightbor, clustering, classification, spectral-decomposition as anomaly detection methods.Whereas authors in [23] identify statistics, digital signal processing, information theory as anomaly detection methods.Survey by [8] identify statistical, classification-based, clustering and outlier-based, soft computing, knowledge-based and combinationlearners.Research in [27] identify a statistical, information theory, clustering and classification.Thus, it can be concluded that anomaly detection methods used by researchers are clustering, classification, statistical, information theory, nearest neighbor, spectral-decomposition, soft computing, knowledge-based, digital signal processing and combination-learner. Figure 1 shows the summary of detection methods concluded from surveys.The following are a brief description of each of the method: a) Clustering, a method for grouping a number of similar objects into groups called clusters so that objects in the same cluster share similarities with each other than objects found in other clusters [6].Researchers in [2] use clustering algorithm in preprocessing step to clustering sensor data into normal cluster and outlier cluster.K-means is the most common algorithm for clustering, usually combined with another technique for outlier detection on data stream [48].Researchers in [49] combine K-Means and Iterative Dichotomiser 3 (ID3) method for anomaly detection, resulting in high accuracy.In order to detect anomaly using K-means, firstly need to set nomal clusters, anomalous clusters, and suitable similarity measures.Secondly perform an offline preprocessing phase [50].b) Classification, it starts with learning a set of instances data (training) and classify an unseen instance into one of the learned (normal/outlier) class (testing) [1].A classification method identifies membership of a set of categories of observations, based on a set of training data that contains observations of categories whose memberships are known [6].Researchers in [51] use a Support Vector Machines (SVM) classifier to detect network anomaly traffic.One SVM class is most widely used for anomaly detection and it is used to effectively separate normal and anomalous data from the features space learned [52].c) Statistical, it is the earliest method, which is used for outlier detection problems.Based on how the probability model is built the statistical-based techiques are categorized into parametric and nonparametric [1].The statistical method approach is based on the development of probabilistic data models as well as the use of mathematical methods from applied statistics and probability theory [22].Statistical is a method of mathematical scheme that uses temporal characteristics, events and trends to create process ISSN: 2089-3272  Network anomaly detection research: a survey (Kurniabudi) profiles and capture specific dynamics (eg network anomalies) relying heavily on statistical methods [23].
Statistically, anomalies are observations that are suspected either partially or entirely irrelevant as they are not generated by a stochastic model assumed [6].d) Information theory, this method analyzes content information using information theory such as: Kolomogorov complexity, entropy, relative entropy, etc. to explain dataset charateristic [17], and involves information quantification [23].Information theory uses one of the following measurements: entropy, conditional entropy, relative entropy, relative conditional entropy, or information gain [27].Researchers in [53] propose the Method of Entropy Spaces (MES), which useful to detecting anomalous traffic.
Having done evaluation in a real scenario, the proposed method achieved good performance in detecting anomalies.Meanwhile researchers in [45] use information entropy in anomaly detection mechanism for mobile payment application.The proposed mechanism can improve system stability and reduce false alarm.e) Nearest neighbor, this method measures similarity or distance of data instance to differentiate data instance [17].The most common method is the use of an approach to analyze sample data with respect to its nearest neighbors in the data mining community and machine learning [1].The suitable nearest neighbor algorithm for anomaly detection is k-nearest neighbor (k-NN).k-NN calculates the nearest neighbors of a record using a suitable distance calculation metric such as Euclidean distance or Mahalanobis distance [22].Fawzy et al. [2] propose outlier detection approach by using nearest neighbor.
The experiment results show that the method achieves high accuracy rate for identifying outlier.
Chorppath et al. [54] compare three machine learning techniques, which are SVM, Naive Bayes and k-NN.Performance measurements show k-NN technique has a lowest true positive rate (TPR) and highest false positive rate (FPR) among the three methods.f) Spectral-decomposition, this method uses a combination of attributes that captures most of the variability in the data in order to find the approximate data [17].Spectral method aims to find the normal behavior mode in the data by using the principle component [1].In the early step of outlier detection, spectral decomposition-based approach uses PCA to reduce dimensionality [55].Similar to [55], Zolotukhin et al. [56], use PCA to reduce dimensionality of feature vectors corresponding with web resources.PCA is the most common method used for analysis high-dimensionality data [40].In Oreilly et al. [44] the Minimum Vollume Elliptical PCA (MVE-PCA) is introduced.This method shows superior performance from a classic PCA.Experiment results show that the computational complexity of distributed MVE-PCA is lower than centralized MVE-PCA.g) Soft computing, Soft computing is usually thought of as encompassing methods such as genetic algorithms, artificial neural networks, fuzzy sets, rough sets, ant colony algorithm and artificial immune system 6].Authors in [57] employ Multi-Objective Genetic Algorithm (MOGA) to detect anomalies from large data sets by analyzing subspaces, where in high-dimensional space context, subspace anomalies concerned as anomalies.Authors in [58] combine genetic algorithm and fuzzy logic.Firstly, the Genetic Algorithm is used to generate digital signature of network segment by using flow analysis.Then, Fuzzy Logic is applied to detect anomaly on instances data.The proposed method achieves 96.53% accuracy and 0.56% false positive rate.In [59] using modification of ant colony optimation metaheuristic that called Ant Colony Optimization for Digital Signature (ACODS) is compared with the PCA for Digital Signature (PCAD).The result from Normalised Mean Square Error (NMSE) correlation coefficient of the methods present similar result.This soft computing method not only works well in detecting anomalies, but is also used for feature selection, such as in [60] that use rough set theory for feature selection.h) Knowledge-based, In this method, the network or host event is checked and matched with predefined rules or attack patterns.The goal is to identify known attacks in common mode so that handling the actual event becomes easier [6].Samples In Alipour et al. [14] build an online model to detect anomalies.This model identifies abnormal actifities by monitoring n-gram of state transition in real traffic sessions.Any state transition violation considered as an abnormal activity.i) Digital signal processing, this method is used to represent network traffic into the form of signal components that can be processed dependently [23].Typically, a signal is converted from the time (or space) domain into the frequency domain, e.g., by means of a Fourier transform.There are two signalprocessing-based approaches: wavelet-based approach and cognitive packet network (CPN)-based approach [22].j) Combination learner, this method use several techniques simultaneously or combined to improve the accuracy of the anomaly detection system.Combination learner inculdes: ensemble based, fusion based and hybrid [6].In ensemble-based technique multiple model can be combined to classify data instances.
The same algorithms can be applied to different dataset or/and same dataset and can be trained with different algorithms [61].Researchers in [62], propose an ensemble of five binary classifiers to detect anomalies from wireless sensor network.Each classifier uses vary algorithms, from simple average computing to complex algorithms such as neural or Artrifical Neural and Fuzzy Inference System (ANFIS) network.The experimental results show the efficiency of the ensemble method.In paper [50], a heterogeneous set of local online learning classifier was developed to automatically recognize anomaly in data without any prior knowledge.Then, by using ensemble-based method a multiple and diverse individual classifier will be combined.Fisher's method or median is used to agregate the individual classifier that applied in parallel for same data.Experiment results confirm that this ensemble method improves the anomaly detection accuracy.While authors in [63], propose a complex combination of anomaly detector with unsupervised (mean, max, rank BFS, mean rank) and supervised (SVM-perf, TopPush, RankBoost, and Acc@Top) methods.All of these methods are compared with two existing anomaly detection systems which are Net-Flow and HTTP network anomaly detection.The experimental results show that the proposed method outperforms the prior methods with significant accuracy.Gogoi et al. [22] and Comput et al [64] categorize anomaly detection methods as supervised and unsupervised methods.The following are brief descriptions of both methods.a. Supervised Method, requeires pre-labeled data, tagged as normal or abnormal.Usually train the data with normal pattern and try to detect attack with comfirmity normal pattern.This method can detect known attack [22], uses prior knowledge to build a normal profile [64] and generally labeled data is needed [65].b.Unsupervised Method, does not need a pre-labeled data set, can detect unknown attack [22], with non prior knowledge of data [64], however, use some measurement criteria to identify outliers [1].In unsupervised (or cluster) method the data point that separated from normal will be considered as anomaly [66].
Other than supervised and unsupervised methods, a pre-defined data anomaly detection method can be be defined as semi-supervised method [1], [6].A large amount of unlabeled data, used together with prelabeled data to build better classifiers is practiced in semi-supervised method [9].Semi-supervised method assumes the training data has only labeled instances for normal class.The use of labels for anomaly class is not required.They much easier compared to supervised approach [6], [67].An example of semi-supervised method is proposed in [68] that presents a semi-supervised statistical method.This method is then compared with Naive Bayes method, resulting the proposed method overcome Naive Bayes in detection capabilities.Networks with a variety of applications and equipment generate huge amounts of data, both in number and type.This is related to data dimensions.As a general knowledge, dimensionality is one of the problems in anomaly detection [17], [57].Table 3 compares the methods used by researchers in solving problems in anomaly detection.So many used by researcher in anomaly detection studies, result in different pros and contras.From the best knowledge of the authors of this survey, the most popular issues in anomaly detection include and not limited to detection capabilities such as detection rate and false alarm.Another issue that related to detection capabilities is dimensional reduction and the computational complexity.Some researchers concern about computational time and scalability.This survey concludes some methods have achieved high performance in detection capability, however, there is consequency such as high false alarm, computational complexity, computational times and scalability.To overcome the problems, some researchers have proposed methods for dimensional reduction such as Juvonen et al. [72], Erfani et al. [52] and.Wei et al. [73].Whereas Zhang et al [74] propose an algorithm with efficient computational time, and Sommer & Paxson use machine learning technique to improve the accuracy of network intrusion detection [75].

Types of Anomaly
The definition of anomalous types by researchers is associated with the network charateristic and area.This concept depends on complexity source of traffic, such as researchers in [22] define the type of anomaly based on [17], as point anomaly, contextual anomaly and collective anomaly.Researchers in [27] map the type of anomaly with type of attacks.In [9], the researchers perform behavioral anomaly detection in smart asisted living environtment.The authors separate point anomaly to three types as: spatial anomaly, timing anomaly, and duration anomaly.They also detect contextual anomalies that are defined as sequence anomaly.While researchers in [1] categorize the sources of outliers into two types: an error and an event.
The authors of this arcticle compare the type of anomaly detected by the existing works and in the category of which, anomaly detection network is done, as presented in Table 4.The authors of this paper discover that in smart, large-scale network and WSN, anomaly detection is performed to recognize collective anomaly, which is usually in the form of a DOS attack.Table 4. Comparing Types of anomalies and Network Category [33],other recognize the control and data planes [12] or wireless sensor network [46].Even some researchers recognize both attack and behavior [42], [32].Overall, this survey paper has confirmed that collective anomaly is the most popular research type of anomaly detection and the most researchers have resolved.Thus, the authors of this paper map out the types and sources of anomalies used by the researchers in the existing works as in Figure 2. A score, which is a value that combine (i) distance or deviation with reference to a set of profiles or signatures, (ii) influence of the majority in its neighborhood, and (iii) distinct dominance of the relevant subspace [6].Usually labeling techniques depends on (i) the size of groups generated by an unsupervised technique, (ii) the compactness of the group(s), (iii) majority voting based on the outputs given by multiple indices, or (iv) distinct dominance of the subset of features [6].
Whereas in wireless networks, anomalies are distinguished by scalar and outlier score [1] as follows.a) Using zero-one classification measurement in scalar scale, which classifies each data into the normal class or outlier class.b) While in outlier score technique, each measurement result gives a score.Score is based on measurement level.

Data Repositories
Anomaly detection is an important part of data analysis and is useful for recognizing network intrusion ].In order for the analysis to work properly, it must be supported by reliable data.In anomaly detection studies on the network, the type of traffic data used may vary.The more complex the dataset, the techniques used will have more challenges [17].Researchers in [11] use dataset of the Los Angeles Network Data Exchange and Repository (LANDER), researchers in [76] use the dataset of the USA Army Research Laboratory (ARL), researchers in [35 ]use datasets from the Intel Berkeley Research Lab, researchers in [59] use the Abilene network dataset and researchers in [33] use the KDD-99 dataset.Researchers use topologies that are designed to meet the research needs, for example: a topology of 2 domains, 28 nodes, 55 bidirectional links and each link provides 2.5 Gbps bandwidth [30].Authors in [70] build a topology that represents a minimalist smart home.
While other researchers use data sources captured from traffic in the specific network.Such as in [14] research, they capture wireless traffic from the ECE department at the University of Arizona.Researchers in [12] use traffic data captured from King Saud University network infrastructure.The authors of this article observe that there are three types of data usually used in network anomaly detection research.First, using data that captured directly from the real network.Second, using publicly available dataset, and the third, using data that captured from topologies specifically designed for testing, often called testbed topologies.Figure 3 plots the data presented in Table 5, showing 52% of researchers use the publicly available dataset as traffic data for analysis purposes.35% of researchers use testbed topology, and 13% use data captured directly from the network.Table 5.The comparison of source of data vs evaluation method used by researchers

Data Type
The main aspect in anomaly detection researches is the type of input data.Input data can be a set of attributes (often known as variables, characteristics, features, fields or dimensions).The attributes can be of different types such as binary, categorical or continuous [17], [22].The type of input data determines the detection method that can be used to analyze the data.Each data instance may consist of only one attribute (univariate) or multiple attributes (multivariate) [6].The technique of detecting outliers on sensor data usually considers the following two aspects [1]: 1) Attributes; An outlier in univariate data with a single attribute can be easily detected if the single attribute is anomalous with respect to that attribute of other data.The sensor node equipped with multiple sensors and also certain correlations may exist among attributes of sensor data.In this case outlier detection method for WSNs should be able to analyze multivariate data; 2) Correlations; defines dependencies: (i) dependencies among the attributes of the sensor node, and (ii) dependency of sensor node readings on history and neighboring node readings].

Outlier/ Anomaly Identity
Generally the outlier detection method does not distinguish between errors and events, tend to regard the outlier as an error.This fact results in the loss of important information hidden from an event.Thus, indentify outlier source and distinction between errors, events and malicious attacks is one of the challenges in detecting outliers in WSNs.This survey work concludes that error and event as a type of anomalies, however are also considered as source of anomaly.As shown in Table 4, the researchers identify error [45], [70], [76] and event [43], [47], [34].Research in [76] identifies error sensor in WSN, while researchers in [5], identifies network errors or failures in large-scale networks by evaluating traffic flow.On the other hand, researchers in [40] identify anomaly in data stream by simulations of some abnormal events such as box removal and replacement, rotation, and flipping.Whereas researchers in [34] identify suspicious activities in real time by evaluate an event session in data stream.Lastly, researchers in [43] detect anomaly event and leverage Q-statistic event correlation analysis in large scale network.

Evaluation Method
Evaluation and validation are among the one of the important stages in every study and researchers use different approaches to do so.Researchers use experiments to evaluate the proposed works.For example, experiments to evaluate anomaly detection system on smart city infrastructure network [77], experiments to verify the framework [9], and experiments to analyze perfSONAR performance in detecting occurrence, experiments on calculation of accuracy of normal and abnormal data points [38].The use of testbed such as, test Joint Sparse PCA Algorithms [40], monitor traffic and test system performance detection [14].Then the use of simuilation, such as evaluating the ability of PAD detect malicious traffic [30], evaluating Hidden Markov Model (HMM) to detect SSH burce-force attack [37], validating the integrated ADS method [31], simulatingTraffic Matrix estimation and anomaly detection [32], testing and validating the sliding mode method on real data traffic [7].Figure 4 illustrates the statistics of the use of evaluation and validation method used in network anomaly detection.It shows 58% of the researchers use the experimental approach, 26% utilze the simulation, while 10% use experiments and 6% use other approaches.

Open Issues and Research Challenges
As a general knowledge, the main issues of network anomaly detection include detection capabilities [71], [78], [79], high dimensionality of data [53], [57], [80] computational complexity and computational times [72], [81], [82].The detection capability is related with detection rate [14], [52], [66] and false alarm [83].As mentioned in [1], the challenge in traditional outlier detection is how to achieve high detection rate and low false alarm at the same time.Many researches have been carried out to build anomaly detection systems with high detection rate [28].However, more issues come with the rapid network development.More complex network will produce heterogeneous and huge volume of data such as the Internet of Things (IoT), this complexity becomes challenges in anomaly detection.In IoT many sensors and devices with different protocols interconnected and produce data stream and result in high dimensionality data.The dimensionality related to the size of data traffic.Heterogeneous of traffic becomes challenge in data analysis.Data captured from IoT network must be extracted with specific technique to become readable information.Since many protocols have contributed to data stream, spesific method is needed to read this difference of data.Thus, more challenges in extracting data.On the other hand, to analyze the data with huge volume, the high capabilities and intelligent algorithms are needed which in turn result in computational complexity.Reseacher must take into consideration on how to select significant and important features from the extracted feature, this so called dimentional reduction.The work becomes a challenge, because an unknown feature that relevant to detection of anomalous traffic and now known as an attack.Outcome of the survey done in this work shows that most detection is successfully done as off-line.Thus, it becomes a challenge to build a real-time network anomaly detection.

CONCLUSION
In this survey study the authors have reviewed articles on network anomaly detection collected from IEEE Explorer and ScienceDirect.As a general knowledge, anomaly detection research field is very wide and dynamic.The survey study summarized current anomaly detection research trends and focus on models, methods, schemes, algorithms to create a reliable anomaly detection system.The study found out that current network anomaly detection has been done on network category of smart network, large scale network, wireless sensor network, mobile networks and conventional nework, include computer network and Transmission Control Protocol (TCP) network.The study concluded that the most popular issues in anomaly detection include the high dimensionality of data, detection capabilities, complexity of computational, and computational times.Although each researcher uses different terminology to measure performance, however, the goal is a same, i.e.: to build a reliable anomaly detection system.The network anomaly detection must achieve high performance with high accuracy on detection rate and low false alarm at the same time.Further, modern network anomaly detection should have ability for real-time detection and automatic profile update.The survey study showed 52% of researchers use publically available benchmark dataset as traffic data.Study also showed that 58% of researchers used experimental approach in evaluating or validating the proposed works.Taking into account the current research trends and network developments, future research is still highly likely to address anomaly on large-scale networks, which generate a variety of traffic types, and real-time observations.

Figure 2 .
Figure 2. Mapping types of anomaly

Figure 3 .
Figure 3. Data used in anomaly detection research

ISSN: 2089- 3272  47 Figure 4 .
Figure 4. Evaluation and validation methods used in the network anomaly detection researches.

Table 1 .
Comparing our survey with existing survey

Table 2 .
Table 2 presents the statistics on anomaly detection works that have been carried out in each network category.Number of article by network category

Table 3 .
Comparison the method types of anomaly and attacks Efficient, accurate and scalable anomaly detection.Able to implement with large-scale and high-dimensional domain, Tested on sensor network datasets only, so no guarantee for other domain