Malware Detection Approaches Based on Operation Codes (OpCodes) of Executable Programs: A Review

ABSTRACT

suspicious program, which yields failing to classify it as a malware; therefore, it fails to detect this malware [27].
Conversely, the dynamic malware analysis extracts feature datasets of the malware after executing it, even for a while and a very short period of time [1], [6], [11], [17]- [20], [28].The setting of the executed suspicious programs vary among the approaches, and all of them aim to extract and collect suitable and optimal feature datasets, which are then used for malware classification and detection.The setting of the executed suspicious programs involves running time, intervention type with a system, testing environments, etc.The advantage of this kind of malware analysis is that less prone to be decoyed by the advanced tactics of attackers because it is updated continuously to discover such a decoy [29].The disadvantage causes huge performance overhead [27].In addition, it rises a partial or complete malware infection in the testing environments, whether it is a production environment or an experimental (virtualized) environment [27], [28].Furthermore, it is challenging to mimic the proper conditions, such as a vulnerable application that is exploited by the malware.It is also unclear how long the infection needs to be active before its destructive effects can be observed [30].
This study intensively reviewed the recent existing approaches that are introduced for detecting malware only based on the operation codes (OpCodes) of the executable programs, since there is a considerable necessity to achieve a comparative and comprehensible analysis of their achieved results [37].Table 1 illustrates the acronyms list.
Table 1.The list of acronyms Among the widely used malware feature datasets, such as API system calls features, registry activities features, file activities features, process activities features, network activities features, operation codes (OpCodes) features, and text features, this study selected operation codes (OpCodes) features.The study chose operation codes (OpCodes) features because the review of the approaches for detecting malware only based on sample OpCodes has not been addressed before, OpCodes features immune against decoying unlike API systems call and text features [38], [39], [40] and shared in the next significant contributions: 1.To the best of our knowledge, this study has made the first attempt to provide a comparison of the approaches for detecting malware only based on sample OpCodes.2. The study examined the improvements in the malware detection ratio over the year advances by calculating the Pearson Correlation between the "Study Year" variable and the "Detection Ratio" variable.3. The study investigated the significance of the variables of the approaches for detecting malware only based on sample operation codes (OpCodes) by calculating the Binary Logistic Regression, which assesses the impact of the independent variables, or predictors, on the dichotomous (binary) dependent variables of the model.The paper is structured as next.First, it broadly defines the malware analyses and states the main contributions of this study.Second, it identifies the criteria for the relevant study materials collection of malware detection approaches, and reviews the literature of the collected studies according to malware detection approaches only based on OpCodes using machine learning (ML) algorithms, deep learning (DL) algorithms, and statistical techniques and information theories (STIT).Third, it discusses and evaluates the malware detection approaches merely based on sample operation codes (opcodes) by calculating descriptive statistics and the relationship between the variables of the approaches for detecting malware only based on sample operation codes (OpCodes).Forth, it summarizes the analysis of the obtained results of the malware detection approaches and motivates recommendations for future research directions accordingly.Finally, it concludes the study.

LITERATURE REVIEW
The collection of the relevant study materials is critical for the literature review.In this study, the strategy of collecting and gathering the relevant literature is briefly explained in the succeeding steps.1.In this study, the most significant information to be collected according to study review theme is identified, which focuses on approaches for detecting malware merely based on sample OpCodes.2. The study defined the suitable search keyword, namely "an approach for detecting malware based on sample operation codes (OpCodes)".
As a result, it obtained 37 studies on the domain of an approach for detecting malware based on sample operation codes (OpCodes).4. Lastly, the study preliminary reviewed the obtained 37 studies, as shown in Table 2, and categorized them into the following three categories for the sake of reviewing simplicity.
1.The approach for detecting malware is only based on sample operation codes (OpCodes) using machine learning (ML) algorithms.2. The approach for detecting malware is only based on sample operation codes (OpCodes) using deep learning (DL) algorithms.3. The approach for detecting malware only based on sample operation codes (OpCodes) using statistical techniques and information theories (STIT).
Threats to validity: This study tackled the studies that encompass the following criteria: (1) propose approaches, methods, and techniques for malware detection, (2) utilize machine learning (ML) algorithms, deep learning (DL) algorithms, and statistical techniques and information theories (STIT), and (3) analyze malware samples only based on the operation code (OpCodes).It involved unlimited ML, DL, and STIT models, unlike in [46].It identified the most significant information to be collected according to the study review theme, which focuses on the approaches for detecting malware merely based on sample OpCodes.It defined the suitable search keyword namely, "an approach for detecting malware based on sample operation codes (OpCodes)", and searched various research databases such as Science Direct [41], Web of Science [42], IEEE Xplore Digital Library [43], SpringerLink [44], and Google Scholar [45], to collect the peer-reviewed journal articles, book chapters, conference proceedings, and reports using the mentioned keyword.Initially, the study collected 348 documents and screened the title and abstracts to identify suitable articles.According to the previously stated criteria, the study excluded 144 documents that violated criteria (1) and criteria (2).Moreover, it excluded 167 documents that smashed criteria (3).Finally, the full text of 37 studies were selected for the review.The next subsections present the literature review of the acquired 37 studies that are organized according to the three previously stated categories.

The approach for detecting malware based on sample operation codes (OpCodes) using machine learning (ML) algorithms
Machine learning (ML) is a subfield of artificial intelligence (AI) that allows systems to acquire the ability to learn from experience and get better over time, all without being expressly programmed to do so [21], [31], [7], [47], [48].Machine learning (ML) comprises of four learning types, namely supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning [49], [14], [22], [50], [51], [52], [53].In this study, several collected studies have utilized the machine learning (ML) algorithms to detect the malware.In this category, the approach first extracted and selected the appropriate malware features and then passed them to the ML algorithm in order to detect the malware.Roughly, the collected studies have used sample OpCodes frequencies, sample N-grams OpCodes, or sample OpCodes features vectors for malware features extraction and selection, as reviewed and discussed in the next subsections.

Malware detection approach utilizes OpCodes frequencies for features extraction and selection
In this subsection, the malware detection approaches that employ OpCodes frequencies for features extraction and selection are presented.Authors in [54] presented a new approach for detecting advanced unknown malware with a high accuracy.Firstly, it analyzed OpCodes occurrences as features extraction through grouping the executables, which follow the rule: the difference between any malware sizes is within 5 Research done by [55] used an iterative approach to determine the suitable behavioral attributes in order to gain better accuracy for classifying and identifying ransomware.It collected 150 sample reports for 10 families of ransomware.Initially, the study selected 27 attributes and then selected 24 attributes from the initial 27 attributes according to their frequencies in the dataset.After that, the iterative approach selected 20 out of 24 and 15 out of 20 attributes based on J48 results.Lastly, it used grouping to select 12 out of 15 attributes.The study verified each attribute's reduction in terms of classification accuracy to ensure identifying optimum attributes.It reduced behavioral attributes to nine attributes, but it gained worse results, so it retrained back to 12 attributes since it gives the best classification results.Finally, it applied J48, NB, and k-NN machine learning (ML) algorithms and achieved 78% of classification accuracy by using J48.
Research done by [50] extracted the OpCodes and then computed the frequency of occurrence of each opcode sequence using Term Frequency (TF).After that, it defined the Weighted Term Frequency (WTF) as the result of weighting the relevance of each OpCode when calculating the term frequency.Finally, it used the LLGC algorithm for classification.It achieved above the 80% of accuracy for merely the 10% of labelled instances.
A study in [8] proposed a method to detect unknown malware, and it consists of four steps.First, it used PE header information to divide sample categories.Then, it computed TF-IDF for each OpCodes sequence in order to choose top-K OpCodes to construct an adjacency matrix, and after that, it applied the Power Iteration algorithm for feature selection.Finally, it trained learning models like kNN and BP to detect unknown malware.The highest obtained accuracy detection of the proposed method is 98.57%, which was achieved by the Adaboost algorithm.
Research conducted by [14] proposed a model that used OpCode Extract and Count (OPEC) algorithm for feature selection, and then applied supervised learning algorithms to detect malware.The model acquired a detection accuracy of 98.7%.
Research introduced by [18] investigated optimal OpCodes set that vigorously points toward malware.It extracted the OpCodes as OpCode density histograms and then used the algorithm for features selection and malware classification, as well, and achieved a detection accuracy of 83.41%.

Malware detection approach utilizes N-grams OpCodes for features extraction and selection
This subsection elaborates on the malware detection approaches that harness N-grams OpCodes for features extraction and selection.Authors in [30] proposed a classification framework to detect unknown malware.First, it extracted 1000 OpCodes patterns as features with the biggest DF values.Then, it applied different methods, namely DF, GR, and FS, for feature selection.After that, it selected top 50, 100, 200, and 300 features based on each feature selection, which measures the correlation between OpCode n-grams feature and malware class.Finally, it applied and evaluated eight machine learning (ML) classifiers like SVM, LR, RF, ANN, DT, NB, BTD, and BNB.It attained more than 96% of accuracy, which is better than previous studies that utilize Byte n-gram patterns.
A study in [56] proposed multiple feature method for detecting malware based on multiple n-value OpCodes N-grams pooled sequences and multiscale grey image texture of malware.First, the method extracted multiple N-value OpCode N-grams combined sequences and selected features from them based on Information Gain (IG).In the meanwhile, it transformed sample files into grey images, generated multi scale images by using a Gaussian pyramid, and extracted features by using GLCM.Finally, it applied k-NN and RF classifiers in order to detect malware.It gained 98.85% detection accuracy.
Research in [27] proposed a model for malware detection using an ensemble approach.It generated multiple features dataset from various sizes of n-grams of OpCodes sequences to train one classifier, namely SVM, RF, or k-NN.First, it extracted n-grams of sizes array from 1 to 4, and then it vectorized them by using TF-IDF.After that, it leveraged the Information Gain (IG) to pick up 1000 maximum instructive features.Finally, it applied a particular classifier, namely SVM, RF, or k-NN to train multi features ni-gram OpCode and nj-gram OpCode sequences and subsequently to weight and average them using weight values and argmax() function in order to predict a final class, a benign or malware.It obtained the finest classification accuracy of 98.1%.
Research introduced by [57] proposed an early malware detection framework.It consists of three stages.The first stage is an evasive behavioral data collection stage, which collected a representative dataset according to a pre-identified list of evasive techniques for malware.The second stage extracted features based on n-gram and TF-IDF techniques and calculated correlation values between API user mode and kernel system calls mode in order to pick up the most representative features.Finally, the third stage applied an ensemble model based on Random Forest (RF) machine learning (ML) algorithm on the extracted and selected features to detect malware.Research in [58] presented a method for malware detection based on subgraph isomorphism using blocks of OpCodes.The method first analyzed and investigated the frequencies of n-grams OpCodes to detect singular code blocks through TF/IDF, and then it used machine learning (ML) algorithms such as RF, XGBoost, DT, SVM, and KNN for learning.Finally, OpCodes sequences are transformed into a Control Flow Graph (CFG) in order to feed the database of CFGs characteristic of malware, which is used for comparing semantic and construction of known and unknown malware in order to detect and classify it.The RF algorithm achieved the finest F1 score: 0.923 for 1-grams and 0.796 for 9-grams.
A study in [59] introduced a detective mechanism based on OpCodes sequences features.First, it collected all possible k-grams for feature extraction and then applied the Information Gain (IG) selection algorithm in order to select the top representative features.Finally, it created a model and classified the unknown malware using the SVM algorithm.It gained 96.83% for malware detection accuracy.
Research by [60] proposed a method for detecting malware, which is based on Control Flow Graph (CFG) in order to extract OpCodes behaviors.It converted a CFG into a tree to form an execution tree, and the trees are concatenated to present a long execution path.Then, it used n-grams with IG and DF to select OpCodebased features.Finally, it employed KNN, DT, and SVM to classify executables.The best achieved accuracy result is 93.2% for CFG-DT.
Research in [17] proposed a new scheme for dynamic OpCode acquisition through QEMU binary translation mechanism.The OpCodes information is obtained from the software runtime and is used for offline analysis.The scheme used a variety of feature selection algorithms, CFS, Chi-square, IG, Symmetrical, and Ngram algorithms to extract features of the operating code information when the software is running.Then, the extracted feature subset is combined with a variety of machine learning (ML) algorithms like DT, SVM, Bayesian network, ensemble and NN algorithms to conduct cross-comparison experiments.The detection accuracy of offline malware reaches 99.85%.As well, the research proposed an online detection scheme based on the above research results called CPU built-in malware monitoring model (CBMM), which accurately identified the execution trajectory of malware under the current process, and monitored malware in real-time.
Research accomplished by [61] designed a method which applied SVM and RF classifiers to the greatest values of frequencies of OpCodes n-grams in order to detect malware and its multi families, as well.The method obtained a detection accuracy of 97%.
A study in [32] proposed a new feature which performed OpCodes n-gram shingling with control statements as stopwords while requiring a smaller feature vector and shorter training time.Random Forest (RF) algorithm is implemented for both learning the classification and achieving 99.11% of accuracy in malware detection.
Research established by [22] proposed a new method that used only single class learner to detect unknown malware.The method is proposed based on examining the frequencies of the appearance of OpCodes sequent.It used TF-IDF to weigh each OpCodes n-grams sequences, suggested labelling only malware samples, and employed the Roc-SVM algorithm for malware detection.It obtained 85% of malware detection accuracy.
Research talented by [33] used n-gram OpCodes and then applied a data segmentation technique for feature selection.Finally, it applied ML algorithms like Naïve Bayes (NB), support vector machine (SVM), partial decision tree (PART) and random forest (RF).It gained f-measure of 98% for malware detection.
A study introduced by [31] used n-OpCode up to 10-grams and then selected the most important features based on IG.Finally, it applied ML algorithms, like Naïve Bayes (NB), support vector machine (SVM), partial decision tree (PART) and random forest (RF) to classify and category malware.It obtained fmeasure of 98% for malware detection.
Research accomplished by [13] extracted OpCodes and converted them into a vocabulary dataset, and then applied n-gram on each word to represent a feature.After that, it uses TF-IDF to measure the significance of every word in order to extract significant features.Finally, the obtained data set is processed with CPD to gain a feature-reduced dataset, which is then evaluated using Weka (6 DM algorithms: Ripper (JRip), C4.5 Decision Tree (J48), Support Vector Machines (SMO), and Naive Bayes (NB).The largest attained malware detection is 0.949 AUC score, which is achieved by the k-NN algorithm.
A study in [62] proposed a technique to extract the behavior of OpCodes based on Control Flow Graph (CFG), jointly with 4-gram of OpCodes sequence.After that, the technique used the k-NN algorithm to detect Trojan Ransomware, and it achieved a detection accuracy of 98.86% when k=1 (1-KK) and n=1 (1-gram) OpCodes.
Research established by [34] obtained OpCodes, then used n-gram and TD-IDF to represent terms and sequences of disassembled instructions as vectors.Finally, it applied six classifiers, namely RF, NB, LR, kNN, Linear SVM, and XGBoost.The best achieved F1 accuracy of 86% by using RF algorithm.

Malware detection approach utilizes OpCodes features vectors for features extraction and selection
The malware detection approaches that use OpCodes features vectors for features extraction and selection are presented in this subsection.Research conducted by [63] attempted to detect IoT-based malware.First, it extracted OpCodes from IoT-based devices and services and then preprocessed them through filtering, which involves normalizing, centering, and scaling.Finally, it applied three ML algorithms, RF, SVM, and k-NN.RF achieved the best accuracy at 98%, followed by SVM and k-NN, both with 91%.
Research presented by [26] proposed a malware detection method for OpCodes and API calls extraction in order to form a feature vector, which eventually applied NB and kNN classifiers in order to detect the malware.The proposed method acquired 95.21% of malware detection accuracy.
Research done [16] created a procedure based on learning to discriminate and classify in the Internet of Battlefield Things (IoBT) using OpCodes progression.The procedure transformed the OpCodes into a vector space and then applied a technique called Deep Eigen space learning to distinguish between malware and benign software.In addition, the procedure utilized the SVM algorithm and n-gram algorithm for robust classification.

The approach for detecting malware based on sample operation codes (OpCodes) using deep learning (DL) algorithms
Deep learning (DL) is a subfield of machine learning (ML) that imitates the structure of the human brain neural network (NN) so that the computer can act autonomously in response to unseen events.DL aids a computer model in predicting and classifying information by filtering it through layers of data [9] , [21], [64], [48].A number of collected studies have utilized deep learning (DL) algorithms to detect malware, as discussed in the following subsections according to whether the malware features are extracted and selected based on OpCodes frequencies, N-grams OpCodes, embedding, or images.

Malware detection approach utilizes OpCodes frequencies for features extraction and selection
In this subsection, the malware detection approaches that employ OpCodes frequencies for features extraction and selection are presented.Research [65] proposed a system for detecting malware based on 1D-CNN.The system took a binary file as an input and then classified it to whether malware or benign.In the meanwhile, the researchers classified the binary file into malware or benign using the TF-IDF algorithm [66] and used it as a benchmark in order to compare it with the 1D-CNN classifier.The overall accuracy of the system for detecting malware is 99.2%.
Research established by [20] proposed a hybrid solution for detecting malware.It adapted OpCode sequences as static features and network traffic as dynamic features in order to detect malware.The proposed hybrid solution achieved malware detection accuracy of 97%.

Malware detection approach utilizes N-grams OpCodes for features extraction and selection
This subsection demonstrates the malware detection approaches that exploit N-grams OpCodes for features extraction and selection.Authors in [2] introduced a method based on a dual branch convolutional neural network (CNN) to determinate and classify malware using multiple features fusion which consists of local fine-grained and global structure features of the visualized malware.The proposed method converted malware global structural information into a bytecode image and then extracted the OpCode semantic information of the code segment by using the n-gram feature model to produce an OpCode image.The method attained a family classification accuracy of 99.05%.
Research in [7] proposed an end-to-end model based on ID CNN to determine binary file maliciousness.First, the model extracted n-grams of OpCodes automatically.Then, the model is trained on multiple feature sets, e.g.1-garms and 2-grams, and sequentially combined these two predictions using a weighted average ensemble.The proposed model utilized a grid search on values (0-1) for optimal prediction weights.The model attained a positive prediction of 98% using a weight parity of 0.5 for ensemble unigram and bigram OpCodes sequences.
Research conducted by [35] introduces a new classifier called SNNMAC, which is a model for classifying malware based on shallow neural networks and static analysis.First, the model extracted n-gram OpCodes sequences from a binary file using a decompiler.Then, the n-gram dataset is decreased according to the designed enhanced n-gram algorithm.Finally, the SNNMAC classifier learned from the dataset to classify the malware.The classifier attained malware classification accuracy of 99.21%.

Malware detection approach utilizes OpCodes embedding for features extraction and selection
In this subsection, the malware detection approaches that exploit OpCodes embedding for features extraction and selection are discussed.Research talented by [10]  for detecting Android malware.The proposed system extracted a raw OpCodes sequence and then performed training using a pipeline technique; thus, it eliminated the need for a lot of n-grams sequences enumeration and manually engineered malware features.Therefore, it yielded better performance than n-grams based systems but less malware accuracy detection of 69%.
Research accomplished by [36] presented a malware detection system based on optimized deep CNN.It went through the embedding layer and then used the k-max pooling method to detect the malware.It gained malware accuracy detection of 99%.
Research [12] proposed a novel approach which modeled malware as a language to detect malware.It collected OpCodes by using IDA Pro software, then used word embedding technique for feature vector, and finally applied two-stage LSTM.It reached an average AUC of 98.7% for malware classification.
Research in [67] introduced a system for detecting malware based on a deep optimized deep neural network.The pipeline of the proposed detection system comprised three consecutive layers, namely the embedding layer, convolutional layer, and k-max pooling layer.The proposed system extracted OpCodes sequences from a binary file and fed them to the optimized deep neural network.It demonstrated malware detection accuracy of 99%.
Research introduced by [68] presented a method based on stacked LSTM to circumvent the timeconsuming drawback of random weight initialization for neural networks (NN).The proposed method used six distinct malware datasets to extract various malware feature datasets like OpCodes, Bytecodes, and API Systems Calls.The method incorporated a model with four hidden layers; the first three of them are pre-trained layers, while the fourth layer is a dense layer as a classifier.The suggested method entailed two phases: unsupervised pre-training on training data to determine the initial weights and supervised fine-tuning of the network to distinguish between malware and benign samples.The extracted feature datasets are converted into embedding vector for OpCodes and System Calls, and one-hot vector for Bytecodes, and then are passed to the model for classification purposes to detect malware.The method achieved IoT malware detection accuracy of 99.1%.

Malware detection approach utilizes images for features extraction and selection
This subsection debates the malware detection approaches that exploit N-grams OpCodes for features extraction and selection.Research in [21] proposed a method called MalNet which learned features automatically from raw data.It generated grayscale images and OpCodes sequence to be used for CNN and LSTM networks, respectively and took a stacking ensemble for malware classification.The proposed method gained malware detection accuracy of 99.36%.
Research conducted by [69] utilized a technique of image similarity based on the CNN approach to detect malware.It converted the executable (EXE) files into images and then applied CNN for classification.Subsequently, it converted the executable (EXE) files to OpCodes, then to images, and finally applied CNN for classification.Finally, it compared the previous two classifications.It achieved malware accuracy detection of 97.6%.
Research established by [9] presented a new approach based on deep learning and function call graph (FCG) in order to detect and classify malware.First, it produced OpCodes based on FCG and then transformed them into vector.Finally, it applied Long Short-Term Memory (LSTM) algorithm for malware classification.It attained malware accuracy detection of 97%.

The approach for detecting malware based on sample operation codes (OpCodes) using statistical techniques and information theories (STIT).
This subsection elaborates the approaches for detecting malware based on sample operational codes (OpCodes) using statistical techniques and information theories (STIT).Mutual information (MI) is a metric used in probability and information theory to quantify the degree to which one variable can be inferred from another.Research in [24] proposed a new method based on the frequency of appearance of OpCodes sequences to detect variants of malware throughout Mutual Information measure: I (x ; Y).The method achieved variant family similarity detection.It conquered malware accuracy detection of 90%.

EVALUATION AND DISCUSSION OF THE MALWARE DETECTION APPROACHES BASED ON SAMPLE OPERATION CODES (OPCODES)
This section evaluates, analyzes, and discusses the obtained results of the approaches for detecting malware that were reported by authors to evaluate their performance.First, it presents the descriptive statistics on the approaches for detecting malware only based on sample operation codes (OpCodes).Then, it explains the relationship between the variables of the approaches for detecting malware based on sample operation codes (OpCodes).

Descriptive statistics on the approaches for detecting malware based on sample operation codes (OpCodes)
As shown in Table 2, 25 studies out of the 37 collected studies of the approaches for detecting malware only based on OpCodes were using machine learning (ML) algorithms, which acted 67.57% of the overall studies.Therefore, this category took the majority.Besides, 11 studies out of the 37 collected studies of approaches for detecting malware merely based on OpCodes were using deep learning (DL) algorithms, which represented 29.73% of the whole studies, and this category came second.Lastly, 1 studies out of the 37 collected studies of approaches for detecting malware only based on OpCodes was utilizing statistical techniques and information theories (STIT), which denoted 2.70% of the total studies.
After extensive literature reviews, this study found that the approaches for detecting malware based on OpCodes that used machine learning (ML) algorithms have conquered the first rank due to their simple construction, easy implementation, fast computation speed, and low calculation overheads.On the contrary, they did not support an end-to-end malware detection process, which enforced the malware detector to conduct some steps of the whole malware detection process manually.In addition, the approaches for detecting malware based on OpCodes that utilized deep learning (DL) algorithms have occupied the second rank due to complexity for implementation, low computation speed, and huge calculation overheads, despite they support end-to-end malware detection process.Therefore, the latter approaches have outperformed the former approaches in terms of full automation from end-to-end for the malware detection process.The approaches for detecting malware based on OpCodes that used statistical techniques and information theories (STIT) have been subjugated after all since they did not provide any sort of intelligence [53].
As presented in Table 3, the approaches have utilized OpCodes frequencies for features extraction and selection represented 24% of the collected studies that used machine learning (ML) for malware detection.Besides, the approaches have employed N-grams OpCodes for features extraction and selection acted 64% of the collected studies that use machine learning (ML) for malware detection.Lastly, the approaches have used OpCodes features vectors for features extraction and selection appeared in 12% of the collected studies that use machine learning (ML) for malware detection.Figure 1 shows the percentage of each one.Likewise, as displayed in Table 5, the approaches have taken advantage of the mutual information (MI) for features extraction and selection act represented 100% of the collected studies that use statistical techniques and information theories (STIT), as presented in Figure 3.

Table 5. OpCodes features extraction and selection in malware detection approaches based on STIT Mutual information (MI) Total
No. of studies 1 1 The percentage 100 100 Figure 3. OpCodes features extraction and selection in malware detection approaches based on STIT Finally, Figure 4 illustrates the average detection ratio of the approaches for detecting malware, which is calculated by dividing the total of the entire approaches detection ratios by the number of the approaches in Table 2.It equaled 86.12% for the collected studies that use machine learning (ML), 95.74% for the collected studies that employ deep learning (DL), and 90% for the collected studies that exploit statistical techniques and information theories (STIT).

The relationship between the variables of the approaches for detecting malware based on sample operation codes (OpCodes)
First, the Pearson Correlation is calculated to measure the strength a linear relationship between the Study Year variable and the Detection Ratio variable.The Study Year is the independent variable, while the Detection Ratio is the dependent variable, and their values are presented in Table 2.The Pearson Correlation between the Study Year variable and Detection Ratio variable is calculated according to equation ( 1), and it equaled 0.370, which indicates that there is a low positive correlation.This result of the correlation proved that when years advances rise, the detection ratio also rises, which means that the detection ratio of the approaches for detecting malware only based on sample operation codes (OpCodes) has been improved over years advances.Besides, the p-value equaled 0.029, which indicated that the Pearson Correlation was statistically significant.

………… …………………….. (1)
Second, the Binary Logistic Regression model is used to assess the impact of the independent variables, or predictors, on the binary dependent variables, or outcomes that take only two values, 0 or 1.As shown in Table 2, the Study Year and Detection Ratio are the independent variables, or predictors, while the ML, DL, STIT, Dynamic, Static, Hybrid, Automatic, and Manual are the dichotomous (binary) dependent variables.
As presented in Table 6, the overall Binary Logistic Regression model was statistically significant for DL, STIT, and Dynamic dichotomous (binary) dependent variables, since their p-values in the "Model Sig." column are less than 0.05.The other five dichotomous (binary) dependent variables ML, Static, Hybrid, Automatic, and Manual, with their p-values greater than 0.05, were not significant.In addition, the Binary Logistic Regression model correctly detected 64.9%, 100%, and 86.5% cases of DL, STIT, and Dynamic dichotomous dependent variables, respectively as in the Accuracy column.Besides, the statistical significance of each predictor, namely the Study Year and Detection Ratio, is illustrated in the "Indept.Var.Sig." column, which showed that only the Study Year added statistical significance to the model since its p-value is less than 0.05, while all the others with their p-values greater than 0.05 did not add the statistical significance.Finally, the odds of using deep learning (DL) algorithms in the approaches for detecting malware based on sample operation codes (OpCodes) was 1.427 times greater over years advances, as shown in Exp (B) column.This merit indicated that adapting the improved deep learning (DL) over the years advances in the approaches for detecting malware based on sample operation codes (OpCodes) fed a more accurate detection ratio for the malware.

RECOMMENDATIONS AND FUTURE DIRECTIONS
This study conducted a comprehensive review of the approaches for detecting malware only based on sample operation codes (OpCodes) and drew useful insights towards them.As mentioned earlier, this study focused on the malware OpCodes features and dropped the other malware features like API system calls features such in [5][38][39][40] [70] and text features such as in [38][39][40] [71][72] due to their limitations, since the former could be decoyed when the evader uses his own developed OpCodes instructions written from the ground up instead of uses of the formal API system calls.As well, it dropped the latter because the garbag of text that could be injected into the malware, which evades detection, too.The following section discussed and summarized the analysis of the obtained results and recommended future directions: 1.There was a positive relationship, which equaled 0.370, between the "Study Year" variable and "Detection Ratio" variable that proved when the years advances rise, the detection ratio also rises, which meant that the detection ratio of the approaches for detecting malware only based on sample operation codes (OpCodes) has been improved over years advances.2. Adapting the improved deep learning (DL) over the years advances in the approaches for detecting malware only based on sample operation codes (OpCodes) provided 1.427 times greater accurate detection ratio for the malware over years advances.Therefore, this study recommends utilizing improved deep learning (DL) algorithms and incorporating them into the approaches for detecting malware in future works.3. The average detection ratio of the approaches for detecting malware equaled 86.12% for the collected studies that used machine learning (ML), 95.74% for the collected studies that employed deep learning (DL), and 90% for the collected studies that exploited statistical techniques and information theories (STIT).4. The collected studies of the approaches for detecting malware only based on OpCodes that used machine learning (ML) algorithms acted 67.57% of the overall studies; therefore, this category took the majority.The most spread approaches for detecting malware were using machine learning (ML) algorithms.It is due to their simple construction, easy implementation, cost-effective performance, and rapid computation.
In contrast, most of them extracted malware feature datasets manually, which caused a negative impact on the overall malware classification and detection.Accordingly, this study recommends improving the approaches that were using machine learning (ML) algorithms to extract malware feature datasets automatically so that they help to avoid human intervention and boost malware detection.13.Moreover, the most spread approaches for detecting malware were using machine learning (ML) algorithms that extracted and selected malware feature datasets statically, not dynamically, which lacked this significant malware feature datasets source.Therefore, this study recommends carrying out several extra studies for improving dynamic feature datasets extraction and selection.14.There were quite infrequent proposed approaches for malware detection that integrated and incorporated together machine learning (ML) algorithms and deep learning (DL) algorithms within one approach, despite each one has novel advantages.Hence, this study recommends bridging this gap by proposing innovative and improved approaches that utilize both learning algorithms, whether are machine learning (ML) algorithms or deep learning (DL) algorithms.15.As presented in Table 2, the reported detection ratio results of the reviewed studies still need to be enhanced so that the approach provides a higher detection ratio.Therefore, this study recommends improving the malware detection ratio.16.The open issues of the introduced malware detection approaches based on OpCodes of the collected studies vary among improving detection accuracy, reducing features vector dimension, integrating and incorporating static and dynamic analysis, adapting automatic malware detection, and promoting end-toend malware detective solutions.

CONCLUSION
Malicious software, or malware for short, poses a threat to computer systems, which need to be analyzed, detected, and eliminated.Malware analysis typically takes one of two forms: dynamic malware analysis and static malware analysis.The former includes malware APIs, registry activities, file activities, process activities, and network activities as features collected in a dataset while the malware is being executed.The latter entails gathering a dataset of properties, including Operational Codes (OpCodes) and text, without running the malware itself.Several prior studies, on the other hand, addressed and reviewed malware detection approaches based on numerous features, but none of them has addressed and analyzed approaches based only on malware OpCodes.As a result, the goal of this article is to review malware detection approaches only based on malware OpCodes.The review explored, demonstrated, and compared the existing approaches for detecting malware based solely on their OpCodes and eventually provided a comprehensive comparative perspective on them.
This study bridged the gap between the approaches for malware detection, and OpCodes feature datasets.In addition, this study found that there was a positive relationship between the Study Year variable and "Detection Ratio variable, which meant that the detection ratio of the approaches for detecting malware only based on sample operation codes (OpCodes) has been improved over years advances.The average detection ratio of the approaches for detecting malware equaled 86.12% for the collected studies that used machine learning (ML), 95.74% for the collected studies that employed deep learning (DL), and 90% for the collected studies that exploited statistical techniques and information theories (STIT).Adapting the improved deep learning (DL) over the years advances in the approaches for detecting malware only based on sample operation codes (OpCodes) provided 1.427 times greater accurate detection ratio for the malware over years advances.Besides, this study found that 67.57% of the entire collected studies were the approaches for detecting malware only based on OpCodes that used machine learning (ML) algorithms.As well, it deduced that 29.73% of the overall studies were the approaches for detecting malware only based on OpCodes that used deep learning (DL) algorithms, and 2.70% of the whole studies were the approaches for detecting malware only based on OpCodes that used statistical techniques and information theories (STIT).Finally, the study ended with insightful recommendations for future research directions.

Figure 1 .
Figure 1.OpCodes features extraction and selection in malware detection approaches based on ML Similarly, as shown in Table 4, the approaches have employed OpCodes frequencies for features extraction and selection acted 18.18% of the collected studies that use deep learning (DL) for malware detection.In addition, the approaches have utilized N-grams OpCodes for features extraction and selection equaled 27.27% of the collected studies that use deep learning (DL) for malware detection.Furthermore, the approaches have exploited OpCodes embedding for features extraction and selection denoted 27.27%.Finally, the approaches have utilized images for features extraction and selection appear in 27.27% of the collected studies that used deep learning (DL) for malware detection.Figure 2 presents the percentage of each one.

Figure 2
presents the percentage of each one.

Figure 4 .
Figure 4.The average detection ratio of the approaches for malware detection Approaches Statistical Tech.& info.Theories-based Approaches

Table 2 .
A Comparison of the approaches for detecting malware based on sample operation codes (OpCodes)

Table 3 .
OpCodes features extraction and selection in malware detection approaches based on ML algorithms

Table 4 .
OpCodes features extraction and selection in malware detection approaches based on DL algorithms Malware Detection Approaches Based on Operation Codes… (Mohammed A. Saleh) 579 Figure 2. OpCodes features extraction and selection in malware detection approaches based on DL

Table 6 .
The Binary Logistic Regression between the independent variables and dependent variables 5. The collected studies of the approaches for detecting malware only based on OpCodes that used deep learning (DL) algorithms represented 29.73% of the overall studies; hence, this category came second.6.The collected studies of the approaches for detecting malware only based on OpCodes that used statistical techniques and information theories (STIT) acted 2.70% of the overall studies.7. The approaches for detecting malware that have utilized OpCodes frequencies for features extraction and selection represented 24% and 18.18% of the collected studies that used machine learning (ML) for malware detection and use deep learning (DL), respectively.8.The approaches for detecting malware that have utilized employed N-grams OpCodes for features extraction and selection represented 64% and 27.27% of the collected studies that used machine learning (ML) for malware detection and use deep learning (DL), respectively.9.The approaches for detecting malware that have used vectors of features for features extraction and selection appeared in 12% of the collected studies that used machine learning (ML).10.The approaches for detecting malware that have exploited OpCodes embedding and images for features extraction and selection denoted 27.27% and 27.27 of the collected studies that used deep learning (DL) for malware detection, respectively.11.The approaches for detecting malware that have taken advantage of the mutual information (MI) for features extraction and selection act represent 100% of the collected studies that used statistical techniques and information theories (STIT).