Selecting Root Exploit Features Using Flying Animal-Inspired Decision

ABSTRACT


INTRODUCTION
People utilize mobile devices in their daily activities to connect, online and communicate. This situation provides an opportunity for the attacker to develop root exploit to compromise victim's Android device for money or private purposes. Root exploit is an application software that takes over the kernel of the Android operating system to gain root privileges. When the attackers gain this privilege, they are able to provide false antivirus results, evade the Android security mechanisms, execute stealth activities without victim's acknowledgement, and install many types of malware to the devices [1]- [3]. In addition, the number of root exploits increasing from time to time because of the homebrew communities. These communities are the people that find multiple ways to break the Android kernel to obtain a customized version of Android. This leaves an opportunity for root exploit writers to wait for the homebrew community to discover new ways to gain control of the Android's kernel [4]. Consequently, in order to detect root exploit, security practitioners conducted the two types of malware analysis; 1) dynamic, and 2) static analysis.
Dynamic analysis investigates root exploit's behavior by executing the application and inspecting its movement [5] [6]. The studies that practicing this type of analysis include the works in [7] [8]. However, dynamic analysis has multiple drawbacks and one of it is the limited monitoring coverage. Because it monitors the application's characteristics within a certain time only. Therefore, root exploit's behaviors that running exceeds the monitoring time of the dynamic analysis are excluded. This consequently leaves missing many root exploit's behaviors. In dissimilarity, static analysis diagnoses the root exploit by reverse engineering the application to retrieve its entire code. Security analysts practice this type of analysis without executing the malware. In addition, static analysis only needs few resources, which is low specifications of hardware, such as CPU, RAM, and storage. Moreover, static analysis process is fast which consume short amount of time than dynamic analysis [9]. During static analysis, the root exploit is unable to hide or modify its malicious process because it is unexecuted [10]. Nonetheless, in the interest to detect root exploit efficiently with machine learning, static analysis needs distinct features in minimal amount.
In machine learning intelligence prediction model, determining the optimal and the best features in fewer amounts increases the machine learning performance results. This performance increment is because fewer features eliminate unnecessary data and decreases the dataset's dimensionality. It also minimizes the nature of the predictive model, hence, reducing the machine learning processing time [11] [12]. In the interest to have few relevant features, this paper adopts flying animal-inspired search approach, to intelligently investigates overall features and select the best features that focuses on detecting root exploit that undiscovered previously. This study utilizes categories of features that covers system command, directory path, and code-based. The first features, which is system command, it is a UNIX-based command in the operating system (OS). This study chooses this type of feature because it is permanent although the UNIX OS update its version regularly. The system command comprises of Android debug bridge (ADB) commands, executing processes and terminal commands. The following feature is the directory path, which consists of Linux kernel directories and system paths. The third feature is code-based features, which people use it for executing the commands, for instance, standard output (stdout), standard error (stderr), and standard input (stdin).
This study proposes three animal-inspired algorithms to search the relevant features in minimal amount. Then, in the experiments, this study uses Adaboost to boost and convert the multilayer perceptron into a strong learner for the machine learning classifiers to detect root exploit in Android mobile devices. In summary, this study has the following unique characteristics. a) The use of 600 normal @ benign and 550 root exploit samples from the Malgenome dataset [13]. b) The utilization of three types of flying animal-inspired (bat, firefly, and bee) to automatically select the optimal @ best root exploit features that suits the multilayer perceptron machine learning classifier. c) The utilization of multiple categories of features, which are system command, directory path and codebased features. d) The use of Adaboost, a type of boost that converts the multilayer perceptron into a strong learner for efficient machine learning result.
The structure of this paper is as follows. Section 2 surveys the related works. Section 3 provides the methodology in the experiment. Section 4 presents the result derived from the experiment. Finally, section 5 delivers the conclusion and future works.

RELATED WORK
This section starts by introducing the root exploit and types of analysis to counter the type of malwares, and then followed by summary of previous researches related to static analysis and machine learning. The end of this section explains the flying animal-inspired method in selecting the optimal features and Adaboost that converts the MLP algorithm to efficiently predict the machine learning performance.

Root exploit
Unscrupulous authors or known as hackers construct malware to take over the Android operating system to gain private victim's information, stealing data, and eavesdropping communication. There are many types of malware, for instance, root exploit, spyware, botnet, Trojan, and worm. However, the most hazardous is the rootkit or known as root exploit [14], [15] [16], [17]. It is because once the attackers take over the kernel with help of root exploit, all the OS layers are controlled by the attackers. Therefore, the infected OS will allow the attackers install multiple types of malware. In order to detect root exploit, researchers conduct malware analysis [18]- [20][21].

Malware analysis
The types of malware analysis are dynamic and static analysis. Dynamic analysis executes the malware and monitors it behaviors. The example of behaviors are user input and network traffic [22]- [30]. However, the limitations of dynamic analysis are, it needs high hardware specifications and consumes a lot of time to monitor the application one by one. Furthermore, it is also unable to detect the hidden activities during the monitor phase. Conversely, another type of malware analysis is static analysis, which examines application without monitor the behaviors [31] [32]. Static analysis [33][6], [34] reverse engineers the application and inspecting its code. It covers unlimited coverage time because it does not execute the application [35]- [37]. The advantages of static analysis are; 1) covers the overall code, 2) inspect the overall structure of the application, 3) the analysis process is fast, and 4) able to detect unknown malware by combining with machine learning.

Static analysis and machine learning
Machine learning is a research that part of the artificial intelligence and provides the knowledge to the computers from the dataset, such as data observations and the environment interactions. From the data, it will allow the computers to predicts decisions and future judgements [38]. For example, [31] detected Android malware with Bayesian machine learning classification, with permissions as features. They utilized permissions as features that derived from Androidmanifest.xml as well as code-based. In addition, a study by Shabtai et al. [39] used static analysis with features such as opcodes, string, methods and predicted by machine learning. Drebin et al. [40] practiced static analysis method and support vector machine (SVM). The authors utilized features such as permission, application programming interface (API) calls, xml files, network addresses to detect malware. However, the research excluded strings or keywords as features. Wei [41] had adopted the graph constructor which uses opcode components, as features, while Droidanalyzer used API as well as keywords as features. According to the authors' current knowledge, at the time of writing this paper, there are still lack of precious studies used flying animal-inspired (bat, firefly, and bee) to select the best features to detect malware whereas focus specifically on a root exploit. Therefore, this study utilizes the flying animal-inspired algorithm to select the best root exploit features.

Flying animal-inspired search
In machine learning prediction, it is important to selecting features to reduce model overfitting, to improve the performance of the machine learning prediction, and to shorten the model training time [43][44], [45]. The following sub-sections describe the flying animal inspired algorithms.

2.4.1
Bat search Bat Search algorithm explores the best feature space based on the echolocation behavior of bats [46]. The bats use a type of sonar known as echolocation. It is fascinating as bats are able to search their prey and distinguish to different types of insects even in a complete darkness with help of echolocation. Other than searching their prey, they also used echolocation to avoid obstacles, and locate their roosting crevices in the dark. These bats emit a very loud sound pulse and listen for the echo that bounces back from the surrounding objects.

2.4.2
Firefly search Firefly searches the best features based on the flashing pattern of tropical fireflies [47]. This algorithm searches the best features based on the brightness of the flash. For instance, any two fireflies that flash, the less bright firefly will fly towards the firefly that flashes much brighter. If the flash decreases, the instance between these two fireflies will increase as well. If there is no brighter than a particular firefly, it will move randomly.

2.4.3
Bee search The bee algorithm searches the best features by following the bee strategy in finding the honey [48]. Bees strategy in finding the food source (honey) is by constructing two groups, called as scouts and foragers. Regularly, the quantity of the scouts is smaller than foragers. The bee in scout group will perform a signal known as waggle dance whenever they discover the food source (honey). The foragers will then fly towards to the food according to the signal from the scout. Some of the recruited foragers may also perform the waggle dance upon their return to the hive, mobilizing further foragers to exploit the food source. Once the flying animal-inspired selected optimized features, this study further utilized it to classify and predict the root exploit malware with enhanced Multilayer perceptron with boosting method called Adaboost.

Adaboost
Boosting is a method to transform from powerless machine learning to a solid classifier. This study utilizes Adaboost to boost the Multilayer perceptron to enhance its execution in machine learning prediction. Adaboost is referring to Adaptive Boosting. It is introduced by researchers [50] and this boosting algorithm is constructed according to the learning of the feeble calculation. It helps to deliver more exact and precise outcomes. Adaboost allocates each perception, a fundamental weight respect, = 1 , whereas n is the  Figure 1 depicts the proposed methodology that consists of four stages, which are data collection, reverse engineering, feature extraction, and machine learning classification. The initial step is to collect the dataset and followed by the reverse engineering. The steps followed by investigating and extracting the significant features and lastly the machine learning classification to detect the root exploit from the flying animal-inspired decision.

Data collection
Application reverse engineering

Data Collection
Data collection phase needs two classes of applications comprises of normal @ benign and malware. As this step needs malware applications and extracted 1,260 samples of Malgenome dataset. This malware dataset consists of 49 families of malware [49] and many studies have utilized the dataset in their experiments [12,14,16]. Malgenome dataset consists of multiple of malware types (botnet, Trojan and root exploit). As this study emphasis on a root exploit malware only, therefore, the experiment extracted the only root exploit in this dataset, which comprises of 550 samples. Table 1 [13] lists the root exploit malware family and benign @ normal dataset. Other than malware, this study also needs a benign application for machine learning to distinguish between malware and benign. This study utilizes benign applications collected from the Google Play store [51]. The study collected 600 benign samples. Hence, the total of both malware and benign is 1,150 samples.

Reverse Engineering
The general process in static analysis is reverse engineering, which reverses the Android application to retrieve its native codes. The Android application uses .apk as their file extension. Figure 2 depicts the reverse engineering step that reverses the .apk to gain the Java codes by using Jadx [52]. Once we retrieve all the code, this step finds and grab the keywords by using "grep" command in Ubuntu terminal. Then, this step saved the result in .csv file. Once the grab process finishes, the following phase is to extract the root exploit features.

Feature Extraction
The extracting features process includes investigating the suspicious strings in both malware and benign samples. As time is limited, this research only managed to find 31 features. Table 2 tabulates the features information which consists of three types of categories, the features in that categories and how many times the  The code-based category is features that comprise of general codes. For instance, setPtyWindowSize is a code to execute a process, stderr is a code to detect standard error and stdout is a code to output standard process in the operating system. Table 2 shows that one of the features, such as Forked, occur none in the benign application, instead occur 76 times in root exploit malware.
Meanwhile, the directory path category is referring to Android unique directory path whereas it based on Linux. It is because Android's kernel is based on Linux OS. As such, /system/xbin/su is the path that provides authorization to enter and receive access to the Linux kernel directories. The table tabulates one of the directory paths (/system/bin/chmod) that exist 12 times in benign, however, appeared 377 times in root exploit.
The next category is system command, which comprises of process, terminal and Android debug bridge (ADB). The ADB command is an application tool that enables the user to communicate with the Android emulator to connect to the Android devices [53]. One of the features is adb_enabled, which appeared three times in benign, but appeared 360 times in root exploit.

Feature Selection
In the interest to enhance the effectiveness of the machine learning in detecting root exploit, this study needs to discover the relevant features as minimal as possible. These relevant features will help to remove irrelevant and noisy data, hence, helps the performance of the multilayer perceptron's results [34], [54]- [56]. The implementation of the three flying animal-inspired algorithms (bat, firefly and bee) to find the best root exploit feature from the 31 features are shown in Table 3.  Table 3 shows that three flying animal-inspired algorithms (bat, firefly, and bee) choose different features. Among 31 features, Bat algorithm chooses 17, Firefly algorithm chooses 11 and Bee algorithm is the fewest than others, which is 7 features. After the feature selection was done, the next step is classification phase using MLP machine learning classifier to convert it to a strong learner with Adaboost.

The MLP Model
This step is to construct the machine learning predictive model. This is done by assembling the machine learning predictive model in the Weka (Waikato Environment for Knowledge Analysis) [57]. This phase performs this experiment in a machine that was furnished with Intel Core i7 as processor, Microsoft Windows 7 Professional and 16 GB of RAM.

RESULTS
This study uses cross validation as evaluation benchmark. In this process, 10-fold cross-validation takes place. The cross-validation method selects ten different parts of data randomly to two sets; (1) training; and (2) testing and these steps are repeated ten times. In each time, nine subsets were used for training set and one subset is for testing set. In particular, the testing set was omitted from the training set. In the attention to evaluate the flying animal-inspired features in detecting root exploit, Table 4 tabulates the evaluation in four types, 1) accuracy, 2) True Positive Rate (TPR), 3) False Positive Rate (FPR), and 4) ROC.  Table 5 tabulates the cross-validation results, while Figure 3 depicts these results in a graphical presentation for a clearer view. It covers the number of features, true positive rate (TPR), accuracy and receiver operating characteristic (ROC). The figure shows that the results are slightly similar even though the number of features is increasing. Nevertheless, this fact proves that the flying animal-inspired features (bat, firefly, and bat) are able to reach good prediction values, which exceed 91 percent in accuracy and 82 percent in TPR. In another point of view, this study observes another result, which is from the false positive rate (FPR) value.  Figure 3. Accuracy, true positive and ROC results.

False positive rate
False positive rate (FPR indicates the value that incorrectly classify the class of the application as malware, however, it is actually normal @ benign. Therefore, the smaller the value, it is the best value. As it indicates the enhanced boost MLP did minor incorrect prediction. Figure 4 shows that, between bee, firefly and bat algorithms, bee algorithm did smaller mistakes than others, which jotted the best value (0.1 percent) with only seven features. While the bat algorithm marked the worst value which is 0.8 percent with 17 features. Figure 4 shows the overall results of the FPR experiment. The figure proves that the fewer features utilization, the machine learning performance in prediction increases. As shown in the figure, the bat algorithm has chosen 17 features more than bee and firefly, hence did more mistakes and marked 0.8 percent. Meanwhile, the bee and bat algorithms decide to use only seven and eleven features, respectively and did lower mistakes than bat algorithm (0.1 percent).  Figure 5 shows the plot between a number of features and false positive rate (FPR). The figure shows that the more features will lead to more incorrect prediction, whereas the trend line is rapidly increasing from 0.1 to 0.8 percent. This finding indicates that the additional features would lead to the higher incorrect percentage in FPR value. Accordingly, in the interest to achieve a good accuracy machine learning prediction, it is important to reduce the number of features and relevant as well.

CONCLUSION AND FUTURE WORKS
Root exploit is a malicious application that compromise the OS kernel to violate the root privileges of the OS. After it successfully attacks, it is capable to execute malicious activities without being notice, bypassing the authentication and install other types of malware. Accordingly, there is a need to predict the root exploit that undiscovered before. This study presented three flying animal-inspired to decide the best root exploit features consists of three categories, which are system command, directory path, and code-based. This study adopted the enhanced MLP with boost method called Adaboost to transform MLP into a strong machine learning classifier. From the evaluation, all the flying-animal inspired (bee, bat and firefly) algorithms have selected the optimal features and marked more than 91% accuracies in predicting unknown root exploit. However, in false positive rate results, with only seven features, bee jotted the lowest mistake among all algorithms, which is 0.1 % only. For future work, it is possible to add more types of features to increase the machine learning performance such as, extending the ADB features and other future studies by referring to works by [58][59] [60].