Ransomware detection using stacked autoencoder for feature selection

The aim of this study is to propose and evaluate an advanced ransomware detection and classification method that combines a Stacked Autoencoder (SAE) for precise feature selection with a Long Short Term Memory (LSTM) classifier to enhance ransomware stratification accuracy. The proposed approach involves thorough pre processing of the UGRansome dataset and training an unsupervised SAE for optimal feature selection or fine tuning via supervised learning to elevate the LSTM model's classification capabilities. The study meticulously analyzes the autoencoder's learned weights and activations to identify essential features for distinguishing ransomware families from other malware and creates a streamlined feature set for precise classification. Extensive experiments, including up to 400 epochs and varying learning rates, are conducted to optimize the model's performance. The results demonstrate the outstanding performance of the SAE-LSTM model across all ransomware families, boasting high precision, recall, and F1 score values that underscore its robust classification capabilities. Furthermore, balanced average scores affirm the proposed model's ability to generalize effectively across various malware types. The proposed model achieves an exceptional 99% accuracy in ransomware classification, surpassing the Extreme Gradient Boosting (XGBoost) algorithm primarily due to its effective SAE feature selection mechanism. The model also demonstrates outstanding performance in identifying signature attacks, achieving a 98% accuracy rate.


Introduction
In today's digital age, ransomware has emerged as a significant threat to individuals and businesses alike [1].Defined as a type of malicious software that encrypts valuable data and demands a ransom in exchange for its release, ransomware attacks have become increasingly prevalent and financially damaging [1,2].Recent incidents have resulted in staggering losses, reaching tens of millions of dollars for consumers [3].In June 2022, the Serbian Republic Geodetic Authority, responsible for registering property rights, experienced a ransomware attack.This attack disrupted regular services, making it difficult for citizens to make changes to real estate ownership in the registry [3].
Similar attacks have also been reported in neighboring countries.These include the Ministry of Agriculture and the Ministry of Science and Education of the Republic of North Macedonia, the Parliament, and the Council of Ministers of Bosnia and Herzegovina, various public institutions in Albania, and the majority of the governmental IT infrastructure in Montenegro [3].
South Africa, on the African continent, stands out as the country most impacted by ransomware and phishing emails [4].The cybersecurity landscape in South Africa has exposed vulnerabilities in multiple sectors, resulting in a significant number of cyberattacks.Pieterse [5] highlights that public and private enterprises, as well as municipalities, are commonly targeted by ransomware attacks in South Africa.An example of this is the Department of Justice, which experienced its third ransomware attack in 2023, following a previous incident in 2020 [6].
These attacks have resulted in significant financial losses for various South African companies.The urgency of tackling the global problem of classifying and detecting ransomware is evident, especially when considering the security of critical infrastructure [7].There are several different types and variants of ransomware, each with its own characteristics and behaviors (see Table 1).Nonetheless, the absence of readily accessible ransomware datasets within the current realm of intrusion detection poses a significant challenge to their accurate categorization and detection [8].
To address this limitation, we made use of the UGRansome dataset, a publicly accessible dataset created in [9], specifically designed to classify and understand ransomware [10][11][12][13].In the age of big data, one crucial aspect of modern data analysis and machine learning is the extraction of meaningful and representative features from complex, high dimensional datasets [14].Among the various techniques available, stacked autoencoders (SAEs) have emerged as a potent tool for automating feature discovery [14,15].They enable the uncovering of intricate data structures and patterns.Grounded in the field of deep learning (DL), SAEs provide an effective solution for addressing the challenge of representing high dimensional data.They pave the way for improved predictive modeling, efficient dimensionality reduction, and insightful data interpretation [15].

Stacked AutoEncoder Background
Feature selection and extraction using SAEs have been extensively studied in various domains.Wang et al. [18] proposed the use of Broad Autoencoder Features (BAF), which involves four interconnected SAEs with different activation functions.The study proposes the BAF with four parallel connected SAEs using different activation functions and evaluates the performance of the BAF in terms   Cerber is a notorious ransomware family known for its ability to evade detection and rapidly evolve.It has been responsible for a significant number of attacks.WannaCry Ransomware WannaCry gained worldwide attention in 2017 when it infected hundreds of thousands of computers [16].It exploited a Windows vulnerability to spread.

Ryuk Ransomware
Ryuk is a targeted ransomware strain that primarily targets businesses and organizations.It often demands large ransoms.NotPetya Ransomware This ransomware variant, which emerged in 2017 [17], was initially disguised as a ransomware attack but was later revealed to be a destructive wiper malware.
of learned features using the Deep Neural Network (DNN).Another study by Kong et al. [19] explored the topic of feature extraction of load curves using an autoencoder network.Wang et al. [20] used a Stacked Supervised Auto-Encoder (SSAE) to train the deep network to obtain fault relevant features.By stacking multiple supervised autoencoders, high level fault relevant features are learned to improve the classification accuracy.In [21] the integration of SAE characteristics with wavelet-based and morphological fractal texture attributes was proposed for the classification of skin disorders.This approach achieved high accuracy in the classification task.

LSTM Background
Our research introduces a unique approach that combines feature selection using SAE and classification with Long Short-Term Memory (LSTM).This resulted in improved ransomware classification accuracy.The process includes preprocessing the UGRansome dataset, training an unsupervised SAE for feature extraction, and then fine-tuning the LSTM model with supervised learning to enhance its classification capabilities.
LSTM is a type of Recurrent Neural Network (RNN) architecture that is used for processing sequential data [26].Unlike traditional RNNs, LSTM is designed to capture long-term dependencies effectively.It achieves this by using a memory cell, which has three components: an input gate, a forget gate, and an output gate.The input gate decides how much new information should be stored in the memory cell, while the forget gate determines what information should be forgotten.The output gate controls the amount of information that is outputted from the memory cell to the next step.
Using these gates, LSTM can process sequential data more accurately and capture long term dependencies [26].LSTM networks have demonstrated considerable promise in the field of detecting malware.Researchers have invested substantial effort into optimizing LSTM hyperparameters specifically for the design of Intrusion Detection Systems (IDS) [26,27].These endeavors have led to the exploration of various LSTM configurations and revealed that the importance of hyperparameters for LSTM in IDS differs significantly from their roles in language models.The intricate interplay between these hyperparameters has a pronounced impact on their relative significance.Taking this interplay into account, the hierarchy of importance for LSTMs in IDS becomes clear, with batch size emerging as the most critical factor, followed by dropout ratio and padding [26].Additionally, innovative sensitivity-based LSTM models have been proposed for the creation of System-call Behavioral Language (SBL) models for malware detection [27].These models have demonstrated impressive performance metrics, including high accuracy values and specificity when tested on unfamiliar IDS datasets.Another approach involves leveraging LSTM in conjunction with word embedding and attention mechanisms to effectively represent and classify malware files [26].This strategy has yielded remarkable results, achieving high accuracy and F1 scores [27].Fang et al. [28] conducted a study where they introduced a novel method for zero-day detection using LSTM.Their model is designed specifically for identifying malicious JavaScript code injected into web pages [29] by extracting features from the semantic level of bytecode and optimizing word vectorization techniques.The findings of their research revealed that the LSTM-based detection model outperforms existing models that rely on Tree-based algorithms.In addition, Roberts and Nair [30] propose a neural architecture that addresses the problem of anomaly detection in discrete sequence datasets.Their approach involves modifying the LSTM autoencoder and incorporating an array of one-class support vector machines (SVM) to detect anomalies within sequences.This method demonstrates improved stability and performs better compared to traditional LSTM-based and sliding window anomaly detection systems.One limitation of this approach is that it requires a labeled dataset for training the one-class SVM, which can be challenging to obtain in certain domains.

Research Contribution
Our research endeavors to harness the combined power of SAEs and LSTM networks to enhance the classification and detection of ransomware using the UGRansome dataset.Specifically, the focus is on incorporating feature selection techniques within the SAE architecture to facilitate the extraction of the most relevant and discriminative features from ransomware data.By selecting the input data, the subsequent LSTM network can efficiently capture the temporal relationships within the feature space.The ultimate goal of this study is to contribute to the advancement of proactive and robust ransomware recognition and classification strategies.The approach employed in this research holds several key advantages for enhancing cybersecurity, particularly in the realm of ransomware detection.The subsequent sections of this work delve into the methodology, experimental setup, results, and discussions, all of which culminate in a comprehensive analysis of the proposed SAE-based ransomware classification using the LSTM model.
To understand the dataset in more detail, we refer to Table 2, which highlights its attribute characteristics.and supervised fine-tuning [33].In the unsupervised pre-training phase, individual layers within the network are trained using autoencoders, which specialize in learning internal data representations.These representations serve to initialize the network weights and enhance its generalization capabilities.Subsequently, in the supervised fine-tuning stage, the pre-trained layers are assembled and jointly trained using labeled data.This approach consistently achieves exceptional accuracy rates [33].

Data Weighting Techniques
SAEs can be further enhanced by integrating data weighting techniques, which bolster the network's robustness and discriminative capacity [33].Stacked sparse autoencoders have emerged as a powerful tool for dimensionality reduction and classifiers in intrusion detection systems [34].

SAE Architecture
Figure 1 illustrates a typical SAE architecture.The objective function for unsupervised pre-training of layer l is given in Equation 1: W(l),b(l) i=1 Where: The network is then trained using a supervised loss function, typically a classification loss, with labeled data [35].The mathematical formulation of the supervised cross-entropy loss function is given in Equation 2: i=1 j=1 Where: Lsupervised A recurrent Neural Network (RNN) is a variation of the feedforward neural network (NN) that introduces a recurrent structure within the network [34,36] (Figure 2).While the feedforward NN comprises multiple layers with unidirectional connections, RNN establishes connections from each neuron to itself.This self connection mechanism allows RNN to retain previous inputs to potentially influence the network's output [36].In RNN, the inference process is similar to that of the feedforward NN, completed through forward propagation.Training in RNN is accomplished using backpropagation through time, where the weights are updated based on the gradient [35].However, RNN faces challenges such as the vanishing gradient problem and the exploding gradient problem.The gradient for each output in RNN depends not only on the current layer but also on the previous layer.Continuous updates of backpropagation can lead to vanishing gradients, causing weakening gradients.Conversely, when gradients become too large, it results in the exploding gradient problem [36].

RNN and LSTM
To address RNN issues, the LSTM deep learning algorithm was developed by Hochreiter and Schmidhuber in 1997 as a variant of the RNN model [35,36].LSTM introduces the concept of memory cells for its nodes to enable the linkage of prior data information to the present nodes.Each LSTM node incorporates three gating mechanisms: an input gate, a forget gate, and an output gate (Figure 3).
Fig. 3: LSTM Node with Gating Mechanisms [34,36] The key components of the LSTM gating mechanisms can be defined as follows: -it (Input Gate): Controls the flow of new information into the memory cell.
-ft (Forget Gate): Controls the flow of information to forget from the previous memory cell state.-ot (Output Gate): Controls the output from the memory cell.
-ct: Represents the cell state.
-ht: Represents the hidden state.
The LSTM equations for these gating mechanisms are as follows: Forget Gate: Output Gate: Cell State Update: Hidden State Update: Where: -Wi,Wf ,Wo,Wc are weight matrices for the gates.-bi,bf ,bo,bc are bias vectors for the gates.-σ represents the sigmoid activation function.-tanh represents the hyperbolic tangent activation function.-[ht−1,xt] represents the concatenation of the previous hidden state ht−1 and the current input xt.These equations govern the behavior of the LSTM memory cell and its gating mechanisms, allowing it to capture long-term dependencies in sequential data [35].

Extreme Gradient Boosting
In this research, we compare the performance of the proposed SAE-LSTM model with that of the XGBoost algorithm.We used the UGRansome dataset for evaluation.XGBoost (Extreme Gradient Boosting) is a powerful and efficient machine learning algorithm used for both regression and classification tasks [37].
It belongs to the ensemble learning category and is based on the gradient boosting framework.XGBoost is known for its high predictive accuracy and is widely used in various data science and machine learning competitions.XGBoost aims to find an optimal model by minimizing a loss function that measures the difference between predicted values and actual target values [37].The algorithm builds a strong predictive model by combining multiple weak models (decision trees) iteratively.

This algorithm uses the following concepts:
Objective function: This is the overall function that XGBoost aims to optimize during training [37].It is a combination of two main parts: the loss function (L(θ)) and the regularization term (Ω(θ)).The goal of XGBoost is to find the best values of model parameters that minimize this objective function [37].

Loss function L(θ):
i=1 This term measures the discrepancy between the actual target values (yi) and the predicted values (yˆ( t) ) generated by the current iteration of the model [37].The loss function quantifies how well the model is performing on the training data.The objective is to minimize this loss by adjusting the model's parameters [37].

Regularization term Ω(θ):
Regularization is a technique used to prevent overfitting, which occurs when a model fits the training data too closely and does not generalize well to new data [37].In XGBoost, there are two components to the regularization term: k=1 This function computes the predicted value for a specific data point (xi) at a given iteration (t) of the boosting process [37].It is essentially the sum of predictions from individual trees (fk(xi)) in the model.As boosting iterations progress, more trees are added, and the prediction is updated.In summary, XGBoost seeks to find the best model parameters (θ) by minimizing a combination of two factors: how well the model fits the training data (loss function) and how complex the model is (regularization term).
The prediction function (yˆ( t) ) represents the model's output for a specific data point at a given iteration of boosting.The goal is to iteratively improve the model by adjusting its parameters and thereby reducing the overall objective function.

Performance Evaluation
The evaluation of the training and testing performance of the established models for ransomware classification is crucial.Several metrics are commonly used to assess the effectiveness of these models, including accuracy, precision, recall (sensitivity), and the F1 Score [12,38,39].These metrics provide valuable insights into the model's ability to make accurate predictions.

Confusion Matrix
A confusion matrix is often used to provide a more detailed evaluation of model performance [12].The methodological approach employed in this research is visually depicted in Algorithm 1. Backward pass: 11:

Algorithm 1 SAE-LSTM Training
Calculate gradients using backpropagation 12: Update the weights and biases Wi, Wo, bi, bo using the optimizer 13: Compute the average loss for the epoch 14: if the average loss is below a predefined threshold or after a fixed number of epochs then 15: break ▷ Training convergence criteria met 16: Use the trained autoencoder for feature selection: 17: Extract the encoded features ht from the encoder 18: These encoded features can be used as selected features for LSTM classification 6 Experimental Setups In this investigation, both the training and testing phases of the proposed data preprocessing, feature extraction, and classification models were executed using Python programming language, version 3.10.12.The training and testing phases of the proposed data encoding, normalization, SAE, and LSTM model were carried out using the Google Colaboratory cloud system.This platform offers convenient access to a wide array of Python libraries and services at no cost.To enhance algorithm execution speed, Nvidia CUDA technology within the Colab environment was utilized.
Various essential tasks, including file uploading, data preprocessing, data frame setup, and more, were accomplished using Python libraries such as numpy, pandas, statistics, sklearn, matplotlib.pyplot, and seaborn.
For implementing the recommended SAE and LSTM architecture, the Python TensorFlow Keras library was employed.The specified SAE architecture comprised three encoders with 75, 50, and 13 layers, respectively, and three corresponding decoders with 50, 75, and 13 layers (Table 3).The activation function was configured as relu, the optimizer as Adam, the loss as mean squared error (mse), and the number of epochs as 50 (Tables 3).The constructed LSTM network consisted of 3 layers, each containing 168 neurons (Table 4).The loss parameter was set to sparse categorical cross-entropy, the optimizer to Adam, and the number of epochs to 400.In this section, we delve into the outcomes obtained through our proposed computational framework.We provide a comprehensive discussion of various facets, including the data preprocessing and encoding procedures, the results of feature extraction utilizing SAE, the cross-validation process involving data splitting, the performance of the LSTM classification, and the predictive modeling of ransomware, categorizing it into Signature (S), Anomaly (A), and Synthetic Signature (SS).

Data Encoding and Pre-processing
Figure 6 provides an overview of the UGRansome statistics.The original UGRansome consists of 207,533 features, with 58,491 redundant patterns that account for 28.18% of the dataset (Figure 6).Within the scope of this study, the sklearn preprocessing library played a pivotal role in the conversion of categorical attributes into numeric representations across multiple columns within the UGRansome dataset (Figure 7).To eliminate redundancy, the SAE ignored duplicate rows during the feature selection process (Figure 6).We employed a methodology known as label encoding to transform UGRansome data.The primary objective underlying this encoding strategy was to render the dataset compatible with machine learning algorithms that mandate numeric inputs for their operation.By this process, categorical variables were effectively transformed into numerical equivalents, thereby rendering them amenable to various modeling and analytical techniques.The initial phase of the analysis involved an examination of the distribution of ransomware instances selected by the SAE.It was observed that Locky, SamSam, and WannaCry exhibited the highest frequency of occurrences, whereas EDA2 and DMALocker occupied a middle ground, with NoobCrypt registering a relatively lower count (Figure 9).
Concurrently, an assessment of the cumulative costs associated with these ransomware types revealed that Locky, SamSam, and WannaCry still retained substantial monetary impact (Figure 10 (a)).Furthermore, an exploration of the distribution of various malware categories across ransomware types was conducted.
The results indicated a relatively balanced distribution, with SSH accounting for 33.0% of instances, Spam representing 31%, and UDP scan comprising 27.6%.In contrast, NerisBonet was found to be in the minority, constituting only 8.3% of the dataset (Figure 11).Subsequently, to gain a more standardized perspective and discern the true extent of the threat posed by each ransomware variant, an analysis of the average dollars per ransomware was undertaken.Surprisingly, this analysis yielded results divergent from the initial observation shown in Figure 10  This result provides valuable insights into the UGRansome dataset, illustrating that while Locky, SamSam, and WannaCry may have incurred substantial cumulative damages due to their higher volume of attacks (Figure 10 (a)), they may not inflict as much financial harm per individual attack when compared to NoobCrypt, DMALocker, and EDA2 (Figure 10 (b)).Therefore, the latter ransomware variants should be closely monitored as potential major threats, particularly if the volume of their attacks were to increase.A correlation matrix of the SAE assesses relationships between features (Figure 12), with +1 indicating a strong positive linear correlation, -1 suggesting a strong negative Visualizing it via a heatmap enhances pattern recognition by color coding high positive, high negative, and low correlations.The ransomware attacks in the dataset inflicted severe financial devastation.On average, victims paid a staggering 30.69 BTC, equivalent to $798,602 (USD) as of September 2023, with an average dollar payout of $14,873.43 (USD).Figure 10 underscores the substantial financial toll imposed by ransomware threats.In fact, this aligns with previous findings where 11% of organizations that opted to pay ransoms in a 2021 report disclosed payments of $1 million or higher. 1 The average network traffic observed was 2021.16 bytes, with a considerable standard deviation of 2272.54 (Figure 13 (b)).This suggests a notable variation in values, potentially indicating spikes in network traffic triggered by zero-day threats.To address this, additional feature engineering might be necessary to better balance the dataset.Figure 13 (a) shows that CryptoLocker exhibited the most anomalous behaviors among ransomware.This suggests that zero-day threats like CryptoLocker, which restrict users' access to their computers, exhibit highly deviant behaviors  The LSTM model's classification outcomes, as shown in Figure 15 and Table 5, are detailed using a confusion matrix.The matrix highlights that 17,891 instances of ransomware were correctly classified as Signature (S) types, with over 11,000 instances classified as Synthetic Signature (SS) and Anomaly (A) types.This classification has an average accuracy of 98% (Figure 14 and Table 5).In our investigation, we undertook a comprehensive comparison between our novel SAE-LSTM model and an XGBoost algorithm described in [7].The results obtained from the XGBoost algorithm, employing the UGRansome dataset [7], have been thoughtfully summarized in Table 6.Interestingly, this analysis revealed the superior performance of the SAE-LSTM model over the XGBoost algorithm, which can be attributed to the effectiveness of feature selection inherent to the SAE-LSTM approach (Figure 16).

Discussion
In summary, Table 5 and Figure 14    This outperforms the XGBoost model, which achieved a 95% accuracy rate in the same task (Figure 17).However, its slightly lower performance in identifying synthetic signature attacks highlights the challenge of detecting zero-day attack signatures.Anomaly attacks, representing novel threats, present a greater challenge due to their lack of discernible patterns.Future work in the IDS field could leverage the UGRansome dataset and refine model parameters to enhance anomaly detection.Table 7 provides a comparative analysis of various studies in the field of IDS.
While many studies have achieved high accuracy, there are several limitations.Some limitations include the use of shallow learning architectures, scalability issues, domain-specific focus, and the need for labeled datasets.Our proposed research achieved a remarkable accuracy of 99% in ransomware classification using SAE and LSTM (Figure 17), but it is limited to supervised ransomware classification, and further research is required to assess its applicability to a broader range of intrusion detection scenarios.Figure 18 shows that the TP and TN of the proposed SAE-LSTM are higher, while FP and FN of the XGBoost are lower, hence the SAE-LSTM model is better in terms of overall classification accuracy and error rates compared to the XGBoost model (Figure 19). Figure 18 indicates that the SAE-LSTM model is more reliable in making correct predictions and has a higher precision and recall.Therefore, in this scenario, the SAE-LSTM model is considered better for ransomware detection.

Conclusion
In today's digital landscape, ransomware presents a formidable threat to individuals and businesses, prompting our innovative approach to detection and classification.Our method combines an SAE for feature selection with an LSTM classifier, yielding enhanced precision in categorizing ransomware.This process involves the UGRansome dataset preprocessing, unsupervised SAE feature selection, and supervised fine-tuning, resulting in a robust model that excels across diverse ransomware families.Architectural optimizations culminate in an exceptional 99% accuracy, surpassing conventional classifiers.

5. 1
Stacked Autoencoder and Feature Selection Stacked Autoencoders (SAEs) are a versatile type of neural network architecture utilized for feature extraction and dimensionality reduction in various domains.They have found applications in biometrics recognition, image recognition, natural language processing, and automatic speech recognition [33].The stacked nature of SAEs arises from their composition, which includes multiple layers of autoencoders.Each layer is tasked with reconstructing the output of the preceding layer.Training SAEs involves two critical steps: unsupervised pre-training

Fig. 1 :
Fig.1: SAE Architecture weights of layer l b (l) -biases of layer l x (l) (i) -input data for the i-th training example in layer l xˆ( l) (i) -reconstructed output for the i-th training example in layer l m -number of training examples 5.5 Supervised Fine-Tuning After unsupervised pre-training, the layers are stacked together to form the full SAE.
-Supervised loss N -Number of training examples C -Number of classes yij -Binary indicator (0 or 1) for correct classification pij -Predicted probability for example i to belong to class j 5.6 Recurrent Neural Network

Fig. 2 :
Fig.2: RNN Architecture γT : It discourages the model from creating too many complex rules.

Fig. 8 :
Fig.8: Removal of abnormal timestamp in the Pre-processed Dataset

Fig. 9 :
Fig.9: Distribution of Attacks in the Pre-processed Dataset

Fig. 11 :
Fig.10: Financial Damages of Ransomware in the Pre-processed Dataset

Table 2 :
Attributes of the UGRansome Dataset

Table 3 :
SAE Layers and Parameters Layer

Table 4 :
LSTM Layers and Parameters Layer