SQL Injection Vulnerability Detection Using Deep Learning: A Feature-based Approach

ABSTRACT


INTRODUCTION
The versatility of the internet raises the expectations of its user by offering limitless information and connectivity. Most businesses, therefore, reshape their existing organizational and commercial processes into internet-based solutions, i.e. web applications, to reach its targeted audience. Web applications provide different modes of user engagement in the system so that businesses can observe stakeholders' need for the product(s) and can offer them customized deals. All critical business information is stored in the application's database whereby owners or the system can make the appropriate decision for action. Therefore, ensuring the security of the stored data is a vital responsibility of the system's designer/developer. However, a large number of web applications are developed without following the secured convention of software development [1]. As a result, web systems are vulnerable to different kinds of cyberattacks [2]. According to the IBM X-Force Threat Intelligence Index, 79% of malicious incidents result from injection attacks. The number of injection attacks has increased 37% in 2017 compared to 2016 [3]. Also, since 2010 injection attacks have been at the top of the list of web application attacks, according to the Open Web Application Security Project (OWASP) [4,5]. These attacks can cause huge financial and reputational losses to business organizations [6][7][8][9][10][11][12][13].
An SQLi attack occurs due to SQLi vulnerabilities of web applications, which lead the intruder to perform an attack [14,15]. Such attacks mostly occur due to a lack of proper validation of inputs of web applications [16][17][18][19][20][21][22][23]. The OWASP's records from 2010 to 2017 reveal the carelessness of web developers about good practices for developing web applications.  1 shows an approach to checking SQLi vulnerability. When there is an escape string, i.e., single quotation, backslash, etc. at end of the URL, it needs also to be at the end of the SQL query generated for the URL, which makes the SQL query a form of payload. Records in the database correspond to the SQL query generated against the web application's URL. A query with a single quote at the end is unknown to the database for any records it contains. Therefore, this forces the database to throw exceptions from which the SQLi vulnerability of a web application can be known. After analyzing the exceptions, attackers try to inject several malicious SQL queries to get access to secured confidential information.
The following example explains the processes showed in fig. 1 that how SQLi vulnerability can be disclosed by converting a simple SQL query which is produced from a URL to a payload. http://www.example.com/product.php?id=10 Let the above URL a targeted web application to be checked SQLi vulnerability.
After adding a single quotation at the end of URL it looks like, http://www.example.com/product.php?id=10' This generates a http request to the web application's server and the predefined SQL query for this URL tries get executed with the single quotation SELECT * FROM products WHERE id_product=$id_product //This query is to be executed For the single quotation with the SQL query the database throws an error/exception as there is a syntactic mistake in the SQL query.
Warning: mysqli_fetch_array() expects parameter 1 to be mysqli_result, bool given in /nfs/c05/h02/mnt/83231/domains/example.com/html/product.php on line 67 From this error/exception an attacker gets fingerprinting information and a determination that this web application builder might not follow all the coding conventions.
There are numerous SQLi detection approaches to diminish such attacks through SQLi vulnerability exploitation. These detection approaches can be of various types: detection using pattern matching [24][25][26][27][28][29][30][31][32], learning-based detection [33][34][35][36][37][38][39][40][41][42][43], and other approaches [44][45][46][47][48][49]. matching and feature-filtering method. They generated a tree of selected SQL syntaxes to make a pattern and use that pattern to identify SQLi attacks. D. Kar et al. [27] developed a tool called SQLiDDS, which they used to transform particular portions of the SQL query into plain text and compared it with the runtime SQL query. Inyong Lee et al. [28] proposed a method for SQLi detection where the method removes the value of an SQL query attribute of web pages when a user submits parameters; it then compares the query with the predetermined query. A. Ghafarian [29] proposed a hybrid model where they monitored the execution of all incoming SQL queries dynamically and performed string matching between the received query and their expected query. Then they compared the result of the string matching process with the valid SQL query to identify SQLi attacks. In their proposed model, R.P. Karuparthi and B. R.P. Karuparthi and B. Zhou [30] proposed a dynamic query matching technique where they compared the executed runtime SQL query with a sanitizer, SQL master file then with predefined threshold value respectively to detect SQLi attack. The result is then sent to the approximate matching algorithm for analysis. A. Kumar and S. Binu [31] proposed a method where they made a set of tokens with the SQL query and matched them with user input at the runtime to detect SQLi attacks. N. Lambert and K. S. Lin [32] broke down various portions of the legal SQL query and used it in comparison with the user-defined SQL query to identify SQLi attacks.

Learning-based approaches
Learning-based approaches focus on studies to detect SQLi wherein solutions based on machine learning are used to solve the problems. Nowadays, learning-based approaches perform better than traditional rule-based approaches. Rawat and Shrivastav [33] proposed a detection approach based on SQL injection performable query tokenization [32] using a support vector machine (SVM). Kamtuo and Soomlek [34] inspected queries that could be used to perform SQLi, and they defined some specific conditions to produce a dataset in order to detect SQLi vulnerability in a web application. They compared various machine learning algorithms and received the best performance with the decision jungle algorithm. Zhuang Chen et al. [35] constructed a dictionary using the selected keywords from SQL injectable queries and HTTP requests, where the Word2vector algorithm was used to extract dataset from the dictionary. They also selected SVM as a learning algorithm to predict SQLi vulnerability. Shar and Tan [17] achieved above 85% prediction accuracy on SQLi and cross site scripting (XSS) vulnerabilities in different web applications using WEKA [36]. They implemented a tool named "PhpMinerI" to extract data by static analysis. Hua et al. [37] used statistical features and existing security knowledge to extract the features of SQLi attack requests. They proposed a web attack detection technology using SVM. Joshi and Geetha [18] used blank separation and query tokenization to prepare a dataset of SQLi detection features. They used a Naive Bayes algorithm to perform the detection operation. Moises et al. [38] analyzed the criteria of SQLMap to detect the frequencies of keywords and non-alphabetic characters used in SQLi. They used Naive Bayes and decision tree classifiers to classify SQLi attacks. D. Das et al. [39] indexed the strings of dynamic SQL queries and employed SVM to classify a runtime SQL query as normal or an attack attempt. Y. Wang and Z. Li [40] generated a parse tree of SQL queries, analyzed HTTP request parameters, and used them to compare with another parse tree. They used an SVM classifier to detect SQLi attempts. Other learning-based studies also found for phishing detection of web applications using features extraction methods.Sahingoz et al. [110] proposed a languageindependent real-time anti-phishing system containing seven different algorithms for classification and NLPbased features to detect phishing websites. Kasim [111] introduced an approach that evaluated a phishing event using the classification of deep-hybrid features accompanied by the Light Gradient Boosted Machine model as soon as the web address entered the address bar. Yang et al. [112] proposed a deep learning-based fast phishing detection approach focusing on the multidimensional features. Lakshmi et al. [113] suggested a unique approach to discovering phishing websites where the hyperlinks exist in the HTML page's source code in the corresponding website. Basitet al. [114] reviewed various phishing attack detection techniques based on Artificial Intelligence (AI) to evaluate the qualities and shortcomings of the given methodologies.

Other Approaches
C. Ping [41] used the second-order SQLi to bypass the protection provided to web applications. They proposed adding random numbers to the selected keywords of executed queries to detect and prevent SQLi attacks. B.D. Priyaa and M. I. Devi [42] collected a query tree from database logs and extracted SQLi vulnerability features. They used an efficient data-adapted decision tree (EDADT) and binary SVM to effectively distinguish between SQLi attacks and normal SQL requests. P. Li et al. [43] proposed an SQLi attack detection technique that analyzes the user's interaction with the web application. They analyzed the user's log of web applications to find out the criteria of SQLi attacks and applied them to their model. A. Ciampa et al. [16] developed a tool to inspect SQLi vulnerability of web applications. K. Kemalis and T. Tzouramanis [44] proposed a technique where they extracted some syntactical structures of SQL queries of a web application and filtered runtime SQL queries based on those structures to detect SQLi attacks. V. Shanmughaneethi and S. Swamynathan [45] proposed syntactic verification with XML and error message customization for SQLi attack detection. Thus, researchers sought to detect and prevent SQLi attacks with either traditional rule-based approaches or machine learning-based approaches, while some worked with other approaches. However, no studies have so far used deep learning-based approaches in an SQLi attack scenario. Moreover, all the researchers focused on the SQL query, but no one considered the features of a web application to deal with SQLi attacks.
SQLi refers to the result of formidable threat to security and privacy concern for both client and web application. These threats become possible for not abiding the proper development conventions [1,30] by the developers which imposes the state of weakness of a web application. Lack of those conventions produce errors that provide information [29,30] about the technologies of that web application. Operators in SQL queries by which attackers can operate an attacking expedition to a web application are used to fuel learning based SQLi detection [34,35,38]. Payloads through user inputs [31] and clicking behavior [43] on different parts of a web application can also lead to SQLi attack.
In view of the above, it can be perceived that SQLi attack can be performed through different types of strategies and various parts of a web application. Those strategies and different parts together can be called web application's features by which a proper implication of the SQLi attacks can be delineated.
Among the types of SQLi detection solutions discussed earlier learning based approaches have become very popular due to its precision in detection problems. By this time shallow architectures such as Support Vector Machine (SVM), Naive Bayes are being used widely in various detection and classification problems. However, these architectures have their limitations which can produce problems in their respective ways. Some relatable drawbacks of SVM are, it is time and memory consuming [46][47][48][49][50][51][52], complexity in choosing right kernel function [51][52][53][54], high algorithmic complexity [50][51][52] and inability to work with massive data [55]. Independence assumption [47,51,56,57], sometimes providing bad results with large dataset [47,57] and sometimes providing good results with large dataset [48,50,58], these are relatable drawbacks of Naive Bayes. And for that, works done by other researchers on this issue might not provide very accurate results. Besides, deep learning is another learning based solution which has a very good reputation over other learning based solutions [59,60]. Moreover, deep learning based solutions have been shown providing better results than other learning based approaches [59][60][61][62][63][64][65] including SVM and Naive Bayes.
Aiming the above problem, this paper proposed a technique to detect web application's SQLi vulnerability based on various web features using deep learning. This technique focuses on web features that could be involved in sensitive data disclosing or can lead to unauthorized access to the database of a web application. Those web features have been elected after analysis of the responses resulted from various actions of SQLi on web applications. Moreover, appearances of the name of these web features in scholarly articles on SQLi increases the assessment of those features.
The contributions of the present study are as follows: • A dataset with 19 SQLi vulnerability features has been produced using manual penetration, testing the following double-blinded testing strategy. • A deep learning-based approach has been used to detect SQLi vulnerabilities.
The rest of the document is organized as follows: Section II describes the overall architecture of our SQLi vulnerability detection framework and also tells how the framework works in detail. Section III shows the experimental results and presents the ultimate feature set used in our framework. Section IV provides a summary of this research and describes the future outlook. Fig. 2 shows the proposed model of SQLi vulnerability detection using a Feedforward Neural Network. The model is designed to find SQLi vulnerability focusing on predefined rules and deciding whether a web application is vulnerable or not. It describes the extraction, processing, and optimization of datasets. It uses feature selection methods along with the feed-forward neural network to bring the dataset into the best result-producing state. Finally, the neural network model is utilized to predict the SQLi vulnerability of the target web application with the dataset. Features are engineered on the basis of the logical conditions of SQL procedures and web applications usability facilities. Those features are used to perform data collection by checking every aspect of all the particular web applications. Hence, real web data related to SQLi have been acquired and with them some data which are generated based on real data have been combined to get an operational dataset. Those features then preprocessed by some predefined processing techniques and then feature selection methods have been applied to get the final dataset. After that, that dataset has been sent to perform the classification operation from which a decision can be taken about a particular web application.

PROPOSED MODEL
From a user's perspective, the SQLi vulnerability detection processes using our model have the following steps: (1) The user will take the desired web address of a web application and put it to a script written in Python. (2) The user will be directed to the web application that could be vulnerable to SQLi. Basically, this web application is test data. (3) The script will start processing to extract the features of the test data (current web application) and save them in a data structure. (4) At this stage, to guess the type of the web application, the implemented model will be activated based on predefined rules learned from previous data collected from other web applications. To predict the types of the test data, the rules of the classifier are utilized. (5) After the prediction operation, a warning will signal if the web application is vulnerable to SQLi; otherwise, that web application will be marked as benign.
The following sections explain the methodologies used for building the proposed framework. The framework supports the prediction model in such a way that it can predict the web application's SQL injection vulnerability. The technique used for SQL injection vulnerability detection and prediction, and the neural network model, research methods, and plan are described.

Formal definitions of dataset
As it has been said earlier, this paper focusing on web application's various features that can put a web application at risk and no state-of-art has been given an eye on that issue, implies a new dataset construction using the web features. That dataset consists of data found with practical testing on various web applications to maintain efficiency and precision. Based on those real life data some dummy data was created to prevent data inadequacy for the operation of deep learning models. Those data have been combined and went through some processing techniques to increase the integrity of the dataset.

Feature Extraction for SQLi Vulnerability
The extraction process has been executed in this study to find out the essential features in order to detect SQL vulnerability in a web application based on the previous literature. These features help the proposed model to classify the given weakness of any web application effectively and efficiently. In this study, 19 different features have been used to determine the class of vulnerability. These features are discussed in the following sections.

Selected features
The following subsection presents the features used in our detection operation, as well as their corresponding rules.

Preprocessing
We have used inconsistency reduction, handling null value, and data standardization to preprocess the data. These techniques are commonly used to ensure the integrity of data in a dataset.

Inconsistency Reduction
Several types of inconsistencies have been discussed in the previous literature [80][81][82]. They cause serious problems in the dataset and come in the way of right decision-making, thereby decreasing usability [83]. Bonding between sets of data has a weakening effect and results in an undesired state of the dataset, which defines the question of integrity. To counter this problem, other researchers [82,[84][85][86] have used various techniques and mechanisms. Inconsistent data may lead to an inaccurate training accuracy in the model.

Missing Value Reduction
In data preprocessing, missing values remain a concern to the researchers because it spoils the efficiency of data in the dataset and forces the classification algorithms to produce unreliable and insubstantial results [87][88][89]. It could make features biased as well as hamper the desired output from the dataset [90]. Steps to handle this problem have been discussed on [89][90][91][92].

Data Standardization
The data processing technique converts the structure of an unbalanced dataset to a common data format, known as data standardization [93,94]. It improves the quality of attributes in the dataset and prevents an attribute from dominating [96]. Data standardization transforms data from the dataset after the data is pulled from the source and before it is loaded into the model for training. Sometimes some features in the dataset influence the training process and make accuracy biased [93][94][95].

Feature Selection
Correlation coefficient and chi-square feature selection methods have been used to have a better understanding of the features. These feature selection methods were chosen because of their model independency. Correlation helps to find out the relationship between features, because highly correlated features can influence the training of the model and lead to biased prediction [97][98][99]. The chi-square method helps to understand the significance of the features by ranking them [100][101][102][103]. The later sections discuss these techniques.

Deep Learning for SQLi Vulnerability Classification
The deep learning model has been chosen for the prediction operation because it gives best-in-class performance in solving classification problems. In recent years researchers have had a good success rate in solving detection problems using deep learning [59][60][61][62][63][64][65]. The deep learning model used in this paper contains 19 input neurons with one hidden layer. The activation functions used in the model are a rectified linear unit (ReLU) in the first layer and a sigmoid in the second layer. A formal description of the used neural network is provided below.
Let n be the neural network, l n = {1,2,3} a layer in the n neural network, x n the input to the corresponding n neural network, i (ln) the incoming inputs to the layer l n , r n the output of the layer l n , w (ln) the weights of the layer l n , b (ln) the biases of the layer l n , f the activation function (ReLU) in the layer l n , and o the function in the output layer (Sigmoid). The equation of the feed-forward operation of the neural network is as follows: (4) The predefined labels for the inputs are determined as shown in Equations (5) and (6). L defines the predefined labels of inputs, and L0 defines the predicted label result.
During the training process, the used neural network model is tuned to minimize the value of the loss function, i.e. cross-entropy function. The loss function is described as follows:

Regularization
In the case of neural network model training, overfitting is one of the major problems. An overfitted neural network with a particular train set cannot generally be used in classification. In the neural model of this research, dropout regularization [104] is used to avoid the overfitting problem. Dropout is a technique to skip some units of the neural network during the training operation. The incoming and outgoing connections of some neurons are removed, and during this the probability remains fixed. It prevents the neural network model from becoming too dependent on a specific set of units and their associated weights and biases. In this research the neural network is modeled with the dropout rate of 0.2. This rate is generally used in typical neural network-based models.

Validation
The overfitting problem discussed in the section on regularization can also be solved by cross validation. Here, K-fold cross validation [105] has been used to validate the efficiency of data in the dataset, as well as to make the training process effective. K-fold cross validation is a technique to use the dataset as a validation set by splitting the whole dataset into equal folds with the random selection of samples. During the validation process, one fold is randomly selected for test operation, while the rest is taken for training operation. This process is repeated until all the folds are used as test and training sets. This ensures traversing through all the data in a dataset and helps to have a good understanding of the dataset used in the model. The dataset in this research was split into several folds with 30 samples in each fold.
The novelty of the work includes creating a new dataset by extracting 19 SQLi features after observing the state-of-the-art literature and conducted black-box manual penetration testing to verify the legitimacy of the records. Feedforward Neural Network classification algorithm has been nominated to detect SQLi injection vulnerability where it uses both correlation coefficient and chi-square method for selecting features to increase its accuracy.

EVALUATION 3.1. Experimental Environment
We used a Windows 10 computer with Intel Core i3-5005U CPU, integrated GPU, and 4GB RAM to deploy our proposed model. We implemented the module of the model in Python. Most of the modules were executed on the CPython Interpreter [106]. The feed-forward neural network modeling tool was implemented using the Keras library [107]. Scikit-learn [108] and Tensorflow [109] were also utilized to implement the clustering and other machine learning algorithms. Linear correlation is used to identify correlations between 19 input attributes of 1,850 SQL injection vulnerabilities and to find samples containing a dataset in the neural network. After finding the correlation among 19 features, iteration has been performed based on the correlation range of -1 to 1, using the score obtained from the correlation operation. In each phase of the iteration, the neural network has been trained using the reduced features based on the correlation range value to see which features are effective to detect the vulnerability. The iteration continues until the best accuracy is acquired. Fig. 3 shows the reduced features and the correlation heatmap between them.

Chi-square of correlated input attributes
The chi-square feature selection method is applied to the features reduced by correlation to find out the significance level of the features shown in Fig. 4. No samples were reduced during this process. After applying chi-square, the same type of iteration has been done as in the correlation operation to identify important features. This time the iteration is performed based on the chi-square value of significance.

Model validation
Model validation refers to the integrity and reliability of the neural network model. A validated model ensures a better acceptance of classification results. The used model was validated based on a comparison of the results of regularization and validation. Both suggest that the model is reliable to use in this scenario.

Deep learning model accuracy
The supervised dataset was split into 90% for testing and 10% for testing. The neural network used 500 epochs with a batch size of 30. The network was randomly initialized. With 17 input attributes from correlation and chi-square with more than 1,500 samples, the neural network model obtained 98.22% accuracy (Fig. 4). The loss of model (Fig. 5) denotes how bad the model is. The model sustained very little loss. Hence, it can be said that the model is close to accuracy.  Table 3 compares the different classifiers used in this research to check if any other machine learning algorithm can better result from the neural network. Here, an SVM, random forest, and Naive Bayes with an accuracy of 94.66%, 97.33%, and 84.49%, respectively, have been used because these were used by other researchers discussed in the learning-based approach in the section on related work. However, in this research, the neural network has provided the best performance with an accuracy of 98.04%.

Table 4. SQLi Vulnerability Detection Accuracy Comparison
Literatures Accuracy Shar and Tan [17] 85.00% Joshi and Geetha [18] 93.30% Rawat and Shrivastav [33] 96.47% Lei and Shen [37] 93.30% Lodeiro-Santiago et al. [38] 97-98% Proposed Method (NN) 98.04% Table 4 shows the SQLi vulnerability accuracy compared with the other five models. We found that 85.00%, 93.30%, 96.47%, 93.30%, and 97-98% accuracy obtained by the models of Shar and Tan [17], Joshi and Geetha [18], Rawat and Shrivastav [33], Lei and Shen [37], and Lodeiro-Santiago et al. [38], respectively. However, our proposed model ensured 98.04% accuracy.  Figure 6. Adeptness of model at sustaining less loss. Figs. 5 and 6 depict model training accuracy and loss, respectively. The determination of the training process can be visualized from these plots. The good accuracy shown in Fig. 4 has been due to the iteration performed. The model training loss shown in Fig. 5 is very low as we know less loss leads to good accuracy.

CONCLUSION AND FUTURE WORK
The objective of this paper has been to propose the framework of SQL injection vulnerability detection of a web application using deep learning by extracting various vulnerability finding points of the web application. In this paper, primarily the deep learning part is described since it is the core of the section on SQL injection vulnerability prediction. More than 1,850 samples of injection vulnerability with selected features are sent through a neural network model for prediction. The system delivers a performance accuracy of 98.04%, which is better than that provided by existing systems. Future experiments should focus on building an automated tool and an effective way to prevent SQL injection.

APPENDIX A Data Collection Algorithm
An algorithm has taken place to perform the data collection operation against every feature. A python based technology has been used to build a tool which takes a particular web application and checks whether the predefined conditions of the features set meet or not. If the checking works according to the conditions then the data collection operation takes place. An overview of the algorithm has been given below for better understanding,