Full article: A new approach to software vulnerability detection based on CPG analysis

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Detecting source code vulnerabilities is an essential issue today. In this paper, to improve the efficiency of detecting vulnerabilities in software written in C/C++, we propose to use a combination of Deep Graph Convolutional Neural Network (DGCNN) and code property graph (CPG). Specifically, 3 main proposed phases in the research method include: phase 1: building feature profiles of source code. At this step, we suggest using analysis techniques such as Word2vec, one hot encoding to standardize and analyze the source code; phase 2: extracting features of source code based on feature profiles. Accordingly, at this phase, we propose to use Deep Graph Convolutional Neural Network (DGCNN) model to analyze and extract features of the source code; phase 3: classifying source code based on the features extracted in phase 2 to find normal source code and source code containing security vulnerabilities. Some scenarios for comparing and evaluating the proposed method in this study compared with other approaches we have taken show the superior effectiveness of our approach. Besides, this result proves that our method in this paper is not only correct and reasonable, but it also opens up a new approach to the task of detecting source code vulnerabilities.

Keywords:

1. Introduction

1.1. Problem

Vulnerability is a weakness that exists in a system and allows attackers to exploit, causing damage to the safety and security attributes of that system including confidentiality, integrity, availability, according to Shen and Chen (Citation2020). In the research, source code vulnerabilities are defined and classified into two categories: classification according to software defects; classification according to the software development process. Common Vulnerabilities and Exposures (CVE) (http://cve.mitre.org) reported the danger level of current source code security vulnerabilities. Therefore, the problem of early detecting source code vulnerabilities is now an urgent issue.

To detect source code security vulnerabilities, Z. Li et al. (Citation2018) pointed out some main approaches including static analysis and dynamic analysis. In which, static analysis method with a combination of techniques such as Pattern Matching; Lexical analysis method; Data flow analysis method; some analysis methods based on the abstract syntax tree (AST) (Dam et al., Citation2018; Grieco et al., Citation2016; Wei & Li, Citation2017; L. Li et al., Citation2017; Li et al., Citation2016, Li et al. Citation2021) has been and is being researched and applied a lot because of its high efficiency and the ability to early detect security vulnerabilities. Regarding dynamic analysis methods, studies usually focus on two main approaches: Semantic analysis method, and Syntactic analysis method. And other papers such as Cheng et al., Citation2022 Cheng et al., Citation2021 Sui et al., Citation2020 Cheng et all.,Citation2019 Xu et al., Citation2019 also have the same ideals and techniques. However, we noticed that applying these two approaches is facing some problems as follows:

The Syntactic analysis-based detection method has the advantage of showing syntactic structure and content of source code in detail. However, the disadvantage of this method is that it couldn’t show the outgoing data flow or the control flow to understand the specific idea of the source code. This makes it difficult to find vulnerabilities because with hundreds of thousands of code lines if you don’t understand the flow of the source code, even if you can read each statement in detail, it is difficult to grasp the source code’s meaning. Then, you only see that a global variable is declared in one function, but do not know if it is assigned a value in another function Sahu and Srivastava., Citation2018 Sahu and Srivastava., Citation2020 Almulihi et al., Citation2022 Almulihi et al., Citation2022 Sahu et al., Citation2021
On the contrary, the Semantic analysis-based detection method could show the data flow and control flow of the program to help understand the specific flow of the program, but it couldn’t support parsing of the program’s syntax in more detail. This makes it difficult to analyze vulnerabilities in each statement in depth.

To overcome the above situation, recent studies have tried to analyze the source code into code representation graphs such as AST, CFG (Control Flow Graph), Program Dependence Graph (PDG), and then used classification algorithms (machine learning and deep learning) such as Long short term memory (LSTM), Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), etc. However, these approaches have some problems as follows:

The method of representing and synthesizing source code features: Accordingly, it can be seen that with certain data sets or certain vulnerability types, using one of these methods (AST, CFG, and PDG) alone for representing source code security vulnerabilities could bring good results, but it is less effective when used for other unknown vulnerabilities. The reason is that these source code representation methods only focus on representing a certain feature type of the source code, so it will lose other important features. This makes it difficult for these methods to build an information profile that represents the fullest features of the source code.
The method of extracting source code features: With traditional approaches, after the source code is synthesized and normalized, it is put to popular machine learning and deep learning models to search, extract, and train features. However, such extraction and training way loses important features of the source code because some features are lost or hidden in the process of normalization and parameter synchronization.

To solve the above problems, in this paper, we propose a new approach with the combination of deep learning graph networks with the method of building feature profiles of source code. Accordingly, our proposed method will solve the above 2 disadvantages as follows: For problem (i) instead of using single representation methods such as AST, CFG, and PDG, the article proposes to use CPG to represent source code and then use some natural language processing methods to representation of source code information. For problem (ii) instead of using traditional deep learning networks, we will use DGCNN for source code feature extraction:

(1) Task 1: Building feature profiles of the source code. In this study, we propose a method of synthesizing source code features based on the CPG processing method and some natural language processing (NLP) models such as the Word2vec and one hot encoding. The main steps in task 1 include:

Step 1: Representing the source code by using the CPG graph. This step has the task of representing the source code as a graph with features as vertices and edges. The process of source code processing and analysis using CPG graphs is presented in detail in Section “The method of building feature profiles of source code” of the paper.
Step 2: Normalizing vertex feature using the Word2vec model. At this step, after the source code is analyzed and normalized by the CPG graph, it is further processed through the Word2vec model.
Step 3: One hot encoding. This step is responsible for embedding the attributes and features on edges.
Step 4: Synthesizing source code features. At this step, the results of the computation and processing on the vertices and edges are synthesized into an information profile showing features of the source code.

(2) Task 2: Extract source code features using DGCNN: The purpose of this step is to use DGCNN to extract source code features based on the graph feature profile built in task 1. Details of this process is presented in section 3.3 of the paper. (3) Task 3: Classifying the source code. This step is responsible for detecting vulnerabilities of source code based on the source code features built from task 2.

1.2. Contribution

Based on the analysis and evaluation of the characteristics and processes in the proposed approach in the paper, it can be seen that the practical significance and scientific content in the research include:

Proposing a method of building feature profiles of source code. In particular, in this paper, in order to build a source code feature profile, we have proposed to combine many different data mining techniques including CPG source code representation technique, data processing technique on the edge, data processing technique on the vertex, etc. With such an approach, code representation graphs such as AST, CFG, PDG all are combined and constituted a common data structure, thereby the source code to be represented in the most complete and clear way on both syntax and semantics (Syntax-based, Semantics-based). In addition, some NLP methods such as Word2vec, one hot encoding are also used to normalize and process data on vertices and edges to ensure that the data is most fully collected. This proposal has an important meaning in the task of detecting source code vulnerabilities because it has contributed to solving difficulties in the process of seeking and representing relationships between components in the source code, thereby improving the efficiency of the source code vulnerability detection process.
Proposing to use the DGCNN deep learning graph network for the task of extracting source code features. This is a new proposal for the process of analyzing and extracting features of source code based on graphs. The use of the DGCNN network helps the process of calculation and feature synthesis go smoothly, avoiding loss and redundancy of data of the graph. The experimental results in the paper have shown that the DGCNN model is more effective than other deep learning models in the problem of selecting and extracting graph features.
Proposing a method of detecting source code security vulnerabilities based on the technique of building source code feature profiles and the DGCNN deep learning graph network. This is a new approach and hasn’t been published by any studies and proposals. With this proposal, the basic features of the source code are exploited and trained in the most explicit way, helping to improve the efficiency of the process of analyzing and evaluating abnormal features of source code. The experimental results performed in section “Result evaluations” of the paper have proven the correctness and science of the proposal.

The rest of the paper is organized as follows: In Section “Related Work”, we study and examine some previous studies for the task of source code vulnerability detection. The contents related to the proposed method are analyzed and presented in Section “The proposed model”. The experimental results and evaluations of the effectiveness of the proposed method are presented in Section “Experiments and evaluations”. Finally, conclusions and future development directions are presented in Section “Conclusion and development directions” of the paper.

2. Related work

2.1. Rule-based source code vulnerability detection

Chen et al. (Citation2017) pointed out that the current vulnerability detection methods are often based on algorithms that traverse the AST graph and match AST nodes with vulnerability rules. This affects the detection speed because it takes time to match a large number of unnecessary rules. This study proposed an optimized rule-checking algorithm to improve the speed of matching rules on the AST tree, resulting in a 28.7% improvement compared to the original PMD. Tian et al. (Citation2009) applied an improved Apriori algorithm for web application vulnerability detection to improve the ability to detect unknown vulnerabilities. This method used association rules to analyze the logical relationship between the components of the application. It could detect most of the vulnerabilities that exist on all pages of the target site. Hu et al. (Citation2020) described some vulnerabilities such as memory leak, double-free, and use-after-free using the CFG and Pointerrelated Control Flow Graph (PCFG) frameworks combined with two algorithms, Vulnerability Feature (VJVF) and Feature Judging, for task detection. The results showed that this MRVDAVF method gave more effective results than some tools such as cppcheck, flawfinder, and splint with memory leaks, double-free, and use-after vulnerabilities, but the detection ability wasn’t diverse. Lee et al. (Citation2006) pointed out the disadvantage of current rule-based vulnerability detection tools: they didn’t adapt to software in general, often limited in terms of techniques, technology. The authors proposed a model with three components: Analysis, Rule Processing, and Detection. The Analysis and Rule Processing components improved the problem, but the model still had limitations in terms of complexity and time. Other approaches, such as those by Sahu et al., Citation2021 Sahu et al., Citation2020 have also been proposed.

2.2. Vulnerability detection using machine learning, deep learning

Rebecca L. Russell et al. (Citation2018) argued that current popular source code vulnerability detection tools only allow detection of a limited set of vulnerabilities based on a predefined ruleset. Meanwhile, the application of machine learning and deep learning in vulnerability detection gives the ability to directly learn features from the source code. This is the basis for detecting more types of vulnerabilities. Harer et al. (Citation2018) presented a data-driven approach to vulnerability detection using machine learning. It was applied to C and C++ programs. As a result, the deep learning model gave better results than the traditional machine learning models with a ROC of 0.87. Tang et al. (Citation2021) proposed a combination of the Extreme Learning Machine (ELM) model and Doc2Vec to represent the program. Besides, the authors demonstrated that Doc2Vec provided faster training and better generalization performance than Word2Vec. With the same idea, Do Xuan et al. (Citation2022) have presented a method to detect security vulnerabilities using machine learning algorithms and NLP methods.

2.3. Vulnerability detection based on the flow graph

Wu et al. (Citation2020) emphasized that the source code feature extraction is an extremely important step that greatly affects the detection results. Because previous methods could not fully exploit this aspect, the author extracted features by CPG graph for the vulnerability detection task. M. Li et al. (Citation2021) showed that previous deep learning-based methods ignored the connection between semantic graphs and didn’t efficiently process information in graph form. The application of Graph Neural Network (GNN) models gave insight into the problem of security vulnerability detection. The experimental results of the study also proved that the graph-based model was better than the sequence-based model, especially for semantic graph, up to 5.01% in accuracy. Wang et al. (Citation2020) pointed out that approaches using regression networks such as Recurrent Neural Network (RNN), LSTM, Bi-LSTM gave poor results with the task of detecting vulnerabilities. Conventional regression models designed for sequential data are not suitable for the current code representation methods by complex graphs such as AST, CFG, CPG, etc. Current GNN networks allow direct manipulation of graph data. This gives the ability to strongly exploit complex relationships on this data type to improve detection efficiency. Research by Tal Ben-Nun et al. (Citation2018) presented a natural language processing method combined with the inst2vec algorithm for detecting C and C++ source code vulnerabilities. In the experiment part, the authors compared their method with other approaches while applying basic machine learning and deep learning models such as RNN, Tree-Based CNN, etc. The results are higher than most approaches on the same experiment dataset. Besides, Yi M. Li et al. (Citation2021) built an IVDetect model, which is a combination of two main methods: Consider the vulnerable statements and their surrounding contexts via data and Control dependencies and artificial intelligence.

3. The proposed model

3.1. The model architecture

The functions and tasks of the sub-blocks in the proposed model shown in Figure are:

Figure 1. The architecture of the proposed model.

Source code: In this paper, the source code set used is FFmpeg and Qemu (https://ffmpeg.org/download.html). These source codes are written in C and C++.
Standardizing source code: In this block, the source code is pre-processed, normalized to remove empty data, spaces, etc.
Building feature profiles of source code: To do this, we propose to use some processing sub-blocks in this block as follows:
- CPG block. This block has the function of representing source code in the CPG graph by using the Joern tool (https://joern.io/). The result of the CPG block is that the source code is parsed syntaxes and semantics by the Joern tool to represent the source code as a CPG graph.
- Vertex feature normalization block. Accordingly, when using the Joern tool, the source code is processed into vertices and edges. Next, the vertex information is processed and normalized into feature vectors by using the Word2vec model.
- Edge feature normalization block. This block has the function of processing edge data to normalize this data into feature vectors. In this paper, we use the one-hot encoding method to normalize the edge feature.
- Source code feature synthesis block. This block combines edge and vertex data into an information profile containing information about structural and semantic features of the source code.
Extracting source code features. In this paper, we propose to use the DGCNN model to extract features of source code represented in CPG form.
Source code classification. This block has the function of classifying source code based on the features extracted from the previous block. The output of this block is the label of source code (is normal or contains security vulnerabilities).

3.2. The method of building feature profiles of source code

3.2.1. Representing source code using the CPG graph

CPG was introduced by Yamaguchi et al. (Citation2014). Accordingly, CPG combines 3 code representation graphs (namely AST (Yamaguchi et al., Citation2012), CFG (Gascon et al., Citation(2013), PDG (Ferrante et al., Citation1989) into a common data structure. A CPG graph consists of the following main components:

Nodes and node types. Nodes represent the program structure. It includes low-level language constructs such as methods, variables, control structures, but it also has higher-level constructs such as HTTP endpoints or findings. Each node has a type that specifies the program structure type represented by the node. For example, a node with type METHOD represents a method while a node with type LOCAL represents defining a local variable.
Labeled edges. Relationships between program structures are represented through edges between respective nodes. For example, to represent a method that contains a local variable, we can create an edge with the label CONTAINS from a node METHOD to the local node. By using labeled edges, we can represent many relationship types in the same graph. Multiple edges can exist between two identical nodes.
Key-value pairs. Nodes contain key-value feature pairs where the valid keys depend on the node type. For example, a method has at least one name and one signature, while a local declaration has at least one name and type of the declared variable.

In this paper, to represent the source code into a CPG graph, we use the Joern tool. Joern is a platform that supports analyzing source code, bytecode, and binary code. Joern is developed with the goal of providing a support tool for vulnerability detection and research based on static analysis of source code. Accordingly, with the input data of C/C++ program source codes, the Joern tool analyzes the source code into a CPG graph consisting of edges and vertices.

3.2.2. The method of normalizing vertex features

As described above, the Joern tool analyzes the source code into two main components: edge and vertex of the CPG graph. In which, the vertex of the graph contains the main information: index of vertex, type of vertex, and the code on that vertex. Next, the paper proposes a method to normalize this information. Accordingly, to perform vertex feature normalization, we propose 3 tasks as follows:

(1) Task 1: Normalizing vertex types.

Table below describes the vertex types of the CPG graph.

Table 1. Vertex types of source code when analyzing into CPG graph

Download CSV Display Table

From Table , it can be seen that after the source code is analyzed by the Joern tool, it has 39 different vertex types. Each vertex type represents a different property and characteristic of the source code. Next, we normalize these vertices to obtain their feature vectors. To do this, we use the one-hot encoding vector. One-hot encoding is the process of transforming each value into a binary vector containing only 1 or 0. Tomas et al. (Citation2013) and Le and Mikolov (Citation2014) presented the operating principle and implementation steps of the one-hot encoding process in detail. Thus, after being encoded, the vertex types become vectors of equal length and each vector is unique.

(2) Task 2: Normalizing the code on the vertex.

We propose to use the Word2vec model to standardize these codes. Word2vec is a method to represent a word in the form of a relational distribution with the remaining words (Le & Mikolov, Citation2014; Tomas et al., Citation2013). Word2vec uses a 2-layer neural network with only one hidden layer. Its input is a large corpus and output is a vector space in which each unique word in the corpus is associated with a corresponding vector in space. In Word2vec, a distributed representation of a word is used to create a multidimensional vector. Each word is represented by the set of weights of each element in it. Thus, instead of just having a one-to-one connection between an element in the vector and a word, the word representation is spread over all elements in the vector, and each element in the vector contributes to the definition of many different words. Word2vec has 2 models, Skip-gram and CBOW (Le & Mikolov, Citation2014; Tomas et al., Citation2013). This paper only uses the Skip-gram model to analyze and normalize the functions. When using the Skip-Gram model, the input is a word in the sentence, and the algorithm looks at the words around it. The number of surrounding words to consider is called “window size”. In order to be trainable, the word is vectorized to be fed into the network to build a dictionary from the text dataset and then use a one-hot vector to describe each word in the dictionary. Basically, the Skip-gram model includes 3 main components: Input; Hidden layer and Output layer.

(3) Task 3: Synthesizing vertex information.

At this task, we combine 2 vectors obtained at task 1 and task 2 into a uniform vector representing the node of the graph. Thus, based on discrete information, which is the result after collecting and analyzing by the Joern tool, the study used some NLP methods including Word2vec and One-hot encoding to normalize this information. The normalized information contains important values and features representing the vertex of the graph. These features represent different information about source code containing vulnerabilities and normal source code.

3.2.3. Edge feature normalization

Table describes edge types in the graph when source code is analyzed by the Joern tool.

Table 2. Edge types of source code when analyzing into CPG graph

Download CSV Display Table

From Table , it can be seen that there are 14 edge types analyzed and extracted through the Joern tool. To normalize edge information, the paper uses one-hot encoding. The way to conduct one-hot encoding is presented in task 1 of section “The method of normalizing vertex features” in the paper

3.2.4. Synthesizing source code features

Thus, based on the process of calculating and normalizing the information presented in sections “The method of normalizing vertex features” and “Edge feature normalization”, we have obtained information about the edges and vertices of the CPG graph. Next, this information is synthesized into an information profile. This information profile describes in detail the characteristics and features of the source code. The profile includes many nodes and many edges. We think that the proposal of building a source code information profile based on the vertex and edge analysis of the CPG graph brings many important values to show the characteristics of the source code.

3.3. Source code feature extraction

As described above, the data representation is complex (the source code is represented on a directed CPG graph with a multi-edge multi-node structure), so pushing it into an ordinary classification model doesn’t bring a high effect. Therefore, in this paper, we propose to use a deep learning graph network to further extract and normalize source code features in CPG graph form. Specifically, the deep learning graph network used in this paper is the DGCNN model (Goy and Ferrara., Citation2018). Currently, deep learning graph networks have been studied and applied in many different fields, however, the application of this network for the task of source code vulnerability classification has still been limited (Makarov et al., Citation2021; Z. Li et al., Citation2019; Zhou et al., Citation2020).

DGCNN is proposed in the study (Tomas et al., Citation2013) to solve two main problems: i) How to extract useful features to characterize diverse information encoded in a graph for classification purposes; ii) How to sequentially read a graph in a meaningful and consistent order. To achieve that goal, the proposed DGCNN architecture includes 3 main layers as follows:

The first layer is the GCN layer. The purpose of this layer is to extract the local structural features of the vertices and define the vertex order. Specifically, the GCN layer uses the new back-propagation rule. This propagation rule is expressed through formula (4) below:
(1) $Z^{(i + 1)} = σ (D^{- 1} (A + I) Z^{(i)} W^{(i)})$ (1)

Where:

A: adjacency matrixX: feature matrixI: unit matrix with the same size as Af: activation functionD: degree matrixw: weight matrixZⁱ: output of layer iZ⁰ = X

Convolution aggregates information about neighboring nodes to extract local substructure information. To extract features of substructure features at multiple scales, the convolutional layers are stacked as follows:

(2)

Z^{t + 1} = f ({\tilde{D}}^{- 1} \tilde{A} Z^{t} W^{t})

(2)

Where: $Z^{0} = X$ , $Z^{t} \in R^{n \times c_{t}}$ is the output of the t-th convolutional layer; $c_{t}$ is the number of output channels of layer t.

The second layer. DGCNN introduces a new Sort Pooling layer to generate a new representation (embedding) for each given graph using input as learned representations for each vertex through a stack of GCN layers. Thus, it can be seen that the Sort Pooling layer sorts the features, takes the high-valued features, and discards the low-valued features. Accordingly, after the initial inputs go through the GCN layers, we have a $Z^{1 : h}$ matrix where each row is a descriptor of vertex features and each column is a channel to transmit feature. This matrix is generated by concatenating the outputs of h GCN layers consisting of n rows (n is the number of vertices), K columns (K is the total number of columns of the output matrices of h GCN layers). With the input of the matrix $Z^{1 : h}$ , the processing is as follows: firstly, the matrix $Z^{1 : h}$ is rearranged according to the rows of the $Z^{h}$ matrix according to the principle: Sorting rows by the representative value of the last column of the matrix $Z^{h}$ . If these two rows have the same value in the last column, we compare the value in the last column of the second row $Z^{h - 1}$ . Next, cut or pad all-zero rows to the $Z^{1 : h}$ matrix so that the size of this matrix goes from n rows to k rows. The output is a matrix of dimensions k × K where k is a predefined integer.
The third layer. The remaining layers are CNN layers and traditional Dense layers Albawi et al., Citation(2017). The purpose of this layer is to read a sorted graph representation and make predictions. Accordingly, the output of the Sort Pooling layer is used as the input of the 1-D Convolution, Max Pooling, and Dense layers to learn the appropriate graph features to predict the graph labels (Duan et al., Citation2003).

Thus, it can be seen that the DGCNN model works according to the following principle: with the input of any graph structure, it is first put to the GCN layers where the vertices information is propagated between nodes. Then the vertex features are sorted and synthesized at the Sort Pooling layer, and transmitted to traditional CNN structures to learn a predictive model.

3.4. Source code classification

To classify source code vulnerabilities and normal source code, we use 2 layers, Fully Connected Layer and Softmax Layer, as shown in Figure .

4. Experiments and evaluations

4.1. Experimental data

The main datasets used for experimentation and evaluation are FFmpeg and Qemu (https://ffmpeg.org/download.html). Qemu is a collection of software programs that enable the creation, management, and administration of virtual machines and the operation of virtualized environments on physical servers. The main vulnerability types in this dataset are Dos, buffer overflow error, etc. FFmpeg includes software programs and libraries for processing video, audio, and other multimedia streams. Both datasets are program source codes written in C/C++ language. Table below lists the main components of these two datasets.

Table 3. Experimental data

Download CSV Display Table

4.2. Experimental scenarios

4.2.1. Scenarios for the experimental dataset

With the experimental dataset presented in Table , this study divides the dataset into different components, then conducts experiments and evaluates the accuracy of the proposed models based on the experimental datasets. The whole process of dividing the experimental dataset into scenarios is randomly selected, where 80% of the dataset is used in the training process, the remaining 20% is used in the testing process.

4.2.2. Evaluation scenarios

In this paper, we propose two main scenarios as follows:

Scenario 1. Evaluating the effectiveness of the proposed model with some deep learning graph networks in some other studies. In particular, in this scenario, we compare and evaluate the DGCNN model with the GCN model (Haridas et al., Citation2020) and GCN+IndRNN (Cai et al., Citation2021).
Scenario 2. Evaluating the effectiveness of the source code vulnerability detection model when not applying the approach using deep learning graph networks. Specifically, in this scenario, we want to clarify the role and importance of the deep learning graph network in the task of synthesizing, extracting features of source code. We will use some classification algorithms such as CNN, Multi-layer Perceptron (MLP), LSTM, BiLSTM for this task.

4.3. Evaluation metrics

We use the following 4 metrics to evaluate the effectiveness of the proposed model. The general formulas of these 4 metrics are expressed through formulas (7, 8, 9, 10).

(3)

accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100 %

(3)

(4)

precision = \frac{TP}{TP + FP} \times 100 %

(4)

(5)

Re call = \frac{TP}{TP + FN} \times 100 %

(5)

(6)

F 1 = \frac{2 \times precision \times Re call}{precision + Re call}

(6)

5. Result evaluations

5.1. Evaluations of experimental scenario 1

Experimental results of detecting source code vulnerabilities using the DGCNN model. As mentioned above, the purpose of this experimental scenario is to compare and evaluate the ability to synthesize and classify source code vulnerabilities of several deep learning graph networks. Specifically, we compare and evaluate 3 main deep learning graph networks: DGCNN (our proposal), GCN (Haridas et al., Citation2020), GCN+IndRNN (Cai et al., Citation2021). Tables below show the experimental results of this scenario. Accordingly, Table shows the results of source code vulnerability classification based on the DGCNN model.

Table 4. Experimental results using the DGCNN model

Download CSV Display Table

Table 5. Experimental results using the GCN model (Haridas et al., Citation2020)

Download CSV Display Table

Table 6. Experimental results using the GCN+IndRNN model (Cai et al., Citation2021)

Download CSV Display Table

Table shows the results of evaluating the GCN network (Haridas et al., Citation2020).

The experimental results in Table show that the source code classification results of the GCN model are relatively stable. Specifically, this model gives an average overall accuracy of 79.55%. This result is not high. Regarding the stability of the GCN model, it is clear that when using the 2-layer GCN model and changing randomly the number of units, the accuracy of the model changes slowly and is insignificant. With parameters [64-32-16] and [256-128-32], the GCN model gives the best results in most metrics. Comparing the experimental results in Tables , it is clear that the DGCNN model has much better performance than the GCN model. The reason is that with the support of GCN and CNN layers, the DGCNN model works better in the task of synthesizing and extracting features of source code. In addition, DGCNN uses a mechanism for synthesizing results based on the combination of each layer together. This mechanism is different from just linearizing features of each node and then taking the final result as output like GCN. In addition, because the GCN model only uses GCN and MLP layers to represent information, it loses important features of the graph, thereby losing the meanings and differences between normal source code and source code containing vulnerabilities. This leads to low classification results. Next, Table below presents the experimental results of the GCN+IndRNN model (Cai et al., Citation2021).

Obviously, the experimental results in Table show that with the support of GCN, MLP, IndRNN layers, the GCN+IndRNN model promotes higher efficiency than the original GCN model. This difference ranges from 4% to 7% on all metrics. The reason is that the GCN+IndRNN model integrates an additional layer of IndRNN to synthesize and represent information of source code, so it could extract some more important features. Comparing the results of Tables , it can be seen that the DGCNN model completely outperforms the GCN+IndRNN model.

5.1.1. Comments and evaluations for scenario 1

Based on the experimental results presented in Tables , it can be seen the complete superiority of the proposed method in the paper with other approaches. Next, the study conducts experiments to evaluate the detection and prediction ability of these models through values of confusion matrices.

From the experimental results in Figure , it can be seen that the DGCNN model brings the best efficiency for the task of accurately detecting vulnerabilities of source code. Accordingly, the DGCNN model correctly predicts 1,908 source code vulnerabilities out of a total of 2,013 vulnerabilities. This result is higher than GCN (Haridas et al., Citation2020) and GCN+IndRNN (Cai et al., Citation2021) models with 98 and 51 vulnerabilities, respectively. As for the ability to accurately predict normal source codes, the DGCNN model also correctly predicts 2,385 normal source codes out of a total of 2,458 source codes. This result is much more efficient than GCN (Haridas et al., Citation2020) and GCN+IndRNN (Cai et al., Citation2021). Comparing confusion matrices of GCN and GCN+IndRNN, it can be seen that the GCN+IndRNN model is more effective than GCN. The reason is that the GCN+IndRNN model uses the IndRNN model for the classification task instead of using the pure GCN layers.

Figure 2. Confusion matrix results of the models at the best parameters. In which (a), (b), (c) respectively show the results of DGCNN, GCN (Haridas et al., Citation2020), GCN+IndRNN (Cai et al., Citation2021).

5.1.2. Experimental results of detecting source code vulnerabilities based on some other approaches

For this scenario, we evaluate the experimental results of some other approaches without using the CPG graph and deep learning graph networks. The results shown in Table below are experimental results when applying a model combining NLP and some deep learning methods including LSTM (Lin et al., Citation2021), CNN (Rebecca L. Russell et al., Citation2018), BiLSTM (Zheng et al., Citation2020).

Table 7. Experimental results of detecting vulnerabilities based on some other methods

Download CSV Display Table

Comparing the experimental results in Table , it can be seen that the results have a large difference between the models. Specifically, the model combining Word2Vec and CNN gives the best precision of normal source code classification (94%). This result is higher than that of LSTM and BiLSTM models by 4% and 8%, respectively. For the correct classification of source code vulnerabilities, the BiLSTM model has the best efficiency of 65%, which is 6% and 7% higher than that of LSTM and CNN models, respectively. This demonstrates with the support of the ability to remember over long sequences, the BiLSTM model has synthesized many important and outstanding features of source code vulnerabilities, thereby improving the ability to accurately detect. Comparing the results of Tables with , it is clear that our proposed method brings much better result than other approaches. Similarly, when comparing the results of Tables with , it can be seen that the approach using CPG graph and deep learning graph network brings better classification results than other approaches. This result shows that our proposal in this paper is correct and reasonable.

6. Conclusion and development directions

Detecting security vulnerabilities of source code is now an urgent issue because the techniques for exploiting vulnerabilities are growing strongly. In this paper, based on some different data mining techniques, we have succeeded in building a new approach to enhance the ability to accurately classify source code vulnerabilities. The experimental results show that the proposed method in the paper has brought remarkable effectiveness for both the task of correctly classifying vulnerabilities and normal source codes. There are 3 reasons leading to the outstanding effectiveness of the proposed method, including: i) CPG graph has succeeded in representing the syntactic and semantic relationships of the source code. Only when these relationships are expressed in detail, the characteristics of the source code are fully represented. ii) Proposing the method of building source code feature profile based on edges and vertices using NLP techniques. This is a new technique that helps us normalize and fully extract the source code features based on the CPG graph. This task is very important because if and only if fully building and synthesizing the source code features, it will bring the best classification effect. iii) Proposing to use the DGCNN for the task of extracting source code features based on the CPG graph. This is a breakthrough proposal because the DGCNN model is very suitable for non-linear and structured graphs such as source code graphs. In the future, in order to improve the performance of the source code vulnerability detection model, it is necessary to improve 2 main tasks: i) Task 1: Improving the method of building source code information profiles. Accordingly, instead of only using NLP methods, other advanced deep learning methods can be applied to search for the relationship between edges and vertices of the graph. ii) Task 2: Improving the method of extracting source code features. Some other non-linear deep learning graph networks can be used to synthesize and extract more information about the depth of edges and vertices.

Disclosure statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability statement

The datasets generated and (or) analysed during the current study are available from the corresponding author on reasonable request (https://ffmpeg.org/download.html).

References

Albawi, S., Abed Mohammed, T., & Al-Zawi, S. (2017). Understanding of a convolutional neural network. Proceedings of the International Conference on Engineering and Technology (ICET), Antalya, Turkey (pp. 1–6). https://ieeexplore.ieee.org/document/8308186
Google Scholar
Almulihi, A., Alassery, F., Khan, A., Shukla, S., Gupta, B., & Kumar, R. (2022). Analyzing the Implications of Healthcare Data Breaches through Computational Technique. Intelligent Automation & Soft Computing, 32(3), 1763–16. https://doi.org/10.32604/iasc.2022.023460
Web of Science ®Google Scholar
Attaallah, A., Alsuhabi, H., Shukla, S., Kumar, R., Gupta, B., & Khan, R. (2022). Analyzing the Big Data Security Through a Unified Decision-Making Approach. Intelligent Automation & Soft Computing, 32(2), 1071–1088. https://doi.org/10.32604/iasc.2022.022569
Web of Science ®Google Scholar
Ben-Nun, T., Jakobovits, A. S., & Hoefler, T. (2018). Neural code comprehension: A learnable representation of code semantics. Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada.
Google Scholar
Cai, M., Jiang, Y., Gao, C., Li, H., & Yuan, W. (2021). Learning features from enhanced function call graphs for Android malware detection. Neurocomputing, 423, 301–307. https://doi.org/10.1016/j.neucom.2020.10.054
Web of Science ®Google Scholar
Cheng, X., Wang, H., Hua, J., Xu, G., & Sui, Y. (2021). DeepWukong: Statically Detecting Software Vulnerabilities Using Deep Graph Neural Network. ACM Transactions on Software Engineering and Methodology, 30(3), 1–33. https://doi.org/10.1145/3436877
Web of Science ®Google Scholar
Cheng, X., Wang, H., Hua, J., Zhang, M., Xu, G., Yi, L., & Sui, Y. (2019). Static Detection of Control-Flow-Related Vulnerabilities Using Graph Embedding, 41–50. https://doi.org/10.1109/ICECCS.2019.00012
Google Scholar
Cheng, X., Zhang, G., Wang, H., & Sui, Y. (2022). Path-sensitive code embedding via contrastive learning for software vulnerability detection, ISSTA: International Symposium on Software Testing and Analysis, 519. https://doi.org/10.1145/3533767.3534371
Google Scholar
Chen, D., Zhang, Y.-D., Wei, W., Wang, S.-X., Huang, R.-B., Li, X.-L., Qu, B.-B., & Jiang, S. (2017). Efficient vulnerability detection based on an optimized rule-checking static analysis technique. Frontiers of Information Technology & Electronic Engineering, 18(3), 332–345. https://doi.org/10.1631/FITEE.1500379
Web of Science ®Google Scholar
Dam, H. K., Pham, T., Ng, S. W., Tran, T., Grundy, J., Ghose, A., Kim, T., & Kim, C.-J. (2018). A deep tree-based model for software defect prediction. https://arxiv.org/abs/1802.00921.
Google Scholar
Do Xuan, C., Ngoc Son, V., & Duc, D. (2022). Automatically detect software security vulnerabilities based on natural language processing techniques and machine learning algorithms. Journal of ICT Research and Applications, 16(1), 70–87. https://doi.org/10.5614/itbj.ict.res.appl.2022.16.1.5
Web of Science ®Google Scholar
Duan, K., Sathiya Keerthi, S., Chu, W., Shevade, S. K., & Poo, A. N. (2003). Multi-category classification by soft-max combination of binary classifiers. Proceedings of the 4th International Workshop, MCS, Guilford, UK (pp. 125–134).
Google Scholar
Ferrante, J., Ottenstein, K. J., & Warren, J. D. (1989). The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3), 319–349. https://doi.org/10.1145/24039.24041
Web of Science ®Google Scholar
Gascon, H., Yamaguchi, F., Arp, D., & Rieck, K. (2013). Structural detection of android malware using embedded call graphs. Proceedings of the ACM workshop on Artificial intelligence and security, Berlin, Germany (pp. 45–54).
Google Scholar
Goy, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. https://doi.org/10.1016/j.knosys.2018.03.022
Web of Science ®Google Scholar
Grieco, G., Grinblat, G. L., Uzal, L. C., Rawat, S., Feist, J., & Mounier, L. (2016). Toward large-scale vulnerability discovery using machine learning. Proceedings of the 6th ACM on Conference on Data and Application Security and Privacy, New Orleans, Louisiana USA (pp. 85–96).
Google Scholar
Harer, J. A., Kim, L. Y., Russell, R. L., Ozdemir, O., Kosta, L. R., Rangamani, A., Hamilton, L. H., Centeno, G. I., Key, J. R., Ellingwood, P. M., Antelman, E., Mackay, A., McConley, M. W., Opper, J. M., & Peter Chin, T. (2018). Automated software vulnerability detection with machine learning.
Google Scholar
Haridas, P., Chennupati, G., Santhi, N., Romero, P., & Eidenbenz, S. (2020). Code characterization with graph convolutions and capsule networks. IEEE Access, 8, 136307–136315. https://doi.org/10.1109/ACCESS.2020.3011909
Web of Science ®Google Scholar
Hu, J., Chen, J., Zhang, L., Liu, Y., Bao, Q., & Arthur, H. A. (2020). A memory-related vulnerability detection approach based on vulnerability features. Tsinghua Science and Technology, 25(5), 604–613. https://doi.org/10.26599/TST.2019.9010068
Web of Science ®Google Scholar
Lee, M., Cho, S., Jang, C., Park, H., & Choi, E. (2006). A rule based security auditing tool for software vulnerability detection. Proceedings of the 2006 International Conference on Hybrid Information Technology, Cheju, Korea (South), 2, 505–512. https://doi.org/10.1109/ICHIT.2006.253653
Google Scholar
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 32, 1188–1196.
Google Scholar
Li, L., Feng, H., Zhuang, W., Meng, N., & Ryder, B. (2017). Cclearner: A deep learning-based clone detection approach. Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China (pp. 249–260).
Google Scholar
Li, M., Li, C., Li, S., Wu, Y., Zhang, B., & Wen, Y. (2021). ACGVD: Vulnerability detection based on comprehensive graph via graph neural network with attention. Proceedings of the ICICS 2021: Information and Communications Security, Chongqing, China (pp. 243–259).
Google Scholar
Lin, G., Zhang, J., Luo, W., Pan, L., De Vel, O., Montague, P., & Xiang, Y. (2021). Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Transactions on Dependable and Secure Computing, 18(5), 2469–2485. https://doi.org/10.1109/TDSC.2019.2954088
Web of Science ®Google Scholar
Li, Y., Wang, S., & Nguyen, T. N. (2021). Vulnerability detection with fine-grained interpretations. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Association for Computing Machinery, New York, NY, USA (pp. 292–303). https://doi.org/10.1145/3468264.3468597
Google Scholar
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., & Jin, H. (2019). A comparative study of deep learning-based vulnerability detection system. IEEE Acces, 7, 103184–103197. https://doi.org/10.1109/ACCESS.2019.2930578
Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., & Hu, J. (2016). VulPecker: An automated vulnerability detection system based on code similarity analysis. Proceedings of the 32nd Annual Conference on Computer Security Applications, Los Angeles California USA (pp. 201–213).
Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., & Chen, Z. (2018). SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing. https://doi.org/10.1109/TDSC.2021.3051525
Google Scholar
Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng, Z., & Zhong, Y. (2018). VulDeePecker: A deep learningbased system for vulnerability detection. https://arxiv.org/abs/1801.01681
Google Scholar
Makarov, I., Kiselev, D., Nikitinsky, N., & Subelj, L. (2021). Survey on graph embeddings and their applications to machine learning problems on graphs. Peer Journal Computer Science, 7(3), e357. https://doi.org/10.7717/peerj-cs.357
PubMedGoogle Scholar
Russell, R., Kim, L., Hamilton, L., Lazovich, T., Harer, J. A., Ozdemir, O., Ellingwood, P. M., & McConley, M. W. (2018). Automated vulnerability detection in source code using deep representation learning.
Google Scholar
Sahu, K., Al-Zahrani, F. A., Srivastava, R. K., & Kumar, R. (2020). Hesitant Fuzzy Sets Based Symmetrical Model of Decision-Making for Estimating the Durability of Web Application. Symmetry, 12(11), 1–20. https://doi.org/10.3390/sym12111770
Google Scholar
Sahu, K., Al-Zahrani, F. A., Srivastava, R. K., & Kumar, R. (2021). Evaluating the Impact of Prediction Techniques: Software Reliability Perspective. Computers Materials and Continua, 67(2), 1471–1488. https://doi.org/10.32604/cmc.2021.014868
Web of Science ®Google Scholar
Sahu, K., & Srivastava, R. K. (2018). Soft computing approach for prediction of software reliability. ICIC Express Letters, 12, 1213–1222. https://doi.org/10.24507/icicel.12.12.1213
Google Scholar
Sahu, K., & Srivastava, R. K. (2020). Needs and importance of reliability prediction: An industrial perspective. Information Sciences Letters, 9, 33–37. https://doi.org/10.18576/isl/090105
Google Scholar
Shen, Z., & Chen, S. (2020). A survey of automatic software vulnerability detection, program repair, and defect prediction techniques. Security and Communication Networks, 2020, 1–16. https://doi.org/10.1155/2020/8858010
Web of Science ®Google Scholar
Sui, Y., Cheng, X., Zhang, G., & Wang, H. (2020). Flow2vec: Value-flow-based precise code embedding. Proceedings of the ACM on Programming Languages, 4(OOPSLA), 1–27. https://doi.org/10.1145/3428301
Google Scholar
Tang, G., Yang, L., Ren, S., Meng, L., Yang, F., & Wang, H. (2021). An automatic source code vulnerability detection approach based on KELM. Security and Communication Networks, 2021, 1–12. https://doi.org/10.1155/2021/5566423
Web of Science ®Google Scholar
Tian, H., Xu, J., Lian, K., & Zhang, Y. 2009. Research on strong-association rule based web application vulnerability detection. Proceedings of the International Conference on Computer Science and Information Technology (CSIT),Beijing, China (pp. 2).
Google Scholar
Tomas, M., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781
Google Scholar
Wang, H., Ye, G., Tang, Z., & Hwei Tan, S. (2020). Combining graph-based learning with automated data collection for code vulnerability detection. Proceedings of the IEEE Transactions on Information Forensics and Security, 16, 1943–1958. https://doi.org/10.1109/TIFS.2020.3044773
Web of Science ®Google Scholar
Wei, H., & Li, M. (2017). Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, Melbourne Australia (pp. 3034–3040).
Google Scholar
Wu, P., Yin, L., Du, X., Jia, L., & Dong, W. (2020). Graph-based vulnerability detection via extracting features from sliced code. Proceedings of the IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C), Macau, China.
Google Scholar
Xu, K., Hu, K., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks? Proceedings of the International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA (pp. 1–17).
Google Scholar
Yamaguchi, F., Golde, N., Arp, D., & Rieck, K. (2014). Modeling and discovering vulnerabilities with code property graphs. Proceedings of the IEEE Symposium on Security and Privacy, Berkeley, CA, USA.
Google Scholar
Yamaguchi, F., Lottmann, M., & Rieck, K. (2012). Generalized vulnerability extrapolation using abstract syntax trees. Proceedings of the Annual Computer Security Applications Conference, Orlando Florida, USA, 2, 358–368. https://doi.org/10.1145/2420950.2421003
Google Scholar
Zheng, W., Gao, J., Wu, X., Liu, F., Xun, Y., Liu, G., & Chen, X. (2020). The impact factors on the performance of machine learning-based vulnerability detection: A comparative study. Journal of Systems & Software, 168, 110659. https://doi.org/10.1016/j.jss.2020.110659
Web of Science ®Google Scholar
Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., & Sun, M. (2020). Graph neural networks: A review of methods and applications. AI Open, 1, 57–81. https://doi.org/10.1016/j.aiopen.2021.01.001
Google Scholar

A new approach to software vulnerability detection based on CPG analysis

Abstract

1. Introduction

1.1. Problem

1.2. Contribution

2. Related work

2.1. Rule-based source code vulnerability detection

2.2. Vulnerability detection using machine learning, deep learning

2.3. Vulnerability detection based on the flow graph

3. The proposed model

3.1. The model architecture

3.2. The method of building feature profiles of source code

3.2.1. Representing source code using the CPG graph

3.2.2. The method of normalizing vertex features

Table 1. Vertex types of source code when analyzing into CPG graph

3.2.3. Edge feature normalization

Table 2. Edge types of source code when analyzing into CPG graph

3.2.4. Synthesizing source code features

3.3. Source code feature extraction

3.4. Source code classification

4. Experiments and evaluations

4.1. Experimental data

Table 3. Experimental data

4.2. Experimental scenarios

4.2.1. Scenarios for the experimental dataset

4.2.2. Evaluation scenarios

4.3. Evaluation metrics

5. Result evaluations

5.1. Evaluations of experimental scenario 1

Table 4. Experimental results using the DGCNN model

Table 5. Experimental results using the GCN model (Haridas et al., Citation2020)

Table 6. Experimental results using the GCN+IndRNN model (Cai et al., Citation2021)

5.1.1. Comments and evaluations for scenario 1

5.1.2. Experimental results of detecting source code vulnerabilities based on some other approaches

Table 7. Experimental results of detecting vulnerabilities based on some other methods

6. Conclusion and development directions

Disclosure statement

Data availability statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date