2024_03_29_526de135f0ca56f3b8f7g

Research and Progress on Learning-Based Source Code Vulnerability Detection

Su Xiaohong Zheng Weining Jiang Yuan Wei Hongwei Fang Jiayuan Wei Ziyue (Department of Computing, Harbin Institute of Technology, Harbin 150001)

Abstract

Abstract Automatic source code vulnerability detection is the premise and foundation of source code vulnerability repair, which is of great significance to ensure software security. The traditional method is usually based on the manual rules formulated by security experts to detect vulnerabilities, but the difficulty of manual rule-making is large, and the types of vulnerabilities that can be detected depend on the predefined rules of security experts. In recent years, the rapid development of Artificial Intelligence (AI) technology provides an opportunity to realize learning-based automatic detection of source code vulnerabilities. Learning-based vulnerability detection methods refer to the use of machine learning or deep learning techniques for vulnerability detection, in which deep learning-based vulnerability detection methods have shown great potential in the field of vulnerability detection due to their ability to automatically extract the syntactic and semantic features related to vulnerabilities in the code, avoiding feature engineering, and have become a research hotspot in recent years. In this paper, we review and summarize the existing learning-based source code vulnerability detection techniques, and systematically analyze and synthesize their research and progress, focusing on five aspects: vulnerability data mining and dataset construction, vulnerability detection task oriented program representation, machine learning and deep learning based source code vulnerability detection methods, interpretable methods for source code vulnerability detection, and fine-grained source code vulnerability detection methods. Five aspects of the research work is systematically analyzed and summarized. On this basis, a reference framework for vulnerability detection combining hierarchical semantic perception, multi-granularity vulnerability classification and auxiliary vulnerability understanding is given. Finally, the future research direction of learning-based source code vulnerability detection technology is outlooked.

Keywords Software security; Source code vulnerability detection; Vulnerability data mining; Vulnerability feature extraction; Code representation learning; Deep learning; Model interpretability; Vulnerability detection

C.I.C. classification number TP311 DOI number

SP. J. 1016.2024.00337

Research and Progress on Learning-Based Source Code Vulnerability Detection

SU Xiao-Hong ZHENG Wei-Ning JIANG Yuan WEI Hong-WeiWAN Jia-Yuan WEI Zi-Yue(Faculty of Computing, Harbin Institute of Technology, Harbin 150001)

Abstract

Automatic detection of source code vulnerabilities is the precondition and foundation of source code vulnerability repair, which is of great significance for ensuring software security. Traditional approaches usually detect vulnerabilities based on the rules predefined by security experts. However, it is difficult to define detection rules manually, and the types of vulnerabilities that can be detected depend on the rules predefined by security experts. In recent years, the rapid development of artificial intelligence technology has provided opportunities to realize learningbased automatic source code vulnerability detection. Learning-based vulnerability detection methods are data-driven methods that use machine learning or deep learning techniques to detect vulnerabilities, among which deep learning based vulnerability detection methods have shown great potential in the field of vulnerability detection and have become a research hotspot in recent years due to their ability to automatically extract syntax and semantic features related to

Received: 2023-03-14; Published: 2023-11-28. This work was supported by the National Natural Science Foundation of China under Grant No. 62272132. Xiaohong Su (Corresponding Author), Ph.D., Professor, Senior Member of China Computer Federation (CCF), focuses on intelligent software engineering, software vulnerability detection, program analysis and software testing, etc. E-mail: sxh@hit. edu. cn. Weining Zheng, Ph.D. Candidate, focuses on software vulnerability detection. Yuan Jiang, Ph.D., Assistant Professor, Member of China Computer Federation, focuses on program analysis and code representation learning. Hongwei Wei, Ph.D. student, focuses on software data mining, software knowledge engineering, search-based software engineering, code pattern generation and search. Jiayuan Wan, Ph.D. student, focuses on software vulnerability detection and software testing. Ziyue Wei, M.S., focuses on smart contract software vulnerability detection.
vulnerabilities in source code to avoid feature engineering. This paper mainly reviews and summarizes existing learning-based source code vulnerability detection techniques, and provides a systematic analysis and overview of their research and progress, focusing on five aspects of the research work: vulnerability data mining and dataset construction, program representation methods for vulnerability detection tasks, traditional machine learning and deep learning-based source code vulnerability detection approaches, interpretable methods for source code vulnerability detection, fine-grained methods for source code vulnerability detection. Specifically, in the first part, we count existing publicly available vulnerability datasets, including their sources and sizes, and describe the challenges faced in building vulnerability datasets, as well as how to address these challenges. In the second part, we briefly introduce intermediate code representations and divide existing code representations applied in the field of vulnerability detection into four categories: metric based, sequence based, syntax tree based and graph based code representations. For each type of code representation method, we list some representative methods and analyze their advantages and disadvantages. In the third part, we introduce commonly used vulnerability detection tools and review coarse-grained vulnerability detection methods, including rule-based, machine learning based, and deep learning based vulnerability detection methods, and then analyze and discuss the characteristics, strengths and weaknesses of each type of vulnerability detection method. In the fourth part, we introduce interpretable methods that can further explain vulnerability detection results, briefly describe model selfinterpretation methods, model approximation methods and sample feedback methods one by one, summarize their characteristics and discuss their strengths and weaknesses. In the fifth part, we first elucidate the problems and challenges posed by fine-grained vulnerability detection, and then provide a detailed description of existing representative methods for fine-grained vulnerability detection and their approaches to alleviate these challenges. Finally, we propose a source code vulnerability detection a framework that combines hierarchical semantic aware, multi-granularity vulnerability classification and assisted vulnerability understanding, and analyze its feasibility. We also prospect the future research directions for learning-based source code vulnerability detection techniques, such as the construction of large-scale, high-quality vulnerability datasets, techniques for detecting vulnerabilities in small or imbalanced samples, accurate and efficient vulnerability detection models, early detection techniques for vulnerabilities etc.

Keywords software security; source code vulnerability detection; vulnerability data mining; vulnerability feature extraction; code representation learning; deep learning; model interpretability; vulnerability detection

1 Introduction

The Internet is an indispensable infrastructure in the information age

. While Internet technology brings convenience to human beings, it also provides opportunities for malicious elements. In recent years, hacker attacks

, digital asset theft

, and leakage of private user information

have occurred frequently, posing a serious threat to the security of information systems. Vulnerabilities in software systems, which are the core components of cyberspace, are the root cause of such security incidents

Software vulnerability (also known as vulnerability) refers to design errors, coding flaws, and operational failures that occur due to intentional or unintentional negligence on the part of the operating entity at various levels and stages of the software lifecycle of a software system or product. Malicious entities can exploit software vulnerabilities to gain access to higher levels of system privileges, steal private user data, etc., thereby jeopardizing the security of the software system and affecting the normal operation of services built on top of the software system

. For example, in 2017, a remote overflow vulnerability in the Windows Server Message Block Protocol (WSMBP) caused the WannaCry ransom attack

, resulting in a global Internet disaster.

Zoom, a U.S. cloud videoconferencing company, will see massive growth in 2020.

Video leak affects more than 4 million Zoom users on Mac systems due to a vulnerability in the Zoom videoconferencing software on Mac systems

December 10, 2010 Apache open source project Log4

was disclosed to have a "nuclear-level" remote code execution vulnerability that could allow an attacker to steal data by constructing a malicious request to execute arbitrary code on a target server, which could lead to mining, ransomware, and other activities

. "A remote code execution vulnerability has been disclosed in the Apache open source project Log4

, which allows an attacker to construct a malicious request to execute arbitrary code on a target server to steal data, mine, or ransom

. International authoritative vulnerability database

(Common Vulnerabilities & Exposures) and the U.S. National Vulnerability Database (National Vulnerability Database) in the disclosure of statistical data shown in Figure 1, can be seen that in recent years there is a trend of annual increase in the number of disclosed software vulnerabilities, especially in 2017, the number of software vulnerabilities has increased, especially in 2017, the number of software vulnerabilities has increased. It can be seen that the number of disclosed software vulnerabilities has been increasing year by year in recent years, especially after 2017, the number of disclosed vulnerabilities is more than twice as much as that of the previous years. Software vulnerabilities have become one of the most important risks to the security of software and information systems.

(a) Number of CVE vulnerabilities over the years

(b) Number of NVD vulnerabilities over the years

Figure 1 Number of Vulnerabilities Disclosed in CVEs and NVDs

Software static vulnerability detection methods can effectively improve software quality, reduce software security vulnerabilities and minimize security risks, and have attracted extensive attention from academia and industry

. According to the analyzed objects, software static vulnerability detection can be divided into binary vulnerability detection and source code vulnerability detection

, this paper mainly reviews the existing source code vulnerability detection methods. According to the techniques used in the detection process, source code vulnerability detection methods can be categorized into rule-based vulnerability detection methods

, traditional machine learning-based vulnerability detection methods

and deep learning-based vulnerability detection methods

, the latter two can be collectively referred to as learning-based vulnerability detection methods.

Rule-based source code vulnerability detection methods (e.g., some open-source or commercial vulnerability detection tools) rely on security rules defined by security experts, but the limitations and imperfections of the rules often lead to false positives or missed positives

, and the formulation of complete and practical vulnerability detection rules requires high labor costs. With the accumulation of security vulnerability disclosures, learning-based automatic detection of source code vulnerabilities has gradually become possible, which can automatically learn vulnerability patterns from massive historical data, avoiding the need to formulate rules manually, and thus has become one of the hottest research directions in the field of software and cyberspace security.

This paper presents a systematic analysis of learning-based source code vulnerability detection techniques, focusing on the mining and construction of vulnerability datasets, program representation and program representation learning methods for vulnerability detection tasks, source code vulnerability detection methods based on traditional machine learning and deep learning, interpretable methods for source code vulnerability detection, and fine-grained source code vulnerability detection methods. By analyzing the existing methods, this paper summarizes the current challenges in the field of vulnerability detection, and gives a reference framework for source code vulnerability detection that combines hierarchical semantic awareness, multi-granularity vulnerability detection and assisted vulnerability understanding. Finally, this paper provides an outlook on the future research direction and development trend.

2 Literature statistics

The process of literature search and screening carried out in this paper is as follows:

Search databases: For foreign language literature, Google Scholar is the main database, supplemented by EI Engineering Index and SCI Science Citation Index. For Chinese literature, China Knowledge Network (CNN) was used as the main database, supplemented by Wanfang Data and Wipo Chinese Science and Technology Journals.

Search keywords: including "source code vulnerability detection", "vulnerability detection interpretability" related keywords in English and Chinese; search time range from January 1, 2000 to May 20, 2023.

Using the above keywords to search the listed search databases according to the year, manual verification of the search results, including checking the title, keywords, abstracts, and browsing the content of the literature, to select the literature that matches the topic of this paper (i.e., literature related to source code vulnerability detection), and at the same time, to categorize the literature (including the source of the literature, whether or not it is a learning-based vulnerability detection method, etc.), if no matching literature appears in the list of web page search results for five consecutive pages, then the search is complete for that year. If there is no matching literature on 5 consecutive pages of the web search results list, the search is completed for that year. The search was performed by three security researchers (three PhD students), each spending an average of 75 hours.

Based on the above steps, this paper finally summarizes 1109 papers related to source code vulnerability detection, among which there are 768 research papers, 186 empirical analysis papers and 155 review papers. As shown in Fig. 2, the number of papers in the field of vulnerability detection shows a fluctuating growth over time and reaches a peak in the last three years (the number of papers in 2023 is relatively small because only the literature in the month of

was investigated), which indicates that this direction has become a research hotspot in recent years.

For the research papers, this paper will continue to analyze the research results based on the traditional machine

Figure 2 Literature Classification Statistics of Source Code Vulnerability Detection Methods (as of May 20, 2023)

The number of papers on deep learning and deep learning-based source code vulnerability detection was counted, and the results are shown in Figure 3. It can be seen that since 2017, deep learning technology is gradually applied to the field of vulnerability detection, and in 2018 the number of its papers has exceeded the number of literature based on traditional machine learning vulnerability detection.

Fig. 3 Summary of source code vulnerability detection research paper information

In addition, this paper also analyzes the literature in the list of Class A international academic conferences and journals recommended by the China Computer Federation and the literature of three domestic computer journals, and the results are shown in Fig. 4. The results are shown in Fig. 4. It can be seen that vulnerability detection is still a research hotspot in the field of network information security and software engineering, and there are also research results related to vulnerability detection in journals and conferences in other fields.

Computer Architecture/Parallelism and Distribution
Computing/storage systems
■ Computer networks
Synchronous network and information security
Software engineering/systems software/
programming language
artificial intelligence (AI)
⿴Cross-cutting/integrated/emerging
Kuchusan University Newspaper

Fig. 4 Summary of top literature on source code vulnerability detection methods in terms of their domains

In this paper, the statistics of the programming languages involved in the vulnerability detection methods are shown in Figure 5. The results are shown in Figure 5. From the figure, it can be seen that the current source code vulnerability detection methods mainly target

code, Java, PHP, JavaScript and so on, followed by other programming languages and multi-language code is less involved.

囲 Java

JavaScript

Prisoner PHP

Python

IV Other languages/multilingual

Fig. 5 Statistics of programming languages involved in source code vulnerability detection methods

3 Relevant definitions, issues and challenges and research content

3.1 Relevant definitions

Definition 1. Abstract Syntax Tree (AST). An AST is a tree representation of the abstract syntactic structure of source code, an intermediate representation of a program as an ordered tree structure, where the inner nodes correspond to operators in the program and the leaf nodes correspond to operands (e.g., constants or identifiers).

Control Flow Graph,

(

is a directed graph with a unique population node START and a unique exit node STOP, except for the population/exit node, the rest of the intermediate nodes represent statements or predicate expressions in the program, where a predicate is an operation

that returns True or False, and a predicate expression is an expression that contains a predicate. operation)

, and a predicate expression is an expression containing a predicate. Edges represent control flow relationships between statements and are also called control flow edges. In addition, for any intermediate node in the graph, there exists at least one path from the population node to that node and from that node to the exit node.

Definition 3. Control Dependency. A node

is Control Dependency on a node

if there exists a directed path from node

to node

in the CFG and any node in the path (excluding

and

) is post-dominated by

and node

is not post-dominated by node

. Here, backward domination means that if every directed path from node

to the exit node contains node

, then node

backward dominates node C. It should be noted that backward domination does not include the exit node and a node does not backward dominate itself.

Definition 4. Data Dependency. If there exists a path from node A to another node

in the CFG , and the value defined at node

is used in node

, then node

has a data dependency on node

Definition 5. Program Dependency Graph (PDG). A PDG is a directed graph with the same nodes as in a control flow graph, but without the population/exit nodes.

The edges connecting the points represent the control dependencies and data dependencies existing between the nodes.

Definition 6: Code Property Graph (CPG). A CPG is a graphical representation of a program obtained by merging an abstract syntax tree, a control flow graph, and a program dependency graph, and is formally defined as

a non-empty finite set of nodes in the control flow graph and the program dependency graph, and the syntax structure nodes in the syntax tree, and

a set of directed edges representing the control dependencies, data dependencies, control flow, and syntax relationships between nodes.

3.2 Problems and Challenges of Learning-based Source Code Vulnerability Detection Methods

(1) Insufficient vulnerability datasets for training learning-based source code vulnerability detection methods

Unlike vulnerability detection methods based on experts' predefined rules, learning-based source code vulnerability detection methods require sufficient sample data to train the model to improve the detection performance. However, there is still a lack of high-quality and large-scale real vulnerability datasets due to the difficulty of obtaining vulnerability code from real projects. Using small-scale datasets to train the model will easily lead to model overfitting, which will affect the generalization ability of the model. Most of the current research uses publicly available synthetic or semi-synthetic vulnerability datasets. Although these datasets are easily accessible and large in size, there is still a significant gap in code complexity and diversity of vulnerability patterns compared to real projects. As mentioned above, as the size of software increases, the types and number of vulnerabilities that can be exploited in software are also increasing. The continuous emergence of new vulnerability exploitation and attack modes makes the generalization ability of learning-based vulnerability detection models more and more serious challenges, and the expansion of the data set to supplement the training of the model is the most direct and effective means to improve the generalization ability. Therefore, how to construct a high-quality, large-scale and sufficiently rich vulnerability dataset is a challenging problem for learning-based vulnerability detection methods.

(2) Limitations of Learning-based Source Code Vulnerability Detection Methods in Deep Vulnerability Semantic Understanding and Complex Vulnerability Feature Extraction

Learning-based source code vulnerability detection methods need to understand and learn the program semantics, automatically capture the vulnerability features in the code, however, in the training process, due to hardware limitations, the code vector representation of the input model needs to be limited to a fixed length, for the larger code and the distribution of vulnerability statements in the end of the function, the code information beyond the limitation will be truncated, so the model can not learn the complete semantic information of the code. This makes it impossible for the model to learn the complete semantic information of the code. Second, there are often a large number of contextual dependencies among code elements (e.g., tokens, statements, etc.) in a program, and the model needs to selectively retain and learn the more important contextual dependencies related to vulnerabilities in order to effectively identify vulnerability patterns. However, vulnerability patterns in real projects are usually more complex, making it difficult for learning-based vulnerability detection methods to accurately and efficiently learn the deep vulnerability semantics of code. Therefore, how to construct a detection model that can extract complex vulnerability features is another challenging problem for learning-based vulnerability detection methods.

(3) Poor interpretability of learning-based source code vulnerability detection methods

For a long time, the black-box problem of deep learning has been a major problem in academia, especially for learning-based source code vulnerability detection methods. Coarse-grained source code vulnerability detection methods do not provide more information about the vulnerabilities after identifying the vulnerable functions or code segments, and the "black-box" nature of the model itself makes it difficult to fully explain the detection mechanism and results of the model. Therefore, it is important to "white-box" the learning-based vulnerability detection methods, so that the detection process and detection results can be interpreted. At present, there has been some development on the interpretability research of deep learning, but the research on its application to the field of vulnerability detection is still relatively small. Therefore, improving the interpretability of learning-based vulnerability detection methods is a challenging problem.

3.3 Research on Learning-Based Source Code Vulnerability Detection

Source code vulnerability detection is the process by which a developer or a security expert finds an existing but unexposed vulnerability in the source code in some way. Based on the granularity of detection, source code vulnerability detection methods can be categorized into coarse-grained vulnerability detection and fine-grained vulnerability detection. Coarse-grained vulnerability detection refers to predicting the likelihood of a vulnerability being contained in a source code file, function (or method), or code fragment, while fine-grained vulnerability detection refers to predicting the specific statements in the source code that may trigger a vulnerability.

Coarse-grained detection is less difficult, faster and more accurate, but the results are not interpretable, which can easily delay vulnerability remediation and increase the remediation cost. Therefore, on the basis of coarse-grained source code vulnerability detection, researchers have further proposed some interpretable methods to help understand the coarse-grained detection results.

Fine-grained vulnerability detection is more difficult than coarse-grained vulnerability detection, but can better assist developers in understanding and remediating vulnerabilities because it can be localized directly to the statement where the vulnerability occurs.

Learning-based source code vulnerability detection techniques analyze, abstract and reason about source code through traditional machine learning or deep learning artificial intelligence techniques, enabling them to automatically or semi-automatically learn complex semantic features related to vulnerabilities from large amounts of historical data in order to generate corresponding vulnerability patterns, which can be applied to coarse-grained or fine-grained vulnerability detection tasks. Figure 6 shows the relationship between the various research components related to existing learning-based source code vulnerability detection techniques.

(1) Mining and construction of vulnerability datasets

Gathering suspected vulnerabilities from open source software repositories or public vulnerability databases

Fig. 6 Relationship between the various research components of existing learning-based source code vulnerability detection methods

Vulnerable programs, tagging vulnerable programs manually or in an automated manner to build a vulnerability dataset.

(2) Method of program representation

Parsing of the source code in the dataset to generate suitable intermediate representations of the program, such as metric-based, sequence-based, syntax-tree-based and graph-based program representations.

(3) Coarse-grained source code vulnerability detection

After obtaining an intermediate representation of the program, software metric features are extracted using manually formulated rules, or appropriate deep learning networks are used to extract vulnerability-related syntactic and semantic features from the representation of the program, which are fed into a classifier that predicts, in a binary classification fashion, whether or not the source code to be detected contains a vulnerability.

(4) Interpretable Methods for Coarse-Grained Source Code Vulnerability Detection

For the source code detected as containing vulnerabilities, further probabilistic or fine-grained explanation information is provided by interpretable methods. The common interpretable methods can be categorized into model self-interpretation methods, model approximation methods and sample feedback methods.

(5) Fine-grained source code vulnerability detection methods

After obtaining the intermediate representation of the program, we directly obtain the fine-grained detection results at the statement level by performing representation learning on the intermediate representation of the program in the source code to give the location of the vulnerable statements.

To summarize, learning-based vulnerability detection research mainly focuses on the following five difficult problems: how to construct large-scale and high-quality vulnerability datasets? How to parse the code into appropriate program representations? How to extract vulnerability features from program representations to achieve coarse-grained vulnerability detection? How to obtain interpretable results based on the coarse-grained vulnerability detection results? How to model fine-grained vulnerability features based on program representations to achieve fine-grained vulnerability detection? Therefore, this paper will summarize these five perspectives.

4 Mining and Construction Methods for Vulnerability Datasets

4.1 Publicly available vulnerability datasets

The construction of vulnerability dataset is the prerequisite and foundation for learning-based source code vulnerability detection and localization. Learning-based vulnerability detection requires high-quality vulnerability data as a prerequisite, and the size and quality of the dataset directly affects the generalization ability of the detection model. It has been shown that improving the diversity of vulnerability types and syntactic structures in the training dataset can help enhance the detection of unknown vulnerabilities

Some of the more critical and publicly available vulnerability datasets are shown in Table 1. The public availability of vulnerability datasets has contributed to the development of learning-based vulnerability detection techniques, but the construction of datasets still faces the following challenges.

(1) Sample imbalance in the data set

In a recent study, Yang et al.

discuss the effectiveness of data sampling methods for the vulnerability detection data imbalance problem. Specifically, the study evaluates the impact of four data sampling methods, including random undersampling/over-sampling, SMOTE, and OSS (One Side Selection)

, on the effectiveness of deep learning vulnerability detection models and the ability to learn code vulnerability patterns, and obtains the following conclusions: First, data sampling methods can indeed alleviate the data imbalance problem in vulnerability detection. First, the data sampling method can indeed alleviate the data imbalance problem in vulnerability detection. Second, oversampling is better than undersampling. Finally, sampling the original samples is better than sampling the feature space generated by the model after learning the samples. Therefore, future research can focus on the oversampling of raw samples.

In real projects, the number of vulnerability statements in the vulnerability samples is also relatively small, which leads to the sample imbalance problem that is more serious in the software fine-grained vulnerability localization task. Although researchers have recognized the challenges that this problem poses to the vulnerability localization task

, few studies have provided solutions. Software fault localization is the process of locating statements in a bit program that cause the program to run incorrectly.
Table 1 Publicly available vulnerability datasets

Label Granularity	Ratio of Vulnerable to Non-Vulnerable Instances in a Dataset	Data sources	literatures
function (math.)		NVD&CVE
		NVD&CVE
		SARD
		NVD&CVE
		SARD
	1471:59 297	NVD&CVE
	1471:59 297	CVE
		SARD
		GitHub
	1658:16511	Debian&Chromium
code segment		NVD&SARD
	56 395:364 232	NVD&SARD
	43 119:138 522	NVD&SARD
statement		GitHub
		SARD
		GitHub
		Debian
		GitHub	ICSE-SEIP

Focusing on the runtime defects that cause test cases to fail, it also faces the same sample imbalance problem, i.e., fewer test cases fail. Although software fault localization is different from vulnerability localization, considering the similarity of the two tasks, the representative methods used in software fault localization to alleviate the sample imbalance problem can still be borrowed by vulnerability localization task.

Some literatures have addressed the sample imbalance problem by expanding the dataset by cloning the failed test cases

. Xie et al.

proposed a data enhancement method, Aeneas, which utilizes Principal Component Analysis (PCA) to generate a reduced feature space, and then synthesizes the failed test cases in the reduced feature space by Conditional Variational Autoencoder (CVAE) to solve the sample imbalance problem. CVAE (Conditional Variational Autoencoder) is used to synthesize the failed test cases in the reduced feature space to solve the sample imbalance problem. The advantage is that the PCA technique reduces the dimension of the feature space and simplifies the expression of data features, thus improving the efficiency of data synthesis. The Lamont method proposed by

et al.

adopts a similar idea, using Linear Discriminant Analysis (LDA) to reduce the dimensionality of the feature space, and then utilizing SMOTE to synthesize failed test cases to obtain balanced sample data. Lei et al.

proposed an inter-class learning based data enhancement method, BCL-FL, which mixes successful and failed test cases through a specially designed data synthesis formula to generate failed test cases that are closer to real test cases. Therefore, in the future, we can consider borrowing the above ideas for augmenting or synthesizing failed test cases to augment the vulnerability samples in order to solve the problem of imbalanced vulnerability samples.

(2) Data quality issues of the dataset

Currently publicly available vulnerability datasets cover a limited number of programming languages and vulnerability types, which makes the datasets less generalizable, and the lack of complete vulnerability context makes them only able to represent a limited range of vulnerability patterns. For example, vulnerability datasets at the function granularity do not provide a complete structure of vulnerability code across functions.

Vulnerabilities mined from the same project may also cause data shell residual problems due to cross-version or code reuse, etc. Differences in the code styles of developers in different project teams and differences in the application context of code in different projects also make it difficult to learn vulnerability data across projects due to differences in data distribution

In addition, the labeling of datasets is often noisy. This is due to the fact that manual labeling depends on the expertise of the security experts, while labeling with static analysis tools results in a large number of mislabels, and the timeliness of the data may lead to mislabels due to potentially undiscovered or silently remediated vulnerabilities in the dataset.

The above dataset-related factors have significantly increased the difficulty of training detection models with better generalization ability, which has become a bottleneck in improving the performance of vulnerability detection models.

(3) Problems with the source of the dataset

Most software code is still not open source, even if it is public.

Most of the open vulnerability reports do not publish the vulnerability code, which makes it more difficult for researchers to obtain data from the data sources. Therefore, most of the current vulnerability data comes from synthetic data from SARD (Software Assurance Reference Dataset

), a few from real projects such as NVD, CVE, GitHub

, and some of the datasets are a mixture of the two.

Synthetic datasets from SARD contain synthetic or semi-synthetic vulnerability data

, and synthetic datasets such as the SATE IV Juliet data

are public vulnerability datasets constructed from synthetic vulnerability code that mimics known patterns of real vulnerability code. Synthesized data is widely used by researchers because of its large number of samples, multiple types, low noise, and low cost. Compared with real code, synthetic code is simpler and more independent, with fewer variations in code patterns and purer vulnerability contexts, and thus its vulnerability features are easier to learn

. However, there is a big gap between the synthesized data and real project code in terms of complexity and incomplete coverage of program syntax structure, which cannot reveal the real vulnerability distribution in real scenarios and cannot reflect real project scenarios, so it is difficult to accurately detect the vulnerabilities in the real code with the model trained on it. Semi-synthesized datasets

are datasets in which real code is simplified and modified to serve the purpose of academic research. For example, testID: 151455 in the SARD dataset is a typical semi-synthetic example. Since in semi-synthetic datasets, researchers tend to highlight the vulnerable parts of the original code, this dataset is not fully representative of real-world vulnerable code.

(4) The problem of labeling the dataset

Models trained on synthetic or semi-synthetic datasets are difficult to adapt to vulnerability detection scenarios of complex code in real-world projects, so it is imperative to mine and annotate vulnerable code instances in real software projects to build real vulnerability datasets. Dowd et al.

found that one hour of security checking can only cover 500 lines of code on average, while most modern software systems contain millions of lines of code. Therefore, manually labeling real vulnerability instances is costly and it is difficult to obtain a large number of real vulnerability instances.

To address this problem, some researches focus on using automated approaches to label vulnerability datasets, such as using the detection results of static analysis tools to label vulnerabilities in open source project code

, but the high false alarm rate of static analysis detection tools makes the reliability of the obtained vulnerability labels low.

The more commonly used method of automatic annotation of vulnerability data is to manually collect data with real vulnerability labels from the international authoritative vulnerability database CVE, locate the code commit links provided in each vulnerability data entry disclosed publicly by CVE to the security-related commit logs in the open source code repository, and then analyze the differences between the submitted code before and after fixing to extract the vulnerability code and patch code respectively, thus creating a vulnerability dataset

. The vulnerability dataset

is created. Part of the dataset

is constructed entirely from vulnerability instances from real projects, but the number of vulnerability instances in it is much smaller than that in the publicly available hybrid vulnerability dataset

, which contains both synthetic data from SARD and NVD and real data from CVE. This is because, in addition to examples of vulnerabilities from real programs, another portion of the dataset includes synthetic vulnerability code extracted from the SARD and NVD databases. Although using the program slicing technique to split the vulnerability code into multiple sliced code segments increases the number of vulnerability instances in the dataset, different sliced code segments extracted from the same original vulnerability code have the same vulnerability type, so this method of constructing the vulnerability dataset in terms of sliced code segments does not add new vulnerability types to the dataset.

Some studies have also attempted semi-automatic annotation of vulnerability datasets. For example, Zhou et al.

constructed four vulnerability datasets from real open source projects by selecting security-related code commits using a keyword kakui, and then manually verifying them by security experts. These datasets contain more examples of vulnerabilities from real projects than the previous datasets, but only two of them have been made public so far. Chakraborty et al.

crawled security-tagged bug reports from Bugzilla

and Debian security tracker

, respectively, and collected vulnerability code and patch code from real projects.

In summary, there is a lack of recognized public benchmark vulnerability datasets

that can be used to evaluate detection performance, and fine-grained vulnerability datasets with statement-level annotations are particularly scarce. Second, there is a lack of recognized high-quality real vulnerability datasets and methods to accurately assess the quality of the datasets. Issues such as data generalization, meta-completeness, completeness, and labeling accuracy and reliability have left the quality of vulnerability datasets to be improved. Again, most of the publicly available vulnerability datasets are hybrid datasets, in which the proportion of real vulnerability instances is low, and there is a lack of diversity of vulnerability types and a lack of rich coverage of vulnerability context structure, which is not conducive to the learning of richer vulnerability patterns by the model, and it is difficult to satisfy the demand for high generalization of models for detecting unknown types of vulnerabilities in real-world applications, thus making it difficult for vulnerability detection models to be trained based on the existing datasets to be used in industry. It is difficult to meet the demand for high generalization of the model to detect unknown types of vulnerabilities, which makes the vulnerability detection model trained on the existing dataset difficult to be applied in industry. Therefore, the lack of large-scale, high-quality public benchmark datasets of source code vulnerabilities from real projects is one of the major challenges facing vulnerability detection.

4.2 Vulnerability Data Mining Methods

Rule-based methods, i.e., the use of keyword basket selection and variance file analysis to collect vulnerability data, can often only extract a small number of vulnerability instances from code repositories. In order to mine more vulnerability data from open source code repositories such as GitHub to solve the problem of creating large-scale and high-quality vulnerability datasets, some researchers have begun to try to adopt a rule-based approach.

Mining Vulnerability Data with Code Change Intent Identification or Security-Related Code Commit Identification Methods

. Table 2 lists the literature related to code commit identification for fault or vulnerability remediation.

Table 2 Vulnerability Data Mining Methods for Commit Identification of Bugs or Vulnerability Fixes

mandates	literatures	research organization
Troubleshooting Related Code Submission Recognition		North Dakota State University, USA
Identification of the type of fault repaired		Yangzhou University in China
Mining Vulnerability Data by Identifying Security-Related Code Submissions		University of Bonn, Germany
		SourceClear Corporation, USA
		SAP Security Institute, Germany
		Northwestern University in China
		North Carolina State University, USA

Among them, Zafar et al.

designed a pre-trained BERT (Bidirectional Encoder Representation from Transformers) based fault repair related code submission identification method, which predicts whether the user's code submission contains the intention to repair the faults or not, and then determines whether the code of target version By predicting whether a user's code submission contains an intention to fix a fault, we can then determine whether the target version of code contains a fault. In order to further identify the specific types of faults involved in code submissions,

et al.

constructed a repair tree based on the code discrepancy files and used a Tree-based Convolutional Neural Network to classify the specific fault types of code submissions.

Since vulnerability (vulnerability is a security-related defect) is different from defects in general, in order to identify vulnerability-relevant code submissions, Perl et al. propose a text-mining and machine-learning based security-relevant code submission identification method based on the above mentioned classification methods for failure repair-related code submissions. Based on the above classification methods, Perl et al. proposed a security-related code submission identification method based on text mining and machine learning, which first extracts repair intent features from code submissions using text mining techniques, then classifies the extracted features using SVMs, and finally determines whether the submission is related to the vulnerability or not based on the prediction results. In order to combine the advantages of different text classifiers in the task of identifying security-related code submissions, Zhou et al.

used logistic regression to combine six different machine learning classifiers, and achieved better classification results than using a single text classifier. Wang et al.

used consistent prediction to evaluate the confidence of multiple classification models, and used a voting strategy to combine the predictions of multiple classifiers with higher confidence to improve the overall recognition accuracy of the models. In contrast to the above work, which only uses the textual information of code submissions, Sabetta et al.

also consider extracting fix-relevant code features from the code and utilizing textual classifiers and differential code classifiers to comprehensively evaluate whether the current user's submissions are relevant to the vulnerability fix. Considering that issue reports in code repositories often contain a large number of vulnerability instances, Oyetoyan et al.

developed a method to identify security-related issue reports and code commits at the same time, combining keyword filtering and TF-IDF.

(Term Frequency-Inverse Document Frequency)

method. In this method, TF is the frequency of a word in a document, i.e., word frequency, IDF is the logarithm of the reciprocal of the frequency of the document containing the word, i.e., the inverse document frequency, and TF-IDF is the product of TF and IDF, which represents the importance of the word to the document. Therefore, by calculating the TF-IDF, we can get the keywords that are more important for document categorization. Oyetoyan et al. used keyword filtering and TF-IDF to extract security-related keywords in problem reports and code submissions, and then used them as features to identify whether the contents of problem reports and code submissions are security-related using machine learning algorithms.

Although there is a large amount of available data resources in open source code repositories, not all of the commit logs and code changes are related to vulnerability fixes, and some of the commit logs even contain a mix of commits related to vulnerability fixes and other types of commits, which further increases the difficulty of accurately identifying the commit logs related to vulnerability fixes. In addition, since some complex vulnerabilities are usually not fixed at once, the complexity of version tracking further increases the difficulty of accurately identifying the version of the vulnerability that was cited and fixed

. Therefore, further research is needed on how to mine vulnerable code through version-tracing analysis of vulnerabilities. In addition, due to potential commercial sensitivity and security and privacy considerations, many vulnerabilities are covertly patched to avoid attackers from exploiting publicly available vulnerabilities, which increases the difficulty of mining high-quality real vulnerability data from software repositories. Currently, there is relatively little literature on security-related code commit identification methods for the purpose of expanding high-quality vulnerability datasets. How to mine vulnerability data from multi-source heterogeneous software repositories to build a large-scale and high-quality real vulnerability dataset remains to be studied.

5 Program Representation for Vulnerability Detection Tasks

Although there are some similarities between natural languages and programming languages, there are also significant differences, in terms of linguistic form, programs

In addition, program semantics also differs from natural language in that it contains rich and explicit structural information, although this structure also exists in natural language, it is not as strict as in programs

. In addition, the semantic adjacency of programs is different from that of natural languages, e.g., statements inside and outside a loop are not semantically adjacent

even though they are in terms of distance, so it is necessary to convert the program to a suitable intermediate representation to model this semantic information

. Common intermediate representations of code are Token Sequences, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Graph (DFG), Program Dependency Graph (PDG), and so on. Program Dependency Graph (PDG), etc. These different code-intermediate representations provide different levels of abstraction

. The higher the abstraction level of the intermediate code representation, the better its ability to represent the semantic information of the code

. For example, Token sequences can represent the natural ordering and lexical statistics of the code, ASTs are good at characterizing the similarity of specific programming patterns and syntactic structures

, while CFGs and DFGs can represent more control and data flow information than Token sequences and ASTs, and PDGs are better at representing the dependencies between variables in a program

. Secondly, different code intermediate representations have different levels of importance for specific software engineering tasks, and a certain code intermediate representation may only be suitable for one or several types of software engineering tasks. Therefore, there is a need to investigate program representations suitable for vulnerability detection tasks.

Early machine learning-based vulnerability detection methods (e.g., metric-based program representations) often used feature engineering to manually extract vulnerability-related features. In recent years, researchers have begun to use deep learning methods to automatically extract vulnerability features. Since programs have different program representations at different stages of compilation. Therefore, as shown in Fig. 7, according to the organization of code representation, deep learning-based program representation methods can be classified into three categories: sequence-based program representation methods, syntax tree-based program representation methods and graph-based program representation methods. Program representation methods have different objectives and are suitable for extracting different vulnerability features. Table 3 lists the representative literatures of the above four types of program representations applied to the task of vulnerability code detection.

Fig. 7 Correspondence between intermediate representations of code at different levels of abstraction and program representation models with different coding semantic capabilities

5.1 Metrics-Based Representation of Programs

Software metrics refers to the code quality, complexity and maintainability of the code structure information through the measure of the code, to identify the software is difficult to understand or difficult to maintain the code area, in order to follow the more complex code analysis. The main idea of metrics-based vulnerability detection methods is to construct a set of code feature values based on software metrics, and then use this set of feature values as inputs to a machine learning model to learn the complex correlations between structural metrics and defects, and use this knowledge to predict the likelihood of defects in the new code.

In earlier work, researchers have used more classical software metrics. For example, Code-churn

103.115], Code-complexity

, Coverage

, Dependency

, Organizational

, Developeractivity

, and so on. (Organizational

, Developeractivity

and so on. However, empirical studies have shown that
Table 3 Representative literature on program representation methods

categorization

subcategories

code representation

metrics-based

Classic software metrics

Developer Activities

Code perturbation, complexity, coverage, dependencies, organization

Complexity, code perturbation, fault history

Code Complexity, Code Perturbation, Developer Activity

Code Scrambling, Developer Activity

Complexity

Improved software metrics

Ring Complexity, Cyclomatic Complexity, Function Dependencies, Function Pointer Use, Control Structure Dependencies

Slicing software metrics

Sequence-based

Based on function call sequences

Sequence of function calls

Based on source code sequences

Source Code Text Token Sequence

Based on a sequence of code segments

Code segment Token sequence

Based on intermediate code sequences

Intermediate Code Token Sequence

Based on assembly code sequences

Assembly Code Token Sequence

Based on syntax trees

Transformation of syntax tree structures into sequences

AST path

AST Token sequence

Direct modeling of syntax tree structures

CFAST

chart-based

The graph is transformed into a sequence

Direct modeling of diagram structures

Composite diagram

Program diagram

Composite diagram

; Slice diagram structure

Clearly, classical software metrics are not applicable to source code vulnerability detection

In order to further improve the accuracy of metric-based vulnerability detection methods, so that software metrics can accurately characterize the vulnerability code, researchers have improved the statistics and extraction methods of software metrics. For example, in addition to classic software metrics such as code complexity, Du et al.

also extracted software metrics that can reflect the characteristics of the vulnerable code, such as function dependency, function pointer usage, and control structure dependency, in order to assist in identifying vulnerable functions. On the other hand, a large number of non-vulnerability related statements within a function can also interfere with vulnerability characterization. To solve this problem, Zagane et al.

proposed a vulnerability detection method that slices the program and extracts software metrics from the sliced code segments. The experimental results of Salimi et al.

show that software metrics extracted from slices can more accurately characterize vulnerabilities than those extracted from functions.

5.2 Sequence-based program representation

Sequence-based program representation is to transform codes into Token sequences and then model them using sequence-based deep neural networks to generate vector representations of the codes. The main types are as follows.

Sequence-based approaches: Sequence-based representations extract function call sequences by parsing the source code and transforming them into vector representations. For example, Grieco et al.

transformed them into low-dimensional vector representations using the N-gram

and word

models, and

et al.

generated numeric vectors of the code by creating a unique integer index for each word. However, the goal of these methods is to achieve a lightweight approach

that can detect vulnerabilities quickly and efficiently, and is limited in the types of vulnerabilities that can be detected. Source Code Token Sequence Based Approaches : Source code Token Sequence based program representation approaches convert the code into Token sequences by using lexical analysis techniques, and then convert each Token in the Token sequence of the code text into a vector representation

using bag-of-words or

-gram or word2vec. However, it is difficult to learn the high-level syntactic and semantic information of the code.

Methods based on Token Sequences of Code Gadgets: Code gadgets are semantically related but not necessarily contiguous lines of code extracted from code using syntactic and semantic analysis. The most representative work in this category is the code gadget-based vulnerability detection approach proposed by

et al. VulDeePecker

、

SySeVR

、

VulDeePecker

. This class of methods first extracts the code segments in different ways, then represents the code segments as Token sequences, and then encodes them into vector representations. Specifically, VulDeePecker

uses static analysis tools (e.g., Checkmarx) to parse the program source code, extracts vulnerability candidate keypoints related to insecure libraries/API function calls from the code according to the rules, and then adopts program slicing techniques to extract the lines of code with data dependency on the keypoints from ASTs and PDGs to form code segments, which are then transformed into Token sequences. The code segments are then transformed into token sequences, which are then converted into low-dimensional vector representations using word2vec. In order to identify specific types of vulnerabilities,

VulDeePecker

extends the snippet representation in VulDeePecker to include more "global" semantic information by adding control-dependent statements. In order to detect more types of vulnerabilities, the SySeVR method

improves the VulDeePecker method by extracting array usage, pointer usage, arithmetic, etc., based on syntactic features, in addition to library/API function calls.

The vulnerability candidate keypoints related to the art expressions are used as the slicing criteria, and the program slicing technique is used to extract the syntactic and semantic features related to the above four types of vulnerability candidate keypoints from the AST and PDG.

Approaches based on Intermediate Token Sequences: Some researchers have focused on code representations based on intermediate token sequences. The most commonly used intermediate code in the field of vulnerability detection is LLVM IR (Low-Level Virtual Machine Intermediate Representation), which adopts the Static Single Assignment (SSA) form to ensure that each variable is defined and used only once. As a result, intermediate code-based program representations can more accurately encode the semantic features of the code related to control flow and variable definition-use relationships

, and capture more accurate semantic information than source code-based program representations

. A representative work is VulDeeLocator

, which replaces the source code slices in SySeVR

with LLVM intermediate code slices generated by the

tool

to represent more code semantics.

Assembler Token Sequence Based Approach: Some researchers represent code based on assembly language token sequences. For example, Li et al.

first extract the source code slices in the same way as SySeVR, and after compiling the source code into assembly code, they generate assembly code slices by using an alignment algorithm to find the assembly code that corresponds to the statements in the source code slices. Tian et al.

use symbolic execution and static analysis techniques to generate assembly code slices by directly parsing the control flow and data flow of the binary program.

Sequence-based program representations can reflect the natural sequence information and lexical statistics of a program, which are useful for tasks such as code generation, code completion, code summarization, code search, vulnerability detection, and defect repair. However, sequence-based intermediate program representation "flattens" the code, which makes the structural information of the program not effectively utilized. Therefore, it is necessary to study syntax tree-based and graph-based program representation methods to fully utilize the structural information of the program to model the source code.

5.3 Syntax-tree based program representations

Syntax tree-based program representation generally refers to the representation of code as an AST structure. Some studies have flattened ASTs into sequences, while others have extracted vulnerability semantics based directly on the original tree structure of ASTs.

The advantage of converting syntax tree structures into sequences is that vulnerability features can be extracted directly using sequence-based deep neural networks, and the key lies in how to convert ASTs extracted from source code into sequence representations. Alon et al.

form paths by connecting root nodes to leaf nodes, and assign more weights to frequent paths to generate vector representations. This method has been successfully applied to the tasks of code attribute prediction or clone detection because it can effectively measure the similarity of syntactic structures between programs. However, since the frequently occurring paths may not be the most vulnerable paths in the program, and the vulnerable paths may not be the most frequently occurring paths, this method is not suitable for the vulnerability detection task.

In the vulnerability detection task, there are two main types of methods to transform ASTs into sequences, one is to transform ASTs into Token sequences of AST paths. For example, Li et al.

use a greedy algorithm to extract long paths that cover the least number of nodes in the AST as possible, and change the starting point of the algorithm to extract multiple long paths that cover all nodes (the nodes between these long paths can overlap each other). After obtaining the set of sequences of long paths, the tokens in the path context are converted into numeric vectors by word2vec. Tanwar et al.

form paths by connecting any two AST leaf nodes, forming the path context with the first and last leaf nodes, the path node, and the root node, and then extracting all such paths from the AST to generate a sequence of paths, and then converting the tokens in the sequence of paths into numeric vectors by using a pre-built vocabulary. The other is to use depth-first traversal to transform the AST into a Token sequence

of AST nodes, and then map the Token in the sequence to a numeric vector using techniques such as word2vec.

Compared with the method of transforming ASTs into sequences, the method of directly modeling the syntax tree structure can more adequately represent the syntactic structural information in the code, and has been widely used in code semantics related tasks such as clone detection, code classification, code completion, code generation, code summarization, and code attribute prediction. However, some empirical studies have shown that

, due to the heterogeneous structure of AST and the fact that there are many more vulnerability-independent nodes than other intermediate representations, the application of AST as an intermediate representation of code only for vulnerability detection does not achieve the best detection performance. Some recent researches have incorporated other code intermediate representations on top of AST to integrate information at different levels of abstraction. For example, adding the AST syntactic structure information of statement blocks to CFG

, or adding control flow information to the original AST to generate a new code representation CFAST (Control Flow Abstract Syntax Tree

5.4 Graph-based program representation

Graph-based program representation consists of two main approaches, transforming graph structures into sequences or modeling graph structures directly.

The focus of a program representation method that transforms a graph into a sequence is how to traverse the intermediate graph-based representation to form a sequence. Duan et al.

proposed a CPG-based vulnerability detection method, VulSniper, in which after generating the CPG of a program, the nodes in the CPG are encoded according to a coding rule designed by the authors to generate a three-dimensional feature tensor, whose first two dimensions are similar to the neighbor-joining matrix used to record the nodes' bits.

The third dimension describes the type of relationship between the nodes, and the order of the nodes in the sequence is determined according to their positions and nesting relationships in the AST, so as to obtain the sequence of feature vectors of the nodes. Based on VulSniper, the authors further propose the use of program slicing techniques to reduce statements that are not related to sensitive operations, in order to reduce vulnerability-independent information in the program representation

Compared with the method of modeling the graph structure after transforming it into a sequence, using graph neural network to model the graph structure directly can better learn the dependencies between nodes as well as the structural and semantic information of the program in the graph-based code representation, and provide more accurate feature vector representations for the source code vulnerability detection model, which mainly includes two ideas, one is to integrate multiple intermediate representations of the code to comprehensively characterize the syntactic and semantic information related to the vulnerabilities at different levels of abstraction

. One is to integrate multiple code intermediate representations to synthesize the syntactic semantic information related to vulnerabilities at different abstraction levels

. The other idea is to use program slicing techniques to remove non-vulnerability related information from the mixed code representations

Representative approaches of the first idea include: Yamaguchi et al.

proposed Code Property Graph (CPG) for program representation to synthesize structural and semantic information of programs. Zhou et al.

generated a Composite Graph to synthesize structural and semantic information of programs by adding natural sequence information to CPGs generated based on ASTs, CFGs, and PDGs. Wang et al.

generate an intermediate representation called Program Graph by combining the information extracted from AST and PDG. Cao et al.

proposed a Code Composite Graph (CCG) that combines AST, CFG, and DFG to synthesize syntactic and semantic information to characterize a program. Compared with the CPG used in the previous literature, the CCG abandons the control-dependent information in the code and retains the data-dependent information. A Comprehensive graph with 12 kinds of connectivity relations is proposed by

et al.

to contain more program semantic information, which covers most of the edges and nodes extracted by the joern static analysis tool.

It has been found that the amount of information contained in the program representation is not as high as possible, and a large amount of non-vulnerability related information may negatively affect the model's ability to learn vulnerability patterns. In order to solve the problem of complex graph structures that synthesize the semantic information of the code and the presence of a large amount of vulnerability-irrelevant information, another idea is to remove the vulnerability-irrelevant information from the intermediate representation of the code's graph structure by using program slicing techniques

. For example, Cheng et al.

proposed DeepWukong, a vulnerability detection method combining program slicing and graph neural networks, which extracts subgraphs from PDGs based on program slicing (Cheng et al. call the subgraphs

), but the method only uses

function calls and operator statements as the slicing criteria, and thus only detects a limited type of vulnerabilities. Similarly,

et al.

used only

function calls and pointer operations as the slicing criterion, and extracted subgraphs from the extended PDG combined with Call Graph (CG) based on the program slicing technique, which only detects memory-related vulnerabilities. In order to improve the detection capability of different types of vulnerabilities, Zheng et al.

proposed an inter-program representation called Slice Property Graph (SPG), which is based on four commonly used slicing criteria and proposes two slicing criteria, namely, function parameter and function return value, related to inter-function data transfers, to cover more key points of vulnerability candidates. Based on these six slicing criteria, inter-process analysis and program slicing techniques are used to generate SPGs with node attribute information (statement content and node type, etc.) to more accurately extract the graph structure information, node attribute information, and code context information that are data-dependent, control-dependent, and function call-dependent with respect to the vulnerability candidate keypoints, so as to avoid the negative impact of a large number of vulnerability-independent statement nodes in CPGs on the training of the detection model. The CPG is designed to avoid the negative impact of a large number of vulnerability-independent statement nodes on the detection model training.

Finally, in terms of programming languages, there are fewer learning-based source code vulnerability detection methods for multilingual source code, and these methods mainly use metric-based and sequence-based program representations

, because feature extraction based on these two representations can be easily scaled to multiple programming languages without the need for language-specific code parsing tools to build complex intermediate representations of the code. For example, Zaharia et al. extracted vulnerability features in C/

and Java code based on an intermediate representation of token sequences

Since syntax tree and graph-based program representations can provide structural information in the code, they have been widely used for vulnerability detection in

language programs as well as other programming languages

in recent years. For example, Lin et al.

used PDG Sub-Dependence Graph (SDG) for PHP to represent dependencies in vulnerable code. However, these representations require different code parsing tools for different programming languages to obtain intermediate representations such as syntax trees and graphs. For example, C/

code can be parsed using joern

or CppDepend

, Java code can be parsed using Javalang

or JavaParser

, JavaScript code can be parsed using Esprima

or Esgraph

, PHP code can be parsed using PHP-Parser

, and JavaScript code can be parsed using PHP-Parser

. PHP code using PHP-Parser

or PHP-Joern

, etc. These tools make lexical and syntactic analysis of source code less costly for researchers, and further facilitate the use of syntax tree and graph-based program representations. With the emergence of multilingual code parsing tools (e.g., Tree-sitter

), researchers have begun to use syntax tree and graph-based program representations for multilingual source code vulnerability detection

In summary, the intermediate representation of programs lays the foundation for learning program representations for vulnerability detection tasks, but the ability of program representations to represent semantics at different levels of granularity and abstraction still needs to be improved. A comparison of the advantages and disadvantages of several typical program representation methods is shown in Table 4.
Table 4 Advantages and disadvantages of program representation

categorization

Metric-based program representation

Sequence-based program representation

Syntax Tree Based Program Representation

Graph-based programs

representation

dominance

WeaknessesMost software metrics are simple and easy to extract, and are suitable for characterizing statistical metric information of the code.

Difficult to deeply characterize vulnerabilities in code

Information on the natural word order and lexical statistics of the program can be represented

Can reflect the syntactic structure of the code information can be better preserved the logical structure of the code and the dependency relationship When dealing with larger programs, the generation of more complex side of the multi-graph, increasing the complexity of the subsequent model mentioned system and other semantic information will be represented as a linear sequence of the program so that the structural information in the program is not fully utilized.

When dealing with large-scale programs, the number of nodes in the syntax tree is much larger than the number of statements, and the hierarchical structure of the tree requires recursive processing, which leads to the high time complexity of the subsequent feature extraction model and the difficulty of extracting the features, among which, the metric-based program representation is suitable for characterizing the statistical metric information of the code but difficult to depict the deeper semantic information of the vulnerability code, and the sequence-based program representation is suitable for representing the natural order and lexical information of programs, but representing programs as linear sequences will lose the structural information of the program. Sequence-based program representation is suitable for representing the natural order and lexical information of the program, but representing the program as a linear sequence will lose the structural information of the program. The AST-based program representation is suitable for representing the unique syntactic structure information of the program, but when dealing with large-scale programs, the number of nodes in the AST is much larger than the number of tokens in the code, which makes it difficult for the model to learn the long-distance dependency information in the syntactic tree structure. Graph-based representation can better preserve the complex semantic information such as logical structure and dependencies of the code, but when the program size is large, the graph structure generated by this method (e.g.,

) is more complex, which makes the model training less efficient. An effective code-intermediate representation should not only retain rich and critical vulnerability semantic information to improve its ability to abstract and characterize the vulnerability semantic information, but also minimize the negative impact of vulnerability-irrelevant information on the representation of vulnerability features and reduce the complexity of code-intermediate representation. Therefore, how the program representation method can comprehensively, accurately and effectively portray the syntax and semantics of vulnerability code at different granularity and abstraction levels still needs to be deeply studied.

6 Coarse-Grained Vulnerability Detection Methods for Source Code

Program analysis is the basic technology of vulnerability detection, including static analysis, dynamic analysis and hybrid analysis methods. Among them, static analysis techniques are mainly used to detect vulnerabilities by analyzing the code, without the need to build a system running environment to test the program, but the false alarm rate is high

. Dynamic analysis detects vulnerabilities by running the program, although the accuracy is higher, but it depends largely on the quality of the test cases, and the code coverage is low, there is a possibility of underreporting, and sometimes researchers can only get some of the changed code from the code repository, it is difficult to get the complete source code of the project, which leads to the code can not be properly compiled and run, so we can only rely on the static analysis means of detection. Although the combination of static and dynamic analysis methods can mitigate false positives and omissions to some extent, there are still problems of higher time complexity and lower detection efficiency

. In short, traditional vulnerability detection techniques still need to be further improved in terms of efficiency and effectiveness. In recent years, machine learning and deep learning have provided more effective and automated solutions for the vulnerability detection task. Compared with traditional techniques, learning-based approaches can learn potential, abstract vulnerability code patterns, and can significantly improve the generalization ability of the model

. Therefore, this paper focuses on the discussion and analysis of learning-based vulnerability detection methods.

6.1 Common vulnerability detection tools and evaluation metrics

Rule-Based Detection Methods

The basic principle of detecting vulnerabilities is to provide a set of syntax, support professionals based on the syntax to describe a defective data flow or control flow with rules, and then use the rule matching method, to report the vulnerability of the code that meets the conditions of the rule description. Existing commercial and open-source detection tools utilize a rule-based detection approach, although their detection principles vary. Table 5 lists some of the classic vulnerability detection tools.

Table 5 Common Vulnerability Detection Tools

typology	Representative tools	Detecting Languages
Business Tools	Checkmarx	20 languages
	Coverity	19 languages
	Fortify	25 languages
	CodeSonar	Java
	Klocwork	Java
	CodeSecure	6 languages
	Helix
open source tool	Flawfinder
	Cppcheck
	FindBugs	Java
	ErrorProne	Java
	Clang Static Analyzer

	AUSERA	Java/Kotlin/C/C ++

The vulnerability detection task is a categorization task, so the metrics from the categorization task can be used. Table 6 lists some of the more commonly used evaluation metrics in the current vulnerability detection field.

Most of the indicators in Table 6 pass TP (True Positive),
Table 6 Evaluation metrics commonly used in vulnerability detection tasks

Assessment of indicators	descriptive	formula
True Positive TP	Number of samples that the model detects as vulnerabilities, and that are actually also vulnerabilities	-
False positive FP	Number of samples that the model detects as vulnerabilities, but are actually non-vulnerabilities
True negative TN	Number of samples detected by the model as non-vulnerable, actually also non-vulnerable
False negative FN	Number of samples detected by the model as non-vulnerable, actually vulnerable
Accuracy	Ratio of the number of samples with correct model detection to the total number of samples of test data	Accuracy
Precision	Percentage of samples detected as vulnerabilities by the model that are real vulnerabilities	Precision
Recall	Percentage of all vulnerability samples in the test data that are predicted by the model to be vulnerabilities	Recall
F1-Score	Metrics that enable a comprehensive assessment of the precision and recall of model detection
Failure to report rate FNR	Ratio of samples detected by the model as non-vulnerable to samples that are actually vulnerable as a percentage of all vulnerable samples in the test set
False Positives Rate FPR	Ratio of samples detected by the model as vulnerabilities that are actually non-vulnerable to all non-vulnerable samples in the test set
ROC curve	The curve drawn in the Recall) plane with horizontal coordinates , and vertical coordinates is equal to
AUC	ROC The area under the curve enclosed by the coordinate axis
IoU	Let be the set of vulnerability statement locations predicted by the model, and be the set of real vulnerability statement locations.

TN(True Negative), FP(False Positive), FN(False Negative), and the more complex ROC curve (Receiver Operating Characteristic curve), AUC (Area Under Curve) and IoU (Intersection over Union) metrics are calculated separately. The more complex ROC curve (Receiver Operating Characteristic curve), AUC (Area Under Curve) and IoU (Intersection over Union) metrics are calculated separately. In particular, the

curve is plotted by traversing the thresholds of the model classifiers. Specifically, the model predicts different vulnerability and non-vulnerability samples based on different classifier thresholds, thus obtaining changing FPR-TPR sample points to form the curves, and the steeper the ROC curve, the better the model performance. AUC represents the area enclosed by the ROC curve and the coordinate axis, and its value is generally between 0.5 and 1, and the closer it is to 1, the better the model performance. IoU is a widely used evaluation metric for target detection, which is introduced by

et al.

to evaluate the performance of fine-grained vulnerability detection by calculating the intersection ratio between the set of true vulnerability statements and the set of predicted vulnerability statements. The value of IoU is between 0 and 1, and the closer it is to 1, the better the model performance is.

6.2 Vulnerability Detection Methods Based on Traditional Machine Learning

Machine learning-based vulnerability detection methods mainly rely on experts to extract features manually and then use machine learning classification models for vulnerability detection. These classification models mainly include Bayesian Network (BN), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), Logistic Regression, K Nearest Neighbor,

Nearest Neighbor,

and Support Vector Machine.

Nearest Neighbor,

and Support Vector Machines

(The feature extraction methods can be categorized into three types: software-based metrics, text mining-based methods, and pattern matching-based methods. According to their feature extraction methods, they can be categorized into three types: software metric-based methods, text mining-based methods, and pattern matching-based methods. These methods are shown in Table 7.

Software metric-based approaches focus on extracting software metrics from code to construct a set of features, including statistical features

transformed from classical software metrics, and features

transformed from improved software metrics that can further characterize the vulnerable code, which are then fed into machine learning classifiers for vulnerability detection.

The text mining based approach utilizes lexical analysis techniques to extract statistical features from the Token information of the source code, and uses bag-of-words techniques to generate the numeric vectors

of the source code, which are then fed into machine learning classifiers for vulnerability detection.

Pattern matching based methods mainly use program analysis techniques to extract vulnerability features and vulnerability patterns, and detect vulnerabilities through pattern matching techniques.

Many studies have extracted structured features from different forms of code representations as Code Patterns to be fed into machine learning classifiers. For example, Yamaguchi et al.

extracted API/library function call sequences from source code by code analysis techniques, converted them into Code Patterns by bag-of-words and principal component analysis techniques, and then performed pattern matching based on distance functions. In order to obtain the syntactic information in the code, the authors further extracted the corresponding ASTs of the code functions, and generated the API patterns

by bag-of-words and Latent Semantic Analysis (LSA) techniques, and then performed pattern matching based on the distance function.
Table 7 Vulnerability detection methods based on traditional machine learning

Vulnerability detection is achieved by matching patterns. In addition to sequence- and syntax tree-based code representations, there are also studies that extract vulnerability features from graph-based code representations. For example, Shar et al.

extracted features such as the number and type of statements in the input and output data through data flow analysis to form code patterns, and then detected vulnerabilities using a machine learning classifier. In addition to using a single graph representation to extract vulnerability patterns, Yamaguchi et al.

proposed to combine multiple graph representations of CPG to extract function call information (including the order of function calls, data flow information such as function parameters and return values, etc.), and then analyze all the information by clustering analysis using Complete Linkage Clustering (CLC) to obtain code patterns. The information is then analyzed by Complete Linkage Clustering (CLC) to obtain the code patterns, and finally the code patterns are used to detect the vulnerabilities by graph traversal query on

. Compared with software metrics-based and text mining-based code characterization, although the vulnerability patterns extracted from the code parsing results of static analysis tools contain more information, they still fail to extract the deeper semantic features of the code.

In summary, machine learning-based vulnerability detection methods rely on experts to define and extract features manually, which is costly and error-prone, and may reduce the detection performance due to the lack of extraction of deep semantic features. In addition, data noise, model overfitting/underfitting and other reasons often make the learning process deviate from the expected goal, which reduces the performance of machine learning-based vulnerability detection methods.

6.3 Deep Learning Based Vulnerability Detection Methods

Compared with machine learning vulnerability detection methods, deep learning-based vulnerability detection methods can automatically learn the semantic features of vulnerabilities implied in the code through deep neural networks, so as to get rid of the shackles of manually defining vulnerability features by experts, and thus have attracted extensive attention in recent years

. Most of the existing deep learning-based vulnerability detection research focuses on the coarse-grained level such as file

, function

or code segment

, i.e., predicting the likelihood of a file, function or code fragment containing a vulnerability, and only a few research studies on fine-grained vulnerability detection at the statement level

, which will be introduced in Section 7.2. This section describes coarse-grained vulnerability detection methods. Table 8 lists the deep learning-based coarse-grained vulnerability detection methods and the feature extraction models they use.

Most of the deep learning based vulnerability detection methods follow the detection framework shown in Figure 8. First, the code is parsed into a suitable intermediate representation of the code and converted into a semantic vector representation suitable for deep learning models, then suitable deep learning model pairs are constructed to represent the code as a vector representation containing deep vulnerability semantics, and finally the vulnerability detection results are outputted by a classifier model.

For sequence-based intermediate representations of programs (including transforming syntax trees/graphs into sequences), Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) are commonly used as the representation model to extract vulnerability features. CNNs are characterized by their ability to capture local semantic features. For example, Russell et al.

used CNNs to learn and extract features from sequential representations of source code functions.

Compared to CNNs, RNNs and their variants (GRU, LSTM, etc.) have a big advantage in processing sequential data, and are good at capturing sequential semantic information of codes and long dependency information in code context. For example, three approaches using sliced code segments as intermediate representations were proposed by

et al. VulDeePecker

、

VulDeePecker

,
Table 8 Deep learning based vulnerability detection methods

mould

feature extraction

dominance

inferior

Depth for sequential program representations

learning model

The sequential semantic features of the code can be efficiently learned for Token

or inter-statement dependencies are modeled based on the

Transformer's model can also efficiently extract long distance

dependency

The inability to model the structural semantics in the source code and the

Grammatical information

CNN/LSTM

BiGRU/BiLSTM

TextCNN

Transformer

BERT/CodeBERT

For deep syntax tree/graph program representations

Degree Learning Model

GCN/GAT

Structural semantics and syntactic letters in code can be modeled directly

information to learn the deeper semantics of the code more accurately.

Difficult to learn global and sequential semantics of code

information, the model is less efficient in learning

GGNN

R-GGNN

R-GCN

Fig. 8 Deep learning based source code vulnerability detection framework

SySeVR

uses RNN and its variants as the network structure for vulnerability feature extraction, and the results show that BiGRU (Bidirectional Gated Recurrent Unit) and BiLSTM (Bi-directional Long Short-Term Memory) are better, because compared with unidirectional GRU/LSTM, the structure of BiGRU and BiLSTM consists of a combination of forward and backward networks, and the output of each moment is composed of both forward and backward RNNs. This is because, compared with unidirectional GRU/LSTM, BiGRU and BiLSTM are structured by combining forward and backward networks, and their outputs at each moment are jointly determined by the outputs of both forward and backward RNNs, which are able to efficiently learn the context (forward and backward) dependency information of the code

. Therefore, many studies have used BiGRU

or BiLSTM

for code representation learning.

In order to combine the advantages of

in extracting local semantic features and RNN in capturing temporal dependencies, many studies have used a combination of CNN and RNN to realize vulnerability pattern extraction, and achieved better vulnerability detection results than using RNN and CNN alone. For example,

et al.

compared three vulnerability feature extraction methods using CNN, LSTM, and the combination of the two. The results show that firstly, CNN and LSTM are more effective in detecting vulnerabilities.

Using CNN to extract the local semantic features of the code, and then learning the long dependencies in the code context through LSTM can extract more vulnerability semantic information than using CNN and LSTM alone. In order to deeply characterize the vulnerability semantics,

et al.

also introduced the discrete Fourier transform and attention mechanism in the combined CNN+BiLSTM model, which transforms the source code to the frequency domain through the Fourier transform, and focuses on the elements in the code that are more important for vulnerability detection through the attention mechanism.

Recently, Vaswani et al.

proposed a model Transformer, which consists entirely of an attentional mechanism and a feedforward neural network. Compared with traditional convolutional neural networks and recurrent neural networks, Transformer proposes a multi-head self-attention mechanism that can directly compute the dependency between any two positions of a Token or an utterance, and capture the global as well as long-distance dependencies better. Global and long distance dependencies are better captured. In addition, the multi-head self-attention mechanism can better perform parallel computation and improve the efficiency of the model, and the structure is more flexible, which can be adjusted according to the needs of different tasks, and it can also deal with variable-length sequences of data, so as to retain the input feature information more effectively. Currently, Transformer model has achieved amazing results in the field of natural language processing, given its powerful sequence feature extraction ability, some researchers have explored and utilized Transformer model to build vulnerability detection models. Thapa et al.

empirical analysis of

and its variants and Transformer-based models on

source code vulnerability dataset shows that Transformer-based models such as BERT, CodeBERT, etc. outperform RNN and its variants such as BiLSTM and BiGRU. Further, Mamede et al

developed an IDE vulnerability detection plugin VDet based on the Transformer model. After fine-tuning and evaluating JavaBERT

, the model is loaded into the server and communication is established with the extension to enable vulnerability detection in Java code.

Recently, large language models for code generation and analysis have become a hot research topic in the field of code analysis due to their excellent performance, especially the GPT (Generative Pre-trained Transformer) family of models

, which have been proved to have strong ability to understand and generate code. For example, in the technical report of Bubeck et al.

, the performance of models such as GPT-4, GPT-3.5, and GPT-3 is compared in various coding challenges, and from the results, all of them achieve amazing results, and the performance of GPT-4 is even comparable to that of human beings to a certain extent. As a result, researchers have begun to use large language models for vulnerability detection to investigate whether they can solve the vulnerability detection problem. Cheshkov et al.

evaluated the performance of ChatGPT and GPT-3 models for detecting Java code vulnerabilities in real projects, and surprisingly, both ChatGPT and GPT-3 models have limited ability to detect code vulnerabilities effectively. However, it has also been pointed out that

, since the

ChatGPT understands natural language directly, so it can provide security experts with richer and more intuitive information about vulnerabilities during the detection process, rather than simple classification results, thus improving the efficiency of experts in subsequent remediation of vulnerabilities. However, the study also points out that ChatGPT does not fully understand the nuances between vulnerabilities and non-vulnerabilities, the information it provides is not always accurate or useful, and its high deployment and training costs also pose barriers to its use. In conclusion, further refinements and research are needed to apply models such as ChatGPT to vulnerability detection tasks on a large scale.

For the syntax tree-based intermediate representation of programs, although neural network models such as TBCNN

, Tree-LSTM

、

AST-based LSTM

, ASTNN

and other neural network models based on syntax trees can be used to model the syntax trees directly to capture the structural features of the code, these methods are not specially designed for capturing the features of vulnerabilities, and it has been shown

that using only AST as an intermediate representation with corresponding representation learning methods cannot achieve satisfactory results in the task of vulnerability detection. However, these methods are not specifically designed for capturing vulnerability features, and empirical studies have shown

that using only AST as an intermediate representation with the corresponding representation learning method cannot achieve satisfactory results in vulnerability detection tasks, which usually combines AST with other graph-based intermediate representations to generate composite graphs, and uses graph neural networks for representation learning.

Graph Neural Network (GNN) is often used to model the graph-based intermediate representation of programs, and vulnerability detection is achieved by using the graph-level classification results. As a deep learning model that can directly learn the features of graph data structures, GNN has obvious advantages when paired with graph-based program representations. Compared with other detection models that first "flatten" the program and then learn the vulnerability features, GNN-based vulnerability detection methods directly learn the code representations on complex code structures, which can better capture the vulnerabilities of the program. Compared with other detection models that first "flatten" the program and then learn the vulnerability features, GNN-based vulnerability detection method learns the code directly from the complex code structure, which can better capture the syntactic and semantic information of the code, and is more suitable for understanding the vulnerability semantics of the complex program, thus improving the vulnerability detection performance. The core idea of graph neural network is to learn the representation of the central node by aggregating the information from the local neighborhood nodes, according to the different aggregation techniques, researchers have developed different GNN models.

Gated Graph Neural Network (GGNN) was firstly used in the field of vulnerability detection. Zhou et al.

utilized GGNN to extract vulnerability features from code composite graphs, and GGNN added GRU units on top of graph neural networks to learn both structural and sequential information in the graphs. Since code composite graphs combine multiple intermediate representations of codes, they are usually heterogeneous graphs. In order to learn different types of nodes and edges, Wang et al.

used an improved R-GGNN (Relational Gated Graph Neural Network) model for the representation learning, which is more efficient in terms of the different types of nodes and edges compared to GGNN. Compared with GGNN, R-GGNN assigns different learnable weights to different types of nodes and edges, and thus is more effective for heterogeneous graphs. Graph Convolutional Neural Networks (GCNs) are used for representation learning.

Network (GCN) is one of the most classical GNN models and is often used as a baseline model for vulnerability detection

. In part of the literature

, a modified GCN model, R-GCN (Relational Graph Convolutional Network), is used as the main model for feature extraction. Similar to R-GGNN, R-GCN is a graph neural network for representation learning of heterogeneous graphs.

et al.

used another improved GCN-based model, FA-GCN (Feature Attention Graph Convolutional Network), as the main model for vulnerability feature extraction. FA-GCN can well deal with heterogeneous graphs (i.e., graphs in which not all statements have the same attributes), and remove the features in the PDG. ), and removes potential noise from the PDG

. In order to assist the model to focus on vulnerability-related information in the program, GAT (Graph Attention Networks) introduces an attention mechanism in the graph, which focuses on the syntactic and semantic information related to vulnerabilities in the program representation learning by assigning certain weights to the edges between nodes. Li et al.

and Ghaffarian et al.

both used GAT as a representation model and obtained optimal results in comparative experiments. In the original GNN, message passing technique is used to improve the learning ability, but a layer of GNN can only pass the information of single-hop neighbor nodes. Although stacking multiple GNNs can expand the message passing range in principle, there are problems of high computational cost and over-smoothing, which seriously impede the ability of GNN to learn the global information

, therefore, Wang et al. proposed the Graph Interval Neural Network (GINE)

, which is the most effective and efficient way to improve the learning ability. Therefore, Wang et al.

proposed Graph Interval Neural Network (GINN) to achieve hierarchical representation learning for extended control flow graphs. GINN firstly extracts the interval intervals in the control flow graph according to the structure of the control flow graph, and calculates the information transfer between the nodes in the intervals at a time, then aggregates the intervals into "nodes" and re-extracts new nodes. Then the intervals are aggregated into "nodes" and new intervals are extracted again until all nodes in the graph belong to the same interval, at this time, the global features of the graph can be obtained, and then the features are decomposed and the graph is gradually restored to the initial structure to obtain the hidden features of each node, GINN can avoid the problem that the information of the nodes is diluted due to the distance, and thus enhances the ability of the generalization of the large graph.

The above code representation models focus on learning the global semantic features of the program to achieve vulnerability detection, in order to effectively extract the local and global semantic features of the code, some research results use the segmented code representation learning framework, i.e., to extract the local semantic features within the statements first, and then to extract the global semantic features between the statements. Since both local and global syntactic semantic information of the code are taken into account, the segmented code representation learning approach is more adequate for extracting program vulnerability features. For example, for sequence-based program intermediate representation, An et al.

used two BiGRU models to learn the local semantic information within statements and the global semantic information between statements, and Yan et al.

used BiGRU and textCNN to extract the local and global semantic features of the code respectively. For graph-based intermediate representation of programs, the local semantic information within code statements is captured by learning the node vector representation of

, and the global semantic information is captured by learning the dependencies between nodes (i.e., statements) of PDG or SPG. Among them,

et al.

use both Tree-LSTM and GRU to learn the local syntactic information, statement content, variable types and context information of the statement nodes in PDG, and use FA-GCN to extract the global semantic information, but the method does not consider the importance of code elements with different granularities to understand the semantics of vulnerabilities. Considering that there are a large number of vulnerability-independent nodes in PDG, and that different utterance nodes and Token in the utterances have different levels of importance for vulnerability detection, Zheng et al.

, based on generating SPG from CPG using slicing technique, use Token-level self-attention mechanism to learn the dependency information among Token in the vulnerability-related utterance nodes, and use the mechanism with node-level and subgraphs to learn the dependency information among Token in the vulnerability-related utterance nodes, and use the mechanism with node-level and subgraphs to extract global semantic information. They also used RGCN with node-level and subgraph-level attention mechanisms to extract vulnerability-related global semantic information from SPGs.

Existing sequence-based program representation models are able to learn the sequential semantic features of code, but are unable to capture the structural semantic information of code. Although the syntax tree-based program representation model can effectively learn the syntactic structure features of code, it cannot extract the structural semantic features between statements. Graph-based program representation models can effectively learn structural semantic features such as data dependencies and control dependencies of code, but cannot extract sequential semantic features of code. To this end, Jiang et al.

propose a hierarchical semantics-aware code representation learning framework, which firstly designs a new code representation form, i.e., semantic graphs, to represent both syntactic information within statements and dependency information between statements. Based on the program analysis and program slicing techniques to generate the semantic graph of the code, the Tree-LSTM is used to extract the local syntactic-semantic features related to vulnerabilities from the statement nodes of the semantic graph, and then the Graph-LSTM is used to extract the global structural-semantic features and sequential-semantic features related to vulnerabilities from the code according to the topological order of the nodes in the semantic graph. The advantage of this model is that it can be combined with the topological order of nodes in the semantic graph. The advantage of this model is that it can combine the advantages of Tree-LSTM and Graph-LSTM to learn vulnerability patterns in code more effectively and efficiently.

After obtaining the feature vectors of the code through representation learning, we feed them into a classifier model to obtain the vulnerability detection results of the code. The commonly used classifier models are divided into four categories: first, using only Fully Connected Network (FCN) as the classifier network

, second, using

with Softmax function as the classifier network

, third, using machine learning models such as Support Vector Machines, Random Forests, Logistic Regression, and Plain Bayes as the classifier network

, and fourth, designing the classifier

by itself. as a classifier

, and the fourth is to design a separate classifier

. Among them, FCN with Softmax function is used as classifier

.

It is currently the most commonly used classifier for deep learning-based vulnerability detection methods.

In summary, existing research on program representation models for vulnerability detection mainly focuses on how to improve the ability of program representation models to model the semantics of code at different levels of granularity and abstraction and to understand the vulnerability semantics of code, and it mainly focuses on attempting to use different representation models or integrating the strengths of the existing models in order to more comprehensively and accurately portray and understand the key vulnerability semantics of code, while less attention has been paid to the complexity and learning efficiency of the program representation models themselves. The main focus is on trying to use different representation models or combining the advantages of existing models to more comprehensively and accurately portray and understand the semantics of critical vulnerabilities in code, while paying less attention to the complexity and learning efficiency of the program representation models. For example, graph-based program representation models use graph embedding techniques to learn code representation directly on graphs, which can better capture the structural semantic information of code, but when the program size is large, the learning efficiency of GNN-based program representation learning models is low. From the perspective of semantic learning capability, the receptive-field of a single GNN node is limited to a single-hop neighbor. Although stacking multiple GNN layers can in principle expand the receptive-field to learn the information between non-neighboring nodes, the problem of over-smoothing arises, which makes it difficult to accurately learn the contextual information of the vulnerability code

. The problem of oversmoothing refers to the fact that after using multi-layer GCNs, the nodes become less and less distinctive, and the vector representations of the nodes tend to be consistent, which makes the corresponding learning task more difficult, and this phenomenon is called oversmoothing

. The sensory field of a node refers to the range in which a node can aggregate information from other nodes during the learning process. Therefore, researching new program representation models to balance the vulnerability semantic comprehension of program representation models and the complexity or learning efficiency of the models is a problem that deserves attention and in-depth research in the future.

7 Interpretable Methods for Coarse-Grained Vulnerability Detection

The black-box nature of deep learning-based vulnerability detection models makes it impossible for users to understand how they make predictions, and coarse-grained vulnerability detection results make it difficult for developers to understand and analyze the causes of vulnerabilities and quickly fix them. Therefore, in recent years, interpretable methods for coarse-grained vulnerability detection have attracted the attention of researchers. Based on the results of coarse-grained vulnerability detection, these methods give the contribution of code elements to the model prediction results or the code elements that play a key role in the model prediction results (these code elements can be tokens, statements, or program dependency subgraphs, etc.), and then interpret the detection results. Interpretation. From the existing research, the current methods mainly use model self-interpretation, model approximation, or sample feedback methods to interpret the detection results of coarse-grained vulnerability detection methods using sequence-based

, syntax tree-based

, and graph-based

program representations. In order to facilitate the understanding, Figure 9 depicts the interpretation process of the three types of methods for sequence-based program representations. The representative interpretable methods for source code vulnerability detection retrieved so far are shown in Table 9.

Figure 9 Example of an interpretable approach to coarse-grained source code vulnerability detection

7.1 Model self-interpretation methods

Model self-interpretation approaches provide interpretability from the perspective of the vulnerability detection model itself, which is equivalent to white-boxing the behavior of a black-box model, and usually require studying the parameters of the vulnerability detection model (e.g., model gradient) to determine the importance of the inputs or the mechanisms (e.g., attention mechanism) that can be added to the detection model to play an explanatory role.

Typical work is the attention-based interpretable vulnerability detection method proposed by Mao et al.

、

et al.

and Zou et al.

.
Table 9 Typical Interpretable Approaches to Source Code Vulnerability Detection

Method Type

program representation

Methods of interpretation

dominance

inferior

Model Self-Interpretation Methods

Based on syntax tree/

Sequence-based

Attention mechanism

- The detection process automatically provides information for interpretation, and the interpretation costs

relatively low

methods are relevant to the model to be interpreted, the

Not universal

Sequence-based

Gradient

Model Approximation Methods

LIME

LEMNA

SHAP

The method is independent of the model to be interpreted, simple to implement, and suitable for

Highly usable and adaptable to most models

The agent model can only locally approximate the pending solution

Interpretation modeling, affecting interpretation performance

Sample feedback methods

Heuristic search

- More accurate extraction of key features and better interpretive power

Higher cost of explanation

chart-based

GNNExplainer

This approach performs coarse-grained vulnerability detection in source code while utilizing the attention weights learned by the model to record code elements that are critical for vulnerability detection. Warnecke et al.

introduce six interpretation methods, three of which provide interpretability by examining the parameters of the model itself, including the gradient method and its improved Integrated Gradients (IG), and the Layer-wise Relevance Propagation (LRP). Propagation (LRP). Among them, the gradient method and the integral gradient method determine the contribution of the code elements to the detection result by the magnitude of the model gradient. LRP defines the contribution of each input feature to the prediction result as a relevance score, and determines the contribution of a code element to the detection result by back-propagating the prediction (i.e., the sum of the relevance scores) layer by layer to the input according to the network structure of the model. This type of approach automatically provides the information needed for interpretation during the detection process, and thus the interpretation cost is low. However, the approach is related to the vulnerability detection model to be interpreted, and is not universally applicable.

7.2 Model Approximation Methods

Model approximation methods take advantage of the strong interpretability of machine learning models, and provide explanatory information for the detection results of the models by training simpler and easier-to-understand interpretable models to approximate the performance of deep learning-based vulnerability detection models. For example, a classic model approximation method is the LIME method used by Tang et al. It provides local explanatory information by constructing local approximations of linear regression, decision trees, and other more interpretable agent models to analyze the decision-making process of individual samples under test. Three other approaches introduced by Warnecke et al.

also use the above approach to provide interpretability for the model, including LIME and LEMNA and SHAP, which are improvements of LIME. These approaches separate the interpretation of the detection results from the black-box model of vulnerability detection and make the approach independent of the model to be interpreted, which has the advantages of simple implementation and applicability, and can be adapted to most of the vulnerability detection models, and can be applied to the most of the vulnerability detection models. They have the advantage of simple implementation and high applicability, and can be adapted to most vulnerability detection models, but the proxy model can only locally approximate the model to be interpreted during the interpretation process, which affects the interpretation performance.

7.3 Sample feedback methods

The sample feedback method combines different masks and existing input samples to construct a perturbed sample by interfering or masking the local component information (e.g., edges or nodes in a token, statement, or program dependency graph) in the sample, and then use the detection results of the model on the perturbed sample as the sample feedback information to obtain the important information relied on by the model to detect vulnerabilities, i.e., the difference in the detection results of two times before and after the perturbation of the sample can be used to determine the importance of the local component information in the sample for vulnerability detection, which can be used to explain the reasons for the model decision. In other words, by comparing the difference between the detection results of the two samples before and after the perturbation, we can determine the importance of the local component information in the sample for vulnerability detection, and use it to explain the reasons for the model's decisions. A typical work of this approach is a heuristic search-based model interpretation method proposed by Zou et al. This method uses heuristic search to obtain the Token in the code to be tested that has a large impact on the model classification and ranks them in order of importance, and determines whether a particular Token is related to a vulnerability by determining whether the probability of vulnerability prediction becomes significantly smaller after the removal of the Token, i.e., explains the detection results of the model by determining the sequence of vulnerability-related Token in the sample. Another work is IVDetect, a fine-grained interpretable vulnerability detection method proposed by

et al. The method consists of two modules, FA-GCN, a coarse-grained graph neural network-based vulnerability detection model, and GNNExplainer

, a fine-grained interpretation model. We use GNNExplainer to mask the edges in the PDG of the program to be tested, compare the detection results before and after the masking to get the importance level of each edge, and then compose the important edges into a subgraph of the PDG related to the vulnerabilities to explain the detection results.

Compared with the sample feedback method based on token mask, the GNNExplainer method based on statement mask for sample feedback, although it can reduce the search space and improve the search efficiency, but the fine-grained interpretation process of the graph neural network-based detection model still requires a high execution cost, and the method changes the graph structure of the code through the mask, which changes the code representation, and may lead to failure to reflect the semantic information of the original program. However, the fine-grained interpretation of the graph neural network-based detection model still requires high execution cost.

The advantage of the sample feedback approach is that by predicting changes in probability

The key features obtained are more accurate than those inferred through attention, widely applicable, and have high interpretation capability, but the cost of interpretation is higher due to the need to search in a large sample feature space and perform multiple vulnerability detection and sample feedback.

In summary, the existing mainstream interpretable methods mainly focus on analyzing the correlation between the characteristics of the input samples and the prediction results of the model from the perspective of correlation, in order to give probabilistic or fine-grained interpretation of the information, and then give the causality interpretation based on the analysis of the sample data, which is still a big gap in assisting the developers to quickly understand and fix the vulnerability of the practical applications. In conclusion, the research on deep learning vulnerability detection models has just begun, and the following issues need to be studied in depth.

(1) Efficiency of Explainable Methods for Vulnerability Detection Needs to be Improved

Most of the current deep learning-based vulnerability detection methods use binary classification methods, can only predict whether the target code contains vulnerabilities, such coarse-grained detection results do not have the interpretability, the existing interpretable methods can provide more fine-grained interpretation information to a certain extent. Among the three commonly used interpretable methods, the most powerful one is the sample feedback method, but the fine-grained interpretation of the coarse-grained detection model requires additional model execution cost and high search cost to find the set of code elements that make the probability of vulnerability prediction significantly smaller. For example, Zou et al.

need to iteratively mask the Token or Token combinations in the vulnerability samples, repeating the vulnerability detection process over and over again in order to give an indication of the significance of the Token or Token combination. The GNNExplainer used by Li et al.

also needs to mask the edges in the PDG to form new samples, and then repeat the detection process to get an interpreted subgraph. This is confirmed by the experiment of Zou et al.

, where GNNExplainer takes

time to interpret 500 functions, which is much longer than other methods. Therefore, the efficiency of this type of explanation method could be improved.

(2) The comprehensibility of the results interpreted by the current interpretable methods needs to be improved

Existing interpretation methods focus on the possibility of interpreting code elements in a function (or code segment) that are related to a vulnerability. For example, Li et al.

can provide an interpreted subgraph of vulnerability statements. For example, Zou et al.

interpret the vulnerability detection results by sorting the tokens according to the likelihood of vulnerability existence. However, the explanatory information provided by the existing methods is difficult to assist developers in understanding the mechanism or cause of the vulnerability, and quickly locating and fixing the vulnerability. Taking the vulnerability example in Figure 10 as an example, line 27 is a vulnerability statement, but it is difficult to explain the cause of the vulnerability based on line 27 alone, and it is easier to explain why line 27 is predicted to be a vulnerability statement after analyzing the context of the vulnerability statement, i.e., the statement in line

、 、 、

. Therefore, it is worthwhile to investigate how to find the specific contexts that work together with the vulnerability statement to trigger the vulnerability, so as to assist developers to make effective evidence-based attribution of the vulnerability, which is an important direction for further research in the future.

Figure 10 Example of Vulnerable Code

8 Fine-Grained Vulnerability Detection Methods for Source Code

8.1 Issues and Challenges of Fine-Grained Source Code Vulnerability Detection Methods

In the vulnerability detection task, the coarse-grained detection can only predict the possibility that the file, function or code fragment contains a vulnerability, developers often need to rely on their own security software development experience and debugging experience, through further analysis and debugging in order to locate the statement that triggers the vulnerability, and then understand and repair the vulnerability. Fine-grained vulnerability detection in software means not only determining whether the code contains vulnerabilities, but also determining whether the code contains vulnerabilities, and further locating the fine-grained vulnerability structure that triggers the vulnerability (i.e., the location of the statement that triggers the vulnerability and the dependency relationship between related statements). Fine-grained vulnerability detection not only helps to reduce the workload of manual identification and troubleshooting, but also helps to enhance the interpretability and comprehensibility of the vulnerability detection results, which can assist developers to better understand and fix the vulnerabilities. Therefore, in order to improve the practicality of learning-based vulnerability detection techniques, fine-grained vulnerability detection is needed. However, the following characteristics of source code vulnerabilities make fine-grained vulnerability detection more challenging than coarse-grained vulnerability detection.

(1) Sparse and discontinuous distribution of vulnerability information

Among dozens, hundreds or even thousands of code statements in a file or function, only a few critical statements can trigger a source code vulnerability, i.e., there are a large number of non-vulnerability related statements in the program, and therefore, the distribution of vulnerability information tends to be sparse and discontinuous. In addition, program code has some special properties that are different from natural language text

(e.g., program context-independent syntactic structures, non-context-independent names).

The name and type constraints, as well as program-specific data flow and control flow and other complex semantic information) makes the program has no obvious fine-grained code structure to describe the vulnerability information candidate region. As shown in Figure 11, target detection, for example, for a given image, target detection can determine whether a target of a specified class exists in the image based on certain algorithms and a priori information, and if it does, it returns the location of the target and the confidence level of the class. Software fine-grained vulnerability detection is similar to target detection in that it needs to identify whether the input data contains a target object (image target or vulnerability pattern) and locate the exact position of the object. The difference is that in the target detection task, the image target is often a continuous set of pixels with obvious region boundaries, and the target information density is high, so the key features of the target are easy to be extracted by the deep learning model, thus realizing fast and efficient detection and localization; whereas in the fine-grained vulnerability detection task, a set of vulnerability statements constituting the vulnerability pattern is often sparse and non-contiguous, and there is no obvious region boundary, so the vulnerability information is not available in the program, and the vulnerability information is not available in the program. In the fine-grained vulnerability detection task, a set of vulnerability statements constituting a vulnerability pattern is often sparse and discontinuous, without obvious region boundaries, and the distribution density of the vulnerability information in the program is low, which makes it difficult for learning-based vulnerability detection methods to effectively mine the fine-grained vulnerability semantics, and a large number of irrelevant vulnerability-independent statements increase the difficulty of the model in capturing the fine-grained vulnerability patterns, which leads to a lower accuracy of vulnerability localization. Therefore, how to improve the ability of learning-based fine-grained vulnerability detection methods to recognize fine-grained vulnerability patterns is a challenging problem for current research.

Enter image

Extraction of the proposed area

Maiku district (Macau)

Extract image features

Classification results

Extract Vulnerability Characteristics

Test results

Extract Vulnerability Candidate Statements

Figure 11 Comparison of the process of vulnerability detection and target detection

(2) Contextual dependencies between vulnerability statements

Whether or not a statement is a vulnerability statement depends not only on the statements that co-occur with it, but also on the particular context in which it occurs. The context of source code is defined as a set of semantically related statements, which may not be neighboring, but usually have dependencies on control or data flow

Taking the vulnerability code provided by

et al.

as an example, as shown in Figure 10, lines 6 and 13 of the program utilize copy_from_user () to retrieve data from user space via the pointer parameter

. In line 6, the first data retrieved from user space (including the values of in_size and out_size, i.e., the old values) is stored in the structure variable u_cmd, and in line 10, a memory buffer is requested based on the sizes of in_size and out_size retrieved in the first retrieval and pointed to it with the pointer s_cmd. In line 13, the second fetch from user space (which also includes the values of in_size and out_size, i.e., the new values) is saved to the buffer pointed to by the pointer s_cmd. Under contention conditions, the new value from the second fetch may be changed by another user thread, leading to a double-fetch vulnerability when inconsistent in_size and out_size values are used. If the changed in_size value is larger than the first fetched in_size value, the previous double-fetch vulnerability can be further exploited because the copy_to_user ( ) function call on line 27 copies the in_size size of the data in u_cmd to the buffer pointed to by S_cmd. In order to fix this vulnerability, consistency checking and handling of in_size and out_size has been added in line 17 21, and the parameter of the copy_to_user ( ) function call statement in line 27 has been corrected. This vulnerability example shows that there are co-occurrence and dependency relationships between the vulnerable statements, and that such relationships play a key role in explaining the formation of buffer overflow vulnerabilities. Identifying which statements in which functions are potentially vulnerable and how they interact to cause a vulnerability is fundamental to fine-grained vulnerability detection.

How to extract the global fine-grained interpretable vulnerability structure, rather than determining the possibility of each statement as a vulnerability statement in isolation, is a difficult problem that needs to be studied in depth for fine-grained vulnerability detection. Due to the different difficulties and objectives of the tasks, the current vulnerability detection methods applicable to coarse-grained vulnerability detection can not be directly applied to fine-grained vulnerability detection tasks, therefore, it is important to research on suitable fine-grained vulnerability detection methods.
Meaning.

(3) Diversity, complexity and hidden nature of vulnerability characteristics

With the increasing size, functionality and architectural complexity of modern software, from the initial stack overflow vulnerabilities, to web security vulnerabilities such as cross-site scripting, SQL injection, and sensitive data leakage vulnerabilities such as heartbleed vulnerability, the source code vulnerabilities are becoming more and more complex and diversified

, and the complexity and diversity of the syntax and semantics of the program itself make the vulnerabilities in the types and patterns also present complex and diversified characteristics

. Moreover, the complexity and diversity of the syntax and semantics of the programs themselves make the types and patterns of vulnerabilities also present complex and diverse characteristics

. In addition, the complexity of the causes and mechanisms of vulnerabilities makes it difficult to explicitly extract the characteristics of various types of source code vulnerabilities from the code, resulting in highly hidden vulnerability characteristics, further increasing the difficulty of vulnerability detection.

Take the recent high-risk remote code execution vulnerability CVE-2021-44228 in the Log4j2 framework as an example, due to the strong hidden nature of the vulnerability, resulting in the related vulnerability statement was not found at once, so that all the versions of 2.0-2.14.1 have security vulnerabilities, until the 2.15 version has been completely fixed. The diversity, complexity and hidden nature of vulnerability features make automatic vulnerability detection more difficult, and also put forward higher requirements on the generalization ability of the model.

8.2 Fine-Grained Vulnerability Detection Methods

Currently, there are some commercial and open-source detection tools that can locate vulnerability statements. For example, the HelixQAC tool

not only checks code for coding violations according to the

coding specification, but also identifies potential security vulnerabilities and prioritizes coding issues according to the severity of the vulnerability risk, using filters, suppressions, and baselines to locate the most Filters, Suppressions and Baselines are used to locate the most critical software vulnerabilities. Although it is possible to locate vulnerability triggering statements based on expert predefined rules, the expert predefined rules have a high labor cost because they are manually designed and require experts to ensure their correctness and completeness.

Learning-based vulnerability detection methods can get rid of the reliance on experts to manually formulate rules. Figure 12 depicts the detection process of two typical fine-grained sequence-based and graph-based vulnerability detection methods. First, the source code is transformed into a program representation, and then the code representation model is used to learn the feature vectors of the statements in the program representation, and then a classifier is used to categorize the code statements, so as to determine the location of the vulnerable statements. The representative fine-grained vulnerability detection methods and their code representation models are shown in Table 10.

graph-based

program representation

Neural Networks for Sequential Representation

(e. g. , LSTM, RNN, CNN, etc. )

Test results

Sequence-based approach for fine-grained vulnerability detection

Neural networks for graph representation

(e. g., GCN, GGNN, GAT, etc.)

Test results

A graph-based approach to fine-grained vulnerability detection

Fig. 12 Typical Learning-Based Approach for Fine-Grained Vulnerability Detection in Source Code

Among them, Choi et al.

proposed a fine-grained vulnerability detection method based on Memory Networks (MNs) by taking the buffer access operation statement as a query and other statements in the code segment as the content to be queried, thus formalizing the buffer overflow vulnerability detection problem into a Query Answering (QA) task. A fine-grained vulnerability detection method based on Memory Networks is proposed, which utilizes Memory Networks, which are commonly used in dialog systems, to capture long-range sequential dependencies in the code, and can overcome the forgetfulness problem that may be caused by the long input sequences of LSTM or CNN networks, but the method only identifies buffer-overflow vulnerabilities that do not involve cross-function and cross-file calls and do not have a control structure in the context. Sestili et al.

improved this method to detect buffer overflow vulnerabilities with control structures in the context, but it still only recognizes buffer overflow as a specific type of vulnerability.

Tian et al.

, on the other hand, put the image target detection domain of the border back
Table 10 Fine-grained vulnerability detection methods

program representation

code representation model

dominance

inferior

sequence-based

program representation

Memory network

Easy to extract sequential relationships between statements and full

local semantic information

Difficult to build structural relationships such as data dependencies between statements

Mode,Lack of structural semantic information in the learned utterance vector representation

BiLSTM/BiGRU

Transformer

Border regression

graph-based

program representation

FS-GNN

Mapping statements directly to nodes via graphs, the

Facilitates learning of structural relationships between statements

Due to the limitation of the GNN network, the utterance vector representation of the model learning is lacking.

Lack of global and sequential semantic information

GGNN+Transformer

CodeBERT + GAT

Structure2vec

The idea of regression is introduced into the field of vulnerability detection, and a fine-grained vulnerability detection system BBregLocator based on edge regression is proposed. The model is trained by using edge regression method, and the location of the vulnerability statements is marked by the edges (consecutive blocks of vulnerability statements) predicted by the model, which improves the accuracy of locating the consecutive blocks of vulnerability statements. However, this method is less effective in locating non-continuous vulnerability statements, and has been listed as one of the directions for subsequent improvement by the authors.

Wartschinski et al.

proposed a vulnerability detection method VUDENC for statement blocks in python language. The method first breaks the code into token sequences by lexical analysis, and then breaks the token sequences into multiple statement blocks by a sliding window with a certain step size. In order to obtain data for training the vulnerability detection model at block granularity, the authors tag the blocks contained in each program as "vulnerable" if the block contains a token from a vulnerable statement, and "not vulnerable" if the block contains a token from a vulnerable statement. ". Then, word2vec is used to transform the statement blocks into low-dimensional vector representations, which are input into LSTM for representation learning, and finally the detection results are obtained by a classifier network composed of FCN.

et al.

proposed a fine-grained vulnerability detection method based on LLVM intermediate code representation, VulDeeLocator, after extracting the LLVM intermediate code sliced code segments generated by slicing-based technique, the standard BiGRU network is used to learn the representation of the sliced code segments. In the model training phase, the multiply layer removes the token vectors of non-vulnerable statements in the vulnerability code segment, and only retains the token vectors related to the vulnerability statements, and the multiply layer doesn't do any processing if the input is a non-vulnerable code segment, and then finally, using the

and average pooling layer, we select the

largest elements among the token vectors of the previous layer, and take the average of the elements of the previous layer. Finally, using

and the average pooling layer, we select the

largest elements from the output Token vector of the previous layer, and take their average as the probability of the model predicting the vulnerability of the sliced code segment, which is used to calculate the cross-entropy loss and guide the adjustment of the model parameters. In the prediction stage, after obtaining the vector representation of each token using BiGRU, we directly apply

and the average pooling layer to select the top

largest elements from the token vectors contained in each statement and average them as the eigenvalues of the statement, if the eigenvalues are larger than the preset threshold

, the statement is predicted to be a vulnerability-related statement. This method can take full advantage of the static single-assignment nature of LLVM intermediate code to capture more semantic features related to control flow and fixed-value-use relationships, which improves the accuracy of the model in detecting coarse-grained vulnerabilities while realizing fine-grained vulnerability detection in intermediate code. However, the vulnerability statement localization accuracy index IoU on the real project test set is only

, and there is still much room for improvement. Yuan Jiang

proposes a sequence generation model, Seq2SeqLocator, which adopts a decoder-encoder structure consisting of BiGRUs to learn the vector representation of each statement in a code segment, and then classifies the statement vector representations by a classifier to achieve fine-grained vulnerability detection.

Most of the above fine-grained vulnerability detection methods use sequence-based deep neural networks, which are not conducive to extracting structural semantic features of the code. To solve this problem, graph neural networks have been applied to fine-grained vulnerability detection tasks in recent years. For example, the MVD method proposed by Cao et al.

and the SedSVD method proposed by Dong et al.

differ in that MVD uses Flow-Sensitive Graph Neural Networks (FS-GNN), which is only capable of detecting memory-related vulnerabilities. SedSCD, on the other hand, in order to focus on more contextual information of the vulnerability statements and reduce the interference of irrelevant information of the vulnerability, analyzes and summarizes 10 kinds of center nodes that may be related to the vulnerability, and builds a subgraph by collecting the neighboring nodes of the center nodes as well as the relationship between the center nodes. Then the combination of RGCN and MLP (Multi-Layer Perceptron) is used to determine whether each node in the subgraph has vulnerabilities or not, so as to realize utterance-level vulnerability detection. Mirsky et al. proposed a source code vulnerability detection method, VulChecker

, which can precisely locate the vulnerable statements and determine their vulnerability types. This method generates code representations similar to VulDeeLocator, compiles the code into LLVM intermediate code, and then uses program slicing techniques to reduce the number of vulnerability-independent

Code. The difference is that VulChecker compiles the code into LLVM intermediate code, and then generates its corresponding graph structure, called ePDG. Then, heuristics are used to extract potential manifestation points in the code according to different vulnerability types, and generate subgraphs by using them as slicing criteria. The node embedding of the subgraph is then generated using the Structure2Vec model

. Finally, a classifier composed of fully connected networks is used to classify the vulnerability statements.

In recent years, a few studies have begun to apply large-scale modeling to the field of fine-grained vulnerability detection. For example,

et al.

proposed LineVul, a Transformer-based fine-grained vulnerability detection approach. LineVul utilizes the BERT architecture to capture long dependencies in sequences and performs statement-level vulnerability localization through the architecture's self-attention mechanism. Specifically, for each Token within a function, the vulnerability statements are located by integrating the self-attention score of each Token computed in the BERT encoder into a self-attention score for each statement, and sorting all statements by the score. In addition, studies have also combined large-scale models with graph neural networks to capture both local and global program semantic and structural information. For example, the LineVD approach proposed by Hin et al.

extracts the semantic information within each statement by pre-training CodeBERT, and then utilizes GAT to learn the data and control dependencies among PDG statement nodes to obtain a vector representation of the statement, and then performs node classification by MLP to achieve fine-grained vulnerability detection. Considering that GNN can only learn the local context near the statement nodes, but it is difficult to learn the semantic information of nodes farther away from the code

, Ding et al. proposed a method VELVET based on integrated learning to locate the vulnerable statements. The method utilizes the GGNN and Transformer models to extract the local and global contexts of statements, respectively, and generates local and global representations of the statements, and then uses GAT to learn data and control dependencies among PDG statement nodes to obtain the vector representation of statements. The method uses GGNN and Transformer models to extract the local and global context of an utterance, generate the local and global representations of the utterance, and then obtain the local vulnerability score and global vulnerability score through MLP, and the average of the two scores is the integrated vulnerability score of the utterance. Since the method only takes the node with the highest score as the vulnerability node, it can only detect a single vulnerability utterance, and underreporting occurs in the programs that contain more than one vulnerability utterance. In a recent study, Zhang et al.

proposed ReVulDL, a two-stage deep learning-based smart contract vulnerability detection method. In the first phase, code fragments with data/control dependencies are extracted and integrated into a chain structure called Propagation Chain (PC) using data flow edges. At the same time, key information such as Token sequences, variable sequences, Token location sequences, and variable location sequences are extracted. Then, the above information is fed into the GraphCodeBERT model to realize coarse-grained vulnerability detection. In the second phase, on the basis of the propagation chain and detection model constructed in the first phase, we use the

GNNExplainer

finds the smallest sub-propagation chain (

) in the propagation chain that can explain the vulnerability by using the edge mask, calculates the suspicious value of each variable in

according to the data dependency, then calculates the suspicious value of the statement to which the variable belongs, and finally ranks all statements according to the suspicious value to realize the fine-grained vulnerability localization.

Most of the above methods require the combination of large-scale models and graph neural networks to capture both local and global program semantic and structural information, and it is difficult to train large-scale models with tens of millions of parameters based on small-scale vulnerability datasets, resulting in inaccurate extraction of vulnerability features and semantics, and due to the difference in learning paradigms between different models, the combination of different types of models (such as Moreover, due to the different learning paradigms of different models, the combination of different types of models (e.g., Transformer + GGNN

, CodeBERT + GAT

) may lead to the problems of complexity and low efficiency of training. In addition, existing research has not optimized large-scale models for vulnerability detection tasks to solve the problem of difficulty in extracting deep vulnerability semantics, and there is a lack of pre-training tasks suitable for vulnerability detection and localization to solve the problem of difficulty in training large-scale models. In addition, the existing research mainly focuses on the use of large-scale models for vulnerability semantics understanding, and there are few studies that use large-scale models to capture the fine-grained vulnerability patterns of programs.

In summary, although commercial or open-source detection tools can realize fine-grained vulnerability detection at the statement level, and the detection results obtained based on the rules predefined by experts are interpretable, the limitations and imperfections of the experts' predefined detection rules often lead to high false positives and misses

. Although deep learning-based vulnerability detection methods can get rid of the dependence on experts' manually formulated rules, and utilize deep learning techniques to automatically extract vulnerability features and learn vulnerability patterns in the code, most of the current deep learning-based vulnerability detection methods are mainly used to detect whether the code contains vulnerabilities at a coarse-grained level such as files, functions, or code snippets, etc., and the only fine-grained vulnerability detection methods are either The only fine-grained vulnerability detection methods can either detect only specific types of vulnerabilities, or the accuracy of detection is not high. For example, the VulDeeLocator method proposed in [98] has an IoU of

and an Accuracy of

in real projects, and LineVD

has a Precision and Recall of

and 53.5 percent, respectively, for the task of node classification of vulnerability statements.

and 53.3% respectively, and VELVET

has a fine-grained vulnerability localization accuracy of

. The experimental data presented in the above literature suggests that the current state-of-the-art deep learning-based vulnerability detection methods are less accurate in detecting fine-grained vulnerabilities in real projects. This is due to the fact that the syntactic structure and dependencies of vulnerability statements in real projects are more complex, and the types of vulnerabilities are more diverse, which makes fine-grained vulnerability detection more challenging.

In addition, most of the current fine-grained vulnerability detection methods still use the cross-entropy loss function based on coarse-grained labeling as the objective function to guide the training of the model, and the lack of fine-grained supervisory signals to guide the training process of the model makes the accuracy of its fine-grained detection low, which is difficult to meet the needs of practical applications. In fact, the optimization objective of fine-grained vulnerability detection should not be the same as that of coarse-grained vulnerability detection, which is to maximize the probability that the whole code segment is predicted to be a real vulnerability label (1 or 0), but to find the set of statements that trigger the vulnerability, i.e., to maximize the IoU (i.e., intersection ratio of the real set of vulnerability statements to the predicted set of vulnerabilities) of the fine-grained vulnerability localization evaluation index. Maximization. Since different combinations of vulnerability statements represent different vulnerability structures (e.g., co-occurrence and conditional dependency of statements), the fine-grained labels imply more information related to vulnerability patterns than the coarse-grained labels. The vulnerability pattern information implicit in the fine-grained labels can be used to guide the model to learn the co-occurrence and conditional dependencies among the vulnerability statements, which will help to identify the fine-grained vulnerability structures in the code more accurately.

9 Outlook for future research

9.1 A Reference Framework for Source Code Vulnerability Detection Combining Hierarchical Semantic Awareness, Multi-Granularity Vulnerability Detection, and Assisted Vulnerability Understanding

By analyzing and summarizing the existing problems in the literature, this paper gives a reference framework for source code vulnerability detection with multi-granularity vulnerability detection and assisted vulnerability understanding, as shown in Fig. 13. The proposed framework consists of five modules: vulnerability data mining, vulnerability data representation, vulnerability semantic understanding, multi-granularity vulnerability detection and auxiliary vulnerability understanding. Unlike general vulnerability detection frameworks, the framework learns to extract key semantic features of vulnerabilities from code at different granularity and abstraction levels through hierarchical semantics-aware code representation, and realizes multi-granularity vulnerability detection at code segment level and statement level. While providing fine-grained vulnerability detection results at the statement level, the framework also provides developers with vulnerability knowledge based on the vulnerability knowledge base or knowledge graph to assist in understanding the causes of vulnerabilities and patch knowledge to assist in fixing the vulnerabilities.

Figure 13 Source Code Vulnerability Detection Reference Framework for Multi-Granularity Vulnerability Detection and Assisted Vulnerability Understanding

The goal of the vulnerability dataset building module is to build a large-scale and high-quality real vulnerability dataset by mining heterogeneous software repositories from multiple sources and jointly using vulnerability data mining, data cleaning and filtering, and vulnerability data enhancement. Considering the limited amount of vulnerability code collected after joint mining of NVD, CVE, NIST (National Institute of Standards and Technology

and GitHub, and data cleaning and filtering, and the serious sample imbalance problem, we can also consider Bugzilla, Stack Overflow

and other sources can be considered to mine more real vulnerability instances, based on which data enhancement techniques can be further adopted to alleviate the sample imbalance problem. The basic idea is as follows: firstly, mine the code files of each version of the target project in the code repository, then analyze the change information in the code commit logs, extract the code commits related to vulnerability fixing by using the code commit intent recognition technique, then locate the code versions before and after fixing, collect the vulnerability codes and corresponding patch codes based on the discrepancy files, and finally improve the collected samples by using the data sampling method. Finally, we utilize data sampling method to improve the collected samples to construct a vulnerability dataset.

The goal of the Vulnerability Data Representation Module is to construct an intermediate representation of code that comprehensively characterizes the syntactic and semantic information related to vulnerabilities through program analysis techniques such as lexical and syntactic parsing. The basic idea is: firstly, lexical analysis, syntactic analysis and semantic analysis are used to construct sequence-based, syntax tree-based and graph-based program representations of the source code respectively. Then, we extract slice criteria based on the abstract syntax tree, and further extract slice nodes using the slice criteria and program dependency graph. Finally, the intermediate representation of the code is constructed by combining the information of sequence relationship, syntactic structure, control dependency and data dependency in other program representations.

The goal of the Vulnerability Semantic Understanding Module is to study a hierarchical semantic-aware code representation model, which needs to take into account both model complexity and the ability to understand and characterize vulnerabilities at different levels of granularity and abstraction. The basic idea is to first construct a statement encoding network to extract local syntactic and semantic features from the intermediate code representation generated by the vulnerability data representation module to obtain the vector representation of each statement, and then further extract global semantic features from the code through the program encoding network to obtain the vector representation of the program, thus laying a foundation for the realization of multi-granularity vulnerability detection. In addition to the traditional deep neural network, we can also consider using a large language model when designing the statement and program coding network, i.e., on the basis of pre-training, the model can be fine-tuned by the task of vulnerability semantic understanding to obtain the vector representations of the statements and programs.

The goal of the Multi-Granular Vulnerability Detection Module is to make full use of the statements and program vectors generated by the Vulnerability Semantics Understanding Module, which contain local and global syntactic and semantic features, to locate the vulnerable statements based on the predicted likelihood of a function or a code segment containing a vulnerability. The basic idea is: firstly, the vector representation of the program is fed into a coarse-grained vulnerability identification model to determine whether the whole source code contains vulnerabilities. If yes, the vector representation of the program is fed into a fine-grained vulnerability localization model, which outputs a sequence of suspected vulnerability statements in the source code.

The repair of source code vulnerabilities requires developers to have professional code and network security knowledge, but not all developers have the ability, so this paper designs a vulnerability understanding module to assist developers to quickly understand and repair vulnerabilities. The basic idea is as follows: firstly, the vulnerability knowledge graph for vulnerability and patch knowledge search is constructed by mining software repositories and vulnerability repositories. Specifically, for open source vulnerability repositories, the structural information (CWE entities and CWE relationships) and textual information can be extracted, and for open source software repositories, the code information can be extracted, and then the above information can be used to construct the vulnerability knowledge graph. Then, the vulnerability code is used as a query to search for vulnerability knowledge (e.g., vulnerability descriptions, vulnerability types and features, etc.). Specifically, the vulnerability or patch knowledge in the knowledge graph can be retrieved through similar code matching, cross-modal search, etc., in order to assist developers in understanding the vulnerabilities, and the patch knowledge (e.g., commit logs, patch code, and discrepancy files, etc.) can be searched to assist developers in fixing the vulnerabilities.

9.2 Reference framework feasibility analysis

Vulnerability data mining is currently the most commonly used means of vulnerability data collection, and most of the real project data in the publicly available vulnerability dataset

is collected through joint mining by CVE, NIST and GitHub. Although CVE records a large number of vulnerability reports, most projects do not open-source their code for commercial or privacy reasons, or hide their vulnerability code for security purposes, resulting in a limited amount of vulnerability code collected through the above methods and a serious sample imbalance problem. Data augmentation is an effective means to solve the data imbalance problem, and has been applied in other fields. For example, Rao et al.

used correlated oversampling to increase the defective samples, and Shen et al.

used semi-supervised techniques to expand the balanced samples. These techniques to deal with the sample imbalance problem can also be utilized for vulnerability sample generation. For example, Chakraborty

and Liu et al.

also used oversampling techniques to generate new vulnerability samples, and generative adversarial networks, which are commonly used in the field of computer vision, can also be used to generate new vulnerability samples

. Therefore, it is feasible to utilize the combination of vulnerability data mining and vulnerability data enhancement techniques to construct a larger vulnerability dataset as suggested in this paper.

In the vulnerability data representation module, the more advanced vulnerability detection method is to characterize the rich syntactic semantics of a program by combining multiple code representations, although there are many research results for reference, such as the CPG proposed by Yamaguchi et al.

, the composite graph proposed by Zhou et al.

, etc., but these code representations often contain a large amount of information that is irrelevant to the vulnerability. . In order to avoid the interference of vulnerability-irrelevant information on the learning of vulnerability patterns by the model, we propose to use the program slicing technique

To further optimize the code representation, the effectiveness of the slicing technique for vulnerability detection has been demonstrated, e.g., SeVCs proposed by

et al

, SPG proposed by Zheng et al

, etc. Finally, pre-trained models such as CodeBERT have been shown to be effective in several programming language related tasks

, and can be fine-tuned for the vulnerability detection task

. Finally, pre-trained models such as CodeBERT have been shown to be effective in several programming language related tasks

, and can be fine-tuned for vulnerability detection tasks

In the vulnerability semantic understanding module, studies have confirmed the effectiveness of segmented code representation learning models for tasks such as clone code detection and program classification, e.g., An et al.

used two BiGRU models to learn semantic information from statements to programs, Yan et al.

used BiGRU and textCNN to achieve extraction of local and global semantic features, Li et al.

used Tree-LSTM and GRU to learn local semantic information of statement nodes in PDG, and then used FA-GCN to learn global semantic information of programs, and Jiang et al.

use Tree-LSTM and GRU to learn the local semantic information of the utterance nodes in PDG, and then use FA-GCN to learn the global semantic information of the program, as well as Jiang et al.

propose a hierarchical semantics-aware code representation learning framework. Therefore, this paper suggests the feasibility of effectively capturing the local vulnerability syntactic structure and global vulnerability semantic information of a program through a segmented code representation learning framework consisting of an utterance encoding network and a program encoding network.

In the module of multi-granularity vulnerability detection, although most of the existing vulnerability detection techniques are either coarse-grained or fine-grained, some recent researches have proved the feasibility and effectiveness of multi-granularity vulnerability detection, for example, the multi-granularity vulnerability detection method proposed by Zou et al.

, which firstly performs the vulnerability detection at the slice level, and then performs the vulnerability detection at the function level, thus improving the accuracy of the coarse-grained vulnerability detection at the function level. This improves the accuracy of function-level coarse-grained vulnerability detection. Unlike this approach, the multi-granular vulnerability detection model proposed in this paper performs function-level or slice-level coarse-grained vulnerability detection first, and then performs statement-level fine-grained vulnerability detection for the predicted vulnerable code, with the ultimate goal of improving the accuracy of fine-grained vulnerability detection.

In the vulnerability understanding module, this paper proposes to construct a vulnerability knowledge graph to integrate the vulnerability knowledge in the form of graph structure, and then utilize the graph search technique to improve the efficiency of obtaining the vulnerability knowledge and patch knowledge and enhance the interpretability. Knowledge graph is essentially a semantic network, which provides a way to better organize and manage information, while vulnerability knowledge graph is built based on vulnerability-related knowledge. There have been studies exploring methods to build vulnerability knowledge graphs, e.g., Han et al.

constructed a knowledge graph based on the Common Weakness Enumeration (CWE) database, and Qin et al.

enriched the size and content of the vulnerability knowledge graph by mining the CVE database. . Therefore, it is feasible to construct a vulnerability knowledge graph and utilize existing code search techniques

to search for vulnerability and patch knowledge related to the vulnerable code from the constructed vulnerability knowledge graph, which can assist developers in understanding and fixing vulnerabilities.

9.3 Future research directions and trends

In recent years, with the development of artificial intelligence technology, learning-based source code vulnerability detection technology has also been a significant development, but with the increasing size of modern software, functionality and architecture is increasingly complex, the shape of the source code vulnerability is also increasingly complex and diverse, these are learning-based source code vulnerability detection technology puts forward a serious challenge. By summarizing the existing research results, this paper proposes possible future research directions and development trends.

(1) Techniques for building large-scale, high-quality, real vulnerability datasets

Constructing open, shared, and high-quality real vulnerability datasets is crucial for improving the generalization ability and practicality of learning-based vulnerability detection models. We can study code change intent identification and security-related Issue Report or Bug Report identification methods, automatically mine and label real vulnerability data from code repositories or bug repositories by using text mining and program analysis techniques, and also explore new vulnerability data mining methods and credible automatic vulnerability labeling methods by combining fuzzy testing and static analysis techniques. On the basis of constructing and openly sharing large-scale and high-quality real vulnerability datasets, it is also necessary to establish a unified and open testing platform, so as to provide support for analyzing and evaluating the quality of vulnerability data.

(2) Vulnerability Detection Techniques under Sample-less and Label-less Conditions

Building large-scale and high-quality vulnerability datasets with trusted labels is a complex and costly systematic project. The scarcity of real vulnerability samples leads to the lack of rich vulnerability structures and types in the training set, which limits the generalization ability of the model. Therefore, in order to improve the detection and generalization ability of the model under the conditions of few samples and few labels, it is necessary to research on the enhancement of the code data for vulnerability detection as well as the training techniques of the model under the semi-supervised or weakly supervised environments. This is a realistic and important research direction.

(3) Vulnerability Semantic Understanding Techniques Balancing Comprehensiveness and Efficiency

The key to vulnerability semantic understanding is how to extract more comprehensive and critical vulnerability semantic features from limited sample data, not only to integrate multiple forms of code intermediate representations to portray critical vulnerability semantic features from different "perspectives", but also to utilize hierarchical code representation learning to extract critical vulnerability semantic features from different "granularities", and also to consider learning critical vulnerability semantic features from different compilation stages (e.g., source code, LLVM intermediate code, assembly code, etc.) for different program models. We also need to use hierarchical code representation to learn the semantic features of critical vulnerabilities from different "granularity", and also consider learning the semantic features of critical vulnerabilities in different program modes from different compilation stages (e.g., source code, LLVM intermediate code, assembly code, etc.), combining with program slicing to realize the multi-perspective, multi-granularity, multi-level vulnerability feature extraction to improve the performance of vulnerability detection, and to improve the vulnerability detection performance in the model. Combined with program slicing technology to achieve multi-perspective, multi-granularity, multi-level vulnerability feature extraction, in order to improve the performance of vulnerability detection, and to strike a balance between the effectiveness and efficiency of the model, which is a future development direction to realize the vulnerability semantic understanding.

(4) Fine-grained Vulnerability Detection Models

Compared to the source code leakage of files, functions, code segments, and other phases of granularity

As a result, the fine-grained vulnerability detection at the statement level can provide more guidance information for the subsequent fixing of vulnerabilities. In recent years, since graph neural networks can directly model the structure of the code and can effectively utilize the structural information of the program, it has become a common practice to extract the graph-based intermediate representation of the code, learn the features of the statement nodes through graph neural networks, and then use the node classification to locate the vulnerable statements

. However, graph neural networks themselves have some limitations. First of all, graph neural networks are computationally expensive, especially when dealing with large graphs. As the number of layers in the neural network increases, the memory and other resources consumed for storing graph structural information and computing features increase exponentially. Second, graph neural network models are prone to overfitting. Finally, graph neural networks have difficulty in aggregating information between nodes that are far away from each other, and therefore it is difficult to learn long dependencies between statements in a program. Therefore, breaking through the limitations of the original model and developing suitable detection models is still a future development direction to improve the performance of software fine-grained vulnerability detection methods.

(5) Interpretable Methods for Vulnerability Detection

Interpretability of vulnerability detection models and results can also help developers understand the causes and mechanisms of vulnerabilities so that they can be remediated quickly. Existing interpretable methods focus on explaining which syntactic units in the code play an important influence or role in detecting vulnerabilities in the model, but cannot explain how the syntactic units interact with each other to enable the model to recognize them as vulnerability code. Combining with the knowledge of external security domains in the vulnerability knowledge graph, it is hoped to be able to provide more explanations, and it is still to be achieved by deeper research in the future.

(6) Advance sensing techniques for security vulnerabilities

Relying only on NVD for security vulnerability management puts Open Source Software (OSS) users at a very high security risk, so OSS users urgently need a means to sense security vulnerabilities in advance for early defense. It has been found that before vulnerabilities are discovered, confirmed and fixed, they are usually discussed in public channels, i.e., there is usually a time gap of several days or even months from the time when vulnerabilities are discussed to the time when vulnerabilities are formally disclosed. During this time window, vulnerabilities are likely to be exploited by attackers to carry out n-day attacks, especially for high-risk types of vulnerabilities, which require technical means to perceive vulnerabilities ahead of time, such as community comment mining, vulnerability and situation discovery. Especially for high-risk vulnerabilities, community comment mining, vulnerability and situation discovery and other technical means to perceive them in advance, which is also one of the research contents that need to be focused on in the future. However, the samples of high-risk vulnerabilities are more scarce and unbalanced, and undoubtedly, the perception of high-risk vulnerabilities in advance will face greater technical challenges.

(7) Enhancing Vulnerability Detection Models with Domain Knowledge

Although learning-based source code vulnerability detection models have shown excellent performance in the vulnerability detection domain, they still have some significant drawbacks, including high data dependency, unstable generalization performance, and lack of interpretability. On the other hand, there is a large amount of underutilized security-related domain knowledge in various code repositories. Fully utilizing this domain knowledge will, to some extent, compensate for the shortcomings of learning-based vulnerability detection models. For example, for traditional machine learning-based vulnerability detection methods, security domain knowledge can be utilized to design features that are more suitable for machine learning models, reducing the cost of manual feature engineering. For example, for deep learning-based vulnerability detection methods, security domain knowledge can be utilized to explain the prediction results of the model or assist developers in understanding and fixing vulnerabilities. Therefore, incorporating domain knowledge into learning-based vulnerability detection models to improve detection performance is also a development direction worthy of deep research in the future.

(8) Source Code Vulnerability Detection Method Based on Large Language Modeling

Large language modeling is currently a hot research topic in the field of artificial intelligence, although some studies have shown that its application in vulnerability detection tasks needs further improvement and in-depth research. However, considering the successful applications of large language models in natural language processing and code analysis, its potential is still undoubted. In the future, we can consider improving the vulnerability detection results of large language models based on cue learning techniques. For example, increasing the utilization of security domain knowledge and designing suitable source code pre-training tasks. Therefore, vulnerability detection based on large language models is a promising direction for future research.

10 Conclusion

Source code vulnerability has become a major threat to the security of software and information systems, the study of source code vulnerability detection technology is important for reducing software security vulnerabilities and software security risk. Learning-based vulnerability detection technology has become a research hotspot in the field of vulnerability detection in recent years due to its advantages of lower labor cost and higher detection efficiency. In this paper, through in-depth investigation of journals and conferences in the direction of software engineering, we have systematically sorted out and summarized the existing research work around the key issues such as mining and construction of vulnerability datasets, program representation for vulnerability detection tasks, commonly used vulnerability detection methods based on machine learning and deep learning, interpretable methods of source code vulnerabilities, and fine-grained vulnerability detection methods for source code, and on the basis of which we provide a method that combines hierarchical semantics, and a method that combines hierarchical semantics, and a method that combines hierarchical semantics and deep learning. On this basis, a vulnerability detection framework combining hierarchical semantic perception, multi-granularity vulnerability classification and auxiliary vulnerability understanding is given. Finally, the future research direction and development trend of learning-based vulnerability detection technology are outlooked.

Acknowledgments We thank the experts for their valuable comments and the National Natural Science Foundation of China (No. 62272132) for financial support.

References

[Ph. D. dissertation]. Beihang University, Beijing, 2012 (in Chinese)

(Hu Chaojian. Vulnerability Detection Techniques Based on Program Analysis [Doctoral dissertation]. Beijing University of Aeronautics and Astronautics, Beijing, 2012)

[2] Ettredge ML, Richardson VJ. Information transfer among internet firms: the case of hacker attacks. Journal of Information Systems, 2003, 17(2):71-82

[3] Mendell RL. Investigating Information-Based Crimes: Aguide for Investigators on Crimes Against Persons Related to the theft or Manipulation of Information Assets. New York, USA: Charles C Thomas Publisher, 2013

[4] Malandrino D, Petta A, Scarano V, Serra L, Spinelli R, Krishnamurthy B.Privacy awareness about information leakage: Who knows what about me? //Proceedings of the Workshop on Privacy in the Electronic Society. Berlin, Germany. 2013: 279-284

[5] Li Zhen, Zou De-Qing, Wang Ze-Li, et al. Survey on static software vulnerability detection for source code. Chinese Journal of Network and Information Security, 2019, 5(1) : 1-14 (in Chinese)

(Zhen Li, Deqing Zou, Zeli Wang et al. A review of source code-oriented static detection of software vulnerabilities. Journal of Network and Information Security, 2019,5(1): 1-14)

[6] Li Yun, Huang Chen-Lin, Wang Zhong-Feng, et al. Survey of software vulnerability mining methods based on machine learning. Journal of Software, 2020, 31 (07) : 2040-2061 (in Chinese)

(Li Yun, Huang Chenlin, Wang Zhongfeng et al. A review of machine learning-based software vulnerability mining methods. Journal of Software, 2020, 31(07): 2040-2061)

[7] Sun Hong-Yu, He Yuan, Wang Ji-Ce, et al. Application of artificial intelligence technology in the field of security vulnerability. Journal on Communications, 2018, 39 (8) : 1-17 (in Chinese)

(Sun H.Y., He Y., Wang K.Z., et al. Application of Artificial Intelligence Techniques in the Field of Security Vulnerabilities. Journal of Communication, 2018, 39(8):1-17)

[8] https://www.freebuf.com/vuls/208158.html

[9] https://github.com/cisagov/log4j-affected-db

[10] Common Vulnerabilities & Exposures.2022, https://cve.mitre. org/.

[11] National Vulnerability Database.2022, https://nvd.nist.gov/.

[12] Li Zhou-Jun, Zhang Jun-Xian, Liao Xiang-Ke, et al. Survey of software vulnerability detection techniques. Chinese Journal of Computers, 2015, 38(4): 717-732 (in Chinese)

(Li Zhoujun, Zhang Junxian, Liao Xiangke et al. Software security vulnerability detection techniques. Journal of Computing, 2015, 38(4):717-732)

[13] https://dwheeler.com/flawfinder/

[14] https://code. google. com/archive/p/rough-auditing-tool-forsecurity/

[15] https://sourceforge.net/projects/cppcheck/

[16] http://findbugs.sourceforge.net/

[17] https://github.com/google/error-prone

[18] Chen S, Fan L, Meng G, Su T, Xue M, Xue Y, Liu Y, Xu L. An empirical assessment of security risks of global android banking apps//Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. Seoul,
Korea, 2020 : 1310-1322

[19] Chen S, Zhang Y, Fan L, Li J, Liu Y.AUSERA: Automated security vulnerability detection for android apps//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering.Rochester, USA, 2022: 1-5

[20] http://clang-analyzer.llvm.org/

[21] https://www.checkmarx.com/

[22] https://scan.coverity.com/

[23] https://www.microfocus.com/en-us/cyberres/application-security

[24] https://codesecure.com/our-products/codesonar/

[25] http://www.klocwork.com/

[26] https://phpmagazine.net/2007/04/armorize-codesecure-on-demandphp-source-code-analysis.html

[27] https://www.perforce.com/products/helix-qac

[28] Yamaguchi F, Golde N, Arp D, Rieck K. Modeling and discovering vulnerabilities with code property graphs// Proceedings of the Symposium on Security and Privacy. San Jose, USA, 2014: 590-604

[29] Zimmermann T, Nagappan N, Williams L. Searching for a needle in a haystack: Predicting security vulnerabilities for windows vista//Proceedings of International Conference on Software Testing, Verification and Validation. Paris, France, 2010: 421-428

[30] Moshtari S, Sami A, Azimi M. Using complexity metrics to improve software security. Computer Fraud &. Security, 2013 (5): 8-17

[31] Morrison P, Herzig K, Murphy B, Williams L. Challenges with applying vulnerability prediction models//Proceedings of the 2015 Symposium and Bootcamp on the Science of Security. Urbana Illinois, USA, 2015:1-9

[32] Salimi S, Ebrahimzadeh M, Kharrazi M. Improving real-world vulnerability characterization with vulnerable slices// Proceedings of the 16th ACM International Conference on Predictive Models and Data Analytics in Software Engineering. New York, USA, 2020:11-20

[33] Younis A, Malaiya Y, Anderson C, Ray I. To fear or not to fear that is the question: code characteristics of a vulnerable function with an existing exploit//Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy. New Orleans Louisiana, USA, 2016:97-104

[34] Scandariato R, Walden J, Hovsepyan A, Joosen W.Predicting vulnerable software components via text mining. IEEE Transactions on Software Engineering, 2014, 40(10): 993-1006

[35] Yamaguchi F, Maier A, Gascon H, Rieck K. Automatic inference of search patterns for taint-style vulnerabilities// Proceedings of the Symposium on Security and Privacy. San Jose, USA, 2015:797-812

[36] Meneely A, Williams L. Strengthening the empirical analysis of the relationship between Linus' Law and software security// Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. BolzanoBozen, Italy, 2010:1-10

[37] Shin Y, Meneely A, Williams L, Osborne JA. Evaluating complexity, code churn, and developer activity metrics as
indicators of software vulnerabilities. IEEE Transactions on Software Engineering, 2011, 37(6):772-87

[38] Shin Y, Williams L. Can traditional fault prediction models be used for vulnerability prediction//Empirical Software Engineering.Baltimore. Maryland, USA, 2013, 18(1):25-59

[39] Walden J, Stuckman J, Scandariato R. Predicting vulnerable components: Software metrics vs text mining//Proceedings of the International Symposium on Software Reliability Engineering.Naples, Italy, 2014:23-33

[40] Yamaguchi F, Rieck K. Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning//Proceedings of the 5th USENIX Workshop on Offensive Technologies (WOOT 11). FranciscoSan, USA, 2011: 1-10

[41] Shar LK, Tan HB. Predicting common web application vulnerabilities from input validation and sanitization code patterns//Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Essen, Germany, 2012:310-313

[42] Shar LK, Tan HB. Predicting SQL injection and cross site scripting vulnerabilities through mining input sanitization patterns. Information and Software Technology, 2013, 55(10): 1767-80

[43] Shar LK, Tan HB, Briand LC.Mining SQL injection and cross site scripting vulnerabilities using hybrid program analysis// Proceedings of the International Conference on Software Engineering (ICSE). San Francisco, USA, 2013:642-651

[44] Yamaguchi F, Lottmann M, Rieck K. Generalized vulnerability extrapolation using abstract syntax trees//Proceedings of the 28th Annual Computer Security Applications Conference. Orlando Florida, USA, 2012: 359-368

[45] Lin G, Zhang J, Luo W, et al. POSTER: Vulnerability discovery with function representation learning from unlabeled projects//Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. Dallas, Texas, USA, 2017: 2539-2541

[46] Wu Y, Zou D, Dou S, Yang W, Xu D, Jin H. VulCNN: An image-inspired scalable vulnerability detection system// Proceedings of the International Conference on Software Engineering.Pittsburgh, USA, 2022: 21-29

[47] Zhou Y, Liu S, Siow J, Du X, Liu Y. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks//Proceedings of Theadvances in Neural Information Processing Systems. Vancouver, Canada, 2019, 32:1-11

[48] Li Z, Zou D, Xu S, et al.VulDeePecker: a deep learning-based system for vulnerability detection//Proceedings of the 25th Annual Network and Distributed System Security Symposium. San Diego, USA, 2018:1-15

[49] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing, 2021:1-16

[50] Zou D, Wang S, Xu S, Li Z, Jin H.

VulDeePecker: A deep learning-based system for multiclass vulnerability detection.
IEEE Transactions on Dependable and Secure Computing, 2019,18(5):2224-2236

[51] Russell R, Kim L, Hamilton L, et al. Automated vulnerability detection in source code using deep representation learning// Proceedings of the International Conference on Machine Learning and Applications (ICMLA). Orlando, USA, 2018: 757-762

[52] Wang H, Ye G, Tang Z, et al. Combininggraph-based learning with automated data collection for code vulnerability detection. IEEE Transactions on Information Forensics and Security, 2020,16:1943-58

[53] Wang S, Liu T, Tan L. Automatically learning semantic features for defect prediction//Proceedings of the 38th International Conference on Software Engineering. Austin, USA, 2016: 297-308

[54] Wu F, Wang J, Liu J, Wang W. Vulnerability detection with deep learning//Proceedings of the International Conference on Computer and Communications. Chengdu, China, 2017: 12981302

[55] Yan H, Luo S, Pan L, Zhang Y.HAN-BSVD: a hierarchical attention network for binary software vulnerability detection. Computers & Security, 2021, 108:102286

[56] Tian J, Xing W, Li Z. BVDetector: a program slice-based binary code vulnerability intelligent detection system. Information and Software Technology, 2020,123:106289

[57] Thapa C, Jang SI, Ahmed ME, Camtepe S, Pieprzyk J, Nepal S. Transformer-based language models for software vulnerability detection//Proceedings of the 38th Annual Computer Security Applications Conference. Austin, USA, 2022:481-496

[58] Mamede C, Pinconschi E, Abreu R.A transformer-based IDE plugin for vulnerability detection//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. Rochester, USA. 2022:1-4

[59] Cao S, Sun X, Bo L, Wei Y, Li B. BGNN4VD: constructing bidirectional graph neural-network for vulnerability detection. Information and Software Technology, 2021, 136:106576

[60] Zheng W, Jiang Y, Su X. VulSPG: vulnerability detection based on slice property graph representation learning// Proceedings of the International Symposium on Software Reliability Engineering. Wuhan, China, 2021:457-467

[61] Duan X, Wu J, Ji S, Rui Z, Luo T, Yang M, Wu Y. VulSniper: focus your attention to shoot fine-grained vulnerabilities//Proceedings of the International Joint Conferences on Artificial Intelligence, Macao, China, 2019: 4665-4671

[62] Duan Xu, Wu Jing-Zheng, Luo Tian-yue, et al. Vulnerability mining method based on code property graph and attention BiLSTM. Journal of Software, 2020, 31 (11) : 3404-3420 (in Chinese)

(Duan X, Wu JZ, Luo TY et al. A Vulnerability Mining Approach Based on Code Attribute Graph and Attention Bidirectional LSTM, Journal of Software. 2020, 31(11): 3404-3420)

[63] Wang Y, Wang K, Gao F, Wang L.Learning semantic program embeddings with graph interval neural network//Proceedings of the ACM on Programming Languages.Chicago, USA, 2020, 4: 1-27

[64] Cao D, Huang J, Zhang X, Liu X. FTCLNet: Convolutional LSTM with fourier transform for vulnerability detection// Proceedings of the International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). Guangzhou, China, 2020:539-546

[65] Li M, Li C, Li S, Wu Y, Zhang B, Wen Y. ACGVD: Vulnerability detection based on comprehensive graph via graph neural network with attention//Proceedings of the International Conference on Information and Communications Security Chongqing, China, 2021:243-259

[66] Duan Ya-Nan. Research on software vulnerability detection method based on code property graph and graph convolutional neural network [Master dissertation]. Harbin Institute of Technology, Harbin, 2020 (in Chinese)

(Yannan Duan. Research on Software Vulnerability Detection Methods Based on Code Attribute Graph and Graph Convolutional Neural Network [Master's thesis]. Harbin Institute of Technology, Harbin, 2020).

[67] Li Y, Wang S, Nguyen TN. Vulnerability detection with finegrained interpretations//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 2021: 292-303

[68] Cheng X, Wang H, Hua J, Xu G, Sui Y. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Transactions on Software Engineering and Methodology (TOSEM), 2021, 30(3):1-33

[69] Ghaffarian SM, Shahriari HR. Neural software vulnerability analysis using rich intermediate graph representations of programs.Information Sciences, 2021, 553:189-207

[70] Zhao J, Guo S, Mu D. DouBigRU-A: Software defect detection algorithm based on attention mechanism and double BiGRU.Computers & Security, 2021, 111:102459

[71] An W, Chen L, Wang J, Du G, Shi G, Meng D.AVDHRAM: automated vulnerability detection based on hierarchical representation and attention mechanism//Proceedings of the 2020 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). Exeter, England, 2020:337-344

[72] Yamaguchi F. Pattern-based vulnerability discovery [Ph. D. dissertation]. Georg-August University School of Science, Göttingen, 2015

[73] Chakraborty S, Krishna R, Ding Y, Ray B.Deep learning based vulnerability detection: are we there yet.IEEE Transactions on Software Engineering, 2021, 1(01):1-17

[74] Ferrante J, Ottenstein KJ, Warren JD. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 1987, 9 (3): 319-49

[75] Harvey B. Computer science logo style: symbolic computing. Bosten, USA: MIT Press; 1997

[76] Lin G, Zhang J, Luo W, et al. Cross-project transfer representation learning for vulnerable function discovery. IEEE Transactions on Industrial Informatics, 2018, 14 (7) : 3289-
3297

[77] Lin G , Zhang J , Luo W, et al. Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Transactions on Dependable and Secure Computing, 2019(99): 1-1

[78] Lin G, Xiao W, Zhang J, et al.Deep learning-based vulnerable function detection: Abenchmark//Proceedings of the International Conference on Information and Communications Security. Irbid, Jordan, 2019:219-232

[79] Lin G, Xiao W, Zhang LY, Gao S, Tai Y, Zhang J. Deep neural-based vulnerability discovery demystified: Data, model and performance. Neural Computing and Applications, 2021,33 (20): 13287-300

[80] Dong F, Wang J, Li Q, et al. Defect prediction in android binary executables using deep neural network. Wireless Personal Communications, 2018, 102(3): 2261-2285

[81] Zheng Y, Pujar S, Lewis B, Buratti L, Epstein E, Yang B, Laredo J, Morari A, Su Z. D2a: A dataset built for ai-based vulnerability detection methods using differential analysis// Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) 2021:111-120

[82] Yang X, Wang S, Li Y, Wang S. Does data sampling improve deep learning-based vulnerability detection? Yeas! and Nays!// Proceedings of the International Conference on Software Engineering. Melbourne, Australia, 2023:2287-2298

[83] Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection//Proceedings of the International Conference on Machine Learning. Nashville, USA.1997, 97(1):179-187

[84] Ding Y, Suneja S, Zheng Y, Laredo J, Morari A, Kaiser G, Ray B. VELVET: A novel ensemble learning approach to automatically locate vulnerable statements//Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). Honululu, USA, 2022:959-970

[85] Fu M, Tantithamthavorn C.LineVul: a transformer-based linelevel vulnerability prediction//Proceedings of the 19th International Conference on Mining Software Repositories. Pittsburgh Pennsylvania, USA, 2022:608-620

[86] Jiang Yuan. Research on Data-Driven Program Security Bug Identification Method. [Ph.D. dissertation]. Harbin Institute of Technology.Harbin, 2022 (in chinese)

(Jiang Yuan. Research on data-driven program security defect identification method [Doctoral dissertation]. Harbin Institute of Technology, Harbin, 2022).

[87] Hin D, Kan A, Chen H, Babar MA.LineVD: statement-level vulnerability detection using graph neural networks// Proceedings of the 19th International Conference on Mining Software Repositories (MSR'22). Association for Computing Machinery). New York, USA.2022: 596-607

[88] Zhang L, Yan L, Zhang Z, Zhang J, Chan WK, Zheng Z.A theoretical analysis on cloning the failed test cases to improve spectrum-based fault localization. Journal of Systems and Software, 2017, 129:35-57

[89] Zhang Z, Lei Y, Mao X, Yan M, Xu L, Wen J. Improving deep-learning-based fault localization with resampling. Journal of Software: Evolution and Process, 2021, 33(3):2312-2330

[90] Xie H, Lei Y, Yan M, Yu Y, Xia X, Mao X. A universal data augmentation approach for fault localization//Proceedings of the 44th International Conference on Software Engineering Pittsburgh Pennsylvania, USA, 2022:48-60

[91] Hu J, Xie H, Lei Y, Yu K. A light-weight data augmentation method for fault localization. Information and Software Technology, 2023:107148-107160

[92] Lei Y, Liu C, Xie H, Huang S, Yan M, Xu Z.BCL-FL:Adata augmentation approach with between-class learning for fault localization//Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).Honululu, USA, 2022:289-300

[93] Croft R, Xie Y, Babar MA. Data preparation for software vulnerability prediction: asystematic literature review. IEEE Transactions on Software Engineering, 2022:1044-1063

[94] Software assurance reference dataset.2022, https://samate.nist. gov/SRD/index.php.

[95] https://github.com/

[96] Okun V, Delaitre A, Black PE. Report on the static analysis tool exposition (sate) iv.NIST Special Publication, 2013, 500: 297-343

[97] Dowd M, McDonald J, Schuh J. The art of software security assessment: Identifying and preventing software vulnerabilities. London, UK: Pearson Education, 2006

[98] Li Z, Zou D, Xu S, Chen Z, Zhu Y, Jin H. Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing, 2021:1-17

[99] https://www.bugzilla.org/

[100] https://security-team.debian.org/security_tracker.html

[101] Zafar S, Malik M Z, Walia G S. Towards standardizing and improving classification of bug-fix commits//Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM).Porto de Galinhas, Recife, Brazil, 2019: 1-6

[102] Ni Z, Li B, Sun X, et al. Analyzing bug fix for automatic bug cause classification. Journal of Systems and Software, 2020, 163: 110538-110552

[103] Perl H, Dechand S, Smith M, et al.Vccfinder: finding potential vulnerabilities in open-source projects to assist code audits// Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. Denver, USA, 2015: 426-437

[104] Zhou Y, Sharma A. Automated identification of security issues from commit messages and bug reports//Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering.Paderborn, Germany, 2017: 914-919

[105] Sabetta A, Bezzi M. A practical approach to the automatic classification of security-relevant commits//Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).Madrid, Spain, 2018: 579-582

[106] Oyetoyan TD, Morrison P. An improved text classification modelling approach to identify security messages in heterogeneous projects. Software Quality Journal, 2021, 29(2): 509-53

[107] Ramos J. Using tf-idf to determine word relevance in document queries//Proceedings of the First Instructional Conference on Machine Learning. Washington, USA. 2003, 242(1):29-48

[108] Pane JF, Myers BA. Studying the language and structure in non-programmers' solutions to programming problems. International Journal of Human-Computer Studies, 2001, 54 (2): 237-64

[109] Mou L, Li G, Zhang L, Wang T, Jin Z. Convolutional neural networks over tree structures for programming language processing//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Phoenix, USA, 2016: 30(1): 1-7

[110] Henkel J, Lahiri SK, Liblit B, Reps T. Code vectors: Understanding programs through embedded abstracted symbolic traces//Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Lake Buena Vista, USA, 2018: 163-174

[111] Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D. Deep learning similarities from different representations of source code.//Proceedings of the 2018 IEEE/ ACM 15th International Conference on Mining Software Repositories (MSR).Gothenburg, Sweden, 2018: 542-553

[112] Zhao G, Huang J, DeepSim: deep learning code functional similarity//Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.Lake Buena Vista, USA. 2018: 141-151

[113] Liu Fang, Li Ge, Hu Xing, et al. Program comprehension based on deep learning. Joural of Computer Research and Development.2019, 56(08) : 1605-1620 (in Chinese) (刘芳, 李戈, 胡星等. 基于深度学习的程序理解研究进展.计算机研究与发展.2019, 56(08):1605-1620)

[114] Bosu A, Carver JC, Hafiz M, Hilley P, Janni D.Identifying the characteristics of vulnerable code changes: an empirical study// Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. Hong Kong, China, 2014:257-268

[115] Meneely A, Srinivasan H, Musa A, Tejeda AR, Mokary M, Spates B. When a patch goes bad: Exploring the properties of vulnerability-contributing commits//Proceedings of the 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement.Baltimore, USA, 2013: 65-74

[116] Shin Y, Williams L. An initial study on the use of execution complexity metrics as indicators of software vulnerabilities// Proceedings of the 7th International Workshop on Software Engineering for Secure Systems, Honolulu ,USA. 2011: 1-7

[117] Du X, Chen B, Li Y, Guo J, Zhou Y, Liu Y, Jiang Y. Leopard: Identifying vulnerable code for vulnerability assessment through program metrics//Proceedings of the 41st International Conference on Software Engineering (ICSE). Montréal, Canada. 2019:60-71

[118] Zagane M, Abdi MK, Alenezi M. Deep learning for software vulnerabilities detection using code metrics. IEEE Access, 2020 (8):74562-74570

[119] Grieco G, Grinblat GL, Uzal L, Rawat S, Feist J, Mounier L. Toward large-scale vulnerability discovery using machine learning//Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, USA, 2016: 85-96

[120] Hovsepyan A, Scandariato R, Joosen W, Walden J. Software vulnerability prediction using text analysis techniques// Proceedings of the 4th International Workshop on Security Measurements and Metrics, Lund, Sweden, 2012:7-10

[121] Li X, Feng B, Li G, Li T, He M. A vulnerability detection system based on fusion of assembly code and source code. Security and Communication Networks, 2021:1-11

[122] Li Y, Wang S, Nguyen TN, Van Nguyen S. Improving bug detection via context-based code representation learning and attention-based neural networks//Proceedings of the ACM on Programming Languages. Athens, Greece, 2019, 3:1-30

[123] Tanwar A, Manikandan H, Sundaresan K, Ganesan P, Chandrasekaran SK, Ravi S. Multi-context attention fusion neural network for software vulnerability identification. arXiv preprint arXiv:2104.09225, 2021: 1-13

[124] Zhang H, Bi Y, Guo H, Sun W, Li J. ISVSF: Intelligent vulnerability detection against Java via sentence-Level pattern exploring.IEEE Systems Journal, 2021: 16(1):1032-1043

[125] Cao S, Sun X, Bo L, Wu R, Li B, Tao C.MVD: Memoryrelated vulnerability detection based on flow-Sensitive graph neural networks//Proceedings of the 44th International Conference on Software Engineering. Pittsburgh Pennsylvania, USA, 2022: 1456-1468

[126] Ghaffarian S.M., Shahriari H.R..Software vulnerability analysis and discovery using machine-learning and data-mining techniques: A survey. ACM Computing Surveys (CSUR), 2017, 50(4): 1-36

[127] Brown PF, Della Pietra VJ, Desouza PV, Lai JC, Mercer RL. Class-based n-gram models of natural language. Computational linguistics, 1992, 18(4):467-480

[128] Church KW. Word2Vec. Natural Language Engineering, 2017 Jan;23(1):155-62

[129] Alon U, Zilberstein M, Levy O, Yahav E. Code2vec: learning distributed representations of code.//Proceedings of the ACM on Programming Languages.Cascais, Portugal, 2019, 3:1-29

[130] dg, https://github.com/mchalupa/dg

[131] Siow JK, Liu S, Xie X, Meng G, Liu Y. Learning program semantics with code representations: An empirical study.// Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering. Honolulu, USA, 2022: 554-565

[132] Zhao Z, Yang B, Li G, Liu H, Jin Z.Precise learning of source code contextual semantics via hierarchical dependence structure and graph attention networks. Journal of Systems and Software, 2022, 184:111108-111125

[133] Zaharia S, Rebedea T, Trausan-Matu S. Source code vulnerabilities detection using loosely coupled data and control flows//Proceedings of the 2019 21st International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). Timisoara, Romania.2019:43-46

[134] Binkley D, Moonen L, Isaacman S. Featherweight assisted vulnerability discovery. Information and Software Technology, 2022,

[135] Li R, Feng C, Zhang X, Tang C. A lightweight assisted vulnerability discovery method using deep neural networks. IEEE Access, 2019, 7:80079-80092

[136] Omar M. Detecting software vulnerabilities using language models. arXiv:2302.11773. 2023: 1-8

[137] Aladics T, Hegedús P, Ferenc R. An AST-based code change representation and its performance in just-in-time vulnerability prediction. arXiv:2303.16591. 2023: 1-20

[138] Rabheru R, Hanif H, Maffeis S. DeepTective: detection of PHP vulnerabilities using hybrid graph neural networks// Proceedings of the 36th Annual ACM Symposium on Applied Computing. Virtual Event, 2021: 1687-1690

[139] Rabheru R, Hanif H, Maffeis S. A hybrid graph neural network approach for detecting PHP vulnerabilities//Proceedings of the 2022 IEEE Conference on Dependable and Secure Computing (DSC).Edinburgh, UK, 2022:1-9

[140] Lin C, Xu Y, Fang Y, Liu Z. VulEye: a novel graph neural network vulnerability detection approach for PHP application. Applied Sciences, 2023,13(2):825-847

[141] https://joern.io/

[142] https://www.cppdepend.com/

[143] https://github.com/c2nes/javalang

[144] https://javaparser.org/

[145] https://esprima.org/

[146] https://github.com/Swatinem/esgraph

[147] https://github.com/nikic/PHP-Parser

[148] https://github.com/malteskoruppa/phpjoern

[149] https://github.com/tree-sitter/tree-sitter

[150] Zhu B, Tan H. VuLASTE: long sequence model with abstract syntax tree embedding for vulnerability detection. arXiv: 2302.02345. 2023: 1-7

[151] Votipka D, Stevens R, Redmiles E, Hu J, Mazurek M.Hackers vs. testers: A comparison of software vulnerability discovery processes//Proceedings of the Symposium on Security and Privacy. San Francisco, USA, 2018:374-391

[152] Lin G, Wen S, Han QL, Zhang J, Xiang Y. Software vulnerability detection using deep neural networks: a survey. Proceedings of the IEEE, 2020, 108(10):1825-1848

[153] Wartschinski L, Noller Y, Vogel T, Kehrer T, Grunske L. VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python. Information and Software Technology, 2022,144:106809-106827

[154] Şahin CB. Semantic-based vulnerability detection by functional connectivity of gated graph sequence neural networks. Soft Computing, 2023, 2:1-7

[155] Choi M J, Jeong S, Oh H, et al.End-to-end prediction of buffer overruns from raw source code via neural memory networks//

Proceedings of the 26th International Joint Conference on Artificial Intelligence.Melbourne, Australia, 2017: 1546-1553

[156] Sestili CD, Snavely WS, VanHoudnos NM. Towards security defect prediction with AI. Pittsburgh, USA: Carnegie Mellon University Software Engineering Institute, Technical Report: AD1083878, 2018

[157] Tian J, Zhang J, Liu F.BBregLocator: avulnerability detection system based on bounding box regression//Proceedings of theInternational Conference on Dependable Systems and Networks Workshops (DSN-W). Xinzhu, China, 2021:93-100

[158] Dong Y, Tang Y, Cheng X, Yang Y, Wang S. SedSVD: statement-level software vulnerability detection based on relational graph convolutional network with subgraph embedding. Information and Software Technology, 2023, 158: 107168-107181

[159] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems, 2017, 30:

[160] De Sousa NT, Hasselbring W. JavaBERT: Training a transformer-based model for the Java programming language// Proceedings of the36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). Melbourne, Australia, 2021: 90-95

[161] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. https://openai.com/research/language-unsupervised. 2018,6,11

[162] Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, Lee P, Lee YT, Li Y, Lundberg S, Nori H. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv:2303.12712. 2023:1-155

[163] Cheshkov A, Zadorozhny P, Levichev R. Evaluation of ChatGPT model for vulnerability detection. arXiv preprint arXiv:2304.07232. 2023:1-6

[164] Aljanabi M, Ghazi M, Ali AH, Abed SA. ChatGpt: open possibilities. Iraqi Journal For Computer Science and Mathematics, 2023, 4(1):62-64

[165] Peng H, Mou L, Li G, Liu Y, Zhang L, Jin Z. Building program vector representations for deep learning//Proceedings of the International Conference on Knowledge Science, Engineering and Management. Chongqing, China, 2015:547-553

[166] Yang S, Cheng L, Zeng Y, Lang Z, Zhu H, Shi Z.Asteria: deep learning-based AST-encoding for cross-platform binary code similarity detection//Proceedings of the International Conference on Dependable Systems and Networks (DSN). Taipei, China, 2021:224-236

[167] Wei H, Li M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code//Proceedings of the International Joint Conference on Artificial Intelligence. Melbourne, Australia, 2017: 3034-3040

[168] Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X.A novel neural source code representation based on abstract syntax tree// Proceedings of the 41st International Conference on Software Engineering (ICSE). Montréal, Canada, 2019: 783-794
[169] Bo D, Wang X, Shi C, Shen H. Beyond low-frequency information in graph convolutional networks//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35 (5) : 3950-3957

[170] Jiang Y, Su X, Treude C, Wang T. Hierarchical semanticaware neural code representation. Journal of Systems and Software, 2022,191:111355-111375

[171] Wang G, Ying R, Huang J, Leskovec J. Multi-hop attention graph neural network. arXiv : 2009.14332.2020: 1-11

[172] Li Q, Han Z, Wu XM. Deeper insights into graph convolutional networks for semi-supervised learning//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, USA, 2018, 32(1):1-8

[173] Zou D, Hu Y, Li W, Wu Y, Zhao H, Jin H. mVulPreter: amulti-granularity vulnerability detection system with interpretations. IEEE Transactions on Dependable and Secure Computing, 2022,

[174] Warnecke A, Arp D, Wressnegger C, Rieck K. Evaluating explanation methods for deep learning in security//Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&P), Genoa, Italy, 2020:158-174

[175] Zou D, Zhu Y, Xu S, Li Z, Jin H, Ye H. Interpreting deep learning-based vulnerability detector predictions based on heuristic searching.ACM Transactions on Software Engineering and Methodology (TOSEM) - Continuous Special Section: AI and SE, 2021, 30(2):1-31

[176] Mao Y, Li Y, Sun J, Chen Y. Explainable software vulnerability detection based on attention-based bidirectional recurrent neural network//Proceedings of IEEE International Conference on Big Data (Big Data). Atlanta, USA, 2020: 4651-4656

[177] Gu M, Feng H, Sun H, Liu P, Yue Q, et al. Hierarchical attention network for interpretable and fine-Grained vulnerability detection//Proceedings of the IEEE INFOCOM Workshops. 2022: 1-6

[178] Ying Z, Bourgeois D, You J, Zitnik M, Leskovec J. Gnnexplainer: generating explanations for graph neural networks. Advances in Neural Information Processing Systems, 2019, 32: 1-12

[179] Tang G, Zhang L, Yang F, et al. Interpretation of learningbased automatic source code vulnerability detection model using vme.Lecture Notes in Computer Science, 2021, 12817: 275-286

[180] Ben-Nun T, Jakobovits AS, Hoefler T. Neural code comprehension: A learnable representation of code semantics// Proceedings of the Advances in Neural Information Processing Systems.Montreal, Canada, 2018, 31: 3585-3597

[181] Liu Jian, Su Pu-Rui, Yang Min, et al. Software and cyber security-asurvey. Journal of Software，2018，29(1) : 42-68 (in Chinese)

(Liu Jian, Su Pu Rui, Yang Min et al. A review of software and cybersecurity research. Journal of Software, 2018, 29(1): 42-68)

[182] Mirsky Y, Macon G, Brown M, Yagemann C, Pruett M, Downing E, Mertoguno S, Lee W. VulChecker: graph-based
vulnerability localization in source code//Proceedings of the 31st USENIX Security Symposium, Security. Boston, USA. 2022:1-18

[183] Dai H, Dai B, Song L. Discriminative embeddings of latent variable models for structured data//Proceedings of the International Conference on Machine Learning. New York, USA. 2016:2702-2711

[184] Zhang Z, Lei Y, Yan M, Yu Y, Chen J, Wang S, Mao X. Reentrancy vulnerability detection and localization: a deep learning based two-phase approach//Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering.Rochester, USA. 2022:1-13

[185] https://www.nist.gov/

[186] http://www.stackoverflow.com/

[187] Rao KN, Reddy C. An efficient software defect analysis using correlation-based oversampling. Arabian Journal for Science and Engineering, 2018, 43(8):4391-411

[188] Shen P, Ding X, Mu X, Xu J. A software defect prediction method based on sampling and integration. Journal of Physics: Conference Series 2021, 1732(1):12002-12009

[189] Liu S, Lin G, Han QL, Wen S, Zhang J, Xiang Y. DeepBalance: deep-learning and fuzzy oversampling for vulnerability detection. IEEE Transactions on Fuzzy Systems, 2019, 28(7):1329-1343

[190] Shu R, Xia T, Williams L, Menzies T. Dazzle: using optimized generative adversarial networks to address security data class imbalance issue//Proceedings of the 19th International Conference on Mining Software Repositories. Pittsburgh, USA. 2022: 144-155

[191] Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L,.

SU Xiao-Hong, Ph. D., professor. Her research interests include intelligent software engineering, software vulnerability detection, program analysis and software testing, etc.

ZHENG Wei-Ning, Ph. D. candidate. His research interest is software vulnerability detection.

Background

Security and software engineering researchers have become increasingly interested in software vulnerability detection techniques in recent years. In general, software vulnerability detection techniques
Qin B, Liu T, Jiang D, Zhou M.Codebert: a pre-trained model for programming and natural languages.arXiv : 2002.08155.2020 1-12

[192] Mashhadi E, Hemmati H. Applying codebert for automated program repair of java simple bugs//Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).Madrid, Spain. 2021,17:505-509

[193] Pan C, Lu M, Xu B. An empirical study on software defect prediction using codebert model. Applied Sciences, 2021, 11 (11):4793-4811

[194] Yuan X, Lin G, Tai Y, Zhang J. Deep neural embedding for software vulnerability discovery: comparison and optimization. Security and Communication Networks, 2022:1-12

[195] Han Z, Li X, Liu H, Xing Z, Feng Z. Deepweak: reasoning common software weaknesses via knowledge graph embedding// Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). Campobasso, Italy.2018:456-466

[196] Qin S, Chow KP. Automatic analysis and reasoning based on vulnerability knowledge graph//Proceedings of the Cyberspace Data and Intelligence, and Cyber-Living, Syndrome, and Health. Beijing, China, 2019:3-19

[197] Bui ND, Yu Y, Jiang L. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations//Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.2021:511-521

[198] Li X, Gong Y, Shen Y, Qiu X, Zhang H, Yao B, Qi W, Jiang D, Chen W, Duan N. CodeRetriever: unimodal and bimodal contrastive learning.arXiv : 2201.10866.2022: 1-13

JIANG Yuan, Ph. D., assistant professor. His research interests include program analysis and its application, code representation learning.

WEI Hong-Wei, Ph. D. candidate. His research interests include software data mining, software knowledge engineering, search-based software engineering, code pattern generation and search.

WAN Jia-Yuan, Ph. D. candidate. His research interests include software vulnerability detection and software testing.

WEI Zi-Yue, M. D. His research interest is smart contract vulnerability detection.

can be divided into three categories: dynamic analysis techniques, static analysis techniques, and hybrid analysis techniques. Static analysis is currently the fastest growing and most widely used
technology.

In the past decade, with the rapid development of artificial intelligence techniques, researchers have developed numerous methods for detecting vulnerabilities based on machine learning and deep learning. According to the technology deployed and the procedural representation they use, these vulnerability detection methods can be classified into different bases: Sequence, AST, and Graph. There is also a lot of work on detection tools evaluation and methods survey. These studies give us good directions on software vulnerability detection and good comprehensive set of hundreds of articles.

Many researchers are exploring ways to improve the performance and utility of data-driven software vulnerability detection techniques. Unfortunately, the cost of extracting and constructing the dataset and the detection efficiency and accuracy of methods have prevented the widespread use of these techniques in the industry at this time. Therefore, implementing intelligent data-driven vulnerability detection methods that can detect vulnerabilities in real-world scenarios is a hot topic in this field. On the other hand, the ultimate goal of vulnerability detection is always to help developers understand and fix vulnerabilities. Thus implementing fine-grained vulnerability detection and providing interpretable detection results is becoming increasingly important in this field. This paper presents a systematic review of data-driven software vulnerability detection techniques, focusing on methods for extracting and constructing vulnerability datasets, methods for learning program and program representations for vulnerability detection tasks, methods for detecting vulnerabilities based on machine learning and deep learning, and methods for detecting fine-grained interpretable software vulnerabilities. By analyzing existing methods, this paper summarizes current challenges in the field of vulnerability detection, and proposes a new framework for software vulnerability detection that combines hierarchical semantic awareness, multi-granularity vulnerability detection, and assisted vulnerability understanding, and forecasts future research directions and trends.

This research was supported by the National Natural Science Foundation of China (No. 62272132).