This is a bilingual snapshot page saved by the user at 2024-3-30 5:25 for https://app.immersivetranslate.com/pdf-pro/2ac01229-2e07-4123-984d-19c395824a4a, provided with bilingual support by Immersive Translate. Learn how to save?
2024_03_29_526de135f0ca56f3b8f7g

 Research and Progress on Learning-Based Source Code Vulnerability Detection

 Su Xiaohong Zheng Weining Jiang Yuan Wei Hongwei Fang Jiayuan Wei Ziyue (Department of Computing, Harbin Institute of Technology, Harbin 150001)

Abstract


Abstract Automatic source code vulnerability detection is the premise and foundation of source code vulnerability repair, which is of great significance to ensure software security. The traditional method is usually based on the manual rules formulated by security experts to detect vulnerabilities, but the difficulty of manual rule-making is large, and the types of vulnerabilities that can be detected depend on the predefined rules of security experts. In recent years, the rapid development of Artificial Intelligence (AI) technology provides an opportunity to realize learning-based automatic detection of source code vulnerabilities. Learning-based vulnerability detection methods refer to the use of machine learning or deep learning techniques for vulnerability detection, in which deep learning-based vulnerability detection methods have shown great potential in the field of vulnerability detection due to their ability to automatically extract the syntactic and semantic features related to vulnerabilities in the code, avoiding feature engineering, and have become a research hotspot in recent years. In this paper, we review and summarize the existing learning-based source code vulnerability detection techniques, and systematically analyze and synthesize their research and progress, focusing on five aspects: vulnerability data mining and dataset construction, vulnerability detection task oriented program representation, machine learning and deep learning based source code vulnerability detection methods, interpretable methods for source code vulnerability detection, and fine-grained source code vulnerability detection methods. Five aspects of the research work is systematically analyzed and summarized. On this basis, a reference framework for vulnerability detection combining hierarchical semantic perception, multi-granularity vulnerability classification and auxiliary vulnerability understanding is given. Finally, the future research direction of learning-based source code vulnerability detection technology is outlooked.


Keywords Software security; Source code vulnerability detection; Vulnerability data mining; Vulnerability feature extraction; Code representation learning; Deep learning; Model interpretability; Vulnerability detection

C.I.C. classification number TP311 DOI number SP. J. 1016.2024.00337

Research and Progress on Learning-Based Source Code Vulnerability Detection

SU Xiao-Hong ZHENG Wei-Ning JIANG Yuan WEI Hong-WeiWAN Jia-Yuan WEI Zi-Yue(Faculty of Computing, Harbin Institute of Technology, Harbin 150001)

Abstract

Automatic detection of source code vulnerabilities is the precondition and foundation of source code vulnerability repair, which is of great significance for ensuring software security. Traditional approaches usually detect vulnerabilities based on the rules predefined by security experts. However, it is difficult to define detection rules manually, and the types of vulnerabilities that can be detected depend on the rules predefined by security experts. In recent years, the rapid development of artificial intelligence technology has provided opportunities to realize learningbased automatic source code vulnerability detection. Learning-based vulnerability detection methods are data-driven methods that use machine learning or deep learning techniques to detect vulnerabilities, among which deep learning based vulnerability detection methods have shown great potential in the field of vulnerability detection and have become a research hotspot in recent years due to their ability to automatically extract syntax and semantic features related to


Received: 2023-03-14; Published: 2023-11-28. This work was supported by the National Natural Science Foundation of China under Grant No. 62272132. Xiaohong Su (Corresponding Author), Ph.D., Professor, Senior Member of China Computer Federation (CCF), focuses on intelligent software engineering, software vulnerability detection, program analysis and software testing, etc. E-mail: sxh@hit. edu. cn. Weining Zheng, Ph.D. Candidate, focuses on software vulnerability detection. Yuan Jiang, Ph.D., Assistant Professor, Member of China Computer Federation, focuses on program analysis and code representation learning. Hongwei Wei, Ph.D. student, focuses on software data mining, software knowledge engineering, search-based software engineering, code pattern generation and search. Jiayuan Wan, Ph.D. student, focuses on software vulnerability detection and software testing. Ziyue Wei, M.S., focuses on smart contract software vulnerability detection.

vulnerabilities in source code to avoid feature engineering. This paper mainly reviews and summarizes existing learning-based source code vulnerability detection techniques, and provides a systematic analysis and overview of their research and progress, focusing on five aspects of the research work: vulnerability data mining and dataset construction, program representation methods for vulnerability detection tasks, traditional machine learning and deep learning-based source code vulnerability detection approaches, interpretable methods for source code vulnerability detection, fine-grained methods for source code vulnerability detection. Specifically, in the first part, we count existing publicly available vulnerability datasets, including their sources and sizes, and describe the challenges faced in building vulnerability datasets, as well as how to address these challenges. In the second part, we briefly introduce intermediate code representations and divide existing code representations applied in the field of vulnerability detection into four categories: metric based, sequence based, syntax tree based and graph based code representations. For each type of code representation method, we list some representative methods and analyze their advantages and disadvantages. In the third part, we introduce commonly used vulnerability detection tools and review coarse-grained vulnerability detection methods, including rule-based, machine learning based, and deep learning based vulnerability detection methods, and then analyze and discuss the characteristics, strengths and weaknesses of each type of vulnerability detection method. In the fourth part, we introduce interpretable methods that can further explain vulnerability detection results, briefly describe model selfinterpretation methods, model approximation methods and sample feedback methods one by one, summarize their characteristics and discuss their strengths and weaknesses. In the fifth part, we first elucidate the problems and challenges posed by fine-grained vulnerability detection, and then provide a detailed description of existing representative methods for fine-grained vulnerability detection and their approaches to alleviate these challenges. Finally, we propose a source code vulnerability detection a framework that combines hierarchical semantic aware, multi-granularity vulnerability classification and assisted vulnerability understanding, and analyze its feasibility. We also prospect the future research directions for learning-based source code vulnerability detection techniques, such as the construction of large-scale, high-quality vulnerability datasets, techniques for detecting vulnerabilities in small or imbalanced samples, accurate and efficient vulnerability detection models, early detection techniques for vulnerabilities etc.
Keywords software security; source code vulnerability detection; vulnerability data mining; vulnerability feature extraction; code representation learning; deep learning; model interpretability; vulnerability detection

 1 Introduction


The Internet is an indispensable infrastructure in the information age . While Internet technology brings convenience to human beings, it also provides opportunities for malicious elements. In recent years, hacker attacks , digital asset theft , and leakage of private user information have occurred frequently, posing a serious threat to the security of information systems. Vulnerabilities in software systems, which are the core components of cyberspace, are the root cause of such security incidents .

Software vulnerability (also known as vulnerability) refers to design errors, coding flaws, and operational failures that occur due to intentional or unintentional negligence on the part of the operating entity at various levels and stages of the software lifecycle of a software system or product. Malicious entities can exploit software vulnerabilities to gain access to higher levels of system privileges, steal private user data, etc., thereby jeopardizing the security of the software system and affecting the normal operation of services built on top of the software system . For example, in 2017, a remote overflow vulnerability in the Windows Server Message Block Protocol (WSMBP) caused the WannaCry ransom attack , resulting in a global Internet disaster.

Zoom, a U.S. cloud videoconferencing company, will see massive growth in 2020.


Video leak affects more than 4 million Zoom users on Mac systems due to a vulnerability in the Zoom videoconferencing software on Mac systems December 10, 2010 Apache open source project Log4 was disclosed to have a "nuclear-level" remote code execution vulnerability that could allow an attacker to steal data by constructing a malicious request to execute arbitrary code on a target server, which could lead to mining, ransomware, and other activities . "A remote code execution vulnerability has been disclosed in the Apache open source project Log4 , which allows an attacker to construct a malicious request to execute arbitrary code on a target server to steal data, mine, or ransom . International authoritative vulnerability database (Common Vulnerabilities & Exposures) and the U.S. National Vulnerability Database (National Vulnerability Database) in the disclosure of statistical data shown in Figure 1, can be seen that in recent years there is a trend of annual increase in the number of disclosed software vulnerabilities, especially in 2017, the number of software vulnerabilities has increased, especially in 2017, the number of software vulnerabilities has increased. It can be seen that the number of disclosed software vulnerabilities has been increasing year by year in recent years, especially after 2017, the number of disclosed vulnerabilities is more than twice as much as that of the previous years. Software vulnerabilities have become one of the most important risks to the security of software and information systems.
 (a) Number of CVE vulnerabilities over the years
 (b) Number of NVD vulnerabilities over the years

Figure 1 Number of Vulnerabilities Disclosed in CVEs and NVDs

Software static vulnerability detection methods can effectively improve software quality, reduce software security vulnerabilities and minimize security risks, and have attracted extensive attention from academia and industry . According to the analyzed objects, software static vulnerability detection can be divided into binary vulnerability detection and source code vulnerability detection , this paper mainly reviews the existing source code vulnerability detection methods. According to the techniques used in the detection process, source code vulnerability detection methods can be categorized into rule-based vulnerability detection methods , traditional machine learning-based vulnerability detection methods and deep learning-based vulnerability detection methods , the latter two can be collectively referred to as learning-based vulnerability detection methods.

Rule-based source code vulnerability detection methods (e.g., some open-source or commercial vulnerability detection tools) rely on security rules defined by security experts, but the limitations and imperfections of the rules often lead to false positives or missed positives , and the formulation of complete and practical vulnerability detection rules requires high labor costs. With the accumulation of security vulnerability disclosures, learning-based automatic detection of source code vulnerabilities has gradually become possible, which can automatically learn vulnerability patterns from massive historical data, avoiding the need to formulate rules manually, and thus has become one of the hottest research directions in the field of software and cyberspace security.

This paper presents a systematic analysis of learning-based source code vulnerability detection techniques, focusing on the mining and construction of vulnerability datasets, program representation and program representation learning methods for vulnerability detection tasks, source code vulnerability detection methods based on traditional machine learning and deep learning, interpretable methods for source code vulnerability detection, and fine-grained source code vulnerability detection methods. By analyzing the existing methods, this paper summarizes the current challenges in the field of vulnerability detection, and gives a reference framework for source code vulnerability detection that combines hierarchical semantic awareness, multi-granularity vulnerability detection and assisted vulnerability understanding. Finally, this paper provides an outlook on the future research direction and development trend.

 2 Literature statistics

 The process of literature search and screening carried out in this paper is as follows:

Search databases: For foreign language literature, Google Scholar is the main database, supplemented by EI Engineering Index and SCI Science Citation Index. For Chinese literature, China Knowledge Network (CNN) was used as the main database, supplemented by Wanfang Data and Wipo Chinese Science and Technology Journals.

Search keywords: including "source code vulnerability detection", "vulnerability detection interpretability" related keywords in English and Chinese; search time range from January 1, 2000 to May 20, 2023.

Using the above keywords to search the listed search databases according to the year, manual verification of the search results, including checking the title, keywords, abstracts, and browsing the content of the literature, to select the literature that matches the topic of this paper (i.e., literature related to source code vulnerability detection), and at the same time, to categorize the literature (including the source of the literature, whether or not it is a learning-based vulnerability detection method, etc.), if no matching literature appears in the list of web page search results for five consecutive pages, then the search is complete for that year. If there is no matching literature on 5 consecutive pages of the web search results list, the search is completed for that year. The search was performed by three security researchers (three PhD students), each spending an average of 75 hours.

Based on the above steps, this paper finally summarizes 1109 papers related to source code vulnerability detection, among which there are 768 research papers, 186 empirical analysis papers and 155 review papers. As shown in Fig. 2, the number of papers in the field of vulnerability detection shows a fluctuating growth over time and reaches a peak in the last three years (the number of papers in 2023 is relatively small because only the literature in the month of was investigated), which indicates that this direction has become a research hotspot in recent years.
 For the research papers, this paper will continue to analyze the research results based on the traditional machine

Figure 2 Literature Classification Statistics of Source Code Vulnerability Detection Methods (as of May 20, 2023)

The number of papers on deep learning and deep learning-based source code vulnerability detection was counted, and the results are shown in Figure 3. It can be seen that since 2017, deep learning technology is gradually applied to the field of vulnerability detection, and in 2018 the number of its papers has exceeded the number of literature based on traditional machine learning vulnerability detection.
 Fig. 3 Summary of source code vulnerability detection research paper information

In addition, this paper also analyzes the literature in the list of Class A international academic conferences and journals recommended by the China Computer Federation and the literature of three domestic computer journals, and the results are shown in Fig. 4. The results are shown in Fig. 4. It can be seen that vulnerability detection is still a research hotspot in the field of network information security and software engineering, and there are also research results related to vulnerability detection in journals and conferences in other fields.
 Computer Architecture/Parallelism and Distribution
  Computing/storage systems
  ■ Computer networks
  Synchronous network and information security
  Software engineering/systems software/
  programming language
  artificial intelligence (AI)
  ⿴Cross-cutting/integrated/emerging
  Kuchusan University Newspaper

Fig. 4 Summary of top literature on source code vulnerability detection methods in terms of their domains

In this paper, the statistics of the programming languages involved in the vulnerability detection methods are shown in Figure 5. The results are shown in Figure 5. From the figure, it can be seen that the current source code vulnerability detection methods mainly target code, Java, PHP, JavaScript and so on, followed by other programming languages and multi-language code is less involved.
 a.
囲 Java
JavaScript
 Prisoner PHP
Python
 IV Other languages/multilingual
 Fig. 5 Statistics of programming languages involved in source code vulnerability detection methods

 3 Relevant definitions, issues and challenges and research content

 3.1 Relevant definitions


Definition 1. Abstract Syntax Tree (AST). An AST is a tree representation of the abstract syntactic structure of source code, an intermediate representation of a program as an ordered tree structure, where the inner nodes correspond to operators in the program and the leaf nodes correspond to operands (e.g., constants or identifiers).

Control Flow Graph, ( is a directed graph with a unique population node START and a unique exit node STOP, except for the population/exit node, the rest of the intermediate nodes represent statements or predicate expressions in the program, where a predicate is an operation that returns True or False, and a predicate expression is an expression that contains a predicate. operation) , and a predicate expression is an expression containing a predicate. Edges represent control flow relationships between statements and are also called control flow edges. In addition, for any intermediate node in the graph, there exists at least one path from the population node to that node and from that node to the exit node.

Definition 3. Control Dependency. A node is Control Dependency on a node if there exists a directed path from node to node in the CFG and any node in the path (excluding and ) is post-dominated by and node is not post-dominated by node . Here, backward domination means that if every directed path from node to the exit node contains node , then node backward dominates node C. It should be noted that backward domination does not include the exit node and a node does not backward dominate itself.

Definition 4. Data Dependency. If there exists a path from node A to another node in the CFG , and the value defined at node is used in node , then node has a data dependency on node .

Definition 5. Program Dependency Graph (PDG). A PDG is a directed graph with the same nodes as in a control flow graph, but without the population/exit nodes.


The edges connecting the points represent the control dependencies and data dependencies existing between the nodes.

Definition 6: Code Property Graph (CPG). A CPG is a graphical representation of a program obtained by merging an abstract syntax tree, a control flow graph, and a program dependency graph, and is formally defined as a non-empty finite set of nodes in the control flow graph and the program dependency graph, and the syntax structure nodes in the syntax tree, and a set of directed edges representing the control dependencies, data dependencies, control flow, and syntax relationships between nodes.


3.2 Problems and Challenges of Learning-based Source Code Vulnerability Detection Methods


(1) Insufficient vulnerability datasets for training learning-based source code vulnerability detection methods

Unlike vulnerability detection methods based on experts' predefined rules, learning-based source code vulnerability detection methods require sufficient sample data to train the model to improve the detection performance. However, there is still a lack of high-quality and large-scale real vulnerability datasets due to the difficulty of obtaining vulnerability code from real projects. Using small-scale datasets to train the model will easily lead to model overfitting, which will affect the generalization ability of the model. Most of the current research uses publicly available synthetic or semi-synthetic vulnerability datasets. Although these datasets are easily accessible and large in size, there is still a significant gap in code complexity and diversity of vulnerability patterns compared to real projects. As mentioned above, as the size of software increases, the types and number of vulnerabilities that can be exploited in software are also increasing. The continuous emergence of new vulnerability exploitation and attack modes makes the generalization ability of learning-based vulnerability detection models more and more serious challenges, and the expansion of the data set to supplement the training of the model is the most direct and effective means to improve the generalization ability. Therefore, how to construct a high-quality, large-scale and sufficiently rich vulnerability dataset is a challenging problem for learning-based vulnerability detection methods.

(2) Limitations of Learning-based Source Code Vulnerability Detection Methods in Deep Vulnerability Semantic Understanding and Complex Vulnerability Feature Extraction

Learning-based source code vulnerability detection methods need to understand and learn the program semantics, automatically capture the vulnerability features in the code, however, in the training process, due to hardware limitations, the code vector representation of the input model needs to be limited to a fixed length, for the larger code and the distribution of vulnerability statements in the end of the function, the code information beyond the limitation will be truncated, so the model can not learn the complete semantic information of the code. This makes it impossible for the model to learn the complete semantic information of the code. Second, there are often a large number of contextual dependencies among code elements (e.g., tokens, statements, etc.) in a program, and the model needs to selectively retain and learn the more important contextual dependencies related to vulnerabilities in order to effectively identify vulnerability patterns. However, vulnerability patterns in real projects are usually more complex, making it difficult for learning-based vulnerability detection methods to accurately and efficiently learn the deep vulnerability semantics of code. Therefore, how to construct a detection model that can extract complex vulnerability features is another challenging problem for learning-based vulnerability detection methods.

(3) Poor interpretability of learning-based source code vulnerability detection methods

For a long time, the black-box problem of deep learning has been a major problem in academia, especially for learning-based source code vulnerability detection methods. Coarse-grained source code vulnerability detection methods do not provide more information about the vulnerabilities after identifying the vulnerable functions or code segments, and the "black-box" nature of the model itself makes it difficult to fully explain the detection mechanism and results of the model. Therefore, it is important to "white-box" the learning-based vulnerability detection methods, so that the detection process and detection results can be interpreted. At present, there has been some development on the interpretability research of deep learning, but the research on its application to the field of vulnerability detection is still relatively small. Therefore, improving the interpretability of learning-based vulnerability detection methods is a challenging problem.

 3.3 Research on Learning-Based Source Code Vulnerability Detection


Source code vulnerability detection is the process by which a developer or a security expert finds an existing but unexposed vulnerability in the source code in some way. Based on the granularity of detection, source code vulnerability detection methods can be categorized into coarse-grained vulnerability detection and fine-grained vulnerability detection. Coarse-grained vulnerability detection refers to predicting the likelihood of a vulnerability being contained in a source code file, function (or method), or code fragment, while fine-grained vulnerability detection refers to predicting the specific statements in the source code that may trigger a vulnerability.

Coarse-grained detection is less difficult, faster and more accurate, but the results are not interpretable, which can easily delay vulnerability remediation and increase the remediation cost. Therefore, on the basis of coarse-grained source code vulnerability detection, researchers have further proposed some interpretable methods to help understand the coarse-grained detection results.

Fine-grained vulnerability detection is more difficult than coarse-grained vulnerability detection, but can better assist developers in understanding and remediating vulnerabilities because it can be localized directly to the statement where the vulnerability occurs.

Learning-based source code vulnerability detection techniques analyze, abstract and reason about source code through traditional machine learning or deep learning artificial intelligence techniques, enabling them to automatically or semi-automatically learn complex semantic features related to vulnerabilities from large amounts of historical data in order to generate corresponding vulnerability patterns, which can be applied to coarse-grained or fine-grained vulnerability detection tasks. Figure 6 shows the relationship between the various research components related to existing learning-based source code vulnerability detection techniques.
 (1) Mining and construction of vulnerability datasets
 Gathering suspected vulnerabilities from open source software repositories or public vulnerability databases

Fig. 6 Relationship between the various research components of existing learning-based source code vulnerability detection methods

Vulnerable programs, tagging vulnerable programs manually or in an automated manner to build a vulnerability dataset.
 (2) Method of program representation

Parsing of the source code in the dataset to generate suitable intermediate representations of the program, such as metric-based, sequence-based, syntax-tree-based and graph-based program representations.
 (3) Coarse-grained source code vulnerability detection

After obtaining an intermediate representation of the program, software metric features are extracted using manually formulated rules, or appropriate deep learning networks are used to extract vulnerability-related syntactic and semantic features from the representation of the program, which are fed into a classifier that predicts, in a binary classification fashion, whether or not the source code to be detected contains a vulnerability.
 (4) Interpretable Methods for Coarse-Grained Source Code Vulnerability Detection

For the source code detected as containing vulnerabilities, further probabilistic or fine-grained explanation information is provided by interpretable methods. The common interpretable methods can be categorized into model self-interpretation methods, model approximation methods and sample feedback methods.
 (5) Fine-grained source code vulnerability detection methods

After obtaining the intermediate representation of the program, we directly obtain the fine-grained detection results at the statement level by performing representation learning on the intermediate representation of the program in the source code to give the location of the vulnerable statements.

To summarize, learning-based vulnerability detection research mainly focuses on the following five difficult problems: how to construct large-scale and high-quality vulnerability datasets? How to parse the code into appropriate program representations? How to extract vulnerability features from program representations to achieve coarse-grained vulnerability detection? How to obtain interpretable results based on the coarse-grained vulnerability detection results? How to model fine-grained vulnerability features based on program representations to achieve fine-grained vulnerability detection? Therefore, this paper will summarize these five perspectives.

 4 Mining and Construction Methods for Vulnerability Datasets

 4.1 Publicly available vulnerability datasets


The construction of vulnerability dataset is the prerequisite and foundation for learning-based source code vulnerability detection and localization. Learning-based vulnerability detection requires high-quality vulnerability data as a prerequisite, and the size and quality of the dataset directly affects the generalization ability of the detection model. It has been shown that improving the diversity of vulnerability types and syntactic structures in the training dataset can help enhance the detection of unknown vulnerabilities .

Some of the more critical and publicly available vulnerability datasets are shown in Table 1. The public availability of vulnerability datasets has contributed to the development of learning-based vulnerability detection techniques, but the construction of datasets still faces the following challenges.
 (1) Sample imbalance in the data set

In a recent study, Yang et al. discuss the effectiveness of data sampling methods for the vulnerability detection data imbalance problem. Specifically, the study evaluates the impact of four data sampling methods, including random undersampling/over-sampling, SMOTE, and OSS (One Side Selection) , on the effectiveness of deep learning vulnerability detection models and the ability to learn code vulnerability patterns, and obtains the following conclusions: First, data sampling methods can indeed alleviate the data imbalance problem in vulnerability detection. First, the data sampling method can indeed alleviate the data imbalance problem in vulnerability detection. Second, oversampling is better than undersampling. Finally, sampling the original samples is better than sampling the feature space generated by the model after learning the samples. Therefore, future research can focus on the oversampling of raw samples.

In real projects, the number of vulnerability statements in the vulnerability samples is also relatively small, which leads to the sample imbalance problem that is more serious in the software fine-grained vulnerability localization task. Although researchers have recognized the challenges that this problem poses to the vulnerability localization task , few studies have provided solutions. Software fault localization is the process of locating statements in a bit program that cause the program to run incorrectly.

  Table 1 Publicly available vulnerability datasets
 Label Granularity  Ratio of Vulnerable to Non-Vulnerable Instances in a Dataset  Data sources  literatures
 function (math.)  NVD&CVE
 NVD&CVE
SARD
 NVD&CVE
SARD
1471:59 297  NVD&CVE
1471:59 297 CVE
SARD
GitHub
1658:16511  Debian&Chromium
 code segment  NVD&SARD
56 395:364 232  NVD&SARD
43 119:138 522  NVD&SARD
 statement GitHub
SARD
GitHub
Debian
GitHub ICSE-SEIP

Focusing on the runtime defects that cause test cases to fail, it also faces the same sample imbalance problem, i.e., fewer test cases fail. Although software fault localization is different from vulnerability localization, considering the similarity of the two tasks, the representative methods used in software fault localization to alleviate the sample imbalance problem can still be borrowed by vulnerability localization task.

Some literatures have addressed the sample imbalance problem by expanding the dataset by cloning the failed test cases . Xie et al. proposed a data enhancement method, Aeneas, which utilizes Principal Component Analysis (PCA) to generate a reduced feature space, and then synthesizes the failed test cases in the reduced feature space by Conditional Variational Autoencoder (CVAE) to solve the sample imbalance problem. CVAE (Conditional Variational Autoencoder) is used to synthesize the failed test cases in the reduced feature space to solve the sample imbalance problem. The advantage is that the PCA technique reduces the dimension of the feature space and simplifies the expression of data features, thus improving the efficiency of data synthesis. The Lamont method proposed by et al. adopts a similar idea, using Linear Discriminant Analysis (LDA) to reduce the dimensionality of the feature space, and then utilizing SMOTE to synthesize failed test cases to obtain balanced sample data. Lei et al. proposed an inter-class learning based data enhancement method, BCL-FL, which mixes successful and failed test cases through a specially designed data synthesis formula to generate failed test cases that are closer to real test cases. Therefore, in the future, we can consider borrowing the above ideas for augmenting or synthesizing failed test cases to augment the vulnerability samples in order to solve the problem of imbalanced vulnerability samples.
 (2) Data quality issues of the dataset

Currently publicly available vulnerability datasets cover a limited number of programming languages and vulnerability types, which makes the datasets less generalizable, and the lack of complete vulnerability context makes them only able to represent a limited range of vulnerability patterns. For example, vulnerability datasets at the function granularity do not provide a complete structure of vulnerability code across functions.

Vulnerabilities mined from the same project may also cause data shell residual problems due to cross-version or code reuse, etc. Differences in the code styles of developers in different project teams and differences in the application context of code in different projects also make it difficult to learn vulnerability data across projects due to differences in data distribution .

In addition, the labeling of datasets is often noisy. This is due to the fact that manual labeling depends on the expertise of the security experts, while labeling with static analysis tools results in a large number of mislabels, and the timeliness of the data may lead to mislabels due to potentially undiscovered or silently remediated vulnerabilities in the dataset.

The above dataset-related factors have significantly increased the difficulty of training detection models with better generalization ability, which has become a bottleneck in improving the performance of vulnerability detection models.
 (3) Problems with the source of the dataset
 Most software code is still not open source, even if it is public.

Most of the open vulnerability reports do not publish the vulnerability code, which makes it more difficult for researchers to obtain data from the data sources. Therefore, most of the current vulnerability data comes from synthetic data from SARD (Software Assurance Reference Dataset ), a few from real projects such as NVD, CVE, GitHub , and some of the datasets are a mixture of the two.

Synthetic datasets from SARD contain synthetic or semi-synthetic vulnerability data , and synthetic datasets such as the SATE IV Juliet data are public vulnerability datasets constructed from synthetic vulnerability code that mimics known patterns of real vulnerability code. Synthesized data is widely used by researchers because of its large number of samples, multiple types, low noise, and low cost. Compared with real code, synthetic code is simpler and more independent, with fewer variations in code patterns and purer vulnerability contexts, and thus its vulnerability features are easier to learn . However, there is a big gap between the synthesized data and real project code in terms of complexity and incomplete coverage of program syntax structure, which cannot reveal the real vulnerability distribution in real scenarios and cannot reflect real project scenarios, so it is difficult to accurately detect the vulnerabilities in the real code with the model trained on it. Semi-synthesized datasets are datasets in which real code is simplified and modified to serve the purpose of academic research. For example, testID: 151455 in the SARD dataset is a typical semi-synthetic example. Since in semi-synthetic datasets, researchers tend to highlight the vulnerable parts of the original code, this dataset is not fully representative of real-world vulnerable code.
 (4) The problem of labeling the dataset

Models trained on synthetic or semi-synthetic datasets are difficult to adapt to vulnerability detection scenarios of complex code in real-world projects, so it is imperative to mine and annotate vulnerable code instances in real software projects to build real vulnerability datasets. Dowd et al. found that one hour of security checking can only cover 500 lines of code on average, while most modern software systems contain millions of lines of code. Therefore, manually labeling real vulnerability instances is costly and it is difficult to obtain a large number of real vulnerability instances.

To address this problem, some researches focus on using automated approaches to label vulnerability datasets, such as using the detection results of static analysis tools to label vulnerabilities in open source project code , but the high false alarm rate of static analysis detection tools makes the reliability of the obtained vulnerability labels low.

The more commonly used method of automatic annotation of vulnerability data is to manually collect data with real vulnerability labels from the international authoritative vulnerability database CVE, locate the code commit links provided in each vulnerability data entry disclosed publicly by CVE to the security-related commit logs in the open source code repository, and then analyze the differences between the submitted code before and after fixing to extract the vulnerability code and patch code respectively, thus creating a vulnerability dataset . The vulnerability dataset is created. Part of the dataset is constructed entirely from vulnerability instances from real projects, but the number of vulnerability instances in it is much smaller than that in the publicly available hybrid vulnerability dataset , which contains both synthetic data from SARD and NVD and real data from CVE. This is because, in addition to examples of vulnerabilities from real programs, another portion of the dataset includes synthetic vulnerability code extracted from the SARD and NVD databases. Although using the program slicing technique to split the vulnerability code into multiple sliced code segments increases the number of vulnerability instances in the dataset, different sliced code segments extracted from the same original vulnerability code have the same vulnerability type, so this method of constructing the vulnerability dataset in terms of sliced code segments does not add new vulnerability types to the dataset.

Some studies have also attempted semi-automatic annotation of vulnerability datasets. For example, Zhou et al. constructed four vulnerability datasets from real open source projects by selecting security-related code commits using a keyword kakui, and then manually verifying them by security experts. These datasets contain more examples of vulnerabilities from real projects than the previous datasets, but only two of them have been made public so far. Chakraborty et al. crawled security-tagged bug reports from Bugzilla and Debian security tracker , respectively, and collected vulnerability code and patch code from real projects.

In summary, there is a lack of recognized public benchmark vulnerability datasets that can be used to evaluate detection performance, and fine-grained vulnerability datasets with statement-level annotations are particularly scarce. Second, there is a lack of recognized high-quality real vulnerability datasets and methods to accurately assess the quality of the datasets. Issues such as data generalization, meta-completeness, completeness, and labeling accuracy and reliability have left the quality of vulnerability datasets to be improved. Again, most of the publicly available vulnerability datasets are hybrid datasets, in which the proportion of real vulnerability instances is low, and there is a lack of diversity of vulnerability types and a lack of rich coverage of vulnerability context structure, which is not conducive to the learning of richer vulnerability patterns by the model, and it is difficult to satisfy the demand for high generalization of models for detecting unknown types of vulnerabilities in real-world applications, thus making it difficult for vulnerability detection models to be trained based on the existing datasets to be used in industry. It is difficult to meet the demand for high generalization of the model to detect unknown types of vulnerabilities, which makes the vulnerability detection model trained on the existing dataset difficult to be applied in industry. Therefore, the lack of large-scale, high-quality public benchmark datasets of source code vulnerabilities from real projects is one of the major challenges facing vulnerability detection.

 4.2 Vulnerability Data Mining Methods


Rule-based methods, i.e., the use of keyword basket selection and variance file analysis to collect vulnerability data, can often only extract a small number of vulnerability instances from code repositories. In order to mine more vulnerability data from open source code repositories such as GitHub to solve the problem of creating large-scale and high-quality vulnerability datasets, some researchers have begun to try to adopt a rule-based approach.


Mining Vulnerability Data with Code Change Intent Identification or Security-Related Code Commit Identification Methods . Table 2 lists the literature related to code commit identification for fault or vulnerability remediation.

Table 2 Vulnerability Data Mining Methods for Commit Identification of Bugs or Vulnerability Fixes
 mandates  literatures  research organization
 Troubleshooting Related Code Submission Recognition  North Dakota State University, USA
 Identification of the type of fault repaired  Yangzhou University in China
 Mining Vulnerability Data by Identifying Security-Related Code Submissions  University of Bonn, Germany
 SourceClear Corporation, USA
 SAP Security Institute, Germany
 Northwestern University in China
 North Carolina State University, USA

Among them, Zafar et al. designed a pre-trained BERT (Bidirectional Encoder Representation from Transformers) based fault repair related code submission identification method, which predicts whether the user's code submission contains the intention to repair the faults or not, and then determines whether the code of target version By predicting whether a user's code submission contains an intention to fix a fault, we can then determine whether the target version of code contains a fault. In order to further identify the specific types of faults involved in code submissions, et al. constructed a repair tree based on the code discrepancy files and used a Tree-based Convolutional Neural Network to classify the specific fault types of code submissions.

Since vulnerability (vulnerability is a security-related defect) is different from defects in general, in order to identify vulnerability-relevant code submissions, Perl et al. propose a text-mining and machine-learning based security-relevant code submission identification method based on the above mentioned classification methods for failure repair-related code submissions. Based on the above classification methods, Perl et al. proposed a security-related code submission identification method based on text mining and machine learning, which first extracts repair intent features from code submissions using text mining techniques, then classifies the extracted features using SVMs, and finally determines whether the submission is related to the vulnerability or not based on the prediction results. In order to combine the advantages of different text classifiers in the task of identifying security-related code submissions, Zhou et al. used logistic regression to combine six different machine learning classifiers, and achieved better classification results than using a single text classifier. Wang et al. used consistent prediction to evaluate the confidence of multiple classification models, and used a voting strategy to combine the predictions of multiple classifiers with higher confidence to improve the overall recognition accuracy of the models. In contrast to the above work, which only uses the textual information of code submissions, Sabetta et al. also consider extracting fix-relevant code features from the code and utilizing textual classifiers and differential code classifiers to comprehensively evaluate whether the current user's submissions are relevant to the vulnerability fix. Considering that issue reports in code repositories often contain a large number of vulnerability instances, Oyetoyan et al. developed a method to identify security-related issue reports and code commits at the same time, combining keyword filtering and TF-IDF.


(Term Frequency-Inverse Document Frequency) method. In this method, TF is the frequency of a word in a document, i.e., word frequency, IDF is the logarithm of the reciprocal of the frequency of the document containing the word, i.e., the inverse document frequency, and TF-IDF is the product of TF and IDF, which represents the importance of the word to the document. Therefore, by calculating the TF-IDF, we can get the keywords that are more important for document categorization. Oyetoyan et al. used keyword filtering and TF-IDF to extract security-related keywords in problem reports and code submissions, and then used them as features to identify whether the contents of problem reports and code submissions are security-related using machine learning algorithms.

Although there is a large amount of available data resources in open source code repositories, not all of the commit logs and code changes are related to vulnerability fixes, and some of the commit logs even contain a mix of commits related to vulnerability fixes and other types of commits, which further increases the difficulty of accurately identifying the commit logs related to vulnerability fixes. In addition, since some complex vulnerabilities are usually not fixed at once, the complexity of version tracking further increases the difficulty of accurately identifying the version of the vulnerability that was cited and fixed . Therefore, further research is needed on how to mine vulnerable code through version-tracing analysis of vulnerabilities. In addition, due to potential commercial sensitivity and security and privacy considerations, many vulnerabilities are covertly patched to avoid attackers from exploiting publicly available vulnerabilities, which increases the difficulty of mining high-quality real vulnerability data from software repositories. Currently, there is relatively little literature on security-related code commit identification methods for the purpose of expanding high-quality vulnerability datasets. How to mine vulnerability data from multi-source heterogeneous software repositories to build a large-scale and high-quality real vulnerability dataset remains to be studied.

 5 Program Representation for Vulnerability Detection Tasks


Although there are some similarities between natural languages and programming languages, there are also significant differences, in terms of linguistic form, programs


In addition, program semantics also differs from natural language in that it contains rich and explicit structural information, although this structure also exists in natural language, it is not as strict as in programs . In addition, the semantic adjacency of programs is different from that of natural languages, e.g., statements inside and outside a loop are not semantically adjacent even though they are in terms of distance, so it is necessary to convert the program to a suitable intermediate representation to model this semantic information . Common intermediate representations of code are Token Sequences, Abstract Syntax Tree (AST), Control Flow Graph (CFG), Data Flow Graph (DFG), Program Dependency Graph (PDG), and so on. Program Dependency Graph (PDG), etc. These different code-intermediate representations provide different levels of abstraction . The higher the abstraction level of the intermediate code representation, the better its ability to represent the semantic information of the code . For example, Token sequences can represent the natural ordering and lexical statistics of the code, ASTs are good at characterizing the similarity of specific programming patterns and syntactic structures , while CFGs and DFGs can represent more control and data flow information than Token sequences and ASTs, and PDGs are better at representing the dependencies between variables in a program . Secondly, different code intermediate representations have different levels of importance for specific software engineering tasks, and a certain code intermediate representation may only be suitable for one or several types of software engineering tasks. Therefore, there is a need to investigate program representations suitable for vulnerability detection tasks.

Early machine learning-based vulnerability detection methods (e.g., metric-based program representations) often used feature engineering to manually extract vulnerability-related features. In recent years, researchers have begun to use deep learning methods to automatically extract vulnerability features. Since programs have different program representations at different stages of compilation. Therefore, as shown in Fig. 7, according to the organization of code representation, deep learning-based program representation methods can be classified into three categories: sequence-based program representation methods, syntax tree-based program representation methods and graph-based program representation methods. Program representation methods have different objectives and are suitable for extracting different vulnerability features. Table 3 lists the representative literatures of the above four types of program representations applied to the task of vulnerability code detection.

Fig. 7 Correspondence between intermediate representations of code at different levels of abstraction and program representation models with different coding semantic capabilities

 5.1 Metrics-Based Representation of Programs


Software metrics refers to the code quality, complexity and maintainability of the code structure information through the measure of the code, to identify the software is difficult to understand or difficult to maintain the code area, in order to follow the more complex code analysis. The main idea of metrics-based vulnerability detection methods is to construct a set of code feature values based on software metrics, and then use this set of feature values as inputs to a machine learning model to learn the complex correlations between structural metrics and defects, and use this knowledge to predict the likelihood of defects in the new code.

In earlier work, researchers have used more classical software metrics. For example, Code-churn 103.115], Code-complexity , Coverage , Dependency , Organizational , Developeractivity , and so on. (Organizational , Developeractivity and so on. However, empirical studies have shown that

  Table 3 Representative literature on program representation methods
 categorization  subcategories  code representation
 metrics-based  Classic software metrics  Developer Activities
 Code perturbation, complexity, coverage, dependencies, organization
 Complexity, code perturbation, fault history
 Code Complexity, Code Perturbation, Developer Activity
 Code Scrambling, Developer Activity
 Complexity
 Improved software metrics

Ring Complexity, Cyclomatic Complexity, Function Dependencies, Function Pointer Use, Control Structure Dependencies
 Slicing software metrics
 Sequence-based  Based on function call sequences  Sequence of function calls
 Based on source code sequences  Source Code Text Token Sequence
 Based on a sequence of code segments  Code segment Token sequence
 Based on intermediate code sequences  Intermediate Code Token Sequence
 Based on assembly code sequences  Assembly Code Token Sequence
 Based on syntax trees  Transformation of syntax tree structures into sequences
AST path AST Token sequence
 Direct modeling of syntax tree structures CFAST
 chart-based  The graph is transformed into a sequence
 Direct modeling of diagram structures
Composite diagram Program diagram Composite diagram ; Slice diagram structure

Clearly, classical software metrics are not applicable to source code vulnerability detection .

In order to further improve the accuracy of metric-based vulnerability detection methods, so that software metrics can accurately characterize the vulnerability code, researchers have improved the statistics and extraction methods of software metrics. For example, in addition to classic software metrics such as code complexity, Du et al. also extracted software metrics that can reflect the characteristics of the vulnerable code, such as function dependency, function pointer usage, and control structure dependency, in order to assist in identifying vulnerable functions. On the other hand, a large number of non-vulnerability related statements within a function can also interfere with vulnerability characterization. To solve this problem, Zagane et al. proposed a vulnerability detection method that slices the program and extracts software metrics from the sliced code segments. The experimental results of Salimi et al. show that software metrics extracted from slices can more accurately characterize vulnerabilities than those extracted from functions.

 5.2 Sequence-based program representation


Sequence-based program representation is to transform codes into Token sequences and then model them using sequence-based deep neural networks to generate vector representations of the codes. The main types are as follows.

Sequence-based approaches: Sequence-based representations extract function call sequences by parsing the source code and transforming them into vector representations. For example, Grieco et al. transformed them into low-dimensional vector representations using the N-gram and word models, and et al. generated numeric vectors of the code by creating a unique integer index for each word. However, the goal of these methods is to achieve a lightweight approach that can detect vulnerabilities quickly and efficiently, and is limited in the types of vulnerabilities that can be detected. Source Code Token Sequence Based Approaches : Source code Token Sequence based program representation approaches convert the code into Token sequences by using lexical analysis techniques, and then convert each Token in the Token sequence of the code text into a vector representation using bag-of-words or -gram or word2vec. However, it is difficult to learn the high-level syntactic and semantic information of the code.

Methods based on Token Sequences of Code Gadgets: Code gadgets are semantically related but not necessarily contiguous lines of code extracted from code using syntactic and semantic analysis. The most representative work in this category is the code gadget-based vulnerability detection approach proposed by et al. VulDeePecker SySeVR VulDeePecker . This class of methods first extracts the code segments in different ways, then represents the code segments as Token sequences, and then encodes them into vector representations. Specifically, VulDeePecker uses static analysis tools (e.g., Checkmarx) to parse the program source code, extracts vulnerability candidate keypoints related to insecure libraries/API function calls from the code according to the rules, and then adopts program slicing techniques to extract the lines of code with data dependency on the keypoints from ASTs and PDGs to form code segments, which are then transformed into Token sequences. The code segments are then transformed into token sequences, which are then converted into low-dimensional vector representations using word2vec. In order to identify specific types of vulnerabilities, VulDeePecker extends the snippet representation in VulDeePecker to include more "global" semantic information by adding control-dependent statements. In order to detect more types of vulnerabilities, the SySeVR method improves the VulDeePecker method by extracting array usage, pointer usage, arithmetic, etc., based on syntactic features, in addition to library/API function calls.


The vulnerability candidate keypoints related to the art expressions are used as the slicing criteria, and the program slicing technique is used to extract the syntactic and semantic features related to the above four types of vulnerability candidate keypoints from the AST and PDG.

Approaches based on Intermediate Token Sequences: Some researchers have focused on code representations based on intermediate token sequences. The most commonly used intermediate code in the field of vulnerability detection is LLVM IR (Low-Level Virtual Machine Intermediate Representation), which adopts the Static Single Assignment (SSA) form to ensure that each variable is defined and used only once. As a result, intermediate code-based program representations can more accurately encode the semantic features of the code related to control flow and variable definition-use relationships , and capture more accurate semantic information than source code-based program representations . A representative work is VulDeeLocator , which replaces the source code slices in SySeVR with LLVM intermediate code slices generated by the tool to represent more code semantics.

Assembler Token Sequence Based Approach: Some researchers represent code based on assembly language token sequences. For example, Li et al. first extract the source code slices in the same way as SySeVR, and after compiling the source code into assembly code, they generate assembly code slices by using an alignment algorithm to find the assembly code that corresponds to the statements in the source code slices. Tian et al. use symbolic execution and static analysis techniques to generate assembly code slices by directly parsing the control flow and data flow of the binary program.

Sequence-based program representations can reflect the natural sequence information and lexical statistics of a program, which are useful for tasks such as code generation, code completion, code summarization, code search, vulnerability detection, and defect repair. However, sequence-based intermediate program representation "flattens" the code, which makes the structural information of the program not effectively utilized. Therefore, it is necessary to study syntax tree-based and graph-based program representation methods to fully utilize the structural information of the program to model the source code.

 5.3 Syntax-tree based program representations


Syntax tree-based program representation generally refers to the representation of code as an AST structure. Some studies have flattened ASTs into sequences, while others have extracted vulnerability semantics based directly on the original tree structure of ASTs.

The advantage of converting syntax tree structures into sequences is that vulnerability features can be extracted directly using sequence-based deep neural networks, and the key lies in how to convert ASTs extracted from source code into sequence representations. Alon et al. form paths by connecting root nodes to leaf nodes, and assign more weights to frequent paths to generate vector representations. This method has been successfully applied to the tasks of code attribute prediction or clone detection because it can effectively measure the similarity of syntactic structures between programs. However, since the frequently occurring paths may not be the most vulnerable paths in the program, and the vulnerable paths may not be the most frequently occurring paths, this method is not suitable for the vulnerability detection task.

In the vulnerability detection task, there are two main types of methods to transform ASTs into sequences, one is to transform ASTs into Token sequences of AST paths. For example, Li et al. use a greedy algorithm to extract long paths that cover the least number of nodes in the AST as possible, and change the starting point of the algorithm to extract multiple long paths that cover all nodes (the nodes between these long paths can overlap each other). After obtaining the set of sequences of long paths, the tokens in the path context are converted into numeric vectors by word2vec. Tanwar et al. form paths by connecting any two AST leaf nodes, forming the path context with the first and last leaf nodes, the path node, and the root node, and then extracting all such paths from the AST to generate a sequence of paths, and then converting the tokens in the sequence of paths into numeric vectors by using a pre-built vocabulary. The other is to use depth-first traversal to transform the AST into a Token sequence of AST nodes, and then map the Token in the sequence to a numeric vector using techniques such as word2vec.

Compared with the method of transforming ASTs into sequences, the method of directly modeling the syntax tree structure can more adequately represent the syntactic structural information in the code, and has been widely used in code semantics related tasks such as clone detection, code classification, code completion, code generation, code summarization, and code attribute prediction. However, some empirical studies have shown that , due to the heterogeneous structure of AST and the fact that there are many more vulnerability-independent nodes than other intermediate representations, the application of AST as an intermediate representation of code only for vulnerability detection does not achieve the best detection performance. Some recent researches have incorporated other code intermediate representations on top of AST to integrate information at different levels of abstraction. For example, adding the AST syntactic structure information of statement blocks to CFG , or adding control flow information to the original AST to generate a new code representation CFAST (Control Flow Abstract Syntax Tree .

 5.4 Graph-based program representation


Graph-based program representation consists of two main approaches, transforming graph structures into sequences or modeling graph structures directly.

The focus of a program representation method that transforms a graph into a sequence is how to traverse the intermediate graph-based representation to form a sequence. Duan et al. proposed a CPG-based vulnerability detection method, VulSniper, in which after generating the CPG of a program, the nodes in the CPG are encoded according to a coding rule designed by the authors to generate a three-dimensional feature tensor, whose first two dimensions are similar to the neighbor-joining matrix used to record the nodes' bits.


The third dimension describes the type of relationship between the nodes, and the order of the nodes in the sequence is determined according to their positions and nesting relationships in the AST, so as to obtain the sequence of feature vectors of the nodes. Based on VulSniper, the authors further propose the use of program slicing techniques to reduce statements that are not related to sensitive operations, in order to reduce vulnerability-independent information in the program representation .

Compared with the method of modeling the graph structure after transforming it into a sequence, using graph neural network to model the graph structure directly can better learn the dependencies between nodes as well as the structural and semantic information of the program in the graph-based code representation, and provide more accurate feature vector representations for the source code vulnerability detection model, which mainly includes two ideas, one is to integrate multiple intermediate representations of the code to comprehensively characterize the syntactic and semantic information related to the vulnerabilities at different levels of abstraction . One is to integrate multiple code intermediate representations to synthesize the syntactic semantic information related to vulnerabilities at different abstraction levels . The other idea is to use program slicing techniques to remove non-vulnerability related information from the mixed code representations .

Representative approaches of the first idea include: Yamaguchi et al. proposed Code Property Graph (CPG) for program representation to synthesize structural and semantic information of programs. Zhou et al. generated a Composite Graph to synthesize structural and semantic information of programs by adding natural sequence information to CPGs generated based on ASTs, CFGs, and PDGs. Wang et al. generate an intermediate representation called Program Graph by combining the information extracted from AST and PDG. Cao et al. proposed a Code Composite Graph (CCG) that combines AST, CFG, and DFG to synthesize syntactic and semantic information to characterize a program. Compared with the CPG used in the previous literature, the CCG abandons the control-dependent information in the code and retains the data-dependent information. A Comprehensive graph with 12 kinds of connectivity relations is proposed by et al. to contain more program semantic information, which covers most of the edges and nodes extracted by the joern static analysis tool.

It has been found that the amount of information contained in the program representation is not as high as possible, and a large amount of non-vulnerability related information may negatively affect the model's ability to learn vulnerability patterns. In order to solve the problem of complex graph structures that synthesize the semantic information of the code and the presence of a large amount of vulnerability-irrelevant information, another idea is to remove the vulnerability-irrelevant information from the intermediate representation of the code's graph structure by using program slicing techniques . For example, Cheng et al. proposed DeepWukong, a vulnerability detection method combining program slicing and graph neural networks, which extracts subgraphs from PDGs based on program slicing (Cheng et al. call the subgraphs ), but the method only uses function calls and operator statements as the slicing criteria, and thus only detects a limited type of vulnerabilities. Similarly, et al. used only function calls and pointer operations as the slicing criterion, and extracted subgraphs from the extended PDG combined with Call Graph (CG) based on the program slicing technique, which only detects memory-related vulnerabilities. In order to improve the detection capability of different types of vulnerabilities, Zheng et al. proposed an inter-program representation called Slice Property Graph (SPG), which is based on four commonly used slicing criteria and proposes two slicing criteria, namely, function parameter and function return value, related to inter-function data transfers, to cover more key points of vulnerability candidates. Based on these six slicing criteria, inter-process analysis and program slicing techniques are used to generate SPGs with node attribute information (statement content and node type, etc.) to more accurately extract the graph structure information, node attribute information, and code context information that are data-dependent, control-dependent, and function call-dependent with respect to the vulnerability candidate keypoints, so as to avoid the negative impact of a large number of vulnerability-independent statement nodes in CPGs on the training of the detection model. The CPG is designed to avoid the negative impact of a large number of vulnerability-independent statement nodes on the detection model training.

Finally, in terms of programming languages, there are fewer learning-based source code vulnerability detection methods for multilingual source code, and these methods mainly use metric-based and sequence-based program representations , because feature extraction based on these two representations can be easily scaled to multiple programming languages without the need for language-specific code parsing tools to build complex intermediate representations of the code. For example, Zaharia et al. extracted vulnerability features in C/ and Java code based on an intermediate representation of token sequences .

Since syntax tree and graph-based program representations can provide structural information in the code, they have been widely used for vulnerability detection in language programs as well as other programming languages in recent years. For example, Lin et al. used PDG Sub-Dependence Graph (SDG) for PHP to represent dependencies in vulnerable code. However, these representations require different code parsing tools for different programming languages to obtain intermediate representations such as syntax trees and graphs. For example, C/ code can be parsed using joern or CppDepend , Java code can be parsed using Javalang or JavaParser , JavaScript code can be parsed using Esprima or Esgraph , PHP code can be parsed using PHP-Parser , and JavaScript code can be parsed using PHP-Parser . PHP code using PHP-Parser or PHP-Joern , etc. These tools make lexical and syntactic analysis of source code less costly for researchers, and further facilitate the use of syntax tree and graph-based program representations. With the emergence of multilingual code parsing tools (e.g., Tree-sitter ), researchers have begun to use syntax tree and graph-based program representations for multilingual source code vulnerability detection .

In summary, the intermediate representation of programs lays the foundation for learning program representations for vulnerability detection tasks, but the ability of program representations to represent semantics at different levels of granularity and abstraction still needs to be improved. A comparison of the advantages and disadvantages of several typical program representation methods is shown in Table 4.

  Table 4 Advantages and disadvantages of program representation
 categorization
 Metric-based program representation
 Sequence-based program representation
 Syntax Tree Based Program Representation
 Graph-based programs
 representation
 dominance

WeaknessesMost software metrics are simple and easy to extract, and are suitable for characterizing statistical metric information of the code.
 Difficult to deeply characterize vulnerabilities in code
 Information on the natural word order and lexical statistics of the program can be represented

Can reflect the syntactic structure of the code information can be better preserved the logical structure of the code and the dependency relationship When dealing with larger programs, the generation of more complex side of the multi-graph, increasing the complexity of the subsequent model mentioned system and other semantic information will be represented as a linear sequence of the program so that the structural information in the program is not fully utilized.

When dealing with large-scale programs, the number of nodes in the syntax tree is much larger than the number of statements, and the hierarchical structure of the tree requires recursive processing, which leads to the high time complexity of the subsequent feature extraction model and the difficulty of extracting the features, among which, the metric-based program representation is suitable for characterizing the statistical metric information of the code but difficult to depict the deeper semantic information of the vulnerability code, and the sequence-based program representation is suitable for representing the natural order and lexical information of programs, but representing programs as linear sequences will lose the structural information of the program. Sequence-based program representation is suitable for representing the natural order and lexical information of the program, but representing the program as a linear sequence will lose the structural information of the program. The AST-based program representation is suitable for representing the unique syntactic structure information of the program, but when dealing with large-scale programs, the number of nodes in the AST is much larger than the number of tokens in the code, which makes it difficult for the model to learn the long-distance dependency information in the syntactic tree structure. Graph-based representation can better preserve the complex semantic information such as logical structure and dependencies of the code, but when the program size is large, the graph structure generated by this method (e.g., ) is more complex, which makes the model training less efficient. An effective code-intermediate representation should not only retain rich and critical vulnerability semantic information to improve its ability to abstract and characterize the vulnerability semantic information, but also minimize the negative impact of vulnerability-irrelevant information on the representation of vulnerability features and reduce the complexity of code-intermediate representation. Therefore, how the program representation method can comprehensively, accurately and effectively portray the syntax and semantics of vulnerability code at different granularity and abstraction levels still needs to be deeply studied.

 6 Coarse-Grained Vulnerability Detection Methods for Source Code


Program analysis is the basic technology of vulnerability detection, including static analysis, dynamic analysis and hybrid analysis methods. Among them, static analysis techniques are mainly used to detect vulnerabilities by analyzing the code, without the need to build a system running environment to test the program, but the false alarm rate is high . Dynamic analysis detects vulnerabilities by running the program, although the accuracy is higher, but it depends largely on the quality of the test cases, and the code coverage is low, there is a possibility of underreporting, and sometimes researchers can only get some of the changed code from the code repository, it is difficult to get the complete source code of the project, which leads to the code can not be properly compiled and run, so we can only rely on the static analysis means of detection. Although the combination of static and dynamic analysis methods can mitigate false positives and omissions to some extent, there are still problems of higher time complexity and lower detection efficiency . In short, traditional vulnerability detection techniques still need to be further improved in terms of efficiency and effectiveness. In recent years, machine learning and deep learning have provided more effective and automated solutions for the vulnerability detection task. Compared with traditional techniques, learning-based approaches can learn potential, abstract vulnerability code patterns, and can significantly improve the generalization ability of the model . Therefore, this paper focuses on the discussion and analysis of learning-based vulnerability detection methods.

 6.1 Common vulnerability detection tools and evaluation metrics


Rule-Based Detection Methods The basic principle of detecting vulnerabilities is to provide a set of syntax, support professionals based on the syntax to describe a defective data flow or control flow with rules, and then use the rule matching method, to report the vulnerability of the code that meets the conditions of the rule description. Existing commercial and open-source detection tools utilize a rule-based detection approach, although their detection principles vary. Table 5 lists some of the classic vulnerability detection tools.
 Table 5 Common Vulnerability Detection Tools
 typology  Representative tools  Detecting Languages
 Business Tools Checkmarx  20 languages
Coverity  19 languages
Fortify  25 languages
CodeSonar Java
Klocwork Java
CodeSecure  6 languages
Helix
 open source tool Flawfinder