Method for improving robustness of source code static analysis tools

By designing a robust compilation front-end and syntax transpiler, the accuracy and scalability of static analysis tools under incomplete configurations are improved, solving the problem of insufficient robustness of static analysis tools under incomplete compilation configurations, and achieving higher analysis accuracy and user experience.

CN115658508BActive Publication Date: 2026-06-19BEIJING XUANYU INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING XUANYU INFORMATION TECH CO LTD
Filing Date
2022-10-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing static analysis tools lack robustness under incomplete compilation configurations, leading to inaccurate or non-existent analysis results. In particular, they are inadequate in handling missing header files and macro definitions, have weak error recovery capabilities, and have poor compatibility with different compilation platforms and dialects.

Method used

The robust compilation front-end is designed, including a lexical analysis module, a syntax parsing module, and a syntax transpile. An AST is generated through preprocessing functions, a token matching state machine, and recursive descent parsing. The AST is then converted into the original analysis tool's AST structure using the visitor design pattern, supporting zero-configuration analysis.

🎯Benefits of technology

It improves the accuracy and scalability of static analysis tools under incomplete configurations, lowers the user threshold, and enhances the robustness and user experience of the tools.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115658508B_ABST
    Figure CN115658508B_ABST
Patent Text Reader

Abstract

This invention discloses a method for improving the robustness of static source code analysis tools, comprising the following steps: Step 1, designing a robust compilation front-end; Step 2, designing a syntax transpiler; Step 3, improving the robustness of the original analysis tool; Step 4, adding zero configuration; In step 1.2, the lexical parsing module reserves the nextToken interface for subclasses to implement; This invention generates an AST by implementing a robust compilation front-end and transpiles the AST into the AST of the original analysis tool, reusing the code analysis capabilities of existing analysis tools. The compilation front-end has a built-in extensible preprocessing module, strong compilation error recovery capabilities, and strong extensibility, which can ensure the robustness of the analysis tool while reusing existing code assets. Under the analysis conditions of incomplete configuration of the tested software, it can still give accurate analysis results, lowering the threshold for users of static analysis tools and improving the user experience of the tool.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of program analysis and testing technology, specifically to a method for improving the robustness of static source code analysis tools. Background Technology

[0002] Static source code analysis tools are important software testing tools. These tools statically analyze the source code of the software under test to discover violations and potential defects in the code, and are widely used in the development and testing of various software. The core components of a static analysis tool include a syntax analysis frontend and a detector engine. The syntax analysis frontend is responsible for parsing the source code of the software under test into intermediate data structures, and the detector engine performs specific rule checks based on these intermediate data structures. Similar to compiler frontends, the syntax analysis frontend in a static analysis tool generally requires a complete compilation of the software code under test according to the compilation configuration. Well-known static analysis tools include Coverity, Klocwork, and ClangStatic Analyzer.

[0003] In practical scenarios where static analysis tools are used, users often struggle to obtain a complete compilation environment and configuration, including compilation and linking options, predefined macros, header file search paths, and dependent source code. Users typically address this in two ways: first, by using an incomplete compilation environment and configuration, leading to errors in the static analysis tool's syntax analysis frontend and resulting in significantly incomplete or inaccurate analysis results; second, by spending considerable time obtaining a complete compilation environment and configuration, and then configuring and analyzing within the static analysis tool. Both approaches significantly impact the effectiveness of using the tool. The root cause of this situation lies in the insufficient robustness of static analysis tools, meaning they cannot perform accurate analysis with incomplete compilation configurations. This manifests in several ways:

[0004] First, the robustness of code preprocessing is insufficient when header files and macro definitions are missing. Existing methods for preprocessing source code typically involve calling the compiler's preprocessing commands and using the preprocessed file as input to static analysis tools. This results in inadequate error handling; when the compiler's preprocessing commands are called, if a specific header file is not found, preprocessing ends, preventing subsequent static analysis.

[0005] Secondly, the error recovery capability of code parsing is not strong enough. Existing methods for parsing code in the compilation front-end usually require a complete type check of the source program or the use of a Parser Generator to parse user-defined syntax files. These methods typically have relatively weak error recovery capabilities, meaning that when a parsing error occurs, it may directly cause the parsing module to crash, making subsequent static analysis impossible.

[0006] Third, the dialect compatibility of the compiler front-end is not strong enough. Existing compiler front-ends for parsing C programs typically only support a single C language standard and compilation platform, such as the C99 standard or the GNUC compilation platform. This leads to parsing errors when encountering a new compilation platform / dialect. Therefore, static analysis tools need to invest heavily in supporting various compilation platforms (such as C51, DSP, ARM, MS, etc.) C language dialects, and become incompatible when encountering a previously unknown dialect. Summary of the Invention

[0007] The purpose of this invention is to provide a method for improving the robustness of static source code analysis tools, so as to solve the problems mentioned in the background art.

[0008] To achieve the above objectives, the present invention provides the following technical solution: a method for improving the robustness of source code static analysis tools, comprising the following steps: Step 1, designing a robust compilation front-end; Step 2, designing a syntax transpiler; Step 3, improving the robustness of the original analysis tool; Step 4, adding zero configuration;

[0009] Step one above includes the following steps:

[0010] 1.1 Design a robust compiler front-end for the preprocessing function of the lexical analysis module, i.e., the syntax parsing module executes different processes according to different preprocessing instructions;

[0011] 1.2 Design a robust compiler front-end lexical analysis module. The lexical analysis module divides the preprocessed character stream into a Token stream based on the pre-designed Token matching state machine. Each Token is the smallest syntactic unit.

[0012] 1.3 Design a robust compiler front-end with a syntax parsing module that can take the token stream from lexical analysis as input and use a recursive descent method to perform syntax parsing and generate an abstract syntax tree (AST).

[0013] 1.4 After implementing the functions of each module in steps 1.1, 1.2 and 1.3 above, a robust compilation front-end is obtained;

[0014] In step two above, a visitor design pattern interface is implemented for the AST generated in step 1.3. The syntax transpiler takes the AST as input and uses the visitor design pattern to transpile each syntax structure of the AST, converting it into the AST structure of the existing analysis tool's front end. This reuses the existing static code analysis module and ensures the stability of the function.

[0015] In step three above, the robust compiler front-end designed in step one and the translator module designed in step two are added to the analysis module of the original analysis tool.

[0016] In step four above, after completing step three, a condition entry for enabling zero-configuration analysis is added to the original analysis tool to extend the robustness analysis support of the original analysis tool, resulting in a robust static analyzer.

[0017] Preferably, in step 1.1, the preprocessing function of the lexical analysis module includes the following three points: First, for macro definition instructions, the macro definition parser is used to parse them and save them to the context index of lexical analysis. When a replaceable macro identifier is encountered in subsequent parsing, the already recorded index is queried and macro replacement is performed; if no macro definition is found, it is safely treated as an identifier; Second, for header file inclusion instructions, the path of the directory where the file under test is located and the string of the included header file are concatenated. If the header file is located, the file is recursively preprocessed and parsed; otherwise, compilation error information is recorded and subsequent code parsing continues safely; Third, for conditional compilation instructions, the context index of the macro definition is referenced to interpret them.

[0018] Preferably, in step 1.2, the lexical parsing module reserves the nextToken interface for subclasses to implement.

[0019] Preferably, in step 1.3, the syntax parsing module is compatible with C 99 / GNU C and C++ 11 syntax.

[0020] Preferably, in step 1.3, the syntax parsing module has the following characteristics: First, ambiguous syntax parsing, that is, when the syntax parsing module encounters ambiguous syntax during parsing, it records it first, and after the entire syntax parsing is completed, it infers based on the parsed context symbol information and rewrites the ambiguous syntax structure; Second, error recovery, that is, when encountering a syntax parsing error, it backtracks to the reasonable syntax structure and records the parsing error information; Third, extensibility, that is, the common syntax parsing subroutine of the C / C++ language is implemented as the top-level base class, the parsing subroutines related to C / C++ language features are delegated to subclasses for implementation, and the recursive descent subroutine is refined into relatively small method units. When the parser is extended for a new dialect, only the recursive descent subroutine that needs special handling needs to be rewritten, thus achieving compatibility with the new dialect at a low cost.

[0021] Preferably, step three specifically includes:

[0022] 3.1 Extend the original analysis tool's code parsing interface. This interface takes the source file as input, connects to a robust compilation front-end and syntax transpiler based on conditions, and outputs the original analysis tool's AST.

[0023] 3.2 Instantiate robust compilation frontend.

[0024] Compared with the prior art, the beneficial effects of the present invention are as follows: The present invention generates an AST by implementing a robust compilation front-end and translates the AST into the AST of the original analysis tool, thus reusing the code analysis capabilities of the existing analysis tool. The compilation front-end has a built-in extensible preprocessing module, strong compilation error recovery capability, and strong extensibility. It can ensure that the robustness of the analysis tool is improved while reusing existing code assets. It can still give accurate analysis results under the analysis conditions of incomplete configuration of the software under test, lower the threshold for users of static analysis tools, and improve the user experience of the tool. Attached Figure Description

[0025] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation

[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0027] Please see Figure 1 The present invention provides a technical solution:

[0028] Example:

[0029] The method to improve the robustness of static source code analysis tools includes the following steps: Step 1, design a robust compilation front-end; Step 2, design a syntax transpiler; Step 3, improve the robustness of the original analysis tool; Step 4, add zero configuration.

[0030] Step one above includes the following steps:

[0031] 1.1 The preprocessing function of the lexical analysis module in the robust compiler front-end is designed, that is, the syntax parsing module executes different processes according to different preprocessing directives. The preprocessing function of the lexical analysis module includes the following three points: First, for macro definition directives, the macro definition parser is used to parse them and save them to the context index of lexical parsing. When a replaceable macro identifier is encountered in subsequent parsing, the already recorded index is queried and macro substitution is performed. If no macro definition is found, it is safely treated as an identifier. Second, for header file inclusion directives, the path of the directory where the file under test is located and the string of the included header file are concatenated. If the header file is located, the file is recursively preprocessed and parsed. Otherwise, the compilation error information is recorded and subsequent code parsing continues safely. Third, for conditional compilation directives, the context index of macro definitions is referenced to interpret them.

[0032] 1.2 Design a robust compiler front-end lexical analysis module. The lexical analysis module divides the preprocessed character stream into a token stream based on the pre-designed token matching state machine. Each token is the smallest syntactic unit. The lexical parsing module reserves the nextToken interface for subclasses to implement.

[0033] 1.3 Design a robust compiler front-end with a syntax parsing module that takes the token stream from lexical analysis as input and performs syntax parsing using recursive descent to generate an Abstract Syntax Tree (AST). The syntax parsing module is compatible with C99 / GNU C and C++. 11. Syntax; The syntax parsing module has the following characteristics: First, ambiguous syntax parsing, meaning that when the parsing module encounters ambiguous syntax, it records it first, and after the entire syntax parsing is completed, it infers based on the parsed context symbols and rewrites the ambiguous syntax structure; Second, error recovery, meaning that when encountering a syntax parsing error, it backtracks to the reasonable syntax structure and records the parsing error information; Third, extensibility, meaning that the common syntax parsing subroutine of the C / C++ language is implemented as the top-level base class, and the parsing subroutines related to C / C++ language features are implemented by subclasses, and the recursive descent subroutine is refined into relatively small method units. When extending the parser for a new dialect, only the recursive descent subroutine that needs special handling needs to be rewritten, thus achieving compatibility with the new dialect at a low cost;

[0034] 1.4 After implementing the functions of each module in steps 1.1, 1.2 and 1.3 above, a robust compilation front-end is obtained;

[0035] In step two above, a visitor design pattern interface is implemented for the AST generated in step 1.3. The syntax transpiler takes the AST as input and uses the visitor design pattern to transpile each syntax structure of the AST, converting it into the AST structure of the existing analysis tool's front end. This reuses the existing static code analysis module and ensures the stability of the function.

[0036] In step three above, the robust compiler front-end designed in step one and the transpile module designed in step two are added to the analysis module of the original analysis tool, specifically as follows:

[0037] 3.1 Extend the original analysis tool's code parsing interface. This interface takes the source file as input, connects to a robust compilation front-end and syntax transpiler based on conditions, and outputs the original analysis tool's AST.

[0038] 3.2 Instantiate robust compilation of the front end;

[0039] In step four above, after completing step three, a condition entry for enabling zero-configuration analysis is added to the original analysis tool to extend the robustness analysis support of the original analysis tool, resulting in a robust static analyzer.

[0040] Experimental Example 1:

[0041] The robust static analyzer obtained in the above embodiments was used to perform static analysis on the C51 platform project and the DSP2000 platform project. The C51 project was compiled with the following settings: no C51 platform configuration environment and no platform header file search directory. The DSP2000 project was compiled with all built-in macro definitions of the DSP platform removed. The specific analysis process is as follows:

[0042] 1) Use the C51 platform project and the DSP2000 platform project as inputs to the robust static analyzer;

[0043] 2) The preprocessing module begins processing. When the robust compilation frontend cannot find the header file for the C51 project platform, it does not recursively expand the header file, but instead records the error message and continues code parsing. For header files that are identified, it recursively expands them. For example, consider the following code snippet in the C51 project:

[0044] #include "global.h"

[0045] #include<internal.h>

[0046] The system can find global.h in the current directory, but not internal.h. Therefore, during the preprocessing module expansion, only the code in global.h is expanded; for internal.h, the error message "Unresolved dinclusion:" is logged.<internal.h> ", and continue analyzing the following code;

[0047] 3) The syntax parsing module takes the code expanded by the preprocessing module as input. Based on the syntax structure of the C / C++ language, it can parse both C and C++ languages ​​simultaneously. It provides subclass overriding interfaces for specific syntax structures, making the parsing module extensible. For syntax errors encountered during parsing, it does not immediately crash and stop, but instead backtracks to a specific point in the valid syntax and continues parsing subsequent code to recover the correct statements as much as possible. For example, regarding the following two interrupt functions in DSP2000:

[0048]

[0049]

[0050] Even without the built-in interrupt macro definition, it can still be correctly parsed into two function definitions.

[0051] 4) The translation module converts the AST generated by the robust front-end into the original AST structure of the static analysis tool;

[0052] 5) Reuse the existing static analysis module and begin static analysis checks;

[0053] 6) Obtain the static analysis results.

[0054] Experimental Example 2:

[0055] With the complete configuration, a comparative experiment was conducted between the robust static analyzer obtained in the above embodiments and the original analysis tool. The analysis tool code library already has 1081 unit tests. The pass rate of the robust static analyzer's analysis front-end is 98.12%. Through comparative testing of 10 real projects, the robust static analyzer generated 889,629 rule alarms. Compared with the original analysis tool's 883,420 rule alarms, the false alarm rate increased by 0.87%, and the false negative rate increased by 0.30%. Overall, the difference is not significant and has basically no impact.

[0056] Experimental Example 3:

[0057] In the case of incomplete configuration, a comparative experiment was conducted between the robust static analyzer obtained in the above embodiments and the original analysis tool. By constructing a zero-configuration experimental scenario, the original analysis tool failed to correctly analyze the results due to incorrect compilation configuration, and quickly ended the entire analysis process. The robust static analyzer was able to analyze more code and rules. Compared with the original analysis tool, the robust static analyzer reduced the false negative rate by 71.%, increased the false positive rate by 18.99%, and improved the analysis efficiency by 14.29%.

[0058] Based on the above, the advantages of this invention are that, compared with existing analysis tools, the robust compilation front-end and AST transpiler proposed in this invention have higher robustness. Through non-intrusive integration based on the original analysis tool, it can fully reuse the existing rule checker, detect more and more accurate results even with incomplete configuration, and with complete configuration, the analysis results are basically the same. The robust compilation front-end proposed in this invention has strong compilation error recovery capabilities and strong scalability, which can ensure the robustness of the analysis tool is improved while reusing existing code assets.

[0059] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.

Claims

1. A method of improving the robustness of a source code static analysis tool, comprising the steps of: Step 1: Design a robust compilation front-end; Step 2: Design a syntax transpiler; Step 3: Improve the robustness of the original analysis tool; Step 4: Add zero-configuration; Its key features are: Step one includes the following steps: 1.1 Design a robust compiler front-end with a preprocessing function for the lexical analysis module. This function executes different processes based on different preprocessing directives. The preprocessing function of the lexical analysis module includes the following three points: First, for macro definition directives, the macro definition parser parses them and saves them to the lexical analysis context index. Subsequent parsing encounters a replaceable macro identifier, so the recorded index is queried and macro substitution is performed; if no macro definition is found, it is safely treated as an identifier. Second, for header file inclusion directives, the path of the directory containing the file under test and the string of the included header file are concatenated. If the header file is located, it is recursively preprocessed and parsed; otherwise, compilation error information is recorded, and subsequent code parsing continues safely. Third, for conditional compilation directives, the macro definition context index is referenced for interpretation. 1.2 Design a robust compiler front-end lexical analysis module. The lexical analysis module divides the preprocessed character stream into a Token stream according to the pre-designed Token matching state machine. Each Token is the smallest syntactic unit. The lexical analysis module reserves the nextToken interface for subclasses to implement. 1.3 Design the syntax parsing function of the syntax parsing module in the robust compiler front-end. That is, the syntax parsing module can take the token stream of lexical analysis as input and use the recursive descent method to perform syntax parsing and generate an abstract syntax tree (AST). The syntax parsing module is compatible with C_99 / GNU_C and C++11 syntax. 1.4 After implementing the functions of each module in steps 1.1, 1.2 and 1.3, a robust compilation front-end is obtained; In step two, a visitor design pattern interface is implemented for the AST generated in step 1.

3. The syntax transpiler takes the AST as input and uses the visitor design pattern to transpile each syntax structure of the AST, converting it into the AST structure of the existing analysis tool's front end. This reuses the existing static code analysis module and ensures the stability of the function. In step three, the robust compiler front-end designed in step one and the translator module designed in step two are added to the analysis module of the original analysis tool. In step four, after completing step three, a conditional entry for enabling zero-configuration analysis is added to the original analysis tool to extend the robustness analysis support of the original analysis tool, resulting in a robust static analyzer.

2. The method for improving the robustness of a source code static analysis tool according to claim 1, wherein: In step 1.3, the syntax parsing module has the following characteristics: First, ambiguous syntax parsing, that is, when the syntax parsing module encounters ambiguous syntax, it records it first, and when the entire syntax parsing is finished, it makes inferences based on the parsed context symbol information and rewrites the ambiguous syntax structure. Second is error recovery, which means that when a syntax parsing error is encountered, the syntax structure is traced back to a reasonable syntax structure and the parsing error information is recorded; Third, it is scalable. The common parsing subroutine of the C / C++ language is implemented as the top-level base class, while the parsing subroutines related to C / C++ language features are implemented by subclasses. The recursive descent subroutine is refined into relatively small method units. When the parser is extended for a new dialect, only the recursive descent subroutine that needs special handling needs to be rewritten. This can achieve compatibility with new dialects at a low cost.

3. The method of claim 1, wherein: Step three specifically involves: 3.1 Extend the original analysis tool's code parsing interface. This interface takes the source file as input, connects to a robust compilation front-end and syntax transpiler based on conditions, and outputs the original analysis tool's AST. 3.2 Instantiate robust compilation frontend.

Citation Information

Patent Citations

  • Compiling method and system for mechanical arm program development programming language

    CN111580825A

  • Compiling method, electronic equipment and storage medium

    CN114780100A