A software formalization model generation method based on a control flow graph and a large language model

By combining control flow graphs and large language models, a state machine-style intermediate representation code is constructed and closed-loop verification is performed. This solves the problems of high reliance on manual verification and low automation in formal verification of software, and realizes high-precision and automated formal model generation, thereby improving the completeness of verification and the efficiency of defect location.

CN122240469APending Publication Date: 2026-06-19NANJING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING UNIV
Filing Date
2026-03-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing formal verification techniques for software, the construction of formal models relies heavily on human experience and has a low degree of automation, making it difficult to accurately characterize complex control structures. This results in formal models that are insufficient in terms of accuracy and verifiability.

Method used

By constructing a control flow graph and generating intermediate representation code in the style of a state machine, combined with a large language model, and utilizing constraint hint templates and test case-driven closed-loop verification, the formal model can be automatically generated and iteratively corrected.

Benefits of technology

It enables the automated generation of formal software models, ensuring semantic equivalence between the generated model and the original program and covering all execution paths, thereby improving the completeness of verification and the efficiency of defect localization, and reducing the cost of manual modeling.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240469A_ABST
    Figure CN122240469A_ABST
Patent Text Reader

Abstract

This invention discloses a method for generating formal software models based on control flow graphs and large language models. The method includes the following steps: parsing the source code to be analyzed using an abstract syntax tree and constructing a control flow graph; assigning a unique node identifier to each node in the control flow graph, corresponding to the basic blocks of the program; declaring all variables in the source program, introducing a program counter variable, constructing a state machine structure model, and generating intermediate representation code in a state machine style based on the control flow graph; using the intermediate representation code and the control flow graph, guiding the large language model to generate a formal model through constraint hint templates; performing syntax constraint verification and test case-driven closed-loop verification on the formal model, iteratively refining the formal model. This invention solves the problems of existing formal model construction in software formal verification processes being highly dependent on manual labor, having low automation, and being difficult to accurately characterize complex control structures.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information processing technology, and specifically to a method for generating formal software models based on control flow graphs and large language models. Background Technology

[0002] Formal verification techniques, by constructing mathematical models of software systems and verifying system properties based on rigorous logical reasoning methods, can theoretically guarantee the correctness of verification results and are considered an effective means to improve software quality and reliability. However, traditional formal verification methods still face many challenges in practical applications. On the one hand, the construction process of formal models is highly dependent on human experience, resulting in high modeling costs and low automation. On the other hand, as the scale and complexity of software systems continue to increase, traditional verification methods have revealed significant shortcomings in terms of reasoning efficiency, state space explosion, and cross-level semantic expression, making it difficult to meet the practical needs of rapid iteration and large-scale verification of complex software systems.

[0003] Meanwhile, large language models demonstrate powerful semantic modeling and reasoning capabilities in natural language understanding, program analysis, and formal expression generation. They can establish effective semantic mapping relationships between natural language, program code, and formal languages, providing a new technical approach to bridging the semantic gap between different representation forms. Introducing large language models into the field of software formal verification helps improve the automation level of model generation and reasoning processes, reduces the dependence of formal methods on the experience of professionals, and thus significantly lowers the technical threshold for use.

[0004] However, existing research mostly focuses on using large language models for code understanding or simple formal expression generation, lacking a systematic characterization of program control structures and execution paths. This results in inadequate accuracy and verifiability of the generated formal models. Control flow graphs, as an important intermediate representation of program execution structures, can effectively reflect the branching, looping, and path relationships of a program. However, a mature and effective technical solution is still lacking for combining control flow graphs with large language models to achieve efficient and accurate generation of formal models of software systems. Summary of the Invention

[0005] The purpose of this invention is to provide a method for generating formal software models based on control flow graphs and large language models, in order to solve the problems that the construction of formal models in the existing formal verification process is highly dependent on manual labor, has a low degree of automation, and is difficult to accurately characterize complex control structures.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: A method for generating formal software models based on control flow graphs and large language models includes the following steps: S1: The source code to be analyzed is parsed using an abstract syntax tree, and a control flow graph is constructed; the nodes of the control flow graph correspond to the basic blocks of the program, and a unique node identifier is assigned to each node in the control flow graph; S2: Declare all variables in the source program, introduce a program counter variable to represent the position of the current node in the control flow graph, construct a state machine structure model, and generate intermediate representation code in the style of a state machine based on the control flow graph; S3: Based on the intermediate representation code and control flow graph, a formal model is generated by guiding the large language model through constraint hint templates; the formal model is subjected to syntax constraint verification and closed-loop verification driven by test cases. When the verification fails, the counterexample trajectory is used to feed back to the large language model to iteratively correct the formal model.

[0007] To optimize the above technical solution, the specific limitations also include: Preferably, in step S1, constructing the control flow graph specifically includes: dividing the source code into basic blocks based on the control flow statements in the abstract syntax tree; creating nodes and directed edges according to the logical jump relationships between the basic blocks to form a control flow graph.

[0008] Furthermore, in step S1, when parsing the source code, the string data in it is simultaneously subjected to dimensionality reduction processing, converting the string into a numerical array representation based on the character ASCII code.

[0009] Furthermore, the state machine structure model adopts a while loop as the main execution structure, and uses conditional judgment statements to correspond to the node identifiers of the nodes in the control flow graph. When the jump condition of the corresponding node is met, the basic block operation corresponding to that node is executed and the program counter variable is updated to the next target node.

[0010] Preferably, the intermediate representation code generation process is based on deterministic transformation rules and does not introduce heuristic inference or semantic approximation operations.

[0011] Preferably, the value range of the program counter variable is the set of node identifiers of the nodes in the control flow graph.

[0012] Furthermore, in step S3, each formal action in the formal model uniquely corresponds to a node in the control flow graph, and can be traced back to its specific location in the source code through the program counter variable and node identifier.

[0013] Furthermore, the constraint prompt template includes a role description that guides the role positioning of the large language model, the syntax rules and keyword set of the formal language, and the specifications of logical operators, set operators and temporal operators supported by the formal language.

[0014] Preferably, in step S3, the test case-driven closed-loop verification specifically includes: compiling the test cases of the source program into formal attributes and using them as the specification of the formal verifier; using the formal verifier to verify the generated formal model and obtain counterexample trajectories; constructing feedback prompts based on the counterexample trajectories to guide the large language model to correct the formal model and iteratively verify it until all formal attributes pass verification.

[0015] Further in step S3, the formal model characterizes program branches, loops, and exception paths through a control flow graph, and combines intermediate representation code to transform the path states in the control flow graph, so as to cover all possible execution paths of the source program.

[0016] Compared with the prior art, the beneficial effects of the present invention are: The software formal model generation method provided by this invention, based on control flow graphs and large language models, reduces the dimensionality of the source program to an intermediate representation code form that is highly consistent with the execution paradigm of the formal state machine structure model by introducing control flow graphs. It then utilizes a large language model to achieve automated mapping from the intermediate representation to the formal language, avoiding instability caused by complex reasoning. This effectively combines control flow graphs and large language models to achieve automated generation of software formal models, reducing the cost of manual modeling. Furthermore, by integrating a test case-driven closed-loop verification and feedback mechanism, test semantics are automatically compiled into formal attributes, and verification counterexamples are used to iteratively correct the formal model generated by the large language model. This ensures the syntactic correctness of the generated formal model and makes its behavior approximate that of the real program, achieving automated improvement and convergence of the formal model's quality.

[0017] This invention explicitly and completely depicts all branches, loops, and execution paths of a program through a control flow graph, and based on the generated state machine-style intermediate representation code, ensures that the generated formal model is semantically equivalent to the original program and can cover all possible execution paths, thus improving the completeness of verification.

[0018] This invention reduces complex data types such as strings in a program to ASCII numerical arrays, ensuring the accurate expression of high-level language semantics in the formal model. It overcomes the limitations of formal languages ​​in terms of data type support and enhances the direct verifiability of the generated model.

[0019] This invention assigns a unique node identifier to each basic block in the control flow graph and maps it to a program counter variable, thus establishing a precise correspondence between each action in the generated formal model and a specific basic block in the source code. This ensures that the formal model has good decomposability and traceability during the verification process. When formal verification detects a property violation, the defect location in the source code can be quickly and accurately pinpointed, improving the efficiency of debugging and defect analysis. Attached Figure Description

[0020] Figure 1 : A flowchart illustrating the software formal model generation method based on control flow graphs and large language models of the present invention.

[0021] Figure 2 : A schematic diagram of the structure of the software formal model generation method based on control flow graph and large language model of the present invention.

[0022] Figure 3 : A flowchart illustrating the Python code compilation control flow graph algorithm of this invention. Detailed Implementation

[0023] The present invention will be further described in detail below through specific embodiments, but it should not be construed as limiting the scope of the subject matter of the present invention to the following embodiments. All technologies implemented based on the above content of the present invention fall within the scope of the present invention.

[0024] In one embodiment, this invention proposes a method for generating formal software models based on control flow graphs and large language models, as shown in the flowchart below. Figure 1 As shown, the entire method includes the following steps: S1: Parse the source code to be analyzed using an abstract syntax tree and construct a control flow graph; the nodes of the control flow graph correspond to the basic blocks of the program, and each node in the control flow graph is assigned a unique node identifier; S2: Declare all variables in the source program, introduce a program counter variable to represent the position of the current node in the control flow graph, construct a state machine structure model, and generate intermediate representation code in the style of a state machine based on the control flow graph; S3: Based on intermediate representation code and control flow graph, a formal model is generated by guiding the large language model through constraint hint templates; the formal model is subjected to syntactic constraint verification and closed-loop verification driven by test cases. When the verification fails, the negative example trajectory is used to feed back to the large language model to iteratively correct the formal model.

[0025] In step S1, the source code to be analyzed is preferably written in Python. However, any high-level programming language with a clear control flow structure can be processed in a similar way. The system performs lexical and syntactic analysis on the input Python source code, constructs an abstract syntax tree, and identifies sequential statements and control flow structures such as if, else, while, for, break, and continue by traversing the abstract syntax tree. Based on this, the system performs structured decomposition of the program. Therefore, the system can accurately identify the locations where the control flow branches, merges, and loops within the program.

[0026] like Figure 3 As shown, during control flow graph construction, the system uses basic blocks as the smallest unit of analysis, dividing a series of consecutive instructions that do not contain internal jumps and are necessarily executed sequentially into the same basic block. Each basic block corresponds to a node in the graph, and the directed edges between nodes are used to represent the control transfer relationships that may occur during program execution. For conditional statements, the system explicitly marks the two branch paths in the control flow graph: condition true and condition false. For loop structures, back edges are introduced in the graph to characterize the repeated execution semantics of the loop body. For jump statements such as break and continue, jump edges are directly established to the corresponding target basic block. To ensure the determinism and traceability of subsequent state machine modeling, the system assigns a unique numerical identifier to each basic block and records the mapping relationship between this identifier and the original source code location.

[0027] In step S2, the system enters the intermediate representation code generation stage. This intermediate representation adopts an explicit state machine-style program structure to construct the state machine structure model. A program counter variable `pc` is introduced, representing the basic block number where the current program execution is located. The intermediate representation code first declares all variables used in the program in a unified location, including original program variables and the program counter variable `pc`, to eliminate the complexity caused by variable scope and implicit state changes in the original program. Subsequently, the intermediate representation uses a unified `while` loop as the main execution framework. Inside the loop, a series of `if` statements with the value of the program counter variable `pc` as a condition are used to model each basic block in the control flow graph. When `pc` equals a certain basic block number, the system executes the instruction sequence corresponding to that basic block and, based on the jump conditions and target nodes recorded in the control flow graph, explicitly updates the value of the program counter variable `pc` after execution, making it point to the next possible basic block to be executed.

[0028] The program control flow, which relies on nested syntax and implicit jumps, is transformed into an explicit state transition process driven by the program counter variable pc. Each change in the value of the program counter variable pc corresponds to a state transition, ensuring that the intermediate representation conforms to the semantic assumptions of state machines in formal languages ​​in terms of execution model. This transformation process does not rely on any heuristic inference or semantic approximation; it is driven by the control flow graph structure and deterministic rules, ensuring strict equivalence between the intermediate representation and the original program at the behavioral level.

[0029] Furthermore, to address the limitations of formal languages ​​in data type support, particularly in handling complex data types like strings, which often differ significantly from high-level programming languages, a unified dimensionality reduction process was implemented for the string type in the original program. Specifically, the system converts strings in Python code into numeric arrays composed of ASCII characters, and rewrites string-related operations as indexing, comparing, and updating operations on these numeric arrays. This allows string semantics to be accurately expressed within a formal model that only supports numerical and set operations. This approach effectively eliminates semantic differences between different language-type systems, further enhancing the verifiability of the generative model.

[0030] In step S3, after obtaining the state machine-style intermediate representation code and the corresponding control flow graph structure, a large language model is introduced to jointly process the above information. The system inputs the intermediate representation code, basic block numbering information, variable declarations, and control flow transition relationships into the large language model through a predefined instruction template, explicitly specifying in the prompts that the constraint generation target is a model description conforming to the syntax and semantic specifications of a specific formal language. Specific prompts can be found in the constraint prompt template, which contains instructions from the large language model, including role descriptions guiding the large language model's role positioning, the syntax rules and keyword set of the formal language, and the specifications for logical, set, and temporal operators supported by the formal language. The intermediate representation code explicitly expands the program's control logic and transforms complex nested structures into flattened state transition relationships. The large language model performs structured mapping at this stage, improving the stability and consistency of the generated results.

[0031] The following are the large language model instructions provided by this invention: Prompt for Generating TLA+ Models #Role description As an expert in TLA+, you are good at understanding and writing TLA+.TLA+ is a formal specification language used for modeling and verifyingconcurrent and distributed systems. # Domain knowledge 1.The logical operators supported by TLA+ include: / \ (and), / (or), ~ (not), =>(Implication),<=>(Bidirectionalimplication), TRUE, FALSE, \A (Universal Quantification), \E (ExistentialQuantification) 2.The set operators supported by TLA+ include: = (Equality), # (not equal), \union (Union), \intersect(Intersection), \in (Membership), \notin (Not in), \subsetseq (Subset Equal),\ (Difference). 3.The temporal operators supported by TLA+ include: [] x>0 The above code is an example of [] (Always). It means that at alltimes, the value of variable x is greater than 0. <>x = 0 The above code is an example of<>(Eventually). It means that at somepoint in time, the value of variable x becomes 0. 4.Built-in keywords and operators in TLA+ include: MODULE, EXTENDS, CONSTANTS, INSTANCE, VARIABLE, ASSUME, PROVE, INIT,NEXT, ACTION, SPECIFICATION, IF, ELSE, WITH, CASE, THEN, LET, IN, CHOOSE,ENABLED, UNCHANGED, DOMAIN. Based on the information and python code with assertions, give acomplete TLA+ model code in only one single code block without explanations. The model should initialize a set of all possible states constrainedby max or min CONSTANTS instead of fixed inputs. 1.Use LET keyword if there’s any temporary variable. 2.Each step should define all variables, even though keep themunchanged. 3.Since the start index in TLA+ is 1 instead of 0, you may change thecorresponding initialization, checks, and assignment. 4.Don't declare parameters with same names as variables or constants. 5.Define arrays like arr \in [1..MaxLen ->0..MaxValue]. If there are assertions in the code, you should also generate corresponding Assertion == action. For example: example1 - example2 - Module Name: - module_name - code - In the generated formal model, each basic block corresponds to a clearly defined action or state transition. This action only describes the update rules of the system variables when the program counter variable `pc` takes a specific value, as well as the next state value of the program counter variable `pc`. Through this one-to-one correspondence, the formal model exhibits good modularity during the verification process. When the formal verification tool detects violations of safety, invariance, or liveness properties, it can directly locate the corresponding basic block number based on the action name or the value of the program counter variable `pc`, and further trace back to the specific location in the original source code, thereby reducing the cost of manual analysis and defect localization.

[0032] like Figure 2As shown, based on the automatic formal model generation process, a closed-loop verification and self-feedback optimization link driven by Python test cases is also constructed, realizing fully automated collaboration from test semantics to formal attributes and then to automatic model correction. Specifically, the system first performs syntax parsing on existing Python test cases, automatically extracting the preconditions, input constraints, execution processes, and expected results into formal constraint expressions, and compiling them into corresponding formal attribute TLA+ descriptions according to established mapping rules, thus elevating the test cases originally oriented towards specific implementations to formal specifications oriented towards system behavior. On this basis, the formal attribute verifier TLC is used to automatically detect the formal attribute TLA+ model generated by the large language model. When problems such as attribute violations, unreachable states, or deadlocks are found during the detection process, the counterexample trajectory and error information output by the formal attribute verifier TLC are automatically parsed and structured, and re-input into the large language model as feedback information. Guided by the constraint hint template, the large language model performs targeted correction and completion of the corresponding state transitions, variable updates, or action definitions in the model based on the counterexample trajectory, thereby generating a new formal model version. This process is executed iteratively until all formal properties compiled from test cases pass the verification of the Formal Property Verifier (TLC), forming a closed-loop self-feedback generation chain from test cases to formal properties, from model detection to counterexample feedback, and then to model correction. This chain ensures the structural and semantic correctness of the generated model and allows the formal model to gradually approximate real program behavior, achieving automatic convergence and continuous improvement of model quality.

[0033] Furthermore, the generated formal models undergo automated grammatical constraint verification and necessary corrections to ensure that the final output model strictly conforms to the grammatical rules and execution semantics of the target formal language, avoiding unparseable or unverifiable issues caused by free generation. Through this mechanism, the generated formal models can be directly input into model detectors or theorem proving tools without manual intervention for various formal verification tasks such as state space exploration, reachability analysis, invariance verification, and liveness property verification.

[0034] As shown in Table 1, comparing the software formal model generation method based on control flow graphs and large language models provided by this invention with the direct model generation method that does not introduce control flow graphs and intermediate representation code, the formal model generated by this invention has a significant advantage in the similarity of the program runtime state set. Specifically, the similarity of the program runtime state set is improved by an average of 20.34% while maintaining almost the same model compilability. Especially when dealing with complex control structures and multi-branch programs, this invention can effectively avoid model omissions and semantic deviations, demonstrating good engineering practical value.

[0035] Table 1. Comparison of the formal model generation method and the uncontrolled flow graph model generation method provided in the embodiments of the present invention.

[0036] This invention accurately characterizes the program execution structure through control flow graphs and reduces the dimensionality of high-level programming languages ​​to an intermediate representation highly consistent with formal state machine models in terms of execution paradigm through deterministic rules. It then combines this with a large language model to generate formal expressions, achieving automated and high-precision construction of formal software models. While ensuring semantic equivalence, this invention reduces the difficulty of manual modeling, improves model verifiability and defect location efficiency, and is suitable for formal verification scenarios of software systems with complex control flows.

[0037] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Any simple modifications, equivalent substitutions, and improvements made by those skilled in the art to the above embodiments without departing from the scope of the technical solution of the present invention, based on the technical essence of the present invention, shall still fall within the protection scope of the technical solution of the present invention.

Claims

1. A method for generating formal software models based on control flow graphs and large language models, characterized in that, Includes the following steps: S1: The source code to be analyzed is parsed using an abstract syntax tree, and a control flow graph is constructed; the nodes of the control flow graph correspond to the basic blocks of the program, and a unique node identifier is assigned to each node in the control flow graph; S2: Declare all variables in the source program, introduce a program counter variable to represent the position of the current node in the control flow graph, construct a state machine structure model, and generate intermediate representation code in the style of a state machine based on the control flow graph; S3: Based on the intermediate representation code and control flow graph, a formal model is generated by guiding the large language model through constraint hint templates; the formal model is subjected to syntax constraint verification and closed-loop verification driven by test cases. When the verification fails, the counterexample trajectory is used to feed back to the large language model to iteratively correct the formal model.

2. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: In step S1, the construction of the control flow graph specifically includes: dividing the source code into basic blocks based on the control flow statements in the abstract syntax tree; creating nodes and directed edges according to the logical jump relationships between the basic blocks to form a control flow graph.

3. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: In step S1, when parsing the source code, the string data in it is simultaneously subjected to dimensionality reduction processing, converting the string into a numerical array representation based on the character ASCII code.

4. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: The state machine structure model uses a while loop as the main execution structure. It uses conditional statements to correspond to the node identifiers of the nodes in the control flow graph. When the jump condition of the corresponding node is met, the basic block operation corresponding to that node is executed and the program counter variable is updated to the next target node.

5. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: The intermediate representation code generation process is based on deterministic transformation rules and does not introduce heuristic inference or semantic approximation operations.

6. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: The value range of the program counter variable is the set of node identifiers of the nodes in the control flow graph.

7. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: In step S3, each formal action in the formal model uniquely corresponds to a node in the control flow graph, and can be traced back to its specific location in the source code through the program counter variable and node identifier.

8. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: The constraint prompt template includes a role description to guide the role positioning of the large language model, the syntax rules and keyword set of the formal language, and the specifications of logical operators, set operators and temporal operators supported by the formal language.

9. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: In step S3, the test case-driven closed-loop verification specifically includes: compiling the test cases of the source program into formal attributes and using them as the specification of the formal verifier; using the formal verifier to verify the generated formal model and obtain counterexample trajectories; constructing feedback prompts based on the counterexample trajectories to guide the large language model to correct the formal model and iteratively verify it until all formal attributes pass the verification.

10. The method for generating formal software models based on control flow graphs and large language models according to claim 1, characterized in that: In step S3, the formal model describes program branches, loops, and exception paths through a control flow graph, and combines intermediate representation code to transform the path states in the control flow graph, so as to cover all possible execution paths of the source program.