Knowledge distillation and double-track reasoning based no-reference code evaluation method and system

By employing knowledge distillation and dual-track reasoning methods, combined with static analysis tools and multiple independent reasoning steps, the problems of low accuracy and high resource consumption in existing code generation and evaluation methods are solved. This achieves efficient and reliable code evaluation, providing dual guarantees for code quality and security.

CN122241706APending Publication Date: 2026-06-19NANTONG NORMAL COLLEGE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANTONG NORMAL COLLEGE
Filing Date
2026-01-29
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing code generation and evaluation methods rely on high-quality reference code or test cases, resulting in low evaluation accuracy and high costs. Furthermore, large models consume enormous computational resources, making them difficult to popularize.

Method used

By employing knowledge distillation and dual-track reasoning methods, a miniaturized model is used to simulate expert-level logical reasoning. Combined with static analysis tools and multiple independent reasoning operations, the functional correctness and security assessment results of the code are generated.

🎯Benefits of technology

It achieves efficient and reliable code evaluation, significantly improves the comprehensiveness and credibility of the evaluation, reduces resource consumption, and provides dual protection for code quality and security.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241706A_ABST
    Figure CN122241706A_ABST
Patent Text Reader

Abstract

This invention provides a no-reference code evaluation method based on knowledge distillation and dual-track reasoning, belonging to the field of code generation evaluation technology. The method includes: constructing a code evaluation training set containing reasoning paths based on code generation data; transferring the reasoning capabilities of a pre-trained reasoning model to the target evaluation model through knowledge distillation and principal singular vector adaptation; performing a no-reference code security evaluation on the code to be evaluated to obtain security risk judgment results characterizing the potential security risks of the code; generating a functional correctness evaluation result for the code to be evaluated based on the results of multiple independent reasoning steps; and finally obtaining a comprehensive evaluation result for the code. This invention effectively improves the security and reliability of the final evaluation result by constructing a high-quality training set containing reasoning paths, and by using knowledge distillation and principal singular vector adaptation, multiple independent reasoning steps, and majority voting strategies for decision-making, thus integrating functional correctness and code security evaluation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of code generation and evaluation technology, specifically to a referenceless code evaluation method and system based on knowledge distillation and dual-track reasoning. Background Technology

[0002] With the widespread application of Large Language Models (LLMs) in code generation, accurately and efficiently evaluating the functional correctness of generated code has become a key challenge. Traditional evaluation methods mainly rely on two paradigms: reference code-based evaluation and test case-based evaluation, but both paradigms have inherent limitations.

[0003] Reference code-based evaluation methods (such as BLEU and CodeBLEU) judge the similarity between generated code and pre-defined reference code. This method relies heavily on obtaining high-quality reference code and struggles to handle semantically equivalent but differently implemented correct code. It often incorrectly penalizes implementations that are functionally correct but whose coding style or logical path is inconsistent with the reference, leading to reduced evaluation accuracy.

[0004] Test case-based evaluation methods (such as Pass@k) determine correctness by running code in an execution environment and verifying whether it passes predefined test cases. While this approach directly verifies functionality, its effectiveness is highly dependent on the coverage and quality of test cases. Designing comprehensive and unambiguous test cases relies on expert experience, is costly, and is difficult to scale. Furthermore, executing the generated code introduces potential security risks.

[0005] In recent years, the LLM-as-a-Judge (Large Language Model as a Judge) approach has offered a promising alternative, directly evaluating the functional consistency between the problem description and the generated code without requiring reference to code or test cases. However, existing LLM-as-a-Judge methods face two major challenges: First, methods relying on general-purpose large language models (such as the GPT series) require complex hint engineering, and the decision-making process is like a "black box," lacking interpretability; second, while using large models specializing in reasoning (such as DeepSeek-R1) can provide better interpretability, their massive parameter scale leads to huge computational resource consumption and extremely high deployment costs, making it difficult to popularize in practical applications. Summary of the Invention

[0006] The purpose of this invention is to provide a referenceless code evaluation method and system based on knowledge distillation and dual-track reasoning, which is used to achieve a fully automatic code correctness evaluation. It does not rely on manually provided reference code or test cases, but uses a miniaturized model to simulate expert-level logical reasoning to judge the functional correctness of the generated code.

[0007] To achieve the above objectives, this invention provides a referenceless code evaluation method based on knowledge distillation and dual-track reasoning, comprising: collecting code generation data; constructing a code evaluation training set containing reasoning paths based on the code generation data; transferring the reasoning capabilities of a pre-trained reasoning model to a target evaluation model based on the code evaluation training set through knowledge distillation and principal singular vector adaptation; performing a referenceless code security evaluation on the code to be evaluated without the need for test cases and reference code to obtain a security risk judgment result characterizing the potential security risks of the code to be evaluated; performing multiple independent reasoning operations on the code to be evaluated using the target evaluation model, and generating a functional correctness evaluation result for the code to be evaluated based on the results of multiple independent reasoning operations through a majority voting strategy; and integrating the security risk judgment result and the functional correctness evaluation result to obtain a comprehensive evaluation result for the code to be evaluated.

[0008] Optionally, constructing a code evaluation training set containing inference paths based on the code generation data includes: collecting problem descriptions and corresponding code solutions from multiple code generation benchmark datasets; and using a pre-trained inference model to generate a functional correctness judgment result for each code solution and an inference path corresponding to the functional correctness judgment result.

[0009] Optionally, the step of constructing a code evaluation training set containing inference paths based on the code generation data further includes: using label-based accuracy verification and discriminator-based logical consistency filtering to filter the functional correctness judgment results and inference paths generated by the pre-trained inference model; including: removing samples whose functional correctness judgment results are inconsistent with functional correctness labels determined based on test case execution; and using a discriminator model to analyze and remove inference paths that contain logical contradictions or factual errors.

[0010] Optionally, the step of employing label-based accuracy verification and discriminator model-based logical consistency filtering includes: downsampling the samples retained after the accuracy verification and logical consistency filtering to ensure that the number of functionally correct samples in the code evaluation training set is equal to the number of functionally incorrect samples.

[0011] Optionally, the knowledge distillation includes: using a pre-trained inference model as a teacher model and a target evaluation model as a student model to construct a knowledge distillation framework; wherein, the knowledge distillation uses the inference paths in the code evaluation training set as intermediate supervision information, and transfers the logical reasoning ability of the teacher model to the student model by minimizing the difference between the inference paths output by the student model and the teacher model.

[0012] Optionally, the principal singular vector adaptation includes: performing singular value decomposition on the pre-trained weight matrix in the target evaluation model to obtain the left singular vector matrix, singular value matrix, and right singular vector matrix of the pre-trained weight matrix; arranging the singular values ​​in the singular value matrix in descending order of numerical value, and retaining the top preset number of singular values ​​as principal singular values; extracting the left and right singular vectors corresponding to the principal singular values; and initializing the low-rank adaptation matrix using the extracted principal singular values, left singular vectors, and right singular vectors, so that the weight update direction of the target evaluation model in the early stage of training approaches the principal component direction of the pre-trained weight matrix.

[0013] Optionally, the initialization process of the low-rank adaptation matrix using the extracted principal singular values, the left singular vector, and the right singular vector includes: multiplying the left singular vector matrix by the square root of the singular value matrix to obtain a first initialization matrix; multiplying the square root of the singular value matrix by the transpose of the right singular vector matrix to obtain a second initialization matrix; and initializing the low-rank adaptation matrix as the product of the first initialization matrix and the second initialization matrix.

[0014] Optionally, the step of using the target evaluation model to perform multiple independent inferences on the code to be evaluated includes: performing a no-reference code security evaluation on the code to be evaluated, including: scanning the code to be evaluated using a static application security testing tool to identify and output potential code defects and security vulnerability patterns; extracting code segments identified as having potential security risks based on the results of the static analysis scan; inputting the extracted suspicious code segments and corresponding vulnerability type descriptions as prompts into a pre-trained inference model for chain-like reasoning, whereby the pre-trained inference model determines whether the suspicious code segments constitute real security vulnerabilities and generates a reason for the determination; generating the security risk judgment result based on the result of the inference verification by the pre-trained inference model; wherein, if the pre-trained inference model confirms the existence of real security vulnerabilities, a risk report containing vulnerability type and risk level is generated; if all suspicious code segments are verified as false alarms, a judgment result of low security risk is generated.

[0015] Optionally, the step of using the target evaluation model to perform multiple independent inferences on the code to be evaluated includes: when the target evaluation model performs inference generation operations on the code to be evaluated, configuring the temperature parameter of the target evaluation model to a preset value greater than zero; based on the temperature parameter configuration, performing multiple independent forward inference operations on the same code to be evaluated; each forward inference operation generates a preliminary functional correctness judgment result, and a inference path that serves as the basis for generating the preliminary functional correctness judgment result.

[0016] On the other hand, the present invention provides a referenceless code evaluation system based on knowledge distillation and dual-track reasoning, for implementing a referenceless code evaluation method based on knowledge distillation and dual-track reasoning. The system includes a control module, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement the referenceless code evaluation method based on knowledge distillation and dual-track reasoning.

[0017] The above technical solution integrates functional correctness and code security assessment into a unified framework. Through the collaborative verification of static analysis tools and inference models, it can accurately identify potential security vulnerabilities that are difficult to find in traditional functional testing, significantly improving the comprehensiveness and reliability of the assessment. Distillation training based on high-quality inference paths and master singular vector adaptation enable small target models to have inference capabilities comparable to large models, greatly reducing resource consumption. Combining multiple independent inferences and majority voting strategies, the final comprehensive evaluation result provides efficient and reliable dual protection for code quality and security.

[0018] Other features and advantages of the present invention will be described in detail in the following detailed description section. Attached Figure Description

[0019] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used together with the following detailed description to explain the invention, but do not constitute a limitation thereof. In the drawings: Figure 1 This is a flowchart of a no-reference code evaluation method based on knowledge distillation and dual-track reasoning.

[0020] Figure 2 This is an ablation study plot of the CODE-DITING 1.5B model on the CodeJudge-17K and CodeJudge-ALL datasets.

[0021] Figure 3 This is an ablation study plot of the CODE-DITING 7B model on the CodeJudge-17K and CodeJudge-ALL datasets.

[0022] Figure 4 This is a performance comparison chart of the PiSSA and LoRA initialization methods on the CODE-DITING 1.5B model.

[0023] Figure 5 This is a performance comparison chart of the PiSSA and LoRA initialization methods on the CODE-DITING 7B model.

[0024] Figure 6 This is a graph showing the F1 score of the CODE-DITING 1.5B model as a function of the k value under the majority voting strategy.

[0025] Figure 7 This is a graph showing the F1 score of the CODE-DITING 7B model as a function of the k value under the majority voting strategy. Detailed Implementation

[0026] The following is in conjunction with the appendix Figure 1 - Appendix Figure 7 The specific implementation methods of the embodiments of the present invention will be described in detail below. It should be understood that the specific implementation methods described herein are only for illustrating and explaining the embodiments of the present invention, and are not intended to limit the embodiments of the present invention.

[0027] It should be noted that the acquisition, transmission, storage, use, and processing of data in the technical solution of this application all comply with the relevant provisions of national laws and regulations. In the embodiments of this application, certain existing industry solutions such as software, components, and models may be mentioned. These should be considered exemplary, intended only to illustrate the feasibility of implementing the technical solution of this application, and do not imply that the applicant has already used or necessarily used such solutions.

[0028] In the process of realizing this invention, the inventors of this application discovered that the prior art suffers from the difficulty in overcoming the dependence on high-quality reference code or test cases, and faces problems such as huge consumption of computing resources and poor interpretability of the evaluation process when using large models.

[0029] Example 1 Reference Figure 1 This is the first embodiment of the present invention, which provides a reference-free code evaluation method based on knowledge distillation and dual-track reasoning, including: S100: Collect code generation data and construct a code evaluation training set containing inference paths based on the code generation data.

[0030] In the embodiments of this application, CODE-DITING (named after the mythical creature "Di Ting" from the classic Chinese novel "Journey to the West", implying that the method can accurately judge the correctness of the function implemented by the code, just like Di Ting can distinguish between truth and falsehood) collects problem descriptions and corresponding code solutions from multiple code generation benchmark datasets; and uses a pre-trained inference model to generate the function correctness judgment result of each code solution and the inference path corresponding to the function correctness judgment result.

[0031] In a preferred embodiment of this application, the selection of the code generation benchmark dataset follows three core principles: 1. Diversity: Covers various programming scenarios such as algorithm problems, system programming, and library usage; 2. Difficulty: In addition to basic grammar tasks, it should also include complex logic challenges and multi-step reasoning problems; 3. Quality: Includes high-coverage test cases to ensure reliable functional correctness assessment.

[0032] Based on the above principles, three large-scale code generation benchmark datasets were selected as seed data: KodCode, OpenCoder, and CodeHarmony. These benchmarks are widely used for training large-scale code language models (CodeLLMs). A data generation and annotation process was designed to construct a balanced and representative training set. Furthermore, to control the distribution of correct and incorrect examples in the code generation benchmark datasets, multiple candidate solutions were generated for each programming task using Qwen2.5-Coder (1.5B / 7B) (a dedicated code generation model in the Tongyi Qianwen 2.5 series, with parameter scales of 1.5B and 7B respectively). To ensure quality, a multi-step verification process was adopted, including: evaluating the functional correctness of each solution through test cases and calculating Pass@1 (first-round generation pass rate) as a label; applying static analysis tools to identify and filter solutions containing syntax errors; and removing code comments to focus the evaluation on the core implementation logic.

[0033] Furthermore, to transfer the logical reasoning capabilities of large-scale reasoning models to the target dataset and enhance sample interpretability, a distillation process was designed. For each triple <Natural Language Description (nl), Code, Label>, the correctness of the code functionality was independently judged using DeepSeek-R1-671B in Vanilla settings, and the output included the predicted label and inference path. This process generates raw distilled data in the format <nl, code, label, reasoning>.

[0034] In the embodiments of this application, accuracy verification based on label comparison and logical consistency filtering based on discriminator model are used to screen the functional correctness judgment results and inference paths generated by the pre-trained inference model; including: removing samples whose functional correctness judgment results are inconsistent with the functional correctness labels determined based on test case execution; and using discriminator model to analyze and remove inference paths that have logical contradictions or factual errors.

[0035] In the embodiments of this application, the samples retained after accuracy verification and logical consistency filtering are downsampled so that the number of functionally correct samples in the code evaluation training set is equal to the number of functionally incorrect samples.

[0036] In a preferred embodiment of this application, the logical consistency filtering employs a multi-stage filtering mechanism, including: Accuracy filtering: Remove samples whose prediction results from DeepSeek-R1-671B (DeepSeek R1 671 billion parameter model) are inconsistent with the test case labels to ensure consistency; Logical coherence filtering: DeepSeek-V3 is used as the discriminator to detect and eliminate reasoning paths containing illusions or logical inconsistencies; Class balancing: The filtered data is downsampled to make the ratio of positive samples to negative samples reach 1:1, which solves the imbalance problem of too many correct samples in the original dataset.

[0037] The final result is a high-quality code evaluation training set, CodeJudge-17K (code judging dataset - 17,000 entries), containing 17,000 samples. This dataset covers a variety of programming tasks, from basic algorithm challenges to complex system implementations, with a balanced distribution of correct and incorrect code samples. Each sample is accompanied by a detailed reasoning path, explaining the judgment process, providing valuable data for training interpretable code judging models.

[0038] S200: Based on the code evaluation training set, it transfers the reasoning ability of the pre-trained inference model to the target evaluation model through knowledge distillation and master singular vector adaptation.

[0039] In a preferred embodiment of this application, the DS-R1-distil (1.5B / 7B) model is used as the base model and fine-tuned on CodeJudge-17K. This allows the small model to learn from a large expert model using only 1% of the parameters.

[0040] In the embodiments of this application, knowledge distillation includes: using a pre-trained inference model as a teacher model and a target evaluation model as a student model to construct a knowledge distillation framework; wherein, knowledge distillation uses the inference paths in the code evaluation training set as intermediate supervision information, and transfers the logical reasoning ability of the teacher model to the student model by minimizing the difference between the inference paths output by the student model and the teacher model.

[0041] In embodiments of this application, principal singular vector adaptation includes: performing singular value decomposition on the pre-trained weight matrix in the target evaluation model to obtain the left singular vector matrix, singular value matrix, and right singular vector matrix of the pre-trained weight matrix; arranging the singular values ​​in the singular value matrix in descending order of numerical value, and retaining the top preset number of singular values ​​as principal singular values; extracting the left and right singular vectors corresponding to the principal singular values; and initializing the low-rank adaptation matrix using the extracted principal singular values, left singular vectors, and right singular vectors to make the weight update direction of the target evaluation model in the early stage of training approach the principal component direction of the pre-trained weight matrix.

[0042] In the embodiments of this application, the initialization process of the low-rank adaptation matrix includes: multiplying the left singular vector matrix by the square root of the singular value matrix to obtain a first initialization matrix; multiplying the square root of the singular value matrix by the transpose of the right singular vector matrix to obtain a second initialization matrix; and initializing the low-rank adaptation matrix as the product of the first initialization matrix and the second initialization matrix.

[0043] In a preferred embodiment of this application, to optimize model training and maintain performance, a low-rank adaptation (LoRA) parameter fine-tuning technique is employed. This technique freezes the pre-trained weights. Introducing trainable low-rank matrices (The upper projection matrix of LoRA, i.e., the second initialization matrix) (LoRA's lower projection matrix, i.e., the first initialization matrix), r min(d,k); where d represents the input dimension, k represents the output dimension, and r represents the low-rank dimension. The complete weight matrix after fine-tuning is as follows:

[0044] Where W represents the fine-tuned complete weight matrix, Let A represent the original pre-trained weight matrix, and B and A both represent newly introduced trainable low-rank matrices.

[0045] The above method reduces the trainable parameters from dk to r(d+k) while maintaining performance with minimal overhead.

[0046] Furthermore, to improve training efficiency and model performance, the LoRA matrix is ​​initialized using Principal Singular Vector Adaptation (PiSSA). Unlike the kaiming-uniform initialization used in LoRA, PiSSA utilizes truncated singular value decomposition (SVD) to... The inherent low-rank structure, namely The LoRA matrix is ​​initialized as follows:

[0047]

[0048] in, Denotes the truncated left singular vector matrix. This represents the transpose of the truncated right singular vector matrix. This represents a truncated singular value diagonal matrix.

[0049] This ensures (Low-rank increment matrix) = BA initial time and The master subspace alignment allows updates to focus on key directions that preserve functionality. Compared to Kaiming-uniform initialization, PiSSA provides a structured starting point, improving convergence speed and final performance, especially in low-rank scenarios.

[0050] The above scheme employs a triple optimization approach: knowledge distillation, PiSSA initialization, and LoRA fine-tuning. Distillation transfers the inference capabilities of the 671B parameter model to smaller 1.5B / 7B models, achieving a 99% improvement in parameter efficiency. PiSSA initialization aligns the LoRA matrix with pre-trained principal components, accelerating convergence and improving performance in low-rank scenarios. LoRA technology compresses the trainable parameters from dk to r(d+k), significantly reducing computational overhead while maintaining accuracy, enabling the smaller model to achieve evaluation capabilities surpassing larger models like GPT-4o.

[0051] S300: Performs a no-reference code security assessment on the code to be evaluated without the need for test cases and reference code, in order to obtain a security risk assessment result that characterizes the potential security risks of the code to be evaluated.

[0052] In the embodiments of this application, a static application security testing tool is used to scan the code to be evaluated, identify and output potential code defects and security vulnerability patterns; based on the results of the static analysis scan, code segments identified as having potential security risks are extracted; the extracted suspicious code segments and their corresponding vulnerability type descriptions are used as prompts and input into a pre-trained inference model for chain-like reasoning, and the pre-trained inference model determines whether the suspicious code segments constitute real security risks and generates a reason for the judgment; based on the results of the inference verification by the pre-trained inference model, a security risk judgment result is generated; wherein, if the pre-trained inference model confirms the existence of real security risks, a risk report containing vulnerability type and risk level is generated; if all suspicious code segments are verified as false alarms, a judgment result of low security risk is generated.

[0053] In a preferred embodiment of this application, the code to be evaluated is input into a preset static code analysis tool, and the parameters of the static code analysis tool are configured to detect predefined potential vulnerability patterns, so as to identify the code locations in the code to be evaluated that may have potential security risks, and associate the identified code locations with the corresponding potential vulnerability pattern types, thereby generating initial potential vulnerability information.

[0054] Furthermore, based on the initial potential vulnerability information, the marked suspicious code segments are extracted, and the functions, methods or class-level structures to which the suspicious code segments belong are overwritten when extracting the suspicious code segments to ensure the integrity of the code logic and the contextual relevance. At the same time, corresponding potential vulnerability pattern description information is attached to each suspicious code segment, thereby constructing a structured evaluation unit.

[0055] Subsequently, the structured evaluation unit is used as input and imported into the preset pre-trained inference model or target evaluation model. The model is guided by preset inference instructions to perform multi-step logical analysis on the suspicious code segment. The multi-step logical analysis includes at least the analysis of the data flow source, whether the data can be controlled by external input, and the security impact that the corresponding potential vulnerability pattern may cause, in order to determine whether the suspicious code segment constitutes a real security vulnerability and generate inference process information corresponding to the judgment result.

[0056] Based on this, the judgment results corresponding to each suspicious code segment are summarized. When at least one suspicious code segment in the summarized results is determined to be a real security vulnerability, a security risk report is generated. The security risk report records the code location, potential vulnerability pattern type, and risk level information corresponding to the real vulnerability in a structured form, and associates the reasoning process information as the basis for interpretation. When all suspicious code segments are determined not to constitute a real security vulnerability, a security risk judgment result indicating that the overall security risk of the code to be evaluated is low is generated. This security risk judgment result is a structured and interpretable assessment conclusion.

[0057] The aforementioned solution, by combining static application security testing tools with chain-like reasoning using pre-trained inference models, automatically scans for potential vulnerabilities in the code to be evaluated without the need for test cases or reference code. It accurately verifies the vulnerabilities and generates a structured security report containing risk level, vulnerability type, and inference path. This process significantly improves the comprehensiveness and reliability of the assessment, not only compensating for the lack of security dimensions in traditional functional testing but also reducing the cost of manual intervention through a fully automated and interpretable verification mechanism. Furthermore, it provides transparent and actionable decision-making support for code risk management, effectively enhancing the security resilience of the code generation process.

[0058] S400: Utilizes the target evaluation model to perform multiple independent inferences on the code to be evaluated, and generates a functional correctness evaluation result for the code to be evaluated through a majority voting strategy based on the results of these multiple independent inferences.

[0059] In the embodiments of this application, when the target evaluation model performs inference generation operations on the code to be evaluated, the temperature parameter of the target evaluation model is configured to a preset value greater than zero; based on the temperature parameter configuration, multiple independent forward inference operations are performed on the same code to be evaluated; each forward inference operation generates a preliminary functional correctness judgment result, as well as a reasoning path that serves as the basis for generating the preliminary functional correctness judgment result.

[0060] In the embodiments of this application, the preliminary functional correctness judgment results generated by all forward inference operations are statistically analyzed; using a majority voting strategy, the judgment result that appears most frequently in the statistics is taken as the functional correctness evaluation conclusion of the code to be evaluated.

[0061] In a preferred embodiment of this application, considering the possibility of inconsistent inference paths in the inference model when the temperature is set to 0.6, a majority voting strategy is adopted to determine the final inference result, further improving the model's inference performance. This strategy belongs to the parallel inference method, where the model performs multiple independent inferences on the same input and selects the result with the highest frequency as the final judgment.

[0062] In a preferred embodiment of this application, from a probabilistic perspective, if the probability of a correct judgment in a single reasoning is... When T independent inferences are performed, the probability of a correct final result can be modeled using a binomial distribution. Specifically, if at least (T+1) / 2 inferences are correct (i.e., the majority vote is correct), then the probability of a correct final judgment is:

[0063] in, Let X represent the probability of a single correct inference, T represent the total number of independent inferences, X represent the number of correct judgments, and k represent the index variable for summation. To determine the threshold for a majority vote, if T=7, then (7+1) / 2=4. That is, at least 4 correct inferences are required to form a majority (4>3).

[0064] when When T > 0.5, according to the law of large numbers, as T increases, the success probability of the majority voting strategy increases. This will continue to improve. This explains why majority voting can effectively improve model performance: as long as the accuracy of a single inference exceeds that of random guessing (i.e., ... (>0.5), multiple votes can significantly reduce the probability of misjudgment.

[0065] For each test sample, perform T=7 independent inferences, and determine the final judgment by majority vote. Setting T to 7 is the optimal trade-off between model performance and inference latency.

[0066] The above scheme achieves intelligent decision optimization through a majority voting strategy. Based on the binomial distribution principle, when the accuracy of a single inference is >0.5, majority voting can significantly reduce the probability of misjudgment. By introducing moderate randomness through the temperature parameter (0.6), the F1 score of the 1.5B / 7B model is improved by about 10% with an increase in delay of 1-2 seconds. This strategy effectively solves the inconsistency problem of the inference model and has anti-preference leakage characteristics, making the evaluation results more stable and reliable.

[0067] S500: Integrates security risk assessment results and functional correctness assessment results to obtain a comprehensive assessment result for the code to be assessed.

[0068] In the embodiments of this application, a security risk assessment result and a functional correctness assessment result are received. The functional correctness assessment result includes at least the functional correctness assessment conclusion and the corresponding code identification information. Based on the code identification information, the functional correctness assessment result and the security risk assessment result are matched at the result level to establish the association relationship between the functional assessment result and the security assessment result corresponding to the same code to be assessed.

[0069] In a preferred embodiment of this application, while keeping the functional correctness assessment conclusion unchanged, a security association identifier is added to the functional correctness assessment result based on the security risk judgment result. The security association identifier is used to characterize the credibility of the functional correctness assessment result under the corresponding security risk conditions or usage precautions.

[0070] Furthermore, based on preset result organization rules, the associated evaluation results are structured and organized. The result organization rules include: generating a security association identifier corresponding to the functional correctness evaluation result based on the risk existence and risk level information indicated in the security risk judgment result. The security association identifier is used to characterize the credibility of the functional correctness evaluation result under the corresponding security risk conditions or usage precautions. The functional correctness evaluation result, security risk judgment result, and security association identifier are written into predefined result fields to form a comprehensive evaluation record for the same code to be evaluated.

[0071] It should be noted that the comprehensive assessment record includes at least a functional assessment field, a security assessment field, and a security association description field. Each field is stored independently and is used to integrate functional judgment information and security risk information in subsequent output or display. This integrated presentation includes, but is not limited to, parallel, tagged, or weighted summary methods.

[0072] The above solution, through a pre-defined integration strategy, integrates independent security risk assessment results with functional correctness assessment results to generate a unified, structured comprehensive assessment report. This provides dual assurance of functional correctness and security for code quality, significantly improving the credibility and practicality of the assessment results.

[0073] Example 2 Reference Figures 1-7 This is the second embodiment of the present invention. In order to verify the beneficial effects of the present invention, scientific demonstration is carried out through experiments.

[0074] In the embodiments of this application, to evaluate the effectiveness and advantages of CODE-DITING, the following three issues are mainly studied: Question 1: How does the performance of CODE-DITING compare to state-of-the-art methods?

[0075] In a preferred embodiment of this application, the performance of CODE-DITING is compared with various other models to evaluate its effectiveness. To ensure fairness, each model uses the optimal prompt word and the same evaluation metric, and the results are shown in Table 1.

[0076] Table 1. Performance Comparison of Models and Hint Methods Across Datasets

[0077] Where Base Model is the basic model, Prompt is the prompting method, and Avg represents the average value; HumanEval-Judge, MBPP-Judge, and BigCodeBench-Judge represent code evaluation datasets based on the HumanEval-plus enhanced benchmark, code evaluation datasets based on the MBPP-plus enhanced benchmark, and code evaluation datasets for real-world software development scenarios, respectively.

[0078] The data in the table shows that: 1. The CODE-DITING 1.5B and 7B models significantly outperform other models in their respective parameter ranges, showing substantial improvements in accuracy, F1 score, and MCC. In particular, CODE-DITING 1.5B significantly surpasses Llama3 1B, Qwen2.5 1.5B, and even the basic DS-r1-distill 1.5B model; similarly, CODE-DITING 7B demonstrates a clear advantage over Llama3 8B, Qwen2.5 7B, and the basic DS-r1-distill 7B models.

[0079] 2. The parameter efficiency of this method is particularly outstanding. CODE-DITING 1.5B uses only about 20% of the parameters, yet achieves performance comparable to DS-r1-distill 7B, demonstrating the effectiveness of the knowledge distillation method in transferring reasoning capabilities to small models.

[0080] CODE-DITING 7B outperforms the closed-source models GPT-4o and DeepSeek-V3 (671B) on all three datasets, lagging only behind DeepSeek-R1 671B. This result is particularly significant considering that CODE-DITING 7B uses only about 1% of the parameters of these large models.

[0081] From Question 1, we can conclude that CODE-DITING outperforms state-of-the-art methods in code evaluation. Version 1.5B surpasses all models with the same number of parameters, and its performance is comparable to models with five times the number of parameters; version 7B uses only 1% of the number of parameters, yet it surpasses GPT-4o and DeepSeek-V3 (671B).

[0082] Question 2: What are the effects of each component in CODE-DITING?

[0083] In a preferred embodiment of this application, a series of ablation experiments are conducted to evaluate the effectiveness of each component of CODE-DITING, focusing on three key aspects: data filtering, parameter initialization, and inference strategy.

[0084] Figure 2 and Figure 3This study demonstrates the impact of data filtering on model performance. By comparing F1 scores for different datasets at k=1 (single inference), it was found that the data filtering strategy consistently and significantly improves model performance. This empirical evidence strongly supports the hypothesis that "high-quality inference paths are crucial for models to develop accurate code evaluation capabilities."

[0085] Specifically, the 1.5B smaller model showed a more significant relative improvement from filtering compared to the 7B model. This difference suggests that smaller models, with their limited representational capabilities and lack of parameter space for effectively learning noisy or blurred examples, are better able to benefit from high-quality training data.

[0086] Figure 4 and Figure 5 The impact of PiSSA initialization on model performance is demonstrated. F1 scores for different initialization methods at k=1 are also compared to isolate the effect of this component. In standard LoRA implementations, matrix A is typically initialized using kaiming-uniform initialization, while matrix B is initialized to zero; PiSSA, however, derives matrices A and B through SVD (Singular Value Decomposition), fundamentally aligning the initialization with the model's intrinsic parameter structure.

[0087] Experimental results show that PiSSA delivers significant performance improvements on the HumanEval-Judge and MBPP-Judge datasets compared to the standard LoRA initialization technique. However, the performance improvement is less pronounced on the more challenging BigCodeBench-Judge dataset, suggesting that the advantage of initialization may vary with task complexity and dataset characteristics.

[0088] These findings suggest that PiSSA initialization helps the model converge to a better solution space, especially in low-rank fitting scenarios with limited parameters.

[0089] Figure 6 and Figure 7 The impact of inference strategy on model performance was analyzed in detail. The optimal configuration was determined by comparing F1 scores under different k values ​​(number of inferences). The results show a clear pattern: model performance continuously improves as k increases, but the improvement gradually decreases at higher k values.

[0090] In a preferred embodiment of this application, a performance-efficiency tradeoff was analyzed to determine the most suitable configuration for practical applications. Experiments were conducted using vLLM (Large Language Model High-Speed ​​Inference Engine) as the inference server on a single NVIDIA RTX 4090 GPU. The baseline latency (k=1) for the CODE-DITING 1.5B and 7B models was 0.15 seconds and 0.30 seconds, respectively. As expected, the time cost increased linearly with k, reaching approximately 1 second (1.5B) and 2 seconds (7B) at k=7.

[0091] By analyzing the performance improvement and computational cost under different k values, k=7 was determined to be the optimal value. This configuration maintains a reasonable inference latency while ensuring a significant improvement in accuracy, making it very suitable for practical applications with strict requirements for both prediction quality and response time.

[0092] The conclusion drawn from Question 2 is that the ablation experiments demonstrate that each component of CODE-DITING significantly contributes to the overall performance. By combining data filtering, PiSSA initialization, and the optimal inference strategy, CODE-DITING achieves state-of-the-art performance while maintaining computational efficiency.

[0093] Question 3: Does CODE-DITING involve preference leakage?

[0094] In a preferred embodiment of this application, preference leakage is a contamination problem in the LLM-as-Judge framework, referring to biased evaluation caused by the correlation between the synthetic data generator and the LLM-based evaluator.

[0095] During training, code generated by models from the same family as the base model (DeepSeek and QwenCoder) was used. This raises a legitimate concern: does CODE-DITING exhibit a bias towards code generated by models similar to those used in the training data?

[0096] To explore this potential problem, we used the Agreement Rate and Cohen's Kappa coefficient as evaluation metrics. Specifically, the Agreement Rate measures the consistency of judgments across different evaluation scenarios:

[0097] The Cohen-Kappa coefficient quantifies the consistency among evaluators while taking into account random consistency.

[0098] in, This represents the observed rate of agreement. This represents the random expectation consistency rate.

[0099] Random Consistency Marginal distribution calculation based on each evaluator's judgment:

[0100] in, and and represent the proportions of the first and second evaluators who classify the sample into class i, respectively. This adjustment for random consistency makes the Cohen-Kappa coefficient more robust than the simple consistency rate, especially in cases of imbalanced class distributions.

[0101] In a preferred embodiment of this application, experiments are conducted from different perspectives to evaluate the consistency of CODE-DITING.

[0102] Consistency across different code generators: This experiment evaluates whether CODE-DITING maintains consistency in its judgments when evaluating code generated by different models for the same programming task. Fifty problems were selected from each dataset, and code solutions were generated using two models (GPT-4o and Claude-3.5) that were not involved in generating the training data. The evaluation results of CODE-DITING were then evaluated to determine if they were independent of the code source.

[0103] Table 2 Consistency Analysis of Different Code Generation Models

[0104] As shown in Table 2, CODE-DITING exhibits high consistency across different code generators, with a consistency rate exceeding 93% across all datasets. The extremely high Cohen-Kappa coefficients (0.86–0.96) indicate that the evaluation results are almost perfectly consistent, except for random expectation consistency. This consistency is particularly evident on the HumanEval-Judge dataset, achieving a 98% consistency rate for code generated by GPT-4o and a 97% consistency rate for code generated by Claude-3.5. Even on the more challenging BigCodeBench-Judge dataset, which involves complex library interactions, CODE-DITING achieves consistency rates of 94% and 93%, respectively. These results strongly suggest that CODE-DITING's evaluation mechanism focuses on the intrinsic quality and correctness of the code, rather than superficial patterns specific to a particular code generator.

[0105] Referring to Table 3, this experiment examines whether CODE-DITING maintains consistency in its judgments when the same code faces semantically equivalent but differently worded problem descriptions. Similarly, 50 code samples were selected from each dataset, and GPT-4o and Claude-3.5 were used to generate paraphrased versions of the original problem descriptions (keeping the semantics unchanged). Then, the consistency of CODE-DITING's judgments across these different problem descriptions was evaluated.

[0106] Table 3. Consistency Analysis of Different Problem Descriptions

[0107] From Question 3, we can conclude that CODE-DITING does not have a significant preference leakage problem. High consistency is maintained when evaluating code generated by different generators and when evaluating code for semantically equivalent problem descriptions.

[0108] In the above embodiments, the code evaluation process of CODE-DITING mainly focuses on verifying the correctness of code functionality and the consistency of the evaluation. It should be understood that in other embodiments of the present invention, the code evaluation process can be further combined with a no-reference code security assessment of the code to be evaluated to obtain the corresponding security risk judgment result. The specific process can be performed with reference to the relevant description of no-reference code security assessment in Embodiment 1.

[0109] Furthermore, the security risk assessment results and functional correctness assessment results can be integrated and output according to a preset merging strategy to generate a comprehensive assessment result for the code to be assessed. The preset merging strategy can be a parallel output strategy, a tagging display strategy, or a configurable weighted aggregation strategy. The parallel output strategy means outputting functional correctness conclusions and security risk conclusions separately for manual or subsequent system decision-making reference. The tagging display strategy means labeling the assessment conclusions according to risk level and correctness category and displaying them simultaneously in the report. The configurable weighted aggregation strategy means visually aggregating the two types of indicators according to user- or system-preset weights for rapid screening, without using them as a prerequisite for functional assessment. It should be understood that the above merging and presentation are only used to provide a more comprehensive assessment perspective; security assessment and functional assessment remain independent judgment dimensions, and the security risk assessment results should not be interpreted as a necessary derivation or decisive basis for the functional correctness assessment results.

[0110] In summary, the CODE-DITING solution achieves a breakthrough in code evaluation through data distillation, knowledge transfer, and majority voting strategies. Its 1.5B / 7B small models, with PiSSA initialization and LoRA efficient fine-tuning, outperform large models such as GPT-4o, requiring only 1% of the parameters. The solution demonstrates superior accuracy (significantly improved F1 score), strong robustness (resistant to preference leakage, consistency rate >93%), and high interpretability in three major benchmark tests. It replaces traditional test dependencies with interpretable inference paths, providing a scalable and low-cost practical technical path for code functional verification.

[0111] This invention also provides a referenceless code evaluation system based on knowledge distillation and dual-track reasoning, for implementing a referenceless code evaluation method based on knowledge distillation and dual-track reasoning. The system includes a control module, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement the referenceless code evaluation method based on knowledge distillation and dual-track reasoning.

[0112] This invention provides a storage medium storing a program that, when executed by a processor, implements a no-reference code evaluation method based on knowledge distillation and dual-track reasoning.

[0113] This invention provides a processor for running a program, wherein the program executes a no-reference code evaluation method based on knowledge distillation and dual-track reasoning during runtime.

[0114] This invention provides a device including a processor, a memory, and a program stored in the memory and executable on the processor. When the processor executes the program, it implements a referenceless code evaluation method based on knowledge distillation and dual-track reasoning. The device described herein can be a server, PC, PAD, mobile phone, etc.

[0115] This application also provides a computer program product that, when executed on a data processing device, is suitable for performing a referenceless code evaluation method based on knowledge distillation and dual-track reasoning.

[0116] Those skilled in the art will understand that embodiments of this application can provide methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0117] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0118] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0119] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0120] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0121] Memory may include non-persistent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0122] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0123] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0124] The above are merely embodiments of this application and are not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A reference-free code evaluation method based on knowledge distillation and dual-track reasoning, characterized in that, include: Collect code generation data, and construct a code evaluation training set containing inference paths based on the code generation data; Based on the code evaluation training set, the reasoning ability of the pre-trained reasoning model is transferred to the target evaluation model through knowledge distillation and principal singular vector adaptation. Without the need for test cases and reference code, a no-reference code security assessment is performed on the code to be evaluated to obtain a security risk assessment result that characterizes the potential security risks of the code to be evaluated. The target evaluation model is used to perform multiple independent inferences on the code to be evaluated, and based on the results of the multiple independent inferences, a majority voting strategy is used to generate a functional correctness evaluation result for the code to be evaluated. By integrating the security risk assessment results and the functional correctness evaluation results, a comprehensive evaluation result for the code to be evaluated is obtained.

2. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 1, characterized in that, The step of constructing a code evaluation training set containing inference paths based on the code generation data includes: Problem descriptions and corresponding code solutions were collected from multiple code generation benchmark datasets; The pre-trained inference model is used to generate a functional correctness judgment result for each code solution and a corresponding inference path.

3. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 2, characterized in that, The step of constructing a code evaluation training set containing inference paths based on the code generation data further includes: The accuracy verification based on label comparison and the logical consistency filtering based on the discriminator model are used to filter the functional correctness judgment results and the inference paths generated by the pre-trained inference model; including: Remove samples whose functional correctness judgment results are inconsistent with the functional correctness labels determined based on test case execution; The discriminator model is used to analyze and remove the reasoning paths that contain logical contradictions or factual errors.

4. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 3, characterized in that, The accuracy verification based on label comparison and the logical consistency filtering based on the discriminator model include: The samples retained after the accuracy verification and logical consistency filtering are downsampled so that the number of functionally correct samples in the code evaluation training set is equal to the number of functionally incorrect samples.

5. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 1, characterized in that, The knowledge distillation includes: A knowledge distillation framework is constructed by using a pre-trained inference model as the teacher model and a target evaluation model as the student model; among which... The knowledge distillation uses the reasoning paths in the code evaluation training set as intermediate supervision information. By minimizing the difference between the reasoning paths output by the student model and the teacher model, the logical reasoning ability of the teacher model is transferred to the student model.

6. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 1, characterized in that, The principal singular vector adaptation includes: Singular value decomposition is performed on the pre-trained weight matrix in the target evaluation model to obtain the left singular vector matrix, singular value matrix and right singular vector matrix of the pre-trained weight matrix; The singular values ​​in the singular value matrix are arranged in descending order of numerical value, and the top number of singular values ​​are retained as the main singular values. Extract the left and right singular vectors corresponding to the principal singular values; The low-rank adaptation matrix is ​​initialized using the extracted principal singular values, left singular vector, and right singular vector, so that the weight update direction of the target evaluation model in the early stage of training approaches the principal component direction of the pre-trained weight matrix.

7. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 6, characterized in that, The initialization process of the low-rank adaptation matrix using the extracted principal singular values, the left singular vector, and the right singular vector includes: Multiply the left singular vector matrix by the square root of the singular value matrix to obtain the first initialization matrix; Multiply the square root of the singular value matrix by the transpose of the right singular vector matrix to obtain the second initialization matrix; The low-rank adaptation matrix is ​​initialized as the product of the first initialization matrix and the second initialization matrix.

8. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 1, characterized in that, The process of performing a no-reference code security assessment on the code to be evaluated includes: Static application security testing tools are used to scan the code to be evaluated, identify and output potential code defects and security vulnerability patterns; Based on the results of the static analysis scan, code segments identified as having potential security risks are extracted. The extracted suspicious code segments and their corresponding vulnerability type descriptions are used as prompts and input into a pre-trained inference model for chain-like reasoning. The pre-trained inference model then determines whether the suspicious code segments constitute a real security risk and generates a reason for the determination. Based on the results of the inference verification of the pre-trained inference model, the security risk judgment result is generated; wherein, if the pre-trained inference model confirms the existence of a real security vulnerability, a risk report containing the vulnerability type and risk level is generated; if all suspicious code segments are verified as false alarms, a judgment result of low security risk is generated.

9. The referenceless code evaluation method based on knowledge distillation and dual-track reasoning according to claim 1, characterized in that, The step of performing multiple independent inferences on the code to be evaluated using the target evaluation model includes: When the target evaluation model performs inference generation operation on the code to be evaluated, the temperature parameter of the target evaluation model is configured to a preset value greater than zero; Based on the temperature parameter configuration, perform multiple independent forward inference operations on the same code to be evaluated. Each forward inference operation generates a preliminary functional correctness judgment result, as well as a reasoning path that serves as the basis for generating the preliminary functional correctness judgment result.

10. A referenceless code evaluation system based on knowledge distillation and dual-track reasoning, characterized in that, The system includes a control module, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the computer program to implement the reference-free code evaluation method based on knowledge distillation and dual-track reasoning according to any one of claims 1-9.