A code summarization method and device supporting a low-resource programming language

By combining cross-language translation and retrieval enhancement technologies with the pure reasoning capabilities of large language models, the problem of code summarization generation for low-resource programming languages ​​is solved, achieving high-quality, low-cost code summarization generation suitable for software engineering, scientific computing, and cross-language development collaboration scenarios.

CN121807372BActive Publication Date: 2026-06-16HANGZHOU DIANZI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU DIANZI UNIV
Filing Date
2026-03-09
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies struggle to generate high-quality code summaries on low-resource programming languages, especially in the absence of parallel corpora. Existing models cannot effectively utilize the knowledge reserves of large language models, resulting in low accuracy and high computational resource consumption in the generated summaries.

Method used

By translating low-resource code into high-resource code through cross-language translation, and combining retrieval enhancement techniques with core sentence filtering, code summaries are generated using the pure reasoning capabilities of a large language model. This process includes steps such as multi-temperature sampling, abstract syntax tree parsing, compiler repair, back-translation optimization, and core sentence extraction, achieving high-quality summary generation under zero-fine-tuning conditions.

🎯Benefits of technology

Without relying on low-resource language parallel corpora, it significantly improves the accuracy of code summarization, reduces computational resource consumption, and achieves low-cost, high-efficiency automated code summarization generation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121807372B_ABST
    Figure CN121807372B_ABST
Patent Text Reader

Abstract

The application discloses a code summary generation method and device supporting a low-resource programming language. A large language model is used to translate source code of a low-resource programming language into candidate code of a target high-resource programming language, and the candidate code that is closest to the semantics of the input source code is selected from the verified candidate code as the best translation code. A plurality of codes similar to the best translation code and summaries thereof are retrieved from a preset knowledge base as reference examples. The best translation code and the reference example code are divided into code statements, and core statements are screened out. The code, the core statements and the summaries of the reference examples, and the best translation code and the core statements are input into the large language model as prompt contexts, and a code summary of the source code of the low-resource programming language is output. Through cross-language translation and multi-source information fusion, the problem of lack of low-resource language corpus is effectively solved without low-resource language training data and zero fine-tuning, and high-quality code summary generation is realized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer technology, and relates to the intersection of natural language processing and software engineering. Specifically, it relates to a method and apparatus for generating code summaries that supports low-resource programming languages. Background Technology

[0002] With the rapid development of computer software engineering technology, code summarization, as a technology that can automatically convert complex source code into natural language descriptions, plays a crucial role in software maintenance, code understanding, and automated documentation generation. In actual development and maintenance processes, developers often face a large amount of legacy code lacking comments or heterogeneous projects written in different programming languages. High-quality code summarization can significantly reduce the time cost of manually reading code and improve software iteration efficiency.

[0003] Currently, research and applications of code summarization technology mainly focus on mainstream high-resource programming languages ​​such as Python and Java. These languages ​​have large open-source communities and massive parallel corpora; for example, the CodeSearchNet dataset provides millions of code-summary pairs, enabling deep learning-based models to be adequately trained. However, in scientific computing, statistical analysis, and specific domain applications, there are many low-resource programming languages ​​such as R, Julia, Lua, Ocaml, and Racket that lack large-scale corpus support. Because samples of these languages ​​are extremely scarce or even completely absent in public datasets, existing mainstream models often fail to capture sufficient semantic features when processing such languages, resulting in extremely low summarization accuracy and failing to meet the needs of practical industrial applications.

[0004] Some studies have attempted to address this issue using context-based transfer learning techniques, but their experiments are often limited to "medium-resource" languages ​​like Ruby, which actually possess a considerable amount of open-source corpora. For languages ​​like R and Julia, which truly lack parallel corpora, traditional fine-tuning transfer learning methods fail due to the lack of training sets.

[0005] Furthermore, most existing non-large language model methods rely on cumbersome fine-tuning training processes. Even transfer learning typically requires a certain amount of labeled data in the target language to adjust model parameters. This not only increases computational resource consumption but also makes it difficult to deploy the model directly in "zero-shot" scenarios without any training data. While large language models have acquired strong general knowledge of high-resource languages ​​through massive amounts of data during the pre-training phase, they struggle to learn the grammatical structures and semantic representations of low-resource languages ​​due to the small proportion of low-resource language samples they encounter. Although the model has learned general code summarization capabilities through large-scale high-resource language data, it is highly susceptible to semantic comprehension bias when faced with unfamiliar low-resource language inputs. It cannot accurately identify the core functional intent of the code, leading to generated summaries that deviate from the true meaning of the code or produce factual illusions when performing zero-shot inference directly.

[0006] In summary, existing technologies lack a solution that can fully explore and utilize the inherent knowledge reserves of large language models, stimulate their powerful contextual learning and reasoning capabilities, and achieve high-quality, low-resource code summarization generation for programming languages. Summary of the Invention

[0007] To address the shortcomings of existing technologies, this invention proposes a method and apparatus for generating code summaries for low-resource programming languages. By translating low-resource code into high-resource code, it achieves semantic alignment between low-resource code and high-resource code. Combined with retrieval enhancement techniques and core sentence filtering, it generates high-quality code summaries using the pure reasoning capabilities of large language models under zero-fine-tuning conditions. This solves the technical problems of scarce low-resource language corpora, high fine-tuning costs of existing models, and lack of cross-language semantic mapping.

[0008] A method for generating code digests that supports low-resource programming languages ​​includes the following steps:

[0009] Step 1: Cross-language translation

[0010] The source code of a low-resource programming language is used as input to a large language model, and a high-resource programming language is selected as the target translation language. Multiple candidate codes are generated through multi-temperature sampling. The candidate codes are then validated using abstract syntax tree parsing. From the validated candidate codes, the one that is semantically closest to the input source code is selected as the best translation code.

[0011] As a preferred approach, when all candidate code fails the abstract syntax tree parsing verification, multiple rounds of code iteration and repair are performed using compiler error messages.

[0012] As a preferred approach, the candidate code is translated into the same low-resource programming language as the source code. The BLEU (Bilingual Evaluation Understudy) score and Sentence-BERT (Sentence Bidirectional Encoder Representations from Transformers) score of the translated code and the source code are calculated, and the total score is calculated by weighting. The candidate code corresponding to the translated code with the highest total score is selected as the best translated code.

[0013] Preferably, the low-resource programming language is Julia, Lua, Ocaml, R, or Racket. The high-resource programming language is Python.

[0014] Step 2: Search Enhancement

[0015] Retrieve multiple codes and summaries similar to the best translation from a pre-built knowledge base as reference examples.

[0016] Step 3: Extracting core statements

[0017] The best translated code output in step 1 and the code of the reference example selected in step 2 are decomposed into abstract syntax tree nodes, and each tree node is mapped to a different type of statement. The classified statements are input into the core statement discrimination model, which performs reasoning on each statement to determine whether it is a core statement or a redundant statement. The core statement discrimination model is a pre-trained binary classification neural network model with an Encoder-Classifier architecture. During the training phase, this model uses a greedy selection strategy based on the ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) metric to generate truth labels.

[0018] Step 4: Fusion Generation

[0019] Take the code, core statements, and summary of the reference example, along with the best translated code and core statements, as prompting context, input the large language model, and output a code summary of the input source code.

[0020] A code digest generation apparatus supporting low-resource programming languages, the apparatus comprising a memory and a processor, wherein the memory stores a computer program, and the processor, when executing the computer program, implements the functions of the following modules:

[0021] The cross-language translation module is configured to take source code in a low-resource programming language as input, generate candidate code through multi-temperature sampling, and output the best high-resource programming language code after filtering by combining abstract syntax tree verification, error correction and back-translation optimization mechanism.

[0022] The retrieval enhancement module is configured to retrieve similar code and its summary from a pre-built knowledge base based on the high-resource programming language code, as a reference example.

[0023] The core statement extraction module is configured to perform abstract syntax tree parsing and statement segmentation on the high-resource programming language code to filter out key code statements.

[0024] The fusion generation module is configured to receive the best translated code, key code statements, and reference examples, construct a prompting context and input a large language model, and output the final code summary.

[0025] The present invention has the following beneficial effects:

[0026] 1. By utilizing cross-language translation and employing multi-temperature sampling and compiler feedback repair mechanisms, low-resource code can be translated into high-resource language code, which large language models excel at, without relying on parallel corpora of low-resource languages. This solves the problem of training high-quality models in low-resource languages ​​due to data scarcity.

[0027] 2. It combines core statement extraction and retrieval enhancement technologies. By extracting and filtering the most informative key statements, and combining them with reference examples retrieved by BM25, it effectively enhances the semantic richness of the input context, thereby significantly improving the accuracy of code summary generation under zero fine-tuning conditions.

[0028] 3. It fully utilizes the inherent reasoning capabilities of large language models, and generates code summaries through pure reasoning by combining preset decoding parameters. This eliminates the need for expensive parameter fine-tuning of large models, reduces computational resource consumption, and achieves low-cost, high-efficiency automated code summarization. Attached Figure Description

[0029] Figure 1 A flowchart of a code digest generation method that supports low-resource programming languages.

[0030] Figure 2 This is a flowchart illustrating the cross-language translation process in an example.

[0031] Figure 3 This is a flowchart illustrating the enhanced retrieval process in an example.

[0032] Figure 4 This is a flowchart of the core statement extraction process in the embodiment.

[0033] Figure 5This is a flowchart of the training process for the core statement discrimination model in the embodiment.

[0034] Figure 6 The flowchart for fusion generation is shown in the embodiment.

[0035] Figure 7 This is a system block diagram of a code digest generation device that supports low-resource programming languages. Detailed Implementation

[0036] The present invention will be further explained below with reference to the accompanying drawings;

[0037] A code digest generation method that supports low-resource programming languages, such as Figure 1 As shown, the specific steps include:

[0038] Step 1: Cross-language translation

[0039] Candidate translations are generated through multi-temperature sampling, and the best translation code is selected by combining grammar verification, iterative repair, and back-translation optimization mechanisms. Figure 2 As shown:

[0040] S11: First, the source code of a low-resource programming language is received as input to the large language model. In order to leverage the rich pre-trained knowledge of the large language model on the Python language, the target language for translation is set to Python.

[0041] S12: To avoid semantic bias that may result from a single translation, a preset sampling temperature set T={0,0.7,0.9,1.1} is used. For temperature T=0 (greedy mode), one candidate code is generated to obtain the most robust translation result; for the other three temperature values ​​T=(0.7,0.9,1.1), three candidate codes are generated for each to explore different translation possibilities. Through multi-temperature sampling, 10 initial candidate codes are generated, forming a candidate pool.

[0042] S13: Invoke the built-in Python compiler to perform abstract syntax tree parsing and verification on the 10 initial candidate codes in the candidate pool. Count the number of initial candidate codes that pass parsing. If at least one initial candidate code can be successfully parsed, skip the repair phase and proceed directly to S15; if all 10 initial candidate codes fail to be parsed, the current translation is deemed unusable, and the iterative repair mechanism in S14 is initiated.

[0043] S14: Combine the initial candidate code that failed to parse with the error messages generated by the compiler to construct a repair suggestion word, which is then sent to the large language model for self-correction. To prevent infinite loops, the number of repair rounds is set to 3. After each round of repair, the abstract syntax tree is parsed and verified again in S13. When the maximum number of repair rounds is reached, the repair stops and the result is output.

[0044] S15: To select the code that semantically most closely resembles the original input, the initial candidate code that passed parsing verification is retranslated back into the original low-resource programming language, denoted as LRPL'. Then, the BLEU score and Sentence-BERT score between LRPL' and the source code are calculated to measure the lexical overlap and deep semantic similarity between LRPL' and the source code. A weighted fusion method is used to calculate the total score of LRPL' (TotalScore).

[0045] TotalScore=0.5×BLEU+0.5×Sentence_BERT

[0046] Finally, the initial candidate code corresponding to LRPL', which has the highest total score, is selected as the best translation code output.

[0047] Step 2: Search Enhancement

[0048] like Figure 3 As shown, a retrieval engine based on the BM25 algorithm is used, and a pre-built knowledge base containing a large number of high-quality code-summary pairs is provided. The best translated code output from step 1 is used as the query statement for the retrieval engine, and its term frequency relevance score with each code in the code-summary knowledge base is calculated.

[0049] The codes were sorted in descending order based on their word frequency relevance scores, and the top three codes and their corresponding summaries were selected as reference examples.

[0050] Step 3: Extracting core statements

[0051] Perform deep syntactic analysis on the source code to filter out the most informative key statements, such as Figure 4 As shown:

[0052] S31: Using a parsing tool, decompose the best translated code output in step 1 and the code of the reference example selected in step 2 into abstract syntax tree nodes, and map each tree node to function definition statements, variable definition statements, loop body statements, conditional control statements, or other statements.

[0053] S32: Input the classified statements into the core statement discrimination model, perform reasoning on each statement, and output the importance label of the statement. An importance label of 1 indicates that the statement is a core statement, and an importance label of 0 indicates that the statement is a redundant statement.

[0054] The core statement discrimination model is a pre-trained binary classification neural network model using an Encoder-Classifier architecture. During the training phase, a greedy selection strategy based on the ROUGE-L metric (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) is employed to generate ground truth labels, such as... Figure 5 As shown: First, the ROUGE-L score between each statement and its corresponding reference summary is calculated. Statements are then sorted from highest to lowest score, and each statement is attempted to be added to the core set. Only when a newly added statement increases the overall ROUGE-L score of the set relative to the annotations is the statement considered to contain valid new information and marked as a truth value (importance label 1). Otherwise, it is marked as non-core (importance label 0). This strategy ensures that the extracted statement set covers the most summary information in the shortest possible length.

[0055] Step 4: Fusion Generation

[0056] By integrating the outputs from other stages and configuring specific decoding parameters, a final summary is generated, such as... Figure 6 As shown:

[0057] S41: Constructing prompt words by assembling contextual information using a structured template: Three reference example blocks are sequentially concatenated, each containing reference example code, core statements from the code, and a standard summary. The target code block is then concatenated, containing the best-translated code and its core statements. This hierarchical structure guides the large model to deeply understand the mapping relationship between code logic and the summary.

[0058] S42: To ensure the most accurate output from the large language model in pure inference mode, a fixed set of sampling parameters is injected into the model. In this embodiment, the sampling temperature is set to 0.0 to force the model to select the words with the highest probability, eliminating randomness. The kernel sampling threshold is set to 0.95, and the number of candidate words is set to 50 to ensure the coherence of the generation. The maximum generation length is set to 128 to prevent the generation of excessively long and redundant text. Stop character list inclusion and <|eot_id|> flags are set to ensure that the generation stops immediately after the summary is completed.

[0059] S43: Loads a finely tuned version of the large language model with 6B to 14B parameters, performs inference based on the above prompts and parameters, and directly outputs the final code summary.

[0060] A code digest generation device that supports low-resource programming languages, such as Figure 7As shown, the device includes an input device, an output device, a memory, and a processor. When the processor executes a program in the memory, it instantiates four core functional modules. Each module contains a specific execution unit to support the implementation of the above method.

[0061] Cross-language translation module: It integrates three sub-units: multi-temperature sampling, verification and repair, and back-translation optimization, which are used to process low-resource code input by users and output the best translation code;

[0062] Search Enhancement Module: Through the built-in BM25 search engine and Top-K interceptor, it enables accurate retrieval from the knowledge base to reference examples;

[0063] Core statement extraction module: It contains two sub-units: AST parsing and classification, and core statement discrimination. It is used to receive data from the translation module and the retrieval module, and output key statements.

[0064] The fusion generation module integrates three sub-units: prompt construction, decoding configuration, and model inference, and finally generates a summary and outputs it through an output device.

[0065] like Figure 7 As indicated by the middle arrow, the modules are closely connected through data transmission channels, together forming the complete system of this invention.

[0066] This invention proposes a code digest generation method and apparatus for low-resource programming languages, applicable to practical scenarios such as software engineering, scientific computing, and cross-language development collaboration, to address real-world problems in low-resource language domains, such as missing documentation and difficulties in understanding. For example:

[0067] In the fields of scientific computing and statistical analysis, this invention can automatically generate clear natural language summaries for scripts such as R and Julia, which are commonly used for algorithm verification but lack engineering documentation support. This helps researchers without a computer science background to quickly understand complex mathematical logic and algorithm implementation, thereby improving the reusability of research code.

[0068] Regarding support for emerging or niche programming languages, for functional languages ​​such as OCaml or newly released domain-specific languages, in the early stages where large-scale parallel corpora are lacking, this invention can provide plug-and-play code understanding support, reducing the learning threshold for developers.

[0069] In multilingual project management, for large data science projects involving multiple programming languages ​​(such as Python and R mixed development), this invention can automatically generate standardized summaries for low-resource language modules, eliminating communication barriers between developers of different language stacks and improving team collaboration efficiency.

[0070] In the process of assisting cross-language code migration, such as when migrating R language code to a Python environment, the summary generated by this invention can serve as a semantic comparison reference to help developers verify the consistency of functional logic before and after the migration.

[0071] The present invention has been described in detail above, but its specific implementation is not limited thereto. Any technical solutions formed by equivalent substitution or equivalent transformation without departing from the spirit and scope of the claims of this application are within the protection scope claimed by the present invention.

Claims

1. A method for generating code digests that supports low-resource programming languages, characterized in that: The source code of a low-resource programming language is translated into candidate code of a target high-resource programming language using a large language model. The best translation code is selected from the candidate code that is most semantically close to the input source code after parsing and verification. Retrieve multiple codes and summaries similar to the best translation from a pre-built knowledge base as reference examples; The best translated code and the code in the reference example are separated into code statements, and the core statements are selected. Take the code, core statements, and summary of the reference example, as well as the best translated code and the core statements of the best translated code, as the prompting context, input the large language model, and output a code summary of the source code of the low-resource programming language; The low-resource programming languages ​​and high-resource programming languages ​​are distinguished based on the scale of the open-source corpus resources.

2. The code digest generation method supporting low-resource programming languages ​​as described in claim 1, characterized in that: Multiple candidate codes are generated by multi-temperature sampling.

3. The code digest generation method supporting low-resource programming languages ​​as described in claim 2, characterized in that: Set the sampling temperature set T={0,0.7,0.9,1.1}; for temperature T=0, generate 1 candidate code; for temperature T=(0.7,0.9,1.1), generate 3 candidate codes respectively.

4. The code digest generation method supporting low-resource programming languages ​​as described in claim 1, characterized in that: The low-resource programming language is Julia, Lua, Ocaml, R, or Racket; the high-resource programming language is Python.

5. The code digest generation method supporting low-resource programming languages ​​as described in claim 1, characterized in that: The candidate code is parsed and verified using an abstract syntax tree. When all candidate codes fail the verification, the failed candidate codes and error messages are sent as repair prompts to the large language model for self-repair.

6. The code digest generation method supporting low-resource programming languages ​​as described in claim 5, characterized in that: Set the maximum number of repair rounds to 3. After each round of repair, re-parse and verify the generated candidate code. When the maximum number of repair rounds is reached, stop the repair immediately and output the results.

7. The code digest generation method supporting low-resource programming languages ​​as described in claim 1, characterized in that: The candidate code is translated into the same low-resource programming language as the source code. The BLEU score and Sentence-BERT score of the translated code and the source code are calculated, and the total score is calculated by weighting. The candidate code corresponding to the translated code with the highest total score is selected as the best translated code.

8. The code digest generation method supporting low-resource programming languages ​​as described in claim 1, characterized in that: Using a retrieval engine based on the BM25 algorithm, the best translated code is used as the query statement. The word frequency relevance score between the query statement and each code in the pre-built knowledge base is calculated. The top 3 codes with the highest word frequency relevance scores and their corresponding summaries are selected as reference examples.

9. The code digest generation method supporting low-resource programming languages ​​as described in claim 1, characterized in that: The code statements obtained by splitting the best translated code with the code of the reference example are input into the core statement discrimination model to filter out the core statements. The core statement discrimination model is a binary classification neural network model using an Encoder-Classifier architecture. It infers for each statement and outputs an importance label for the statement. An importance label of 1 indicates that the statement is a core statement, and an importance label of 0 indicates that the statement is redundant. During the training phase, a greedy selection strategy based on the ROUGE-L metric is used to generate truth labels. Specifically, the ROUGE-L score between each statement and its corresponding reference summary is calculated, and the statements are sorted from high to low. Statements are then added to the core set one by one. Only when the newly added statement can increase the overall ROUGE-L score of the set relative to the annotations is the statement considered to contain valid new information and marked as a truth value, i.e., an importance label of 1. Otherwise, it is marked as non-core, i.e., an importance label of 0.

10. A code digest generation apparatus supporting low-resource programming languages, the apparatus comprising a memory and a processor, wherein the memory stores a computer program, characterized in that: When the processor executes the computer program, it performs the functions of the method as described in any one of claims 1 to 9.