A cross-programming language migration method and system for code similarity detection

By mining and generating contrastive samples on low-resource languages ​​using a pre-trained code encoder and an adaptive contrastive learning framework, the cross-language transfer problem of code similarity detection in multilingual code libraries is solved, achieving efficient code similarity detection results.

CN117608651BActive Publication Date: 2026-06-30ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2023-10-30
Publication Date
2026-06-30

Smart Images

  • Figure CN117608651B_ABST
    Figure CN117608651B_ABST
Patent Text Reader

Abstract

This invention proposes a cross-programming language transfer method and system for code similarity detection, belonging to the fields of software engineering and deep learning. Supervised contrastive learning is performed on a multilingual pre-trained code encoder using a high-resource labeled source language code library. A low-resource unlabeled target language code library is divided into two parts, and adaptive contrastive learning is performed on the fine-tuned multilingual pre-trained code encoder based on these two parts alternately. The contrast samples in adaptive contrastive learning are obtained through both mining and generation modes, with sampling from the mined and generated contrast samples based on preference parameters during training. The multilingual pre-trained code encoder obtained after adaptive contrastive learning is used as the result of cross-programming language transfer for code similarity detection on the low-resource target language. This invention can transfer a similarity detection model trained on a high-resource language to implement code similarity detection on a low-resource language.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of software engineering and deep learning, and specifically to a cross-programming language transfer method and system for code similarity detection. Background Technology

[0002] Code clone detection is an important task in software engineering, aiming to identify functionally similar source code from large codebases; these similar code pairs are called code clones. The identified code snippets can help programmers review or refactor the code. In recent years, industry and academia have proposed many novel neural network models for code clone detection, achieving excellent performance on code clone benchmarks such as BigCloneBench and POJ-104.

[0003] Based on the definition of similar code, code clones can be categorized into four types. We focus on solving Type-4 code clones, i.e., functionally similar code pairs, which is the most challenging type and cannot be accurately detected by simple text matching. Several neural network models have been proposed by academia and industry for detecting Type-4 code clones. These models utilize abstract syntax trees or data flow information obtained through the compiler to help understand code functionality. During the training phase, code snippets are encoded into low-dimensional dense vectors, bridging the vector distance between clone pairs in the implicit space and widening the vector distance between non-clone pairs. However, the structure and node properties of the abstract syntax tree are unique for each programming language, making these methods based on abstract syntax trees and data flow difficult to transfer to new programming languages.

[0004] In summary, existing code clone detection models suffer from a significant drawback: they only support a single programming language. In the real world, large software engineering projects typically consist of files in multiple programming languages, ranging from scripting languages ​​like Python to system programming languages ​​like C / C++ and Rust. Therefore, models that only detect code clones in a single language cannot meet the needs of multi-language code clone detection. Furthermore, these code clone detection models usually rely on large amounts of labeled data for training. For some uncommon programming languages, labeled data is difficult to obtain, which severely limits the model's ability to detect code clones on these low-resource programming languages ​​where labeled data is scarce.

[0005] One solution is to use a unified compiler-generated intermediate representation (IR) to represent code across different languages. The model can be trained on the IR, rather than on the source code, enabling it to learn common representations across different programming languages. However, obtaining IRs for different programming languages ​​requires significant domain expertise and engineering effort to fix compilation errors, presenting challenges in terms of language extensibility. Another solution is to leverage a pre-trained multilingual code encoder. Multilingual code encoders perform self-supervised model pre-training on a large corpus of code containing multiple programming languages, during which language-independent code context representations can be learned through self-supervised pre-training. However, using only self-supervised code representations for clone detection in the target language still performs worse than supervised fine-tuning.

[0006] Contrastive learning has proven to be a highly beneficial self-supervised pre-training task in visual and linguistic similarity tasks. The principle of contrastive learning is to learn consistent vector representations for different forms of data. Different representations of the same data instance are called positive semantic contrast samples, while other instances are called negative semantic contrast samples. During self-supervised training of contrastive learning, by narrowing the vector distance between positive sample representations and widening the distance between negative sample representations, the model can learn consistent vector representations for samples of the same data instance but with different representations. Several works have already used contrastive learning for code similarity learning. These works use source-to-source compilers to create different forms of the same code as positive contrast samples, use other source code as negative contrast samples, or generate positive contrast samples by renaming variables, or obtain negative contrast samples by injecting code vulnerabilities.

[0007] Currently, there is no effective method in academia or industry to achieve cross-language transfer of code similarity detection, that is, to transfer a similarity detection model trained on a high-resource language with labeled data to a low-resource language without labeled data. Summary of the Invention

[0008] To address the problems existing in the prior art, this invention proposes a cross-programming language transfer method and system for code similarity detection. This invention mainly consists of three parts: target programming language contrast sample mining based on pre-trained code encoder, contrast sample generation based on code translation and variable name replacement, and an adaptive contrast learning framework based on contrast sample mining and contrast sample generation.

[0009] Part 1: Target Programming Language Comparison Sample Mining Based on Pre-trained Code Encoders

[0010] This part is based on a pre-trained code encoder finely tuned on a high-resource programming language, which mines positive and negative contrast samples on a low-resource language through clustering methods.

[0011] (1.1) Obtaining a multilingual pre-trained code encoder fine-tuned on high-resource programming languages: This method uses a Transformer model pre-trained on large-scale multilingual code data to obtain code vector encodings, such as CodeBERT and GraphCodeBERT models. On high-resource programming languages ​​with labeled data, such as C++ and Python, this invention uses supervised contrastive learning training objective, i.e., the InfoNCE loss function, to fine-tune the multilingual pre-trained code encoder. Specifically, a two-layer feedforward network using the ReLU activation function is added after the last layer of the original Transformer encoder as a pre-trained code encoder, and the last hidden state of the [CLS] token is used to generate code vector encodings. Finally, code clone detection is achieved through the cosine similarity between code vector encodings. Specifically, this invention uses the original code string as input and uses a BPE code segmenter pre-trained on the code data to segment the input code string. The segmented tokens are then input into the multilingual pre-trained code encoder for vector encoding. The multilingual pre-trained code encoder fine-tuned on high-resource languages ​​can shorten the spatial distance of code vectors of code clone pairs on high-resource languages ​​and widen the vector distance of non-clone pairs.

[0012] (1.2) Obtaining Low-Resource Language Code Representation Clustering: Through the above steps, a multilingual pre-trained code encoder (hereinafter referred to as M) is fine-tuned on a high-resource programming language. s It can perform vector encoding on arbitrary code snippets, including some low-resource programming languages. Meanwhile, M... s The code clone detection capability has a certain degree of cross-programming language transferability, and can be directly transferred to low-resource languages ​​for code encoding while performing clone detection through vector similarity. This invention discovers that directly applying M... s Code clone detection performance transferred to low-resource languages ​​can outperform simple text similarity-based matching methods, such as BM25. Based on the cross-language transfer capabilities of a multilingual pre-trained code encoder, this invention directly uses a fine-tuned M... s Vector encoding is performed on unlabeled low-resource language code snippets to obtain code representations. The KMeans algorithm is then used to cluster the code representations in the entire low-resource language code library, dividing all the code in the low-resource language into C clusters.

[0013] (1.3) Low-Resource Language Contrast Sample Mining Based on Representation Clustering: The purpose of this step is to mine contrast learning samples from the low-resource language codebase using the representation clustering results from step (1.2), thus addressing the problem of missing labeled data in low-resource languages. For any code segment p in the low-resource language codebase, find all code segments belonging to the same cluster as p, and calculate the vector similarity between p and these code segments. Select the k code segments within the same cluster that are closest to p as candidate positive contrast learning samples; select all code segments not in the same cluster as p as candidate negative contrast learning samples.

[0014] Part Two: Generation of Comparison Samples Based on Code Translation and Variable Name Replacement

[0015] In the first part, the accuracy of contrastive sample mining directly depends on M. s While cross-programming language transfer capabilities are present, direct transfer learning models have limited performance. Furthermore, the mined comparison samples often contain pseudo-positive and pseudo-negative samples, and performing transfer learning directly on samples with low accuracy can even lead to performance degradation. Therefore, this paper proposes a comparison sample generation method based on code back-translation and variable name substitution, which uses code data augmentation to obtain a certain number of correct positive comparison samples.

[0016] Part Three: An Adaptive Contrastive Learning Framework Based on Contrastive Sample Mining and Contrastive Sample Generation

[0017] The contrast samples mined in the first part come from the entire codebase and are more diverse in form, but due to limitations in model transferability, they may contain incorrect positive and negative contrast samples. The contrast samples generated in the second part are guaranteed to be functionally consistent with the original code, but their diversity is limited due to the fixed generation method. To combine the contrast learning samples obtained from these two aspects, leveraging their respective advantages and forming a complementary relationship, an adaptive contrastive learning framework is proposed.

[0018] (3.1) Adaptive Contrastive Learning Sample Selection Strategy: This method proposes an adaptive contrastive learning sample selection strategy to dynamically adjust the preference for selecting mined / generated contrastive samples throughout the training process. This method employs a linear decay method to adjust the preference parameter α for selecting mined / generated contrastive samples throughout the training process. t Intuitively, models struggle to accurately mine contrastive learning samples in the early stages of training, so the preference for generating samples should be increased to reduce noise. However, as the model gradually adapts to new programming language patterns, increasing the preference for selecting mined contrastive samples becomes advantageous. This adjustment allows for the introduction of greater complexity and diversity, thereby enhancing the model's ability to handle broader semantic variations.

[0019] (3.2) Iterative Contrastive Learning Training: Re-mining contrastive learning samples in every training step is very time-consuming, even infeasible with a large codebase, because all programs must be recoded and K-means must be run again to update the contrastive learning samples. To reduce training time costs, the contrastive learning samples mined for each program are only updated at the beginning of each training cycle. Furthermore, the codebase is randomly divided into two parts to reduce the size of the pool used for K-means clustering and neighbor search. This invention employs a training strategy that alternates between the two codebase parts. At the start of training, contrastive learning samples are mined first on the first part, and then the model is trained on the first part. Then, the process switches to the second part, using the enhanced model to mine more accurate contrastive learning samples on the second part, and so on, iteratively enhancing the model's code clone detection capability on the target programming language.

[0020] Based on the above three parts, this invention proposes a cross-programming language transfer method for code similarity detection, including:

[0021] Supervised comparative learning of a multilingual pre-trained code encoder is performed using a high-resource labeled source language code library to obtain a fine-tuned multilingual pre-trained code encoder.

[0022] The low-resource unlabeled target language code library is divided into two parts. Based on the two parts of the code library, the fine-tuned multilingual pre-trained code encoder is subjected to adaptive contrastive learning. The contrastive sample acquisition method in the adaptive contrastive learning includes two modes: mining and generation. The training process uses a linear decay method to adjust the preference parameters and samples are drawn from the mined and generated contrastive samples based on the preference parameters.

[0023] The multilingual pre-trained code encoder obtained after adaptive contrastive learning is used as the result after cross-programming language transfer to achieve code similarity detection on low-resource target languages.

[0024] Furthermore, this invention proposes a cross-programming language migration system for code similarity detection to implement the above method.

[0025] The beneficial effects of this invention are: it can transfer a similarity detection model trained on a high-resource language, i.e., a fine-tuned pre-trained code encoder, to a low-resource language, and achieve code similarity detection on the low-resource language. Attached Figure Description

[0026] Figure 1 This is a block diagram illustrating the implementation of the cross-programming language transfer method for code similarity detection proposed in this invention. Detailed Implementation

[0027] The present invention will be further described below with reference to the accompanying drawings and embodiments. The accompanying drawings are merely illustrative diagrams of the present invention. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities can be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0028] refer to Figure 1 The main steps of the cross-programming language transfer method for code similarity detection proposed in this invention include:

[0029] (1) Given a code snippet in It is a codebase containing multiple code snippets; the purpose of code clone detection is to... The system recalls a set of all code snippets with similar functionality to program p. In cross-language migration scenarios, there is a high-resource source language l... s Code library with labeled data Each code snippet in the document is labeled with a set of tags. The cloned code in [the language]. However, low-resource target languages ​​[are also mentioned]. t The code in It is unlabeled. The goal of cross-language transfer is to leverage... and unlabeled In low-resource target languages t It achieves excellent code cloning performance.

[0030] (2) Obtain a multilingual pre-trained code encoder finely tuned on a high-resource programming language.

[0031] Supervised contrastive learning in high-resource programming language codebases Fine-tuning the multilingual pre-trained code encoder In this embodiment, the GraphCodeBERT pre-trained model is used as... Firstly, from Randomly select a batch of programs p1, p2, ..., p B And their corresponding labels y1, y2, ..., y B , where y i It is program p i Index of the problems solved. For each program p i ,from A positive sample for contrastive learning is randomly selected from the sample. Its label is related to y i Same. Then and {p1, p2, ..., p BThe samples were concatenated into a single batch, and these 2B samples were input into the model. In this process, the code representations of all programs in the batch are obtained. Where d is the dimension of the transformer model, and x is calculated in this batch. i and x j Pairwise cosine similarity s ij The model is trained using the InfoNCE loss function, which is shown in the following formula.

[0032]

[0033] Where τ is the temperature hyperparameter. By minimizing this loss function, the cosine similarity of code representations for programs with similar functions will increase, while the similarity of code representations for programs with dissimilar functions will decrease. The code encoder fine-tuned using the above method is denoted as...

[0034] (3) Obtain low-resource language code representation clusters.

[0035] After minor adjustments to the high-resource language It already possesses a certain degree of cross-programming language transfer capability, capable of encoding code in any programming language, and also exhibits a certain level of detection performance. Inspired by transferability, directly use In target low resource dataset The above method automatically mines semantically similar / dissimilar contrastive learning samples from programs using clustering methods, without requiring supervised signals. Following the aforementioned code encoding process, it first uses... get Code representation of each program p Then, through Run K-means clustering on the program, assigning each program to one of C semantic categories, i.e., after clustering. Each program will obtain a cluster label c i In this embodiment, cosine similarity is used as the distance metric for K-means, and C is a hyperparameter for determining the number of cluster centers, set to... The number of programming problems in the specification does not need to be strictly set to the number of programming problems in actual use. This invention is robust to the choice of C.

[0036] (4) Low-resource language contrast sample mining based on representation clustering.

[0037] In obtaining Clustering labels for all programs Next, we will continue to introduce how to mine each program p iPositive (similar) and negative (dissimilar) contrast samples. In zero-shot transfer experiments, programs with similar functions are more likely to be assigned to the same cluster than dissimilar programs. To discover positive contrast samples with similar program functions, this invention will compare them with p i Procedures that assign samples to the same cluster are considered as candidate sets T of positive contrast samples. i .

[0038]

[0039] However, some functionally dissimilar programs may be incorrectly assigned to the same cluster. To reduce spurious positive samples, only p in T is retained. i The nearest k neighbors are used to exclude boundary samples from the candidate set. p is defined in the following formula. i The final set of positive samples P obtained from the mining i .

[0040] P i ={p j |p j ∈T∧rank(s ij )≤k}

[0041] Among them, s ij It is x i and x j The cosine similarity score, while the rank is used for retrieval. The sorting index is a function (in descending order). k is a hyperparameter with values ​​ranging from {16, 32, 64}. A key factor in contrastive learning is having a variety of negative contrastive samples, which prevents the model from collapsing into a simple constant solution. To meet this requirement, this invention treats all programs not in the same cluster as p. i A negative contrast sample set is provided to ensure that the model is exposed to a sufficient number of negative examples. This negative contrast sample set is denoted as N. i .

[0042]

[0043] (5) The accuracy of the above comparative sample mining directly depends on While cross-programming language transfer learning is possible, zero-shot transfer models have limited performance. The mined comparison samples often contain pseudo-positive and pseudo-negative samples, and directly performing transfer learning on samples with low accuracy can even lead to performance degradation. Therefore, this invention proposes a comparison sample generation method based on code back-translation and variable name substitution, using code data augmentation to obtain a certain number of correct positive comparison samples.

[0044] In one specific embodiment of the present invention, for The code snippet p iThe following two methods can be used to generate a certain number of correct positive contrast samples.

[0045] (5.1) Contrastive Sample Generation Method Based on Code Back-Translation: Back-translation is an effective data augmentation method in natural language processing tasks. Given original English text, it is first translated into another language, usually German, by a neural machine translation model, and then the German text is translated back into English. When using a powerful translation model, the back-translated text is usually different in textual form but semantically equivalent. Inspired by the application of back-translation in natural language processing tasks, this invention also utilizes back-translation to generate functionally consistent positive contrastive samples. For example, CodeGeeX, a multilingual code generation model trained on large-scale code data, is used to translate the program into Python first, and then translate it back into the original language. The back-translated samples can maintain the functionality of the program, but there will be differences in the specific code form. CodeGeeX can achieve zero-sample code translation without the need for additional code translation data for fine-tuning.

[0046] (5.2) Comparative Sample Generation Method Based on Variable Name Replacement: Variable name renaming is used to create different views of code snippets in all languages. Variable name renaming only changes the names of variables and does not change the functionality of the code. For a code snippet, one of the following two renaming strategies is applied randomly: identifier normalization, normalizing variable names to "var 1", "var2", ..., "var m", and function names to "func 1", "func 2", ..., "func m"; identifier randomization, first collecting variable and function names from all programs into a name pool, and then randomly selecting names from the pool.

[0047] One of the following methods was randomly selected to generate the comparison samples: code back-translation and variable name replacement. However, it is not limited to these two methods.

[0048] (5) Since the comparison samples mined in step (4) come from the entire codebase, they are more diverse in form, but due to the limitations of model transferability, they may contain incorrect positive and negative comparison samples; the comparison samples generated in step (5) can guarantee that they are consistent with the original code in function, but due to the fixed generation method, their diversity is relatively limited. In order to combine the comparison learning samples obtained from these two aspects, give full play to their respective advantages and form complementarity, this invention proposes an adaptive comparison learning sample selection strategy, which dynamically adjusts the preference for selecting the comparison samples mined and generated during the entire training process, thereby combining the comparison learning samples obtained from these two aspects, giving full play to their respective advantages and forming complementarity.

[0049] Specifically, for each training step t, the program fragment p iBased on step (4), the positive comparison sample set P is obtained. i A method for generating a comparison sample that is randomly selected. First, let's start with P. i A positive contrast sample obtained by random sampling of the data. And for program p i application In order to obtain a positive comparison sample that is definitely correct. The final selected positive contrast samples for contrastive learning training It is determined by the following formula.

[0050]

[0051] Where γ∈{0,1} is from parameter α t Sampling is performed from a Bernoulli distribution. α t The higher the α value, the stronger the preference for the positive contrast samples generated during sampling. This embodiment uses a linear decay method to adjust the preference parameter α throughout the training process. t Intuitively, in the early stages of training, models struggle to mine accurate positive contrast samples, so the preference for generating positive samples should be increased to reduce noise. However, as the model gradually adapts to the patterns of the target programming language, increasing the preference for mining positive samples becomes beneficial. More use of mined positive samples allows the model to incorporate greater complexity and diversity, enhancing its ability to handle broader variations in program semantics. Formally, α t Adjustments are made using the following formula.

[0052]

[0053] α0∈[0,1] is α at the start of training. t The initial hyperparameter T is set. total It is the total number of training steps across all training cycles, where each training cycle includes a number of training steps, and σ∈[0,1] is a hyperparameter used to determine when to start reducing the preference for generating positive contrast samples. In this embodiment, σ is set to 0.1.

[0054] (6) Iterative comparative learning training.

[0055] To reduce training time costs, the set of positive and negative contrast samples mined by each program is updated only at the beginning of each training epoch. Furthermore, the samples are randomly selected... Divided into two parts, and To reduce the pool size used for clustering and neighbor search ranking, the following approach was adopted. and An iterative training strategy that alternates between training phases. At the start of training, first... Mining positive and negative sample comparison sets, and in Train the model. In the next training cycle, Replace with Using the enhanced model trained in the previous cycle The system then mines a more accurate set of positive and negative samples for comparison and repeats the training process in a continuous cycle.

[0056] In each training step during the training phase, from or Randomly select a batch of programs {p1, p2, ..., p...} B}, batch size is B, each program is p i Each has a positive contrast sample obtained by the sampling method described in step (5). The positive comparison set P obtained in step (4) i negative contrast set N i .Will and {p1, p2, ..., p B The two programs are then concatenated and input into the model. Finally, adaptive training is performed using the InfoNCE contrastive loss function.

[0057]

[0058] Please note that, to make the above equation neater, P... i and N i The definition has been changed. Add to P i ,Will Add to N i In the middle, τ is a temperature hyperparameter and is set to 0.1.

[0059] The multilingual pre-trained code encoder obtained after the above adaptive contrastive learning will be used as an example. As a result of cross-programming language migration, the encoder Code similarity detection can be implemented in low-resource languages.

[0060] This embodiment also provides a cross-programming language migration system for code similarity detection, which is used to implement the above embodiments; details already described will not be repeated. The terms "module," "unit," etc., used below can refer to a combination of software and / or hardware that performs a predetermined function. Although the system described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible.

[0061] A cross-programming language transfer system for code similarity detection includes:

[0062] The supervised contrastive learning fine-tuning module is used to perform supervised contrastive learning on a multilingual pre-trained code encoder using a high-resource labeled source language code library to obtain a fine-tuned multilingual pre-trained code encoder.

[0063] The adaptive contrastive learning module is used to divide the low-resource unlabeled target language code library into two parts. Based on the two parts of the code library, adaptive contrastive learning is performed on the fine-tuned multilingual pre-trained code encoder in turn. The contrastive sample acquisition method in the adaptive contrastive learning includes two modes: mining and generation. The training process uses a linear decay method to adjust the preference parameters and samples are drawn from the mined and generated contrastive samples based on the preference parameters.

[0064] The code similarity detection module is used to perform code similarity detection on low-resource target languages ​​by using the multilingual pre-trained code encoder obtained after adaptive contrastive learning as the result after cross-programming language transfer.

[0065] The specific implementation process of the functions and roles of each module in the above system is detailed in the corresponding steps of the above method, and will not be repeated here. For the system embodiment, since it basically corresponds to the method embodiment, relevant parts can be referred to in the description of the method embodiment. The system embodiment described above is merely illustrative; the modules described as separate components may or may not be physically separated, i.e., they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0066] The system embodiments of the present invention can be applied to any device with data processing capabilities, such as a computer or other similar device. The system embodiments can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution.

[0067] The technical effects of the present invention are verified through experiments below.

[0068] In the testing of this invention, model training was performed on a computer equipped with an NVIDIA RTX-3090 GPU. During the high-resource language fine-tuning training and low-resource language transfer training phases, the batch size B was set to 16, and the temperature hyperparameter τ in the objective function was set to 0.1. The AdamW optimizer was used to optimize the model. The learning rate was set to 3e-5, the weight decay was set to 1e-5, and the learning rate was linearly scheduled at a rate of 0.1 for warm-up. The gradient norm was clipped to 1.0. On the validation sets of POJ-104 and GCJ, the model reached its optimal detection performance after two training epochs and was saved for low-resource language transfer training. During the transfer training phase, σ was set to 0.1, and the optimal parameters were obtained by grid search on the validation set of the target language dataset for α0∈{1.0, 0.8, 0.4, 0.2} and k∈{16, 32, 64}.

[0069] Table 1 lists several comparison schemes in the "Model" column. Text-embedding-ada-002 is OpenAI's state-of-the-art text and code encoding API, capable of vector representation of arbitrary text and code data, including code snippets from different programming languages. It achieves good text-code similarity modeling results without fine-tuning. ContraCode-FT / CodeBERT-FT / GraphCodeBERT-FT are the results of supervised fine-tuning on five low-resource target languages: Ruby, C#, Rust, JavaScript, and Go, respectively, using state-of-the-art pre-trained code encoders ContraCode, CodeBERT, and GraphCodeBERT. These three methods assume a certain amount of labeled data is available on low-resource languages ​​and represent the upper limit of performance achievable through supervised fine-tuning. The method in this invention does not require labeled data on low-resource languages. MAP@R is used as the performance metric for clone detection on the target low-resource languages; a higher metric indicates that the method can more accurately identify all cloned code for each code snippet in the codebase.

[0070] Table 1 Comparison of Unsupervised and Supervised Code Representation Methods

[0071] Model Ruby C# Rust JavaScript Go Avg Text-embedding-ada-002 71.37 47.83 59.81 50.47 59.13 54.31 ContraCode-FT 65.42 59.06 63.85 53.97 67.59 61.12 CodeBERT-FT 74.21 72.00 84.06 74.62 78.89 77.39 GraphCodeBERT-FT 80.03 74.74 87.63 76.84 82.54 80.44

[0072] Table 2 compares two code vector representation cross-language transfer methods, GraphCodeBERT-ZeroTrans and GraphCodeBERT-Whiten. GraphCodeBERT-ZeroTrans represents a GraphCodeBERT model fine-tuned on labeled data in a high-resource language, directly transferred to a low-resource target language with zero samples. GraphCodeBERT-Whiten represents a method that, after fine-tuning GraphCodeBERT on labeled data in a high-resource language, uses whitening to transfer the model to a low-resource target language. The proposed methods GraphCodeBERT-Our-IR / BT represent the performance of using identifier renaming and back-translation as comparative sample generation methods, respectively.

[0073] Table 2 Comparison of Cross-Language Code Representation Transfer Methods

[0074] Model Ruby C# Rust JavaScript Go Avg GraphCodeBERT-ZeroTrans 61.98 54.86 57.46 62.95 66.65 60.48 GraphCodeBERT-Whiten 59.30 58.73 59.99 65.67 65.59 62.50 GraphCodeBERT-Our-IR 72.35 63.74 74.65 73.58 76.08 72.01 GraphCodeBERT-Our-BT - 66.95 79.22 72.38 77.56 74.03

[0075] Tables 2 and 1 show that the transfer method proposed in this invention significantly outperforms ZeroTrans and Whiten methods, as well as the current state-of-the-art unsupervised code representation method, Text-embedding-ada-002, on five target low-resource programming languages. Furthermore, the proposed method achieves performance close to that of methods using labeled data without using labeled data in the target low-resource target languages, further demonstrating its ability to achieve cross-programming language transfer for clone detection tasks in unlabeled data scenarios.

[0076] The above examples are merely specific embodiments of the present invention. Obviously, the present invention is not limited to the above embodiments and many variations are possible. All variations that can be directly derived or conceived by those skilled in the art from the disclosure of the present invention should be considered within the scope of protection of the present invention.

Claims

1. A cross-programming language transfer method for code similarity detection, characterized in that, include: Supervised comparative learning of a multilingual pre-trained code encoder is performed using a high-resource labeled source language code library to obtain a fine-tuned multilingual pre-trained code encoder. The low-resource unlabeled target language code library is divided into two parts. Based on the two parts of the code library, adaptive contrastive learning is performed on the fine-tuned multilingual pre-trained code encoder. The comparative sample acquisition method in the adaptive contrastive learning includes two modes: mining and generation. The comparison samples obtained using the mining model include positive and negative comparison samples. Specifically, the code representation of each target program in the low-resource unlabeled target language code library is obtained using a fine-tuned multilingual pre-trained code encoder, and all code representations are clustered. For each target program, the nearest values ​​in the same cluster are selected from those values. One neighbor is used as a positive comparison sample, and the program in different clusters is used as a negative comparison sample; The comparison samples obtained by the generation mode are positive comparison samples. Specifically, positive comparison samples are generated by code back translation or variable name replacement. The code back translation refers to first translating the target program into a high-resource program language type and then translating it back into the original target program language type. The variable name replacement mentioned above refers to replacing the variable names in the target program without changing the code functionality; The training process uses a linear decay method to adjust the preference parameters, and sampling is performed based on the preference parameters from the mined and generated comparison samples, as shown below: ; in, This represents the positive contrast samples selected in the final sampling for contrastive learning training. This represents a positive contrast sample randomly selected from the set of positive contrast samples obtained through mining. This represents the generated positive contrast sample. This indicates sampling from a Bernoulli distribution of the preference parameters; In each training step of the adaptive contrastive learning process, a batch of target programs is randomly selected from a portion of the low-resource unlabeled target language codebase. Each target program... Corresponding to a set of positive contrast samples obtained through mining A set of negative contrast samples obtained through mining And the final sampling of positive contrast samples selected for contrastive learning training. ; for each target program Corresponding positive contrast samples Add to positive contrast sample set In the middle, and the negative contrast sample set Each target program Positive contrast samples Add to negative contrast sample set middle; The target program sequences and corresponding positive comparison samples from the same batch Sequence concatenation, input fine-tuning of a multilingual pre-trained code encoder, and adaptive training using a contrastive learning loss function are: ; in, Indicates the target program. The cosine similarity between two target programs is represented by... It is a temperature hyperparameter; The multilingual pre-trained code encoder obtained after adaptive contrastive learning is used as the result after cross-programming language transfer to achieve code similarity detection on low-resource target languages.

2. The cross-programming language transfer method for code similarity detection according to claim 1, characterized in that, The supervised contrastive learning mentioned above includes: Source programs and their positive samples are obtained from a high-resource labeled source language code library. The source programs and their positive samples have the same label. Source program sequences and corresponding positive sample sequences from the same batch are concatenated and input into a multilingual pre-trained code encoder. The multilingual pre-trained code encoder is trained using the InfoNCE loss function to obtain a fine-tuned multilingual pre-trained code encoder.

3. The cross-programming language transfer method for code similarity detection according to claim 2, characterized in that, The labels mentioned above are indexes of the problems solved by the program.

4. The cross-programming language transfer method for code similarity detection according to claim 1, characterized in that, The preference parameters are as follows: ; in, The initial hyperparameters, It is the total number of training steps across all training cycles, where each training cycle includes a number of training steps. This is a hyperparameter used to determine when to start reducing the preference for the generated positive contrast samples.

5. The cross-programming language transfer method for code similarity detection according to claim 1, characterized in that, The multilingual pre-trained code encoder uses a pre-trained language model.

6. A cross-programming language transfer system for code similarity detection, used to implement the cross-programming language transfer method for code similarity detection as described in claim 1, characterized in that the system... include: The supervised contrastive learning fine-tuning module is used to perform supervised contrastive learning on a multilingual pre-trained code encoder using a high-resource labeled source language code library to obtain a fine-tuned multilingual pre-trained code encoder. The adaptive contrastive learning module is used to divide the low-resource unlabeled target language code library into two parts. Based on the two parts of the code library, adaptive contrastive learning is performed on the fine-tuned multilingual pre-trained code encoder in turn. The contrastive sample acquisition method in the adaptive contrastive learning includes two modes: mining and generation. The training process uses a linear decay method to adjust the preference parameters and samples are drawn from the mined and generated contrastive samples based on the preference parameters. The code similarity detection module is used to perform code similarity detection on low-resource target languages ​​by using the multilingual pre-trained code encoder obtained after adaptive contrastive learning as the result after cross-programming language transfer.