A parallelized source code clone detection method
By parallelizing code block processing and similarity comparison, the problem of difficulty in identifying cloned code interspersed in expressions in existing technologies is solved, thus improving detection efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2024-05-30
- Publication Date
- 2026-06-16
Smart Images

Figure CN118444976B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of software engineering, specifically relating to a parallelized and efficient code clone detection method for source code similarity. Background Technology
[0002] During software development, developers often refer to existing solutions to design and implement new business functions. This includes directly copying and pasting existing source code and writing source code according to standard API call flows. These development habits result in a certain amount of identical or similar source code in the code repository; this type of code is called cloned code. The existence of cloned code has a negative impact on software development and maintenance, therefore, it is necessary to monitor and manage such code.
[0003] Cloned code is mainly divided into four categories: the first category is completely identical code (with comments and blank lines removed); the second category is code with different identifiers; the third category is source code pairs with added or deleted expressions; and the fourth category is source code with dissimilar text but identical functionality. Currently, detection tools typically group the first three categories together for detection, and treat the fourth category as a separate category.
[0004] Researchers both domestically and internationally have proposed many valuable solutions for detecting the first three types of cloned code. However, current methods suffer from low detection efficiency, resulting in poor performance when dealing with massive codebase clone detection. Furthermore, the third type of cloned code contains many large-granularity clones formed by interweaving similar expressions, which current methods cannot effectively identify. These factors limit the detection of cloned code. Summary of the Invention
[0005] This invention addresses the problem of ineffective identification of third-type code clones caused by expression interleaving, as well as the challenge of detecting clones in massive code repositories, by providing a highly efficient parallelized source code clone detection method. It achieves efficient clone detection through parallel reading and splitting of code blocks, parallel sorting of code blocks, and parallel similarity comparison of multiple code pairs. By using multiple methods for counting similar lines of code, it overcomes the problem that continuous code fragment comparison methods cannot effectively detect code clones of expression interleaving type. This method provides an effective solution for detecting clones in massive code repositories.
[0006] This invention provides a parallelized source code clone detection method, the specific steps of which are as follows:
[0007] (1) Program initialization;
[0008] The system checks if the input parameters conform to the specifications. If they do, it continues loading configuration data and proceeds to the next step; otherwise, it provides parameter prompts. Specifically, the parameters include: min-line (minimum number of method lines; methods with fewer than this value are not included in the detection), max-line (maximum number of method lines; methods with more than this value are not included in the detection), compare-type (code similarity comparator type: 1 for string comparison, 2 for token comparison, 3 for Simhash comparison), line-gap-dis (comparison range coefficient; methods with a line count greater than this value are not included in the detection), similarity (similarity threshold; methods exceeding this value are detected as cloned code), language (detection language), and extensions (source file extensions, separated by commas; source files with extensions matching the configuration will be parsed and included in clone detection).
[0009] (2) Source code analysis;
[0010] Based on the input parameters, the program loads a list of files. If the input parameter is a directory, it recursively lists all files within that directory; if the input parameter is a file, the program reads the path list from the file. The program filters and extracts the source code files to be analyzed, and breaks them down into code blocks to be detected based on a preset clone code detection granularity. Specifically, the breakdown process involves reading the source code text character by character, identifying identifiers and other elements according to the encoding rules of the target language, and then extracting code blocks (method blocks or statement blocks) according to the syntax rules of the target language.
[0011] (3) Sort the code blocks to be detected;
[0012] The code blocks are sorted based on their line count. A parallelized merge sort algorithm is used to achieve fast sorting in this step. The sorted code blocks are stored in a hash table (map), and each code block contains information such as its index number, code path, and start and end line numbers.
[0013] (4) Initialize the comparator;
[0014] Based on the comparison type specified in the configuration file, select the corresponding comparator for initialization. There are three types of comparators: raw string comparison, locality-sensitive hash comparison, or source code tokenization comparison. The raw string comparator treats each line of code as a string and uses the programming language's built-in string comparison functionality to compare the similarity of code lines. The locality-sensitive hash comparator converts each line of code into a 64-bit 0 / 1 sequence and then compares the Hamming distance between two 0 / 1 sequences; if the Hamming distance is less than 4, the two strings are considered similar. The source code tokenization comparator tokenizes the identifier of each line of source code and then compares the similarity of the tokenized strings.
[0015] (5) Locate the target clone pair;
[0016] Using the comparator selected in the previous step, perform pairwise comparisons on all code blocks to be compared in the sorted code block list of step (3), calculate the similarity, and those with a similarity greater than the configured threshold are considered clone code pairs. The specific process is as follows:
[0017] 1) Check the size of the input code block object list. If it is less than 2, return an empty list directly; if the list size is greater than or equal to 2, continue with the following steps.
[0018] 2) Search for code pairs in the list of code block objects and return a list containing the code pairs. Let the total number of code blocks be N. Declare an N*N two-dimensional array to represent the matrix of code blocks to be compared for similarity. Based on given values, extract a predetermined number of candidate code pairs from the upper triangular matrix for comparison. Each code pair can be processed independently in subsequent steps, thus achieving parallelized clone code detection.
[0019] 3) Loop Comparison: Within a loop, the generated candidate code pairs are processed in parallel. For each code pair, the following operations are performed:
[0020] a. Split the code pair into two code blocks, and retrieve the corresponding code block objects from the code block list based on the indices of these two code blocks.
[0021] b. Calculate the line spacing difference between the two code block objects and filter them according to the line-gap-dis configuration.
[0022] c. Calculate the similarity between two code block objects, i.e., the proportion of similar code between the two code snippets. If the similarity is greater than or equal to the configured similarity threshold, the code block pair is identified as a clone pair; if the similarity is less than the configured similarity threshold, the next pair is selected from the candidate code pairs, and the similarity of the new code pair is calculated. The specific calculation steps are as follows:
[0023] c1. Initialize two variables: sequence (string sequence) and sameCounter (number of similar lines of code);
[0024] c2. Divide the code block to be compared into code line sequences codes1 and codes2;
[0025] c3. Iterate through the code sequence with the largest number of lines, codes2, and compare its current expression, statement, with all expressions in the code sequence code1 to be compared. If the two expressions are the same, append the character 1 to the sequence, increment sameCounter by 1, and remove the two lines of expression being compared from the code sequence. If the current expression, statement, is different from all lines of code in code1, append the character 0 to the sequence.
[0026] c4. Repeat the process from c1 to c3 until the last expression of the code sequence codes2 is reached. If the ratio of the number of lines in sameCounter to the number of lines in codes1 is greater than the set threshold, then these two method blocks are identified as cloned code.
[0027] d. Determine the type of clone pair based on the length of the same sequence in the code snippets. Use regular expressions to match consecutive 1s in the sequence. If the number of consecutive 1s is less than 3, the type is ordinary third-type clone code; otherwise, the type is expression-interleaved third-type clone code.
[0028] e. Add the found clone pairs to the pairs list. Due to concurrent access, a lock is used to ensure the thread safety of the pairs list.
[0029] f. Output processing progress: After each loop iteration, output the processing progress information.
[0030] g. Return a list of clone pairs: After all candidate code pairs have been processed, return a list of found clone pairs, pairs.
[0031] Compared with the prior art, the technical solution of the present invention has the following advantages and positive effects: each step of the method of the present invention can be parallelized, which can greatly improve the detection efficiency of cloned code; in addition, the present invention realizes the detection of expression-interleaved cloned code by comparing code lines. Attached Figure Description
[0032] Figure 1 Flowchart of the method of this invention;
[0033] Figure 2This invention provides a framework diagram for clone detection. Detailed Implementation
[0034] To make the technical solutions of the embodiments of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings. Examples of implementation Figure 1 and Figure 2 As shown in the figure. This embodiment of a parallelized source code clone detection method includes the following steps:
[0035] (1) Configure the detection parameters. The following are the currently verified high-performance parameters: max-line=500, compare-type=1, line-gap-dis=0.3, similarity=0.7, min-line=5, language=java, extensions=java.
[0036] (2) Read the source file to be tested according to the passed source code path. Then, iterate through and parse all the files to obtain a list of code blocks. (This example uses a method block.)
[0037] (3) Based on (2), sort all code blocks in ascending order according to their line number. The sorted code block sequence is denoted as . .
[0038] (4) After (3) is completed, initialize the code similarity comparator according to the configured parameters. In this example, the string comparator StringComparator is used.
[0039] (5) Based on (4), start comparing method blocks, identify and extract clone code.
[0040] 1) First, extract several subsequences from the code sequence according to a certain number. ,
[0041] Then, these subsequences are processed in parallel.
[0042] 2) For each subsequence, remove method blocks from each pair of sequences and perform similarity comparisons. Here, we take code subsequences as an example. For example, the comparison process is as follows:
[0043] c1. Initialize two variables: sequence (the string sequence) and sameCounter (the number of similar lines of code), where sequence is an empty string and sameCounter = 0.
[0044] c2. The code block to be compared and Divide into code line sequences according to code line.
[0045] and
[0046] .
[0047] c3. Iterate through the code sequence ch with the largest number of lines, comparing its current expression list (statement) with all expressions in the code sequence ci to be compared. If the two expressions are the same, append the character 1 to the sequence, increment sameCounter by 1, and remove the two currently compared expressions from the code sequence. If the current expression statement is different from all the code lines in codes1, append the character 0 to the sequence.
[0048] c4. Repeat the process from c1 to c3 until the last expression of the code sequence codes2 is reached. The ratio of the number of lines of sameCounter=3 to ci is 0.75, which is greater than the set threshold of 0.7. Therefore, these two method blocks are identified as cloned code.
[0049] (6) Determine the type of clone pair based on the length of the same sequence in the code fragment. Use the regular expression [1]+ to match consecutive 1s in the sequence. If the number of consecutive 1s is less than 3, the type is ordinary third-class clone code; otherwise, the type is expression interlacing third-class clone code. In this example, sequence = 1010100, so the clone pair is identified as ordinary third-class clone code.
[0050] (7) Add the found clone pair (ci, ch) to the pairs list. Due to concurrent access, a lock is used to ensure the thread safety of the pairs list.
[0051] (8) Output processing progress: After each loop iteration, output the processing progress information.
[0052] (9) Return the list of clone pairs: After all candidate code pairs have been processed, return the list of clone pairs found.
[0053] Table 1 shows the experimental results of the clone code detection tool designed in this patent. The configuration for this experiment is as follows: max-line=500, compare-type=1, line-gap-dis=0.3, similarity=0.7, min-line=5, language=java.
[0054] extensions=java; The file list is read using a single thread, and the code blocks are sorted using Java's built-in sorting API. One set of code pairs (containing 10,000 code pairs) is retrieved in each round of comparison. The experimental machine is a personal laptop with the following configuration: Intel Core i7 CPU; 8GB RAM.
[0055] Table 1. Statistics of Code Cloning Experiment Results
[0056]
[0057] The above embodiments are preferred embodiments of the present invention and are not intended to limit the invention. Those skilled in the art can make modifications to the above embodiments without departing from the spirit and scope of the invention. Therefore, all improvements made to the present invention should be within the protection scope of the present invention.
Claims
1. A parallelized source code clone detection method, characterized in that, Includes the following steps: The S1 program initializes by checking whether the input parameters conform to the specifications. If they do, it continues to load the configuration data and proceeds to the next step; otherwise, it provides parameter prompts. S2 source code parsing: Based on the input parameters, the program loads the file list, filters and extracts the source code files to be analyzed, and breaks down the source code into code blocks to be detected according to the preset clone code detection granularity. S3 sorting of code blocks to be detected: The code blocks are sorted based on the number of lines in the code block. S4 initializes the comparator by selecting a comparator for initialization based on the comparison type in the configuration file. S5 finds target clone pairs by using a comparator to compare all code blocks in the sorted code block list pairwise and calculating the similarity. Code blocks with a similarity greater than a configured threshold are considered clone pairs. The specific process is as follows: S5.1 Check the size of the input code block object list. If it is less than 2, return an empty list directly. If the list size is greater than or equal to 2, continue with the following steps. S5.2 Search for code pairs in the list of code block objects and return a list containing code pairs; Assume the total number of code blocks is N, declare an N*N two-dimensional array, which represents the code block matrix to be compared for similarity; According to the given values, take out a preset number of candidate code pairs from the upper triangular matrix for comparison; Each code pair is independently subjected to subsequent steps to achieve parallelized clone code detection; S5.3 uses a loop comparison, where the generated candidate code pairs are processed in parallel within a single loop, as follows: a. Split the code pair into two code blocks, and retrieve the corresponding code block objects from the code block list based on the indices of these two code blocks; b. Calculate the line spacing difference between the two code block objects and filter them according to the line-gap-dis configuration; c. Calculate the similarity between two code block objects, that is, the proportion of similar code between the two code snippets; If the similarity is greater than or equal to the similarity threshold in the configuration, the code block pair is identified as a clone pair; if the similarity is less than the similarity threshold in the configuration, the next pair is taken from the candidate code pairs, and the similarity of the new code pair is calculated. The specific calculation process for step c is as follows: c1. Initialize two variables: sequence (string sequence) and sameCounter (number of similar lines of code); c2. Divide the code block to be compared into code line sequences codes1 and codes2; c3. Iterate through the code sequence with the largest number of lines, codes2, and compare its current expression, statement, with all expressions in the code sequence code1 to be compared. If the two expressions are the same, append the character 1 to the sequence, increment sameCounter by 1, and remove the two lines of expression being compared from the code sequence. If the current expression, statement, is different from all lines of code in code1, append the character 0 to the sequence. c4. Repeat the c3 process until the last expression of the code sequence codes2 is reached. If the ratio of the number of lines in sameCounter to the number of lines in codes1 is greater than the set threshold, then these two method blocks are identified as cloned code. d. Determine the type of clone pair based on the length of the same sequence in the code snippets; use regular expressions to match consecutive 1s in the sequence. If the number of consecutive 1s is less than 3, the type is ordinary third-type clone code; otherwise, the type is expression-interleaved third-type clone code. e. Add the found clone pairs to the pairs list; due to concurrent access, use a lock to ensure the thread safety of the pairs list; f. Output processing progress: After each loop iteration, output the processing progress information; g. Return a list of clone pairs: After all candidate code pairs have been processed, return a list of found clone pairs, pairs.
2. The parallelized source code clone detection method according to claim 1, characterized in that, The input parameters include: minimum number of method lines (min-line), methods with fewer than this value are not included in the detection; maximum number of method lines (max-line), methods with more than this value are not included in the detection; code similarity comparator type (compare-type), where 1 represents comparison using strings, 2 represents comparison using token strings, and 3 represents comparison using Simhash; comparison range coefficient (line-gap-dis), methods with a line gap percentage greater than this value are not included in the detection; similarity threshold (similarity), methods with similarity values greater than this value are detected as cloned code; detection language; and source file extensions (separated by commas), source files with extensions matching the configuration will be parsed and included in the clone detection.
3. The parallelized source code clone detection method according to claim 1, characterized in that, The program loads the file list as follows: if the passed parameter is a directory, it recursively lists all files in the directory; if the passed parameter is a file, the program reads the path list in the file.
4. The parallelized source code clone detection method according to claim 1, characterized in that, The specific process of splitting the source code into code blocks to be detected is as follows: read the source code text character by character, identify the identifiers according to the encoding rules of the language to be detected, and then extract the code blocks according to the syntax rules of the language to be detected.
5. The parallelized source code clone detection method according to claim 4, characterized in that, In S3, a parallelized merge sort algorithm is used to sort the code blocks. The sorted code blocks are stored in a hash table map. Each code block contains an index number, a code path, and the start and end line numbers of the code.
6. The parallelized source code clone detection method according to claim 1, characterized in that, The comparator includes: raw string comparison, locality-sensitive hash comparison, and source code tokenization comparison; The original string comparator treats each line of code as a string and uses the built-in string comparison function of the programming language to compare the similarity of the lines of code. The locality-sensitive hash comparator converts each line of code into a 64-bit 0 / 1 sequence, and then compares the Hamming distance between two 0 / 1 sequences. If the Hamming distance is less than 4, the two strings are considered similar. The source code tokenization comparator tokenizes the identifier of each line of source code and then compares the similarity of the tokenized strings.