Method and apparatus for performing context retrieval on basis of code repository

By rewriting the question text and combining it with pre-trained models for reasoning and content analysis of code repository filenames, the problem of insufficient understanding of code structure in existing technologies is solved, and efficient and accurate contextual retrieval at the code repository level is achieved.

WO2026138158A1PCT designated stage Publication Date: 2026-07-02ALIPAY (HANGZHOU) DIGITAL SERVICE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ALIPAY (HANGZHOU) DIGITAL SERVICE TECHNOLOGY CO LTD
Filing Date
2025-10-30
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing retrieval enhancement generation techniques struggle to accurately understand the code structure in modern software projects, leading to large errors in code context retrieval and impacting the accuracy of downstream code processing tasks.

Method used

By rewriting the question text for keyword retrieval and combining it with the reasoning of code repository filenames by a pre-trained model, repository files related to the rewritten text are filtered out and divided into contextual fragments. The retrieval accuracy is improved by filtering based on content relevance.

Benefits of technology

It improves the ability of large language models to understand and retrieve code repositories, ensures the accuracy and coverage of retrieval results, and enhances the effectiveness of code repository-level retrieval augmentation tasks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025131207_02072026_PF_FP_ABST
    Figure CN2025131207_02072026_PF_FP_ABST
Patent Text Reader

Abstract

Provided in the embodiments of the present description is a method for performing context retrieval on the basis of a code repository. The method comprises: receiving first question text; rewriting the first question text, performing, on the basis of obtained rewritten text, keyword retrieval to obtain several first repository files that match the rewritten text, and incorporating the first repository files into a first set; using a pre-trained target model to perform target inference to obtain several supplementary repository files related to the rewritten text, and incorporating the supplementary repository files into the first set; screening the first set to retain repository files that meet a first standard, wherein the first standard is defined on the basis of the summary correlation between the repository files and the rewritten text; and dividing any repository file in the first set into several context snippets, and selecting a target context snippet that meets a second standard, wherein the second standard is defined on the basis of the content correlation between the context snippets and the first question text, and the target context snippet is used for processing a downstream task related to the first question text.
Need to check novelty before this filing date? Find Prior Art

Description

A method and apparatus for contextual retrieval based on code repositories

[0001] This application claims priority to Chinese Patent Application No. 202411952485.4, filed on December 26, 2024, entitled "A Method and Apparatus for Contextual Retrieval Based on Code Repository", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This specification relates to one or more embodiments in the field of large language modeling and retrieval enhancement technology, and in particular to a method and apparatus for contextual retrieval based on a code repository. Background Technology

[0003] With the continuous development of artificial intelligence technology, large language models are widely used in various software engineering projects, covering many down-to-earth application scenarios such as code problem solving, automatic program repair, automatic code completion, and software engineering intelligent agent development. The development of these application scenarios usually relies on retrieval-augmented generation (RAG) technology, which expands the knowledge scope of large language models by leveraging external knowledge bases.

[0004] However, as the complexity of software engineering projects continues to rise, the references and calls between functional modules contained in the source code files become intricate. Traditional search-enhanced generation techniques struggle to understand the code structure of modern software projects, inevitably leading to errors in identifying and locating code contexts. This results in retrieving code contexts that are not highly relevant to the problem query, thus affecting the accuracy of downstream code processing tasks.

[0005] Therefore, we hope to find a solution that can improve the ability of large language models to understand code repositories containing source code from numerous software engineering projects through technical means, so as to accurately locate the code context related to the problem query and improve the accuracy of retrieval. Summary of the Invention

[0006] One or more embodiments of this specification describe a method and apparatus for contextual retrieval based on a code repository, which enhances the ability of a large language model to understand repository files in the code repository, improves the ability of the large language model to locate context fragments, and thus better completes the repository-level retrieval enhancement task.

[0007] According to the first aspect, a method for contextual retrieval based on a code repository is provided, the code repository containing several repository files, the method comprising:

[0008] Receive the first question text.

[0009] The first problem text is rewritten, and based on the rewritten text, a keyword search is performed to obtain several first repository files that match the rewritten text, which are then included in the first set.

[0010] Based on the filenames of the repository files, target inference is performed using a pre-trained target model to obtain several supplementary repository files related to the rewritten text, which are then included in the first set.

[0011] The first set is filtered to retain repository files that meet a first criterion, which is defined based on the summary relevance of the repository file to the rewritten text.

[0012] Any repository file in the first set is divided into several context fragments, and target context fragments that meet the second criterion are selected. The second criterion is defined based on the relevance of the context fragments to the content of the first question text. The target context fragments are used for downstream task processing related to the first question text.

[0013] According to one implementation, the rewritten text includes several first identifiers and first descriptions representing code entities; the target reasoning includes:

[0014] The first identifiers and / or first descriptions are input into the first model, which infers several supplementary repository files with the code entity definitions based on the filenames of the repository files.

[0015] In one scenario of the above implementation, the first model is pre-trained based on a target task, which is used to predict the naming conventions of repository files with code entity definitions.

[0016] In one scenario of the above implementation, the code entity includes at least one of the following: module, package, class, function, variable.

[0017] According to one implementation, the target reasoning includes:

[0018] Subtree retrieval is performed in a pre-constructed repository tree to determine the first subtree containing existing repository files in the first set. The nodes of the repository tree contain the filenames of the repository files in the code repository, and the edges of the repository tree represent the directory relationships between the repository files.

[0019] The first subtree is input into the second model, which infers to obtain supplementary repository files related to the rewritten text in the first subtree.

[0020] In one scenario of the above implementation, the subtree retrieval includes:

[0021] The first node is determined to be the lowest common ancestor of the existing repository files in the first set in the repository tree.

[0022] Using the first node as the root, the first subtree is extracted from the warehouse tree.

[0023] According to one implementation, the filtering of the first set includes:

[0024] For any second repository file in the first set, determine summary information based on its content.

[0025] Calculate the first vector similarity between the summary information and the rewritten text.

[0026] If the similarity of the first vector is less than the first similarity threshold preset in the first standard, then the second warehouse file is removed from the first set.

[0027] In one scenario of the above implementation, the second repository file is a code file, and the determination of the summary information includes:

[0028] Obtain the abstract syntax tree corresponding to the second repository file. The abstract syntax tree contains definition nodes representing code entities in the second repository file. The definition nodes include the definition header and definition body of the corresponding code entity.

[0029] For any defined node in the abstract syntax tree, determine its code entity digest, which includes the definition header of the node and the definition body that satisfies the first constraint; the first constraint is used to restrict that the number of tokens contained in the text is less than a preset first quantity threshold.

[0030] By concatenating the code entity summaries corresponding to several definition nodes in the abstract syntax tree, the summary information of the second repository file is obtained.

[0031] In one scenario of the above implementation, the second repository file is a non-code text file, and the determination of summary information includes:

[0032] The content of the second repository file is divided into several text blocks based on the first rule. The first rule includes the number of first text segments as the unit segment length.

[0033] For any text block in the second repository file, extract its text block summary.

[0034] By concatenating the summaries of the text blocks corresponding to each of the aforementioned text blocks, a summary information of the second repository file is obtained.

[0035] According to one implementation, if the repository file is a code file, then the context fragment is the definition of the code entity in the repository file.

[0036] If the repository file is a non-code text file, then the context fragment is a text block obtained by dividing the length of the text into units of the first text line number preset in the second rule.

[0037] In one implementation scenario described above, the second rule further includes a second number of overlapping text lines between adjacent text blocks.

[0038] According to one implementation, selecting the target context fragments that satisfy the second criterion includes:

[0039] For any context fragment, calculate the second vector similarity between its content and the first question text.

[0040] If the similarity of the second vector is higher than the preset second similarity threshold in the second standard, then it is determined as the target context fragment.

[0041] In one scenario of the above implementation, the method further includes: sorting the obtained target context fragments based on the corresponding second vector similarity.

[0042] According to a second aspect, an apparatus for contextual retrieval based on a code repository is provided, the code repository comprising a plurality of repository files, the apparatus comprising:

[0043] The receiving module is configured to receive the first question text.

[0044] The retrieval module is configured to rewrite the first question text, perform keyword retrieval based on the obtained rewritten text, obtain several first warehouse files that match the rewritten text, and classify them into a first set.

[0045] The inference module is configured to perform target inference using a pre-trained target model based on the filename of the repository file, and obtain several supplementary repository files related to the rewritten text, which are then included in the first set.

[0046] The first filtering module is configured to filter the first set to retain repository files that meet a first criterion, which is defined based on the summary relevance of the repository file to the rewritten text.

[0047] The second filtering module is configured to divide any repository file in the first set into several context fragments, select target context fragments that meet the second criterion, the second criterion being defined based on the relevance of the context fragments to the content of the first question text, and the target context fragments being used for downstream task processing related to the first question text.

[0048] According to a third aspect, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the method described in the first aspect.

[0049] According to a fourth aspect, a computing device is provided, including a memory and a processor, characterized in that the memory stores executable code, and when the processor executes the executable code, it implements the method described in the first aspect.

[0050] In summary, the methods and apparatus disclosed in the embodiments of this specification can analyze code repositories based on problem text. The analysis process follows a sequence: first, a fast search based on filenames is performed, followed by a content-based similarity analysis. This improves both the analysis speed and the accuracy of the large language model's understanding of code structure, allowing for a progressively deeper retrieval of the code repository and the accurate retrieval of relevant contextual fragments for the problem text. The embodiments of this specification enhance the understanding and retrieval capabilities of large language models for code repositories, helping them better perform code repository-level contextual retrieval tasks. Attached Figure Description

[0051] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.

[0052] Figure 1 is a schematic diagram of a method framework for context retrieval based on a code repository disclosed in this specification;

[0053] Figure 2 is a flowchart of a method for context retrieval based on a code repository according to an embodiment of this specification;

[0054] Figure 3 is a schematic diagram of a warehouse tree provided in an embodiment of this specification;

[0055] Figure 4A is a schematic diagram of an exemplary subtree construction provided according to an embodiment of this specification;

[0056] Figure 4B is a schematic diagram of an exemplary subtree retrieval provided according to an embodiment of this specification;

[0057] Figure 5 is a schematic diagram of an exemplary abstract syntax tree provided according to an embodiment of this specification;

[0058] Figure 6 is a schematic diagram of code entity digest extraction according to the embodiments of this specification;

[0059] Figure 7 is a schematic diagram of a device for contextual retrieval based on a code repository according to an embodiment of this specification. Detailed Implementation

[0060] The solutions provided in the embodiments of this specification will now be described with reference to the accompanying drawings.

[0061] In one or more embodiments of this specification, the process and limitations of large language models performing relevant context retrieval on question text will be described using the Python programming language as an example. Technical solutions to address these limitations will be illustrated using Python programming examples. It should be noted that although Python is used as an example in some embodiments of this specification, this does not imply limitation on application scenarios or technical tools. The technical concepts embodied in the various embodiments of this specification can be applied to various large language models that support code context retrieval and the code projects / repositories built using their supported programming languages.

[0062] As mentioned earlier, to improve developer efficiency during the R&D process, Large Language Models (LLMs) have been widely used in software engineering projects to solve various coding tasks. Thanks to training with massive amounts of code text, LLMs have shown considerable potential in code understanding, searching, analysis, and generation.

[0063] Many coding tasks involve numerous question queries described in Natural Language (NL). Currently, most LLMs (or pre-trained large code models) employ a NL-to-Code processing paradigm, lacking the ability to integrate additional code context. However, in many complex software engineering projects, directly processing question queries without additional code context information presents numerous challenges. For example, when faced with unfamiliar codebases or when the question query lacks relevant code references, the inference performance of LLMs will be significantly reduced. Furthermore, solving specific software engineering question queries based on parameters learned during pre-training becomes extremely difficult. For instance, LLMs cannot keep up with real-time updates to public codebases or grasp the code characteristics of private codebases, and so on.

[0064] Retrieval-enhanced generative techniques (RAG) can help LLMs retrieve and integrate relevant documents (context) needed during the reasoning process, thereby effectively improving their ability to handle question queries. Specifically, when an LLM needs to answer a question query, RAG can help it retrieve relevant contextual fragments from a large external knowledge base and pass them to the LLM. Then, the LLM can use these contextual fragments to guide downstream reasoning tasks related to the question query, effectively answering the query and improving the quality and accuracy of predictions.

[0065] In the field of software engineering, external knowledge bases related to code typically refer to code repositories that store source files. Contextual snippets can be code snippets, documents, or metadata related to the problem-solving query, such as "path / to / test.py:10-20". Code repositories contain a large number of code files (e.g., .py, .cpp, .java, .html, etc.) related to specific software engineering projects, as well as non-code text files (e.g., .txt, .md, resource files, etc.). There is inevitably a semantic gap between these source files written in different languages ​​and the problem queries described in natural language. Therefore, accurately understanding the true intent behind the problem query text and translating it into a retrieval operation of the corresponding contextual snippets in the code repository during RAG contextual retrieval is a highly challenging task.

[0066] In some related technologies, contextual retrieval is considered a standard keyword search problem, implemented through search engines. These search engines are typically built on pre-stored vector databases, using distributed vectors to capture and represent the features of contextual fragments, and matching and retrieving based on the vector similarity between keywords in the query text and contextual fragments. However, this method relies excessively on the expressive power of vectors for single contextual fragment features. When faced with source files in code repositories with complex relationships, it cannot understand the structural relationships between repository files or code entities, often resulting in inaccurate search results and impacting the efficiency and accuracy of contextual retrieval.

[0067] Other related technologies employ the Reason and Act (ReAct) paradigm, leveraging the powerful semantic understanding capabilities of large language models for contextual retrieval. During the retrieval process, the large language model is guided to generate iterative retrieval plans, interweaving reasoning trajectories and actions related to the question query. This allows the large language model to perform dynamic reasoning, create and adjust action execution plans (reasoning → action), and interact with external knowledge bases, incorporating relevant information into the reasoning (action → reasoning), progressively executing to approximate the optimal retrieval result. When necessary, these methods also integrate knowledge graph technology to model the code repository and assist retrieval through graph search. However, this technology heavily relies on the large language model's ability to plan retrieval. In certain scenarios, such as handling highly abstract or fuzzy question queries, retrieval biases can easily occur, causing the large language model to fail to correctly understand the deep semantics of the question query or code repository. This makes the retrieval process difficult to control, compromising the accuracy of contextual fragment retrieval and negatively impacting subsequent code task processing. Furthermore, the application of knowledge graphs further reduces the generalization ability of this method.

[0068] This demonstrates that various technologies have shortcomings in handling code repository-level context retrieval tasks. While keyword search engines can quickly locate relevant context, their accuracy is low. Dynamic reasoning-based retrieval, though yielding more accurate results, suffers from a complex, uncontrollable process and poor generalization. The inventors discovered that, despite its less-than-ideal accuracy, keyword retrieval often returns contextual results that closely resemble the target context needed to solve the problem. In other words, keyword retrieval captures some true results and contextual elements close to them. Based on this finding, the powerful semantic understanding capabilities of large language models can be leveraged to further supplement keyword retrieval with the resulting context, compensating for any missed context. Subsequent careful filtering and sorting of these contextual elements allows for more accurate identification of the most relevant contextual fragments to the problem, thereby improving overall retrieval performance.

[0069] In summary, to address the aforementioned technical problems, the inventors propose a method for context-based retrieval based on code repositories. This method enhances the understanding of repository files within a large language model and improves the model's ability to locate code context, thereby better fulfilling repository-level retrieval enhancement tasks. Specifically, Figure 1 illustrates a schematic framework of a code repository-based context-based retrieval method in one embodiment. Referring to the figure, the method can be broadly divided into two stages: a file retrieval stage and a context fragment retrieval stage.

[0070] During the document retrieval phase, the query text is rewritten to remove redundant natural language expressions and extract core keywords and key descriptions. Subsequent keyword retrieval operations can be based on the rewritten text, searching the code repositories for files matching the rewritten text and adding these selected files to the search results set (i.e., the first set). As mentioned earlier, relying solely on keyword search engines for contextual results is inaccurate. Therefore, LLM can be used to supplement the search by retrieving code entities mentioned in the rewritten text based on the filenames of the repositories, adding selected files to the first set and enriching the search results. Next, based on the summary information of each retrieved repository file in the first set, a summary relevance calculation can be performed with the rewritten text to filter out repositories with insufficient relevance, effectively reducing potential errors introduced during the initial retrieval process.

[0071] In the context retrieval phase, the content of each repository file retrieved in the first set is divided into several context fragments. The relevance between each context fragment and the query text is analyzed, and those context fragments that meet certain relevance criteria are selected. These selected context fragments can be identified as target contexts, completing the context retrieval task for the code repository.

[0072] Following the above technical concept, Figure 2 shows a flowchart of a method for context retrieval based on a code repository according to an embodiment of this specification. It is understood that this method can be executed by any device, equipment, platform, or cluster of devices with computing and processing capabilities. Referring to Figure 2, in one embodiment, the code repository contains several repository files, and the method includes at least the following steps: S201: Receive a first question text. S203: Rewrite the first question text, perform keyword retrieval based on the obtained rewritten text, obtain several first repository files matching the rewritten text, and classify them into a first set. S205: Based on the filenames of the repository files, perform target inference using a pre-trained target model to obtain several supplementary repository files related to the rewritten text, and classify them into the first set. S207: Filter the first set to retain repository files that meet a first criterion, the first criterion being defined based on the summary relevance of the repository files to the rewritten text. S209: Divide any repository file in the first set into several context fragments, and select the target context fragments that meet the second criterion. The second criterion is defined based on the relevance of the context fragments to the content of the first question text. The target context fragments are used for downstream task processing related to the first question text.

[0073] The specific execution method of each of the above steps will be described in detail below with reference to the accompanying drawings.

[0074] The code repository can be a general-purpose code repository consisting of source files from multiple engineering projects, a code repository containing a specific software project, or a code repository used by a private engineering project. This specification does not make any specific limitations on this.

[0075] In a code repository, all the files contained therein are collectively referred to as repository files. These files come in various types, including but not limited to: code files that define code entities and record code logic, and commentary text files written in natural language. The code entities include at least one of the following: module, package, class, function, and variable.

[0076] Code files can be program code written in any programming language, such as C++, Java, Python, etc. In some software engineering projects that employ hybrid programming methods, code files may also contain program code written in multiple programming languages. Non-code text files are typically recorded in natural language, such as .txt files for explanation, .md files written in markup languages, resource files that record application configuration information, and so on.

[0077] In the embodiments described in this specification, Python code will be used as an example. However, it should be understood that one or more embodiments in this specification are intended to provide a method for context retrieval based on a code repository, and are not limited to a specific programming language. The method provided by the embodiments of this invention can be applied to any scenario involving the technical concept provided in the embodiments of this invention.

[0078] In step S201, the first question text is received.

[0079] The first issue text can be a natural language description of various issues related to the source code in the code repository, such as code location, bug fixing, adding new features, updating dependencies, etc. Depending on the issue, the first issue text can contain a problem description corresponding to the issue. For example, when code location is required, the first issue text can contain a functional description related to the code entity to be located; when bug fixing is required, the first issue text can provide error messages and error outputs related to program runtime errors; when generating code to add new features is required, the first issue text can include a functional description of the new features.

[0080] The first question text may also include various forms of text content. For example, the description of the question may be given in natural language, and examples of program code snippets may be given in the target programming language. In subsequent steps, the semantic understanding capabilities of the large language model can be used to process the first question text containing various formats of text content. This specification does not specifically limit the text format of the first question text in the embodiments.

[0081] Next, in step S203, the first problem text is rewritten, and based on the rewritten text, a keyword search is performed to obtain several first warehouse files that match the rewritten text, which are then included in the first set.

[0082] The initial question text provided by the user lacks a fixed writing format or paradigm, thus it can be presented in various descriptive forms. Typically, the initial question text suffers from overly brief descriptions, vague expressions, or a lack of necessary background information. These issues may make it difficult for LLM (Local Management Analyzer) to accurately capture the deeper semantics of the initial question text during contextual retrieval. To address the problem of semantic sparseness in the initial question text, this step involves appropriately rewriting the text to enrich or refine its semantic content, making it easier for LLM to understand and process.

[0083] In some embodiments, specific technical methods can be employed to perform this rewriting task. For example, techniques such as Query2Doc and HyDE can be used to semantically expand and refine the first question text. These techniques can effectively transform the original question text into a more detailed and clearer version, thereby improving the accuracy of subsequent LLM understanding.

[0084] In other embodiments, a simpler and more direct method than Query2Doc, HyDE, and other technologies can be used to perform the rewriting task. Specifically, a Large Language Model (LLM) can be used to reason and summarize the first question text. During the reasoning process, the LLM is guided to focus on the core of the question text and ignore non-critical information, ensuring that the rewritten text accurately reflects the core semantics of the question. In this embodiment, complex text processing steps can be avoided by directly utilizing the semantic understanding capabilities of the LLM to rewrite the first question text, simplifying the process.

[0085] After the rewriting is complete, the rewritten text can be used to perform keyword searches in the code repository. The search will find the first repository files that match the rewritten text, and these files will be included in the first set of search results.

[0086] In some embodiments, the keyword retrieval can be a text frequency retrieval method such as BM25, or other retrieval methods that do not involve vector similarity comparison. Vector similarity is not used for retrieval matching because, in the context of code repository retrieval, question queries typically involve keywords representing specific code entities. These keywords are often names or codes assigned by developers in repository files, possessing strong coding characteristics but containing very little natural semantic information. In this context, using retrieval methods based on vector similarity comparison may not only introduce unnecessary semantic ambiguity, thus affecting retrieval accuracy, but may also increase computational burden and reduce retrieval efficiency. Therefore, to avoid these potential problems, a more direct and efficient term frequency retrieval method can be used in keyword retrieval.

[0087] After text rewriting and keyword retrieval, a first set is obtained, which contains several repository files (the first repository files) related to the rewritten text. In the next step S205: based on the filenames of the repository files, target inference is performed using a pre-trained target model to obtain several supplementary repository files related to the rewritten text, which are then added to the first set.

[0088] As mentioned earlier, while keyword search can quickly obtain context relevant to the query, its accuracy is relatively low. This means that the first set may contain some first-repository files that are not highly relevant to the query, or it may miss some first-repository files that are highly relevant to the query. In other words, the search results of the first set may be neither comprehensive nor accurate enough, and further searches are needed to complete and optimize them.

[0089] In this step, for each code entity mentioned in the rewritten text, inference can be performed using a pre-trained target model. The inference process involves analyzing the filenames of repository files in the code repository to identify the repository files containing the definitions of these code entities. The repository files retrieved through this inference method (the supplementary repository files) will be integrated into the first set containing the previous search results, thereby improving the overall search coverage.

[0090] According to one implementation, repository files that may contain the definition of a code entity can be inferred based on the identifier of the code entity. In this implementation, the rewritten text contains several first identifiers and first descriptions representing the code entity. During the target inference process, the several first identifiers and / or first descriptions are input into a first model, which infers several supplementary repository files containing the code entity definition based on the filename of the repository file. Specifically, an identifier refers to a string used to uniquely identify a specific code entity. It is easy to understand that, corresponding to the code entity example above, the first identifier can include, but is not limited to, the following: module name, package name, class name, function name, and variable name. In addition, the rewritten text may also contain first descriptions presented in natural language, which correspond to specific code entities. For example, the first description "date of instance A" may correspond to the code entity identifier "A.Date", and the first description "separability of a matrix" may correspond to the code entity identifier "separate". In this implementation, the semantic understanding capability of the first model can be used to parse the first description in the rewritten text and derive the first identifier of the corresponding code entity. These identifiers, along with those directly contained in the rewritten text ontology, are inferred by the first model to identify repository files that may contain the corresponding code entity definitions. These files will be identified as the supplementary repository files and integrated into the previously established first set.

[0091] In a specific example, the first model could be a Large Language Model (LLM). The first model is pre-trained based on a target task: predicting the naming conventions of repository files containing code entity definitions. Specifically, the first model is pre-trained on a large amount of general-purpose and code data, enabling it to deeply understand and master various common code entities, repository file naming conventions, and the naming habits followed by developers. Through pre-training, the first model can not only recognize standard programming terms and syntax keywords but also understand the unique file naming patterns in different programming languages, or variations of these patterns. When handling actual retrieval tasks, the first model can capture the association between filenames and code entities, inferring the filenames of repository files containing code entity definitions.

[0092] According to another implementation, the file structure relationships between repository files can be considered. Reasoning is performed on adjacent files of each existing repository file in the previously retrieved result set (the first set) to supplement the first set. In this implementation, subtree retrieval can be performed in a pre-constructed repository tree to determine a first subtree containing existing repository files from the first set. The nodes of the repository tree contain the filenames of the repository files in the code repository, and the edges of the repository tree represent the directory relationships between repository files. Then, the first subtree is input into the second model to reason about the supplementary repository files related to the rewritten text in the first subtree. Specifically, as explained above, the content covered by real-world software engineering / code repositories is usually not limited to a single function body or a single source code file. The source code of an application is usually constructed as a code repository containing multiple interdependent files, modules, and libraries. Therefore, when performing target reasoning, the second model also needs to fully understand the relationships between the various repository files in the code repository. Repository files with related relationships may also have a certain degree of correlation in their content. As mentioned earlier, although the result set (the first set) obtained through the keyword search in the aforementioned steps is not ideal in terms of accuracy, the returned results are often quite close to the real target required to solve the problem query. Therefore, based on the previous search results, target reasoning can be performed on each repository file associated with it in order to retrieve the relevant repository files that were missed.

[0093] Next, we will first give a brief introduction to the warehouse tree involved in this implementation, and then introduce in detail the generation of the first subtree and the target reasoning process.

[0094] In this implementation, a complex code repository can be abstracted into a repository tree based on file directory relationships. The repository tree helps the second model understand the structural and semantic relationships between repository files. File directory relationships represent the organizational form of repository files, such as directory hierarchies between files, or the program architecture described by various code elements written in a programming language (e.g., modules, classes, functions, etc.) (e.g., init.py in a Python program). Figure 3 shows an exemplary repository tree. In a code repository, several repository files are organized according to a certain directory structure. As shown in the figure, the astropy folder contains both repository files, such as setup.py, and subfolders, such as modeling, and so on, with repository files distributed across different directory levels. It can be seen that in some programming languages ​​(e.g., Python), source code files organized into folders constitute a code module, while in some programming languages ​​(e.g., C++), source code organized into folders represents a close logical call relationship between them. This type of code repository with a directory structure can be constructed into a file tree (shown on the right side of the figure) according to the directory hierarchy of each repository file, which is the aforementioned repository tree. In the repository tree, the edges between nodes represent the structural information of the code repository, that is, the directory hierarchy between repository files.

[0095] It is understood that the above content uses a code repository containing source files for a single project as an example to briefly introduce the repository tree. As mentioned earlier, in practical application scenarios, a code repository can contain source files for multiple projects. In this scenario, there can be multiple repository trees, and this specification does not limit this.

[0096] Currently, the first set contains previously retrieved repository files related to rewritten text. In order to find the nearest neighbor repository files of these files, a subtree can be constructed around the first set. That is, a subtree can be extracted from the repository tree, which can cover the existing repository files in the first set.

[0097] In a specific scenario, a first node can be identified first. This node is the lowest common ancestor (LCA) of the existing repository files in the first set within the repository tree. Then, using this first node as the root, the first subtree is extracted from the repository tree. Figure 4A illustrates the construction of an exemplary subtree. The first set contains pre-retrieved repository files related to rewritten text (two files shown in the figure). In the repository tree (shown on the left side of the figure), the LCA node of these files is searched; in this example, it is the "modeling" node, which is the first node. Using this node as the root, a subtree is extracted, as shown on the right side of the figure; this subtree is the first subtree. It can be seen that the first subtree contains the nearest neighbor nodes of each repository file node in the first set. These nearest neighbor nodes have a relatively close code structure relationship with the previously retrieved files.

[0098] Based on the obtained first subtree, the second model is used for target inference. As mentioned earlier, target inference is performed based on the filenames of the repository files, with the aim of retrieving repository files related to the rewritten text but missed in previous searches from the first subtree, and adding them to the first set as supplementary repository files. For example, in the subtree retrieval example shown in Figure 4B, the second model searches in the first subtree and finds that the neighboring repository file "test_compound.py" is also a repository file related to the rewritten text. Therefore, this file can be added to the first set.

[0099] The two implementations described above employ different approaches. One uses the first model to infer repository files containing code entity definitions in the rewritten text, while the other uses the second model to infer neighboring repository files related to the rewritten text within the repository tree. Both can supplement the retrieval result set (i.e., the first set). In some embodiments, the methods outlined in the two implementations can be used simultaneously to supplement the first set, performing both filename-based and directory-based inference.

[0100] It should be noted that the various supplementary retrieval methods mentioned above can also be implemented using static program analysis strategies. However, given that static program analysis relies on strict matching rules and is strongly bound to specific programming languages, it may lead to a decrease in the hit rate and reduced generalization of supplementary retrieval. Therefore, the above implementation methods employ a strategy of reasoning based on filenames using a large language model. However, this does not imply any limitation on the retrieval strategy; any strategy that can effectively retrieve repository files containing code entity definitions can be equally applied to the above implementation methods. This specification will not list these strategies in detail in the embodiments.

[0101] After the above steps, supplementary searches can encompass numerous repository files related to rewritten text in the first set, overcoming the previous shortcomings in coverage from keyword searches. However, the error issues in the keyword search process mentioned earlier still need further resolution. In the aforementioned steps, the reasoning and retrieval were based on repository filenames or directory structures without in-depth analysis of the specific content of the repository files. Next, based on the file content, the previous search results can be further analyzed to identify and filter out repository files with low relevance to the rewritten text, further improving the accuracy of the search results.

[0102] In step S207: The first set is filtered to retain repository files that meet a first criterion, which is defined based on the summary relevance of the repository file to the rewritten text.

[0103] In this step, a Large Language Model (LLM) can be used to perform relevance analysis on the repository files. In practice, especially in large software engineering projects, repository files often contain a large amount of content (some code files can even reach tens of thousands of lines). When calculating the relevance of repository file content, LLM typically decomposes the file content into tokens for semantic analysis, and the size of the file content directly affects the number of tokens. A token is the basic unit of text processing in LLM, usually representing a word, punctuation mark, or other text element. The larger the file content, the more tokens it contains, which significantly increases the amount of information that LLM needs to process during semantic analysis, reducing analysis efficiency. Therefore, considering the efficiency of LLM analysis, in this step, a summary relevance score between the repository file and the rewritten text can be calculated based on the summary information of the repository file to ensure fast and accurate semantic analysis.

[0104] According to one implementation, for any second repository file in the first set, summary information is determined based on its content; a first vector similarity between the summary information and the rewritten text is calculated; if the first vector similarity is less than a preset first similarity threshold in the first standard, the second repository file is removed from the first set.

[0105] In this implementation, summary information can be extracted based on the specific content of the second repository file. This involves a deep understanding of the file content, which may include: identifying key functional descriptions, extracting main code entities, and extracting comments and explanations. The extraction of summary information aims to capture the core features of the repository file, laying the foundation for subsequent relevance calculation. Based on the summary information, a first vector similarity between it and the rewritten text can be calculated. The calculation of vector similarity can be implemented using various algorithms, such as cosine similarity and Euclidean distance. The relevance between the summary information and the rewritten text is determined by judging the distance between them in the vector space (the summary relevance). Then, the calculated first vector similarity is compared with a preset threshold (the first similarity threshold). Repository files whose first vector similarity is less than the first similarity threshold are removed from the first set. The first similarity threshold is set in advance when defining the first criterion. It represents the minimum similarity requirement that the repository file and the rewritten text are sufficiently related. If the calculated first vector similarity is lower than this threshold, it indicates that the relevance between the repository file and the rewritten text is not high. In subsequent steps, the correct context fragments will not be generated, and it may even have a negative impact on the retrieval of context fragments.

[0106] As can be seen from the above introduction, after the summary relevance screening, the repository files retained in the first set are all closely related to the rewritten text. These files will serve as the basis for extracting context fragments in subsequent steps. It is easy to understand that the more accurate the summary relevance calculation, the higher the correlation between the repository files in the retained first set and the rewritten text, and consequently, the more accurate the extraction of context fragments. In other words, in this step, how to effectively extract summary information that accurately reflects the core content of the repository files has a decisive impact on the subsequent summary relevance calculation. Therefore, the method for extracting summary information is particularly crucial.

[0107] In a specific scenario, the second repository file is a code file, and its summary information can be extracted using the following steps:

[0108] Obtain the abstract syntax tree corresponding to the second repository file. The abstract syntax tree contains definition nodes representing code entities in the second repository file. Each definition node includes a definition header and a definition body for the corresponding code entity. For any definition node in the abstract syntax tree, determine its code entity digest, which includes the node's definition header and a definition body satisfying a first constraint. The first constraint restricts the text from containing fewer than a preset first threshold number of terms. Concatenate the code entity digests corresponding to several definition nodes in the abstract syntax tree to obtain the summary information of the second repository file.

[0109] Before elaborating on the above steps, let's briefly introduce the Abstract Syntax Tree (AST). An AST is a data structure that displays the various code entities and their relationships in a code file in a tree structure. Each node represents a code entity in the source code, and the connections between nodes reflect the syntactic hierarchy and code relationships between these entities, such as function calls, variable references, etc. Figure 5 shows an exemplary AST. Referring to the figure, in a AST, you can find the definition nodes of code entities (in the figure, functions and classes are used as examples; other types of code entities can be deduced by analogy). Unlike other types of nodes (such as expression nodes representing arithmetic and logical operations), definition nodes represent the declarations of code entities. Definition nodes contain the definition header and definition body of the corresponding code entity.

[0110] Traversing the abstract syntax tree allows for quick location of individual code entities within the repository file via definition nodes. Based on the code entity definitions, a code entity summary is extracted. Figure 6 illustrates an exemplary code entity summary extraction. Referring to the figure, taking a function code entity as an example, two cases can be handled. One case, as shown on the left side of the figure, involves discarding the definition body when it contains too many terms (exceeding a preset first threshold), retaining only the definition header as the code entity's summary. The other case, as shown on the right side of the figure, involves a relatively short definition body where the number of terms does not exceed the preset first threshold. In this case, the definition body is retained, and the complete definition (including the definition header and body) serves as the code entity's summary. It is evident that, to extract summaries from each code entity, a first constraint with a preset first threshold is defined in this scenario. This constraint ensures the conciseness of the extracted code entity summaries by limiting the number of terms in the summary. This approach captures the core information of each code entity while avoiding redundancy caused by excessively long text. After obtaining the summaries of each code entity in the repository file, these summaries are then concatenated to form the summary information of the repository file (i.e., the second repository file). This summary information presents the core content of the repository file from a macro perspective, including not only an overview of each important code entity but also, due to lexical number screening, maintaining information conciseness and readability, thus providing a foundation for efficient processing of subsequent analysis and comparisons.

[0111] In another specific scenario, the second repository file is a non-code text file, and its summary information can be extracted through the following steps:

[0112] The content of the second repository file is segmented into several text blocks based on a first rule, where the first rule includes a first number of text segments as the unit segment length. For any text block in the second repository file, its text block summary is extracted. The text block summaries corresponding to the various text blocks are concatenated to obtain the summary information of the second repository file.

[0113] In this scenario, the content of the second repository file is segmented by natural paragraphs, primarily taking into account the natural structural patterns of text files in organizing information. Specifically, the number of natural paragraphs contained in each text block is fixed, determined by the preset number of the first text paragraphs. For example, if the number of the first text paragraphs is set to 3, then each text block will contain 3 consecutive natural paragraphs. In a text file, a natural paragraph is typically a complete, independent sentence or group of sentences, revolving around a central idea or theme, and is the basic unit of text structure. Segmenting text by natural paragraphs ensures that each text block is semantically coherent and complete.

[0114] Next, for each text block obtained from text segmentation, a summary extraction operation is performed. The first sentence of each text block can be extracted as its summary. The first sentence is usually the one that best summarizes the content of the text block, thus effectively representing the theme and key points of the entire text block. Finally, the text block summaries corresponding to each text block in the second repository file are concatenated to form the summary information of the repository file. This extracted summary information not only preserves the main structure and key information of the file content but also significantly reduces the size of the summary information by using the summary format, facilitating rapid analysis in subsequent steps.

[0115] In the steps described above, the search results in the first set were first supplemented, and then further filtered through summary analysis. After this series of operations, the repository files retained in the first set are all closely related to the rewritten text. In this process, the core benchmark of the analysis is the rewritten text, which may only contain some key information fragments from the first question text. To ensure accurate comparison with the complete semantics, in the next stage, context fragment extraction and sorting can be performed based on the semantic association between these files and the first question text, thereby obtaining the final search results.

[0116] In step S209, any repository file in the first set is divided into several context fragments, and target context fragments that meet the second criterion are selected. The second criterion is defined based on the relevance of the context fragments to the content of the first question text. The target context fragments are used for downstream task processing related to the first question text.

[0117] In this step, the content of each repository file in the first set is divided into multiple context fragments. These context fragments can be divided according to the content structure of the file or according to semantic coherence; there is no specific limitation, but it is necessary to ensure that each context fragment is logically self-contained. For example, if the repository file is a code file, the context fragment can be the definition content of each code entity in the repository file; if the repository file is a non-code text file, the context fragment can be each text block obtained by text segmentation with the length divided by the first text line number preset in the second rule.

[0118] When the repository file for context segmentation is a code file, the definition nodes can be traversed according to the method of parsing the abstract syntax tree mentioned in the previous steps to obtain the definition content of each code entity in the repository file; or the definition content of the code entity in the repository file that has been parsed in the previous steps can be directly used; the embodiments in this specification will not be elaborated on in detail.

[0119] When the repository file being segmented into context fragments is a non-code text type, its content can be processed using a text segmentation method based on specific rules. Specifically, the file content can be divided into multiple independent text blocks according to a preset number of lines, with each text block being the context fragment. For example, if the second rule defines the first text line count as 5, then the first 5 lines of text in the repository file will constitute the first text block, the next 5 lines will constitute the second text block, and so on, until the end of the file.

[0120] To enhance the continuity and relevance of information between text blocks, in one example, the second rule may further include: an overlapping second number of text lines between adjacent text blocks. That is, when the content of the repository file is divided into text blocks, each text block, in addition to containing text lines defined by the first number of text lines, will also share text lines defined by the second number of text lines with the preceding text block. For example, if the second rule defines the first number of text lines as 10 and the second number of text lines as 2, then the first text block will contain text content from line 1 to line 10, and the second text block will start from line 9 and contain text content from line 9 to line 18. Thus, lines 9 and 10 will become overlapping lines between the two text blocks.

[0121] Next, a second criterion is applied to filter these context fragments. This second criterion is based on the content relevance of the context fragment to the first question text. Specifically, the semantic relevance of each context fragment to the first question text is calculated, along with a second vector similarity. If this second vector similarity is higher than a preset second similarity threshold in the second criterion, it is identified as a target context fragment. This second similarity threshold defines the minimum level of similarity required between a context fragment and the first question text to be considered relevant. In other words, if the second vector similarity of a context fragment exceeds this threshold, then the context fragment can be considered highly relevant to the first question text in terms of content and can be identified as a target context.

[0122] In a specific scenario, the obtained target context fragments can be sorted based on their respective second vector similarity. The purpose of sorting is to place the target context fragment most relevant to the first question text at the top of the list, so that it can be given priority in subsequent analysis or processing.

[0123] Ultimately, the selected target context fragments will be used for downstream task processing related to the first question text. These downstream tasks include, but are not limited to: code location, bug fixing, adding new features, updating dependencies, and so on.

[0124] The foregoing description, based on one or more embodiments, details a method for contextual retrieval based on code repositories. By employing the method provided in the embodiments of this specification, the analysis speed of a large language model on a code repository can be improved by first performing a fast search based on filenames, followed by a similarity analysis based on content. Simultaneously, the accuracy of the large language model's understanding of code structure can be improved, allowing for a progressively deeper retrieval of the code repository. This helps the model better complete code repository-level contextual retrieval tasks, improves retrieval accuracy, and correctly selects the target context fragments for the problem text.

[0125] In this specification, the terms "first" in the first problem text, the first repository file, etc., as well as the corresponding "second" and "third" (if they exist) in the text, are merely for the convenience of distinction and description, and do not have any limiting meaning.

[0126] The foregoing description describes specific embodiments of this specification; other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than those shown in the embodiments, and the desired result may still be achieved. Furthermore, the processes depicted in the drawings do not necessarily need to follow the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0127] Figure 7 is a schematic diagram of an apparatus for context retrieval based on a code repository according to an embodiment of this specification. The apparatus 700 is deployed in a computing device, which can be implemented using any device, equipment, platform, device cluster, etc., with computing and processing capabilities. This apparatus embodiment corresponds to the method embodiment shown in Figure 2. The code repository contains several repository files, and the apparatus 700 includes:

[0128] The receiving module 701 is configured to receive the first question text.

[0129] The retrieval module 702 is configured to rewrite the first question text, perform keyword retrieval based on the obtained rewritten text, obtain several first warehouse files that match the rewritten text, and classify them into a first set.

[0130] The reasoning module 703 is configured to perform target reasoning using a pre-trained target model based on the filename of the repository file, and obtain several supplementary repository files related to the rewritten text, which are then included in the first set.

[0131] The first filtering module 704 is configured to filter the first set to retain repository files that meet a first criterion, which is defined based on the summary relevance of the repository file to the rewritten text.

[0132] The second filtering module 705 is configured to divide any repository file in the first set into several context fragments, select target context fragments that meet the second criterion, the second criterion being defined based on the content relevance of the context fragment to the first question text, and the target context fragment being used for downstream task processing related to the first question text.

[0133] According to another embodiment, this specification also provides a computer program product including a computer program / instructions that, when executed by a processor, implement the steps of the method described above in conjunction with FIG2.

[0134] According to yet another embodiment, this specification also provides a computing device including a memory and a processor, characterized in that the memory stores executable code, and when the processor executes the executable code, it implements the steps of the method described above in conjunction with FIG2.

[0135] Those skilled in the art will recognize that the functions described in the embodiments of the present invention in one or more of the above examples can be implemented using hardware, software, firmware, or any combination thereof. When implemented in software, these functions can be stored in a computer-readable medium or transmitted as one or more instructions or code on a computer-readable medium.

[0136] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, or improvements made based on the technical solutions of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for contextual retrieval based on a code repository, the code repository containing several repository files, the method comprising: Receive the first question text; The first problem text is rewritten, and based on the rewritten text, a keyword search is performed to obtain several first repository files that match the rewritten text, which are then included in the first set. Based on the filenames of the repository files, target inference is performed using a pre-trained target model to obtain several supplementary repository files related to the rewritten text, which are then included in the first set. The first set is filtered to retain repository files that meet a first criterion, which is defined based on the summary relevance of the repository file to the rewritten text; Any repository file in the first set is divided into several context fragments, and target context fragments that meet the second criterion are selected. The second criterion is defined based on the relevance of the context fragments to the content of the first question text. The target context fragments are used for downstream task processing related to the first question text.

2. The method according to claim 1, wherein, The rewritten text includes several first identifiers and first descriptions representing code entities; the target reasoning includes: The first identifiers and / or first descriptions are input into the first model, which infers several supplementary repository files with the code entity definitions based on the filenames of the repository files.

3. The method according to claim 2, wherein, The first model is pre-trained based on a target task, which is used to predict the naming conventions of repository files with code entity definitions.

4. The method according to claim 2, wherein, The code entity includes at least one of the following: module, package, class, function, variable.

5. The method according to claim 1, wherein, The target reasoning includes: Subtree retrieval is performed in a pre-constructed repository tree to determine the first subtree containing existing repository files in the first set. The nodes of the repository tree contain the filenames of the repository files in the code repository, and the edges of the repository tree represent the directory relationships between the repository files. The first subtree is input into the second model, which infers to obtain supplementary repository files related to the rewritten text in the first subtree.

6. The method according to claim 5, wherein, The subtree retrieval includes: Determine the first node, which is the lowest common ancestor of the existing repository files in the first set in the repository tree; Using the first node as the root, the first subtree is extracted from the warehouse tree.

7. The method according to claim 1, wherein, The filtering of the first set includes: For any second repository file in the first set, determine summary information based on its content; Calculate the first vector similarity between the summary information and the rewritten text; If the similarity of the first vector is less than the first similarity threshold preset in the first standard, then the second warehouse file is removed from the first set.

8. The method according to claim 7, wherein, The second repository file is a code file, and the determination of summary information includes: Obtain the abstract syntax tree corresponding to the second repository file. The abstract syntax tree contains definition nodes representing code entities in the second repository file. The definition nodes include the definition header and definition body of the corresponding code entities. For any defined node in the abstract syntax tree, determine its code entity digest, which includes the definition header of the node and the definition body that satisfies the first constraint; the first constraint is used to limit the number of tokens contained in the text to be less than a preset first quantity threshold. By concatenating the code entity summaries corresponding to several definition nodes in the abstract syntax tree, the summary information of the second repository file is obtained.

9. The method according to claim 7, wherein, The second repository file is a non-code text file, and the determination of summary information includes: The content of the second repository file is divided into several text blocks based on the first rule. The first rule includes the number of first text segments as the unit segment length. For any text block in the second repository file, extract its text block summary; By concatenating the summaries of the text blocks corresponding to each of the aforementioned text blocks, a summary information of the second repository file is obtained.

10. The method according to claim 1, wherein, If the repository file is a code file, then the context fragment is the definition of the code entity in that repository file; If the repository file is a non-code text file, then the context fragment is a text block obtained by dividing the length of the text into units of the first text line number preset in the second rule.

11. The method according to claim 10, wherein, The second rule also includes a second number of overlapping text lines between adjacent text blocks.

12. The method according to claim 1, wherein, The selection of target context fragments that meet the second criterion includes: For any context fragment, calculate the second vector similarity between its content and the text of the first question; If the similarity of the second vector is higher than the preset second similarity threshold in the second standard, then it is determined as the target context fragment.

13. The method of claim 12, further comprising: For the obtained target context fragments, they are sorted based on the corresponding second vector similarity.

14. An apparatus for contextual retrieval based on a code repository, the code repository comprising a plurality of repository files, the apparatus comprising: The receiving module is configured to receive the text of the first question. The retrieval module is configured to rewrite the first question text, perform keyword retrieval based on the obtained rewritten text, obtain several first warehouse files that match the rewritten text, and classify them into a first set; The reasoning module is configured to perform target reasoning using a pre-trained target model based on the filename of the repository file to obtain several supplementary repository files related to the rewritten text and classify them into the first set. The first filtering module is configured to filter the first set to retain repository files that meet a first criterion, which is defined based on the summary relevance of the repository file to the rewritten text. The second filtering module is configured to divide any repository file in the first set into several context fragments, select target context fragments that meet the second criterion, the second criterion being defined based on the relevance of the context fragments to the content of the first question text, and the target context fragments being used for downstream task processing related to the first question text.

15. A computer program product comprising a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1-13.

16. A computing device, comprising a memory and a processor, characterized in that, The memory stores executable code, and when the processor executes the executable code, it implements the method of any one of claims 1-13.