A code generation implementation method based on large language model and retrieval enhancement

By using large-scale language models and retrieval enhancement techniques, combined with multimodal data processing and knowledge graph construction, the automatic generation and verification of PLC control programs were realized, solving the problems of low efficiency and insufficient accuracy in control code generation, and improving the efficiency and reliability of industrial production.

CN122240090APending Publication Date: 2026-06-19SHENYANG INST OF AUTOMATION - CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENYANG INST OF AUTOMATION - CHINESE ACAD OF SCI
Filing Date
2024-12-18
Publication Date
2026-06-19

Smart Images

  • Figure CN122240090A_ABST
    Figure CN122240090A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of automatic code generation, specifically a method for code generation based on large-scale language models and retrieval enhancement. It includes data processing, vector database construction, knowledge graph construction, prompting engineering implementation, and generated code verification. This invention achieves automatic generation of control code, verification, and debugging of the generated code through large-scale language models and retrieval enhancement generation technology, combined with the relational expression capabilities of knowledge prompts. This invention is mainly applied to PLC control program generation in the industrial field, suitable for scenarios in general industrial applications where PLC-controlled equipment needs to operate automatically. Its purpose is to assist developers in quickly generating control code based on the process flow.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of automatic code generation, specifically a method for controlling the automatic generation and verification of code based on large-scale language models and retrieval-enhanced generation technology. Background Technology

[0002] With the rapid development of large-scale language models, significant progress has been made in question-answering mechanisms and natural language processing. In recent years, engineers have turned their attention to the field of code generation using large models. However, in some specialized fields, the conventional training knowledge of large models is insufficient, such as in the field of control code generation. Currently, control codes in the production process are mainly written manually, based on technological processes and accumulated production experience. The current stage of development in manufacturing has shifted from mass production to mass customization. When new products are added or products are modified, control codes must be rewritten based on new technological processes and production experience. Engineers need to re-adjust production lines and equipment using the newly written code, with debugging cycles lasting several months, and the standardization and reliability of the written code cannot be guaranteed.

[0003] The development of large model code generation technology has greatly increased the efficiency of software development and code generation. However, the results of large model generation largely depend on the data used for model training, and cannot guarantee the accuracy and timeliness of the generated results. This is the common problem of the illusion of large models and poor timeliness. To achieve fully automated generation, there is an urgent need for a method to control the automatic generation of code while meeting process requirements, relevant standards, and reliability. Summary of the Invention

[0004] To address the aforementioned problems, the present invention aims to design an automatic control code generation method based on large model and retrieval enhancement technology, and to verify the generated code, thereby solving the problems of long control code generation and modification time, large workload, and lax standard implementation.

[0005] This invention utilizes large-scale language models and retrieval-enhanced generation technology, combined with the relational expression capabilities of TIP, to achieve automatic generation of control code, as well as verification and debugging of the generated code. This invention is mainly applied to the generation of PLC control programs in the industrial field, and is suitable for scenarios in general industrial fields where PLC-controlled equipment needs to operate automatically. Its purpose is to assist developers in quickly generating control code based on the process flow.

[0006] The technical solution adopted by this invention to achieve the above objectives is: a code generation method based on a large language model and retrieval enhancement, comprising the following steps:

[0007] Step 1: Collect multimodal data in a specific professional field;

[0008] Step 2: Process the multimodal data and construct a vector database;

[0009] Step 3: Construct a professional knowledge graph based on the vector database;

[0010] Step 4: Deploy the large model through the framework and link the large model with the professional knowledge graph;

[0011] Step 5: Based on the received user requirements, extract key information to query entities in the professional knowledge graph, output the vector database knowledge associated with the queried entities, and fuse it according to the set prompt template as input information for the large model to realize the prompt engineering.

[0012] Step 6: After receiving the input information, the large model generates code based on user needs and the retrieved knowledge; the generated code is then validated to evaluate the large model's ability to generate code.

[0013] The multimodal data includes, but is not limited to, machine tool data information, machining parameter requirements, assembly process requirements, industrial control design requirements, control algorithms and programs.

[0014] Step 2, processing the data and constructing a vector database, includes the following steps:

[0015] The data is parsed using a parsing tool and then loaded using a document loader.

[0016] The loaded data is divided into document chunks using a splitter to facilitate embedding.

[0017] The segmented data is vectorized by embedding and the resulting vectorized information is stored in a vector database.

[0018] The vectorized information includes at least one type of information from machine tool data, machining parameter requirements, assembly process requirements, industrial control design requirements, control algorithms and programs, and each type of information contains at least one keyword.

[0019] Step 3, constructing a professional knowledge graph based on the vector database, includes the following steps:

[0020] Based on the vectorized information in the vector database, keywords are extracted from each piece of information to construct a graph according to the pattern layer. The keywords in each piece of information in the vector database are used as entities in the professional knowledge graph, and the relationships between each data keyword are used as relationships between entities to construct the professional knowledge graph.

[0021] In step 5, based on the received user requirements, key information is extracted to query entities in the professional knowledge graph, and the vector database knowledge associated with the queried entities is output, as follows:

[0022] Based on the accepted user requirements, extract key information to query entities in the professional knowledge graph, and find entities that match the key information by traversing the entities in the professional knowledge graph.

[0023] It also outputs the vector database knowledge associated with the queried entities.

[0024] In step 5, the prompt template is used for fusion as input information for the large model to realize the prompting process, which includes the following steps:

[0025] By using a predefined prompt template, the user's needs, along with the top K most relevant answers obtained from a vector database, are displayed as knowledge.

[0026] In step 6, the generated control code is verified to evaluate the ability of the large model to generate control code, as detailed below:

[0027] The initial code generated by the large model is translated to meet the code format requirements of the validation tools.

[0028] The code is compiled to check for syntax errors.

[0029] If errors are found, generate repair suggestions based on the location of the errors in the code, debug and fix the code, and then re-validate the syntax; if the code passes the syntax validation, then perform functional validation.

[0030] During the functional verification phase, the functional attributes of the code are first generated. The code is then verified to determine whether it passes verification. If the requirements are not met, functional repair suggestions are generated to debug and repair the code.

[0031] The knowledge graph in question is a control program knowledge graph.

[0032] A code generation implementation system based on a large language model and retrieval enhancement includes:

[0033] The vector database construction module is used to process multimodal data and build a vector database.

[0034] The professional knowledge graph construction module is used to construct professional knowledge graphs based on a vector database.

[0035] The large model query module is used to deploy large models through the framework and associate them with professional knowledge graphs. Based on the accepted user requirements, it extracts key information to query entities in the professional knowledge graph, outputs the vector database knowledge associated with the queried entities, and integrates it according to the set prompt template as input information for the large model to realize prompt engineering.

[0036] The code generation module is used to generate code based on user needs and retrieved knowledge after the large model receives input information; the generated code is then validated to evaluate the large model's ability to generate code.

[0037] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the code generation method based on a large language model and retrieval enhancement.

[0038] The present invention has the following beneficial effects and advantages:

[0039] This invention proposes an automatic control code generation method based on a large language model and retrieval-enhanced generation technology. This method enables the automatic generation of equipment control codes required for redesigned production after product changes. By combining retrieval-enhanced generation technology with a large language model, it effectively avoids the drawback of large models lacking training data in specialized domains, thus preventing misleading results. It introduces a database containing domain-specific knowledge, allowing for the simultaneous input of problem requirements and retrieval of relevant knowledge and data from the database. The problem and the retrieved knowledge are then integrated and input into the large model, effectively assisting it in generating accurate control codes that meet the requirements. After code generation, relevant verification is performed to ensure the accuracy and security of the code. This invention accelerates the control code writing process when new production processes are established, new products are introduced, or changes in production equipment and requirements occur due to product changes. It provides assistance to developers, significantly speeding up the process from control code writing to application implementation, improving production efficiency, and helping enterprises adapt to the current industrial development trend of increasingly customized production. Attached Figure Description

[0040] Figure 1 This is a flowchart illustrating the overall research and technical roadmap of this invention.

[0041] Figure 2 A flowchart for generating the code verification phase. Detailed Implementation

[0042] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments.

[0043] To make the design scheme and technical advantages of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and specific examples, in which preferred embodiments of the present invention are shown. However, the present invention can be implemented in many different forms and is not limited to the embodiments described herein. Rather, the purpose of providing embodiments is to enable a more thorough and complete understanding of the disclosure of the present invention.

[0044] An automatic control code generation method based on large model and retrieval enhancement generation technology includes data processing, vector database construction, knowledge graph construction, prompt engineering implementation, and generated code verification.

[0045] The solution first acquires multimodal knowledge, relevant standards, and process requirements related to code generation from various fields as data sources. These data sources are then filtered and cleaned to ensure accuracy and timeliness. After data clarification and filtering, the collected data undergoes document parsing, document segmentation, document embedding, and vectorization. The vectorized data is then stored in a vector database, thus completing the vector database construction. Next, a knowledge graph is built based on the entities in the database and the relationships between them, establishing links between different entities and enhancing the comprehensiveness of the retrieval. When a user inputs requirements or adds process requirements to the large model, the large model generates an answer based on the user's input and relevant content retrieved from the knowledge graph, following a pre-defined template. After code generation, it is loaded into a relevant testing platform to test whether it meets the requirements in terms of syntax and functionality, verifying the reliability and accuracy of the generated code.

[0046] An automatic control code generation method based on large-scale language models and retrieval enhancement generation technology includes data processing, vector database construction, prompt engineering implementation, and code verification and model evaluation.

[0047] The solution first collects control codes and their corresponding text information, images, and other multimodal information within the field. After data processing, a vector database is constructed. Then, a large-scale model is deployed within the framework, and its interface with the database is established. A prompt template is set to integrate user requirements with relevant retrieval knowledge, thus completing the framework construction. When changes in the production process lead to new requirements being input into the large-scale model, these requirements are simultaneously sent to a knowledge graph for retrieval. The retrieved key information is then output as knowledge fragments from the database. The top N data points are returned to the prompt template, which integrates the user requirements with the N relevant data points as prompts for the large-scale model. Based on these prompts, the large-scale model uses its powerful generation capabilities to generate code. After code generation, the code is validated. If validation passes, it can be directly applied; otherwise, it needs debugging. Only after debugging and meeting relevant standards and requirements can the code be applied in actual production, ensuring it meets production needs.

[0048] This invention discloses an automatic control code generation method based on a large language model and retrieval enhancement, such as... Figure 1 As shown, follow these steps:

[0049] Step 1: Collect multimodal data related to the control field, as well as standard documentation from relevant organizations and open-source databases;

[0050] Collect relevant data, paying attention to collecting multimodal data, including images, documents, code, etc., to enhance the multimodal query capabilities of large models. Multimodal data related to the control domain includes machine tool data and machining parameter requirements in the machining field, or routine assembly process requirements in the assembly field. The collection scope covers various fields of industrial scenarios, including but not limited to machining and assembly.

[0051] Step 2: Process the data and build a control program vector database;

[0052] Step 2, which involves processing the data and constructing the control program vector database, specifically includes:

[0053] Data processing consists of four main parts: document parsing, document chunking, document embedding, and vector database construction. After collecting data and related documents, the documents are loaded and parsed. Document parsing is performed by appropriate parsing tools, using document loaders to load data and files. Document chunking uses a data splitter to break large documents into smaller chunks for easier embedding. Finally, the documents are vectorized and stored in a vector database using the Sentence BERT embedding model.

[0054] To ensure query accuracy, attention must be paid to the accuracy and standardization of collected data during the data collection phase. Simultaneously, multimodal data should be collected to enhance the versatility of large-scale model queries. During document loading, different parsing tools are used depending on the collected data format, such as PDF or Word documents. Document chunking is performed because large models have token length limitations, and excessively long tokens can slow down or inaccurate understanding and reasoning within the large model. The chunking process involves first splitting the document into semantically smaller chunks, then combining these smaller chunks into larger chunks until a set value is reached. After combination, a certain degree of overlap is created to ensure contextual relevance. The document embedding part first vectorizes the document and stores the vectorized document in a vector database, from which various vector databases can be selected. The document parsing, document chunking, and document embedding tools mentioned above are all integrated within the LangChain framework.

[0055] Step 3: Construct a knowledge graph of the control program based on the vector database.

[0056] Step 3 involves constructing a knowledge graph for the control program based on the database. This is because the relationships between data stored in the database cannot be identified during querying. By constructing a knowledge graph, the relationships between different entities can be identified during retrieval, thereby expanding the completeness of the retrieved data information. This enhances the understanding of user needs by the large model and the completeness and comprehensiveness of relevant reference data, thus improving the quality of code generation.

[0057] Step 4: Deploy the multimodal large model and other intelligent agents;

[0058] Step 4 requires the deployment of the large model and other intelligent agents within the langchain framework, such as knowledge graph query tools, vector databases, and code verification tools. Because knowledge graphs are superior to vector databases in terms of relational representation, building a knowledge graph on top of a vector database can effectively represent the relationships between entities within a professional domain. This expands the scope of related data retrieval during searches, improves the completeness of prompts by retrieving data directly related to user needs and providing suggestions to the large model from multiple perspectives, thereby enhancing the quality of code generation.

[0059] Step 5: Verify the generated code and evaluate the model's ability to generate code;

[0060] Step 5: Set the prompt template and implement the prompt project.

[0061] Step 5 specifically describes the process of prompting engineering, which involves creating prompts, queries, or instructions to guide the output of a language model. It allows users to control the model's output and generate customized text based on their specific needs. The prompt template depends on the type and format of the question you want to input into the large model. The format of the question depends on the template set. In this study, the template includes the user's question and the Top K most relevant data retrieved from the vector database. These two parts are then fused according to the set template and input into the large model for answering and reference. This step plays a crucial role in the result generation.

[0062] Step 6: Verify the generated code and evaluate the model's ability to generate code;

[0063] Step 6, verifying the generated code, is a crucial step in its application to actual production. It's essential to ensure the accuracy and standardization of the generated code, guaranteeing that it meets safety production standards and process requirements. Figure 2 As shown, the initial code generated from the large model is first translated to meet the code format requirements of the verification tool. The code is then compiled to check for syntax errors. If errors are found, repair suggestions are generated based on the location of the errors. Following a thought chain strategy, the code is debugged and repaired. After repair, syntax verification is performed again. This process is iterative until the code passes syntax verification. If the code passes syntax verification, functional verification begins. In the functional verification stage, functional attributes such as logical relationships and functions are generated. Verification is performed to determine if these attributes meet production logic and user requirements. If they do not meet requirements, functional repair suggestions are generated, and the code is debugged and repaired. This stage is similar to the syntax repair stage and is also an iterative process. Only after all code has passed syntax and functional verification can it be applied to actual production.

[0064] While validating the code, evaluating the model's ability to generate code is an important reference for subsequent optimization and improvement of the model and data processing. The evaluation of the model's ability to generate code mainly focuses on two aspects: accuracy and answer relevance.

[0065] Example:

[0066] The system queries relevant information in the knowledge graph based on user input. This embodiment requires the design of a PID controller. If only the knowledge learned during the training of the large model is used, a PID value can be simply calculated. However, such a simple PID controller program cannot meet industrial needs.

[0067] After introducing the retrieval enhancement process, the large model receives information from the user input containing the keyword "PID". It then traverses the nodes of the knowledge graph to query entities containing the keyword "PID". Once the entity is found, it outputs the specific information of the entity and several other related entities in the vector database. During the search, the PID controller functional block found is complete and meets the requirements of industrial applications.

[0068] At this point, the information returned to the large model becomes "user input + queried knowledge" instead of simply a user input question; the large model can then generate code based on this information; the generated code is of higher quality and runs better.

Claims

1. A code generation implementation method based on a large language model and retrieval enhancement, characterized in that, Includes the following steps: Step 1: Collect multimodal data in a specific professional field; Step 2: Process the multimodal data and construct a vector database; Step 3: Construct a professional knowledge graph based on the vector database; Step 4: Deploy the large model through the framework and link the large model with the professional knowledge graph; Step 5: Based on the received user requirements, extract key information to query entities in the professional knowledge graph, output the vector database knowledge associated with the queried entities, and fuse it according to the set prompt template as input information for the large model to realize the prompt engineering. Step 6: After receiving the input information, the large model generates code based on user needs and the retrieved knowledge; The generated code is validated to evaluate the ability to generate code for large models.

2. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, The multimodal data includes, but is not limited to, machine tool data information, machining parameter requirements, assembly process requirements, industrial control design requirements, control algorithms and programs.

3. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, Step 2, processing the data and constructing a vector database, includes the following steps: The data is parsed using a parsing tool and then loaded using a document loader. The loaded data is divided into document chunks using a splitter to facilitate embedding. The segmented data is vectorized by embedding and the resulting vectorized information is stored in a vector database. The vectorized information includes at least one type of information from machine tool data, machining parameter requirements, assembly process requirements, industrial control design requirements, control algorithms and programs, and each type of information contains at least one keyword.

4. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, Step 3, constructing a professional knowledge graph based on the vector database, includes the following steps: Based on the vectorized information in the vector database, keywords are extracted from each piece of information to construct a graph according to the pattern layer. The keywords in each piece of information in the vector database are used as entities in the professional knowledge graph, and the relationships between each data keyword are used as relationships between entities to construct the professional knowledge graph.

5. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, In step 5, based on the received user requirements, key information is extracted to query entities in the professional knowledge graph, and the vector database knowledge associated with the queried entities is output, as follows: Based on the accepted user requirements, extract key information to query entities in the professional knowledge graph, and find entities that match the key information by traversing the entities in the professional knowledge graph. It also outputs the vector database knowledge associated with the queried entities.

6. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, In step 5, the prompt template is used for fusion as input information for the large model to realize the prompting process, which includes the following steps: By using a predefined prompt template, the user's needs, along with the top K most relevant answers obtained from a vector database, are displayed as knowledge.

7. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, In step 6, the generated control code is verified to evaluate the ability of the large model to generate control code, as detailed below: The initial code generated by the large model is translated to meet the code format requirements of the validation tools. The code is compiled to check for syntax errors. If errors are found, generate repair suggestions based on the location of the errors in the code, debug and fix the code, and then re-validate the syntax. If the code passes syntax validation, then perform functional validation. During the functional verification phase, the functional attributes of the code are first generated. The code is then verified to determine whether it passes the verification. If the requirements are not met, feature fix suggestions are generated to debug and repair the code.

8. The code generation implementation method based on a large language model and retrieval enhancement according to claim 1, characterized in that, The knowledge graph in question is a control program knowledge graph.

9. A code generation system based on a large language model and retrieval enhancement, characterized in that, include: The vector database construction module is used to process multimodal data and build a vector database. The professional knowledge graph construction module is used to construct professional knowledge graphs based on a vector database. The large model query module is used to deploy large models through the framework and associate them with professional knowledge graphs. Based on the accepted user requirements, it extracts key information to query entities in the professional knowledge graph, outputs the vector database knowledge associated with the queried entities, and integrates it according to the set prompt template as input information for the large model to realize prompt engineering. The code generation module is used to generate code based on user needs and retrieved knowledge after the large model receives input information. The generated code is validated to evaluate the ability to generate code for large models.

10. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements a code generation method based on a large language model and retrieval enhancement as described in any one of claims 1-8.