A substation inspection multi-modal large model post-training method, system, device and medium

By constructing a multimodal large model's thought chain data samples and employing a multi-stage training strategy, the problem of low model training efficiency in substation inspection was solved. This enabled progressive learning from simple to complex and logically standardized defect reasoning, thereby improving the automated analysis capabilities of substation inspection.

CN122242749APending Publication Date: 2026-06-19STATE GRID SICHUAN ELECTRIC POWER CORP ELECTRIC POWER RES INST +3

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
STATE GRID SICHUAN ELECTRIC POWER CORP ELECTRIC POWER RES INST
Filing Date
2026-03-19
Publication Date
2026-06-19

Smart Images

  • Figure CN122242749A_ABST
    Figure CN122242749A_ABST
Patent Text Reader

Abstract

This invention discloses a method, system, equipment, and medium for training a multimodal large-scale model of substation inspection, specifically relating to the field of model training technology. The key technical points are as follows: Question text templates for different query types and their corresponding answer text templates are used, along with multiple substation defect sample images, to construct multiple thought chain generation data samples; these multiple thought chain generation data samples are input into a multi-stage CoT generation model, and under the constraints of pre-constructed substation inspection defect standard constraints, substation defect inference datasets corresponding to different query types are obtained; the multiple substation defect inference datasets are input into a base multimodal large-scale model according to question difficulty, and the base multimodal large-scale model is fine-tuned; GRPO reinforcement learning is used to train the fine-tuned base multimodal large-scale model until the loss function converges, resulting in a multimodal inference model of substation defects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of model training technology, specifically to a method, system, equipment, and medium for post-training of a multimodal large model for substation inspection. Background Technology

[0002] With the rapid development of artificial intelligence technology, technologies such as Multimodal Large Language Models (MLLM) have made breakthroughs in tasks such as general visual question answering and image description. These models can understand and process multimodal information such as images and text, demonstrating strong generalization capabilities. Applying MLLM to the field of intelligent inspection of power system substations, through automated analysis of inspection images, identification of equipment defects, reasoning about the causes of defects, and assessment of defect levels, is of great significance for improving operation and maintenance efficiency and ensuring the safe and stable operation of the power grid.

[0003] In existing technologies, applying MLLM to substation defect visual reasoning scenarios typically employs multimodal large language models (such as the Qwen-VL series and GPT-o1 models) trained with reasoning capabilities using reinforcement learning. These multimodal large language models undergo large-scale reinforcement learning (RL) training to stimulate their inherent chain of thought (CoT) reasoning abilities. However, in the substation field, there is still a lack of high-quality multimodal reasoning data containing complete intermediate thought processes and professional inspection procedures. Without using this data as a "cold start" and directly training the base MLLM with reinforcement learning, it is difficult to effectively guide the model to generate stable, coherent, and procedurally compliant reasoning processes. Furthermore, the existing two-stage post-training strategy of directly performing supervised fine-tuning (SFT) and large-scale reinforcement learning often fails to consider the varying levels of difficulty in substation defect reasoning and judgment tasks, such as from simple yes / no judgments to multiple-choice classifications and complex analysis and localization. Ignoring this logical progression leads to low model training efficiency and difficulty in converging on complex tasks.

[0004] For example, an invention patent with publication number CN118114770A discloses a model training method and device based on thought chains. This method constructs a thought chain reminder, a first training question, and a first training answer for generating thought chains in a large language model. The thought chain reminder, the first training question, and the first training answer are input into the large language model for training thought chain reasoning, generating a target thought chain. A second training question and its corresponding second training answer are obtained for fine-tuning a small language model to be trained. The second training question is used as input, and the target thought chain and the second training answer are used as training targets, input into the small language model to be trained for fine-tuning its reasoning ability, resulting in a fine-tuned small language model. This achieves fine-tuning of the small language model based on the target thought chain generated by the large language model, realizing thought chain knowledge distillation and avoiding the current situation where thought chains can only improve the performance of the large language model. Although this prior art constructs a thought chain reminder, a first training question, and a first training answer to train and fine-tune the thought chain reasoning of the large language model, this training method lacks consideration of the task difficulty level and logical progression, resulting in low model training efficiency.

[0005] Therefore, the present invention aims to provide a method, system, equipment and medium for training a multimodal large model of substation inspection, in order to solve the related problems mentioned above. Summary of the Invention

[0006] The technical problem this invention aims to solve is that existing technologies lack consideration for the levels of task difficulty and logical progression, resulting in low model training efficiency. The goal is to provide a method, system, equipment, and medium for training a multimodal large-scale model of substation inspection. This involves combining question text templates for different question types and their corresponding answer text templates with substation defect sample images to create multiple thought chain generation data samples based on different question types. Using these different thought chain generation data samples, a multi-stage CoT generation model is used to generate thought chains corresponding to different question types under constraints. These thought chains are then categorized to construct different substation defect inference datasets. Finally, multiple substation defect inference datasets are used to fine-tune the multimodal large-scale model at multiple levels, thereby solving the problem of the model lacking progressive learning data samples from simple judgments to complex analysis and localization.

[0007] This invention is achieved through the following technical solution:

[0008] A method for post-training a multimodal large model of substation inspection, the method comprising:

[0009] Multiple thought chain generation data samples are constructed by combining question text templates for different question types and corresponding answer text templates with multiple substation defect sample images; each thought chain generation data sample includes a question text template, an answer text template, and a substation defect sample image.

[0010] Multiple thought chain generated data samples are input into the multi-stage CoT generation model. Under the constraints of the pre-built substation inspection defect standard, substation defect reasoning datasets corresponding to different query question types are obtained.

[0011] Multiple substation defect inference datasets are sequentially input into the base multimodal large model, and the base multimodal large model is fine-tuned under supervision. The fine-tuned base multimodal large model is trained using GRPO reinforcement learning until the loss function converges, thus obtaining the substation defect multimodal inference large model.

[0012] Furthermore, the method also includes:

[0013] Different question template libraries and corresponding answer template libraries are constructed based on different question types. Each question template library stores a question text template corresponding to the question type, and each answer template library stores an answer text template corresponding to the question type.

[0014] An image database is constructed, comprising multiple substation defect sample images; wherein each substation defect sample image is labeled with a defect label and a defect bounding box.

[0015] Furthermore, the question types include yes / no questions, choice questions, and analysis questions.

[0016] Furthermore, question text templates for different question types and their corresponding answer text templates are used to construct multiple thought chains to generate data samples, specifically by combining them with multiple substation defect sample images.

[0017] Multiple question text templates are randomly selected from the question template library in sequence, and answer text templates are randomly selected from the corresponding answer template library;

[0018] Multiple substation defect sample images are randomly selected from the image database. The selected question text template, answer text template, and substation defect sample images are combined one by one to construct multiple thought chains to generate data samples.

[0019] Furthermore, multiple thought chain generation data samples are input into a multimodal large model. Under the constraints of pre-constructed substation inspection defect standards, substation defect reasoning datasets corresponding to different query question types are obtained. The multi-stage CoT generation model includes a multimodal large model and a text reasoning large language model, specifically:

[0020] Multiple thought chain generated data samples are input into a multimodal large model. Under the constraints of the pre-constructed substation inspection defect standard, multiple initial thought chain texts are generated. The multiple initial thought chain texts are combined with the substation defect sample images and problem text templates in the corresponding thought chain generated data samples to construct multiple thought chain enhanced data samples.

[0021] Multiple thought chain enhancement data samples are input into a multimodal large model to generate multiple image description texts; the multiple image description texts are combined with the substation defect sample images in the corresponding thought chain generated data samples to construct multiple thought chain optimization data samples.

[0022] Multiple optimized data samples of the thought chain are input into the large language model of text reasoning to generate multiple thought chain texts;

[0023] Multiple thought chain data samples are constructed by combining multiple thought chain texts with substation defect sample images, question text templates, and answer text templates from the corresponding thought chain generated data samples.

[0024] Based on the question types of the question text templates in the thinking chain data samples, multiple thinking chain data samples are classified and multiple substation defect reasoning datasets are constructed.

[0025] Furthermore, the multiple substation defect reasoning datasets include yes / no question reasoning datasets, choice question reasoning datasets, and analysis question reasoning datasets.

[0026] This invention also provides a post-training system for a multimodal large model of substation inspection, which is used in the post-training method for a multimodal large model of substation inspection described in any of the above claims. The system includes:

[0027] The data sample construction module is used to construct multiple thought chain generated data samples by combining question text templates of different question types and corresponding answer text templates with multiple substation defect sample images; wherein each thought chain generated data sample includes a question text template, an answer text template, and a substation defect sample image;

[0028] The dataset construction module is used to input multiple thought chain generated data samples into the multi-stage CoT generation model. Under the constraints of the pre-built substation inspection defect standard constraints, the substation defect reasoning datasets corresponding to different query question types are obtained.

[0029] The model training module is used to sequentially input multiple substation defect inference datasets into the base multimodal large model and perform supervised fine-tuning of the base multimodal large model; GRPO reinforcement learning is used to train the fine-tuned base multimodal large model until the loss function converges, thus obtaining the substation defect multimodal inference large model.

[0030] The present invention also provides a computer device, including a system memory and a processor, wherein the system memory stores a computer program, and the processor executes the computer program to implement the steps of any of the methods described above.

[0031] The present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of any of the methods described above.

[0032] The present invention also provides a computer program product containing instructions that, when executed by a cluster of computer devices, cause the cluster of computer devices to perform the method described in any of the preceding claims.

[0033] Compared with the prior art, the present invention has the following advantages and beneficial effects:

[0034] In this invention, multiple thought chain generation data samples based on different question types are constructed by combining question text templates and corresponding answer text templates with substation defect sample images. These thought chain generation data samples are then used to generate thought chains corresponding to different question types under constraints through a multi-stage CoT generation model. Different thought chains are then classified to construct different substation defect inference datasets. Finally, multiple substation defect inference datasets are used to fine-tune a multimodal large model at multiple levels, thereby solving the problem of the model lacking progressive learning data samples from simple judgments to complex analysis and localization. Attached Figure Description

[0035] To more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort. In the drawings:

[0036] Figure 1This is a schematic diagram of the method flow for a multimodal large model post-training method for substation inspection in this embodiment;

[0037] Figure 2 This is a schematic diagram of the module connection of a multimodal large model post-training system for substation inspection in this embodiment;

[0038] Figure 3 This is a schematic diagram of the structure of a computer device in this embodiment. Detailed Implementation

[0039] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0040] In this disclosure, unless otherwise stated, the use of terms such as "first," "second," etc., to describe various elements is not intended to limit the positional, temporal, or importance relationships of these elements; such terms are merely used to distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of that element, while in other cases, based on the context, they may refer to different instances.

[0041] The terminology used in the description of the various examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context explicitly indicates otherwise, an element may be one or more unless the number of elements is specifically limited. Furthermore, the term "and / or" as used in this disclosure covers any one of the listed items and all possible combinations thereof.

[0042] Example 1

[0043] See Figure 1 , Figure 1 A flowchart illustrating a method for training a multimodal large-scale model during substation inspection is shown, wherein the method includes:

[0044] S1: Construct multiple thought chain generation data samples by combining question text templates for different question types and corresponding answer text templates with multiple substation defect sample images; wherein each thought chain generation data sample includes a question text template, an answer text template, and a substation defect sample image;

[0045] It should be noted that in this embodiment, multiple question text templates are randomly selected from the question template library in sequence, and answer text templates are randomly selected from the corresponding answer template library; multiple substation defect sample images are randomly selected from the image database, and the selected question text templates, answer text templates and substation defect sample images are combined one by one to construct multiple thought chain generation data samples.

[0046] S2: Input multiple thought chain generated data samples into the multi-stage CoT generation model respectively. Under the constraints of the pre-built substation inspection defect standard constraints, the substation defect reasoning dataset corresponding to different query question types is obtained.

[0047] S3: Input multiple substation defect inference datasets into the base multimodal large model in sequence, and perform supervised fine-tuning of the base multimodal large model; use GRPO reinforcement learning to train the fine-tuned base multimodal large model until the loss function converges, and obtain the substation defect multimodal inference large model.

[0048] It should be noted that in this embodiment, the question types are classified according to the difficulty of the reasoning in the answer, specifically including yes / no type questions, choice type questions, and analysis type questions; while in other embodiments, the question types may also adopt other classification forms, or be divided into open type questions and closed type questions, and no further restrictions are imposed here;

[0049] Meanwhile, in this embodiment, the base multimodal large model is a pre-trained model with multimodal understanding and instruction following capabilities, specifically the Qwen2.5-VL-7B-Instruct large model. Other large models can also be used in other embodiments, and no further restrictions are imposed here.

[0050] Meanwhile, in S2, the substation defect reasoning datasets corresponding to different question types are obtained, specifically including three types of reasoning datasets: yes / no question reasoning dataset, choice question reasoning dataset, and analysis question reasoning dataset. Each reasoning dataset includes multiple thought chain data samples. Each thought chain data sample consists of thought chain text generated by the multi-stage CoT generation model based on the thought chain generated data sample, as well as substation defect sample images, question text templates, and answer text templates in the corresponding thought chain generated data sample.

[0051] Therefore, in S3, the course-based learning approach is adopted first, and the base multimodal large model is cold-started and trained sequentially according to the increasing difficulty of the defect reasoning questions (true / false, multiple choice, analytical questions). The fine-tuning parameters for each stage are set as follows: training epochs=3, batch size=16, AdamW optimizer, learning rate=1e-5, and cross-entropy loss function. The difference between the probability of generating a token and the actual token is calculated. The specific fine-tuning steps for each stage are as follows: First, the base multimodal large model is subjected to first-stage supervised fine-tuning using the true / false question reasoning dataset. The substation defect sample images and question text templates from the thought chain data samples in the non-question reasoning dataset are used as the base multimodal model. The large-scale model is trained by two stages of supervised fine-tuning. The input to the large-scale model is a substation defect sample image and a question text template from the thought chain data samples in the thought chain dataset. The optimization goal is to generate the thought chain text and output the correct answer text template. The second stage of supervised fine-tuning is performed using an analytical question reasoning dataset. The input to the analytical question reasoning dataset is a substation defect sample image and a question text template from the thought chain data samples in the analytical question reasoning dataset. The optimization goal is to generate the thought chain text and output the correct answer text template. After completing these three stages of supervised fine-tuning, the fine-tuned multimodal large-scale model is obtained.

[0052] However, the supervised fine-tuning process mainly focuses on fitting the distribution of the training data. The flawed reasoning process generated by the model may suffer from logical inaccuracies, non-standard expressions, or failure to fully align with substation inspection procedures. To further improve the reasoning quality, accuracy, robustness, and alignment with human expert judgment of the fine-tuned multimodal large model, reinforcement learning (RL) is employed, and group relative policy optimization is introduced. The GRPO reinforcement learning algorithm further optimizes the fine-tuned multimodal large model. Specifically, it defines the environment, policy network, and reward function. The environment is defined as all thought chain data samples in the problem reasoning dataset; each thought chain data sample corresponds to a state. The agent (the fine-tuned multimodal large model) generates an action based on the state, i.e., an answer containing the thought chain text and the final answer text template. The policy network is the fine-tuned multimodal large model; its input is a state, and its output is a probability distribution of actions. The reward function evaluates the quality of the model's generated answer in a given state. Based on the substation defect reasoning requirements, three aspects of rewards are considered: reasoning accuracy reward, reasoning thought chain text quality reward, and format conformity reward. The reasoning accuracy reward is whether the model's final answer matches the true label; the reasoning thought chain text quality reward evaluates the coherence, logic, and compliance with substation inspection procedures of the model's generated thought chain; and the format conformity reward is whether the model outputs results according to the required format. The reward function is as follows: In the formula, Rewards are given for accurate reasoning. This indicates a reward for the quality of the reasoning thought process text. Rewards for conforming to standardized formats , , These are weighting coefficients used to balance the influence of different reward dimensions. Indicates state, Indicates an action;

[0053] Then, GRPO iterative training is performed. A batch of question text templates is randomly selected from the question reasoning dataset. For each question text template, the current policy network is used for sampling to generate a group of answer outputs, with the group size being... Therefore, each data sample in the thought chain corresponds to a group; for each output within a group, a reward function is used to calculate its corresponding reward value, resulting in the reward set for that group; the GRPO algorithm does not rely on an independent value network (Critic), but rather optimizes the strategy by calculating the relative advantage of rewards within the group. For one output within the group, its advantage... The calculation formula is as follows: ,in, This represents the average reward within the group. It is the standard deviation of the group's rewards. This involves adding a minimum value to prevent the denominator from reaching zero. Using this advantage estimation method, outputs above the group average receive a positive advantage (i.e., encouraging output), while those below receive a negative advantage (i.e., suppressing output); based on the calculated advantage... The objective function in the form of Proximal Policy Optimization (PPO) is used to optimize the model parameters. Update the GRPO loss function. The definition is as follows:

[0054] ,in, It is the probability ratio of the new strategy to the old strategy; It is a cutoff function that restricts the probability ratio to... Within a certain range, to prevent the policy update step size from being too large; It is the KL divergence term, used to constrain the current policy. The language model should not deviate too far from the fine-tuned multimodal model to ensure its stability. These are the weight coefficients of the KL divergence term; by minimizing the loss function Update model parameters using gradient descent. Until convergence; after training with GRPO reinforcement learning, a large-scale multimodal reasoning model for substation defects is obtained, which is capable of high-quality visual reasoning for complex substation defects, conforms to substation inspection procedures, and has a standardized output format.

[0055] Specifically, in this embodiment, by combining question text templates and corresponding answer text templates for different question types with substation defect sample images, multiple thought chain generation data samples based on different question types are constructed. Using these different thought chain generation data samples, a multi-stage CoT generation model is used to generate thought chains corresponding to different question types under constraints. Different thought chains are then classified to construct different substation defect inference datasets. Finally, multiple substation defect inference datasets are used to fine-tune the multimodal large model at multiple levels, thereby solving the problem of the model lacking progressive learning data samples from simple judgments to complex analysis and localization.

[0056] As one possible implementation, the method further includes:

[0057] S10: Construct different question template libraries and corresponding answer template libraries based on different question types, wherein each question template library stores a question text template corresponding to the question type, and each answer template library stores an answer text template corresponding to the question type;

[0058] It should be noted that, based on the aforementioned yes / no type questions, a yes / no question template library is constructed, which includes multiple yes / no question text templates, and a corresponding yes / no question answer template library is constructed, which includes multiple yes / no question answer text templates.

[0059] For example, the yes / no question text template is: "Is there a defect in this image?", and the corresponding yes / no question answer text template is: "Yes" / "No";

[0060] Based on the aforementioned choice-type questions, a choice question template library is constructed, which includes multiple choice question text templates, and a corresponding choice question answer template library, which includes multiple choice question answer text templates.

[0061] For example, if you select the question text template: "What type of substation defect is present in this image? A[Defect Type 1], B[Defect Type 2], C[Defect Type 3], D[Defect Type 4]", the corresponding question answer text template is: "B[Defect Type 2]" / "A[Defect Type 1]".

[0062] Based on the aforementioned analysis type of question, an analysis question template library is constructed, which includes multiple analysis question text templates, and a corresponding analysis question answer template library, which includes multiple analysis question answer text templates.

[0063] For example, the analysis question text template is: "Analyze the defects present in the image and give the location of the defects," and the corresponding analysis question answer text template is: "The [defect type] present in the [equipment component] in the image is located at..." For example, the location of contamination on the surface of the insulator of main transformer No. 1 is... .

[0064] S20: Construct an image database comprising multiple substation defect sample images; wherein each substation defect sample image is labeled with a defect label and a defect bounding box.

[0065] As one possible implementation, multiple thought chain generation data samples are input into a multimodal large model. Under the constraints of pre-constructed substation inspection defect standards, substation defect reasoning datasets corresponding to different query question types are obtained. The multi-stage CoT generation model includes a multimodal large model and a text reasoning large language model, specifically:

[0066] S100: Input multiple thought chain generated data samples into the multimodal large model respectively. Under the constraints of the pre-constructed substation inspection defect standard constraints, generate multiple initial thought chain texts. Combine the multiple initial thought chain texts with the substation defect sample images and problem text templates in the corresponding thought chain generated data samples to construct multiple thought chain enhanced data samples.

[0067] Specifically, in this embodiment, each thought chain generated data sample is sequentially input into the multimodal large model. Under the constraints of the pre-constructed substation inspection defect standard, and using the first prompt word, an initial thought chain text corresponding to each thought chain generated data is generated. The initial thought chain text includes four parts: summary, description, reasoning, and conclusion. Then, the generated initial thought chain text is concatenated with the corresponding substation defect sample image and problem text template as a thought chain enhancement data sample for use in the next step of thought chain enhancement.

[0068] It should be noted that in this embodiment, a general multimodal large model is used, such as Qwen2.5-VL-72B. In other embodiments, other multimodal large models can also be used for CoT generation.

[0069] Meanwhile, the pre-constructed substation inspection defect standard constraints include the definition of typical substation defects, inspection details, and judgment standards. For example, relevant clauses in the "Inspection Details of Substation Equipment" clearly stipulate in "1.1 Busbar and Insulator Inspection Details" 1.1a) and c) that "there are no foreign objects attached to the surface of the insulator and no dirt accumulation." If there is obvious dirt accumulation (green and black dirt layers) on the surface of the insulator in the image, it directly violates this clause and belongs to typical defects. In "1.2 Circuit Breaker Inspection Details" 1.2e) it is emphasized that "there is no discharge in the external insulation and the anti-pollution coating is intact." Although there are no discharge traces in the image, the dirt accumulation itself will reduce the performance of the external insulation, which meets the condition of "dirt accumulation." "Defect definition; 1.3 Inspection Rules for Through-Wall Bushings stipulates: "The surface and creepage skirts should be free of serious dirt accumulation." The dirt accumulation on the umbrella skirt structure (creepage skirt) in the image is obvious, violating this clause; 1.4b) of the Inspection Rules for Oil-Immersed Transformers (Reactors) requires "No serious oil contamination on the outside of the bushing," but the dirt in the image is mainly environmental pollutants (non-oil contamination). This clause is only for reference and does not directly apply; In summary, the regulations regard "contamination on the surface of the external insulation" as a hidden danger that needs to be prioritized because it will reduce the surface resistance and is prone to flashover, pollution flashover or short circuit accidents under humid conditions; The above are some examples. The specific settings should be made according to actual needs, and no further restrictions are imposed here."

[0070] The specific template for the first prompt word is as follows:

[0071] "I have a picture and a question I need you to answer. I require you to strictly follow a format that includes four specific parts: Summary, Caption, Reasoning, and Conclusion. You must organize your answer exactly according to this structure, and the final answer given in the **Conclusion** section must be completely consistent with the standard correct answer."

[0072] The following is the content of the substation inspection procedure document. Please refer to it when making inferences: { };

[0073] Specifically:

[0074] In the summary, briefly explain how you will approach this problem and what steps you will take to arrive at the answer.

[0075] In the image caption, describe the content of the image, paying particular attention to details relevant to the question.

[0076] In the reasoning section, based on the image content and the aforementioned substation inspection guidelines, outline a step-by-step thought process to solve the problem. Please strictly follow the inspection requirements for the corresponding equipment and facilities in the inspection guidelines during the analysis.

[0077] In the conclusion, give the final answer in a direct and clear format, and the answer must be exactly the same as the correct answer. The conclusion should directly answer yes or no.

[0078] The format should be as follows:

[0079] <summary> [Summarize how you will handle the problem and explain the steps you will take.]< / summary> ;

[0080] [A detailed description of the image is provided here, with particular emphasis on aspects relevant to the problem.]

[0081] <reasoning> [Here is a logically clear, chain-like explanation of the problem. This should outline the step-by-step reasoning process and be analyzed strictly in accordance with the requirements of the substation inspection guidelines.]< / reasoning> ;

[0082] <conclusion> [The final answer is stated here in a clear and straightforward format. It must exactly match the correct answer.]< / conclusion> .

[0083] Please apply this format strictly and carefully to analyze the given image and answer the related questions, ensuring that your answers perfectly match the standard answers.

[0084] Now, please answer the following questions based on the image below:

[0085] Question;

[0086] Standard answer: Answer;

[0087] Please only provide a complete answer that conforms to the above format; do not add any additional descriptions or explanations.

[0088] S200: Input multiple thought chain enhancement data samples into the multimodal large model to generate multiple image description texts; combine the multiple image description texts with the substation defect sample images in the corresponding thought chain generated data samples to construct multiple thought chain optimization data samples.

[0089] Specifically, in this embodiment, each thought chain enhancement data sample is sequentially input into the multimodal large model, and image description text containing all necessary visual details is generated using the second prompt word; then the generated image description text is combined with the corresponding substation defect sample image as a thought chain optimization data sample for use in the next step of thought chain optimization.

[0090] It should be noted that in this embodiment, the multimodal large model adopts a general multimodal large model, such as Qwen2.5-VL-72B. In other embodiments, other multimodal large models can also be used for thought chain enhancement.

[0091] The template for the second prompt word is as follows:

[0092] Given an image Given a question and a thought process, please create a detailed visual description including all necessary details to correctly answer the question.

[0093] S300: Input multiple optimized thinking chain data samples into the text reasoning large language model to generate multiple thinking chain texts; combine the multiple thinking chain texts with the substation defect sample image, question text template and answer text template in the corresponding thinking chain generated data samples to construct multiple thinking chain data samples;

[0094] Specifically, in this embodiment, each optimized thought chain data sample is sequentially input into the text reasoning large language model. Using the third prompt word, high-quality thought chain text that closely resembles human expert thinking is generated. Then, the generated thought chain text is combined with the corresponding substation defect sample image, question text template, and answer text template to construct thought chain data samples for subsequent training of the base multimodal large model.

[0095] It should be noted that in this embodiment, the text reasoning large language model adopts DeepSeek-R1. Other pure text reasoning large language models can also be used in other embodiments, and no further restrictions are imposed here.

[0096] The specific template for the third prompt word is as follows:

[0097] "Based on the following image descriptions, think step by step to answer the questions correctly."

[0098] [Image Description] {detailed_caption}

[0099]

Question

[0100]

Answer

[0101] S400: Based on the question type of the question text template in the thinking chain data sample, classify multiple thinking chain data samples and construct multiple substation defect reasoning datasets.

[0102] Specifically, in this embodiment, without relying on manual annotation, preliminary structured information and detailed image descriptions are generated using a general multimodal large model. Then, the powerful logical capabilities of the pure text reasoning model are utilized to generate a high-quality thought chain based on the detailed description.

[0103] The technical solution has the following beneficial effects: The data sample constructed by the present invention contains a complete thinking process from image understanding, procedure matching to logical judgment. This data structure can help overcome the problem of lack of knowledge in the field of substation professional in general multimodal large models, so that the trained model can not only identify defects, but also judge defects according to the substation inspection procedure details like human experts, which significantly reduces the factual errors and illusions of the model in complex scenarios.

[0104] Meanwhile, a multi-stage SFT training strategy based on course learning was adopted, and training courses with progressively increasing difficulty levels of true / false questions, multiple choice questions, and analytical questions were designed. This step-by-step learning strategy can guide the model to first master basic defect perception and classification capabilities, and then gradually transition to complex comprehensive analysis tasks, thereby improving the model's learning efficiency for long logical chain tasks and the final defect reasoning accuracy.

[0105] Meanwhile, after supervised fine-tuning, this invention introduces the Group Relative Policy Optimization (GRPO) reinforcement learning algorithm and designs a comprehensive reward function that includes three dimensions: defect reasoning accuracy, CoT quality, and format standardization. This ensures that the defect analysis generated by the final model not only has accurate conclusions, but also that the reasoning process strictly follows the logical steps of the substation inspection procedure and the output format is standardized, which facilitates the automated parsing and integration of subsequent intelligent inspection-related business systems.

[0106] Example 2

[0107] This invention also provides a post-training system for a multimodal large model of substation inspection, which is used in the post-training method for a multimodal large model of substation inspection described in any of the above claims. The system includes:

[0108] The data sample construction module 100 is used to construct multiple thought chain generated data samples by combining question text templates of different question types and corresponding answer text templates with multiple substation defect sample images; wherein each thought chain generated data sample includes a question text template, an answer text template, and a substation defect sample image.

[0109] The dataset construction module 200 is used to input multiple thought chain generated data samples into the multi-stage CoT generation model. Under the constraints of the pre-built substation inspection defect standard constraints, the substation defect reasoning dataset corresponding to different query question types is obtained.

[0110] The model training module 300 is used to input multiple substation defect inference datasets into the base multimodal large model in sequence, and to fine-tune the base multimodal large model; the fine-tuned base multimodal large model is trained using GRPO reinforcement learning until the loss function converges, thus obtaining the substation defect multimodal inference large model.

[0111] It should be noted that the modules in the system of Embodiment 2 correspond to the steps in the method of Embodiment 1. The steps in the method of Embodiment 1 have been described in detail in Embodiment 1, and the module content in the system will not be described in detail in this Embodiment 2.

[0112] Example 3

[0113] This embodiment also provides a computer device, including a system memory 1005 and a processor 1001. The system memory 1005 stores a computer program, and the processor 1001 executes the computer program to implement the steps of any of the methods described above.

[0114] It should be noted that the processor 1001 is used to execute the steps in the above method embodiments according to the instructions in the program code. Alternatively, when the processor 1001 executes the computer program, it implements the functions of each module / unit in the above system / device embodiments.

[0115] Specifically, in this embodiment, the computer program can be divided into one or more modules / units, which are stored in the system memory 1005 and executed by the processor 1001 to complete this application. The one or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program in the terminal device.

[0116] The terminal device can be a desktop computer, laptop, handheld computer, or cloud server, etc. The terminal device may include, but is not limited to, a processor 1001 and a system memory 1005. Those skilled in the art will understand that this does not constitute a limitation on the terminal device; it may include more or fewer components than shown in the figures, or a combination of certain components, or different components. For example, the terminal device may also include an input / output device 1003, a network access device 1002, a bus 1006, etc.

[0117] The processor 1001 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0118] System memory 1005 can be an internal storage unit of the terminal device, such as a hard drive or RAM. System memory 1005 can also be a storage device 1004 of the terminal device, such as an external hard drive, SmartMedia Card (SMC), Secure Digital (SD) card, or FlashCard. Furthermore, system memory 1005 can include both internal storage units and storage device 1004. System memory 1005 is used to store computer programs and other programs and data required by the terminal device. System memory 1005 can also be used to temporarily store data that has been output or will be output.

[0119] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0120] Example 4

[0121] This embodiment provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the methods described above.

[0122] The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), registers, hard disks, optical fibers, compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof, or any other form of computer-readable storage medium in the art.

[0123] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and storage medium can reside within an application-specific integrated circuit (ASIC). In embodiments of the invention, the computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, system, or device.

[0124] Example 5

[0125] This embodiment also provides a computer program product containing instructions that, when executed by a cluster of computer devices, cause the cluster of computer devices to perform the method described in Embodiment 1.

[0126] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above description is only a specific embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for training a multimodal large-scale model of substation inspection after inspection, characterized in that, The methods include: Multiple thought chain generation data samples are constructed by combining question text templates for different question types and corresponding answer text templates with multiple substation defect sample images; each thought chain generation data sample includes a question text template, an answer text template, and a substation defect sample image. Multiple thought chain generated data samples are input into the multi-stage CoT generation model. Under the constraints of the pre-built substation inspection defect standard, substation defect reasoning datasets corresponding to different query question types are obtained. Multiple substation defect inference datasets are sequentially input into the base multimodal large model, and the base multimodal large model is fine-tuned under supervision. The fine-tuned base multimodal large model is trained using GRPO reinforcement learning until the loss function converges, thus obtaining the substation defect multimodal inference large model.

2. The method for training a multimodal large model of substation inspection according to claim 1, characterized in that, The method also includes: Different question template libraries and corresponding answer template libraries are constructed based on different question types. Each question template library stores a question text template corresponding to the question type, and each answer template library stores an answer text template corresponding to the question type. An image database is constructed, comprising multiple substation defect sample images; wherein each substation defect sample image is labeled with a defect label and a defect bounding box.

3. The method for training a multimodal large model for substation inspection according to claim 1, wherein the question types include yes / no type questions, choice type questions, and analysis type questions.

4. The method for training a multimodal large model of substation inspection according to claim 2, characterized in that, By combining question text templates for different question types and their corresponding answer text templates with multiple substation defect sample images, multiple thought chains are constructed to generate data samples. Specifically: Multiple question text templates are randomly selected from the question template library in sequence, and answer text templates are randomly selected from the corresponding answer template library; Multiple substation defect sample images are randomly selected from the image database. The selected question text template, answer text template, and substation defect sample images are combined one by one to construct multiple thought chains to generate data samples.

5. The method for training a multimodal large model of substation inspection according to claim 1, characterized in that, Multiple thought chain generation data samples are input into a multimodal large model. Under the constraints of pre-constructed substation inspection defect standards, substation defect reasoning datasets corresponding to different query question types are obtained. The multi-stage CoT generation model includes a multimodal large model and a text reasoning large language model, specifically: Multiple thought chain generated data samples are input into a multimodal large model. Under the constraints of the pre-constructed substation inspection defect standard, multiple initial thought chain texts are generated. The multiple initial thought chain texts are combined with the substation defect sample images and problem text templates in the corresponding thought chain generated data samples to construct multiple thought chain enhanced data samples. Multiple thought chain enhancement data samples are input into a multimodal large model to generate multiple image description texts; the multiple image description texts are combined with the substation defect sample images in the corresponding thought chain generated data samples to construct multiple thought chain optimization data samples. Multiple optimized data samples of the thought chain are input into the large language model of text reasoning to generate multiple thought chain texts; Multiple thought chain data samples are constructed by combining multiple thought chain texts with substation defect sample images, question text templates, and answer text templates from the corresponding thought chain generated data samples. Based on the question types of the question text templates in the thinking chain data samples, multiple thinking chain data samples are classified and multiple substation defect reasoning datasets are constructed.

6. The method for training a multimodal large model of substation inspection according to claim 1, characterized in that, Multiple substation defect reasoning datasets include yes / no question reasoning datasets, choice question reasoning datasets, and analysis question reasoning datasets.

7. A multimodal large-scale model post-training system for substation inspection, characterized in that, This system is used in the post-training method for a multimodal large model of substation inspection as described in any one of claims 1-6, wherein the system comprises: The data sample construction module is used to construct multiple thought chain generated data samples by combining question text templates of different question types and corresponding answer text templates with multiple substation defect sample images; wherein each thought chain generated data sample includes a question text template, an answer text template, and a substation defect sample image; The dataset construction module is used to input multiple thought chain generated data samples into the multi-stage CoT generation model. Under the constraints of the pre-built substation inspection defect standard constraints, the substation defect reasoning datasets corresponding to different query question types are obtained. The model training module is used to sequentially input multiple substation defect inference datasets into the base multimodal large model and perform supervised fine-tuning of the base multimodal large model; GRPO reinforcement learning is used to train the fine-tuned base multimodal large model until the loss function converges, thus obtaining the substation defect multimodal inference large model.

8. A computer device comprising a system memory and a processor, wherein the system memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method described in any one of claims 1 to 6.

10. A computer program product containing instructions, characterized in that, When the instructions are executed by a cluster of computer devices, the cluster of computer devices causes the cluster of computer devices to perform the method as described in any one of claims 1 to 6.