Artificial intelligence-based power customer service model optimization method and system

By utilizing meta-training of support sets and query sets, TF-IDF feature extraction, and PPO reinforcement learning in the power customer service Q&A service, the large model is optimized, solving the problem of low inference accuracy in power customer service Q&A services under small sample scenarios, and improving the model's inference ability and interpretability.

CN120764689BActive Publication Date: 2026-06-26WUXI PENGPAI SHUZHI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WUXI PENGPAI SHUZHI TECH CO LTD
Filing Date
2025-07-01
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, power customer service Q&A services are difficult to optimize large models efficiently in small sample scenarios, resulting in low inference accuracy, and traditional optimization methods are inefficient.

Method used

Meta-training is performed by acquiring support and query sets. The large model is optimized using the Episodic Training method. The TF-IDF algorithm is combined to extract inference chain feature vectors, calculate the comprehensive policy reward value, and use the PPO algorithm for reinforcement learning to optimize the large model and improve inference capabilities.

Benefits of technology

This improved the reasoning accuracy and interpretability of large models in power customer service Q&A, enabling more efficient decision support.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120764689B_ABST
    Figure CN120764689B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of artificial intelligence, and specifically discloses a power customer service model optimization method and system based on artificial intelligence, which obtains a support set and a query set for large model small sample learning, uses the support set and the query set to perform meta-training on a large model, retrieves an auxiliary set to perform inference test on the pre-trained large model for power business scene problem instances, and then performs reinforcement learning on the pre-trained large model based on a comprehensive strategy reward value according to the inference steps and the inference result obtained through the test, so as to obtain a strategy optimized large model to infer and optimize the solution to the power business scene problem in actual application. The application uses small sample learning technology to realize fine-tuning of the inference ability of the large model, and uses comprehensive strategy feedback to perform reinforcement learning, thereby improving the explainability of the inference process of the large model and the accuracy of the inference result, and providing more efficient decision support for power customer service question answering of the large model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence technology, specifically relating to an artificial intelligence-based method and system for optimizing power customer service models. Background Technology

[0002] With the rapid development of artificial intelligence technology, customer service in the power industry is gradually shifting from manual service to large-scale model-supported services. The large-scale models used in power customer service applications require corresponding inference optimization to adapt to more complex business question-and-answer scenarios and output interpretable reasoning processes and accurate results. Traditional optimization methods require executing complex instructions, utilizing large amounts of training data, and often involve lengthy adjustment processes, resulting in low optimization efficiency. Few-shot learning is a hot topic, especially when data is scarce; effectively fine-tuning large models is crucial. In few-shot scenarios, prompt learning relies heavily on knowledge learned by pre-trained language models during the pre-training phase, while less knowledge can be learned in downstream tasks, leading to relatively low accuracy in few-shot training. Therefore, how to efficiently optimize and train power customer service inference models through few-shot learning is an urgent problem to be solved. Summary of the Invention

[0003] The purpose of this invention is to provide an artificial intelligence-based method and system for optimizing power customer service models, in order to solve the aforementioned problems existing in the prior art.

[0004] To achieve the above objectives, the present invention adopts the following technical solution:

[0005] Firstly, it provides AI-based optimization methods for electricity customer service models, including:

[0006] Obtain a support set and a query set for small sample learning of a large model. Both the support set and the query set contain several power business scenario problem samples, as well as inference step samples and inference result samples corresponding to each power business scenario problem sample.

[0007] Meta-training of a large model is performed using the support set and query set to obtain a pre-trained large model after small-sample learning.

[0008] Retrieve a preset auxiliary set, which includes several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The group of power business scenario problem instances includes several power business scenario problem instances for the same inference result instance.

[0009] The pre-trained large model was tested using the same reasoning result instance in the power business scenario problem instance group and the corresponding power business scenario problem instances to obtain the test reasoning steps and test reasoning results for each power business scenario problem instance.

[0010] Test inference steps that are identical to the test inference result instance are used as target inference chain data, and inference chain feature vectors are extracted from each target inference chain data.

[0011] Calculate the comprehensive policy reward value of each target inference chain data based on the inference chain feature vector of each target inference chain data, and label each target inference chain data with the corresponding comprehensive policy reward value.

[0012] The pre-trained large model is reinforced by the comprehensive policy reward value labeled by the data of each objective inference chain to obtain the policy optimization large model. The policy optimization large model is then used to output the inference steps and results of the corresponding power business scenario problems in the actual application process.

[0013] In one possible design, the meta-training of the large model using the support set and query set to obtain a pre-trained large model after few-shot learning includes:

[0014] The large model is meta-trained using the support set and query set and based on the Episodic Training method until the set training conditions are met, resulting in a pre-trained large model after few-shot learning.

[0015] In one possible design, the training condition is that the cross-entropy loss function used for training converges, and the cross-entropy loss function is:

[0016]

[0017] Where L represents the cross-entropy loss value, i represents the sample number of the problem in the power business scenario, and y i The inference result sample y represents the problem sample i in the power business scenario. i 'The training and inference results of the large model corresponding to problem sample i in the power business scenario.'

[0018] In one possible design, the extraction of inference chain feature vectors from each target inference chain data includes:

[0019] The TF-IDF algorithm is used to extract features from the inference chain data of each target to obtain the corresponding inference chain feature vector.

[0020] In one possible design, the comprehensive policy reward value for each target inference chain data is calculated based on the inference chain feature vectors of each target inference chain data, including:

[0021] Calculate the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the other target inference chain data, and determine the comprehensive policy reward value of the corresponding target inference chain data based on the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data.

[0022] In one possible design, calculating the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the remaining target inference chain data includes:

[0023] The inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data are substituted into a preset vector similarity formula for calculation to obtain the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data. The vector similarity formula is as follows:

[0024]

[0025] Where T represents vector similarity, x represents the inference chain feature vector of the corresponding target inference chain data, y represents the inference chain feature vector of the other target inference chain data, ‖‖ represents norm operation, and γ is the set similarity calculation coefficient.

[0026] In one possible design, determining the comprehensive policy reward value for the corresponding target inference chain data based on the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data includes:

[0027] The average similarity parameter of the corresponding target inference chain data is obtained by averaging the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data. The average similarity parameter of the corresponding target inference chain data is then normalized to obtain the comprehensive policy reward value of the corresponding target inference chain data.

[0028] In one possible design, the step of performing reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation to obtain a policy optimization large model includes: using the PPO algorithm and performing reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation to obtain a policy optimization large model.

[0029] Secondly, an AI-based power customer service model optimization system is provided, comprising a data acquisition unit, a model training unit, a data retrieval unit, a model testing unit, a feature extraction unit, a reward calculation unit, and an inference optimization unit, wherein:

[0030] The data acquisition unit is used to acquire the support set and query set for small sample learning of the large model. The support set and query set each contain several power business scenario problem samples and the inference step samples and inference result samples corresponding to each power business scenario problem sample.

[0031] The model training unit is used to perform meta-training on a large model using the support set and query set to obtain a pre-trained large model after small-sample learning.

[0032] The data retrieval unit is used to retrieve a preset auxiliary set, which includes several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The power business scenario problem instance group includes several power business scenario problem instances for the same inference result instance.

[0033] The model testing unit is used to test the pre-trained large model using the same reasoning result instance in the power business scenario problem instance group, and to obtain the test reasoning steps and test reasoning results for each power business scenario problem instance.

[0034] The feature extraction unit is used to take the test inference steps that are the same as the test inference result and the corresponding inference result instance as the target inference chain data, and extract the inference chain feature vector of each target inference chain data.

[0035] The reward calculation unit is used to calculate the comprehensive strategy reward value of each target inference chain data based on the inference chain feature vector of each target inference chain data, and to label each target inference chain data with the corresponding comprehensive strategy reward value.

[0036] The inference optimization unit is used to perform reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation, to obtain a policy optimization large model, and to use the policy optimization large model to output the inference steps and inference results for the corresponding power business scenario problems in actual application.

[0037] Thirdly, it provides an AI-based power customer service model optimization system, including:

[0038] Memory, used to store instructions;

[0039] A processor is configured to read instructions stored in the memory and execute the method described in any one of the first aspects above, according to the instructions.

[0040] Fourthly, a computer-readable storage medium is provided, on which instructions are stored, which, when executed on a computer, cause the computer to perform any of the methods described in the first aspect. A computer program product is also provided, which, when executed on a computer, performs any of the methods described in the first aspect.

[0041] Beneficial Effects: This invention obtains a support set and a query set for few-shot learning of a large model, performs meta-training on the large model using these sets, and then uses an auxiliary set to test the pre-trained large model's reasoning on power business scenario problem instances. Based on the reasoning steps and results obtained from the test, the pre-trained large model undergoes reinforcement learning based on a comprehensive policy reward value to obtain a policy-optimized large model for optimizing reasoning solutions to power business scenario problems in practical applications, thereby improving the large model's performance in power customer service question answering. This invention utilizes few-shot learning technology to fine-tune the reasoning ability of the large model and uses comprehensive policy feedback for reinforcement learning, improving the interpretability of the large model's reasoning process and the accuracy of the reasoning results, providing more efficient decision support for the large model's power customer service question answering. Attached Figure Description

[0042] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0043] Figure 1 This is a schematic diagram of the steps in the method of Embodiment 1 of the present invention;

[0044] Figure 2 This is a schematic diagram of the system configuration in Embodiment 2 of the present invention;

[0045] Figure 3 This is a schematic diagram of the system configuration in Embodiment 3 of the present invention. Detailed Implementation

[0046] It should be noted that the descriptions of these embodiments are intended to aid in understanding the invention and do not constitute a limitation thereof. The specific structural and functional details disclosed herein are merely for describing exemplary embodiments of the invention. However, the invention may be embodied in many alternative forms and should not be construed as being limited to the embodiments described herein.

[0047] It should be understood that, unless otherwise explicitly specified and limited, the corresponding terms should be interpreted broadly. For example, "connection" can be a fixed connection, a detachable connection, or an integral connection; it can be an electrical connection, a direct connection, an indirect connection through an intermediate medium, or a connection within two components. Those skilled in the art can understand the specific meaning of the above terms in the embodiments according to the specific circumstances.

[0048] Specific details are provided in the following description to provide a complete understanding of the exemplary embodiments. However, those skilled in the art will understand that the exemplary embodiments can be implemented without these specific details. For example, apparatus may be shown in block diagrams to avoid obscuring the examples with unnecessary details. In other embodiments, well-known processes, structures, and techniques may be omitted with non-essential details to avoid obscuring the embodiments.

[0049] Example 1:

[0050] This embodiment provides an artificial intelligence-based optimization method for electricity customer service models, which can be applied to corresponding electricity business question-and-answer servers, such as... Figure 1 As shown, the method includes the following steps:

[0051] S1. Obtain the support set and query set for small sample learning of the large model. The support set and query set each contain several power business scenario problem samples, as well as the inference step samples and inference result samples corresponding to each power business scenario problem sample.

[0052] In practice, the server first obtains a support set and a query set for small-sample learning of a large model. Both the support set and the query set contain several power business scenario problem samples, as well as inference step samples and inference result samples corresponding to each power business scenario problem sample. The power business scenario problem samples, inference step samples, and inference result samples corresponding to each power business scenario problem sample contained in the query set are different from the power business scenario problem samples, inference step samples, and inference result samples corresponding to each power business scenario problem sample contained in the support set.

[0053] S2. Use the support set and query set to perform meta-training on the large model to obtain a pre-trained large model after small-sample learning.

[0054] In practice, the server utilizes the support set and query set and performs meta-training on the large model based on the Episodic Training paradigm until the set training conditions are met, resulting in a pre-trained large model after small-sample learning. The set training conditions refer to the convergence of the cross-entropy loss function used in the training, which is:

[0055]

[0056] Where L represents the cross-entropy loss value, i represents the sample number of the problem in the power business scenario, and y i The inference result sample y represents the problem sample i in the power business scenario. i 'The training and inference results of the large model corresponding to problem sample i in the power business scenario.'

[0057] Episodic Training is a commonly used training method in few-shot learning. Its core idea is to train the model by simulating multiple task scenarios (episodes), thereby improving the model's adaptability to new tasks. The basic steps of Episodic Training include: 1. Dataset Splitting: During training, the dataset is split into a support set and a query set. The support set provides samples for each class, while the query set is used to evaluate the model's generalization ability. 2. Simulating Task Scenarios: In each training episode, several classes are randomly selected from all classes, and a support set and query set are assigned to each selected class. Thus, each episode simulates a small task scenario. 3. Training Process: The model is trained using data from the support set, and then its performance is evaluated using data from the query set. Through iterative training with multiple such episodes, the model gradually learns to make accurate predictions with a limited number of samples. 4. Parameter Update: After each episode, the model's parameters are updated based on feedback from the query set to improve its performance on unseen classes.

[0058] S3. Retrieve a preset auxiliary set, which includes several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The group of power business scenario problem instances includes several power business scenario problem instances for the same inference result instance.

[0059] In practice, the server retrieves a pre-configured auxiliary set from the database. The auxiliary set contains several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The power business scenario problem instance group contains several power business scenario problem instances for the same inference result instance.

[0060] S4. Test the pre-trained large model using the power business scenario problem instance corresponding to the same reasoning result instance in the power business scenario problem instance group, and obtain the test reasoning steps and test reasoning results for each power business scenario problem instance.

[0061] In practice, the server uses an auxiliary set to test the pre-trained large model. That is, the pre-trained large model is tested by each power business scenario problem instance corresponding to the same reasoning result instance in the power business scenario problem instance group, so as to obtain the test reasoning steps and test reasoning results for each power business scenario problem instance.

[0062] S5. Take the test inference steps that are the same as the test inference result instance as the target inference chain data, and extract the inference chain feature vector of each target inference chain data.

[0063] In practice, the server first filters the inference chains based on the test results, and takes the test inference steps that are the same as the corresponding inference result instances as the target inference chain data. Then, the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is used to extract features from each target inference chain data to obtain the corresponding inference chain feature vector.

[0064] S6. Calculate the comprehensive policy reward value of each target inference chain data based on the inference chain feature vector of each target inference chain data, and label each target inference chain data with the corresponding comprehensive policy reward value.

[0065] In practice, the server calculates the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the other target inference chain data. This involves substituting the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data into a preset vector similarity formula for calculation. The vector similarity formula is as follows:

[0066]

[0067] Where T represents vector similarity, x represents the inference chain feature vector of the corresponding target inference chain data, y represents the inference chain feature vector of the other target inference chain data, ‖‖ represents norm operation, and γ is the set similarity calculation coefficient.

[0068] Then, the comprehensive policy reward value of the corresponding target inference chain data is determined based on the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data. That is, the average similarity parameter of the corresponding target inference chain data is obtained by averaging the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data. The average similarity parameter of the corresponding target inference chain data is then normalized to obtain the comprehensive policy reward value of the corresponding target inference chain data.

[0069] S7. Based on the comprehensive policy reward value of each target inference chain data annotation, perform reinforcement learning on the pre-trained large model to obtain the policy optimization large model, and use the policy optimization large model to output the inference steps and inference results of the corresponding power business scenario problems in the actual application process.

[0070] In practical implementation, the server employs the PPO algorithm (Proximal Policy Optimization, a reinforcement learning algorithm that balances training stability and performance improvement by limiting policy update magnitude and efficiently utilizing sampled data) and performs reinforcement learning on the pre-trained large model based on the comprehensive policy reward value labeled with data from each objective inference chain. Specifically, it uses the comprehensive policy reward value to calculate the dominance function (typically using Generalized Dominance Estimation (GAE) to optimize action value assessment), and uses the Clip function to constrain gradient updates, limiting the magnitude of policy changes and achieving model policy updates, thus obtaining the policy-optimized large model. This policy-optimized large model can then be used to perform inference steps and output inference results for corresponding power business scenarios in practical applications, realizing the optimization of customer service question-and-answer inference in complex power business scenarios.

[0071] This method utilizes few-shot learning techniques to fine-tune the reasoning capabilities of large models and employs comprehensive strategy feedback for reinforcement learning, thereby improving the interpretability of the reasoning process and the accuracy of the reasoning results of large models. This provides more efficient decision support for large-scale power customer service Q&A.

[0072] Example 2:

[0073] This embodiment provides an artificial intelligence-based power customer service model optimization system, such as... Figure 2 As shown, it includes a data acquisition unit, a model training unit, a data retrieval unit, a model testing unit, a feature extraction unit, a reward calculation unit, and an inference optimization unit, wherein:

[0074] The data acquisition unit is used to acquire the support set and query set for small sample learning of the large model. The support set and query set each contain several power business scenario problem samples and the inference step samples and inference result samples corresponding to each power business scenario problem sample.

[0075] The model training unit is used to perform meta-training on a large model using the support set and query set to obtain a pre-trained large model after small-sample learning.

[0076] The data retrieval unit is used to retrieve a preset auxiliary set, which includes several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The power business scenario problem instance group includes several power business scenario problem instances for the same inference result instance.

[0077] The model testing unit is used to test the pre-trained large model using the same reasoning result instance in the power business scenario problem instance group, and to obtain the test reasoning steps and test reasoning results for each power business scenario problem instance.

[0078] The feature extraction unit is used to take the test inference steps that are the same as the test inference result and the corresponding inference result instance as the target inference chain data, and extract the inference chain feature vector of each target inference chain data.

[0079] The reward calculation unit is used to calculate the comprehensive strategy reward value of each target inference chain data based on the inference chain feature vector of each target inference chain data, and to label each target inference chain data with the corresponding comprehensive strategy reward value.

[0080] The inference optimization unit is used to perform reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation, to obtain a policy optimization large model, and to use the policy optimization large model to output the inference steps and inference results for the corresponding power business scenario problems in actual application.

[0081] Example 3:

[0082] This embodiment provides an artificial intelligence-based power customer service model optimization system, such as... Figure 3 As shown, at the hardware level, it includes:

[0083] The data interface is used to establish data communication between the processor and external data terminals;

[0084] Memory, used to store instructions;

[0085] The processor is used to read instructions stored in the memory and execute the AI-based power customer service model optimization method in Embodiment 1 according to the instructions.

[0086] Optionally, the system also includes an internal bus, through which the processor, memory, and data interface can be interconnected. This internal bus can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry Standard Architecture) bus, etc. The bus can be divided into an address bus, a data bus, a control bus, etc.

[0087] The memory may include, but is not limited to, random access memory (RAM), read-only memory (ROM), flash memory, first-in-first-out (FIFO) memory, and / or first-in-last-out (FILO) memory. The processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0088] Example 4:

[0089] This embodiment provides a computer-readable storage medium storing instructions. When these instructions are executed on a computer, the computer performs the artificial intelligence-based power customer service model optimization method described in Embodiment 1. The computer-readable storage medium refers to a data storage medium, which may include, but is not limited to, floppy disks, optical disks, hard disks, flash memory, USB flash drives, and / or Memory Sticks. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.

[0090] This embodiment also provides a computer program product that, when run on a computer, executes the artificial intelligence-based power customer service model optimization method described in Embodiment 1. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.

[0091] Finally, it should be noted that the above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. An artificial intelligence-based method for optimizing electricity customer service models, characterized in that, include: Obtain a support set and a query set for small sample learning of a large model. Both the support set and the query set contain several power business scenario problem samples, as well as inference step samples and inference result samples corresponding to each power business scenario problem sample. Meta-training of a large model is performed using the support set and query set to obtain a pre-trained large model after small-sample learning. Retrieve a preset auxiliary set, which includes several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The group of power business scenario problem instances includes several power business scenario problem instances for the same inference result instance. The pre-trained large model was tested using the same reasoning result instance in the power business scenario problem instance group and the corresponding power business scenario problem instances to obtain the test reasoning steps and test reasoning results for each power business scenario problem instance. Test inference steps that are identical to the test inference result instance are used as target inference chain data, and inference chain feature vectors are extracted from each target inference chain data. The comprehensive strategy reward value of each target inference chain data is calculated based on its inference chain feature vector, and each target inference chain data is labeled with its corresponding comprehensive strategy reward value. The calculation of the comprehensive strategy reward value of each target inference chain data based on its inference chain feature vector includes: calculating the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the other target inference chain data, and determining the comprehensive strategy reward value of the corresponding target inference chain data based on the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data. The calculation of the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the other target inference chain data includes: substituting the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data into a preset vector similarity formula for calculation, to obtain the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data. The vector similarity formula is: Where T represents vector similarity, x represents the inference chain feature vector of the corresponding target inference chain data, y represents the inference chain feature vector of another target inference chain data, ||| represents norm operation, and γ is a set similarity calculation coefficient; the step of determining the comprehensive strategy reward value of the corresponding target inference chain data based on the vector similarity of the inference chain feature vector of the corresponding target inference chain data with respect to the inference chain feature vectors of the other target inference chain data includes: taking the average of the vector similarity of the inference chain feature vector of the corresponding target inference chain data with respect to the inference chain feature vectors of the other target inference chain data to obtain the average similarity parameter of the corresponding target inference chain data, and normalizing the average similarity parameter of the corresponding target inference chain data to obtain the comprehensive strategy reward value of the corresponding target inference chain data; The pre-trained large model is subjected to reinforcement learning based on the comprehensive policy reward value of each target inference chain data annotation to obtain a policy optimization large model. The policy optimization large model is then used to output the inference steps and results for corresponding power business scenario problems in actual application.

2. The method for optimizing an artificial intelligence-based electricity customer service model according to claim 1, characterized in that, The process of meta-training a large model using support and query sets to obtain a pre-trained large model after few-shot learning includes: The large model is meta-trained using the support set and query set and based on the Episodic Training method until the set training conditions are met, resulting in a pre-trained large model after few-shot learning.

3. The method for optimizing an artificial intelligence-based electricity customer service model according to claim 2, characterized in that, The training condition is that the cross-entropy loss function used in the training converges, and the cross-entropy loss function is: Where L represents the cross-entropy loss value, i represents the sample number of the problem in the power business scenario, and y i The inference result sample y represents the problem sample i in the power business scenario. i 'The training and inference results of the large model corresponding to problem sample i in the power business scenario.' 4. The method for optimizing an artificial intelligence-based electricity customer service model according to claim 1, characterized in that, The extraction of inference chain feature vectors from each target inference chain data includes: The TF-IDF algorithm is used to extract features from the inference chain data of each target to obtain the corresponding inference chain feature vector.

5. The method for optimizing an artificial intelligence-based electricity customer service model according to claim 1, characterized in that, The step of performing reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation to obtain a policy optimization large model includes: using the PPO algorithm and performing reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation to obtain a policy optimization large model.

6. An artificial intelligence-based power customer service model optimization system, characterized in that, It includes a data acquisition unit, a model training unit, a data retrieval unit, a model testing unit, a feature extraction unit, a reward calculation unit, and an inference optimization unit, wherein: The data acquisition unit is used to acquire the support set and query set for small sample learning of the large model. The support set and query set each contain several power business scenario problem samples and the inference step samples and inference result samples corresponding to each power business scenario problem sample. The model training unit is used to perform meta-training on a large model using the support set and query set to obtain a pre-trained large model after small-sample learning. The data retrieval unit is used to retrieve a preset auxiliary set, which includes several inference result instances and a group of power business scenario problem instances labeled for each inference result instance. The power business scenario problem instance group includes several power business scenario problem instances for the same inference result instance. The model testing unit is used to test the pre-trained large model using the same reasoning result instance in the power business scenario problem instance group, and to obtain the test reasoning steps and test reasoning results for each power business scenario problem instance. The feature extraction unit is used to take the test inference steps that are the same as the test inference result and the corresponding inference result instance as the target inference chain data, and extract the inference chain feature vector of each target inference chain data. The reward calculation unit is used to calculate the comprehensive strategy reward value of each target inference chain data based on the inference chain feature vectors of each target inference chain data, and to label each target inference chain data with the corresponding comprehensive strategy reward value. The calculation of the comprehensive strategy reward value of each target inference chain data based on the inference chain feature vectors of each target inference chain data includes: calculating the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the other target inference chain data, and determining the comprehensive strategy reward value of the corresponding target inference chain data based on the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data; the calculation of the vector similarity between the inference chain feature vector of each target inference chain data and the inference chain feature vectors of the other target inference chain data includes: substituting the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data into a preset vector similarity formula for calculation, to obtain the vector similarity between the inference chain feature vector of the corresponding target inference chain data and the inference chain feature vectors of the other target inference chain data, wherein the vector similarity formula is: Where T represents vector similarity, x represents the inference chain feature vector of the corresponding target inference chain data, y represents the inference chain feature vector of another target inference chain data, ||| represents norm operation, and γ is a set similarity calculation coefficient; the step of determining the comprehensive strategy reward value of the corresponding target inference chain data based on the vector similarity of the inference chain feature vector of the corresponding target inference chain data with respect to the inference chain feature vectors of the other target inference chain data includes: taking the average of the vector similarity of the inference chain feature vector of the corresponding target inference chain data with respect to the inference chain feature vectors of the other target inference chain data to obtain the average similarity parameter of the corresponding target inference chain data, and normalizing the average similarity parameter of the corresponding target inference chain data to obtain the comprehensive strategy reward value of the corresponding target inference chain data; The inference optimization unit is used to perform reinforcement learning on the pre-trained large model based on the comprehensive policy reward value of each target inference chain data annotation, to obtain a policy optimization large model, and to use the policy optimization large model to output the inference steps and inference results for the corresponding power business scenario problems in actual application.

7. An artificial intelligence-based power customer service model optimization system, characterized in that, include: Memory, used to store instructions; A processor is configured to read instructions stored in the memory and execute the artificial intelligence-based power customer service model optimization method according to any one of claims 1-5.