Data processing method, apparatus, device, and medium

By acquiring multimodal information from the customer service system, retrieving business knowledge bases, and correcting response information from large language models, the problem of existing customer service systems being unable to provide accurate responses has been solved. This has enabled highly accurate and compliant responses, adapting to the business needs of education sales scenarios and improving user experience.

CN122309665APending Publication Date: 2026-06-30SHENZHEN DIANMAO TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN DIANMAO TECH CO LTD
Filing Date
2026-03-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing customer service systems cannot achieve precise response control, cannot meet users' personalized needs, and suffer from problems such as insufficient integration of multimodal processing, inadequate knowledge base retrieval and response accuracy, lack of automated process optimization, and rigid interaction logic.

Method used

By acquiring multimodal information, retrieving business knowledge bases, generating initial response information using large language models, and comparing and correcting it with a pre-set script template library, the accuracy, compliance, and consistency of responses are ensured. This is further optimized by combining a full-link monitoring and evaluation mechanism.

Benefits of technology

It achieves highly accurate responses to user queries, improves the accuracy of knowledge matching and the compliance of responses, adapts to the business needs of education and sales scenarios, and enhances the system's adoption rate and user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309665A_ABST
    Figure CN122309665A_ABST
Patent Text Reader

Abstract

This application relates to the field of artificial intelligence technology and discloses a data processing method, apparatus, device, and medium. The method includes: acquiring multimodal information based on client input; retrieving a business knowledge base based on the multimodal information and obtaining relevant target business information from the business knowledge base; determining initial response information of a large language model based on the multimodal information and the target business information; determining deviation segments between the initial response information and standard dialogue in a preset dialogue template library; correcting the deviation segments in the initial response information to obtain corrected target response information, and feeding back the target response information to the client. This application can improve the accuracy of knowledge matching and ensure the output effect of the large language model; by controlling the dialogue through standard dialogue, the accuracy, compliance, and consistency of the response dialogue can be guaranteed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, specifically to data processing methods, apparatus, devices, and media. Background Technology

[0002] With the rapid development of artificial intelligence technology, intelligent customer service systems based on Large Language Models (LLM) have been widely used in education, e-commerce, finance, and other fields. Traditional customer service systems mainly rely on rule engines or simple keyword matching, which are difficult to handle complex and ever-changing user inquiries. While related customer service systems based on multimodal input can process multiple types of information such as text, voice, and images, their workflows only complete basic modality recognition, conversion, and single task distribution, failing to achieve precise control and meet users' personalized needs. Summary of the Invention

[0003] In view of this, this application provides a data processing method, apparatus, device, and medium to solve the problem that existing customer systems cannot achieve accurate response control.

[0004] In a first aspect, this application provides a data processing method, the method comprising: Obtain multimodal information based on client input; Based on the multimodal information retrieval business knowledge base, relevant target business information is obtained from the business knowledge base; Based on the multimodal information and the target business information, the initial response information of the large language model is determined; Determine the deviation segment between the initial response information and the standard script in the preset script template library; The deviation segment in the initial response information is corrected to obtain the corrected target response information, and the target response information is fed back to the client.

[0005] Secondly, this application provides a data processing apparatus, the apparatus comprising: The acquisition module is used to acquire multimodal information based on client input. The retrieval and generation module is used to retrieve business knowledge base based on the multimodal information, obtain relevant target business information from the business knowledge base, and determine the initial response information of the large language model based on the multimodal information and the target business information. The deviation determination module is used to determine the deviation segment between the initial response information and the standard script in the preset script template library; The processing module is used to correct the deviation segment in the initial response information to obtain the corrected target response information, and to feed back the target response information to the client.

[0006] Thirdly, this application provides an electronic device, including: a memory and a processor, which are communicatively connected to each other. The memory stores computer instructions, and the processor executes the computer instructions to perform the data processing method described in the first aspect or any corresponding embodiment.

[0007] Fourthly, this application provides a computer-readable storage medium storing computer instructions for causing a computer to perform the data processing method described in the first aspect or any corresponding embodiment.

[0008] Fifthly, this application provides a computer program product, including computer instructions for causing a computer to execute the data processing method described in the first aspect or any corresponding embodiment thereof.

[0009] The data processing method provided in this application, based on user query needs, first searches the business knowledge base to find relevant knowledge fragments, then synthesizes these knowledge fragments to determine the initial response information output by the large language model, and finally uses standard scripts from a script template library to refine the initial response information, thereby providing users with more accurate target response information. This method utilizes relevant business knowledge from the business knowledge base to generate responses, improving the accuracy of knowledge matching and ensuring the output effect of the large language model; script control through standard scripts ensures the accuracy, compliance, and consistency of the response scripts, resulting in a higher adoption rate and adaptability to the business needs of education sales scenarios, thus improving the relevance of responses to business requirements. Attached Figure Description

[0010] To more clearly illustrate the technical solutions in the specific embodiments of this application or the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0011] Figure 1 This is a schematic diagram illustrating an application scenario according to an embodiment of this application; Figure 2 This is a schematic flowchart of a first type of data processing method according to an embodiment of this application; Figure 3 This is a schematic diagram of a second type of data processing method according to an embodiment of this application; Figure 4 This is a schematic diagram of a third data processing method according to an embodiment of this application; Figure 5This is a structural block diagram of a data processing apparatus according to an embodiment of this application; Figure 6 This is a schematic diagram of the hardware structure of an electronic device according to an embodiment of this application. Detailed Implementation

[0012] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0013] It is understood that before using the technical solutions disclosed in the various embodiments of this application, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in this application in an appropriate manner in accordance with relevant laws and regulations, and user authorization should be obtained.

[0014] The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0015] Traditional educational institutions rely on human customer service for inquiries, which suffers from limited service hours, high labor costs, and limited sales radius. In recent years, generative dialogue systems based on large models have demonstrated powerful semantic understanding and generation capabilities. While existing multimodal customer service systems can handle various types of user inquiries, their workflow design limitations prevent them from meeting the requirements of accuracy, consistency of responses, and conversion efficiency in educational sales scenarios.

[0016] Specifically, the relevant customer service systems still face many technical challenges: (1) Multimodal processing is only at the basic level and there is no deep business integration: The core of this type of system only realizes modal type recognition, basic conversion (speech to text) and task distribution (text to dialogue module, image to detection card extraction module). It does not design exclusive speech control and knowledge matching mechanisms in combination with actual business scenarios such as sales. The processing results are not closely aligned with business needs and cannot meet the core requirements of sales conversion.

[0017] (2) Insufficient knowledge base retrieval and response accuracy, and lack of precise matching mechanism: Only knowledge base calls based on basic rules are implemented. Dynamic retrieval and precise matching logic is not designed for text information after multimodal conversion. It is difficult to cope with complex semantics and multi-turn dialogue scenarios, which may lead to inaccurate answers or deviation from business requirements. At the same time, there is no verification and correction mechanism for the output content, which cannot guarantee the consistency of core business information.

[0018] (3) Prompt word optimization relies on manual intervention and lacks an automated optimization system: The effectiveness of large-scale models is highly dependent on the design of prompt words. The dialogue task modules of such systems still use fixed prompt words and calling parameters, lacking automated and systematic optimization methods. This results in low efficiency and difficulty in scaling up, making it impossible to iteratively improve response quality based on actual interaction data. Some prompt word optimization schemes are only for single text dialogue scenarios, without considering the special characteristics of multimodal interaction, and do not involve the automated optimization of knowledge base retrieval strategies.

[0019] (4) Lack of full-link monitoring and evaluation mechanism and no data-driven self-optimization capability: Only a single round of multimodal input-output processing was completed. The full-link data (input, output, response time, matching accuracy, etc.) of the entire interaction process was not tracked or evaluated. There was also no automated optimization closed loop based on the evaluation data. The system performance could not be continuously improved, and the iteration relied entirely on manual feedback.

[0020] (5) Rigid interaction logic and no sales scenario-specific word control: The fixed task module calling logic is adopted, and there is no precise word control mechanism for sales scenarios. It is impossible to guarantee the accuracy and compliance of core sales wording such as price statement, preferential terms, and product introduction. It is difficult to flexibly adapt to users' personalized needs, which affects user experience and conversion effect.

[0021] As one optional application scenario in the embodiments of this application, such as Figure 1 As shown, application 101 is installed in terminal device 110, and user 130 can interact with application 101 through terminal device 110 and / or access device of terminal device 110.

[0022] For example, application 101 can be any application that provides question-and-answer related services. For instance, application 101 could be a question-and-answer interactive application, such as a text-to-text application, an image-to-text application, etc. Figure 1 In the application scenario shown, if application 101 is active, the terminal device 110 can display the interface 102 of application 101. The interface 102 may include various pages that application 101 can provide, such as interactive pages, settings pages, query pages, etc.

[0023] In some embodiments, terminal device 110 is communicatively connected to server 120 to provide services to application 101. Terminal device 110 may be a mobile terminal, fixed terminal, or portable terminal, including but not limited to mobile phones, desktop computers, laptop computers, multimedia tablets, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices or any combination thereof. In some embodiments, terminal device 110 may also support any type of interface, and server 120 may be various types of computing systems or servers capable of providing computing power, including but not limited to mainframes, edge computing nodes, and computing devices in cloud environments.

[0024] It should be noted that, Figure 1 This is merely an example of an application scenario and does not limit the scope of protection of this application.

[0025] To address some or all of the aforementioned issues, this application provides a data processing method. For user queries, the method first retrieves relevant knowledge fragments from a business knowledge base. Then, it synthesizes these knowledge fragments to determine the initial response information output by a large language model. Finally, it refines the initial response information using standard scripts from a script template library, thereby providing the user with a more accurate target response. This method utilizes relevant business knowledge from the business knowledge base to generate responses, improving the accuracy of knowledge matching and ensuring the output effect of the large language model. By controlling the scripts using standard scripts, it ensures the accuracy, compliance, and consistency of the response scripts, resulting in a higher adoption rate. Furthermore, it adapts to the business needs of educational sales scenarios, enhancing the relevance of responses to business requirements.

[0026] According to an embodiment of this application, a data processing method embodiment is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0027] This embodiment provides a data processing method that can be used in the aforementioned server, such as a customer service system. Figure 2 This is a flowchart of a data processing method according to an embodiment of this application, such as... Figure 2 As shown, the process includes the following steps.

[0028] Step S201: Obtain multimodal information based on client input.

[0029] When users need to ask a question, they can enter the relevant question information. Specifically, users can enter the question information based on the client they are using (such as a terminal device or related applications installed on the terminal). Furthermore, it supports users to enter relevant information in multiple modalities, that is, users can enter multimodal information, which can be in the form of text, voice, images, or other forms, depending on the user's input content.

[0030] This embodiment primarily targets a customer service scenario, where the multimodal information can specifically be a multimodal question. For example, the multimodal information could be: "When do elementary school math olympiad classes start?" Step S202: Retrieve relevant target business information from the business knowledge base based on multimodal information.

[0031] In this embodiment, a pre-established structured knowledge base related to business operations is provided, namely, a business knowledge base. For example, in an education scenario, this business knowledge base may contain multiple pieces of knowledge related to education. After obtaining the multimodal information input by the user, the business knowledge base is first searched based on the multimodal information to obtain relevant business information, namely, target business information, which is a fragment of business knowledge in the business knowledge base.

[0032] For education sales scenarios, keyword retrieval and semantic vector retrieval can be used to comprehensively search the business knowledge base, so as to retrieve the appropriate target business information in the business knowledge base and avoid problems such as missing target business information due to a single retrieval method.

[0033] It's understandable that, since multimodal information can take various forms, if the original multimodal information is in text format, knowledge base retrieval can be performed directly based on the text-based multimodal information. However, if the original multimodal information is in other formats such as speech or image, it needs to be converted into text content first. This can be achieved through technologies like speech-to-text conversion or OCR recognition, or by utilizing the multimodal processing capabilities of a large language model (speech-to-text, image basic analysis, etc.). Afterward, the business knowledge base can be retrieved based on the converted text content.

[0034] Step S203: Determine the initial response information of the large language model based on the multimodal information and the target business information.

[0035] After obtaining the target business information, a prompt word is generated based on the multimodal information and the target business information. The prompt word containing the multimodal information and the target business information is input into the Large Language Model (LLM). The powerful reasoning ability of the Large Language Model can be used to output content that can respond to the multimodal information, i.e., the initial response information.

[0036] The input to the large language model includes target business information retrieved from the business knowledge base, such as information fragments related to sales scenarios obtained from the structured business knowledge base. This can improve the accuracy of knowledge matching and ensure the output effect of the large language model.

[0037] Step S204: Determine the deviation segment between the initial response information and the standard script in the preset script template library.

[0038] Initial response information directly output by a large language model may be inaccurate or non-compliant. To ensure the accuracy, compliance, and consistency of the output content in sales scenarios and to prevent the large language model from generating invalid response information, this embodiment also performs word control on the response information to address the deficiency of existing multimodal customer service systems that lack business-specific word control.

[0039] Specifically, a script template library is pre-set, which contains multiple standard scripts. For the initial response information generated by the large language model, it can be compared with the standard scripts in the script template library to determine the deviation between the two. This allows us to determine which segments in the initial response information deviate from the standard scripts, thus identifying the deviating segments in the initial response information.

[0040] The script template library contains multiple standard scripts. You can first identify one or more standard scripts in the script template library that are closest to the initial response information, and then compare the two to identify the deviation segments in the initial response information that do not meet the script requirements.

[0041] Step S205: Correct the deviation segments in the initial response information to obtain the corrected target response information, and then send the target response information back to the client.

[0042] In this embodiment, each deviation segment in the initial response information can be corrected, for example, by correcting it based on standard wording. The specific correction method can be determined based on the actual situation. By correcting the deviation segments in the initial response information, a corrected response information that better meets the requirements of the scenario can be obtained. For ease of description, the corrected initial response information is referred to as the target response information. This target response information can then be used as the final generated response and fed back to the client. The client can then display this target response information to the user, thereby enabling automatic responses to user inquiries and providing customer service support.

[0043] The data processing method provided in this embodiment, based on the user's query needs, first searches the business knowledge base to find relevant knowledge fragments, then synthesizes these knowledge fragments to determine the initial response information output by the large language model, and finally uses standard scripts from the script template library to correct the initial response information, thereby providing the user with more accurate target response information. This method utilizes relevant business knowledge from the business knowledge base to generate responses, which can improve the accuracy of knowledge matching and ensure the output effect of the large language model; script control through standard scripts can ensure the accuracy, compliance, and consistency of the response scripts, resulting in a higher adoption rate and adaptability to the business needs of education sales scenarios, thus improving the relevance of responses to business requirements.

[0044] This embodiment provides a data processing method that can be used in the aforementioned server, such as a customer service system. Figure 3 This is a flowchart of a data processing method according to an embodiment of this application, such as... Figure 3 As shown, the process includes the following steps.

[0045] Step S301: Obtain multimodal information based on client input.

[0046] For details, please see Figure 2 Step S201 of the illustrated embodiment will not be described again here.

[0047] Step S302: Retrieve the business knowledge base based on multimodal information and obtain relevant target business information from the business knowledge base.

[0048] For details, please see Figure 2 Step S202 of the illustrated embodiment will not be described again here.

[0049] Step S303: Determine the initial response information of the large language model based on the multimodal information and the target business information.

[0050] For details, please see Figure 2 Step S203 of the illustrated embodiment will not be described again here.

[0051] Step S304: Determine the deviation segment between the initial response information and the standard script in the preset script template library.

[0052] For details, please see Figure 2 Step S204 of the illustrated embodiment will not be described again here.

[0053] In some optional implementations, step S304, "determining the deviation segment between the initial response information and the standard script in the preset script template library," may specifically include steps a1 to a3.

[0054] Step a1: Determine the edit distance and semantic similarity between the initial response information and the standard scripts in the script template library.

[0055] Step a2: Determine the matching score of the initial response information based on edit distance and semantic similarity.

[0056] Step a3: If the matching score is less than a preset threshold, locate the deviation segment in the initial response information that deviates from the standard script.

[0057] In this embodiment, when comparing the initial response information with standard dialogue in the dialogue template library, the edit distance and semantic similarity between the two are determined. Edit distance refers to the minimum number of editing operations required to transform one string into the other, and it measures the similarity between the two strings. Semantic similarity is the similarity between their semantic vectors, such as cosine similarity.

[0058] The edit distance and semantic similarity between the initial response and the standard wording are fused to obtain a score that comprehensively evaluates whether the two are a good match, i.e., a matching score. For example, a weighted summation method can be used to fuse the edit distance and semantic similarity to obtain the corresponding matching score.

[0059] If the matching score is high, it means that the initial response information matches the standard script, that is, the initial response information generated by the large language model meets the requirements, so the initial response information can be directly used as the target response information fed back to the client.

[0060] Conversely, if the matching score is low, i.e., less than the preset threshold, it indicates a significant deviation in the initial response. Therefore, it's possible to pinpoint the deviation segments in the initial response that differ from the standard wording. For example, a differential algorithm can be used to locate specific segments in the initial response that deviate from the standard wording, such as incorrect price figures, missing discount periods, or discrepancies in course names.

[0061] In this embodiment, not only is the character-level edit distance calculated, but semantic similarity is also calculated by combining BERT semantic embedding and other methods to form a two-dimensional matching score. This can avoid semantic bias caused by simple character matching and can more accurately locate and identify mismatched content.

[0062] Step S305: Correct the deviation segments in the initial response information to obtain the corrected target response information, and then send the target response information back to the client.

[0063] For details, please see Figure 2 Step S205 of the illustrated embodiment will not be described again here.

[0064] In some optional implementations, step S305, "correcting the deviation segments in the initial response information to obtain the corrected target response information," may include steps b1 to b3.

[0065] Step b1: If the deviation fragment belongs to a key business field, replace the deviation fragment in the initial response information with the corresponding content in the standard script.

[0066] Step b2: If the deviation fragment is descriptive content, the corrected fragment corresponding to the deviation fragment is regenerated based on the large language model, using standard language as a constraint, and the deviation fragment in the initial response information is replaced with the corrected fragment.

[0067] Step b3: If the deviation fragment is sensitive content related to compliance, perform template matching on the deviation fragment to determine the matching standard template content, and replace the deviation fragment in the initial response information with the standard template content.

[0068] In this embodiment, a differentiation correction strategy can be executed for specific discrepancies in the initial response information.

[0069] Specifically, if the deviation fragment belongs to a key business field, such as price, discount expiration date, or course validity period, the deviation fragment can be directly and forcibly replaced with the precise expression in the standard script template library, and a replacement log can be recorded for easy traceability.

[0070] If the deviation fragment is descriptive content, such as content related to course features or service introductions, a constraint regeneration strategy can be adopted. This involves using standard wording as a constraint to trigger a partial rewriting of the large model, thereby regenerating the corrected fragment corresponding to the deviation fragment using the large language model and replacing the fragment. This approach preserves the personalized expression framework of the original response and corrects only factual deviations.

[0071] If the deviation fragment pertains to compliance-sensitive content, such as refund policies or qualification statements, then template-forced matching can be enabled to completely replace the deviation fragment with the standard wording approved by legal counsel, ensuring the compliance of the corrected response information.

[0072] It is understandable that the corrected response information can be reassessed to determine whether further correction is needed. For example, based on steps a1 to a3, it can be determined whether there are still any biased segments (e.g., whether the matching score is less than a preset threshold). The corrected response information can be used as the target response information until it meets the requirements.

[0073] In this embodiment, semantic similarity and edit distance are integrated into the speech control mechanism, and a differentiated correction strategy based on content type and correction result verification are also included, which can ensure the accuracy, compliance and consistency of the output speech in the sales scenario.

[0074] Step S306: Obtain multimodal interaction evaluation data; the multimodal interaction evaluation data includes multimodal interaction data and the corresponding evaluation indicators.

[0075] In this embodiment, in addition to integrating knowledge base management and precise control of dialogue, the system also adds optimization and evaluation functions for prompt words and other content to the multimodal intelligent customer service system. This enables the construction of an intelligent customer service workflow with full-link optimization, and realizes a closed-loop design for full-link monitoring and evaluation from knowledge management, intelligent dialogue, process optimization to effect monitoring.

[0076] Specifically, during the interaction between the user and the customer service system, corresponding multimodal interaction data can be generated. For example, taking the interaction process from steps S301 to S305 as an example, the multimodal interaction data can include the initial response information input by the user and the final target response information. Furthermore, for each piece of multimodal interaction data, corresponding evaluation indicators are set to represent the quality of the multimodal interaction data.

[0077] For ease of description, multimodal interaction data with evaluation metrics will be referred to as multimodal interaction evaluation data, so that subsequent optimization can be carried out based on this multimodal interaction evaluation data.

[0078] Optionally, step S306, "acquiring multimodal interaction evaluation data", may specifically include steps c1 to c3.

[0079] Step c1: Obtain multimodal interaction data; multimodal interaction data includes multimodal input data, modality type, and output data of the interaction process.

[0080] Step c2: Obtain evaluation metrics for multimodal interaction data; the evaluation metrics include a first metric and / or a second metric. The first metric includes at least one of the following: conversion accuracy of multimodal input data, edit distance between output data and standard dialogue in the dialogue template library, and semantic similarity. The second metric includes whether the output data is adopted and / or manually labeled information.

[0081] Step c3: Generate multimodal interaction evaluation data based on the multimodal interaction data and the corresponding evaluation indicators.

[0082] In this embodiment, tracking code can be embedded in the workflow (or gateway) of the large language model to record relevant data during each multimodal interaction. This data includes multimodal input data (e.g., user-input multimodal information), modality types (e.g., text, voice, image, etc.), and output data (e.g., target response information). Furthermore, it can record intermediate processing results (e.g., retrieval results from the business knowledge base, initial responses from the large language model, etc.), response latency during the interaction process, token consumption, and other end-to-end data. Multimodal interaction data is generated based on this recorded data. In essence, each interaction process generates one piece of multimodal interaction data.

[0083] Each piece of multimodal interaction data can be evaluated, and this embodiment supports two evaluation modes.

[0084] Evaluation Mode 1: Automated Evaluation. For this multimodal interactive data, the conversion accuracy of the multimodal input data (i.e., conversion accuracy, such as the conversion accuracy from speech to text), the edit distance between the output data and the standard dialogue in the dialogue template library, semantic similarity, etc., can be calculated to form the first indicator for automated generation.

[0085] Evaluation Mode 2: Manual Evaluation. Specifically, a relevant annotation interface can be provided for manual evaluation of the adoption results (e.g., adoption, rejection) or manual annotation of the quality of the responses (high, medium, low, etc.). Based on whether the output data was adopted and / or the manual annotation information, a corresponding second indicator is generated.

[0086] In addition, automated backtesting scripts can be integrated to periodically perform batch tests on the system using labeled, high-quality test sets and generate performance reports.

[0087] Optionally, a gateway, such as a Flask gateway, can be added between the client and server in the intelligent customer service system. This gateway serves as the unified entry point for the system and adds end-to-end data reporting functionality, integrating request routing, authentication, logging, and monitoring data synchronization. Specifically, the gateway receives multimodal requests from clients and forwards them to a large-scale workflow engine (such as the Dify workflow engine), then sends the returned results back to the client. Simultaneously, the gateway can send detailed multimodal interaction data (input type, conversion result, processing flow, output content, etc.) to the evaluation and optimization module (such as Langfuse) to provide data support for monitoring and evaluation.

[0088] Step S307: Optimize the prompt words and / or system parameters of the large language model based on the multimodal interaction evaluation data, and optimize the retrieval strategy of the business knowledge base based on the multimodal interaction evaluation data.

[0089] In this embodiment, the multimodal interaction evaluation data can represent the response effect of previous multimodal interaction data, and therefore the entire interaction process can be optimized based on this. Specifically, the prompt words and system parameters of the large language model can be optimized; wherein, the prompt words can be pre-set, for example, they can include system prompt words, as well as other prompt words used to indicate role settings, output format requirements, etc.; the system parameters can include the calling parameters involved in the large language model's inference, such as temperature parameters, Top-K and other basic decoding parameters.

[0090] Furthermore, the retrieval strategy for the business knowledge base will be optimized simultaneously to ensure that the retrieval results meet the requirements. This enables automated optimization of prompts and large model calling parameters throughout the system, especially for knowledge base retrieval strategy optimization in multimodal interaction scenarios. It can replace inefficient manual tuning and achieve automated optimization of the multimodal customer service system process.

[0091] In some alternative implementations, step S307, "optimizing the prompt words and / or system parameters of the large language model based on multimodal interaction evaluation data," may include step d1.

[0092] Step d1: Based on the multimodal interaction evaluation data, optimize the prompt words and system parameters corresponding to various modal types; the system parameters include temperature parameters. Specifically, at least the system prompt words and the first temperature parameter for the text modal type are optimized; at least the error correction prompt words and the second temperature parameter for the voice modal type or the image modal type are optimized; the second temperature parameter is less than the first temperature parameter.

[0093] In this embodiment, prompt words and system parameters in the large model workflow are used as optimization targets. The optimizer can automatically select the optimal combination of few-shot examples from multimodal interaction data (such as text questions, speech-to-text features, and image extraction information) and optimize accordingly. In particular, to address the uncertainties in multimodal inputs (speech-to-text may have ASR errors, and image extraction may have OCR biases), optimization is performed separately for different modality types.

[0094] Specifically, for multimodal interactive data in text modality, in addition to optimizing the preset system prompts, the temperature parameter is also optimized, specifically the first temperature parameter. Generally, text modality does not have modality conversion errors, so this first temperature parameter can be set relatively high to enhance response flexibility.

[0095] For multimodal interactive data of either speech or image modality, error correction prompts are provided and can be optimized based on the multimodal interactive data to minimize ASR errors or OCR biases. Furthermore, a second temperature parameter is optimized, which is generally smaller than the first temperature parameter to enhance the stability of the output response. Typically, the focus can be on optimizing the error correction prompts, and the query expansion strategy during the retrieval phase can also be optimized to further compensate for noise introduced by ASR / OCR.

[0096] Subsequently, upon receiving new multimodal input information, appropriate temperature parameters can be selected based on the corresponding modality type, thereby realizing a dynamic parameter adjustment mechanism for input modality sensing.

[0097] Optionally, step S307, "optimizing the retrieval strategy of the business knowledge base based on multimodal interaction evaluation data," may include steps d2 to d3.

[0098] Step d2: Determine the search parameters when searching the business knowledge base; the search parameters include at least one of the following: search quantity value, similarity threshold, and re-ranking strategy parameters.

[0099] Step d3: In the case of keyword retrieval and semantic vector retrieval of the business knowledge base, determine the fusion weight between keyword retrieval and semantic vector retrieval.

[0100] In this embodiment, the retrieval strategy of the business knowledge base is also considered as an optimizable objective. Specifically, based on multimodal interaction evaluation data, the retrieval quantity value (e.g., Top-K value), similarity threshold, and reranking strategy of vector retrieval can be dynamically adjusted to balance recall and precision. Furthermore, for scenarios such as education sales, the fusion weight of keyword retrieval and semantic vector retrieval can be optimized. By adjusting the weight ratio of different retrieval methods, the optimal combination of precise matching such as course codes and price figures with semantic matching such as course descriptions and target audiences can be achieved.

[0101] Furthermore, based on the actual matching results of multimodal interaction data, the system can automatically evaluate and switch the most suitable embedding model for the education and sales field. For example, by comparing the relevance scores and user adoption rates of knowledge fragments under different retrieval configurations, the system can automatically generate the optimal embedding model configuration, allowing selection of the best model from multiple options. Moreover, it can train query expansion models based on historical data, automatically expanding users' brief voice / image queries into more complete semantic representations that facilitate retrieval, thus achieving multimodal query expansion optimization.

[0102] The following example illustrates the process of this method in detail. Specifically, it utilizes the visual workflow orchestration capabilities provided by the open-source large language model application development platform Dify to offer a large model workflow; it employs the large model programming framework DSPy to achieve procedural optimization of prompt words and processes; and it leverages the open-source observability platform Langfuse to achieve end-to-end monitoring and evaluation capabilities.

[0103] While retaining the basic multimodal processing capabilities, key new components such as script control, end-to-end monitoring, and automated optimization have been added, forming an end-to-end closed-loop workflow of "multimodal input - intelligent processing - accurate output - monitoring and evaluation - automated optimization - iterative upgrade". Figure 4 This paper illustrates a complete workflow for handling multimodal consultations with users, which includes the following steps.

[0104] Step S401: The user inputs a consultation request, including multimodal information, through the client.

[0105] For example, users can ask questions in text, voice, or image formats through website chat windows, such as asking about course prices by voice, sending course materials via image, or asking about course details by text.

[0106] Step S402: The consultation request arrives at the Flask gateway. The gateway completes request authentication and routing, and begins recording the entire interaction data.

[0107] Step S403: The gateway forwards the multimodal request to the Dify workflow engine. The Dify workflow engine then initiates the multimodal processing function.

[0108] Step S404: The Dify workflow engine completes the basic multimodal processing.

[0109] Specifically, if the information is text, it will directly enter the knowledge base retrieval node.

[0110] If the information is voice, it is converted into text information using speech recognition technology and then entered into the knowledge base retrieval node.

[0111] If the information is an image, the image recognition module is called to extract the relevant information, which is then converted into text and entered into the knowledge base retrieval node.

[0112] Step S405: The Dify workflow engine initiates the dialogue control processing to generate the final response.

[0113] Specifically, the knowledge base retrieval node dynamically and vectorizes the structured business knowledge base for scenarios such as education and sales based on the converted text information, obtains relevant business knowledge fragments, and gets the target business information.

[0114] The user's multimodal information questions and knowledge fragments (i.e. target business information) are sent to the large model to generate an initial response; the initial response is verified and corrected through a speech control mechanism (such as the template matching, edit distance calculation, and differentiation correction strategy mentioned above) to ensure the accuracy and compliance of core sales information such as course prices and preferential policies.

[0115] Step S406: Return the final response to the user via the Flask gateway. The gateway will simultaneously report all data from this interaction to the Langfuse monitoring platform.

[0116] After completing steps S401 to S406, the user's multimodal interaction can be achieved. Furthermore, continuous end-to-end monitoring and evaluation (steps S407 to S408) and automated optimization and iterative upgrades (steps S409 to S4011) can be performed simultaneously.

[0117] Step S407: The Langfuse monitoring platform continuously tracks and records the entire chain of data for all multimodal interactions.

[0118] Specifically, this data may include: input type, transformation result, knowledge base retrieval content, initial model response, corrected output, response latency, token consumption, etc.

[0119] Step S408: Determine the evaluation metrics for each multimodal interaction data.

[0120] Specifically, the system can periodically run automated backtesting scripts to automatically evaluate historical multimodal interaction data using metrics such as relevance and edit distance; it also supports manual annotation of dialogue quality (acceptance, rejection), and all evaluation results are imported into Langfuse to form a standardized dataset.

[0121] Step S409: Perform automatic optimization based on the dataset from the Langfuse monitoring platform.

[0122] Specifically, once the Langfuse monitoring platform accumulates sufficient multimodal interaction evaluation data (including text dialogues, speech-to-text accuracy labels, and image extraction correctness annotations), it automatically triggers the DSPy optimization engine, which performs hierarchical optimization tasks. For example, for text-based interaction data, the focus is on optimizing prompt word examples and decoding parameters; for speech / image-based interaction data, the focus is on optimizing error-correction prompt words in multimodal preprocessing nodes and query expansion strategies in the retrieval stage to compensate for noise introduced by ASR / OCR. Furthermore, by comparing the relevance scores and user adoption rates of knowledge fragments under different retrieval configurations, it can automatically generate the optimal embedding model configuration, hybrid retrieval weight scheme, and dynamic threshold strategy.

[0123] Among these, the dialogue adoption rate and sales conversion correlation can be used as end-to-end optimization targets, iterating the optimal parameter combination of search suggestions, knowledge base retrieval, and response generation to achieve joint optimization across the entire chain.

[0124] Step S410: The DSPy engine automatically pushes and updates the optimized new configuration to the Dify workflow engine, completing the iteration of system parameters and processes.

[0125] Step S411: The system uses the optimized configuration to process subsequent user multimodal inquiries. This enables continuous automatic performance improvement, forming a permanent self-optimizing closed loop.

[0126] The data processing method provided in this embodiment introduces knowledge base integration and dialogue control mechanisms into the workflow design of the large language model. It can also automatically optimize prompt words and retrieval strategies in a collaborative manner, realizing full-link monitoring and evaluation of the customer service system. This end-to-end self-optimizing closed-loop system can achieve efficient and accurate customer service and sales conversion, significantly improving service efficiency and user experience in the consultation and sales scenarios of the education industry. By combining semantic similarity, template matching, and edit distance calculation, it can realize the location of mismatched content and differentiated correction strategies, balancing the accuracy of dialogue and the personalization of responses. Based on end-to-end effect feedback of multimodal interaction data, it realizes the automated optimization of prompt words, input modality-aware decoding parameters, and knowledge base retrieval strategies. This not only improves the system iteration efficiency but also adapts to the complex scenarios of multimodal interaction, improves the generation effect of large models and the accuracy of knowledge matching, improves system operation and maintenance efficiency, and significantly shortens the iteration cycle. This AI intelligent customer service system can provide 24-hour online service, accurately respond to various user needs, improve the conversion rate and customer experience, and is suitable for multiple business stages such as pre-class consultation, in-class Q&A, and post-class follow-up.

[0127] This embodiment also provides a data processing apparatus for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that performs a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0128] This embodiment provides a data processing device, such as... Figure 5 As shown, the device includes: The acquisition module 501 is used to acquire multimodal information based on client input; The retrieval and generation module 502 is used to retrieve business knowledge base based on the multimodal information, obtain relevant target business information from the business knowledge base, and determine the initial response information of the large language model based on the multimodal information and the target business information. The deviation determination module 503 is used to determine the deviation segment between the initial response information and the standard script in the preset script template library; The processing module 504 is used to correct the deviation segment in the initial response information to obtain the corrected target response information, and to feed back the target response information to the client.

[0129] In some optional implementations, determining the deviation segment between the initial response information and the standard script in the preset script template library includes: Determine the edit distance and semantic similarity between the initial response information and the standard scripts in the script template library; The matching score of the initial response information is determined based on the edit distance and the semantic similarity. If the matching score is less than a preset threshold, locate the deviation segment in the initial response information that deviates from the standard script.

[0130] In some optional implementations, correcting the deviation segments in the initial response information to obtain the corrected target response information includes: If the deviation segment belongs to a key business field, the deviation segment in the initial response information will be replaced with the corresponding content in the standard script; If the deviation fragment is descriptive content, the corrected fragment corresponding to the deviation fragment is regenerated based on the standard language model, using the standard language as a constraint, and the deviation fragment in the initial response information is replaced with the corrected fragment. If the deviation fragment is content sensitive to compliance, template matching is performed on the deviation fragment to determine the matching standard template content, and the deviation fragment in the initial response information is replaced with the standard template content.

[0131] In some optional implementations, the processing module is further configured to: Acquire multimodal interaction evaluation data; the multimodal interaction evaluation data includes multimodal interaction data and the corresponding evaluation indicators; The prompt words and / or system parameters of the large language model are optimized based on the multimodal interaction evaluation data, and the retrieval strategy of the business knowledge base is optimized based on the multimodal interaction evaluation data.

[0132] In some optional implementations, acquiring multimodal interaction evaluation data includes: Acquire multimodal interaction data; the multimodal interaction data includes multimodal input data, modality type, and output data of the interaction process; The evaluation metrics for the multimodal interaction data are obtained; the evaluation metrics include a first metric and / or a second metric, wherein the first metric includes at least one of the following: the conversion accuracy of the multimodal input data, the edit distance between the output data and the standard dialogue in the dialogue template library, and semantic similarity; and the second metric includes whether the output data is adopted and / or manually labeled information. Multimodal interaction evaluation data is generated based on the multimodal interaction data and the corresponding evaluation indicators.

[0133] In some optional implementations, optimizing the prompt words and / or system parameters of the large language model based on the multimodal interaction evaluation data includes: Based on the multimodal interaction evaluation data, the prompt words and system parameters corresponding to various modal types are optimized; the system parameters include temperature parameters. Specifically, at least the system prompt words and the first temperature parameter for text modality are optimized; at least the error correction prompt words and the second temperature parameter for voice modality or image modality are optimized; the second temperature parameter is less than the first temperature parameter.

[0134] In some optional implementations, optimizing the retrieval strategy of the business knowledge base based on the multimodal interaction evaluation data includes: Determine the retrieval parameters when searching the business knowledge base; the retrieval parameters include at least one of the following: retrieval quantity value, similarity threshold, and re-ranking strategy parameters; When performing keyword retrieval and semantic vector retrieval on the business knowledge base, the fusion weight between keyword retrieval and semantic vector retrieval is determined.

[0135] The data processing apparatus provided in this disclosure can execute the data processing method provided in any embodiment of this disclosure, and has the corresponding functional modules and beneficial effects for executing the method. Further functional descriptions of the various modules and units described above are the same as in the corresponding embodiments described above, and will not be repeated here.

[0136] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application.

[0137] The following is a detailed reference. Figure 6 This diagram illustrates a suitable structural schematic for implementing the electronic device described in the embodiments of this application. The electronic device may include a processor (e.g., a central processing unit, graphics processor, etc.) 601, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 602 or a program loaded from memory 608 into random access memory (RAM) 603. The RAM 603 also stores various programs and data required for the operation of the electronic device. The processor 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.

[0138] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; memory devices 608 including, for example, magnetic tapes, hard disks, etc.; and communication devices 609. Communication device 609 allows electronic devices to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 6 Electronic devices with various devices are shown, but it should be understood that it is not required to implement or have all of the devices shown, and more or fewer devices may be implemented or have instead.

[0139] Specifically, according to embodiments of this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 609, or installed from a memory 608, or installed from a ROM 602. When the computer program is executed by the processor 601, it performs the functions defined in the data processing method of the embodiments of this application.

[0140] Figure 6 The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0141] This application also provides a computer-readable storage medium. The methods described in this application can be implemented in hardware or firmware, or implemented as recordable on a storage medium, or implemented as computer code downloaded over a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and then stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that computers, processors, microprocessor controllers, or programmable hardware include storage components capable of storing or receiving software or computer code. When the software or computer code is accessed and executed by the computer, processor, or hardware, the data processing methods shown in the above embodiments are implemented.

[0142] A portion of this application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to this application through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.

[0143] Although embodiments of this application have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of this application, and all such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A data processing method, characterized by, The method includes: Obtain multimodal information based on client input; Based on the multimodal information retrieval business knowledge base, relevant target business information is obtained from the business knowledge base; Based on the multimodal information and the target business information, the initial response information of the large language model is determined; Determine the deviation segment between the initial response information and the standard script in the preset script template library; The deviation segment in the initial response information is corrected to obtain the corrected target response information, and the target response information is fed back to the client.

2. The method of claim 1, wherein, The step of determining the deviation segment between the initial response information and the standard script in the preset script template library includes: Determine the edit distance and semantic similarity between the initial response information and the standard scripts in the script template library; The matching score of the initial response information is determined based on the edit distance and the semantic similarity. If the matching score is less than a preset threshold, locate the deviation segment in the initial response information that deviates from the standard script.

3. The method of claim 1, wherein, The step of correcting the deviation segment in the initial response information to obtain the corrected target response information includes: If the deviation segment belongs to a key business field, the deviation segment in the initial response information will be replaced with the corresponding content in the standard script; If the deviation fragment is descriptive content, the corrected fragment corresponding to the deviation fragment is regenerated based on the standard language model, using the standard language as a constraint, and the deviation fragment in the initial response information is replaced with the corrected fragment. If the deviation fragment is content sensitive to compliance, template matching is performed on the deviation fragment to determine the matching standard template content, and the deviation fragment in the initial response information is replaced with the standard template content.

4. The method according to any one of claims 1 to 3, characterized in that, The method further includes: Acquire multimodal interaction evaluation data; the multimodal interaction evaluation data includes multimodal interaction data and the corresponding evaluation indicators; The prompt words and / or system parameters of the large language model are optimized based on the multimodal interaction evaluation data, and the retrieval strategy of the business knowledge base is optimized based on the multimodal interaction evaluation data.

5. The method of claim 4, wherein, The acquisition of multimodal interaction evaluation data includes: Acquire multimodal interaction data; the multimodal interaction data includes multimodal input data, modality type, and output data of the interaction process; The evaluation metrics for the multimodal interaction data are obtained; the evaluation metrics include a first metric and / or a second metric, wherein the first metric includes at least one of the following: the conversion accuracy of the multimodal input data, the edit distance between the output data and the standard dialogue in the dialogue template library, and semantic similarity; and the second metric includes whether the output data is adopted and / or manually labeled information. Multimodal interaction evaluation data is generated based on the multimodal interaction data and the corresponding evaluation indicators.

6. The method of claim 4, wherein, The optimization of the prompt words and / or system parameters of the large language model based on the multimodal interaction evaluation data includes: Based on the multimodal interaction evaluation data, the prompt words and system parameters corresponding to various modal types are optimized; the system parameters include temperature parameters. Specifically, at least the system prompt words and the first temperature parameter for text modality are optimized; at least the error correction prompt words and the second temperature parameter for voice modality or image modality are optimized; the second temperature parameter is less than the first temperature parameter.

7. The method of claim 4, wherein, The optimization of the retrieval strategy for the business knowledge base based on the multimodal interaction evaluation data includes: Determine the retrieval parameters when searching the business knowledge base; the retrieval parameters include at least one of the following: retrieval quantity value, similarity threshold, and re-ranking strategy parameters; When performing keyword retrieval and semantic vector retrieval on the business knowledge base, the fusion weight between keyword retrieval and semantic vector retrieval is determined.

8. A data processing apparatus, characterized in that, The device includes: The acquisition module is used to acquire multimodal information based on client input. The retrieval and generation module is used to retrieve business knowledge base based on the multimodal information, obtain relevant target business information from the business knowledge base, and determine the initial response information of the large language model based on the multimodal information and the target business information. The deviation determination module is used to determine the deviation segment between the initial response information and the standard script in the preset script template library; The processing module is used to correct the deviation segment in the initial response information to obtain the corrected target response information, and to feed back the target response information to the client.

9. An electronic device, characterized in that, include: A memory and a processor are communicatively connected, the memory stores computer instructions, and the processor executes the computer instructions to perform the data processing method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing the computer to perform the data processing method according to any one of claims 1 to 7.