Model training method and device, vehicle, and storage medium

By acquiring and analyzing the sentiment index and problem-solving level of the question information, and combining multimodal audio and behavioral feature information, the answer generation model is fine-tuned, which solves the problem of low accuracy in emotion recognition in existing technologies and improves the accuracy of emotion recognition and context awareness.

CN122242588APending Publication Date: 2026-06-19CHINA FAW CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA FAW CO LTD
Filing Date
2026-03-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing response generation models have low accuracy in emotion recognition, ignore speech signals and user interaction behavior, lack end-to-end joint modeling, cannot capture changes in the intensity and mixed states of emotions, and lack the ability to model the process of emotion evolution.

Method used

By acquiring dialogue information from the initial response generation model, the emotional index and problem-solving level of the question information are obtained. Based on the emotional index and problem-solving level, the emotional value score of the response information is determined. In response to scores lower than the preset score, the model parameters are fine-tuned. The model is then corrected by combining multimodal audio and behavioral feature information to optimize the model and improve the accuracy of emotion recognition.

Benefits of technology

It improves the accuracy of the response generation model in emotion recognition, enhances the quantitative representation of users' emotional states and context awareness, and achieves more efficient emotion response and problem solving.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242588A_ABST
    Figure CN122242588A_ABST
Patent Text Reader

Abstract

This invention discloses a model training method, apparatus, vehicle, and storage medium. The model training method includes: acquiring dialogue information of an initial response generation model, wherein the dialogue information includes: question information received by the initial response generation model and response information output by the initial response generation model; acquiring the sentiment index and problem-solving degree value of the question information; determining the sentiment value score of the response information based on the sentiment index and problem-solving degree value; and fine-tuning the model parameters of the initial response generation model based on a preset loss function in response to a sentiment value score lower than a preset score, thereby obtaining a target response generation model. This invention solves the technical problem of low accuracy in emotion recognition of existing response generation models.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of automation control technology, and more specifically, to a model training method, apparatus, vehicle, and storage medium. Background Technology

[0002] Against the backdrop of rapid development in artificial intelligence and natural language processing technologies, intelligent dialogue systems based on large language models have been widely applied in scenarios such as customer service and virtual assistants. One of the core capabilities of such systems is understanding and responding to users' emotional states to achieve a more natural, empathetic, and efficient interactive experience. In practical applications, users expect the system to not only accurately identify their explicit emotions (such as anger, joy, and sadness) but also perceive implicit or complex emotions (such as expectation amidst anxiety or dissatisfaction beneath politeness) and dynamically adjust response strategies in different business contexts. Especially in highly sensitive scenarios (such as complaint handling and psychological counseling), higher demands are placed on the fine-grainedness, robustness, and contextual relevance of emotion recognition.

[0003] However, current mainstream response generation models still have significant shortcomings in emotion recognition. First, most models rely solely on text semantics for emotion judgment, ignoring key acoustic cues in the speech signal (such as fundamental frequency variations, speech rate fluctuations, and pause patterns) as well as user interaction behaviors (such as input rhythm, page dwell time, and repetitive operations). This leads to a high misjudgment rate for irony, suppressed emotions, or culturally specific expressions.

[0004] Secondly, existing methods typically treat emotion recognition and response generation as two independent modules, lacking end-to-end joint modeling, making it difficult for generated responses to align with true emotional states. Furthermore, emotion classification often employs predefined discrete labeling systems, failing to capture changes in emotion intensity and mixed states. Additionally, model training data is mostly derived from general corpora, lacking domain-specific emotion annotations, thus limiting generalization ability. Moreover, existing systems generally lack the ability to model the evolution of emotions, making it difficult to combine multi-turn dialogue history for dynamic emotion tracking, further restricting the accuracy and practicality of emotion recognition.

[0005] There is currently no solution to the aforementioned technical problems. Summary of the Invention

[0006] This invention provides a model training method, apparatus, vehicle, and storage medium to at least address the technical problem of low accuracy in emotion recognition by existing response generation models.

[0007] According to one embodiment of the present invention, a model training method is provided, comprising: acquiring dialogue information of an initial answer generation model, wherein the dialogue information includes: question information received by the initial answer generation model and answer information output by the initial answer generation model; acquiring the sentiment index and problem-solving degree value of the question information; determining the sentiment value score of the answer information based on the sentiment index and problem-solving degree value; and fine-tuning the model parameters of the initial answer generation model based on a preset loss function in response to the sentiment value score being less than a preset score, thereby obtaining a target answer generation model.

[0008] Optionally, the model training method further includes: acquiring multiple fragmented texts and multiple text labels, wherein each fragmented text corresponds to a text label; performing a merging operation on local fragmented texts in the multiple fragmented texts to obtain a first merging result, wherein the local fragmented texts correspond to the same text label; performing risk detection on the first merging result to obtain at least one risky word; and correcting the at least one risky word based on the text processing model to obtain text information.

[0009] Optionally, the model training method further includes: acquiring multiple audio segments and multiple audio timestamps, wherein each audio segment corresponds to an audio timestamp; performing a merging operation on local audio segments among the multiple audio segments to obtain a second merging result, wherein the local audio segments correspond to the same audio timestamp; and extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0010] Optionally, the model training method further includes: obtaining user behavior logs; extracting at least one emotional behavior feature from the user behavior logs based on a multimodal behavior processing model to obtain behavior feature information.

[0011] Optionally, the model training method further includes: determining a first weighted weight value corresponding to the text information; determining a second weighted weight value corresponding to the acoustic feature information; determining a third weighted weight value corresponding to the behavioral feature information; and calculating the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0012] Optionally, the model training method further includes: in response to an emotion value score being less than a preset score, extracting multiple initial prompt words from the answer information; correcting the multiple initial prompt words based on a preset loss function to obtain multiple target prompt words; and fine-tuning the model parameters of the initial answer generation model based on the multiple target prompt words to obtain the target answer generation model.

[0013] Optionally, the model training method further includes: obtaining target dialogue information of the target answer generation model, wherein the target dialogue information includes: target answer information output by the target answer generation model; obtaining the target sentiment value score of the target answer information; and, in response to the target sentiment value score being less than a preset score, repeatedly performing fine-tuning operations on the model parameters of the target answer generation model until the target sentiment value score is greater than or equal to the preset score.

[0014] According to one embodiment of the present invention, a model training apparatus is also provided, comprising: a first acquisition module, configured to acquire dialogue information of an initial answer generation model, wherein the dialogue information includes: question information received by the initial answer generation model and answer information output by the initial answer generation model; a second acquisition module, configured to acquire the sentiment index and problem-solving degree value of the question information; a determination module, configured to determine the sentiment value score of the answer information based on the sentiment index and problem-solving degree value; and a parameter fine-tuning module, configured to fine-tune the model parameters of the initial answer generation model based on a preset loss function in response to the sentiment value score being less than a preset score, thereby obtaining a target answer generation model.

[0015] Optionally, the model training device further includes: a third acquisition module for acquiring multiple text segments and multiple text labels, wherein each of the multiple text segments corresponds to a text label; a first merging module for performing a merging operation on some text segments in the multiple text segments to obtain a first merging result, wherein the some text segments correspond to the same text label; a risk detection module for performing risk detection on the first merging result to obtain at least one risky word; and a correction module for correcting the at least one risky word based on the text processing model to obtain text information.

[0016] Optionally, the model training device further includes: a fourth acquisition module for acquiring multiple audio segments and multiple audio timestamps, wherein each of the multiple audio segments corresponds to an audio timestamp; a second merging module for performing a merging operation on some audio segments among the multiple audio segments to obtain a second merging result, wherein the some audio segments correspond to the same audio timestamp; and a first extraction module for extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0017] Optionally, the model training device further includes: a fifth acquisition module for acquiring user behavior logs; and a second extraction module for extracting at least one emotional behavior feature from the user behavior logs based on the multimodal behavior processing model to obtain behavior feature information.

[0018] Optionally, the second acquisition module includes: a first determining unit for determining a first weighted weight value corresponding to the text information; a second determining unit for determining a second weighted weight value corresponding to the acoustic feature information; a third determining unit for determining a third weighted weight value corresponding to the behavioral feature information; and a calculation unit for calculating the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0019] Optionally, the parameter fine-tuning module includes: an extraction unit, used to extract multiple initial prompt words from the answer information in response to an emotion value score being less than a preset score; a correction unit, used to correct the multiple initial prompt words based on a preset loss function to obtain multiple target prompt words; and a parameter fine-tuning unit, used to fine-tune the model parameters of the initial answer generation model based on the multiple target prompt words to obtain the target answer generation model.

[0020] Optionally, the model training device further includes: a fifth acquisition module for acquiring target dialogue information of the target response generation model, wherein the target dialogue information includes: target response information output by the target response generation model; a sixth acquisition module for acquiring the target emotional value score of the target response information; and an execution module for repeatedly performing fine-tuning operations on the model parameters of the target response generation model in response to the target emotional value score being less than a preset score, until the target emotional value score is greater than or equal to the preset score.

[0021] According to one embodiment of the present invention, a vehicle is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the model training method described in any of the preceding claims.

[0022] According to one embodiment of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the model training method described in any of the preceding claims.

[0023] According to one embodiment of the present invention, a non-volatile storage medium is also provided, wherein a computer program is stored in the non-volatile storage medium, and the computer program is configured to execute the model training method described in any of the above-mentioned embodiments at runtime.

[0024] According to one embodiment of the present invention, a computer program product is also provided, which stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the model training method described above.

[0025] In this embodiment of the invention, by acquiring the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model, and acquiring the sentiment index and problem-solving degree value of the question information, the technical effect of determining the sentiment value score of the answer information based on the sentiment index and problem-solving degree value is achieved. This realizes the technical objective of fine-tuning the model parameters of the initial answer generation model based on a preset loss function in response to the sentiment value score being less than the preset score, thereby obtaining the target answer generation model. This can solve the technical problem of low accuracy in sentiment recognition of the answer generation model in the prior art. Attached Figure Description

[0026] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0027] Figure 1 This is a flowchart of a model training method according to one embodiment of the present invention;

[0028] Figure 2 This is a flowchart of a method for obtaining text information according to one embodiment of the present invention;

[0029] Figure 3 This is a flowchart of a method for acquiring acoustic feature information according to one embodiment of the present invention;

[0030] Figure 4 This is a structural block diagram of an answer generation model according to one embodiment of the present invention;

[0031] Figure 5 This is a structural block diagram of a model training device according to one embodiment of the present invention. Detailed Implementation

[0032] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0033] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate so that embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0034] According to an embodiment of the present invention, an embodiment of a model training method is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system containing at least one set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0035] This method embodiment can also be executed in an electronic device, similar control device, or vehicle-mounted terminal that includes a memory and a processor. Taking a vehicle-mounted terminal as an example, the vehicle-mounted terminal may include one or more processors and a memory for storing data. Optionally, the vehicle-mounted terminal may also include a communication device for communication functions and a display device. Those skilled in the art will understand that the above structural description is merely illustrative and does not limit the structure of the vehicle-mounted terminal. For example, the vehicle-mounted terminal may include more or fewer components than those described above, or have a different configuration than those described above.

[0036] A processor may include one or more processing units. For example, a processor may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processing (DSP) chip, a microprocessor, a field-programmable gate array (FPGA), a neural network processing unit (NPU), a tensor processing unit (TPU), or an artificial intelligence (AI) type processor. Different processing units may be independent components or integrated into one or more processors. In some instances, electronic devices may also include one or more processors.

[0037] The memory can be used to store computer programs, such as the computer program corresponding to the model training method in this embodiment of the invention. The processor implements the model training method described above by running the computer program stored in the memory. The memory may include high-speed random access memory and non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the electronic device via a grid. Examples of such grids include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0038] The communication device is used to receive or transmit data via a grid. Specific examples of the aforementioned grid may include a wireless grid provided by the mobile terminal's communication provider. In one example, the communication device includes a network interface controller (NIC), which can connect to other grid devices via a base station to communicate with the Internet. In another example, the communication device may be a radio frequency (RF) module used for wireless communication with the Internet. In some embodiments of this solution, the communication device is used to connect to mobile devices such as mobile phones and tablets, enabling the mobile device to send commands to the vehicle-mounted terminal.

[0039] The display device can be a touchscreen liquid crystal display (LCD) or a touch display (also referred to as a "touchscreen" or "touch display screen"). This LCD allows the user to interact with the user interface of the in-vehicle terminal. In some embodiments, the in-vehicle terminal has a graphical user interface (GUI), allowing the user to interact with the GUI through finger contact and / or gestures on a touch-sensitive surface. The human-machine interaction function may include a vehicle gear shifting function, and executable instructions for performing these functions are configured / stored in one or more processor-executable computer program products or readable storage media.

[0040] Figure 1 This is a flowchart of a model training method according to one embodiment of the present invention, such as... Figure 1 As shown, the method includes the following steps:

[0041] Step S101: Obtain the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model.

[0042] Optionally, the execution subject in this embodiment is a model training system. It should be noted that other electronic devices and processors can also be used as the execution subject, and no further limitations are made here.

[0043] In the technical solution provided by step S101 of the present invention, in the intelligent customer service system, the user initiates a question through a webpage or application software, the system transmits the question information to the initial answer generation model, and after the model returns the answer information, the backend service synchronously records the complete log of the interaction.

[0044] The specific technical steps can be represented as follows: First, obtain the question information, that is, extract the user's original input text from the input payload of the request interface; at the same time, obtain the answer information, that is, capture the generated text content from the response body of the model inference service, and store it in association with the corresponding question information according to the session ID.

[0045] The aforementioned initial response generation model refers to the original large language model that has not yet undergone emotion recognition optimization or fine-tuning, used to receive user input and generate preliminary responses.

[0046] The above-mentioned question information refers to the natural language input submitted by the user to the model, which can be in the form of text, speech-to-text (ASR results), etc.

[0047] The above-mentioned answer information refers to the text reply automatically generated by the initial answer generation model based on the question information.

[0048] As another alternative implementation, a proxy middleware is introduced into the model deployment architecture to automatically intercept data before / after the request flows through the model: 1) Obtaining question information: Before the request reaches the model, the middleware copies the user input in the request body; 2) Obtaining answer information: Before the model generates the result and returns it to the client, the middleware intercepts the response content, extracts the answer text, and matches it with the cached question information.

[0049] It is worth noting that step S101 can completely and accurately capture the input-output pairs of the initial answer generation model in real-world interaction scenarios, providing a raw corpus foundation for subsequent sentiment analysis. Furthermore, by structurally recording the correspondence between questions and answers, it ensures that the sentiment recognition task has clear contextual basis, avoiding misjudgments of sentiment due to missing information, thereby improving the data integrity and reliability of sentiment labeling and evaluation.

[0050] Step S102: Obtain the sentiment index and problem-solving level of the question information.

[0051] In the technical solution provided by step S102 of the present invention, in the intelligent customer service scenario, after a user submits a question, the system needs to input the question information into a pre-trained emotion classification / regression model (such as a sentiment analysis model) and output the corresponding emotion score. Further, the same question information is then input into a problem-solving status recognition model (such as a rule-based template or fine-tuned text classifier) ​​to identify keywords (such as "solved," "still doesn't work," "thank you") or semantic intent, and output a resolution score.

[0052] The aforementioned sentiment index is a numerical value used to quantify the intensity or tendency of emotions expressed by users in their questions (e.g., 0~1 represents negative to positive, or -1~1 represents negative to positive), and is usually based on the output of a text sentiment analysis model.

[0053] The above-mentioned problem resolution level value refers to an indicator that reflects whether the user's current question indicates that the problem has been resolved. It can be a continuous value (such as 0~1) or a category label (such as "unresolved", "partially resolved", "resolved"). It is usually judged based on whether the question contains signals such as confirmation, thanks, or follow-up questions.

[0054] As an alternative implementation, in voice customer service or app interaction scenarios, the system can collect user behavior data in addition to text. Specifically, the system can integrate ASR-transcribed question text with acoustic features (such as speech rate, fundamental frequency, and energy) into a multimodal emotion recognition model to generate a more robust emotion index. Furthermore, by combining the question text with subsequent user behavior (such as whether to immediately end the conversation, whether to repeat the question, or whether to click the "problem solved" button), the degree of problem resolution can be inferred through a time-series model or rule engine.

[0055] It is worth noting that step S102 can simultaneously generate a structured emotion index and a problem-solving progress value from the user's question information, thereby achieving a quantitative representation of the user's emotional state and problem-solving progress. The two indicators obtained in step S102 also provide calculable and comparable objective evidence for subsequent analysis, which helps to accurately depict the emotion-solving relationship in user interaction and improve the data granularity and context awareness of the emotion recognition task.

[0056] Step S103: Determine the emotional value score of the response information based on the emotional index and the degree of problem-solving.

[0057] In the technical solution provided in step S103 of the present invention, in the intelligent customer service system, the system sets weights based on business experience (such as an emotion weight of 0.6 and a problem-solving degree weight of 0.4), and linearly combines the two; for example, if the emotion index is -0.7 (negative) and the problem-solving degree value is 0.8 (highly resolved), then the initial fusion value = -(-0.7) × 0.6 + 0.8 × 0.4 (negative emotions being resolved is considered positive value). Furthermore, the fusion value is mapped to the [0,1] interval as the emotion value score of the answer, used for sorting or filtering high-quality responses.

[0058] The aforementioned emotional value score is a comprehensive score used to measure whether the model-generated response effectively alleviates negative emotions, maintains positive emotions, or promotes problem-solving at the emotional level. It is usually a normalized continuous value (e.g., 0~1).

[0059] As an alternative implementation, in the model optimization platform, the system uses both as feature inputs to pre-train a rating prediction model. This model learns the non-linear relationship between emotion improvement and problem-solving. Furthermore, the model output is the emotion value score, which can capture complex patterns such as "high negative emotion + high degree of problem-solving" leading to high emotion value.

[0060] It is worth noting that step S103 can transform the user's emotional state and problem-solving progress into a unified quantitative indicator of the emotional utility of the response information, thereby achieving an objective assessment of the actual value of the response in terms of emotional dimension. The emotional value score generated in step S103 provides a comparable and ranking basis for subsequent screening, optimization, or feedback, enhancing the accuracy and interpretability of emotion-related performance metrics.

[0061] In step S104, in response to the emotional value score being less than the preset score, the model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model.

[0062] In the technical solution provided in step S104 of the present invention, in the intelligent customer service system, when the emotional value score of a certain answer is lower than the preset score (e.g., 0.4), the system will automatically select low-scoring question-answer pairs as samples to be optimized, and use a reward-weighted strategy gradient loss to backpropagate the emotional value score as a reward signal. Furthermore, based on the initial model, the loss function is used to train some parameters to generate the target model.

[0063] The aforementioned preset score is a manually set threshold (e.g., 0.4) used to determine whether the current answer is insufficient in the emotional dimension. If the emotional value score is lower than this value, the optimization mechanism is triggered.

[0064] The aforementioned preset loss function refers to the optimization objective function used to guide the model to learn high emotional value answers, which may integrate standard language modeling loss and emotion-related supervision signals (such as negative correlation terms of emotional value scores).

[0065] The initial response generation model described above is a basic large language model that has not been optimized for emotion guidance.

[0066] The aforementioned target response generation model refers to an optimized model that has been fine-tuned and improved in terms of emotional response capabilities.

[0067] As an alternative implementation, in the model iteration platform, samples with low sentiment value scores are collected periodically, and historical dialogues are evaluated in batches, with samples whose sentiment value scores are below a threshold being marked. Furthermore, based on the standard cross-entropy loss, lower-scoring samples are assigned higher weights to form a weighted loss function. These samples are then used to form a fine-tuning dataset to fine-tune the initial response generation model with all or some parameters, outputting the target response generation model.

[0068] It is worth noting that step S104 can automatically identify poorly performing answers and drive model parameter updates based on the feedback signal of emotional value scores. This allows the model to retain its original language capabilities while specifically improving the response quality in low emotional value scenarios. By introducing an optimization objective directly related to emotional utility, emotion-oriented regulation of answer generation behavior is achieved, enhancing the model's ability to generate high emotional value responses in real interactions.

[0069] Steps S101 to S104 above show that, in this invention, by acquiring the dialogue information of the initial answer generation model, which includes the question information received by the initial answer generation model and the answer information output by the initial answer generation model, and by acquiring the emotion index and problem-solving degree value of the question information, the technical effect of determining the emotion value score of the answer information based on the emotion index and problem-solving degree value is achieved. This realizes the technical objective of fine-tuning the model parameters of the initial answer generation model based on the preset loss function in response to the emotion value score being less than the preset score, thereby obtaining the target answer generation model. This can solve the technical problem of low accuracy in emotion recognition of the answer generation model in the prior art.

[0070] The method described in this embodiment will now be described in further detail.

[0071] Step S201: Obtain multiple text segments and multiple text tags, wherein each text segment corresponds to a text tag;

[0072] Step S202: Perform a merging operation on partial text fragments in multiple text fragments to obtain a first merging result, wherein the partial text fragments correspond to the same text tag;

[0073] Step S203: Perform risk detection on the first merged result to obtain at least one risky word;

[0074] Step S204: Based on the text processing model, at least one risk word is corrected to obtain text information.

[0075] In this embodiment, such as Figure 2 As shown, in the intelligent customer service training data preparation stage, the system first extracts user messages from historical work orders, with each sentence as a segment of text, and then manually or through a classification model, labels them with intent (such as "complaint"). Finally, all sentences labeled "complaint" in the same work order are concatenated in chronological order to form the first merged result.

[0076] Furthermore, the keyword matching and deep learning sensitive word recognition model is used to scan and merge the text, outputting a list of risky words (such as "cannot be reimbursed", "lawsuit", "fraud" etc.), and then the text rewriting model is called to replace "fraud" with neutral expressions such as "service did not meet expectations" to generate compliant question text information.

[0077] The aforementioned fragmented text refers to sub-fragments of the original long text that have been divided according to semantics or length, such as a sentence or a dialogue in a user feedback log.

[0078] The above text tags are category identifiers assigned to each text segment, such as "complaint", "inquiry", "praise", or emotion category ("anger", "satisfaction").

[0079] The aforementioned local fragment text refers to a group of adjacent or logically related fragments with the same text label within the overall fragment set.

[0080] The first merging result mentioned above is a coherent text formed by sequentially splicing or semantically fusing multiple fragmented texts under the same tag.

[0081] The aforementioned risk words refer to sensitive words or inappropriate expressions in the text that may cause compliance issues, emotional escalation, or business risks, such as abusive language, privacy information, and suggestive language.

[0082] The aforementioned text processing models refer to natural language processing models used to identify and replace / rewrite risky words, such as rule-based filters, sequence-to-sequence rewriting models, or masked language models.

[0083] As an alternative implementation, when constructing training samples in the dialogue system, multi-turn dialogues are segmented according to the number of speeches. Each user message is labeled by the intent recognition module (e.g., "problem unresolved"). Then, multiple consecutive user messages labeled "problem unresolved" are merged into a coherent statement. Furthermore, risky words are identified through a deployed sensitive content detection API (e.g., integrated with Alibaba Cloud Content Security). A lightweight BERT-based mask filling model is then used to perform context-aware replacement of the risky words (e.g., changing "junk product" to "product with poor user experience"), outputting the purified text information.

[0084] It is worth noting that steps S201-S204 can automatically construct high-quality question text information that is structurally clear, semantically coherent, and free of risky expressions from labeled, fragmented raw text. By merging fragments with the same label to improve contextual integrity, and then eliminating sensitive or inappropriate content through risk detection and correction mechanisms, the compliance, security, and semantic consistency of subsequent model training data are effectively guaranteed.

[0085] Step S301: Obtain multiple audio segments and multiple audio timestamps, wherein each audio segment corresponds to an audio timestamp;

[0086] Step S302: Perform a merging operation on partial audio segments from multiple audio segments to obtain a second merging result, wherein the partial audio segments correspond to the same audio timestamp;

[0087] Step S303: Extract at least one emotion keyword from the second merging result based on the multimodal audio processing model to obtain acoustic feature information.

[0088] In this embodiment, such as Figure 3As shown, the user audio in the car can be divided into multiple audio segments in real time. Each segment is accompanied by a timestamp accurate to milliseconds and may contain dual-channel data of the calling and called parties. The dual-channel audio under the same timestamp is aligned and mixed (or a single channel with a higher signal-to-noise ratio is selected) to generate a clear second merged result.

[0089] Furthermore, the second merged result is input into a multimodal audio processing model, which first transcribes the text and then identifies emotional keywords (such as "waited so long" and "very dissatisfied") as acoustic feature information.

[0090] The aforementioned segmented audio refers to audio segments (such as 2-5 second WAV files) after the original speech stream has been divided according to time windows or semantic boundaries.

[0091] The audio timestamps mentioned above are time stamps used to identify the start / end times of segmented audio in the original session (e.g., "00:01:23.450"), and are used to align multiple audio streams or associated text.

[0092] The aforementioned partial audio segments refer to a group of audio fragments that share the same audio timestamp (usually from multi-channel recordings or retransmitted segments) across multiple audio segments.

[0093] The second merging result mentioned above is a complete audio segment formed by aligning and merging multiple audio segments under the same timestamp (such as multi-channel mixing or deduplication splicing).

[0094] The aforementioned multimodal audio processing model refers to a deep learning model that can simultaneously process audio signals and their potential text transcription (ASR output), used to jointly model acoustic and semantic information to identify emotions.

[0095] The aforementioned emotional keywords can be keywords or phrases that reflect the speaker's emotional state (such as "I'm so angry" or "I'm really anxious"). They can be directly identified by the model from the audio or indirectly obtained through ASR+text analysis.

[0096] In this application, the aforementioned acoustic feature information specifically refers to semantic-level features (i.e., emotion keywords) extracted from speech that are related to emotion, rather than traditional low-order acoustic parameters.

[0097] As an alternative implementation, the user's voice is segmented into multiple audio segments by network packets on the client side, with each segment carrying a locally generated timestamp. The server can identify duplicate or complementary segments based on the same timestamp, reconstruct continuous speech (the second merging result) by splicing or discarding redundant segments through overlapping regions, and then use an end-to-end multimodal model to directly predict emotional keywords from the audio (without explicit ASR), outputting acoustic feature information in the form of keywords such as "anxious" and "disappointed".

[0098] It is worth noting that steps S301-S303 enable the reconstruction of high-quality continuous audio segments from raw speech data that may contain fragments, repetitions, or multiple channels, and automatically extract acoustic feature information in the form of keywords directly related to user emotions. Furthermore, through timestamp alignment and multimodal modeling, the integrity and recognition accuracy of emotion-related speech content are effectively improved, providing structured and semantically clear acoustic input for subsequent emotion analysis.

[0099] Step S401: Obtain user behavior logs;

[0100] Step S402: Extract at least one emotional behavior feature from the user behavior log based on the multimodal behavior processing model to obtain behavior feature information.

[0101] In this embodiment, the system first needs to obtain user behavior logs, that is, the sequence of user operations before submitting a question, such as "delete and rewrite 3 times in the input box", "stay on the 'FAQ' page for 45 seconds and then close it", and "click the 'transfer to human agent' button twice in a row".

[0102] Furthermore, the above logs are input into a pre-trained multimodal behavior processing model (such as a Transformer-based temporal behavior encoder), and the model outputs emotional and behavioral features, such as "high anxiety index" and "low satisfaction tendency", forming behavioral feature information.

[0103] The aforementioned user behavior logs are structured data that record user actions in digital systems (such as apps, web pages, and customer service platforms), including but not limited to clicks, swipes, input, page dwell time, repeated questions, and exit timing.

[0104] The aforementioned multimodal behavior processing model is a machine learning model that integrates multiple behavioral signals (such as sequential operations, interaction rhythm, interface path, etc.) and combines them with context (such as current dialogue content or business stage) for joint modeling, and is used to identify implicit emotional states.

[0105] The aforementioned emotional and behavioral characteristics refer to quantitative or categorical indicators that are highly correlated with user emotions and inferred from behavioral logs, such as "rapid and continuous clicking (representing anxiety)," "suddenly sending a message after a long period of no input (representing hesitation or repression)," and "returning to the previous page multiple times (representing confusion or dissatisfaction)."

[0106] The aforementioned behavioral feature information is a structured representation composed of one or more emotional and behavioral features, which serves as a component of the question information and is used to assist in emotion recognition.

[0107] As an alternative implementation, the system can also export a large stream of behavioral events during user sessions from the background log system, including screen touch coordinates, input delays, message recalls, and the number of voice re-recordings. Furthermore, the system uses a graph neural network to model the behavioral sequences, identify typical emotional patterns (e.g., "recall + re-record > 2 times → hesitation / anxiety"), and output a structured set of emotional and behavioral features as behavioral characteristic information.

[0108] It is worth noting that steps S401-S402 automatically extract behavioral patterns related to emotional states from the original user behavior logs and transform them into structured behavioral feature information. Furthermore, the emotional behavioral features extracted in the above steps provide non-verbal, context-aware supplementary signals for emotion recognition, enhancing the ability to perceive emotional states that are difficult to capture directly through text or speech, such as silence, repression, or irony, thereby improving the dimensionality and robustness of emotion representation.

[0109] Step S501: Determine the first weighted weight value corresponding to the text information;

[0110] Step S502: Determine the second weighted weight value corresponding to the acoustic feature information;

[0111] Step S503: Determine the third weighted weight value corresponding to the behavioral feature information;

[0112] Step S504: Calculate the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0113] In this embodiment, the system can allocate weights based on context-adaptive dynamic weights, that is, automatically adjust the weights of each modality according to the current interaction stage.

[0114] First, determine the first weighting value. If the user is in a text input interface (no acoustic signal), the text weight is set to 0.8 and the acoustic weight is 0.

[0115] Next, the second weighting value is determined. If it is a real-time voice call with a high signal-to-noise ratio, the acoustic weight is increased to 0.6.

[0116] Next, determine the third weighting value. If an abnormal operation is detected (such as 3 retractions within 5 seconds), the behavior weight is set to 0.3.

[0117] Finally, sub-emotion scores are obtained separately for each modality (e.g., text score -0.6, acoustic score -0.8, behavioral score -0.7), and then a weighted average is calculated:

[0118] Emotion Index = 0.8×(-0.6) + 0.6×(-0.8) + 0.3×(-0.7) (The final value is output after normalization).

[0119] The above text information is the natural language content of the user's question, which is used for sentiment analysis after preprocessing.

[0120] The aforementioned acoustic feature information refers to emotional keywords or intonation-related semantic features extracted from speech (such as "rapid tone" or "faster speech speed").

[0121] The aforementioned behavioral characteristics refer to emotion-related behavioral patterns extracted from user interaction logs (such as "repeatedly deleting input" and "rapid continuous clicking").

[0122] The aforementioned weighted values ​​are used to measure the relative importance of a certain modality (text, acoustic, behavior) to the emotion judgment in the current context. They are usually normalized values ​​(such as between 0 and 1), and the sum of the three can be 1 or dynamically adjusted according to the strategy.

[0123] The aforementioned sentiment index refers to a single numerical value output after integrating multimodal signals, used to quantify the overall emotional tendency of users when asking questions (e.g., -1 indicates strong negativity, +1 indicates positivity).

[0124] As an alternative implementation, during the model training data construction phase, the optimal static weights are learned using historical labeled data. Feature importance analysis reveals that text contributes the most to emotion judgment, which is set to 0.5, followed by acoustic features at 0.3, and behavioral features as an auxiliary factor at 0.2. For each sample, the emotion scores output by the three modalities (generated by independent sub-models) are linearly fused according to the above fixed weights to obtain a unified emotion index, which is used for subsequent labeling or evaluation.

[0125] It is worth noting that steps S501-S504, based on three heterogeneous modalities of information—text, acoustics, and behavior—achieve a fusion and quantification of the emotional state of user queries by introducing a differentiated weighting mechanism. The sentiment index generated by these steps comprehensively reflects the credibility and contextual relevance of multi-source signals, effectively improving the comprehensiveness and robustness of sentiment judgment, and maintaining stable output even in scenarios with missing single modalities or noise interference.

[0126] Step S601: In response to the emotional value score being less than the preset score, extract multiple initial prompt words from the answer information;

[0127] Step S602: Correct multiple initial prompt words based on a preset loss function to obtain multiple target prompt words;

[0128] Step S603: Fine-tune the model parameters of the initial answer generation model based on multiple target prompt words to obtain the target answer generation model.

[0129] In this embodiment, for responses with low emotional value ratings, an ensemble gradient method is used to identify the top N words that contribute the most to the emotional output (such as "cannot", "no way", "no way"). These words are then input into a small text rewriting model, which generates alternative words (such as "cannot for the time being" and "we can try") under the guidance of a preset loss function (which includes emotional value reward and semantic preservation constraints).

[0130] Furthermore, a fine-tuning sample of instructions consisting of the original question and target prompts is constructed to efficiently fine-tune some parameters of the initial model, thereby obtaining the target answer generation model.

[0131] The aforementioned emotional value score is a quantitative indicator that measures the emotional utility of an answer in alleviating users' negative emotions or promoting problem-solving.

[0132] The preset score mentioned above is a threshold (e.g., 0.4) used to determine whether the response lacks emotional expression. If it is lower than this value, the optimization process will be triggered.

[0133] The initial prompt words mentioned above refer to words or phrases in the model-generated responses that play a key role in emotional expression or interaction style (such as "I'm sorry," "You're right," "Don't worry"), which are usually identified through attention weights, gradient saliency, or rule templates.

[0134] The target cue words mentioned above are optimized alternative words or phrases that are more in line with high emotional value expressions (such as replacing "cannot handle" with "we are doing our best to coordinate for you").

[0135] The aforementioned preset loss function is an objective function designed specifically to guide prompts towards higher emotional value. It may include negative correlation terms for emotional value scores, semantic similarity constraints, and language fluency penalties.

[0136] The initial response generation model mentioned above refers to the basic large language model that has not undergone emotion-oriented optimization.

[0137] The aforementioned target response generation model refers to an optimized model that, after fine-tuning, can generate responses with higher emotional value.

[0138] It is worth noting that steps S601-S603 can locate key emotional expression units (initial prompt words) from answers with low emotional value, and generate more empathetic and reassuring target prompt words through a loss function-driven correction mechanism. This guides the fine-tuning of model parameters, thereby achieving fine-grained intervention in the model's emotional expression style. This allows the target answer generation model to significantly improve its response quality and user-friendliness in the emotional dimension while maintaining semantic accuracy.

[0139] Step S701: Obtain the target dialogue information of the target answer generation model, wherein the target dialogue information includes: the target answer information output by the target answer generation model;

[0140] Step S702: Obtain the target emotional value score of the target response information;

[0141] Step S703: In response to the target emotional value score being less than the preset score, the fine-tuning operation of the model parameters of the target response generation model is repeated until the target emotional value score is greater than or equal to the preset score.

[0142] In this embodiment, the system can input a batch of historically low-scoring questions into the current target answer generation model, collect the generated answers, and then use the constructed sentiment index and problem-solving degree value to recalculate the sentiment value score of each answer according to a unified formula.

[0143] Furthermore, if the overall average score is still lower than the preset score (e.g., 0.3 < 0.4), then this batch of new question-answer pairs will be added to the fine-tuning dataset, and prompt word correction and parameter updates will be performed again to generate a new generation of target models. This process will be repeated until the target is met.

[0144] The aforementioned target answer generation model refers to an optimized language model that has undergone at least one round of emotion-oriented fine-tuning, used to generate improved answers.

[0145] The aforementioned target dialogue information refers to the complete interactive output generated by the model under a specific question input, specifically its target answer information (i.e., the model's response text).

[0146] The aforementioned target emotional value score is a quantitative score of emotional utility recalculated based on the same set of evaluation logic (such as combining the emotional index of the question with the degree of problem-solving).

[0147] The preset score mentioned above is a pre-set threshold for emotional value (e.g., 0.4), used to determine whether the current model output meets the requirements for emotional response quality.

[0148] The aforementioned fine-tuning operation refers to updating the model parameters using the methods described above (such as prompt word correction, loss function optimization, etc.).

[0149] It is worth noting that steps S701-S703 achieve closed-loop verification and continuous optimization of the emotional response capability of the target answer generation model. By repeatedly evaluating the emotional value score of its output and automatically triggering fine-tuning when it fails to meet the standard, the model is ensured to gradually approach the preset emotional quality standard during the iteration process. This ensures the stability and measurability of the model output in the emotional dimension and avoids performance bottlenecks caused by insufficient fine-tuning in a single instance.

[0150] Figure 4 This is a structural block diagram of an answer generation model according to one embodiment of the present invention, such as... Figure 4 As shown, the answer generation model includes an input module (responsible for receiving user questions or requests), a dialogue information acquisition module (acquiring target dialogue information for a specific question from the target answer generation model, including the generated answer information), an emotional value scoring module (analyzing and calculating the emotional value score of the target answer information, and using predefined methods or algorithms to evaluate the emotional tendency and problem-solving effectiveness of the answer), a score comparison and decision module (comparing the calculated emotional value score with a preset score. If the score is lower than the preset value, a model fine-tuning operation is triggered; otherwise, the final answer is output), a model parameter fine-tuning module (responsible for adjusting the parameters of the target answer generation model when the emotional value score does not meet the standard), and an output module (outputting the verified and compliant answer information to the user).

[0151] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or grid device, etc.) to execute the methods of the various embodiments of the present invention.

[0152] This embodiment also provides a model training apparatus for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that performs a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0153] Figure 5 This is a structural block diagram of a model training device 500 according to one embodiment of the present invention, such as... Figure 5 As shown, the device includes: a first acquisition module 501, a second acquisition module 502, a determination module 503, and a parameter fine-tuning module 504.

[0154] The first acquisition module 501 is used to acquire the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model.

[0155] The second acquisition module 502 is used to acquire the sentiment index and problem-solving degree value of the question information;

[0156] Module 503 is used to determine the emotional value score of the response information based on the emotional index and the problem-solving degree value.

[0157] The parameter fine-tuning module 504 is used to fine-tune the model parameters of the initial answer generation model based on the preset loss function in response to the sentiment value score being less than the preset score, so as to obtain the target answer generation model.

[0158] Optionally, the model training device 500 further includes: a third acquisition module for acquiring multiple text segments and multiple text labels, wherein each of the multiple text segments corresponds to a text label; a first merging module for performing a merging operation on some of the text segments in the multiple text segments to obtain a first merging result, wherein the some text segments correspond to the same text label; a risk detection module for performing risk detection on the first merging result to obtain at least one risky word; and a correction module for correcting the at least one risky word based on a text processing model to obtain text information.

[0159] Optionally, the model training device 500 further includes: a fourth acquisition module for acquiring multiple audio segments and multiple audio timestamps, wherein each of the multiple audio segments corresponds to an audio timestamp; a second merging module for performing a merging operation on some audio segments among the multiple audio segments to obtain a second merging result, wherein the some audio segments correspond to the same audio timestamp; and a first extraction module for extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0160] Optionally, the model training device 500 further includes: a fifth acquisition module for acquiring user behavior logs; and a second extraction module for extracting at least one emotional behavior feature from the user behavior logs based on the multimodal behavior processing model to obtain behavior feature information.

[0161] Optionally, the second acquisition module 502 includes: a first determining unit, used to determine a first weighted weight value corresponding to the text information; a second determining unit, used to determine a second weighted weight value corresponding to the acoustic feature information; a third determining unit, used to determine a third weighted weight value corresponding to the behavioral feature information; and a calculation unit, used to calculate the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0162] Optionally, the parameter fine-tuning module 504 includes: an extraction unit, used to extract multiple initial prompt words from the answer information in response to an emotion value score being less than a preset score; a correction unit, used to correct the multiple initial prompt words based on a preset loss function to obtain multiple target prompt words; and a parameter fine-tuning unit, used to fine-tune the model parameters of the initial answer generation model based on the multiple target prompt words to obtain a target answer generation model.

[0163] Optionally, the model training device 500 further includes: a fifth acquisition module for acquiring target dialogue information of the target response generation model, wherein the target dialogue information includes: target response information output by the target response generation model; a sixth acquisition module for acquiring the target emotional value score of the target response information; and an execution module for repeatedly performing fine-tuning operations on the model parameters of the target response generation model in response to the target emotional value score being less than a preset score, until the target emotional value score is greater than or equal to the preset score.

[0164] Embodiments of the present invention also provide a vehicle including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the model training method described above.

[0165] Optionally, in this embodiment, the vehicle may be configured to store a computer program for performing the following steps:

[0166] Step S101: Obtain the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model;

[0167] Step S102: Obtain the sentiment index and problem-solving level value of the question information;

[0168] Step S103: Determine the emotional value score of the response information based on the emotional index and the degree of problem-solving.

[0169] In step S104, in response to the emotional value score being less than the preset score, the model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model.

[0170] Optionally, when the processor executes the program, it also performs the following steps: obtaining multiple fragmented texts and multiple text tags, wherein each of the multiple fragmented texts corresponds to a text tag; performing a merging operation on some fragmented texts in the multiple fragmented texts to obtain a first merging result, wherein the some fragmented texts correspond to the same text tag; performing risk detection on the first merging result to obtain at least one risky word; and correcting the at least one risky word based on a text processing model to obtain text information.

[0171] Optionally, when the processor executes the program, it also performs the following steps: acquiring multiple audio segments and multiple audio timestamps, wherein each of the multiple audio segments corresponds to an audio timestamp; performing a merging operation on some of the audio segments to obtain a second merging result, wherein the some audio segments correspond to the same audio timestamp; and extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0172] Optionally, the processor may also perform the following steps when executing the program: acquiring user behavior logs; extracting at least one emotional behavior feature from the user behavior logs based on a multimodal behavior processing model to obtain behavior feature information.

[0173] Optionally, when the processor executes the program, it also performs the following steps: determining a first weighted weight value corresponding to the text information; determining a second weighted weight value corresponding to the acoustic feature information; determining a third weighted weight value corresponding to the behavioral feature information; and calculating the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0174] Optionally, the processor may further perform the following steps when executing the program: in response to an emotional value score being less than a preset score, extracting multiple initial prompt words from the answer information; correcting the multiple initial prompt words based on a preset loss function to obtain multiple target prompt words; and fine-tuning the model parameters of the initial answer generation model based on the multiple target prompt words to obtain the target answer generation model.

[0175] Optionally, when the processor executes the program, it also performs the following steps: obtaining target dialogue information of the target answer generation model, wherein the target dialogue information includes: target answer information output by the target answer generation model; obtaining the target sentiment value score of the target answer information; in response to the target sentiment value score being less than a preset score, repeatedly performing fine-tuning operations on the model parameters of the target answer generation model until the target sentiment value score is greater than or equal to the preset score.

[0176] Optionally, specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated here.

[0177] Embodiments of the present invention also provide an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to run the computer program to perform the model training method described above.

[0178] Optionally, in this embodiment, the electronic device may be configured to store a computer program for performing the following steps:

[0179] Step S101: Obtain the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model;

[0180] Step S102: Obtain the sentiment index and problem-solving level value of the question information;

[0181] Step S103: Determine the emotional value score of the response information based on the emotional index and the degree of problem-solving.

[0182] In step S104, in response to the emotional value score being less than the preset score, the model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model.

[0183] Optionally, when the processor executes the program, it also performs the following steps: obtaining multiple fragmented texts and multiple text tags, wherein each of the multiple fragmented texts corresponds to a text tag; performing a merging operation on some fragmented texts in the multiple fragmented texts to obtain a first merging result, wherein the some fragmented texts correspond to the same text tag; performing risk detection on the first merging result to obtain at least one risky word; and correcting the at least one risky word based on a text processing model to obtain text information.

[0184] Optionally, when the processor executes the program, it also performs the following steps: acquiring multiple audio segments and multiple audio timestamps, wherein each of the multiple audio segments corresponds to an audio timestamp; performing a merging operation on some of the audio segments to obtain a second merging result, wherein the some audio segments correspond to the same audio timestamp; and extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0185] Optionally, the processor may also perform the following steps when executing the program: acquiring user behavior logs; extracting at least one emotional behavior feature from the user behavior logs based on a multimodal behavior processing model to obtain behavior feature information.

[0186] Optionally, when the processor executes the program, it also performs the following steps: determining a first weighted weight value corresponding to the text information; determining a second weighted weight value corresponding to the acoustic feature information; determining a third weighted weight value corresponding to the behavioral feature information; and calculating the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0187] Optionally, the processor may further perform the following steps when executing the program: in response to an emotional value score being less than a preset score, extracting multiple initial prompt words from the answer information; correcting the multiple initial prompt words based on a preset loss function to obtain multiple target prompt words; and fine-tuning the model parameters of the initial answer generation model based on the multiple target prompt words to obtain the target answer generation model.

[0188] Optionally, when the processor executes the program, it also performs the following steps: obtaining target dialogue information of the target answer generation model, wherein the target dialogue information includes: target answer information output by the target answer generation model; obtaining the target sentiment value score of the target answer information; in response to the target sentiment value score being less than a preset score, repeatedly performing fine-tuning operations on the model parameters of the target answer generation model until the target sentiment value score is greater than or equal to the preset score.

[0189] Optionally, specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated here.

[0190] Embodiments of the present invention also provide a computer-readable storage medium storing a computer program configured to execute the model training method described above when run on a computer or processor.

[0191] Optionally, in this embodiment, the computer-readable storage medium may be configured to store a computer program for performing the following steps:

[0192] Step S101: Obtain the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model;

[0193] Step S102: Obtain the sentiment index and problem-solving level value of the question information;

[0194] Step S103: Determine the emotional value score of the response information based on the emotional index and the degree of problem-solving.

[0195] In step S104, in response to the emotional value score being less than the preset score, the model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model.

[0196] Optionally, the storage medium is configured to store program code for performing the following steps: obtaining multiple fragmented texts and multiple text tags, wherein each fragmented text corresponds to a text tag; performing a merging operation on local fragmented texts in the multiple fragmented texts to obtain a first merging result, wherein the local fragmented texts correspond to the same text tag; performing risk detection on the first merging result to obtain at least one risky word; and correcting the at least one risky word based on a text processing model to obtain text information.

[0197] Optionally, the storage medium is configured to store program code for performing the following steps: acquiring multiple audio segments and multiple audio timestamps, wherein each audio segment corresponds to an audio timestamp; performing a merging operation on local audio segments from the multiple audio segments to obtain a second merging result, wherein the local audio segments correspond to the same audio timestamp; and extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0198] Optionally, the storage medium is configured to store program code for performing the following steps: obtaining user behavior logs; extracting at least one emotional behavior feature from the user behavior logs based on a multimodal behavior processing model to obtain behavior feature information.

[0199] Optionally, the storage medium is configured to store program code for performing the following steps: determining a first weighted weight value corresponding to text information; determining a second weighted weight value corresponding to acoustic feature information; determining a third weighted weight value corresponding to behavioral feature information; and calculating a sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0200] Optionally, the storage medium is configured to store program code for performing the following steps: in response to an emotion value score being less than a preset score, extracting multiple initial prompt words from the answer information; correcting the multiple initial prompt words based on a preset loss function to obtain multiple target prompt words; and fine-tuning the model parameters of the initial answer generation model based on the multiple target prompt words to obtain a target answer generation model.

[0201] Optionally, the storage medium is configured to store program code for performing the following steps: obtaining target dialogue information of the target response generation model, wherein the target dialogue information includes: target response information output by the target response generation model; obtaining the target sentiment value score of the target response information; and, in response to the target sentiment value score being less than a preset score, repeatedly performing fine-tuning operations on the model parameters of the target response generation model until the target sentiment value score is greater than or equal to the preset score.

[0202] Optionally, specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated here.

[0203] Embodiments of the present invention also provide a computer program product, including a computer program, wherein the computer program, when executed by a processor, implements the steps of the model training method described above.

[0204] Optionally, in this embodiment, the computer program product described above may be configured to store a computer program for performing the following steps:

[0205] Step S101: Obtain the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model;

[0206] Step S102: Obtain the sentiment index and problem-solving level value of the question information;

[0207] Step S103: Determine the emotional value score of the response information based on the emotional index and the degree of problem-solving.

[0208] In step S104, in response to the emotional value score being less than the preset score, the model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model.

[0209] Optionally, when the computer program executes the program, it also performs the following steps: obtaining multiple fragmented texts and multiple text tags, wherein each of the multiple fragmented texts corresponds to a text tag; performing a merging operation on some fragmented texts in the multiple fragmented texts to obtain a first merging result, wherein the some fragmented texts correspond to the same text tag; performing risk detection on the first merging result to obtain at least one risky word; and correcting the at least one risky word based on a text processing model to obtain text information.

[0210] Optionally, when the computer program executes the program, it also performs the following steps: obtaining multiple audio segments and multiple audio timestamps, wherein each of the multiple audio segments corresponds to an audio timestamp; performing a merging operation on some of the audio segments to obtain a second merging result, wherein the some audio segments correspond to the same audio timestamp; and extracting at least one emotion keyword from the second merging result based on a multimodal audio processing model to obtain acoustic feature information.

[0211] Optionally, when the computer program executes the program, it also performs the following steps: obtaining user behavior logs; extracting at least one emotional behavior feature from the user behavior logs based on a multimodal behavior processing model to obtain behavior feature information.

[0212] Optionally, when the computer program executes the program, it also performs the following steps: determining a first weighted weight value corresponding to the text information; determining a second weighted weight value corresponding to the acoustic feature information; determining a third weighted weight value corresponding to the behavioral feature information; and calculating the sentiment index of the question information based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

[0213] Optionally, when the computer program executes the program, it also performs the following steps: in response to the emotion value score being less than the preset score, extracting multiple initial prompt words from the answer information; correcting the multiple initial prompt words based on the preset loss function to obtain multiple target prompt words; and fine-tuning the model parameters of the initial answer generation model based on the multiple target prompt words to obtain the target answer generation model.

[0214] Optionally, when the computer program executes the program, it further performs the following steps: obtaining target dialogue information of the target answer generation model, wherein the target dialogue information includes: target answer information output by the target answer generation model; obtaining the target emotional value score of the target answer information; in response to the target emotional value score being less than a preset score, repeatedly performing fine-tuning operations on the model parameters of the target answer generation model until the target emotional value score is greater than or equal to the preset score.

[0215] Optionally, specific examples in this embodiment can refer to the examples described in the above embodiments and optional implementations, and will not be repeated here.

[0216] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0217] In the embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.

[0218] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0219] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0220] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or grid device, etc.) to execute all or part of the steps of the methods of the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0221] The above are merely preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A model training method, characterized in that, include: Obtain the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model; Obtain the sentiment index and problem-solving level value of the question information; The emotional value score of the response information is determined based on the emotional index and the problem-solving level value. In response to the emotional value score being less than the preset score, the model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model.

2. The model training method according to claim 1, characterized in that, The question information includes text information, and the model training method further includes: Obtain multiple text segments and multiple text tags, wherein each of the multiple text segments corresponds to a text tag; A merging operation is performed on some fragmented texts among the multiple fragmented texts to obtain a first merging result, wherein the fragmented texts correspond to the same text tag; Risk detection is performed on the first merged result to obtain at least one risky word; The text information is obtained by correcting the at least one risk word based on a text processing model.

3. The model training method according to claim 2, characterized in that, The question information also includes acoustic feature information, and the model training method further includes: Obtain multiple audio segments and multiple audio timestamps, wherein each of the multiple audio segments corresponds to one audio timestamp; A merging operation is performed on some audio segments among the multiple audio segments to obtain a second merging result, wherein the some audio segments correspond to the same audio timestamp; At least one emotion keyword is extracted from the second merging result based on a multimodal audio processing model to obtain the acoustic feature information.

4. The model training method according to claim 3, characterized in that, The question information also includes behavioral feature information, and the model training method further includes: Obtain user behavior logs; At least one emotional behavior feature is extracted from the user behavior log based on a multimodal behavior processing model to obtain the behavior feature information.

5. The model training method according to claim 4, characterized in that, The sentiment index obtained from the question information includes: Determine the first weighted value corresponding to the text information; Determine the second weighted value corresponding to the acoustic feature information; Determine the third weighted value corresponding to the behavioral feature information; The sentiment index of the question information is calculated based on the first weighted weight value, the second weighted weight value, and the third weighted weight value.

6. The model training method according to claim 1, characterized in that, The model parameters of the initial answer generation model are fine-tuned based on the preset loss function to obtain the target answer generation model, which includes: In response to the emotional value score being less than the preset score, multiple initial prompt words are extracted from the answer information; Based on the preset loss function, the multiple initial prompt words are corrected to obtain multiple target prompt words; The model parameters of the initial answer generation model are fine-tuned based on the multiple target prompt words to obtain the target answer generation model.

7. The model training method according to claim 1, characterized in that, The method further includes: Obtain the target dialogue information of the target answer generation model, wherein the target dialogue information includes: the target answer information output by the target answer generation model; Obtain the target emotional value score of the target response information; In response to the target emotional value score being less than the preset score, the fine-tuning operation of the model parameters of the target response generation model is repeatedly performed until the target emotional value score is greater than or equal to the preset score.

8. A model training device, characterized in that, include: The first acquisition module is used to acquire the dialogue information of the initial answer generation model, wherein the dialogue information includes: the question information received by the initial answer generation model and the answer information output by the initial answer generation model; The second acquisition module is used to acquire the sentiment index and problem-solving degree value of the question information; The determination module is used to determine the emotional value score of the response information based on the emotional index and the problem-solving degree value; The parameter fine-tuning module is used to fine-tune the model parameters of the initial answer generation model based on a preset loss function in response to the emotional value score being less than a preset score, so as to obtain the target answer generation model.

9. A vehicle comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to run the computer program to perform the model training method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program is configured to execute the model training method described in any one of claims 1 to 7 when run on a computer or processor.