An artificial intelligence-based multi-modal big data analysis processing method and device
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HENAN POLYTECHNIC
- Filing Date
- 2025-11-25
- Publication Date
- 2026-06-23
AI Technical Summary
Existing multimodal data processing technologies suffer from insufficient scene adaptability, weak emotion-semantic association, fixed modal fusion methods, and lack of cross-modal error correction in the recognition and conversion of voice input, resulting in insufficient recognition accuracy and reliability.
We employ an AI-based multimodal big data analysis method. Through scene perception, emotion-semantic joint analysis, dynamic weight fusion, and cross-modal error correction mechanisms, we collect multimodal data for preprocessing to obtain deep semantic information, facial emotion sequences, and voice emotion sequences. We then perform consistency verification and dynamic weight fusion, and use cross-modal verification to correct the fusion results.
It improves the accuracy and reliability of multimodal data processing, adapts to different application scenarios, and enhances the precision and reliability of information transformation.
Smart Images

Figure CN121580302B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and multimodal data processing technology, and relates to a method and apparatus for multimodal big data analysis and processing based on artificial intelligence. Background Technology
[0002] In today's rapidly developing information technology landscape, multimodal data processing technology has been widely applied in the field of voice input recognition and conversion. By integrating multiple modal data such as voice and facial images, it aims to improve the accuracy and comprehensiveness of recognition, providing strong support for many scenarios such as intelligent interaction, voice assistants, and real-time translation.
[0003] However, existing technologies for recognizing and converting voice input still have many limitations, which restrict their efficient application in a wider range of scenarios:
[0004] First, there is insufficient adaptability to different scenarios. Most existing technologies only perform voice and facial image fusion processing for general scenarios, without fully considering the different modal data requirements of different application scenarios. For example, in noisy industrial production scenarios, the requirements for anti-interference processing of voice signals are higher, while in quiet office scenarios, the focus may be more on the collaborative analysis of facial micro-expressions and voice. This generalized processing approach leads to a significant decrease in the recognition accuracy of the technology in specific scenarios.
[0005] Secondly, the emotional-semantic connection is weak. Although existing technologies can identify a user's emotions through facial features, they fail to effectively combine these emotions with the semantic content of the speech to verify the consistency between emotion and semantics. When there are contradictory situations, such as a user expressing "understanding" in their speech but their facial features showing "confusion," the technology cannot correct this, leading to ambiguity in the converted information and affecting subsequent judgments of the user's true intentions.
[0006] Third, the modal fusion method is fixed. Existing technologies often employ static weight fusion strategies when fusing multimodal data, meaning that the fusion weights for each modality, such as speech and facial images, are pre-set without dynamically adjusting based on real-time data quality. In practical applications, the quality of modal data is often unstable, such as noisy speech or blurred images. Fixed weight allocations cannot adapt to real-time changes in data quality, thus affecting the reliability of the fusion results.
[0007] Fourth, cross-modal error correction is lacking. When errors occur in single-modal recognition, such as misspellings in speech recognition, current technologies cannot utilize data from other modalities (such as facial lip features) for auxiliary error correction. Due to the lack of cross-modal mutual verification and correction mechanisms, the error rate of converted information is high, making it difficult to meet the application requirements of high-precision recognition.
[0008] In summary, existing multimodal data processing technologies suffer from problems such as insufficient scene adaptability, weak emotion-semantic association, fixed modal fusion methods, and lack of cross-modal error correction in speech input recognition and conversion. An improved technical solution is urgently needed to address these issues and enhance the accuracy, reliability, and scene adaptability of speech input recognition and conversion. Summary of the Invention
[0009] To address the problems existing in the background technology, this invention proposes a multimodal big data analysis and processing method and device based on artificial intelligence. It aims to improve the accuracy, scene adaptability and information transformation reliability of multimodal data processing by introducing scene perception, sentiment-semantic joint analysis, dynamic weight fusion and cross-modal error correction mechanism.
[0010] The first aspect of this application provides a multimodal big data analysis and processing method based on artificial intelligence, including:
[0011] Based on modal configuration parameters, multimodal data is collected and preprocessed.
[0012] Acquire deep semantic information, facial emotion sequences, and voice emotion sequences, and perform consistency verification on the correlation between the deep semantic information, facial emotion sequences, and voice emotion sequences at the same timestamp;
[0013] Dynamic weights are determined based on the confidence scores of the speech modality, image modality, and auxiliary data, and then the dynamic weights are used to fuse deep semantic information, facial emotion sequences, and speech emotion sequences.
[0014] Based on cross-modal verification, semantically ambiguous or erroneous parts in the fusion result are corrected, and transformation recognition information is generated based on the corrected fusion result.
[0015] Optionally, the process of generating the deep semantic information includes: performing preliminary recognition on the denoised speech data based on a pre-trained speech recognition model to generate initial text, and combining the initial text with a domain knowledge base corresponding to the scene type to correct and complete the text, thereby obtaining deep semantic information.
[0016] Optionally, the process of generating the facial emotion sequence includes: extracting the temporal motion trajectory of facial feature points in image data based on optical flow, and matching the temporal motion trajectory with a preset emotion template through a dynamic time warping algorithm to obtain the facial emotion sequence.
[0017] Optionally, the process of generating the speech emotion sequence includes: extracting the fundamental frequency and syllable interval time of the speech data, and inputting them into an emotion classification model based on a gated recurrent unit network to obtain the speech emotion sequence.
[0018] Optionally, the consistency verification includes: if there is a contradiction between facial emotion, voice emotion and deep semantic information at the same timestamp, then the scene rule base is invoked for correction.
[0019] Optionally, the formula for calculating the confidence level of the speech modality is: The formula for calculating the image modality confidence is as follows: The confidence level of the auxiliary data The similarity between the text draft and the speech semantics is calculated based on the cosine distance between the bidirectional encoder representation converter vectors;
[0020] in, For speech modal confidence, The scene coefficient is the ratio of the number of matched domain terms to the total number of words, and the speech rate stability is the ratio of 1 minus the standard deviation of speech rate to the average speech rate; Image modality confidence is defined as follows: clear frame percentage is the ratio of the number of unblurred frames to the total number of frames; and emotional fluctuation anomaly is the ratio of the absolute value of the difference between the current emotion and the historical emotion to a preset threshold.
[0021] Optionally, the dynamic weights include speech weights. Image weights and auxiliary data weights , ; ; ;in, For speech modal confidence, For image modal confidence, To assist in data confidence;
[0022] The cross-modal verification includes: if the words recognized by speech recognition do not match the facial lip features, then the domain knowledge base is invoked to perform word replacement; if there is semantic ambiguity, then a unique interpretation is determined by combining the scene type.
[0023] A second aspect of this application provides a multimodal big data analysis and processing device based on artificial intelligence, comprising:
[0024] The processing module is used to collect multimodal data and preprocess the multimodal data based on modal configuration parameters;
[0025] The analysis module is used to acquire deep semantic information, facial emotion sequences, and voice emotion sequences, and to perform consistency verification on the correlation between the deep semantic information, the facial emotion sequences, and the voice emotion sequences at the same timestamp.
[0026] The fusion module is used to determine dynamic weights based on the speech modality confidence, image modality confidence, and auxiliary data confidence, and to fuse deep semantic information, facial emotion sequences, and speech emotion sequences using the dynamic weights.
[0027] The generation module is used to correct semantically ambiguous or erroneous parts in the fusion result based on cross-modal verification, and to generate transformation recognition information based on the corrected fusion result.
[0028] Optionally, the fusion module includes: the formula for calculating the confidence level of the speech modality is: The formula for calculating the image modality confidence is as follows: The confidence level of the auxiliary data The similarity between the text draft and the speech semantics is calculated based on the cosine distance between the bidirectional encoder representation converter vectors;
[0029] in, For speech modal confidence, The scene coefficient is the ratio of the number of matched domain terms to the total number of words, and the speech rate stability is the ratio of 1 minus the standard deviation of speech rate to the average speech rate; Image modality confidence is defined as follows: clear frame percentage is the ratio of the number of unblurred frames to the total number of frames; and emotional fluctuation anomaly is the ratio of the absolute value of the difference between the current emotion and the historical emotion to a preset threshold.
[0030] Optionally, the fusion module includes: dynamic weights, including speech weights. Image weights and auxiliary data weights , ; ; ;in, For speech modal confidence, For image modal confidence, To assist in data confidence;
[0031] The cross-modal verification includes: if the words recognized by speech recognition do not match the facial lip features, then the domain knowledge base is invoked to perform word replacement; if there is semantic ambiguity, then a unique interpretation is determined by combining the scene type.
[0032] Compared with the prior art, the present invention has the following beneficial effects:
[0033] This invention provides a multimodal big data analysis and processing method and apparatus based on artificial intelligence. The dynamic weight fusion algorithm based on modality confidence replaces fixed weights and improves the reliability of fusion results. By quantifying modality confidence and weights through mathematical models, multimodal fusion is upgraded from experience-driven to data-driven, which greatly improves the processing accuracy. Attached Figure Description
[0034] Figure 1 This is a flowchart of a multimodal big data analysis and processing method based on artificial intelligence according to an embodiment of the present invention;
[0035] Figure 2 This is a schematic diagram of a multimodal big data analysis and processing device based on artificial intelligence in one embodiment of the present invention. Detailed Implementation
[0036] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0037] In one embodiment, such as Figure 1 As shown, a multimodal big data analysis and processing method based on artificial intelligence is provided, which can be applied to... Figure 1 Taking China as an example, the following specific steps will be used:
[0038] S10: Based on modal configuration parameters, collect multimodal data and preprocess the multimodal data.
[0039] Specifically, the modal configuration parameters in this invention are configured by collecting scene feature data of the current environment, inputting it into a scene classification model to determine the scene type, and then calling preset modal configuration parameters according to the scene type. The process is as follows:
[0040] Scene feature data includes ambient audio, background images, and user identification tags. Ambient audio is collected via the microphone of the terminal device, capturing sound signals from the current scene. The background image is captured via the camera of the terminal device, capturing visual information from the current scene. The user identification tag is the identity information provided when the user logs into the system. The collected ambient audio, background images, and user identification tags are input into a scene classification model. This model is a hybrid architecture of convolutional neural networks and long short-term memory networks. By jointly analyzing the spectral characteristics of the ambient audio, the visual characteristics of the background image, and the attribute characteristics of the user identification tag, the model outputs a scene type. These scene types include, but are not limited to, teaching scenarios, medical consultation scenarios, and everyday conversation scenarios. After determining the scenario type, the system calls the preset modal configuration parameters corresponding to the scenario type. The modal configuration parameters include speech recognition weight and facial feature capture frame rate. For example, in the teaching scenario, the speech recognition weight is set to a higher value for subject-specific terminology, and the facial feature capture frame rate is set to a higher frame rate for students' micro-expressions. In the medical consultation scenario, the speech recognition weight is set to a higher value for symptom description terms, and the facial feature capture frame rate is set to a higher frame rate for patients' painful expression features. In the daily conversation scenario, the speech recognition weight and facial feature capture frame rate are kept in a balanced setting.
[0041] Based on the modal configuration parameters, voice data, image data, and auxiliary data are collected. The voice data undergoes noise reduction processing, and the image data undergoes feature point normalization processing to obtain multimodal data. The specific process is as follows:
[0042] Based on the invoked modal configuration parameters, the system collects the user's voice data via a directional microphone and image data via a high-definition camera, while simultaneously acquiring auxiliary data. This auxiliary data includes the user's input text draft and the system's stored user interaction history. The user's input text draft includes: a "pre-input draft" before voice input, i.e., text content related to the current voice topic that the user records before starting the voice description. For example, in a teaching scenario, before explaining "the solution to a math problem," the user inputs a draft to clarify the scope of the voice input topic. Secondly, there is a synchronous supplementary draft for voice input, i.e., text fragments recorded synchronously when the user finds their voice expression inaccurate or information missing during voice input. For example, in a medical consultation scenario, when the user describes physical discomfort, they synchronously input a draft to supplement symptoms, such as "fever at night, no runny nose," supplementing details not clearly stated in the voice.
[0043] The acquired speech data is denoised using an adaptive filtering algorithm to filter out environmental noise and retain clear user speech signals. The acquired image data is normalized by feature point extraction to extract facial feature points from the images, and the coordinates of these facial feature points are uniformly mapped to a preset facial coordinate system to ensure consistency in the positions of the same facial feature points in images acquired at different times and angles. After the above processing, multimodal data is obtained, which includes denoised speech data, normalized image data, and auxiliary data. This multimodal data can be directly used for subsequent analysis and processing steps.
[0044] In this embodiment, the scene classification model is a hybrid architecture of convolutional neural network and long short-term memory network. It performs joint analysis on the spectral features of ambient audio, the visual features of background image, and the attribute features of user identity tags, specifically including the following:
[0045] Specifically, the hybrid architecture of convolutional neural networks and long short-term memory networks consists of a feature extraction layer, a feature fusion layer, and a classification layer. The feature extraction layer contains two parallel convolutional neural network branches, used to process the spectral features of the ambient audio and the visual features of the background image, respectively. The feature fusion layer connects to the long short-term memory network to process the fused temporal features. The classification layer is a fully connected layer used to output the scene type.
[0046] The process of extracting spectral features from ambient audio is as follows: the collected ambient audio signal is preprocessed, such as by framing and windowing, and the time-domain audio signal is converted into a two-dimensional Mel spectrum graph through Mel spectrum transformation, for example, with time on the horizontal axis and Mel frequency on the vertical axis, to obtain the spectral features of the ambient audio; the spectral features of the ambient audio are input into the first branch of the convolutional neural network, and through multiple layers of convolutional and pooling layers, key spatial features such as high-frequency energy distribution and spectral envelope in the spectral features of the ambient audio are extracted, and the audio feature vector is output.
[0047] The process of extracting visual features from the background image is as follows: the acquired background image is standardized, for example, the size is normalized and the pixel value is normalized, to obtain the visual features of the background image; the visual features are input into the second convolutional neural network branch, and spatial features such as texture, color distribution and object contours in the image are captured through convolution operations. After dimensionality reduction by pooling layers, the image feature vector is output.
[0048] The process of processing the attribute features of user identity tags is as follows: User identity tags are text-based identity identifiers, such as "teacher", "patient", and "ordinary user". They are converted into fixed-dimensional attribute feature vectors through an embedding layer. The attribute feature vectors contain category information of user identity, such as occupation and role.
[0049] The joint analysis process involves concatenating audio feature vectors, image feature vectors, and attribute feature vectors of user identity tags to obtain a fused feature vector. This fused feature vector is then input into a Long Short-Term Memory (LSTM) network. Leveraging its ability to capture temporal dependencies, the network analyzes the correlation patterns among audio, image, and identity features at different times. For example, in a teaching scenario, environmental audio often includes blackboard writing and question-and-answer sounds, background images often include blackboards and desks / chairs, and user identity tags are often "teacher" or "student." These three elements exhibit a synergistic temporal correlation. The temporal feature vector output from the LTM network is then processed by a classification layer to obtain the scene type classification results, such as: teaching scene, medical consultation scene, or daily conversation scene.
[0050] By using the hybrid architecture of convolutional neural networks and long short-term memory networks to jointly analyze the three features, the scene classification model can comprehensively utilize multi-dimensional information such as environmental audio, background images, and user identity to improve the accuracy of scene type recognition.
[0051] S20: Obtain deep semantic information, facial emotion sequence, and voice emotion sequence, and perform consistency verification on the correlation between the deep semantic information, the facial emotion sequence, and the voice emotion sequence under the same timestamp.
[0052] Specifically, the process of recognizing speech data in the multimodal data and generating deep semantic information by combining it with a domain knowledge base is as follows: Based on a pre-trained speech recognition model, the denoised speech data is initially recognized, converting the speech signal into the corresponding text form to generate initial text; according to the determined scenario type, the corresponding domain knowledge base is invoked, for example, a subject terminology database for a teaching scenario, or a symptom database for a medical consultation scenario; the initial text is corrected and completed using the domain knowledge base. For example, in a teaching scenario, if the initial text contains "single original," it is corrected to "single substance" using the subject terminology database; if the initial text omits a subject, such as only "explaining the solution steps," it is completed to "the teacher explains the solution steps" based on the context and user identity tags, ultimately obtaining complete and accurate deep semantic information.
[0053] The process of analyzing image data in the multimodal data to generate facial emotion sequences is as follows: Image data normalized by feature points is processed using optical flow to track and extract the temporal trajectories of facial feature points such as eyebrows, corners of the mouth, and eyeballs, obtaining the temporal motion trajectories of these feature points. These temporal motion trajectories are then input into a dynamic time warping algorithm and matched with a preset emotion template to determine the emotion type for each timestamp. For example, "confusion" corresponds to the trajectory features of raised eyebrows and downturned corners of the mouth, while "understanding" corresponds to the trajectory features of relaxed eyebrows and slightly raised corners of the mouth. The emotion types of each timestamp are integrated in chronological order to generate a facial emotion sequence in the form of "Timestamp 1: Emotion Type 1; Timestamp 2: Emotion Type 2…".
[0054] The process of analyzing the intonation and speed features of the speech data to generate a speech emotion sequence is as follows: Intonation and speed features are extracted from the denoised speech data. Intonation features are obtained by calculating the fundamental frequency of the speech signal, and speed features are obtained by calculating the syllable interval time. The extracted fundamental frequency and syllable interval time are input into an emotion classification model based on a gated recurrent unit network. This model learns the correlation between intonation, speed, and emotion, and outputs the emotion type for each timestamp. The emotion types of each timestamp are integrated in chronological order to generate a speech emotion sequence, which has the same form as a facial emotion sequence, i.e., "Timestamp 1: Emotion Type 1; Timestamp 2: Emotion Type 2…".
[0055] The process of verifying the consistency of the association between facial emotions, voice emotions, and deep semantic information at the same timestamp is as follows: For each timestamp, the facial emotions from the facial emotion sequence, the voice emotions from the voice emotion sequence, and the semantic content in the deep semantic information corresponding to the timestamp are analyzed for association to determine whether the three match. For example, if the deep semantic information is "This question is very simple", then the corresponding facial emotions and voice emotions should tend to be "calm" or "confident". If there is a contradiction in the association among the three, such as the deep semantic information is "I am happy", but the facial emotions are "angry" and the voice emotions are "irritable", then the scene rule base is called for correction. In the teaching scenario, facial emotions are given priority, such as students may hide their true emotions due to shyness. Finally, the verified and corrected emotion and semantic association results are obtained.
[0056] S30: Determine dynamic weights based on the confidence scores of the speech modality, image modality, and auxiliary data, and use the dynamic weights to fuse deep semantic information, facial emotion sequences, and speech emotion sequences.
[0057] Specifically, the formula is used to calculate the confidence level of the speech modality. ,in, For speech modal confidence; , where is the scene coefficient; word matching rate is the ratio of the number of matched domain terms in deep semantic information to the total number of words, for example, the proportion of the number of terms such as "simple" and "function" identified in the teaching scene that match the subject terminology database to the total number of words; speech rate stability is 1 minus the ratio of speech rate standard deviation to average speech rate, where the speech rate standard deviation is the degree of dispersion of the interval time of each syllable in the speech data, and the average speech rate is the average value of the interval time of each syllable in the speech data.
[0058] Syllable interval time is a core foundational data point for calculating average speech rate and speech rate standard deviation. The extraction process is performed on speech data denoised using an adaptive filtering algorithm: First, speech endpoint detection is performed on the denoised speech data. Through joint determination of short-time energy and zero-crossing rate, effective speech segments and silent segments are distinguished, eliminating interference from silent segments in the syllable interval calculation. Then, a method based on a pre-trained syllable segmentation model is used to identify syllable boundaries within effective speech segments. This model, by learning the acoustic features of syllables from a large amount of labeled speech data, can accurately locate the start and end times of each syllable. Finally, the difference in start times between two adjacent syllables is calculated to obtain the syllable interval time, denoted as . Where i is the syllable interval index, taking values from 1, 2, ..., n, and n is the total number of syllable interval times within the current valid speech segment. For example, if the start time of the first syllable is 0.2 seconds and the start time of the second syllable is 0.5 seconds, then the corresponding syllable interval time... =0.5-0.2=0.3 seconds; the start time of the third syllable is 0.8 seconds, then the corresponding syllable interval time is... =0.8-0.5=0.3 seconds, and so on to obtain all .
[0059] Average speech rate is the average syllable interval time in the speech data, used to quantify the overall speed of the current speech. The calculation process includes: filtering valid data, summing, and averaging. Specifically, the process is as follows: First, valid syllable interval times are filtered out, excluding outliers caused by user pauses or speech recognition errors, and retaining those that conform to the normal speech rate range. Then, the sum of the interval times of all valid syllables after filtering is calculated and denoted as Σ. = + +…+ Where n is the number of effective syllable intervals; finally, the average speech rate is calculated using the arithmetic mean formula, which is: in, This indicates the average speech rate, measured in seconds per syllable, which is the average interval between each syllable. The smaller the value, the faster the speech rate. For example: if the interval between 5 syllables is selected from the current valid speech segment, the intervals are respectively... =0.2 seconds =0.3 seconds =0.25 seconds =0.3 seconds =0.25 seconds, then Σ =0.2+0.3+0.25+0.3+0.25=1.3 seconds, n=5, substituting into the formula, we get... The interval between each syllable is 0.26 seconds, which indicates that the overall speaking speed is relatively stable.
[0060] The standard deviation of speech rate is a quantitative indicator of the dispersion of the time intervals between syllables in speech data. Higher dispersion indicates greater fluctuations in speech rate, meaning the speech rate varies drastically; lower dispersion indicates a more stable speech rate. Its calculation requires using the average speech rate. Based on this, following the steps of finding the difference, squaring, averaging, and taking the square root, the specific formula and process are as follows: Calculate the interval time of each effective syllable. With average speaking speed The difference is denoted as For each difference Perform the squaring operation to obtain To eliminate the effect of positive and negative differences canceling each other out; calculate all The average value, i.e., the variance S 2 The formula is: ;right Perform the square root operation to obtain the standard deviation of the speech rate. The formula is: The unit is consistent with the syllable interval time.
[0061] When calculating image modal confidence, the formula is used. ,in, Image modality confidence; clear frame percentage is the ratio of the number of unblurred frames in the image data to the total number of frames. Unblurred frames refer to frames where facial feature points are clearly identifiable; abnormal emotion fluctuation value is the ratio of the absolute value of the difference between the current emotion and the historical emotion to a preset threshold. The current emotion is the emotion type corresponding to a certain timestamp in the image data, the historical emotion is the average of the emotion types of the N consecutive frames before that timestamp, and the preset threshold is the upper limit of normal fluctuation obtained based on the statistics of historical emotion data.
[0062] First, the current emotion is the emotion type corresponding to a specific timestamp in the image data. This emotion type is derived from a complete analysis process of image data in standardized multimodal data. First, the temporal motion trajectories of facial feature points such as eyebrows, corners of the mouth, and eyeballs in the image data are extracted using optical flow. Then, these temporal motion trajectories are input into a dynamic time warping algorithm and matched with preset emotion templates. For example, "confusion" corresponds to the trajectory features of raised eyebrows and downturned corners of the mouth, while "understanding" corresponds to the trajectory features of relaxed eyebrows and slightly raised corners of the mouth. Finally, a unique emotion type is determined for that timestamp, such as "confusion," "calm," "irritable," or "understanding." To achieve numerical calculation, a fixed emotional quantification value needs to be preset for each emotion type. The quantification value needs to reflect the gradient difference of the emotion, such as a lower quantification value for negative emotions and a higher quantification value for positive emotions. For example, the preset quantification value is 0 for "irritability", 1 for "confusion", 3 for "calmness" and 5 for "understanding". This quantification rule is stored in the system's emotion encoding library in advance to ensure that the quantification value of the same emotion type is completely consistent in different calculation scenarios, providing a basis for subsequent difference calculation.
[0063] Secondly, historical emotion is calculated. Historical emotion is the average of the emotion types of the N consecutive frames preceding the current emotion's timestamp. The value of "N consecutive frames" needs to be preset in conjunction with the scene type, and the value logic should be adapted to the characteristics of emotion changes in the scene: In teaching scenarios, users' emotional changes are usually relatively smooth, so N is set to 10 frames to fully reflect the stable trend of emotions; in medical consultation scenarios, users may experience rapid emotional fluctuations due to descriptions of illnesses, communication of treatment plans, etc., so N is set to 5 frames to accurately capture recent emotional changes; in daily conversation scenarios, users' emotional fluctuations are between the two, so N is set to 8 frames. In specific calculations, the emotion types of the N consecutive frames traced back from the current timestamp are first extracted from the facial emotion sequence. Then, the emotion type of each frame is converted into the corresponding emotion quantification value through an emotion encoding library. Finally, the historical emotion average is calculated using the arithmetic mean formula: Historical Emotion Average = (Emotion Quantification Value of Frame 1 + Emotion Quantification Value of Frame 2 + ... + Emotion Quantification Value of Frame N) / N. For example, if the current emotion corresponds to the timestamp t10 (N=10 in the teaching scenario), then the emotion types from frames t0 to t9 are extracted. If their corresponding quantization values are 1, 1, 2, 2, 3, 2, 2, 1, 1, 2 respectively, substituting them into the formula, we can get the historical average emotion = (1+1+2+2+3+2+2+1+1+2) / 10 = 1.7.
[0064] Finally, the preset thresholds are determined and abnormal emotional fluctuation values are calculated. The preset thresholds are the upper limits of normal emotional fluctuations in various scenarios, obtained from a large amount of historical emotional data. They need to be set separately for different scenarios to adapt to the emotional fluctuation patterns of different scenarios: In the teaching scenario, 100,000 emotional fluctuation data of teachers and students are collected, and the normal emotional fluctuation amplitude is obtained through data analysis, that is, the absolute value of the difference between the emotion of a single frame and the historical average. Therefore, the preset threshold for the teaching scenario is set to 2.0; In the medical consultation scenario, 80,000 emotional fluctuation data of patients are collected, and 95% of the normal emotional fluctuation amplitude does not exceed 3.5. Therefore, the preset threshold is set to 3.5; In the daily conversation scenario, 150,000 emotional fluctuation data of ordinary users are collected, and 95% of the normal emotional fluctuation amplitude does not exceed 2.5. Therefore, the preset threshold is set to 2.5. These preset thresholds are stored in the scenario rule base in advance and are called synchronously when the scenario type is determined. After obtaining the current emotional quantification value, historical emotional average value, and preset threshold, the abnormal value is calculated using the formula: Emotional fluctuation anomaly value = (Absolute value of current emotional quantification value - historical emotional average value) / preset threshold. For example, in a teaching scenario, if the current emotional quantification value is 0, corresponding to "irritability," the historical emotional average value is 1.7, and the preset threshold is 2.0, then the emotional fluctuation anomaly value = |0 - 1.7| / 2.0 = 0.85; if the current emotional quantification value is 5, corresponding to "understanding," and the historical emotional average value is 1.7, then the emotional fluctuation anomaly value = |5 - 1.7| / 2.0 = 1.65. This anomaly value can be used to determine whether the emotional fluctuation exceeds the normal range. An anomaly value ≤ 1 is considered normal emotional fluctuation, and an anomaly value > 1 is considered abnormal emotional fluctuation.
[0065] When calculating the confidence level of auxiliary data, if the auxiliary data includes a draft of text input by the user, then The similarity between the text draft and the speech semantics is calculated based on the cosine distance between the bidirectional encoder-representer vectors. This involves converting both the text draft and deep semantic information into bidirectional encoder-representer vectors, and then calculating the cosine distance between the two vectors. If the auxiliary data does not contain a text draft, then... =0.
[0066] The process of determining dynamic weights based on the speech modal confidence, image modal confidence, and auxiliary data confidence is as follows: Dynamic weights include speech weights. Image weights and auxiliary data weights The calculation formulas for the three are as follows: ; ; ; Calculated using the above formula , , satisfy + + =1, and follow , , It dynamically adjusts to real-time changes, for example, when the voice data is clear but the image data is blurry. High value When the value is low, Enlarge Reduce the value to highlight the contribution of speech modal data.
[0067] The process of fusing deep semantic information, facial emotion sequences, and speech emotion sequences according to the dynamic weights is as follows: Deep semantic information, facial emotion sequences, and speech emotion sequences are converted into vector forms, whereby deep semantic information is converted into semantic vectors through a word embedding model, and facial and speech emotion sequences are converted into emotion vectors through an emotion coding model. The three vectors are then weighted and summed according to the dynamic weights, resulting in a fusion vector = W1 × semantic vector + W2 × facial emotion vector + W3 × speech emotion vector. This fusion vector integrates the effective information from each modality, and the weights are dynamically adjusted according to the quality of each modality, making the fusion result more consistent with the reliability characteristics of the current data.
[0068] S40: Correct the semantically ambiguous or erroneous parts in the fusion result based on cross-modal verification, and generate transformation recognition information based on the corrected fusion result.
[0069] Specifically, the process of correcting semantically ambiguous or erroneous parts in the fusion result through cross-modal verification is as follows: For semantic ambiguity in the fusion result, such as the same word corresponding to multiple interpretations or errors, or parts of speech recognition that do not match the actual intent, a cross-modal verification mechanism is invoked for multi-dimensional verification. If the word recognized by speech does not match the facial lip features in the image data, for example, the speech recognition is "four" but the facial lip features show a "ten" shape, then a word matching the facial lip features is retrieved from the domain knowledge base corresponding to the scene type and replaced to ensure consistency between the word and the lip movement; if there is semantic ambiguity, such as "yam" in the text potentially referring to both agricultural products and medicinal materials, then a unique interpretation is determined based on the identified scene type, for example, prioritizing "medicinal materials" in a medical scenario and "agricultural products" in a commercial scenario; after the above verification and correction, an accurate and unambiguous fusion result is obtained.
[0070] Based on the corrected fusion results, the process of generating conversion recognition information containing basic text, emotion tags, and scene notes is as follows: The basic text is the semantic content corrected through cross-modal verification, that is, the complete text formed by integrating deep semantic information and cross-modal correction results, covering the core semantics expressed by the user, such as "the key to solving this problem is the oxidative properties of the element"; the emotion tag consists of a timestamp and the corresponding emotion type. The timestamp is the specific time when the emotion occurred, such as "t1" and "t3". The emotion type is "confusion" and "understanding" determined after consistency verification and fusion. The tag format is "[timestamp: emotion type]", for example, "[t 1: Confusion][t3: Understanding]”; Scene notes are explanations of domain terms related to the scene type, that is, for professional terms appearing in the basic text, the corresponding definitions or explanations are retrieved from the domain knowledge base. For example, in the teaching scene, the note for “element” is “Definition: a pure substance composed of the same element”, and in the medical consultation scene, the note for “fever” is “Definition: a physiological state in which the human body temperature exceeds the normal range of 36.3℃-37.2℃”; The basic text, emotion tags and scene notes are integrated to form the final conversion recognition information. The information contains accurate semantic content, reflects the user's emotional changes, and is supplemented with professional explanations to adapt to the needs of the scene.
[0071] In one embodiment, such as Figure 2 As shown, an AI-based multimodal big data analysis and processing device is provided. This AI-based multimodal big data analysis and processing device corresponds one-to-one with the AI-based multimodal big data analysis and processing method in the above embodiments. The AI-based multimodal big data analysis and processing device includes: a processing module, an analysis module, a fusion module, and a generation module. Detailed descriptions of each functional module are as follows:
[0072] The processing module is used to collect multimodal data and preprocess the multimodal data based on modal configuration parameters;
[0073] The analysis module is used to acquire deep semantic information, facial emotion sequences, and voice emotion sequences, and to perform consistency verification on the correlation between the deep semantic information, the facial emotion sequences, and the voice emotion sequences at the same timestamp.
[0074] The fusion module is used to determine dynamic weights based on the speech modality confidence, image modality confidence, and auxiliary data confidence, and to fuse deep semantic information, facial emotion sequences, and speech emotion sequences using the dynamic weights.
[0075] The generation module is used to correct semantically ambiguous or erroneous parts in the fusion result based on cross-modal verification, and to generate transformation recognition information based on the corrected fusion result.
[0076] Optionally, the fusion module includes: the formula for calculating the confidence level of the speech modality is: The formula for calculating the image modality confidence is as follows: The confidence level of the auxiliary data The similarity between the text draft and the speech semantics is calculated based on the cosine distance between the bidirectional encoder representation converter vectors;
[0077] in, For speech modal confidence, The scene coefficient is the ratio of the number of matched domain terms to the total number of words, and the speech rate stability is the ratio of 1 minus the standard deviation of speech rate to the average speech rate; Image modality confidence is defined as follows: clear frame percentage is the ratio of the number of unblurred frames to the total number of frames; and emotional fluctuation anomaly is the ratio of the absolute value of the difference between the current emotion and the historical emotion to a preset threshold.
[0078] Optionally, the fusion module includes: dynamic weights, including speech weights. Image weights and auxiliary data weights , ; ; ;in, For speech modal confidence, For image modal confidence, To assist in data confidence;
[0079] The cross-modal verification includes: if the words recognized by speech recognition do not match the facial lip features, then the domain knowledge base is invoked to perform word replacement; if there is semantic ambiguity, then a unique interpretation is determined by combining the scene type.
[0080] Specific limitations regarding the AI-based multimodal big data analysis and processing device can be found in the limitations of the AI-based multimodal big data analysis and processing method described above, and will not be repeated here. Each module in the aforementioned AI-based multimodal big data analysis and processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.
[0081] Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A multimodal big data analysis and processing method based on artificial intelligence, characterized in that, include: Based on modal configuration parameters, multimodal data is collected and preprocessed. Acquire deep semantic information, facial emotion sequences, and voice emotion sequences, and perform consistency verification on the correlation between the deep semantic information, facial emotion sequences, and voice emotion sequences at the same timestamp; Dynamic weights are determined based on speech modality confidence, image modality confidence, and auxiliary data confidence, and these dynamic weights are then used to fuse deep semantic information, facial emotion sequences, and speech emotion sequences. The formula for calculating the speech modality confidence is as follows: ; For speech modal confidence, The scene coefficient is the ratio of the number of matched domain terms to the total number of words. The speech rate stability is 1 minus the ratio of the speech rate standard deviation to the average speech rate. The formula for calculating the image modality confidence is: The Image modal confidence is defined as follows: the percentage of clear frames is the ratio of the number of unblurred frames to the total number of frames; the abnormal emotion fluctuation value is the ratio of the absolute value of the difference between the current emotion and the historical emotion to a preset threshold; the auxiliary data confidence is... The similarity between the text draft and the speech semantics is calculated based on the cosine distance of BERT vectors; if the auxiliary data does not include the text draft, then... Corresponding processing; the auxiliary data includes the user-inputted text draft and the user's historical interaction records stored in the system; Based on cross-modal verification, semantically ambiguous or erroneous parts in the fusion result are corrected, and transformation recognition information is generated based on the corrected fusion result.
2. The multimodal big data analysis and processing method based on artificial intelligence according to claim 1, characterized in that, The process of generating the deep semantic information includes: performing preliminary recognition on the denoised speech data based on a pre-trained speech recognition model to generate initial text, and combining the initial text with a domain knowledge base corresponding to the scene type to correct and complete the text, thereby obtaining deep semantic information.
3. The multimodal big data analysis and processing method based on artificial intelligence according to claim 1, characterized in that, The process of generating the facial emotion sequence includes: extracting the temporal motion trajectory of facial feature points in image data based on optical flow, and matching the temporal motion trajectory with a preset emotion template through a dynamic time warping algorithm to obtain the facial emotion sequence.
4. The multimodal big data analysis and processing method based on artificial intelligence according to claim 1, characterized in that, The process of generating the speech emotion sequence includes: extracting the fundamental frequency and syllable interval time of the speech data, and inputting them into an emotion classification model based on a gated recurrent unit network to obtain the speech emotion sequence.
5. The multimodal big data analysis and processing method based on artificial intelligence according to claim 1, characterized in that, The consistency verification includes: if there is a contradiction between facial emotion, voice emotion and deep semantic information under the same timestamp, the scene rule base is invoked to make corrections.
6. The multimodal big data analysis and processing method based on artificial intelligence according to claim 1, characterized in that, The dynamic weights include voice weights. Image weights and auxiliary data weights , ; ; ;in, For speech modal confidence, For image modal confidence, To assist in data confidence; The cross-modal verification includes: if the words recognized by speech recognition do not match the facial lip features, then the domain knowledge base is invoked to perform word replacement; if there is semantic ambiguity, then a unique interpretation is determined by combining the scene type.
7. A multimodal big data analysis and processing device based on artificial intelligence, characterized in that, include: The processing module is used to collect multimodal data and preprocess the multimodal data based on modal configuration parameters; The analysis module is used to acquire deep semantic information, facial emotion sequences, and voice emotion sequences, and to perform consistency verification on the correlation between the deep semantic information, the facial emotion sequences, and the voice emotion sequences at the same timestamp. The fusion module is used to determine dynamic weights based on speech modality confidence, image modality confidence, and auxiliary data confidence, and to fuse deep semantic information, facial emotion sequences, and speech emotion sequences using these dynamic weights; wherein, the formula for calculating the speech modality confidence is: ; For speech modal confidence, The scene coefficient is the ratio of the number of matched domain terms to the total number of words. The speech rate stability is 1 minus the ratio of the speech rate standard deviation to the average speech rate. The formula for calculating the image modality confidence is: The Image modal confidence is defined as follows: the percentage of clear frames is the ratio of the number of unblurred frames to the total number of frames; the abnormal emotion fluctuation value is the ratio of the absolute value of the difference between the current emotion and the historical emotion to a preset threshold; the auxiliary data confidence is... The similarity between the text draft and the speech semantics is calculated based on the cosine distance of BERT vectors; if the auxiliary data does not include the text draft, then... Corresponding processing; the auxiliary data includes the user-inputted text draft and the user's historical interaction records stored in the system; The generation module is used to correct semantic ambiguities or errors in the fusion result based on cross-modal verification, and to generate transformation recognition information based on the corrected fusion result.
8. The multimodal big data analysis and processing device based on artificial intelligence according to claim 7, characterized in that, The fusion module includes: dynamic weights, including speech weights. Image weights and auxiliary data weights , ; ; ;in, For speech modal confidence, For image modal confidence, To assist in data confidence; The cross-modal verification includes: if the words recognized by speech recognition do not match the facial lip features, then the domain knowledge base is invoked to perform word replacement; if there is semantic ambiguity, then a unique interpretation is determined by combining the scene type.