Abstract generation method and device based on multi-modal information, equipment and storage medium

By using multimodal information processing combined with speech and image recognition technologies, more accurate and comprehensive AI conference summaries are generated, solving the problems of inaccurate and incomplete summaries in existing technologies.

CN119128133BActive Publication Date: 2026-06-23PING AN TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PING AN TECH (SHENZHEN) CO LTD
Filing Date
2024-08-29
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing intelligent meeting summarization technologies suffer from inaccurate and incomplete summaries due to the noisy and chaotic environment of multi-person meetings and the limitations of speech recognition systems.

Method used

A multimodal information processing method is adopted, which uses speech recognition model and image recognition model to acquire audio and image data, performs confidence filtering and text matching, and generates text summary.

Benefits of technology

It improves the accuracy and comprehensiveness of summaries, reduces the impact of speech recognition errors, and generates more coherent, readable, and richer summaries.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119128133B_ABST
    Figure CN119128133B_ABST
Patent Text Reader

Abstract

The application belongs to the field of artificial intelligence, and relates to a summary generation method based on multi-modal information, which comprises the following steps: inputting target audio data into a speech recognition model for processing to obtain speech recognition text, and performing text recognition on target image data to obtain image recognition text; performing confidence filtering on the speech recognition text to obtain filtered recognition text; extracting target text corresponding to the target image data from the filtered recognition text; and inputting the target text, the target image data and the image recognition text into a summary generation model to generate a text summary. The application also provides a summary generation device and equipment based on multi-modal information and a storage medium. In addition, the application also relates to blockchain technology, and the target audio data and the target image data can be stored in the blockchain. The application can increase the diversity of summaries, improve the efficiency and accuracy of summary generation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of artificial intelligence, finance and digital healthcare, and in particular to a method, apparatus, device and storage medium for summarizing based on multimodal information. Background Technology

[0002] With the rapid development of communication technology, online conferencing tools have become an emerging model for cross-enterprise and cross-regional communication and collaboration. Online conferencing has the unique advantage of being unrestricted by time and space, allowing people to conduct discussions anytime, anywhere, greatly improving the efficiency and convenience of online communication. However, online conferencing also generates a large amount of meeting data, often overwhelming users with lengthy or fragmented information. To help users quickly locate the core content from this complex information, summarization technology has emerged.

[0003] Current intelligent meeting summarization technologies in the industry typically involve first converting meeting audio into text using an Automatic Speech Recognition (ASR) system, then processing the text and using Natural Language Processing (NLP) techniques to generate a summary based on the text. However, due to the noisy and chaotic environment of multi-person meetings and the limitations of ASR systems, problems such as inaccurate audio recognition results and limited text information can arise, further leading to inaccurate generated summaries.

[0004] On the other hand, in intelligent meetings, meetings often generate meeting-related image materials, such as presentation slides. These additional information are not utilized in general meeting summary generation methods, resulting in incomplete summaries. Summary of the Invention

[0005] The purpose of this application is to propose a method, apparatus, device, and storage medium for summarizing based on multimodal information, so as to solve the technical problems of low accuracy and insufficient comprehensiveness of summaries generated in existing conferences.

[0006] To address the aforementioned technical problems, this application provides a summary generation method based on multimodal information, employing the following technical solution:

[0007] Acquire the target audio data and the corresponding target image data;

[0008] The target audio data is input into a trained speech recognition model for processing to obtain speech-recognized text, and the target image data is subjected to text recognition to obtain image-recognized text.

[0009] The speech recognition text is filtered by confidence to obtain the filtered recognition text;

[0010] Based on the image recognition text, extract the target text corresponding to the target image data from the filtered recognition text;

[0011] The target text, the target image data, and the image recognition text are input into a trained summary generation model to generate a text summary.

[0012] To address the aforementioned technical problems, this application also provides a summary generation device based on multimodal information, employing the following technical solution:

[0013] The acquisition module is used to acquire target audio data and target image data;

[0014] The text recognition module is used to input the target audio data into a trained speech recognition model for processing to obtain speech-recognized text, and to perform text recognition on the target image data to obtain image-recognized text.

[0015] The filtering module is used to perform confidence filtering on the speech recognition text to obtain filtered recognition text;

[0016] The matching module is used to extract target text corresponding to the target image data from the filtered recognition text based on the image recognition text;

[0017] The generation module is used to input the target text, the target image data, and the image recognition text into a trained summary generation model to generate a text summary.

[0018] To address the aforementioned technical problems, this application also provides a computer device that employs the following technical solution:

[0019] The computer device includes a memory and a processor. The memory stores computer-readable instructions, and the processor executes the computer-readable instructions to implement the steps of the summary generation method based on multimodal information as described above.

[0020] To address the aforementioned technical problems, this application also provides a computer-readable storage medium, employing the technical solution described below:

[0021] The computer-readable storage medium stores computer-readable instructions, which, when executed by a processor, implement the steps of the summary generation method based on multimodal information as described above.

[0022] Compared with the prior art, this application has the following main advantages:

[0023] This application provides a summarization method based on multimodal information. It uses a speech recognition model to recognize target audio data, obtaining speech-recognized text, and performs text recognition on target image data, obtaining image-recognized text. Then, it performs confidence filtering on the speech-recognized text to remove inaccurate or disjointed text segments, improving text accuracy. Finally, it extracts target text corresponding to the target image data from the filtered image-recognized text. The target text, target image data, and image-recognized text are then input into a trained summarization model to generate a text summary. This method enriches the content sources of the summary, increases its diversity, and makes the generated summary more comprehensive, further improving the efficiency and accuracy of summary generation. Attached Figure Description

[0024] To more clearly illustrate the solutions in this application, the accompanying drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0025] Figure 1 This is an exemplary system architecture diagram to which this application can be applied;

[0026] Figure 2 This is a flowchart of an embodiment of the summary generation method based on multimodal information according to this application;

[0027] Figure 3 yes Figure 2 A flowchart of a specific implementation of step S205;

[0028] Figure 4 This is a schematic diagram of a structure of an embodiment of the summarization apparatus based on multimodal information according to this application;

[0029] Figure 5 This is a schematic diagram of the structure of one embodiment of the computer device according to this application. Detailed Implementation

[0030] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein in the specification of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having," and any variations thereof, in the specification, claims, and foregoing drawings of this application, are intended to cover non-exclusive inclusion. The terms "first," "second," etc., in the specification, claims, or foregoing drawings of this application are used to distinguish different objects, not to describe a particular order.

[0031] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0032] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

[0033] The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0034] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.

[0035] This application provides a summarization method based on multimodal information, which can be applied to, for example... Figure 1 In the system architecture 100 shown, the system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. The network 104 is used as a medium to provide a communication link between the terminal devices 101, 102, and 103 and the server 105. The network 104 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0036] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social media platform software, etc.

[0037] Terminal devices 101, 102, and 103 can be various electronic devices with displays and support web browsing, including but not limited to smartphones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), MP4 players (Moving Picture Experts Group Audio Layer IV), laptops, and desktop computers, etc.

[0038] Server 105 can be a server that provides various services, such as a backend server that supports the pages displayed on terminal devices 101, 102, and 103.

[0039] It should be noted that the summarization method based on multimodal information provided in this application is generally executed by a server / terminal device, and correspondingly, the summarization device based on multimodal information is generally located in the server / terminal device.

[0040] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0041] Continue to refer to Figure 2 The flowchart illustrates an embodiment of the summary generation method based on multimodal information according to this application, including the following steps:

[0042] Step S201: Obtain target audio data and target image data.

[0043] Among them, the target audio data and target image data correspond to the voice signals of the participants in the intelligent meeting that needs to generate a meeting summary, and the visual content such as documents, slides, and charts that the participants may share.

[0044] In this embodiment, the summarization method based on multimodal information runs on an electronic device (e.g., Figure 1 The server / terminal device shown can acquire target audio and image data via wired or wireless connection. It should be noted that the aforementioned wireless connection methods may include, but are not limited to, 3G / 4G, WiFi, Bluetooth, WiMAX, Zigbee, UWB (ultra-wideband), and other currently known or future wireless connection methods.

[0045] It should be emphasized that, to further ensure the privacy and security of the target audio and image data, the aforementioned target audio and image data can also be stored in a blockchain node.

[0046] The blockchain referred to in this application is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked together using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.

[0047] Step S202: Input the target audio data into the trained speech recognition model for processing to obtain speech recognition text, and perform text recognition on the target image data to obtain image recognition text.

[0048] In this embodiment, speech recognition is performed on the target audio data and image recognition is performed on the target image data, enriching the sources of text content and resulting in more comprehensive text.

[0049] The trained speech recognition model is an end-to-end speech recognition model based on deep learning, which can improve the efficiency and accuracy of speech recognition. Image text recognition can employ existing OCR technology or a pre-trained, user-built image text recognition model.

[0050] In some optional embodiments, the speech recognition model includes a recognition segmentation layer, an acoustic feature extraction layer, an acoustic unit recognition layer, and a speech recognition layer.

[0051] The step of inputting the target audio data into a trained speech recognition model for processing to obtain speech-recognized text includes:

[0052] The target audio data is input into the recognition and segmentation layer, the speech segmentation endpoints are determined based on the current segmentation parameters, and the target audio data is segmented according to the speech segmentation endpoints to generate a speech frame sequence;

[0053] The speech frame sequence is input into the acoustic feature extraction layer for feature extraction to obtain acoustic features;

[0054] Acoustic features are identified through an acoustic unit identification layer to obtain an acoustic unit sequence;

[0055] The acoustic unit sequence is processed by the speech recognition layer to obtain the speech recognition text.

[0056] In this embodiment, the target audio data may include multiple frames of sound signals. The target audio data can be frame-divided. Specifically, the target audio data is input into the recognition and segmentation layer, and frame division is performed according to the audio data to obtain speech frames, and windowing processing is performed on the speech frames. The windowed speech frames are classified. The types of speech frames include, but are not limited to, voiceless sounds, voiced sounds, noises, and silences, etc. Among them, voiceless sounds and voiced sounds are valid speech frames, and valid speech frames are the parts that need to be speech-recognized. The recognition and segmentation layer determines the probabilities of the speech features being voiceless sounds, voiced sounds, noises, and silences for each speech frame's classification result, determines the probability that the speech frame is a valid speech frame according to the probabilities, and then determines the starting point and ending point of the speech segment in the audio data, that is, the speech segmentation endpoints. The audio data to be processed at the current moment is segmented according to the speech segmentation endpoints and the current segmentation parameters to generate a sequence of speech frames to be recognized.

[0057] Among them, the current segmentation parameter is determined according to the speech rate of the current audio data, and the speech rate is the ratio of the number of words contained in the current audio data to the duration of the current audio data.

[0058] The acoustic feature extraction layer extracts the acoustic features of each speech frame respectively. Common forms of acoustic feature representation include Fbank (FilterBank) feature vectors, MFCC (Mel-frequency cepstral coefficients) feature vectors.

[0059] Speech is a typical time-series signal. In this embodiment, the acoustic feature extraction layer can adopt an LSTM network. Specifically, a bidirectional Bi-LSTM layer can be adopted. For the input audio data, a recurrent neural network in the forward and reverse order is respectively used to obtain two independent hidden layer representations, and then a certain calculation (concatenation or addition) is performed on these two hidden layer representations to obtain a final hidden layer representation, that is, the acoustic feature, and the acoustic feature is output for subsequent calculations. This acoustic feature represented by the hidden layer contains the speech information from the previous moment and the next moment at the same time.

[0060] In this embodiment, the acoustic unit recognition layer is used to convert the sequence of feature speech frames output by the recognition and segmentation layer into acoustic units (such as phonemes or letters), that is, to obtain a sequence of acoustic units, and the sequence of acoustic units is a sequence composed of at least one acoustic unit. Among them, a phoneme is the smallest speech unit. For example, "ni hao", which consists of two syllables and corresponding two tones, can be decomposed into 7 phonemes: "n, i, 3, h, a, o, 3".

[0061] In this embodiment, the speech recognition layer is used to convert the acoustic units in the acoustic unit sequence into natural language text. By processing the acoustic unit sequence through the speech recognition layer, candidate probabilities of at least one candidate text are obtained. The candidate text corresponding to the highest candidate probability is determined as the speech recognition text determined by the acoustic unit sequence. The speech recognition layer can be implemented using a Transformer decoding network.

[0062] By recognizing the segmentation layer, the speech segmentation endpoints can be dynamically modified according to the user's speech rate, improving the accuracy of speech segmentation endpoint detection. Through subsequent processing by the acoustic feature extraction layer, acoustic unit recognition layer, and speech recognition layer, the efficiency and accuracy of speech recognition can be improved.

[0063] In some optional implementations, the image text recognition model includes a text detection layer, an image feature extraction layer, and a feature recognition layer. The text detection layer mainly uses text detection algorithms to locate the text region and obtain a text region layout feature map containing text location information. The image feature extraction layer segments the text region layout feature map and extracts character features to obtain single character features. The feature recognition layer performs text recognition on the single character features to obtain the image-recognized text.

[0064] Step S203: Confidence filtering is performed on the speech recognition text to obtain the filtered recognition text.

[0065] In this embodiment, a trained language model is used to score the transcribed filtered text, i.e., the probability of sentences. After scoring the transcribed filtered text, an appropriate confidence threshold is set. This confidence threshold can be adjusted according to the performance of the speech recognition system, the requirements for summary quality, and the actual application scenario. Texts with scores below the threshold are filtered out, and only high-quality texts with scores above or equal to the threshold are retained.

[0066] It should be understood that a well-trained language model is obtained by fine-tuning a pre-trained language model using relevant text data for a specific domain-specific conference scenario. For example, for a conference in the financial field, a pre-trained language model can be fine-tuned using text data from the financial field, setting a smaller learning rate to allow the language model to learn textual knowledge in that domain, thereby making the language model perform better in the financial field; the same principle applies to the medical field.

[0067] In some alternative implementations, the steps described above for confidence filtering of the speech recognition text to obtain the filtered recognition text include:

[0068] The speech recognition text is input into a trained language model, which includes an embedding layer, multiple encoding layers, and an output layer.

[0069] The text encoding vector is obtained by vector encoding the speech recognition text through the embedding layer;

[0070] By performing attention fusion on the text encoding vectors through multiple encoding layers, text fusion features are obtained;

[0071] The output layer classifies and predicts the text fusion features, outputting the probability distribution of each word in the speech recognition text.

[0072] The sentence probability of each sentence in the speech recognition text is calculated based on the probability distribution;

[0073] Output all sentences whose sentence probability is greater than or equal to the preset confidence threshold to obtain the final filtered and recognized text.

[0074] In this embodiment, the embedding layer is used to perform position embedding, word embedding, and character embedding on the speech recognition text. Position embedding provides information about the position of words in the sentence, which helps to understand the grammatical structure and context of the sentence. Word embedding captures the semantic information of words, enabling the model to understand the meaning of words in context. Character embedding captures character-level information, which helps the model to handle unknown words or spelling variations. By incorporating character-level information into the model, the model's ability to process complex data is improved.

[0075] Multiple coding layers are stacked, for example, 12 layers. Each coding layer includes a multi-head attention sublayer and a feedforward connection layer. The output of each coding layer serves as the input to the next layer, passing textual information between coding layers to achieve a more comprehensive understanding of the input text. Specifically, the multi-head attention sublayer performs multi-scale attention calculations as follows:

[0076]

[0077] Among them, Q i =H i-1 W q ;K i =H i-1 W k V i =H i-1 W v Q i Let K represent the query vector for the i-th head. i V represents the key vector of the i-th head. i H represents the value vector of the i-th head; i This represents the mask-self-attention output of the i-th head; d k This represents the dimension of the key vector.

[0078] By merging self-attention, the calculation formula for multi-head attention is as follows:

[0079] MultiHead=Concat(head1,head2,…,head A W0;

[0080] Where Concat represents the matrix concatenation function; W0 represents the parameter matrix when compressing each self-attention point.

[0081] The feedforward connection layer is used to enhance the connection of text fusion features and prevent underfitting. Residual connections and layer normalization are performed after each multi-head attention sublayer and feedforward connection layer to help gradient flow and reduce degradation problems during training.

[0082] In this embodiment, the output layer includes a fully connected layer and a softmax layer. After the text fusion features are weighted and summed and nonlinearly transformed by the fully connected layer, they are converted into a probability distribution by the softmax layer, and the probability distribution of each word in the audio recognition text is output.

[0083] The probabilities of each word are summed, or the probabilities are weighted and summed according to the weight of each word to obtain the sentence probability. The sentence probability is compared with a preset reliability threshold. If the sentence probability is greater than or equal to the preset reliability threshold, the sentence is retained; if the sentence probability is less than the preset reliability threshold, the sentence is removed, and the filtered recognition text is obtained.

[0084] By introducing a confidence filtering method, inaccurate and non-fluent text fragments are removed, improving the accuracy and effectiveness of the text. Secondly, by filtering low-confidence text, the impact of speech recognition errors on the summary generation can be reduced, thus improving the accuracy of the summary.

[0085] Step S204: Extract the target text corresponding to the target image data from the filtered recognition text based on the image recognition text.

[0086] Specifically, a text matching algorithm is used to determine the text segments in the filtered and recognized text that correspond to the text in the image recognition; the text segments are then output as the target text corresponding to the target image data.

[0087] Text matching algorithms include string matching, edit distance, word vector similarity, substring search, and regular expression matching. By using text matching algorithms to match the text fragments in the filtered text that correspond to the text in the image recognition, the text fragments are associated with the target image data, making the text content more comprehensive and richer.

[0088] Step S205: Input the target text, target image data, and image recognition text into the trained summary generation model to generate a text summary.

[0089] The target text, target image data, and image recognition text are input into the trained summary generation model. The summary generation model performs text prediction based on the features fused from the multi-modal information of the target text, target image data, and image recognition text to obtain a text summary.

[0090] In some optional implementations, the summary generation model includes an image feature extraction layer, a text feature extraction layer, a feature fusion layer, an encoder layer, and a decoder layer. The target text, target image data, and image recognition text are processed by the image feature extraction layer, text feature extraction layer, feature fusion layer, encoder layer, and decoder layer to generate the corresponding text summary.

[0091] In this embodiment, see Figure 3 As shown, in the above-mentioned summary generation model that inputs target text, target image data, and image-recognized text into the trained model, the steps for generating text summaries include:

[0092] Step S301: Input the target text and image recognition text into the text feature extraction layer to obtain text features.

[0093] In this process, the target text and the image recognition text can be extracted using the same text feature extraction layer, or different sub-layers can be set in the text feature extraction layer to extract features from the target text and the image recognition text respectively.

[0094] For example, the text feature extraction layer includes a first feature extraction sub-layer and a second feature extraction sub-layer. The first feature extraction sub-layer and the second feature extraction sub-layer use a Transformer encoder to extract features, respectively, to obtain first text features and second text features. The first text features and the second text features are calculated through an attention mechanism. The first text features and the second text features are concatenated to obtain the final text features.

[0095] When the text feature extraction layer is a single layer, the extracted text features are directly output.

[0096] Step S302: Input the target image data into the image feature extraction layer to obtain image features.

[0097] The image feature extraction layer can be implemented using a common VGG neural network, or it can be implemented using a pre-configured neural network.

[0098] In some optional implementations, the image feature extraction layer includes an image segmentation network, a residual network (ResNet), and an attention network (Attention). The image segmentation network segments the target image data to obtain segmented images; ResNet performs hierarchical feature extraction on the segmented images to obtain image features at different scales; and Attention performs attention fusion on the image features at different scales to obtain the final image features.

[0099] Step S303: The text features and image features are fused through the feature fusion layer to obtain multimodal fused features.

[0100] Specifically, a shared representation method is used to fuse text features and image features. For example, a fully connected layer can be used as a mapping function to map text features and image features to a shared space, respectively, to obtain the shared representation t1 of the text. g '(Feature representation of the target text), t2 g '(Feature representation corresponding to image recognition text), shared image representation g', are concatenated to obtain a higher-dimensional fusion representation S=[t1 g ',t2 g ',g').

[0101] Step S304: Attention calculation is performed on the multimodal fusion features through the encoder layer to obtain semantically enhanced features.

[0102] The encoder layer adopts the Transformer encoder architecture, which includes an encoding embedding layer and multiple stacked encoders. Each encoder includes a multi-head attention sublayer and a feedforward network sublayer.

[0103] In this embodiment, the step of performing attention calculation on the multimodal fusion features through the encoder layer to obtain semantically enhanced features includes:

[0104] The multimodal fusion features are vector-embedded by encoding the embedding sublayer to obtain the fusion feature vector;

[0105] The weighted fusion features are obtained by performing self-attention calculation on the fused feature vector through a multi-head attention sublayer.

[0106] The weighted fusion features are subjected to residual connection and layer normalization to obtain normalized fusion features;

[0107] The normalized fusion features are input into the feedforward network sublayer to obtain the enhanced fusion features;

[0108] The enhanced fusion features are subjected to residual connection and layer normalization to obtain semantically enhanced features.

[0109] The encoding embedding layer transforms multimodal fusion features into vectors. The stacked encoders consist of six layers. Multi-head attention allows focus on different positions within the fusion feature vector, enhancing the ability to capture features from different locations and resulting in weighted fusion features incorporating contextual information. Residual connections and layer normalization of the weighted fusion features help address the vanishing gradient problem in deep networks. After linear transformation using the weight matrix in the feedforward sublayers, a non-linear activation function is applied, increasing the model's non-linearity and allowing it to learn more complex feature representations, resulting in enhanced fusion features. These enhanced fusion features are then processed through residual connections and layer normalization to obtain semantically enhanced features.

[0110] The encoder layer can effectively handle long sequences and cross-modal information. Through the self-attention mechanism, it can capture long-distance dependencies in the sequence. The multi-head attention mechanism can learn the features of the data from different perspectives, providing richer feature representations. Residual connections and layer normalization help learn deeper feature representations and improve the generalization ability of the model.

[0111] Step S305: The semantic enhancement features are processed by the decoder layer to generate text and output a text summary.

[0112] The decoder layer adopts the Transformer decoder architecture, which includes a location embedding layer, multiple stacked decoders, and an output layer. Each decoder includes a masked multi-head attention sublayer, a multi-head attention sublayer, and a feedforward network sublayer. The semantic enhancement features are decoded through the decoder layer, and combined with the attention mechanism, text summarization is generated step by step.

[0113] In some optional implementations, the decoder layer includes a location embedding layer, a masked multi-head attention sublayer, a multi-head attention sublayer, a feedforward network sublayer, and an output layer. Semantic enhancement features are sequentially passed through the masked multi-head attention sublayer, the multi-head attention sublayer, the feedforward network sublayer, and the output layer to gradually generate text summaries.

[0114] In this embodiment, the steps of generating text from semantically enhanced features through a decoder layer and outputting a text summary include:

[0115] The semantically enhanced features are input into the location embedding layer to obtain the location encoding vector;

[0116] The location encoding vector and semantic enhancement features are input into the masked multi-head attention sublayer to obtain local attention features;

[0117] Global attention features are obtained by performing multi-head attention calculation on local attention features through a multi-head attention sublayer.

[0118] The global attention features are input into the feedforward network sublayer for processing to obtain the feedforward enhanced features;

[0119] The feedforward enhanced features are input into the output layer for prediction, and the output is the probability distribution of the predicted text.

[0120] The beam search algorithm is used to search based on the predicted text probability distribution and output the final text summary.

[0121] The location embedding layer provides positional information for words or image patches in the semantically enhanced feature sequence. It needs to be combined with a mask (i.e., a masked multi-head attention sublayer) to ensure that at each step of the generated sequence, the decoder can only see its previous state and not its future state. The masked multi-head attention sublayer cannot capture the dependencies of words in later positions because the decoder cannot know future information during decoding; that is, the words in later positions have not yet been generated. Typically, a mask matrix is ​​used to mark which positions cannot be used to calculate the attention vector. When calculating the attention vector, the values ​​of the positions marked in the mask matrix are set to 0. The multi-head attention sublayer can capture the dependencies between any two words and capture contextual information when calculating the attention vector. The feedforward network sublayer is used to connect and enhance the features output by the multi-head attention sublayer to prevent underfitting.

[0122] It should be noted that both the multi-head attention sublayer and the feedforward feedback sublayer contain a residual connection structure, which adds the sublayer output to the sublayer input and then normalizes it to obtain the final output of the sublayer.

[0123] The output layer includes a linear layer and a softmax layer. Since the decoder layer outputs a text sequence, the linear layer maps the output text sequence into a longer sequence. The text at each position represents the score of the corresponding text. The softmax layer converts the scores into probabilities and outputs a preset number of texts with the highest probabilities to obtain the probability distribution of each text.

[0124] The Beam Search algorithm is used to expand the search based on the probability distribution to obtain the best text summary.

[0125] The decoder layer allows for parallel processing of the entire sequence, significantly accelerating the summarization process. Through a self-attention mechanism, the decoder captures long-range dependencies within the sequence, generating coherent and context-sensitive text. By using masking operations, the decoder cannot see future words when generating each word, helping to avoid error propagation during sequence generation and further improving the accuracy of the generated summaries.

[0126] It should be noted that the speech recognition model, language model, and summary generation model in this application can be trained separately or jointly.

[0127] Taking joint training as an example, a historical conference dataset is obtained, which carries real conference summary labels and includes several historical audio datasets and corresponding historical image datasets. The historical audio dataset is divided into a first training set and a first test set. The historical image data corresponding to the first training set is used as a second training set, and the historical image data corresponding to the first test set is used as a second test set. The first training set is input into a pre-built speech recognition model to obtain predicted speech recognition text. Text recognition is performed on the second training set to obtain historical image recognition text. The predicted speech recognition text is then filtered by confidence using a pre-trained language model. Obtain the filtered predicted text; extract the predicted text corresponding to the historical image dataset from the filtered predicted text based on the historical image recognition text; input the predicted text, the second training set, and the predicted speech recognition text into the pre-built summary generation model to generate a predicted summary; calculate the loss value between the predicted summary and the real conference summary label based on the preset loss function, adjust the model parameters based on the loss value, and continue iterative training until the model converges to obtain the speech recognition model, the language model, and the summary generation model to be verified; use the first test set and the second test set to verify the speech recognition model, the language model, and the summary generation model to be verified, and output the final model that meets the conditions.

[0128] The speech recognition model can use the CTC loss function L1, the language model can use the perceptual loss function L2, and the summary generation model can use the cross-entropy loss L3. The final loss value is obtained by weighted summation.

[0129] Historical audio data in historical audio datasets can be augmented, including time stretching, pitch shifting, volume adjustment, and noise injection.

[0130] By integrating multimodal information through a summary generation model, text summaries are generated, improving the coherence and readability of the summaries and providing more comprehensive, accurate, and richer conference summaries.

[0131] This application generates more accurate summaries by performing confidence filtering on speech recognition text, reducing information loss due to speech recognition errors. By generating text summaries based on multimodal information from target text, target image data, and image recognition text, it improves the coherence and readability of the summaries, making them easier to understand and digest. At the same time, the multimodal information fusion method enables the conference summarization algorithm to acquire, analyze, and integrate information from different perspectives, providing more comprehensive, accurate, and richer conference summaries to meet users' needs for in-depth understanding and summarization of conference content.

[0132] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0133] The summary generation method based on multimodal information provided in this application can be applied to meeting scenarios in the medical and financial fields. For example, in a remote consultation meeting conducted through an intelligent meeting system, audio and image data of the remote consultation meeting can be acquired, and steps S202 to S205 described above can be executed to obtain a meeting summary of the remote consultation meeting.

[0134] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware with computer-readable instructions. These computer-readable instructions can be stored in a computer-readable storage medium. When executed, the program can include the processes of the embodiments of the above methods. The aforementioned storage medium can be a non-volatile storage medium such as a magnetic disk, optical disk, or read-only memory (ROM), or random access memory (RAM).

[0135] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0136] Further reference Figure 4 As a response to the above Figure 2To implement the method shown, this application provides an embodiment of a summary generation device based on multimodal information, which is similar to... Figure 2 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0137] like Figure 4 As shown, the multimodal information-based summary generation device 400 described in this embodiment includes: an acquisition module 401, a text recognition module 402, a filtering module 403, and a display module 404. Wherein:

[0138] The acquisition module 401 is used to acquire target audio data and target image data;

[0139] The text recognition module 402 is used to input the target audio data into a trained speech recognition model for processing to obtain speech recognition text, and to perform text recognition on the target image data to obtain image recognition text;

[0140] The filtering module 403 is used to perform confidence filtering on the speech recognition text to obtain filtered recognition text;

[0141] The matching module 404 is used to extract the target text corresponding to the target image data from the filtered recognition text based on the image recognition text;

[0142] The generation module 405 is used to input the target text, the target image data, and the image recognition text into a trained summary generation model to generate a text summary.

[0143] It is important to emphasize that, to further ensure the privacy and security of the target audio and image data, the aforementioned target audio and image data can also be stored in a blockchain node.

[0144] Based on the aforementioned multimodal information-based summary generation device, more accurate summaries can be generated by performing confidence filtering on speech recognition text, reducing information loss caused by speech recognition errors. By generating text summaries based on multimodal information from target text, target image data, and image recognition text, the coherence and readability of the summaries are improved, making them easier to understand and digest. At the same time, the multimodal information fusion method enables the meeting summary algorithm to acquire, analyze, and integrate information from different perspectives, providing more comprehensive, accurate, and richer meeting summaries to meet users' needs for in-depth understanding and summarization of meeting content.

[0145] In some optional implementations, the speech recognition model includes a recognition segmentation layer, an acoustic feature extraction layer, an acoustic unit recognition layer, and a speech recognition layer. The text recognition module 402 also includes:

[0146] The segmentation module is used to input the target audio data into the recognition segmentation layer, determine the speech segmentation endpoints based on the current segmentation parameters, segment the target audio data according to the speech segmentation endpoints, and generate a speech frame sequence;

[0147] An acoustic feature extraction submodule is used to input the speech frame sequence into the acoustic feature extraction layer for feature extraction to obtain acoustic features;

[0148] An acoustic recognition submodule is used to identify the acoustic features through the acoustic unit recognition layer to obtain an acoustic unit sequence;

[0149] The speech recognition submodule is used to process the acoustic unit sequence through the speech recognition layer to obtain speech recognition text.

[0150] By recognizing the segmentation layer, the speech segmentation endpoints can be dynamically modified according to the user's speech rate, improving the accuracy of speech segmentation endpoint detection. Through subsequent processing by the acoustic feature extraction layer, acoustic unit recognition layer, and speech recognition layer, the efficiency and accuracy of speech recognition can be improved.

[0151] In some alternative implementations, the filtering module 403 includes:

[0152] The input submodule is used to input the speech recognition text into the trained language model, which includes an embedding layer, multiple encoding layers, and an output layer.

[0153] The first embedding submodule is used to perform vector encoding on the speech recognition text through the embedding layer to obtain a text encoding vector;

[0154] The first encoding submodule is used to perform attention fusion on the text encoding vector through multiple encoding layers to obtain text fusion features;

[0155] The classification and prediction submodule is used to classify and predict the text fusion features through the output layer, and output the probability distribution of each word in the speech recognition text;

[0156] The calculation submodule is used to calculate the sentence probability of each sentence in the speech recognition text based on the probability distribution;

[0157] The filtering submodule is used to output all sentences whose sentence probability is greater than or equal to a preset confidence threshold, so as to obtain the final filtered recognition text.

[0158] By introducing a confidence filtering method, inaccurate and non-fluent text fragments are removed, improving the accuracy and effectiveness of the text. Secondly, by filtering low-confidence text, the impact of speech recognition errors on the summary generation can be reduced, thus improving the accuracy of the summary.

[0159] In some alternative implementations, the matching module 404 includes:

[0160] The matching submodule is used to determine the text segment in the filtered recognition text that corresponds to the image recognition text using a text matching algorithm;

[0161] The output submodule is used to output the text fragment as the target text corresponding to the target image data.

[0162] By using a text matching algorithm to match the text fragments in the filtered text that correspond to the text in the image recognition, the text fragments are associated with the target image data, making the text content more comprehensive and richer.

[0163] In some optional implementations, the summary generation model includes an image feature extraction layer, a text feature extraction layer, a feature fusion layer, an encoder layer, and a decoder layer; the generation module 405 includes:

[0164] The text feature extraction submodule is used to input the target text and the image-recognized text into the text feature extraction layer to obtain text features;

[0165] An image feature extraction submodule is used to input the target image data into the image feature extraction layer to obtain image features;

[0166] The fusion submodule is used to fuse the text features and the image features through the feature fusion layer to obtain multimodal fusion features;

[0167] The encoder submodule is used to perform attention calculation on the multimodal fusion features through the encoder layer to obtain semantically enhanced features;

[0168] The generation submodule is used to generate text from the semantic enhancement features through the decoder layer and output a text summary.

[0169] By integrating multimodal information through a summary generation model, text summaries are generated, improving the coherence and readability of the summaries and providing more comprehensive, accurate, and richer conference summaries.

[0170] In some optional implementations of this embodiment, the encoder layer includes an encoding embedding sublayer, a multi-head attention sublayer, and a feedforward network sublayer; the encoder submodule includes:

[0171] The second embedding submodule is used to embed the multimodal fusion features into a vector through the encoding embedding sublayer to obtain a fusion feature vector;

[0172] The multi-head attention submodule is used to perform self-attention calculation on the fused feature vector through the multi-head attention sublayer to obtain weighted fused features;

[0173] The first residual normalization submodule is used to perform residual connection and layer normalization processing on the weighted fusion features to obtain normalized fusion features;

[0174] The enhancement submodule is used to input the normalized fusion features into the feedforward network sublayer to obtain the enhanced fusion features;

[0175] The second residual normalization submodule is used to perform residual connection and layer normalization processing on the enhanced fusion features to obtain semantically enhanced features.

[0176] The encoder layer can effectively process long sequences and cross-modal information. The self-attention mechanism can capture long-distance dependencies in the sequence. The multi-head attention mechanism can learn the features of the data from different perspectives, providing richer feature representations. Residual connections and layer normalization help learn deeper feature representations and improve the generalization ability of the model.

[0177] In some optional implementations of this embodiment, the decoder layer includes a position embedding layer, a mask multi-head attention sublayer, a multi-head attention sublayer, a feedforward network sublayer, and an output layer; the generation submodule includes:

[0178] The location embedding submodule is used to input the semantic enhancement features into the location embedding layer to obtain a location encoding vector;

[0179] The masking submodule is used to input the position encoding vector and the semantic enhancement features into the masking multi-head attention sublayer to obtain local attention features;

[0180] The decoding multi-head attention submodule is used to perform multi-head attention calculation on the local attention features through the multi-head attention sublayer to obtain global attention features;

[0181] The feedforward enhancement submodule is used to input the global attention features into the feedforward network sublayer for processing to obtain feedforward enhancement features;

[0182] The output submodule is used to input the feedforward enhancement features into the output layer for prediction and output the probability distribution of the predicted text.

[0183] The search generation submodule is used to perform a search based on the predicted text probability distribution using a beam search algorithm, and output the final text summary.

[0184] The decoder layer allows for parallel processing of the entire sequence, significantly accelerating the summarization process. Through a self-attention mechanism, the decoder captures long-range dependencies within the sequence, generating coherent and context-sensitive text. By using masking operations, the decoder cannot see future words when generating each word, helping to avoid error propagation during sequence generation and further improving the accuracy of the generated summaries.

[0185] To address the aforementioned technical problems, embodiments of this application also provide a computer device. Please refer to [link / reference needed]. Figure 5 , Figure 5 This is a basic structural block diagram of the computer device in this embodiment.

[0186] The computer device 5 includes a memory 51, a processor 52, and a network interface 53 that are interconnected via a system bus. It should be noted that only the computer device 5 with components 51-53 is shown in the figure; however, it should be understood that it is not required to implement all the shown components, and more or fewer components can be implemented alternatively. Those skilled in the art will understand that the computer device described here is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions, and its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.

[0187] The computer device can be a desktop computer, laptop, handheld computer, or cloud server, etc. The computer device can interact with the user via a keyboard, mouse, remote control, touchpad, or voice control.

[0188] The memory 51 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 51 may be an internal storage unit of the computer device 5, such as the hard disk or memory of the computer device 5. In other embodiments, the memory 51 may also be an external storage device of the computer device 5, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device 5. Of course, the memory 51 may include both the internal storage unit and its external storage device of the computer device 5. In this embodiment, the memory 51 is typically used to store the operating system and various application software installed on the computer device 5, such as computer-readable instructions based on a multimodal information summarization method. In addition, the memory 51 can also be used to temporarily store various types of data that have been output or will be output.

[0189] In some embodiments, the processor 52 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chip. The processor 52 is typically used to control the overall operation of the computer device 5. In this embodiment, the processor 52 is used to execute computer-readable instructions stored in the memory 51 or to process data, for example, to execute computer-readable instructions of the multimodal information-based summary generation method.

[0190] The network interface 53 may include a wireless network interface or a wired network interface, which is typically used to establish communication connections between the computer device 5 and other electronic devices.

[0191] This embodiment implements the steps of the multimodal information-based summary generation method described above by executing computer-readable instructions stored in memory through a processor. By performing confidence filtering on the speech recognition text, a more accurate summary can be generated, reducing information loss due to speech recognition errors. By generating a text summary based on multimodal information from the target text, target image data, and image recognition text, the coherence and readability of the summary are improved, making it easier to understand and digest. At the same time, the multimodal information fusion method enables the meeting summary algorithm to acquire, analyze, and integrate information from different perspectives, providing a more comprehensive, accurate, and richer meeting summary, meeting users' needs for in-depth understanding and summarization of meeting content.

[0192] This application also provides another implementation, namely, a computer-readable storage medium storing computer-readable instructions that can be executed by at least one processor to perform the steps of the multimodal information-based summary generation method described above. By performing confidence filtering on the speech recognition text, a more accurate summary can be generated, reducing information loss due to speech recognition errors. By generating text summaries based on multimodal information of the target text, target image data, and image recognition text, the coherence and readability of the summary are improved, making it easier to understand and digest. At the same time, the multimodal information fusion method enables the meeting summary algorithm to acquire, analyze, and integrate information from different perspectives, providing a more comprehensive, accurate, and richer meeting summary to meet users' needs for in-depth understanding and summarization of meeting content.

[0193] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0194] Obviously, the embodiments described above are only some embodiments of this application, not all embodiments. The accompanying drawings show preferred embodiments of this application, but do not limit the patent scope of this application. This application can be implemented in many different forms; rather, the purpose of providing these embodiments is to provide a more thorough and comprehensive understanding of the disclosure of this application. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or make equivalent substitutions for some of the technical features. Any equivalent structures made using the content of this application's specification and drawings, directly or indirectly applied to other related technical fields, are similarly within the scope of patent protection of this application.

Claims

1. A summarization method based on multimodal information, characterized in that, Includes the following steps: Acquire the target audio data and the corresponding target image data; The target audio data is input into a trained speech recognition model for processing to obtain speech-recognized text, and the target image data is subjected to text recognition to obtain image-recognized text. The speech recognition text is filtered by confidence to obtain the filtered recognition text; Based on the image recognition text, extract the target text corresponding to the target image data from the filtered recognition text; The target text, the target image data, and the image-recognized text are input into a trained summary generation model to generate a text summary. The step of performing confidence filtering on the speech recognition text to obtain filtered recognition text includes: inputting the speech recognition text into a trained language model, the language model including an embedding layer, multiple encoding layers, and an output layer; performing vector encoding on the speech recognition text through the embedding layer to obtain a text encoding vector; performing attention fusion on the text encoding vector through the multiple encoding layers to obtain text fusion features; performing classification prediction on the text fusion features through the output layer to output the probability distribution of each word in the speech recognition text; calculating the sentence probability of each sentence in the speech recognition text based on the probability distribution; and outputting all sentences whose sentence probabilities are greater than or equal to a preset confidence threshold to obtain the final filtered recognition text. The step of extracting the target text corresponding to the target image data from the filtered recognition text based on the image recognition text includes: using a text matching algorithm to determine the text segment in the filtered recognition text that corresponds to the image recognition text; and outputting the text segment as the target text corresponding to the target image data.

2. The summarization method based on multimodal information according to claim 1, characterized in that, The speech recognition model includes a recognition segmentation layer, an acoustic feature extraction layer, an acoustic unit recognition layer, and a speech recognition layer; the step of inputting the target audio data into the trained speech recognition model for processing to obtain the speech-recognized text includes: The target audio data is input into the recognition segmentation layer, the speech segmentation endpoints are determined based on the current segmentation parameters, and the target audio data is segmented according to the speech segmentation endpoints to generate a speech frame sequence; The speech frame sequence is input into the acoustic feature extraction layer for feature extraction to obtain acoustic features; The acoustic features are identified through the acoustic unit identification layer to obtain an acoustic unit sequence; The acoustic unit sequence is processed by the speech recognition layer to obtain speech-recognized text.

3. The summarization method based on multimodal information according to claim 1, characterized in that, The summary generation model includes an image feature extraction layer, a text feature extraction layer, a feature fusion layer, an encoder layer, and a decoder layer; the step of inputting the target text, the target image data, and the image-recognized text into the trained summary generation model to generate a text summary includes: The target text and the image-recognized text are input into the text feature extraction layer to obtain text features; The target image data is input into the image feature extraction layer to obtain image features; The text features and image features are fused through the feature fusion layer to obtain multimodal fusion features; The encoder layer performs attention calculations on the multimodal fusion features to obtain semantically enhanced features; The semantic enhancement features are processed by the decoder layer to generate text and output a text summary.

4. The summarization method based on multimodal information according to claim 3, characterized in that, The encoder layer includes an encoding embedding sublayer, a multi-head attention sublayer, and a feedforward network sublayer; the step of performing attention calculation on the multimodal fusion features through the encoder layer to obtain semantically enhanced features includes: The multimodal fusion features are vector-embedded through the encoding embedding sublayer to obtain the fusion feature vector; The multi-head attention sublayer performs self-attention calculation on the fused feature vector to obtain weighted fused features; The weighted fusion features are subjected to residual connection and layer normalization to obtain normalized fusion features; The normalized fusion features are input into the feedforward network sublayer to obtain the enhanced fusion features; The enhanced fusion features are subjected to residual connection and layer normalization to obtain semantically enhanced features.

5. The summarization method based on multimodal information according to claim 3, characterized in that, The decoder layer includes a location embedding layer, a mask multi-head attention sublayer, a multi-head attention sublayer, a feedforward network sublayer, and an output layer; The step of generating text summaries by processing the semantically enhanced features through the decoder layer includes: The semantic enhancement features are input into the location embedding layer to obtain the location encoding vector; The location encoding vector and the semantic enhancement feature are input into the masked multi-head attention sublayer to obtain local attention features; The global attention features are obtained by performing multi-head attention calculation on the local attention features through the multi-head attention sub-layer. The global attention features are input into the feedforward network sublayer for processing to obtain feedforward enhancement features; The feedforward enhancement features are input into the output layer for prediction, and the predicted text probability distribution is output. The cluster search algorithm is used to search based on the predicted text probability distribution and output the final text summary.

6. A summarization generation device based on multimodal information, characterized in that, include: The acquisition module is used to acquire target audio data and target image data; The text recognition module is used to input the target audio data into a trained speech recognition model for processing to obtain speech-recognized text, and to perform text recognition on the target image data to obtain image-recognized text. The filtering module is used to perform confidence filtering on the speech recognition text to obtain filtered recognition text; The matching module is used to extract target text corresponding to the target image data from the filtered recognition text based on the image recognition text; The generation module is used to input the target text, the target image data, and the image recognition text into a trained summary generation model to generate a text summary; The filtering module includes: The input submodule is used to input the speech recognition text into the trained language model, which includes an embedding layer, multiple encoding layers, and an output layer. The first embedding submodule is used to perform vector encoding on the speech recognition text through the embedding layer to obtain a text encoding vector; The first encoding submodule is used to perform attention fusion on the text encoding vector through multiple encoding layers to obtain text fusion features; The classification and prediction submodule is used to classify and predict the text fusion features through the output layer, and output the probability distribution of each word in the speech recognition text; The calculation submodule is used to calculate the sentence probability of each sentence in the speech recognition text based on the probability distribution; The filtering submodule is used to output all sentences whose sentence probability is greater than or equal to a preset confidence threshold, so as to obtain the final filtered recognition text; The matching module includes: The matching submodule is used to determine the text segment in the filtered recognition text that corresponds to the image recognition text using a text matching algorithm; The output submodule is used to output the text fragment as the target text corresponding to the target image data.

7. A computer device, characterized in that, The method includes a memory and a processor, wherein the memory stores computer-readable instructions, and the processor executes the computer-readable instructions to implement the steps of the summarization method based on multimodal information as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-readable instructions, which, when executed by a processor, implement the steps of the summary generation method based on multimodal information as described in any one of claims 1 to 5.