Artificial intelligence-based copy data generation method, device, equipment and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By processing video content with an AI-based image and audio text encoder, efficient and accurate copy is generated, solving the problems of low efficiency and insufficient accuracy in existing technologies.

CN119202232BActive Publication Date: 2026-06-19PING AN TECH (SHENZHEN) CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: PING AN TECH (SHENZHEN) CO LTD
Filing Date: 2024-08-23
Publication Date: 2026-06-19

Application Information

Patent Timeline

23 Aug 2024

Application

19 Jun 2026

Publication

CN119202232B

IPC: G06F16/34; G06F16/334; G06F16/783; G06F16/74; G06V10/762

AI Tagging

Technical Efficacy Phrases

Training is automatic and accurateThe trained image encoder generates automatic and accurate

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Video value attribute determination method and device and electronic equipment
CN116310968A
Emotion recognition method, device, equipment and storage medium
CN115050077B
Video search method and apparatus, electronic device and storage medium
WO2023159765A1
Video searching method and device, electronic equipment and storage medium
CN114595357A
Emotion recognition method and device, equipment and storage medium
CN115050077A

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN119202232B_ABST

Patent Text Reader

Abstract

This application belongs to the fields of artificial intelligence and fintech, and relates to a method, apparatus, computer device, and storage medium for generating text data based on artificial intelligence. The method includes: acquiring a video to be processed; extracting an image sequence from the video and generating a comprehensive image vector corresponding to the image sequence based on an image encoder; extracting target audio from the video and generating a comprehensive audio vector corresponding to the target audio based on an audio encoder; extracting target text from the video and generating a comprehensive text vector corresponding to the target text based on a text encoder; combining the comprehensive image vector, comprehensive audio vector, and comprehensive text vector to obtain a target vector; decoding the target vector based on a decoder to obtain word data; and generating target text corresponding to the video based on the word data. Furthermore, the target text can be stored in a blockchain. This application improves the efficiency and accuracy of text generation based on a three-modal encoder and decoder.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of artificial intelligence development technology and fintech, and in particular to methods, apparatus, computer equipment and storage media for generating text data based on artificial intelligence. Background Technology

[0002] Against the backdrop of the rapid development of the insurance industry, video content, as an intuitive and vivid means of information dissemination, has been widely used in all aspects of insurance business, including but not limited to company background presentations, business process introductions, detailed explanations of insurance products, and sales staff training. These video resources not only enrich the dimensions of information delivery but also enhance customers' and internal employees' understanding and awareness of insurance business.

[0003] Currently, when utilizing video content for knowledge absorption and transformation, it's necessary to generate copy based on the video content. Existing methods for generating video-corresponding copy require business personnel to watch the entire video one by one, then manually summarize the key information to write suitable insurance copy, such as marketing scripts or product promotional copy. However, video content is often lengthy and information-heavy, making watching from beginning to end and manually summarizing time-consuming and laborious, resulting in low processing efficiency. Furthermore, during manual summarization, differences in individual understanding or distraction can easily lead to the omission or misunderstanding of key information, resulting in lower accuracy of the generated copy. Summary of the Invention

[0004] The purpose of this application is to propose a text data generation method, apparatus, computer device, and storage medium based on artificial intelligence, so as to solve the technical problems of low processing efficiency and low accuracy of the generated text in the existing method of generating text corresponding to videos manually.

[0005] To address the aforementioned technical problems, this application provides an artificial intelligence-based text data generation method, employing the following technical solution:

[0006] Get the video to be processed;

[0007] Based on a preset image clustering algorithm, image sequences are extracted from the video, and a comprehensive image vector corresponding to the image sequences is generated based on a trained image encoder;

[0008] Extract the target audio from the video and generate a composite audio vector corresponding to the target audio based on the trained audio encoder;

[0009] Extract the target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder;

[0010] The integrated image vector, the integrated audio vector, and the integrated text vector are combined to obtain the corresponding target vector.

[0011] The target vector is decoded based on the trained decoder to obtain the corresponding word data; wherein, the number of word data includes multiple;

[0012] Based on the word data, generate target text corresponding to the video.

[0013] Furthermore, the step of extracting image sequences from the video based on a preset image clustering algorithm specifically includes:

[0014] The video file in the video is read based on a preset video processing library;

[0015] The video file is processed by frame extraction based on a preset time interval to obtain the corresponding first image;

[0016] The first image is subjected to image clustering processing based on a preset image fingerprint algorithm, and the fingerprint similarity of the first image is obtained.

[0017] Images with fingerprint similarity greater than a preset similarity threshold from the first image are filtered to obtain the corresponding second image;

[0018] Construct a corresponding image sequence based on the second image.

[0019] Furthermore, the step of extracting the target audio from the video specifically includes:

[0020] Invoke the preset audio extraction tool;

[0021] The video is processed by the audio extraction tool to obtain the corresponding first audio.

[0022] The first audio is subjected to speech separation processing to obtain the corresponding second audio.

[0023] The second audio is used as the target audio.

[0024] Furthermore, the step of extracting the target text from the video specifically includes:

[0025] Obtain multiple preset text extraction strategies;

[0026] Filter out the target text extraction strategy that matches the video from all the described text extraction strategies;

[0027] The video is processed by extracting text based on the text extraction strategy to obtain the corresponding text data.

[0028] The text data is used as the target text.

[0029] Furthermore, the step of combining the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector specifically includes:

[0030] Obtain the preset splicing strategy;

[0031] Based on the splicing strategy, the integrated image vector, the integrated audio vector, and the integrated text vector are spliced together to obtain the corresponding spliced vector;

[0032] The concatenated vector is used as the target vector.

[0033] Furthermore, the step of generating target text corresponding to the video based on the word data specifically includes:

[0034] All the aforementioned word data are concatenated to obtain the corresponding concatenated data;

[0035] Get the preset cleanup rules;

[0036] Based on the cleaning rules, the spliced data is cleaned to obtain the corresponding target data;

[0037] The target data is used as the target text.

[0038] Furthermore, before the step of extracting image sequences from the video based on a preset image clustering algorithm and generating a comprehensive image vector corresponding to the image sequences based on a trained image encoder, the method further includes:

[0039] Obtain pre-built video sample data;

[0040] Obtain a preset initial encoder and a feature cross-alignment training strategy corresponding to the initial encoder; wherein the initial encoder is an initial image encoder, an initial audio encoder, or an initial text encoder.

[0041] Determine the target optimization algorithm corresponding to the initial encoder;

[0042] Based on the feature cross-alignment training strategy, the target optimization algorithm, and the preset InfoNCE loss function, the initial encoder is trained using the video sample data to obtain the trained specified encoder.

[0043] The specified encoder is stored.

[0044] To address the aforementioned technical problems, this application also provides an artificial intelligence-based text data generation device, which employs the following technical solution:

[0045] The first acquisition module is used to acquire the video to be processed.

[0046] The first processing module is used to extract image sequences from the video based on a preset image clustering algorithm, and generate a comprehensive image vector corresponding to the image sequences based on a trained image encoder;

[0047] The second processing module is used to extract the target audio from the video and generate a comprehensive audio vector corresponding to the target audio based on the trained audio encoder.

[0048] The third processing module is used to extract target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder.

[0049] The combination module is used to combine the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector;

[0050] The decoding module is used to decode the target vector based on the trained decoder to obtain the corresponding word data; wherein the number of word data includes multiple items.

[0051] The generation module is used to generate target text corresponding to the video based on the word data.

[0052] To address the aforementioned technical problems, this application also provides a computer device that employs the following technical solution:

[0053] Get the video to be processed;

[0054] Based on a preset image clustering algorithm, image sequences are extracted from the video, and a comprehensive image vector corresponding to the image sequences is generated based on a trained image encoder;

[0055] Extract the target audio from the video and generate a composite audio vector corresponding to the target audio based on the trained audio encoder;

[0056] Extract the target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder;

[0057] The integrated image vector, the integrated audio vector, and the integrated text vector are combined to obtain the corresponding target vector.

[0058] The target vector is decoded based on the trained decoder to obtain the corresponding word data; wherein, the number of word data includes multiple;

[0059] Based on the word data, generate target text corresponding to the video.

[0060] To address the aforementioned technical problems, this application also provides a computer-readable storage medium, employing the technical solution described below:

[0061] Get the video to be processed;

[0062] Based on a preset image clustering algorithm, image sequences are extracted from the video, and a comprehensive image vector corresponding to the image sequences is generated based on a trained image encoder;

[0063] Extract the target audio from the video and generate a composite audio vector corresponding to the target audio based on the trained audio encoder;

[0064] Extract the target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder;

[0065] The integrated image vector, the integrated audio vector, and the integrated text vector are combined to obtain the corresponding target vector.

[0066] The target vector is decoded based on the trained decoder to obtain the corresponding word data; wherein, the number of word data includes multiple;

[0067] Based on the word data, generate target text corresponding to the video.

[0068] Compared with the prior art, the embodiments of this application have the following main advantages:

[0069] This application first acquires the video to be processed; then, it extracts image sequences from the video based on a preset image clustering algorithm, and generates a comprehensive image vector corresponding to the image sequences based on a trained image encoder; it also extracts target audio from the video, and generates a comprehensive audio vector corresponding to the target audio based on a trained audio encoder; and it extracts target text from the video, and generates a comprehensive text vector corresponding to the target text based on a trained text encoder; subsequently, it combines the comprehensive image vector, the comprehensive audio vector, and the comprehensive text vector to obtain the corresponding target vector; then, it decodes the target vector based on a trained decoder to obtain corresponding word data; wherein the number of word data includes multiple; finally, it generates target text corresponding to the video based on the word data. This application utilizes an image encoder, an audio encoder, and a text encoder to perform feature learning and information mapping on the video to be processed, obtaining corresponding integrated image vectors, integrated audio vectors, and integrated text vectors. These integrated image vectors, integrated audio vectors, and integrated text vectors are then combined to obtain a corresponding target vector. A trained decoder is then used to decode the target vector to obtain corresponding word data. Based on this word data, target text corresponding to the video is automatically and accurately generated. Unlike existing methods that manually generate text corresponding to videos, this application, based on the use of three modal encoders and decoders, can automatically and accurately generate corresponding text based on the video content, improving the efficiency of text generation and ensuring the accuracy of the generated text. Attached Figure Description

[0070] To more clearly illustrate the solutions in this application, the accompanying drawings used in the description of the embodiments of this application will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0071] Figure 1 This is an exemplary system architecture diagram to which this application can be applied;

[0072] Figure 2 A flowchart of an embodiment of the AI-based text data generation method according to this application;

[0073] Figure 3 This is a schematic diagram of the structure of an embodiment of the AI-based text data generation apparatus according to this application;

[0074] Figure 4 This is a schematic diagram of the structure of one embodiment of the computer device according to this application. Detailed Implementation

[0075] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein in the specification of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having," and any variations thereof, in the specification, claims, and foregoing drawings of this application, are intended to cover non-exclusive inclusion. The terms "first," "second," etc., in the specification, claims, or foregoing drawings of this application are used to distinguish different objects, not to describe a particular order.

[0076] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0077] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

[0078] like Figure 1 As shown, system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc.

[0079] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social media platform software, etc.

[0080] Terminal devices 101, 102, and 103 can be various electronic devices with displays and support web browsing, including but not limited to smartphones, tablets, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III), MP4 players (Moving Picture Experts Group Audio Layer IV), laptops, and desktop computers, etc.

[0081] Server 105 can be a server that provides various services, such as a backend server that supports the pages displayed on terminal devices 101, 102, and 103.

[0082] It should be noted that the AI-based text data generation method provided in this application is generally executed by a server / terminal device, and correspondingly, the AI-based text data generation device is generally located in the server / terminal device.

[0083] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0084] Continue to refer to Figure 2 This document illustrates a flowchart of an embodiment of the AI-based text data generation method according to this application. The order of steps in the flowchart can be changed, and some steps can be omitted, depending on different needs. The AI-based text data generation method provided in this application can be applied to any scenario requiring text data, and thus can be applied to products in these scenarios, such as insurance text data in the financial insurance field. The AI-based text data generation method includes the following steps:

[0085] Step S201: Obtain the video to be processed.

[0086] In this embodiment, the AI-based text data generation method runs on an electronic device (e.g., Figure 1The server / terminal device shown can acquire video via wired or wireless connection. It should be noted that the aforementioned wireless connection methods include, but are not limited to, 3G / 4G / 5G connections, WiFi connections, Bluetooth connections, WiMAX connections, Zigbee connections, UWB (ultra-wideband) connections, and other currently known or future wireless connection methods. In the business scenario of insurance product marketing in the financial insurance industry, the aforementioned video can be a video containing background information on the insurance industry, business introductions, product introductions, and sales script training.

[0087] Step S202: Extract image sequences from the video based on a preset image clustering algorithm, and generate a comprehensive image vector corresponding to the image sequences based on a trained image encoder.

[0088] In this embodiment, the image sequence is input into a trained image encoder for feature encoding. The image encoder then outputs a feature vector (ff) for each keyframe in the image sequence. These vectors typically contain a depth representation of the image. Subsequently, the feature vectors of all keyframes are aggregated to obtain a corresponding composite image vector (fv). The aggregation process can employ any of the following methods: averaging, weighted averaging, or pooling. Furthermore, the specific implementation process of extracting the image sequence from the video based on the preset image clustering algorithm will be described in further detail in subsequent embodiments and will not be elaborated upon here. Additionally, the training process of the image encoder can refer to the training process of a specific encoder and will not be elaborated upon further here.

[0089] Step S203: Extract the target audio from the video and generate a comprehensive audio vector corresponding to the target audio based on the trained audio encoder.

[0090] In this embodiment, the target audio is input into a trained audio encoder for feature encoding. The audio encoder then outputs feature vectors (fas) of each audio segment contained in the target audio. These vectors contain a deep representation of the audio. Subsequently, the feature vectors of all audio segments are aggregated to obtain a corresponding comprehensive audio vector (fa). The aggregation process can employ any of the following methods: averaging, weighted averaging, or pooling. Furthermore, the specific implementation process for extracting the target audio from the video described above will be further described in detail in subsequent embodiments of this application, and will not be elaborated upon here. Additionally, the training process of the audio encoder can refer to the training process of a designated encoder, and will not be elaborated upon further here.

[0091] Step S204: Extract the target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder.

[0092] In this embodiment, the target text is input into a trained text encoder for feature encoding, and the text encoder outputs feature vectors (fw) of each token contained in the target text. These vectors contain deep semantic information of the text. Subsequently, the feature vectors of all tokens are aggregated to obtain the corresponding comprehensive text vector (fs). The aggregation process can employ any of the following methods: averaging, weighted averaging, or pooling. Furthermore, the specific implementation process for extracting the target text from the video described above will be further described in detail in subsequent embodiments of this application, and will not be elaborated upon here. In addition, the training process of the text encoder can refer to the training process of a designated encoder, and will not be elaborated upon further here.

[0093] Step S205: Combine the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector.

[0094] In this embodiment, the specific implementation process of combining the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector will be further described in detail in subsequent specific embodiments of this application, and will not be elaborated on here.

[0095] Step S206: Decode the target vector based on the trained decoder to obtain the corresponding word data; wherein the number of word data includes multiple.

[0096] In this embodiment, the decoder specifically employs an autoregressive model with a transformer structure. After the target vector is input into the trained decoder, the decoder generates each word in the text step by step based on the target vector. At the first time step, the first word is generated based on the encoder's target vector. At the second time step, the vector of the first word is used as input to generate the second word, and so on, until a stop character is generated, thus obtaining the corresponding word data. The words generated at all time steps can be concatenated to form the complete text content.

[0097] Step S207: Generate target text corresponding to the video based on the word data.

[0098] In this embodiment, the specific implementation process of generating target text corresponding to the video based on the word data will be further described in detail in subsequent specific embodiments of this application, and will not be elaborated on here.

[0099] This application first acquires the video to be processed; then, it extracts image sequences from the video based on a preset image clustering algorithm, and generates a comprehensive image vector corresponding to the image sequences based on a trained image encoder; it also extracts target audio from the video, and generates a comprehensive audio vector corresponding to the target audio based on a trained audio encoder; and it extracts target text from the video, and generates a comprehensive text vector corresponding to the target text based on a trained text encoder; subsequently, it combines the comprehensive image vector, the comprehensive audio vector, and the comprehensive text vector to obtain the corresponding target vector; then, it decodes the target vector based on a trained decoder to obtain corresponding word data; wherein the number of word data includes multiple; finally, it generates target text corresponding to the video based on the word data. This application utilizes an image encoder, an audio encoder, and a text encoder to perform feature learning and information mapping on the video to be processed, obtaining corresponding integrated image vectors, integrated audio vectors, and integrated text vectors. These integrated image vectors, integrated audio vectors, and integrated text vectors are then combined to obtain a corresponding target vector. A trained decoder is then used to decode the target vector to obtain corresponding word data. Based on this word data, target text corresponding to the video is automatically and accurately generated. Unlike existing methods that manually generate text corresponding to videos, this application, based on the use of three modal encoders and decoders, can automatically and accurately generate corresponding text based on the video content, improving the efficiency of text generation and ensuring the accuracy of the generated text.

[0100] In some optional implementations, the step S202, which involves extracting image sequences from the video based on a preset image clustering algorithm, includes the following steps:

[0101] The video file in the video is read based on a preset video processing library.

[0102] In this embodiment, the video processing library mentioned above can specifically be OpenCV, Ffmpeg, etc., and the video file in the video can be read by using the video processing library.

[0103] The video file is processed by frame extraction based on a preset time interval to obtain the corresponding first image.

[0104] In this embodiment, the value of the aforementioned time interval is not specifically limited and can be set according to actual usage requirements. For example, it can be set to extract one frame per second. Frames can be extracted from the video file according to this time interval, and the extracted frames can be saved as image files (such as JPEG or PNG formats) to obtain the aforementioned first image.

[0105] The first image is subjected to image clustering processing based on a preset image fingerprinting algorithm, and the fingerprint similarity of the first image is obtained.

[0106] In this embodiment, since the video contains many repetitive and similar images, image clustering is used to extract only key salient images from the video that may represent the storyline. This application uses image fingerprints to perform image clustering on the first images to obtain fingerprint similarity for acquiring the first images. An image fingerprint is a representation used to uniquely identify images generated based on visual features.

[0107] Images with fingerprint similarity greater than a preset similarity threshold are filtered from the first image to obtain the corresponding second image.

[0108] In this embodiment, the value of the similarity threshold is not specifically limited and can be set according to actual business needs; 0.5 is preferred. By filtering out images with fingerprint similarity greater than the similarity threshold from all first images, and then deleting those images from the first images, a corresponding second image is obtained.

[0109] Construct a corresponding image sequence based on the second image.

[0110] In this embodiment, the clustered center point images (cluster centers) are selected as keyframes, and these keyframes are tiled according to the video playback time to form a corresponding image sequence. By constructing the image sequence using an image clustering algorithm, fewer static frames and more dynamic scene frames in the video can be retained. Even if the resulting image sequence does not include all frames, the extracted image sequence can still represent the entire video by capturing salient scenes with significant changes. By utilizing this method, a basic image sequence for the video can be created by focusing on scenes with meaningful changes. These processed image sequences provide a new training sample as input to the image encoder, which highlights the storyline with fewer frames and distinguishes it from the input of all frames in a traditional video.

[0111] This application reads video files from a video using a preset video processing library; then, it performs frame extraction on the video files at preset time intervals to obtain corresponding first images; subsequently, it performs image clustering processing on the first images using a preset image fingerprint algorithm and obtains the fingerprint similarity of the first images; next, it filters images from the first images whose fingerprint similarity is greater than a preset similarity threshold to obtain corresponding second images; finally, it constructs a corresponding image sequence based on the second images. This application, by using a video processing library and an image fingerprint algorithm to perform video extraction and image clustering processing, can quickly and accurately capture significant scenes with marked changes to represent the entire video image sequence, improving the intelligence and accuracy of image sequence generation.

[0112] In some optional implementations of this embodiment, step S203, extracting the target audio from the video, includes the following steps:

[0113] Invoke the preset audio extraction tool.

[0114] In this embodiment, the selection of the audio extraction tool is not specifically limited; for example, ffmpeg can be used.

[0115] The video is processed by the audio extraction tool to obtain the corresponding first audio.

[0116] In this embodiment, the video is processed by using the aforementioned audio extraction tool to ensure that the length of the extracted audio matches the length of the original video, thereby obtaining the corresponding first audio.

[0117] The first audio is subjected to speech separation processing to obtain the corresponding second audio.

[0118] In this embodiment, since the video subtitles already contain speech information, the separated speech does not require speech information; only pitch intensity information is retained. The first audio can be speech-separated using the bytesep framework, retaining pitch intensity information and removing speech content to obtain the corresponding second audio. Specifically, the bytesep framework is used to decouple the speech of the first audio from the mixed audio containing various sounds. The result of decoupling the sound can be considered as audio data sharing the content narrative with the original video clips without any speech information. This embodiment performs speech decoupling because speech itself includes content narrative, and the model can learn to focus on the content narrative of the speech, rather than the desired audio (text content).

[0119] The second audio is used as the target audio.

[0120] This application utilizes a pre-defined audio extraction tool to extract audio from the video, obtaining a first audio track. The first audio track is then subjected to speech separation processing to obtain a second audio track. This second audio track is subsequently used as the target audio track. By using an audio extraction tool to perform audio extraction and speech classification processing on the video, this application can quickly and accurately obtain target audio that contains only pitch intensity information and no speech information, thus improving the intelligence of target audio extraction and ensuring the accuracy of the obtained target audio.

[0121] In some alternative implementations, the step S204 of extracting the target text from the video includes the following steps:

[0122] Obtain multiple preset text extraction strategies.

[0123] In this embodiment, the aforementioned text extraction strategies may specifically include ASR (Automatic Speech Recognition)-based text extraction strategies, OCR (Optical Character Recognition)-based text extraction strategies, and CMD and LSMDC fusion-based text extraction strategies. ASR-based text extraction strategies include: if the video contains human voices, using an ASR service or library (such as Google Speech-to-Text, IBM Watson Speech to Text, or open-source DeepSpeech, Kaldi, etc.) to recognize the human voices in the video. Specifically, OCR-based text extraction strategies include: if the video contains subtitles or text information, using an OCR library (such as Tesseract, EasyOCR, etc.) to recognize the text in the video frames. Post-processing is performed on the recognized text, such as noise removal and OCR error correction. The recognized subtitle text is saved as a string or file. The text extraction strategy based on the fusion of CMD and LSMDC involves generating text information using a CMD and LSMDC fusion method if the video lacks subtitles and audio. There are slight differences in clip length between CMD and LSMDC, although both provide brief descriptions for each video segment. Because LSMDC video segments need to be concatenated to match the length of CMD video segments, the descriptions must be formatted by concatenating the same number of descriptions with the video. Therefore, the concatenated descriptions show variations in sentence length. Furthermore, there are inherent differences between the two datasets in how they generate text descriptions. CMD uses user-generated descriptions from YouTube, while LSMDC uses descriptive video services to generate descriptions. Due to the differences in text variation between the two datasets, chatGPT is further used to integrate reading capabilities for text fusion. The descriptions of CMD and LSMDC are labeled as [CMD] and [LSMDC], respectively. A chatGPT prompt is constructed: the content of [CMD] and [LSMDC] is integrated into a single description, thus obtaining the text description of the video without subtitles or audio from the chatGPT output.

[0124] Select the target text extraction strategy that matches the video from all the stated text extraction strategies.

[0125] In this embodiment, by analyzing whether the video contains human voices or subtitles, a matching target text extraction strategy can be selected from all the stated text extraction strategies. Specifically, if the video contains human voices, an ASR-based text extraction strategy is used as the target text extraction strategy. If the video contains subtitles, an OCR-based text extraction strategy is used. If the video contains neither subtitles nor human voices, a text extraction strategy based on the fusion of CMD and LSMDC is used as the target text extraction strategy.

[0126] The video is processed by extracting text based on the aforementioned text extraction strategy to obtain the corresponding text data.

[0127] In this embodiment, the video can be processed for text extraction according to the strategy steps in the above text extraction strategy, thereby extracting the corresponding text data.

[0128] The text data is used as the target text.

[0129] This application obtains a variety of preset text extraction strategies; then selects a target text extraction strategy that matches the video from all the preset text extraction strategies; subsequently, it performs text extraction processing on the video based on the text extraction strategy to obtain corresponding text data; and finally, it uses the text data as the target text. This application achieves fast and accurate extraction of the corresponding target text from the video by selecting a target text extraction strategy that matches the video from a variety of preset text extraction strategies, and then using the text extraction strategy to perform text extraction processing on the video, thereby improving the intelligence of target text extraction and ensuring the accuracy of the obtained target audio.

[0130] In some alternative implementations, step S205 includes the following steps:

[0131] Obtain the preset splicing strategy.

[0132] In this embodiment, the above-mentioned splicing strategy may include concatenation, or it may include more complex fusion methods, such as weighted sum, attention mechanism, and other strategies.

[0133] Based on the splicing strategy, the integrated image vector, the integrated audio vector, and the integrated text vector are spliced together to obtain the corresponding spliced vector.

[0134] In this embodiment, the integrated image vector, the integrated audio vector, and the integrated text vector can be spliced according to the strategy content of the above splicing strategy to obtain the corresponding spliced vector.

[0135] The concatenated vector is used as the target vector.

[0136] This application obtains a preset splicing strategy; then, based on the splicing strategy, it splices the integrated image vector, the integrated audio vector, and the integrated text vector to obtain a corresponding spliced vector; subsequently, the spliced vector is used as the target vector. This application, by using a splicing strategy, can quickly and intelligently complete the combined processing of the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector, improving the efficiency and intelligence of target vector generation.

[0137] In some optional implementations of this embodiment, step S207 includes the following steps:

[0138] All the aforementioned word data are concatenated to obtain the corresponding concatenated data.

[0139] In this embodiment, the word data refers to the words generated at all time steps. The concatenation process refers to concatenating all word data into a complete sentence or text to obtain the corresponding concatenated data.

[0140] Get the preset cleanup rules.

[0141] In this embodiment, the above-mentioned cleaning rules refer to rules for processing text such as removing duplicate words, correcting grammatical errors, removing special tags, and formatting output.

[0142] Based on the cleaning rules, the spliced data is cleaned to obtain the corresponding target data.

[0143] In this embodiment, the spliced data can be cleaned according to the rules of the above-mentioned cleaning rules to obtain the cleaned spliced data, which is the target data mentioned above.

[0144] The target data is used as the target text.

[0145] This application obtains concatenated data by concatenating all the word data; then, it acquires preset cleaning rules; subsequently, it performs text cleaning processing on the concatenated data based on the cleaning rules to obtain corresponding target data; and finally, it uses the target data as the target text. After decoding the target vector using a trained decoder to obtain the corresponding word data, this application concatenates all the word data to obtain the corresponding concatenated data and intelligently uses cleaning rules to perform text cleaning processing on the concatenated data to generate the final target text, effectively improving the data accuracy of the target text.

[0146] In some optional implementations of this embodiment, before step S202, the electronic device may further perform the following steps:

[0147] Obtain pre-built video sample data.

[0148] In this embodiment, the aforementioned video sample data may be a certain number of pre-collected insurance video data. This insurance video data may include background information on the insurance industry, business introductions, product introductions, and sales script training. Furthermore, the selection of the aforementioned first quantity is not specifically limited and can be determined based on the actual encoder training requirements.

[0149] Obtain a preset initial encoder and a feature cross-alignment training strategy corresponding to the initial encoder;

[0150] In this embodiment, the initial encoder is either an initial image encoder, an initial audio encoder, or an initial text encoder. The aforementioned feature cross-alignment training strategy refers to cross-aligning the features of the three modalities (image, audio, and text) and training using the InfoNCE loss function.

[0151] Determine the target optimization algorithm corresponding to the initial encoder.

[0152] In this embodiment, the above-mentioned target optimization algorithm can specifically adopt a gradient descent algorithm, such as SGD or Adam algorithm.

[0153] Based on the feature cross-alignment training strategy, the target optimization algorithm, and the preset InfoNCE loss function, the initial encoder is trained using the video sample data to obtain the trained specified encoder.

[0154] In this embodiment, the training process for the initial encoder includes: 1. Inputting video sample data into the initial encoder to obtain features for three modalities (image, audio, and text), and cross-aligning the features of the three modalities. The features for the three modalities include the feature vector (ff) of each keyframe in the video sample data, the composite image vector (fv), the feature vector (fas) of each audio segment in the video sample data, the composite audio vector (fa), the feature vector (fw) of each token in the video sample data, and the composite text vector (fs). 2. Calculating the similarity matrix. Initializing the similarity matrices Sa_v (similarity between fa and fv), Sa_f (similarity between fa and ff), Sas_v (similarity between fas and fv), and Sas_f (similarity between fas and ff). These matrices can be calculated using cosine similarity or other similarity metrics. For example, Sa_v = cosine_similarity(fa, fvT), where fvT is the transpose of fv. 3. Integrating instance-level similarity. For each audio segment (each element in fas), calculate its similarity to the overall video (fv) to obtain the audio-level similarity Saud′. This can be achieved by averaging the rows of Sfas_fv. Similarly, for each keyframe of the video (each element in ff), calculate its similarity to the composite audio vector (fa) to obtain the video-level similarity Svid′. This can be achieved by averaging the columns of Sfa_ff (requiring transposition to match dimensions). 4. Calculate the similarity between Audio and Video. Saud′ and Svid′ are summed and averaged to obtain Sasf′, which serves as the overall similarity simAV between audio and video. 5. Calculate the InfoNCE loss and backpropagate. Calculate the InfoNCE loss using simAV and positive / negative sample pairs. Backpropagate the loss gradient to the encoders of the corresponding audio modality and the corresponding video modality to update parameters, aligning features between speech and image, thereby completing the training of the specified encoder. This application obtains the corresponding trained encoder by using speech as anchor information, aligning features pairwise, and iteratively training until the model converges. Furthermore, since the network has a symmetrical structure, the above training method can be used to train encoders for the corresponding audio modality and encoders for the corresponding text modality, aligning features between audio and text. Similarly, the InfoNCE loss function is used to train encoders for the corresponding audio modality and encoders for the corresponding text modality.

[0155] The specified encoder is stored.

[0156] In this embodiment, the storage method of the specified encoder is not specifically limited, and can be set according to actual storage needs, such as blockchain storage, database storage, cloud server storage, etc.

[0157] This application acquires pre-constructed video sample data; then acquires a preset initial encoder and a feature cross-alignment training strategy corresponding to the initial encoder; wherein the initial encoder is an initial image encoder, an initial audio encoder, or an initial text encoder; then determines a target optimization algorithm corresponding to the initial encoder; further, based on the feature cross-alignment training strategy, the target optimization algorithm, and a preset InfoNCE loss function, the initial encoder is trained using the video sample data to obtain a trained specified encoder; subsequently, the specified encoder is stored. After acquiring the pre-constructed video sample data and the initial encoder, this application trains the initial encoder using the video sample data according to the feature cross-alignment training strategy, the target optimization algorithm, and the InfoNCE loss function corresponding to the initial encoder. This enables a rapid and intelligent completion of the specified encoder construction process, improving the intelligence of the specified encoder construction and ensuring the effectiveness of the obtained specified encoder. Furthermore, the specified encoder is intelligently stored, thereby ensuring the security of the generated specified encoder.

[0158] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0159] It should be emphasized that, to further ensure the privacy and security of the aforementioned target documents, they can also be stored in a node of a blockchain.

[0160] The blockchain referred to in this application is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked together using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.

[0161] The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0162] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.

[0163] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware with computer-readable instructions. These computer-readable instructions can be stored in a computer-readable storage medium. When executed, the program can include the processes of the embodiments of the above methods. The aforementioned storage medium can be a non-volatile storage medium such as a magnetic disk, optical disk, or read-only memory (ROM), or random access memory (RAM).

[0164] It should be understood that although the steps in the flowcharts of the accompanying figures are shown sequentially as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the accompanying figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the sub-steps or stages of other steps.

[0165] Further reference Figure 3 As a response to the above Figure 2 The implementation of the method shown in this application provides an embodiment of an artificial intelligence-based text data generation device, which is similar to... Figure 2 Corresponding to the method embodiments shown, this device can be specifically applied to various electronic devices.

[0166] like Figure 3As shown, the AI-based text data generation device 300 described in this embodiment includes: a first acquisition module 301, a first processing module 302, a second processing module 303, a third processing module 304, a combination module 305, a decoding module 306, and a generation module 307. Wherein:

[0167] The first acquisition module 301 is used to acquire the video to be processed;

[0168] The first processing module 302 is used to extract image sequences from the video based on a preset image clustering algorithm, and generate a comprehensive image vector corresponding to the image sequences based on a trained image encoder;

[0169] The second processing module 303 is used to extract target audio from the video and generate a comprehensive audio vector corresponding to the target audio based on the trained audio encoder.

[0170] The third processing module 304 is used to extract target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder.

[0171] The combination module 305 is used to combine the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector.

[0172] The decoding module 306 is used to decode the target vector based on the trained decoder to obtain the corresponding word data; wherein the number of word data includes multiple items.

[0173] The generation module 307 is used to generate target text corresponding to the video based on the word data.

[0174] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0175] In some optional implementations of this embodiment, the first processing module 302 includes:

[0176] The reading submodule is used to read video files from the video based on a preset video processing library;

[0177] The processing submodule is used to perform frame extraction processing on the video file based on a preset time interval to obtain the corresponding first image;

[0178] The clustering submodule is used to perform image clustering processing on the first image based on a preset image fingerprint algorithm and obtain the fingerprint similarity of the first image;

[0179] The filtering submodule is used to filter images from the first image whose fingerprint similarity is greater than a preset similarity threshold to obtain the corresponding second image;

[0180] A construction submodule is used to construct a corresponding image sequence based on the second image.

[0181] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0182] In some optional implementations of this embodiment, the second processing module 303 includes:

[0183] Calling submodules is used to invoke preset audio extraction tools;

[0184] The first extraction submodule is used to perform audio extraction processing on the video based on the audio extraction tool to obtain the corresponding first audio.

[0185] The separation submodule is used to perform speech separation processing on the first audio to obtain the corresponding second audio.

[0186] The second determining submodule is used to use the second audio as the target audio.

[0187] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0188] In some optional implementations of this embodiment, the third processing module 304 includes:

[0189] The first acquisition submodule is used to acquire a variety of preset text extraction strategies;

[0190] The filtering submodule is used to filter out the target text extraction strategy that matches the video from all the text extraction strategies;

[0191] The second extraction submodule is used to perform text extraction processing on the video based on the text extraction strategy to obtain the corresponding text data.

[0192] The third determining submodule is used to use the text data as the target text.

[0193] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0194] In some optional implementations of this embodiment, the combination module 305 includes:

[0195] The second acquisition submodule is used to acquire the preset splicing strategy;

[0196] The first splicing submodule is used to splice the integrated image vector, the integrated audio vector, and the integrated text vector based on the splicing strategy to obtain the corresponding spliced vector;

[0197] The fourth determining submodule is used to use the spliced vector as the target vector.

[0198] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0199] In some optional implementations of this embodiment, the generation module 307 includes:

[0200] The second splicing submodule is used to splice all the word data to obtain the corresponding spliced data.

[0201] The third submodule is used to obtain preset cleanup rules;

[0202] The cleaning submodule is used to perform text cleaning processing on the spliced data based on the cleaning rules to obtain the corresponding target data;

[0203] The fifth determination submodule is used to use the target data as the target text.

[0204] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0205] In some optional implementations of this embodiment, the AI-based text data generation device further includes:

[0206] The first acquisition module is used to acquire pre-built video sample data;

[0207] The second acquisition module is used to acquire a preset initial encoder and to acquire a feature cross-alignment training strategy corresponding to the initial encoder; wherein the initial encoder is an initial image encoder, an initial audio encoder, or an initial text encoder.

[0208] The determination module is used to determine the target optimization algorithm corresponding to the initial encoder;

[0209] The training module is used to train the initial encoder using the video sample data based on the feature cross-alignment training strategy, the target optimization algorithm, and the preset InfoNCE loss function, so as to obtain a trained specified encoder.

[0210] The storage module is used to store the specified encoder.

[0211] In this embodiment, the operations performed by the above modules or units correspond one-to-one with the steps of the AI-based text data generation method in the aforementioned implementation method, and will not be repeated here.

[0212] To address the aforementioned technical problems, embodiments of this application also provide a computer device. Please refer to [link / reference needed]. Figure 4 , Figure 4 This is a basic structural block diagram of the computer device in this embodiment.

[0213] The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are interconnected via a system bus. It should be noted that only the computer device 4 with components 41-43 is shown in the figure; however, it should be understood that it is not required to implement all the shown components, and more or fewer components can be implemented alternatively. Those skilled in the art will understand that the computer device described here is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions, and its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.

[0214] The computer device can be a desktop computer, laptop, handheld computer, or cloud server, etc. The computer device can interact with the user via a keyboard, mouse, remote control, touchpad, or voice control.

[0215] The memory 41 includes at least one type of readable storage medium, including flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, disk, optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as the hard disk or memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the computer device 4. Of course, the memory 41 may include both the internal storage unit and its external storage device of the computer device 4. In this embodiment, the memory 41 is typically used to store the operating system and various application software installed on the computer device 4, such as computer-readable instructions for a text data generation method based on artificial intelligence. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.

[0216] In some embodiments, the processor 42 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is used to execute computer-readable instructions stored in the memory 41 or to process data, for example, to execute computer-readable instructions for the AI-based text data generation method.

[0217] The network interface 43 may include a wireless network interface or a wired network interface, which is typically used to establish communication connections between the computer device 4 and other electronic devices.

[0218] Compared with the prior art, the embodiments of this application have the following main advantages:

[0219] In this embodiment, an image encoder, an audio encoder, and a text encoder are used to perform feature learning and information mapping on the video to be processed in three modalities to obtain corresponding integrated image vectors, integrated audio vectors, and integrated text vectors. These integrated image vectors, integrated audio vectors, and integrated text vectors are then combined to obtain corresponding target vectors. A trained decoder is then used to decode these target vectors to obtain corresponding word data. Based on this word data, target text corresponding to the video is automatically and accurately generated. Unlike existing methods that manually generate text corresponding to videos, this application, based on the use of three modal encoders and decoders, can automatically and accurately generate corresponding text based on the video content, improving the efficiency of text generation and ensuring the accuracy of the generated text.

[0220] This application also provides another embodiment, namely, providing a computer-readable storage medium storing computer-readable instructions that can be executed by at least one processor to cause the at least one processor to perform the steps of the above-described artificial intelligence-based text data generation method.

[0221] Compared with the prior art, the embodiments of this application have the following main advantages:

[0222] In this embodiment, an image encoder, an audio encoder, and a text encoder are used to perform feature learning and information mapping on the video to be processed in three modalities to obtain corresponding integrated image vectors, integrated audio vectors, and integrated text vectors. These integrated image vectors, integrated audio vectors, and integrated text vectors are then combined to obtain corresponding target vectors. A trained decoder is then used to decode these target vectors to obtain corresponding word data. Based on this word data, target text corresponding to the video is automatically and accurately generated. Unlike existing methods that manually generate text corresponding to videos, this application, based on the use of three modal encoders and decoders, can automatically and accurately generate corresponding text based on the video content, improving the efficiency of text generation and ensuring the accuracy of the generated text.

[0223] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0224] Obviously, the embodiments described above are only some embodiments of this application, not all embodiments. The accompanying drawings show preferred embodiments of this application, but do not limit the patent scope of this application. This application can be implemented in many different forms; rather, the purpose of providing these embodiments is to provide a more thorough and comprehensive understanding of the disclosure of this application. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or make equivalent substitutions for some of the technical features. Any equivalent structures made using the content of this application's specification and drawings, directly or indirectly applied to other related technical fields, are similarly within the scope of patent protection of this application.

Claims

1. An artificial intelligence-based copy data generation method, characterized by, Includes the following steps: Get the video to be processed; Based on a preset image clustering algorithm, image sequences are extracted from the video, and a comprehensive image vector corresponding to the image sequences is generated based on a trained image encoder; Extract the target audio from the video and generate a composite audio vector corresponding to the target audio based on the trained audio encoder; Extract the target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder; The integrated image vector, the integrated audio vector, and the integrated text vector are combined to obtain the corresponding target vector. The target vector is decoded based on the trained decoder to obtain the corresponding word data; wherein, the number of word data includes multiple; Generate target text corresponding to the video based on the word data; The step of extracting image sequences from the video based on a preset image clustering algorithm specifically includes: The video file in the video is read based on a preset video processing library; The video file is processed by frame extraction based on a preset time interval to obtain the corresponding first image; The first image is subjected to image clustering processing based on a preset image fingerprint algorithm, and the fingerprint similarity of the first image is obtained. Images with fingerprint similarity greater than a preset similarity threshold from the first image are filtered to obtain the corresponding second image; Construct a corresponding image sequence based on the second image; The step of extracting the target text from the video specifically includes: Obtain multiple preset text extraction strategies; Filter out the target text extraction strategy that matches the video from all the described text extraction strategies; The video is processed by extracting text based on the text extraction strategy to obtain the corresponding text data. The text data is used as the target text. 2.The AI-based copy data generation method of claim 1, wherein, The step of extracting the target audio from the video specifically includes: Invoke the preset audio extraction tool; The video is processed by the audio extraction tool to obtain the corresponding first audio. The first audio is subjected to speech separation processing to obtain the corresponding second audio. The second audio is used as the target audio. 3.The AI-based copy data generation method of claim 1, wherein, The step of combining the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector specifically includes: Obtain the preset splicing strategy; Based on the splicing strategy, the integrated image vector, the integrated audio vector, and the integrated text vector are spliced together to obtain the corresponding spliced vector; The concatenated vector is used as the target vector. 4.The AI-based copy data generation method of claim 1, wherein, The step of generating target text corresponding to the video based on the word data specifically includes: All the aforementioned word data are concatenated to obtain the corresponding concatenated data; Get the preset cleanup rules; Based on the cleaning rules, the spliced data is cleaned to obtain the corresponding target data; The target data is used as the target text. 5.The AI-based copy data generation method of claim 1, wherein, Before the steps of extracting image sequences from the video based on a preset image clustering algorithm and generating a comprehensive image vector corresponding to the image sequences based on a trained image encoder, the method further includes: Obtain pre-built video sample data; Obtain a preset initial encoder and a feature cross-alignment training strategy corresponding to the initial encoder; wherein the initial encoder is an initial image encoder, an initial audio encoder, or an initial text encoder. Determine the target optimization algorithm corresponding to the initial encoder; Based on the feature cross-alignment training strategy, the target optimization algorithm, and the preset InfoNCE loss function, the initial encoder is trained using the video sample data to obtain the trained specified encoder. The specified encoder is stored. 6.A script data generating apparatus based on artificial intelligence, characterized by, include: The first acquisition module is used to acquire the video to be processed. The first processing module is used to extract image sequences from the video based on a preset image clustering algorithm, and generate a comprehensive image vector corresponding to the image sequences based on a trained image encoder; The second processing module is used to extract the target audio from the video and generate a comprehensive audio vector corresponding to the target audio based on the trained audio encoder. The third processing module is used to extract target text from the video and generate a comprehensive text vector corresponding to the target text based on the trained text encoder. The combination module is used to combine the integrated image vector, the integrated audio vector, and the integrated text vector to obtain the corresponding target vector; The decoding module is used to decode the target vector based on the trained decoder to obtain the corresponding word data; wherein the number of word data includes multiple items. The generation module is used to generate target text corresponding to the video based on the word data; The first processing module includes: The reading submodule is used to read video files from the video based on a preset video processing library; The processing submodule is used to perform frame extraction processing on the video file based on a preset time interval to obtain the corresponding first image; The clustering submodule is used to perform image clustering processing on the first image based on a preset image fingerprint algorithm and obtain the fingerprint similarity of the first image; The filtering submodule is used to filter images from the first image whose fingerprint similarity is greater than a preset similarity threshold to obtain the corresponding second image; A construction submodule is used to construct a corresponding image sequence based on the second image; The third processing module includes: The first acquisition submodule is used to acquire a variety of preset text extraction strategies; The filtering submodule is used to filter out the target text extraction strategy that matches the video from all the text extraction strategies; The second extraction submodule is used to perform text extraction processing on the video based on the text extraction strategy to obtain the corresponding text data. The third determining submodule is used to use the text data as the target text.

7. A computer device comprising a memory and a processor, the memory storing computer-readable instructions, wherein the processor, when executing the computer-readable instructions, implements the steps of the document data generation method based on artificial intelligence as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-readable instructions, which, when executed by a processor, implement the steps of the AI-based text data generation method as described in any one of claims 1 to 5.

Citation Information

Patent Citations

Video generation method and device, equipment and storage medium
CN117336567A
Method and apparatus for recognizing multimedia content
US20230032728A1

Patent Information

Abstract

Description

Patent Citations

Video generation method and device, equipment and storage medium

Method and apparatus for recognizing multimedia content