Video generation method and apparatus
By segmenting text into subtexts and matching them with video features, the generated video exhibits better coherence, thus solving the problem of poor video coherence in existing technologies.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NETEASE (HANGZHOU) NETWORK CO LTD
- Filing Date
- 2023-03-08
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, when automatically generating videos from text, the correlation between images is poor, resulting in poor video continuity.
The target text is divided into several subtexts, the embedding features of the subtexts are extracted, and they are matched with the pre-built video embedding features. The videos corresponding to the subtexts are then concatenated to generate the target video.
It improves the coherence of the video, making the generated video more consistent and coherent with video clips that match known text.
Smart Images

Figure CN116320659B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a video generation method and apparatus. Background Technology
[0002] This section is intended to provide background or context for the embodiments of this disclosure as set forth in the claims. The description herein is not intended to be a prior art simply because it is included in this section.
[0003] With the development of computer technology, more and more content is presented in video format, and correspondingly, the demand for video editing work is gradually increasing.
[0004] In existing technologies, a common approach is to search for several images based on the text input by the user, and then stitch these images together to achieve the effect of automatically generating videos based on text.
[0005] However, the images in the videos generated by the above method have poor correlation, resulting in poor video coherence. Summary of the Invention
[0006] In view of this, the purpose of this disclosure is to provide a video generation method and apparatus.
[0007] To achieve the above objectives, this exemplary embodiment provides a video generation method, including:
[0008] Obtain the target text and divide the target text into several subtexts;
[0009] The embedding features of the subtext are extracted, and the embedding features of the subtext are matched with the embedding features of several pre-constructed videos to obtain the video corresponding to the subtext;
[0010] By concatenating the video corresponding to each of the sub-texts, the target video is obtained.
[0011] Based on the same inventive concept, an exemplary embodiment of this disclosure also provides a video generation apparatus, including:
[0012] The text acquisition module is configured to acquire target text and divide the target text into several subtexts;
[0013] The video retrieval module is configured to extract the embedding features of the subtext and match the embedding features of the subtext with the embedding features of a number of pre-constructed videos to obtain the video corresponding to the subtext.
[0014] The video stitching module is configured to stitch together the video corresponding to each of the sub-texts to obtain the target video.
[0015] Based on the same inventive concept, an exemplary embodiment of this disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described in any of the above embodiments.
[0016] Based on the same inventive concept, exemplary embodiments of this disclosure also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any of the methods described above.
[0017] Based on the same inventive concept, the exemplary embodiments of this disclosure also provide a computer program product, including computer program instructions that, when run on a computer, cause the computer to perform the method described in any of the above.
[0018] As can be seen from the above, the video generation method and apparatus provided in this disclosure include: acquiring target text and dividing the target text into several sub-texts; extracting the embedding features of the sub-texts and matching the embedding features of the sub-texts with the embedding features of several pre-constructed videos to obtain videos corresponding to the sub-texts; and splicing the videos corresponding to each sub-text to obtain a target video. This disclosure retrieves several video segments based on known text and splices them to generate a video that matches the known text. Compared with related technologies that generate videos by splicing images, the generated video has better continuity. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in this disclosure or related technologies, the accompanying drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are only embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a schematic diagram illustrating an application scenario of the video generation method provided in this embodiment of the disclosure;
[0021] Figure 2 This is a schematic flowchart of a video generation method provided in an embodiment of the present disclosure;
[0022] Figure 3 Another flowchart illustrating the video generation method provided in this disclosure embodiment;
[0023] Figure 4 This is a schematic diagram of the structure of a video generation apparatus provided in an embodiment of the present disclosure;
[0024] Figure 5This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of this disclosure clearer, the principles and spirit of this disclosure will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are provided merely to enable those skilled in the art to better understand and implement this disclosure, and are not intended to limit the scope of this disclosure in any way. Rather, these embodiments are provided to make this disclosure more thorough and complete, and to fully convey the scope of this disclosure to those skilled in the art.
[0026] In this article, it is important to understand that any number of elements in the accompanying figures is for illustrative purposes and not for limitation, and any naming is for distinction only and has no limiting meaning.
[0027] The principles and spirit of this disclosure will be explained in detail below with reference to several representative embodiments.
[0028] refer to Figure 1 This is a schematic diagram illustrating an application scenario of the video generation method provided in this embodiment.
[0029] This application scenario includes terminal device 101, server 102, and data storage system 103. Terminal device 101, server 102, and data storage system 103 can all be connected via wired or wireless communication networks. Terminal device 101 includes, but is not limited to, desktop computers, mobile phones, mobile computers, tablets, media players, smart wearable devices, personal digital assistants (PDAs), or other electronic devices capable of performing the aforementioned functions. Server 102 and data storage system 103 can be independent physical servers, server clusters or distributed systems composed of multiple physical servers, or cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
[0030] Server 102 provides video generation services to users of terminal device 101. Terminal device 101 has a client installed that communicates with server 102. Users can input target text through the client, which sends the target text to server 102. Server 102 divides the target text into several sub-texts; extracts the embedding features of the sub-texts, and matches the embedding features of the sub-texts with the embedding features of several pre-constructed videos to obtain the video corresponding to each sub-text; concatenates the videos corresponding to each sub-text to obtain the target video, and sends the target video to the client, which then displays the target video to the user.
[0031] The data storage system 103 stores a large number of video embedding features and the videos themselves.
[0032] The following is combined Figure 1 The above application scenarios are used to describe the video generation method according to exemplary embodiments of this disclosure. It should be noted that the above application scenarios are shown only to facilitate understanding of the spirit and principles of this disclosure, and the embodiments of this disclosure are not limited in any way. Rather, the embodiments of this disclosure can be applied to any applicable scenario.
[0033] refer to Figure 2 This is a flowchart illustrating a video generation method provided in an embodiment of this disclosure.
[0034] The video generation method includes the following steps:
[0035] Step S210: Obtain the target text and divide the target text into several sub-texts.
[0036] In some exemplary embodiments, a method for dividing target text into subtexts includes:
[0037] The target text is divided into several sub-texts based on the punctuation marks in the target text.
[0038] Optionally, the punctuation marks in the target text can be sentence-end marks, which are marks used at the end of a sentence to indicate a pause and the tone of the sentence, such as a period, a question mark, and an exclamation mark.
[0039] In some exemplary embodiments, after dividing the target text into several sub-texts, the method further includes:
[0040] Calculate the number of characters in the subtext;
[0041] In response to determining that the number is less than a preset first number threshold, the subtext is merged with adjacent subtexts until the number of characters in the merged subtext is greater than or equal to the first number threshold;
[0042] In response to determining that the number is greater than a preset second quantity threshold, the subtext is split into several subtexts according to the punctuation marks in the subtext, until the number of characters in the split subtexts is less than or equal to the second quantity threshold.
[0043] Optionally, punctuation marks in the subtext can be sentence-level marks. Sentence-level marks are used within a sentence to indicate different types of pauses, such as commas, pause marks, semicolons, and colons.
[0044] Specifically, when the number of characters in the subtext is too small, the accuracy of feature extraction is low; when the number of characters in the subtext is too large, the efficiency of feature extraction is low. Through the above embodiments, the number of characters in the subtext can be controlled within a reasonable range, improving efficiency while ensuring accuracy.
[0045] Step S220: Extract the embedding features of the sub-text, and match the embedding features of the sub-text with the embedding features of several pre-constructed videos to obtain the video corresponding to the sub-text.
[0046] In some exemplary embodiments, a method for extracting the embedding features of subtext includes:
[0047] Extract the keywords from the subtext and extract the embedding features of the keywords as the embedding features of the subtext.
[0048] Optionally, keywords of the subtext can be extracted using at least one of the following algorithms: TF-IDF, TextRank, and LDA (Latent Dirichlet Allocation, a three-layer Bayesian probabilistic model). This disclosure does not limit the method for extracting keywords from the subtext.
[0049] Optionally, the embedding features of keywords can be extracted using machine learning algorithms. For example, the embedding features of keywords can be extracted using the GPT-2 model.
[0050] In some exemplary embodiments, a method for extracting embedding features from a video includes:
[0051] The video is converted into several image frames;
[0052] The image frame is converted into several sub-image frames, and the sub-image frames are mapped to an embedding sequence to obtain the embedding features of the video.
[0053] Optionally, methods for extracting embedding features from videos include:
[0054] The video is sampled into several image frames;
[0055] Convert the image frame into several flattened 2D patches;
[0056] The patches are mapped to a 1D embedding sequence through a linear patch embedding layer and then input into the ViT model to obtain the embedding features of the video output by the ViT model.
[0057] In some exemplary embodiments, a method for matching the embedding features of subtext with the embedding features of video includes:
[0058] The embedding features of the sub-text and the embedding features of the video are normalized to obtain normalized text embedding features and normalized video embedding features.
[0059] Calculate the similarity between the normalized text embedding features and the normalized video embedding features, and take the video corresponding to the largest similarity value among several similarity values as the video corresponding to the subtext.
[0060] In some embodiments, the formula for calculating the similarity between text and video is as follows:
[0061]
[0062] Wherein, s(v i ,t j ) indicates video v i and text t j similarity, w j Represents text t j Features, z i Indicates video v i Its characteristics.
[0063] Optionally, the similarity score ranges from 0 to 1. The more closely matched the text and video are, the closer the similarity score is to 1, and the less closely matched the text and video are, the closer the similarity score is to 0.
[0064] Step S230: Concatenate the video corresponding to each of the sub-texts to obtain the target video.
[0065] In some exemplary embodiments, the target video is obtained by splicing the videos corresponding to each of the sub-texts according to their order in the target text.
[0066] As can be seen from the above, the video generation method provided in this disclosure includes: obtaining target text and dividing the target text into several sub-texts; extracting the embedding features of the sub-texts and matching the embedding features of the sub-texts with the embedding features of several pre-constructed videos to obtain the video corresponding to the sub-text; and splicing the videos corresponding to each sub-text to obtain the target video. This disclosure retrieves several video segments based on known text and splices them to generate a video that matches the known text. Compared with related technologies that generate videos by splicing images, the generated video has better continuity.
[0067] refer to Figure 3 This is another flowchart illustrating the video generation method provided in this embodiment.
[0068] The video generation method includes the following steps:
[0069] Step S210: Obtain the target text and divide the target text into several sub-texts.
[0070] In some exemplary embodiments, a method for dividing target text into subtexts includes:
[0071] The target text is divided into several sub-texts based on the punctuation marks in the target text.
[0072] Step S220: Extract the embedding features of the sub-text, and match the embedding features of the sub-text with the embedding features of several pre-constructed videos to obtain the video corresponding to the sub-text.
[0073] In some exemplary embodiments, a method for extracting the embedding features of subtext includes:
[0074] Extract the keywords from the subtext and extract the embedding features of the keywords as the embedding features of the subtext.
[0075] In some exemplary embodiments, a method for extracting embedding features from a video includes:
[0076] The video is converted into several image frames;
[0077] The image frame is converted into several sub-image frames, and the sub-image frames are mapped to an embedding sequence to obtain the embedding features of the video.
[0078] In some exemplary embodiments, a method for matching the embedding features of subtext with the embedding features of video includes:
[0079] The embedding features of the sub-text and the embedding features of the video are normalized to obtain normalized text embedding features and normalized video embedding features.
[0080] Calculate the similarity between the normalized text embedding features and the normalized video embedding features, and take the video corresponding to the largest similarity value among several similarity values as the video corresponding to the subtext.
[0081] Step S230: Divide the video corresponding to the sub-text into several sub-videos, and match the sub-text with the sub-videos to obtain the sub-videos corresponding to the sub-text.
[0082] In some exemplary embodiments, a method for dividing a video corresponding to a subtext into subvideos includes:
[0083] The sub-text is converted into audio, the duration of the audio is determined, and the video is divided into several sub-videos according to the duration.
[0084] In this method, the duration of the video corresponding to the subtext is controlled according to the duration of the audio corresponding to the subtext, resulting in a more reasonable video with better audiovisual effects.
[0085] In some exemplary embodiments, a method for dividing a video corresponding to a subtext into subvideos includes:
[0086] The video is converted into several image frames;
[0087] Calculate the similarity between the subtext and each of the image frames, and select the image frames whose similarity is greater than a preset similarity threshold as candidate image frames;
[0088] By stitching together the candidate image frames, several sub-videos are obtained.
[0089] Optionally, several sub-videos can be spliced together according to the original time sequence of the candidate image frames.
[0090] In this process, the image frames in the video were initially screened, filtering out image frames that were weakly related to the subtext.
[0091] It should be noted that, unlike the method described above which divides sub-videos solely by duration, this method may still contain image frames with weak association with the sub-text. These image frames can affect the final similarity between the sub-video and the text. For example, in a sub-video, some image frames have extremely high similarity to the text, while others have extremely low similarity. This sub-video might not be selected due to its low average similarity, which is clearly insufficient. However, in this embodiment, because image frames with weak association with the sub-text are pre-filtered, this situation does not exist.
[0092] Optionally, the process of stitching together the candidate image frames to obtain several sub-videos includes:
[0093] Convert the subtext into audio and determine the duration of the audio.
[0094] The candidate image frames are stitched together according to the duration to obtain several sub-videos.
[0095] In some exemplary embodiments, matching the sub-text with the sub-video to obtain the sub-video corresponding to the sub-text includes:
[0096] Calculate the similarity between the sub-text and each sub-video, and take the sub-video with the highest similarity as the sub-video corresponding to the sub-text.
[0097] In some embodiments, the formula for calculating the similarity between text and video is as follows:
[0098]
[0099] Wherein, s(v i ,t j ) indicates video v i and text t j similarity, w j Represents text t j Features, z i Indicates video v i Its characteristics.
[0100] In some exemplary embodiments, matching the embedding features of the subtext with the embedding features of a pre-constructed set of videos includes:
[0101] Based on a pre-trained matching model, the embedding features of the sub-text are matched with the embedding features of several pre-constructed videos;
[0102] The method further includes:
[0103] Obtain the training text and training video, as well as the similarity tags corresponding to the training text and the training video;
[0104] Extract the embedding features of the training text and the training video;
[0105] Based on the embedding features of the training text and the training video, and using the pre-built matching model, the similarity prediction results corresponding to the training text and the training video are obtained.
[0106] Based on the similarity labels and the similarity prediction results, the matching model is trained using a preset loss function.
[0107] In some exemplary embodiments, the loss function is:
[0108]
[0109]
[0110]
[0111] Among them, L v2t Indicates the first loss, L t2v V represents the second loss, and L represents the total loss; i Indicates video, t j Let B represent the text, B represent the batch size, and s(v) represent the data. i ,t j ) indicates video v i and text t j The similarity.
[0112] In some exemplary embodiments, the matching model includes a text feature extraction network and a video feature extraction network;
[0113] The text feature extraction network extracts the embedding features of the training text.
[0114] The video feature extraction network extracts the embedding features of the training video.
[0115] In some exemplary embodiments, the text feature extraction network and the video feature extraction network are trained based on backpropagation according to the loss function described above.
[0116] In some exemplary embodiments, after obtaining the video corresponding to the subtext, the method further includes:
[0117] Calculate the coherence score between videos corresponding to adjacent sub-texts; wherein the coherence score is the product of the similarity scores of the two videos;
[0118] In response to the coherence score being less than the coherence score threshold, the video corresponding to the subtext is re-determined.
[0119] The higher the coherence score, the better the coherence between the two videos.
[0120] Optionally, the coherence score threshold can be pre-configured or determined based on the average or median coherence score of the current video.
[0121] Step S240: Concatenate the sub-videos corresponding to each of the sub-texts to obtain the target video.
[0122] In some exemplary embodiments, the target video is obtained by concatenating the sub-videos corresponding to each sub-text according to the order of each sub-text in the target text.
[0123] In some exemplary embodiments, after obtaining the target video, the method further includes:
[0124] The target text is embedded in the target video.
[0125] Optionally, each of the subtexts can be embedded into the video or subvideo corresponding to the subtext.
[0126] In some exemplary embodiments, after obtaining the target video, the method further includes:
[0127] Detect whether there are subtitles or watermarks in the target video;
[0128] In response to determining that a subtitle or watermark exists in the target video, the subtitle or watermark is removed.
[0129] As can be seen from the above, the video generation method provided in this disclosure includes: obtaining target text and dividing the target text into several sub-texts; extracting the embedding features of the sub-texts and matching the embedding features of the sub-texts with the embedding features of several pre-constructed videos to obtain the video corresponding to the sub-text; and splicing the videos corresponding to each sub-text to obtain the target video. This disclosure retrieves several video segments based on known text and splices them to generate a video that matches the known text. Compared with related technologies that generate videos by splicing images, the generated video has better continuity.
[0130] Furthermore, the videos corresponding to the sub-texts obtained in the initial retrieval are further processed and matched, which further improves the relevance and fit between the known texts and the generated videos.
[0131] It should be noted that the method of this disclosure embodiment can be executed by a single device, such as a computer or server. The method of this embodiment can also be applied to a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method of this disclosure embodiment, and the multiple devices will interact with each other to complete the method described.
[0132] It should be noted that the above description describes some embodiments of this disclosure. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recorded in the claims can be performed in a different order than that shown in the above embodiments and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0133] Based on the same inventive concept, corresponding to any of the above embodiments, this disclosure also provides a video generation apparatus.
[0134] refer to Figure 4 The video generation device includes:
[0135] The text acquisition module 410 is configured to acquire target text and divide the target text into several sub-texts;
[0136] The video retrieval module 420 is configured to extract the embedding features of the subtext and match the embedding features of the subtext with the embedding features of a number of pre-constructed videos to obtain the video corresponding to the subtext.
[0137] The video stitching module 430 is configured to stitch together the video corresponding to each of the sub-texts to obtain the target video.
[0138] In some exemplary embodiments, after obtaining the video corresponding to the subtext, the video retrieval module 420 is further configured to:
[0139] Divide the video corresponding to the sub-text into several sub-videos;
[0140] The sub-text is matched with the sub-video to obtain the sub-video corresponding to the sub-text;
[0141] The video stitching module 430 is also configured as follows:
[0142] The target video is obtained by concatenating the sub-videos corresponding to each of the sub-texts.
[0143] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0144] The sub-text is converted into audio, the duration of the audio is determined, and the video is divided into several sub-videos according to the duration.
[0145] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0146] The video is converted into several image frames;
[0147] Calculate the similarity between the subtext and each of the image frames, and select the image frames whose similarity is greater than a preset similarity threshold as candidate image frames;
[0148] By stitching together the candidate image frames, several sub-videos are obtained.
[0149] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0150] Convert the subtext into audio and determine the duration of the audio.
[0151] The candidate image frames are stitched together according to the duration to obtain several sub-videos.
[0152] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0153] Calculate the similarity between the sub-text and each sub-video, and take the sub-video with the highest similarity as the sub-video corresponding to the sub-text.
[0154] In some exemplary embodiments, the video retrieval module 420 is specifically configured as follows:
[0155] The embedding features of the sub-text and the embedding features of the video are normalized to obtain normalized text embedding features and normalized video embedding features.
[0156] Calculate the similarity between the normalized text embedding features and the normalized video embedding features, and take the video corresponding to the largest similarity value among several similarity values as the video corresponding to the subtext.
[0157] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0158] Calculate the coherence score between videos corresponding to adjacent sub-texts; wherein the coherence score is the product of the similarity scores of the two videos;
[0159] In response to the coherence score being less than the coherence score threshold, the video corresponding to the subtext is re-determined.
[0160] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0161] Based on a pre-trained matching model, the embedding features of the sub-text are matched with the embedding features of several pre-constructed videos;
[0162] Specifically configured as follows:
[0163] Obtain the training text and training video, as well as the similarity tags corresponding to the training text and the training video;
[0164] Extract the embedding features of the training text and the training video;
[0165] Based on the embedding features of the training text and the training video, and using the pre-built matching model, the similarity prediction results corresponding to the training text and the training video are obtained.
[0166] Based on the similarity labels and the similarity prediction results, the matching model is trained using a preset loss function.
[0167] In some exemplary embodiments, the video retrieval module 420 is further configured to:
[0168] Acquire the video and convert it into several image frames;
[0169] The image frame is converted into several sub-image frames, and the sub-image frames are mapped to an embedding sequence to obtain the embedding features of the video.
[0170] In some exemplary embodiments, the video retrieval module 420 is specifically configured as follows:
[0171] Extract the keywords from the subtext and extract the embedding features of the keywords as the embedding features of the subtext.
[0172] In some exemplary embodiments, the text acquisition module 410 is specifically configured as follows:
[0173] The target text is divided into several sub-texts based on the punctuation marks in the target text.
[0174] For ease of description, the above apparatus is described in terms of its functions, divided into various modules. Of course, in implementing this disclosure, the functions of each module can be implemented in one or more software and / or hardware.
[0175] The apparatus of the above embodiments is used to implement the corresponding video generation method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0176] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this disclosure also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the video generation method described in any of the above embodiments.
[0177] Figure 5 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.
[0178] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.
[0179] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.
[0180] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.
[0181] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0182] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.
[0183] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.
[0184] The electronic devices described above are used to implement the corresponding video generation methods in any of the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0185] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the video generation method as described in any of the above embodiments.
[0186] The aforementioned non-transitory computer-readable storage media can be any available medium or data storage device that a computer can access, including but not limited to magnetic storage (e.g., floppy disks, hard disks, magnetic tapes, magneto-optical disks (MOs), etc.), optical storage (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor storage (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND flash), solid-state drives (SSDs)).
[0187] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to perform the video generation method as described in any of the embodiments in the exemplary method section above, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0188] Those skilled in the art will recognize that embodiments of this disclosure can be implemented as a system, method, or computer program product. Therefore, this disclosure can be implemented as entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software, generally referred to herein as a "circuit," "module," or "system." Furthermore, in some embodiments, this disclosure can also be implemented as a computer program product contained in one or more computer-readable media, which includes computer-readable program code.
[0189] Any combination of one or more computer-readable media may be used. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example,, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (not exhaustive) of a computer-readable storage medium may include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device.
[0190] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in connection with an instruction execution system, apparatus, or device.
[0191] Program code contained on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0192] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)), or it can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0193] It should be understood that each block of a flowchart and / or block diagram, as well as combinations of blocks in a flowchart and / or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device to produce a machine that, when executed by a computer or other programmable data processing device, creates means for implementing the functions / operations specified in the blocks of the flowchart and / or block diagram.
[0194] These computer program instructions may also be stored in a computer-readable medium that enables a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce a product comprising instruction means that implement the functions / operations specified in the boxes of a flowchart and / or block diagram.
[0195] Computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, such that the instructions that execute on the computer or other programmable apparatus can provide a process for implementing the functions / operations specified in the boxes of a flowchart and / or block diagram.
[0196] Furthermore, although the operations of the methods of this disclosure are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all of the operations shown must be performed to achieve the desired result. Rather, the steps depicted in the flowcharts may be executed in a different order. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0197] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this disclosure should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms "first," "second," and similar words used in the embodiments of this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are used only to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly. The article "a" or "an" preceding an element does not exclude the existence of multiple such elements.
[0198] While the spirit and principles of this disclosure have been described with reference to several specific embodiments, it should be understood that this disclosure is not limited to the disclosed specific embodiments, and the division of aspects does not imply that features in these aspects cannot be combined to achieve benefit; such division is merely for convenience of expression. This disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the appended claims is to be interpreted in the broadest sense, thereby encompassing all such modifications and equivalent structures and functions.
Claims
1. A method of video generation, the method comprising: include: Obtain the target text and divide the target text into several subtexts; The embedding features of the subtext are extracted, and the embedding features of the subtext are matched with the embedding features of several pre-constructed videos to obtain the video corresponding to the subtext; The video corresponding to the subtext is converted into several image frames. The similarity between the subtext and each image frame is calculated. The image frames with a similarity greater than a preset similarity threshold are selected as candidate image frames. The sub-text is converted into audio, the duration of the audio is determined, and the candidate image frames are spliced together according to the duration to obtain several sub-videos; Calculate the similarity between the sub-text and each sub-video, and take the sub-video with the highest similarity as the sub-video corresponding to the sub-text; Calculate the coherence score between the sub-videos corresponding to adjacent sub-texts; wherein the coherence score is the product of the similarity scores of the two sub-videos; in response to the coherence score being less than a coherence score threshold, redetermine the sub-videos corresponding to the sub-texts; By concatenating the video corresponding to each of the sub-texts, the target video is obtained.
2. The method according to claim 1, characterized in that, The step of matching the embedding features of the sub-text with the embedding features of several pre-constructed videos to obtain the video corresponding to the sub-text includes: The embedding features of the sub-text and the embedding features of the video are normalized to obtain normalized text embedding features and normalized video embedding features. Calculate the similarity between the normalized text embedding features and the normalized video embedding features, and take the video corresponding to the largest similarity value among several similarity values as the video corresponding to the subtext.
3. The method of claim 1, wherein, The step of matching the embedding features of the subtext with the embedding features of several pre-constructed videos includes: Based on a pre-trained matching model, the embedding features of the sub-text are matched with the embedding features of several pre-constructed videos; The method further includes: Obtain the training text and training video, as well as the similarity tags corresponding to the training text and the training video; Extract the embedding features of the training text and the training video; Based on the embedding features of the training text and the training video, and using the pre-built matching model, the similarity prediction results corresponding to the training text and the training video are obtained. Based on the similarity labels and the similarity prediction results, the matching model is trained using a preset loss function.
4. The method of claim 1, wherein, The method further includes: Acquire the video and convert it into several image frames; The image frame is converted into several sub-image frames, and the sub-image frames are mapped to an embedding sequence to obtain the embedding features of the video.
5. The method of claim 1, wherein, The extraction of the embedding features of the subtext includes: Extract the keywords from the subtext and extract the embedding features of the keywords as the embedding features of the subtext.
6. A video generating apparatus characterized by comprising: include: The text acquisition module is configured to acquire target text and divide the target text into several subtexts; The sub-video generation module is configured to convert the video corresponding to the sub-text into several image frames, calculate the similarity between the sub-text and each image frame, and select image frames whose similarity is greater than a preset similarity threshold as candidate image frames; convert the sub-text into audio, determine the duration of the audio, and splice the candidate image frames according to the duration to obtain several sub-videos; The sub-video matching module is configured to calculate the similarity between the sub-text and each sub-video, and to take the sub-video with the highest similarity as the sub-video corresponding to the sub-text; A coherence verification module is configured to calculate a coherence score between sub-videos corresponding to adjacent sub-texts; wherein the coherence score is the product of the similarity scores of the two sub-videos; and in response to the coherence score being less than a coherence score threshold, the sub-video matching module is triggered to re-determine the sub-video corresponding to the sub-text. The video stitching module is configured to stitch together the video corresponding to each of the sub-texts to obtain the target video.
7. An electronic device, comprising: It includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method as described in any one of claims 1 to 5.
8. A non-transitory computer-readable storage medium, comprising: The non-transitory computer-readable storage medium stores computer instructions for causing the computer to perform the method of any one of claims 1 to 5.
9. A computer program product, characterised in that, It includes computer program instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1 to 5.