Text information generation method and device, electronic equipment and storage medium

By extracting video features and generating target prompts, and using a large language model to generate text information, the problem of low efficiency and poor quality of text information generation on video platforms is solved, achieving efficient and matching text information generation.

CN119854566BActive Publication Date: 2026-06-12BEIJING ZITIAO NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING ZITIAO NETWORK TECH CO LTD
Filing Date
2023-10-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing technologies, the text information generation efficiency of video platforms is low and the quality is poor, resulting in a small quantity of similar content that cannot be matched with the video content.

Method used

By extracting video features from the target video, target prompts are generated, and large language models are used to generate text information to ensure that the content matches the video content.

🎯Benefits of technology

It enables the rapid and efficient generation of text information, improves generation efficiency and content quality, and ensures the matching of text information with video content.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119854566B_ABST
    Figure CN119854566B_ABST
Patent Text Reader

Abstract

Embodiments of the present disclosure provide a text information generation method and device, electronic equipment and storage medium. The method comprises: obtaining a target video, and extracting a video feature of the target video, the video feature being used to represent content in at least one content dimension of the target video; generating a target prompt according to the video feature, the target prompt being used to represent a generation rule of text information for the target video; and generating the text information for the target video based on the target prompt. By extracting the video feature of the target video, generating the target prompt, and using the target prompt in combination with the capability of a language model to generate the text information for the target video, the method realizes fast and efficient generation of the text information, while ensuring that the content of the generated text information matches the video content of the target video, and improves the generation efficiency and content quality of the text information corresponding to the target video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of Internet technology, and in particular to a method, apparatus, electronic device and storage medium for generating text information. Background Technology

[0002] Currently, video resources published on various video platforms are usually accompanied by corresponding descriptive text information. Through this text information, users can better understand the content of the video and enable interaction between different users.

[0003] In existing technologies, the aforementioned text information is usually manually edited and uploaded to the video platform by the video publisher or video commenter. This is inefficient and cannot guarantee the quality of the content, resulting in problems such as a small quantity and poor content quality of text information related to videos on the video platform. Summary of the Invention

[0004] This disclosure provides a text information generation method, apparatus, electronic device, and storage medium to overcome the problems of insufficient quantity and poor content quality of text information for videos.

[0005] In a first aspect, embodiments of this disclosure provide a method for generating text information, including:

[0006] A target video is acquired, and video features of the target video are extracted, wherein the video features are used to characterize the content of the target video in at least one content dimension; a target prompt is generated based on the video features, wherein the target prompt is used to characterize the generation rules of text information for the target video; and text information for the target video is generated based on the target prompt.

[0007] Secondly, embodiments of this disclosure provide a text information generation apparatus, comprising:

[0008] An acquisition module is used to acquire a target video and extract video features from the target video, wherein the video features are used to characterize the content of the target video in at least one content dimension;

[0009] The first generation module is used to generate a target prompt based on the video features, wherein the target prompt is used to characterize the generation rules of text information for the target video;

[0010] The second generation module is used to generate text information for the target video based on the target prompt.

[0011] Thirdly, embodiments of this disclosure provide an electronic device, including: a processor and a memory;

[0012] The memory stores computer-executed instructions;

[0013] The processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the text information generation method as described in the first aspect and various possible designs of the first aspect.

[0014] Fourthly, embodiments of this disclosure provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the text information generation method described in the first aspect and various possible designs of the first aspect.

[0015] Fifthly, embodiments of this disclosure provide a computer program product, including a computer program that, when executed by a processor, implements the text information generation method described in the first aspect and various possible designs of the first aspect.

[0016] The text information generation method, apparatus, electronic device, and storage medium provided in this embodiment acquire a target video and extract its video features, which characterize the content of the target video in at least one content dimension. Based on the video features, a target prompt is generated, which characterizes the rules for generating text information for the target video. Text information for the target video is then generated based on the target prompt. By extracting the video features of the target video, generating the target prompt, and utilizing the capability of a language model to generate text information for the target video, the method achieves rapid and efficient text information generation while ensuring that the content of the generated text information matches the video content of the target video, thereby improving the generation efficiency and content quality of the text information corresponding to the target video. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is an application scenario diagram of the text information generation method provided in the embodiments of this disclosure;

[0019] Figure 2 Flowchart of the text information generation method provided in the embodiments of this disclosure Figure 1 ;

[0020] Figure 3 for Figure 2A flowchart of one possible implementation of step S102 in the illustrated embodiment;

[0021] Figure 4 for Figure 2 A flowchart of another possible implementation of step S102 in the illustrated embodiment;

[0022] Figure 5 for Figure 2 A flowchart of another possible implementation of step S102 in the illustrated embodiment;

[0023] Figure 6 for Figure 5 A flowchart of one possible implementation of step S1028 in the illustrated embodiment;

[0024] Figure 7 This is a schematic diagram illustrating a process for generating video features based on multiple content dimensions, as provided in an embodiment of the present disclosure.

[0025] Figure 8 This is a schematic diagram of a video playback page provided in an embodiment of this disclosure;

[0026] Figure 9 Flowchart of the text information generation method provided in the embodiments of this disclosure Figure 2 ;

[0027] Figure 10 This is a schematic diagram of the structure of a prompt template provided in an embodiment of the present disclosure;

[0028] Figure 11 A structural block diagram of the text information generation apparatus provided in the embodiments of this disclosure;

[0029] Figure 12 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure;

[0030] Figure 13 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0032] It should be noted that the user information and data involved in this disclosure (including but not limited to data used for analysis, stored data, and displayed data) are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use, and processing of the relevant data must comply with the relevant laws, regulations, and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.

[0033] The application scenarios of the embodiments of this disclosure are explained below:

[0034] Figure 1 This diagram illustrates an application scenario of the text information generation method provided in this disclosure. The method can be applied to applications (APPs) with online video playback capabilities, and more specifically, to scenarios where descriptive text information is automatically generated for live videos, short videos, and other video media. The executing entity in this embodiment can be a terminal device running the aforementioned application with online video playback capabilities, a server deploying the server-side component corresponding to the application, or other electronic devices performing similar functions. (See reference...) Figure 1 As shown in the diagram, taking a terminal device as an example, the terminal device plays a short video by running a short video application. Simultaneously, the short video application also provides a description function for the short video. Specifically, the short video playback page displays a video playback area and a description area for the short video. For example, the description area currently includes text #1 posted by the video publisher User_1 and text #2 posted by the commenter User_2. Then, in response to the current user's trigger operation, such as when the user clicks the "Generate Description" control as shown in the diagram, the terminal device automatically generates text information for the short video by executing the text information generation method provided in this embodiment, and displays it in the input box corresponding to the description area. The content of this text information is exemplified as "XXXXXX" in the diagram. Afterwards, the current user can directly publish the text information in the input box to the description area by clicking the "Publish" control, or further edit and improve the above text information to obtain descriptive text that meets the user's personalized needs, and then publish the improved descriptive text to the description area, thereby realizing the rapid generation and publication of text information for the video. The text information in the description area can be a description of the current video, such as the video theme and author, or it can be user comments, bullet screen text, graphic symbols, etc.

[0035] In another possible application scenario, the execution entity of this embodiment can also be the server corresponding to the video platform. The server, for the target video, uses the text information generation method provided in this embodiment to directly generate corresponding text information and load it into the description area corresponding to the target video, thereby achieving automatic generation of text information. Specific application scenarios can be determined based on needs, and will not be elaborated upon here.

[0036] In existing technologies, text information for target videos is typically edited and published by the user. Some related technologies allow video platforms to provide users with pre-generated text information and, based on user selections or the video content itself, determine a corresponding target text from the pre-generated information and publish it in the target video's description area. The aforementioned manual editing-based approach results in a low quantity and efficiency of text information generated by the video platform, while the selection based on pre-generated text information leads to repetitive and similar content with low relevance to the actual content of the target video, resulting in poor text content quality.

[0037] This disclosure provides a text information generation method to solve the above problems.

[0038] refer to Figure 2 , Figure 2 Flowchart of the text information generation method provided in the embodiments of this disclosure Figure 1 The method of this embodiment can be applied in a terminal device. This text information generation method includes:

[0039] Step S101: Obtain the target video.

[0040] Step S102: Extract video features from the target video. The video features are used to characterize the content of the target video in at least one content dimension.

[0041] For example, refer to Figure 1 The illustrated application scenario diagram takes a terminal device as the execution subject of the method in this embodiment as an example. The terminal device obtains the target video by running a target application that has the function of providing text descriptions for the video. More specifically, the target application is, for example, a short video application. In this case, the terminal device obtains the target video from the video server in response to user operations through the target application, and plays it in subsequent steps to realize the basic function of playing video in the target application. Of course, in another possible implementation, the target video can also be stored locally on the terminal device. The terminal device loads the target video from the local storage and plays it in response to user operations. The specific implementation method is not limited.

[0042] Further, for example, after obtaining the target video, the terminal device extracts features from the target video while playing the target video (or before or after playing the target video, depending on specific needs), to obtain the video features of the target video. Among them, the video features are used to represent the content of at least one content dimension of the target video. For example, the title, author, and tags of the target video are all content dimensions of the video features. For another example, the image content and content text in the target video can also be content dimensions of the video features. The video features obtained after extracting the target video are the information representing the content under the above content dimensions. More specifically, the video features can be a feature identifier or a set of multiple feature identifiers. The feature identifier is used to represent the content under different content dimensions. For example, the video features include Tag[1] = {author: user_1}, which represents the content that the author of the target video is "user_1". For another example, the video features include Tag[2] = {title: introduce city_A}, which represents the content that the content theme or title of the target video is "introducing city_A". In another possible implementation, video features can be represented in the form of a feature matrix, which can be used to characterize the more complex content of the target video in one or more content dimensions. Examples will not be elaborated here.

[0043] Furthermore, in one possible implementation, such as Figure 3 As shown, the specific implementation of step S102 includes:

[0044] Step S1021: Obtain reference information for the target video.

[0045] Step S1022: Obtain the video features of the target video based on the reference information.

[0046] For example, in one possible implementation, the reference information includes at least one of the following: title information, author information, tag information, etc. The aforementioned reference information is descriptive video information specific to the target video. The terminal device can obtain the reference information corresponding to the target video from the video server during the process of acquiring the target video; alternatively, the reference information can be stored within the target video, and the terminal device obtains the corresponding reference information by parsing the target video. Then, the terminal device obtains the corresponding video features based on the aforementioned reference information. For example, based on the title information of the target video, it determines the theme content of the target video and generates a feature identifier representing the theme content of the target video. In subsequent steps, a corresponding target prompt is generated based on this theme identifier, ultimately generating text information matching the theme content.

[0047] In another possible implementation, the reference information includes reference text information, such as user comments posted by users regarding the target video. More specifically, the reference text information could be, for example, trending comments posted by users regarding the target video, which are comments that have received a high number of "likes," "favorites," and "favorites." In some implementations that rank comments based on the number of "likes," the trending comments could also be the top-ranked user comments. Then, the terminal device performs feature extraction based on this reference text information to obtain the features corresponding to the reference text information, which serve as the video features of the target video. For example, if the reference text information includes a text segment, the semantic meaning corresponding to that text segment obtained after feature extraction can be used as one of the video features of the target video. The video features obtained based on the reference text information can be represented as semantic identifiers, feature arrays, or matrices, which will not be elaborated further.

[0048] In another possible implementation, such as Figure 4 As shown, the specific implementation of step S102 includes:

[0049] Step S1023: Extract the audio data of the target video.

[0050] Step S1024: Process the audio data to obtain the content text of the target video. The content text is used to represent the dialogue or narration of the characters in the target video.

[0051] Step S1025: Obtain video features based on the content text of the target video.

[0052] For example, after obtaining the target video, the terminal device can first extract data from the audio track in the target video to obtain audio data. The specific implementation method is determined by the specific data format of the target video, and will not be elaborated here. Then, content recognition is performed on the extracted audio data to obtain the content text of the target video. The content text represents the dialogue or narration of the characters in the target video. Specifically, the content text may include one or more paragraphs of text, the text content of which is the dialogue or narration of the characters in the target video. The terminal device can use a speech recognition service to convert the audio into text. The specific implementation of the speech recognition service is existing technology and will not be elaborated here.

[0053] After obtaining the content text, the terminal device uses the dialogue and narration in the target video represented by the content text to perform video content summarization processing, obtaining video features corresponding to the target video. One possible implementation is that the video features obtained from the content text can be feature identifiers representing the content category corresponding to the main content of the dialogue and / or narration, obtained after summarizing the content text. For example, feature identifier #1 represents the main content of the dialogue and / or narration as "travel," and feature identifier #2 represents the main content of the dialogue and / or narration as "technology product introduction." Then, based on the above feature identifiers, video features are generated. The target prompts subsequently generated based on the video features are matched with the content categories represented by the above feature identifiers. This ensures that the finally generated text information matches the dialogue and / or narration in the target video, improving the matching accuracy between the text information and the video content.

[0054] In another possible implementation, the video features obtained from the content text can be a summary text representing the main content of dialogue and / or narration, derived from the content text. This involves semantically summarizing the dialogue and / or narration to summarize the main theme. Then, video features are generated based on this summary text. The target prompts generated subsequently based on these video features are matched with the theme represented by the summary text. This ensures that the final generated text information matches the dialogue and / or narration in the target video, improving the matching degree between the text information and the video content. The process of summarizing and inductively combining the content text to obtain the summary text can be implemented using a pre-defined language model, which will not be elaborated upon here.

[0055] Furthermore, after extracting the audio data of the target video, the process also includes: processing the audio data to obtain the musical features of the background music in the target video. That is, if the audio data extracted from the target video includes background music in addition to dialogue and surrounding content, the background music portion can be further processed to extract the corresponding musical features. Based on the emotional content expressed by these musical features, the dimensions of the video features can be further enriched. For example, musical features may include at least one of the following: music title identifier, music genre, and music melody. Similarly, depending on the different implementation methods of musical features, they can also be implemented in the form of feature identifiers, feature arrays, or feature matrices, which will not be elaborated further.

[0056] Accordingly, the music features are then merged with the content text features to obtain video features that include the music feature dimension. Specifically, step S1025 is implemented by obtaining video features based on the content text of the target video and the music features of the background music.

[0057] In this embodiment, by extracting the background music features from the audio data and further combining them with the content text, video features are obtained. This enriches the information content and dimensions of the video features, further improving the accuracy of the video features in describing the content of the target video, making the subsequently generated target prompts more accurate, and thus obtaining higher quality text information.

[0058] In another possible implementation, such as Figure 5 As shown, the specific implementation of step S102 includes:

[0059] Step S1026: Extract frames from the target video to obtain image data of the target video, wherein the image data contains at least one video frame of the target video.

[0060] Step S1027: Perform image recognition on the image data to obtain frame content information that represents the image content of the video frame.

[0061] Step S1028: Generate video features of the target video based on the frame content information.

[0062] For example, after obtaining the target video, the terminal device can extract frames from the target video to obtain image data containing at least one video frame of the target video. The frame extraction process can be based on a fixed time interval, or it can extract frames from internal frames (I-frames) within the target video. If the target video is an animation, it can also extract frames from keyframes (K-frames). The specific implementation can be set as needed. Specific video frame extraction methods are existing technologies and will not be elaborated upon here.

[0063] Next, content recognition is performed on the extracted image data to obtain frame content information that represents the image content of the video frame. Specifically, the frame content information may include one or more descriptive text segments, the content of which is the image content. Alternatively, the frame content information may include a combination of multiple image element identifiers, each representing an image element in the image. For example, image element identifier P1 represents "a car," image element identifier P2 represents "a mobile phone," and so on. The frame content information formed by the combination of image element identifiers represents the image content in the video frame.

[0064] Furthermore, after obtaining the frame content information, the terminal device uses the frame content information to generate video features. These features can be feature identifiers that represent the content category corresponding to the image content after summarizing the frame content information, or inductive text that represents the image content. The specific implementation method is similar to the process of obtaining video features based on content text, and will not be elaborated here.

[0065] Further, in one possible implementation, the target video is frame-stripped to obtain image data containing at least two video frames. Specifically, step S1027 involves performing image recognition on at least two target video frames in the image data to obtain corresponding frame content information. In this implementation, such as... Figure 6 As shown, the specific implementation of step S1028 includes:

[0066] Step S1028A: Based on the playback timing of at least two target video frames, generate a frame content sequence, which includes the corresponding frame content information arranged according to the playback timing.

[0067] Step S1028B: Generate video features of the target video based on the frame content sequence.

[0068] For example, when the image data includes at least two video frames, the terminal device determines at least two target video frames and extracts features from each target video frame sequentially based on their playback timing to generate frame content information corresponding to each target video frame. Then, the frame content information corresponding to each target video frame is combined in an ordered manner to obtain a frame content sequence. This frame content sequence includes the same number of frame content information as the target video frames, meaning that the frame content information in the content sequence corresponds one-to-one with the target video frames, and the position of the frame content information in the content sequence corresponds to the position (playback timestamp) of the target video frame in the target video. Therefore, the frame content sequence can not only represent the frame content information corresponding to each target video frame but also the variation characteristics between the image content represented by each frame content information. Subsequently, based on the frame content sequence, video features of the target video are generated. These video features can represent the variation characteristics between multiple target video frames in the target video, that is, the correlation characteristics between images. This makes the generation rules of the text information represented by the target prompts generated based on video features more compatible with the video content of the target video, thereby improving the quality of the text content of the final text information generated based on the target prompts.

[0069] For example, the various implementations of step S102 described above can be implemented individually to enable the terminal device to obtain the video features of the target video, or at least two of the implementations can be combined to generate the video features of the target video by combining content from multiple content dimensions. Figure 7 This is a schematic diagram illustrating a process for generating video features based on multiple content dimensions, as provided in an embodiment of this disclosure. Figure 7As shown, after obtaining the target video, the terminal device extracts popular comments from the target video and generates a first video feature P1 based on the popular comments; it also extracts audio data from the target video and generates a second video feature P2 based on the audio data; and it extracts image data containing multiple video frames from the target video and generates a third video feature P3 based on the image data. Then, the first video feature P1, the second video feature P2, and the third video feature P3 are combined to generate the video feature P of the target video.

[0070] Step S103: Generate target prompts based on video features. Target prompts are used to characterize the rules for generating text information for the target video.

[0071] For example, after obtaining video features, the terminal device performs prompt conversion based on the implementation form of the video features to generate a target prompt adapted to the large language model. The prompt, also known as a prompt word, is a type of instruction information for the model, which can include descriptive statements, question statements, or parameters, among other textual information. By inputting prompts into the model, it can enable the model to execute corresponding tasks more accurately and reliably, generating results that meet user needs. Prompts can be carried in a specific file format, such as a JSON file.

[0072] Furthermore, in one possible implementation, video features and fixed prompt text can be directly combined to generate the corresponding target prompt. For example, the fixed prompt text could include: "Please generate text that matches the following conditions:". Then, by simply combining the prompt text with the video features obtained in the previous steps, the target prompt can be generated. In another possible implementation, a corresponding prompt template can first be obtained based on the video features; then, based on the prompt template, the target prompt can be generated by combining some or all of the features from the video features.

[0073] For example, the target prompt can be generated locally by the terminal device based on preset processing rules, or the terminal can send video features to the corresponding server, which will then generate the corresponding target prompt and return it to the terminal device. The specific settings can be configured as needed, and will not be elaborated here.

[0074] Step S104: Generate text information for the target video based on the target prompt.

[0075] For example, after receiving the target prompt, the terminal device can transmit the target prompt to the model server that deploys the large language model by calling the interface or service corresponding to the large language model. Then, the model server processes the target prompt using the large language model and generates corresponding text information. After that, the model server sends the model's output, i.e., the text information of the target video, back to the terminal device, so that the terminal device obtains the text information for the target video, thus completing the automatic generation process of the text information.

[0076] Optionally, after step S104, the method further includes:

[0077] Step S105: Display at least two text information items in the first area of ​​the video playback page corresponding to the target video.

[0078] Step S106: In response to the selection operation of the target text information in at least two text information items, the target text information is displayed in the second area of ​​the video playback page, wherein the second area is used to display user comments on the target video.

[0079] Figure 8 This is a schematic diagram of a video playback page provided in an embodiment of the present disclosure, with reference to... Figure 8 As shown, the video playback page of the target video includes a first area and a second area. The second area is used to display user comments on the target video, i.e., the comment area in the prior art. In another implementation, the second area can also be a bullet screen area in the video playback page, but no specific example is given here. The specific implementation of the second area is the prior art, and will not be described in detail here. On the other hand, the first area is used to display the text information generated in the above steps. After generating text information through the steps of the above embodiment, at least two pieces of text information are displayed in the second area (e.g., text information #1, text information #2, and text information #3 shown in the figure) for the user to select. Then, the user applies a selection operation to the terminal device according to their needs, selecting the target text information, i.e., text information #2. The terminal device, according to the user's selection operation, directly publishes text information #2 to the second area, or sends text information #2 to the editing area (input box) for the user to edit before publishing it to the second area, thereby realizing the process of publishing descriptive text information. This text information can be a description of the target video or a reply or comment to user comments in the comment area.

[0080] In this embodiment, after generating text information, multiple text messages are displayed to provide users with alternative options. By responding to the user's selection, the goal of quickly publishing text information can be achieved, thus improving the efficiency of text information publishing.

[0081] refer to Figure 9 , Figure 9 Flowchart of the text information generation method provided in the embodiments of this disclosure Figure 2 This embodiment is in Figure 2 Based on the illustrated embodiment, step S102 is further refined, and a step of optimizing the target video is added. This method can be applied to video editing scenarios. Specifically, the text information generation method includes:

[0082] Step S201: Obtain the target video.

[0083] Step S202: Extract video features from the target video. The video features are used to characterize the content of the target video in at least one content dimension.

[0084] Step S203: Based on the video features, obtain the corresponding prompt template, wherein the prompt template is used to represent the preset rules for dynamically generating prompts by calling input features.

[0085] Step S204: Generate input features based on video features, and call the prompt template for processing to generate the target prompt.

[0086] For example, firstly, the prompt templates correspond to video features, and the terminal device obtains the corresponding prompt templates based on different video features. For instance, the video features include a feature P1 that represents the video theme of the target video. When P1 represents "travel video", the terminal device obtains prompt template M1; when P1 represents "game guide", the terminal device obtains prompt template M2.

[0087] Furthermore, a prompt template is a model or framework that defines the rules for generating prompts. In generating corresponding prompts, the prompt template needs to invoke input features. Therefore, different input features will result in different prompts generated by the prompt template, thus achieving dynamic changes in the prompts. Specifically, the prompt template consists of input features and at least one rule description, which characterizes the sub-rules that constitute the preset rules. The rule description, by invoking the corresponding input features, limits the content and / or form of the prompt, thereby accurately generating the prompt.

[0088] Figure 10 This is a schematic diagram of the structure of a prompt template provided in an embodiment of the present disclosure, such as... Figure 10As shown, the prompt template M1 obtained based on video features has input features p1, p2, and p3 (shown as p1, p2, and p3 in the figure). Input feature p1 represents the video's trending comments, input feature p2 represents the video's text content, and input feature p3 represents the video's image content. Furthermore, the prompt template M1 also includes rule description information Info_1{p1}, rule description information Info_2{p2}, and rule description information Info_3{p3} (shown as Info_1{p1}, Info_2{p2}, and Info_3{p3} in the figure). The rule description information Info_1 is used to invoke input feature p1 to define a sub-rule of a preset rule, thereby limiting the content and / or form of the prompt. For example, the content of the rule description information Info_1{p1} is "Create an ancient poem using {trending comments}". The rule description information Info_2 is used to call the input feature p2 and limit a sub-rule of the preset rule to further restrict the content and / or form of the prompt. For example, the content of the rule description information Info_2{p2} is "supplement some relevant content for {content text} so that people can learn something new." Similarly, the rule description information Info_3{p3} is similar and will not be elaborated further. Then, based on some or all of the features in the video feature generator, the input features are generated, and the prompt template is inserted to generate the complete rule description information with inserted video features, such as "create an ancient poem using {#comment_001}". Here, "#comment_001" can be the specific content of a trending comment or the location where trending comments are stored; the specific implementation method is not limited. Finally, based on the set of rule description information with inserted video features, the target prompt for the video features can be obtained.

[0089] Furthermore, the text information generated based on the aforementioned prompt template may include descriptive text and / or graphic identifiers. Descriptive text refers to text or strings that describe the video content or video commentary. Graphic identifiers may include emoticons, stickers, etc., without limitation. The specific implementation method for generating descriptive text and / or graphic identifiers is provided by the capabilities of the pre-trained language model and is not specifically limited here.

[0090] Furthermore, the rule description information in the prompt template includes at least one of the following: first description information, used to characterize the text format of the text information generated based on the target input features; second description information, used to characterize the text content of the text information generated based on the target input features; third description information, used to characterize the text length of the text information generated based on the target input features; fourth description information, used to characterize the grammatical style of the text information generated based on the target input features; and fifth description information, used to characterize the selection rules for the target input features. Here, text format refers to the format of descriptive text, including "ancient poems," "poems," etc., or it can refer to the segmented format of descriptive text; text content has been described in the above embodiments and will not be repeated here. Text length is the string length of the descriptive text; grammatical style includes, for example, "humorous," "pleasant," etc.; the selection rules for the target input features are information used to set the input features of the prompt template, that is, which features are used as input features, for example, in combination with the above... Figure 10 In the illustrated embodiment, the specific contents of input feature p1, input feature p2, and input feature p3 are determined by the fifth description information.

[0091] In this embodiment, by using the rule description information set in the prompt template, fine-grained matching of the selection rules for text format, text content, text length, grammatical style, and target input features of descriptive text can be achieved. This enables the target prompt generated based on the prompt template to accurately match video features, thereby improving the accuracy, rationality, and diversity of the generated text information and enhancing the content quality of the text information.

[0092] Step S205: Based on the target prompt, generate text information for the target video using a large language model.

[0093] Step S206: Based on the text information, generate supplementary video data, which includes video content that matches the text information.

[0094] Step S207: Add supplementary video data to the target video to generate an optimized video.

[0095] For example, in video editing applications, after generating text information, the video editor (i.e., the video creator) can further optimize the target video based on the text information generated in the above steps. For instance, they can add content that matches the automatically generated text information, thereby improving video quality. Specifically, the terminal device generates supplementary video data based on the text information, such as background music, stickers, and special effects templates. In a more specific implementation, the terminal device can generate corresponding artistic text from the obtained descriptive text, or obtain music related to the descriptive text from the server as background music, and add the artistic text and background music to the target video to generate an optimized video. This further improves the quality of the video content by leveraging the content of the text information. This further enriches the video content of the target video and improves the quality of the video works generated by the terminal device in video editing scenarios.

[0096] In this embodiment, steps S201-S202 and S205 are implemented in the same way as in this disclosure. Figure 2 The implementation methods of steps S101-S102 and S104 in the illustrated embodiment are the same, and will not be described in detail here.

[0097] Corresponding to the text information generation method in the above embodiment, Figure 11 This is a structural block diagram of a text information generation apparatus provided in an embodiment of this disclosure. For ease of explanation, only the parts relevant to the embodiments of this disclosure are shown. (Refer to...) Figure 11 The text information generation device 3 includes:

[0098] The acquisition module 31 is used to acquire the target video and extract the video features of the target video. The video features are used to characterize the content of the target video in at least one content dimension.

[0099] The first generation module 32 is used to generate target prompts based on video features. The target prompts are used to characterize the generation rules of text information for the target video.

[0100] The second generation module 33 is used to generate text information for the target video based on the target prompt.

[0101] In one embodiment of this disclosure, the second generation module 33 is further configured to: display at least two text information items in a first area of ​​the video playback page corresponding to the target video; and, in response to a selection operation on the target text information among the at least two text information items, display the target text information in a second area of ​​the video playback page; wherein the second area is used to display user comments on the target video.

[0102] In one embodiment of this disclosure, when extracting video features of a target video, the acquisition module 31 is specifically used to: acquire reference text information for the target video, the reference text information including text information published by the user for the target video; and obtain video features of the target video based on the reference text information.

[0103] In one embodiment of this disclosure, when the acquisition module 31 extracts video features from the target video, it is specifically used to: extract audio data from the target video; process the audio data to obtain the content text of the target video, the content text being used to characterize the dialogue between characters in the target video; and obtain video features based on the content text of the target video.

[0104] In one embodiment of this disclosure, the acquisition module 31 is further configured to: process the audio data to obtain the music features of the background music in the target video; when the acquisition module 31 obtains the video features based on the content text of the target video, it is specifically configured to: obtain the video features based on the content text of the target video and the music features of the background music; wherein, the music features include at least one of the following: music name identifier, music type, music melody.

[0105] In one embodiment of this disclosure, when the acquisition module 31 extracts video features from the target video, it is specifically used to: extract frames from the target video to obtain image data of the target video, wherein the image data contains at least one video frame of the target video; perform image recognition on the image data to obtain frame content information characterizing the image content of the video frame; and generate video features of the target video based on the frame content information.

[0106] In one embodiment of this disclosure, when the acquisition module 31 performs image recognition on image data to obtain frame content information representing the image content of a video frame, it is specifically used to: perform image recognition on at least two target video frames in the image data respectively to obtain corresponding frame content information; when the acquisition module 31 generates video features of the target video based on the frame content information, it is specifically used to: generate a frame content sequence based on the playback timing of at least two target video frames, wherein the frame content sequence includes corresponding frame content information arranged according to the playback timing; and generate video features of the target video based on the frame content sequence.

[0107] In one embodiment of this disclosure, the first generation module 32 is specifically used to: obtain a corresponding prompt template based on video features, wherein the prompt template is used to characterize a preset rule for dynamically generating prompts by calling input features; generate input features based on video features, and call the prompt template for processing to generate a target prompt.

[0108] In one embodiment of this disclosure, at least one of the following is included: the text information includes descriptive text and / or graphic identifiers; the prompt template includes at least one rule description, each rule description being used to characterize a sub-rule constituting a preset rule.

[0109] In one embodiment of this disclosure, the rule description information includes at least one of the following: first description information for characterizing the text format of the text information generated based on the target input features; second description information for characterizing the text content of the text information generated based on the target input features; third description information for characterizing the text length of the text information generated based on the target input features; fourth description information for characterizing the grammatical style of the text information generated based on the target input features; and fifth description information for characterizing the selection rules of the target input features.

[0110] In one embodiment of this disclosure, the second generation module 33 is further configured to: generate supplementary video data based on text information, wherein the supplementary video data includes video content that matches the text information; and add the supplementary video data to the target video to generate an optimized video.

[0111] The acquisition module 31, the first generation module 32, and the second generation module 33 are connected sequentially. The text information generation device 3 provided in this embodiment can execute the technical solution of the above method embodiment, and its implementation principle and technical effect are similar, so it will not be described again here.

[0112] Figure 12 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present disclosure, such as... Figure 12 As shown, the electronic device 4 includes:

[0113] Processor 41, and memory 42 communicatively connected to processor 41;

[0114] Memory 42 stores instructions executed by the computer;

[0115] The processor 41 executes computer execution instructions stored in the memory 42 to achieve, for example, Figures 2-10 The text information generation method in the illustrated embodiment.

[0116] Optionally, the processor 41 and the memory 42 are connected via a bus 43.

[0117] For relevant instructions, please refer to the corresponding text. Figures 2-10 The relevant descriptions and effects of the steps in the corresponding embodiments are understood, and will not be elaborated on here.

[0118] This disclosure provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement this disclosure. Figures 2-10 The text information generation method provided in any of the corresponding embodiments.

[0119] To implement the above embodiments, this disclosure also provides an electronic device.

[0120] refer to Figure 13 The diagram illustrates a structural schematic of an electronic device 900 suitable for implementing embodiments of the present disclosure. The electronic device 900 can be a terminal device or a server. The terminal device can include, but is not limited to, mobile terminals such as mobile phones, laptops, digital radio receivers, personal digital assistants (PDAs), portable Android devices (PADs), portable media players (PMPs), and in-vehicle terminals (e.g., in-vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 13 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0121] like Figure 13 As shown, the electronic device 900 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 901, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 902 or a program loaded from a storage device 908 into a random access memory (RAM) 903. The RAM 903 also stores various programs and data required for the operation of the electronic device 900. The processing unit 901, ROM 902, and RAM 903 are interconnected via a bus 904. An input / output (I / O) interface 905 is also connected to the bus 904.

[0122] Typically, the following devices can be connected to I / O interface 905: input devices 906 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 907 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 908 including, for example, magnetic tapes, hard disks, etc.; and communication devices 909. Communication device 909 allows electronic device 900 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 13An electronic device 900 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0123] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 909, or installed from a storage device 908, or installed from a ROM 902. When the computer program is executed by a processing device 901, it performs the functions defined in the methods of embodiments of this disclosure.

[0124] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0125] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0126] The aforementioned computer-readable medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

[0127] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0128] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0129] The units described in the embodiments of this disclosure can be implemented in software or in hardware. The name of a unit does not necessarily limit the unit itself; for example, the first acquisition unit can also be described as "a unit that acquires at least two Internet Protocol addresses".

[0130] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0131] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0132] In a first aspect, according to one or more embodiments of this disclosure, a method for generating text information is provided, comprising:

[0133] A target video is acquired, and video features of the target video are extracted, wherein the video features are used to characterize the content of the target video in at least one content dimension; a target prompt is generated based on the video features, wherein the target prompt is used to characterize the generation rules of text information for the target video; and text information for the target video is generated based on the target prompt.

[0134] According to one or more embodiments of this disclosure, the method further includes: displaying at least two pieces of the text information in a first area of ​​a video playback page corresponding to the target video; and, in response to a selection operation of a target text information among the at least two pieces of the text information, displaying the target text information in a second area of ​​the video playback page; wherein the second area is used to display user comments on the target video.

[0135] According to one or more embodiments of this disclosure, extracting video features of the target video includes: obtaining reference text information for the target video, the reference text information including text information posted by a user for the target video; and obtaining video features of the target video based on the reference text information.

[0136] According to one or more embodiments of this disclosure, extracting video features from the target video includes: extracting audio data from the target video; processing the audio data to obtain content text of the target video, the content text being used to characterize dialogue in the target video; and obtaining the video features based on the content text of the target video.

[0137] According to one or more embodiments of this disclosure, the method further includes: processing the audio data to obtain musical features of the background music in the target video; obtaining the video features based on the content text of the target video includes: obtaining the video features based on the content text of the target video and the musical features of the background music; wherein the musical features include at least one of the following: music name identifier, music type, and music melody.

[0138] According to one or more embodiments of this disclosure, extracting video features from the target video includes: extracting frames from the target video to obtain image data of the target video, wherein the image data contains at least one video frame of the target video; performing image recognition on the image data to obtain frame content information characterizing the image content of the video frame; and generating video features of the target video based on the frame content information.

[0139] According to one or more embodiments of this disclosure, the step of performing image recognition on the image data to obtain frame content information characterizing the image content of the video frame includes: performing image recognition on at least two target video frames in the image data respectively to obtain corresponding frame content information; the step of generating video features of the target video based on the frame content information includes: generating a frame content sequence based on the playback timing of the at least two target video frames, wherein the frame content sequence includes corresponding frame content information arranged according to the playback timing; and generating video features of the target video based on the frame content sequence.

[0140] According to one or more embodiments of this disclosure, generating a target prompt based on the video features includes: obtaining a corresponding prompt template based on the video features, wherein the prompt template is used to characterize a preset rule for dynamically generating prompts by calling input features; generating the input features based on the video features, and calling the prompt template for processing to generate the target prompt.

[0141] According to one or more embodiments of this disclosure, at least one of the following is included: the text information includes descriptive text and / or graphic identifiers; the prompt template includes at least one rule description information, each of the rule description information being used to characterize a sub-rule constituting the preset rule.

[0142] According to one or more embodiments of this disclosure, the rule description information includes at least one of the following: first description information for characterizing the text format of the text information generated based on target input features; second description information for characterizing the text content of the text information generated based on target input features; third description information for characterizing the text length of the text information generated based on target input features; fourth description information for characterizing the grammatical style of the text information generated based on target input features; and fifth description information for characterizing the selection rules of the target input features.

[0143] According to one or more embodiments of this disclosure, the method further includes: generating supplementary video data based on the text information, the supplementary video data including video content matching the text information; adding the supplementary video data to the target video to generate an optimized video.

[0144] Secondly, according to one or more embodiments of this disclosure, a text information generation apparatus is provided, comprising:

[0145] An acquisition module is used to acquire a target video and extract video features from the target video, wherein the video features are used to characterize the content of the target video in at least one content dimension;

[0146] The first generation module is used to generate a target prompt based on the video features, wherein the target prompt is used to characterize the generation rules of text information for the target video;

[0147] The second generation module is used to generate text information for the target video based on the target prompt.

[0148] According to one or more embodiments of this disclosure, the second generation module is further configured to: display at least two items of the text information in a first area of ​​the video playback page corresponding to the target video; and, in response to a selection operation of the target text information among the at least two items of the text information, display the target text information in a second area of ​​the video playback page; wherein the second area is used to display user comments on the target video.

[0149] According to one or more embodiments of this disclosure, when the acquisition module extracts video features of the target video, it is specifically configured to: acquire reference text information for the target video, the reference text information including text information published by the user for the target video; and obtain video features of the target video based on the reference text information.

[0150] According to one or more embodiments of this disclosure, when the acquisition module extracts video features from the target video, it is specifically configured to: extract audio data from the target video; process the audio data to obtain the content text of the target video, the content text being used to characterize dialogue in the target video; and obtain the video features based on the content text of the target video.

[0151] According to one or more embodiments of this disclosure, the acquisition module is further configured to: process the audio data to obtain the musical features of the background music in the target video; when the acquisition module obtains the video features based on the content text of the target video, it is specifically configured to: obtain the video features based on the content text of the target video and the musical features of the background music; wherein, the musical features include at least one of the following: music name identifier, music type, music melody.

[0152] According to one or more embodiments of this disclosure, when the acquisition module extracts video features from the target video, it is specifically configured to: extract frames from the target video to obtain image data of the target video, wherein the image data contains at least one video frame of the target video; perform image recognition on the image data to obtain frame content information characterizing the image content of the video frame; and generate video features of the target video based on the frame content information.

[0153] According to one or more embodiments of this disclosure, when the acquisition module performs image recognition on the image data to obtain frame content information characterizing the image content of the video frame, it is specifically used to: perform image recognition on at least two target video frames in the image data respectively to obtain corresponding frame content information; when the acquisition module generates video features of the target video based on the frame content information, it is specifically used to: generate a frame content sequence based on the playback timing of the at least two target video frames, wherein the frame content sequence includes corresponding frame content information arranged based on the playback timing; and generate video features of the target video based on the frame content sequence.

[0154] According to one or more embodiments of this disclosure, the first generation module is specifically configured to: obtain a corresponding prompt template based on the video features, wherein the prompt template is used to characterize a preset rule for dynamically generating prompts by calling input features; generate the input features based on the video features, and call the prompt template for processing to generate the target prompt.

[0155] According to one or more embodiments of this disclosure, at least one of the following is included: the text information includes descriptive text and / or graphic identifiers; the prompt template includes at least one rule description information, each of the rule description information being used to characterize a sub-rule constituting the preset rule.

[0156] According to one or more embodiments of this disclosure, the rule description information includes at least one of the following: first description information for characterizing the text format of the text information generated based on target input features; second description information for characterizing the text content of the text information generated based on target input features; third description information for characterizing the text length of the text information generated based on target input features; fourth description information for characterizing the grammatical style of the text information generated based on target input features; and fifth description information for characterizing the selection rules of the target input features.

[0157] According to one or more embodiments of this disclosure, the second generation module is further configured to: generate supplementary video data based on the text information, wherein the supplementary video data includes video content matching the text information; and add the supplementary video data to the target video to generate an optimized video.

[0158] Thirdly, according to one or more embodiments of the present disclosure, an electronic device is provided, comprising: at least one processor and a memory;

[0159] The memory stores computer-executed instructions;

[0160] The at least one processor executes computer execution instructions stored in the memory, causing the at least one processor to perform the text information generation method as described in the first aspect and various possible designs of the first aspect.

[0161] Fourthly, according to one or more embodiments of the present disclosure, a computer-readable storage medium is provided, wherein computer-executable instructions are stored therein, and when a processor executes the computer-executable instructions, the text information generation method described in the first aspect and various possible designs of the first aspect is implemented.

[0162] Fifthly, according to one or more embodiments of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the text information generation method as described in the first aspect and various possible designs of the first aspect.

[0163] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0164] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0165] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. A method for generating text information, characterized in that, include: Acquire a target video and extract video features from the target video, the video features being used to characterize the content of the target video in at least one content dimension; Based on the video characteristics, determine the corresponding prompt template; The prompt template includes at least one rule description, which includes multiple descriptions, including descriptions of selection rules for characterizing input features. Based on the video features and the descriptive information of the selection rules used to characterize the input features, input features are generated; The input features are processed by calling the prompt template to generate the target prompt; The target prompt is used to characterize the rules for generating text information for the target video; Based on the target prompt, text information for the target video is generated.

2. The method according to claim 1, characterized in that, The method further includes: At least two of the text information items are displayed in the first area of ​​the video playback page corresponding to the target video; In response to a selection operation of target text information among the at least two of the text information, the target text information is displayed in a second area of ​​the video playback page; The second area is used to display user comments on the target video.

3. The method according to claim 1, characterized in that, The extraction of video features from the target video includes: Obtain reference text information for the target video, including user comments posted by users regarding the target video; Based on the reference text information, the video features of the target video are obtained.

4. The method according to claim 1, characterized in that, The extraction of video features from the target video includes: Extract the audio data from the target video; The audio data is processed to obtain the content text of the target video, which is used to represent the dialogue between characters in the target video. The video features are obtained based on the content text of the target video.

5. The method according to claim 4, characterized in that, Also includes: The audio data is processed to obtain the musical features of the background music in the target video; The step of obtaining the video features based on the content text of the target video includes: The video features are obtained based on the text content of the target video and the musical characteristics of the background music; The musical features include at least one of the following: Music title identifier, music genre, music melody.

6. The method according to claim 1, characterized in that, The extraction of video features from the target video includes: Frames are extracted from the target video to obtain image data of the target video, wherein the image data contains at least one video frame of the target video; Image recognition is performed on the image data to obtain frame content information that represents the image content of the video frame; Based on the frame content information, the video features of the target video are generated.

7. The method according to claim 6, characterized in that, The step of performing image recognition on the image data to obtain frame content information representing the image content of the video frame includes: Image recognition is performed on at least two target video frames in the image data to obtain the corresponding frame content information; The step of generating video features of the target video based on the frame content information includes: Based on the playback timing of the at least two target video frames, a frame content sequence is generated, wherein the frame content sequence includes corresponding frame content information arranged according to the playback timing. Based on the frame content sequence, the video features of the target video are generated.

8. The method according to claim 1, characterized in that, The text information includes descriptive text and / or graphic identifiers; The rule description information is used to characterize the sub-rules that constitute the preset rule.

9. The method according to claim 8, characterized in that, The rule description information also includes at least one of the following: The first descriptive information is used to characterize the text format of the text information generated based on the target input features; The second descriptive information is used to characterize the text content of the text information generated based on the target input features; The third descriptive information is used to characterize the text length of the text information generated based on the target input features; The fourth descriptive information is used to characterize the grammatical style of the text information generated based on the target input features.

10. The method according to claim 1, characterized in that, Also includes: Based on the text information, supplementary video data is generated, which includes video content that matches the text information. The supplementary video data is added to the target video to generate an optimized video.

11. A text information generation device, characterized in that, include: An acquisition module is used to acquire a target video and extract video features from the target video, wherein the video features are used to characterize the content of the target video in at least one content dimension; The first generation module is used to determine the corresponding prompt template based on the video features; The prompt template includes at least one rule description, which includes multiple descriptions, including a description of a selection rule for characterizing input features; based on the video features and the description of the selection rule for characterizing input features, input features are generated; The input features are processed by calling the prompt template to generate a target prompt; wherein, the target prompt is used to characterize the generation rules of text information for the target video; The second generation module is used to generate text information for the target video based on the target prompt.

12. An electronic device, characterized in that, include: Processor and memory; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory, causing the processor to perform the text information generation method as described in any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the text information generation method as described in any one of claims 1 to 10.