Video comment generation method and device, electronic equipment, storage medium and computer program product
By generating personality activation pattern matrices and empathy keyframe sequences using a large language model, the limitations of large visual language models in generating personality-consistent video comments are overcome, achieving high consistency and reliability in the generation of video comments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INST OF AUTOMATION CHINESE ACAD OF SCI
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240877A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of video technology, and more specifically, to methods, apparatus, electronic devices, storage media, and computer program products for generating video comments. Background Technology
[0002] Large Visual Language Models (LVLMs) have demonstrated strong natural language generation capabilities across a variety of multimodal tasks, but they have significant limitations in generating text consistent with a specific personality.
[0003] The personality-controlled natural language generation schemes in related technologies struggle to capture deep-seated personality representations, resulting in superficial and inconsistent personality traits in the generated text, and are prone to personality illusions. Furthermore, these methods generally neglect the intrinsic connection between personality traits and multimodal visual content, failing to establish clear constraints between the generated text and personalized visual concepts. This leads to poor quality video commentary that can control personality, making it difficult to meet the needs of practical applications. Summary of the Invention
[0004] This disclosure provides a method, apparatus, electronic device, storage medium, and computer program product for generating video comments, in order to at least solve the problem in the aforementioned related technologies that the generation of personality-controllable video comments is poor and difficult to meet the needs of practical application scenarios.
[0005] According to a first aspect of the present disclosure, a method for generating video comments is provided, comprising: The process involves acquiring text data and multiple videos. The text data includes the title and content description of each video, interpretation information for each of several preset personality types, personality-related comment prompts, personality-unrelated comment prompts, and personality-unrelated comment text. The text data and videos are then input into a large language model to obtain a personality activation pattern matrix. This matrix captures the personality trait representations required by the large language model to generate specific personality-related comments. Using the large language model and the personality activation pattern matrix, a sequence of personality empathy keyframes is extracted from the videos. Finally, based on the personality empathy keyframe sequence, the personality-unrelated comment prompts, and the personality activation pattern matrix, a personality-controllable video comment is generated using the large language model.
[0006] Optionally, the step of inputting the text data and the multiple videos into a large language model to obtain a personality activation pattern matrix includes: generating a personality-perceived reasoning sample set using the large language model, the multiple videos, the personality-related comment prompt text, and the personality-related comment text, wherein the personality-perceived reasoning sample set is used to reflect the reasoning process of the large language model under the guidance of a personality; generating a personality-independent reasoning sample set using the large language model, the multiple videos, the personality-independent comment prompt text, and the personality-independent comment text, wherein the personality-independent reasoning sample set is used to reflect the reasoning process of the large language model in the absence of a personality; performing forward reasoning on the personality-perceived reasoning sample set and the personality-independent reasoning sample set using the large language model to obtain a personality-perceived hidden state and a personality-independent hidden state; and calculating the personality activation pattern matrix based on the personality-perceived hidden state and the personality-independent hidden state.
[0007] Optionally, the step of extracting a sequence of personality empathy keyframes from the multiple videos using the large language model and the personality activation pattern matrix includes: extracting the hidden states of multiple video frames contained in the multiple videos using the large language model; calculating the feature similarity between the hidden state of each video frame and the personality activation pattern matrix using the large language model; and selecting video frames whose feature similarity meets a preset condition from the multiple video frames as the sequence of personality empathy keyframes.
[0008] Optionally, calculating the feature similarity between the hidden state of each video frame in the plurality of video frames and the personality activation pattern matrix using the large language model includes: calculating the feature similarity between the hidden state of each video frame in the plurality of video frames and the personality activation pattern matrix using the large language model with the following formula:
[0009] in, Indicates the feature similarity, For the first of the preset multiple personality types Types of personality For the j-th video frame among the plurality of video frames, The number of layers in the large language model. This represents the number of visual features in the j-th video frame. The first personality activation pattern matrix contained therein Type of personality in the large language model The personality activation vector of the layer, For the first of the large language models The first in the layer The first video frame The hidden state of a visual token.
[0010] Optionally, the step of generating a personality-controllable video comment based on the personality empathy keyframe sequence, the personality-irrelevant comment prompt text, and the personality activation pattern matrix using the large language model includes: encoding the personality empathy keyframe sequence and the personality-irrelevant comment prompt text using the large language model to obtain an initial hidden state sequence; modifying the initial hidden state sequence using the personality activation pattern matrix to obtain a modified hidden state sequence; and generating the personality-controllable video comment based on the modified hidden state sequence using the large language model.
[0011] Optionally, the preset multiple personality types include the following: conscientiousness, openness, extraversion, agreeableness, and neuroticism.
[0012] According to a second aspect of the present disclosure, a video comment generation apparatus is provided, comprising: a data acquisition module configured to acquire text data and multiple videos, wherein the text data includes the title and video content description of each of the multiple videos, interpretation information for each of a preset plurality of personality types, personality-related comment prompt text, personality-related comment text, personality-irrelevant comment prompt text, and personality-irrelevant comment text; a matrix acquisition module configured to input the text data and the multiple videos into a large language model to obtain a personality activation pattern matrix, wherein the personality activation pattern matrix is used to capture the personality trait representations required by the large language model to generate specific personality-related comments; a keyframe extraction module configured to extract a sequence of personality empathy keyframes from the multiple videos using the personality activation pattern matrix through the large language model; and a comment generation module configured to generate a personality-controllable video comment based on the personality empathy keyframe sequence, the personality-irrelevant comment prompt text, and the personality activation pattern matrix through the large language model.
[0013] Optionally, the matrix acquisition module is configured to: generate a personality-perceived reasoning sample set using the large language model, the multiple videos, the personality-related comment prompt text, and the personality-related comment text, wherein the personality-perceived reasoning sample set reflects the reasoning process of the large language model under the guidance of a personality; generate a personality-independent reasoning sample set using the large language model, the multiple videos, the personality-independent comment prompt text, and the personality-independent comment text, wherein the personality-independent reasoning sample set reflects the reasoning process of the large language model without the guidance of a personality; perform forward reasoning on the personality-perceived reasoning sample set and the personality-independent reasoning sample set using the large language model to obtain a personality-perceived hidden state and a personality-independent hidden state; and calculate the personality activation pattern matrix based on the personality-perceived hidden state and the personality-independent hidden state.
[0014] Optionally, the keyframe extraction module is configured to: extract the hidden states of multiple video frames contained in the multiple videos using the large language model; calculate the feature similarity between the hidden state of each video frame in the multiple video frames and the personality activation pattern matrix using the large language model; and select video frames whose feature similarity meets preset conditions from the multiple video frames as the personality empathy keyframe sequence.
[0015] Optionally, the keyframe extraction module is configured to: calculate the feature similarity between the hidden state of each of the multiple video frames and the personality activation pattern matrix using the large language model and the following formula:
[0016] in, Indicates the feature similarity, For the first of the preset multiple personality types Types of personality For the j-th video frame among the plurality of video frames, The number of layers in the large language model. This represents the number of visual features in the j-th video frame. The first personality activation pattern matrix contained therein Type of personality in the large language model The personality activation vector of the layer, For the first of the large language models The first in the layer The first video frame The hidden state of a visual token.
[0017] Optionally, the comment generation module is configured to: encode the personality empathy keyframe sequence and the personality-irrelevant comment prompt text using the large language model to obtain an initial hidden state sequence; modify the initial hidden state sequence using the personality activation mode matrix to obtain a modified hidden state sequence; and generate the personality-controllable video comment based on the modified hidden state sequence using the large language model.
[0018] Optionally, the preset multiple personality types include the following: conscientiousness, openness, extraversion, agreeableness, and neuroticism.
[0019] According to a third aspect of the present disclosure, an electronic device is provided, comprising: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement a method for generating video comments according to the present disclosure.
[0020] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided that, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform a method for generating video comments according to the present disclosure.
[0021] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements a method for generating video comments according to the present disclosure.
[0022] The technical solutions provided by the embodiments of this disclosure have at least the following beneficial effects: In this disclosure, a personality activation pattern matrix is extracted from a large language model through a personality activation recording process, enabling the establishment of deep representations of personality traits. Furthermore, by constructing a keyframe sequence aligned with personality, visual content highly relevant to the target personality can be selected. Additionally, by injecting the personality activation pattern matrix into the video commentary generation process through a personality activation replay mechanism, consistency between the output text and the specified personality trait can be ensured. In other words, this disclosure establishes effective constraints in three key stages: personality representation learning, visual content selection, and text generation control. This fundamentally avoids the problem of personality illusion, significantly improves the personality consistency and content reliability of the generated video commentary text, and effectively meets the needs of practical application scenarios.
[0023] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0024] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.
[0025] Figure 1 This is a flowchart illustrating a method for generating video comments according to exemplary embodiments of the present disclosure; Figure 2 This is a schematic diagram illustrating the process of generating video comments according to exemplary embodiments of the present disclosure; Figure 3 This is a block diagram illustrating an apparatus for generating video comments according to exemplary embodiments of the present disclosure; Figure 4 This is a block diagram illustrating an electronic device according to exemplary embodiments of the present disclosure. Detailed Implementation
[0026] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.
[0027] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following examples do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0028] It should be noted that the phrase "at least one of several items" in this disclosure refers to three parallel cases: "any one of the several items", "a combination of any number of the several items", and "all of the several items". For example, "including at least one of A and B" includes the following three parallel cases: (1) including A; (2) including B; (3) including A and B. Another example is "performing at least one of step one and step two", which means the following three parallel cases: (1) performing step one; (2) performing step two; (3) performing both step one and step two.
[0029] Figure 1 This is a flowchart illustrating a method for generating video comments according to an exemplary embodiment of the present disclosure.
[0030] Reference Figure 1In step 101, text data and multiple videos can be obtained. The text data may include the title and video content description of each video, interpretation information of each personality type in multiple preset personality types, personality-related comment prompt text, personality-related comment text, personality-irrelevant comment prompt text, and personality-irrelevant comment text.
[0031] It should be noted that there can be m videos, and they can be represented as: ,in, It is the first of m videos. Several videos will serve as the visual basis for text generation. Multiple personality types are preset, including five major personality types. The aforementioned interpretation information can be used for personality interpretations targeting these five personality traits.
[0032] According to exemplary embodiments of this disclosure, the aforementioned preset multiple personality types may include the following: Conscientiousness, openness, extraversion, agreeableness, neuroticism.
[0033] In step 102, the aforementioned text data and multiple videos can be input into the large language model to obtain a personality activation pattern matrix. This personality activation pattern matrix can be used to capture the personality trait representations required by the large language model to generate specific personality-related comments.
[0034] According to an exemplary embodiment of this disclosure, a personality-perceived reasoning sample set can be generated using a large language model, utilizing multiple videos, personality-related comment prompts, and personality-related comment texts. This personality-perceived reasoning sample set can be used to reflect the reasoning process of the large language model under the guidance of a personality.
[0035] It should be noted that the construction of the personality perception inference sample set can be based on the input data: For each sample, the input may include a video. Specific personality prompt text (For example, "Generate a comment that matches an extroverted personality") and labeled comments that match that personality. This personality-perceived reasoning sample set can reflect the reasoning process of a large model under the guidance of a clearly defined personality. Additionally, in some embodiments, the personality-perceived reasoning sample set... The construction also requires video metadata (such as video title and content description), which will not be listed here.
[0036] Furthermore, a personality-independent reasoning sample set can be generated using a large language model, utilizing multiple videos, personality-independent comment prompts, and personality-independent comment texts. This personality-independent reasoning sample set can be used to reflect the reasoning process of the large language model in the absence of personality guidance.
[0037] It should be noted that in constructing a sample set of personality-independent reasoning... At that time, the input of a large language model may include, but is not limited to: video. Basic comment generation tips This refers to comment generation prompts that do not provide any personal guidance (e.g., "Please generate a comment for this video"), and randomly generated comments. Video metadata (e.g., video title and content description).
[0038] Then, a large language model can be used to analyze the personality perception inference sample set separately. Sample set of inferences unrelated to personality By performing forward reasoning, we obtain the hidden states of personality perception and personality-independent hidden states. Next, based on these hidden states, we can calculate the personality activation pattern matrix.
[0039] It should be noted that after constructing the two types of sample sets, the personality activation pattern matrix can be extracted. First, the personality perception inference sample set can be processed separately. Sample set of inferences unrelated to personality Forward inference is performed to record the hidden states of the model as it generates comments at each network layer. Then, the average hidden states of each sample set at each layer can be calculated to form a personality-perceived average hidden state matrix and a personality-independent average hidden state matrix. Next, the layer-by-layer differences between the two average hidden state matrices can be calculated to obtain a personality activation pattern matrix. This matrix captures the deep representational patterns required by the model to generate comments for a specific personality, reflecting the differential characteristics of personality trait activation within the model.
[0040] Furthermore, the average hidden state matrix for the two types of sample sets can be calculated as follows: ; ; in, This represents the j-th video. Text prompts indicating the i-th personality type. This represents the comment text for the j-th video sample. Let l represent the hidden state vector corresponding to the k-th token of the j-th video sample in the l-th layer of the large language model for the i-th personality type, where l is the index of the network layer, and its value ranges from 1 to L. Let R represent the d-dimensional real vector space, R represent the set of real numbers, and d represent the dimension of the hidden state; in, Represents a sample set of personality perception inference. The corresponding average hidden state matrix, Let represent an (L×d) dimensional matrix space.
[0041] ; .
[0042] in, Let j represent the j-th video sample. These represent basic comment generation prompts, i.e., prompts that are not perceived by a person. This represents the comment text for the j-th video sample. The first character representing the large language model The hidden state vector corresponding to the k-th token of the j-th video sample in the layer. Let R represent the d-dimensional real vector space, R represent the set of real numbers, d represent the dimension of the hidden state, and l represent the index of the network layer, which ranges from 1 to L. in, Indicates a sample set of inferences unrelated to personality. The corresponding average hidden state matrix, Indicates (L) d) Dimensional matrix space, where R represents the set of real numbers, d represents the dimension of the hidden state, and L represents the number of layers contained in the large language model.
[0043] The personality activation pattern matrix can be calculated as follows: .
[0044] in, Represents the personality activation pattern matrix. The personality activation pattern matrix contains the first... Type of personality in the first language model The personality activation vector of the layer, The personality activation pattern matrix contains the first... Type of personality in the first language model The personality activation vector of the layer, Let R represent the (L×d) dimensional matrix space, R represent the set of real numbers, d represent the dimension of the hidden state, and L represent the number of layers contained in the large language model.
[0045] In step 103, a personality empathy keyframe sequence can be extracted from multiple videos using a large language model and a personality activation pattern matrix. For example, a large visual language model can be used to calculate the feature similarity between each personality activation pattern and the hidden state of each frame, and then the Top-K frames with the highest feature similarity can be selected as the keyframe sequence.
[0046] Thus, this disclosure achieves dual optimization through personality-oriented keyframe extraction: in terms of computational efficiency, the visual information of the original video can be compressed to K / M of its original length, significantly reducing the model's computational load and memory consumption; in terms of semantic alignment, the personality-aware frame selection mechanism ensures that the visual content input to the model is highly correlated with the target personality traits, thereby providing a high-quality visual foundation for subsequent generation of personality-consistent comments. Furthermore, this method is well-suited for long video scenarios because it effectively overcomes the inherent limitations of visual language models in processing long sequences while maintaining the accuracy of personality expression.
[0047] According to an exemplary embodiment of this disclosure, the hidden states of multiple video frames contained in multiple videos can be extracted using a large language model. Then, the feature similarity between the hidden state of each video frame and the personality activation pattern matrix can be calculated using the large language model. Next, video frames whose feature similarity meets preset conditions can be selected from the multiple video frames as a sequence of personality empathy keyframes. For example, the feature similarities corresponding to the multiple video frames can be sorted in descending order, and a preset number of video frames at the top of the sorting results can be selected as the sequence of personality empathy keyframes.
[0048] It should be noted that the input video can be processed by a visual encoder of a large visual language model. Perform time-series sampling to obtain A set of candidate video frames Considering the limitations of computational resources, the sampling strategy can employ uniform sampling or adaptive sampling methods. Then, each candidate frame can be... A visual encoder that takes a large visual language model as input alone can obtain its visual feature representation without any accompanying text input. For example, for the first... For each video frame, its hidden state set across L network layers can be obtained through forward propagation:
[0049] in, Indicates the first The number of visual features output per video frame. Let represent the hidden state vector of the first visual token in the j-th video frame of the l-th layer of the large language model. This represents the j-th video frame in the l-th layer of the large language model. The hidden state vector of each visual token, where l is the index representation of the network layer, i.e., the l-th layer of the large language model, and its value ranges from 1 to L.
[0050] Next, we can build upon the aforementioned personality activation pattern matrix. Calculate the relationship between each video frame and the target personality. The feature similarity between them, i.e., the degree of semantic alignment. And, for the... Each video frame, and its relation to personality Feature similarity can be measured by calculating the average feature similarity across all layers.
[0051] According to an exemplary embodiment of this disclosure, the feature similarity between the hidden state of each video frame and the personality activation pattern matrix in multiple video frames can be calculated using a large language model using the following formula:
[0052] in, Indicates feature similarity. This indicates the calculation of cosine similarity. To presuppose the first of multiple personality types Types of personality For the j-th video frame out of multiple video frames, The number of layers in a large language model. This represents the number of visual features in the j-th video frame. The first one contained in the personality activation pattern matrix Type of personality in the first language model The personality activation vector of the layer, For the first large language model The first in the layer The first video frame The hidden state of a visual token, and the token can be interpreted as a word.
[0053] It should be noted that this feature similarity measurement method ensures comprehensiveness of the evaluation in the following ways: at the layer dimension, it comprehensively considers both shallow visual features and deep semantic features; at the spatial dimension, it aggregates feature responses from all visual locations within the frame. This multi-layered, multi-location aggregation strategy can effectively capture the association patterns between personality traits and visual content at different levels of abstraction.
[0054] Next, after calculating the feature similarity of all candidate frames, the feature similarities can be sorted in descending order, and the top K (K << M) video frames can be selected from them as the key frame sequence of personality empathy. This screening process can ensure that the visual information finally input into the model is both representative of personality and meets the length limit of model processing. In addition, the threshold K can be dynamically adjusted according to the actual application scenario, and the value of the threshold K needs to balance computational efficiency and content integrity.
[0055] In step 104, a large language model can be used to generate a personality-controlled video comment based on the key frame sequence of personality empathy, the personality-irrelevant comment prompt text, and the personality activation pattern matrix.
[0056] Thus, in the present disclosure, since the personality activation pattern matrix can capture the deep representations related to personality, by replaying the personality activation pattern matrix, that is, by continuously injecting personality features during the video comment generation process using the personality activation replay mechanism, the generated video comment text can be clearly aligned with the specified personality features, that is, it can be ensured that the generated video comment is highly consistent with the target personality in terms of semantic content, emotional tendency, and expression style, and the hallucination problem of personality inconsistency can be effectively avoided, thereby improving the accuracy and reliability of generating personality-controlled video comments.
[0057] According to an exemplary embodiment of the present disclosure, a large language model can be used to encode the key frame sequence of personality empathy and the personality-irrelevant comment prompt text to obtain an initial hidden state sequence. Then, the personality activation pattern matrix can be used to modify the initial hidden state sequence to obtain a modified hidden state sequence. Next, a large language model can be used to generate a personality-controlled video comment based on the modified hidden state sequence.
[0058] It should be noted that personality-controlled text generation can be achieved without modifying the model parameters based on the key frame sequence of personality empathy and the personality activation pattern matrix. Specifically, first, the key frame sequence of personality empathy and the text prompt instruction for guiding the large visual language model to generate a basic comment (for example, "Please generate a comment for this video") , can jointly form a multimodal input ( ) and be input into the large visual language model. Then, the model can encode this multimodal input to obtain an initial hidden state sequence. Then, personality activation replay can be performed on the last token of the generated initial hidden state sequence, that is, can be and the vector of the corresponding layer in the personality activation pattern matrix To merge: .
[0059] in, The personality activation pattern matrix contains the first... Type of personality in the first language model The personality activation vector of the layer, l is the index representation of the network layer, that is, the l-th layer of the large language model, and its value ranges from 1 to L.
[0060] Next, we can base our decisions on the modified hidden state. Predicting the next token allows for iterative generation of personalized video comment text.
[0061] Figure 2 This is a schematic diagram illustrating the process of generating video comments according to exemplary embodiments of the present disclosure.
[0062] Reference Figure 2 First, it can acquire text data and multiple videos.
[0063] Then, the text data and multiple videos can be input into a large visual language model to obtain the personality-perceived average hidden state matrix and the personality-independent average hidden state matrix, respectively. For example, the large visual language model used here can be, but is not limited to, MiniCPM-V-2.6.
[0064] Next, the difference between the personality perception average hidden state matrix and the personality-independent average hidden state matrix can be used to obtain the personality activation pattern matrix.
[0065] Then, MiniCPM-V-2.6 can be used to sample and encode frames from multiple videos to obtain all frames from the multiple videos.
[0066] Next, the feature similarity between each video frame and the personality activation pattern matrix can be calculated separately, and the Top-K frames with the most similarity to the relevant personality can be retained as the personality keyframe sequence.
[0067] Then, based on the personality keyframe sequence, a personality activation replay mechanism can be executed using the personality activation pattern matrix to obtain personality-controllable video commentary text. Specifically, personality-independent hidden states can be obtained using MiniCPM-V-2.6 through the personality keyframe sequence and personality-independent cue text. Next, the personality activation pattern matrix and personality-independent hidden states can be fused through the personality activation replay mechanism to ultimately obtain personality-controllable video commentary.
[0068] Figure 3 This is a block diagram illustrating a video comment generation apparatus 300 according to an exemplary embodiment of the present disclosure.
[0069] Reference Figure 3 The generation device 300 may include a data acquisition module 301, a matrix acquisition module 302, a keyframe extraction module 303, and a comment generation module 304.
[0070] The data acquisition module 301 can acquire text data and multiple videos. The text data can include the title and video content description of each video, interpretation information for each of the preset personality types, personality-related comment prompts, personality-related comment prompts, and personality-unrelated comment text.
[0071] According to exemplary embodiments of this disclosure, the aforementioned preset multiple personality types may include the following: Conscientiousness, openness, extraversion, agreeableness, neuroticism.
[0072] The matrix acquisition module 302 can input the aforementioned text data and multiple videos into the large language model to obtain a personality activation pattern matrix. This personality activation pattern matrix can be used to capture the personality trait representations required by the large language model to generate specific personality-related comments.
[0073] According to an exemplary embodiment of this disclosure, the matrix acquisition module 302 can generate a personality perception reasoning sample set by using a large language model and multiple videos, personality-related comment prompt texts, and personality-related comment texts. The personality perception reasoning sample set can be used to reflect the reasoning process of the large language model under the guidance of a personality.
[0074] Furthermore, the matrix acquisition module 302 can also generate a personality-independent reasoning sample set by using multiple videos, personality-independent comment prompt texts, and personality-independent comment texts through the large language model. This personality-independent reasoning sample set can be used to reflect the reasoning process of the large language model in the absence of personality guidance.
[0075] Then, the matrix acquisition module 302 can utilize the large language model to process the personality perception inference sample set separately. Sample set of inferences unrelated to personality By performing forward reasoning, the hidden states of personality perception and personality-independent hidden states are obtained. Next, the matrix acquisition module 302 can calculate the personality activation pattern matrix based on the hidden states of personality perception and personality-independent hidden states.
[0076] The keyframe extraction module 303 can extract a sequence of personality empathy keyframes from multiple videos using a large language model and a personality activation pattern matrix. For example, a large visual language model can be used to calculate the feature similarity between each personality activation pattern and the hidden state of each frame, and then the Top-K frames with the highest feature similarity can be selected as the keyframe sequence.
[0077] According to an exemplary embodiment of this disclosure, the keyframe extraction module 303 can extract the hidden states of multiple video frames contained in multiple videos using a large language model. Then, the keyframe extraction module 303 can calculate the feature similarity between the hidden state of each video frame and the personality activation pattern matrix using the large language model. Next, the keyframe extraction module 303 can select video frames whose feature similarity meets preset conditions from the multiple video frames as a sequence of personality empathy keyframes.
[0078] According to an exemplary embodiment of this disclosure, the keyframe extraction module 303 can calculate the feature similarity between the hidden state of each video frame and the personality activation pattern matrix in multiple video frames using the following formula through a large language model:
[0079] in, Indicates feature similarity. This indicates the calculation of cosine similarity. To presuppose the first of multiple personality types Types of personality For the j-th video frame out of multiple video frames, The number of layers in a large language model. This represents the number of visual features in the j-th video frame. The first personality activation pattern matrix contained therein Type of personality in the large language model The personality activation vector of the layer, For the first of the large language models The first in the layer The first video frame The hidden state of a visual token.
[0080] The comment generation module 304 can generate personality-controlled video comments based on a large language model, a sequence of personality empathy keyframes, personality-independent comment prompts, and a personality activation pattern matrix.
[0081] According to an exemplary embodiment of this disclosure, the comment generation module 304 can encode the personality empathy keyframe sequence and personality-independent comment prompt text using a large language model to obtain an initial hidden state sequence. Then, the comment generation module 304 can modify the initial hidden state sequence using a personality activation pattern matrix to obtain a modified hidden state sequence. Next, the comment generation module 304 can use the large language model, based on the modified hidden state sequence, to generate a personality-controlled video comment.
[0082] Figure 4This is a block diagram illustrating an electronic device 400 according to an exemplary embodiment of the present disclosure.
[0083] Reference Figure 4 The electronic device 400 includes at least one memory 401 and at least one processor 402, wherein the at least one memory 401 stores instructions that, when executed by the at least one processor 402, perform a method for generating video comments according to an exemplary embodiment of the present disclosure.
[0084] As an example, electronic device 400 may be a PC, tablet, personal digital assistant, smartphone, or other device capable of executing the aforementioned instructions. Here, electronic device 400 is not necessarily a single electronic device, but may be a collection of any devices or circuits capable of executing the aforementioned instructions (or instruction sets) individually or in combination. Electronic device 400 may also be part of an integrated control system or system manager, or may be configured to interconnect with a portable electronic device locally or remotely (e.g., via wireless transmission) through an interface.
[0085] In electronic device 400, processor 402 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example and not limitation, processor may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, etc.
[0086] The processor 402 can execute instructions or code stored in the memory 401, which can also store data. Instructions and data can also be sent and received over a network via a network interface device, which can employ any known transmission protocol.
[0087] The memory 401 may be integrated with the processor 402, for example, by placing RAM or flash memory within an integrated circuit microprocessor. Alternatively, the memory 401 may include a separate device, such as an external disk drive, a storage array, or other storage device that can be used by any database system. The memory 401 and the processor 402 may be operatively coupled, or may communicate with each other, for example, via I / O ports, network connections, etc., enabling the processor 402 to read files stored in the memory.
[0088] In addition, the electronic device 400 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 400 can be interconnected via a bus and / or network.
[0089] According to exemplary embodiments of this disclosure, a computer-readable storage medium may also be provided, which, when executed by a processor of an electronic device, enables the electronic device to perform the aforementioned video comment generation method. Examples of computer-readable storage media include: read-only memory (ROM), random access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disc storage, hard disk drive (HDD), solid-state drive (SSD), card storage (such as multimedia cards, secure digital (SD) cards, or ultra-fast digital (XD) cards), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid-state drive, and any other device configured to store a computer program and any associated data, data files, and data structures in a non-transitory manner and to provide the computer program and any associated data, data files, and data structures to a processor or computer so that the processor or computer can execute the computer program. The computer program in the aforementioned computer-readable storage medium can run in an environment deployed in computer devices such as clients, hosts, agent devices, servers, etc. Furthermore, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system, such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner through one or more processors or computers.
[0090] According to exemplary embodiments of the present disclosure, a computer program product may also be provided, including a computer program that, when executed by a processor, implements a method for generating video comments according to the present disclosure.
[0091] According to the video commentary generation method, apparatus, electronic device, storage medium, and computer program product disclosed herein, a personality activation pattern matrix can be extracted from a large language model through a personality activation recording process, enabling the establishment of deep representations of personality traits. Furthermore, by constructing a keyframe sequence aligned with personality, visual content highly relevant to the target personality can be filtered out. Additionally, by injecting the personality activation pattern matrix into the video commentary generation process through a personality activation replay mechanism, consistency between the output text and the specified personality trait can be ensured. In other words, this disclosure establishes effective constraints in three key stages: personality representation learning, visual content filtering, and text generation control, thereby fundamentally avoiding personality illusion problems and significantly improving the personality consistency and content reliability of the generated video commentary text, effectively meeting the needs of practical application scenarios.
[0092] According to exemplary embodiments of this disclosure, personality-oriented keyframe extraction achieves dual optimizations: in terms of computational efficiency, the visual information of the original video can be compressed to K / M of its original length, significantly reducing the computational load and memory consumption of the model; in terms of semantic alignment, the personality-aware frame selection mechanism ensures that the visual content of the input model is highly correlated with the target personality traits, thereby providing a high-quality visual foundation for generating personality-consistent comments. Furthermore, this method is well-suited for long video scenarios because it effectively overcomes the inherent limitations of visual language models in processing long sequences while maintaining the accuracy of personality expression.
[0093] According to an exemplary embodiment of this disclosure, since the personality activation pattern matrix can capture deep representations related to personality, by replaying the personality activation pattern matrix, that is, by continuously injecting personality features during the video comment generation process using a personality activation replay mechanism, the generated video comment text can be clearly aligned with the specified personality features. This ensures that the generated video comment maintains a high degree of consistency with the target personality in terms of semantic content, emotional tendency, and expression style, effectively avoiding the illusion of personality inconsistency, thereby improving the accuracy and reliability of generating personality-controllable video comments.
[0094] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.
[0095] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A method for generating video comments, characterized in that, include: Acquire text data and multiple videos, wherein the text data includes the title and video content description of each video, interpretation information for each of the preset multiple personality types, personality-related comment prompt text, personality-related comment text, personality-irrelevant comment prompt text, and personality-irrelevant comment text; The text data and the multiple videos are input into a large language model to obtain a personality activation pattern matrix, wherein the personality activation pattern matrix is used to capture the personality trait representations required by the large language model to generate specific personality-related comments; Using the large language model and the personality activation pattern matrix, a sequence of personality empathy keyframes is extracted from the multiple videos. Using the large language model, based on the personality empathy keyframe sequence, the personality-irrelevant comment prompt text, and the personality activation pattern matrix, personality-controllable video comments are generated.
2. The generation method as described in claim 1, characterized in that, The step of inputting the text data and the multiple videos into a large language model to obtain a personality activation pattern matrix includes: Using the large language model, a personality perception reasoning sample set is generated by utilizing the multiple videos, the personality-related comment prompt text, and the personality-related comment text. The personality perception reasoning sample set is used to reflect the reasoning process of the large language model under the guidance of a personality. Using the large language model, a personality-independent reasoning sample set is generated by utilizing the multiple videos, the personality-independent comment prompt text, and the personality-independent comment text. The personality-independent reasoning sample set is used to reflect the reasoning process of the large language model in the absence of personality guidance. Using the large language model, forward reasoning is performed on the personality perception reasoning sample set and the personality-independent reasoning sample set respectively to obtain the personality perception hidden state and the personality-independent hidden state. The personality activation pattern matrix is calculated based on the personality perception hidden state and the personality irrelevant hidden state.
3. The generation method as described in claim 1, characterized in that, The step of extracting a sequence of personality empathy keyframes from the multiple videos using the large language model and the personality activation pattern matrix includes: The hidden states of multiple video frames contained in the multiple videos are extracted using the large language model. The feature similarity between the hidden state of each video frame in the multiple video frames and the personality activation pattern matrix is calculated using the large language model. Select video frames whose feature similarity meets the preset conditions from the plurality of video frames as the personality empathy keyframe sequence.
4. The generation method as described in claim 3, characterized in that, The step of calculating the feature similarity between the hidden state of each video frame in the plurality of video frames and the personality activation pattern matrix through the large language model includes: The feature similarity between the hidden state of each video frame and the personality activation pattern matrix is calculated using the large language model according to the following formula: in, Indicates the feature similarity, For the first of the preset multiple personality types Types of personality For the j-th video frame among the plurality of video frames, The number of layers in the large language model. This represents the number of visual features in the j-th video frame. The first personality activation pattern matrix contained therein Type of personality in the large language model The personality activation vector of the layer, For the first of the large language models The first in the layer The first video frame The hidden state of a visual token.
5. The generation method as described in claim 1, characterized in that, The process of generating personality-controllable video comments using the large language model, based on the personality empathy keyframe sequence, the personality-irrelevant comment prompt text, and the personality activation pattern matrix, includes: The large language model is used to encode the personality empathy keyframe sequence and the personality-irrelevant comment prompt text to obtain the initial hidden state sequence; The initial hidden state sequence is modified using the personality activation pattern matrix to obtain the modified hidden state sequence; Using the large language model, the modified hidden state sequence is used to generate the personality-controllable video comments.
6. The generation method as described in claim 1, characterized in that, The preset multiple personality types include the following: Conscientiousness, openness, extraversion, agreeableness, neuroticism.
7. A device for generating video comments, characterized in that, include: The data acquisition module is configured to acquire text data and multiple videos. The text data includes the title and video content description of each video, interpretation information for each of the preset multiple personality types, personality-related comment prompts, personality-related comment prompts, and personality-unrelated comment text. The matrix acquisition module is configured to input the text data and the multiple videos into a large language model to obtain a personality activation pattern matrix, wherein the personality activation pattern matrix is used to capture the personality trait representations required by the large language model to generate specific personality-related comments; The keyframe extraction module is configured to extract a sequence of personality empathy keyframes from the multiple videos using the large language model and the personality activation pattern matrix. The comment generation module is configured to generate personality-controlled video comments based on the personality empathy keyframe sequence, the personality-independent comment prompt text, and the personality activation mode matrix using the large language model.
8. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the method for generating video comments as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is able to perform the video comment generation method as described in any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method for generating video comments as described in any one of claims 1 to 6.