A video question generation method and device

By acquiring video information and historical dialogue records, and using a multi-attention mechanism fusion model to extract feature information, video questions are generated and filtered. This solves the problem of low reliability in video question generation in existing technologies, and achieves video question generation with higher credibility and consistency.

CN114764456BActive Publication Date: 2026-06-12HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAWEI CLOUD COMPUTING TECHNOLOGIES CO LTD
Filing Date
2020-12-30
Publication Date
2026-06-12

Smart Images

  • Figure CN114764456B_ABST
    Figure CN114764456B_ABST
Patent Text Reader

Abstract

The application provides a video question generation method and device, and relates to the field of artificial intelligence (AI). When the method is executed, video information and N pieces of historical dialogue record information related to the video information are first acquired, wherein N is greater than or equal to 1. Visual feature information is extracted from the video information by using a first neural network model, and dialogue feature information is extracted from each piece of historical dialogue record information by using a second neural network model. For each piece of historical dialogue record information, a multi-attention mechanism fusion model is used to perform feature fusion on the visual feature information and the dialogue feature information, to determine a first representation of the dialogue feature information to the visual feature information and a second representation of the visual feature information to the dialogue feature information. For each piece of historical dialogue record information, a question generation model is used to perform feature extraction on the first representation and the second representation, to generate a video candidate question corresponding to the video information. The video question obtained by the method is more reliable.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence (AI) technology, and in particular to a method and apparatus for generating video questions. Background Technology

[0002] With the development of AI technology, video question generation using neural networks has become an important network service. This service allows for the automatic generation of questions related to a given video. However, existing video question generation methods suffer from insufficient encoding of the video information, resulting in low reliability of the generated video questions. Summary of the Invention

[0003] Based on this, this application provides a video problem generation method and apparatus to improve the reliability of video problems.

[0004] Firstly, embodiments of this application provide a video question generation method. This method can be executed by a server or by a device with data processing capabilities, without specific limitations. When executed, the method first acquires video information and N historical dialogue records related to the video information; where N is greater than or equal to 1. It then uses a first neural network model to extract visual feature information from the video information and a second neural network model to extract dialogue feature information from each historical dialogue record. Next, for each historical dialogue record, a fusion model based on a multi-attention mechanism is used to fuse the visual and dialogue feature information to determine a first representation of the dialogue feature information relative to the visual feature information, and a second representation of the visual feature information relative to the dialogue feature information. Finally, for each historical dialogue record, a question generation model is used to extract features from the first and second representations to generate video candidate questions corresponding to the video information.

[0005] In this embodiment, after extracting the feature information of video information and historical dialogue records, the visual feature information of video information and the dialogue feature information of historical dialogue records can be fully integrated through a fusion model with a multi-attention mechanism. This can better achieve the fusion of multi-modal information, enabling the decoder to obtain the content contained in the video information and historical dialogue records more completely, thus making the credibility of the obtained set of video candidate questions higher.

[0006] In one possible implementation, a question-and-answer model can be used to answer video candidate questions and determine the answer corresponding to the video candidate questions. For N historical dialogue records, the text similarity between the video candidate questions and the answers corresponding to the video candidate questions can be determined, and the video candidate questions with text similarity higher than a preset threshold can be used as the target video questions.

[0007] This application obtains the answers to the video candidate questions based on the video candidate question set, and determines the target video question based on the video candidate question set and its corresponding answers. This application also incorporates the answers to the candidate questions into the evaluation of the video candidate questions, fully considering the impact of the generated question answers on the video question, thus enabling the video question to have better coherence and completeness.

[0008] In one possible implementation, the first neural network model includes: a video frame feature extraction model and a video segment feature extraction model; the video frame feature extraction model is used to extract video frame feature information related to each frame of the video information; and the video segment feature extraction model is used to extract video segment feature information related to a preset frame of the video information; the number of preset frames is greater than 1.

[0009] In this embodiment, a neural network model is used to extract feature information from video information, which is faster and more convenient.

[0010] In one possible implementation, the second neural network model is a dialogue feature extraction model; by referring to the positional encoding of words in historical dialogue records, the dialogue feature extraction model is used to extract features from multiple historical dialogue records to determine dialogue feature information.

[0011] In this embodiment, a neural network model is used to extract feature information from historical dialogue records, which is faster and more convenient.

[0012] In one possible implementation, visual feature information and dialogue feature information are used as input parameters to the fusion model of the multi-attention mechanism, so that the fusion model of the multi-attention mechanism outputs a first representation of the dialogue feature information on the visual feature information and a second representation of the visual feature information on the dialogue feature information.

[0013] In this embodiment, the feature information of video information and the feature information of historical dialogue record information can be fully integrated through the fusion model of multi-attention mechanism, which can better realize the fusion of multi-modal information, so that the decoder can obtain the video information and the content contained in the dialogue question and answer history more completely, and the credibility of obtaining the video candidate question set is higher.

[0014] In one possible implementation, video candidate questions can be input into a question-answering model, which then outputs the answers corresponding to the video candidate questions.

[0015] In this embodiment, a neural network model is used to extract feature information of video candidate questions, which is faster and more convenient.

[0016] Secondly, embodiments of this application provide a video question generation apparatus, comprising: an acquisition unit, configured to acquire video information and N historical dialogue records related to the video information; wherein N is greater than or equal to 1; a first information extraction unit, configured to extract visual feature information from the video information using a first neural network model, and extract dialogue feature information from each historical dialogue record using a second neural network model; a second information extraction unit, configured to perform feature fusion on the visual feature information and dialogue feature information based on a multi-attention mechanism fusion model for each historical dialogue record, to determine a first representation of the dialogue feature information on the visual feature information, and a second representation of the visual feature information on the dialogue feature information; and a candidate question generation unit, configured to extract features from the first representation and the second representation through a question generation model for each historical dialogue record, to generate video candidate questions corresponding to the video information.

[0017] In one possible implementation, the device further includes: a target video question determination module, used to answer video candidate questions through a question-answering model to determine the answer corresponding to the video candidate questions; and, for N historical dialogue records, to determine the text similarity between the video candidate questions and the answers corresponding to the video candidate questions, and to use video candidate questions with text similarity higher than a preset threshold as target video questions.

[0018] In one possible implementation, the first neural network model includes: a video frame feature extraction model and a video segment feature extraction model; a first information extraction unit is used to: extract video frame feature information related to each frame image from each frame image of the video information using the video frame feature extraction model; and extract video segment feature information related to a preset frame image from a preset frame image of the video information using the video segment feature extraction model; the number of preset frames is greater than 1.

[0019] In one possible implementation, the second neural network model is a dialogue feature extraction model; the first information extraction unit is used to: extract features from multiple segments of historical dialogue records by using the dialogue feature extraction model, based on the positional encoding of words in the reference historical dialogue record information, to determine the dialogue feature information.

[0020] In one possible implementation, the second information extraction unit is used to: input visual feature information and dialogue feature information as input parameters to the fusion model of the multi-attention mechanism, so that the fusion model of the multi-attention mechanism outputs a first representation of the dialogue feature information on the visual feature information and a second representation of the visual feature information on the dialogue feature information.

[0021] In one possible implementation, the target video question determination module is used to: input video candidate questions into the question answering model and output the answers corresponding to the video candidate questions.

[0022] Thirdly, this application provides a video problem generation apparatus, including a processor and a memory, wherein the memory stores a computer program; the processor is configured to execute the computer program stored in the memory so that the solution described in any implementation of the first aspect is executed.

[0023] Fourthly, this application provides a computer-readable storage medium storing computer-readable instructions, which, when read and executed by a computer, cause the computer to perform the solution described in any implementation of the first aspect above.

[0024] Fifthly, this application provides a computer program product that, when read and executed by a computer, causes the computer to perform the scheme described in any of the implementations of the first aspect above.

[0025] For the technical effects that can be achieved by the second to fifth aspects mentioned above, please refer to the description of the technical effects that can be achieved by the corresponding possible design schemes in the first aspect mentioned above. This application will not repeat them here. Attached Figure Description

[0026] Figure 1 This illustration shows an application scenario diagram of the video problem generation method provided in the embodiments of this application;

[0027] Figure 2 A flowchart illustrating the video problem generation method provided in an embodiment of this application is shown.

[0028] Figure 3 This diagram illustrates the network structure of the video problem generation method provided in an embodiment of this application.

[0029] Figure 4 A schematic diagram of the structure of the video problem generation apparatus provided in an embodiment of this application is shown;

[0030] Figure 5 A schematic diagram of the structure of the video problem generation apparatus provided in an embodiment of this application is shown. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions in the embodiments of this application will be described in detail below with reference to the accompanying drawings.

[0032] It should be noted that in this application, "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone, where A and B can be singular or plural. Furthermore, unless otherwise stated, the ordinal numbers such as "first" and "second" mentioned in the embodiments of this application are used to distinguish multiple objects and are not used to limit the order, sequence, priority, or importance of multiple objects.

[0033] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

[0034] To better illustrate the scheme of this application, let's first... Figure 1 This application may be applicable to the following scenarios. Figure 1 The process includes a robot and a user. After the robot imports a video (which can be downloaded by the robot itself from the Internet or imported by relevant staff) (assuming that the historical dialogue record information does not exist), it applies the video question generation method of this application to randomly generate a video question A (1+3=?) and an answer B (4) (which is not displayed on the robot screen). After the user answers answer A (3), the robot records and compares answer A with answer B, outputting the user's answer result. Since 3 and 4 are different, the robot can output a prompt message indicating that the answer is incorrect.

[0035] Another scenario is where historical dialogue records exist, and there are multiple records, such as: Dialogue record 1: What color is the bear in the video? Red; Dialogue record 2: How many bears are in the video? 2. The robot will fuse the dialogue records with the video information to generate more reliable video alternative questions.

[0036] See Figure 2 The illustrated video question generation method can be executed by a server or by a device with data processing capabilities (such as a robot). This application does not specifically limit the executing entity. The following description uses a robot as an example for illustration only.

[0037] Step 201: The robot acquires video information and N historical dialogue records related to the video information; where N is greater than or equal to 1.

[0038] It should be noted that the video information can be pre-imported into the robot's memory, or it can be downloaded from the network by the robot based on the chat content with the user. For example, if the user mentions "SpongeBob SquarePants" to the robot, the robot will download video information related to "SpongeBob SquarePants" from the network. Historical dialogue records may be related to the video information, but these records may or may not exist. In practical application, this application does not specifically limit the existence of historical dialogue records. If the historical dialogue records do not exist, they will be set to empty; if they exist, they will be input into the robot's processor for data processing.

[0039] Step 202: The robot uses a first neural network model to extract visual feature information from video information and a second neural network model to extract dialogue feature information from each historical dialogue record.

[0040] Specifically, the feature information of video information and historical dialogue record information is extracted through a pre-trained neural network model. The neural network model allows for more comprehensive and accurate feature extraction.

[0041] In this embodiment, the first neural network model may include: a video frame feature extraction model and a video segment feature extraction model; the robot uses the video frame feature extraction model to extract video frame feature information related to each frame of the video information; and uses the video segment feature extraction model to extract video segment feature information related to a preset frame of the video information; the number of preset frames is greater than 1.

[0042] The following explanation uses the VGGNet (Visual Geometry Group Net) model for video frame feature extraction and the 3D convolutional network (3D-ConvNet) model for video segment feature extraction as examples. However, in practical applications, the model types of video frame extraction and video segment extraction are not limited. Any neural network model that can extract video frame features and video segment features is applicable to this application.

[0043] For example, for a segment with N fFor video frames, the robot can resize each frame to 224 pixels long and 224 pixels wide. Then, for each frame, it uses a pre-trained VGGNet model (VGGNet is also known as a video frame feature extraction model; this is only used as an example and does not limit the specific structure of the video frame feature extraction model) to extract a 4096-dimensional feature vector. For each feature vector, it can be converted to 512 dimensions using the following formula 1:

[0044] feat f =max(0,feat) f W1+b1)W2+b2 Formula 1

[0045] Where W1 and W2 are weight matrices of the trainable parameter matrix, and b1 and b2 are bias vectors of the trainable parameter matrix, the resulting 512-dimensional feature vectors are then used to... f 'Stacking yields the frame-level feature matrix v of the video f (That is, video frame feature information).

[0046] In addition, regarding the above N f For video with frame images, the robot can also divide every 16 consecutive frames into segments (this is just an example using a preset number of 16; in actual applications, the preset number is not limited to 16). Each frame within a segment is then resized to 112 pixels long and 112 pixels wide. For each segment, a pre-trained 3D-ConvNet model (which is also a video segment feature extraction model; this is just an example and the specific structure of the video segment feature extraction model is not limited) is used to extract a 4096-dimensional feature vector. s For each feature vector, it is converted to 512 dimensions using the following formula 2:

[0047] feat s =max(0,feat) s W3+b3)W4+b4 Formula 2

[0048] Where W3 and W4 are weight matrices of the trainable parameter matrix, and b3 and b4 are bias vectors of the trainable parameter matrix, the resulting 512-dimensional feature vectors are then used to... s 'Stacking yields the segment-level feature matrix v of the video s (That is, video segment feature information).

[0049] Since historical dialogue records are primarily text-based, unlike video (image) information, feature information can be extracted from them using models other than video frame extraction and video segment extraction models, such as dialogue feature extraction models. The robot, referencing the positional encoding of words in the historical dialogue records, utilizes the dialogue feature extraction model to extract features from multiple segments of historical dialogue records, thus determining the dialogue feature information.

[0050] The following explanation uses the dialogue feature extraction model as an example of the global vectors for word representation (Glove) model. However, in practical applications, the model type of dialogue feature extraction model is not limited. Any neural network model that can extract dialogue features is applicable to this application.

[0051] For example, for historical dialogue records containing n rounds, c = (c1, c2, ..., c...) n A dialogue record c in ) i Dialogue log c i From the i-th question q i and the corresponding i-th answer a i The connection consists of the history of the dialogue and question-and-answer session. i Represented as a string of words They represent the question q respectively. i and corresponding answer a i The j-th word, l, l' represent the question q respectively. i and corresponding answer a i The number of words; for dialogue record c i For each word in the dialogue history, a 512-dimensional word vector is calculated using a pre-trained GloVe model (this is merely an example; the specific structure of the neural network model is not limited in actual applications). A 512-dimensional positional encoding is also calculated. The word vectors and positional encodings calculated by the GloVe model are added together to obtain the c of each dialogue history. i encoding matrix Stacking the encoding matrices of all dialogue records yields the encoding matrix of all historical dialogues. (That is, the characteristic information of historical dialogue records).

[0052] Step 203: For each historical dialogue record, the robot will perform feature fusion on visual feature information and dialogue feature information based on a multi-attention mechanism fusion model to determine the first representation of visual feature information by dialogue feature information and the second representation of dialogue feature information by visual feature information.

[0053] Step 204: For each historical dialogue record, the robot extracts features from the first and second representations using a question generation model to generate alternative video questions corresponding to the video information.

[0054] In one optional implementation, visual feature information and dialogue feature information can be used as input parameters to the multi-attention mechanism fusion model, so that the multi-attention mechanism fusion model outputs the first representation of the dialogue feature information on the visual feature information, and the second representation of the visual feature information on the dialogue feature information.

[0055] As an example, the fusion model of the multi-attention mechanism can determine the first and second representations through the function Attention(I,O,f) associated with the fusion of multimodal information, or it can be determined by other functions, which are not specifically limited here. This application only uses Attention(I,O,f) as an example for illustrative purposes. For the input matrices I, O and the integer F∈{0,1,2}, the attention matrix Attention(I,O,F) is calculated using the following formula 3:

[0056]

[0057] Where Concat() represents the matrix concatenation function, softmax() represents the logistic regression (Softmax) function, LayerNorm() represents the layer normalization function, and W... O W i Q W i K W i V W5 and W6 are weight matrices of trainable parameter matrices, and b5 and b6 are bias matrices of trainable parameter matrices.

[0058] Specifically, Formula 3 includes three different MultiHead computation expressions: when flag = 0, MultiHead calculates the natural language representation based on the attention encoding of visual information; when flag = 1, MultiHead calculates the visual representation based on the attention encoding of natural language information; and when flag = 2, MultiHead calculates the fusion expression of the above two modalities.

[0059] Specifically, Attention(I,O,F) includes three types, where F∈{0,1,2}. When F=0, it means that the information of I needs to be updated but the information of F needs to be incorporated; when F=1, it means that the information of F needs to be updated but the information of I needs to be incorporated; when F=2, it means that the information of both I and F needs to be updated and the two types of information need to be merged.

[0060] In this embodiment, after the robot extracts the feature information of video information and historical dialogue records, it can fully integrate the visual feature information of video information with the dialogue feature information of historical dialogue records through a multi-attention mechanism fusion model. This can better achieve the fusion of multi-modal information, enabling the decoder to obtain the content contained in the video information and historical dialogue records more completely, thus making the credibility of the obtained set of video candidate questions higher.

[0061] Furthermore, it should be noted that when there are multiple historical dialogue records, there may be multiple video candidate questions generated. In order to improve the reliability and accuracy of the video candidate questions, a question-answering model can be used to answer the video candidate questions and determine the answers corresponding to the video candidate questions. For N historical dialogue records, the text similarity between the video candidate questions and the answers corresponding to the video candidate questions is determined, and the video candidate questions with text similarity higher than a preset threshold are used as the target video questions.

[0062] In one alternative approach, the robot can input video candidate questions into a question-answering model and output the answers corresponding to the video candidate questions.

[0063] It should be noted that the robot can input a set of candidate video questions, video information, and historical dialogue records into a video question-answering model for feature extraction to determine the answers corresponding to the candidate video questions. Then, the robot determines the credibility of the candidate video questions and their corresponding answers. Finally, video questions with credibility greater than a preset threshold and their corresponding answers are selected as target video questions. Using the candidate video questions and their highly credible answers as target video questions improves the accuracy of video question answering, and subsequent video question judgments based on the target video questions yield more reliable results.

[0064] In this embodiment, a multi-attention mechanism fusion model can fully integrate the feature information of video information with the feature information of historical dialogue records, making the determined video candidate questions more relevant to the user's needs and providing a better application experience. This multi-attention mechanism fusion model can better achieve multi-modal information fusion, allowing the decoder to obtain a more complete picture of the video information and the content contained in the dialogue question-and-answer history, resulting in a more reliable set of video candidate questions. Furthermore, the answers corresponding to the video candidate question set are obtained based on the set of video candidate questions, and the target video question is determined based on the set of video candidate questions and their corresponding answers. This application also incorporates the answer to the question into the evaluation of the candidate questions, fully considering the impact of the generated question's answer on the video question, thus making the video question more coherent and complete.

[0065] Next, through Figure 3 The schematic diagram of the network structure of the video problem generation method is used to illustrate the solution of this application. Figure 3 It includes: video feature information extraction module, historical dialogue record information feature extraction module, cross-coding initialization module, cross-coding update module, attention calculation module, alternative question generation module, alternative question answer generation module, and alternative question selection module.

[0066] It should be noted that, firstly, the video information is extracted using the video feature information extraction module. Refer to the description above to obtain the matrix v used to represent the feature information of the video frames. f And the matrix v used to characterize the feature information of video segments. s The feature extraction module of historical dialogue records extracts features from the dialogue records of n rounds: question Q1-answer A1, question Q2-answer A2, question Q3-answer A3, ..., question Qn-answer An, to obtain the feature information c' of the historical dialogue records (which can be determined with reference to the above description, and will not be elaborated here).

[0067] When the robot determines a video problem, it can pre-initialize the current historical dialogue records using a counter cnt. If there are n historical dialogue records, cnt will be initialized to n.

[0068] Subsequently, the cross-coding initialization module can refer to Formula 3 above to perform feature fusion on the feature information c' of the historical dialogue record information to obtain the dialogue record cross-coding matrix O related to the video frame. f And the cross-coding matrix O of the dialogue records associated with the video segment s For specific implementation, please refer to the following formula: O f =Os =Attention(c',v,F),F∈{0,1,2}

[0069] Furthermore, the cross-coding update module can update the matrix v used to represent the feature information of video frames. f A matrix v used to represent the feature information of video segments s The cross-coding matrix O of dialogue records associated with video frames f And the cross-coding matrix O of the dialogue records associated with the video segment s Using the above formula 3 to apply O f And O s Update separately, and change O f Updated to O f(update) , will O s Updated to O s(update) For specific execution, please refer to Formula 4 below:

[0070]

[0071] Here, `Concat()` represents the matrix concatenation function, and the function `MultiHeadAttention()` was defined in Equation 3. `W7` and `W8` are both weight matrices of the trainable parameter matrices, and `b7` and `b8` are both bias matrices of the trainable parameter matrices. It should be noted that O... f and O s The update process can be viewed as a retrospective analysis of historical dialogue records. By repeatedly fusing historical dialogue information from each round with video visual feature information, a latent state g, F is generated. f1 and F s1 It is an update of visual feature information, F f2 and F s2 It involves updating text feature information. g1 and g2 obtain the hidden state after a certain round of updates through an attention mechanism.

[0072] Next, subtract 1 from cnt, assuming the encoding matrix c of the (n-1)th dialogue record is... cnt ', based on the current dialogue record cross-coding matrix O associated with the video frame f And the cross-coding matrix O of the dialogue records associated with the video segment s The attention calculation module uses the following formula 5 to calculate the dialogue record cross-coding matrix A associated with the video frame for the (n-1)th dialogue record. f The cross-coding matrix A of dialogue records associated with the video segment s :

[0073]

[0074] The encoding matrix c for the (n-1)th dialogue record cnt', and the cross-coding matrix A of the dialogue record associated with the video frame for the (n-1)th dialogue record. f The cross-coding matrix A of dialogue records associated with the video segment s The cross-coding update module can update the O using the following formula 6. f And O s The details are as follows:

[0075]

[0076] Where σ() represents the sigmoid function, Concat() represents the matrix concatenation function, ⊙ represents the pairwise multiplication of corresponding elements in the matrix, W9, W 10 W 11 W 12 Both are weight matrices of trainable parameter matrices, b9, b 10 b 11 b 12 Both are bias matrices of trainable parameter matrices.

[0077] The alternative question generation module can determine the alternative question cq for the (n-1)th dialogue record using the following formula 7. cnt :

[0078] cq cnt = softmax(Attention(c cnt ',O f ,1)⊙Attention(c cnt ',O s ,1)) Formula 7

[0079] Here, ⊙ represents the pairwise multiplication of corresponding elements in the matrix, and softmax() represents the Softmax function.

[0080] Then, the robot can decrement the counter cnt by 1 until the counter cnt equals 1, at which point it exits, obtaining a set CQ = (cq1, cq2, ..., cq...) with n candidate problems. n ), where n equals the total number of historical dialogue records.

[0081] The alternative question answer generation module can generate an answer set CA = (ca1, ca2, ..., ca3) corresponding to the alternative question set using a pre-trained HACRN video question answering model. n ), where the i-th answer ca i Corresponding to the i-th alternative question cq i The HACRN video question-and-answer model is trained on a set of video information and a set of historical dialogue records corresponding to each video.

[0082] The alternative question selection module can encode the entire history of dialogue records into a matrix c' = (c1', c2', ..., c...). n ') and the i-th alternative question cq i The generated answer ca i The i-th alternative problem cq is calculated using the following formula 8. i The conditional probability p(cq) i |c′,ca i ):

[0083]

[0084] e i =softmax(tanh(W) 13 tanh(W 14 cq i )+W 15 tanh(W 16 BiLSTM(c'))) Formula 8

[0085] Where softmax() represents the Softmax function, BiLSTM() represents a bidirectional long short-term memory module, and W 13 W 14 W 15 W 16 Both are weight matrices of trainable parameter matrices.

[0086] The alternative problem selection module allows selection of conditional probability p(cq) i |c′,ca i The largest possible solution is the i-th alternative problem. i This serves as the result of generating the target video question.

[0087] It should be noted that with the development of intelligent chatbots, in addition to daily command-based dialogues and casual conversations, more dialogue scenarios are being introduced into chatbot systems. Using the video question generation method provided in this application, a chatbot can converse with a person based on a video, such as accompanying a user while watching a film or television show and engaging in discussion during the viewing; or, in early childhood education robots, asking questions and engaging in dialogue with the user based on educational videos.

[0088] Tables 1 and 2 below show the comparison results of different learning methods used to determine video questions when processing different datasets. Table 1 shows the test results of different model combinations on the YouTube-Clips dataset, and Table 2 shows the test results of different model combinations on the TACoS-MultiLevel dataset. It should be noted that "no reinforcement learning" means that the answer set corresponding to the video question set was not considered when generating the video question, and "no cross-coding" means that the information fusion of video information and historical dialogue record information was not performed using Formula 3 above. Regardless of the evaluation mechanism or dataset, it can be seen that the test results of the method in this application are the best. For example, in Table 1, for the BLEU-1 dataset, the test result of the method without reinforcement learning is 0.521, the test result of the method without cross-coding is 0.301, and the test result using the method in this application is 0.548.

[0089] Table 1

[0090] Evaluation indicators BLEU-1 BLEU-2 ROGUE METEOR No reinforcement learning methods 0.521 0.356 0.565 0.201 Cross-coding method 0.301 0.213 0.287 0.178 The method of this application 0.548 0.386 0.595 0.242

[0091] Table 2

[0092] Evaluation indicators BLEU-1 BLEU-2 ROGUE METEOR No reinforcement learning methods 0.697 0.587 0.703 0.334 Cross-coding method 0.377 0.243 0.324 0.201 The method of this application 0.743 0.633 0.746 0.375

[0093] Therefore, it can be seen that this application can fully consider the role that the answers to the generated questions can play in the continuation of the video dialogue question-and-answer session, making the video dialogue question-and-answer session smoother and more coherent. At the same time, the generated questions can combine video information and the content of the dialogue question-and-answer history to better achieve multi-modal information fusion, thereby effectively improving the effect of video question generation.

[0094] Based on the same concept, such as Figure 4 As shown, a video question generation device provided in this application embodiment includes: an acquisition unit 41, a first information extraction unit 42, a second information extraction unit 43, and a candidate question generation unit 44;

[0095] The system includes: an acquisition unit 41, used to acquire video information and N historical dialogue records related to the video information, wherein N is greater than or equal to 1; a first information extraction unit 42, used to extract visual feature information from the video information using a first neural network model, and to extract dialogue feature information from each historical dialogue record using a second neural network model; a second information extraction unit 43, used to perform feature fusion on visual feature information and dialogue feature information based on a multi-attention mechanism fusion model for each historical dialogue record, to determine the first representation of dialogue feature information on visual feature information, and the second representation of visual feature information on dialogue feature information; and a candidate question generation unit 44, used to extract features from the first representation and the second representation through a question generation model for each historical dialogue record, to generate video candidate questions corresponding to the video information.

[0096] In one possible implementation, the device further includes: a target video question determination module, used to answer video candidate questions through a question-answering model to determine the answer corresponding to the video candidate questions; and, for N historical dialogue records, to determine the text similarity between the video candidate questions and the answers corresponding to the video candidate questions, and to use video candidate questions with text similarity higher than a preset threshold as target video questions.

[0097] In one possible implementation, the first neural network model includes: a video frame feature extraction model and a video segment feature extraction model; the first information extraction unit 42 is used to: extract video frame feature information related to each frame image from each frame image of the video information using the video frame feature extraction model; and extract video segment feature information related to the preset frame image from preset frame images of the video information using the video segment feature extraction model; the number of preset frames is greater than 1.

[0098] In one possible implementation, the second neural network model is a dialogue feature extraction model; the first information extraction unit 42 is used to: extract features from multiple segments of historical dialogue record information by using the dialogue feature extraction model, based on the positional encoding of words in the reference historical dialogue record information, to determine the dialogue feature information.

[0099] In one possible implementation, the second information extraction unit 43 is used to: input visual feature information and dialogue feature information as input parameters to the fusion model of the multi-attention mechanism, so that the fusion model of the multi-attention mechanism outputs a first representation of the dialogue feature information on the visual feature information and a second representation of the visual feature information on the dialogue feature information.

[0100] In one possible implementation, the target video question determination module is used to: input video candidate questions into the question answering model and output the answers corresponding to the video candidate questions.

[0101] Based on the same concept, such as Figure 5 The image shows a video problem generation apparatus 500 provided in this application. Exemplarily, the video problem generation apparatus 500 may be a chip or a chip system. Optionally, in the embodiments of this application, the chip system may be composed of chips, or may include chips and other discrete devices.

[0102] The video problem generation apparatus 500 may include at least one processor 510, and may also include at least one memory 520 for storing computer programs, program instructions, and / or data. The memory 520 and the processor 510 are coupled. The coupling in this embodiment is an indirect coupling or communication connection between devices, units, or modules, and may be electrical, mechanical, or other forms, used for information exchange between devices, units, or modules. The processor 510 may operate in conjunction with the memory 520. The processor 510 may execute the computer program stored in the memory 520. Optionally, at least one of the at least one memory 520 may be included in the processor 510.

[0103] The video problem generation device 500 may also include a transceiver 530, through which the video problem generation device 500 can interact with other devices. The transceiver 530 can be a circuit, a bus, a transceiver, or any other device that can be used for information interaction.

[0104] In one possible implementation, the video problem generation device 500 can be applied to the aforementioned network device. Specifically, the video problem generation device 500 can be the aforementioned network device itself, or it can be any device capable of supporting the aforementioned network device in implementing any of the above embodiments. The memory 520 stores the necessary computer programs, program instructions, and / or data for implementing the functions of the network device in any of the above embodiments. The processor 510 can execute the computer program stored in the memory 520 to complete the method in any of the above embodiments.

[0105] This application embodiment does not limit the specific connection medium between the transceiver 530, processor 510, and memory 520. This application embodiment... Figure 5 The memory 520, processor 510, and transceiver 530 are connected via a bus, and the bus is in Figure 5 The connections between other components are shown in bold and are for illustrative purposes only, not as limiting information. The bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, Figure 5The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0106] In the embodiments of this application, the processor may be a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, and may implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.

[0107] In the embodiments of this application, the memory can be non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), or it can be volatile memory, such as random-access memory (RAM). The memory can also be any other medium capable of carrying or storing desired program code in the form of instructions or data structures, and accessible by a computer, but is not limited thereto. The memory in the embodiments of this application can also be a circuit or any other device capable of implementing storage functions, used to store computer programs, program instructions, and / or data.

[0108] Based on the above embodiments, this application also provides a readable storage medium storing instructions that, when executed, cause the method performed by the security detection device in any of the above embodiments to be implemented. The readable storage medium may include various media capable of storing program code, such as a USB flash drive, portable hard drive, read-only memory, random access memory, magnetic disk, or optical disk.

[0109] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0110] This application is described with reference to flowchart illustrations and / or block diagrams of the methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing device, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0111] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0112] These computer program instructions can also be loaded onto a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable device for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

Claims

1. A method for generating video questions, characterized in that, include: Obtain video information and N historical dialogue records related to the video information; wherein, N is greater than or equal to 1; Visual feature information is extracted from the video information using a first neural network model, and dialogue feature information is extracted from each of the historical dialogue records using a second neural network model; For each of the aforementioned historical dialogue records, a fusion model based on a multi-attention mechanism is used to perform feature fusion on the visual feature information and the dialogue feature information to determine the first representation of the dialogue feature information on the visual feature information, and the second representation of the visual feature information on the dialogue feature information. For each historical dialogue record, feature extraction is performed on the first representation and the second representation using a question generation model to generate video candidate questions corresponding to the video information.

2. The method according to claim 1, characterized in that, Also includes: The question-and-answer model is used to answer the video candidate questions and determine the answer corresponding to the video candidate questions. For the N historical dialogue records, determine the text similarity between the video candidate question and the answer corresponding to the video candidate question, and take the video candidate question with a text similarity higher than a preset threshold as the target video question.

3. The method according to claim 1, characterized in that, The first neural network model includes: a video frame feature extraction model and a video segment feature extraction model; The step of extracting visual feature information from the video information using a first neural network model includes: The video frame feature extraction model is used to extract video frame feature information related to each frame of the video information; and, The video segment feature extraction model is used to extract video segment feature information related to the preset frame image from the preset frame image of the video information; the number of preset frames is greater than 1.

4. The method according to any one of claims 1-3, characterized in that, The second neural network model is a dialogue feature extraction model; The extraction of dialogue feature information from each historical dialogue record using a second neural network model includes: By referring to the positional encoding of words in the historical dialogue record information, the dialogue feature extraction model is used to extract features from each of the historical dialogue records to determine the dialogue feature information.

5. The method according to any one of claims 1-3, characterized in that, The fusion model based on the multi-attention mechanism performs feature fusion on the visual feature information and the dialogue feature information to determine a first representation of the visual feature information by the dialogue feature information and a second representation of the dialogue feature information by the visual feature information, including: The visual feature information and the dialogue feature information are used as input parameters to the fusion model of the multi-attention mechanism, so that the fusion model of the multi-attention mechanism outputs the first representation of the dialogue feature information on the visual feature information, and the second representation of the visual feature information on the dialogue feature information.

6. The method according to claim 2, characterized in that, The step of using a question-and-answer model to answer the video candidate questions and determine the answer corresponding to the video candidate questions includes: The video candidate questions are input into the question answering model, and the answers corresponding to the video candidate questions are output.

7. A video problem generation device, characterized in that, include: The acquisition unit is used to acquire video information and N historical dialogue records related to the video information; wherein N is greater than or equal to 1. The first information extraction unit is used to extract visual feature information from the video information using a first neural network model, and to extract dialogue feature information from each of the historical dialogue records using a second neural network model. The second information extraction unit is used to perform feature fusion on the visual feature information and the dialogue feature information based on a multi-attention mechanism fusion model for each of the historical dialogue record information, and to determine the first representation of the dialogue feature information on the visual feature information and the second representation of the visual feature information on the dialogue feature information. The alternative question generation unit is used to extract features from the first representation and the second representation for each of the historical dialogue records, and generate alternative video questions corresponding to the video information through a question generation model.

8. The apparatus according to claim 7, characterized in that, Also includes: The target video question determination module is used to answer the video candidate questions using a question answering model, and determine the answer corresponding to the video candidate questions; Furthermore, for the N historical dialogue records, the text similarity between the video candidate question and the answer corresponding to the video candidate question is determined, and the video candidate question with a text similarity higher than a preset threshold is taken as the target video question.

9. The apparatus according to claim 7, characterized in that, The first neural network model includes: a video frame feature extraction model and a video segment feature extraction model; The first information extraction unit is used for: The video frame feature extraction model is used to extract video frame feature information related to each frame of the video information; and, The video segment feature extraction model is used to extract video segment feature information related to the preset frame image from the preset frame image of the video information; the number of preset frames is greater than 1.

10. The apparatus according to any one of claims 7-9, characterized in that, The second neural network model is a dialogue feature extraction model; The first information extraction unit is used for: By referring to the positional encoding of words in the historical dialogue record information, the dialogue feature extraction model is used to extract features from each of the historical dialogue records to determine the dialogue feature information.

11. The apparatus according to any one of claims 7-9, characterized in that, The second information extraction unit is used for: The visual feature information and the dialogue feature information are used as input parameters to the fusion model of the multi-attention mechanism, so that the fusion model of the multi-attention mechanism outputs the first representation of the dialogue feature information on the visual feature information, and the second representation of the visual feature information on the dialogue feature information.

12. The apparatus according to claim 8, characterized in that, The target video problem determination module is used for: The video candidate questions are input into the question answering model, and the answers corresponding to the video candidate questions are output.

13. A video problem generation device, characterized in that, include: Processor and memory; The memory stores computer programs; The processor is configured to execute a computer program stored in the memory such that the method described in any one of claims 1-6 is performed.

14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1-6.