A traffic video question-answering method and device based on cross-modal human-like step-by-step reasoning
By employing a cross-modal human-like stepwise reasoning method and utilizing temporal consistency correction and dual-correlation denoising modules, the problems of false associations and information overload in traffic video question answering are solved, achieving high-precision traffic video question answering and supporting the reliable operation of intelligent transportation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHANGAN UNIV
- Filing Date
- 2026-02-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing traffic video question answering models are prone to generating false cross-modal associations under the influence of mixed visual objects and interference from weather, occlusion, etc., and ineffective knowledge enhancement methods lead to information overload and misleading reasoning, making it difficult to meet the high-precision requirements of intelligent transportation.
A cross-modal human-like stepwise reasoning method is adopted. Through a temporal consistency correction module and a dual correlation denoising module, the temporal correlation of video frames is dynamically analyzed, interference is filtered, and relevant traffic regulations are accurately selected. Cross-modal fusion and reasoning are performed by combining traffic scenarios and problem semantics.
It achieves high-precision question answering of traffic videos, automatically filters out interference, accurately selects key knowledge, ensures the accuracy of decision-making, solves the problems of cross-modal false associations and domain knowledge false relevance, and provides reliable intelligent transportation support.
Smart Images

Figure CN122244748A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of semantic understanding and reasoning technology for traffic scenes in intelligent transportation, and in particular to a traffic video question answering method and device with cross-modal human-like stepwise reasoning. Background Technology
[0002] Traffic video question answering technology is a core supporting capability of intelligent transportation. It is widely used in key scenarios such as road monitoring and early warning, autonomous driving assisted decision-making, and traffic incident tracing and analysis. It can help managers quickly locate the cause of accidents, assist vehicles in making safe driving decisions, and support the determination of responsibility for traffic incidents.
[0003] However, existing models face significant bottlenecks in their application in the transportation sector: Traffic videos contain a mix of visual objects and are susceptible to interference from weather, occlusion, and other factors, which can cause models to capture false cross-modal associations and misjudge inference logic. Meanwhile, existing knowledge enhancement methods fail to effectively filter noise, and irrelevant domain knowledge can easily lead to information overload and mislead the reasoning process.
[0004] These problems directly lead to risks such as missed judgments of violations, misjudgments of accidents, and misunderstandings of scenarios, making it difficult to meet the high-precision requirements of intelligent transportation.
[0005] Therefore, in order to solve the above-mentioned problems in the existing technology, it is an urgent technical problem for those skilled in the art to provide a traffic video question answering method that can simultaneously solve the problems of cross-modal false correlation and domain knowledge false correlation, and provide technical guarantee for the reliable operation of intelligent transportation devices. Summary of the Invention
[0006] In view of this, the present invention provides a traffic video question answering method and apparatus for cross-modal human-like stepwise reasoning.
[0007] To solve the above-mentioned technical problems, the present invention adopts the following technical solution: A cross-modal human-like stepwise reasoning method for traffic video question answering includes the following steps: S100: Obtain the traffic video to be processed, the corresponding question text, the candidate answer set, and the traffic knowledge graph; S200, extract the visual appearance features and motion features of the traffic video, the text features of the question text, and the triplet features of the traffic knowledge graph, and transform each feature into a unified dimensional space to obtain a multimodal initial feature set; S300, input the visual appearance features and motion features in the multimodal initial feature set to the temporal consistency correction module to obtain the temporally corrected visual appearance features and motion features; S400, the visual appearance features and motion features after time-correction are fused to obtain visual scene features. The visual scene features, along with the triplet features and text features in the multimodal initial feature set, are input into the dual correlation denoising module to obtain the denoised key knowledge features. S500 performs cross-modal fusion and reasoning based on time-corrected visual features, text features, and key knowledge features, and outputs question-answering results.
[0008] Preferably, step S200 specifically includes the following steps: a1, the traffic video is divided into N equal-length segments, each segment containing T frames of images, wherein... L represents the total number of frames in the traffic video. b1. The pre-trained Swin-L model is used to extract the visual appearance features of each image frame, and the pre-trained VideoSwin-B model is used to extract the motion features of each segment. c1, using a pre-trained BERT model to map the question text into text features; d1, Encode the traffic knowledge graph to obtain triplet features; e1 uses a single-layer fully connected network to uniformly transform visual appearance features, motion features, text features, and triplet features into the same dimensional space, forming a multimodal initial feature set.
[0009] Preferably, step S300 specifically includes the following steps: a2, calculates the initial correlation score between text features and visual appearance features, and calculates and normalizes it to the [0,1] interval by cosine similarity; b2 generates a temporal weight matrix based on inter-frame time distance and exponential decay mechanism; c2, combining the initial correlation score and the temporal weight matrix, calculates the temporal consistency correction score, selects the frames and segments that best match the semantics of the problem, and obtains the temporally corrected visual appearance features and motion features.
[0010] Preferably, step S400 specifically includes the following steps: a3, by fusing the temporally corrected visual appearance features and motion features, we obtain the visual scene features; b3 calculates the correlation between visual scene features and triplet features through matrix multiplication, and selects the top-K scene-related triplet features; c3 uses a three-layer fully connected network to verify the semantic consistency between the Top-K triplet features related to the scene and the features of the question text, filters out low consistency features, and obtains the denoised key knowledge features.
[0011] Preferably, step S500 specifically includes the following steps: a5, cross-attention calculation is performed between the text features and the temporally corrected visual appearance features and motion features to obtain the question-visual fusion features; b5 performs max pooling on key knowledge features, concatenates them with problem-visual fusion features, and then obtains knowledge-enhanced multimodal features through linear mapping. c5 calculates the cosine similarity between the multimodal features and the features of each candidate answer, and selects the candidate answer with the highest similarity as the question answering result.
[0012] Preferably, the method further includes: a model training process, wherein the loss function used for training is the sum of the visual-question coarse-grained contrast loss and the InfoNCE loss for answer classification, wherein the visual-question contrast loss adjusts the importance of appearance features and motion features through an adaptive weight factor.
[0013] Preferably, before step S500, a model performance evaluation and screening step is also included: 00, acquire test traffic videos, test question texts, and traffic knowledge graphs, and use the trained model to obtain test question-and-answer results; 01. Evaluate model performance by comparing test results with actual answers; 02. The model whose performance meets the preset threshold will be used as the final deployment model.
[0014] A cross-modal human-like stepwise reasoning traffic video question-answering device includes: an acquisition module, a feature extraction module, a temporal correction module, a knowledge denoising module, and a reasoning output module, all mounted on an electronic device; wherein, The acquisition module is used to acquire the traffic video to be processed, the corresponding question text, the candidate answer set, and the traffic knowledge graph; The feature extraction module is used to extract visual appearance features and motion features of traffic videos, text features of question texts, and triple features of traffic knowledge graphs, and to transform each feature into a unified dimensional space. The temporal correction module is used to filter visual features with real semantic associations through temporal consistency calculation and output temporally corrected visual appearance features and motion features. The knowledge denoising module is used to fuse time-corrected visual features to obtain visual scene features, and to filter key knowledge features through scene relevance filtering and semantic consistency verification. The reasoning output module is used to perform cross-modal fusion and reasoning based on time-corrected visual features, text features, and key knowledge features, and output question-and-answer results.
[0015] Preferably, the temporal correction module includes an initial relevance score calculation unit, a temporal weight generation unit, and a correction score fusion unit, which are respectively used to calculate the initial relevance between text and visual features, generate a temporal weight matrix, and fuse them to obtain a temporal consistency correction score.
[0016] Preferably, the knowledge denoising module includes a scene feature fusion unit, a scene relevance filtering unit, and a semantic consistency verification unit, which are used to generate visual scene features, filter scene-related knowledge, and filter semantically inconsistent knowledge noise, respectively.
[0017] The present invention achieves the following technical effects compared to the prior art: (1) The present invention designs a timing consistency correction module that automatically filters out invalid images caused by weather, obstruction and other interference by dynamically analyzing the timing correlation of video frames; at the same time, it accurately selects the most relevant traffic regulations knowledge by combining traffic scenarios and problems, avoiding redundant information interference and ensuring the accuracy of subsequent analysis and decision-making. (2) This invention proposes a cross-modal stepwise reasoning model, which solves two key problems: cross-modal false association and domain knowledge false association, and simulates the human stepwise cognitive process of "screening effective information → retrieving key knowledge → comprehensive reasoning"; (3) The present invention proposes a dual correlation denoising module, which adopts a dual mechanism of "scene correlation filtering + semantic consistency verification" to accurately filter key knowledge related to the current visual scene and problem semantics, and avoids interference from irrelevant knowledge noise. Attached Figure Description
[0018] Figure 1 This is a schematic diagram of the traffic video question answering process of the present invention, which involves cross-modal human-like stepwise reasoning. Figure 2 This is an example diagram of traffic video and problem text annotations for this invention; Figure 3 This is a schematic diagram of the traffic video question-and-answer device for cross-modal human-like stepwise reasoning according to the present invention. Detailed Implementation
[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Example 1:
[0020] like Figure 1 As shown, this invention discloses a cross-modal human-like stepwise reasoning method for traffic video question answering, including: S100: Obtain the traffic video to be processed, the corresponding question text, the candidate answer set, and the traffic knowledge graph; S200 extracts visual appearance and motion features from traffic videos, text features from question texts, and triplet features from traffic knowledge graphs, and transforms each feature into a unified dimensional space to obtain a multimodal initial feature set; S300: Input the visual appearance features and text features from the initial feature set of the multimodal mode to the temporal consistency correction module to obtain the temporally corrected visual appearance features and motion features; S400 integrates the temporally corrected visual appearance features and motion features to obtain visual scene features. The visual scene features, along with the triplet features and text features from the multimodal initial feature set, are input into the dual correlation denoising module to obtain the denoised key knowledge features. S500 performs cross-modal fusion and reasoning based on time-corrected visual features, text features, and key knowledge features, and outputs question-answering results.
[0021] like Figure 2 As shown, firstly, any traffic video is selected, and corresponding question texts are labeled for each traffic video to create a sample set of traffic videos. The sample set contains 10,080 traffic videos, and each traffic diagram corresponds to multiple question texts; it also includes a traffic knowledge graph.
[0022] This step inputs the traffic video question to be answered into the trained video reasoning question answering model, which will output the probability distribution of candidate answers for the traffic video.
[0023] The specific process is consistent with the probability distribution of the output candidate answers during the training process. Finally, this step selects the candidate answer with the highest probability as the reasoning answer.
[0024] The multimodal initial feature extraction process includes: Step a1: Divide the input traffic video into N equal-length segments according to the total number of frames L. The number of frames T in each segment is determined by the formula... Calculation determines ( This indicates a round-down operation.
[0025] When the total number of frames in the traffic video is L=128 and the number of segments is N=16, each segment contains T=8 frames of images.
[0026] Step b1: Use a pre-trained Swin-L model to extract features from each image frame to obtain frame-level visual appearance features F. a A pre-trained Video Swin-B model is used to extract motion features for each segment, resulting in segment-level motion features F. m ; Step c1: Encode the question text using a pre-trained BERT model, mapping the natural language question to text features F. q ; Step d1: Input multiple triples into the knowledge graph embedding model to obtain the embedding representation of each triple. Then, feed the embedding representation into the encoding layer of the BERT model to generate triple features F. k ; Step e1: Construct a single-layer fully connected network (FCN) for visual appearance features F a Motion characteristics F m Text features F q and triplet features F k Channel dimension transformation is performed separately, and all features are mapped to a unified feature space, ultimately forming a multimodal initial feature set {F}. a ,F m ,F q ,F k} The specific implementation process of timing consistency correction includes: Step a2: First, analyze the visual appearance feature F a Flatten the text and calculate its features F. q Compared with the flattened appearance feature F a The initial correlation score between them is calculated using the cosine similarity method, with the specific formula as follows:
[0027] in This represents the function for calculating cosine similarity. This represents a normalization operation that maps the calculation result to the interval [0,1].
[0028] Step b2: Using each frame as the anchor frame i, set the temporal neighborhood window size u (default u=5), and calculate the temporal distance between the k-th frame and the anchor frame i within the neighborhood. ; The time-series weights are generated based on the exponential decay mechanism, and the specific formula is as follows:
[0029] Where λ is the decay coefficient (default setting is 0.5), the time series weight matrix is generated using this formula. ; Step c2: Combine the initial correlation score with the time series weight matrix to calculate the time series consistency correction score, using the following formula:
[0030] Wherein, α and β are balancing coefficients (default α=0.6, β=0.4), used to adjust the contribution ratio of the initial correlation score and the time series weight; The maximum value is taken from each column of the corrected score matrix, and the frames and segments that best match the semantics of the problem are selected to obtain the temporally corrected visual appearance features. and motion characteristics .
[0031] The specific implementation process of dual correlation denoising is as follows, including: Step a3: Element-level addition is used to fuse the temporally corrected visual appearance features. and motion characteristics To obtain visual scene features The formula is: ; Step b3: Select a visual scene feature and a triplet feature from the dataset as input to the video reasoning question answering model, and obtain the Top-K triplet features whose relevance to the visual scene feature is higher than a preset threshold. These are denoted as the scene-related triplet features. ; Step c3: Construct a three-layer fully connected network to form a semantic consistency verification network, which integrates visual scene features. Text features Triple features based on scene relevance Feature concatenation is performed, and the network is used to calculate the semantic consistency score. The network calculation process is as follows:
[0032] in, This indicates a feature cascading operation, with the dropout layer's dropout probability set to 0.3. This represents the activation function. The function maps the output to the [0,1] interval as a consistency score, sets a threshold σ (default σ=0.5), sets the consistency score below the threshold to 0, and sets it to 1 otherwise; Combining consistency scores with scene-related triplet features Multiply to obtain the key knowledge features after denoising. .
[0033] The specific implementation process of cross-modal fusion and inference includes: Step a5: Construct a Transformer encoder to process text features. As a query vector, the time-corrected visual appearance features Motion characteristics Cross-attention is calculated using the key vector and value vector as key and value vectors respectively.
[0034] in This represents text features enhanced based on temporally corrected visual appearance features. This represents the temporally corrected visual appearance features based on text feature enhancement. Indicates a feature cascade operation; The same process is used to calculate the cross-attention fusion result of text features and motion features. ,Will and After cascading, the problem-visual fusion features are obtained through linear mapping. The formula is:
[0035] Step b5, key knowledge features Perform max pooling (MaxPool) to compress the spatial dimensions and then fuse the features with the problem-visual representation. Feature concatenation is performed, followed by a single-layer linear mapping to obtain knowledge-enhanced multimodal features. The specific formula is as follows:
[0036] Step c5: Calculate the knowledge-enhanced multimodal features. Features of each candidate answer The cosine similarity between the two options is used, and the similarity value is the score of the corresponding candidate answer. The specific formula is as follows:
[0037] The candidate answer with the highest score will be selected as the final answer. ,Right now
[0038] The method also includes a model training process, using a total loss function that is the sum of the visual-question coarse-grained contrast loss and the answer classification loss, with the specific formula as follows:
[0039] in: Visual - Problem: Coarse-to-fine granularity contrast loss To optimize cross-modal feature alignment, it includes two sub-losses: frame-level (fine-grained) and segment-level (coarse-grained), with the specific formula as follows:
[0040] Frame-level loss and fragment-level loss The calculation method is as follows:
[0041]
[0042] in and This is an adaptive weighting factor used to dynamically adjust the importance of appearance and motion features; Temperature hyperparameter (default) =0.07).
[0043] Answer Classification Loss The InfoNCE loss function is used to optimize the classification accuracy of candidate answers.
[0044] Prior to step S500, a model performance evaluation and screening step is also included: 00, acquire test traffic videos, test question texts, and traffic knowledge graphs, and use the trained model to obtain test question-and-answer results; 01. Evaluate model performance by comparing test results with actual answers; 02. The model whose performance meets the preset threshold will be used as the final deployment model. Example 2:
[0045] The present invention also discloses a traffic video question answering device with cross-modal human-like stepwise reasoning, comprising: an acquisition module, a feature extraction module, a temporal correction module, a knowledge denoising module, and a reasoning output module disposed on an electronic device; The acquisition module is used to acquire traffic videos to be processed, corresponding question texts, candidate answer sets, and traffic knowledge graphs. The feature extraction module is used to extract visual appearance and motion features from traffic videos, text features from question texts, and triple features from traffic knowledge graphs, and transform each feature into a unified dimensional space. The temporal correction module is used to filter visual features with true semantic associations through temporal consistency calculations and output temporally corrected visual appearance features and motion features. The knowledge denoising module is used to fuse time-corrected visual features to obtain visual scene features, and then select key knowledge features through scene relevance filtering and semantic consistency verification. The inference output module is used to perform cross-modal fusion and inference based on time-corrected visual features, text features, and key knowledge features, and output question-answering results.
[0046] This invention utilizes continuous frame information from video to capture temporal changes in traffic scenes. Combined with traffic rules in a knowledge graph, it achieves deep understanding and real-time reasoning of complex traffic interactions. It can handle dynamic scenarios such as vehicle lane changes, pedestrian crossings, and traffic light switching. Through temporal reasoning, understanding of complex interactions, and handling of unexpected scenarios, it provides accurate driving decision support. It can be widely applied in fields such as autonomous driving simulation training, intelligent traffic monitoring, and driver behavior assessment. It provides dynamic and real-time decision support for intelligent transportation devices and automatic determination of liability in traffic accidents, demonstrating significant technical advantages and application value.
[0047] The above description is merely a preferred embodiment of the present invention and does not constitute any limitation on the technical scope of the present invention. Therefore, any minor modifications, equivalent changes, and alterations made to the above embodiments based on the technical essence of the present invention shall still fall within the scope of the technical solution of the present invention.
Claims
1. A cross-modal human-like stepwise reasoning method for traffic video question answering, characterized in that, Includes the following steps: S100: Obtain the traffic video to be processed, the corresponding question text, the candidate answer set, and the traffic knowledge graph; S200, extract the visual appearance features and motion features of the traffic video, the text features of the question text, and the triplet features of the traffic knowledge graph, and transform each feature into a unified dimensional space to obtain a multimodal initial feature set; S300, input the visual appearance features and motion features in the multimodal initial feature set to the temporal consistency correction module to obtain the temporally corrected visual appearance features and motion features; S400, the visual appearance features and motion features after time-correction are fused to obtain visual scene features. The visual scene features, along with the triplet features and text features in the multimodal initial feature set, are input into the dual correlation denoising module to obtain the denoised key knowledge features. S500 performs cross-modal fusion and reasoning based on time-corrected visual features, text features, and key knowledge features, and outputs question-answering results.
2. The traffic video question-answering method based on cross-modal human-like stepwise reasoning according to claim 1, characterized in that, Step S200 specifically includes the following steps: a1, the traffic video is divided into N equal-length segments, each segment containing T frames of images, wherein... L represents the total number of frames in the traffic video. b1. The pre-trained Swin-L model is used to extract the visual appearance features of each image frame, and the pre-trained Video Swin-B model is used to extract the motion features of each segment. c1, using a pre-trained BERT model to map the question text into text features; d1, Encode the traffic knowledge graph to obtain triplet features; e1 uses a single-layer fully connected network to uniformly transform visual appearance features, motion features, text features, and triplet features into the same dimensional space, forming a multimodal initial feature set.
3. The traffic video question-answering method based on cross-modal human-like stepwise reasoning according to claim 1, characterized in that, Step S300 specifically includes the following steps: a2, calculates the initial correlation score between text features and visual appearance features, and calculates and normalizes it to the [0,1] interval by cosine similarity; b2 generates a temporal weight matrix based on inter-frame time distance and exponential decay mechanism; c2, combining the initial correlation score and the temporal weight matrix, calculates the temporal consistency correction score, selects the frames and segments that best match the semantics of the problem, and obtains the temporally corrected visual appearance features and motion features.
4. The traffic video question-answering method based on cross-modal human-like stepwise reasoning according to claim 1, characterized in that, Step S400 specifically includes the following steps: a3, by fusing the temporally corrected visual appearance features and motion features, we obtain the visual scene features; b3 calculates the correlation between visual scene features and triplet features through matrix multiplication, and selects the top-K scene-related triplet features; c3 uses a three-layer fully connected network to verify the semantic consistency between the Top-K triplet features related to the scene and the features of the question text, filters out low consistency features, and obtains the denoised key knowledge features.
5. The traffic video question answering method based on cross-modal human-like stepwise reasoning according to claim 1, characterized in that, Step S500 specifically includes the following steps: a5, cross-attention calculation is performed between the text features and the temporally corrected visual appearance features and motion features to obtain the question-visual fusion features; b5 performs max pooling on key knowledge features, concatenates them with problem-visual fusion features, and then obtains knowledge-enhanced multimodal features through linear mapping. c5 calculates the cosine similarity between the multimodal features and the features of each candidate answer, and selects the candidate answer with the highest similarity as the question answering result.
6. The traffic video question answering method based on cross-modal human-like stepwise reasoning according to claim 1, characterized in that, The method further includes a model training process, wherein the loss function used for training is the sum of the visual-question coarse-grained contrast loss and the InfoNCE loss for answer classification, wherein the visual-question contrast loss adjusts the importance of appearance features and motion features through an adaptive weight factor.
7. The traffic video question answering method based on cross-modal human-like stepwise reasoning according to claim 1, characterized in that, Prior to step S500, a model performance evaluation and screening step is also included: 00, acquire test traffic videos, test question texts, and traffic knowledge graphs, and use the trained model to obtain test question-and-answer results; 01. Evaluate model performance by comparing test results with actual answers; 02. The model whose performance meets the preset threshold will be used as the final deployment model.
8. A traffic video question-answering device with cross-modal human-like stepwise reasoning, characterized in that, include: The electronic device includes an acquisition module, a feature extraction module, a temporal correction module, a knowledge denoising module, and an inference output module; among which... The acquisition module is used to acquire the traffic video to be processed, the corresponding question text, the candidate answer set, and the traffic knowledge graph; The feature extraction module is used to extract visual appearance features and motion features of traffic videos, text features of question texts, and triple features of traffic knowledge graphs, and to transform each feature into a unified dimensional space. The temporal correction module is used to filter visual features with real semantic associations through temporal consistency calculation and output temporally corrected visual appearance features and motion features. The knowledge denoising module is used to fuse time-corrected visual features to obtain visual scene features, and to filter key knowledge features through scene relevance filtering and semantic consistency verification. The reasoning output module is used to perform cross-modal fusion and reasoning based on time-corrected visual features, text features, and key knowledge features, and output question-and-answer results.
9. A traffic video question-answering device for cross-modal human-like stepwise reasoning according to claim 8, characterized in that, The temporal correction module includes an initial relevance score calculation unit, a temporal weight generation unit, and a correction score fusion unit, which are used to calculate the initial relevance between text and visual features, generate a temporal weight matrix, and fuse them to obtain a temporal consistency correction score, respectively.
10. A traffic video question-answering device for cross-modal human-like stepwise reasoning according to claim 8, characterized in that, The knowledge denoising module includes a scene feature fusion unit, a scene relevance filtering unit, and a semantic consistency verification unit, which are used to generate visual scene features, filter scene-related knowledge, and filter semantically inconsistent knowledge noise, respectively.