Traffic situation-oriented multi-modal cue retrieval and early warning method and system

By aligning multimodal traffic data, calculating suspiciousness, constructing evidence loop diagrams, and dynamically adjusting early warning rules, the problem of false alarms and missed alarms caused by misinterpretation in large multimodal models is solved, improving the reliability and controllability of traffic situation awareness.

CN122220366APending Publication Date: 2026-06-16GUANGZHOU YEDUN INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU YEDUN INFORMATION TECHNOLOGY CO LTD
Filing Date
2026-04-25
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In multimodal large models, readable text and symbolic carriers in traffic scenarios may be misinterpreted as instructional fragments, leading to false or missed warnings, affecting the reliability and controllability of traffic management, and lacking isolation, constraints and access control over intermediate inference results and outputs.

Method used

By aligning multimodal traffic data, feature extraction and semantic representation learning are performed, injection suspicion scores are calculated, dynamic gating parameters are generated, evidence loop diagrams are constructed and loop integrity is calculated, early warning escalation rules are dynamically adjusted, and access control and output semantic hierarchical processing are implemented.

🎯Benefits of technology

It achieves the goal of suppressing false alarms and missed alarms while ensuring response efficiency, improving the stability and controllability of traffic situation awareness, and providing structured early warning outputs to facilitate verification and engineering implementation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122220366A_ABST
    Figure CN122220366A_ABST
Patent Text Reader

Abstract

The present application relates to the field of intelligent transportation and traffic situation awareness, and in particular to a multi-modal clue retrieval and early warning method and system for traffic situation. The application acquires and aligns multi-modal traffic data, extracts features to generate unified semantic representation, retrieves candidate traffic clues based on query and extracts evidence, calculates suspiciousness injection to generate dynamic gating parameters, constructs an evidence closed loop and regulates early warning rules, calls a fusion perception reasoning model to perform abnormality recognition and early warning reasoning, generates structured early warning output, realizes multi-modal data alignment and unified semantic representation, improves retrieval relevance to drive fusion reasoning output structured early warning with candidate evidence, and can explain the introduction of suspiciousness injection and dynamic gating to visually prompt false positives / misses, construct an evidence closed loop, and dynamically regulate and upgrade rules to improve scene adaptability and robustness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent transportation and traffic situation awareness, and more specifically, to a multimodal clue retrieval and early warning method and system for traffic situation. Background Technology

[0002] With the advancement of roadside perception and digital traffic management, multi-source heterogeneous data, including road images / videos, roadside detection statistics, floating car speeds, signal phase states, and guidance information, are being accessed on a large scale. The industry is beginning to introduce multimodal large models to perform unified semantic representation learning, traffic clue retrieval, and anomaly reasoning on multi-source data, and to integrate the reasoning results into the early warning escalation and response chain to improve cross-modal understanding and retrieval recall.

[0003] However, in real-world traffic scenarios, there are a large number of readable text and symbol carriers (such as information boards, billboards, vehicle stickers, construction signs, directional signs, and high-contrast textures). When a multimodal large model has strong text recognition and cross-scale semantic aggregation capabilities, these carriers may be presented as "instructional fragments" that are readable by the model under certain processing links, thereby triggering indirect prompt injection risks and causing false or missed warnings. This type of trigger is also covert and may be parsed by the model under specific resolution / sampling strategies through contrast edges, character splicing, local textures, and inter-frame overlay.

[0004] Furthermore, when applications directly connect the output of multimodal large models to alarm or decision-making links, if there is a lack of isolation, constraints, and access control over the intermediate results of inference and the output processing, the above risks may be amplified into policy-level impacts (such as suppressing real alarms or creating false alarms), thereby affecting the reliability and controllability of traffic management.

[0005] Therefore, existing technologies urgently need to introduce a constraint mechanism for engineering risks in addition to the "alignment-representation-retrieval-reasoning" link: to quantify the credibility / suspicion of candidate evidence sets and generate dynamic gating parameters to restrict reasoning paths and output permissions; at the same time, through mechanisms such as evidence closed-loop consistency evaluation, dynamic adjustment of early warning escalation rules and hierarchical control of output semantics, the early warning can ensure response efficiency while maintaining prudence and traceability in the injection of potential hints. Summary of the Invention

[0006] To address the shortcomings of existing technologies, the present invention aims to provide a multimodal clue retrieval and early warning method and system for traffic situation.

[0007] To achieve the above objectives, the present invention provides the following technical solution: A multimodal clue retrieval and early warning method for traffic situational awareness includes the following steps: Step 1: Acquire and align multimodal traffic data for the traffic scenario to obtain aligned multimodal traffic data; Step 2: Perform feature extraction and semantic representation learning on the aligned multimodal traffic data to obtain multimodal traffic cue representations in a unified semantic space; Step 3: Receive traffic clue query requests; perform a search in the traffic clue index based on the traffic clue query request and the multimodal traffic clue representation to obtain a candidate traffic clue set; extract a set of candidate evidence related to the traffic clue query request from the candidate traffic clue set. Step 4: Calculate the injection suspicion score for the candidate evidence set, and generate dynamic gating parameters based on the injection suspicion score to control the early warning reasoning path and early warning output permissions; Step 5: Construct an evidence loop diagram based on the candidate evidence set and calculate the evidence loop completeness; dynamically adjust the early warning escalation rules based on the injection suspicion score and the evidence loop completeness. Step 6: Under the constraints of dynamic gating parameters and the early warning upgrade rules after dynamic adjustment, the fusion perception reasoning model is invoked to perform anomaly identification and early warning reasoning on the candidate evidence set, and the intermediate reasoning results and structured early warning outputs are subject to access control according to the dynamic gating parameters to generate structured early warning outputs.

[0008] Furthermore, multimodal traffic data includes at least road image data or road video data, as well as traffic status data that corresponds to the road image data or road video data in time and space.

[0009] Furthermore, feature extraction and semantic representation learning include encoding road image data or road video data into visual semantic representations, encoding traffic state data into spatiotemporal semantic representations, and performing fusion of visual semantic representations and spatiotemporal semantic representations to obtain multimodal traffic cue representations in a unified semantic space.

[0010] Furthermore, the candidate evidence set includes at least candidate road image fragments or candidate road video fragments and candidate traffic state fragments.

[0011] Furthermore, the injection suspicion score is determined based on at least one of the following or a combination thereof: Enhanced consistency divergence: generating multiple viewpoint representations for the same candidate road image segment or candidate road video segment, inputting each viewpoint representation into the fusion perception inference model to obtain the corresponding inference result; calculating the enhanced consistency divergence based on the consistency assessment between each inference result and the logical link consistency assessment. Voice causal conflict scoring: Construct a set of semantic constraints from the structured semantics output by the fusion perception reasoning model; construct a traffic state evidence map from the candidate traffic state fragments; perform a causal consistency test between the set of semantic constraints and the traffic state evidence map to obtain the voice causal conflict score; High-risk carrier aggregation score: High-risk carriers in candidate road image segments or candidate road video segments are detected and their structures are analyzed. A high-risk carrier map is constructed based on the detection and structure analysis results. Risk aggregation inference is then performed based on the high-risk carrier map to obtain the high-risk carrier aggregation score.

[0012] Furthermore, the evidence loop diagram includes at least the following nodes: visual anomaly events extracted from candidate road image segments or candidate road video segments, state anomaly events extracted from candidate traffic state segments, and structured reasoning conclusions output by the fusion perception reasoning model; and edges consisting of temporal adjacency, spatial adjacency, causal support, and homologous association, with each edge having a corresponding edge weight.

[0013] Furthermore, the completeness of the evidence loop is determined at least through the following methods: calculating the node confidence for each node, and performing probabilistic consistency fusion on the node confidence based on edge weights to obtain the loop consistency confidence; performing confidence propagation and decay along temporal and spatial adjacency relationships on the evidence loop graph; extracting causal constraints from the structured reasoning conclusions output by the fusion perception reasoning model, and performing constraint violation penalties on the evidence loop graph; and calculating the completeness of the evidence loop based on the loop consistency confidence, the propagated and updated confidence, and the loop penalty term.

[0014] Furthermore, dynamic control includes at least closed-loop threshold adjustment, duration threshold adjustment, and early warning level mapping strategy adjustment.

[0015] Furthermore, the structured early warning output includes at least the early warning level, early warning probability, early warning location label, and early warning description; and the structured early warning output is subject to semantic hierarchical control based on the injected suspiciousness score.

[0016] Furthermore, the multimodal clue retrieval and early warning system for traffic conditions includes: a data acquisition module for acquiring multimodal traffic data; Feature representation module: Generates traffic clue representations; Clue retrieval module: Extracts a set of candidate evidence; Suspicion assessment module: Calculates injection suspicion score and generates dynamic gating parameters; Closed-loop reasoning module: Constructs evidence closed-loop diagrams, calculates the completeness of evidence closed loops, and dynamically adjusts early warning escalation rules; Early warning output module: Under the constraints of dynamic gating parameters and dynamically adjusted early warning upgrade rules, it performs early warning reasoning on the candidate evidence set and generates structured early warning output.

[0017] Compared with the prior art, the present invention has the following beneficial effects: This invention aligns multimodal traffic data in time and space and learns a unified semantic representation. It completes the retrieval and extracts a set of candidate evidence in the traffic clue index according to the query request, so that anomaly identification and early warning reasoning are based on a more centralized, consistent and reusable evidence foundation. At the same time, it outputs structured results such as early warning level / probability / location label / description, which are convenient for review, playback and engineering implementation. Dynamic gating enables controllable reasoning and output boundaries: Suspicion scores are calculated for candidate evidence, and dynamic gating parameters are generated accordingly to control the warning reasoning path and warning output permissions; during the reasoning stage, permissions are controlled for intermediate reasoning results and structured warning outputs, and this can be combined with semantic layering control to make the output more conservative in high-suspicion cases, clearly marking the source of uncertainty and supplementary evidence suggestions, thereby suppressing false alarms / false negatives and "overassertion / false suppression" caused by warning injection; Based on candidate evidence, an evidence loop diagram is constructed and the loop integrity is calculated. Combined with the injection of suspiciousness, the warning escalation rules are dynamically adjusted (such as the loop threshold, duration threshold, warning level mapping strategy, etc.). This allows for more agile escalation when "the evidence is sufficient and the risk is low" and more prudent suppression of impulsive escalation when "the evidence is insufficient or the risk is high", thereby improving the stability and controllability of multi-scenario and all-time operation. The intermediate results of reasoning are hierarchically packaged (such as evidence citation lists, key causal links, confidence propagation paths, conflict item summaries, etc.) and their visibility is set according to dynamic gating, so that different authorized entities can obtain structured early warning outputs that match their responsibilities, thereby improving the efficiency of review, collaboration and post-event review. Attached Figure Description

[0018] Figure 1 This is a schematic diagram of the overall process of a multimodal clue retrieval and early warning method for traffic situations; Figure 2 A schematic diagram illustrating the generation of multimodal traffic data acquisition, spatiotemporal alignment, and unified semantic representation; Figure 3 This is a schematic diagram illustrating query-based traffic clue retrieval, candidate evidence extraction, injection of suspiciousness calculation and dynamic gating parameter generation, as well as evidence loop construction and loop integrity calculation. Figure 4 This diagram illustrates the dynamic adjustment of early warning rules, anomaly identification, and early warning reasoning under dynamic gating and evidence closed-loop evaluation constraints, and the generation of structured early warning outputs. Detailed Implementation

[0019] Example 1: Refer to Figures 1 to 4 To facilitate implementation and verification, the thresholds, coefficients and scoring functions involved in this invention are defined uniformly below.

[0020] For any candidate evidence entry, let its corresponding candidate road image segment or candidate road video segment be... Candidate traffic state segments are Its time window is Spatial positioning is For any scalar index Its normalization result is denoted as .

[0021] Normalization can be achieved using quantile-based normalization based on historical sample statistics: ;in and Representing indicators In the historical sample set quantiles and quantiles, This is the clipping function. If there are insufficient historical samples, field calibration data or simulation data can be used to replace the historical sample set to ensure the reproducibility of normalization and its feasibility in engineering. To ensure the reproducibility and numerical stability of quantile clipping normalization, the preferred value is... In scenarios that are more sensitive to extreme values, it is advisable to... The historical sample set can be maintained using a sliding time window method, with the window length preferably being the most recent. Day (for example) Heavenly (days), and set the minimum sample size. (For example, no less than) One sample or no less than (Samples within an event window). When the sample size is insufficient or the sample distribution drifts significantly, the historical sample set can be replaced by a field calibration dataset or a simulation dataset. Quantile updates can be performed periodically or incrementally by event window: each update cycle is recalculated and cached. Alternatively, an exponential sliding update can be used to balance stability and adaptability. To avoid... Too small a value leads to normalization.

[0022] A multimodal clue retrieval and early warning method for traffic situational awareness includes the following steps: Step one involves acquiring and aligning multimodal traffic data for the traffic scene, resulting in aligned multimodal traffic data. Acquiring multimodal traffic data allows access to various data sources, including road images, road videos, and traffic status information, which contain rich information about the traffic scene. Alignment ensures that different modalities remain consistent in time and space, enabling subsequent feature extraction and analysis to be performed within a unified reference frame, thus improving the accuracy of clue recognition and reasoning. The aligned data ensures that multimodal information can work synergistically to form a complete, continuous, and consistent description of the traffic scene.

[0023] In one specific implementation, step one is as follows: Multimodal traffic data acquisition and calibration involves acquiring road image or video data covering the same road segment, and simultaneously collecting corresponding traffic status data. Traffic status data includes at least one or more of the following: traffic flow, average vehicle speed, lane occupancy, queue length, and signal phase status. During acquisition, the road image or video data is appended with the acquisition time and location information, and the traffic status data is appended with the same time reference and road segment location information, achieving unified calibration across data sources. For example, at the same intersection on an urban arterial road, continuous road video data is collected, while vehicle speed and phase status are simultaneously obtained from roadside detection devices and signal control records, forming multimodal traffic data for the same time period. Multimodal traffic data alignment and consistency verification involves performing time alignment between road image data or road video data and traffic status data based on the acquisition time, and spatial alignment based on road segment positioning information to map traffic status data to the corresponding road segments, lanes, or directions in the road image data or road video data. Consistency verification is also performed, triggering realignment or removing abnormal segments when the time deviation or spatial mapping error exceeds a preset threshold, to obtain aligned multimodal traffic data, ensuring that subsequent processing can be carried out under the same spatiotemporal reference.

[0024] Time deviation is defined as: ;in, For the moment of visual data acquisition, This refers to the time corresponding to the traffic status data. The spatial mapping error is defined as: ;in, For visual data localization, For traffic status data location, This is a spatial distance metric function, including but not limited to road network topology distance or geographic distance.

[0025] The threshold can be adaptively determined through the statistical distribution of "normally aligned samples": ;in, This is the time deviation threshold. This is the spatial mapping error threshold. For quantile functions, Determined by business alignment tolerance. When or If the condition is not met after realignment, a realignment is triggered; if the condition is still not met after realignment, the corresponding abnormal segment is removed. To avoid triggering by occasional noise, a continuity criterion can be introduced: when the condition is met within the sliding window... or The proportion exceeds the threshold Then trigger the removal process again.

[0026] Step two involves performing feature extraction and semantic representation learning on the aligned multimodal traffic data to obtain multimodal traffic cue representations in a unified semantic space. Feature extraction extracts important information from the original images, videos, and traffic state data, including visual, spatiotemporal, and semantic features. Semantic representation learning, by mapping multimodal features to a unified semantic space, establishes connections between different modalities, enabling direct comparison, retrieval, and fusion. The traffic cue representations in the unified semantic space provide an operable and unified representation for subsequent retrieval, inference, and early warning, facilitating effective association analysis and judgment by the model.

[0027] In one specific implementation, step two is as follows: Visual semantic representation generation and quality constraints are implemented by extracting features from aligned multimodal traffic data, specifically road image or video data. Visual cues such as lane line morphology, vehicle trajectory changes, pedestrian aggregation, occlusion levels, abnormal stops, and wrong-way driving are encoded into visual semantic representations. Simultaneously, the clarity, exposure, and jitter intensity of the road image or video data are evaluated, and the evaluation results are used as quality constraints in the generation of visual semantic representations. This suppresses low-quality segments in the visual semantic representation, thereby reducing misjudgments caused by rain, fog, nighttime glare, or camera shake. For example, in road video data at elevated ramp entrances, when a vehicle suddenly decelerates and deviates from its lane, the motion features of this segment are jointly encoded with lane geometry features into a visual semantic representation, and quality suppression is applied to local distortions caused by strong nighttime reflections. Spatiotemporal semantic representation generation and unified semantic space fusion are performed. Feature extraction is conducted on traffic state data in the aligned multimodal traffic data, encoding temporal changes such as traffic flow, average speed, lane occupancy, queue length, and signal phase state into spatiotemporal semantic representations. Spatiotemporal location encoding consistent with road segments, lane directions, and time windows is introduced to enhance the characterization of the propagation of sudden congestion and the impact of signal phase switching. Subsequently, visual semantic representations and spatiotemporal semantic representations are fused, and cross-modal alignment constraints are used to map them to a unified semantic space, outputting a multimodal traffic cue representation. This makes visual anomalies and state anomalies comparable, correlated, and searchable at the same semantic scale. For example, when road video data shows a rapid increase in vehicle density within lanes, while traffic state data shows a decrease in average speed and an increase in queue length, the multimodal traffic cue representation obtained through fusion can stably characterize the key clues of congestion formation and retain their spatiotemporal orientation.

[0028] Step 3: Receive traffic clue query requests; perform a search in the traffic clue index based on the query request and multimodal traffic clue representations to obtain a set of candidate traffic clues; extract a set of candidate evidence related to the traffic clue query request from the candidate traffic clue set; receiving query requests can transmit traffic events, anomalies, or specific conditions of interest to the user to the system for targeted searching. By searching the multimodal traffic clue representations in the index, candidate clues related to the query request can be quickly found, ensuring information accuracy and retrieval efficiency. Further extraction of the candidate evidence set involves filtering the most valuable evidence from the candidate clues, providing basic data for subsequent suspicion assessment and closed-loop reasoning, ensuring the system focuses on the most relevant and important traffic information.

[0029] In one specific implementation, step three is as follows: The system parses and generates constraints for traffic clue query requests. It receives traffic clue query requests and performs semantic parsing on them, breaking down the query requests into event type constraints, time window constraints, spatial range constraints, and evidence form constraints. The event type constraints include at least one or more of the following: congestion, accident, wrong-way driving, road construction, abnormal parking, and pedestrian intrusion. The evidence form constraints are used to indicate the priority and number of candidate road image segments or candidate road video segments and candidate traffic state segments. At the same time, the query requests are encoded into query vectors consistent with the unified semantic space, and search filtering conditions are generated to limit the search scope of the traffic clue index. The process involves retrieving traffic clues from an index and generating a candidate traffic clue set. The index performs similarity retrieval based on query vectors and multimodal traffic clue representations, followed by secondary filtering using time window and spatial range constraints to obtain the candidate traffic clue set. The similarity retrieval uses vector similarity ranking combined with spatiotemporal proximity weighting, prioritizing clues from the same time period, road segment, or adjacent road segments to improve the aggregation capability for the same abnormal event. For example, when a traffic clue query requests information about abnormal congestion on a main road during the morning rush hour, the retrieval can prioritize clue entries from the upstream road segment of the same intersection with similar multimodal traffic clue representations and significant speed reductions, and include congestion propagation clues from adjacent road segments in the candidate traffic clue set. The process involves extracting candidate evidence sets and fine-grained slicing. Candidate traffic clue sets are extracted based on the event type and evidence form constraints of the query request. Candidate road image or video clips are sliced ​​according to preset durations before and after the anomaly, and candidate traffic state clips within the same time window are extracted simultaneously. Furthermore, the candidate evidence sets are re-ranked based on relevance. This re-ranking is based on a joint evaluation of visual cue strength, traffic state change magnitude, and spatiotemporal consistency scores to ensure that the output evidence directly supports subsequent calculation of suspicion scores and early warning inference. For example, when the query request is for a suspected accident, candidate road video clip slices cover the deceleration before the collision, the trajectory change during the collision, and the stationary state after the collision. Candidate traffic state clips simultaneously cover continuous changes in traffic flow, average speed, and queue length, thus forming a candidate evidence set highly relevant to the traffic clue query request.

[0030] Candidate evidence slice window determination: Let the traffic state variable vector be... The intensity of visual abnormality is Traffic state abrupt change points can be determined by the rate of change or the difference norm: Visual anomalies can be identified by the peak intensity of the anomalies. ;in, The sampling interval is... The retrieved candidate time ranges are used to determine the event center time by combining the mutation points. ;in, Used to balance the reliability of visual and state evidence. It can be determined by alignment quality, data missing rate, etc. The slice window is defined as: ;in, , The value can be adaptively determined by event type constraints: maintain the historical sample window length distribution for different event types, and take the quantile value as the default window; when the suspicion level is high, the value can be increased. , To cover more contextual evidence. and The value should be adjusted according to the type of traffic event. For example, in congestion events, a relatively long time window (e.g., 10 to 30 minutes) can be considered to capture the continuous factors affecting traffic and the gradual change in traffic flow. In accident events, however, the reaction time after the event is usually more critical. This timeframe can be appropriately increased, approximately 20 to 60 minutes, to ensure that all relevant evidence is captured within a sufficient period. For example, when the query request points to an accident that occurred on a certain expressway, It could be set to 10 minutes, and Set to 30 minutes. This ensures that information is captured starting a few minutes before the incident and continuing for 30 minutes after the incident to cover the development of traffic congestion. This setting also allows for a moderately larger time window in cases of high suspicion, preventing the omission of more complex dynamic processes.

[0031] Step four involves calculating the injection suspicion score for the candidate evidence set and generating dynamic gating parameters based on this score. These parameters control the early warning inference path and output permissions. Calculating the injection suspicion score quantitatively assesses the reliability and potential risk of each candidate piece of evidence, helping the system distinguish between high-risk and low-risk information. Generating dynamic gating parameters based on suspicion scores dynamically controls the early warning inference path, determining which evidence participates in the inference and how data with different confidence levels is processed, thereby optimizing inference efficiency and reducing false positives and false negatives. Simultaneously, the dynamic gating parameters also control the output permissions for early warnings, ensuring that only results meeting the criteria are output, thus improving system security and reliability.

[0032] In one specific implementation, step five is as follows: The injection of suspicion score involves multi-source aggregation. For each candidate evidence set, at least one or a combination of enhanced consistency divergence, linguistic causal conflict score, and high-risk carrier aggregation score are calculated and normalized to form the injection suspicion score. Enhanced consistency divergence is calculated by generating multiple perspective representations for the same candidate road image or video segment. These representations include at least segment representations with different temporal sampling densities, local representations with different spatial cropping scales, and dynamic representations with different motion emphasis intensities. Each perspective representation is input into a fusion-sensory inference model to obtain corresponding inference results. Enhanced consistency divergence is calculated based on consistency assessments and logical link consistency assessments among the inference results, increasing suspicion when inference conclusions fluctuate significantly or logical links break down. Linguistic causal conflict score is calculated by constructing a semantic constraint set from the structured semantics output by the fusion-sensory inference model. This set includes at least event sequence, causal support, and persistence. Continuing with temporal and spatial accessibility constraints, a traffic state evidence map is constructed for candidate traffic state segments. The traffic state evidence map uses state variables such as traffic flow, average vehicle speed, lane occupancy, queue length, and signal phase status as nodes and state evolution relationships as edges. A causal consistency test is performed between the semantic constraint set and the traffic state evidence map to obtain a grammatical causal conflict score, which increases the suspicion level when the causal chain of the structured semantic claim does not match the evolution of the state variables. The high-risk carrier aggregation score is calculated by detecting and structurally analyzing high-risk carriers in candidate road image segments or candidate road video segments. High-risk carriers include at least one or more of the following: hazardous chemical transport vehicles, engineering operation vehicles, oversized vehicles, and dense groups of non-motorized vehicles. A high-risk carrier map is constructed based on the detection and structural analysis results, and risk aggregation inference is performed based on the high-risk carrier map to obtain a high-risk carrier aggregation score, which increases the suspicion level when high-risk carriers show spatiotemporal clustering, proximity to each other, or are correlated with abnormal events. Evidence loop diagram construction and relation weighting: An evidence loop diagram is constructed based on the candidate evidence set. Visual anomaly events extracted from candidate road image or video segments, state anomaly events extracted from candidate traffic state segments, and structured reasoning conclusions output by the fusion perception reasoning model are used as nodes. Temporal adjacency, spatial adjacency, causal support, and homologous association are used as edges, and each edge is assigned a weight. The edge weight of temporal adjacency is used to characterize the continuity of anomaly events within adjacent time windows; the edge weight of spatial adjacency is used to characterize the propagation of anomaly events between the same or adjacent road segments; the edge weight of causal support is used to characterize the support strength of visual and state anomaly events for the structured reasoning conclusions; and the edge weight of homologous association is used to characterize the matching degree of candidate road image or video segments and candidate traffic state segments from the same acquisition time and spatial location. Denoising processing is performed on the loop diagram. When the edge weight of homologous association is below a threshold, its contribution to subsequent loop calculation is reduced to avoid false loops introduced by alignment errors. The system calculates the completeness of the evidence loop and penalizes conflicts. It calculates the confidence level of each node in the evidence loop graph and performs probabilistic consistency fusion based on edge weights to obtain the loop consistency confidence level. Confidence propagation and attenuation are performed along temporal and spatial adjacency relationships in the evidence loop graph to strengthen the confidence level of nodes that are spatiotemporally continuous and closely related to the core abnormal event. Causal constraints are extracted from the structured reasoning conclusions output by the fusion perception reasoning model, and constraint violation penalties are applied to the evidence loop graph. Penalty terms are introduced when the causal constraints are inconsistent with the state evolution relationship of the traffic state evidence graph or the temporal sequence of visual abnormal events. Finally, the completeness of the evidence loop is calculated based on the loop consistency confidence level, the propagated and updated confidence level, and the loop penalty terms. A higher completeness of the evidence loop is achieved when the evidence chain is more closed, more consistent, and has fewer conflicts. The dynamic adjustment and examples of early warning escalation rules are explained. Based on the injection suspicion score and the completeness of the evidence loop, the early warning escalation rules are dynamically adjusted. This dynamic adjustment includes at least adjustments to the loop closure threshold, duration threshold, and early warning level mapping strategy: When the injection suspicion score is high and the evidence loop completeness is low, the loop closure threshold is increased and the duration threshold is extended, making early warning escalation more cautious and requiring more sufficient evidence loops; when the injection suspicion score is low and the evidence loop completeness is high, the loop closure threshold is decreased and the duration threshold is shortened, allowing for more agile escalation of early warnings for highly consistent anomalies; when the injection suspicion score and the evidence loop completeness are opposite, the early warning level mapping strategy is adjusted so that the early warning level is more constrained by the evidence loop completeness or more constrained by the injection suspicion score, achieving a balance between risk and evidence sufficiency. For example, if a suspected rear-end collision is detected in a candidate road video clip on a certain expressway, the inference results after multiple perspective representations are input into the fusion perception inference model show significant discrepancies and unstable logical links, leading to an increase in enhanced consistency divergence. Simultaneously, the traffic state evidence map only shows short-term average speed fluctuations but does not show a continuous increase in queue length, resulting in an increased causal conflict score and a high injection suspicion score. If the causal support edge weights between visual anomalies and state anomalies in the evidence loop diagram are low and the homologous association is unstable, the evidence loop completeness is low. In this case, increasing the loop closure threshold and extending the duration threshold can suppress rapid escalation, thus avoiding false alarms caused by a single visual clip. Conversely, if visual anomalies such as vehicles occupying lanes and remaining stationary, continuous decreases in average speed, and continuous increases in queue length occur simultaneously within a continuous time window on the same road segment, and the structured inference conclusions are consistent with the causal constraints, the evidence loop completeness is improved and the injection suspicion score is low. In this case, decreasing the loop closure threshold and shortening the duration threshold can promote rapid escalation of warnings, thereby improving the response efficiency to real accidents and persistent congestion.

[0033] Example of injecting suspiciousness score calculation: Obtain the enhanced consistency divergence for each candidate evidence set. Sentence causal conflict score High-risk carrier aggregation Then, inject the suspiciousness score. The calculation can be performed in a reproducible manner as follows: First, normalize the results separately: Then, a weighted fusion method is used to obtain the injection suspicion score: ; in, and The weights can be determined by minimizing the costs of false positives and false negatives using historical labeled samples. ;in, For sample risk labels or false positive / false negative cost labels, For cost-sensitive loss function, This is the regularization coefficient; if there is no labeled data, the weights can be initialized using business rules and iteratively updated based on online feedback. Example: Take... , , Reason: Enhanced consistency divergence D directly reflects the overall "degree of deviation" of multimodal (visual / state / text) within the same event window, and has stronger universality and stability for most injection / forgery anomalies. Therefore, it is given the highest weight to improve the robustness of the main signal; linguistic causal conflict scoring. Structural evidence, which tends to favor logical contradictions and narrative inconsistencies, is effective in distinguishing between "real anomalies with differing expressions" and "fabricated narratives," but it is more sensitive to text quality and extraction errors, thus its weight is secondary; high-risk carrier aggregation analysis Often relying on risk vocabularies, carrier profiles, or blacklist strategies, it can quickly amplify known high-risk patterns, but is easily affected by strategy updates and scenario shifts. Therefore, it is given a lower weight as a supplementary measure. In practice, it can be fine-tuned online according to the cost of false positives / false negatives.

[0034] To obtain enhanced consistency divergence from the candidate evidence set Sentence causal conflict score High-risk carrier aggregation This implementation provides a reproducible calculation method. Let the candidate event slice window be... The set of participating modes is These represent visual modalities (images / videos), traffic status modalities (vector sequences such as speed / flow / occupancy), and text / vocal modalities (event descriptions, announcements, alarm statements, etc.). For each modality... Extract its feature vector within the window. (For example: pooled features of visual encoders, features of state sequence encoders, sentence vectors of text encoders), and then normalized. .

[0035] Enhanced consistency divergence This measure is used to assess the overall inconsistency of multimodal evidence within the same event window. First, the modality center vector is calculated. ;Redefine the consistency divergence as

[0036] When multimodal semantics are consistent and The included angle is relatively small. Lower; when there is cross-modal inconsistency due to injection / splitting / mismatch, Increase.

[0037] Voice causality conflict score This measure is used to assess the degree of conflict between the causal chain expressed in the text / voice and the causal chain inferred from state evidence. Let the set of causal assertions extracted from the text modality be... ;in Indicates the direction of cause and effect. Indicates polarity (promotion / inhibition or positive / negative causality). This is the assertion confidence level. Simultaneously, the corresponding causal polarity estimated from the traffic state mode within the window is... ( (Indicates that it cannot be determined). In a reproducible implementation, the set of causal assertions in the text / vocal modality. Extraction can be achieved using a "template matching + syntactic constraint" approach: a set of causal trigger words / structural templates is predefined (e.g., "cause / lead / initiate / cause / because...so / make / inhibit / alleviate / exacerbate", etc.). After segmenting the text into sentences, sentences containing causal trigger templates are identified, and causal entities are determined based on dependency relations or component structure. Result Entity and direction Simultaneously, polarity is determined based on the semantics of the trigger word. Furthermore, template matching strength, trigger word confidence, or extraction model output probability are mapped to assertion confidence. Alternatively, an information extraction model can be used to directly output the information. Quadruple. Causal polarity in traffic state modes. The estimate can be made in the candidate event window. Internal directionality test on the sequence of state variables: optimal construction of causal variables With outcome variable Difference sequences , Calculate the sign correlation or correlation coefficient. .when Time determination ;when Time determination The determination is made when any of the following conditions are met. (Unable to determine): (1) Insufficient sample size (e.g., fewer than 10 valid sampling points within the window) (2) Insufficient correlation strength (3) The state sequence missing rate exceeds the threshold. (4) The fluctuation range of the variable is lower than the minimum change threshold, making the directionality test meaningless. Among them, , , , It can be obtained from historical samples based on different event types; and, when At that time, this assertion is not included in the conflict indicator function. The conflict determination is only used to indicate or reduce the severity of subsequent "insufficient evidence". Stability weights.

[0038] Define a conflict indicator function for a single assertion. ; The voice-causal conflict score is then defined as the weighted conflict ratio:

[0039] in To prevent extremely small constants with a denominator of zero. The larger the value, the less consistent the causal relationship between the textual narrative and the causal relationship supported by the state evidence.

[0040] High-risk carrier aggregation : Used to characterize the cumulative risk of the source carriers (accounts / devices / cameras / upload channels / nodes / regions, etc.) of candidate evidence in the risk profile. Let the set of carriers be . , Each carrier is assessed based on its historical alert hit rate, violation records, credibility score, and abnormal upload behavior to determine its basic risk level. And a time decay coefficient can be introduced. ; The time interval since the most recent high-risk event. (where is the decay time constant). Therefore, the carrier aggregation risk can be calculated using the "parallel risk" method: This form satisfies the condition that the high risk of any carrier will be significantly increased. The superposition of multiple medium-risk carriers can also increase [the risk]. And the output naturally falls into Within the range, it facilitates subsequent normalization and fusion.

[0041] Define the gating parameters as a four-dimensional vector. ; Each component takes a value from . And respectively: The inference link depth gating coefficient; The gating factor for the range of available evidence; The boundary gating coefficients are visible for intermediate inference results; The visible boundary gating coefficient for structured early warning output fields (semantic hierarchical control). Given an injection suspicion score. To ensure the monotonicity of "the higher the suspicion level, the more conservative the approach," a monotonic compression mapping is used to generate the gating coefficients. ;in It is a sigmoid function. Steepness coefficient, The gating threshold can be determined through historical sample fitting or on-site calibration. From this mapping, we know that when... When it increases, each Monotonic reduction reduces the depth of the reasoning path, narrows the scope of available evidence, and tightens the visible boundary between intermediate reasoning results and structured early warning output.

[0042] Step five involves constructing an evidence loop diagram based on the candidate evidence set and calculating the completeness of the evidence loop. Dynamic adjustments are then made to the warning escalation rules based on the injected suspiciousness score and the completeness of the evidence loop. Constructing the evidence loop diagram represents the spatiotemporal, causal, and logical relationships between candidate evidence, forming a complete evidence network. Calculating the completeness of the evidence loop allows for the assessment of the integrity and consistency of the entire evidence chain, and the determination of the reliability and verifiability of clues. Dynamic adjustments to the warning escalation rules based on suspiciousness and loop completeness enable adaptive adjustments to the warning level and response strategy. This allows the system to flexibly decide whether to escalate the warning or take further measures based on the completeness of the evidence and the risk situation, enhancing the system's intelligence and flexibility.

[0043] In one specific implementation, the construction of the evidence loop diagram and the calculation of the evidence loop integrity are as follows: Node extraction and unified identifier generation: Based on the candidate evidence set, visual anomaly events are extracted from candidate road image segments or candidate road video segments, and state anomaly events are extracted from candidate traffic state segments. The structured reasoning conclusions output by the fusion perception reasoning model are received as reasoning nodes. Visual anomaly events include at least one or more of the following: abnormal parking, wrong-way driving, lane obstruction, sudden deceleration, sudden changes in vehicle trajectory, and pedestrian intrusion. State anomaly events include at least one or more of the following: sudden changes in traffic flow, sudden decrease in average vehicle speed, sudden increase in lane occupancy, continuous increase in queue length, and abnormal switching of signal phase states. A unified identifier is assigned to each type of node. The system identifies and records the time window, spatial range, directional attributes, and sources of evidence, enabling visual anomaly event nodes, state anomaly event nodes, and structured reasoning conclusion nodes to be tracked and verified within the same graph structure. For example, in a candidate road video clip at a main road intersection, vehicles occupying lanes and stationary vehicles and vehicles behind them decelerating suddenly are extracted as visual anomaly event nodes. At the same time, in the candidate traffic state clips within the same time window, the decrease in average vehicle speed and the increase in queue length are extracted as state anomaly event nodes. The suspected accident leading to congestion output by the fusion perception reasoning model is used as the structured reasoning conclusion node to uniformly identify the spatiotemporal orientation of the three types of nodes. Edge construction and edge weight assignment: In the evidence loop diagram, nodes are connected by edges based on temporal adjacency, spatial adjacency, causal support, and homology. Each edge is assigned a corresponding weight: Temporal adjacency connects nodes in adjacent time windows with continuous anomaly patterns; the edge weight is determined by the time overlap ratio, time interval, and anomaly persistence. Spatial adjacency connects nodes in the same road segment, adjacent road segments, or different entrances at the same intersection; the edge weight is determined by spatial distance, topological connectivity, and consistency of driving direction. Causal support connects nodes with visual anomalies, nodes with state anomalies, and nodes with structured reasoning conclusions; the edge weight is determined by the strength of the anomalous evidence's support for the conclusion and the necessity of that support. Consistency between the nature of the evidence and the support is jointly determined; the homologous association is used to connect the nodes corresponding to candidate road image segments or candidate road video segments from the same acquisition time and spatial location with candidate traffic state segments. The edge weights are jointly determined by the temporal alignment error, spatial mapping error and evidence synchronization, and the edge weights are thresholded and pruned to suppress false associations caused by alignment errors. For example, when visually abnormal event nodes and state abnormal event nodes are in the same time window and have the same spatial range, the edge weights of the homologous association can be set to high values. At the same time, if the structured reasoning conclusion node claims that the accident occurred in the upstream lane while the visually abnormal event node shows that it is stationary in the lane and there is a queue behind it, the edge weights of the causal support relationship can be strengthened in order to form a closed evidence link. Confidence calculation, probabilistic consistency fusion, and propagation attenuation are performed to calculate the confidence of each node. The confidence of visual anomaly event nodes is determined at least by anomaly detection intensity, cross-frame stability, and occlusion robustness. The confidence of state anomaly event nodes is determined at least by the magnitude and duration of state abrupt changes, and the signal-to-noise ratio after noise suppression. The confidence of structured reasoning conclusion nodes is determined at least by the confidence level of the conclusion output by the fused perceptual reasoning model, evidence citation coverage, and logical link self-consistency. Based on this, probabilistic consistency fusion is performed on the node confidence based on edge weights to obtain closed-loop consistency confidence, enabling nodes supported by multiple high-weight edges and mutually consistent to obtain higher confidence. Consistency characterization; then confidence propagation and attenuation are performed along temporal and spatial adjacency relationships on the evidence loop diagram. Propagation is used to spread the credible information of the core anomalous node to related nodes in adjacent time and space, and attenuation is used to reduce the contribution of nodes that are far from the core anomalous node or only weakly associated, thereby avoiding the amplification of isolated noise; for example, in the scenario where congestion propagates from upstream to downstream, the high confidence of the upstream visual anomalous event node can be propagated along the spatial adjacency relationship to the downstream state anomalous event node, but if only short-term vehicle speed fluctuations occur downstream and the edge weight of the temporal adjacency relationship is low, the confidence after propagation will remain conservative due to attenuation, so as to maintain the accuracy of the evidence loop; The calculation of causal constraint violation penalties and evidence loop integrity involves extracting causal constraints from the structured reasoning conclusions output by the fused perception reasoning model. These constraints include at least sequence constraints, causal support constraints, duration constraints, and spatial accessibility constraints. Constraint violation penalties are then applied to the evidence loop diagram: when the structured reasoning conclusion claims that the causal chain is caused by an accident leading to an increase in queue length, but the evidence loop diagram shows that the queue length increase occurred before the visual anomaly, or that the spatial range does not meet the accessibility constraint, a penalty term is introduced for the corresponding node and the edge of the causal support relationship. Similarly, when the structured reasoning conclusion claims that the anomaly is caused by an abnormal switching of signal phase states leading to congestion, but the signal phase state nodes in the evidence loop diagram do not show abnormal switching and the homologous association is stable, a penalty term is also introduced to suppress inconsistent conclusions. Finally, the calculation is based on the loop consistency confidence, the confidence after propagation and update, and the loop penalty term. The completeness of the evidence loop is calculated by ensuring that the evidence chain is more closed, consistent, and has fewer constraints in terms of temporal adjacency, spatial adjacency, causal support, and homologous association. For example, if candidate road video clips on a certain expressway segment continuously show vehicles occupying lanes and stationary, accompanied by sudden deceleration of following vehicles, while candidate traffic state clips show a continuous decrease in average vehicle speed and a continuous increase in queue length, and the structured reasoning conclusion is that a suspected accident caused the congestion, satisfying the sequential and spatial accessibility constraints, then the confidence of the loop consistency and the confidence after propagation and update both increase, and the penalty term is small, allowing the evidence loop completeness to be rated as high. Conversely, if only short-term visual anomalies occur while the state anomalies are discontinuous, or if the structured reasoning conclusion contradicts the state evolution, the penalty term increases, lowering the evidence loop completeness, thus providing a more robust and sufficient characterization of evidence for subsequent early warning upgrades.

[0044] In the evidence loop diagram, for any two nodes The weights of various edges can be defined as follows: Temporal adjacency weight: ;in The time center difference between the two nodes. The overlap ratio of the time windows. The time decay coefficient can be obtained by fitting historical normal propagation samples.

[0045] Spatial adjacency weight: ;in This refers to spatial distance or topological distance. Scoring the connectivity of the road network topology. Scoring for consistency in driving direction. This is the spatial attenuation coefficient.

[0046] Homologous association weights: ;in This indicates the synchronicity of evidence, such as originating from the same acquisition period or the same reporting batch. To suppress spurious correlations caused by alignment errors, a noise reduction threshold can be set. : ; From normally aligned samples The distribution is determined. When At that time, its contribution was reduced to: This is to reduce the impact of low-synchronicity edges on the closed-loop graph.

[0047] Causal support weights: ;in , and These represent the strength, necessity, and consistency of causal support, respectively, and all require normalization.

[0048] Closed-form calculation of node confidence, probabilistic consistency fusion, propagation and decay, constraint violation penalty, and final completeness: Node confidence calculation: The node confidence scores, corresponding to visual anomalies, state anomalies, and structured reasoning conclusions, are defined as follows: ; ; ;in: For anomaly detection intensity, For cross-frame stability, To ensure robustness; The magnitude of the state transition. For duration, The signal-to-noise ratio after noise suppression; The confidence level of the model's conclusions. For evidence citation coverage, For logical link self-consistency; , , These are weighting coefficients, which can be determined through historical sample fitting or calibration. Weights of visually anomalous nodes. For example, (0.6, 0.3, 0.1) indicates that anomaly detection strength is the most important and plays a dominant role, followed by cross-frame stability, while occlusion robustness is relatively weak. This is because a high anomaly score directly reflects anomaly events, stability provides auxiliary verification, and occlusion has a relatively small impact. State anomaly node weights For example, in the case of (0.5, 0.3, 0.2), the amplitude of the state change is the primary signal, the duration is secondary, and the signal-to-noise ratio is secondary. This is because the amplitude of the change directly reflects the anomaly, the duration increases the confidence level, and the noise level has a limited impact. (Structured reasoning conclusion weights) For example, in the case of (0.5, 0.3, 0.2), the confidence level of the model conclusion is the most important, the coverage provides a reference for the sufficiency of evidence, and the self-consistency ensures logical consistency but has a slightly lower weight, because a single high-confidence conclusion can significantly improve the confidence level of a node.

[0049] Probabilistic Consistency Fusion: The weighted consistency fusion based on adjacency weights can be expressed as: , ;in For the normalization of edge weights, For weighted summation of various edge weights, such as: And each weighting coefficient Normalization is required. It can be assigned as (0.4, 0.3, 0.2, 0.1) for the following reasons: temporal adjacency weights. The value is set to 0.4 because similarity within the time window is crucial; spatial adjacency weight. Setting it to 0.3 indicates that spatial relationships are important in many scenarios; causal support weights. Setting it to 0.2, although important, is not as strong as the direct impact of time and space; homology association weight Setting it to 0.1 usually means that information from the same source has a lower weight, and other factors should be given priority.

[0050] Confidence propagation and decay can be calculated iteratively:

[0051] in For iteration rounds, For the propagation coefficient, For nodes To the core abnormal node set Graph distance or spatiotemporal distance This is the attenuation coefficient.

[0052] The penalty for violating causal constraints is defined as follows: ;in It is a set of causal constraints (such as sequence order, causal support, duration, spatial reachability, etc.). For each constraint penalty weight, This violates the decision function (violation of the decision function results in a true value, while non-violation results in a false value).

[0053] The completeness of the evidence loop is defined as: ; in Aggregation of closed-loop consistency confidence (e.g., mean or weighted mean). To propagate and iterate to The aggregate confidence level after round, And their sum is 1. They can be allocated as ( The value is (0.5, 0.3, 0.2), for the following reason: closed-loop consistency confidence. The value is set to 0.5 because closed-loop consistency reflects the overall reliability of the evidence chain and is crucial; propagation iteration confidence. Setting it to 0.3 enhances confidence through iteration, but this effect is relatively minor; causal constraint violation penalty Setting it to 0.2, although violations will affect the results, the penalty for violations has a smaller weight compared to other factors.

[0054] In order to make , , as well as , , It is feasible, and at least one reproducible calculation method is given below: (1) Road network topology connectivity score Representing the road network as a graph The node / edge represents the connectivity of a road segment or intersection. Let... For nodes and The shortest topological path length between the road segments (e.g., measured by the number of road segments or topological distance) can be taken as... When unreachable As an alternative implementation, "reachability" binarization or Jaccard similarity based on the set of adjacent road segments can be used. (2) Consistency score of driving direction Record directional attributes (such as lane direction vector or main driving direction angle) for each node. ).make Let be the absolute value of the difference in the angle between the directions of the two nodes. When a node does not have a direction attribute, it can be set to... As a non-penalty item. (3) Evidence synchronicity Let the collection period number or reporting batch number of the two evidence sources be respectively... .like ,but Otherwise, If the difference is the timestamp difference between the two pieces of evidence, then it can be taken. When a collection period identifier exists but the timestamp is missing, batch consistency determination can be used alone. (4) Causal support strength Used to characterize evidence nodes For the conclusion node The extent of support. Acceptable. or Anomaly magnitude indicators; or the attention / contribution of the fusion perception reasoning model to the evidence (e.g., normalized attention weights or feature importance). (5) Necessity of causal support Used to characterize "removal of evidence" Post-conclusion The degree of decline. Counterfactual ablation can be used: to transform the evidence... The confidence level of the conclusion is recalculated after removing or obscuring candidate evidence. Compared with the original confidence level Compare, (6) Causal support consistency Used to depict evidence Compared with other evidence, the conclusion The degree of consistency. A definition compatible with closed-loop consistency confidence can be used: for example, calculating the evidence... The local consistency score (the proportion consistent with the conclusion direction of neighborhood evidence, or the proportion consistent with semantic constraints / state evolution) is normalized to... .in Various normalization parameters can be obtained by fitting historical samples or by on-site calibration; when any one of them is unavailable, a default value can be used to replace it and its influence can be reduced in the fusion weights to ensure that the system can still operate even when some information is missing.

[0055] In one specific implementation, the dynamic control is as follows: The system adjusts the closed-loop threshold and controls the suspiciousness level. When the injected suspiciousness score is high and the evidence loop integrity is low, the system dynamically adjusts the closed-loop threshold and extends the duration threshold. Specifically, when the injected suspiciousness score exceeds the preset high-risk threshold, it means that the event is highly abnormal and may affect traffic safety. In this case, the system will increase the closed-loop threshold to require a more closed and complete evidence chain, preventing incorrect warning judgments based on single evidence. For example, when a candidate road video clip shows a stationary vehicle blocking the middle of the road accompanied by a sudden deceleration of vehicles behind it, and the relevant traffic status clip only shows a short-term decrease in vehicle speed, but due to the low evidence loop integrity and multiple suspicious points, the closed-loop threshold will be increased to avoid triggering a warning prematurely. At the same time, the duration threshold will also be adjusted, requiring the abnormal event to last longer before a warning can be confirmed, improving the system's robustness under uncertain conditions and reducing the possibility of false alarms. The system adjusts its warning level mapping strategy and optimizes its response based on dynamic adjustments to the injection suspicion score and the completeness of the evidence loop. Specifically, when the injection suspicion score is low and the completeness of the evidence loop is high, the system prioritizes a lower-level warning, such as simply alerting the user to potential risks, responding quickly, and reducing resource consumption by simplifying processes. Conversely, when the completeness of the evidence loop is low and the suspicion score is high, the warning level is raised, and a detailed warning report and emergency measures are pushed to ensure the user is informed in a timely manner and can take preventative actions. For example, at a highway intersection, if the queue length in a traffic status segment continues to increase and the completeness of the evidence loop is high with the support of visual anomalies, the system will quickly escalate to a high-risk warning. Conversely, if the queue length is low, only a normal alarm is issued, and continuous monitoring is implemented to flexibly adjust the response strategy and ensure the most reasonable resource allocation and reaction time.

[0056] Step Six: Under the constraints of dynamic gating parameters and dynamically adjusted early warning escalation rules, the fusion perception inference model is invoked to perform anomaly identification and early warning inference on the candidate evidence set. Access control is applied to the intermediate inference results and structured early warning output according to the dynamic gating parameters, generating a structured early warning output. Invoking the fusion perception inference model can comprehensively process multimodal data and candidate evidence, identify potential abnormal events, and perform early warning inference, fully utilizing various types of traffic information for intelligent judgment. Through dynamic gating parameters and dynamically adjusted rule constraints, the inference process can be ensured to comply with risk control strategies, prioritizing high-risk evidence and controlling access to sensitive information. The generated structured early warning output provides standardized results, including early warning level, probability, location label, and description, facilitating user understanding and subsequent processing while maintaining data security and system controllability.

[0057] In one specific implementation, step six is ​​as follows: Anomaly identification and early warning inference are performed under the constraints of dynamic gating parameters and dynamically adjusted early warning escalation rules. For the candidate evidence set, the fusion perception inference model is called for joint inference. First, the consistency of visual anomalies in candidate road image segments or candidate road video segments and state anomalies in candidate traffic state segments is checked. Then, under the constraints of dynamically adjusted early warning escalation rules, the early warning level and early warning probability are inferred. Among them, dynamic gating parameters are used to limit the range of available evidence, evidence granularity and inference link depth during inference, so that the inference process meets both business security requirements and maintains interpretability. The inference output is structured into four elements: early warning level, early warning probability, early warning location label and early warning description. The early warning location label includes at least the road segment, lane direction, distance range from key nodes and time window identifier. For example, when candidate road video clips show vehicles occupying the lane and standing still at the ramp entrance, and vehicles behind frequently brake suddenly, while candidate traffic state clips show a continuous decrease in average vehicle speed and a continuous increase in queue length, and the closed-loop threshold is reduced and the duration threshold is shortened in the warning upgrade rules after dynamic adjustment, the fusion perception reasoning model can give a higher warning level and a higher warning probability more quickly, and locate the warning location label to the outer lane of the ramp entrance and a range of several meters upstream. Access control is applied to intermediate inference results and structured early warning outputs using dynamic gating parameters. Semantic layering control of the output is implemented based on the injection suspicion score. Intermediate inference results are hierarchically packaged, including at least a list of evidence citations, key causal links, confidence propagation paths, and summaries of conflict items. Visibility is set according to the dynamic gating parameters, ensuring that different access holders receive structured early warning outputs matching their responsibilities. Simultaneously, semantic layering control of the structured early warning output is applied based on the injection suspicion score. When the injection suspicion score is high, the early warning description uses a more conservative tone and explicitly marks the source of uncertainty and items requiring supplementary evidence to avoid over-assertion. When the injection suspicion score is low and the evidence loop is complete, the early warning description uses a more explicit tone and provides suggested handling directions and a description of the scope of impact, thus achieving semantic strength adaptation between risk sensitivity and information sufficiency. For example, in scenarios where strong nighttime glare leads to unstable visual evidence and a high injection suspicion score, even if the warning level is inferred to be high, the structured warning output can still stratify the warning description as a possible lane obstruction or accident causing congestion and suggest reviewing upstream segments and supplementing roadside detection data. In scenarios where multi-source evidence is closed and the injection suspicion score is low, the structured warning output can directly provide a description of an accident suspected of causing lane obstruction and congestion, and simultaneously output more refined warning location labels and higher warning probabilities to support rapid response.

[0058] Dynamic closed-loop threshold and duration threshold update: Let the basic closed-loop threshold be... The base duration threshold is Injecting a suspiciousness score With closed-loop integrity Under the influence of the law, the dynamic threshold can be defined as: ; ;in To adjust the intensity coefficient, it can be obtained through calibration that minimizes the historical false alarm / false alarm costs. If Significantly higher than ,but Rising and Lengthening the process makes upgrades more cautious; if Significantly higher than If so, the threshold is lowered, making upgrades more agile.

[0059] High-risk thresholds can be implemented using a quantile-adaptive approach: ;when When this occurs, the "conservative output mode" is triggered: the gating parameters are tightened. Furthermore, the structured early warning output should strengthen the identification of sources of uncertainty and provide supplementary evidence recommendations; when and When the threshold is high, a "fast response mode" is triggered to ensure agile upgrades. (This mechanism is consistent with output semantic layering control and access control.) Early warning level mapping strategy: Early warning probability or escalation confidence level A fusion mapping of closed-loop integrity and suspiciousness can be used: ;in It can be dynamically adjusted according to business preferences: increase when more constraints on closed-loop integrity are required. Reduce when more doubtful constraints are needed. This mapping strategy can automatically adjust the response mode when there is "high closed-loop integrity and low suspicion" or "low closed-loop integrity and high suspicion".

[0060] Example 2: A multimodal clue retrieval and early warning system for traffic conditions, including: a data acquisition module for acquiring multimodal traffic data; Feature representation module: Generates traffic clue representations; Clue retrieval module: Extracts a set of candidate evidence; Suspicion assessment module: Calculates injection suspicion score and generates dynamic gating parameters; Closed-loop reasoning module: Constructs evidence closed-loop diagrams, calculates the completeness of evidence closed loops, and dynamically adjusts early warning escalation rules; Early warning output module: Under the constraints of dynamic gating parameters and dynamically adjusted early warning upgrade rules, it performs early warning reasoning on the candidate evidence set and generates structured early warning output.

[0061] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A multimodal clue retrieval and early warning method for traffic situation, characterized in that, Includes the following steps: Step 1: Acquire and align multimodal traffic data for the traffic scenario to obtain aligned multimodal traffic data; Step 2: Perform feature extraction and semantic representation learning on the aligned multimodal traffic data to obtain multimodal traffic cue representations in a unified semantic space; Step 3: Receive traffic clue query requests; perform a search in the traffic clue index based on the traffic clue query request and the multimodal traffic clue representation to obtain a candidate traffic clue set; extract a set of candidate evidence related to the traffic clue query request from the candidate traffic clue set. Step 4: Calculate the injection suspicion score for the candidate evidence set, and generate dynamic gating parameters based on the injection suspicion score to control the early warning reasoning path and early warning output permissions; Step 5: Construct an evidence loop diagram based on the candidate evidence set and calculate the completeness of the evidence loop; The execution of the early warning escalation rules is dynamically adjusted based on the injection suspicion score and the completeness of the evidence loop. Step 6: Under the constraints of dynamic gating parameters and the early warning upgrade rules after dynamic adjustment, the fusion perception reasoning model is invoked to perform anomaly identification and early warning reasoning on the candidate evidence set, and the intermediate reasoning results and structured early warning outputs are subject to access control according to the dynamic gating parameters to generate structured early warning outputs.

2. The multimodal clue retrieval and early warning method for traffic situation as described in claim 1, characterized in that, Multimodal traffic data includes at least road image data or road video data, as well as traffic status data that corresponds to the road image data or road video data in time and space.

3. The multimodal clue retrieval and early warning method for traffic situation as described in claim 1, characterized in that, Feature extraction and semantic representation learning involves encoding road image data or road video data into visual semantic representations, encoding traffic state data into spatiotemporal semantic representations, and fusing the visual semantic representations and spatiotemporal semantic representations to obtain multimodal traffic cue representations in a unified semantic space.

4. The multimodal clue retrieval and early warning method for traffic situation as described in claim 1, characterized in that, The candidate evidence set includes at least candidate road image clips or candidate road video clips and candidate traffic status clips.

5. The multimodal clue retrieval and early warning method for traffic situation as described in claim 4, characterized in that, The injection suspicion score is determined based on at least one of the following or a combination thereof: Enhanced consistency divergence: Multiple viewpoint representations are generated for the same candidate road image segment or candidate road video segment, and each viewpoint representation is input into the fusion perception inference model to obtain the corresponding inference result; the enhanced consistency divergence is calculated based on the consistency evaluation and logical link consistency evaluation between the inference results. Voice causal conflict scoring: Construct a set of semantic constraints from the structured semantics output by the fusion perception reasoning model; construct a traffic state evidence map from the candidate traffic state fragments; perform a causal consistency test between the set of semantic constraints and the traffic state evidence map to obtain the voice causal conflict score; High-risk carrier aggregation score: High-risk carriers in candidate road image segments or candidate road video segments are detected and their structures are analyzed. A high-risk carrier map is constructed based on the detection and structure analysis results. Risk aggregation inference is then performed based on the high-risk carrier map to obtain the high-risk carrier aggregation score.

6. The multimodal clue retrieval and early warning method for traffic situation as described in claim 5, characterized in that, The evidence loop diagram includes at least the following nodes: visual anomalies extracted from candidate road image segments or candidate road video segments, state anomalies extracted from candidate traffic state segments, and structured reasoning conclusions output by the fusion perception reasoning model; and edges consisting of temporal adjacency, spatial adjacency, causal support, and homologous association, with each edge having a corresponding edge weight.

7. The multimodal clue retrieval and early warning method for traffic situation as described in claim 6, characterized in that, The completeness of the evidence loop is determined at least through the following methods: calculating the node confidence for each node and performing probabilistic consistency fusion on the node confidence based on edge weights to obtain the loop consistency confidence; performing confidence propagation and decay along temporal and spatial adjacency relationships on the evidence loop graph; extracting causal constraints from the structured reasoning conclusions output by the fusion-perception reasoning model and performing constraint violation penalties on the evidence loop graph; and calculating the completeness of the evidence loop based on the loop consistency confidence, the propagated and updated confidence, and the loop penalty term.

8. The multimodal clue retrieval and early warning method for traffic situation as described in claim 7, characterized in that, Dynamic control includes at least closed-loop threshold adjustment, duration threshold adjustment, and early warning level mapping strategy adjustment.

9. The multimodal clue retrieval and early warning method for traffic situation as described in claim 8, characterized in that, The structured early warning output includes at least the early warning level, early warning probability, early warning location label, and early warning description; and, based on the injection suspicion score, the structured early warning output is subject to semantic hierarchical control.

10. A multimodal cue retrieval and early warning system for traffic situations, applied to the multimodal cue retrieval and early warning method for traffic situations as described in any one of claims 1-9, characterized in that, include: Data acquisition module: Acquires multimodal traffic data; Feature representation module: Generates traffic clue representations; Clue retrieval module: Extracts a set of candidate evidence; Suspicion assessment module: Calculates injection suspicion score and generates dynamic gating parameters; Closed-loop reasoning module: Constructs evidence closed-loop diagrams, calculates the completeness of evidence closed loops, and dynamically adjusts early warning escalation rules; Early warning output module: Under the constraints of dynamic gating parameters and dynamically adjusted early warning upgrade rules, it performs early warning reasoning on the candidate evidence set and generates structured early warning output.