A traffic robot event identification method based on multi-sensor fusion
By utilizing spatial spectrum estimation and multi-sensor fusion of acoustic signals in traffic scenarios, combined with directional perception from cameras and radar, a closed-loop processing structure is constructed. This solves the problems of insufficient utilization of acoustic information and uncertainty in recognition results in existing traffic event recognition, and achieves stable event recognition and rapid response in complex urban environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG XIULIAN TECH CO LTD
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-16
AI Technical Summary
Existing traffic incident recognition schemes lack sufficient utilization of acoustic information and lack closed-loop feedback of recognition results, making it difficult to reliably utilize acoustic information to perform spatial guidance functions and eliminate recognition uncertainties in complex urban environments.
By estimating the spatial spectrum of acoustic signals based on the spatial structure information of traffic scenes, combining multi-sensor fusion of cameras and radar, driving directional perception with the direction of arrival of sound sources, and pre-screening events through acoustic fingerprint templates adapted to noise, a closed-loop processing structure is constructed that includes physical time sequence verification, active multi-view verification, and causal graph inference, so as to realize event authenticity verification and type inference.
It significantly improves the reliability of acoustic sensors in spatial guidance decision-making in urban environments, shortens the response time of sensor directional focusing, and eliminates uncertainty through a closed-loop feedback mechanism, achieving complete perception-to-verification, reasoning, and self-evolution capabilities.
Smart Images

Figure CN122223975A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of traffic control system technology, and more specifically to a traffic robot event recognition method based on multi-sensor fusion. Background Technology
[0002] Modern transportation systems place increasingly higher demands on the rapid and accurate identification of traffic incidents. Existing traffic incident detection solutions mainly include roadside fixed multi-sensor systems and vehicle-mounted sensing systems. Both types of solutions mostly employ a fusion recognition framework combining visual and radar sensors. While these solutions have achieved some success in engineering applications, they still have the following shortcomings.
[0003] 1. Existing solutions do not make sufficient use of acoustic information. Traffic incidents are often accompanied by acoustic signals such as braking sounds, collision sounds, and explosion sounds. Acoustic signals have the characteristics of omnidirectional perception, insensitivity to obstruction, and fast response to transient events. However, in urban traffic environments, multipath reflections caused by buildings, overpasses, and large vehicles, as well as complex background noise, can significantly reduce the stability of acoustic positioning. As a result, existing systems usually only use acoustics as an auxiliary feature and are unable to undertake the function of sensor orientation guidance.
[0004] 2. Most existing multi-sensor fusion systems employ a linear feedforward processing structure, lacking a closed-loop mechanism that uses the recognition results to adjust the sensing strategy. When there is uncertainty in the event recognition results, the system can usually only passively wait for subsequent data, and it is difficult to actively change the observation position or direction to resolve the uncertainty. For roadside fixed sensing systems, this problem is even more pronounced when the event is in an obstructed or blind spot.
[0005] Therefore, there is an urgent need for a traffic robot event recognition method that can stably utilize acoustic information to perform spatial guidance functions in complex urban environments and has a closed-loop feedback mechanism to actively resolve recognition uncertainties. Summary of the Invention
[0006] The main objective of this invention is to provide a traffic robot event recognition method based on multi-sensor fusion, which aims to solve the technical problems of insufficient utilization of acoustic information and lack of closed-loop feedback in existing traffic event recognition schemes.
[0007] To achieve the above objectives, this invention provides a traffic robot event recognition method based on multi-sensor fusion, comprising: S1: Based on the spatial structure information of the traffic scene, the acoustic signal collected by the microphone array is spatially spectral estimated and multipath reflection peaks are suppressed, and the direction of arrival of the sound source is output. S2: Drive the camera and radar to focus based on the direction of sound source arrival, pre-screen events using acoustic fingerprint templates with noise adaptation, and output event candidates and initial confidence levels after confirmation by multiple sensors. S3: Based on the multi-sensor time series pattern of the speed difference between sound waves and electromagnetic waves, the authenticity of event candidates is verified by an adaptive time series tolerance window and the initial confidence is updated, while the event type is inferred. S4: For event candidates whose confidence level is still uncertain after verification by S3, drive the robot to perform active multi-view verification; S5: Input the confirmed events into the event causal graph to predict related events. The event causal graph uses event type as nodes and causal relationship as directed edges. The edge parameters are updated online through verification data. The prediction results are fed back to S1 and S2 to adjust the perception parameters. S6: Continuously detect multi-sensor inconsistencies, exclude inconsistencies that cannot be explained by known event types after sensor degradation, and generate new event type prototypes to expand the event causal graph.
[0008] Optionally, it also includes: using high-precision maps and 3D city models to construct a multipath propagation model to obtain information on the distribution of reflecting surfaces, calculating the direction of arrival of reflected waves using the mirror sound source method, and excluding spatial spectral peaks corresponding to the direction of arrival of reflected waves.
[0009] Optionally, in S1, the noise covariance matrix used for spatial spectrum estimation adaptively adjusts the integration time according to the traffic flow parameters estimated by the radar in real time, so as to improve the estimation reliability in high noise environment and improve the time resolution in low noise environment.
[0010] Optionally, in S1, after suppressing the multipath reflection peak, the remaining peak is scored in multiple dimensions based on road geometry prior constraints, reflector surface distribution priors, historical sound source active area statistics, and spectral peak intensity, and the sound source arrival direction is determined based on the scoring results.
[0011] Optionally, in S2, the noise adaptation of the acoustic fingerprint template includes: performing an environmental adaptation transformation on the pre-stored acoustic fingerprint template according to the current background noise spectrum, generating an adaptation template under the current noise conditions, and performing event pre-screening based on the adaptation template.
[0012] Optionally, after S2, the method further includes: performing a spatial consistency check between the direction of arrival of the sound source and the radar pre-tracking: when there is an abnormal moving target being tracked by the radar in the spatial region corresponding to the direction of arrival of the sound source, increasing the initial confidence level of the corresponding direction; when no abnormal moving target is detected, decreasing the initial confidence level of the corresponding direction.
[0013] Optionally, in S3, the posterior probability of time-series templates for multiple event types is calculated in parallel. Whenever a new sensor channel responds, the posterior probability of each event type is updated, thereby realizing event authenticity verification and event type inference. An adaptive time-series tolerance window calibrates the processing delay of each sensor channel and dynamically adjusts the delay parameters according to the distance of the event from the robot.
[0014] Optionally, in S4, active multi-view verification is performed on the planned observation trajectory under traffic scenario safety constraints. Traffic scenario safety constraints include at least one of dynamic safety distance constraints with traffic flow, right-of-way priority constraints, and safety constraints of observation stop points. Emergency events are directly reported without active multi-view verification. When it is impossible to safely reach the observation point, a downgraded verification of long-distance zoom observation or directional acoustic enhancement acquisition is initiated.
[0015] Optionally, in S5, the online update of edge parameters includes the recursive update of causal transition probabilities and the adjustment of time delay intervals; a prediction confidence decay mechanism is introduced in forward inference to terminate propagation when the cumulative path probability is lower than a preset threshold; the spatial range of associated events is filtered or weighted according to the direction of arrival of the sound source; the perception parameters include at least one of acoustic pre-screening threshold, visual detection sensitivity, and radar perception parameters; when the predicted event does not occur within the timeout period, the transition probability of the corresponding causal edge is reduced.
[0016] Optionally, in S6, multi-sensor inconsistency is detected by residual analysis of the sensors; sensor degradation is excluded by monitoring the signal-to-noise ratio of each sensor channel and the detection consistency with known static references; the inconsistency incremental clustering is a clustering method based on feature vector distance, and when the number of clustered samples exceeds the threshold, it is registered as a new event type prototype; when multi-sensor inconsistency is detected, S4 is triggered to perform directional active verification of the source direction of multi-sensor inconsistency, and the verification result of S4 is fed back to the inconsistency incremental clustering to update the event prototype.
[0017] Compared with the prior art, the present invention has the following beneficial effects: First, this invention overcomes the acoustic multipath effect caused by buildings and vehicles in urban traffic environments by using a multipath propagation model driven by spatial structure information of traffic scenarios and a spatial spectrum peak screening strategy. This enables acoustic sensors to reach a level of reliability in urban environments that can undertake spatial guidance decision-making. Combined with acoustically guided multi-sensor directional focusing and acoustic fingerprint template matching adapted to current noise conditions, acoustics is elevated from an auxiliary mode to a cross-sensor spatial leader channel, significantly shortening the response time of sensor directional focusing.
[0018] Second, this invention constructs a closed-loop processing structure that includes physical time-series verification, active multi-view verification, causal graph inference feedback, and unknown event discovery. It uses the temporal pattern of the difference in propagation speed between sound waves and electromagnetic waves to verify the authenticity of events and suppress false alarms. For events with uncertain confidence, the robot actively changes its observation position to obtain incremental information to resolve the uncertainty. Chain inference is performed through event causal graphs, and the prediction results are fed back to adjust the perception parameters to form a positive feedback closed loop. At the same time, the inconsistency of multiple sensors is transformed into unknown event detection signals, and the event type coverage is automatically expanded through incremental clustering. This enables the system to have a complete closed-loop capability from perception to verification, inference, feedback, and self-evolution. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the structures shown in these drawings without creative effort.
[0020] Figure 1 This is a flowchart of a traffic robot event recognition method based on multi-sensor fusion according to the present invention; The objectives, features, and advantages of this invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.
[0022] like Figure 1 As shown, the method in this embodiment is executed according to S1 to S6, where S1 to S5 constitute the main processing link, and S6 serves as a parallel processing link to continuously detect unknown events. S1 outputs the direction of arrival of the sound source, which is used to drive the camera and radar directional perception in S2. The radar pre-tracking result in S2 participates in the spatial consistency verification of the output direction of S1. The verification result of S3 determines whether the event candidate enters S4. The verification result of S4 is used to update the causal graph side parameters in S5. The prediction result of S5 is fed back to S1 and S2 to adjust the perception parameters. The new event type prototype generated in S6 is used to expand the event causal graph in S5.
[0023] The transportation robot in this embodiment is a multi-sensor mobile platform with autonomous navigation capabilities, equipped with a uniform circular microphone array, a PTZ camera, and a 77GHz millimeter-wave radar. The microphone array consists of eight omnidirectional microphones, covering a target frequency band from 200Hz to 8000Hz. The PTZ camera supports 360° horizontal rotation, ±45° vertical pitch, and optical zoom. The millimeter-wave radar has a detection range of at least 150m and an angular resolution of at least 1°. The robot is also equipped with an inertial navigation unit, a satellite positioning module, and high-precision maps and 3D city model data.
[0024] Traffic robots perform traffic incident inspection tasks on urban roads, intersections, underpasses, or construction sites.
[0025] S1: Based on the spatial structure information of the traffic scene, the acoustic signal collected by the microphone array is spatially spectral estimated and multipath reflection peaks are suppressed, and the direction of arrival of the sound source is output.
[0026] In step S1, a short-time Fourier transform is performed on the multi-channel acoustic signals acquired by the microphone array at each frequency. Construct the spatial covariance matrix , For each frequency The spatial covariance matrix constructed above. For Perform eigenvalue decomposition to obtain the noise subspace matrix. The spatial spectrum was constructed using the MUSIC method. in, These are the spatial spectral function values at the azimuth and elevation angles. Let be the array guiding vector, be the noise subspace matrix, and H be the conjugate transpose operator. It is the azimuth angle. Angle of elevation, To avoid extremely small positive numbers with a denominator of zero, the example value can be taken as... .
[0027] This step maps the multi-channel acoustic signal into an energy distribution at azimuth and elevation angles through spatial spectrum estimation, which is used to extract potential sound source direction information from complex traffic background noise. The spatial spectrum construction in Equation (1) utilizes the orthogonality between the array steering vector and the noise subspace, making the spectral values at the actual sound source direction relatively prominent, while the spectral values at the direction consistent with the noise subspace are suppressed, thus providing a basis for subsequent peak screening. This processing does not directly determine the sound source direction based on the single-channel sound pressure amplitude, but rather uses the array spatial structure information to improve the direction estimation capability, so as to obtain candidate directions that can be used for subsequent directional perception in traffic scenarios with multiple noise sources, reflections, and obstructions.
[0028] Through the An angle search is performed, and the location of local maxima is used as the initial candidate result for the direction of arrival of the sound source. High-precision maps and 3D city models are used to provide road geometry information and reflector distribution information, and the direction of arrival of reflected waves is calculated using the mirror sound source method. The purpose of introducing road geometry information and reflector distribution information is to explicitly incorporate spatial priors in the traffic scene into the acoustic localization process. Relying solely on the spatial spectrum peaks themselves, it is difficult to distinguish between direct sound and reflected sound formed by building facades, the bottom of overpasses, or the surface of large vehicles. However, by combining the mirror sound source method to predict the direction of arrival of reflected waves, reflection peaks consistent with the scene structure can be removed from the candidate set, thereby reducing the interference of multipath effects on subsequent camera and radar pointing control. This processing transforms the acoustic channel from providing only coarse anomaly cues into an input channel that can participate in multi-sensor spatial guidance. The aforementioned mirror sound source method is mainly used to suppress static multipath reflections caused by fixed building structures. For dynamic reflections caused by moving vehicles, the position of the moving reflector can be obtained through radar tracking results and the reflector model can be updated in real time, or the score of such peaks can be reduced in the multi-dimensional scoring by statistically analyzing historical active areas of sound sources.
[0029] For each peak in the spatial spectrum, the angular deviation between it and the predicted reflection direction is calculated. When the deviation is less than a preset tolerance, the peak is identified as a reflection peak and excluded. In this embodiment, the preset tolerance is 5°. For the remaining peaks after excluding reflection peaks, a multi-dimensional score is performed based on road geometry prior constraints, reflector surface distribution priors, historical sound source active area statistics, and spectral peak intensity. The final sound source arrival direction is determined based on the score results. An exemplary weight can be set as follows: road geometry prior 0.35, reflector surface distribution prior 0.30, historical sound source active area statistics 0.15, and spectral peak intensity 0.20.
[0030] In practical processing, the aforementioned spatial covariance matrix can be expressed as a weighted average of the spatial covariance matrices at each frequency within the target frequency band, and updated online using the following recursive method. The noise subspace is obtained by performing eigenvalue decomposition on this matrix, using the eigenvectors corresponding to the smaller eigenvalues as the basis vectors of the noise subspace. The partitioning of the signal subspace and the noise subspace can be automatically determined using the eigenvalue threshold method or information theory criteria.
[0031] The noise covariance matrix used for spatial spectrum estimation in S1 is updated recursively: in, For the first The noise covariance matrix after the next update For the first The array observation vector at the next update Forgetting factor, satisfying . The time can be set according to the rate of change of traffic flow. It can be appropriately reduced when the traffic flow changes rapidly and appropriately increased when the traffic flow is stable. This invention does not impose any limitations on this. Integral time It can adaptively adjust according to the noise level: in, To find the minimum value function, This is the upper limit for integration time; the example value is 500ms. The baseline integration time is 50ms (example value). This represents the current standard deviation of noise power. The reference noise power standard deviation. The value is an adjustment factor, and the example value ranges from 0.5 to 2.0. , , and This can be determined through system calibration experiments; however, this invention does not limit the scope of the invention. It can be determined from the calibration results in a quiet environment during the initial stage of system startup. It can be obtained from the background noise power statistics within the current time window.
[0032] Equation (3) is used to dynamically adjust the time range of noise statistics based on the current noise level. When the integration time is too short, the noise covariance matrix estimation is easily affected by instantaneous disturbances, leading to increased spatial spectrum fluctuations. When the integration time is too long, it will reduce the time resolution capability for sudden traffic events. Therefore, by adaptively adjusting the integration time through the standard deviation of noise power, a balance can be achieved between the stability of noise estimation and the time response capability. This approach is particularly beneficial for scenarios where traffic flow states change rapidly, and can reduce the performance inconsistency of fixed integration time in high-noise and low-noise environments. In high-noise environments, Increase the value to improve the stability of noise estimation. In low-noise environments, Reduced to improve temporal resolution.
[0033] Furthermore, the complexity of the acoustic multipath propagation model varies depending on the scene. In urban road scenes, the mirror source method requires calculating the directions of first- and second-order reflected waves from multiple types of reflecting surfaces. In open scenes such as highways, the main reflecting surfaces are simplified to two types: the road surface and the guardrails on both sides, and the second-order reflected waves can be ignored. The weights of each item in the four-dimensional scoring vector can also be configured according to the scene.
[0034] Furthermore, the adaptive estimation of the noise covariance matrix based on traffic flow state is adaptable to both steady-state broadband noise on highways and abrupt noise on urban roads. Under steady-state noise, the integration time is extended to improve estimation accuracy, while under abrupt noise, the forgetting factor is reduced to accelerate model updates.
[0035] S2: Drive the camera and radar to focus based on the direction of sound source arrival, pre-screen events using acoustic fingerprint templates with noise adaptation, and output event candidates and initial confidence levels after confirmation by multiple sensors.
[0036] In step S2, the direction of arrival of the sound source output by S1 is used to control the PTZ camera and radar beam to focus in the corresponding direction, and at the same time, the radar pre-tracking result is retrieved to perform spatial consistency verification on the spatial area corresponding to that direction.
[0037] The purpose of this step is not simply to superimpose the results from the three types of sensors, but to utilize the omnidirectional leading characteristic of the acoustic channel to constrain the subsequent perception range of vision and radar. Compared to continuously performing high-density scanning or high-resolution detection across the entire field of view, first identifying candidate spatial regions based on the direction of sound source arrival, and then using the camera and radar for directional perception, reduces the computational burden from irrelevant areas and increases the observation density of local anomalies. Thus, the acoustic channel plays a "leader screening" role, while vision and radar play a "directional confirmation" role, with a clear division of labor among the three.
[0038] When radar pre-tracking results indicate the presence of an abnormal moving target in the area, the initial confidence level of the event candidates in the corresponding direction is increased. When no abnormal moving target is detected in the area, the initial confidence level of the event candidates in the corresponding direction is decreased. Abnormal moving targets can be identified by comparing them with statistical thresholds for normal traffic flow, which can be obtained by calibrating historical traffic flow data.
[0039] The system pre-constructs an acoustic fingerprint database for traffic events, with each type of event corresponding to a set of pre-stored acoustic fingerprint templates. Templates may include Mel-spectrum embedding features, Mel-frequency cepstral coefficients and their differential features, short-time energy envelopes, and zero-crossing rates. Based on the current background noise spectrum, the pre-stored acoustic fingerprint templates undergo environmental adaptation transformation to generate adapted templates under the current noise conditions. These adapted templates are then matched with the real-time acquired signals to obtain a pre-screening matching score for the event.
[0040] The specific method of environmental adaptation transformation is as follows: For each frequency band component of the pre-stored acoustic fingerprint template, the signal-to-noise ratio (SNR) weight is calculated based on the power spectral density of the current background noise in the corresponding frequency band. The template weight is reduced for frequency bands with lower SNR, and maintained or increased for frequency bands with higher SNR, thereby generating an adapted template. This method automatically reduces the influence of unreliable frequency bands under the current noise conditions during the matching process, while retaining the ability to distinguish higher SNR bands. The weight calculation can use the ratio of the current frequency band SNR to the reference SNR, truncated to the interval [0,1], which is not limited in this invention.
[0041] This processing does not simply perform noise suppression on the acquired signal, but actively transforms the template to reflect the expected performance under the current environmental conditions. Because traffic noise exhibits significant road segment and time-time dependence, the spectral profile of the same event may differ under different background noise conditions. Direct matching with static templates can easily lead to template mismatch. Template-side adaptation allows the matching process to better reflect the consistency between the event itself and the observed signal under the current environmental conditions, thereby improving the environmental adaptability of the pre-screening.
[0042] This method involves template-side adaptation, rather than simply denoising the input signal. When the matching score exceeds the pre-screening threshold... At that time, an acoustic trigger is generated. The example value can be 0.6, and it can also be adjusted within the range of 0.4 to 0.8 according to the target false alarm rate and false alarm rate. After acoustic triggering, the camera extracts visual evidence in the corresponding direction, and the radar extracts motion state evidence in the corresponding direction. The comprehensive decision rule after step-by-step confirmation can be set as follows: when the detection results of at least two sensors are consistent, an event candidate is formed, and the initial confidence of the event candidate is calculated. The initial confidence can be determined according to the number of triggered sensors and the weighted average of the detection scores of each sensor. For example, when all three sensors (acoustic, visual, and radar) are triggered, the initial confidence can be the weighted average of the detection scores of the three sensors. When two sensors are triggered, the weighted average of the detection scores of the two sensors can be multiplied by a reduction factor. The example value of the reduction factor can be 0.8, and this invention does not limit it here. If only one sensor is triggered, it is retained as a low-confidence candidate and awaits further verification in S3. For low-confidence candidates triggered by only one sensor, S3 waits for other sensor channels to respond within the adaptive timing tolerance window before performing timing verification; if no other channel responds after the tolerance window expires, it is directly marked as negative or handed over to S4 for processing.
[0043] The technical significance of the step-by-step confirmation mechanism lies in organizing evidence from different modalities in spatial and temporal order. Acoustic pre-screening first identifies suspected events and their locations, visual evidence supplements information on appearance or scene changes, and radar evidence supplements information on motion states. By requiring consistency between at least two sensor results, the probability of false triggers caused by single-modal anomalies can be reduced, and event candidates entering the S3 physical-temporal consistency verification have higher initial credibility, thereby reducing the consumption of subsequent verification resources on low-quality candidates.
[0044] Furthermore, the PTZ camera performs focusing based on the event distance estimated by acoustic ranging. When the event distance is far, such as events within hundreds of meters in a highway patrol scenario, a correspondingly longer focal length setting is used to ensure the resolution of target details.
[0045] S3: Based on the multi-sensor time-series patterns of the speed difference between sound waves and electromagnetic waves, the authenticity of event candidates is verified by an adaptive time-series tolerance window and the initial confidence level is updated, while the event type is inferred.
[0046] In step S3, the system utilizes the timing differences in the responses of different sensor channels to the same event to verify the authenticity of event candidates and infer the event type. Let the first... The calibrated response time of each sensor channel is: in, For the first The calibrated response time of each sensor channel For the first The physical propagation delay corresponding to each sensor channel For the first Processing latency of each sensor channel.
[0047] Equation (4) decomposes the observed channel response time into two parts: physical propagation delay and channel processing delay. Its purpose is to distinguish the systematic delay caused by sensor hardware, sampling methods, and algorithm execution from the physical timing sequence formed by the event itself during spatial propagation. Without this decomposition, the output time difference between different channels will be mixed with non-negligible differences in system processing, thereby weakening the ability to verify the authenticity of the event using differences in propagation speed. By first correcting the processing delay and then comparing the cross-channel timing relationship, the verification basis can be made closer to the physical propagation law of the event itself.
[0048] The processing latency of each channel can be obtained through calibration experiments during system deployment. Taking the acoustic channel as an example, its expected response time can be expressed as: in, Indicates the expected response time of the acoustic channel. The distance of the event from the robot. For the speed of sound, Delay processing for acoustic channels. It can be obtained from radar pre-tracking results or from joint acoustic and radar estimation.
[0049] Equation (5) gives the correspondence between the expected response time of the acoustic channel and the event distance. Since the speed of sound is much smaller than the propagation speed of electromagnetic waves, the response time of the acoustic channel is more sensitive to the event distance. Therefore, its time delay term can provide an important constraint for distinguishing between real events and multimodal accidental co-occurrence. Especially when the distance of traffic events changes, the acoustic time delay will adjust with the distance, so that the subsequent adaptive timing tolerance window can be updated with the changes in the geometric relationship of the scene, rather than using a fixed threshold for rough judgment.
[0050] A time series template is established for each candidate event type. The mean of the time series template can be calculated based on the difference in propagation speed between sound waves and electromagnetic waves at known distances, yielding the theoretical response delay for each channel. The standard deviation can be determined by combining the measured deviations from calibration experiments. Let the first... After the observation update, the first The posterior probability of the event class is ,but: in, The total number of candidate event types. For the first Class events in the first Likelihood value under the second observation Let be the posterior probability of the j-th event after the (n-1)-th observation update. Let be the likelihood value of event j in the nth observation. Let be the posterior probability of event c after the (n-1)th observation update. The initial posterior probabilities of each event type at the start of the recursion can be set to a uniform distribution, that is, the initial posterior probability of each event type is the reciprocal of the total number of candidate event types, or it can be given by the historical event frequency statistics.
[0051] Equation (6) updates the posterior probability of each candidate event type using a recursive method. Its function is not only to compare the relative probabilities of different event types, but also to continuously incorporate newly arriving sensor evidence into the existing judgment process. As new channel responses are observed, the posterior probabilities of different event types will be redistributed according to their consistency with the current observation. Therefore, this equation simultaneously performs the functions of event authenticity verification and event type inference. Compared with one-time static judgment, recursive updating is more suitable for handling traffic scenarios where the response times of multiple sensor channels are inconsistent and the arrival order is not fixed.
[0052] Likelihood value This can be expressed in Gaussian form as follows: in, Pi It is an exponential function. For the first The actual response delay of this observation and The first Class events in the first The template mean and template standard deviation for the corresponding channel in each observation satisfy the following conditions: The exponential term is an exponential function term.
[0053] The exponential function term in equation (7) is used to characterize the deviation between the actual response delay and the expected delay of the template. The closer the actual response delay is to the mean of the template for the corresponding event type, the higher its likelihood value. The greater the deviation, the lower its likelihood value. Thus, the time series template is no longer just an abstract label, but describes the typical propagation law of a certain type of event on multiple sensor channels through the mean and dispersion. Treating the likelihood of channels with unreliable responses as constant is to avoid undue penalty to the overall posterior probability due to missing observations caused by occlusion, weak signals, or excessive distance, thereby improving the robustness of multi-channel fusion judgment.
[0054] When a channel fails to respond reliably due to occlusion or distance, the likelihood of that channel can be treated as a constant, so that it does not bias the posterior probability. The adaptive timing tolerance window is dynamically adjusted according to the processing delay of each channel and the distance of the event from the robot. The width of the tolerance window can be set as a constant multiple of the standard deviation of the timing template for the corresponding event type. An example value of this constant can be between 2.0 and 3.0, and this invention does not limit it.
[0055] The technical function of this tolerance window is to provide a variable decision boundary for event candidates under different distances and processing delays. If the tolerance window is fixed, near-distance events and far-distance events will be constrained by the same delay threshold, which can easily lead to the former being falsely rejected and the latter being falsely accepted. By dynamically adjusting the tolerance window based on channel delay and event distance, real events can maintain a high degree of consistency within a reasonable fluctuation range, while suppressing false candidates that are clearly inconsistent with the laws of physical propagation.
[0056] When the distance is short, the physical propagation delay difference across channels decreases, allowing for a corresponding tightening or reweighting of the time criteria for each channel. When the distance is long, the acoustic propagation delay increases, allowing for a moderate relaxation of the acoustic criteria. In this embodiment, a confirmation threshold is used. A value of 0.85 can be used as the rejection threshold. A value of 0.10 can be used, and both can be adjusted through calibration experiments based on the target false alarm rate and false negative rate.
[0057] It should be noted that the distance estimation-driven dynamic adjustment mechanism of the time window enables this step to adapt to observation distances at different scales. The physical propagation delay of events on the order of tens of meters in urban scenes and events on the order of hundreds of meters in high-speed scenes automatically scales with distance. The Bayesian selection mechanism of multi-event type templates holds true at both scales.
[0058] S4: For event candidates whose confidence level is still uncertain after verification by S3, drive the robot to perform active multi-view verification.
[0059] In step S4, the system performs active multi-view verification on event candidates that remain in the gray zone after verification in S3. Based on the urgency and timeliness of the events, they can be divided into emergency events, delayable verification events, and low-speed evolution events. Emergency events are reported directly without initiating active multi-view verification. Delayable verification events undergo active multi-view verification. Low-speed evolution events can be verified along the patrol route. Traffic scene safety constraints include at least one of the following: dynamic safe distance constraints between the robot and traffic flow, right-of-way priority constraints, and observation stop point safety constraints. For example, when the speed of vehicles in adjacent lanes is high, the safe distance between the robot and the traffic flow can be appropriately increased. For high-risk areas, the robot is only allowed to stop and observe at locations that meet the field of vision and backoff conditions. The observation stop point is determined jointly based on the candidate event location, viewpoint complementarity, and traffic scene safety constraints. Specifically, a set of candidate stopping points can be generated at preset angular intervals within the reachable safe area surrounding the candidate event. For each candidate stopping point, the angular difference between it and existing stopping points is calculated as a viewpoint complementarity score. Simultaneously, a safety score is calculated based on the point's distance from traffic flow, right-of-way conditions, and backoff path availability. These scores are then sorted in descending order by a weighted average of the viewpoint complementarity score and the safety score, and the stopping points with the highest scores are selected sequentially. The weights of the viewpoint complementarity score and the safety score can be adjusted according to task priority; this invention does not impose limitations on this.
[0060] The purpose of introducing active multi-view verification in this step is to leverage the robot's spatial maneuverability to supplement gray-zone events with incremental information that is difficult to obtain under a fixed viewpoint. When an event candidate remains between the confirmation and rejection thresholds after S3, it indicates that the existing perceptual evidence is insufficient to support a stable conclusion. By changing the observation position and viewpoint, the geometric relationships between the target and occlusions, the background environment, and various sensors can be reorganized, thereby obtaining more discriminative observation data. This process differs from passively waiting for subsequent frames; its core is to actively create new observation conditions to resolve uncertainties.
[0061] The robot performs observations sequentially according to a set of planned observation stops, acquires multi-view, multi-sensor observation data, and updates the confidence of event candidates using incremental information.
[0062] The technical significance of multi-view observation lies in the fact that the evidence provided by different stopping points is often complementary. An obscured target edge, vehicle attitude changes, abnormal ground areas, or local motion features may become more apparent from one viewpoint, while from another. By incrementally fusing data obtained from different stopping points, the separability of event candidate confidence can be progressively improved. This process is not simply repeated sampling, but rather leverages changes in perspective to enhance the discriminative power of observational information.
[0063] Verification ends when the confidence level reaches the confirmation or rejection threshold. Degradation verification is initiated when it is impossible to safely reach a candidate observation point. Degradation verification can employ long-distance zoom observation or directional acoustic enhancement acquisition.
[0064] The purpose of downgraded verification is to maintain the ability to verify candidate events as much as possible when traffic safety constraints prevent the robot from approaching the candidate event area. Long-distance zoom observation focuses on acquiring information about changes in the target's appearance and scene structure, while directional acoustic enhancement acquisition focuses on improving the acoustic signal-to-noise ratio in a specific direction. Both supplement verification evidence from visual and acoustic paths, respectively, to reduce information loss caused by the inability to maneuver and approach the target.
[0065] The verification results obtained in S4 are used not only for the current event determination, but also for updating the causal graph edge parameters in S5.
[0066] It should be noted that the event timeliness classification strategy and the downgrade verification strategy enable this step to be implemented even in restricted scenarios such as highways where traffic robots cannot actively cross lanes. Emergency events such as collisions and fires are reported directly according to the recorded rules without initiating active movement verification. Delayed verification events such as road obstacles are handled using the recorded downgrade verification strategy, performing PTZ long-range observation and beamforming directional acoustic listening. The dynamic safety distance formula increases linearly with the speed of vehicles in adjacent lanes, automatically shrinking the feasible observation area of the traffic robot in high-speed scenarios to adapt to high-speed safety requirements.
[0067] S5: Input the confirmed events into the event causal graph to predict related events. The event causal graph uses event type as nodes and causal relationship as directed edges. The edge parameters are updated online through verification data. The prediction results are fed back to S1 and S2 to adjust the perception parameters.
[0068] In step S5, the system inputs the events confirmed in S3 or verified in S4 into the event causal graph, performs related event prediction, and feeds the prediction results back to the front-end perception. The event causal graph is a directed weighted graph, where nodes represent event types and directed edges represent causal relationships. Each edge contains at least the causal transition probability, time delay interval, and spatial propagation range. Let the edges... The causal transition probability in the th After the next update, ,side The causal transition probability in the th After the next update, Then, the update can be done using the following recursive method: in, For the first The indicator of whether the causal transition probability occurs in the next observation is set to 1 if it occurs and 0 if it does not occur. Edges obtained from offline statistics The initial causal transition probability.
[0069] To avoid a single observation having too large an impact on the recursive result when the number of online observations is small, a minimum observation threshold can be set. Before the cumulative number of observations reaches this threshold, the initial causal transition probability obtained offline is used as the current estimate; after reaching this threshold, equation (8) is used for recursive update. An example value for the minimum observation threshold can be 5 to 10, which is not limited in this invention.
[0070] The initialization of the causal graph can be determined in the following ways: the set of event type nodes can be predefined based on the historical event records of the target deployment scenario or the event classification standards of the traffic management department; the initial causal transition probability can be obtained by statistical analysis of the historical event sequence data of the deployment scenario; when historical data is insufficient, domain experience values can also be used for initialization and recursively corrected by equation (8) during operation.
[0071] Equation (8) is used to recursively update the causal transition probability based on historical observations. Its technical significance lies in enabling the edge parameters in the causal graph to no longer rely solely on offline initialization, but to be continuously corrected based on observations in actual traffic scenarios. If a certain antecedent event is frequently accompanied by a subsequent event, the transition probability of the corresponding edge gradually increases. Conversely, it gradually decreases.
[0072] Therefore, causal graphs can reflect the evolution of events in the current deployment scenario, rather than remaining fixed on the initial experience settings.
[0073] side The average time delay can be updated using the following formula: in, For the i-th update Average time delay, After the (i-1)th update Average time delay, For the first The actual delay of the observation To update the coefficients, satisfying Those skilled in the art can set the size of the update coefficient according to actual needs, and the present invention does not limit it.
[0074] The initial value for recursion in equation (9) can be determined by statistical analysis of the time intervals between adjacent causal events in the historical event sequence of the deployment scenario.
[0075] Equation (9) is used to update the average time delay corresponding to the edge and to determine the time delay interval by combining the historical observation dispersion. Its function is to further refine the causal relationship between events from "whether they are related" to "within what time range they are related". Since the propagation and evolution of traffic events are affected by road type, traffic density and target movement state, the time interval of the same event chain may differ in different scenarios. By dynamically updating the average time delay and its interval, the adaptability of the causal graph to the temporal characteristics of the actual scenario can be improved.
[0076] Determine the edges based on the updated average time delay and the corresponding historical observation dispersion. The time delay interval. For a line containing Candidate causal propagation paths for each node Its path cumulative probability It can be represented as: in, For the first candidate causal propagation path The causal transition probability of the edge, the first Edge connection node With nodes The cumulative probability of a path is calculated using the edge-independent approximation.
[0077] Equation (10) accumulates the causal transition probabilities of each edge on the path to obtain the overall credibility of the candidate causal propagation path. As the propagation depth increases, the path accumulation probability gradually decreases to form a prediction confidence decay mechanism. The role of this mechanism is to suppress the propagation of long chains with low credibility, avoid generating a large number of weakly correlated predictions over a wide range due to multi-step reasoning, and thus prioritize the concentration of perception resources on more likely associated event paths.
[0078] During forward inference, as the propagation depth increases, the cumulative probability of the path gradually decreases, thus forming a prediction confidence decay mechanism. When Below the minimum prediction threshold The transmission will cease at that time. Example values can be 0.02, and the propagation depth can also be set to an upper limit, such as 3 layers. After the graph inference obtains the candidate spatial range of the associated events, the candidate spatial range is filtered by using the sound source arrival direction output by S1 and the short-term historical statistical results formed within a preset time window based on the sound source arrival direction, or different weights are assigned to each candidate direction within the candidate spatial range.
[0079] Introducing the direction of sound source arrival and its short-term historical statistical results into the candidate spatial range filtering aims to feed back the real-time spatial information obtained by the current perception layer to the inference layer. If acoustic activity consistent with the predicted event persists in a candidate direction within a preset time window, that direction is more worthy of retention or given higher weight. Conversely, its priority can be reduced. This process allows causal inference results to no longer rely solely on historical statistical relationships, but can be coupled with real-time perceived evidence in the current scene, improving the spatial specificity of the prediction results.
[0080] The preset time window can be set according to the event evolution speed and the system sampling period, and can be set from 1 second to 10 seconds for example. When the prediction results are fed back to S1 and S2, the perception parameters can be adjusted within the expected spatial range and expected time window of the predicted event. The perception parameters include at least one of acoustic pre-screening threshold, visual detection sensitivity and radar perception parameters.
[0081] The role of adjusting perception parameters based on prediction results is to enable the system to have forward-looking perception capabilities. When the cumulative probability of a certain path is high, the sensitivity of the relevant perception channels can be appropriately increased within its corresponding expected spatial range and expected time window, thereby capturing subsequent related events more promptly. Conversely, when the predicted event does not occur within the expected time window, reducing the corresponding causal edge transition probability can reduce the resource consumption caused by the continued existence of ineffective sensitivity enhancement.
[0082] When a predicted event does not occur within the expected time window, the transition probability of the corresponding causal edge can be reduced to decrease subsequent ineffective sensitization.
[0083] Furthermore, the learning rate α is set differently based on the current road type. In highway scenarios, the causal relationship is relatively stable, so the value of α is relatively small; in urban intersection scenarios, traffic conditions change more significantly, so the value of α is relatively large. After learning the current road type through a high-precision map, the traffic robot automatically switches the corresponding parameters. The side parameters of the typical causal chain "sudden braking of the vehicle in front → rear-end collision of the vehicle behind → regional congestion" under different road types are represented by an online adaptive update mechanism.
[0084] S6: Continuously detect multi-sensor inconsistencies, exclude inconsistencies that cannot be explained by known event types after sensor degradation, and generate new event type prototypes to expand the event causal graph.
[0085] In step S6, the system uses multi-sensor inconsistency as a clue for detecting unknown events and forms new event type prototypes through incremental clustering.
[0086] The underlying technical idea of this step is to treat multi-sensor inconsistencies not simply as noise or error, but as potential clues for discovering unknown event types. When known event types and their fusion relationships cannot explain the current differences in cross-sensor responses, these differences may correspond to scenarios not covered by the training samples or novel events. Thus, inconsistencies are no longer merely negative factors to be eliminated, but become input information for expanding the system's event coverage.
[0087] The residuals for sensor pairs between vision, radar, and acoustic sensors are constructed. (The sentence is incomplete and requires further context.) Taking the first moment as an example, the first... The standardized residual of a sensor pair is defined as : in, For the first The standardized residuals of the a-th and b-th sensor pairs at time a are used to measure the degree of inconsistency of the sensor pair at the current time. For the first The raw residuals of the sensor pairs at time points a and b represent the difference in response between the two sensors to the same observed object. For different modal sensor pairs, the raw residuals are calculated as follows: For a vision-radar sensor pair, the raw residual is the difference in their position estimates for the same target; for a vision-acoustic sensor pair, the raw residual is the angular deviation between the visually detected target direction and the sound source arrival direction; for a radar-acoustic sensor pair, the raw residual is the angular deviation between the radar target direction and the sound source arrival direction. All of these residuals are projected into the angular domain for uniform measurement. The mean of the residual baseline for the a-th and b-th sensor pairs is obtained from the historical residual statistics under normal operating conditions, and represents the typical residual level of the sensor pair in the absence of abnormal events. Let the standard deviation of the residual baseline of the a-th and b-th sensor pairs satisfy the following condition: This reflects the range of fluctuation of the sensor to the normal residual.
[0088] Equation (11) normalizes the residuals with their baseline statistics, allowing the degree of inconsistency between different sensor pairs to be compared on a uniform scale. This process helps reduce the impact of differences in the dimensions, magnitudes, and noise levels of each sensor on the detection of unknown events, enabling subsequent clustering processes to focus more on the anomalous patterns themselves, rather than the differences in the numerical ranges of the original signals from different modalities.
[0089] Simultaneously, a health assessment is performed on each sensor channel. Health indicators may include the signal-to-noise ratio of each channel, image quality indicators, radar echo quality indicators, and detection consistency with known static references. When the health of a sensor falls below a threshold, the residual analysis results containing that sensor are temporarily exempted and not included in the unknown event determination.
[0090] The purpose of ruling out sensor degradation before judging unknown events is to avoid mistaking equipment anomalies for new types of traffic incidents. Without performing a health assessment first, abnormal responses caused by lens contamination, radar echo attenuation, microphone malfunction, etc., may also manifest as cross-sensor inconsistencies, thus interfering with the formation of unknown event prototypes. By first ruling out sensor degradation factors, subsequent inconsistency analysis can reflect anomalies more accurately in the scene itself, rather than equipment status malfunctions.
[0091] set up For the current set of health sensor pairs, the total inconsistency can be expressed as: in, For the first The total inconsistency at each time step is a weighted sum of the inconsistencies of all health sensor pairs, used to determine whether there are significant inconsistencies at the current time that cannot be explained by known event types. (a,b) is the index of a sensor pair in the set of health sensor pairs hs, such as (vision, radar), (vision, acoustic), (radar, acoustic). For the first The weighting coefficients of each sensor pair satisfy the following conditions: Furthermore, the weights can be pre-normalized so that the sum of all weights is 1. This is a function to maximize the value. For example, the visual and radar weights can be set to 0.4, the visual and acoustic weights to 0.3, and the radar and acoustic weights to 0.3, and these can be adjusted based on historical reliability calibration results. To standardize the residual threshold, an example value of 3.0 can be used.
[0092] Equation (12) weights and summarizes the inconsistencies of multiple health sensor pairs to form an overall inconsistency criterion. Its function is to integrate local anomalies scattered across different modal pairs into a unified decision index, thereby avoiding the triggering of unknown event determination based on anomalies of a single sensor pair. Only when multiple health sensor pairs collectively exhibit strong inconsistencies can the system have a stronger basis to believe that the current observation may correspond to an unknown event type.
[0093] when Significant inconsistency is determined when the inconsistency consistently exceeds a threshold and cannot be explained by known event types. A feature vector is extracted for each inconsistency event. This feature vector may include the standardized residuals of each sensor pair, the temporal information of the inconsistency, and the spatial distribution characteristics of the inconsistency. An incremental clusterer is used for clustering. When the distance between a new feature vector and an existing cluster center is less than the cluster radius... If the condition is met, merge it into the corresponding cluster. Otherwise, create a new cluster. An example value of 2.0 is acceptable. This applies when the number of samples in a cluster exceeds the minimum sample threshold. At that time, register the cluster as a prototype for a new event type. An example value for is 5.
[0094] The technical advantage of incremental clustering lies in enabling the system to gradually form prototypes of new event types during operation without requiring system downtime and retraining of the entire model. As similar inconsistencies accumulate, cluster centers gradually converge to more stable representations, thus transforming scattered anomaly observations into reusable new prototypes. This process allows the system to gradually expand from a closed recognition pattern capable of handling only predefined event types to an open recognition pattern capable of absorbing new types.
[0095] When a significant inconsistency is detected, S4 is triggered to perform directional active verification in the corresponding direction to obtain additional observational information. The results of the active verification are fed back to the incremental clusterer to correct cluster centers or confirm prototype stability.
[0096] The purpose of feeding back the active validation results to the incremental clusterer is to correct unknown event prototypes using additional multi-perspective observation information. If active validation confirms that a certain type of inconsistency corresponds to a stable new event pattern, the credibility of the relevant cluster centers can be enhanced. If validation shows that it originates from occasional perturbations or local environmental factors, the continued expansion of the corresponding prototype can be inhibited. This feedback mechanism helps improve the stability and interpretability of new event type prototypes.
[0097] For a new event type prototype that is stably formed, candidate causal edges can be established with existing event nodes in S5 based on its occurrence order, time interval and spatial proximity, thereby expanding the event causal graph.
[0098] Furthermore, the sensor tracks changes in background characteristics under different scenarios through an exponentially weighted recursive update mechanism for the residual statistical baseline, and is adaptable to both steady-state backgrounds on highways and abrupt change backgrounds on urban roads; the prototypes of new event types generated by online incremental clustering vary with the operating scenario, and can all expand the event node set of the S5 causal graph.
[0099] In this embodiment, the main link from S1 to S3 is used to form the known event identification result, S4 is used to actively verify gray zone events, S5 is used to implement correlation prediction based on confirmed events and provide feedback to adjust the front-end perception, and S6 is used to transform cross-sensor inconsistencies into clues for discovering unknown events. These steps form a closed-loop processing structure of perception, verification, reasoning, feedback, and expansion, enabling the system to achieve stable identification for known event types and gradually expand its adaptability to unknown event types. The thresholds, weights, time windows, clustering radii, path depths, and update coefficients given in this paper are all example values and can be adjusted by those skilled in the art through calibration experiments, historical data statistics, or simulation analysis based on specific sensor performance, deployment scenarios, and target indicators. For parameters for which no unique value is explicitly given, the determination method has been given in the corresponding paragraphs.
[0100] The above are merely preferred embodiments of the present invention and do not limit the scope of the patent. Any equivalent modifications made based on the inventive concept of the present invention and the description and drawings, or any technical solutions directly or indirectly applied to other related technical fields, should be included within the protection scope of the present invention.
Claims
1. A traffic robot event recognition method based on multi-sensor fusion, characterized in that, include: S1: Based on the spatial structure information of the traffic scene, the acoustic signal collected by the microphone array is spatially spectral estimated and multipath reflection peaks are suppressed, and the direction of arrival of the sound source is output. S2: Drive the camera and radar to focus in the direction of the sound source arrival, pre-screen events through the acoustic fingerprint template of noise adaptation, and output event candidates and initial confidence after being confirmed by multiple sensors step by step. S3: Based on the multi-sensor time-series pattern of the speed difference between sound waves and electromagnetic waves, the authenticity of the event candidates is verified by an adaptive time-series tolerance window and the initial confidence level is updated, while the event type is inferred. S4: For event candidates whose confidence level is still uncertain after verification by S3, drive the robot to perform active multi-view verification; S5: Input the confirmed events into the event causal graph to predict related events. The event causal graph uses the event type as nodes and the causal relationship as directed edges. The edge parameters are updated online through verification data. The prediction results are fed back to S1 and S2 to adjust the perception parameters. S6: Continuously detect multi-sensor inconsistencies, exclude sensor degradation, incrementally cluster inconsistencies that cannot be explained by known event types, and generate new event type prototypes to expand the event causal graph.
2. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, Also includes: A multipath propagation model is constructed using high-precision maps and 3D city models to obtain information on the distribution of reflecting surfaces. The direction of arrival of reflected waves is calculated using the mirror sound source method, and spatial spectral peaks corresponding to the direction of arrival of the reflected waves are excluded.
3. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In S1, the noise covariance matrix used for spatial spectrum estimation adaptively adjusts the integration time according to the traffic flow parameters estimated by the radar in real time, so as to improve the estimation reliability in high noise environment and improve the time resolution in low noise environment.
4. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In step S1, after suppressing the multipath reflection peak, the remaining peak is scored in multiple dimensions based on road geometry prior constraints, reflector surface distribution priors, historical sound source active area statistics, and spectral peak intensity. The direction of arrival of the sound source is determined based on the scoring results.
5. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In step S2, the noise adaptation of the acoustic fingerprint template includes: performing an environmental adaptation transformation on the pre-stored acoustic fingerprint template according to the current background noise spectrum, generating an adaptation template under the current noise conditions, and performing event pre-screening based on the adaptation template.
6. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, Following S2, the method further includes: performing a spatial consistency check between the direction of arrival of the sound source and the radar pre-tracking: when there is an abnormal moving target being tracked by the radar in the spatial region corresponding to the direction of arrival of the sound source, increasing the initial confidence level in the corresponding direction; when no abnormal moving target is detected, decreasing the initial confidence level in the corresponding direction.
7. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In S3, the posterior probability of time-series templates for multiple event types is calculated in parallel. Whenever a new sensor channel responds, the posterior probability of each event type is updated, thereby realizing event authenticity verification and event type inference. The adaptive time-series tolerance window calibrates the processing delay of each sensor channel and dynamically adjusts the delay parameters according to the distance of the event from the robot.
8. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In step S4, the active multi-view verification is performed by planning the observation trajectory under traffic scenario safety constraints. The traffic scenario safety constraints include at least one of the following: dynamic safety distance constraints with traffic flow, right-of-way priority constraints, and safety constraints of observation stopping points. Emergency events are directly reported without performing the active multi-view verification. When it is impossible to safely reach the observation point, a downgraded verification of long-distance zoom observation or directional acoustic enhancement acquisition is initiated.
9. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In step S5, the online update of the edge parameters includes recursive updating of causal transition probabilities and adjustment of time delay intervals; a prediction confidence decay mechanism is introduced in forward inference, terminating propagation when the cumulative path probability is lower than a preset threshold; the spatial range of the associated events is filtered or weighted according to the direction of arrival of the sound source; the perception parameters include at least one of acoustic pre-screening threshold, visual detection sensitivity, and radar perception parameters; when the predicted event does not occur within the timeout period, the transition probability of the corresponding causal edge is reduced; the learning rate of the online update of the edge parameters is set differently according to the road type, and the learning rate value in the highway scenario is lower than the learning rate value in the urban intersection scenario.
10. The traffic robot event recognition method based on multi-sensor fusion according to claim 1, characterized in that, In step S6, the multi-sensor inconsistency is detected by residual analysis using sensors; the exclusion of sensor degradation includes monitoring the signal-to-noise ratio of each sensor channel and the detection consistency with a known static reference; the inconsistency incremental clustering is a clustering method based on feature vector distance, and when the number of clustered samples exceeds a threshold, it is registered as a new event type prototype; when the multi-sensor inconsistency is detected, step S4 is triggered to perform directional active verification of the source direction of the multi-sensor inconsistency, and the verification result of step S4 is fed back to the inconsistency incremental clustering to update the event prototype.