Cross-scene behavior risk grading system and method based on multi-modal fusion

The multimodal fusion cross-scenario behavior risk classification system solves the problem of insufficient cross-scenario recognition accuracy, and achieves efficient and real-time target recognition and behavior risk classification, which is suitable for scenarios such as security monitoring and intelligent management.

CN122243168APending Publication Date: 2026-06-19SICHUAN JIUZHOU SOFTWARE CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SICHUAN JIUZHOU SOFTWARE CO LTD
Filing Date
2026-01-22
Publication Date
2026-06-19

Smart Images

  • Figure CN122243168A_ABST
    Figure CN122243168A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of artificial intelligence technology, specifically relating to a cross-scenario behavioral risk classification system and method based on multimodal fusion. The system includes: a data acquisition module for real-time acquisition of multimodal data; a target recognition module for identifying features of the multimodal data to obtain preliminary recognition results; a multi-relationship graph module for constructing and using the preliminary recognition results to extract entities and entity relationships to update a knowledge graph, and combining the updated knowledge graph with contextual information reasoning to obtain optimized recognition results; and a behavioral event module configured to extract temporal features from the multimodal data for target behavioral chain identification; construct a risk scoring index, and combine the optimized recognition results to determine the risk classification of the target behavioral chain. This invention can accurately identify complete behavioral chains, achieve differentiated classification of behavioral risks, and adapt to application needs in scenarios such as security monitoring and intelligent management.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision, artificial intelligence and pattern recognition, and in particular to a cross-scenario behavioral risk classification system and method based on multimodal fusion. Background Technology

[0002] Currently, target recognition technology has been widely applied to many key fields such as public safety, intelligent monitoring, and autonomous driving. The need for recognition of small targets and targets in cross-scenario environments is particularly urgent. However, current mainstream target recognition technologies still face many bottlenecks in practical applications that urgently need to be overcome. These technological limitations restrict the performance of recognition systems and make it difficult to meet the high-precision recognition requirements in complex real-world scenarios. Specifically, these limitations manifest in the following aspects: 1. Insufficient data and modal support.

[0003] Existing recognition systems rely on single-modal data to build models. Visual or speech modalities are easily affected by factors such as lighting, angle, and noise in complex environments, resulting in insufficient recognition stability. At the same time, the models have limited ability to mine multi-dimensional information such as time series and 3D spatial structure, making it difficult to fully extract the features of distant and small-sized targets, further restricting recognition accuracy. In addition, traditional technologies rely heavily on manually labeled data, which is costly, time-consuming, and highly subjective, and cannot adapt to the diverse labeling needs in open domains.

[0004] 2. Weak ability to adapt to scenarios and targets.

[0005] Most models adopt a targeted training mode for specific scenarios and specific targets, which is insufficient for generalization. When faced with changes in target shape (occlusion, deformation, brand / type switching) or dynamic scene switching (indoor and outdoor environment conversion), the recognition accuracy and efficiency will drop significantly, making it difficult to adapt to diverse application scenarios in the open domain.

[0006] 3. Semantic understanding and behavioral analysis are incomplete.

[0007] Existing solutions can only achieve simple detection of isolated targets, failing to build semantic associations between targets and scenes, and between targets and behaviors, and lacking context awareness capabilities. At the same time, they cannot model the complete behavioral chain of "action start-process-end", and have not formed a correlation analysis between behavior and scene risk, making it difficult to achieve differentiated judgment of risk level based on scene differences, and failing to meet the intelligent recognition needs in complex scenarios.

[0008] In summary, existing target recognition technologies have shortcomings in modal fusion, scene adaptation, data annotation efficiency, semantic understanding, information utilization, and behavior analysis, which severely restricts their application in the field of cross-scene small target recognition. There is an urgent need for a technical method that can integrate multimodal information and improve cross-scene adaptability and small target recognition accuracy to solve these problems. Summary of the Invention

[0009] To address the shortcomings of the existing technologies, this invention provides a cross-scenario behavioral risk classification system and method based on multimodal fusion. The system includes: Data acquisition module: used to acquire multimodal data in real time, including video stream data, 3D information data, audio data and sensor data; Target recognition module: connected to the data acquisition module, used to recognize the features of the multimodal data and obtain preliminary recognition results including target category, target location, and result confidence level; Multi-relationship graph module: connected to the target recognition module, used to construct and use the preliminary recognition results to extract entities and entity relationships to update the knowledge graph, and combine the updated knowledge graph to perform contextual information reasoning to obtain optimized recognition results; Behavioral event module: connected to the data acquisition module and the multivariate relationship graph module respectively, configured to extract the temporal features of the multimodal data for target behavior chain identification; construct a risk scoring index, compare the risk scoring index with a preset threshold, and determine the risk level of the target behavior chain.

[0010] Preferably, the data acquisition module further includes a data preprocessing module connected to the data acquisition module, comprising: Image preprocessing unit: used to extract frames from the video stream data according to frequency to obtain multiple image frame data; to segment the image frame data, retain candidate regions containing the target, and obtain preprocessed image frame data; Audio preprocessing unit: used to segment, denoise, and extract features from the audio data to obtain preprocessed audio data; Spatiotemporal alignment unit: connected to the image preprocessing unit and the audio preprocessing unit respectively, used to combine the preprocessed image frame data, preprocessed audio data and sensor data to generate original data with timestamps, and construct a spatiotemporal data matrix as preprocessed data.

[0011] Preferably, the target recognition module has a built-in target recognition model, and the training method of the target recognition model includes: Acquire preprocessed data, label the preprocessed data using a large model, and manually correct low-confidence labels to generate the training set; extract features from the training set and input them into the target recognition unit to complete the training of the target recognition model.

[0012] Preferably, the behavior event module includes: Behavioral temporal modeling unit: used to construct a three-stage behavioral chain including the start, process and end of the behavior based on the temporal characteristics, and obtain a preliminary behavioral chain; Feature enhancement unit: connected to the behavior temporal modeling unit, used to perform weight allocation on the multimodal features of the three-stage behavior chain through enhancement algorithm, optimize the recognition result of the preliminary behavior chain, and obtain the target behavior chain; Event structuring unit: connected to the feature enhancement unit, used to combine the target behavior chain with the optimized identification result to generate structured behavior event information; and to construct a risk scoring index based on the behavior event information for risk classification determination of the target behavior chain.

[0013] Preferably, the behavioral event information includes: subject, behavior, object, scene, time, and spatial coordinates.

[0014] Preferably, the method for constructing the risk scoring index includes: constructing the risk scoring index by weighted fusion of behavioral hazard items, scenario risk items, scope of behavioral impact items, behavioral persistence items, and the subject's historical behavioral risk records; the formula for expressing the risk scoring index S is: ; in, , , , , These respectively represent behavioral hazard items, scenario risk items, scope of behavioral impact items, behavioral persistence items, and the subject's historical behavioral risk records; , , , , Let be the weight coefficient, and satisfy... + + + + .

[0015] Preferably, the behavior event module further includes an adaptive learning module: connected to the behavior event module, used to dynamically adjust the feature extraction weights of the target recognition module according to the application scenario, update the training set of the target recognition module based on recognition error cases, and optimize recognition performance.

[0016] The present invention also provides a cross-scenario behavior risk classification method based on multimodal fusion, the method being used to implement the cross-scenario behavior risk classification system based on multimodal fusion as described in any one of claims 1-7.

[0017] Compared with the prior art, the present invention has the following beneficial effects: This invention leverages the complementarity of multimodal data to provide effective support for accurate identification of small targets, significantly improving target recognition performance compared to single-modal methods. By automatically labeling data with large models, it reduces manual costs while rapidly adapting to new targets, improving data processing efficiency. It exhibits excellent scene and target adaptability, achieving stable identification across different environments, target shapes, and behavioral patterns, demonstrating strong robustness. Real-time performance meets practical application requirements, achieving 30 FPS real-time processing on ordinary GPUs. It not only supports unlimited expansion of target categories but also accurately identifies complete behavioral chains and generates structured event information. Combined with scene differences, it performs differentiated risk assessment of behavioral behavior, adapting to application needs in security monitoring, intelligent management, and other scenarios. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a schematic diagram of a cross-scene small target recognition method based on multimodal fusion provided in a preferred embodiment of the present invention. Detailed Implementation

[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0021] Example 1 like Figure 1 As shown, this invention discloses a cross-scenario behavior risk classification system based on multimodal fusion.

[0022] To address the problems of existing technologies that rely on single-modal data, are difficult to adapt to diverse application scenarios, lack a complete "action start-process-end" behavior chain, and cannot differentiate risk levels based on scenario differences, the target recognition system of the present invention includes, in some preferred embodiments, the following: The data acquisition module is used to acquire multimodal data in real time, including video stream data, 3D information data, audio data, and sensor data. This real-time acquisition of multimodal data is then input into the target recognition module for data processing. Audio data, in particular, can provide "non-visual verification" of target behavior when small targets are partially exposed due to obstacles or when low light conditions at night cause image features to be blurred. This is due to its temporal continuity and scene correlation. Furthermore, for noisy environments and weak target audio such as ignition sounds, lock picking sounds, and small mechanical operating sounds, preprocessing is performed, such as using an adaptive noise suppression algorithm based on spectral subtraction combined with wavelet threshold denoising to separate target audio from environmental noise, and Mel-spectrum feature extraction to improve the identification of weak audio. All four types of data are acquired through synchronous acquisition devices, naturally possessing a spatiotemporal correlation, ensuring the effectiveness of subsequent data preprocessing.

[0023] Target recognition module: Connected to the data acquisition module, it is used to identify the features of the multimodal data and obtain preliminary recognition results including target category, target location, and result confidence level. The result confidence level refers to a quantified value of the reliability of the preliminary recognition result; the preliminary recognition result refers to the unoptimized raw result directly output by the target recognition module; the specific recognition method can be reasonably designed by those skilled in the art based on actual needs or the actual situation on site, for example, using an LSTM network to analyze video stream data and identify dynamic behaviors, such as the continuous action of raising a hand to smoke.

[0024] Multi-relationship graph module: connected to the target recognition module, used to construct and use the preliminary recognition results to extract entities and entity relationships to update the knowledge graph, and combine the updated knowledge graph to perform contextual information reasoning to obtain optimized recognition results.

[0025] The multi-dimensional relational graph is a knowledge graph that represents the relationship between a target, a scene, and a behavior using a graph structure. Nodes represent three types of entities, and edges represent the relationships between entities. Existing technologies only identify single entities in isolation, such as target categories or single behaviors, without constructing relationships between targets and scenes or between targets and behaviors. This lack of contextual association leads to an inability to perceive the rationality of a scene, such as failing to distinguish between "pedestrians smoking in a no-smoking area" and "pedestrians smoking in a smoking area," and further hindering multi-target interactive reasoning, resulting in fragmented detection. To address these issues, in some preferred embodiments, this invention constructs a multi-dimensional relational graph that connects the relationships between targets, scenes, and behaviors, such as "pedestrian - located in - no-smoking area - performing - smoking," allowing the system to upgrade from identifying single information points to understanding the complete scene; the rationality of behavior is judged based on graph relationships, avoiding misjudging illegal behaviors as normal behaviors.

[0026] In some other preferred embodiments, when the confidence level of the result is higher than a preset first confidence level threshold (e.g., 0.8), the present invention will update the map in real time based on the preliminary identification result, such as adding targets or adjusting scene relationships; if the confidence level of the result is lower than the first preset confidence level threshold, the target will be marked as a "sample to be verified", and the corresponding preliminary identification result will not participate in the map update; thus providing a high-quality data foundation for subsequent behavior chain identification and risk classification.

[0027] Behavioral event module: connected to the data acquisition module and the multivariate relationship graph module respectively, configured to extract the temporal features of the multimodal data for target behavior chain identification; construct a risk scoring index, and combine the optimized identification results to determine the risk level of the target behavior chain.

[0028] The main shortcomings of existing behavior detection technologies include: First, fragmented behavior recognition. Most technologies can only identify isolated, discrete behavioral states such as "smoking," "raising a hand," and "crossing the road," failing to capture the complete temporal logic of "action start-process-end." For example, they can only detect independent actions such as "pedestrian raising a hand," "pedestrian holding a cigarette," and "pedestrian exhaling smoke," but cannot connect them into a complete behavioral chain of "taking out a cigarette, lighting it, smoking, and extinguishing it." This results in the understanding of the target behavior remaining at the fragmented action level, making it difficult to reconstruct the full picture of the behavior. Second, risk assessment lacks contextual relevance. Existing risk classifications mostly rely on single behavioral features, such as classifying smoking as risky, without establishing a correlation with contextual attributes. For example, they cannot distinguish between "smoking in a smoking area" (low risk) and "smoking at a gas station" (high risk), nor can they dynamically assess risk based on the interaction between the target and the context. This leads to risk assessments being detached from real-world scenarios, resulting in insufficient accuracy and practicality. To address these issues, this invention uses a behavior event module to identify target behavior chains and constructs a risk scoring index. It then combines this with a real-time updated multivariate relationship graph to determine the risk classification of the target behavior chain.

[0029] This embodiment is based on multimodal data with strong spatiotemporal correlation. Through collaborative preprocessing and advanced feature fusion technology, combined with an efficient target recognition model, it avoids interference from single modalities and significantly improves the robustness and accuracy of small target recognition and localization in complex environments. The result confidence level provides quantitative support for the reliability of the results. The multivariate relationship graph breaks through the limitations of single entity recognition, transforming fragmented information into complete scene logic and avoiding misjudgment of behavior. The construction of continuous behavior chains solves the problem of fragmentation in traditional technology recognition. Combined with the real-time updated relationship graph, the system dynamically assesses risks based on scene attributes and target-scene associations, getting rid of the one-sidedness of single behavior judgment and improving the practicality of risk classification.

[0030] Example 2 Based on Example 1, this example is used to preprocess multimodal data.

[0031] Existing multimodal data preprocessing methods often rely on independent single-modal processing supplemented by simple cross-modal collaboration. The processing logic for different modalities differs significantly from conventional operations, and the cross-modal collaboration process is relatively simplified. This preprocessing approach suffers from drawbacks such as video frame extraction and audio segmentation often being performed independently, failing to integrate sensor data for temporal alignment, leading to cross-modal information misalignment. Visual preprocessing often operates on entire frames of images, failing to extract target regions, resulting in significant background interference and low feature extraction efficiency. Furthermore, the processed data is fragmented, failing to form structured data with spatiotemporal correlation. To address these issues, in some preferred embodiments, the data acquisition module of this invention further includes a data preprocessing module, comprising an image preprocessing unit, an audio preprocessing unit, and a spatiotemporal alignment unit.

[0032] In some preferred embodiments, the image preprocessing unit of the present invention is used to extract frames from the video stream data by frequency to obtain multiple image frame data; to segment the image frame data, retain candidate regions containing the target, and obtain preprocessed image frame data; wherein, segmentation tools such as Mask R-CNN can be used to extract the target candidate regions of the image frames, and morphological operations can be used to optimize the edges to retain effective target information. The audio preprocessing unit is used to segment, denoise, and extract features from the acquired audio data to obtain preprocessed audio data; wherein, the methods for segmenting, denoising, and extracting features from the audio data can be reasonably designed by those skilled in the art according to actual conditions or field needs, with the aim of achieving high-quality audio data functionality. The spatiotemporal alignment unit is used to combine the preprocessed image frame data, the preprocessed audio data, and the sensor data to generate original data with timestamps, and to construct a spatiotemporal data matrix as preprocessed data. Using timestamps as the core index, the multimodal features after fusion processing include: images, audio, 3D space, and sensor time series. A high-dimensional spatiotemporal data matrix is ​​constructed, retaining independent features and establishing modal relationships; thus, the structured integration of multimodal data is achieved, generating preprocessed data.

[0033] The preprocessing flow in this embodiment focuses on spatiotemporal alignment, target focusing, and structural fusion to achieve efficient processing of multimodal data and high-quality input data.

[0034] Example 3 Based on Example 1, this example is used to train a better target recognition model.

[0035] In some preferred embodiments, the target recognition module has a built-in target recognition model, and the training method of the target recognition model includes: acquiring preprocessed data, labeling the preprocessed data using a large model, and manually correcting low-confidence labels to generate the training set; extracting features from the training set and inputting it into the target recognition unit to complete the training of the target recognition model.

[0036] Traditional methods generally rely on manual annotation, where annotators manually label pre-processed images and video frames one by one using tools such as LabelImg and VGG ImageAnnotator to assign information such as target categories and bounding box positions. This approach has significant drawbacks, including extremely low annotation efficiency. Annotating a single complex scene image can take several minutes. When dealing with training sets containing massive amounts of multimodal data, a large number of annotators need to work continuously, resulting in a lengthy data preparation cycle and severely slowing down model development. The labor costs of professional annotators, combined with the management costs incurred from long-term annotation, make the annotation cost of large-scale training sets a significant expense in the project. Manual annotation is susceptible to factors such as personnel experience, fatigue, and subjective judgment differences, which can easily lead to errors in the annotation of small or blurred targets, and it is difficult to ensure the consistency of annotation results for large batches of data. In open domain scenarios, target categories are constantly expanding, and the annotation of new target categories requires retraining of personnel, further increasing costs and time, and failing to quickly respond to the needs of model iteration.

[0037] To address the above issues, in some preferred embodiments, this invention employs a collaborative annotation scheme combining large-scale automatic annotation with precise manual correction. The implementation process includes: selecting a large VL model with cross-modal understanding capabilities, such as CLIP or BLIP-2, and inputting preprocessed multimodal data into the model; the model, based on its pre-trained visual-language association capabilities, automatically identifies target entities in the data and outputs preliminary annotation results including target category, location coordinates, and annotation confidence level, where the confidence level quantifies the reliability of the annotation results. If the target annotation confidence level is less than a preset annotation confidence threshold (e.g., 0.7), it is pushed to a manual annotation platform; annotators only review and correct this portion of the data, without needing to process the high-confidence, reliable annotation results. Finally, the corrected annotation data and the high-confidence automatically annotated data are integrated to generate a training set.

[0038] In some preferred embodiments, the target recognition unit includes a feature fusion unit, which is used to transform features from different modalities into unified-dimensional fused features; the target recognition model is responsible for target classification and localization based on the fused features. Commonly used feature fusion methods include data-level concatenation, feature-level concatenation, weighting (such as Cross-Attention and Transformer), decision-level voting, and weighted averaging.

[0039] This embodiment addresses the issues of single-modality susceptibility to environmental interference and lack of synergy in simple multimodal superposition by employing strongly spatiotemporally correlated multimodal data. It fully leverages the complementary advantages of cross-modal approaches, effectively improving the accuracy and robustness of small target recognition in complex environments. Simultaneously, the synergy between automatic large-model annotation and manual correction significantly improves annotation efficiency, shortens data preparation cycles, and substantially reduces labor costs. It also avoids subjective biases in manual annotation, ensuring the consistency and accuracy of training set annotation results and providing a high-quality data foundation for model training. Relying on diverse feature fusion methods and a mature target recognition architecture, the constructed model can efficiently process multimodal fusion features and adapt to target recognition needs in different scenarios.

[0040] Example 4 Based on any one of Examples 1 to 3, this example is used to identify the target's behavioral chain and determine the behavioral risk level.

[0041] To address the issues of fragmented behavior recognition and scenario-independent risk assessment in existing technologies, in some preferred embodiments, this invention relies on the behavior event detection module to identify the target's behavior chain and classify and assess behavior risk; the behavior event module includes: Behavioral temporal modeling unit: used to construct a three-stage behavioral chain including the start, process and end of the behavior based on the temporal characteristics, and obtain a preliminary behavioral chain.

[0042] The behavior temporal modeling unit takes multimodal temporal features, including the action temporal sequence of video streams, the pose temporal sequence of sensors, and the voiceprint temporal sequence of audio, as well as the preliminary recognition results output by the target recognition module, including target category, location, and confidence level, as input. It models the multimodal temporal features using commonly used temporal correlation algorithms (such as the Transformer temporal encoder) to capture the temporal dependencies of behaviors. First, it extracts all discrete behavioral actions (such as "taking out a cigarette," "holding a cigarette," and "lighting a cigarette") of the same target from the preliminary recognition results, and labels the timestamps and sequence of each action based on the multimodal temporal features. In some preferred embodiments, relying on a pre-trained behavior temporal logic library (e.g., the "lighting a cigarette" action should occur after "holding a cigarette"), it identifies missing stages in the behavior sequence (e.g., if "holding a cigarette" and "smoking" are detected, but "lighting a cigarette" is not detected, then a behavior is considered missing). Finally, through complementary reasoning of multimodal features, the missing stages are filled in, ultimately forming a complete behavior chain including the start-process-end of the behavior, resulting in a preliminary complete behavior chain.

[0043] Feature enhancement unit: connected to the behavior temporal modeling unit, used to perform weight allocation on the multimodal features of the three-stage behavior chain through enhancement algorithm, optimize the recognition result of the preliminary behavior chain, and obtain the target behavior chain.

[0044] The feature enhancement unit in this embodiment adopts a "cross-modal attention enhancement algorithm" to assign weights to the multimodal features of each stage of the behavior chain. For example, the initial stage of the behavior focuses on 3D spatial features and audio features, while the process stage focuses on video visual features. It corrects the time interval deviation of the behavior chain and eliminates redundant behavior segments, such as excluding interference actions that are irrelevant to the target, thereby improving the uniqueness of behavior type recognition and outputting a high-precision behavior chain.

[0045] Event structuring unit: connected to the feature enhancement unit, used to combine the target behavior chain with the optimized identification result to generate structured behavior event information; and to construct a risk scoring index based on the behavior event information for risk classification determination of the target behavior chain.

[0046] Structured behavioral event information is generated through event structuring units combined with a real-time updated multivariate relationship graph. This structured behavioral event information includes: subject, behavior, object, scene, time, and spatial coordinates. The event structuring unit extracts entity information associated with the behavior chain from the multivariate relationship graph, including: subject (target category and ID); object (category and location of the affected object); scene (current environment type). This information is then combined with the timestamp of the behavior chain and the spatial coordinates (real-time location trajectory of the target) to generate structured behavioral event information.

[0047] A risk scoring index is constructed, and the risk scoring index is compared with a preset threshold to determine the risk level of the target behavior chain. In some preferred embodiments, the method for constructing the risk scoring index of the present invention includes a multi-dimensional fusion method. The multiple dimensions include: behavioral hazard items, scenario risk items, behavioral impact scope items, behavioral duration items, and the subject's historical behavioral risk record items. In some preferred embodiments, the weight coefficients of the present invention can be obtained based on expert scores in the field and through consistency checks. For example, the weight percentages of the scoring items in the above five dimensions are 40%, 25%, 20%, 10%, and 5%, respectively. Among them, behavioral risks include high risk, medium risk, low risk, and no risk; scenario risks include high risk, medium risk, and low risk; the scope of impact of the behavior includes ≥3 people or key objects, less than 3 people or ordinary objects, only oneself or no objects. In some preferred embodiments, the key objects of the present invention refer to objects that, once damaged or affected, will cause serious consequences, such as flammable and explosive materials (which may lead to explosions), important equipment (such as the core units of a substation, whose failures will affect the power supply of a large area), etc.; ordinary objects refer to objects that, once damaged or affected, will only cause minor losses, such as personal belongings, ordinary office desks and chairs, etc. The duration of the behavior includes a duration of ≥5 minutes, 1-5 minutes, and <1 minute. In other preferred embodiments, depending on the actual application scenario, the weighted scoring dimension of the present invention also includes the subject's historical behavioral risk record. For example, if a logistics personnel illegally smokes in a chemical warehouse (high-risk scenario) (high-risk behavior), involving flammable and explosive chemicals in the warehouse (key objects), and the smoking behavior lasts for 7 minutes, and the personnel had a high-risk record of illegally using open flames in a chemical area 3 months ago, the corresponding risk level will be increased. The classification rules, which are based on five dimensions including behavioral hazard level, scenario risk level, and scope of behavioral impact, are determined by combining statistical data on risk incidents in the field, industry consensus standards, and the severity of the harmful consequences caused by the behavior or scenario.

[0048] The risk level of the behavioral chain is determined by integrating multi-dimensional scores. If the final score is within the first preset score range (e.g., greater than 80 points), it is determined to be an extremely high risk level; if the final score is within the second preset score range (e.g., 60-80 points), it is determined to be a high risk level; if the final score is within the third preset score range (e.g., 30-60 points), it is determined to be a medium risk level; otherwise, it is determined to be a low risk level. In some preferred embodiments, such as a behavioral chain of a person igniting a fire in a gas station: the behavioral hazard (high-risk behavior) coefficient is 100*40%=40 points, the scenario risk (high-risk scenario) coefficient is 100*25%=25 points, and the scope of influence (involving key objects) is 100*20%=20 points. After weighted integration, the final score is 85 points; within the first preset score range, it is determined to be an extremely high risk level.

[0049] This embodiment constructs a complete "start-process-end" behavior chain by connecting discrete behaviors and filling in missing stages through a behavior temporal modeling unit. The feature enhancement unit dynamically allocates multimodal feature weights, corrects time deviations, and eliminates redundancy, thereby improving the accuracy of behavior chain recognition. Combined with a multi-dimensional risk scoring method, it realizes risk level determination that is deeply bound to the actual scenario, realizing the transformation of behavior recognition from "fragmented" to "full-link" while ensuring the accuracy of risk classification and scenario adaptability.

[0050] Example 5 Based on any one of Embodiments 1 to 3, this embodiment provides a superior adaptive learning method.

[0051] Existing adaptive optimization methods for behavior detection often employ offline static sample sets for model training or fixed feature weights. Once the training set is online, it is not updated, and the feature weights do not dynamically adjust with the scene. Model iteration requires retraining, resulting in insufficient generalization, inability to cover multiple error-prone scenarios and behaviors, and difficulty in correcting recognition biases. Furthermore, these methods suffer from weak scene adaptability and high model iteration costs, hindering rapid response to optimization needs. To address these issues, in some preferred embodiments, the behavior event module of this invention further includes an adaptive learning module connected to the behavior event module. This module dynamically adjusts the feature extraction weights of the target recognition module based on the application scenario, updates the training set of the target recognition module based on recognition error cases, and optimizes recognition performance. In some preferred embodiments, the adaptive learning module includes a feedback learning unit, a parameter adjustment unit, and a model adjustment unit. First, after the behavior event module outputs the recognition results, the feedback learning unit automatically filters out recognition error cases, including samples with missed detections, false detections, and behavior chain recognition deviations. These error samples are then labeled with their correct behavior categories, scene attributes, and feature information. In some preferred embodiments, the error cases also include target samples whose initial recognition results from the target recognition module have a confidence level lower than a second preset confidence threshold (e.g., 0.6). Subsequently, the labeled error samples are incorporated into the original training set, dynamically expanding the training set and providing targeted data support for model optimization. The parameter adjustment unit collects environmental parameters of the current scene in real time, determines scene complexity (e.g., backlighting, occlusion, complex backgrounds), and dynamically adjusts the feature extraction weights of the target recognition model according to preset scene-feature weight mapping rules. For example, in backlit scenes, the weights of 3D depth features are enhanced, while the weights of visual action features, which are significantly affected by lighting, are weakened; in complex background scenes, the weights of audio-related features are increased, thereby improving the model's adaptability to different complex scenes. The model adjustment unit performs lightweight incremental training on the target recognition model based on the updated training set and adjusted feature extraction weights, eliminating the need to retrain the entire model. By iteratively learning the feature patterns of error samples through mini-batch processing, the model's feature fusion and behavior recognition logic are corrected. While maintaining the model's original cross-scene recognition capabilities, this improves the model's recognition accuracy for error-prone scenes and behaviors.

[0052] This embodiment achieves continuous optimization of model recognition accuracy and cross-scene robustness by dynamically expanding the training set, adjusting feature weights as needed, and using lightweight incremental training, while reducing model iteration costs.

[0053] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.

Claims

1. A cross-scenario behavioral risk classification system based on multimodal fusion, characterized in that, include: Data acquisition module: used to acquire multimodal data in real time, including video stream data, 3D information data, audio data and sensor data; Target recognition module: connected to the data acquisition module, used to recognize the features of the multimodal data and obtain preliminary recognition results including target category, target location, and result confidence level; Multi-relationship graph module: connected to the target recognition module, used to construct and use the preliminary recognition results to extract entities and entity relationships to update the knowledge graph, and combine the updated knowledge graph to perform contextual information reasoning to obtain optimized recognition results; Behavioral event module: connected to the data acquisition module and the multivariate relationship graph module respectively, configured to extract the temporal features of the multimodal data for target behavior chain identification; construct a risk scoring index, compare the risk scoring index with a preset threshold, and determine the risk level of the target behavior chain.

2. The cross-scenario behavioral risk classification system based on multimodal fusion according to claim 1, characterized in that, The data acquisition module is followed by a data preprocessing module: connected to the data acquisition module, including: Image preprocessing unit: used to extract frames from the video stream data according to frequency to obtain multiple image frame data; to segment the image frame data, retain candidate regions containing the target, and obtain preprocessed image frame data; Audio preprocessing unit: used to segment, denoise, and extract features from the audio data to obtain preprocessed audio data; Spatiotemporal alignment unit: connected to the image preprocessing unit and the audio preprocessing unit respectively, used to combine the preprocessed image frame data, preprocessed audio data and sensor data to generate original data with timestamps, and construct a spatiotemporal data matrix as preprocessed data.

3. The cross-scenario behavioral risk classification system based on multimodal fusion according to claim 2, characterized in that, The target recognition module has a built-in target recognition model, and the training method of the target recognition model includes: Acquire preprocessed data, label the preprocessed data using a large model, and manually correct low-confidence labels to generate the training set; extract features from the training set and input them into the target recognition unit to complete the training of the target recognition model.

4. The cross-scenario behavioral risk classification system based on multimodal fusion according to claim 1, characterized in that, The behavior event module includes: Behavioral temporal modeling unit: used to construct a three-stage behavioral chain including the start, process and end of the behavior based on the temporal characteristics, and obtain a preliminary behavioral chain; Feature enhancement unit: connected to the behavior temporal modeling unit, used to perform weight allocation on the multimodal features of the three-stage behavior chain through enhancement algorithm, optimize the recognition result of the preliminary behavior chain, and obtain the target behavior chain; Event structuring unit: connected to the feature enhancement unit, used to combine the target behavior chain with the optimized identification result to generate structured behavior event information; and to construct a risk scoring index based on the behavior event information for risk classification determination of the target behavior chain.

5. The cross-scenario behavioral risk classification system based on multimodal fusion according to claim 4, characterized in that, The behavioral event information includes: subject, behavior, object, scene, time, and spatial coordinates.

6. The cross-scenario behavioral risk classification system based on multimodal fusion according to claim 1, characterized in that, The method for constructing the risk scoring index includes: constructing the risk scoring index by weighted fusion of behavioral hazard items, scenario risk items, scope of behavioral impact items, behavioral persistence items, and the subject's historical behavioral risk records; the formula for expressing the risk scoring index S is: ; in, , , , , These respectively represent behavioral hazard items, scenario risk items, scope of behavioral impact items, behavioral persistence items, and the subject's historical behavioral risk records; , , , , Let be the weight coefficient, and satisfy... + + + + .

7. The cross-scenario behavioral risk classification system based on multimodal fusion according to claim 1, characterized in that, The behavioral event module is followed by an adaptive learning module: connected to the behavioral event module, used to dynamically adjust the feature extraction weights of the target recognition module according to the application scenario, update the training set of the target recognition module based on recognition error cases, and optimize recognition performance.

8. A cross-scenario behavioral risk classification method based on multimodal fusion, characterized in that, The method is used to implement the cross-scenario behavior risk classification system based on multimodal fusion as described in any one of claims 1-7.