A smart security booth multi-modal collaborative learning method and system based on federated learning

By employing federated learning and multimodal collaborative learning methods, the problems of data privacy and security, heterogeneous device collaboration, and environmental adaptability in smart security booths were solved, thereby improving the privacy protection and model stability of the security system and reducing resource consumption.

CN122242816APending Publication Date: 2026-06-19HUAIYIN TEACHERS COLLEGE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAIYIN TEACHERS COLLEGE
Filing Date
2026-04-28
Publication Date
2026-06-19

Smart Images

  • Figure CN122242816A_ABST
    Figure CN122242816A_ABST
Patent Text Reader

Abstract

This invention relates to the fields of smart security and federated learning technologies, and discloses a multimodal collaborative learning method and system for smart security booths based on federated learning. The method involves collecting multimodal data from a security scene using multiple heterogeneous security sensing nodes, generating a multimodal security sensing feature set. Each heterogeneous security sensing node performs model training or feature learning locally, generating local model parameters or distillation information, achieving collaborative learning without uploading the original security sensing data. The smart security booth jointly optimizes the multimodal learning results based on multi-objective loss constraints and uses a federated double distillation mechanism to aggregate the distillation information from multiple heterogeneous security sensing nodes, generating a global security model and distributing it to each node for local model updates. Simultaneously, environmental disturbance feature information is acquired, and the local training parameters or federated aggregation strategy on the node side are adaptively adjusted to form a dynamic feedback loop. Through the above methods, this invention improves the sensing accuracy, robustness, and adaptability of smart security systems in complex environments while ensuring data privacy and communication efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The invention relates to the fields of smart security, embedded artificial intelligence and distributed machine learning, and in particular to a smart security booth based on federated learning and its multimodal collaborative learning method. Specifically, it involves a technical solution that utilizes heterogeneous sensing intelligent equipment such as drones, robot dogs, and unmanned vehicles in a smart security booth scenario to locally learn multimodal sensing information such as video data, point cloud data, and sound data, and achieves collaborative training and updating of models through federated learning.

[0002] This invention further relates to related technologies such as model knowledge distillation under multi-objective loss constraints, heterogeneous node collaborative learning, and adaptive adjustment of environmental disturbances, which are applicable to application scenarios such as intelligent perception, behavior recognition, and abnormal event detection in complex security environments. Background Technology

[0003] With the continuous improvement of smart city construction and public safety demands, smart security systems are gradually evolving from manual inspections and fixed monitoring to intelligent, automated, and collaborative approaches. As a front-end node in the security system, smart security booths typically undertake functions such as area patrols, anomaly detection, and emergency response. Their operating environment is characterized by high openness, complex scenarios, and numerous interference factors.

[0004] Existing smart security booths mostly acquire environmental information through fixed cameras or single mobile devices. In recent years, they have gradually introduced intelligent equipment such as drones, robot dogs, and unmanned vehicles to improve patrol range and flexibility. However, in practical applications, existing technical solutions still have the following shortcomings:

[0005] 1. Privacy and security risks arising from centralized processing of security data. Most existing smart security systems employ a centralized data processing model, where front-end security devices collect sensor data such as video, audio, or point cloud data, and then upload the raw data to a central server for unified analysis and model training. Since security data typically involves personnel behavior, activity trajectories, and detailed environmental information, centralized storage and transmission are prone to data leaks, unauthorized access, and other security risks, making it difficult to meet the needs of security applications with high data privacy and security requirements.

[0006] 2. Heterogeneous security equipment struggles to achieve effective collaborative learning. Security equipment such as drones, robotic dogs, and unmanned vehicles differ significantly in sensor type, computing power, movement patterns, and operating environments. Existing security models are mostly designed based on homogeneous devices or uniform data formats, making it difficult to adapt to the characteristics of different types of security equipment. This results in insufficient model generalization ability, limited collaborative learning effects, and impacts the overall intelligence level of the security system.

[0007] 3. Complex environmental factors lead to unstable model performance. Smart security booths are typically deployed outdoors or in semi-open environments, making them susceptible to environmental disturbances such as changes in lighting, weather conditions, noise interference, and equipment movement. Existing models are mostly trained in ideal or relatively stable environments. When the actual environment changes, the model's recognition accuracy and stability tend to decline, making it difficult to maintain reliable security performance in the long term.

[0008] 4. High consumption of communication and computing resources. Multimodal security sensing data is characterized by large data volume and high update frequency. Centralized uploading of raw data not only consumes a large amount of communication bandwidth, but also places high demands on the computing resources of the central server. With the increasing number of edge devices, existing technical solutions struggle to balance system performance and resource consumption.

[0009] Existing smart security booth technologies still have shortcomings in terms of data security, heterogeneous equipment collaboration, environmental adaptability, and resource utilization efficiency, and a new technical solution is urgently needed to improve them. Summary of the Invention

[0010] The purpose of this invention is to provide a smart security booth based on federated learning and its multimodal collaborative learning method, so as to solve the problems of high privacy and security risks caused by centralized processing of security perception data in existing smart security booths, difficulty in collaborative learning of heterogeneous security equipment such as drones, robot dogs, and unmanned vehicles, unstable model performance under complex environmental conditions, and large consumption of communication and computing resources.

[0011] To achieve the above objectives, the present invention provides the following technical solution: a multimodal collaborative learning method for smart security booths based on federated learning, characterized in that the method includes:

[0012] Step S1, Multimodal security sensing data acquisition and representation step. Environmental sensing data from multiple heterogeneous security sensing nodes deployed around the smart security booth are collected to generate a multimodal security sensing feature set; wherein, the heterogeneous security sensing nodes include at least one of drones, robot dogs, and unmanned vehicles, and the environmental sensing data includes at least one of video data, point cloud data, and sound data.

[0013] Preferably, step S1 includes the following steps:

[0014] Step S11, Sensing Modality Definition Step. Define multiple security sensing modality types to form a sensing modality set, which includes at least visual modality, spatial modality, and acoustic modality.

[0015] Step S12, Multimodal Feature Generation Step. According to the perception modality set, each heterogeneous security perception node collects corresponding perception data, and preprocesses and extracts features from the perception data to generate a multimodal security perception feature set.

[0016] Step S2, Node-side local model learning step. Each heterogeneous security sensing node performs model training or feature learning locally based on the multimodal security sensing feature set, generating local model parameters, model output results, or model distillation information; wherein, the original security sensing data does not leave the corresponding heterogeneous security sensing node during the local model learning process.

[0017] Preferably, step S2 includes the following steps:

[0018] Step S21, Local Model Initialization. Based on the computing power, sensing modality type, or task requirements of the heterogeneous security sensing nodes, initialize the corresponding local security model structure.

[0019] Step S22, Local Feature Learning and Distillation Information Generation Step. The local security model is trained using the multimodal security perception feature set, and distillation information for federated learning is generated based on the model output.

[0020] Step S3, Multi-objective loss constraint and joint optimization step. Set objective functions corresponding to various security tasks and construct a multi-objective loss function; based on the multi-objective loss function, jointly constrain the learning results of different modal features and different security tasks to generate a multi-modal knowledge representation for federated learning.

[0021] Preferably, step S3 includes the following steps:

[0022] Step S31, Multi-objective loss term setting step. Set multiple loss terms corresponding to tasks such as security identification, anomaly detection, and behavior analysis to form a multi-objective loss function. (Setting the loss term for security identification...) ), anomaly detection ( Behavioral analysis For tasks such as , construct a multi-objective loss function by considering multiple loss terms: .in, (Security identification weight) (Anomaly detection weight) (Behavioral analysis weights) , , Let be the cross-entropy loss function for each task.

[0023] Step S32, Joint Optimization Step. Based on the multi-objective loss function, the local model output or distillation information is jointly optimized to generate a multimodal knowledge representation.

[0024] Step S4, Federated Double Distillation Aggregation Step. Based on a federated learning mechanism, multimodal knowledge representations from multiple heterogeneous security sensing nodes are aggregated to construct a federated double distillation architecture and generate a global security model.

[0025] Preferably, step S4 includes the following steps:

[0026] Step S41, Node-side distillation step. Each heterogeneous security sensing node performs a first-level distillation on the local model learning results to generate node-side distillation information.

[0027] Step S42, Federation-side distillation step. The node-side distillation information is subjected to a second layer of distillation and aggregation to form a global security model.

[0028] Step S5, Global Model Distribution and Local Update Step. The global security model or its model parameters are distributed to each heterogeneous security sensing node to update the corresponding local security model.

[0029] Step S6, Environmental Disturbance Adaptive Adjustment Step. Obtain disturbance characteristic information in the operating environment of the smart security booth; based on the disturbance characteristic information, adaptively adjust the training parameters, model update frequency, or aggregation strategy in the federated learning process.

[0030] Preferably, step S6 includes the following steps:

[0031] Step S61, Environmental disturbance feature extraction step. Extract at least one environmental disturbance feature from the following: illumination change, noise interference, and equipment motion disturbance.

[0032] Step S62, Adaptive Adjustment Step. Based on the environmental perturbation characteristics, the federated learning parameters or model update strategy are dynamically adjusted. Attached Figure Description

[0033] To more clearly illustrate the technical solution of the present invention, a brief description of the present invention will be provided below in conjunction with the accompanying drawings. It should be understood that the following drawings are only for illustrative purposes and do not constitute a limitation on the scope of protection of the present invention.

[0034] Figure 1 This is a schematic diagram of the overall structure of the smart security booth system based on federated learning of the present invention;

[0035] Figure 2 This is a schematic diagram of the heterogeneous security sensing nodes in the smart security booth of the present invention;

[0036] Figure 3 This is a flowchart illustrating the multimodal collaborative learning method for smart security booths based on federated learning, as described in this invention.

[0037] Figure 4 This is a schematic diagram of the functional modules of the federated double distillation architecture of the present invention;

[0038] Figure 5 This is a schematic diagram illustrating the working principle of the environmental disturbance adaptive adjustment module of the present invention. Detailed Implementation

[0039] To make the objectives, technical solutions, and beneficial effects of this invention clearer, the invention will be further described below with reference to specific embodiments. It should be understood that the following embodiments are only for explaining this invention and are not intended to limit the scope of protection of this invention.

[0040] System deployment structure description:

[0041] In this embodiment, the smart security booth acts as a federated learning coordination node, forming a distributed collaborative learning network with various heterogeneous security sensing devices.

[0042] The heterogeneous security sensing devices include, but are not limited to, mobile sensing equipment such as drones, robot dogs, and unmanned vehicles. Each sensing device is equipped with a computing unit, a communication module, and a sensing module, and has local data processing and model training capabilities.

[0043] The smart security booth is used to perform federated learning scheduling, model aggregation, and adaptive control, without directly acquiring the raw perception data of each node.

[0044] The implementation steps of the multimodal collaborative learning method for smart security booths based on federated learning are as follows:

[0045] Step S1: Multimodal security sensing data collection.

[0046] Multiple heterogeneous security sensing nodes deployed around the smart security booths perform real-time sensing of the target area, and each sensing node collects various types of security sensing data, such as video data, point cloud data, and sound data.

[0047] Each sensing node preprocesses and extracts features from the collected raw data to generate corresponding multimodal security sensing features, and organizes these features into a multimodal security sensing feature set for subsequent local model learning.

[0048] By employing the above methods, we can achieve multi-angle and multi-dimensional perception of complex security scenarios, thereby improving the comprehensiveness and robustness of environmental perception.

[0049] Step S2, local model learning on the node side.

[0050] Each heterogeneous security sensing node performs a model training or feature learning process locally based on the multimodal security sensing feature set to obtain local model parameters or intermediate knowledge representations.

[0051] In this step, each sensing node only uses the data it collects locally for calculations, and the raw security sensing data is not uploaded to the smart security booth or other nodes, thereby reducing the risk of privacy leakage and reducing communication load.

[0052] The training process of the local model can adopt a lightweight model or a tailored model structure according to the node's computing power conditions to adapt to embedded or edge computing environments.

[0053] Step S3: Multi-objective loss constraints and joint optimization.

[0054] After completing local learning at the node level, each sensing node performs joint constraints and optimization on the local learning results based on a multi-objective loss function.

[0055] The multi-objective loss function includes at least multi-task learning constraints and multi-modal feature consistency constraints, which are used to coordinate the expression relationship of different modal data in the feature space and generate a unified multi-modal knowledge representation.

[0056] By using multi-objective loss constraints, different perception modalities can achieve cross-modal information complementarity while maintaining their respective discrimination capabilities, thereby improving the overall accuracy of security identification and judgment.

[0057] Step S4, Federal double distillation polymerization.

[0058] In this embodiment, the federated polymerization process employs a double-layer distillation mechanism.

[0059] First, on the node side, each heterogeneous security sensing node generates node distillation information based on a local model or multimodal knowledge representation, which is used to characterize the key knowledge of the node-side learning results.

[0060] Subsequently, each node uploads the node distillation information to the smart security booth, which then performs fusion processing on the federated side to generate a global security model.

[0061] Federated aggregation is performed through distillation, avoiding the direct transmission of complete model parameters, reducing communication overhead, and alleviating the aggregation difficulties caused by inconsistent model structures of heterogeneous nodes.

[0062] Step S5: Global model distribution and local update.

[0063] The smart security booth will distribute the generated global security model or its corresponding model parameters to each heterogeneous security sensing node.

[0064] Each sensing node updates or corrects its local model based on the global security model, so that the node-side model gradually converges towards the global optimum, thereby improving the recognition and decision-making capabilities of each node in complex security scenarios.

[0065] This process can be executed periodically or by event triggering to adapt to different security task requirements.

[0066] Step S6: Adaptive adjustment to environmental disturbances.

[0067] The smart security booth or various heterogeneous security sensing nodes acquire environmental disturbance feature information in the current security scene. The environmental disturbance feature information includes, but is not limited to, changes in illumination, noise interference, and motion disturbances.

[0068] Based on the environmental disturbance characteristics, the training parameters of the local model on the node side or the federated aggregation strategy are adaptively adjusted, including but not limited to adjusting the learning rate, distillation weights or aggregation frequency.

[0069] After completing the adaptive adjustment, the method returns to step S2 or step S4 to form a dynamic feedback loop, thereby improving the stability and adaptability of the system in complex and ever-changing security environments.

[0070] The implementation results are explained below:

[0071] Through the above implementation methods, this invention achieves privacy-preserving collaborative learning among various heterogeneous security sensing devices in a smart security booth scenario. It can integrate multimodal sensing information without centralizing raw data, thereby improving the security system's perception, recognition, and response capabilities in complex environments. Furthermore, the adoption of federated double distillation and environmental perturbation adaptive mechanisms ensures that this invention maintains good communication efficiency, model generalization ability, and system robustness even under embedded and edge computing conditions.

Claims

1. A multimodal collaborative learning method for smart security booths based on federated learning, characterized in that, The method includes: S1, which collects environmental security sensing data from multiple heterogeneous security sensing nodes to generate a multimodal security sensing feature set; S2, each heterogeneous security sensing node performs model training or feature learning locally based on the multimodal security sensing feature set, generating local model parameters or distillation information; S3, based on the multi-objective loss function, jointly constrains and optimizes the learning results of each heterogeneous security sensing node to generate multimodal knowledge representation; S4 uses a federated double distillation mechanism to aggregate distillation information from multiple heterogeneous security sensing nodes to generate a global security model. S5, The global security model or model parameters are sent to each heterogeneous security sensing node to update the local model on the node side; S6. Obtain environmental disturbance feature information and adaptively adjust the federated learning process based on the environmental disturbance feature information.

2. The method according to claim 1, characterized in that, The heterogeneous security sensing nodes include one or more of drones, robot dogs, and unmanned vehicles. The multimodal security sensing data includes one or more of video data, point cloud data, audio data, and data collected by other sensors.

3. The method according to claim 1, characterized in that... The multimodal security sensing feature set generated in S1 is obtained by preprocessing and feature extraction of the original sensing data. In S2, each heterogeneous security sensing node does not upload the original security sensing data when performing model training locally. The distillation information generated in S2 includes intermediate feature representations, output distribution information, or feature mapping information. The multi-objective loss function in S3 includes a multi-task learning loss term and a multimodal consistency constraint loss term.

4. The method according to claim 1, characterized in that, The federated double distillation mechanism in S4 includes node-side distillation and federated distillation. The federated aggregation process in S4 does not require all heterogeneous security sensing nodes to use the same model structure. The global security model in S5 is broadcast to all heterogeneous security sensing nodes. In S5, each heterogeneous security sensing node updates its local model's parameters or corrects its model based on the global security model. The environmental disturbance characteristics in S6 include one or more of the following: changes in illumination, noise interference, or motion disturbances.

5. The method according to claim 1, characterized in that, The method repeats S2 to S6 by forming an adaptive feedback loop.

6. The method according to claim 1, characterized in that, The multi-task learning loss term is used to constrain the learning collaboration relationship between different security tasks. The multimodal consistency constraint loss term is used to constrain the consistency of representation of features of different modalities in a unified feature space.

7. The method according to claim 1, characterized in that, The node-side distillation is used to generate node distillation information characterizing the local learning results of each heterogeneous security sensing node. The federated-side distillation is used to fuse distillation information from multiple heterogeneous security sensing nodes to generate a global security model. The adaptive adjustment includes adjusting the node-side local training parameters. The adaptive adjustment includes adjusting the federated aggregation frequency or distillation weights.

8. A multimodal collaborative learning system for smart security booths based on federated learning, characterized in that, include: Multiple heterogeneous security sensing nodes are used to collect multimodal security sensing data and perform local model learning; The smart security booth is used to perform federated learning scheduling, federated double distillation aggregation, and adaptive adjustment to environmental disturbances; wherein, the smart security booth is communicatively connected to each heterogeneous security sensing node.

9. The system according to claim 20, characterized in that, The heterogeneous security sensing nodes include drones, robot dogs, or unmanned vehicles.

10. The system according to claim 20, characterized in that, The smart security booth is equipped with a federated model management module, an adaptive adjustment module, and a communication control module.