An AI-based intelligent video monitoring and behavior analysis early warning system

By constructing a multi-dimensional feature linkage system of scene, behavior, and environment and dynamic threshold adjustment driven by reinforcement learning, the problems of misjudgment and missed judgment in cross-scene applications of existing systems have been solved, and the high accuracy and reliability of intelligent video surveillance and behavior analysis early warning system have been achieved.

CN122245042APending Publication Date: 2026-06-19DATANG YUNCHENG POWER GENERATION CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DATANG YUNCHENG POWER GENERATION CO LTD
Filing Date
2026-04-23
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing theft alarm systems suffer from serious misjudgments and omissions when applied across different scenarios. They are unable to automatically adjust analysis strategies and theft warning thresholds according to different scenarios, resulting in poor accuracy and reliability of the warning systems. Furthermore, they lack the effective application of advanced artificial intelligence technologies such as reinforcement learning.

Method used

A multi-source sample library construction module is used to obtain a four-in-one data sample library of scene, behavior, environment, and fusion. The theft warning threshold is dynamically adjusted through reinforcement learning strategy. Combined with the intelligent monitoring real-time perception module and the theft warning and feedback module, the alarm threshold is intelligently matched and adaptively adjusted with the scene and environment.

Benefits of technology

It significantly reduced the false alarm rate in cross-scenario applications, improved the accuracy, reliability, and adaptability of the early warning system, and enhanced the accuracy of behavioral early warnings through long-term iterative optimization, reducing the false and missed warning rates.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245042A_ABST
    Figure CN122245042A_ABST
Patent Text Reader

Abstract

This invention discloses an AI-based intelligent video surveillance and behavior analysis early warning system, specifically relating to the field of alarm system technology. It includes a multi-source sample library construction module, a theft initial early warning threshold configuration module, an intelligent monitoring real-time perception module, a theft early warning threshold decision module, and a theft early warning and feedback module. The multi-source sample library construction module collects raw monitoring data from multiple target scenarios, simultaneously collects manually labeled data, and obtains a set of behavior feature vectors through a behavior feature extraction unit, outputting a scenario-behavior-environment-fusion four-linked data sample library. This invention, by constructing a multi-dimensional feature linkage system of scenario-behavior-environment, combined with differentiated initial thresholds and a dynamic adjustment mechanism, achieves accurate adaptation to complex multi-source scenarios, breaking through the limitations of traditional monitoring systems that rely on a single threshold to adapt to multiple scenarios, and improving the accuracy, reliability, and adaptability of the early warning system.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of alarm system technology, specifically to an AI-based intelligent video surveillance and behavior analysis early warning system. Background Technology

[0002] In today's society, the demand for security in various places is increasing. Intelligent monitoring and theft behavior analysis and early warning systems have been widely used as an important means to maintain public safety, ensure production order, and prevent theft. With the continuous development of AI technology, AI technology provides technical support for the realization of intelligent monitoring and theft behavior analysis and early warning, which can improve the system's early warning performance and intelligence level.

[0003] Existing theft alarm systems are often designed and developed for specific scenarios. Their preset alarm rules are applicable to normal behavior patterns and environmental conditions in that specific scenario. They cannot automatically adjust the analysis strategy and theft warning threshold according to the characteristics of different scenarios. This leads to a large number of false alarms when applied across different scenarios, misreporting normal behavior as abnormal, or missed alarms, failing to detect real abnormal behavior in time, which seriously affects the accuracy and reliability of the warning system. Some studies have attempted to adapt to different scenarios and environmental conditions by adding more rules, but as the complexity of the scenario increases, the number of rules grows exponentially, making the system overly complex.

[0004] However, while existing technologies can achieve the purpose of theft warning, existing alarm systems have fixed thresholds and suffer from core defects such as poor scene adaptability, low accuracy of behavior analysis, and high false alarm and missed warning rates, which cannot meet the needs of increasingly complex and changing scenarios. At the same time, most existing research methods lack the effective application of advanced artificial intelligence technologies such as reinforcement learning. Therefore, this paper proposes an intelligent monitoring and theft behavior analysis and warning system that adopts a reinforcement learning-driven threshold dynamic adjustment mechanism and performs hierarchical warning and effect feedback to improve the accuracy, reliability and adaptability of the warning system. Summary of the Invention

[0005] To overcome the aforementioned deficiencies of the prior art, embodiments of the present invention provide an AI-based intelligent video surveillance and behavior analysis early warning system to address the problems mentioned in the background art.

[0006] To achieve the above objectives, the present invention provides the following technical solution: an AI-based intelligent video surveillance and behavior analysis early warning system, comprising: Multi-source sample library construction module: includes a multi-source scene data acquisition unit and a behavior feature extraction unit. It collects raw monitoring data of the target multi-source scene, collects manually labeled data, and obtains a set of behavior feature vectors through the behavior feature extraction unit, outputting a scene-behavior-environment-fusion four-in-one data sample library; The initial theft warning threshold configuration module: Based on the scenario-behavior-environment-fusion four-link data sample library, it obtains the normal behavior-environment fusion sample library under different monitoring scenarios, fits the probability distribution model of its normal behavior characteristics for each scenario, and outputs the scenario-environment adapted initial theft warning threshold matrix. Intelligent monitoring real-time perception module: collects real-time monitoring data, real-time data from light sensors, and device distribution coordinate data, and outputs real-time fused feature vectors and current scene classification results; Theft warning threshold decision module: Based on real-time fused feature vectors, current scene classification results and historical warning data, it outputs dynamically adjusted theft warning threshold and threshold adjustment step size records through reinforcement learning strategy; Theft warning and feedback module: Based on real-time fused feature vectors and dynamically adjusted theft warning thresholds, it outputs theft warning level signal, theft warning log, and adjustment effect statistics, and feeds back the adjustment effect statistics to theft warning threshold decision module.

[0007] The technical effects and advantages of this invention are as follows: 1. This invention constructs a multi-dimensional feature linkage system of scene-behavior-environment, combined with differentiated initial thresholds and dynamic adjustment mechanisms, thereby achieving intelligent matching of alarm thresholds with scenes and environments, significantly reducing the false alarm rate when applied across scenes, and improving the accuracy, reliability and adaptability of the early warning system; 2. The intelligent monitoring real-time perception module of this invention adopts an attention fusion mechanism to improve the recognition of multi-dimensional features. Through reinforcement learning online optimization, the alarm system can adapt to environmental changes and long-term performance drift, thereby improving the reliability and adaptability of the system, which in turn helps to improve the accuracy of behavioral early warning, reduce the false early warning rate and reduce the missed early warning rate. 3. This invention continuously updates historical data through the effect feedback mechanism in the theft warning and feedback module, drives the reinforcement learning model to optimize the threshold adjustment strategy, achieves long-term iterative improvement of system performance, realizes the synergistic optimization of feature fusion, scene matching and threshold adjustment, and thus adapts to scene changes and behavior pattern updates in long-term use. Attached Figure Description

[0008] Figure 1 This is a schematic diagram of the overall process of the present invention.

[0009] Figure 2 This is a schematic diagram of the multi-source sample library construction module of the present invention.

[0010] Figure 3 This is a flowchart illustrating the intelligent monitoring real-time sensing module of the present invention. Detailed Implementation

[0011] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0012] Please see Figure 1 As shown, the present invention provides an AI-based intelligent video surveillance and behavior analysis early warning system, including a multi-source sample library construction module, a theft initial early warning threshold configuration module, an intelligent monitoring real-time perception module, a theft early warning threshold decision module, and a theft early warning and feedback module; The multi-source sample library construction module is connected to the theft initial warning threshold configuration module, the intelligent monitoring real-time perception module is connected to both the theft initial warning threshold configuration module and the theft warning threshold decision module, and the theft warning and feedback module is connected to the theft warning threshold decision module.

[0013] Please see Figure 2 As shown, the multi-source sample library construction module includes a multi-source scene data acquisition unit and a behavior feature extraction unit. It collects raw monitoring data of the target multi-source scene, along with manually labeled data, and obtains a set of behavior feature vectors through the behavior feature extraction unit. The resulting output is a scene-behavior-environment-fusion four-way data sample library, comprising the following steps: A1: Multi-source scene data acquisition unit: Through monitoring equipment, it acquires raw monitoring data D_raw of the target multi-source scene, including different scene monitoring sets V_s and corresponding scene environmental sensor data sets S_env, D_raw={V_s,S_env}; the manually labeled T_an includes scene type label T_sc, normal behavior and abnormal behavior label T_be, and the environmental sensor data includes light intensity L and preset monitoring device distribution coordinate set D_eq; This embodiment specifically describes the monitoring equipment, such as high-definition network cameras (supporting 1080P resolution, 30 frames / second), which continuously record monitoring footage at different times (e.g., day / night) and under different environmental conditions (e.g., low light / normal lighting). For densely populated scenarios such as shopping malls, additional peak / off-peak traffic monitoring footage is collected to ensure that the sample covers dynamically changing scenarios. Simultaneously, light sensors are deployed in each monitoring scenario, and positioning modules are deployed on the preset monitoring equipment. Sensor data is timestamped and aligned with the monitoring footage via RS485 / Ethernet interfaces to ensure time synchronization between the monitoring footage and environmental parameters. Professional annotation personnel are organized to annotate scene type labels (e.g., shopping mall - normal lighting), normal / abnormal behavior labels, and environmental feature sets (e.g., low light - dense equipment) frame by frame based on the monitoring footage. Annotation tools (e.g., LabelMe, VGG Image Annotator) are used to export structured annotation files, which are then associated with the corresponding monitoring data and sensor data. Target multi-source scenarios refer to the multiple scene types (e.g., shopping mall, warehouse, convenience store, etc.) required by the system, and equipment such as shelves, cash registers, and jewelry cabinets. The coordinates of the equipment are consistent with the coordinates of the personnel behavior.

[0014] A2: The behavioral feature extraction unit includes the following steps: A2.1: Based on M target scenarios, m∈(1,2,...,M), for example, M=3 corresponds to shelf goods, cash register, and jewelry display case. The m-th scenario is used to extract Q. m N-frame monitoring sequence segments V fra For example, Q m ≥1000, q∈(1,2,...,Q) m ); for the frame sequence V of the q-th segment fra m,q Perform Z-score preprocessing to obtain V norm m,q V fra m,q =(v1 m,q v2 m,q ,...,v N m,q ), i∈(1,2,...,N), V norm m,q Through M-class scenario Q m The mean and standard deviation of N frame monitoring sequence segments are obtained by Z-score normalization; at the same time, each segment is scaled to a preset fixed size of width × height × number of channels, such as 640×480×3, 1280×720×3, etc. It should be noted in this embodiment that the number of frames (N) can be adjusted according to the actual application scenario, such as 8, 16, 32, etc.

[0015] A2.2: Using the YOLO model, normalize the q-th fragment V in the m-th scene. norm m,q Each frame of V norm,i m,q Perform person target detection and output a single-frame set of person bounding boxes B. i m,q And the confidence set of personnel bounding boxes C i m,q B i m,q C i m,q =YOLO(V norm m,q ), B i m,q ={x 1,k,i m,q ,y 1,k,i m,q ,x 2,k,i m,q ,y 2,k,i m,q |k=1,2,...,K i m,q}, C i m,q ={c k,i m,q |c k,i m,q =P obj,i m,q (k)×P cls,i m,q (k), k=1,2,...,K i m,q},(x 1,k,i m,q ,y 1,k,i m,q ) and (x 2,k,i m,q ,y 2,k,i m,q ) are the top-left and top-right coordinates of the bounding box of the k-th person in the i-th frame, respectively, and K i m,q To determine the effective number of people in a single frame of this segment, filter c k,i m,q Invalid targets (e.g., c) corresponding to the threshold k,i m,q <0.5), P obj,i m,q (k) represents the probability that a person is present in the k-th detection box (the target confidence branch prediction value output by YOLO, belonging to [0,1]), P cls m,q(k) represents the probability that the k-th detection box belongs to the "person" category (YOLO classification branch prediction value, belonging to [0,1]), for example, c k,i m,q =P obj,i m,q (k)×P cls,i m,q (k) = 0.8 × 0.9 = 0.72 ≥ 0.5, retain this objective; A2.3: Using the OpenPose model (which includes body and hand keypoint detection branches), crop the person region B from the bounding box of the person in the i-th frame. i m,q Extract n pose keypoints of a person, including the torso and hands, defined in a pre-defined fixed order, and output the coordinates (x, y, y) of each keypoint. I,k,i m,q ,y I,k,i m,q ) and key point confidence s I,k,i m,q Given I∈(1,2,...,n), we obtain the set D of personnel posture point coordinates. i m,q D i m,q =OpenPose(Crop(V norm,i m,q B i m,q Crop() is the image cropping function; for s I,k,i m,q Invalid keypoints below the corresponding threshold are filled in (e.g., s). I,k,i m,q <0.2), that is , N va For I∈{I+1,I-1} and s I,k,i m,q The number of adjacent valid points ≥ the corresponding threshold is calculated. If keypoint I is 1, only the average of valid points I+1 is taken. If ≥ 3 consecutive keypoints are invalid, the adjacent range is expanded to I±2. If they are still invalid, the average of the same keypoints in all other frames of the same person is taken. Finally, the coordinate set D of all pose keypoints in the i-th frame is obtained. i m,q ={(x I,k,i m,q ,y I,k,i m,q ,s I,k,i m,q | I=1,2,...,n;k=1,2,...,K i m,q}; This embodiment specifically explains that the OpenPose model is an enhanced version, based on existing technology. It is an extended version that adds hand keypoint detection (or facial keypoints, foot keypoints, etc.) to the standard OpenPose (human pose estimation). The person presets n pose keypoints, for example, a total of 17 keypoints, defined in a fixed order: 1-nose, 2-neck, 3-right shoulder, 4-right elbow, 5-right wrist, 6-left shoulder, 7-left elbow, 8-left wrist, 9-right hip, 10-right knee, 11-right ankle, 12-left hip, 13-left knee, 14-left ankle, 15-right eye, 16-left eye, 17-right ear, 18-left ear; the hand keypoints are, for example, the palm + 5 fingers × 4 joints (fingertip, middle joint, base joint, metacarpophalangeal joint, etc.).

[0016] A2.4: First, normalize the q-th fragment V of the m-th scene. norm m,q N-frame monitoring sequence segment V fra The same person is matched using the Intersection over Union (IOU) calculation method. If the IOU between the bounding box of the person in frame i and the bounding box of the person in frame (i-1) is greater than or equal to the corresponding threshold (e.g., 0.5), then they are determined to be the same person. The keypoint sequence of the k-th person across N frames is recorded as D. k m,q D k m,q ={D k,i m,q |i∈(1,2,...,N)}; then, by performing difference operations on the horizontal and vertical coordinates of the attitude key points, the displacements between the attitude key points in adjacent i-1 frames and the i-th frame are obtained, thus obtaining the displacement set ΔD of the k-th person. k m,q ΔD k m,q ={(x I,k,i m,q -x I,k,i-1 m,q ,y I,k,i m,q -y I,k,i-1 m,q Finally, the temporal behavioral feature vector F of the k-th person in N frames is obtained through a bidirectional long short-term memory network (BiLSTM). te,k m,q F te,k m,q =BiLSTM(ΔD k,2 m,q ,ΔD k,3 m,q ,...,ΔD k,N m,q ); In this embodiment, it should be specifically explained that the Intersection over Union (IOU) calculation method is to divide the area of ​​the overlapping region between the bounding box of the i-th frame and the bounding box of the i-1-th frame by the total area of ​​the two boxes, and the result range is ∈ [0,1]. BiLSTM is an abbreviation for Bidirectional Long Short-Term Memory Network, which is a deep learning model for processing temporal data. It encodes the keypoint displacement ΔD of N-1 consecutive frames into a temporal behavior feature vector through BiLSTM.

[0017] A2.5: First, based on the coordinates (x, y, y) of n attitude key points... I,k,i m,q ,y I,k,i m,q ) and key point confidence s I,k,i m,q Calculate the mean value F of the key features of the k-th person's posture. po,k m,q , Then, by using the Cat function to perform a concatenation operation on F... po,k m,q and F te,k m,q By concatenating the features, we obtain the k-th person behavior feature vector F. Cat,k m,q F Cat,k m,q =Cat(F po,k m,q ,F te,k m,q Then, F is processed through a one-dimensional convolutional layer Convld. Cat,k m,q Dimension reduction, to obtain F be,k m,q F be,k m,q =Convld(F Cat,k m,q The dimension reduction is the time-series behavioral feature vector F. te,k m,q The dimensions are chosen, for example, 64 dimensions; finally, the set of behavioral feature vectors F for all personnel is obtained. be m,q F be m,q ={F be,k m,q |k=1,2,...,K i m,q}, traverse all segments to obtain the set F of behavioral feature vectors for all segments. be m ; A3: First, consider the set of behavioral feature vectors F for the m-th scene. be mPerform Min-Max normalization to obtain the normalized set of behavioral feature vectors F. be,norm m F be,norm m ={F be,norm m,q |q∈(1,2,...,Q m Similarly, for the environmental parameters of the m-th scene, the illumination intensity L and the distribution coordinates D_eq of each device, ... i Normalization is performed to obtain the normalized set of environmental features, T_en. norm m This includes the normalized illumination intensity and the normalized encoded set of device distribution coordinates; the scene encoding uses one-hot encoding to transform the scene type label T_sc into a scene encoding vector T_sc. m For example, the encoding vector for the office area is 100; then the behavior-environment fusion feature F is calculated. fus m =β×F be,norm m +(1-β)×T_en norm m β represents the weight used in behavioral feature fusion (during the sample library construction phase, dynamic weight allocation is performed based on scene type using MLP+Softmax; for theft behavior, β is typically greater than 1-β); finally, the scene-behavior-environment-fusion four-part data sample library D is obtained. lib D lib ={(T_sc m ,F be,norm m ,T_en norm m ,F fus m ,T_be m )|m∈(1,2,...,M)},T_be m This is a behavior label, with 1 for normal and 0 for abnormal. This embodiment needs to explain the core basis of the existing technology: the fusion application of YOLO (object detection model) and OpenPose (pose estimation model) is an existing technology in fields such as intelligent monitoring and behavior analysis; the distributed coordinates D_eq for each device are... i Normalized encoding D_eq_quant i =(D_eq i -D_eq_min) / (D_eq_max-D_eq_min), where D_eq_max and D_eq_min are the maximum and minimum values ​​of the device coordinate encoding, respectively.

[0018] The initial theft warning threshold configuration module: Based on the scenario-behavior-environment-fusion four-link data sample library, it obtains a normal behavior-environment fusion sample library under different monitoring scenarios, fits a probability distribution model of normal behavior characteristics for each scenario, and outputs a scenario-environment adapted initial theft warning threshold matrix, including the following steps: B1: First, based on the scenario-behavior-environment-fusion four-linked data sample library D lib Obtain a fusion sample library D of normal behaviors under different monitoring scenarios. lib m And extract the set F of behavior-environment feature vectors of all normal labels in the m-th scene. norm m F norm m ={F norm,j m |j∈(1,2,...,J)}, where J is the total number of normal behavioral features in the scenario; then, calculate the Euclidean distance from all normal behavioral-environmental features in the scenario to the mean of all normal behavioral-environmental features in the scenario to obtain the distance set, and take the 95th quantile of the distance set point as the initial threshold Th for theft warning in the scenario. base m ; B2: Based on the environmental data corresponding to all normal behavior feature vectors, obtain the environmental level information of the monitoring screen, and according to the preset mapping relationship between the environmental level and the correction factor η(L), adjust the threshold Th. base m Make corrections to generate a final threshold Th that is adapted to the environment level. init m,L ,Th init m,L =Th base m×η(L); The environmental level information includes illumination level information and equipment density information. The illumination level includes one or more of low light, normal light, and strong light. The equipment density information includes one or more of equipment density and equipment sparsity based on the number of equipment. The illumination level is based on low light environment (illuminance coefficient less than normal range), normal light, and strong light environment (illuminance coefficient greater than normal range). The preset equipment density environment has a number of monitoring devices greater than or equal to a threshold (e.g., 5), and the equipment sparsity environment has a number less than the threshold. The environment is defined as follows: For example, the set of environmental features is preset to five environmental levels: low light environment, normal light environment, strong light environment, dense equipment environment, and sparse equipment environment. A corresponding environmental level correction factor η(L) is preset, where L∈(1,2,...,5). The misjudgment rate of samples from historical environmental level scenarios is statistically analyzed (the misjudgment rate is high before correction and decreases after correction), for example, the corresponding η(L) are 1.1, 1.0, 1.05, 0.97, and 1.08 respectively. Finally, all scenarios are traversed to obtain the initial threshold matrix Th for scene-environment adaptation for theft warning. init ,Th init ={Th init m,L |m∈(1,2,...,M),L∈(1,2,...,L1)}, where M is the number of scenes and L1 is the number of environment levels; In this embodiment, it should be specifically noted that the behavioral feature error is large under low light, so the threshold needs to be relaxed (to avoid misjudging normal behavior). Under normal lighting, the recognition accuracy is stable, so the baseline threshold should be maintained. Strong light will cause the image to be overexposed (the error is smaller than under low light), so the threshold can be slightly relaxed. Densely packed devices pose a high risk of theft, so the threshold needs to be tightened (to more strictly judge abnormal behavior). Sparsely packed devices pose a low risk of theft, so the threshold can be relaxed (to reduce unnecessary suspected anomalies). The light intensity coefficient L_quant=L / L_max is normalized, where L_max is the maximum light intensity.

[0019] Please see Figure 3 As shown, the intelligent monitoring real-time perception module collects real-time monitoring data, real-time data from light sensors, and device distribution coordinate data, and outputs real-time fused feature vectors and current scene classification results, including the following steps: C1: First, collect real-time monitoring data of the target scene. act , extract Q m N-frame monitoring sequence segments V fra,act The data is then input into the behavior feature extraction unit to obtain the set F of behavior feature vectors for all segments. be,act m Then, for the set of behavioral feature vectors F be,act m Normalization is performed to obtain the normalized set of behavioral feature vectors F. be,act,normm Similarly, the normalized environmental feature set T_en is obtained. act,norm m ; C2: First, F is processed by the feature concatenation operation Cat. be,act,norm m and T_en act,norm m The features are then stitched together and then processed by a multilayer perceptron (MLP) to obtain the stitched features Cat(F). be,act,norm m ,T_en act,norm m The values ​​are mapped to scalar values, and finally the Softmax function is used to map the values ​​to the (0,1) interval to obtain the weight α of behavioral features and the weight (1-α) of environmental features during fusion, i.e., α = Softmax(MLP(Cat(F be,act,norm m ,T_en act,norm m The real-time fused feature vector is F. act,fus m F act,fus m =α×F be,act,norm m +(1-α)×T_en act,norm m ; C3: Calculate the real-time fused feature vector F using the cosine similarity matching function Sim. act,fus m The fusion feature F corresponding to each scene encoding in the sample library fus m The cosine similarity is used, and the scene corresponding to the maximum similarity is taken as the current scene classification result, i.e., the scene matching function C_real=argmax(Sim(F act,fus m ,F fus m ); Theft warning threshold decision module: Based on real-time fused feature vectors, current scene classification results, and historical warning data, it outputs dynamically adjusted theft warning threshold and threshold adjustment step size records through a reinforcement learning strategy, including the following steps: D1: Based on real-time fused feature vector F act,fus m OneHot(C_real) encoding of the current scene classification result m The vector consists of historical early warning data (DH), which includes the early warning accuracy (ACC) for the most recent early warning period. his False alarm rate (FP) his and leakage warning rate FNhis Define the reinforcement learning state space S, S = Cat(F act,fus m OneHot(C_real) m Then define the reward function R, R=a1×(1-FP). his )+a2×(1-FN his Let A1, A2, and A3 be the corresponding weights (e.g., a1=0.6, a2=0.3, and a3=0.1, based on the Analytic Hierarchy Process (AHP), combined with the false positive rate, false negative rate, and real-time weight ratio of historical scenario tests). SS is the real-time decision score, SS=exp(-Δt / τ), where Δt is the decision time from state input to output adjustment action A, and τ is the maximum allowable time (e.g., 200ms). If τ is 1, then SS is 1; otherwise, SS is 0.5. Finally, define the action space A for adjusting the theft warning threshold (e.g., if it increases by 15%, A=+0.15; if it decreases by 5%, A=-0.15; 0 indicates that the warning threshold remains unchanged), A=(A1,A2,...,A3). n2 ), A n2 The n2th theft warning threshold adjustment amount is n2, where n2 is the number of theft warning threshold adjustments. The action set includes options to raise the threshold, lower the threshold, and keep it unchanged. The actions of raising and lowering the threshold each include at least two different adjustment ranges, for example, A={+5%,+10%,+15%,−5%,−10%,−15%,0}. D2: Based on the real-time state space S, the DQN policy network is used to calculate the theft warning threshold adjustment for each action in the action space A. The expected reward R that the action can obtain is selected, and then the action with the largest expected reward is selected as the final theft warning threshold adjustment ΔA, ΔA=π(S;θ), where π() is the DQN policy network and θ is the parameter of the DQN policy network (network parameters θ include weights, biases, etc.). Then, the standard deviation σ(F) of the fused feature vectors is used. act,fus m ), optimize and record the adjustment step size of the theft warning threshold, and obtain Step(ΔA)=ΔA×λ(σ(F) act,fus m )), λ is the scaling factor, λ=min[1,σ(F act,fus m ) / σ(th)], where σ(th) is the upper bound of the standard deviation of the fused feature vector corresponding to normal behavior in the current scene; finally, based on the current scene m and the Lth environment level, the corresponding initial warning threshold Th is obtained. init m,L The dynamically adjusted theft warning threshold Th is obtained. dy m,L ,Thdy m,L =Th init m,L ×(1+Step(ΔA)); In this embodiment, it should be specifically noted that the fused feature vector is [0.1, 0.2, 0.3, ..., 0.2], with similar values ​​in each dimension and a small standard deviation; the fused feature vector is [0.1, 0.8, 0.05, ..., 0.9], with large differences in values ​​in each dimension and a large standard deviation.

[0020] This embodiment needs to specifically explain that the early warning accuracy rate is the proportion of correct judgments within the most recent early warning period; the false early warning rate is the ratio of the number of times normal behavior is warned as abnormal to the total number of early warnings; the missed early warning rate is the ratio of the sum of the number of times abnormal behavior was not warned and the number of times the warning level was lower than the actual level to the total number of actual abnormal behaviors; the reinforcement learning decision model can adopt a deep reinforcement learning algorithm known in the art, such as the DQN policy network, which is an existing technology and has been trained. Based on the scene-behavior-environment-fusion four-link feature sample library, simulated state-action-reward training data is generated to train the DQN model and optimize the parameters θ (weights, biases, etc.) until the reward value of the model converges (for example, the reward value is stable above 0.8). After training, the model is deployed to the theft early warning threshold decision module.

[0021] Theft warning and feedback module: By calculating the deviation between the real-time fused feature vector and the normal fused feature vector, and comparing it with the dynamically adjusted theft warning threshold, it outputs theft warning level signal, theft warning log, and adjustment effect statistics, and feeds the adjustment effect statistics back to the theft warning threshold decision module, including the following steps: E1: Calculate the real-time fused feature vector F using the Euclidean distance calculation function Dist. act,fus m The mean μ(F) of the normally fused features of the current scene in the sample library norm m Behavioral deviation An, An=Dist(F) act,fus m ,μ(F norm m If An ≥ the dynamically adjusted theft warning threshold Th dy m,L If ×k1, k1≥1.2, then output the Level 1 theft warning signal Alert1, triggering the Level 1 emergency intervention rule. If Th dy m,L ×k2≤An<Th dy m,LIf ×k1, 1<k2<k1, and k1 and k2 are two risk coefficients, then output the Level 2 theft warning signal Alert2, triggering the Level 2 key attention rule. If Th dy m,L ≤An<Th dy m,L If ×k2, then output the Level 3 theft warning signal Alert3, triggering the Level 3 prompt verification rule. If An < Th dy m,L If no warning is given, then None; E2: Based on timestamp t, monitoring location coordinates Loc(t), and scene classification result C_real m Real-time behavioral feature vector set F be,act,norm m (t), Real-time environmental feature set T_en act,norm m Given (t) and the alert level Alert(t), generate a structured theft alert log Log(t), where Log(t) = <t,Loc(t),C_real m (t),F be,act,norm m (t),T_en act,norm m (t),Alert(t)>; E3: Based on the warning results and post-event verification, calculate the warning accuracy rate, false warning rate and missed warning rate at the current moment, and feed them back to the theft warning threshold decision module to update the state space for the next warning cycle.

[0022] In this embodiment, it is necessary to specifically explain that the risk coefficients are obtained by extracting samples marked as absurd theft behavior from the sample library, calculating the distance between the real-time fusion features of each sample and the mean of normal behavior features, and obtaining the theft behavior deviation distribution (for example, after statistics, it was found that 90% of theft samples had an anomalousness ≥1.2 and 50% of theft samples had an anomalousness ≥1.1), and determining the risk coefficients k1 and k2. Their specific values ​​can be determined by performing a grid search in historical data or a simulation environment: testing is conducted on a set of preset candidate values ​​(such as k1∈[1.1,1.3,1.5], k2∈[1.05,1.1,1.15]), and the set of values ​​that maximizes the accuracy of the first-level warning and meets the management requirements of the warning distribution at all levels is selected as the final parameters.

[0023] Secondly: The accompanying drawings of the embodiments disclosed in this invention only involve the structures involved in the embodiments disclosed in this invention. Other structures can refer to the general design. In the absence of conflict, the same embodiment and different embodiments of this invention can be combined with each other. In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. An AI-based intelligent video surveillance and behavior analysis early warning system, characterized in that: include: Multi-source sample library construction module: includes a multi-source scene data acquisition unit and a behavior feature extraction unit. It collects raw monitoring data of the target multi-source scene, collects manually labeled data, and obtains a set of behavior feature vectors through the behavior feature extraction unit, outputting a scene-behavior-environment-fusion four-in-one data sample library; The initial theft warning threshold configuration module: Based on the scenario-behavior-environment-fusion four-link data sample library, it obtains the normal behavior-environment fusion sample library under different monitoring scenarios, fits the probability distribution model of its normal behavior characteristics for each scenario, and outputs the scenario-environment adapted initial theft warning threshold matrix. Intelligent monitoring real-time perception module: collects real-time monitoring data, real-time data from light sensors, and device distribution coordinate data, and outputs real-time fused feature vectors and current scene classification results; Theft warning threshold decision module: Based on real-time fused feature vectors, current scene classification results and historical warning data, it outputs dynamically adjusted theft warning threshold and threshold adjustment step size records through reinforcement learning strategy; Theft warning and feedback module: By calculating the deviation between the real-time fused feature vector and the normal fused feature vector, and comparing it with the dynamically adjusted theft warning threshold, it outputs theft warning level signal, theft warning log, and adjustment effect statistics, and feeds the adjustment effect statistics back to the theft warning threshold decision module.

2. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 1, characterized in that: The behavior feature extraction unit includes: A2.1: Based on M types of target scenes, m∈(1,2,...,M), the m-th scene is used to extract Q. m N-frame monitoring sequence segments V fra , q∈(1,2,...,Q m ); for the frame sequence V of the q-th segment fra m,q Perform Z-score preprocessing to obtain V norm m,q V fra m,q =(v1 m,q v2 m,q ,...,v N m,q ), i∈(1,2,...,N), V norm m,q Through M-class scenario Q m The mean and standard deviation of N frame monitoring sequence segments are obtained by Z-score normalization; at the same time, each segment is scaled to a preset fixed size of width × height × number of channels; A2.2: Using the YOLO model, normalize the q-th fragment V in the m-th scene. norm m,q Each frame of V norm,i m,q Perform personnel target detection and output a single-frame set of personnel bounding boxes B. i m,q And the confidence set of personnel bounding boxes C i m,q B i m,q C i m,q =YOLO(V norm m,q ), C i m,q ={c k,i m,q |c k,i m,q =P obj,i m,q (k)×P cls,i m,q (k), k=1,2,...,K i m,q }, K i m,q To determine the effective number of people in a single frame of this segment, filter c k,i m,q Invalid targets with corresponding thresholds, P obj,i m,q (k) represents the probability that a person is present in the k-th detection box, P cls m,q (k) represents the probability that the k-th detection box belongs to the personnel category; A2.2: Using the YOLO model, normalize the q-th fragment V in the m-th scene. norm m,q Each frame of V norm,i m,q Perform personnel target detection and output a single-frame set of personnel bounding boxes B. i m,q And the confidence set of personnel bounding boxes C i m,q B i m,q C i m,q =YOLO(V norm m,q ), C i m,q ={c k,i m,q |c k,i m,q =P obj,i m,q (k)×P cls,i m,q (k), k=1,2,...,K i m,q }, K i m,q To determine the effective number of people in a single frame of this segment, filter c k,i m,q Invalid targets with corresponding thresholds, P obj,i m,q (k) represents the probability that a person is present in the k-th detection box, P cls m,q (k) represents the probability that the k-th detection box belongs to the personnel category; A2.3: Using the OpenPose model, crop the person region B from the bounding box of the person in the i-th frame. i m,q Extract n pose keypoints of a person, including the torso and hands, defined in a pre-defined fixed order, and output the coordinates (x, y, y) of each keypoint. I,k,i m,q ,y I,k,i m,q ) and key point confidence s I,k,i m,q Given I∈(1,2,...,n), we obtain the set D of personnel posture point coordinates. i m,q D i m,q =OpenPose(Crop(V norm,i m,q B i m,q Crop() is the image cropping function; for s I,k,i m,q Invalid keypoints corresponding to the threshold are filled in, i.e. , N va For I∈{I+1,I-1} and s I,k,i m,q The number of adjacent valid points ≥ the corresponding threshold is calculated. If keypoint I is 1, only the average of valid points I+1 is taken. If ≥ 3 consecutive keypoints are invalid, the adjacent range is expanded to I±2. If they are still invalid, the average of the same keypoints in all other frames of the same person is taken. Finally, the coordinate set D of all pose keypoints in the i-th frame is obtained. i m,q ={(x I,k,i m,q ,y I,k,i m,q ,s I,k,i m,q | I=1,2,...,n;k=1,2,...,K i m,q } 3. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 3, characterized in that: The behavior feature extraction unit further includes: A2.4: First, for the m-th scene, the q-th normalized segment V norm m,q N-frame monitoring sequence segment V fra The Intersection over Union (IOU) method is used to match the same person. If the IOU between the bounding box of the person in frame i and the bounding box of the person in frame (i-1) is greater than or equal to the corresponding threshold, then they are determined to be the same person. The keypoint sequence of the k-th person in N frames is recorded as D. k m,q D k m,q ={D k,i m,q |i∈(1,2,...,N)}; then, by performing difference operations on the horizontal and vertical coordinates of the attitude key points, the displacements between the attitude key points in adjacent i-1 frames and the i-th frame are obtained, thus obtaining the displacement set ΔD of the k-th person. k m,q Finally, the temporal behavioral feature vector F of the k-th person in N frames is obtained through a bidirectional long short-term memory network (BiLSTM). te,k m,q ; A2.5: First, based on the coordinates (x, y, y) of n attitude key points... I,k,i m,q ,y I,k,i m,q ) and key point confidence s I,k,i m,q Calculate the mean value F of the key features of the k-th person's posture. po,k m,q Then, by using the Cat function to perform a concatenation operation on F... po,k m,q and F te,k m,q By concatenating the features, we obtain the k-th person behavior feature vector F. Cat,k m,q F Cat,k m,q =Cat(F po,k m,q ,F te,k m,q Then, F is processed through a one-dimensional convolutional layer Convld. Cat,k m,q Dimension reduction, to obtain F be,k m,q F be,k m,q =Convld(F Cat,k m,q The dimension reduction is the time-series behavioral feature vector F. te,k m,q The dimensions are determined; finally, the set of behavioral feature vectors F for all personnel is obtained. be m,q F be m,q ={F be,k m,q |k=1,2,...,K i m,q }, traverse all segments to obtain the set F of behavioral feature vectors for all segments. be m .

4. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 1, characterized in that: The scenario-behavior-environment-fusion four-linked data sample library: First, the behavioral feature vector set F of the m-th scenario... be m Perform Min-Max normalization to obtain the normalized set of behavioral feature vectors F. be,norm m F be,norm m ={F be,norm m,q |q∈(1,2,...,Q m Similarly, for the environmental parameters of the m-th scene, the illumination intensity L and the distribution coordinates D_eq of each device, ... i Normalization is performed to obtain the normalized set of environmental features, T_en. norm m This includes the normalized illumination intensity and the normalized encoded set of device distribution coordinates; the scene encoding uses one-hot encoding to transform the scene type label T_sc into a scene encoding vector T_sc. m Next, the behavior-environment fusion feature F is calculated. fus m =β×F be,norm m +(1-β)×T_en norm m β represents the weight during behavioral feature fusion; finally, the scene-behavior-environment-fusion four-part data sample D is obtained. lib D lib ={(T_sc m ,F be,norm m ,T_en norm m ,F fus m ,T_be m )|m∈(1,2,...,M)},T_be m This is a behavior label, with 1 for normal and 0 for abnormal.

5. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 1, characterized in that: The initial threshold matrix for theft warning based on scene-environment adaptation is first based on the scene-behavior-environment-fusion four-linked data sample library D. lib Obtain a fusion sample library D of normal behaviors under different monitoring scenarios. lib m And extract the set F of behavior-environment feature vectors of all normal labels in the m-th scene. norm m F norm m ={F norm,j m |j∈(1,2,...,J)}, where J is the total number of normal behavioral features in the scenario; then, calculate the Euclidean distance from all normal behavioral-environmental features in the scenario to the mean of all normal behavioral-environmental features in the scenario to obtain the distance set, and take the 95th quantile of the distance set point as the initial threshold Th for theft warning in the scenario. base m ; B2: Based on the environmental feature set corresponding to all normal behavior feature vectors, obtain the environmental level information of the monitoring screen, and according to the preset mapping relationship between the environmental level and the correction factor η(L), adjust the threshold Th. base m Make corrections to generate a final threshold Th that is adapted to the environment level. init m,L ,Th init m,L =Th base m ×η(L); The environmental level information includes illumination level information and device density information. The illumination level includes one or more of low light, normal light, and strong light. The device density information includes one or more of device density and device sparsity based on the number of devices. Finally, all scenes are traversed to obtain the initial threshold matrix Th for scene-environment adaptation theft warning. init ,Th init ={Th init m,L |m∈(1,2,...,M),L∈(1,2,...,L1)}, where M is the number of scenes and L1 is the number of environment levels.

6. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 1, characterized in that: The real-time fusion of feature vectors and current scene classification results includes: C1: First, collecting real-time monitoring data of the target scene Vs act , extract Q m N-frame monitoring sequence segments V fra,act The data is then input into the behavior feature extraction unit to obtain the set F of behavior feature vectors for all segments. be,act m Then, for the set of behavioral feature vectors F be,act m Normalization is performed to obtain the normalized set of behavioral feature vectors F. be,act,norm m Similarly, the normalized environmental feature set T_en is obtained. act,norm m ; C2: First, F is processed by the feature concatenation operation Cat. be,act,norm m and T_en act,norm m The features are then stitched together and then processed by a multilayer perceptron (MLP) to obtain the stitched features Cat(F). be,act,norm m ,T_en act,norm m The values ​​are mapped to scalar values, and finally the Softmax function is used to map the values ​​to the (0,1) interval to obtain the weight α of behavioral features and the weight (1-α) of environmental features during fusion, i.e., α = Softmax(MLP(Cat(F be,act,norm m ,T_en act,norm m The real-time fused feature vector is F. act,fus m F act,fus m =α×F be,act,norm m +(1-α)×T_en act,norm m ; C3: Calculate the real-time fused feature vector F using the cosine similarity matching function Sim. act,fus m Features F corresponding to the scene codes in the sample library fus m The cosine similarity is used, and the scene corresponding to the maximum similarity is taken as the current scene classification result, i.e., the scene matching function C_real=argmax(Sim(F act,fus m ,F fus m ).

7. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 1, characterized in that: The dynamically adjusted theft warning threshold and threshold adjustment step size record include: D1: based on the real-time fusion feature vector F act,fus m OneHot(C_real) encoding of the current scene classification result m The vector consists of historical early warning data (DH), which includes the early warning accuracy (ACC) for the most recent early warning period. his False alarm rate (FP) his and leakage warning rate FN his Define the reinforcement learning state space S, S = Cat(F act,fus m OneHot(C_real) m Then define the reward function R, R=a1×(1-FP). his )+a2×(1-FN his )+a3×SS, where a1, a2, and a3 are the corresponding weights, SS is the real-time decision score, SS=exp(-Δt / τ), Δt is the decision time from state input to output adjustment action A, and τ is the maximum allowable time; finally, define the action space A for the theft warning threshold adjustment, A=(A1,A2,...,A3). n2 ), A n2 The n2th theft warning threshold adjustment amount is n2, where n2 is the number of theft warning threshold adjustments. The action set includes options to raise the threshold, lower the threshold, and keep it unchanged. The actions of raising and lowering the threshold each include at least two different adjustment ranges. D2: Based on the real-time state space S, the DQN policy network is used to calculate the theft warning threshold adjustment for each action in the action space A. The expected reward R that the action can obtain is selected, and then the action with the largest expected reward is selected as the final theft warning threshold adjustment ΔA, ΔA=π(S;θ), where π() is the DQN policy network and θ is the parameter of the DQN policy network; then the standard deviation σ(F) of the fused feature vectors is used. act,fus m ), optimize and record the adjustment step size of the theft warning threshold, and obtain Step(ΔA)=ΔA×λ(σ(F) act,fus m )), λ is the scaling factor, λ=min[1,σ(F act,fus m ) / σ(th)], where σ(th) is the upper bound of the standard deviation of the fused feature vector corresponding to normal behavior in the current scene; finally, based on the current scene m and the Lth environment level, the corresponding initial warning threshold Th is obtained. init m,L The dynamically adjusted theft warning threshold Th is obtained. dy m,L ,Th dy m,L =Th init m,L ×(1+Step(ΔA)).

8. The AI-based intelligent video surveillance and behavior analysis early warning system according to claim 1, characterized in that: The theft warning and feedback module includes: E1: calculating the real-time fused feature vector F using the Euclidean distance calculation function Dist. act,fus m The mean μ(F) of the normally fused features of the current scene in the sample library norm m Behavioral deviation An, An=Dist(F) act,fus m ,μ(F norm m If An ≥ the dynamically adjusted theft warning threshold Th dy m,L If ×k1, k1≥1.2, then output the Level 1 theft warning signal Alert1, triggering the Level 1 emergency intervention rule. If Th dy m,L ×k2≤An<Th dy m,L If ×k1, 1<k2<k1, and k1 and k2 are two risk coefficients, then output a level 2 theft warning signal Alert2, triggering the level 2 key attention rule. If Th dy m,L ≤An<Th dy m,L If ×k2, then output the Level 3 theft warning signal Alert3, triggering the Level 3 prompt verification rule. If An < Th dy m,L If the condition is zero, then there is no warning (None). E2: Based on timestamp t, monitoring location coordinates Loc(t), and scene classification result C_real m Real-time behavioral feature vector set F be,act,norm m (t), Real-time environmental feature set T_en act,norm m Given (t) and the alert level Alert(t), generate a structured theft alert log Log(t), where Log(t) = <t,Loc(t),C_real m (t),F be,act,norm m (t),T_en act,norm m (t),Alert(t)>; E3: Based on the warning results and post-event verification, calculate the warning accuracy rate, false warning rate and missed warning rate at the current moment, and feed them back to the theft warning threshold decision module to update the state space for the next warning cycle.