Campus monitoring behavior recognition method and system based on reasoning enhancement

By introducing an attention caching mechanism and a length extrapolation strategy into the context-aware encoder, the Transformer model is optimized, solving the problems of computational complexity and recognition accuracy in long video streams in campus security monitoring, and achieving efficient and accurate behavior recognition.

CN119785296BActive Publication Date: 2026-06-12JINAN PRESCHOOL TEACHERS COLLEGE +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JINAN PRESCHOOL TEACHERS COLLEGE
Filing Date
2025-01-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing real-time video analytics technologies suffer from high computational complexity, redundancy, and inefficiency in campus security monitoring. They struggle to handle long video streams and the inconsistency between training and inference data leads to decreased recognition accuracy.

Method used

We employ a context-aware encoder based on reasoning enhancement, introduce an attention caching mechanism and a length extrapolation strategy, and optimize the computational complexity and sequence length processing capability of the Transformer model through a spatiotemporal causal self-attention module and hybrid relative position encoding.

🎯Benefits of technology

It improves the accuracy and robustness of campus security monitoring, reduces computational complexity and training costs, and enhances the efficiency and accuracy of real-time behavior recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119785296B_ABST
    Figure CN119785296B_ABST
Patent Text Reader

Abstract

The application provides a campus monitoring behavior recognition method and system based on reasoning enhancement, and belongs to the technical field of campus safety, and comprises the following steps: acquiring a real-time video stream of campus monitoring; pre-processing the acquired video stream to obtain a space-time embedding sequence; and identifying an action behavior of the space-time embedding sequence by using a context-aware encoder based on reasoning enhancement; wherein, an attention caching mechanism and a length extrapolation mechanism are introduced into the context-aware encoder. The application optimizes the calculation efficiency by using the attention caching mechanism, and fully utilizes the context information of a long sequence by using the hybrid relative position coding in the length extrapolation mechanism, so that the accuracy of action recognition is significantly improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of campus security technology, and in particular relates to a campus monitoring behavior recognition method and system based on reasoning enhancement. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] In response to the increasingly complex needs of campus security monitoring, real-time video analytics has become a core research direction for ensuring campus safety. This technology requires models to perform instant analysis based solely on current and historical information, which is crucial for quickly identifying and responding to potential threats. Its applications extend beyond routine security monitoring, playing a key role in risk warning and rapid emergency response. By processing and analyzing video data in real time, this technology can promptly detect abnormal behavior, predict potential risks, and provide rapid decision support in emergency situations, thereby comprehensively improving the efficiency and accuracy of campus security management. This advanced analytical approach provides a smarter and more proactive solution for campus security management, effectively enhancing schools' ability to cope with various security challenges.

[0004] To accurately understand current behavior, the system needs to process and analyze contextual information over a relatively long time span, which inevitably increases computational complexity. Simultaneously, to meet the demands of real-time applications, the system must be able to process continuously input video streams with extremely low latency. These two requirements make the implementation of real-time behavior recognition technology particularly complex.

[0005] Existing methods include those based on RNN architectures, whose sequential processing characteristics are suitable for streaming data. However, in long sequences, they are prone to gradient vanishing or exploding, making it difficult to capture long-range dependencies. There are also models based on the Transformer architecture, which can effectively capture long-range dependencies through self-attention, resulting in significant performance improvements. However, the Transformer exhibits inefficiency and redundancy when processing video streams. First, the Transformer lacks a memory mechanism, requiring recalculation of interactions between elements within the window at each time step. Since most elements within the window are identical at adjacent time steps, this leads to a large amount of redundant computation. Second, the computational complexity of self-attention is O(n log n). O(n 2 ) As sequence length increases, the computational cost increases quadratically, making it difficult to handle extremely long contexts. Furthermore, the discrepancy between the sequence length of the training data and the sequence length during inference can affect the accuracy of behavior recognition during inference, and using long sequences for training increases training costs. Summary of the Invention

[0006] To overcome the shortcomings of the existing technologies, this invention proposes a campus surveillance behavior recognition method and system based on reasoning enhancement. Firstly, an attention caching mechanism is used to avoid redundant calculations of duplicate elements, reducing computational complexity to... O(n) Secondly, a length extrapolation strategy is proposed, which enables the model to handle longer sequences during testing than during training, thereby improving performance and reducing the cost of training long sequences.

[0007] To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions:

[0008] Firstly, a campus surveillance behavior recognition method based on reasoning enhancement is disclosed, including:

[0009] Obtain real-time video streams from campus surveillance cameras;

[0010] The acquired video stream is preprocessed to obtain a spatiotemporal embedding sequence;

[0011] A context-aware encoder based on reasoning enhancement is used to identify actions and behaviors in spatiotemporally embedded sequences;

[0012] The context-aware encoder incorporates an attention caching mechanism and a length extrapolation mechanism.

[0013] Furthermore, the reasoning-enhanced context-aware encoder includes a spatiotemporal causal self-attention module, global average pooling, and an action recognition classifier.

[0014] Furthermore, in the spatiotemporal causal self-attention module, an attention caching mechanism and a length extrapolation mechanism are introduced in each spatiotemporal causal self-attention layer.

[0015] Furthermore, the attention caching mechanism introduces key caching and value caching to store the keys and values ​​of previously calculated video frames, and when a new video frame... When the event arrives, the expression for updating the cache using the attention caching mechanism is:

[0016]

[0017]

[0018] in, Indicates key caching; Indicates value caching; This indicates the key buffer of the previous frame; This indicates that the value of the previous frame is cached; Represents the key vector, and ; Represents a value vector, and ; This represents the feature vector of the current frame. and These represent linear mapping matrices for keys and values, respectively. This indicates a chain operation.

[0019] Furthermore, the length extrapolation mechanism employs hybrid relative position coding, using learnable position coding at close intervals to capture dynamic information of the video sequence; and using linearly decaying position bias at long intervals to give the model extrapolation capability during the inference stage.

[0020] Furthermore, the spatiotemporal embedding sequence is input into the spatiotemporal causal self-attention module to obtain the enhanced spatiotemporal sequence. Extract the feature map of the last frame in the enhanced spatiotemporal sequence. [-1] is processed by global average pooling and then fed into the classifier to obtain the action category of the current frame.

[0021] Secondly, a campus surveillance behavior recognition system based on reasoning enhancement is disclosed, including:

[0022] The acquisition module is configured to acquire real-time video streams from campus surveillance cameras.

[0023] The preprocessing module is configured to preprocess the acquired video stream to obtain a spatiotemporal embedding sequence;

[0024] The recognition module is configured to use a context-aware encoder based on reasoning enhancement to recognize actions and behaviors in spatiotemporally embedded sequences.

[0025] The context-aware encoder incorporates an attention caching mechanism and a length extrapolation mechanism.

[0026] Thirdly, an electronic device is disclosed, including a memory and a processor, as well as computer instructions stored in the memory and running on the processor, wherein the computer instructions, when executed by the processor, complete the steps of the above-mentioned reasoning-enhanced campus monitoring behavior recognition method.

[0027] Fourthly, a computer-readable storage medium is disclosed for storing computer instructions, which, when executed by a processor, complete the steps of the aforementioned reasoning-enhanced campus surveillance behavior recognition method.

[0028] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0029] This invention provides a campus surveillance behavior recognition method based on reasoning enhancement. By leveraging the architectural advantages of Transformer, it effectively handles long-distance dependencies, improves the accuracy of prediction and the robustness of recognition, thereby more efficiently ensuring campus security.

[0030] This invention proposes an inference enhancement technique that improves the inference speed and accuracy of a model without increasing additional training costs. This technique includes a cached attention mechanism and a length extrapolation strategy. The cached attention mechanism avoids redundant attention calculations by saving previously computed keys and values, thus reducing computational complexity to a minimum. O(n) This improves the speed of real-time recognition; the length extrapolation strategy, through hybrid positional encoding (HRPE), enables the model to handle longer sequences in the inference phase of behavior recognition than in the training phase, thereby improving monitoring performance and reducing the cost of training long sequences, and improving recognition accuracy.

[0031] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0032] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0033] Figure 1 This is a schematic diagram of the campus surveillance behavior recognition method based on reasoning enhancement as described in Embodiment 1 of the present invention. Detailed Implementation

[0034] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0035] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.

[0036] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0037] Example 1

[0038] In one or more embodiments, a campus surveillance behavior recognition method based on reasoning enhancement is disclosed, comprising the following steps:

[0039] Step 1: Obtain real-time video stream from campus surveillance.

[0040] Step 2: Preprocess the acquired video stream to obtain a spatiotemporal embedding sequence.

[0041] Step 2-1: For a given video stream F The goal of online motion detection is to identify the latest video frame in real time. The actions that occur in the middle; This represents the number of video frames observed. It's important to note that any future information... It is unavailable.

[0042] Step 2-2: Use patch embedding to divide each frame of the video stream into multiple small blocks, and use linear transformation to map each small block into a fixed-dimensional vector, thereby converting a single frame image into an embedding sequence that can be processed by the Transformer model.

[0043] Steps 2-3: Stack the multiple frames together to obtain a spatiotemporal embedding sequence ,in Indicates the length of the stream buffer. This indicates the number of embeddings per frame.

[0044] The spatiotemporal embedding sequence obtained through block embedding operations can efficiently extract local features, reduce computational complexity, and adapt to dynamic, multi-scene, and multi-view tasks.

[0045] In specific applications of campus surveillance, it not only improves the efficiency of detecting abnormal behavior, but also helps the model adapt flexibly to changing environments, thereby significantly improving overall performance and practicality.

[0046] Step 3: Employ an inference-enhanced context-aware encoder (CAE) to capture contextual information from cached video frames, accurately identifying the current action. This includes:

[0047] The context-aware encoder maintains a fixed-length queue as a stream buffer to cache the video stream and dynamically updates it as new video frames arrive. During each update, the CAE applies a spatiotemporal attention mechanism to the video frames in the stream buffer to capture contextual information and identify the current action.

[0048] Step 3-1: Construct a spatiotemporal causal self-attention module, in which an attention caching mechanism and a length extrapolation mechanism are introduced in each spatiotemporal causal self-attention layer.

[0049] Step 3-1-1: Stack multiple spatiotemporal causal self-attention layers and add a causal mask to the attention matrix of each layer to ensure that the embedding at each time step only focuses on the current and previous time steps when modeling the sequence context.

[0050] Step 3-1-2: Introduce an attention caching mechanism (CAM) in each spatiotemporal causal self-attention layer.

[0051] Context-aware encoders capture the spatiotemporal dependencies of video frames within a buffer using a spatiotemporal causal self-attention layer. However, due to the existence of [various factors] between adjacent windows... The overlapping frames result in a large amount of redundant calculations for interactions between elements, causing computational redundancy.

[0052] To improve inference efficiency, an attention caching mechanism is introduced in each spatiotemporal causal self-attention layer of the CAE, by introducing a key cache. Sum value cache It stores the key and value of the previously calculated video frame. When a new video frame... Upon arrival, CAM updates the cache according to the following formula:

[0053] (1)

[0054] (2)

[0055] in, Indicates key caching; Indicates value caching; This indicates the key buffer of the previous frame; This indicates that the value of the previous frame is cached; Represents the key vector, and ; Represents a value vector, and ; This represents the feature vector of the current frame. and These represent linear mapping matrices for keys and values, respectively. This indicates a concatenation operation, used to concatenate the key / value cache from the previous time step in formulas (1) and (2) with the key / value cache from the current time step, thereby updating the key / value cache.

[0056] In this way, each layer only needs to calculate the relevant features for the current segment, as shown in the following formula:

[0057] (3)

[0058] in, Represents the attention matrix; Represents the query vector, and ; The linear mapping matrix representing the query; This represents the parameter normalized according to the feature dimension. Formula (3) represents the calculation of the attention score among all elements.

[0059] Through a key-value caching mechanism, the original spatiotemporal causal self-attention is simplified to the element-value pair at a single moment. Cross-attention between elements at each time step: whenever a new video frame arrives, attention is only required between the query of the new frame and the key-value pair in the cache, thus reducing the computational complexity from... Reduce to This effectively reduces redundant calculations and significantly improves the efficiency of spatiotemporal reasoning.

[0060] Step 3-1-3: Design a Length Extrapolation Mechanism (LEM) to overcome the difference in length distribution between the training and testing phases through positional encoding, enabling models trained on short sequences to utilize longer contexts during the testing phase, thereby improving recognition accuracy.

[0061] Position in video frame Elements and positions The attention score between elements at a given location can be expressed as:

[0062] (4)

[0063] in, Indicates position Query and location The dot product between the keys at each point is used to calculate the attention score; For position offset terms, This represents the parameters that are normalized according to the feature dimension.

[0064] Then, a hybrid relative position embedding (HRPE) was designed. In close-range intervals ( ) Use learnable positional encoding ( ), to capture dynamic information of video sequences; at long distance intervals ( A linearly decaying position bias is used to give the model extrapolation capability during the inference phase. The formula is as follows:

[0065] (5)

[0066] in, This indicates the positional offset of the hybrid positional encoding at positions i and j; Indicates position Search at the location; Indicates position Location encoding; Indicates the linear decay slope; This indicates a linearly decaying bias, used to ensure at the junction point Smooth transition at the point, and .

[0067] During the training phase, the CAE will have a length of T (T>n) The stream cache is used as input. During the testing phase, the cache size in the cache attention mechanism (i.e., and ), extending the time span of the stream cache to At this point, the linear decay interval of HRPE generates a position offset for the extended time segment, ranging from... arrive This expanded historical information provides a broader context for actions, thereby improving the robustness of predictions.

[0068] Step 3-2: Input the spatiotemporal embedding sequence into the spatiotemporal causal self-attention module to obtain the enhanced spatiotemporal sequence. ,Pick Feature map of the last frame After global average pooling, the data is fed into a classifier to obtain the action category of the current frame;

[0069] In this embodiment, the classifier is a neural network including a fully connected layer and a Softmax activation function, which improves the detection efficiency of abnormal behavior under campus monitoring.

[0070] Example 2

[0071] In one or more embodiments, a campus surveillance behavior recognition system based on reasoning enhancement is disclosed, comprising:

[0072] The acquisition module is configured to acquire real-time video streams from campus surveillance cameras.

[0073] The preprocessing module is configured to preprocess the acquired video stream to obtain a spatiotemporal embedding sequence;

[0074] The recognition module is configured to use a context-aware encoder based on reasoning enhancement to recognize actions and behaviors in spatiotemporally embedded sequences.

[0075] The context-aware encoder incorporates an attention caching mechanism and a length extrapolation mechanism in its spatiotemporal causal self-attention layer.

[0076] Example 3

[0077] This embodiment provides an electronic device, including a memory and a processor, as well as computer instructions stored in the memory and running on the processor. When the computer instructions are executed by the processor, they complete the steps of the above-described reasoning-enhanced campus monitoring behavior recognition method.

[0078] Example 4

[0079] This embodiment provides a computer-readable storage medium for storing computer instructions, which, when executed by a processor, complete the steps of the above-described reasoning-enhanced campus monitoring behavior recognition method.

[0080] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0081] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0082] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment, whereby a series of operational steps are performed to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0083] The descriptions of each embodiment in the above embodiments have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0084] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A campus surveillance behavior recognition method based on reasoning enhancement, characterized in that, include: Obtain real-time video streams from campus surveillance cameras; The acquired video stream is preprocessed to obtain a spatiotemporal embedding sequence; A context-aware encoder based on reasoning enhancement is used to identify actions and behaviors in spatiotemporally embedded sequences; The context-aware encoder incorporates an attention caching mechanism and a length extrapolation mechanism. The context-aware encoder based on reasoning enhancement includes a spatiotemporal causal self-attention module, global average pooling, and an action recognition classifier; in the spatiotemporal causal self-attention module, an attention caching mechanism and a length extrapolation mechanism are introduced in each spatiotemporal causal self-attention layer. The attention caching mechanism introduces key and value caching to store the keys and values ​​of previously calculated video frames, and when a new video frame... When the event arrives, the expression for updating the cache using the attention caching mechanism is: ; ; in, Indicates key caching; Indicates value caching; This indicates the key buffer of the previous frame; This indicates that the value of the previous frame is cached; Represents the key vector, and ; Represents a value vector, and ; This represents the feature vector of the current frame. and These represent linear mapping matrices for keys and values, respectively. Indicates concatenation operation; The design employs a length extrapolation mechanism to overcome the difference in length distribution between the training and testing phases through positional encoding, enabling models trained on short sequences to utilize longer contexts during the testing phase. Position in video frame Elements and positions The attention score between elements at a given location can be expressed as: ; in, Indicates position Query and location The dot product between the keys at each point is used to calculate the attention score; For position offset terms, This represents the parameters normalized according to the feature dimension; A hybrid relative position encoding was designed, which is used in close-range intervals ( ) Use learnable positional encoding ( ), to capture dynamic information of video sequences; at long distance intervals ( Using a linearly decaying position bias, the model is given extrapolation capability during the inference phase, as shown in the following formula: ; in, This indicates the positional offset of the hybrid positional encoding at positions i and j; Indicates position Search at the location; Indicates position Location encoding; Indicates the linear decay slope; This indicates a linearly decaying bias, used to ensure at the junction point Smooth transition at the point, and .

2. The campus surveillance behavior recognition method based on reasoning enhancement as described in claim 1, characterized in that, The preprocessing yields a spatiotemporal embedded sequence, specifically: Block embedding is used to divide each frame of the video stream into multiple small blocks, and each small block is mapped to a fixed-dimensional vector through linear transformation; Stacking multiple frames together yields spatiotemporal embedding sequence ,in, Indicates the length of the stream buffer. This indicates the number of embeddings per frame.

3. The campus surveillance behavior recognition method based on reasoning enhancement as described in claim 1, characterized in that, The length extrapolation mechanism employs hybrid relative position coding, using learnable position coding at close intervals to capture dynamic information of the video sequence; and using linearly decaying position bias at long intervals to give the model extrapolation capability during the inference stage.

4. The campus surveillance behavior recognition method based on reasoning enhancement as described in claim 1, characterized in that, The spatiotemporal embedding sequence is input into the spatiotemporal causal self-attention module to obtain the enhanced spatiotemporal sequence. Extract the feature map of the last frame in the enhanced spatiotemporal sequence. After global average pooling, the data is fed into a classifier to obtain the action category of the current frame.

5. A campus surveillance behavior recognition system based on reasoning enhancement, characterized in that, include: The acquisition module is configured to acquire real-time video streams from campus surveillance cameras. The preprocessing module is configured to preprocess the acquired video stream to obtain a spatiotemporal embedding sequence; The recognition module is configured to use a context-aware encoder based on reasoning enhancement to recognize actions and behaviors in spatiotemporally embedded sequences. The context-aware encoder incorporates an attention caching mechanism and a length extrapolation mechanism. The context-aware encoder based on reasoning enhancement includes a spatiotemporal causal self-attention module, global average pooling, and an action recognition classifier; in the spatiotemporal causal self-attention module, an attention caching mechanism and a length extrapolation mechanism are introduced in each spatiotemporal causal self-attention layer. The attention caching mechanism introduces key and value caching to store the keys and values ​​of previously calculated video frames, and when a new video frame... When the event arrives, the expression for updating the cache using the attention caching mechanism is: ; ; in, Indicates key caching; Indicates value caching; This indicates the key buffer of the previous frame; This indicates that the value of the previous frame is cached; Represents the key vector, and ; Represents a value vector, and ; This represents the feature vector of the current frame. and These represent linear mapping matrices for keys and values, respectively. Indicates concatenation operation; The design employs a length extrapolation mechanism to overcome the difference in length distribution between the training and testing phases through positional encoding, enabling models trained on short sequences to utilize longer contexts during the testing phase. Position in video frame Elements and positions The attention score between elements at a given location can be expressed as: ; in, Indicates position Query and location The dot product between the keys at each point is used to calculate the attention score; For position offset terms, This represents the parameters normalized according to the feature dimension; A hybrid relative position encoding was designed, which is used in close-range intervals ( ) Use learnable positional encoding ( ), to capture dynamic information of video sequences; at long distance intervals ( Using a linearly decaying position bias, the model is given extrapolation capability during the inference phase, as shown in the following formula: ; in, This indicates the positional offset of the hybrid positional encoding at positions i and j; Indicates position Search at the location; Indicates position Location encoding; Indicates the linear decay slope; This indicates a linearly decaying bias, used to ensure at the junction point Smooth transition at the point, and .

6. An electronic device, characterized in that, It includes a memory and a processor, as well as computer instructions stored in the memory and running on the processor, which, when executed by the processor, perform the campus surveillance behavior recognition method based on inference enhancement as described in any one of claims 1-4.

7. A computer-readable storage medium, characterized in that, Used to store computer instructions, which, when executed by a processor, complete the campus monitoring behavior recognition method based on inference enhancement as described in any one of claims 1-4.