System and method for scalable complex event patterns for video analytics

By compiling queries into NFAs and utilizing surrogate models and adaptive event instantiation techniques, the high cost and resource waste of complex event pattern matching in video analytics systems are solved, achieving efficient and accurate processing of complex event patterns and significantly improving system throughput and resource utilization efficiency.

CN122295705APending Publication Date: 2026-06-26CENT FOR PERCEPTUAL & INTERACTIVE INTELLIGENCE (CPII) LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CENT FOR PERCEPTUAL & INTERACTIVE INTELLIGENCE (CPII) LTD
Filing Date
2025-02-17
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing video analytics systems suffer from high processing costs and resource waste when handling complex event patterns, especially when matching complex event patterns, where GPU resources are wasted due to unnecessary over-location and processing.

Method used

The query compiler is used to convert the query into a nondeterministic finite automaton (NFA), the surrogate model is used to predict the surrogate score, the probabilistic pattern matching module (PPAT) filters out windows with a high probability of mismatch, and the adaptive event instantiation module (AEM) adaptively instantiates the events in the window until it is determined whether there is a match or not, and then returns the window that is guaranteed to match.

Benefits of technology

It effectively reduced processing costs, improved processing speed, achieved efficient and accurate matching of complex event patterns, reduced the waste of GPU resources, increased throughput by 2.3 to 5.7 times, and maintained accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122295705A_ABST
    Figure CN122295705A_ABST
Patent Text Reader

Abstract

This invention relates to a system for identifying complex event patterns in real-time video analytics data input, the system comprising a query compiler, a surrogate model, a first inference module, and a second inference module. The invention also relates to a method for identifying complex event patterns in real-time video analytics data input, comprising: converting a query into a nondeterministic finite automaton using a query compiler; predicting a surrogate score using a surrogate model; estimating the probability that a window matches a query pattern; filtering out windows unlikely to match the query pattern; adaptively instantiating events within the window until it can be determined whether a match is guaranteed or impossible; and returning a window guaranteed to contain a match.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of real-time video analytics, and in particular to systems and methods for identifying complex event patterns in video data. It relates to the optimization of pattern recognition and reasoning processes in video streams, and to probabilistic pattern matching and sequential processing techniques. Background Technology

[0002] The ability to process and analyze digital events (including but not limited to video streams) in real time has become extremely important, facilitating rapid decision-making and intelligent automation. However, despite recent efforts, existing deep video analytics systems primarily support queries for simple events, such as selecting frames containing a red car or calculating the average number of cars in a video. Because video streams are temporal, queries that match event patterns can support a wider range of use cases and applications. For example, in traffic analysis, there are many places where right turns on red lights are permitted, and if a driver does not come to a complete stop before turning, it is a traffic violation. While this "rolling stop followed by a right turn" behavior is not uncommon, it poses a significant risk to pedestrians on the right-hand sidewalk. Detecting this violation requires identifying both the "rolling stop" and "right turn" events and matching them to a temporal "later-occurrence" pattern. In sports analytics, coaches monitor the game and adjust tactics based on the situation on the court. For example, in basketball, the "pick and roll" is a common offensive tactic used to create scoring opportunities. It involves a player A picking a guard to set a "screen," thus creating an opportunity for (1) teammate B to shoot an open shot or (2) player A to move toward the basket ("roll over") and receive a pass. To identify this tactic, all "screen," "roll over," "pass," and "shoot" events must be extracted and matched against specific patterns within a specific time window (e.g., 5 seconds). In terms of shopping behavior analysis, the way goods are displayed in a store can significantly influence customers' purchasing decisions. Therefore, shopping behavior must be analyzed to understand customer preferences and develop strategies to boost sales. For example, if a large number of customers are observed heading to the checkout area after picking up their beer, separating the beer area from the checkout counter could increase customers' shopping time. Such behavior cannot be identified using traditional techniques that rely on mining transaction data, such as association rules, because many interesting events occur before customers check out. Instead, these behaviors can be identified by performing various pattern matching queries on and evaluating large amounts of video.

[0003] To match complex patterns of different events in a video, a deep understanding of the video content is essential. While recent multimodal large-scale language models (MLLMs) such as GPT-4, Gemini, LLaMA-VID, and LLaVA have made some progress, their pattern matching capabilities remain rudimentary. More importantly, these models treat every frame of the video of interest as a token when processing each query. In other words, each frame of video must be processed by a massive model with billions of parameters, resulting in exceptionally high processing costs and times.

[0004] Alternatively, one could rely on a unimodal model and fast selection techniques (such as Zeus and probabilistic predicates (PP)) to first locate all individual interesting events (e.g., "scroll stop", "turn right"). Then, a modern Complex Event Processing (CEP) engine could be used as a post-processing step to match the query pattern. However, this "locate all then CEP" approach over-locates many unnecessary events, ultimately resulting in a mismatch with the query pattern and wasting valuable graphics processing unit (GPU) resources. Another approach is to directly train or fine-tune a unimodal binary classification model for each specific query pattern. However, this approach also lacks scalability, especially when there are a large number of potential query patterns to consider.

[0005] When it comes to scalable video analytics, the database community has been actively developing solutions to reduce processing costs and improve efficiency. However, existing work in this field has not solved the challenge of handling pattern-matching queries. Instead, the main focus of these works has been on simpler queries such as selection, top-k, and aggregation. Most of these works employ cascading methods to accelerate the processing, which involves using lightweight but less accurate surrogate models to quickly infer the data once, and then using these approximate inference results to guide the search or sampling process. More accurate but more expensive oracle models are only used when necessary.

[0006] For example, U.S. patent application number 10733457 may employ artificial intelligence and deep learning mechanisms to determine a subject's intent based on real-time video feeds. The pre-trained intent model in this patent is trained to predict possible state action changes based on the probability of the subject's behavior, using predetermined parameters and their corresponding scores. However, this model may exhaust GPU resources in determining the probability of the subject's behavior. Furthermore, it has not disclosed the ability to reduce mismatched queries, thus reducing its effectiveness in determining subject behavior.

[0007] A U.S. patent application (application number 9292493) discloses a system for detecting deceptive behavior in human communication expressed digitally. This system uses a computer to classify text input as genuine or deceptive, combining psycholinguistic cue analysis and statistical analysis / modeling. The patent discloses tokenization, stemming, pruning, and no-punctuation (NOP) steps performed in the model. These are preprocessing steps used to remove words that appear only once in the dataset. However, these models, when processing each query, may treat every frame in the video of interest as a lexical unit, thus requiring each video frame to pass through a large model with billions of parameters, resulting in exceptionally high processing costs or times.

[0008] International patent application WO2022183138A2 may disclose a method for automatically enhancing the sentiment cognition of natural language content through processing circuitry. This method incorporates a classification model that categorizes sentiment, connotation, or lexical strength, rather than using probabilities instead of strength. While the system includes a preprocessing engine to transform text data into a form that matches the rules required for further analysis by natural language rules and a multimedia classification engine, it may lack scalability for achieving high accuracy at a lower cost.

[0009] Furthermore, Chinese patent application CN112269805 B may disclose a data processing method, data processing apparatus, electronic device, and computer-readable storage medium configured to perform image processing on tagged customer group data and determine target customer group data for application in refined marketing. It employs an algorithm capable of segmenting patterns, which can determine a minimum time window for an efficient pattern matching system. While this algorithm can greatly aid in effectively matching pattern queries, this approach may over-target many unnecessary events that ultimately do not match the query pattern, thus wasting valuable GPU resources.

[0010] Therefore, there are still some challenges to be addressed in order to efficiently and accurately process the large amounts of data required for pattern matching queries without wasting valuable GPU resources. Summary of the Invention

[0011] The purpose of this invention is to provide a scalable complex event processing system and method, including but not limited to video processing, while reducing costs or shortening processing time. Another purpose of this invention is to provide a system and method for scalable complex event processing, including but not limited to video processing, without over-targeting unnecessary events, ultimately leading to query pattern mismatches and wasting valuable GPU resources.

[0012] Another object of the present invention is to provide a scalable complex event processing system and method, including but not limited to video processing, capable of handling complex queries to match complex event patterns, wherein multiple events occur sequentially.

[0013] To at least achieve the above-mentioned main objectives, the technical solution adopted by the present invention is: a system for identifying complex event patterns in real-time video analysis data input, the system comprising a query compiler, a proxy model, a first inference module and a second inference module.

[0014] These objectives can also be achieved through methods that identify complex event patterns in the data input used for real-time video analytics, including using a query compiler to convert queries into nondeterministic finite automata; using a surrogate model to predict surrogate scores; estimating the probability that a window matches a query pattern; filtering out windows that are unlikely to match the query pattern; adaptively instantiating events within the window until it can be determined whether a match is guaranteed or impossible; and returning the window that is guaranteed to contain a match. Attached Figure Description

[0015] The features of the invention will be more readily understood and recognized when reading the following detailed description in conjunction with the accompanying drawings of preferred embodiments of the invention, wherein:

[0016] Figure 1 Examples of two different window configurations are shown;

[0017] Figure 2 The nondeterministic finite automaton (NFA) compiled based on query Q1 is shown.

[0018] Figure 3 This illustrates an overview of the invention;

[0019] Figure 4(a) shows a window consisting of five segments;

[0020] Figure 4(b) shows the deterministic finite automaton (DFA) derived from the NFA of query Q2;

[0021] Figure 4(c) shows the non-homogeneous Markov chain constructed for query Q2;

[0022] Figure 5 An implementation scheme for calculating all possible events for a specific segment is shown;

[0023] Figure 6 Multi-window optimization is shown;

[0024] Figure 7 The performance of the present invention and the video processing model in terms of throughput and relative accuracy is shown.

[0025] Figure 8 The percentage of input video frames that are ultimately instantiated in each query by the expensive oracle model is shown.

[0026] Figure 9 The processing time details for each query in this invention are shown;

[0027] Figure 10 The average throughput and relative accuracy of the present invention are shown when all queries have different F1 accuracy targets;

[0028] Figure 11 The throughput and accuracy of the present invention for different pattern lengths in query Q4 are shown;

[0029] Figure 12 The average throughput and accuracy of the present invention are shown, with window sizes ranging from 30 to 300 segments for all queries;

[0030] Figure 13 The average throughput and accuracy of the present invention under different adaptive event instantiation (AEM) strategies are shown. Detailed Implementation

[0031] Specific embodiments of the present invention are disclosed herein as requested. However, it should be understood that the disclosed embodiments are merely examples of the invention, which may be implemented in many different forms. Therefore, the specific structural and functional details disclosed herein should not be construed as limiting, but rather serve as the basis for the claims. It should be understood that the accompanying drawings and their detailed description are not intended to limit the invention to the specific forms disclosed herein; rather, the invention covers all modifications, equivalents, and alternatives falling within the scope defined by the claims. As used throughout this application, the word “may” indicates optional (i.e., possible) rather than mandatory (i.e., required). Similarly, the words “comprising” and “including” mean including but not limited to. Furthermore, unless otherwise stated, the word “a” means “at least one” and the word “a plurality” means one or more. When using abbreviations or technical terms, these refer to their generally accepted meanings known in the art.

[0032] The term "invention" is also referred to as "Bobsled" and can be used interchangeably throughout the specification.

[0033] The term "data input" refers to the time window or relevant segment of the video stream being decoded.

[0034] The term "first inference module" refers to the Probabilistic Pattern Matching Module (PPAT), and "second inference module" refers to the Adaptive Event Instantiation Module (AEM).

[0035] In one embodiment of the present invention, a system for identifying complex event patterns in real-time video analytics data input is provided, characterized in that it includes: a query compiler; a proxy model; a first inference module; and a second inference module.

[0036] In one embodiment of the invention, the query compiler is configured to automatically infer the shortest duration of all events in the query for which no minimum duration is specified, and to convert the query into a nondeterministic finite automaton (NFA).

[0037] In one embodiment of the invention, the proxy model is configured to predict the proxy score of the data input for estimating the probability that the first inference module matches the window of the query pattern.

[0038] In one embodiment of the present invention, the first inference module is a probabilistic pattern matching (PPAT) module configured to filter windows that are highly likely to not match the query pattern.

[0039] In one embodiment of the present invention, the first inference module is a discrete-time non-homogeneous Markov chain (NHMC) with the ability to process non-homogeneous data input.

[0040] In one embodiment of the present invention, the first inference module estimates the probability of a window matching the query pattern based on the nondeterministic finite automaton (NFA) provided by the query compiler and the surrogate score predicted by the surrogate model, thereby filtering out windows that are highly likely not to match the query pattern.

[0041] In one embodiment of the invention, the second inference module is an adaptive event instantiation (AEM) module, configured to adaptively instantiate events within the window passed through the first inference module until it can be determined whether a match is guaranteed or impossible, and return the window that guarantees a match.

[0042] In one embodiment of the present invention, the second inference module further includes an event instantiation strategy.

[0043] In one embodiment of the present invention, the second inference module sequentially selects and instantiates events within the window of the first inference module.

[0044] In one embodiment of the present invention, the event instantiation strategy includes, but is not limited to, a greedy strategy.

[0045] In one embodiment of the present invention, the greedy strategy is calculated using the following formula:

[0046]

[0047] In one embodiment of the present invention, the first inference module further caches the transition matrix and reuses these matrices to reduce the computational overhead of future query matching.

[0048] In one embodiment of the invention, the data input includes, but is not limited to, decoded video data streams obtained from multiple sources.

[0049] In one embodiment of the invention, the proxy model is pre-trained using real-time data input.

[0050] In one embodiment of the invention, the system has the ability to process data input individually and in batches, thereby having scalability.

[0051] This invention also teaches a method for identifying complex event patterns in real-time video analytics data input, characterized by the following steps: converting a query into a nondeterministic finite automaton using a query compiler; predicting a surrogate score using a surrogate model; estimating the probability that a window matches the query pattern; filtering out windows that are unlikely to match the query pattern; adaptively instantiating events within the window until it can be determined whether a match is guaranteed or impossible; and returning a window that is guaranteed to contain a match.

[0052] In another embodiment of the present invention, the step of using a query compiler to convert a query into a nondeterministic finite automaton (NFA) further includes: obtaining the minimum value of the shortest duration of the offline measurement labeled from a given verification set; and converting the query into a nondeterministic finite automaton (NFA).

[0053] In another embodiment of the present invention, the step of estimating the probability of a window matching a query pattern further includes: using a power set construction algorithm to obtain a deterministic finite automaton (DFA) from an initial nondeterministic finite automaton (NFA); constructing a first inference module based on the deterministic finite automaton (DFA) and the surrogate score, wherein the first inference module is a discrete-time nonhomogeneous Markov chain (NHMC); and calculating the pattern matching probability using the discrete-time nonhomogeneous Markov chain (NHMC).

[0054] In another embodiment of the present invention, the probability of a window matching a query pattern is calculated using the following formula:

[0055]

[0056] In another embodiment of the invention, the step of filtering windows with a high probability of mismatch between the query pattern and the actual query pattern further includes deriving a rejection threshold.

[0057] In another embodiment of the invention, the step of filtering windows that are highly likely to mismatch the query pattern further includes prioritizing the filtering of negative windows and deferring the remaining processing to the second inference module.

[0058] In another embodiment of the invention, the method further includes reasoning about unfilterable windows.

[0059] In another embodiment of the invention, the further reasoning step further includes using information entropy to unify the decision-making process to cover both window-matching queries and window-mismatching queries.

[0060] In another embodiment of the invention, the step of further reasoning about unfilterable windows further includes selecting segments sequentially and instantiating the selected segments one at a time in a cost-effective manner.

[0061] In another embodiment of the invention, the step of further reasoning about unfilterable windows further includes: selecting the video segment with the highest expected conditional mutual information at each step; and considering future windows in the decision-making process.

[0062] The background details, advantages, and accompanying embodiments of the invention will be further discussed below:

[0063] Detailed introduction

[0064] Unlike many existing works, Bobsled does not require training or fine-tuning of a oracle or surrogate model for each query. First, it employs a lightweight, pre-trained unimodal model as a surrogate to capture the event distribution for each segment of the video stream. Using this information, it can derive the probability of a window matching the query pattern. And form a cascade. If If the value falls below a certain threshold, the fragments within the window will be discarded, and the oracle model will not perform any further processing.

[0065] However, deriving the probability from the probability of a single event within a window... It's not easy. Specifically, matching the event window with the pattern requires transforming the query pattern into a nondeterministic finite automaton (NFA) and consuming the event window to determine if the final state can be reached. However, since pattern queries can be transformed into NFAs with non-mutually exclusive outgoing edges, they cannot be directly used to form probabilistic models like Markov chains for estimation. (Because the sum of the probabilities of non-mutually exclusive edges will be greater than 1). Bobsled proposed the concept of Probabilistic Patterns (PPAT) using non-homogeneous Markov chains to solve this problem.

[0066] For windows that PPAT cannot directly filter, Bobsled will perform further optimizations to minimize inference costs. For example, consider a pattern query that searches for the following patterns on a window of size 4: “Scroll Stop” (RS) → “Turn Right” (RT) → “Turn Left” (LT).

[0067] consider Figure 1 The window shown has two different configurations. The window covers... to Four segments. After scanning using the surrogate model, the discrete distribution of potential events for each segment can be obtained. In case 1, the intelligent adaptive event instantiation (AEM) strategy can select segments. This is first passed to the oracle model to instantiate its true label. Thus, once the oracle model confirms... It can reject a window early if it's not a "right turn". This is because a non-"right turn" event at the second position of the window would immediately violate the query pattern. This intelligent approach allows for early stopping, incurring only the cost of passing a fragment to the oracle model, and then proceeding to process the next window.

[0068] Conversely, the basic strategy is to... Initially, the fragments within the window are passed to the oracle model sequentially. However, if If the oracle model identifies it as "rolling stop," then unlike the aforementioned intelligent strategy, it will simultaneously... Instantiate and discover A window cannot be discarded until it is "turned right". The trade-off is that the window must be instantiated before it is discarded. and Designing an intelligent instantiation strategy is not easy, because the optimal choice depends not only on the probability of early rejection, but also on the probability of early acceptance when the window contains a positive match. For example, Figure 1 Case 2 in the example. In this case, when... and After instantiating as "Scroll Stop" and "Turn Right", the smart strategy should select Instead Because it is possible to not process In cases where the window is open in advance.

[0069] Considering the complexity and probabilistic nature of the decision space, Bobsled formulates the problem as a sequence information maximization (SIM) problem. This formulation employs a greedy strategy, allowing Bobsled to make near-optimal choices by maximizing the mutual information between different instantiation options. To date, Bobsled is the first system to effectively handle pattern matching queries in video streams. Bobsled accelerates the processing of complex events while achieving the expected accuracy. Experiments show that Bobsled can improve throughput by 2.3x to 5.7x without any significant loss of accuracy.

[0070] Video event detection

[0071] In recent years, Transformer-based models have indeed changed the world. The latest multimodal language models (MLLMs), such as GPT-4, Sora, Gemini, LLaMA-VID, and LLaVA, are capable of handling tasks like text-to-video generation and video question answering, marking a significant step towards artificial general intelligence (AGI). However, current MLLMs lack sophisticated pattern matching capabilities. Even with future advancements, the inherent need to process each frame of the image may make MLLMs computationally prohibitive in database-driven video analytics. Applying the broader capabilities of MLLMs to complex event video processing could result in unnecessarily resource-intensive processing. Furthermore, event-pattern language can accurately capture query intent, while achieving the same accuracy through natural language prompting engineering is almost an art. Therefore, while language modalities within MLLMs are not particularly necessary for our application, they introduce unnecessary overhead due to the large proportion of model parameters allocated to language processing. For detecting single events in a video, single-modal video recognition models are sufficient.

[0072] Bobsled supports complex video event processing using a unimodal model. It is independent of the type of event occurring: it supports both actions detected by a video recognition model and objects detected by an image model. In the following discussion, this invention focuses on action-based events, as they are more relevant to specific use cases.

[0073] Video recognition models take consecutive segments (frame sequences) as input and utilize spatial information within each frame and temporal information across frames for recognition. These models can be broadly categorized into two types: convolutional-based models (such as C3D and R(2+1)D) and Transformer-based models (such as VideoMAE and X-CLIP). Convolutional-based models use 3D convolutions instead of 2D convolutions to capture temporal information within video segments. On the other hand, Transformer-based models segment the input video into words and employ self-attention mechanisms to capture contextual information.

[0074] These video recognition models typically take as input a segment of 16, 64, or 128 consecutive frames. These models are capable of recognizing multiple events occurring within a segment. For example, given a segment, they can output simultaneous events such as "Player A is setting a screen" and "Player B is dribbling." Furthermore, since an action may span multiple segments, a single model can output a single "feeding" label for hundreds of consecutive segments.

[0075] Deep video analytics

[0076] Current systems optimized for video queries support various query types, including selection, LIMIT, aggregation, and top-k. Most of these systems are based on query-specific proxy models. For example, they train a lightweight binary classifier as a "proxy" for query predicates (e.g., COUNT(CAR) > 5). Frames with a proxy score (i.e., the probability of a frame passing the query according to the proxy model) below a certain threshold are discarded, thus avoiding the processing of expensive oracle models.

[0077] The training data for the surrogate model is obtained by extracting frames from the query video, feeding them to the oracle model, and treating the oracle model's output as the true label. The threshold is calibrated based on different levels of accuracy guarantees, according to the relationship between the surrogate model's empirical recall / precision relative to the oracle label and the desired recall / precision target.

[0078] However, surrogate models are not the only way to provide surrogate scores. Systems like TASTI and Seiden employ an alternative approach: first, they process a subset of frames using a oracle model. Then, they use the labels obtained from this processed subset to interpolate surrogate scores for the remaining frames. These techniques can be applied to any surrogate model-based system, including Bobsled. ExSample can answer LIMIT queries without relying on any surrogate scores. It uses Thompson sampling to select frames from the queried video for the oracle model to process and stops early after finding a sufficient number of results. However, for selection or pattern queries aiming to find all answers, determining when to stop the search early while still meeting specific recall or precision targets remains unclear, especially without additional information such as surrogate scores.

[0079] Some systems are designed to adaptively select the most suitable model or configuration (e.g., image size, sampling rate) from a set of inference models or configurations. Their goal is to use more expensive models or configurations to process important frames, while using less expensive models or configurations to process irrelevant frames. For example, Zeus, which focuses on action localization, uses reinforcement learning agents to predict the configuration of the next video segment to be processed. Other works, such as Rekall and VOCAL, primarily focus on programming models for handling complex events in videos and data tagging. In contrast, our work pioneers an exploration of the efficiency problem of pattern matching in videos.

[0080] Complex Event Handling

[0081] Complex Event Processing (CEP) systems are designed for pattern matching in traditional data streams. A CEP query consists of a sequence of events occurring in a specific order, along with constraints on those events. Various query languages ​​have been proposed to describe patterns, including SQL-TS, Cayuga, and SASE+. These languages ​​support expressing event sequences, Kling closures, complex predicates, and continuity, thus offering richer expressiveness compared to languages ​​used for regular expression matching.

[0082] A typical CEP query consists of three clauses: "SEQ", "WHERE", and "WITHIN". The "SEQ" clause defines a sequence of events and associated quantifiers that define the minimum and maximum number of occurrences of an event. One of these quantifiers is the Clyne positive closure, represented by "+", which can define the number of occurrences of one or more events, such as "A+", or a specific number of occurrences of an event. For example, "A{3, 5}" means that event "A" occurs at least three times and at most five times. The "WHERE" clause specifies the conditions for the events, while the "WITHIN" clause specifies the window size.

[0083] SEQ (A, B+, C) / / Query Q1

[0084] WHERE A.condition = 'sunny'

[0085] AND B.condition = 'rainy'

[0086] AND C.condition = 'cloudy'

[0087] WITHIN 1 hour

[0088] Q1 above is a CEP query example that captures a series of weather conditions: in the past hour, the weather changed from "sunny" to "rainy," lasted for a while, and finally became "cloudy." The Klin positive closure quantifier matches one or more "rainy" events. By default, the continuity between these events is loose. Strict continuity requires two events to be consecutive in the data stream, which is typically used for regular expression matching. Relaxed continuity removes the continuity requirement between events; all irrelevant events are skipped until the next relevant event is found. This is especially important in the real world because, in such cases, noisy events included in the data stream should be ignored by the query.

[0089] During execution, CEP queries are typically compiled into a nondeterministic finite automaton (NFA). Formally, an NFA can be represented using 5-tuples. It means that, among them Represents a set of states, Represents a set of conditions. It is a transfer function. It is the initial state. This is the final state. The transition function. The state transitions of the NFA are defined. For example, It refers to Under the condition of state, to state The transition. The condition is a Boolean expression, consisting of basic operators ( , , ), comparison operators (such as , , The ε-transition is a set of variables whose values ​​are either true or false. The ε-transition is a model of the Kling positive closure operator under relaxed continuity conditions, allowing the matching process to move from one state to another without consuming any events; its condition always evaluates to "true".

[0090] Figure 2 This describes an NFA compiled based on query Q1. In this example, the matching process must satisfy the condition "sunny" to move from the state " The process transitions from state "B" to state "C". There is an ε-transition from state "B" to state "C", which can occur without consuming any events. Therefore, in state "B", the "rain" event can remain in state "B", or the matching process can first enter state "C" via an ε-transition and then proceed via... The "cloudy" transition consumes the "rainy" event and remains in state "C". This highlights the inherent nondeterminism of NFA, namely that the conditions associated with two edges in a given state may be mutually exclusive.

[0091] If there are several alternative transitions that lead to a final state after consuming some events, then a match can be found. If no match is found in the current window, the matching process slides the window to the next event. For example, from X[ABDD]EE to XA[BDDE]E, where "[" and "]" represent the beginning and end of the window. After a match is found, the matching process has different options for moving the window. For example, the "skip-past-last-event" strategy skips the last matching event and moves the window from X[ABCD]EEFF to XABC[DEEF]F.

[0092] The primary focus of CEP systems is to increase throughput, thereby facilitating the processing of high-speed data streams. A key challenge lies in evaluating the complexity of NFAs on each window. Much work has addressed this issue in terms of pattern matching costs and memory consumption. However, CEP for video events should have a different focus, as the main bottleneck stems from model inference rather than the pattern matching process. For example, on an NVIDIA GeForce RTX 3090 GPU, a typical video recognition model like R(2+1)D-18 can recognize 184 segments per second when each segment uses 16 frames. In contrast, a CEP system running on a single CPU core can process over 100,000 events per second. Therefore, the challenge shifts to minimizing model inference to maximize the overall system throughput.

[0093] CEP in Bobsled

[0094] Bobsled aims to speed up the processing of complex events in video streams. It supports events recognized by any recognition model, including image classification, action recognition, object detection, etc. Currently, events directly output by non-recognition models are not supported. For example, the event "more than 5 cars" is not the direct output of any common object detection model, but an event post-processed by an external module based on the model's direct output. However, if the model is fine-tuned to directly return the event "more than five cars", Bobsled can also support it. The reason is that Bobsled avoids any query-specific surrogate models because this approach cannot scale as the number of queries increases. Instead, Bobsled uses surrogate models that perform the same task as the oracle model. These models are usually pre-trained and provided together with the oracle model. For example, the latest version of the YOLOv10 object detector has multiple variants: nano, small, medium, large, and extra-large. The nano version can be used as the surrogate model, and the extra-large version can be used as the oracle model. If there is no lightweight version of the oracle model, the user can use a model quantization tool to prepare one first.

[0095] Bobsled implements a subset of the Flink CEP pattern API and uses it for complex event processing in video streams:

[0096] Query = <Component>

[0097] [. <Continuity><Component> ] *

[0098] [.window(<Window Size>)] / / in number of segments

[0099] [.f1(<F1 Target>]

[0100] .model(<Oracle Model>, <Surrogate Model>)

[0101] Component = (<Condition>)

[0102] [.minDuration(<Minimum Duration>)] / / in number of segments

[0103] [.maxDuration(<Maximum Duration>)] / / same as above

[0104] Continuity = followedBy | next

[0105] Compared to Flink's CEP API, one modification we made was the introduction of the "model" keyword, requiring users to specify the oracle model and surrogate model. Another modification was the addition of the "f1" keyword, which speeds up the process by requiring users to specify a target F1 score (default is 0.99). For example:

[0106] ('rolling stop').minDuration(1) / / Query Q2

[0107] .next('right turn').minDuration(1)

[0108] .followedBy('left turn').minDuration(1)

[0109] .window(5)

[0110] .f1(0.99)

[0111] .model(r(2+1)d-18, c3d-3)

[0112] All other keywords are inherited from Flink CEP. The keywords "minDuration" and "maxDuration" specify the shortest and longest duration of an event, respectively. In the example, "minDuration(1)" means that the event must match at least once, equivalent to the Klin positive closure quantifier. For example, "('right turn').minDuration(4)" means that the "right turn" event must appear consecutively in at least 4 segments to be considered a match. Regarding continuity, strict continuity is specified by the keyword "next", while loose continuity is specified by "followedBy". The window size is determined by the keyword "window", and if not specified, the default value in Bobsled is 30 seconds. In Bobsled, once a match is found, the "skip-past-last-event" strategy is used to prevent returning overlapping results, thus ensuring the diversity of results.

[0113] System Overview

[0114] Figure 3 This document presents a system overview of Bobsled. More specifically, the invention comprises four parts: a query compiler, a proxy model, a probabilistic pattern (PPAT), and adaptive event instantiation (AEM). Bobsled can handle multiple queries across multiple video streams. Further details will be provided below from the perspective of a single query on a single video stream.

[0115] In addition to transforming queries into NFAs, the compiler infers the minimum duration of events in a query if the user does not specify one. This is necessary because events perceived by a user in a video (such as actions) are typically mapped from multiple original model predictions spanning multiple segments. For example, when using segments of 16 frames (approximately 0.5 seconds), the "eating" action perceived by a user in a video is typically represented by multiple "eating" labels predicted by various models, spanning multiple consecutive segments. Determining the precise minimum duration for different types of events is challenging for query users (e.g., what should the minimum duration be for a "drinking" event or an "eating" event). Making a decision without this knowledge can impact accuracy (because the minimum duration is set too high) or efficiency (because setting the minimum duration to 1 results in reduced selectivity and unnecessary overhead due to frequent use of the Kling positive closure operator). Therefore, in Bobsled, the compiler automatically infers the minimum duration for all unspecified events in the query by analyzing the validation set.

[0116] The NFA, along with the surrogate scores predicted by the surrogate model, is used to construct a non-homogeneous Markov chain (NHMC), which serves as a probabilistic pattern (PPAT) to estimate the probability that a window matches the query pattern. Windows with a high probability of not matching the query pattern are filtered out early without any additional processing of fragments within the window. The PPAT component makes decisions by deriving a rejection threshold from the target F1 precision specified in the query. In another invention, there is an acceptance threshold, where windows with a probability higher than the acceptance threshold are immediately returned without any additional processing. However, accepting based on the acceptance threshold can lead to false positives and affect precision. Similarly, filtering based on the rejection threshold can also lead to false positives and affect recall. Given that Bobsled uses the F1 metric (which includes both precision and recall) as the target precision, and considering that complex event queries are often more selective than simple event queries, PPAT only determines the rejection threshold at this stage, prioritizing the filtering of negative windows, and deferring the remaining processing to the Adaptive Event Instantiation (AEM) module. Windows that PPAT cannot filter are then passed to AEM for further inference. The purpose of AEM is to select the minimum number of segments to be processed by the oracle model, which may enable early rejection or early acceptance within the window.

[0117] For example, query Q2 can find a window of size 5 where the first event is "scroll stop" (RS), the second event is "turn right" (RT), and then the "turn left" (LT) event occurs at any other location. Now, we will consider the event window and use "?" to denote event fragments that have not yet been instantiated by the oracle model. Considering the window in Figure 4(a), if we first select 𝑐2 for oracle inference and confirm it as a "turn left" event, then the entire window... These can be preemptively removed without instantiating other fragments. Or, if fragments are selected... , and Perform predictive reasoning and obtain results This invention can accept windows in advance without needing to send fragments. and Instantiation. In short, given a window, the goal of AEM is to select the next segment that maximizes the probability of early acceptance if the window eventually matches the query; conversely, if the window eventually does not match the query, select the next segment that maximizes the probability of early rejection. However, whether the window matches the query is not known in advance. To address this problem, this invention uses information entropy to unify these two cases and solves them as an approximate optimal solution to the Sequence Information Maximization (SIM) problem.

[0118] Query compiler

[0119] The query compiler in Bobsled follows the implementation in SASE+, which first compiles each query into an NFA. It's worth noting that this invention does not implement a match cache as in SASE+, because model inference is the main bottleneck in Bobsled, and converting to a basic NFA is sufficient for these purposes.

[0120] Since each event perceived by the user originates from multiple consecutive model-predicted events, if the query does not specify a minimum duration for each event, the query compiler automatically annotates the minimum duration for it before converting the query to an NFA. The minimum duration for each event is obtained offline from a given validation set containing baseline truth labels. For each event, Bobsled measures the duration (in terms of fragment count) of each occurrence in the validation set and records the minimum duration. For example, if the "scroll stop" event occurs 5 times in the validation set with durations of 5, 6, 6, 7, and 10, then the minimum duration for "scroll stop" is 5.

[0121] Probability pattern

[0122] Given a known window of events, the existence of a match can be directly determined by evaluating a nondeterministic finite automaton (NFA) for the query. If a series of transitions leads to the final state, the NFA returns a match result. A surrogate model is used to reduce access to the oracle model, obtaining the discrete distribution of each event and using these uncertain events to build a probabilistic pattern (PPAT) for calculating the pattern matching probability. The PPAT is essentially a nonhomogeneous Markov chain (NHMC), as detailed below.

[0123] Non-homogeneous Markov chains as probabilistic patterns

[0124] Given query pattern For each window Construct a non-homogeneous Markov chain (NHMC) to estimate With pattern Probability of matching .

[0125] Non-homogeneous Markov chains (NHMCs) are a variant of Markov chains. In a Markov chain, the sum of the transition probabilities of any given state must equal 1. However, since the transitions of an NFA compiled from a query may be non-mutually exclusive, an NFA cannot be directly used as the state space and transitions of any Markov chain. To construct the state space and transitions of an NHMC, an automaton (NFA) equivalent to the query pattern must be used, but its transitions must be mutually exclusive. In this respect, deterministic finite automata (DFAs) meet the requirements perfectly. Although there are algorithms (such as Brzozowski Derivatives) that can directly convert query patterns into DFAs, these methods are generally complex and difficult to implement. In this invention, a power set construction algorithm is used to obtain a DFA from an initial NFA. The DFA can be represented by 5-tuples. ,in It is a set of states. It is a set of conditions. It is a transfer function. It is the initial state. This is the final state. For any two states... and either from arrive There is exactly one transition condition , making Either there is no transfer at all.

[0126] Figure 4(b) shows the DFA derived from the NFA compiled from query Q2. The transitions in this DFA are mutually exclusive. The states and transitions of the DFA can serve as the basis for the states and transitions of the NHMC, with the transition probabilities derived from the surrogate score. However, probabilistic pattern matching cannot be modeled solely using standard Markov chains because the transition probabilities are non-homogeneous. Specifically, the transition probabilities from one state to another are not static but depend on the current segment of interest. For example, consider the DFA in Figure 4(b) and the window in Figure 4(a). If the segment... and It has been consumed, and the current state has been reached. So stay at The transition probability is However, assume that the state remains unchanged after consuming 𝑐3. Then it stays at The transition probability should be This explains why a non-homogeneous Markov chain is used as the PPAT.

[0127] In NHMC, the transition probabilities are represented by a set of transition matrices, each for a specific time period. Here, we are using discrete-time NHMC. Assume the state set is... Each transition matrix Both are dimensionless A square matrix, where This is the state number in NHMC. The transition matrix ( ) indicates in Time from state Transition to state The probability of each transition matrix. They all use fragments The proxy scores are constructed. This is used to establish the transition matrix. A random variable was defined. To represent fragments The distribution of events that may occur in the process is determined by the agent score. Indicates a given condition From state Transition to state The probability of: If not from state to state The transfer, then Set it to 0. Here, It refers to Under the distribution conditions, conditions The probability of it being true. Therefore, a given... A window composed of fragments constitutes a window with NHMC of the transition matrix.

[0128] One point to note is the final state of the DFA, because a Markov chain itself does not have a final state that ends the pattern matching process. To address this, a self-transition function with a probability of 1 is added to the final state of the DFA. Formally, if If it is the final state, then it will Set it to 1. Using this method, a valid NHMC can be formed based on the query and proxy scores.

[0129] Figure 4(c) shows the NHMC and its transition matrix for query Q2 based on the window in Figure 4(a). For example, consider the entries in transition matrix 𝑃1. This entry indicates the initial state when starting to process the window. Transition to state The probability is 0.99 because of the first segment. The probability of "scrolling stops" is 0.99. Similarly, consider the transition matrix. Entries in This entry indicates consumption. arrive Then from the state Transition to state The probability is 0.7, because The probability of turning right is 0.7.

[0130] Given an NHMC, window With query The probability of a match is equal to the probability from the initial state ( Transition to the final state ( The probability of ). In NHMC, The transition probability after the step is obtained by... The calculation is performed by multiplying the transition matrices. The resulting matrix is ​​represented in... The probability of transitions between states after each step. To extract the specific transition probabilities from the initial state to the final state, this invention employs two indicator vectors. and .here, Is the initial state The corresponding indicator vector. Specifically, if... ,but And all other items are set to 0. Similarly, Is it related to the final state? The corresponding indicator vector. Specifically, if... ,but , while all other items are set to 0. Therefore, from the fragment Component window , and query The probability of a match is:

[0131] (1)

[0132] Now, from a computational perspective, the time complexity of calculating this pattern matching probability is O(n log n). ,in It refers to the window size. This is the number of states in NHMC, the same as the number of states in DFA. Although long-mode queries may have a larger... But in reality Typically, these values ​​are between single and double digits. Furthermore, longer pattern queries also mean greater selectivity. This effectively gives PPAT stronger filtering capabilities, thereby reducing the number of accesses to the oracle model.

[0133] Rejection threshold

[0134] Given the user-specified F1 target, this invention converts it into a rejection threshold, i.e., PPAT. This threshold is determined using a validation set that includes proxy scores and baseline ground truth labels. Since the F1 score combines recall and precision, and PPAT's primary objective is to filter out negative windows, the precision target... Set it to 1 and focus on the recall target. .because The recall target can be derived from the F1 target: To achieve the recall objective, this invention employs the Probabilistic Predicate (PP) method to determine the rejection threshold. PP searches for the maximum threshold that satisfies the desired recall objective based on real-time data collected by the system. Specifically, given a threshold... In this case, the recall rate is calculated based on whether the query is likely to match (based on a threshold). This is the ratio of the number of windows that actually match the query to the total number of windows that actually match the query. Our goal is to start the search from threshold 1 and continue until the threshold is reached. This ensures that the recall rate just reaches the target. This allows for pattern matching probabilities lower than [a certain value] without filtering the positive window. The window is filtered. In other words, it strikes a balance between efficiency (filtering non-matching windows) and accuracy (achieving the recall target). Future work will use more sophisticated techniques to derive thresholds with stricter quality limits.

[0135] Adaptive Event Instantiation (AEM)

[0136] In Bobsled, windows sent via PPAT are sent to AEM for further processing. The goal of AEM is to dynamically determine the next fragment to be instantiated by the oracle model, allowing windows to be rejected or accepted early. In other words, its aim is to make a series of choices online to produce the fragment instantiation order with the lowest cost. AEM is an optimization technique designed to improve efficiency without compromising accuracy. In cases where there is no opportunity for early stopping of the window (e.g., consider query('A').next('B').next('C') and window [ABC]), the instantiation order is not important; AEM will instantiate all fragments to guarantee correctness. However, since pattern matching queries are generally selective, AEM is an effective technique that can significantly improve speed.

[0137] AEM works by selecting the next segment that maximizes the probability of acceptance if the window eventually matches the query; otherwise, it selects the next segment that maximizes the probability of rejection. The challenge is that whether the current window will ultimately match the query is unknown. To address this uncertainty, AEM uses information entropy to unify the decision-making process for these two possible outcomes.

[0138] Assumption It is a random variable representing a window. Is it consistent with the query? match. Follows Bernoulli distribution: if and If matched, then ,otherwise . The uncertainty can be expressed by its entropy. Entropy is measured by... and The probability of matching is calculated and expressed as: Initially, when uncertainty... Whether or not During matching, The entropy is greater than 0. Assume... It is a random variable, representing Events already instantiated within the window. As more fragments are instantiated, the conditional entropy... It will gradually decrease and eventually reach 0. Therefore, minimizing conditional entropy is equivalent to maximizing the probability of acceptance or rejection.

[0139] Given this unified form, AEM aims to minimize conditional entropy to zero at minimal cost. However, since events are unknown beforehand, AEM selects fragments sequentially, instantiating them one by one economically each time. Furthermore, to minimize the number of fragments instantiated, AEM aims to make the process as fast as possible. Minimize. This is equivalent to maximize. and Mutual information between them, mutual information measurement Includes related The amount of information. According to information theory, this invention now has... ,in express and Mutual information between them. Due to The entropy is a constant, and maximizing it... This will lead to minimization Therefore, this problem is essentially a sequence information maximization (SIM) problem.

[0140] In SIM, this invention has random variables. ,in Through function Deterministically dependent on ,Right now .random variable It is observable, and It cannot be directly observed. In order to obtain... Information that can be used for each Conduct tests to reveal the results. Each test incurs one unit of cost. Given a given... The test has a fixed budget, and the goal is to select the correct pair in sequence. The test with the most information content. This series of choices is encoded into a strategy. ,in Representing the Step-by-step testing. For example, in the first step, for Conduct tests and observe the results. .pass The collected results are represented as .

[0141] Optimal Strategy Maximize and Mutual information between them: However, finding the optimal strategy This is an NP-hard problem. Instead, a widely used greedy strategy is employed: at each step, select the test that maximizes the expected conditional mutual information. ,in ,and It refers to the previous The events that have already been instantiated in the step. This approach is simple and has been shown to be close to optimal. In fact, SIM is similar to multi-armed slot machines and reinforcement learning (RL), where the reward can be defined as a reduction in uncertainty. However, in reinforcement learning, both the environment and the inherent uncertainty are unknown, so the solution must explore and utilize the environment. In contrast, in SIM, each The probability distribution is already known through the surrogate scores. Therefore, the SIM solution can focus on utilizing this known information to achieve near-optimal results without extensive exploration.

[0142] Now, consider a query. and a by A fragment Component window This invention defines random variables. in represent Is it consistent with the query? match, Representative excerpt Possible events. By employing a greedy strategy for the SIM problem (with a budget of...) This test, because a window can only be instantiated at most... (Segments), this invention can instantiate a segment with the maximum expected conditional mutual information at each step. Note that the formula for calculating conditional mutual information is: .because It is a constant in all segments, thus maximizing the expected conditional mutual information. Minimize the expected conditional entropy The dual problem. Therefore, in the () In this step, the present invention excludes fragments that have already been instantiated and selected:

[0143] (2)

[0144] Efficiently calculate the expected conditional entropy

[0145] Formula 2 requires the calculation of the expected conditional entropy, i.e. , The formula for calculating the conditional entropy of each desired condition is:

[0146] (3)

[0147] Conditional entropy The calculation formula is:

[0148] (4)

[0149] in Follows Bernoulli distribution:

[0150] (5) (6)

[0151] According to Formula 1, the pattern matching probability is constructed by... The calculation is performed using several transition matrices, which come from a set of... A window of segments. Therefore, to calculate... and Let be the pattern matching probability under the condition, where:

[0152] (7)

[0153] in and It is the indicator vector defined in Formula 1. Therefore The transition matrix is ​​conditional, i.e., based on the current segment of interest. As a condition, assume the label is . and It is a transition matrix derived from fragments that have specific labels (i.e., have been instantiated) and fragments that are still uncertain. For example, consider... Figure 5 Window in , It has already been instantiated. The purpose of Formula 2 is to address indeterminate fragments ( Select the segment with the minimum expected conditional entropy. For example, calculate according to Formula 3. The expected conditional entropy needs to be considered. The entire event space, i.e. Formula 7 It refers to and The constructed transition matrix. Similarly, in Equation 7... refer to and The constructed transition matrix. (Formula 7) It refers to the The constructed transition matrix is ​​conditional upon the following conditions: Equivalent to "RS", "FS", "RT" or "LT".

[0154] Formula 3 requires enumerating all possible events for a specific segment. Formula 7, on the other hand, has a time complexity of O(n). ,in It is the dimension of the square transition matrix. The size of the event space is expressed as... Therefore, the complexity of calculating formula 3 is... Therefore, a simple implementation of Formula 2 requires... .Although Generally very small, but This could be a very large number. Please note that when calculating a specific segment... of When using formula 3, and The transition matrix in the enumeration The event space remains unchanged. Therefore, when calculating Equation 3, the two terms in Equation 7 remain unchanged. and This only requires calculation once per window, thus amortizing the complexity of Formula 7 to... Therefore, the overall complexity of calculating formula 3 is reduced to... This reduces the computational complexity of Formula 2 to It has a linear relationship with the window size.

[0155] Multi-window optimization

[0156] The greedy strategy instantiates fragments within the window in a near-optimal order and stops as early as possible. However, the near-optimal strategy only applies to the window of current interest. In fact, in Instantiating a fragment can also affect the early acceptance or rejection of all future windows. For example, consider the following query:

[0157] ('rolling stop').minDuration(2) / / Query Q3

[0158] .next('right turn').minDuration(2)

[0159] .window(5)

[0160] .f1(0.99)

[0161] .model(r(2+1)d-18, c3d-3)

[0162] In this query, the first event is "scroll stop" (RS) and the second event is "turn right" (RT). Both events must appear in at least two segments to form a match.

[0163] Now consider Figure 6 The video stream in the image shows its oracle tags. The first five overlapping windows... to No matching query pattern could be found. If only windows are considered... First, the fragment Instantiate or first set the fragment Instantiation, both can reject a window early at the same cost of reasoning about a fragment. However, Instantiation will actually remove the window in advance. , and On the contrary, After instantiation, except In addition, one can only refuse in advance. In fact, the current window The instantiation choice will affect all future windows, including those with... Non-overlapping windows. This is because... Overlapping windows (such as ) and will also be with some other windows that are further away (such as )overlapping.

[0164] With this in mind, the present invention enhances the original greedy strategy by also considering the gains of future windows when selecting the next segment to be instantiated. However, in the context of video streaming, future events (such as segments)... , The events contained within are not visible at the time of decision-making. Therefore, the inventors only considered the potential impact on "partial windows," that is, child windows that overlap with the current window but only contain currently visible events. For example, in processing windows... At that time, the part of the window that you are interested in is , and When deciding which fragment to instantiate next, one must consider not only... We also need to consider arrive The potential benefits.

[0165] In order to take into account a portion of the window during the decision-making process, this invention can derive its corresponding entropy. Each of them Indicates a partial window Is it consistent with the query? Matching. However, this invention does not require calculating each individually. Conversely, queries can be... Rewritten as ,so The pattern in can be obtained from The matching can begin at any position within the range. For example, for the query Q3, it can be rewritten as:

[0166] ('true').minDuration(0) / / Query Q3 *

[0167] .next('rolling stop').minDuration(2)

[0168] .next('right turn').minDuration(2)

[0169] .window(5)

[0170] .f1(0.99)

[0171] .model(r(2+1)d-18, c3d-3)

[0172] The rewritten query essentially captures the semantics of matching the original window and all its corresponding partial windows. It's important to note that considering partial windows is only a heuristic, but empirical results show that this approach consistently delivers some performance improvements in practice.

[0173] Discussion, implementation and optimization

[0174] In one embodiment, the invention is implemented using C++ and Python. Decord is used for video decoding, PyTorch for model training and inference, and NumPy for matrix computation.

[0175] Cross-window transfer matrix sharing

[0176] The pattern matching probability is calculated by constructing a non-homogeneous Markov chain (NHMC). For a chain of size , window The transition matrices are constructed to form the NHMC. When the window size is large, the computational cost of constructing these transition matrices becomes very high. However, since the sliding windows overlap, Bobsled caches and reuses the transition matrices as much as possible. The idea here is similar to that in "Efficiently Computing Expected Conditional Entropy," only applied to different windows. This approach avoids repeatedly calculating transition matrices for different windows, thus reducing overhead.

[0177] Batch inference

[0178] In AEM, the greedy strategy intentionally selects only one segment for processing at each step, thus avoiding overprocessing unnecessary segments. However, this, in turn, reduces GPU utilization. Bobsled, on the other hand, employs batch inference. During each inference iteration, AEM selects segments in batches from concurrent streaming video, which are then processed by the GPU together. Segments needed for multiple queries on the same video stream are shared.

[0179] Cold start and data drift

[0180] Bobsled initially disables PPAT. It collects proxy scores and oracle tags from the video stream being processed, then enables PPAT once a sufficient number of tags are obtained. During cold start, the system runs using only AEM. Additional computational resources can be allocated as needed if there are any stringent latency constraints. Bobsled recalculates thresholds to accommodate the new distribution whenever a significant change in data distribution is detected.

[0181] In video stream analysis, interesting patterns (e.g., "scrolling stops and then turns right") are often observed to span short windows of a few seconds to a few minutes, rather than long windows of several hours. While Bobsled derives a rejection threshold based on the validation set, a given query pattern may not have any matches in the validation set, making it impossible to determine the appropriate threshold. In such cases, this invention currently assumes a rejection threshold of 0 and relies on Adaptive Event Instantiation (AEM). Furthermore, given that action recognition models are still under active development, they may sometimes fail to achieve satisfactory quality. However, better models are rapidly emerging, and Bobsled continues to deliver significant speedups regardless of model accuracy.

[0182] Further embodiments of the present invention are illustrated by the following working examples:

[0183] Evaluate

[0184] Bobsled was evaluated using four real-world datasets. The experiments were conducted on an Intel Xeon W-2123 server equipped with 80 GB of memory and an NVIDIA GeForce RTX 3090 GPU.

[0185] data

[0186] Table 1 (left) shows details of four real-world datasets. The BDD100K dataset is a collection of dashcam videos, each approximately 40 seconds long. Previous researchers manually labeled 200 videos in this dataset with five different action tags, and these 200 videos were concatenated into a single video stream. The MERL dataset consists of 106 videos filmed using a fixed overhead camera, depicting people shopping at a grocery store. Each video is approximately 2 minutes long, and they were concatenated into a single video stream. This dataset also includes manually labeled action tags. In addition, two other video streams were collected from YouTube: Cooking and DailyLife. These videos do not have manually labeled action tags. For these datasets, ten action tags identified by a oracle model were used as ground truth labels in the experiments.

[0187] Table 1

[0188]

[0189] Model

[0190] Table 1 (right) provides details of the oracle and surrogate models used in the experiments. All models were pre-trained. In the first two datasets, the R(2+1)D-18 model was used as the oracle model, and C3D-3 was used as the surrogate model. The R(2+1)D-18 model has 18 layers, while the C3D-3 has 3 layers. Both the first and second datasets have their own specific models, pre-trained using their respective training data. The table also lists the model accuracy and processing speed (in frames per second). For the YouTube dataset, the pre-trained X-CLIP-B / 14 was used as the oracle model, and X-CLIP-B / 32 was used as the surrogate model because these videos did not have their own domain-specific pre-trained models. Since X-CLIP is a general-purpose zero-shot model, it is also more computationally expensive in terms of processing speed (frames per second). Because YouTube videos do not have manually labeled tags, the accuracy of X-CLIP-B / 14 and X-CLIP-B / 32 listed in the table is extracted from the Kinetics-400 results in the X-CLIP paper.

[0191] Query

[0192] Table 2 lists the queries used in the evaluation. Queries were constructed to ensure that each query returned meaningful results. Each dataset has two baseline queries with schema lengths ranging from two to five events. The table also includes query selectivity.

[0193]

[0194] Baseline

[0195] We constructed four baselines to evaluate the system:

[0196] • Full oracle localization (full oracle). In this baseline, the oracle model is used to localize all events, and then Flink's CEP engine is applied to match patterns.

[0197] • Full Proxy Positioning (Full Proxy). The same as above, except that a proxy model is used instead of an oracle model.

[0198] • Locate events of interest using Zeus. Here, the state-of-the-art motion localization system Zeus first locates only the events specified in the query, then uses Flink's CEP engine to match the query pattern. Zeus can also accept a target F1 score as input to strike a balance between speed and accuracy.

[0199] • Query-Specific Model Filtering (QMF). Finally, a model-driven approach can be considered, training a specific query model for each query. This model takes a frame window as input and outputs the probability that the window matches a specific query pattern. Window with a probability below a rejection threshold is discarded. The training method for this model is similar to that of the surrogate model in Bobsled. Furthermore, the method for calibrating the rejection threshold is the same as that used in PPAT. All frames in the remaining windows are further processed by a oracle model, and matches are identified using Flink's CEP engine. This model is based on surrogate models (i.e., C3D-3 and X-CLIP-B / 32), but with two additional layers and fine-tuned using query-specific training data.

[0200] Both Zeus and QMF's specific query models require offline training. During training, for the BDD100K and MERL datasets, the data was divided into training, validation, and test sets according to the proportions suggested in previous studies. For the YouTube data, the data was evenly divided into training, validation, and test sets.

[0201] Evaluation indicators

[0202] This section provides the system's performance in terms of throughput (frames per second) and accuracy (F1 score). Given the significant differences in F1 scores across different datasets and queries, we use the F1 score relative to the full oracle baseline to represent the system's accuracy. In calculating the F1 score, we follow the approach used in Zeus, employing the Intersection over Union (IoU) as the criterion for true positives. IoU is a widely used metric for evaluating the consistency between predicted intervals and ground truth intervals in temporal action localization. A match is considered a true positive when its IoU is greater than 0.5 compared to the full oracle baseline. Therefore, a relative accuracy of 1 indicates that, within a given IoU range, the system's results are identical to those of the full oracle.

[0203] Default configuration

[0204] By default, 16 frames are considered a segment, approximately half a second in length. The batch size is set to 8 to fully utilize the GPU. The default window size for queries and the F1 target value are 60 segments and 0.99, respectively.

[0205] End-to-end comparison

[0206] Figure 7 The table shows the performance of Bobsled and other systems in terms of throughput and relative accuracy. As a baseline, the full oracle achieves an accuracy of 1, but also has the lowest throughput. In contrast, the full agent consistently achieves the highest throughput but the worst accuracy. Table 1 (right) clearly illustrates why—pre-trained agent models tend to process frames an order of magnitude faster than their corresponding oracle models, but at the cost of varying degrees of accuracy loss. Consistent with all the major results of previous studies, any solution based solely on using an agent model cannot achieve satisfactory accuracy. Since the relative accuracy of the full agent is close to 0, it will not be discussed further from now on.

[0207] As a query-specific solution, QMF achieves good relative accuracy for certain queries (Q1, Q2, Q3, Q5, Q8). However, it suffers a significant accuracy loss for other queries (Q4, Q6). Furthermore, its accuracy is almost unacceptable on query Q7. Nevertheless, QMF's performance is not significantly improved compared to full prediction, except for queries with substantial accuracy losses (i.e., Q6 and Q7). This is because query-specific models, due to their lightweight architecture and limited number of positive samples (matches) in the training set, struggle to distinguish between windows that match and those that do not. Therefore, building a query-specific model as an early filter does not appear to be an effective approach.

[0208] Zeus performs almost the opposite of QMF. Compared to full prediction, Zeus shows a significant performance improvement, but also a significant decrease in accuracy. This difference is due to the fundamentally different F1 objectives. Specifically, Zeus's F1 objective is designed for individual action events, while in this setup, the F1 objective is for the overall query pattern.

[0209] In contrast, Bobsled consistently maintained near-full oracle accuracy across all queries. In fact, Bobsled achieved 100% relative accuracy for all queries it ran. Bobsled's throughput was 2.3x to 5.7x higher than full oracle and 2.1x to 5.7x higher than QMF. Compared to Zeus, Bobsled's throughput was 1.2x to 4.0x higher, except for Q4, where Zeus's throughput was slightly higher than Bobsled, but the accuracy loss was unacceptable.

[0210] Ablation Research

[0211] An ablation study was conducted to examine the effectiveness of various techniques in Bobsled. Figure 8The percentage of video frames that are ultimately instantiated in each query by the expensive oracle model is shown.

[0212] For reference, the inventors included: (a) the instantiation rate of the full oracle, which was always 1 (i.e., all frames were instantiated using the oracle model). The inventors then reported (b) the percentage of frames remaining after filtering with PPAT with AEM disabled; (c) the percentage of frames that the AEM-based oracle model required to instantiate with PPAT disabled; and (d) the percentage of frames that the oracle model required to instantiate with both PPAT and AEM enabled.

[0213] As shown in the figure, the trimming effect of PPAT varies depending on the dataset and query. For certain queries and datasets (such as Q3 and Q7), PPAT can discard 10% to 39% of frames early. Queries Q5 and Q8 are exceptions because there are no matches in the validation set. Therefore, Bobsled assumes a rejection threshold of 0 and relies entirely on Adaptive Event Instantiation (AEM) in these cases. For query Q4, PPAT performs poorly because the surrogate model performs poorly in detecting the "check shelves" specified in the query. This results in a low rejection threshold, limiting the ability to filter out many video windows. Nevertheless, in some cases, the limited trimming ability of PPAT is acceptable as long as its overhead is low, because Bobsled is designed for AEM to perform further trimming. The trimming effect of AEM is independent of PPAT. AEM alone can reduce the instantiation of frames by 59% to 86% across all queries. When used with PPAT, PPAT+AEM (i.e., Bobsled) can achieve only 8.8% (Q3) to 31.3% (Q1) of the total number of video frames instantiated.

[0214] The pruning capabilities of PPAT and AEM are not free. Therefore, Figure 9 The output shows a breakdown of Bobsled's processing time for each query. It's clear that AEM has stronger pruning capabilities compared to PPAT, but it does require more computation. Nevertheless, their computational overhead is well worth it overall, as they collectively consume only a small fraction of the total processing time while significantly reducing the number of frames processed by the oracle model.

[0215] Impact of F1 Objectives

[0216] Figure 10The diagram shows the average throughput and relative accuracy of Bobsled when given different F1 accuracy targets for all queries. As expected, Bobsled's accuracy improves with stricter accuracy requirements, but at the cost of reduced throughput. More importantly, when the F1 target value is higher than 0.9 (which most users prefer), Bobsled's accuracy and performance gains remain high and stable.

[0217] Impact of pattern length

[0218] Figure 11 The graph shows the throughput and accuracy of Bobsled as the schema length of Q4 varies. Q4 was chosen because it has the longest schema (length 5) in the query set, allowing the schema length to be controlled by removing query events from it. As the graph shows, Bobsled's throughput increases with schema length, while accuracy remains near 100%. The increased selectivity of the query schema also allows Bobsled's PPAT and AEM more opportunities to prune the window early, thus improving throughput.

[0219] The effect of window size

[0220] Figure 12 The diagram shows the average throughput and accuracy of Bobsled as the window size for all queries increases from 30 fragments to 300 fragments. Unsurprisingly, the throughput decreases as the window size increases, because a larger window requires processing more fragments. However, we can see that when the window size increases tenfold from 30 fragments to 300 fragments, the throughput drops by less than 50%, indicating that Bobsled remains effective.

[0221] Impact of instantiation strategy

[0222] Figure 13 The average throughput and accuracy of Bobsled under different Adaptive Event Instantiation (AEM) strategies are shown. Experiments tested: (i) a sequential strategy, i.e., processing each segment within the window sequentially from the beginning; (ii) a random strategy, i.e., randomly selecting one segment within the window each time; and (iii) an interval strategy, i.e., processing one segment at a time. Step selection There are segments, among which It is the pattern length. This ensures that if the first... If a fragment instantiation does not match any event specified in the query pattern, it may reject the window of interest at the lowest cost and advance the maximum number of windows. (iv) Greedy strategy, i.e., Bobsled's greedy strategy without multi-window optimization (MWO); (v) Greedy strategy + multi-window optimization, i.e., Bobsled's greedy strategy with multi-window optimization. As can be seen from the figure, the greedy strategy is more effective than other basic strategies, while multi-window optimization can further improve the pruning ability. It should be noted that the instantiation strategy does not affect the accuracy. Therefore, relative accuracy is not reported here.

[0223] in conclusion

[0224] In summary, this invention provides a novel video stream processing system designed to effectively support complex event queries. Bobsled employs probabilistic pattern matching and adaptive event instantiation techniques to avoid processing unnecessary video segments. Experimental results show that Bobsled can significantly improve processing speed without significantly reducing accuracy.

[0225] The above explanation of the present invention is not limited to the foregoing embodiments and drawings, and it will be apparent to those skilled in the art that various substitutions, modifications and alterations can be made without departing from the scope of the present invention.

Claims

1. A system for identifying complex event patterns in real-time video analytics data input, characterized in that, include: Query compiler; Proxy model; The first reasoning module; as well as The second reasoning module.

2. The system for identifying complex event patterns in real-time video analytics data input according to claim 1, characterized in that, The query compiler is configured to automatically infer the shortest duration of all events in the query for which no minimum duration is specified, and to convert the query into a nondeterministic finite automaton (NFA).

3. The system for identifying complex event patterns in real-time video analytics data input according to claim 1, characterized in that, The proxy model is configured to predict the proxy score of the data input, which is used to estimate the probability that the first inference module matches the window of the query pattern.

4. The system for identifying complex event patterns in real-time video analytics data input according to claim 1, characterized in that, The first inference module is a probabilistic pattern matching (PPAT) module, configured to filter windows that are highly likely to not match the query pattern.

5. The system for identifying complex event patterns in real-time video analytics data input according to claim 1 or 4, characterized in that, The first inference module is a discrete-time non-homogeneous Markov chain (NHMC), which has the ability to process non-homogeneous data input.

6. The system for identifying complex event patterns in real-time video analytics data input according to claim 4, characterized in that, The first inference module estimates the probability of a window matching the query pattern based on the surrogate score predicted by the nondeterministic finite automaton and surrogate model provided by the query compiler, thereby filtering out windows that are highly likely not to match the query pattern.

7. The system for identifying complex event patterns in real-time video analytics data input according to claim 1, characterized in that, The second inference module is an Adaptive Event Instantiation (AEM) module, configured to adaptively instantiate events within the window passed through the first inference module until it can be determined whether a match is guaranteed or impossible, and then return the window that guarantees a match.

8. The system for identifying complex event patterns in real-time video analytics data input according to claim 1 or 7, characterized in that, The second inference module further includes an event instantiation strategy.

9. The system for identifying complex event patterns in real-time video analytics data input according to claim 8, characterized in that, The second inference module sequentially selects and instantiates events within the window of the first inference module.

10. The system for identifying complex event patterns in real-time video analytics data input according to claim 8, characterized in that, The event instantiation strategy includes, but is not limited to, the greedy strategy.

11. The method for identifying complex event patterns in real-time video analytics data input according to claim 10, characterized in that, The greedy strategy is calculated using the following formula:

12. The system for identifying complex event patterns in real-time video analytics data input according to claim 1 or 9, characterized in that, The first inference module further caches the transition matrix and reuses these matrices to reduce the computational overhead of future query matching.

13. The system for identifying complex event patterns in real-time video analytics data input according to claim 1, characterized in that, The data inputs include, but are not limited to, decoded video data streams obtained from multiple sources.

14. The system for identifying complex event patterns in real-time video analytics data input according to claim 1, characterized in that, The proxy model is pre-trained using real-time data input.

15. The system for identifying complex event patterns in real-time video analytics data input according to any of the preceding claims, characterized in that, The system has the ability to process data input individually and in batches, thus possessing scalability.

16. A method for identifying complex event patterns in real-time video analytics data input, characterized in that, Includes the following steps: Use a query compiler to convert the query into a nondeterministic finite automaton (NFA); Predict agent scores using an agent model; Estimate the probability that the window matches the query pattern; Filter out windows that are unlikely to match the query pattern; Adaptively instantiate events within the window until it can be determined whether a match is guaranteed or impossible; and Returns a window that is guaranteed to contain a match.

17. The method for identifying complex event patterns in real-time video analytics data input according to claim 16, characterized in that, The step of using a query compiler to convert the query into a nondeterministic finite automaton (NFA) further includes: Obtain the shortest duration of offline measurements from the given validation set. The minimum value of the measurement is marked; and Transform the query into a nondeterministic finite automaton (NFA).

18. The method for identifying complex event patterns in real-time video analytics data input according to claim 16, characterized in that, The step of estimating the probability that the window matches the query pattern further includes: A power set construction algorithm is used to obtain a deterministic finite automaton (DFA) from an initial nondeterministic finite automaton (NFA); The first inference module is constructed based on a deterministic finite automaton (DFA) and surrogate scores, wherein the first inference module is a discrete-time non-homogeneous Markov chain (NHMC); and The pattern matching probability is calculated using a discrete-time non-homogeneous Markov chain (NHMC).

19. The method for identifying complex event patterns in real-time video analytics data input according to claim 18, characterized in that, The probability that the window matches the query pattern is calculated using the following formula:

20. The method for identifying complex event patterns in real-time video analytics data input according to claim 16, characterized in that, The step of filtering and querying patterns with a high probability of mismatch further includes deriving a rejection threshold.

21. The method for identifying complex event patterns in real-time video analytics data input according to claim 16 or 20, characterized in that, The step of filtering windows that are highly likely to mismatch with the query pattern further includes prioritizing the filtering of negative windows and deferring the remaining processing to the second inference module.

22. The method for identifying complex event patterns in real-time video analytics data input according to claim 21, characterized in that, The method also includes further reasoning about windows that cannot be filtered.

23. The method for identifying complex event patterns in real-time video analytics data input according to claim 22, characterized in that, The further reasoning steps further include using information entropy to unify the decision-making process to cover both window-matching and window-mismatching queries.

24. The method for identifying complex event patterns in real-time video analytics data input according to claim 22, characterized in that, The step of further reasoning about windows that cannot be filtered further includes selecting fragments sequentially and instantiating the selected fragments one at a time in a cost-effective manner.

25. The method for identifying complex event patterns in real-time video analytics data input according to claim 22, characterized in that, The step of further reasoning about windows that cannot be filtered further includes: At each step, select the video segment with the highest expected conditional mutual information; and Consider future windows in the decision-making process.