An intelligent video analysis method based on large model scheduling and a storage medium
By using large-scale model scheduling and dynamic selection of the optimal model for video analysis, the problems of poor model versatility and insufficient adaptability in existing technologies are solved, achieving efficient and accurate video analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN XINGHUAN HENGYU INFORMATION TECH CO LTD
- Filing Date
- 2025-03-25
- Publication Date
- 2026-06-23
AI Technical Summary
Existing intelligent video analytics technologies suffer from poor model versatility, low analysis efficiency, and insufficient adaptability to complex scenarios, resulting in low efficiency and insufficient accuracy when processing complex and ever-changing video data.
An intelligent video analysis method based on large model scheduling is adopted. By collecting video data, preprocessing and extracting features, dynamically selecting the optimal model for analysis, and performing result fusion and online incremental learning, the system's adaptability and computing resource utilization are improved.
It significantly improves the accuracy and flexibility of video analytics, enabling it to quickly adapt to changes in different scenarios and tasks, improve the utilization of computing resources and analysis efficiency, and reduce misjudgments and omissions.
Smart Images

Figure CN120220031B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of video analytics technology, and in particular to an intelligent video analytics method and storage medium based on large model scheduling. Background Technology
[0002] With the widespread application of video surveillance technology in numerous fields, such as security, transportation, and industrial production, the demand for intelligent video analytics is growing rapidly. Traditional intelligent video analytics methods typically rely on small models designed for specific tasks and scenarios. These models often exhibit limitations when processing complex and ever-changing real-world video data. On the one hand, a single model cannot simultaneously meet the needs of multiple analysis tasks, leading to the deployment of multiple models, increasing system complexity and resource consumption. On the other hand, when video scenes or tasks change, adjusting and updating traditional models is difficult, making it impossible to quickly adapt to new situations. Furthermore, when facing large-scale video data, traditional computing architectures and model collaboration methods cannot fully utilize computing resources, further limiting the efficiency and accuracy of video analytics.
[0003] Therefore, the applicant proposes an intelligent video analysis method based on large model scheduling, which aims to solve the problems of poor model versatility, low analysis efficiency, and insufficient adaptability to complex scenarios in existing intelligent video analysis technologies. Summary of the Invention
[0004] This invention proposes an intelligent video analysis method based on large model scheduling, which solves the problems of poor model versatility, low analysis efficiency, and insufficient adaptability to complex scenes in existing intelligent video analysis technologies. The technical solution of this invention is implemented as follows:
[0005] A smart video analysis method based on large model scheduling includes: acquiring raw video data and preprocessing it to segment the continuous video stream into a series of video segments with independent analytical value; for each video segment, extracting image features and spatiotemporal features in parallel, quantizing and normalizing the features, and screening highly relevant features to reduce redundancy; selecting the optimal model from a large model library based on video content description reports, task priorities, and model performance indicators, and dynamically allocating the model to appropriate nodes according to the resource status of computing nodes; calling the selected large model for deep analysis, outputting target location, category, confidence level, and behavior description results, and generating a structured analysis report, including event time, location, abnormal behavior labels, and confidence level information; integrating the output results of different models through a fusion model method to eliminate conflicts and noise; collecting inaccurately identified samples, manually labeling them, and inputting them into an incremental learning module, using knowledge distillation or adaptive parameter update strategies to update model parameters and structure while retaining the original performance, thereby improving the system's adaptability.
[0006] As a preferred technical solution, preprocessing includes: using Gaussian filtering to remove noise interference in the video image, and using histogram equalization technology to enhance the contrast and brightness of the image, making the target object clearer and more identifiable, laying a good foundation for subsequent feature extraction and analysis; it also includes segmenting the continuous video stream into a series of video segments with independent analytical value based on the timestamps of the video data and key event detection algorithms.
[0007] As a preferred technical solution, in traffic monitoring scenarios, video segments are divided according to the time points when vehicles pass through specific intersections or road sections; on industrial production lines, segments are divided according to the product's production cycle or the completion nodes of key processes.
[0008] As a preferred technical solution, video content feature extraction is achieved through the following methods: a feature extraction module based on a deep convolutional neural network (CNN) utilizes convolutional kernels of different sizes and multi-layer convolutional pooling structures to perform multi-scale feature extraction on video images, obtaining rich low-level and high-level features; a spatiotemporal feature extraction module based on a recurrent neural network (RNN) and its variants models the temporal dimension information in the video sequence; then, the features output by different feature extraction modules are quantized and normalized to ensure that their numerical ranges are uniform and comparable; and finally, a subset of features with high relevance to the current video analysis task is selected by using information gain, chi-square test, or Relief algorithms.
[0009] As a preferred technical solution, the specific method for generating a video content description report is as follows:
[0010] Step S1: Target recognition and localization. The system automatically detects the main targets in the video, marks their positions and classifies them. At the same time, it generates a credibility score for each recognition result to reflect the accuracy of the recognition.
[0011] Step S2: Scene classification. Based on the overall characteristics of the video footage, the scene is classified into a specific type to help understand the background environment of the target.
[0012] Step S3: Behavioral analysis, analyze the target's behavioral patterns, record the duration and dynamic trends of the behavior, and label abnormal behaviors and their probabilities to provide a preliminary explanation of the behavior;
[0013] Step S4: Extract additional information, including the video's shooting time, location, weather conditions, and contextual information, to enhance the report's completeness;
[0014] Step S5: Structured integration. Organize the above information into tables or lists to clearly present the target category, location, behavior description, confidence level, and scenario type data, providing a unified basis for subsequent analysis.
[0015] As a preferred technical solution, the optimal model should be selected from the large model library according to the following method: For tasks requiring high-precision target detection and classification, a large target detection model based on the Transformer architecture with high-resolution feature extraction capabilities and a large amount of target category training data should be scheduled; for complex behavior analysis tasks, a large behavior recognition model based on 3D convolutional neural networks that has been extensively trained on multiple behavior patterns and has good generalization ability should be selected.
[0016] As a preferred technical solution, the specific methods for behavior description and probability estimation are as follows:
[0017] Step a) Behavioral Feature Extraction and Encoding: A deep learning-based feature extraction model performs multi-level feature extraction on the target behavior in the video clip. The extracted behavioral features are quantified and encoded, converting them into numerical vectors that can be processed by a computer for subsequent model analysis and computation.
[0018] Step b) Behavior classification and recognition: Using a pre-trained behavior classification model, the encoded behavior feature vector is input into the model. The model classifies and recognizes the target behavior based on the various behavior patterns and feature distributions it has learned. For each possible behavior category, the model calculates its corresponding probability score.
[0019] Step c) Behavior description generation: Based on the behavior classification results and probability scores, and combined with predefined behavior description templates, generate detailed behavior description statements.
[0020] Step d) Uncertainty handling and supplementary information: When the probability distribution of behavior is relatively dispersed, that is, no behavior category has a significantly high probability, the model will reflect this uncertainty in the behavior description; at the same time, the model will also combine other information in the video to supplement and improve the behavior description, so as to provide more behavior analysis results.
[0021] As a preferred technical solution, the specific implementation steps of the result fusion processing are as follows:
[0022] Step a) Data preparation and standardization: Collect the analysis results of different large models on the same video segment, including the target's location information, target category labels, behavior recognition results, and related confidence scores; standardize the output results of different models to ensure the consistency and comparability of data formats;
[0023] Step b) Fusion based on probability statistics, including target location fusion, target category fusion, and behavior recognition result fusion;
[0024] Step c) Deep learning model-assisted fusion: Construct a deep learning model specifically for result fusion, use the output results of different models as the input features of the fusion model, train the fusion model on a large amount of labeled video data, and enable it to learn the optimal fusion method between the results of different models.
[0025] Step d) Conflict resolution and result optimization: Conflicts are resolved using rule-based methods or further data analysis.
[0026] Step e) Output the fusion result. After the above steps, the final fusion result is obtained, including the fused target location, category, behavior and other information. This result is then organized into a unified format for output for subsequent application processing.
[0027] As a preferred technical solution, during video analysis, the online monitoring module continuously evaluates and verifies the analysis results of the large model. The collected incremental data will be manually labeled and preprocessed before entering the online incremental learning module. After online incremental learning, the updated large model will be reinvested in the video analysis task, continuously improving the system's intelligence and adaptability, and ensuring that it can cope with the ever-changing video data and analysis task requirements.
[0028] A non-transitory storage medium for storing a program that executes the above-described intelligent video analytics method based on large model scheduling.
[0029] Compared with existing technologies, this solution has the following advantages:
[0030] (1) Significantly improves the accuracy of video analysis: The dynamic large model scheduling strategy can accurately select the most suitable model based on the real-time characteristics of the video content, ensuring that the most appropriate analysis methods can be used in various complex scenarios, thereby greatly reducing the probability of misjudgment and missed judgment. The multi-level feature fusion and adaptive adjustment mechanism fully explores the rich information in video data. By reasonably allocating weights to different levels of features, the model can understand the video content more comprehensively and accurately, thereby improving the accuracy of tasks such as target detection and behavior recognition. For example, in industrial production quality inspection, it can more accurately identify the subtle defects and assembly problems of products, reducing the outflow of defective products. Distributed computing and model collaborative optimization enable the various models to complement and cooperate with each other in the analysis process. Through information interaction and collaborative work, the accuracy of the overall analysis is further improved. For example, in traffic scenarios, the collaboration of multiple models can more accurately judge the driving intention of vehicles and traffic conditions, reducing the misjudgment of traffic accidents.
[0031] (2) Significantly enhanced system flexibility and adaptability: Based on different application scenarios and diverse task requirements, it flexibly schedules suitable models from a rich, large model library to easily cope with various complex and changing situations. Whether in security, transportation, industrial production, or other fields, it can quickly adapt to new tasks and scenario changes, meeting the personalized needs of different users. It supports online incremental learning and model update functions, enabling real-time collection of newly emerging video data and inaccurately identified samples, timely updating the model's parameters and structure to quickly adapt to new targets, behaviors, and scenarios, maintaining efficient analysis capabilities for constantly changing video data, and consistently providing accurate and up-to-date analysis results.
[0032] (3) Effectively improves computing resource utilization and analysis efficiency: Distributed computing architecture divides video data for parallel processing, avoiding the resource bottleneck of traditional centralized computing, giving full play to the advantages of cluster computing resources, and greatly improving data processing speed. For example, in large-scale video surveillance systems, video streams from multiple cameras can be processed simultaneously to ensure real-time requirements. Model co-optimization algorithms reduce redundant calculations and resource waste through efficient information sharing and collaborative work among models, further improving the utilization of computing resources and overall analysis efficiency. For example, in intelligent transportation systems, multiple related models work together to quickly process large amounts of traffic video data, provide timely feedback on traffic conditions, and provide strong support for traffic management. Attached Figure Description
[0033] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0034] Figure 1 This is a flowchart of an intelligent video analysis method based on large model scheduling according to the present invention. Detailed Implementation
[0035] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.
[0036] Reference Figure 1This invention proposes an intelligent video analysis method based on large model scheduling, which includes the following steps: video data acquisition and preprocessing, video content feature extraction and analysis, large model dynamic scheduling and task allocation, video analysis and result generation based on large model, result fusion and post-processing, and online incremental learning and model updating. This method solves the problems of poor model versatility, low analysis efficiency, and insufficient adaptability to complex scenarios in existing intelligent video analysis technologies.
[0037] The specific steps are as follows:
[0038] 1. Video data acquisition and preprocessing:
[0039] 1) Video data is collected using cameras distributed in different locations with varying perspectives and parameter settings to ensure comprehensive coverage of the monitored area and diverse information collection. Optimized data transmission protocols ensure that the collected video data is transmitted to the data processing center in real time, guaranteeing data integrity and timeliness.
[0040] 2) At the data processing center, preliminary preprocessing operations are performed on the received video data, including but not limited to noise reduction, filtering, image enhancement, and color correction. For example, Gaussian filtering is used to remove noise interference from the video image, and histogram equalization technology is used to enhance the contrast and brightness of the image, making the target object clearer and more distinguishable, laying a good foundation for subsequent feature extraction and analysis.
[0041] 3) Based on the timestamps of the video data and key event detection algorithms, the continuous video stream is segmented into a series of video segments with independent analytical value. For example, in traffic monitoring scenarios, video segments can be divided according to the time points when vehicles pass through specific intersections or road sections; on industrial production lines, they can be segmented according to the product's production cycle or the completion nodes of key processes.
[0042] 2. Video content feature extraction and analysis:
[0043] 1) For each video segment, multiple parallel feature extraction modules are launched simultaneously. The feature extraction module, based on a deep convolutional neural network (CNN), utilizes convolutional kernels of different sizes and multi-layer convolutional pooling structures to perform multi-scale feature extraction on the video image, obtaining rich low-level and high-level features. For example, 3x3, 5x5, and 7x7 convolutional kernels are used to extract local detail features, medium-scale features, and global semantic features of the image, respectively, and a series of pooling operations are used to reduce the resolution of the feature maps while retaining key information.
[0044] 2) A spatiotemporal feature extraction module based on Recurrent Neural Networks (RNNs) and their variants (such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs)) models the temporal dimension information in video sequences. By processing video frames sequentially, it captures spatiotemporal features such as the motion trajectory of targets, speed changes, temporal patterns of behavior, and interactions between targets. For example, in pedestrian behavior analysis, an LSTM network can be used to memorize the walking path and speed changes of pedestrians, thereby determining the pedestrian's behavioral intentions, such as whether they are loitering, running, or walking normally.
[0045] 3) Quantize and normalize the features output by different feature extraction modules to ensure uniformity and comparability of their numerical ranges. Then, using feature selection methods such as information gain, chi-square test, or Relief algorithm, select a subset of features highly relevant to the current video analysis task. This reduces subsequent computational load and interference from redundant information, improving the model's analytical efficiency and accuracy.
[0046] 3. Dynamic scheduling and task allocation for large models:
[0047] 1) After the video content feature extraction and analysis are completed, the real-time video content analysis system quickly performs a comprehensive and in-depth interpretation of the feature information of the video segments. It uses target detection algorithms to determine the main target types and quantities in the video, scene classification algorithms to determine the environmental scene category of the video, and behavior analysis algorithms to preliminarily identify the behavioral patterns and trends of the targets. Based on this information, it generates a detailed video content description report.
[0048] The specific method for generating a video content description report is as follows:
[0049] First, for the object detection part, the system accurately determines the location information of each object appearing in the video, marking the specific location of each object in the video frame in the form of bounding box coordinates, and also identifying its category, such as identifying the object as a pedestrian, vehicle (further subdivided into cars, trucks, motorcycles, etc.), animal, or specific industrial product. Furthermore, for each object, a confidence score is generated, reflecting the model's certainty about the object recognition result, presented as a percentage, to help subsequent analysis judge the reliability of the object recognition result. For example, an object identified as a car has a confidence score of 90%, indicating that the model has a high degree of confidence in this recognition result.
[0050] Secondly, in terms of scene classification, the system uses a pre-trained scene classification model to accurately categorize the environment of the video into specific types, such as indoor office scenes, outdoor park scenes, urban street traffic scenes, and industrial production workshop scenes. This classification process is based not only on the overall features of the image but also on various factors such as lighting conditions, background elements, and target distribution. For example, for a video clip with numerous traffic signs, road markings, and frequent vehicle traffic, the system will accurately classify it as an urban street traffic scene.
[0051] Furthermore, the behavior analysis module delves into the behavioral patterns and trends of targets in the video. For pedestrians, it determines whether they are walking, running, standing, sitting, or loitering, and whether they are performing specific actions such as waving, jumping, or fighting. For vehicles, it analyzes their direction of travel (east, south, west, north, etc.), speed changes (acceleration, deceleration, constant speed), compliance with traffic rules (running red lights, illegal lane changes, etc.), and interactions with other targets (such as overtaking and yielding between vehicles, and pedestrians crossing with vehicles). Simultaneously, it assigns corresponding behavior tags and durations to each behavior, such as "pedestrian loitering at an intersection for 10 seconds" or "vehicle accelerating on a main road for 5 seconds," to more clearly describe the target's behavioral dynamics.
[0052] In addition, the system analyzes and records other important elements in the video, such as the video's shooting time and location (associated with the camera's geographic location information), weather conditions (if inferred from the video footage, such as rain, sunshine, or snow), and visual features like light intensity and color distribution. This additional information further enriches the video content description report, providing more comprehensive contextual support for subsequent large-scale model scheduling and video analysis.
[0053] Finally, all the above information is integrated to generate a structured video content description report. The report is presented in a clear and easy-to-understand format, such as a table. Each row records the target's ID, category, location, confidence level, behavior label, behavior duration, as well as scene category, shooting time, location, weather conditions, and light intensity. This provides accurate and detailed information for subsequent large-scale model selection and task allocation, ensuring that the entire intelligent video analysis system can make optimal decisions based on the actual situation of the video content, thereby improving the accuracy and efficiency of video analysis.
[0054] Such video content description reports can comprehensively reflect the key information of video segments, making the system more accurate and efficient in large model scheduling and task allocation, giving full play to the advantages of each large model, and improving the performance and effectiveness of the entire intelligent video analysis method.
[0055] 2) Based on the video content description report, combined with the preset task priority list and model performance evaluation metrics, the intelligent scheduling algorithm quickly searches and matches within the large model library to select the most suitable large model for the current video analysis task. Each large model in the model library is equipped with detailed metadata information such as performance metrics, applicable scenarios, and input / output requirements, enabling the scheduling algorithm to accurately select a model. For example, for tasks requiring high-precision object detection and classification, a large object detection model based on the Transformer architecture with high-resolution feature extraction capabilities and a large amount of target category training data is scheduled; for complex behavior analysis tasks, a large behavior recognition model based on a 3D convolutional neural network that has been extensively trained on multiple behavior patterns and has good generalization ability is selected.
[0056] 3) During the task allocation phase, considering the load balancing and real-time requirements of computing resources, an algorithm based on dynamic resource allocation and task queue scheduling is adopted to assign selected large models to appropriate computing nodes. The resource status of computing nodes (including CPU and GPU utilization, memory capacity, network bandwidth, etc.) is fed back to the scheduling algorithm through a real-time monitoring system to ensure that each large model can run in a resource-sufficient and stable computing environment, fully leveraging its performance advantages and improving the overall response speed and analysis efficiency of the system.
[0057] 4. Video analysis and result generation based on large models:
[0058] 1) The preprocessed and feature-extracted video data is input into the selected large-scale model. Based on its pre-trained parameters and complex network architecture, the large-scale model performs in-depth analysis and reasoning on the video features. For example, the object detection large-scale model accurately determines the location, category, and confidence score of objects in the video through multi-layer convolution and fully connected operations, while outputting the object's feature vector for subsequent model interaction and analysis; the behavior recognition large-scale model utilizes its learned behavior patterns and semantic information to classify and label the behavior of objects, determine whether it belongs to normal or abnormal behavior, and provide detailed behavior descriptions and probability estimates.
[0059] The specific methods for behavioral description and probability estimation are as follows:
[0060] a) Behavioral feature extraction and encoding:
[0061] Deep learning-based feature extraction models (such as 3D convolutional neural networks or convolutional networks combined with spatiotemporal attention mechanisms) perform multi-level feature extraction on target behavior in video clips. These features not only cover changes in the appearance of the target (such as human posture, limb movements, vehicle steering angle, etc.), but also the target's trajectory, speed changes, and interaction features with the surrounding environment and other targets over time (such as changes in distance between people, relative positional relationships between vehicles and traffic facilities, etc.).
[0062] The extracted behavioral features are quantified and encoded, transforming them into numerical vectors that can be processed by computers for subsequent model analysis and computation. For example, for human behavior, information such as joint position changes, limb movement direction, and speed might be encoded into fixed-length vectors, with each dimension representing a specific behavioral feature parameter.
[0063] b) Behavior classification and recognition:
[0064] By using a pre-trained behavior classification model, encoded behavior feature vectors are input into the model. Based on the various behavior patterns and feature distributions it has learned, the model classifies and identifies target behaviors. For example, in security monitoring scenarios, the model can distinguish between normal behaviors such as walking, standing, and talking, and abnormal behaviors such as fighting, running, and intrusion.
[0065] For each possible behavior category, the model calculates its corresponding probability score. This process is typically achieved through the model's output layer using a softmax function. The softmax function converts the model's predicted scores for each behavior category into a probability distribution, ensuring that the sum of the probabilities for all behavior categories is 1. For example, if the model identifies that a person's behavior in a video may belong to three categories: "walking normally," "running," and "loitering," after softmax calculation, the probability of "walking normally" is 0.6, the probability of "running" is 0.2, and the probability of "loitering" is 0.2.
[0066] c) Behavior description generation:
[0067] Based on the behavior classification results and probability scores, combined with predefined behavior description templates, detailed behavior descriptions are generated. For example, if the model determines that the target behavior belongs to the "running" category and has a high probability (e.g., above 0.8), the generated behavior description might be: "The target is running at a relatively fast speed in the video footage. The probability of running is estimated at 0.85. Its behavior shows obvious signs of panic, suggesting a possible emergency." For complex behavioral scenarios, such as interactions between multiple people, the model further analyzes the behavior of each target and the relationships between them, generating more detailed and accurate descriptions, such as: "Three people are engaged in a violent physical conflict within the video area. One person exhibits aggressive behavior, while the other two are in a defensive state. The probability of the conflict is estimated at 0.9."
[0068] d) Uncertainty handling and supplementary information:
[0069] When the probability distribution of behaviors is relatively dispersed, meaning that no single behavior category has a significantly high probability, the model will reflect this uncertainty in the behavior description. For example, if the probabilities of the three behavior categories "normal walking," "slow movement," and "brief pause" are 0.35, 0.3, and 0.35, respectively, the behavior description might be: "The target's behavior in the video footage has a certain degree of uncertainty. The probabilities of normal walking, slow movement, and brief pause are relatively close, approximately 0.35, 0.3, and 0.35, respectively. Further observation is needed to clarify its behavioral intention."
[0070] At the same time, the model will also combine other information from the video, such as the target's appearance (wearing a specific uniform may imply their professional identity, thus affecting the interpretation of behavior) and the scene environment (behavior in dangerous areas may have a higher risk meaning), to supplement and improve the behavior description, so as to provide richer, more accurate and practically meaningful behavior analysis results.
[0071] Through the above steps, the system can provide detailed behavioral descriptions and probability estimates for target behaviors in videos, providing key decision-making basis and information support for video analysis applications in fields such as security monitoring, intelligent transportation, and industrial production. This helps users to more accurately understand the events and behaviors occurring in videos, promptly identify potential problems or anomalies, and take corresponding measures.
[0072] 2) During the analysis process, the large model also generates intermediate results and auxiliary information, which can be used by subsequent processing modules or other related models. For example, when the semantic segmentation large model segments a video scene, it will simultaneously output the semantic category of each pixel and the boundary information of the region. This information is of great importance for target localization, scene understanding, and subsequent behavior analysis.
[0073] 3) Finally, the large model generates a detailed video analysis report based on the analysis results, including key information such as the target's location, category, behavioral tags, the time and location of the event, and the confidence level of the analysis results, so as to facilitate subsequent result fusion and application processing.
[0074] 5. Result fusion and post-processing:
[0075] 1) Since different large models may produce multiple analysis results for the same video segment, result fusion processing is required. The specific implementation steps are as follows:
[0076] a) Data preparation and standardization:
[0077] Collect the analysis results of different large models on the same video segment. These results may include the target's location information (represented in coordinate form, such as the coordinates of the top left and bottom right corners of the bounding box), target category labels (such as "pedestrian", "vehicle", "animal", etc.), behavior recognition results (such as "walking", "running", "standing", etc.), and related confidence scores (usually a value between 0 and 1, indicating the degree of certainty of the model for the result).
[0078] The outputs of different models should be standardized to ensure data format consistency and comparability. For example, if some models output target position coordinates as relative coordinates with the image center as the origin, while other models output absolute coordinates, they need to be uniformly converted to the same coordinate system. For confidence scores, if there are different calculation methods or value ranges, they also need to be normalized to ensure that they are all within the same 0 to 1 range.
[0079] b) Fusion based on probability statistics:
[0080] Target location fusion: For the location information of the same target detected by multiple models, a weighted average method is used for fusion. Each model is assigned a corresponding weight based on its performance evaluation metrics (such as average accuracy and recall in the target localization task). For example, if model A has a high accuracy of 90% in previous target localization tests, it is assigned a higher weight, such as 0.6; model B has an accuracy of 80%, and is assigned a weight of 0.4. When calculating the fused target location coordinates, the target location coordinates output by each model are multiplied by its corresponding weight, and then summed to obtain the final fused location coordinates.
[0081] Target category fusion: The prediction results of each model for the target category are statistically analyzed, and a fusion decision is made based on their confidence scores. A common method is majority voting; if a majority of models predict a certain target category, that category is used as the fused target category. However, confidence scores are also considered. If a model has a very high confidence score for a certain category (e.g., above 0.9), it will be given a larger weight in the overall judgment even if that category does not have a majority vote. For example, if three models predict the target as "pedestrian," "vehicle," and "pedestrian" respectively, but the model predicting "vehicle" has a confidence score of only 0.6, while the other two models predicting "pedestrian" have confidence scores of 0.8 and 0.7 respectively, then the fused target category is determined to be "pedestrian."
[0082] Behavior recognition result fusion: Similar to target category fusion, this involves statistically analyzing the recognition results and confidence scores of each model for the target behavior. A probability distribution-based fusion method can be used, such as weighted summing of the probability distributions of each model's output behavior category to obtain a fused behavior probability distribution. The behavior category with the highest probability is then selected as the final fused behavior result. Simultaneously, the temporal continuity and logical consistency of the behavior are considered. If a behavior was identified as "walking" by most models in a previous moment of a video segment, and while some models predict other behaviors in the current moment, these behaviors have a certain logical continuity with "walking" (such as "accelerating walking" or "turning while walking"), then the fusion process will favor the category related to the behavior in the previous moment.
[0083] c) Deep learning model-assisted fusion:
[0084] Construct a deep learning model specifically for result fusion, such as a multilayer perceptron (MLP) or a convolutional neural network (CNN). Use the outputs of different models (including target location, category, behavior, and corresponding confidence scores) as input features for this fusion model.
[0085] The fusion model is trained on a large amount of labeled video data, enabling it to learn the optimal fusion method between different model results. For example, through training, the model can learn which model results are more reliable in certain scenarios, and how to adjust fusion weights and decision strategies based on the output features of different models to obtain more accurate fusion results.
[0086] The trained fusion model can further optimize and integrate the analysis results of different large models in practical applications, improving the accuracy and robustness of result fusion. It is especially suitable for situations where there are large differences or high uncertainties in the results of multiple models in complex scenarios.
[0087] d) Conflict resolution and outcome optimization:
[0088] During the fusion process, some conflicts may arise, such as significant differences in the two models' predictions of the target location, or completely different but similarly confident classifications of the target category. These conflicts are resolved using rule-based methods or further data analysis. For example, if the difference in the target location predictions between the two models exceeds a certain threshold (determined based on video resolution and target size), the feature extraction process and analysis results of both models are re-examined. This may reveal that one model has false positives or inaccurate feature extraction, thus eliminating its result. For target category conflicts, if a clear judgment cannot be made using confidence scores and majority voting, contextual information from the video (such as the surrounding environment of the target, the categories of other related targets, etc.) is used for auxiliary judgment.
[0089] The fused results are optimized and post-processed to remove potential noise and redundant information. For example, if outliers appear in the target location coordinates after fusion (such as those outside the video frame), they are corrected or removed. For some low-confidence target detection results or behavior recognition results, if their impact on the overall analysis results is small and there is some uncertainty, their weight can be appropriately reduced or they can be ignored directly to improve the accuracy and reliability of the final fusion results.
[0090] e) Output of fusion results:
[0091] After the above steps, the final fusion result is obtained, including information such as the fused target location, category, and behavior. This information is then formatted and output in a unified format for subsequent application processing. For example, a structured data list is output, where each element contains the target's unique identifier, fused location coordinates, final target category, behavior description, and corresponding confidence score. This provides accurate, reliable, and consistent decision-making support for subsequent video analytics applications (such as security incident alarms, traffic flow statistics, and industrial production quality inspection).
[0092] 3) Based on specific application requirements, further post-processing is performed on the final fusion output. In the field of security monitoring, if the analysis results indicate abnormal behavior or security incidents, the system will automatically generate alarm information and send relevant video clips and analysis reports to the monitoring personnel's terminal devices via SMS, email, or dedicated monitoring software. Simultaneously, it will link with other security systems (such as access control systems and alarm systems) for emergency response. In industrial production quality inspection, the analysis results are compared with preset quality standards to generate a product quality inspection report, including information such as the type, location, and severity of product defects. This report is then fed back to the production control system for adjustments and optimization of the production process.
[0093] 6. Online incremental learning and model updates:
[0094] 1) During video analysis, the online monitoring module continuously evaluates and verifies the analysis results of the large model. By comparing them with manually labeled standard answers or accurate results in historical data, it identifies targets, behaviors, or video segments with new features that were not accurately identified. For example, in intelligent transportation systems, if a new type of vehicle is incorrectly identified as another type of vehicle, or if a new traffic violation is not detected by the existing model, these video samples will be automatically collected.
[0095] The collected incremental data will undergo manual annotation and preprocessing before entering the online incremental learning module. This module employs advanced incremental learning algorithms, such as knowledge distillation-based methods or adaptive parameter update strategies, to transfer knowledge from the new data into the existing large model while avoiding negative impacts on the performance of the original model. During training, the model's parameters and structure are adaptively adjusted based on the characteristics of the new data and the model's current state. For example, for newly emerging target categories.
[0096] 2) Add corresponding neurons to the last layer of the model and train and update the new parameters through the backpropagation algorithm so that the model can accurately identify these new targets; for new behavioral patterns, adjust the parameters of the feature extraction and classifier parts of the intermediate layer of the model to enhance the model's ability to understand and recognize new behaviors.
[0097] 3) After online incremental learning, the updated large model will be reinvested in video analysis tasks to continuously improve the system's intelligence and adaptability, ensuring that it can cope with the ever-changing video data and analysis task requirements.
[0098] Beneficial effects:
[0099] 1) Significantly improves the accuracy of video analytics:
[0100] Dynamic large-scale model scheduling strategies can accurately select the most suitable model based on the real-time characteristics of video content, ensuring that the most appropriate analysis methods can be used in various complex scenarios, thereby greatly reducing the probability of false alarms and missed detections. For example, in security monitoring, the analysis of crowd behavior can more accurately identify abnormal behavior and avoid false alarms or missed dangers due to model mismatch.
[0101] Multi-level feature fusion and adaptive adjustment mechanisms fully tap into the rich information in video data. By rationally allocating weights to features at different levels, the model can more comprehensively and accurately understand video content, thereby improving the accuracy of tasks such as object detection and behavior recognition. For example, in industrial production quality inspection, it can more accurately identify subtle defects and assembly problems in products, reducing the number of defective products.
[0102] Distributed computing and model co-optimization enable various models to complement and collaborate during the analysis process, further improving the accuracy of the overall analysis through information exchange and collaborative work. For example, in traffic scenarios, multiple models working together can more accurately determine vehicle driving intentions and traffic conditions, reducing misjudgments of traffic accidents.
[0103] 2) Significantly enhances system flexibility and adaptability:
[0104] This solution can flexibly select suitable models from a rich model library based on different application scenarios and diverse task requirements, easily handling various complex and ever-changing situations. Whether in security, transportation, industrial production, or other fields, it can quickly adapt to new tasks and scenario changes, meeting the personalized needs of different users.
[0105] It supports online incremental learning and model update functions, which can collect newly emerging video data and inaccurately identified samples in real time, and update the model parameters and structure in a timely manner, so that it can quickly adapt to new targets, behaviors and scenarios, maintain efficient analysis capabilities for constantly changing video data, and always provide accurate and up-to-date analysis results.
[0106] 3) Effectively improve computing resource utilization and analysis efficiency:
[0107] Distributed computing architectures divide and process video data in parallel, avoiding the resource bottlenecks of traditional centralized computing, fully leveraging the advantages of cluster computing resources, and greatly improving data processing speed. For example, in large-scale video surveillance systems, video streams from multiple cameras can be processed simultaneously, ensuring real-time requirements.
[0108] Model co-optimization algorithms reduce redundant calculations and resource waste through efficient information sharing and collaborative work among models, further improving the utilization of computing resources and overall analysis efficiency. For example, in intelligent transportation systems, multiple related models working together can quickly process large amounts of traffic video data, provide timely feedback on traffic conditions, and offer strong support for traffic management.
[0109] The following is an explanation using a specific embodiment:
[0110] Example: Urban security monitoring system
[0111] In a modern city's security monitoring network, a large number of high-definition cameras are deployed, covering various key areas of the city, such as commercial centers, residential communities, and transportation hubs. The intelligent video analysis method of this invention is applied to this security monitoring system, and the specific implementation process is as follows:
[0112] 1. Video data acquisition and preprocessing:
[0113] The camera captures video data at a high frame rate and transmits the data in real time to the server cluster in the monitoring center via a wired network. The server performs real-time Gaussian and median filtering on the received video data to remove noise interference caused by factors such as light flicker and wind rustling. At the same time, it uses an adaptive histogram equalization algorithm to enhance the image, improving the contrast and clarity of the target object for subsequent feature extraction and analysis.
[0114] Based on urban traffic flow and population activity patterns, the video data is divided into segments of 30 seconds each to ensure that each segment can reflect relatively complete scene information and dynamic population behavior.
[0115] 2. Video content feature extraction and analysis:
[0116] For each video segment, a CNN-based feature extraction module is activated, using convolutional kernels of different sizes (such as 3x3, 5x5, and 7x7) to extract multi-scale features from the video images, obtaining texture, color, and shape features. Simultaneously, an LSTM-based spatiotemporal feature extraction module models the motion trajectories, speed changes, and behavioral patterns of people and vehicles in the video sequence, capturing their temporal series features.
[0117] The extracted features are normalized so that their values are within the range of [0,1]. Then, a feature selection algorithm based on information gain is used to select a subset of features closely related to the security monitoring task, such as facial features, clothing features, behavioral features of personnel, and vehicle model, color, and driving direction features.
[0118] 3. Dynamic scheduling and task allocation for large models:
[0119] When a video clip enters the analysis phase, the task awareness module quickly analyzes the video content. If it detects a crowd gathering with signs of abnormal behavior (such as arguing or pushing), the system immediately schedules a large model with high-precision behavior recognition capabilities from the model library and assigns it to a computing node with sufficient GPU resources for task processing. Simultaneously, for video clips in other normal scenarios, such as vehicle traffic monitoring at intersections, lightweight object detection and traffic flow analysis models are scheduled to CPU resources for processing, achieving efficient utilization of computing resources and high-efficiency task execution.
[0120] 4. Video analysis and result generation based on large models:
[0121] The behavior recognition model performs in-depth analysis of the input video features, leveraging its pre-training advantage on a large amount of crowd behavior data in different scenarios to accurately determine whether there are abnormalities in crowd behavior and output detailed behavior descriptions and confidence scores for abnormal behaviors. For example, for a conflict involving multiple people, the model might output, "A group of people were detected engaging in a violent physical conflict at [specific location], lasting approximately [X] seconds. The confidence score for this abnormal behavior is 0.92, involving approximately [specific number] people, some of whom exhibit aggressive behavior, potentially causing injury and public disorder." The target detection model quickly and accurately detects and identifies targets such as people and vehicles in the video, determining their location, category, and related attribute information, such as vehicle license plate numbers, and the gender and age range of people. It generates a corresponding confidence score for each detection result, for example, "A black sedan with license plate number [specific license plate number] was detected at [specific coordinate location]. The vehicle type recognition confidence score is 0.95, the direction of travel is due east, and the speed is approximately [specific speed] km / h."
[0122] Based on the analysis results, corresponding security monitoring reports are generated, including the time, location, type of behavior, and information on the personnel and vehicles involved in the abnormal behavior. At the same time, confidence estimates are given for the analysis results of each target and behavior to enable monitoring personnel to conduct further verification and processing.
[0123] 5. Result fusion and post-processing:
[0124] The analysis results from different large-scale models are fused using a probability-based fusion algorithm to eliminate potential duplicate target information and inconsistent behavior judgments, resulting in an accurate and complete security monitoring report. For example, for the detection results of multiple target detection models for the same vehicle, the final vehicle location and category information are determined by calculating the weighted average of location and category confidence scores. For behavior recognition results, a deep learning model is used to fuse the behavior labels and probabilities from different models to obtain the most reliable behavior judgment conclusion.
[0125] Based on the severity of the analysis results, the system automatically generates corresponding alarm information and sends it to the mobile terminal devices of monitoring personnel. Simultaneously, it stores the video analysis results in a database for subsequent querying and statistical analysis. Furthermore, the system can integrate with other security systems; for example, when abnormal behavior is detected, it automatically triggers nearby alarm devices and notifies surrounding security personnel to handle the situation on-site.
[0126] 6. Online incremental learning and model updates:
[0127] During video analysis, the system continuously collects video samples of abnormal behaviors that were not accurately identified, as well as video data of newly emerging target types (such as new models of drones, electric scooters, etc.). These samples are then manually labeled and entered into the online incremental learning module.
[0128] The online incremental learning module employs a knowledge distillation-based incremental learning algorithm to transfer knowledge from new data into the existing large model. By adding a small number of neurons and adjusting some connection weights to the original model, it enables the model to quickly learn new behavioral patterns and target features without significantly impacting the performance of the original model on previously learned tasks. For example, when a new type of drone flies in a no-fly zone, online incremental learning allows the model to quickly identify this new target and behavioral pattern, include it in the abnormal behavior monitoring scope, and update the relevant behavior recognition and early warning mechanisms. After incremental learning, the updated large model is re-deployed to security monitoring tasks, continuously improving the system's ability to identify and handle new situations.
[0129] Specifically, it has the following advantages:
[0130] 1) Highly efficient automated analysis;
[0131] 2) Quantify the reliability of behavior;
[0132] 3) Adapt flexibly to different scenarios;
[0133] 4) Supports complex decision-making;
[0134] 5) Multi-dimensional information fusion;
[0135] Its practical applications include security, transportation, and industry.
[0136] As can be seen from the above embodiments, the intelligent video analysis method based on large model scheduling of the present invention can effectively improve the accuracy, efficiency and flexibility of video analysis in urban security monitoring systems, providing strong technical support for urban security and having significant practical application value and social benefits.
[0137] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A smart video analysis method based on large model scheduling, characterized in that, include: The raw video data is collected and preprocessed to divide the continuous video stream into a series of video segments with independent analytical value. For each video segment, image and spatiotemporal features are extracted in parallel, and the features are quantified and normalized. Highly relevant features are selected to reduce redundancy. Based on the video content description report, combined with task priority and model performance indicators, the optimal model is selected from a large model library, and the model is dynamically allocated to appropriate nodes according to the resource status of computing nodes. The selected large model is called for deep analysis, and the target location, category, confidence level, and behavior description results are output, generating a structured analysis report, including event time, location, abnormal behavior labels, and confidence level information. By integrating the outputs of different models using a fusion model approach, conflicts and noise can be eliminated. Collect inaccurately identified samples, manually label them, and input them into the incremental learning module. Use knowledge distillation or adaptive parameter update strategies to update the model parameters and structure while retaining the original performance, thereby improving the system's adaptability. The specific method for generating a video content description report is as follows: Step S1: Target recognition and localization. The system automatically detects the main targets in the video, marks their positions and classifies them. At the same time, it generates a credibility score for each recognition result to reflect the accuracy of the recognition. Step S2: Scene classification. Based on the overall characteristics of the video footage, the scene is classified into a specific type to help understand the background environment of the target. Step S3: Behavioral analysis, analyze the target's behavioral patterns, record the duration and dynamic trends of the behavior, and label abnormal behaviors and their probabilities to provide a preliminary explanation of the behavior; Step S4: Extract additional information, including the video's shooting time, location, weather conditions, and contextual information, to enhance the report's completeness; Step S5: Structured integration. Organize the above information into tables or lists to clearly present the target category, location, behavior description, confidence level, and scenario type data, providing a unified basis for subsequent analysis. The optimal model should be selected from a large model library as follows: For tasks requiring high-precision object detection and classification, a large object detection model based on the Transformer architecture with high-resolution feature extraction capabilities and a large amount of training data for object categories should be selected; for complex behavior analysis tasks, a large behavior recognition model based on 3D convolutional neural networks that has been extensively trained on multiple behavior patterns and has good generalization ability should be selected.
2. The intelligent video analysis method based on large model scheduling as described in claim 1, characterized in that, The preprocessing includes: using Gaussian filtering to remove noise interference in the video image, and using histogram equalization technology to enhance the contrast and brightness of the image, making the target object clearer and more identifiable, laying a good foundation for subsequent feature extraction and analysis; it also includes segmenting the continuous video stream into a series of video segments with independent analytical value based on the timestamp of the video data and key event detection algorithms.
3. The intelligent video analysis method based on large model scheduling as described in claim 2, characterized in that, In traffic monitoring scenarios, video segments are divided based on the time when a vehicle passes through a specific intersection or road segment; on industrial production lines, segments are divided according to the product's production cycle or the completion node of key processes.
4. The intelligent video analysis method based on large model scheduling as described in claim 1, characterized in that, Video content feature extraction is achieved through the following methods: A feature extraction module based on a deep convolutional neural network (CNN) utilizes convolutional kernels of different sizes and multi-layer convolutional pooling structures to extract features from video images at multiple scales, obtaining rich low-level and high-level features; a spatiotemporal feature extraction module based on a recurrent neural network (RNN) and its variants models the temporal dimension information in the video sequence; then, the features output by different feature extraction modules are quantized and normalized to ensure that their numerical ranges are uniform and comparable; finally, information gain, chi-square test, or Relief algorithm are used to select a subset of features that are highly relevant to the current video analysis task.
5. The intelligent video analysis method based on large model scheduling as described in claim 1, characterized in that, The specific methods for behavioral description and probability estimation are as follows: Step a) Behavioral Feature Extraction and Encoding: A deep learning-based feature extraction model performs multi-level feature extraction on the target behavior in the video clip. The extracted behavioral features are quantified and encoded, converting them into numerical vectors that can be processed by a computer for subsequent model analysis and computation. Step b) Behavior classification and recognition: Using a pre-trained behavior classification model, the encoded behavior feature vector is input into the model. The model classifies and recognizes the target behavior based on the various behavior patterns and feature distributions it has learned. For each possible behavior category, the model calculates its corresponding probability score. Step c) Behavior description generation: Based on the behavior classification results and probability scores, and combined with predefined behavior description templates, generate detailed behavior description statements. Step d) Uncertainty handling and supplementary information: When the probability distribution of behavior is relatively dispersed, that is, no behavior category has a significantly high probability, the model will reflect this uncertainty in the behavior description; at the same time, the model will also combine other information in the video to supplement and improve the behavior description, so as to provide more behavior analysis results.
6. The intelligent video analysis method based on large model scheduling as described in claim 1, characterized in that, The specific implementation steps for result fusion processing are as follows: Step a) Data preparation and standardization: Collect the analysis results of different large models on the same video segment, including the target's location information, target category labels, behavior recognition results, and related confidence scores; Standardize the output results of different models to ensure the consistency and comparability of data formats; Step b) Fusion based on probability statistics, including target location fusion, target category fusion, and behavior recognition result fusion; Step c) Deep learning model-assisted fusion: Construct a deep learning model specifically for result fusion, use the output results of different models as the input features of the fusion model, train the fusion model on a large amount of labeled video data, and enable it to learn the optimal fusion method between the results of different models. Step d) Conflict resolution and result optimization: Conflicts are resolved using rule-based methods or further data analysis. Step e) Output the fusion result. After the above steps, the final fusion result is obtained, including the fused target location, category, and behavior information. This information is then organized into a unified format for output for subsequent application processing.
7. The intelligent video analysis method based on large model scheduling as described in claim 1, characterized in that, During video analysis, the online monitoring module continuously evaluates and verifies the analysis results of the large model. The collected incremental data will be manually labeled and preprocessed before entering the online incremental learning module. After online incremental learning, the updated large model will be reinvested in the video analysis task, continuously improving the system's intelligence and adaptability to ensure that it can cope with the ever-changing video data and analysis task requirements.
8. A non-transitory storage medium, characterized in that, It is used to store a program for executing an intelligent video analysis method based on large model scheduling as described in any one of claims 1 to 7 above.