Method and device for implementing operation specification review based on multi-modal large model, processor and computer readable storage medium thereof

By using a multimodal large model for operational specification review, the problems of high labor costs and low efficiency in traditional review have been solved. This has enabled intelligent supervision and efficient automated compliance review throughout the entire process, improving the objectivity and timeliness of operational specification review.

CN121233764BActive Publication Date: 2026-06-23THE THIRD RES INST OF MIN OF PUBLIC SECURITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
THE THIRD RES INST OF MIN OF PUBLIC SECURITY
Filing Date
2025-10-10
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Traditional operational procedure reviews are characterized by high labor costs, low efficiency, strong subjectivity, difficulty in handling complex scenarios, and time-consuming and labor-intensive manual report filling.

Method used

It employs a multimodal large model for multi-source data access and preprocessing, and uses a large view model, a large language model, and a large speech model for multimodal data content understanding and scenario analysis. Combined with cross-modal information fusion and consistency verification, it enables intelligent application output and interactive display, and supports full-process supervision.

Benefits of technology

It has improved the objectivity and timeliness of operational procedure review, reduced human error, enhanced the comprehensiveness of supervision, reduced workload, and supported intelligent supervision throughout the entire life cycle.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121233764B_ABST
    Figure CN121233764B_ABST
Patent Text Reader

Abstract

The present application relates to a kind of based on multimodal big model implementation operation standardization review method, comprising the following steps: carrying out multi-source data access and preprocessing, and the operation process related data is obtained from different types of data sources, and data preprocessing is carried out;Intelligent data analysis is carried out to the data after preprocessing by multimodal big model, and scene analysis is carried out to multimodal data content understanding, and operation process abstract and operation behavior data are analyzed;Cross-modal information fusion and consistency check are carried out to operation behavior data and abstract, and various operators are processed by operation specification review;Intelligent application output and interactive display are carried out.The multimodal big model implementation operation standardization review method, device, processor and computer readable storage medium thereof of the present application are used, the whole process of operation is implemented full element, full chain, full intelligent supervision, the standardization level is improved, can whole process closed loop supervision, cover from operation implementation occurs to the complete life cycle of the end output operation report, realize whole process intelligent supervision.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and public safety technology, and particularly to the field of operational procedure review. Specifically, it refers to a method, apparatus, processor, and computer-readable storage medium for operational procedure review based on a multimodal large model. Background Technology

[0002] With the acceleration of digital transformation, artificial intelligence technology, especially multimodal large model technology, has provided innovative solutions for operational procedure review. Traditional operational procedure review suffers from high labor costs, low efficiency, and strong subjectivity. Furthermore, models struggle to handle complex scenarios, and manually filling out electronic reports is time-consuming and labor-intensive.

[0003] With its powerful analytical and understanding capabilities for multimodal data such as audio, video, and text, the multimodal big data model can achieve process summarization, intelligent monitoring, compliance review, and risk alerts for operational implementation. By using deep learning to study massive amounts of regulations and multimodal audio-visual operational cases, the big data model can automatically identify issues such as behavioral deviations and procedural violations during the operational process, thereby improving the objectivity and timeliness of supervision.

[0004] Currently, many local governments and organizations are exploring the "AI + supervision" model, such as intelligent review of operational processes or law enforcement, and automatic generation and review of reports. This technological innovation not only aligns with the national policy direction of "Internet + supervision" but also helps to build a more transparent and standardized operational implementation system, promoting the standardization of government and organizations. Summary of the Invention

[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method, apparatus, processor and computer-readable storage medium for operational specification review based on a multimodal large model that is objective, timely and widely applicable.

[0006] To achieve the above objectives, the present invention provides a method, apparatus, processor, and computer-readable storage medium for operational specification review based on a multimodal large model, as follows:

[0007] The main feature of this method for operational specification review based on a multimodal large model is that the method includes the following steps:

[0008] (1) Perform multi-source data access and preprocessing, obtain relevant data from different types of data sources, and perform data preprocessing;

[0009] (2) Perform multimodal big model intelligent data analysis on the preprocessed data. Use the big view model, big language model and big speech model to perform multimodal data content understanding and scene analysis to obtain operation process summary and process data;

[0010] (3) Perform cross-modal information fusion and consistency verification on process data and summaries, and process various operators through operation specification review;

[0011] (4) Perform intelligent application output and interactive display. Based on the output results of operator processing, the front-end page performs early warning display and statistical analysis to realize data query and business processing.

[0012] Preferably, step (1) specifically includes the following steps:

[0013] (1.1) Process text data into text tokens using a text encoder;

[0014] (1.2) Encode the audio data into a voice token using a whisper encoder;

[0015] (1.3) The recorder video, surveillance video and images are processed into visual tokens by a visual encoder.

[0016] Preferably, step (1) of performing multi-source data access specifically includes:

[0017] We obtain relevant data interfaces from multiple data sources, connect to them, and perform standardized processing.

[0018] Preferably, step (1) specifically includes the following steps:

[0019] (2.1) Generate the original case summary from the text data of the electronic report document using a large language model;

[0020] (2.2) The on-site images and operation process videos are understood and described frame by frame through multimodal large model, and the audio recording summary obtained by speech analysis is used to identify and extract key frames;

[0021] (2.3) Perform speech analysis on the audio using a speech model to generate speech-to-text and subtitles, and generate audio summaries using a large language model.

[0022] Preferably, step (2.2) specifically includes:

[0023] Audio and video are separated using recorder data; a summary is generated using a large speech model; and key frames of key videos are extracted using a large view model.

[0024] The main feature of this device for operational specification review based on a multimodal large model is that the device includes:

[0025] A processor is configured to execute computer-executable instructions;

[0026] The memory stores one or more computer-executable instructions, which, when executed by the processor, implement the various steps of the above-described method for reviewing operational specifications based on a multimodal large model.

[0027] The processor for implementing operational specification review based on a multimodal large model is characterized in that the processor is configured to execute computer-executable instructions, and when the computer-executable instructions are executed by the processor, the various steps of the aforementioned method for implementing operational specification review based on a multimodal large model are implemented.

[0028] The computer-readable storage medium is characterized in that it stores a computer program that can be executed by a processor to implement the various steps of the above-described method for reviewing operational specifications based on a multimodal large model.

[0029] This invention employs a method, apparatus, processor, and computer-readable storage medium for operational specification review based on a multimodal large model. By constructing a "1+3+N" collaborative architecture—comprising "one intelligent central platform, three AI engines, N types of operators, and application scenarios"—it supports flexible access to new data sources and new application scenarios, adapting to different business and scenario needs. It achieves full-element, full-chain, and fully intelligent supervision of the entire operational process, solving problems such as low efficiency, strong subjectivity, and narrow coverage in traditional operational specification review, thus improving standardization. It possesses multimodal fusion analysis capabilities, achieving for the first time deep fusion and semantic alignment of text, audio, and video modalities in operational specification review, enhancing the comprehensiveness of supervision. It enables automated compliance review, driven by both a large model and a rule base, automatically identifying operational procedure violations, inconsistencies between reports and audio / video recordings, and reducing human oversight. It also significantly reduces workload by generating summaries and automatically writing reports, easing the burden on employees. It can provide closed-loop supervision throughout the entire process, covering the entire lifecycle from the occurrence of the operation to the implementation of the report, and realize intelligent supervision of the entire process of "pre-event reminder, in-event monitoring, and post-event audit". Attached Figure Description

[0030] Figure 1 This is a flowchart of the method for reviewing operational specifications based on a multimodal large model according to the present invention. Detailed Implementation

[0031] To more clearly describe the technical content of the present invention, the following description is provided in conjunction with specific embodiments.

[0032] The method for operational specification review based on a multimodal large model of the present invention includes the following steps:

[0033] (1) Perform multi-source data access and preprocessing, obtain operation process related data from different types of data sources, and perform data preprocessing;

[0034] (2) Perform multimodal big model intelligent data analysis on the preprocessed data. Through the big view model, big language model and big voice model, perform multimodal data content understanding and scenario analysis to obtain operation process summary and case data;

[0035] (3) Perform cross-modal information fusion and consistency verification on the operation process data and summary, and process various operators through operation specification review;

[0036] (4) Perform intelligent application output and interactive display. Based on the output results of operator processing, the front-end page performs early warning display and statistical analysis to realize data query and business processing.

[0037] In a preferred embodiment of the present invention, step (1) specifically includes the following steps:

[0038] (1.1) Process text data into text tokens using a text encoder;

[0039] (1.2) Encode the audio data into a voice token using a whisper encoder;

[0040] (1.3) The recorder video, surveillance video and images are processed into visual tokens by a visual encoder.

[0041] As a preferred embodiment of the present invention, step (1) of performing multi-source data access specifically includes:

[0042] We obtain relevant data interfaces from multiple data sources, connect to them, and perform standardized processing.

[0043] In a preferred embodiment of the present invention, step (1) specifically includes the following steps:

[0044] (2.1) Generate the original summary from the text data of the electronic report document using a large language model;

[0045] (2.2) Use a multimodal large model to understand and describe the frames of images and videos, and combine the audio recording summary obtained from speech analysis to identify and extract key frames;

[0046] (2.3) Perform speech analysis on the audio using a speech model to generate speech-to-text and subtitles, and generate audio summaries using a large language model.

[0047] In a preferred embodiment of the present invention, step (2.2) specifically comprises:

[0048] Audio and video are separated using recorder data; case summaries are generated using a large voice model; and keyframes of key videos are extracted using a large view model.

[0049] The apparatus for operational specification review based on a multimodal large model of the present invention includes:

[0050] A processor is configured to execute computer-executable instructions;

[0051] The memory stores one or more computer-executable instructions, which, when executed by the processor, implement the various steps of the above-described method for reviewing operational specifications based on a multimodal large model.

[0052] The processor of the present invention for implementing operation specification review based on a multimodal large model is configured to execute computer-executable instructions, which, when executed by the processor, implement the various steps of the above-described method for implementing operation specification review based on a multimodal large model.

[0053] The computer-readable storage medium of the present invention stores a computer program thereon, which can be executed by a processor to implement the various steps of the above-described method for reviewing operational specifications based on a multimodal large model.

[0054] This invention relates to the fields of artificial intelligence and public safety technology, and particularly to an intelligent operational procedure review method and system based on a multimodal large model. This method and system are used to automatically analyze, review compliance, verify consistency, and generate operational audio-visual case summaries from multimodal data such as text, audio, and video during the operational process, thereby improving the intelligence and efficiency of operational procedure review. Compared with existing technologies, this invention helps to automate and intelligently review operational procedures, achieving higher efficiency and promoting the intelligent and humanized implementation of operational procedures.

[0055] This invention aims to address the challenges in the field of operational procedure review by enabling intelligent analysis of operational process data based on multimodal large model technology, under conditions of significantly reduced manual review. This allows for the automatic generation of structured case data and summaries, thereby rapidly completing standardized review and case quality analysis, and ultimately improving case-handling efficiency.

[0056] To achieve the above objectives, the present invention provides the following technical solution:

[0057] like Figure 1 As shown, this invention proposes a method for reviewing operational specifications based on a multimodal large model, with the following steps:

[0058] Step 1: Implement multi-source data access and preprocessing. Obtain relevant data from different types of data sources (audio and video recordings of on-site operations, on-site images, on-site monitoring videos, electronic reports, etc.) and perform data preprocessing. Relevant data interfaces can be obtained from multiple data sources for connection and standardized processing.

[0059] Multimodal data preprocessing: Text data such as reports and electronic documents are processed into text tokens by a text encoder, audio data is encoded into voice tokens by a whisper encoder, and recorder videos, surveillance videos, and images are transformed into visual tokens by a visual encoder;

[0060] Text data Input text encoder Obtain the text token sequence , It is the sequence length of the text token and the text encoder to which it belongs. Based on the Transformer architecture, its processing flow is represented as follows:

[0061] Q, K, and V represent the query, key, and value matrices generated from the input text features, respectively. Indicates the dimension of the key vector;

[0062] Audio data Input audio encoder Obtain the audio token sequence , It is the sequence length of the audio token, belonging to an audio encoder based on the Whisper architecture, whose processing includes converting the audio signal... The data is converted into a log-Mel spectrogram, then features are extracted through convolutional and downsampling layers, and finally a voice token is generated by a Transformer encoder.

[0063] ;

[0064] Video data is converted into images using a fixed-interval sampling technique, and then the visual data of the images is processed through data... Input visual encoder Obtain the visual token sequence , It is the sequence length of the visual tokens, belonging to the Vision Transformer model of the visual encoder, whose processing involves processing the input image... Cut into Image blocks Each image patch is projected into a token vector using a projection matrix, and after adding position encoding, it is input into the transformer encoder:

[0065] ,in For the projection matrix, This is the position encoding matrix.

[0066] Step 2: Multimodal data intelligent analysis, which enables content understanding and feature extraction of different modalities of data through three major AI engines. The preprocessed data from Step 1 is centrally processed for multimodal large-scale intelligent data analysis. The large-scale view model, large-scale language model, and large-scale speech model are used as three major AI multimodal model engines to achieve content understanding of text, audio, video, and other multimodal data. This automatically generates original summaries, extracts keyframes, and automatically generates summaries of operation process records, thereby enabling intelligent analysis for operational specification review.

[0067] Furthermore, in a preferred embodiment, the specific data analysis process of the three AI engines in step 2 above is as follows:

[0068] Step 201: Generate the original report summary from text data such as electronic report documents using a large language model, including elements such as time, location, task, cause, process, result, tools, and methods;

[0069] Step 202: Using a multimodal large model, perform frame-by-frame understanding and description of the on-site images and operation process videos. Combined with the audio recording summary obtained from speech analysis, complete the identification and extraction of key frames. According to standardization requirements, the key frames must include key scenes such as document display, identity notification, operation process, rights notification, report result display, and signature confirmation.

[0070] Step 203: For audio, perform speech analysis using a speech model to convert speech to text and generate subtitles, and generate a recorded audio summary using a large language model.

[0071] Furthermore, in a preferred embodiment, the recorder data in step 202 above can achieve audio-video separation, generate a summary using a large voice model, and extract key frames of key videos using a large view model.

[0072] Step 3, cross-modal information fusion and consistency verification process, integrates the operation process data, summaries, etc. obtained from the multimodal large model engine analysis, and then submits them to the operation specifications for review by various operators to check and judge the compliance of the operation process and the consistency of reports and audio and video records.

[0073] The operation specifications review various algorithm processing methods. The original summary of the report, key frames, audio recording summary and other data obtained by the multimodal large model engine in step 2 are subjected to cross-modal information fusion and consistency verification. The operation process is judged to be compliant and consistent through operator management or algorithm service.

[0074] Furthermore, in a preferred embodiment, an intelligent central platform can be established for the above-mentioned operational specification review process: a unified multimodal operational specification review AI platform to achieve cross-modal understanding of various types of data, including textual data, reports, and audio and video data from recorders;

[0075] Build three major AI multimodal model engines: a large view model, a large language model, and a large speech model;

[0076] It can complete the application of N types of operational standard operators and scenarios, such as evidence consistency verification, process compliance review, procedure verification, and automatic generation of summaries and reports.

[0077] Step 4: Implement the intelligent application output and interactive display process, present the output results of operators and application processing analysis, and realize early warning display and query, statistical analysis by the front-end page, while the back-end realizes data query, business processing, and interactive display process.

[0078] In a specific embodiment of the present invention, this embodiment adopts a "1+3+N" intelligent collaborative construction model, namely, an overall architecture of "one intelligent central platform, three major AI engines, and N types of scenario operators and applications," to promote the full-process, full-element, and fully intelligent supervision of operational implementation. Specifically, it involves establishing an intelligent central platform, namely a unified multimodal operation specification review AI platform, to achieve cross-modal understanding of various types of data, including text-based reports, electronic documents, and audio and video data from recorders; and constructing three major AI engines: the intelligent central platform uses a large view model, a large language model, and a large voice model as three major AI multimodal model engines to realize the application of N types of standardized operation scenarios, such as consistency verification of reports and audio and video recording evidence, process compliance review, procedure verification, and automatic generation of summaries and reports.

[0079] A multimodal intelligent hub is established, and based on the analysis results of the AI ​​engine, multiple applications of operation specification review are supported through violent behavior inspection operators, operation compliance review operators, and consistency check operators. For example, it can check whether there is violent behavior, whether the process complies with laws, regulations and relevant rules, and whether the reports and audio and video records are consistent, thereby realizing the automation and intelligence of operation specification review.

[0080] The analysis results of various operators and applications reviewed in the operation specifications are displayed interactively through the front-end and back-end systems. The front-end interface supports: real-time early warning information display, statistical analysis reports (such as violation rate, high-frequency problem types), custom queries, and visual dashboards; the back-end system supports: data storage and indexing, business process processing, model iteration training, and feedback loop functions.

[0081] Data integration methods include integration of structured or unstructured document data such as reports and electronic documents; data recorder data: integration with audio and video interfaces; and electronic data: including multimodal evidence materials such as reports, images, and surveillance videos.

[0082] The data sources for this technical solution are mainly audio and video data from the recorder, and text data such as electronic documents of operation implementation reports.

[0083] This technical solution encodes text, audio, and video data using different encoders for multimodal data processing. A large language model, a large speech model, and a large view model then perform content analysis and scene understanding. This process is primarily handled by the large models themselves, without utilizing RAG technology or external knowledge bases. Based on this, the large models are used to fuse and verify cross-modal information such as summaries and elements from processed audio, video, and reports. Finally, various operators are applied for operational compliance review, such as violence detection operators, compliance review operators, and consistency operators.

[0084] This technical solution combines audio modal summarization with operational standardization requirements, using a large view model to understand and analyze sampled video frames to determine whether they are key scenes in the operation process. By combining audio and video modalities, the ability of the large view model to understand the scene can be significantly improved.

[0085] For the specific implementation scheme of this embodiment, please refer to the relevant descriptions in the above embodiments, which will not be repeated here.

[0086] It is understood that the same or similar parts in the above embodiments can be referred to each other, and the contents not described in detail in some embodiments can be referred to the same or similar contents in other embodiments.

[0087] It should be noted that in the description of this invention, the terms "first," "second," etc., are used for descriptive purposes only and should not be construed as indicating or implying relative importance. Furthermore, in the description of this invention, unless otherwise stated, "a plurality of" means at least two.

[0088] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process, and the scope of the preferred embodiments of the invention includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as will be understood by those skilled in the art to which embodiments of the invention pertain.

[0089] It should be understood that various parts of the present invention can be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented in software or firmware stored in memory and executed by a suitable instruction execution device. For example, if implemented in hardware, as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.

[0090] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The corresponding program can be stored in a computer-readable storage medium. When the program is executed, it includes one or a combination of the steps of the method embodiments.

[0091] Furthermore, the functional units in the various embodiments of the present invention can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

[0092] The storage media mentioned above can be read-only memory, disk, or optical disk, etc.

[0093] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0094] This invention employs a method, apparatus, processor, and computer-readable storage medium for operational specification review based on a multimodal large model. By constructing a "1+3+N" collaborative architecture—comprising "one intelligent central platform, three AI engines, N types of operators, and application scenarios"—it supports flexible access to new data sources and new application scenarios, adapting to different business and scenario needs. It achieves full-element, full-chain, and fully intelligent supervision of the entire process, solving problems such as low efficiency, strong subjectivity, and narrow coverage in traditional operational specification reviews, thus improving standardization. It possesses multimodal fusion analysis capabilities, achieving for the first time deep fusion and semantic alignment of text, audio, and video modalities in operational specification review, enhancing the comprehensiveness of supervision. It enables automated compliance review, driven by both a large model and a rule base, automatically identifying procedural violations and contradictory evidence, reducing human oversight. It significantly reduces workload and eases burden through summary generation and automatic report writing. It provides closed-loop supervision throughout the entire process, covering the entire lifecycle from the occurrence to the end of the operation, achieving intelligent supervision of the entire process: "pre-event prompts, in-event monitoring, and post-event auditing."

[0095] In this specification, the invention has been described with reference to specific embodiments thereof. However, it will be apparent that various modifications and variations can be made without departing from the spirit and scope of the invention. Therefore, the specification and drawings should be considered illustrative rather than restrictive.

Claims

1. A method for reviewing operational specifications based on a multimodal large model, characterized in that, The method includes the following steps: (1) Perform multi-source data access and preprocessing, obtain relevant data from different types of data sources, and perform data preprocessing; (2) Perform multimodal big model intelligent data analysis on the preprocessed data. Through the big view model, big language model and big speech model, perform multimodal data content understanding and scenario analysis to obtain operation implementation process summary and process data; (3) Perform cross-modal information fusion and consistency verification on process data and operation process summaries, and process various operators through operation specification review; (4) Perform intelligent application output and interactive display. Based on the output results of operator processing, the front-end page performs early warning display and statistical analysis to realize data query and business processing; Step (1) specifically includes the following steps: (1.1) Process text data into text tokens using a text encoder; (1.2) Encode the audio data into a voice token using a whisper encoder; (1.3) The operation process recorder video, monitoring video and images are processed into visual tokens by a visual encoder; Step (2) specifically includes the following steps: (2.1) Generate the original summary of the operation process from the text data of the electronic report document using a large language model; (2.2) The operation site images and operation process videos are understood and described frame by frame through multimodal large model, and the audio record summary obtained by speech analysis is used to identify and extract key frames; (2.3) The process audio is analyzed by a speech model to generate speech-to-text and subtitles, and a summary of the recorded audio is generated by a large language model; The specific steps (2.2) are as follows: Audio and video are separated using recorder data; an operation process summary is generated using a large voice model; and keyframes of the video are extracted using a large view model.

2. The method for reviewing operational specifications based on a multimodal large model according to claim 1, characterized in that, The multi-source data access in step (1) is specifically as follows: We obtain relevant data interfaces from multiple data sources, connect to them, and perform standardized processing.

3. A device for reviewing operational procedures based on a multimodal large model, characterized in that, The device includes: A processor is configured to execute computer-executable instructions; The memory stores one or more computer-executable instructions, which, when executed by the processor, implement the steps of the method for reviewing operational specifications based on a multimodal large model as described in any one of claims 1 to 2.

4. A processor for operational specification review based on a multimodal large model, characterized in that, The processor is configured to execute computer-executable instructions, which, when executed by the processor, implement the steps of the method for reviewing operational specifications based on a multimodal large model as described in any one of claims 1 to 2.

5. A computer-readable storage medium, characterized in that, It stores a computer program that can be executed by a processor to implement the steps of the method for reviewing operational specifications based on a multimodal large model as described in any one of claims 1 to 2.