An anomaly detection method and device for monitoring equipment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining the matching probability analysis of image and text information and utilizing supervised and unsupervised learning anomaly detection models, the accuracy and cost issues of anomaly detection in existing technologies for monitoring equipment are solved, and efficient and automated detection of monitoring equipment is achieved.

CN116189079BActive Publication Date: 2026-06-19SHANGHAI XINYI INTELLIGENT TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANGHAI XINYI INTELLIGENT TECH CO LTD
Filing Date: 2022-12-30
Publication Date: 2026-06-19

Application Information

Patent Timeline

30 Dec 2022

Application

19 Jun 2026

Publication

CN116189079B

IPC: G06F16/58; G06V20/52

AI Tagging

Application Domain

Character and pattern recognitionMetadata still image retrieval

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN116189079B_ABST

Patent Text Reader

Abstract

The purpose of this application is to provide an anomaly detection method and device for monitoring equipment. Compared with existing technologies, this application acquires the monitoring image of the monitoring equipment at a target time and the first text information of the monitoring image stored by the network device at the target time, wherein the monitoring image contains first image information; the monitoring image and the first text information are input into an anomaly detection model; the matching probability of the first text information and the first image information is obtained through the anomaly detection model, and the presence of an anomaly in the monitoring equipment is determined by the matching probability. In this way, the anomaly detection problem is transformed into a matching problem of the first image information and the first text information, making full use of the characteristics of monitoring scenarios such as urban monitoring. It can uniformly detect faults in the monitoring equipment itself and connection anomalies between the monitoring equipment and the monitoring system without complex engineering details. The monitoring system maintained in this way has strong reliability and maintainability.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to an anomaly detection technology for monitoring equipment. Background Technology

[0002] As a large-scale distributed system, a monitoring system requires high reliability and maintainability. In practical use, monitoring equipment may experience anomalies such as screen freezes and wiring errors. With the increasing scale of the distributed system, equipment malfunctions become almost inevitable. Therefore, a method for detecting anomalies in monitoring equipment is needed.

[0003] Existing monitoring equipment anomaly detection is usually achieved through equipment self-testing or manual troubleshooting. Equipment self-testing can detect anomalies such as abnormal equipment lines or whether the equipment is operating, but it cannot determine whether the monitoring equipment is effectively monitoring during operation, such as screen freezing. Manual troubleshooting, while highly accurate, is extremely costly and not suitable for large-scale monitoring scenarios. Summary of the Invention

[0004] The purpose of this application is to provide a method and device for anomaly detection in monitoring equipment.

[0005] According to one aspect of this application, an anomaly detection method for a monitoring device is provided, wherein the method includes:

[0006] The monitoring image of the monitoring device at a target time and the first text information of the monitoring image stored by the network device at the target time are obtained, wherein the monitoring image has first image information.

[0007] Input the surveillance image and the first text information into the anomaly detection model;

[0008] The matching probability of the first text information and the first image information is obtained through the anomaly detection model, so as to determine whether the monitoring device has an anomaly based on the matching probability.

[0009] Furthermore, the anomaly detection model includes a first anomaly detection model with supervised learning and a second anomaly detection model with unsupervised learning, and the step of inputting the surveillance image and the first text information into the anomaly detection model includes:

[0010] The monitoring image and the first text information are input into the first anomaly detection model to obtain the first matching probability between the first image information on the monitoring image and the first text information in the network device.

[0011] The monitoring image and the first text information are input into the second anomaly detection model to obtain a second matching probability between the first image information on the monitoring image and the first text information in the network device.

[0012] The matching probability of obtaining the first text information and the first image information through the anomaly detection model includes:

[0013] The first matching probability and the second matching probability are fused together according to a preset fusion rule to obtain the matching probability of the first text information and the first image information. If the matching probability is greater than a preset first threshold, it is determined that the monitoring device is not abnormal.

[0014] Furthermore, the method also includes:

[0015] A labeled image is pre-acquired, wherein the labeled image is a monitoring image with manually annotated second image information, and the second image information has corresponding second text information; wherein the first anomaly detection model is trained based on the labeled image and the second text information.

[0016] The second anomaly detection model is trained based on the monitoring image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold, wherein the second threshold is greater than the first threshold.

[0017] Further, the first anomaly detection model includes the target detection model and the text recognition model, and the second image information is identified by a bounding box. The step of training the first anomaly detection model based on the labeled image and the second text information includes:

[0018] The target detection model is trained based on the labeled image so that the target detection model can detect bounding boxes on the labeled image;

[0019] The text recognition model is trained based on the second text information and the second image information marked with a bounding box on the labeled image.

[0020] Further, the text information includes time text information and location text information, and the image information includes time image information and location image information, wherein training the text recognition model based on the second text information and the second image information marked with a bounding box on the labeled image includes:

[0021] The text recognition model is trained based on the second time image information and the second location image information marked with a label box on the labeled image, so that the text recognition model can recognize the time text information and location text information contained in the second time image information and the second location image information.

[0022] The text recognition model is trained by matching the second time text information, the second location text information, the time text information and the location text information recognized by the text recognition model, so that the text recognition model can match the time text information with the second time text information and match the location text information with the second location text information.

[0023] Further, the step of matching and training the text recognition model based on the second time text information, the second location text information, the time text information and the location text information recognized by the text recognition model includes:

[0024] Construct a cost matrix by matching the loss values of all matching methods between the second time text information and the second location text information and the time text information and the location text information;

[0025] The matching method that minimizes the sum of the loss values in the cost matrix is used as the matching training result of the text recognition model.

[0026] Furthermore, the method also includes:

[0027] The text recognition model is trained by matching based on the second time text information, the second location text information, the time text information and the location text information recognized by the text recognition model. This includes:

[0028] Convolution calculations are performed using the learning parameters with the time text information and the location text information, respectively.

[0029] Based on the convolution calculation result and the matching rule, the matching method combination of the second time text information and the second location text information with the time text information and the location text information is determined, and the matching method combination is used as the matching training result of the text recognition model.

[0030] Furthermore, the method also includes:

[0031] A preset virtual surveillance image is defined as a surveillance image with randomly added third image information and / or a related image similar to the surveillance image. The third image information has corresponding third text information. The step of training the second anomaly detection model based on the surveillance image whose first matching probability with the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold includes:

[0032] The second anomaly detection model is trained based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold.

[0033] Before the first anomaly detection model is trained, the second anomaly detection model is trained based on the virtual surveillance image.

[0034] After the first anomaly detection model is trained, the second anomaly detection model is trained based on the virtual surveillance image and the surveillance image whose first matching probability output by the first anomaly detection model is greater than a preset second threshold. The training amount of the virtual surveillance image is reduced, and the training amount of the surveillance image whose first matching probability is greater than a preset first threshold is increased, until the training amount of the virtual surveillance image is 0.

[0035] Further, training the second anomaly detection model based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold includes:

[0036] The monitoring images whose first matching probability of the first text information and the first image information output by the n virtual monitoring images and / or the trained first anomaly detection model is greater than a preset second threshold, together with their corresponding n third text information and / or first text information, constitute n image-text pairs.

[0037] For each image-text pair, an image feature and a text feature of the image-text pair are extracted according to a preset image encoder and a preset text encoder;

[0038] Construct n relationships between n image features and n text features 2 There are n matching methods, wherein the n 2 The matching methods include n positive samples representing the correct matching methods and (n... 2 -n) negative samples representing incorrect matching methods;

[0039] Based on the n positive samples and (n2 The second anomaly detection model is trained using (-n) negative samples through contrastive learning, and the training of the second anomaly detection model is optimized based on the contrastive learning loss.

[0040] Further, the step of inputting the surveillance image and the first text information into the second anomaly detection model to obtain a second matching probability between the first image information on the surveillance image and the first text information in the network device includes:

[0041] A test text sequence is generated based on the first text information, wherein the test text sequence includes the first text information and several augmented text information generated by augmenting the first text information;

[0042] The monitoring image and the text sequence to be tested are input into the second anomaly detection model, and the image features of the monitoring image and the text features of the text sequence to be tested are extracted according to the preset image encoder and the preset text encoder, respectively.

[0043] Calculate the similarity between the image features and each text feature, and use the normalized similarity values as each matching probability;

[0044] Obtain the matching probability between the image feature and the text feature corresponding to the first text information, and use the matching probability as the second matching probability.

[0045] Furthermore, the method also includes:

[0046] A preset time error is defined, wherein training the second anomaly detection model based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold includes:

[0047] The monitoring image whose first matching probability of the first text information and the first image information output by the virtual monitoring image and / or the trained first anomaly detection model is greater than a preset second threshold, together with its corresponding third text information and / or first text information, constitutes n image-text pairs.

[0048] For each image-text pair, an image feature is extracted from that pair according to a preset image encoder.

[0049] Extract (2X+1) text features of the image text pair based on the preset text encoder and the time error X;

[0050] Construct a (2X+1)n network between n image features and (2X+1)n text features. 2There are several matching methods, wherein the (2X+1)n 2 The number of matching methods includes (2X+1)n positive samples representing correct matching methods and (2X+1)(n 2 -n) negative samples representing incorrect matching methods;

[0051] Based on (2X+1)n positive samples and (2X+1)(n 2 The second anomaly detection model is trained using (-n) negative samples through contrastive learning, and the training of the second anomaly detection model is optimized based on the contrastive learning loss.

[0052] Further, the step of inputting the surveillance image and the first text information into the second anomaly detection model to obtain a second matching probability between the first image information on the surveillance image and the first text information in the network device includes:

[0053] Generate (2X+1) basic text information based on the first text information and the time error;

[0054] A test text sequence is generated based on (2X+1) basic text information, wherein the test text sequence includes (2X+1) basic text information and several augmented text information generated by augmenting the basic text information;

[0055] The monitoring image and the text sequence to be tested are input into the second anomaly detection model, and the image features of the monitoring image and the text features of the text sequence to be tested are extracted according to the preset image encoder and the preset text encoder, respectively.

[0056] Calculate the similarity between the image features and each text feature, and use the normalized similarity values as each matching probability;

[0057] Obtain the matching probability between the image feature and the text feature corresponding to the basic text information, and take the maximum value of the matching probability as the second matching probability.

[0058] Further, the step of inputting the surveillance image and the first text information into the first anomaly detection model to obtain the first matching probability between the first image information on the surveillance image and the first text information in the network device includes:

[0059] The (2X+1) basic text information and the monitoring image are input into the first anomaly detection model, and the matching probability between the monitoring image and each of the basic text information is calculated respectively. The maximum matching probability is taken as the first matching probability.

[0060] Further, after training the second anomaly detection model based on the monitoring image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold, the method further includes:

[0061] The second anomaly detection model is optimized based on the monitoring images whose matching probability is greater than a preset second threshold, determined by the first and second anomaly detection models after training, according to a preset fusion rule.

[0062] Furthermore, the fusion rules include:

[0063] The matching probability y = α·M1(x) + (1-α)·M2(x)

[0064] Where M1(x) is the first matching probability, M2(x) is the second matching probability, and α is a weight parameter that decreases continuously as the amount of training data of the second anomaly detection model increases until it decreases to a preset limit value.

[0065] According to another aspect of this application, a computer-readable medium is also provided, on which computer-readable instructions are stored, which can be executed by a processor to perform the operation as described above.

[0066] According to another aspect of this application, an anomaly detection device for monitoring equipment is also provided, wherein the device includes:

[0067] One or more processors; and

[0068] A memory storing computer-readable instructions, which, when executed, cause the processor to perform the operations described above.

[0069] Compared with existing technologies, this application obtains the monitoring image of the monitoring device at a target time and the first text information of the monitoring image stored by the network device at the target time, wherein the monitoring image contains first image information; the monitoring image and the first text information are input into an anomaly detection model; the matching probability of the first text information and the first image information is obtained through the anomaly detection model, and the presence of an anomaly in the monitoring device is determined by the matching probability. In this way, the anomaly detection problem is transformed into a matching problem of first image information and first text information, making full use of the characteristics of monitoring scenarios such as urban monitoring. It can uniformly detect faults in the monitoring device itself and connection anomalies between the monitoring device and the monitoring system without complex engineering details. The monitoring system maintained in this way has strong reliability and maintainability. Attached Figure Description

[0070] Other features, objects, and advantages of the invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0071] Figure 1 A flowchart of an anomaly detection method for a monitoring device according to one aspect of this application is shown.

[0072] Figure 2 This diagram illustrates a flowchart of an anomaly detection method for a monitoring device according to a preferred embodiment of this application.

[0073] Figure 3 A schematic diagram of a monitoring image is shown for a monitoring scenario applicable to this application.

[0074] The same or similar reference numerals in the accompanying drawings represent the same or similar parts. Detailed Implementation

[0075] The present invention will now be described in further detail with reference to the accompanying drawings.

[0076] In a typical configuration of this application, the terminal, the device of the service network, and the trusted party all include one or more processors (CPUs), input / output interfaces, network interfaces, and memory.

[0077] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0078] Computer-readable media include both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include non-transitory computer-readable media, such as modulated data signals and carrier waves.

[0079] In some monitoring scenarios, such as urban surveillance, see Figure 3The monitoring image shown includes information provided by the monitoring device, such as a monitoring screen of a specific scene and the time and location information of that scene. This time and location information (i.e., the first image information) is displayed in the monitoring image as an image. Simultaneously, the monitoring system backend (i.e., the network device) records the correct time and location information (i.e., the first text information) corresponding to the monitoring device in text form based on the monitoring device's acquisition frequency and target location. Figure 3 For example, the monitoring image contains the time information "2021-08-22 22:20:20" and the location information "204 Wukang Road 2HG" in image form. If there is no abnormality in the monitoring equipment, the monitoring system backend has the time information "2021-08-22 22:20:20" and the location information "204 Wukang Road 2HG" recorded in text form. In this scenario, when the monitoring equipment experiences screen freezes, the first image information on the monitoring image provided by the monitoring equipment cannot be matched with the first text information recorded by the monitoring system's backend. For example, the time information of the monitoring image is "2021-08-22 22:20:20" and the location information is "204 Wukang Road 2HG", while the correct time information recorded by the monitoring system's backend is "2021-08-22 22:24:46" and the location information is "204 Wukang Road 2HG". Therefore, by matching the first image information on the monitoring screen with the first text information recorded by the monitoring system's backend, it is possible to determine whether the monitoring equipment is malfunctioning based on the matching result.

[0080] To further illustrate the technical means adopted and the effects achieved in this application, the technical solution of this application will be clearly and completely described below in conjunction with the accompanying drawings and preferred embodiments.

[0081] Figure 1 This application illustrates an anomaly detection method for a monitoring device, comprising:

[0082] S11 Obtain the monitoring image of the monitoring device at the target time and the first text information of the monitoring image stored by the network device at the target time, wherein the monitoring image has first image information;

[0083] S12 inputs the monitoring image and the first text information into the anomaly detection model;

[0084] S13 obtains the matching probability of the first text information and the first image information through the anomaly detection model, so as to determine whether the monitoring device has an anomaly through the matching probability.

[0085] In this embodiment, in step S11, the monitoring image of the monitoring device at the target time and the first text information of the monitoring image stored by the network device at the target time are obtained, wherein the monitoring image has first image information.

[0086] If the monitoring images and / or first text information of the monitoring equipment and / or network equipment cannot be obtained, it can be directly determined that the monitoring equipment and / or network equipment is malfunctioning, and maintenance information will be automatically sent. For example, improper or incorrect connection between the monitoring system and the monitoring equipment will cause the monitoring equipment to fail to display the monitoring images normally. In this case, the monitoring images cannot be obtained, and maintenance information will be sent directly to the designated recipient. Furthermore, if valid first image information cannot be extracted from the acquired monitoring images, it also indicates that the monitoring equipment is malfunctioning. For example, if there is a short circuit, open circuit, or poor contact of the BNC connector between the video cable core and the shielding mesh of the monitoring equipment, there will be large-area mesh interference on the monitoring screen. In this case, valid first image information cannot be obtained from the monitoring images, so it is determined that the monitoring equipment is malfunctioning, and maintenance information will be automatically sent. It should be clarified that monitoring equipment malfunctions include faults in the monitoring equipment itself, such as failure to display the monitoring screen or lag, as well as connection malfunctions between the monitoring equipment and the monitoring system, such as short circuits, open circuits, or interference.

[0087] In step S12, the acquired monitoring image and the first text information are input into the anomaly detection model. In step S13, the matching probability of the first text information and the first image information is obtained through the anomaly detection model, so as to determine whether the monitoring device has an anomaly based on the matching probability.

[0088] This method determines whether the monitoring equipment is abnormal by matching the first image information and the first text information. It fully utilizes the characteristics of urban monitoring scenarios, enabling unified detection of both equipment malfunctions and connection anomalies between the equipment and the monitoring system without complex engineering details. Monitoring systems maintained in this way exhibit strong reliability and maintainability. However, it should be clarified that while scalability is a necessary characteristic for distributed systems, unlike cloud computing data centers, monitoring systems have relatively fixed inputs—only the video signals from the monitoring equipment itself—without fluctuating traffic over time. Therefore, scalability is not a problem to be solved in this case.

[0089] In another preferred embodiment, such as Figure 2 As shown, the anomaly detection model includes a first anomaly detection model with supervised learning and a second anomaly detection model with unsupervised learning. Figure 2 Step S21 and Figure 1Step S11 in the embodiments is the same or substantially the same, and therefore will not be repeated here, but is included here by reference only. Steps S22, S23, and S24 include: inputting the monitoring image and the first text information into the first anomaly detection model to obtain a first matching probability between the first image information on the monitoring image and the first text information in the network device; inputting the monitoring image and the first text information into the second anomaly detection model to obtain a second matching probability between the first image information on the monitoring image and the first text information in the network device; fusing the first matching probability and the second matching probability according to a preset fusion rule to obtain a matching probability between the first text information and the first image information; if the matching probability is greater than a preset first threshold, determining that the monitoring device is not abnormal. Here, an anomaly detection model combining supervised and unsupervised learning is used, which reduces the annotation cost of the anomaly detection model while ensuring its detection accuracy. It should be noted that steps S22 and S23 are not sequential. Since the inputs of the first and second anomaly detection models are not interdependent, and each model requires a certain amount of time for detection, concurrent detection can be performed to save time.

[0090] Specifically, during model training, a first anomaly detection model is trained based on manually labeled images. These labeled images are surveillance images with manually labeled second image information, which in turn includes corresponding second text information. Here, both the second image information and the second text information are determined by the first image information of the surveillance image. For example, if the first image information provided by the monitoring equipment is the first time image information "2021-08-22 22:20:20" and the first image location information "204 Wukang Road 2HG", then the manually recorded second text information is the second time text information "2021-08-22 22:20:20" and the second location text information "204 Wukang Road 2HG". Simultaneously, the second image information is manually labeled as the second time image information "2021-08-22 22:20:20" and the second location image information "204 Wukang Road 2HG", and these are marked with bounding boxes. This surveillance image is then used as the labeled image for training the first anomaly detection model.

[0091] The first anomaly detection model includes an object detection model and a text recognition model. The object detection model is trained on labeled images to identify all bounding boxes on the labeled images. No restrictions are placed on the object detection model; suitable models include, but are not limited to, RetinaNet. The text recognition model is trained on manually annotated bounding boxes in the labeled images and the second image information within them. This allows the text recognition model to identify the text content (i.e., time and location information) corresponding to the second time and location image information represented in image form. No restrictions are placed on the text recognition model; suitable models include, but are not limited to, CRNN. After the first training of the text recognition model, matching training is performed on the second time and location text information, as well as the time and location text information identified by the model. This allows the text recognition model to match time text information with the second time text information and location text information with the second location text information. It should be clarified that the matching training enables the text recognition model to correctly classify text information as time or location without needing to know the specific meaning of the text information.

[0092] In one feasible matching training method, the loss values corresponding to all matching methods between the second time text information and the second location text information and the time text information and the location text information are constructed into a cost matrix; the matching method with the smallest sum of loss values in the cost matrix is taken as the matching training result of the text recognition model.

[0093] Specifically, all matching methods between the second time text information and the second location text information and the time text information and location text information are as follows:

[0094]

[0095]

[0096] In the text recognition model, time text information 1 and location text information 2 in the form of token sequences are extracted. Based on the time text information 1 and location text information 2 in the form of token sequences and the second time text information a and the second location text information b in the form of token sequences, the corresponding loss values of the above four matching methods are calculated. The loss function used includes, but is not limited to, CTC Loss. The cost matrix is as follows:

[0097] Loss_1a Loss_1b Loss_2a Loss_2b

[0098] The optional matching combinations are 1a-2b and 2a-1b. In a simple and direct evaluation method, the sum of the loss values of each matching combination is calculated, namely (loss_1a+loss_2b) and (loss_1b+loss_2a), and the matching combination with the smallest sum of loss values is selected as the matching training result of the text recognition model for this time.

[0099] In another feasible matching training method, learning parameters and matching rules are preset, and convolution calculations are performed with the time text information and the location text information respectively according to the learning parameters; the matching method combination of the second time text information and the second location text information with the time text information and the location text information is determined according to the convolution calculation result and the matching rule, and the matching method combination is used as the matching training result of the text recognition model.

[0100] Specifically, the text recognition model extracts time text information 1 and location text information 2 in multi-dimensional vector form. Convolution calculations are then performed on these two elements using preset learning parameters. The selected matching combination (1a-2b or 2a-1b) is determined based on the sign of the convolution result and preset rules. The learning parameters are automatically optimized during the text recognition model's training process. For example, if the preset rule is that a negative convolution result matches the second location text information b, and vice versa, a positive result matches the second time text information a, then if the convolution result of time text information 1 is non-negative, the matching method 1a-2b is selected.

[0101] It should be clarified here that the text recognition model's identification of time and location text information in the second image information only involves extracting the text content from the second image information. At this stage, the text recognition model is unaware of the distinction between time and location content; that is, it does not know which text content represents time information and which represents location information. The purpose of the text recognition model's matching training is to enable the model to distinguish between time and location, that is, to correctly match the time text information with the second time text information, and to correctly match the location text information with the second location text information.

[0102] Through the supervised learning training method described above, when a monitoring image and first text information are input into the first anomaly detection model, the model first identifies the target in the monitoring image with bounding boxes based on previous target detection training. Then, it identifies the text content (time text information and location text information) represented by the first image information (i.e., first time image information and first location image information) within the bounding boxes. The model then pairs the time text information with the first time text information in the first text information, and the location text information with the first location text information in the first text information. Finally, it calculates the probability that the time text information can be decoded into the first time text information, and the probability that the location text information can be decoded into the first location text information. The minimum of the two probabilities is taken as the first matching probability between the first image information and the first text information. The two probability values are compared with a first threshold. If either probability is less than the first threshold, the monitoring device is considered to be abnormal. Because the first anomaly detection model is a supervised learning model, high detection accuracy can be guaranteed in the initial stage of anomaly detection.

[0103] The second anomaly detection model is an unsupervised learning model. After the first anomaly detection model meets the training accuracy requirements, the second anomaly detection model is trained based on monitoring images whose first matching probability output by the first anomaly detection model is greater than a second threshold, wherein the second threshold is greater than the first threshold.

[0104] Here, the first threshold is used to determine whether the monitoring equipment is abnormal, and the second threshold is used to filter the training data of the second anomaly detection model. To ensure the high reliability of the training data, the second threshold is greater than the first threshold. Only when the first matching probability of a certain monitoring image output by the first anomaly detection model is greater than the second threshold, that is, when the monitoring equipment providing the monitoring image is not abnormal and the credibility of the monitoring image is high, is the monitoring image used as the training data of the second anomaly detection model.

[0105] In a preferred embodiment, a virtual surveillance image is preset. The virtual surveillance image is a surveillance image with randomly added third image information marked by a bounding box and / or an associated image similar to the surveillance image. The third image information has corresponding third text information. Before the first anomaly detection model is trained, the second anomaly detection model is trained based on the virtual surveillance image. After the first anomaly detection model is trained, the second anomaly detection model is trained based on the virtual surveillance image and the surveillance images whose first matching probability output by the first anomaly detection model is greater than a preset first threshold. The training amount of the virtual surveillance image is reduced, and the training amount of the surveillance images whose first matching probability is greater than the preset first threshold is increased, until the training amount of the virtual surveillance image is 0.

[0106] Here, since the first anomaly detection model cannot provide reliable training data to the second anomaly detection model before meeting the accuracy requirements of model training, an independent pre-training method for the second anomaly detection model is provided, enabling the second anomaly detection model and the first anomaly detection model to start model training simultaneously. After the first anomaly detection model meets the accuracy requirements of model training, it provides real and reliable training data to the second anomaly detection model, and gradually increases the training volume of this part of real data while reducing the training volume of virtual monitoring images, thereby shortening the training time of the entire anomaly detection model while ensuring the adaptability and detection accuracy of the model to the application scenario.

[0107] Specifically, obtain a monitoring image without the first image information or a similar associated image, and automatically add third image information marked with a bounding box to it according to a predetermined program, including third time image information marked with a bounding box and third location image information marked with a bounding box. Use this image as a virtual monitoring image. Here, the automatically added third image information does not need to be consistent with the image content, but should conform to basic rules. For example, roughly divide the time of the image through the overall brightness of the image. If the image content is at night, the third time image information should also be within the time range of night. Record the third image information added by the predetermined program in text form as third text information, including third time text information and third location text information.

[0108] Before the first anomaly detection model meets the accuracy requirements of model training, obtain n virtual monitoring images and their corresponding third text information, and form an image-text pair by pairing the virtual monitoring image with its corresponding third text information; for each of the n image-text pairs, extract an image feature I of the image-text pair according to a preset image encoder and a preset text encoder x and a text feature T x , where 0 < x ≤ n, construct n 2 matching methods between n image features and n text features, as follows:

[0109] <![CDATA[T1]]> <![CDATA[T2]]> <![CDATA[T3]]> …… <![CDATA[T n ]]> <![CDATA[I1]]> <![CDATA[I1T1]]> <![CDATA[I1T2]]> <![CDATA[I1T3]]> …… <![CDATA[I1T n ]]> <![CDATA[I2]]> <![CDATA[I2T1]]> <![CDATA[I2T2]]> <![CDATA[I2T3]]> …… <![CDATA[I2T n ]]> <![CDATA[I3]]> <![CDATA[I3T1]]> <![CDATA[I3T2]]> <![CDATA[I3T3]]> …… <![CDATA[I3T n ]]> …… …… …… …… …… …… <![CDATA[I n ]]> <![CDATA[I n T1]]> <![CDATA[I n T2]]> <![CDATA[I n T3]]> …… <![CDATA[I n T n ]]>

[0110] Among them, n 2 matching methods include n positive samples representing correct matching methods, namely I x T x , where 0 < x ≤ n, and (n 2 -n) negative samples representing incorrect matching methods; according to n positive samples and (n 2The second anomaly detection model is trained using (-n) negative samples through contrastive learning, and the training of the second anomaly detection model is optimized based on the contrastive learning loss function, which includes, but is not limited to, InfoNCE loss.

[0111] Here, we utilize the image-text matching principle of the CLIP (Contrastive Language-Image Pre-training) model to establish a multimodal (image-text) second anomaly detection model. The process of extracting bounding boxes through the target detection model is simplified. The unique image features of the entire virtual surveillance image are directly extracted using the image encoder, and the unique text features of the corresponding third text information are extracted using the text encoder. Since extracting text features does not require distinguishing between third-time and third-location text information, the third-time and third-location text information are concatenated, and the unique text features of this concatenated result are extracted. Furthermore, the second anomaly detection model is trained using contrastive learning based on positive and negative samples to establish the matching relationship between image and text features. The model is then trained and optimized using the contrastive learning loss function. This approach avoids the manual annotation costs and related technical problems encountered during the target detection training process of the second anomaly detection model, saving on model training expenses.

[0112] After the first anomaly detection model fails to meet the required training accuracy, surveillance images with a first matching probability greater than a preset second threshold are added to the training data of the second anomaly detection model. The second anomaly detection model is then trained using the aforementioned training method. The amount of training data for this portion of real data is gradually increased, while the amount of training data for virtual surveillance images is decreased until the amount of training data for virtual surveillance images is reduced to 0.

[0113] In the process of anomaly detection using the second anomaly detection model, firstly, several augmented text information derived from the first text information are automatically generated, and the first text information and the augmented text information constitute the test text sequence. Then, the monitoring image and the test text sequence are input into the second anomaly detection model, and the image features of the monitoring image and the text features of the first text information and each augmented text information in the test text sequence are extracted according to the preset image encoder and preset text encoder, respectively. The similarity between the image features and each text feature is calculated through the multimodal second anomaly detection model, and the normalized similarity values are used as the matching probability between the monitoring image and the first text information and each augmented text information. The matching probability between the image feature and the text feature corresponding to the first text information is obtained, and this matching probability is used as the second matching probability between the monitoring image and the first text information.

[0114] Furthermore, the first matching probability and the second matching probability are fused into a matching probability of the first text information and the first image information according to a preset fusion rule. If the matching probability is greater than a preset first threshold, it is determined that the monitoring device is not abnormal; otherwise, if the matching probability is less than or equal to the preset first threshold, the monitoring device is abnormal. The fusion rule is: matching probability y = α·M1(x) + (1-α)·M2(x); where M1(x) is the first matching probability, M2(x) is the second matching probability, and α is a weight parameter that decreases continuously as the amount of training data of the second anomaly detection model increases until it decreases to a preset limit value.

[0115] Furthermore, the second anomaly detection model can be optimized based on monitoring images whose matching probability is greater than a preset second threshold, as determined by the first and second anomaly detection models.

[0116] Here, because the training data for the second anomaly detection model is virtual surveillance images and the amount of training data is limited, the accuracy of the second anomaly detection model is lower than that of the first anomaly detection model in the initial stage. However, as the amount of virtual surveillance images used for training the second anomaly detection model decreases and the amount of real training data increases, the accuracy of the second anomaly detection model will gradually surpass that of the first anomaly detection model through continuous optimization and training. Therefore, α is set as a weight parameter that decreases continuously with the increase of the amount of training data for the second anomaly detection model until it reaches a preset limit value. This ensures that the detection results of the anomaly detection model in the early stage are mainly affected by the detection results of the supervised learning first anomaly detection model, while in the later stage they are mainly affected by the detection results of the unsupervised learning second anomaly detection model. This application provides a strategy for fusing model inference results that adjusts with changes in the amount of training data. It integrates supervised and unsupervised learning anomaly detection models, obtaining an anomaly detection model whose performance gradually improves with the time of model use at a lower annotation cost.

[0117] In practical applications, network latency and other factors can cause discrepancies between the first-time image information on the monitoring image and the first-time text information recorded in the background. Therefore, in a preferred embodiment, a time error is preset, and the first-time text information in the first text information is amplified into several instances based on this time error, while the first-location text information remains unchanged, thereby generating several basic text information. In one possible embodiment, an acceptable error of X minutes is set, and the time represented by the first-time text information is A, then (2X+1) basic text information are generated from (AX) to (A+X). For example, if the time error is preset to one minute, and the first time text information is "2021-08-22 22:20:20", then it can be expanded to "2021-08-22 22:19:20", "2021-08-22 22:20:20", and "2021-08-22 22:21:20". The first location text information, "204 Wukang Road 2HG", remains unchanged. Therefore, the three basic text information are "2021-08-22 22:19:20", "204 Wukang Road 2HG", "2021-08-22 22:19:20", "204 Wukang Road 2HG", and "2021-08-22". 22:21:20" "204 Wukang Road 2HG"; Here, the second information in the first time text can also be ignored, and the first time text information can be expanded to "021-08-22 22:19", "2021-08-22 22:20" "021-08-22 22:21". In this case, the number of seconds in the expanded basic text information is unlimited and any second value from 0 to 59 seconds can be accepted. Accordingly, when the time error is set to two minutes, the initial text information is "2021-08-22 22:20:20". Ignoring the seconds, it is augmented to "2021-08-22 22:18", "2021-08-22 22:19", "2021-08-22 22:20", "2021-08-22 22:21", and "2021-08-22 22:22". This augmentation method provides more basic text information, improving accuracy during both training and testing.

[0118] For the first anomaly detection model, the training process remains unchanged. During the detection process, (2X+1) basic text information and monitoring images are input into the first anomaly detection model. The matching probability between the monitoring image and each basic text information is calculated respectively, and the maximum matching probability between the first image information and the basic text information is taken as the first matching probability.

[0119] For the second anomaly detection model, during the model training process, for the n image-text pairs used for model training, before extracting the text features, first, according to the time error, the text information in each image-text pair is augmented into (2X + 1) according to the above method. Then, (2X + 1) text features of the image-text pair are extracted according to the preset text encoder and the time error of X minutes. Furthermore, for all n image-text pairs, (2X + 1)n matching methods between n image features and (2X + 1)n text features are constructed. Among them, (2X + 1)n matching methods include (2X + 1)n positive samples representing correct matching methods and (2X + 1)(n2 - n) negative samples representing incorrect matching methods. The second anomaly detection model is trained by contrastive learning based on (2X + 1)n positive samples and (2X + 1)(n2 - n) negative samples, and the second anomaly detection model is optimized according to the contrastive learning loss. 2 (2X + 1)n 2 Among the (2X + 1)n matching methods, there are (2X + 1)n positive samples representing correct matching methods and (2X + 1)(n 2 - n) negative samples representing incorrect matching methods. The second anomaly detection model is trained by contrastive learning based on (2X + 1)n positive samples and (2X + 1)(n2 - n) negative samples, and the second anomaly detection model is optimized according to the contrastive learning loss.

[0120] Taking the time error as one minute and the text information being augmented into 3 as an example: Three text features corresponding to the three text information are extracted according to the preset text encoder. Therefore, one image-text pair will correspond to one image feature I x and three text features T x1 、T x2 、T x3 , where 0 < x ≤ n. Furthermore, for n image-text pairs, 3n matching methods between n image features and 3n text features are constructed, as follows: 2 The content is as follows:

[0121] <![CDATA[T 11 ]]> <![CDATA[T 12 ]]> <![CDATA[T 13 ]]> … <![CDATA[T n1 ]]> <![CDATA[T n2 ]]> <![CDATA[T n3 ]]> <![CDATA[I1]]> <![CDATA[I1T 11 ]]> <![CDATA[I1T 12 ]]> <![CDATA[I1T 13 ]]> … <![CDATA[I1T n1 ]]> <![CDATA[I1T n2 ]]> <![CDATA[I1T n3 ]]> <![CDATA[I2]]> <![CDATA[I2T 11 ]]> <![CDATA[I2T 12 ]]> <![CDATA[I2T 13 ]]> … <![CDATA[I2T n1 ]]> <![CDATA[I2T n2 ]]> <![CDATA[I2T n3 ]]> …… …… … … … … … … <![CDATA[I n ]]> <![CDATA[I n T 11 ]]> <![CDATA[I n T 12 ]]> <![CDATA[I n T 13 ]]> … <![CDATA[I n T n1 ]]> <![CDATA[I n T n2 ]]> <![CDATA[I n T n3 ]]> <00In the detection process of the second anomaly detection model, firstly, several augmented text information derived from each basic text information is automatically generated, forming a test text sequence composed of (2X+1) basic text information and several augmented text information. Then, the monitoring image and the test text sequence are input into the second anomaly detection model. Image features of the monitoring image and text features of the (2X+1) basic text information and each augmented text information in the test text sequence are extracted using preset image encoders and preset text encoders, respectively. The similarity between the image features and each text feature is calculated using the multimodal second anomaly detection model, and the normalized similarity values are used as the matching probability between the monitoring image and the basic text information and each augmented text information. The matching probability between the image feature and the text features corresponding to the (2X+1) basic text information is obtained, and the maximum matching probability is used as the second matching probability between the monitoring image and the first text information. This method eliminates the impact of network transmission time errors on the detection results.

[0124] Compared with existing technologies, this application obtains the monitoring image of the monitoring device at a target time and the first text information of the monitoring image stored by the network device at the target time, wherein the monitoring image contains first image information; the monitoring image and the first text information are input into an anomaly detection model; the matching probability of the first text information and the first image information is obtained through the anomaly detection model, and the presence of an anomaly in the monitoring device is determined by the matching probability. In this way, the anomaly detection problem is transformed into a matching problem of first image information and first text information, making full use of the characteristics of monitoring scenarios such as urban monitoring. It can uniformly detect faults in the monitoring device itself and connection anomalies between the monitoring device and the monitoring system without complex engineering details. The monitoring system maintained in this way has strong reliability and maintainability.

[0125] Furthermore, embodiments of this application also provide a computer-readable medium having computer-readable instructions stored thereon, which can be executed by a processor to implement the aforementioned method.

[0126] This application embodiment also provides an anomaly detection device for monitoring equipment, wherein the device includes:

[0127] One or more processors; and

[0128] A memory storing computer-readable instructions, which, when executed, cause the processor to perform the operations of the aforementioned method.

[0129] For example, when executed, computer-readable instructions cause the one or more processors to: acquire a monitoring image of the monitoring device at a target time and first text information of the monitoring image stored by the network device at the target time, wherein the monitoring image has first image information;

[0130] Input the surveillance image and the first text information into the anomaly detection model;

[0131] The matching probability of the first text information and the first image information is obtained through the anomaly detection model, so as to determine whether the monitoring device has an anomaly based on the matching probability.

[0132] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be considered illustrative and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices recited in the apparatus claims may also be implemented by a single unit or device in software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any particular order.

Claims

1. An anomaly detection method for monitoring equipment, wherein, The method includes: The monitoring image of the monitoring device at a target time and the first text information of the monitoring image stored by the network device at the target time are obtained, wherein the monitoring image has first image information. A preset virtual monitoring image is a monitoring image with randomly added third image information and / or an associated image similar to the monitoring image, wherein the third image information has corresponding third text information; The monitoring image and the first text information are input into the first anomaly detection model and the second anomaly detection model to obtain the first matching probability and the second matching probability of the first image information on the monitoring image and the first text information in the network device, respectively. The first anomaly detection model is a supervised learning anomaly detection model and the second anomaly detection model is an unsupervised learning anomaly detection model. The first matching probability and the second matching probability are fused together according to a preset fusion rule to obtain the matching probability of the first text information and the first image information. If the matching probability is greater than a preset first threshold, it is determined that the monitoring device is not abnormal. The second anomaly detection model is trained based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold. Specifically, before the first anomaly detection model is trained, the second anomaly detection model is trained based on the virtual surveillance image. After the first anomaly detection model is trained, the second anomaly detection model is trained based on the virtual surveillance image and the surveillance image whose first matching probability output by the first anomaly detection model is greater than a preset second threshold. The training amount of the virtual surveillance image is reduced, and the training amount of the surveillance image whose first matching probability is greater than a preset first threshold is increased, until the training amount of the virtual surveillance image is 0. The second threshold is greater than the first threshold.

2. The method of claim 1, wherein, The method further includes: A labeled image is pre-acquired, wherein the labeled image is a monitoring image with manually annotated second image information, and the second image information has corresponding second text information, wherein the first anomaly detection model is trained based on the labeled image and the second text information.

3. The method according to claim 2, wherein the first anomaly detection model includes a target detection model and a text recognition model, and the second image information is identified by bounding boxes, wherein, The step of training the first anomaly detection model based on the labeled image and the second text information includes: The target detection model is trained based on the labeled image so that the target detection model can detect bounding boxes on the labeled image; The text recognition model is trained based on the second text information and the second image information marked with a bounding box on the labeled image.

4. The method according to claim 3, wherein the text information includes time text information and location text information, and the image information includes time image information and location image information, wherein, The step of training the text recognition model based on the second text information and the second image information marked with a bounding box on the labeled image includes: The text recognition model is trained based on the second time image information and the second location image information marked with a label box on the labeled image, so that the text recognition model can recognize the time text information and location text information contained in the second time image information and the second location image information. The text recognition model is trained by matching the second time text information, the second location text information, the time text information and the location text information recognized by the text recognition model, so that the text recognition model can match the time text information with the second time text information and match the location text information with the second location text information.

5. The method according to claim 4, wherein, The step of matching and training the text recognition model based on the second time text information, the second location text information, the time text information and the location text information recognized by the text recognition model includes: Construct a cost matrix by matching the loss values of all matching methods between the second time text information and the second location text information and the time text information and the location text information; The matching method that minimizes the sum of the loss values in the cost matrix is used as the matching training result of the text recognition model.

6. The method of claim 4, wherein, The method further includes: The text recognition model is trained by matching based on the second time text information, the second location text information, the time text information and the location text information recognized by the text recognition model. This includes: Convolution calculations are performed using the learning parameters with the time text information and the location text information, respectively. Based on the convolution calculation results and the matching rules, the matching method combination of the second time text information and the second location text information with the time text information and the location text information is determined, and the matching method combination is used as the matching training result of the text recognition model.

7. The method according to claim 1, wherein, The step of training the second anomaly detection model based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold includes: The monitoring images whose first matching probability of the first text information and the first image information output by the n virtual monitoring images and / or the trained first anomaly detection model is greater than a preset second threshold, together with their corresponding n third text information and / or first text information, constitute n image-text pairs. For each image-text pair, an image feature and a text feature of the image-text pair are extracted according to a preset image encoder and a preset text encoder; Construct n relationships between n image features and n text features 2 There are n matching methods, wherein the n 2 The matching methods include n positive samples representing the correct matching methods and (n... 2 -n) negative samples representing incorrect matching methods; Based on the n positive samples and (n 2 The second anomaly detection model is trained using (-n) negative samples through contrastive learning, and the training of the second anomaly detection model is optimized based on the contrastive learning loss.

8. The method of claim 7, wherein, The step of inputting the surveillance image and the first text information into the first anomaly detection model and the second anomaly detection model to obtain the first matching probability and the second matching probability of the first image information on the surveillance image and the first text information in the network device includes: A test text sequence is generated based on the first text information, wherein the test text sequence includes the first text information and several augmented text information generated by augmenting the first text information; The monitoring image and the text sequence to be tested are input into the second anomaly detection model, and the image features of the monitoring image and the text features of the text sequence to be tested are extracted according to the preset image encoder and the preset text encoder, respectively. Calculate the similarity between the image features and each text feature, and use the normalized similarity values as each matching probability; Obtain the matching probability between the image feature and the text feature corresponding to the first text information, and use the matching probability as the second matching probability.

9. The method of any one of claims 4 to 6, wherein, The method further includes: A preset time error is defined, wherein training the second anomaly detection model based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold includes: The monitoring image whose first matching probability of the first text information and the first image information output by the virtual monitoring image and / or the trained first anomaly detection model is greater than a preset second threshold, together with its corresponding third text information and / or first text information, constitutes n image-text pairs. For each image-text pair, an image feature is extracted from that pair according to a preset image encoder. Extract (2X+1) text features of the image text pair based on the preset text encoder and the time error X; Construct a (2X+1)n network between n image features and (2X+1)n text features. 2 There are several matching methods, wherein the (2X+1)n 2 The number of matching methods includes (2X+1)n positive samples representing correct matching methods and (2X+1)(n 2 -n) negative samples representing incorrect matching methods; Based on (2X+1)n positive samples and (2X+1)(n 2 The second anomaly detection model is trained using (-n) negative samples through contrastive learning, and the training of the second anomaly detection model is optimized based on the contrastive learning loss.

10. The method of claim 9, wherein, The step of inputting the surveillance image and the first text information into the first anomaly detection model and the second anomaly detection model to obtain the first matching probability and the second matching probability of the first image information on the surveillance image and the first text information in the network device includes: Generate (2X+1) basic text information based on the first text information and the time error; A test text sequence is generated based on (2X+1) basic text information, wherein the test text sequence includes (2X+1) basic text information and several augmented text information generated by augmenting the basic text information; The monitoring image and the text sequence to be tested are input into the second anomaly detection model, and the image features of the monitoring image and the text features of the text sequence to be tested are extracted according to the preset image encoder and the preset text encoder, respectively. Calculate the similarity between the image features and each text feature, and use the normalized similarity values as each matching probability; Obtain the matching probability between the image feature and the text feature corresponding to the basic text information, and take the maximum value of the matching probability as the second matching probability.

11. The method of claim 10, wherein, The step of inputting the surveillance image and the first text information into the first anomaly detection model and the second anomaly detection model to obtain the first matching probability and the second matching probability of the first image information on the surveillance image and the first text information in the network device includes: The (2X+1) basic text information and the monitoring image are input into the first anomaly detection model, and the matching probability between the monitoring image and each of the basic text information is calculated respectively. The maximum matching probability is taken as the first matching probability.

12. The method of claim 1, wherein, After training the second anomaly detection model based on the virtual surveillance image and / or the surveillance image whose first matching probability of the first text information and the first image information output by the trained first anomaly detection model is greater than a preset second threshold, the method further includes: The second anomaly detection model is optimized based on the monitoring images whose matching probability is greater than a preset second threshold, determined by the first and second anomaly detection models after training, according to a preset fusion rule.

13. The method of claim 1 or 12, wherein, The fusion rules include: ， Where M1(x) is the first matching probability and M2(x) is the second matching probability. The weight parameters are those that decrease continuously as the amount of training data for the second anomaly detection model increases until they decrease to a preset limit value.

14. A computer-readable medium having stored thereon computer-readable instructions that can be executed by a processor to implement the method as claimed in any one of claims 1 to 13.

15. An anomaly detection device for monitoring equipment, wherein, The device includes: One or more processors; and A memory storing computer-readable instructions, which, when executed, cause the processor to perform the operations of the method as described in any one of claims 1 to 13.