A scene attention-based detection method, device, equipment and storage medium
By constructing a hierarchical network architecture, the system first detects scenes of interest in visible light images, and then schedules object detection models according to priority. This solves the problem of wasted resources in multi-category scene detection in existing technologies and achieves efficient and accurate object detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING INST OF REMOTE SENSING INFORMATION
- Filing Date
- 2024-12-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing visible light image detection models cannot monitor multiple scene categories simultaneously, resulting in wasted resources and low detection efficiency.
A hierarchical network architecture based on scene detection is constructed. First, all scenes of interest in the image are detected by a scene blind detection model. Then, a single-class object detection model is scheduled to perform object detection according to scene priority.
By reducing invalid computations across the entire image, detection efficiency and accuracy are improved, adapting to the flexible needs of various application scenarios.
Smart Images

Figure CN120047660B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of digital image processing technology, and in particular to a method, apparatus, device and storage medium for scene detection based on interest. Background Technology
[0002] Classical visible light image object detection typically refers to closed-set, fixed-category detection, a single-category object-centric model that maintains state-of-the-art performance. However, in open-world, multi-category visible light data, single-category object detection models can only detect the state of one object category, failing to simultaneously monitor other categories. Other regions of interest are ignored or deleted in this task. This results in a significant waste of various resources in visible light images, greatly reducing the actual potential of visible light data.
[0003] In summary, how to detect many undisclosed hotspot regions in visible light images without prior knowledge is a problem that urgently needs to be solved. Summary of the Invention
[0004] This application aims to at least partially address one of the technical problems in the related art.
[0005] Therefore, the first objective of this application is to propose a scene-based detection method to address the problem that existing technologies cannot detect many undisclosed, unknowable hotspot regions in visible light images.
[0006] The second objective of this application is to provide an apparatus.
[0007] The third objective of this application is to propose an electronic device.
[0008] The fourth objective of this application is to provide a computer-readable storage medium.
[0009] The fifth objective of this application is to provide a computer program product.
[0010] To achieve the above objectives, a first aspect of this application proposes a method for detecting scenes of interest, comprising:
[0011] Obtain publicly available visible light image datasets;
[0012] A scene blind detection model is constructed based on a multi-scene supervised learning detection neural network model. The scene blind detection model is trained using the publicly available visible light image dataset to obtain the trained scene blind detection model.
[0013] The trained scene blind detection model is used to perform a preliminary blind detection on the visible light image to be detected, thereby obtaining all scenes included in the visible light image to be detected.
[0014] Based on all scenes included in the visible light image to be detected, the corresponding single-class object detection model is invoked to obtain the detection results.
[0015] Preferably, obtaining the publicly available visible light image dataset includes:
[0016] Acquire high-resolution visible light images of different resolutions and publicly available visible light images with significant differences in characteristics, including visible light data samples from various complex scenes and from different sources, resolutions, and imaging conditions that result in severe image texture degradation.
[0017] Preferably, the step of constructing a scene blind detection model based on a multi-scene supervised learning detection neural network model, and training the scene blind detection model using the publicly available visible light image dataset to obtain the trained scene blind detection model includes:
[0018] A scene blind detection model is constructed using a deep convolutional neural network, which is used to extract feature information from the input image.
[0019] Using an open visible light image dataset, supervised learning is employed to annotate all scenes of interest and train a blind detection model for those scenes, resulting in a trained blind detection model.
[0020] Preferably, the step of calling the corresponding single-class object detection model based on all scenes included in the visible light image to be detected, and obtaining the detection results includes:
[0021] Based on the detection results of the scene blind detection model, determine the scene contained in the image;
[0022] For the detected scene, a single-class object detection model corresponding to the scene is scheduled for object detection based on a single-class detection model scheduling mechanism. The scheduling can be performed according to the priority of the scene.
[0023] Preferably, the single-class detection model scheduling mechanism includes:
[0024] Scene-type-based scheduling calls different single-class detection models for different scene types;
[0025] Priority-based scheduling uses a preset scenario priority and a scheduling object detection model to detect the scenario with the highest priority.
[0026] Scheduling based on target importance involves ranking the importance of scenarios and prioritizing the detection of scenarios with the highest importance.
[0027] Preferably, the single-class object detection model uses the YOLO network and incorporates attention and multi-scale fusion mechanisms for model construction.
[0028] To achieve the above objectives, a second aspect of this application provides a scene-of-interest detection device, comprising:
[0029] The data acquisition module acquires publicly available visible light image datasets;
[0030] The training module constructs a scene blind detection model based on a multi-scene supervised learning detection neural network model, and trains the scene blind detection model using the publicly available visible light image dataset to obtain the trained scene blind detection model.
[0031] The blind detection module uses the trained scene blind detection model to perform a preliminary blind detection on the visible light image to be detected, and obtains all scenes included in the visible light image to be detected.
[0032] The detection model invocation module invokes the corresponding single-object detection model based on all scenes included in the visible light image to be detected, and obtains the detection results.
[0033] To achieve the above objectives, a third aspect of this application provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
[0034] The memory stores computer-executed instructions;
[0035] The processor executes computer execution instructions stored in the memory to implement the method described in any of the preceding descriptions.
[0036] To achieve the above objectives, a fourth aspect of this application provides a computer-readable storage medium, comprising computer-executable instructions stored therein, which, when executed by a processor, are used to implement the method described in any of the above embodiments.
[0037] To achieve the above objectives, a fifth aspect of this application provides a computer program product including computer instructions for causing a computer to perform the method described in the first aspect or any corresponding embodiment thereof.
[0038] This application provides a scene-of-interest detection method that constructs a hierarchical network architecture. First, it detects scenes of interest before performing object detection on those scenes. The first layer employs a scene blind detection model that simultaneously detects all categories of interest in visible light data. Specifically, through supervised learning, it performs preliminary blind detection on regions of interest in open visible light image datasets to identify all scenes of interest within the dataset. Second, the second layer, based on the blind detection results, simultaneously or selectively schedules single-class object detection models according to priority. By detecting scenes first, it reduces unnecessary computation across the entire image, improving detection efficiency. Targeting specific scenes for object detection improves accuracy. The object detection model can be flexibly scheduled according to different needs, adapting to various application scenarios.
[0039] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0040] The above and / or additional aspects and advantages of this application will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:
[0041] Figure 1 A flowchart of a first specific embodiment of a scene-of-interest detection method provided by the present invention;
[0042] Figure 2 A schematic diagram of a scene detection strategy for any visible light image.
[0043] Figure 3 A summary diagram of intelligent object detection algorithms;
[0044] Figure 4 This is a schematic diagram of a scene detection strategy model;
[0045] Figure 5 This is a structural block diagram of a scene-of-interest detection device provided in an embodiment of the present invention. Detailed Implementation
[0046] The core of this invention is to provide a method, apparatus, device, and storage medium for detecting scenes of interest. By constructing a layered network architecture, the scene of interest is detected first, and then object detection is performed on the scene of interest, which reduces the invalid computation of the whole image and improves the detection efficiency.
[0047] To enable those skilled in the art to better understand the present invention, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are merely some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0048] Please refer to Figure 1 , Figure 1 A flowchart of a first specific embodiment of a scene-of-interest detection method provided by the present invention; the specific operation steps are as follows:
[0049] Step S101: Obtain a publicly available visible light image dataset.
[0050] Acquire high-resolution visible light images of different resolutions and publicly available visible light images with significant differences in characteristics, including visible light data samples from various complex scenes and from different sources, resolutions, and imaging conditions that result in severe image texture degradation.
[0051] Step S102: Construct a scene blind detection model based on a multi-scene supervised learning detection neural network model, and train the scene blind detection model using a publicly available visible light image dataset to obtain the trained scene blind detection model.
[0052] A scene blind detection model is constructed using a deep convolutional neural network. This scene blind detection model is used to extract feature information from the input image.
[0053] Using an open visible light image dataset, supervised learning was employed to annotate all scenes of interest and train a scene blind detection model, resulting in a trained scene blind detection model.
[0054] Step S103: Use the trained scene blind detection model to perform preliminary blind detection on the visible light image to be detected, and obtain all scenes included in the visible light image to be detected.
[0055] Step S104: Based on all scenes included in the visible light image to be detected, call the corresponding single-class object detection model to obtain the detection results.
[0056] Based on the detection results of the scene blind detection model, determine the scene contained in the image;
[0057] For the detected scene, a single-class object detection model corresponding to the scene is scheduled for object detection based on a single-class detection model scheduling mechanism. The scheduling can be performed according to the priority of the scene.
[0058] The single-class detection model scheduling mechanism includes:
[0059] Scene-type-based scheduling calls different single-class detection models for different scene types;
[0060] Priority-based scheduling uses a preset scenario priority and a scheduling object detection model to detect the scenario with the highest priority.
[0061] Scheduling based on target importance involves ranking the importance of scenarios and prioritizing the detection of scenarios with the highest importance.
[0062] The single-class object detection model uses the YOLO network and incorporates attention and multi-scale fusion mechanisms for model construction.
[0063] This embodiment provides a scene-of-interest detection method that constructs a layered network architecture. First, scenes of interest are detected, and then object detection is performed on those scenes. The first layer employs a scene blind detection model, simultaneously detecting scenes of interest across all categories of interest in visible light data. That is, through supervised learning, preliminary blind detection is performed on regions of interest in open visible light image datasets to identify all scenes of interest contained within the dataset. Second, the second layer, based on the blind detection results, simultaneously or selectively schedules single-class object detection models according to priority. By detecting scenes first, unnecessary computation across the entire image is reduced, improving detection efficiency. Targeted object detection for specific scenes improves accuracy. Object detection models can be flexibly scheduled according to different needs, adapting to various application scenarios.
[0064] Based on the above embodiments, this embodiment describes a method for detecting scenes of interest, as follows: Figure 2 As shown, the details are as follows:
[0065] Before introducing the embodiments of this application, the following content will be introduced first:
[0066] like Figure 3 As shown, deep learning object detection algorithms are mainly divided into two categories: two-stage detection and single-stage detection. Two-stage algorithms first generate candidate boxes as samples and then apply image classification algorithms to the candidate regions; while single-stage detection algorithms directly regress the predicted objects.
[0067] Two-stage object detection algorithm
[0068] Two-stage object detection algorithms achieve high accuracy primarily because they utilize Recurrent Prospective Arrays (RPNs) for precise candidate box generation, reducing false detection rates. They offer better detection performance for large objects and complex scenes. Representative algorithms include Faster R-CNN, Mask R-CNN, and Cascade R-CNN. However, they require two stages of computation, making them slower and unsuitable for real-time object detection.
[0069] Single-stage object detection algorithm
[0070] Single-stage object detection algorithms perform well in various application scenarios and typically offer fast inference speeds. The choice of algorithm depends on the specific application requirements, including factors such as accuracy, speed, and resource consumption. Furthermore, these algorithms often provide pre-trained models within deep learning frameworks (such as TensorFlow and PyTorch), allowing for the rapid development of custom object detection applications. They can achieve online real-time object detection, but their detection accuracy is not as high as that of two-stage object detection algorithms.
[0071] Anchor-free detection algorithm
[0072] Classical single-stage and two-stage object detection methods require pre-setting object bounding boxes (anchors), which adds extra complexity to the detection process. Anchor-free object detection algorithms, on the other hand, do not require pre-setting anchors; instead, they directly predict the object's location and category from within the image. Examples of such algorithms include YOLOv1, CornerNet, and CenterNet. The advantage of anchor-free algorithms is that they avoid the process of setting and tuning anchors, simplifying the algorithm structure.
[0073] Single-stage object detection, also known as the YOLO network, defines object detection as a regression problem. Research based on it has rapidly gained momentum, and it is often used for scene detection, focusing on specific scenarios. Its biggest advantage is its high detection speed. Although its detection accuracy is lower than that of two-stage object detection models, it can achieve similar levels through other techniques such as attention mechanisms and multi-scale fusion. Therefore, single-stage object detection models have greater application value.
[0074] like Figure 4 As shown, this embodiment includes the following two-layer architecture:
[0075] First layer: Scene blind detection model
[0076] Model structure: A scene blind detection model is designed using a deep convolutional neural network (CNN), which can effectively extract image features.
[0077] Training data: Supervised learning was performed using an open visible light image dataset to label all scene categories of interest.
[0078] Blind detection process: By performing preliminary scene detection on all categories of interest in the visible light data, all scenes of interest contained in the dataset are identified.
[0079] Output: The output of this layer is a list of detected scene categories and their corresponding confidence scores.
[0080] Second layer: Object detection model
[0081] Strategy scheduling: Based on the blind detection results of the first layer, prioritize scheduling the targeted object detection model (i.e., the single-class object detection model).
[0082] Detection execution: In a defined target scene, perform object detection to identify and locate objects of interest.
[0083] Priority scheduling: Based on the importance or priority of the scene, different object detection models are scheduled to achieve optimal resource allocation.
[0084] In one embodiment, policy scheduling includes:
[0085] The scene blind detection model analyzes input images to identify various scene information. The identification result includes not only the scene type but also the scene's importance or priority in the application. Different scenes may have different impacts on the detection task; therefore, scheduling subsequent detection tasks based on scene priority is a key optimization of this invention.
[0086] Scene type: Scenes can be of various types, such as city, nature, and indoors. Each type of image will contain different backgrounds and objects.
[0087] Priority setting: The priority of different scenarios may be set based on a variety of factors. For example, in security monitoring systems, outdoor scenarios may have a higher priority than indoor scenarios because outdoor areas are more prone to emergencies; in medical image processing, specific areas (such as tumor locations) may require higher detection priority.
[0088] Based on the importance of each scene, the scene categories output by the first-layer scene detection system are weighted or prioritized according to specific rules. These priorities can be static (such as preset rules) or dynamic (such as adjustments based on real-time data feedback or historical experience).
[0089] Model scheduling mechanism
[0090] Scene detection and scheduling is a dynamic decision-making process, the core of which is to select the appropriate single-object detection model based on scene priority. Specifically, the scheduling mechanism can be implemented in the following ways:
[0091] Scene-based scheduling: When a specific scene (such as an urban traffic scene) is detected, the corresponding traffic monitoring object detection model is scheduled. This model is specifically designed to identify targets such as traffic signals, vehicles, and pedestrians. Different scene types may require different single-class detection models.
[0092] Priority-based scheduling: When a scenario is determined to have a high priority (e.g., an emergency), a higher-performance or faster-responding object detection model is scheduled for rapid detection. For example, in emergency scenarios in security monitoring, a low-latency single-class detection model can be selected to ensure that possible abnormal events are identified as early as possible.
[0093] Target importance-based scheduling: For each scenario, models can be scheduled based on the detection importance of different targets. For example, in natural scenes, it may be necessary to prioritize the detection of rare animals or dangerous species, while in other scenarios, ordinary targets (such as ordinary pedestrians) can be detected by models with lower priority.
[0094] Scheduling algorithms and decision rules
[0095] To make the scheduling mechanism efficient and accurate, a scheduling algorithm can be designed to dynamically determine which object detection models to invoke based on the type and priority of the scene. The following are some possible scheduling strategies:
[0096] Static rule scheduling: Based on the scene category and its priority, static rules are used to select the corresponding single-class detection model. For example, when the scene type is "city street" and the priority is high, the "pedestrian detection" model is scheduled for detection.
[0097] Dynamic feedback scheduling: In real-time applications, the scheduling mechanism can also be dynamically adjusted based on historical data and the current state. For example, if a scenario has changed frequently in historical data (such as changes in traffic conditions due to weather changes), a real-time feedback scheduling detection model can be selected.
[0098] Hybrid scheduling strategy: Combining static rules and dynamic adjustments to form a hybrid scheduling system. Initially, a detection model is selected based on scenario type and priority. During real-time operation, the model priority is adjusted or alternative models are selected based on real-time feedback (such as uncertainty or confidence of detection results).
[0099] Prioritization and performance optimization
[0100] The scheduling process must consider not only the priority of the scenarios but also the allocation of computing resources. To ensure that the system can detect high-priority scenarios in a timely manner without causing delays in the detection of low-priority scenarios, the following performance optimization strategies can be adopted:
[0101] Dynamic resource allocation: When multiple scenes are detected simultaneously, computing resources can be dynamically allocated according to the priority of the scenes. For example, more GPU resources can be allocated to high-priority scenes, and detection resources for low-priority scenes can be reduced.
[0102] Multithreading and Parallel Processing: To improve efficiency, multithreading or parallel computing techniques can be employed. In scenarios with different priorities, multithreading is used for detection, ensuring that high-priority tasks are completed in the shortest possible time while other tasks can also be performed in parallel.
[0103] For example, in an intelligent traffic monitoring system, the system first identifies "urban road" scenes in the image using a scene blind detection model, and then sets them as high priority according to preset rules. Next, the system schedules a "vehicle detection" model to perform object detection on the road scene, quickly identifying whether there are illegal parking, traffic congestion, or other issues. Simultaneously, if the scene is a "parking lot," another single-class object detection model specifically for "parking space detection" is scheduled.
[0104] Subsequent processing and optimization
[0105] After object detection is completed, the detection results can be optimized through post-processing, such as using the non-maximum suppression (NMS) method to remove redundant detection boxes, or further adjusting the detection priority based on the confidence level of the detection results.
[0106] RSBD consists of an image encoder, a text encoder, and an object decoder. The text encoder processes any task-related description, including object categories, names of any kind, titles about objects, and reference expressions. Prior information acts as a cue or label, such as the location information of the input scene and the name of the object scene. These are then integrated into the detector to extract the object scene from the image based on the text and prior information input.
[0107] For the Image Backbone model, we use the ResNET network to extract multi-scale features of image objects; the TextEncoder model uses the seq2seq approach to encode text and establishes a correspondence between the text and objects in the Image Backbone; finally, the text is fed together with the prior object location information into a YOLOX with dynamic class headers to build an object decoder for object scene detection.
[0108] This embodiment provides a scene-of-interest detection method that constructs a layered network architecture. First, scenes of interest are detected, and then object detection is performed on those scenes. The first layer employs a scene blind detection model, simultaneously detecting scenes of interest across all categories of interest in visible light data. That is, through supervised learning, preliminary blind detection is performed on regions of interest in open visible light image datasets to identify all scenes of interest contained within the dataset. Second, the second layer, based on the blind detection results, simultaneously or selectively schedules single-class object detection models according to priority. By detecting scenes first, unnecessary computation across the entire image is reduced, improving detection efficiency. Targeted object detection for specific scenes improves accuracy. Object detection models can be flexibly scheduled according to different needs, adapting to various application scenarios.
[0109] Please refer to Figure 5 , Figure 5 This invention provides a structural block diagram of a scene-of-interest detection device; the specific device may include:
[0110] Data acquisition module 100 acquires publicly available visible light image datasets;
[0111] Training module 200 constructs a scene blind detection model based on a multi-scene supervised learning detection neural network model, and trains the scene blind detection model using the publicly available visible light image dataset to obtain the trained scene blind detection model;
[0112] The blind detection module 300 uses the trained scene blind detection model to perform a preliminary blind detection on the visible light image to be detected, and obtains all scenes included in the visible light image to be detected.
[0113] The detection model calling module 400 calls the corresponding single-object detection model based on all scenes included in the visible light image to be detected, and obtains the detection results.
[0114] This embodiment provides a scene-based detection device for implementing the aforementioned scene-based detection method. Therefore, the specific implementation of the scene-based detection device can be found in the embodiment section of the scene-based detection method described above. For example, the data acquisition module 100, training module 200, blind detection module 300, and detection model invocation module 400 are respectively used to implement steps S101, S102, S103, and S104 in the aforementioned scene-based detection method. Therefore, the specific implementation can be referred to the description of the corresponding embodiments, which will not be repeated here.
[0115] To implement the above embodiments, this application also proposes an electronic device, including: a processor and a memory communicatively connected to the processor; the memory stores computer execution instructions; the processor executes the computer execution instructions stored in the memory to implement the method provided in the foregoing embodiments.
[0116] To implement the above embodiments, this application also proposes a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, are used to implement the methods provided in the foregoing embodiments.
[0117] To implement the above embodiments, this application also proposes a computer program product, including a computer program that, when executed by a processor, implements the methods provided in the foregoing embodiments.
[0118] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in this application all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0119] It should be noted that personal information collected from users should be used for legitimate and reasonable purposes and should not be shared or sold outside of these legitimate uses. Furthermore, such collection / sharing should only be conducted after receiving the user's informed consent, including but not limited to notifying the user to read the user agreement / user notice and sign an agreement / authorization that includes authorization of relevant user information before the user uses the function. In addition, any necessary steps must be taken to protect and safeguard access to such personal information data and ensure that others with access to personal information data comply with their privacy policies and procedures.
[0120] This application is intended to provide an implementation scheme for users to selectively prevent the use or access to their personal information data. Specifically, this disclosure is intended to provide hardware and / or software to prevent or block access to such personal information data. Once personal information data is no longer needed, risks can be minimized by restricting data collection and deleting data. Furthermore, where applicable, such personal information is de-identified to protect user privacy.
[0121] In the foregoing descriptions of the embodiments, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with that embodiment or example, which are included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0122] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0123] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.
[0124] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable medium may be paper or other suitable media on which the program can be printed, since the program can be obtained electronically, for example, by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.
[0125] It should be understood that various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0126] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
[0127] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0128] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.
Claims
1. A method for detecting a scene of interest based on a scene, characterized by, include: Obtain publicly available visible light image datasets, including: Acquire high-resolution visible light images of different resolutions and publicly available visible light images with large differences in characteristics, including visible light data samples of various complex scenes and images with severe texture degradation due to different sources, resolutions, and imaging conditions; A scene blind detection model is constructed based on a multi-scene supervised learning detection neural network model. The scene blind detection model is trained using the publicly available visible light image dataset to obtain a trained scene blind detection model. The output of the scene blind detection model is a list of detected scene categories and corresponding confidence scores. The trained scene blind detection model is used to perform a preliminary blind detection on the visible light image to be detected, thereby obtaining all scenes included in the visible light image to be detected. Based on all scenes included in the visible light image to be detected, the corresponding single-class object detection model is invoked to obtain the detection results. The single-class object detection model uses a YOLO network and incorporates attention and multi-scale fusion mechanisms for model construction. The detection results obtained by calling the corresponding single-class object detection model based on all scenes included in the visible light image to be detected include: Based on the detection results of the scene blind detection model, determine the scene contained in the image; For the detected scene, a single-class object detection model corresponding to the scene is scheduled for object detection based on a single-class detection model scheduling mechanism. The scheduling can be performed according to the priority of the scene. The single-class detection model scheduling mechanism includes: Scene-type-based scheduling calls different single-class detection models for different scene types; Priority-based scheduling uses a preset scenario priority and a scheduling object detection model to detect the scenario with the highest priority. Scheduling based on target importance involves ranking the importance of scenarios and prioritizing the detection of scenarios with the highest importance. The method further includes: RSBD consists of an image encoder, a text encoder, and an object decoder. The text encoder processes any description related to the task, including object categories, names of any kind, titles about objects, and reference expressions. Prior information acts as a cue or label, and the prior information is the location information of the input scene and the name of the object scene. The text encoder and prior information are integrated into the detector to extract the object scene from the image based on the text and prior information input. The Image Backbone model uses the ResNET network to extract multi-scale features from image objects; the TextEncoder model uses the seq2seq approach to encode text and establishes a correspondence between the text and objects in the Image Backbone. Finally, it is input together with prior object location information into a YOLOX with dynamic class headers to build an object decoder for object scene detection.
2. The method according to claim 1, wherein, The scene blind detection model constructed based on the multi-scene supervised learning detection neural network model is trained using the publicly available visible light image dataset, resulting in the trained scene blind detection model, which includes: A scene blind detection model is constructed using a deep convolutional neural network, which is used to extract feature information from the input image. Using an open visible light image dataset, supervised learning is employed to annotate all scenes of interest and train a blind detection model for those scenes, resulting in a trained blind detection model.
3. A scene-based attention detection apparatus, characterized by comprising: The apparatus implements the method as described in claim 1, the apparatus comprising: The data acquisition module acquires publicly available visible light image datasets; The training module constructs a scene blind detection model based on a multi-scene supervised learning detection neural network model, and trains the scene blind detection model using the publicly available visible light image dataset to obtain the trained scene blind detection model. The blind detection module uses the trained scene blind detection model to perform a preliminary blind detection on the visible light image to be detected, and obtains all scenes included in the visible light image to be detected. The detection model invocation module invokes the corresponding single-object detection model based on all scenes included in the visible light image to be detected, and obtains the detection results.
4. An electronic device, comprising: include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the method as described in any one of claims 1-2.
5. A computer readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, are used to implement the method as described in any one of claims 1-2.
6. A computer program product comprising computer programs / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the method of any one of claims 1-2.