An image semantic classification and region segmentation method based on CNN-SNN-GRL fusion
By employing the CNN-SNN-GRL fusion method, the problem of insufficient high-level semantic feature extraction and data fitting capabilities in hyperspectral images is solved, achieving accurate and autonomous image region segmentation, which is applicable to fields such as remote sensing ecological monitoring, medical image diagnosis, and industrial quality inspection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JINGCHU UNIV OF TECH
- Filing Date
- 2026-02-03
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, image region segmentation methods have difficulty effectively extracting high-level semantic features from hyperspectral images, and their ability to fit image feature data is weak. Furthermore, dynamic graph representation learning and updating rely excessively on manual processes and lacks adaptability.
The CNN-SNN-GRL fusion method is adopted, which extracts features through convolutional neural networks, calculates feature similarity through Siamese neural networks, constructs dynamic graphs and performs Laplacian normalization, and updates network weights by combining Hilbert-Schmidt independence criterion gradient loss, thereby achieving accurate extraction and topological association of high-level semantic features of images.
It achieves accurate extraction and topological association of high-level semantic features of images, breaks through the limitations of traditional GCN, improves the accuracy and autonomy of image region segmentation, and provides innovative technical solutions for fields such as remote sensing ecological monitoring, medical image diagnosis, and industrial quality inspection.
Smart Images

Figure CN122244497A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and more specifically, to an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion. Background Technology
[0002] Image region segmentation is one of the main research directions in the field of computer vision. Image region segmentation can divide an image into regions with semantic consistency, which can provide basic support for image processing tasks such as image recognition and target tracking. It has wide applications in fields such as remote sensing hyperspectral imaging, ecological monitoring, medical image processing, urban land planning, military defense, robotics, agricultural technology, autonomous driving, video processing, industrial quality inspection and network communication.
[0003] Currently, typical research on applying GCN to image region segmentation includes: multi-scale feature fusion, cross-modal adversarial learning, semantic segmentation and change detection, superpixel graph structure, fusion of attention mechanism and graph learning, and fusion of graph learning and convolutional neural network.
[0004] Based on the above analysis of the image region segmentation methods used, it is clear that using Graph Convolutional Networks (GCNs) to achieve image region segmentation requires addressing two key issues: (1) Extraction of high-level semantic features of images, especially complete spatial and spectral high-level semantic information in hyperspectral images. In GCN, the Convolutional Neural Network (CNN) is responsible for extracting the semantic features of images. However, CNN can only define image semantic features by stacking, which makes it difficult to realize the transmission and reasoning of image semantic features. Moreover, CNN performs convolution operations on the spatial information of images to obtain image features, and does not pay much attention to the topological information of images. Therefore, it is necessary to optimize the structure of the Graph Representation Learning (GRL) network to obtain superior global topological information of images for describing the high-level semantic features of images.
[0005] (2) Improve the ability to fit image feature data and enhance the adaptive learning and update performance of graph representation learning.
[0006] However, some GCN networks use Euclidean or Mahalanobis distance functions to design adjacency matrices, which results in poor fitting ability for high-dimensional nonlinear HSI image feature data. Dynamic graph representation learning and graph representation updates rely too much on manual completion and have poor adaptability. Summary of the Invention
[0007] To address the issues of weak fitting ability for high-dimensional nonlinear HSI image feature data, excessive reliance on manual processing for dynamic graph representation learning and updating, and poor adaptability in existing technologies, this application provides an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion.
[0008] The embodiments of this application are implemented as follows: Firstly, this application provides an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion, including: The original image is acquired and preprocessed to obtain a region-blocked image; The original image and the region segmented image are used to extract features by a convolutional neural network to obtain a complete set of image features and region features. The set of regional features is input into the Siamese neural network, and the cross-union ratio is introduced to divide positive and negative samples, and the feature similarity evaluation results of positive and negative samples are calculated. Regional features are defined as graph nodes, and dynamic graph weights are calculated by combining the positive and negative sample feature similarity evaluation results to construct a dynamic graph. Based on graph representation learning technology, high-level semantic features are obtained by combining the dynamic graph with dynamic convolution operations and Laplacian normalization. By combining upsampling and convolution operations, the network weights and features are updated using gradient loss based on the Hilbert-Schmidt independence criterion, thereby updating the dynamic graph.
[0009] Based on the updated dynamic graph, the image segmentation result is output.
[0010] In one possible implementation, the step of acquiring the original image and preprocessing the original image to obtain a region-blocked image further includes: Input the original image; Set the size of the region-divided image and the number of region blocks; The original image is divided into multiple regions of the same scale.
[0011] In one possible implementation, the step of inputting the set of regional features into a Siamese neural network, while simultaneously introducing the intersection-union ratio (IU) to divide positive and negative samples, and calculating the feature similarity evaluation result of the positive and negative samples, further includes: Positive and negative samples are defined according to the intersection-union ratio (IU). The positive and negative samples are respectively input into the two sub-networks of the Siamese neural network, and the feature vectors of the positive and negative sample image features are obtained. Based on the feature vectors of positive and negative sample images, the similarity between the features of positive and negative sample images is calculated. By using fully connected layers and Softmax layers, the similarity between the features of the positive and negative sample images is classified and similarity is learned, resulting in classification and similarity learning results. Based on the classification results and similarity learning results, a comprehensive evaluation is conducted to obtain the similarity evaluation result.
[0012] In one possible implementation, defining positive and negative samples based on the intersection-union ratio further includes: The anchor points for positive and negative samples are determined using overlapping windows; Positive and negative samples are determined by setting different step sizes between them and the anchor point frame.
[0013] In one possible implementation, the Siamese neural network further compensates for feature errors and label errors using an improved cross-entropy loss function, which is expressed as:
[0014] ; in, This represents the total number of image samples. For the input of the first Feature labels of image samples Label the predicted results. and These are the two outputs of the Siamese neural network. For Lagrange multipliers, For transpose matrix operations, and These represent two regional blocks, The weights are the edge feature weights of the graph nodes.
[0015] In one possible implementation, defining the region features as graph nodes and calculating dynamic graph weights based on the positive and negative sample feature similarity evaluation results to construct a dynamic graph further includes: The feature vectors in the region feature set are defined as graph nodes, and each graph node corresponds to a region block image. Based on the positive and negative sample feature similarity evaluation results, the graph edge weights are calculated. The graph edge weights reflect the similarity or correlation between image region blocks. The larger the weight value, the more similar or related the two image region blocks are. A dynamic graph is constructed based on the graph nodes and the graph edge weights.
[0016] In one possible implementation, the graph representation learning technique, through dynamic convolution operations and Laplacian normalization, combined with the dynamic graph, obtains the associated high-level semantic features, further including: Define a dynamic convolution filter to perform dynamic convolution operations on the dynamic graph and obtain the dynamic graph encoding result; The Laplacian feature matrix of the image is obtained by normalizing the features of the graph nodes based on the Laplacian operator. Design a mapping matrix, combine the dynamic encoding result with the mapping matrix and complete image features to perform high-level semantic feature association of the image, and obtain the associated high-level semantic features.
[0017] In one possible implementation, the method of combining upsampling and convolution operations to update network weights and features using the Hilbert-Schmidt independence criterion gradient loss to update the dynamic graph further includes: Upsampling layers are used to upsample high-level semantic features to restore the spatial resolution of the image. Convolution operations are then performed on the upsampled features to further extract local features and semantic information of the image. The Hilbert-Schmidt independence criterion is introduced to construct a gradient loss function. The gradient of the gradient loss function with respect to the model parameters is calculated to obtain the gradient change detection results. By combining the results of upsampling and convolution operations with the gradient change detection results, joint learning training of CNN-SNN-GRL is achieved to update the dynamic graph.
[0018] In one possible implementation, the gradient loss function is expressed as:
[0019]
[0020]
[0021] ; in, The number of high-level semantic feature categories of the image. The number of pixels in the image. For the first Image pixels Predicted as the first The probability of a high-level semantic feature category. The weights for determining the categories of high-level semantic features of an image. The true category of high-level semantic features. and They are respectively and The update results Represents the nodes in the graph and Between graph edges The weight, Encoding for dynamic graphs, These are the weights used for learning and training the convolutional neural network.
[0022] In one possible implementation, the step of outputting the image segmentation result based on the updated dynamic graph further includes: Based on the updated dynamic graph, feature information is extracted; Based on the feature information, an image segmentation result is generated.
[0023] The technical solution provided in this application can achieve at least the following beneficial effects: This application presents an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion. By fusing convolutional neural networks (CNN), Siamese neural networks (SNN), and graph representation learning (GRL), a dynamic graph structure is constructed and an intersection-over-union (IoU) similarity evaluation mechanism is embedded to achieve accurate extraction and topological association of high-level semantic features of images. A multi-level feature encoding system of "block features → similarity evaluation → dynamic graph aggregation → closed-loop update" is established. Combined with Laplacian normalization verification and HSIC gradient optimization, multi-level fusion and autonomous evolution of semantic features are completed. Medical anatomical labels and hyperspectral ground object prior knowledge are introduced, and combined with dynamic convolutional filters and mapping matrices for joint optimization, the cross-scene feature discrimination capability is enhanced, and complex region segmentation mechanisms are analyzed. This method overcomes the three major limitations of traditional graph convolutional networks (GCN): weak node feature description, ambiguous loss function definition, and insufficient dynamic update capability. It achieves deep analysis of image semantic classification and region segmentation from low-level feature extraction to high-level semantic aggregation, providing an innovative technical solution for accurate segmentation in fields such as remote sensing ecological monitoring, medical image diagnosis, and industrial quality inspection. Attached Figure Description
[0024] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0025] Figure 1 This is a flowchart illustrating an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion, as shown in an exemplary embodiment of this application. Figure 2 This is a schematic diagram illustrating the method for setting positive and negative samples in an SNN model, as shown in an exemplary embodiment of this application. Figure 3 This is a schematic diagram illustrating the SNN model structure and learning and training strategy in an exemplary embodiment of this application; Figure 4 This is a flowchart illustrating an adjacency matrix determination method according to an exemplary embodiment of this application; Figure 5 This is a schematic diagram illustrating the CNN-GRL network model structure and image region segmentation processing flow, as shown in an exemplary embodiment of this application. Figure 6 This is a schematic diagram showing the ROC value test results for four types of networks; Figure 7(a) is a schematic diagram of the Accuracy evaluation index results of the four networks in the image feature extraction task, and Figure 7(b) is a schematic diagram of the Loss value evaluation index results. Figure 8 This is a schematic diagram of six original images; Figure 9 This is a schematic diagram of EFM obtained from the first original image by four different networks; Figure 10 This is a schematic diagram of EFM obtained from the second original image by four different networks; Figure 11 This is a schematic diagram of EFM obtained from the third original image by four different network detection methods; Figure 12 This is a schematic diagram of EFM obtained from the fourth original image by four different networks; Figure 13 This is a schematic diagram of EFM obtained from the fifth original image by four different network detection methods; Figure 14 This is a schematic diagram of EFM obtained from the sixth original image by four different network detection methods; Figure 15 It is an image edge feature map extracted by a CNN-GRL network; Figure 16 This is a schematic diagram of the feature histogram detection results of animal images extracted by the CNN-GRL network; Figure 17 This is a schematic diagram of the feature histogram detection results of building images extracted by the CNN-GRL network; Figure 18 This is a schematic diagram of the feature histogram detection results of flower images extracted by the CNN-GRL network; Figure 19 This is a schematic diagram of the feature histogram detection results of a home image extracted by a CNN-GRL network; Figure 20 This is a schematic diagram of the feature histogram detection results of the hyperspectral image extracted by the CNN-GRL network; Figure 21 This is a schematic diagram of the feature histogram detection results of a medical image extracted by a CNN-GRL network. Figure 22 is a schematic diagram of the test results of CNN-GRL network weight and high-level semantic feature update. Figure 22(a) is... right and Figure 22(b) illustrates the effect of the number of input samples on the input sample size. and The diagram illustrates the effects, Figure 22(c) is... and right Schematic diagram of the impact; Figure 23 is a schematic diagram of the accuracy detection results of the network model. Figure 23(a) is a schematic diagram of the number of input samples, and Figure 23(b) is a schematic diagram of the number of iterations. Figure 24 shows the NMI detection results of the network model. Figure 24(a) shows the number of input samples, and Figure 24(b) shows the number of iterations. Figure 25 is a visualization of the aggregated distribution of image semantic features. Figure 25(a) is a schematic diagram of the first and second original images, Figure 25(b) is a schematic diagram of the third and fourth original images, and Figure 25(c) is a schematic diagram of the fifth and sixth original images. Figure 26 is a schematic diagram of the FOM detection results of the network model. Figure 26(a) is a schematic diagram of the number of input samples, and Figure 26(b) is a schematic diagram of the number of iterations. Figure 27 is a schematic diagram of the ASD detection results of the network model. Figure 27(a) is a schematic diagram of the number of input samples, and Figure 27(b) is a schematic diagram of the number of iterations. Figure 28 is a schematic diagram of the DSC detection results of the network model. Figure 28(a) is a schematic diagram of the number of input samples, and Figure 28(b) is a schematic diagram of the number of iterations. Figure 29 is a schematic diagram of the RVD detection results of the network model. Figure 29(a) is a schematic diagram of the number of input samples, and Figure 29(b) is a schematic diagram of the number of iterations. Figure 30 is a visualization of image segmentation. Figure 30(a) is a visualization of the first original image, Figure 30(b) is a visualization of the second original image, Figure 30(c) is a visualization of the third original image, Figure 30(d) is a visualization of the fourth original image, Figure 30(e) is a visualization of the fifth original image, and Figure 30(f) is a visualization of the sixth original image. Figure 31 is a schematic diagram of the convergence test results of eight network models in the image region segmentation task. Figure 31(a) is a schematic diagram of the detection results of the Accuracy evaluation index, and Figure 31(b) is a schematic diagram of the detection results of the Loss value evaluation index. Figure 32This is a diagram showing the comparison of high-level semantic feature verification results of GNN-GRL network; Figure 33 This is a diagram comparing the verification results of the runtime complexity of various network models. Detailed Implementation
[0026] To make the objectives, implementation methods and advantages of this application clearer, the exemplary implementation methods of this application will be clearly and completely described below with reference to the accompanying drawings of the exemplary embodiments of this application. Obviously, the exemplary embodiments described are only some embodiments of this application, and not all embodiments. It should be understood that the specific embodiments described herein are only used to explain this application and are not intended to limit this application.
[0027] It should be noted that the brief descriptions of terms in this application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of this application. Unless otherwise stated, these terms should be understood in their ordinary and common meaning.
[0028] The terms "first," "second," "third," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar or related objects or entities, and do not necessarily imply a specific order or sequence, unless otherwise specified. It should be understood that such terms are interchangeable where appropriate.
[0029] The terms “comprising” and “having”, and any variations thereof, are intended to cover but not exclude inclusion, for example, a product or device that includes a range of components is not necessarily limited to all of the components that are clearly listed, but may include other components that are not clearly listed or that are inherent to such product or device.
[0030] Before explaining the image semantic classification and region segmentation method based on CNN-SNN-GRL fusion provided in the embodiments of this application, the application scenarios and implementation environment of the embodiments of this application will be introduced first.
[0031] Image region segmentation is one of the main research directions in the field of computer vision. Image region segmentation can divide an image into semantically consistent regions, providing fundamental support for image processing tasks such as image recognition and target tracking. It has wide applications in fields such as remote sensing hyperspectral imaging, ecological monitoring, medical image processing, urban land planning, military defense, robotics, agricultural technology, autonomous driving, video processing, industrial quality inspection, and network communication.
[0032] Initially, Graph Convolutional Networks (GCNs) were applied to image region segmentation research, mainly in improving image feature extraction and enhancing semantic segmentation performance.
[0033] Currently, typical research on applying GCN to image region segmentation includes: multi-scale feature fusion, cross-modal adversarial learning, semantic segmentation and change detection, superpixel graph structure, fusion of attention mechanism and graph learning, and fusion of graph learning and convolutional neural network.
[0034] Based on the above analysis of the image region segmentation methods used, it is clear that using Graph Convolutional Networks (GCNs) to achieve image region segmentation requires addressing two key issues: (1) Extraction of high-level semantic features of images, especially complete spatial and spectral high-level semantic information in hyperspectral images. In GCN, the Convolutional Neural Network (CNN) is responsible for extracting the semantic features of images. However, CNN can only define the semantic features of images by stacking, which makes it difficult to realize the transmission and reasoning of the semantic features of images. Moreover, CNN performs convolution operations on the spatial information of images to obtain image features, and does not pay much attention to the topological information of images.
[0035] Therefore, it is necessary to optimize the structure of Graph Representation Learning (GRL) networks to obtain superior global topological information of images for describing high-level semantic features of images.
[0036] (2) Improve the fitting ability of image feature data and enhance the adaptive learning and update performance of graph representation learning. Some GCN networks use Euclidean distance or Mahalanobis distance functions to design the adjacency matrix, which will result in poor fitting ability of high-dimensional nonlinear HSI image feature data. Dynamic graph representation learning and graph representation update rely too much on manual completion and have poor adaptability.
[0037] Therefore, it is necessary to improve the similarity calculation method between high-dimensional node features of images and adopt an end-to-end training method to achieve autonomous learning and updating of graph representation.
[0038] In Geographic Graph Convolutional Networks (GCNs), graph reasoning of image content is based on Perceived Organizational Support (POS) theory. GCN nodes are represented by different image regions, and GCN edges are represented by the similarity between these regions. GCNs can capture the dependencies between semantic features of the image and encode the topological information of the GCN graph. However, GCN nodes lack the ability to describe high-level semantic features of the image, making it difficult to accurately match the weights of GCN edges with these features, thus reducing the autonomy of GCN dynamic graph computation.
[0039] To address the aforementioned issues, this application provides an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion. A CNN is embedded within a GRL network to construct a CNN-GRL network model, ensuring that the CNN network can uniformly perform graph representation learning and classification of high-level semantic features of images, establishing associations between these high-level semantic features within the graph structure. A Siamese Neural Network (SNN) is used to extract positive and negative sample image features and calculate the similarity between features. An Intersection over Union (IoU) criterion is introduced to evaluate the positive and negative sample image features, thereby obtaining the high-level semantic features of the image. A gradient loss function is constructed by introducing the Hilbert-Schmidt Independence Criterion (HSIC) to achieve dynamic graph updates. The CNN, SNN, CNN-GRL, and HSIC are jointly trained.
[0040] Next, the technical solutions of this application and how they solve the aforementioned technical problems will be described in detail through embodiments and in conjunction with the accompanying drawings. The embodiments can be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of this application.
[0041] Figure 1 This is a flowchart illustrating an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion, as shown in an exemplary embodiment of this application.
[0042] In one exemplary embodiment, such as Figure 1 As shown, an image semantic classification and region segmentation method based on CNN-SNN-GRL fusion is provided. In this embodiment, the method may include the following steps: Step 100: Obtain the original image and preprocess it to obtain a region-blocked image.
[0043] In one exemplary embodiment, the specific implementation process includes: Input the original image, and define the original image as... ; For the original image By dividing the region into blocks of the same scale, we obtain a set of region blocks: ; Where N is the number of region-based image blocks, For the first Image segmented into regions, ; Set the size of the region block image to ,but It can be represented as .
[0044] The image can come from different datasets, such as Cora, Citeseer, PubMed, etc. These images can be natural images, remote sensing images, medical images, etc. The purpose of segmentation is to divide the image into multiple regions for subsequent feature extraction and processing. The output image region segmentation results are defined as graph nodes of CNN-GRL. These image region segments will serve as input for subsequent feature extraction.
[0045] Step 200: Extract features from the original image and the region segmented image using a convolutional neural network to obtain a complete set of image features and region features.
[0046] In an exemplary embodiment, a convolutional neural network (CNN) typically consists of multiple convolutional layers, pooling layers, and fully connected layers. Convolutional layers are used to extract local features of an image, pooling layers are used to reduce the dimensionality of feature maps, and fully connected layers are used to classify features. Image regions are divided into blocks and input into the trained CNN. Through the processing of convolutional and pooling layers, feature vectors of each image region block are extracted. These feature vectors can represent key information in the image, such as edges, textures, and shapes.
[0047] Figure 2 This is a schematic diagram illustrating an exemplary embodiment of the SNN model positive and negative sample setting method.
[0048] Furthermore, in order to obtain a dynamic graph with high-level semantic features, some embodiments of this application provide a CNN-GRL network module that introduces SNN and IoU to ensure that the image features extracted by CNN-GRL include high-level semantic features and have the function of judging high-level semantic information. Real-time judgment of the extracted high-level semantic features of the image is the key to ensuring that the constructed CNN-GRL graph has dynamism.
[0049] The original image The classification feature labels are difficult to directly train for high-level language features in SNN models, so it is necessary to redefine the labels of positive and negative image samples.
[0050] To obtain positive and negative sample labels, you first need to set the positive and negative samples of the input image.
[0051] Some embodiments of this application use the IoU value of positive and negative samples as an evaluation criterion. The IoU value is determined by the ratio of the overlapping region to the joint region in the image.
[0052] Step 300: Input the set of regional features into the Siamese neural network, and simultaneously introduce the intersection-union ratio to divide positive and negative samples, and calculate the feature similarity evaluation result of positive and negative samples.
[0053] The specific process of introducing SNN and IoU is as follows: Positive and negative samples are defined according to the intersection-union ratio (IU). The positive and negative samples are respectively input into the two sub-networks of the Siamese neural network, and the feature vectors of the positive and negative sample image features are obtained. Based on the feature vectors of positive and negative sample images, the similarity between the features of positive and negative sample images is calculated. By using fully connected layers and Softmax layers, the similarity between the features of the positive and negative sample images is classified and similarity is learned, resulting in classification and similarity learning results. Based on the classification results and similarity learning results, a comprehensive evaluation is conducted to obtain the similarity evaluation result.
[0054] Some embodiments of this application specify the positive and negative sample methods for SNN model learning and training, such as... Figure 2 As shown: The anchor bounding boxes for positive and negative samples are determined using an overlapping window, see [link / reference]. Figure 2 The white box in the image indicates that positive and negative samples are determined by setting different step ratios between them and the anchor point box. In this application, the positive sample is set with a step ratio of 1:2. (See attached image) Figure 2 The red box indicates that negative samples are set with a step size ratio of 1:4. (See...) Figure 2 The green box in the middle.
[0055] Among them, the positive samples are presented using a Gaussian distribution of sample features. Based on this, the labels for positive sample learning training are designated as Negative samples are presented using a global, uniform distribution of sample features. Based on this, the labels for negative sample learning training are designated as .
[0056] Step 400: Define the region features as graph nodes, and calculate the dynamic graph weights by combining the positive and negative sample feature similarity evaluation results to construct a dynamic graph.
[0057] The main steps of graph representation learning include: each node in the graph sends the acquired image feature information to its neighboring nodes, and all nodes fuse the image feature information of their neighboring nodes; nonlinear transformations are used to improve the graph representation learning model's ability to describe image feature information; and the image feature information of all nodes is fused to form a complete graph. The ability of the graph nodes to describe image feature information will directly affect the quality of the constructed graph.
[0058] Figure 3 This is a schematic diagram illustrating the SNN model structure and learning training strategy in an exemplary embodiment of this application.
[0059] In one exemplary embodiment, this application employs an SNN model to assign weights among the edge features of graph nodes, setting the weights to be... They share weights and evaluate the similarity between edge features of graph nodes to match feature categories of unknown image samples.
[0060] exist Defined separately in the middle and Divide into two regions and set Feature category labels are , express and The feature categories are the same. express and The feature categories are different.
[0061] Therefore, the input to the SNN model can be represented as .
[0062] To reduce the computational complexity of the model, an offline method was used to train the SNN model for binary classification. The output of the SNN model was extended into a one-dimensional vector through concatenation, and then trained to learn the feature similarity of graph nodes through a fully connected layer (FC) and a softmax layer.
[0063] Define the total number of image samples participating in the learning and training of the SNN model as: , For the input of the first Feature labels of image samples As the label for the predicted result, during learning and training get When, it means and There are similarities, learning and training get When, it means and No similarity.
[0064] Define the two outputs of the SNN model as follows: and , For Lagrange multipliers, For transpose matrix operations, the Softmax cross-entropy loss function of the SNN model... It can be expressed by the following formula: (1); Among these improvements, the cross-entropy loss function for SNN models was refined, as the commonly used cross-entropy loss function does not take into account the input feature labels. and predicted feature labels Errors exist This is a common problem in current machine learning networks. Compared with the two excellent SNN model loss functions, Contrastive Loss and Triplte Loss, the loss function designed in this application is superior. It also has certain advantages, for The learning and training have also been improved, taking into account and There are characteristic errors between them The internal structure and learning / training strategy of the SNN model constructed in this application are as follows: Figure 3 As shown.
[0065] In an exemplary embodiment, the Laplace normalization verification includes: checking the orthogonality of eigenvectors: the product of the eigenma matrix and its transpose satisfies the identity matrix condition; verifying symmetry: the Laplace matrix is equal to its transpose; and confirming positive semidefiniteness: the eigenvalues are non-negative.
[0066] Figure 4 This is a flowchart illustrating an exemplary embodiment of the adjacency matrix determination method of this application.
[0067] Step 500: Based on graph representation learning technology, through dynamic convolution operation and Laplacian normalization processing, combined with the dynamic graph, the associated high-level semantic features are obtained.
[0068] In one exemplary embodiment, the association of high-level semantic features of the image mainly involves CNN-GRL graph coding, adjacency matrix design, and graph learning training mapping matrix design.
[0069] A CNN-GRL graph consists of two parts: graph nodes and edges connecting the graph. A CNN-GRL graph is defined as follows: The graph node set is (Includes information on the number of graph nodes), the graph edge set is (Includes information on the number of graph edges), the weighted adjacency symmetric matrix is: .
[0070] Components of a matrix Used to describe the relationships between nodes in a graph. Represents the nodes in the graph and Between graph edges The weight, This refers to the weights of the entire CNN-GRL network.
[0071] set up Sometimes, exist, hour, It does not exist. The threshold can be determined by calculating the Gaussian kernel function, and the smoothness control factor of the Gaussian kernel function is defined as follows: , and The high-level semantic features between them are respectively and , and The high-level semantic feature labels between them are respectively and The calculation method is shown in Formula 1. and The high-level semantic feature distance, feature label distance, and maximum threshold feature distance are respectively , and ,but The following formula can be used for calculation: (2); Degree of CNN-GRL graph nodes This refers to the number of neighboring nodes of a given graph node. It is also one of the high-level semantic features of graph nodes. It is a diagonal matrix, and the matrix dimensions are the same as the graph nodes. The number of nodes is the same. degree .
[0072] Based on the Laplace operator (LO), for Laplace normalization was performed. In order to ensure and Consistency of matrix properties in calculations.
[0073] It can be represented as the Laplacian feature matrix of the image, and its features belong to the high-level semantic features of the image.
[0074] Define the feature vector matrix extracted during the learning and training of the CNN-GRL model as follows: Both are graph nodes The eigenvector matrix, through the eigenvector matrix, by the By decomposing, it can be determined ,Sure Whether it is a true high-level semantic feature can be determined by... Further verification is needed.
[0075] in, This represents a matrix of eigenvalue vectors for the diagonal elements. This represents the transpose of the matrix.
[0076] Dynamic convolution operations in CNN-GRL graphs are a key step in CNN-GRL graph encoding and the foundation for high-level semantic feature association in images.
[0077] Define the dynamic convolution filter as The dynamic convolution operation process can be represented by the following formula: (3).
[0078] According to the definition rules of Chebyshev polynomials, the order of a Chebyshev polynomial is defined as follows: , The Chebyshev polynomial is , Both The order of Chebyshev coefficients is given by... , It is a diagonal matrix. , It is a Lagrange multiplier. The solution can be obtained by constructing the activation function Softmax, as shown in the following equation: .
[0079] The output of the dynamic graph convolution operation in CNN-GRL is the dynamic graph encoding result. The adjacency matrix in the dynamic convolution operation is defined as follows: The activation function of the CNN-GRL model is... This indicates the dynamic graph encoding result. It can be represented by the following formula 4.
[0080]
[0081] ,and
[0082] for The previous level coding result (4); Adjacency Matrix Learned and trained through an SNN model, By incorporating an SNN model, high-level semantic features of the two image regions are obtained sequentially using Equation 1. and and predicted labels Matrix, that is , The solution method is shown in Formula 1.
[0083] This application is designed An improvement was made to the method of directly designing the adjacency matrix using Laplacian normalized image feature labels. Detailed definition as follows Figure 4 As shown.
[0084] Obtaining the dynamic graph encoding Afterwards, the mapping matrix needs to be trained through graph learning. To perform high-level semantic feature association of images, basic image features learned and trained using a CNN network will be used. and To integrate.
[0085] The determination is based on the cross-entropy error of image features learned and trained by the CNN network and the CNN-GRL network. The specific calculation method is as follows: (5).
[0086] For the weights learned and trained by the CNN network, some embodiments of this application use DeepResidualLearning for training. Cross-entropy error uses norm-based methods. Perform the operation, represented as In addition, the CNN-GRL network also takes into account two factors: graph nodes and graph edges.
[0087] Meanwhile, the input feature labels are considered in the cross-entropy error calculation. With predicted feature labels Errors exist .
[0088] Learning training mapping matrix The design is usually based on the error between the output features of graph dynamic convolution and the predicted image features.
[0089] Some embodiments of this application improve this rule by using error calculation between the initial graph node semantic features and the predicted image semantic features. This improvement also incorporates the image semantic feature error factors generated during the learning and training process of CNN, SNN and GRL networks. It not only controls the limitations of high-level image semantic features in the calculation process and improves the accuracy of image semantic feature classification and aggregation, but also significantly reduces the complexity of performing image semantic feature error calculation step by step between the introduced models.
[0090] In addition, The design also takes into account the error factors between high-level semantic feature labels of images, further improving the performance of image semantic feature classification and aggregation of the model.
[0091] The final high-level semantic features of the image output by the CNN-GRL network model are represented as follows: .
[0092] Step 600: Using a combination of upsampling and convolution operations, update the network weights and features through gradient loss based on the Hilbert-Schmidt independence criterion to update the dynamic graph.
[0093] In one exemplary embodiment, in order to improve the discriminative ability of high-level semantic features of the CNN-GRL network, the autonomous update strategy of the CNN-GRL dynamic graph is also very important.
[0094] This application employs upsampling (UP) and convolutional convolution (Conv) for end-to-end learning and training to achieve... and The update utilizes the HSIC gradient change theory to determine the loss function of the CNN-GRL network. The number of high-level semantic feature categories of an image is defined as follows: The number of pixels in the image is , No. Image pixels Predicted as the first The probability of a high-level semantic feature category is The weight for determining the category of high-level semantic features of an image is: The true category of high-level semantic features is ,but and Update results and , The calculation method can be expressed by the following formula: (6); The CNN-GRL dynamic graph self-update strategy calculates the model's dynamic weights and high-level semantic feature update values by introducing high-level semantic feature category probabilities and feature category determination weight parameters, based on the number of high-level semantic feature categories and the number of image pixels.
[0095] Image representation learning networks update network model weights and high-level semantic features primarily based on loss functions, which is a common mechanism used in most machine learning models.
[0096] The CNN-GRL model provided in some embodiments of this application proposes a loss function design strategy based on the update of network model weights and high-level semantic features. The feasibility of this strategy is based on the fact that the loss function is constructed and updated during the backpropagation verification stage of the network model, rather than during the forward propagation learning and training stage.
[0097] Some embodiments of this application employ... The calculation method simultaneously implements the design and updating of the loss function, which also significantly reduces the computational complexity of the network model.
[0098] Compared with existing GCN network model construction methods, CNN-GRL networks can not only describe the relationships between high-level semantic features of adjacent images, but also establish high-level semantic feature associations over longer distances, and improve the flexibility of weight and graph structure updates.
[0099] Step 700: Output the image segmentation result based on the updated dynamic graph.
[0100] In one possible implementation, such as Figure 5 As shown, the CNN-GRL network model provided in some embodiments of this application specifically includes three parts: a dynamic graph representation module, a high-level semantic feature association module, and a dynamic graph update module.
[0101] Its image region segmentation processing flow includes: In the CNN-GRL network model provided in this application, the image features extracted by the CNN convolutional layer are defined as CNN-GRL graph nodes.
[0102] The SNN model is used to extract features from positive and negative image samples. IoU is introduced to evaluate the similarity of positive and negative sample features, obtain high-level semantic features of the image, use a fully connected layer (FC) to classify the high-level semantic features, design a Softmax function to learn the similarity of the high-level semantic features, obtain the edges of the CNN-GRL graph, and complete the learning and training of the dynamic graph representation module CNN-GRL.
[0103] By designing adjacency and mapping matrices, dynamic graph convolution is performed on the acquired high-level semantic features of the image. Then, the high-level semantic features are mapped and aggregated with the image features extracted by the CNN convolutional layer to obtain associated high-level semantic features, thus completing the learning and training of the high-level semantic feature association module.
[0104] End-to-end learning is performed using upsampling layers and convolutional layers. The final high-level semantic information map of the image is obtained through the HSIC gradient loss function and updated in real time. During backpropagation, the CNN-GRL state map can also be adjusted according to the gradient of the HSIC loss function.
[0105] In one possible implementation, this application also provides a CNN-GRL image region segmentation algorithm, which completes the image region segmentation task by performing graph association representation on the high-level semantic features of the image.
[0106] Using existing image segmentation datasets as priors, we perform joint graph representation learning training on SNN, CNN-GRL, and HSIC introduced in the CNN-GRL network model to obtain control parameters, which are then applied to different types of image segmentation tasks.
[0107] In the SNN network, fixed convolutional layer parameters are set, while the CNN-GRL network mainly learns and trains the parameters of the FC layer. The HSIC gradient descent algorithm is used to realize the weight and dynamic high-level semantic feature map update of the CNN-GRL network.
[0108] The specific algorithm flow is as follows: Input: Image X and region segmentation rules N, CNN network and feature extractor, number of iterations of SNN network, number of iterations of CNN-GRL network, smoothness control factor of Gaussian kernel function. High-level semantic maximum threshold feature distance The order of Chebyshev polynomials sum coefficient vector Weight , , .
[0109] Initialization parameters: predicted labels Number of high-level semantic feature categories of an image The number of pixels in the image High-level semantic feature category probability The true category of high-level semantic features It also includes the following optimization parameters.
[0110] Extracting complete image features using a CNN network Delineate image regions and extract region block image features. and Introducing IoU pairs and Similarity is evaluated to determine ; The SNN model is trained and updated through multiple iterations. ,optimization ,get ; The CNN-GRL graph structure is designed based on the Gaussian kernel function. ; Using the Laplace operator, we get Design a dynamic convolution filter as ,get Through the adjacency matrix Perform dynamic encoding ; Use graph learning to train the mapping matrix ,right and Perform high-level semantic feature association of images to obtain ; Using upsampling convolution UP+Conv, combined with HSIC gradient change theory, joint learning and training are performed. , and ; Output: Segmentation result image.
[0111] As can be seen, some embodiments of this application fuse CNN, SNN, and GRL network models, with the main research objectives of high-level semantic feature extraction, judgment, classification, and aggregation of images. By utilizing the HSIC gradient change rule, the constructed network model is jointly trained. The CNN-GRL network model constructed in this application has achieved excellent image region segmentation results.
[0112] In CNN models, introducing IoU for initial image region segmentation is the foundation for constructing the GRL graph node structure, the criterion for positive and negative sample division in SNN models, and the guarantee for high-level semantic feature extraction of images.
[0113] Compared with the currently used SNN models, this paper improves the performance of high-level semantic feature extraction by defining image semantic feature labels, calculating the error between the input semantic feature labels and the predicted semantic features, and further enhancing the performance of high-level semantic feature extraction. In combination with the characteristics of the GRL model, the similarity evaluation method between GRL graph node features is improved. The commonly used evaluation rule based on image feature distance is abandoned. Instead, the model's loss function is used for learning and training, and the similarity of high-level semantic features of the image is directly evaluated through weight allocation.
[0114] In the CNN-GRL model, model weights are designed by combining the distance of high-level semantic features and the distance of feature labels. The operation rules that only use the distance of image features are improved and the consistency with the SNN model is maintained. In the design of dynamic convolutional filters, a Laplacian normalization processing mechanism is introduced to ensure that the extracted image features are high-level semantic features, and a judgment strategy for high-level semantic features is formulated.
[0115] In dynamic graph encoding, the adjacency matrix is designed based on the predicted image feature labels. The graph learning training mapping matrix design uses error calculations between the initial graph node semantic features and feature labels, and the predicted image semantic features and feature labels. This effectively controls the limitations of high-level semantic features in the computation process, improves the accuracy of image semantic feature classification and aggregation, and reduces the complexity of semantic feature error calculations. The conventional mechanism of updating the loss function through backpropagation is improved. Based on the network model weights and the update status of high-level semantic features, an update strategy for the CNN-GRL model loss function is formulated. This achieves synchronous design and update of the loss function, reducing the computational complexity of the network model.
[0116] The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion proposed in this application can be widely applied to image feature extraction, feature classification, feature aggregation, image recognition, object tracking, image region segmentation, and other computational tasks. -In the field of machine vision processing.
[0117] To verify this application, the image semantic classification and region segmentation method based on CNN-SNN-GRL fusion provided in this application is applied to different experimental scenarios and compared with existing technologies.
[0118] The experimental environment and main parameters included: an AMD Ryzen 5 5600G with Radeon Graphics 3.90GHz processor, 16.0GB of memory, and a Windows 10 64-bit operating system. Matlab 2014a was used to construct CNN, SNN, and CNN-GRL network models. The experiment first trained the SNN and CNN-GRL network models, setting the learning rate to 0.001, the optimizer to Adam, the network convergence momentum to 0.9, the weight decay to 0.0005, and the number of network iterations from 0 to 900. Six image datasets—Cora, Citeseer, Pubmed, Tox21, IndianPines, and TCGABrainMRIdatasets—were used for training the SNN and CNN-GRL networks. The trained CNN-GRL network was then used to perform region segmentation tasks for six image categories: animals, flowers, buildings, rooms, hyperspectral images, and medical images. The experimental image normalization methods included: setting the experimental image size to 591×591 pixels and the resolution to 300 dpi, and using CNN for noise removal. The ratio of positive to negative samples was 1:10, and the number of positive samples was 2000.
[0119] The experiment includes: 1. The performance of the improved SNN models in some embodiments of this application is compared and analyzed. The models participating in the comparison experiment include references
[29] ,
[30] and
[31] .
[0120] 2. The CNN-GRL model in some embodiments of this application is used to train the network to learn the performance of key control parameters. The experiment focuses on comparing and analyzing the performance of high-level semantic features, feature classification and feature aggregation extracted by the CNN-GRL model.
[0121] 3. Comparative analysis with the network models proposed in references
[13] ,
[16] ,
[19] ,
[22] ,
[24] ,
[26] , and
[28] is conducted to verify the advantages of the CNN-GRL model in some embodiments of this application.
[0122] Experiment 1: SNN models typically use Cross-EntropyLoss, ContrastiveLoss, and TriplteLoss for network learning, training, and validation. Some embodiments of this application use SNN models that improve upon Cross-EntropyLoss by defining... Meanwhile, the labels for the prediction results It has also been redefined.
[0123] The experiment verified the loss function and predicted label calculation method used in references
[29] ,
[30] and
[31] to be correct. Performance and The accuracy.
[0124] The experiment mainly verifies the model's image edge feature extraction performance and network convergence, setting the number of iterations for network learning and training to be [number missing]. The evaluation metrics were F1 Score (F1), Kappa Statistic (KS), Hausdorff Distance (HD), Dice Similarity Coefficient (DSC), Volume Overlap Error (VOE), Receiver Operating Characteristic (ROC), feature extraction standard rate (Accuracy), loss function (Loss value), and Edge Feature Map (EFM). The experimental results are shown in Table 1. Figures 6 to 14 As shown.
[0125] Table 1 shows the image edge feature detection results using four different networks.
[0126] The F1 score combines two metrics, Precision and Recall, and can verify the accuracy of extracted image edge features. The higher the F1 score, the higher the accuracy of image edge feature extraction.
[0127] The KS value can be used to detect the similarity of image edge feature vector matrices. The larger the detection value, the higher the similarity of the image edge feature matrices.
[0128] HD is used to determine the vector distance between image edge features; the smaller the distance, the more similar the image edge features.
[0129] The DSC value can evaluate the degree of overlap between the image feature assessment prediction results and the true label. The larger the DSC value, the higher the accuracy of the extracted image edge features.
[0130] The VOE value can be used to evaluate the classification accuracy of image edge features; the smaller the detection value, the higher the classification accuracy of image edge features.
[0131] The results in Table 1 show that the four network models involved in the experiment all performed well in terms of the accuracy of extracted image edge features, the similarity of edge feature vector matrices, the distance error of edge feature vectors, the overlap between edge feature prediction results and real feature labels, and the accuracy of edge feature classification.
[0132] Loss functions used in some embodiments of this application and The learning and training strategy is superior to the methods proposed in the other three papers.
[0133] The ROC (Real-Time Optical Characteristic) metric is used to determine the authenticity of extracted image edge features. The higher the ROC value, the closer it is to the true image edge features. Figure 6 The test results show that the ROC curves of the four network models involved in the experiment are all located above the judgment standard line, indicating that the extracted image edge features are highly realistic. The test results verify that the SNN network models in some embodiments of this application have certain advantages.
[0134] The method for evaluating network convergence involves continuously increasing the number of training iterations of the network model and detecting the accuracy and loss value of the extracted image edge features. Network convergence verification is also an important part of the computer image vision processing ablation experiments. The results in Figure 7 show that, in some embodiments of this application, the SNN network model, after 100 iterations of training, achieves an accuracy value of approximately 90% and a loss value of approximately 0.1. Furthermore, the accuracy and loss values detected in subsequent iterations are very stable, indicating that the SNN network model has fast convergence speed and stable convergence control, and the two experimental evaluation metrics are also quite ideal.
[0135] Figure 8-14 The results of the test show that the image edge feature information extracted by the four network models includes less low-frequency information and noise information. The image edge feature information extracted by the three literature models has different degrees of information loss. The loss of image feature edge details extracted by the models from large to small is as follows: literature
[29] , literature
[31] , literature
[30] , and the SNN model in some embodiments of this application. The experiment further verifies the advantages of the SNN model in some embodiments of this application in image edge feature extraction.
[0136] The above detection results fully verify the loss function defined in the SNN network model of some embodiments of this application. and predicted feature labels The validity of the two parameters and the accuracy of Formula 1 are shown in the following formulas for the loss function and the method of solving the predicted feature labels of the three network models in references
[29] ,
[30] and
[31] : (7); Reference
[29] uses the standard cross-entropy function to construct the loss function in its network model. It did not take into account the learning of input image feature labels during training. and prediction labels Error factors between them; predicted labels The calculation did not take into account and Error factors between two different feature categories.
[0137] The network model in reference
[30] adopts and Design loss function using direct comparison method Although the distance between different categories of features was controlled However, the accuracy of the loss error is slightly lacking; predicting labels It takes into account the error factors between different feature categories, but lacks comparison operations between feature categories.
[0138] Loss function for network model construction in reference
[31] At the same time, Error factors between different categories of features, but the error control methods used are not suitable for SNN network models; predicting labels The solution not only failed to consider and The error factor between two different feature categories is too small, and the calculation rules are too simple.
[0139] Experiment 2: In some embodiments of this application, the CNN-GRL network model includes an SNN network model, and the same applies below.
[0140] Key control parameters include: dynamic convolution filter ( Determine the adjacency matrix ), mapping matrix Loss function and network weight .
[0141] The experiment focuses on the performance of the above control parameters, specifically the accuracy of high-level semantic feature extraction and classification of images, and the network weights. The update capability was verified.
[0142] The evaluation metrics set for the experiment include: mean squared error (MSE) between the extracted high-level semantic features and the expected high-level semantic features during learning and training, peak signal-to-noise ratio (PSNR), peak mean squared error (PMSE), structural similarity ratio (SSIM), coefficient of determination (R²) score in explainable variation, perceptual image similarity (LPIPS), distance similarity (FID) between high-level semantic feature vectors, Spearman rank correlation coefficient (SROCC), Pearson linear correlation coefficient (PLCC), high-level semantic feature classification accuracy, precision, and recall, as well as the evaluation parameters in Table 1.
[0143] Simultaneously, image edge features and feature histograms were detected. Experimental detection data are shown in Tables 2 and 3. Figure 15-21 As shown.
[0144] Table 2 Results of high-level semantic feature detection in CNN-GRL network
[0145] The Mean Semantic Error (MSE) metric refers to the error between the high-level semantic features extracted by the CNN-GRL network and the expected high-level semantic features during training. The error evaluation includes two aspects: variance and bias of the two types of features. It is a reliable evaluation metric. In the experiment, the evaluation threshold for MSE was set to 10. When the detected MSE is ≤10, it indicates that the high-level semantic features extracted by the CNN-GRL network have high accuracy. When the detected MSE is >10, it indicates that the accuracy is low. The smaller the MSE, the more accurate the extracted image features are.
[0146] The PSNR evaluation metric mainly detects the noise control performance during the image feature extraction process. The experiment sets the evaluation threshold for PSNR to 40. When the detected PSNR is ≥40, it indicates that the noise control performance of the high-level semantic features extracted by the CNN-GRL network is good. When the detected PSNR is <40, it indicates that the noise control ability is not strong. The larger the PSNR, the stronger the noise control ability of the CNN-GRL network.
[0147] The PMSE evaluation metric can simultaneously detect the error between image features and the noise control effect. The experiment is set to 0≤PMSE≤1. The smaller the detection value, the higher the accuracy of the high-level semantic features extracted by the CNN-GRL network and the better the noise control effect.
[0148] The SSIM evaluation metric can detect the extracted high-level semantic feature map from three aspects: brightness, contrast, and pixel spatial distribution structure. The experiment is set to -1≤SSIM≤1. When SSIM=-1 or 1, it means that the high-level semantic features extracted by the CNN-GRL network are exactly the same as the expected high-level semantic features. When SSIM=0, they are completely different.
[0149] The R2 evaluation metric mainly includes three evaluation parameters: variance, standard deviation, and interquartile range between the high-level semantic features of the image and the expected high-level semantic features during learning and training. It can be used to detect the fitting effect of the loss function constructed by the CNN-GRL network. The experiment is set to 0≤R2≤1. The larger the detection value, the better the performance of the loss function constructed by the CNN-GRL network and the stronger the explanatory power of the network.
[0150] The LPIPS evaluation metric itself has the function of extracting image features, which can be further analyzed on the extracted image features. It is more suitable for the analysis of high-level semantic features of images. The experiment is set to 0≤LPIPS≤1. The smaller the detection value, the more accurate the high-level semantic features extracted by the CNN-GRL network are.
[0151] The FID evaluation metric can be used to calculate the distance between image features based on the distribution of the image feature space. The experiment is set to 0≤FID≤1. The smaller the detection value, the more accurate the high-level semantic features extracted by the CNN-GRL network are.
[0152] The SROCC evaluation metric can assess the hierarchical relevance of image features (based on the rank of the image feature vector). In the experiment, 0 ≤ SROCC ≤ 1 is set. The larger the detection value, the more similar the high-level semantic features extracted by the CNN-GRL network are to the expected high-level semantic features.
[0153] The PLCC evaluation metric can not only directly detect the correlation of image features, but also directly evaluate the prediction results of the CNN-GRL network. The experiment is set to -1≤SROCC≤1. When SROCC=-1 or 1, it means that the high-level semantic features extracted by the CNN-GRL network are more similar to the expected high-level semantic features. When SROCC=0, there is no similarity.
[0154] Accuracy, Precision, and Recall are three evaluation metrics that directly assess the accuracy of image feature extraction. The higher the value, the more accurate the high-level semantic features extracted by the CNN-GRL network.
[0155] Image edge feature maps are essentially edge maps after image region segmentation, allowing for a direct evaluation of the image region segmentation results. Image feature histograms, on the other hand, can detect whether the information in the image edge feature maps is feature information.
[0156] The data in Table 2 shows that the dynamic convolutional filter in the CNN-GRL network in some embodiments of this application is... The adjacency matrix is Mapping matrix and network weight Four key control parameters and the loss function for SNN network design Loss function for CNN-GRL network design and network weight It exhibits good performance and strong network dynamic update capability. Comparison of the common evaluation parameters in Tables 1 and 2 reveals that the detection F1, KS, HD, DSC, and VOE values are all improved after introducing the CNN-GRL network, verifying the necessity of introducing the CNN-GRL network in some embodiments of this application.
[0157] Figure 15 In the detected image edge feature map, the edge information is clear and the noise information is low. No grayscale variation information was found in the detected feature histogram; all information belongs to the image edge. Figure 15-21 Comparative analysis of the detected image edge features showed that the edge feature information extracted by the CNN-GRL network was richer, further verifying the necessity of introducing the CNN-GRL network in some embodiments of this application.
[0158] Laplacian feature matrix of an image according to some embodiments of this application Dynamic convolution filter Mapping matrix Loss function Network weight Update weights The definitions of parameters, and the image features extracted after learning and training by the CNN-GRL network, are as follows: , , , and .
[0159] The experiment verified whether the above features belong to high-level semantic features, and the experimental detection results are shown in Table 3.
[0160] Table 3. Validation results of high-level semantic features of CNN-GRL network
[0161] In the experimental verification indicators, targeting The basis for satisfying the orthogonality of the indicator feature vectors is Formula 8, the basis for satisfying the symmetry is Formula 9, and the basis for satisfying the semi-positive definiteness is Formula 10. As long as the above corresponding conditions are met, the detection result is judged as √.
[0162] The validation metric InformationValue (IV) is used to detect the extracted image feature label values. The IV value must be between 0.00 and 2.00, and is determined according to Formula 11.
[0163] In theory, the effective range of feature label values is [0, +∞]. The larger the value, the stronger the predictive ability. In practical applications, when IV > 1, its accuracy is relatively weak
[32] .
[0164] The validation metric SHapleyAdditiveexPlanations (SHAP)
[33] is used to detect the impact of predicted image feature labels on the network model. The effective value is [-1, 1]. The larger the detection value, the more accurate the feature labels predicted by the network model are.
[0165] The results of the verification test in Table 3 show that the CNN-GRL network constructed in some embodiments of this application can accurately extract high-level semantic features of images.
[0166] (8) (9) (10) (11) Similarly, you can refer to the above formula to... , , , Adjacency matrix and mapping matrix The orthogonality, symmetry, and positive semidefiniteness of the eigenvectors are used to determine the eigenvectors.
[0167] To verify the autonomous update capability of the CNN-GRL network's weights and high-level semantic features, the experiment set the number of iterative training iterations for CNN-GRL to be [number missing]. ,right and rate of change and Conduct testing.
[0168] The number of positive samples in the input image of CNN-GRL is set to 0 to 2000. and rate of change and Conduct testing.
[0169] Setting up CNN-GRL and The dynamic range is 0.0 to 1.0. and rate of change and The test results are shown in Figure 22.
[0170] The data from Figure 22 shows that, with the number of iterations of learning and training... As the number of input image samples increases, the network weight update values are tested. Image high-level semantic feature update value rate of change and , and The values all show a non-linear upward trend.
[0171] The value varies continuously between 0.00 and 1.00, and generally shows an increasing trend. The eigenvalues are constantly changing, and the feature map consists entirely of high-frequency edge information, with no low-frequency information found.
[0172] With network weight and When the value increases from 0.0 to 0.6, rate of change and It exhibits a non-linear, rapid upward trend, reaching its maximum value at 0.6; and When the value increases from 0.6 to 1.0, rate of change and It began to show a downward trend, but the decline was gradual, reaching 1.0. and The value not only failed to drop to 0% or close to 0%, but was also relatively high, indicating that... and The two model control parameters are highly effective.
[0173] right The impact of the update is greater than Slightly worse, illustrating the formulation in some embodiments of this application. An update strategy is essential. Detection data validates that the CNN-GRL network has excellent real-time update capabilities.
[0174] The above experimental results verify that the CNN-GRL network model in some embodiments of this application has good performance in high-level semantic feature extraction and feature classification of images. Compared with the conventional GCN network model, the dynamic convolution filter in the CNN-GRL network model in some embodiments of this application... Mapping matrix Loss function and network weight It played a crucial role. Based on the definitions of relevant parameters above, the convolutional filter in the conventional GCN network model... Mapping matrix Loss function and network weight Formula 12 is mainly used to represent it.
[0175] (12) In Formula 12, The operation only considered normalizing the Laplacian matrix, without actually introducing... Convolution transformation is performed on the Chebyshev polynomial. The operation only applies the cross-entropy function to perform error calculation between the original image features and the high-level semantic features, without considering the error between the original image and the high-level semantic feature labels, or the global feature factors of the image. The operation directly compares the input and predicted image labels using the cross-entropy function. More importantly, it does not incorporate any design elements. and Update strategy. The calculation is based on the error of high-level semantic features of the image, without considering the crucial factor of high-level semantic feature label error.
[0176] Experiment 3: To verify the comprehensive performance of the CNN-GRL network model in some embodiments of this application in image region segmentation tasks, the experiment compared and analyzed the network models proposed in references
[13] ,
[16] ,
[19] ,
[22] ,
[24] ,
[26] , and
[28] . The seven network models each used their own proposed image segmentation strategies to construct experimental platforms. The CNN-GRL network in some embodiments of this application used... Figure 5 The model structure and image region segmentation process shown are used to construct an experimental platform.
[0177] The network model used in the experiment was trained using the six datasets listed above to perform semantic feature aggregation and region segmentation tasks for six image categories: animals, flowers, buildings, rooms, hyperspectral images, and medical images. The experiment mainly evaluated and analyzed the network model's image feature clustering performance, image region segmentation performance, network model convergence, and network model runtime complexity.
[0178] Image feature clustering performance verification evaluation metrics include: the best mapping accuracy between cluster assignment labels and real labels
[34] , the standard mutual information metric (NMI) of the similarity between cluster assignment labels and real labels
[35] , and the image semantic feature distribution visualization graph.
[0179] The effective value of accuracy is 0% to 100%. The higher the detection value, the more accurate the feature label mapping.
[0180] The effective value of NMI is [0,1]. The higher the detection value, the more information the image feature clustering results share with the real labels, and the better the clustering effect.
[0181] In the visualization of the distribution of detected semantic features, all image semantic features are defined and classified, each category of features is divided into classes and aggregated, and the fewer other category features are included in the aggregated category features, the better the image semantic feature aggregation and classification effect, providing strong support for image segmentation tasks.
[0182] The performance evaluation metrics for image region segmentation include: Figure of Merit (FOM), the average feature distance between the predicted edge and the real edge, the Dice Similarity Coefficient (DSC), the relative volume difference (RVD) between the predicted segmentation region and the real target region, and the image segmentation result visualization.
[0183] The effective value of FOM is [0,1]. The higher the detection value, the higher the accuracy of the image semantic feature segmentation result and the lower the false negative rate and false positive rate.
[0184] The effective value of ASD is [0, +∞]. The smaller the detection value, the more accurate the segmentation of image feature edges, and the higher the degree of overlap between the predicted edge and the real edge.
[0185] The effective value of DSC is [0,1]. The larger the detection value, the higher the degree of overlap between the image prediction and the actual segmentation result, and the better the image segmentation effect.
[0186] The effective value of RVD is [0,1]. The smaller the detection value, the smaller the volume error between the predicted segmentation region and the actual segmentation region, and the better the image segmentation result.
[0187] In the image segmentation result visualization, the more accurate and detailed the division of different regions, the better the image segmentation effect.
[0188] The main evaluation metrics for verifying the convergence of a network model include the accuracy and loss value, which represent the best mapping between clustered labels and the true labels. The faster the accuracy approaches its maximum value and the faster the loss value approaches its minimum value, the stronger the convergence of the network model, especially with fewer iterations and fewer input samples.
[0189] The main evaluation metrics for verifying the runtime complexity of a network model include the running time T and CPU utilization V for completing two tasks: image feature semantic segmentation and image region segmentation visualization detection. The smaller the values of T and V, the lower the runtime complexity of the network model.
[0190] The statistical values for the above evaluation parameters are the average values detected on six datasets. Visualizations of image semantic feature distribution and image segmentation results are provided. Only one image from the six datasets is shown for detection. The experiment was set to iterate and train a certain number of times. The number of positive samples was 2000, and the ratio of positive to negative samples was 1:10. The experimental results are as follows: Figures 11-21 As shown.
[0191] The results of the tests in Figures 23 and 24 show that as the number of input samples and the number of iterations increase, the Accuracy and NMI values of the eight network models tested in the experiment show a non-linear and steady upward trend. When the number of input samples and the number of iterations reach their maximum values, the Accuracy and NMI values are both above 85%, indicating that the eight network models have good performance in image feature extraction, feature classification and aggregation.
[0192] In some embodiments of this application, the CNN-GRL network model achieves accuracy and NMI values of 97% and 94% respectively when the number of input samples reaches the maximum value, and 94% and 98% respectively when the number of iterations reaches the maximum value, indicating that the model has significant advantages in image feature extraction, classification and aggregation.
[0193] The mathematical models and defined control parameters of the designs in some embodiments of this application have been verified to have high reliability and stability.
[0194] The detection results in Figure 25 show that in the detection of a selected image from the six image datasets, all eight network models involved in the experiment were able to accurately classify all feature categories of each image. After feature aggregation, the boundaries between each category of features were clear, indicating that the image feature label classification rules of the network models were effective and the image feature extraction and classification performance was good. However, after image feature aggregation, all network models mixed features of other categories in the feature set of the classification. In some embodiments of this application, the CNN-GRL network model detected only a small number of features of other categories in each category of features, indicating that the model has good image feature mapping and aggregation performance, which provides a good guarantee for subsequent accurate image segmentation. At the same time, it also further verifies that all mathematical models and defined control parameters of the network model design in some embodiments of this application have high reliability and stability, and the joint learning and training strategy of SNN, CNN-GRL and HSIC modules is very necessary.
[0195] The results of the detection in Figure 26 show that as the number of input samples and the number of iterations increase, the FOM values of the eight network models show a non-linear and stable upward trend. When the number of input samples and the number of iterations reach their maximum values, the FOM values are all above 80%, indicating that the image semantic feature segmentation performance of the eight network models is good; the network model proposed in reference
[19] has slightly poor stability. The FOM values of the CNN-GRL network in some embodiments of this application reach 96% and 94%, respectively, indicating that the model has a great advantage in image semantic feature segmentation. This verifies that the SNN module designed in some embodiments of this application, the IoU strategy for judging the similarity of positive and negative sample features, and the designed convolutional filter have excellent reliability and stability.
[0196] The detection results in Figure 27 show that, with the increase of the number of input samples and the number of iterations, the ASD values of the eight network models exhibit a non-linear, steady decreasing trend. When the number of input samples and the number of iterations reach their maximum values, the ASD values are all below 0.25, indicating that the image semantic feature segmentation performance of the eight network models is good. The ASD values of the CNN-GRL network in some embodiments of this application decrease to below 0.05 and 0.08, further illustrating the advantages of the model in image semantic feature segmentation. This further verifies that the constructed SNN module, the introduced IoU strategy for determining the similarity of positive and negative sample features, the defined convolutional filter module, and the designed loss function in some embodiments of this application have excellent reliability and stability.
[0197] The detection results in Figure 28 show that the DSC values of the eight network models exhibit a non-linear, stable upward trend as the number of input samples and iterations increases. When the number of input samples and iterations reach their maximum values, the DSC values are all above 0.80, indicating that the eight network models have good image region segmentation performance. The DSC values of the CNN-GRL network in some embodiments of this application rise to above 0.92 and 0.95, further demonstrating the model's advantages in image region segmentation. This verifies that the adjacency matrix, mapping matrix, loss function, network weights, and the CNN-GRL dynamic graph construction, CNN-GRL high-level semantic feature association mechanism, and CNN-GRL dynamic graph self-update strategy defined in some embodiments of this application have excellent reliability and stability.
[0198] The results of the detection in Figure 29 show that the RVD values of the eight network models exhibit a non-linear, stable decreasing trend as the number of input samples and iterations increases. When the number of input samples and iterations reach their maximum values, the RVD values are all below 0.20, indicating that the image region segmentation performance of the eight network models is good. Regarding the number of input samples, the stability of the RVD values of the network models in references
[16] and
[24] is slightly worse. The RVD values of the CNN-GRL network models in some embodiments of this application decrease to below 0.05 and 0.06, further demonstrating that the models have certain advantages in image region segmentation. This further verifies that the adjacency matrix, mapping matrix, loss function, network weights, and the CNN-GRL dynamic graph construction, CNN-GRL high-level semantic feature association mechanism, and CNN-GRL dynamic graph self-updating strategy defined in some embodiments of this application have excellent reliability and stability.
[0199] The results of the detection in Figure 30 show that in the detection of one selected image from the six image datasets, all eight network models involved in the experiment were able to accurately segment the most obvious different regions in each image. The main regions in all six images were accurately segmented, indicating that the network models all have basic image segmentation functions and the image region segmentation performance is at a good level.
[0200] In cases where image regions are complex, the models in the literature often exhibit unclear segmentation of a small number of regions. The CNN-GRL network model in some embodiments of this application segments image regions accurately and meticulously. Compared to the seven network models proposed in the literature, its image region segmentation performance is slightly superior, particularly in accurately identifying green plants within desert areas of hyperspectral images and the Great Wall in distant architectural images. This further verifies that the construction of SNN and CNN-GRL modules, the design and introduction of the CNN-GRL high-level semantic feature association mechanism, the designed CNN-GRL dynamic graph self-update strategy, and the formulated joint learning and training strategy of SNN, CNN-GRL, and HSIC modules in some embodiments of this application all play a decisive role in image region segmentation tasks.
[0201] The results of the test in Figure 31 show that, among the eight network models involved in the experiment, the worst case was after 170 iterations of learning and training, when the accuracy reached the maximum value of 86% and the loss value reached the minimum value of 0.12, indicating that the convergence of the eight network models is at a good level.
[0202] After 100 iterations of training, the accuracy of the CNN-GRL network model rapidly increases to a maximum of 95%, and subsequent iterations remain relatively stable at this maximum. Similarly, after 100 iterations of training, the loss value rapidly decreases to a minimum of 0.02, and subsequent iterations remain relatively stable at this minimum.
[0203] The results demonstrate that the CNN-GRL model has excellent network convergence and, compared with some current advanced image region segmentation models, has certain advantages in image semantic feature aggregation and image region segmentation.
[0204] The verification of network model runtime complexity mainly involves two tasks: completing image semantic features and image region segmentation visualization detection. The evaluation metrics are the task completion time T and CPU utilization V.
[0205] Figure 32 The detection data shown indicates that the CNN-GRL model proposed in this embodiment of the invention is significantly superior to existing models in terms of the comprehensiveness of detected features.
[0206] Figure 33 The detection results shown indicate that the CNN-GRL model in some embodiments of this application does not have high operational complexity.
[0207] It can be seen that by fusing CNN, SNN, and GRL network models, with the main research objectives of high-level semantic feature extraction, judgment, classification, and aggregation of images, and utilizing the HSIC gradient change rule to jointly train the constructed network models, the CNN-GRL network models constructed in some embodiments of this application have achieved excellent image region segmentation results. The main research conclusions are as follows.
[0208] In CNN models, introducing IoU for initial image region segmentation is the foundation for constructing the GRL graph node structure, the criterion for positive and negative sample division in SNN models, and the guarantee for high-level semantic feature extraction of images.
[0209] Compared to currently used SNN models, this paper improves the performance of high-level semantic feature extraction by defining image semantic feature labels and calculating the error between the input semantic feature labels and the predicted semantic features, based on the conventional cross-entropy loss function which calculates the image feature error. Furthermore, considering the characteristics of the GRL model, the similarity evaluation method between GRL graph node features is improved. The commonly used evaluation rule based on image feature distance is abandoned; instead, the model's loss function is used for training, and the similarity of high-level semantic features is directly evaluated through weight allocation.
[0210] In the CNN-GRL model, model weights are designed by combining the distance between high-level semantic features and the distance between feature labels, improving the computational rules that only use image feature distance while maintaining consistency with SNN models. In the dynamic convolutional filter design, a Laplacian normalization mechanism is introduced to ensure that the extracted image features are high-level semantic features, and a decision strategy for high-level semantic features is formulated. In dynamic graph encoding, the adjacency matrix is designed based on the predicted image feature labels, and the graph learning training mapping matrix design uses error calculations between the initial graph node semantic features and feature labels, and between the predicted image semantic features and feature labels. This effectively controls the limitations of high-level semantic features in the computation process, improves the accuracy of image semantic feature classification and aggregation, and reduces the complexity of semantic feature error calculations. The conventional mechanism of updating the loss function through backpropagation is improved. Based on the network model weights and the update status of high-level semantic features, an update strategy for the CNN-GRL model loss function is formulated. The design and update of the loss function are synchronized, reducing the computational complexity of the network model.
[0211] Related experiments have verified the correctness and reliability of the above conclusions. The CNN-GRL network model proposed in some embodiments of this application can be widely applied to computer vision processing fields such as image feature extraction, feature classification, feature aggregation, image recognition, object tracking, and image region segmentation.
[0212] The DOIs of the documents used in some embodiments of this application include: Reference
[13] DOI:10.1016 / j.aei.2025.103222. Reference
[16] DOI:10.1016 / j.inffus.2025.103025. Reference
[19] DOI:10.1007 / s11432-023-4073-y. Reference
[22] DOI:10.1016 / j.asoc.2024.112657. Reference
[24] DOI:10.1109 / TCYB.2025.3531657. Reference
[26] DOI:10.1016 / j.eswa.2024.123678. Reference
[28] DOI:10.1109 / tgrs.2021.3123423. Reference
[29] DOI:10.1109 / TIP.2025.3546481. Reference
[30] DOI:10.1109 / TMM.2024.3521796. Reference
[31] DOI:10.1109 / TGRS.2024.3456678. It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially as indicated, these steps are not necessarily executed in the indicated order. Unless explicitly stated in some embodiments of this application, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the steps or stages of other steps.
[0213] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0214] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. A method for image semantic classification and region segmentation based on CNN-SNN-GRL fusion, characterized in that, include: The original image is acquired and preprocessed to obtain a region-blocked image; The original image and the region segmented image are used to extract features by a convolutional neural network to obtain a complete set of image features and region features. The set of regional features is input into the Siamese neural network, and the cross-union ratio is introduced to divide positive and negative samples, and the feature similarity evaluation results of positive and negative samples are calculated. Regional features are defined as graph nodes, and dynamic graph weights are calculated by combining the positive and negative sample feature similarity evaluation results to construct a dynamic graph. Based on graph representation learning technology, high-level semantic features are obtained by combining the dynamic graph with dynamic convolution operations and Laplacian normalization. By combining upsampling and convolution operations, the network weights and features are updated using gradient loss based on the Hilbert-Schmidt independence criterion, thereby updating the dynamic graph. Based on the updated dynamic graph, the image segmentation result is output.
2. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 1, characterized in that, The step of acquiring the original image and preprocessing the original image to obtain a region-blocked image further includes: Input the original image; Set the size of the region-divided image and the number of region blocks; The original image is divided into multiple regions of the same scale.
3. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 1, characterized in that, The step of inputting the region feature set into the Siamese neural network, and simultaneously introducing the intersection-over-union ratio (IoU) to divide positive and negative samples, and calculating the feature similarity evaluation result of positive and negative samples, further includes: Positive and negative samples are defined according to the intersection-union ratio (IU). The positive and negative samples are respectively input into the two sub-networks of the Siamese neural network, and the feature vectors of the positive and negative sample image features are obtained. Based on the feature vectors of positive and negative sample images, the similarity between the features of positive and negative sample images is calculated. By using fully connected layers and Softmax layers, the similarity between the features of the positive and negative sample images is classified and similarity is learned, resulting in classification and similarity learning results. Based on the classification results and similarity learning results, a comprehensive evaluation is conducted to obtain the similarity evaluation result.
4. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 3, characterized in that, The definition of positive and negative samples based on the intersection-union ratio further includes: The anchor points for positive and negative samples are determined using overlapping windows; Positive and negative samples are determined by setting different step sizes between them and the anchor point frame.
5. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 3, characterized in that, The Siamese neural network also compensates for feature errors and label errors through an improved cross-entropy loss function, which is expressed as: ; in, This represents the total number of image samples. For the input of the first Feature labels of image samples Label the predicted results. and These are the two outputs of the Siamese neural network. For Lagrange multipliers, For transpose matrix operations, and These represent two regional blocks, The weights are the edge feature weights of the graph nodes.
6. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 1, characterized in that, The step of defining the region features as graph nodes and calculating dynamic graph weights based on the positive and negative sample feature similarity evaluation results to construct a dynamic graph further includes: The feature vectors in the region feature set are defined as graph nodes, and each graph node corresponds to a region block image. Based on the positive and negative sample feature similarity evaluation results, the graph edge weights are calculated. The graph edge weights reflect the similarity or correlation between image region blocks. The larger the weight value, the more similar or related the two image region blocks are. A dynamic graph is constructed based on the graph nodes and the graph edge weights.
7. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 1, characterized in that, The graph representation-based learning technique, through dynamic convolution operations and Laplacian normalization, combined with the dynamic graph, obtains the associated high-level semantic features, further including: Define a dynamic convolution filter to perform dynamic convolution operations on the dynamic graph and obtain the dynamic graph encoding result; The Laplacian feature matrix of the image is obtained by normalizing the features of the graph nodes based on the Laplacian operator. Design a mapping matrix, combine the dynamic encoding result with the mapping matrix and complete image features to perform high-level semantic feature association of the image, and obtain the associated high-level semantic features.
8. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 1, characterized in that, The method of using a combination of upsampling and convolution operations to update network weights and features through gradient loss based on the Hilbert-Schmidt independence criterion, thereby updating the dynamic graph, further includes: Upsampling layers are used to upsample high-level semantic features to restore the spatial resolution of the image. Convolution operations are then performed on the upsampled features to further extract local features and semantic information of the image. The Hilbert-Schmidt independence criterion is introduced to construct a gradient loss function. The gradient of the gradient loss function with respect to the model parameters is calculated to obtain the gradient change detection results. By combining the results of upsampling and convolution operations with the gradient change detection results, joint learning training of CNN-SNN-GRL is achieved to update the dynamic graph.
9. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 8, characterized in that, The gradient loss function is expressed as: ; in, The number of high-level semantic feature categories of the image. The number of pixels in the image. For the first Image pixels Predicted as the first The probability of a high-level semantic feature category. The weights for determining the categories of high-level semantic features of an image. The true category of high-level semantic features. and They are respectively and The update results Represents the nodes in the graph and Between graph edges The weight, Encoding for dynamic graphs, These are the weights used for learning and training the convolutional neural network.
10. The image semantic classification and region segmentation method based on CNN-SNN-GRL fusion as described in claim 1, characterized in that, The step of outputting image segmentation results based on the updated dynamic graph further includes: Based on the updated dynamic graph, feature information is extracted; Based on the feature information, an image segmentation result is generated.