A visual saliency positioning method based on fourier series regional modeling
By defining the shape of a two-dimensional closed region using Fourier series and generating a shape mask using the number of turns theorem, the problems of blurred boundaries and unfocused regions in visual saliency localization methods are solved, achieving accurate localization of salient regions and clear boundaries, and is applicable to a variety of deep learning models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2026-04-15
- Publication Date
- 2026-06-19
AI Technical Summary
Existing visual saliency localization methods suffer from problems such as blurred boundaries and unfocused regions, especially the Grad-CAM and Prompt-CAM methods, which are deficient in boundary localization.
The shape of a two-dimensional closed region is defined by Fourier series. The salient region is located by surrounding salient pixels. The shape mask is generated by Fourier coefficients and the number of turns theorem. The shape parameters are optimized by gradient backpropagation to form a shape mask region with clear boundaries.
It achieves precise focusing of salient regions with clear boundaries, is applicable to various deep learning models, requires no modification to the model structure, and possesses versatility and high adaptability.
Smart Images

Figure CN122244549A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image data processing or generation technology, and specifically relates to a visual saliency localization method based on Fourier series region modeling. Background Technology
[0002] Visual saliency localization (VSL) is a core technique for interpretability analysis of deep learning models. It aims to accurately locate salient regions in an image upon which the model's decisions depend, thus supporting the interpretability of model behavior. Existing VSL methods primarily revolve around class activation mapping (CAM) mechanisms, using heatmaps to display the salient regions that the model focuses on, resulting in a series of representative technical solutions. For example: The Grad-CAM method estimates pixel-wise importance and visualizes the contribution weight of each pixel to the model's classification decision in the form of a heatmap, thereby locating the salient regions of interest to the model. In classification tasks, it calculates the contribution of each channel of the feature map of the last convolutional layer to the final classification decision, generating a low-resolution localization map reflecting the class contribution. This map is then upsampled to the corresponding image pixel size, visually demonstrating the salient regions that play a key role in the model's decision.
[0003] The Prompt-CAM method is designed for visual Transformers and introduces learnable category cues, ClassPrompts. It generates fine-grained heatmaps through the attentional interaction between cues and image patches, accurately converging the focus of the visualized heatmap from background noise to the core region of the target object. This enables the localization of visually salient regions for ViT model decision-making and optimizes the interpretability of the ViT model.
[0004] However, existing visual saliency localization methods suffer from problems such as blurred boundaries and unfocused regions.
[0005] For example, the Grad-CAM method suffers from problems such as blurred boundaries and unfocused regions in saliency localization. Grad-CAM is an extension of activation mapping. It relies on weighted gradients of features from the final convolutional layer of a CNN to generate a low-resolution heatmap, which is then upsampled back to the original image size for visualization. Essentially, it quantifies the contribution of all pixels in the image and generates a heatmap based on these contribution values to achieve visual saliency localization. Because this method relies on the low-resolution features of deep convolutional layers, its contribution quantification results have significant spatial limitations. Furthermore, during upsampling using interpolation algorithms, the originally discrete weights are transformed into continuous values. This results in saliency regions lacking clear boundaries, exhibiting a diffuse, outward-spreading phenomenon with blurred boundaries. Moreover, influenced by deep feature downsampling and receptive field size, the generated heatmap only reflects the relative importance of all pixels and fails to focus on key details.
[0006] The Prompt-CAM method is designed specifically for the ViT architecture model. It introduces category cues for fine-tuning, enabling the capture of fine-grained features. However, it essentially still generates heatmaps based on pixel contribution estimates, failing to overcome the problems of blurred boundaries and unfocused regions inherent in heatmap representations of visual saliency. Furthermore, this method is only applicable to the ViT model and has poor generalizability.
[0007] Therefore, a visual saliency localization method is needed to solve the problems of blurred boundaries and unfocused regions in existing technologies. Summary of the Invention
[0008] This invention directly uses Fourier series to define the shape of a two-dimensional closed region, achieving salient region localization by enclosing salient pixels. The invention transforms the salient region localization result from a heatmap with blurred boundaries into a shape mask region with clear geometric boundaries. By establishing a differentiable mapping from shape parameters to the mask image, the shape contour can be directly optimized based on the model feedback gradient, thereby achieving precise focusing of the salient region. This completely solves the inherent defects of traditional class activation mapping + heatmap localization, which suffers from blurred boundaries and unfocused regions.
[0009] This invention provides the following technical solution: a visual saliency localization method based on Fourier series region modeling, comprising the following steps: Step 1: Initialize training parameters and load the model.
[0010] Step 2: Randomly initialize the Fourier coefficients and transform them into closed curves on a two-dimensional plane according to the Fourier series expansion.
[0011] Step 3: Based on the number of turns theorem, the Fourier coefficients are differentiable and mapped to a saliency mask, which is then multiplied with the original image and fed into the classification model.
[0012] Step 4: Construct and calculate the total loss function, and update the Fourier coefficients.
[0013] Step 5: Repeat steps 3 and 4 until the maximum number of iterations is reached to obtain the optimized Fourier coefficients. The optimized Fourier coefficients define the pixel region where the shape mask value approaches 1 as the visual saliency region.
[0014] Preferably, in step 1, the learning parameters set in the initialization settings include: Fourier series order, gradient optimization learning rate, total number of iterations, and regularization weights; the classification model parameters are loaded and frozen in the model loading process.
[0015] Preferably, in step 2, when using Fourier series parameterization to characterize the two-dimensional closed shape, the shape profile is: (1) in Representing the Cartesian x and y coordinates of the contour boundary, respectively. i Represents the imaginary unit. These represent the Fourier series coefficients corresponding to different frequency components. denoted by , where t represents the order of the Fourier series and t represents the position of a point on a two-dimensional closed contour.
[0016] Preferably, step 3 includes the following sub-steps: Step 3-1: Calculate the winding value of the closed curve defined in Step 2 for all pixels using the winding integral.
[0017] Step 3-2: Generate a shape mask based on the surrounding values. Assign 1 to the pixels within the area enclosed by the shape mask, and assign 0 to the remaining pixels.
[0018] Step 3-3: Multiply the shape mask element-wise with the input image to generate a mask image, feed the mask image into the classification model for forward propagation and obtain the model decision output.
[0019] More preferably, in step 3, an analytical mapping from shape representation parameters to image pixel space is established using the number of turns theorem; while ensuring the continuous differentiability of the entire computational chain, the abstract geometric parameters are transformed into a visual mask image; for any pixel in the image... Calculate the closed contour generated relative to the Fourier series. C number of turns for: (2) in, C This represents the closed shape outline defined by the Fourier series. These represent the parametric representations of the contours. This indicates that the integral of the expression is calculated over all points on the closed contour C.
[0020] Preferably, step 4 includes the following sub-steps: Step 4-1: Design a bi-objective loss function that maximizes classification confidence while minimizing the shape mask area.
[0021] Step 4-2: Use gradient backpropagation to update the Fourier coefficients according to the loss function in Step 4-1.
[0022] More preferably, the bi-objective loss function is: (3) Where c represents the Fourier coefficient, and C represents the classifier. This represents a classification network that outputs the probability distribution of different categories. Represents the original image, Indicates element-wise multiplication. A normalized grayscale mask representing the shape determined by the Fourier coefficients. express Real labels for images Represents a normalized grayscale image The mean value, the area of the reaction mask.
[0023] The beneficial effects of this invention are: 1. This invention directly defines the shape boundary by Fourier coefficients to surround the salient pixel region, and uses gradient to update the shape representation parameters to change the boundary shape, thereby changing the shape of the salient region. It maximizes the classification confidence while minimizing the area of the salient region. It completely solves the problems of blurred boundaries, unfocused regions, and weak method adaptability in previous methods for locating salient regions.
[0024] 2. This invention relies on gradient backpropagation to update Fourier coefficients (shape representation parameters). Ultimately, it achieves a visual saliency localization method with clear edges, spatial accuracy, and adaptability to various deep learning models.
[0025] 3. This invention does not require any modification to the structure and parameters of the classification model, nor does it require the introduction of additional input information. It simply applies a learnable shape mask to the input image, directly inputs the mask image into the original model and obtains the output result, and then uses the backpropagation gradient of the model to iteratively update the custom shape parameters, ultimately achieving accurate localization of visually salient regions. The entire process is not limited to a specific network architecture and can be seamlessly adapted to various mainstream visual models, possessing significant advantages such as strong versatility, simple access, and wide applicability. Attached Figure Description
[0026] Figure 1 This is a schematic diagram of the framework of a visual saliency localization method based on Fourier series region modeling according to the present invention. Figure 2 This is a diagram illustrating the visual salience region positioning effect of the present invention. Figure 3 This is a schematic diagram illustrating the steps and effects of an embodiment of the present invention; Figure 4 This is a schematic diagram of the method steps of the present invention. Detailed Implementation
[0027] The relevant technologies of this invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this invention, and not all embodiments. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0028] like Figures 1-4 As shown, the visual saliency localization method based on Fourier series region modeling in this embodiment follows the following steps: Step 1: Initialize training parameters and load the model.
[0029] Set the order of the Fourier series. Set the gradient optimization learning rate Total number of iterations Regularization weights Learn parameters; load and freeze the classification model. parameter.
[0030] Step 2: Randomly initialize Fourier coefficients The Fourier coefficients are transformed into closed curves on a two-dimensional plane using the Fourier series expansion.
[0031] Step 3: Based on the number of turns theorem, the Fourier coefficients are differentiable and mapped to a saliency mask, which is then multiplied with the original image and fed into the classification model.
[0032] Step 3-1: Calculate the winding value of the closed curve defined in Step 2 for all pixels using the winding integral.
[0033] Step 3-2: Generate a shape mask based on the wrapping value. Pixels within the area enclosed by the shape mask (points inside the closed curve) are assigned 1, and all other points are assigned 0.
[0034] Step 3-3: Apply the shape mask With input image Generate mask image by element-wise multiplication , mask image Input into classification model Perform forward propagation and obtain the model decision output.
[0035] Step 4: Construct and calculate the total loss function, and update the Fourier coefficients.
[0036] Step 4-1: Design a bi-objective loss function to maximize classification confidence. Simultaneously minimize the shape mask area .
[0037] Step 4-2: Use gradient backpropagation to update the Fourier coefficients according to the loss function in Step 4-1.
[0038] Step 5: Repeat steps 3 and 4 until the maximum number of iterations is reached. The optimized Fourier coefficients were obtained. The definition of a region of pixels whose shape mask value approaches 1 is the region of visual saliency.
[0039] Furthermore, regarding step 2, which utilizes Fourier series to parameterize and represent arbitrary two-dimensional closed shapes, this invention, based on the expressive properties of Fourier series, represents arbitrary shapes completely as a set of optimizable Fourier coefficients in a continuous and differentiable manner. (Shape representation parameters). The shape profile is represented in Fourier series expansion form: (1) in These are the Cartesian x and y coordinates of the contour boundary, respectively. i The imaginary unit, These are the Fourier coefficients, used for shape definition (a core optimizable parameter). Any two-dimensional closed shape can be characterized by changing the Fourier coefficients. K The order of the Fourier series determines the level of detail in the shape representation. K Larger values result in more refined shape representations, but also make convergence through optimization more difficult. Complex Fourier coefficients of different orders control different geometric features of the shape. For the shape's centroid, Define the basic shape outline (circle or ellipse), and higher order... Add fine geometric details to the shape.
[0040] Furthermore, regarding the operation in step 3 of mapping the Fourier coefficients to a mask image based on the number of turns theorem, this invention utilizes the number of turns theorem to establish an analytical mapping from shape representation parameters (Fourier coefficients) to the image pixel space. This mapping mechanism, while ensuring continuous differentiability throughout the entire computational chain, transforms abstract geometric parameters into a visual mask image (the image region defined by the shape mask). For any pixel in the image... Calculate its closed contour relative to the Fourier series. C number of turns The number of turns is defined by the line integral over the contour as shown in Formula 2: (2) in C A closed shape profile defined for a Fourier series. For the parameterized representation of the contour, at this time Represents a closed curve C Around the pixel The number of turns refers to the number of times the curve wraps around a point. This invention utilizes the integral property of the number of turns (the number of turns for points inside and outside a closed curve is significantly different) to directly determine which pixels are located within the defined shape. The region formed by these pixels within the generated boundary shape is the salient region located by this method.
[0041] Furthermore, in step 4, the present invention designs a bi-objective optimization loss function: (3) Where c represents the Fourier coefficient, and C represents the classifier. This represents a classification network that outputs the probability distribution of different categories. Represents the original image, Indicates element-wise multiplication. A normalized grayscale mask representing the shape determined by the Fourier coefficients. express Real labels for images Represents a normalized grayscale image The mean of the reaction mask area. Minimizing the shape mask area... At the same time, maximize the model classification confidence. By iteratively optimizing the shape parameters, the shape boundary gradually converges and focuses on the core visual region that the model's decision-making is concerned with.
[0042] Example To demonstrate the effectiveness and feasibility of the visual saliency localization method based on Fourier series region modeling proposed in this invention, this embodiment uses the ImageNet dataset for visual saliency localization experiments. This dataset is used for visual perception classification tasks and contains samples of 1000 different categories.
[0043] In the experiment, the parameters of the ResNet50 target classification model were loaded and fixed, and visually salient regions were extracted using a Fourier series-based region modeling technique. This technique extracts visually salient regions from the image by mapping the Fourier series to a grayscale mask and multiplying it element-wise with the original image. During optimization, the Adam optimizer was used with a learning rate of 0.01, a maximum number of iterations of 10000, and a highest frequency component K of 6. Iterative optimization of the Fourier series parameters indirectly adjusted the mask image, thereby accurately locating the visually salient regions in the original image. During testing, the masked visually salient regions were fed into a classifier for inference, and the classifier's output category was compared with the original image label. If they matched, it indicated that the mask determined by the Fourier series had located the visually salient regions in the image, verifying the effectiveness of the localization method.
[0044] The comparison method selects the mainstream salient region localization algorithm Grad-CAM as the benchmark. In the specific implementation, for the Grad-CAM method, the feature image generated by the fourth convolutional layer (layer 4) of the ResNet50 backbone network is extracted and mapped to the original image space for visualization. The visualization results of the visual salient region localization method based on Fourier series region modeling and Grad-CAM localization effects are as follows: Figure 3 As shown.
[0045] The results show that both the method of this invention and Grad-CAM can effectively locate salient regions in images that are highly correlated with category attributes, demonstrating their reliability as visual analysis tools. Figure 3 Taking D as an example, for locating the candle region in the input image, Grad-CAM locates both the candle and cake regions as discriminative salient regions, while the method in this patent accurately focuses on the candlelight region of the image, relying solely on the original Figure 1 0.68% of the pixels are sufficient for correct identification and classification. Therefore, this method demonstrates a significant advantage in positioning accuracy, generating sharper edges and achieving precise positioning of key features with a smaller spatial area, thus verifying the superiority of this method over the comparison methods.
[0046] This invention directly utilizes Fourier coefficients to define shape boundaries that enclose salient regions. The shape boundary definition parameters can be iteratively optimized through model feedback gradients, gradually converging and focusing on the core visual region of interest for model decision-making. Compared to traditional CAM-based visual saliency localization results, the salient regions determined by this invention have clearer boundaries and are more focused. The salient regions determined by this invention occupy only a tiny pixel area of the original input image. Relying on this small, clearly defined core region, the model can achieve high-confidence correct classification of the target category. The visual saliency localization method based on Fourier series region modeling proposed in this invention can accurately delineate the core visual regions upon which the model relies to complete classification decisions.
[0047] In summary, the visual saliency localization method based on Fourier series region modeling proposed in this invention directly defines the shape of a two-dimensional closed region through Fourier series, transforming the saliency localization result from the traditional boundary-fuzzy heatmap into a shape mask region with clear geometric boundaries. This method utilizes the winding number theorem to establish a differentiable mapping from shape representation parameters to image pixel space, ensuring the continuous differentiability of the entire computational chain and realizing the transformation of abstract geometric parameters into a visual mask image. Simultaneously, by designing a bi-objective loss function, it maximizes classification confidence while minimizing the shape mask area, relying on gradient backpropagation to update Fourier coefficients, causing the shape boundary to gradually converge and focus on the core visual region of interest for model decision-making. The entire process requires no modification to the classification model structure and parameters, nor the introduction of additional input information; by simply applying a learnable shape mask to the input image, it achieves seamless adaptation to various mainstream visual models, effectively solving the problems of fuzzy boundaries, unfocused regions, and weak method adaptability in existing technologies. It provides a precise and universal visual saliency localization scheme for the interpretability analysis of deep learning models.
[0048] It should be emphasized that the above are merely preferred embodiments of the present invention and are not intended to limit the present invention in any way. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention shall still fall within the scope of the technical solution of the present invention.
Claims
1. A visual saliency localization method based on Fourier series region modeling, characterized in that, Includes the following steps: Step 1: Initialize training parameters and load the model; Step 2: Randomly initialize the Fourier coefficients and transform them into closed curves on a two-dimensional plane according to the Fourier series expansion. Step 3: Based on the number of turns theorem, the Fourier coefficients are differentiable and mapped to a saliency mask, which is then multiplied with the original image and fed into the classification model; Step 4: Construct and calculate the total loss function, and update the Fourier coefficients; Step 5: Repeat steps 3 and 4 until the maximum number of iterations is reached to obtain the optimized Fourier coefficients. The optimized Fourier coefficients define the pixel region where the shape mask value approaches 1 as the visual saliency region.
2. The visual saliency localization method based on Fourier series region modeling according to claim 1, characterized in that, In step 1, the learning parameters set in the initialization settings include: Fourier series order, gradient optimization learning rate, total number of iterations, and regularization weights; the classification model parameters are loaded and frozen in the model loading process.
3. The visual saliency localization method based on Fourier series region modeling according to claim 1, characterized in that, In step 2, when using Fourier series parameterization to characterize the two-dimensional closed shape, the shape profile is: (1) in Representing the Cartesian x and y coordinates of the contour boundary, respectively. i Represents the imaginary unit. These represent the Fourier series coefficients corresponding to different frequency components. denoted by , where t represents the order of the Fourier series and t represents the position of a point on a two-dimensional closed contour.
4. The visual saliency localization method based on Fourier series region modeling according to claim 1, characterized in that, Step 3 includes the following sub-steps: Step 3-1: Calculate the winding value of the closed curve defined in Step 2 for all pixels using the winding integral; Step 3-2: Generate a shape mask based on the surrounding values. Assign 1 to the pixels within the area enclosed by the shape mask, and assign 0 to the remaining pixels. Step 3-3: Multiply the shape mask element-wise with the input image to generate a mask image, feed the mask image into the classification model for forward propagation and obtain the model decision output.
5. The visual saliency localization method based on Fourier series region modeling according to claim 4, characterized in that, In step 3, the shape representation parameters are established to the image pixel space using the winding number theorem; while ensuring the continuous differentiability of the entire computational chain, the abstract geometric parameters are transformed into a visual mask image; for any pixel in the image... Calculate the closed contour generated relative to the Fourier series. C number of turns for: (2) in, C This represents the closed shape outline defined by the Fourier series. These represent the parametric representations of the contours. This indicates that the integral of the expression is calculated over all points on the closed contour C.
6. The visual saliency localization method based on Fourier series region modeling according to claim 1, characterized in that, Step 4 includes the following sub-steps: Step 4-1: Design a bi-objective loss function that maximizes classification confidence while minimizing the shape mask area; Step 4-2: Use gradient backpropagation to update the Fourier coefficients according to the loss function in Step 4-1.
7. The visual saliency localization method based on Fourier series region modeling according to claim 6, characterized in that, The bi-objective loss function is: (3) Where c represents the Fourier coefficient, and C represents the classifier. This represents a classification network that outputs the probability distribution of different categories. Represents the original image, Indicates element-wise multiplication. A normalized grayscale mask representing the shape determined by the Fourier coefficients. express Real labels for images Represents a normalized grayscale image The mean value, the area of the reaction mask.