Panoramic image salient object segmentation method and device based on field of view perception

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By converting panoramic images into equidistant cylindrical projections and using a field-aware convolutional neural network segmentation model, the problems of boundary discontinuities and distortions in panoramic image segmentation are solved, achieving more efficient and significant object segmentation results.

CN115457275BActive Publication Date: 2026-06-30BEIHANG UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIHANG UNIV
Filing Date: 2022-09-16
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Existing salient object segmentation methods perform poorly on 360-degree panoramic images and ignore the continuous and complete field of view of the panoramic image, resulting in boundary discontinuities and distortion problems.

Method used

An equidistant cylindrical projection is used to convert panoramic images into two-dimensional planar images, and a field-aware convolutional neural network segmentation model is used for salient object segmentation. A sample adaptive field-aware transformation module is used to enhance the model's adaptability to distortion and boundary effects.

Benefits of technology

It improves the accuracy and reliability of salient object segmentation in panoramic images, preserves the complete panoramic view, and enhances adaptability to distortion and boundary effects.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115457275B_ABST

Patent Text Reader

Abstract

This disclosure presents a method and apparatus for salient object segmentation in panoramic images with field-of-view awareness. One specific implementation of the method includes: acquiring a panoramic image; performing projection processing on the panoramic image to obtain an equidistant cylindrical projection image and analyzing the equidistant cylindrical projection image; acquiring a preset field-of-view awareness convolutional neural network segmentation model, and segmenting the equidistant cylindrical projection image according to the preset field-of-view awareness convolutional neural network segmentation model to obtain a salient object segmentation result image. This implementation improves the reliability of salient object segmentation in panoramic images.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments disclosed herein relate to computer vision technology, specifically to a method and apparatus for salient object segmentation of panoramic images with field-of-view perception. Background Technology

[0002] 360-degree panoramic images can display scene information with a 360-degree horizontal and 180-degree vertical field of view. Compared to traditional images, panoramic images typically contain richer scene content, which has led to their increasing use in real-world applications. Salient object segmentation, which automatically processes the parts of interest in an image while ignoring the parts of interest, is a fundamental problem in computer vision. Research on salient object segmentation of 360-degree panoramic images is of great significance for their compression, transmission, and parsing.

[0003] Currently, relevant salient object segmentation methods are all based on traditional two-dimensional planar images, and have achieved good results. However, when these salient object segmentation methods, which perform well on traditional planar images, are applied to salient object segmentation of 360-degree panoramic images, the results are unsatisfactory. Furthermore, existing salient object segmentation methods for 360-degree panoramic images are relatively few and have low reliability. For example, one method designs a distortion adaptive module to slice the equidistant cylindrical projection image into four equal-sized image blocks to learn different feature kernels, and designs a multi-scale module to integrate contextual features, thereby improving the performance of panoramic image salient object segmentation; another method designs a multi-stage salient image segmentation method to handle equidistant cylindrical projection images by using perspective views with less distortion and object-level semantic saliency ranking. These existing techniques for panoramic image salient object segmentation mainly focus on mitigating distortion problems rather than adapting to distortion, and ignore a key advantage of panoramic images: a continuous and complete panoramic view. Whether equidistant cylindrical projection images are sliced into blocks or perspective views with only local views are used, the complete panoramic view is lost, and other problems such as more boundary discontinuities may arise. Summary of the Invention

[0004] The summary portion of this disclosure is intended to provide a brief overview of the concepts, which will be described in detail in the detailed description portion. This summary portion is not intended to identify key or essential features of the claimed technical solutions, nor is it intended to limit the scope of the claimed technical solutions.

[0005] Some embodiments of this disclosure propose a method and apparatus for salient object segmentation of panoramic images with field-of-view awareness, in order to solve one or more of the technical problems mentioned in the background section above.

[0006] Some embodiments of this disclosure provide a view-aware panoramic image salient object segmentation method, the method comprising: acquiring a panoramic image; performing projection processing on the panoramic image to obtain an equidistant cylindrical projection image and analyzing the equidistant cylindrical projection image; acquiring a preset view-aware convolutional neural network segmentation model, and processing the equidistant cylindrical projection image according to the preset view-aware convolutional neural network segmentation model to obtain a salient object segmentation result image.

[0007] One embodiment of the above-described embodiments of this disclosure has the following beneficial effects: First, a panoramic image is acquired. The panoramic image refers to a 360-degree all-around image, which can display scene information with a 360-degree horizontal and 180-degree vertical field of view, and is used for processing in this solution. Then, the panoramic image is subjected to isometric cylindrical projection processing to obtain an isometric cylindrical projection image. The obtained panoramic image is projected into a two-dimensional planar image, i.e., an isometric cylindrical projection image. Finally, a preset field-of-view (FOR)-aware convolutional neural network (CNN) segmentation model is acquired, and the isometric cylindrical projection image is processed according to the preset FOR-aware CNN segmentation model to obtain a salient object segmentation result image. This solution processes the panoramic image into a two-dimensional planar isometric cylindrical projection image, which facilitates subsequent processing by the convolutional neural network model. At the same time, it preserves the complete panoramic field of view, which is beneficial for the segmentation of salient objects in the panoramic image. The preset field-of-view-aware convolutional neural network segmentation model can be a pre-trained convolutional neural network. Its sample adaptive field-of-view transformation module focuses on learning features under different fields of view through horizontal, vertical and scaling field-of-view transformations, which enhances the model's adaptability to objects with distortion, boundary effects or multi-scale variations, and improves the reliability of salient object segmentation in panoramic images. Attached Figure Description

[0008] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and elements are not necessarily drawn to scale.

[0009] Figure 1 This is a flowchart of some embodiments of a field-of-view-aware panoramic image salient object segmentation method according to the present disclosure;

[0010] Figure 2 This is a schematic diagram of salient object segmentation in a panoramic image based on the field of view perception of this disclosure before and after the segmentation.

[0011] Figure 3 This is a schematic diagram of the overall framework of a view-aware convolutional neural network segmentation model for a view-aware panoramic image salient object segmentation method according to the present disclosure.

[0012] Figure 4 This is a schematic diagram of the overall framework of the horizontal field of view transformation submodule and the zoom field of view transformation submodule in the field of view transformation submodule of this disclosure;

[0013] Figure 5 These are schematic diagrams illustrating the structure of some embodiments of a panoramic image salient object segmentation apparatus based on the present disclosure. Detailed Implementation

[0014] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0015] It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings. Unless otherwise specified, the embodiments and features described in this disclosure can be combined with each other.

[0016] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0017] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0018] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

[0019] This disclosure will now be described in detail with reference to the accompanying drawings and embodiments.

[0020] Figure 1 This is a flowchart 100 of some embodiments of the field-view-aware panoramic image salient object segmentation method according to the present disclosure. The field-view-aware panoramic image salient object segmentation method includes steps 101 to 103, as detailed below:

[0021] Step 101: Obtain the panoramic image.

[0022] In some embodiments, the entity executing the field-of-view (POR) panoramic image salient object segmentation method can acquire panoramic images via wired or wireless connections. These panoramic images can be 360-degree omnidirectional images, displaying scene information with a 360-degree horizontal and 180-degree vertical field of view, more closely resembling the three-dimensional scenes experienced by humans in real-world environments.

[0023] As an example, the above panoramic image can be referenced. Figure 2 The diagram shows a panoramic image before and after salient object segmentation for field-of-view perception. Figure 2 The image on the left in the diagram can represent the panoramic image mentioned above. Due to their wide field of view, panoramic images typically contain more scene information, but this also presents more challenges for salient object segmentation. These challenges include boundary discontinuities caused by projection, distortion levels that vary with position, and salient objects with varying scales. All of these factors affect the reliability of salient object segmentation in panoramic images.

[0024] Step 102: Project the panoramic image to obtain an equidistant cylindrical projection image and analyze the equidistant cylindrical projection image.

[0025] In some embodiments, the execution entity can perform projection processing on the panoramic image to obtain an equidistant cylindrical projection image and analyze the equidistant cylindrical projection image. Specifically, to facilitate processing the panoramic image using a mature two-dimensional convolutional neural network, the original three-dimensional spherical panoramic image can be subjected to equidistant cylindrical projection onto a two-dimensional plane to obtain an equidistant cylindrical projection image with a rectangular shape and an aspect ratio of 2:1. Furthermore, to facilitate algorithm research, the characteristics of the equidistant cylindrical projection image can be analyzed. These characteristics may include the degree of distortion in significant regions.

[0026] Optionally, the aforementioned execution entity may perform projection processing on the panoramic image to obtain an equidistant cylindrical projection image and analyze the equidistant cylindrical projection image, and may also perform the following steps:

[0027] Based on the corresponding annotations of the equidistant cylindrical projection image, the degree of distortion in the salient regions of the equidistant cylindrical projection image is analyzed to obtain the degree of distortion in the salient regions of the equidistant cylindrical projection image. The degree of distortion in the salient regions can be determined using the following formula:

[0028]

[0029] Where D represents the distortion degree of the aforementioned significant region. j represents the pixel row index number in the pixel image included in the aforementioned isometric cylindrical projection image. y represents the ordinate of the pixel in the aforementioned pixel image in the aforementioned planar coordinate system. jThis represents the ordinate of the pixel in the j-th row of the pixel image in the aforementioned planar coordinate system. This represents the pixel spherical coordinates in the above-mentioned projected spherical coordinate system that correspond to the vertical coordinates of the pixels in the above-mentioned pixel image in the above-mentioned planar coordinate system. Let represent the pixel spherical coordinates in the above-mentioned projected spherical coordinate system corresponding to the pixel ordinate of the j-th row pixel in the above-mentioned planar coordinate system. Let Q represent the set of corresponding coordinate pairs consisting of the pixel coordinates of the pixel in the above-mentioned pixel image in the above-mentioned planar coordinate system and the corresponding pixel spherical coordinates. Let h represent the height of the above-mentioned equidistant cylindrical projection image. Let E represent the projection operator from the above-mentioned planar coordinate system to the above-mentioned projected spherical coordinate system. Let N represent the number of corresponding coordinate pairs in the above-mentioned set of corresponding coordinate pairs. This represents the significant region in the pixel image described above that corresponds to the vertical coordinate of the pixel in the j-th row in the planar coordinate system. This refers to the region on the spherical image in the above-mentioned equidistant cylindrical projection image that corresponds to the pixel spherical coordinate of the j-th row pixel in the pixel image in the above-mentioned spherical coordinate system in the above-mentioned spherical coordinate system in the above-mentioned spherical coordinate system in the above-mentioned planar coordinate system.

[0030] Step 103: Obtain a preset field-aware convolutional neural network segmentation model, and perform segmentation processing on the above equidistant cylindrical projection image according to the preset field-aware convolutional neural network segmentation model to obtain a salient object segmentation result image.

[0031] In some embodiments, the execution entity may acquire a preset field-aware convolutional neural network segmentation model and segment the equidistant cylindrical projection image according to the preset field-aware convolutional neural network segmentation model to obtain a salient object segmentation result image. The preset field-aware convolutional neural network segmentation model may be a trained convolutional neural network.

[0032] As an example, the above salient object segmentation result image can be referenced. Figure 2 The diagram shows a panoramic image before and after salient object segmentation for field-of-view perception. Figure 2 The image on the right in the diagram represents the salient object segmentation result image described above. The pre-defined field-aware convolutional neural network segmentation model described above can be referenced... Figure 3 The diagram illustrates the overall framework of the field-of-view (POR)-aware convolutional neural network (CNN) segmentation model. This model includes a basic feature extraction module, a channel adaptation module, a feature fusion module, an output prediction module, and a sample adaptive PNR transformation module. The sample adaptive PNR transformation module comprises a sample adaptation submodule, a scaling PNR transformation submodule, a vertical PNR transformation submodule, a horizontal PNR transformation submodule, a PNR preservation submodule, and a preset number of convolution operations. Figure 3Each step in this process corresponds to the steps described above for segmenting the equidistant cylindrical projection image based on the preset field-perception convolutional neural network segmentation model.

[0033] In some optional implementations of certain embodiments, the execution entity obtains a preset field-aware convolutional neural network segmentation model and performs segmentation processing on the equidistant cylindrical projection image according to the preset field-aware convolutional neural network segmentation model to obtain a salient object segmentation result image, which may include the following steps:

[0034] The first step involves extracting, adapting, and modulating features at multiple stages from the aforementioned equidistant cylindrical projection image to obtain multiple stage modulated sub-features.

[0035] In some optional implementations of certain embodiments, the above-described multi-stage feature extraction, adaptation, and modulation of the equidistant cylindrical projection image to obtain multi-stage modulated sub-features may include the following sub-steps:

[0036] The first sub-step involves extracting features from the equidistant cylindrical projection image at multiple stages based on the basic feature extraction module in the preset field-of-view perception convolutional neural network segmentation model, thereby obtaining multiple basic sub-features at multiple stages.

[0037] In some optional implementations of certain embodiments, the execution entity can sequentially perform five stages of feature extraction on the equidistant cylindrical projection image according to the basic feature extraction module in the preset field-aware convolutional neural network segmentation model, obtaining multiple stage basic sub-features corresponding to stages two, three, four, and five, respectively. The basic sub-features corresponding to stage one can be the basic sub-features obtained from the first feature extraction of the equidistant cylindrical projection image, which are not subsequently processed and therefore not included in the multiple stage basic sub-features. The basic feature extraction module can be a feature extraction backbone network.

[0038] As an example, the aforementioned feature extraction backbone network could be a 50-layer Residual Network (ResNet). This could include multiple convolutional layers and downsampling layers. When the size of the aforementioned isometric cylindrical projection image can be represented as H×W×3, the size of the aforementioned basic sub-features can be represented as... Here, H represents the height of the equidistant cylindrical projection image. W represents the width of the equidistant cylindrical projection image. k represents the number of stages. C represents the number of channels. k This represents the number of channels in the k-th stage.

[0039] The aforementioned downsampling layers can be five, and the aforementioned multiple stages can be the latter four stages. That is, the aforementioned preset basic feature extraction module performs feature extraction on the aforementioned equidistant cylindrical projection image in five stages according to the aforementioned five downsampling layers, and uses the basic sub-features of the second, third, fourth and fifth stages for subsequent model processing.

[0040] The second sub-step involves adapting the basic sub-features of the multiple stages based on the channel adaptation module in the preset vision-aware convolutional neural network segmentation model to obtain multiple stage-adapted sub-features.

[0041] In some optional implementations of certain embodiments, the execution entity can perform two convolution operations on the basic sub-features of the multiple stages to obtain multiple stage-adapted sub-features. The spatial sizes of the convolution kernels in the two convolution operations are 3×3 and 1×1, respectively, and the number of convolution kernels in the two convolution operations decreases sequentially. The number of convolution kernels in the convolutional layer corresponding to the last convolution operation can be 64. Each convolutional layer corresponding to each convolution operation is connected to a Batch Normalization (BN) layer and an activation function layer. The number of convolutional layers, the kernel size of each convolutional layer, and the number of convolution kernels can all be adjusted according to the network requirements. Furthermore, different types of BN layers and activation function layers can be selected as needed.

[0042] The third sub-step involves modulating the multiple stage adaptation sub-features based on the feature fusion module in the preset vision-aware convolutional neural network segmentation model to obtain multiple stage modulated sub-features.

[0043] In some optional implementations of certain embodiments, the above-mentioned modulation processing of the multiple stage adaptation sub-features based on the feature fusion module in the preset field-aware convolutional neural network segmentation model to obtain multiple stage modulated sub-features may include the following steps:

[0044] First, based on the feature fusion module of the current stage, each stage adaptor feature of the above multiple stage adaptor features and the enhancement sub-feature of the higher-level stage adjacent to the current stage are modulated to obtain the modulated sub-feature of the current stage.

[0045] Then, in response to determining that the enhancer features of the higher-level stage adjacent to the current stage do not exist, only the adaptor features of the current stage are used as the input to the feature fusion module. The feature fusion module includes a preset number of convolution operations. The input to the feature fusion module corresponding to the fifth stage is the adaptor features of the fifth stage. The input to the feature fusion module corresponding to the fourth stage is the adaptor features of the fourth stage and the enhancer features of the fifth stage. The input to the feature fusion module corresponding to the third stage is the adaptor features of the third stage and the enhancer features of the fourth stage. The input to the feature fusion module corresponding to the second stage is the adaptor features of the second stage and the enhancer features of the third stage.

[0046] As an example, the above feature fusion process can be represented by the following formula:

[0047]

[0048] Where F represents the feature and k represents the stage number. This represents the aptamer characteristics at stage k. This represents the modulator feature of the k-th stage. This represents the enhancer feature of the (k+1)th stage. Let represent the intermediate features of the k-th stage. U represents the upsampling function. C represents the channel dimension stacking function. CONV1 represents a convolutional layer, where each convolutional kernel has a spatial size of 1×1. CONV 33 This represents two convolutional layers, where each convolutional kernel has a spatial size of 3×3. * indicates a convolution operation. The size of the intermediate features described above can be represented as... Here, H represents the height of the equidistant cylindrical projection image described above. W represents the width of the equidistant cylindrical projection image described above.

[0049] Each convolutional layer is connected to a Batch Normalization (BN) layer and an activation function layer. During the feature fusion process described above, feature summation of the same size can be performed on each feature element. The number of convolutional layers, the kernel size of each convolutional layer, and the number of convolutional kernels can all be adjusted according to the network's needs. Furthermore, different types of BN layers and activation function layers can be selected as needed.

[0050] The second step involves enhancing the modulated features of the above-mentioned multiple stages to obtain enhanced sub-features of multiple stages. Specifically, the human eye's observation behavior of panoramic images is applied to the feature learning of isometric cylindrical projection images by the neural network.

[0051] In some optional implementations of certain embodiments, the execution entity enhances the multiple stage modulator features to obtain multiple stage enhanced sub-features, which may include the following steps: enhancing the multiple stage modulator features according to the sample adaptive field transformation module in the preset field-aware convolutional neural network segmentation model to obtain multiple stage enhanced sub-features.

[0052] In some optional implementations of certain embodiments, the execution entity enhances the multiple-stage modulation sub-features according to the sample adaptive field-view transformation module in the preset field-view-aware convolutional neural network segmentation model to obtain multiple-stage enhanced sub-features, which may include the following steps:

[0053] The first sub-step involves performing field-view preservation processing on the modulated sub-features of the above multiple stages based on the field-view preservation sub-module in the above-mentioned sample adaptive field-view transformation module, thereby obtaining multiple stage field-view preservation sub-features.

[0054] In some optional implementations of certain embodiments, the execution entity may perform a preset number of convolution operations on each stage modulation sub-feature of the multiple stage modulation sub-features according to the field-view preservation sub-module in the sample adaptive field-view transformation module, thereby obtaining multiple stage field-view preservation sub-features. The preset number of convolution operations on each stage modulation sub-feature of the multiple stage modulation sub-features may be a combination of a certain number of convolutional layers containing convolutional kernels with spatial sizes of 3×3 and 1×1, and may be further learning of the modulation sub-features.

[0055] The second sub-step involves performing field-view transformation processing on the aforementioned multiple-stage modulation sub-features based on the field-view transformation sub-module in the aforementioned sample adaptive field-view transformation module, thereby obtaining multiple-stage field-view transformation sub-features.

[0056] In some optional implementations of certain embodiments, the execution entity performs field-view transformation processing on the multiple stage modulation sub-features according to the field-view transformation sub-module in the sample adaptive field-view transformation module to obtain multiple stage field-view transformation sub-features, which may include the following steps:

[0057] First, based on the horizontal field of view transformation submodule in the above-mentioned field of view transformation submodule, the horizontal field of view transformation processing is performed on each stage modulation sub-feature of the above-mentioned multiple stage modulation sub-features to obtain multiple stage horizontal field of view transformation sub-features.

[0058] Secondly, based on the vertical field of view transformation submodule in the above-mentioned field of view transformation submodule, the vertical field of view transformation processing is performed on each stage modulation sub-feature of the above-mentioned multiple stage modulation sub-features to obtain multiple stage vertical field of view transformation sub-features.

[0059] Then, according to the scaling field of view transformation submodule in the above field of view transformation submodule, scaling field of view transformation processing is performed on each of the above multiple stage modulation sub-features to obtain multiple stage scaling field of view transformation sub-features.

[0060] Finally, based on each of the aforementioned multi-stage horizontal field-of-view transformation sub-features, each of the aforementioned multi-stage vertical field-of-view transformation sub-features, and each of the aforementioned multi-stage zoom field-of-view transformation sub-features, a field-of-view transformation sub-feature for each stage is obtained. Specifically, the human eye's observation behavior of a panoramic image can be divided into horizontal left-right observation behavior, vertical up-down observation behavior, and front-back near-far observation behavior. These horizontal left-right observation behavior, vertical up-down observation behavior, and front-back near-far observation behavior are then applied to the neural network's feature learning of the aforementioned equidistant cylindrical projection image.

[0061] The transformation parts of the above-mentioned horizontal field of view transformation submodule, the above-mentioned vertical field of view transformation submodule, and the above-mentioned zoom field of view transformation submodule all contain certain parallel sub-branches, and each sub-branch in the above-mentioned parallel sub-branch has the same form but different parameters.

[0062] Furthermore, although the specific transformation functions for the above-mentioned horizontal field of view transformation processing, vertical field of view transformation processing, and zoom field of view transformation processing are different, the main operation process is the same. The above-mentioned main operation process includes: forward field of view transformation processing, forward field of view transformation feature learning, inverse field of view transformation processing, inverse field of view transformation feature learning, and feature fusion of sub-branch of field of view transformation.

[0063] In some optional implementations of certain embodiments, the above-described main operation flow may include:

[0064] First, the positive field of view transformation processing function described above is shown in the following formula:

[0065] P′ e =T(SP) -1 (f(SP(T- 1 (P e )))))=F(P e ).

[0066] Where F represents the above-mentioned positive field-of-view transformation processing function. P e This represents the coordinates of the projection point of the above modulator feature in a planar coordinate system. P′ eThis represents the transformed projection point coordinates in the planar coordinate system corresponding to the equidistant cylindrical projection image after the aforementioned positive field-of-view transformation. T represents the spherical coordinate transformation function from the projection spherical coordinates in the projection spherical coordinate system corresponding to the equidistant cylindrical projection image to the planar coordinates in the planar coordinate system corresponding to the equidistant cylindrical projection image. -1 Let T be the inverse function. SP represents the spherical polar projection function that projects the spherical coordinates in the projection spherical coordinate system onto the complex plane coordinates in the projection complex plane coordinate system corresponding to the above-mentioned equidistant cylindrical projection image. -1 Let f denote the inverse function of SP. Let f denote the Möbius transform function that satisfies preset conditions, and implement the above-mentioned horizontal field of view transformation processing, the above-mentioned vertical field of view transformation processing, and the above-mentioned zoom field of view transformation processing under different preset conditions.

[0067] Secondly, the specific form of the Möbius transform function that satisfies the above preset conditions is shown in the following formula:

[0068]

[0069] in, Let represent the complex projection plane corresponding to the above equidistant cylindrical projection image. a, b, c, and d each represent a constant complex number in the above complex projection plane. z represents the complex projection variable.

[0070] The inverse field-of-view transformation processing function described above is expressed as the inverse function F of the forward field-of-view transformation processing function described above. -1 The aforementioned forward view transformation feature learning, inverse view transformation feature learning, and view transformation sub-branch feature fusion all include convolution operations.

[0071] Specifically, the difference in rotation angles between the branches in the horizontal field-of-view transformation submodule can be set to 30 degrees. The vertical field-of-view transformation submodule can have two sub-branches, with rotation angles of 30 degrees and -30 degrees respectively.

[0072] Optionally, the above-mentioned zoom field transformation submodule can combine the zoom center parameter with the zoom scale factor parameter.

[0073] As an example, if the spatial size of the module input feature is h×w, then the scaling center coordinates can be... and The above set of scaling factors can be {0.8, 1.2, 0.7, 1.3}.

[0074] Optionally, the rotation angle parameters of each branch in the aforementioned horizontal transformation submodule and the aforementioned field-of-view transformation submodule, as well as the center parameter and scale factor in the aforementioned scaling field-of-view transformation submodule, can be adjusted according to actual task requirements. The aforementioned sample adaptive field-of-view transformation module can be defined as a feature enhancement module for use in other panoramic feature enhancement processes.

[0075] In some optional implementations of certain embodiments, f represents a Möbius transform function that satisfies preset conditions. Implementing the above-mentioned horizontal field-of-view transformation processing, the above-mentioned vertical field-of-view transformation processing, and the above-mentioned zoom field-of-view transformation processing under different constraints may include:

[0076] First, the Möbius transform function used for the aforementioned horizontal and vertical field-of-view transformation processes is defined as the first Möbius transform function. The specific form of the first Möbius transform function is shown in the following formula:

[0077]

[0078] in, Let b be the conjugate complex number. Let a be the conjugate complex number of a.

[0079] Secondly, in response to the determination that the rotation angle of the direction of the projected spherical vector passing through the origin in the projected spherical coordinate system is within a preset range, the specific forms of the complex numbers a and b in the first Möbius transformation function are as follows:

[0080]

[0081] Where L represents the projected spherical vector passing through the origin in the above-mentioned projected spherical coordinate system. l represents the first coordinate of the above-mentioned projected spherical vector. m represents the second coordinate of the above-mentioned projected spherical vector. n represents the third coordinate of the above-mentioned projected spherical vector. i represents the imaginary unit. θ represents the rotation angle in the above-mentioned direction.

[0082] Then, in response to determining L = (0, 0, 1), the first Möbius transform function is determined as the horizontal field of view transformation processing function.

[0083] Then, in response to determining L = (0, 1, 0), the first Möbius transform function is determined as the vertical field of view transformation processing function.

[0084] Next, the Möbius transform function used for the above-mentioned scaling view transformation process is determined as the second Möbius transform function. The specific form of the second Möbius transform function is shown in the following formula:

[0085]

[0086] Where a′ represents a complex exponent in the aforementioned projected complex plane. ρ represents the modulus of the aforementioned complex exponent. e′ represents the natural exponent. θ′ represents the argument of the aforementioned complex exponent.

[0087] Subsequently, in response to the determination that θ′=0 and ρ<1, the above second Möbius transform function is determined as a contraction function centered at the origin.

[0088] Finally, in response to the determination that θ′=0 and ρ>1, the above second Möbius transform function is determined as an expansion function centered at the origin.

[0089] Optionally, in response to the determination that the scaling center of the objective function is not at the origin, the objective function can be rotated to the origin first, and then the transformed objective function can be rotated back to the original objective function.

[0090] As an example, the above-mentioned field-of-view transformation submodule can be referenced. Figure 4 The diagram shows the overall framework of the horizontal field of view transformation submodule and the zoom field of view transformation submodule. Figure 4 Figure (a) shows the overall framework of the horizontal field-of-view transformation submodule, where F h Let represent the first Möbius transform function mentioned above. Figure 4 Figure (b) shows the overall framework of the zoom field transformation submodule, where F zm Let represent the second Möbius transform function mentioned above. Figure 4 The steps in this process correspond to the steps in the main operation flow described above. Here, the vertical field-of-view transformation submodule and the horizontal field-of-view transformation submodule are similar. c1 represents the convolutional neural network. in represents the input. out represents the output. Figure 4 The SE (Squeeze-and-Excitation) network in the model can be composed of a combination of global average pooling layers, fully connected layers, and classification layers.

[0091] The third sub-step involves performing field-view adaptive processing on the modulator features of the multiple stages based on the sample adaptive field-view transformation module, thereby obtaining multiple stage sample adaptive sub-features.

[0092] In some optional implementations of certain embodiments, the process of performing field-view adaptive processing on the multiple stage modulation sub-features to obtain multiple stage sample adaptive sub-features based on the sample adaptive field-view transformation module can include the following steps: performing a preset number of convolution operations on each stage modulation sub-feature of the multiple stages according to the sample adaptive sub-module in the sample adaptive field-view transformation module to obtain the multiple stage sample adaptive sub-features. Wherein, the value of each feature of each stage of the sample adaptive sub-feature is between 0 and 1.

[0093] Optionally, the above sample adaptation module can be represented by the following formula:

[0094]

[0095] Where k represents the stage number. This represents the adaptive sub-feature of the sample in stage k. Sigmoid represents the classification function. FC represents the fully connected function. ReLU represents the activation function. GAP represents the global average pooling function.

[0096] The fourth sub-step involves fusing the aforementioned multiple stage vision-preserving sub-features, vision-transforming sub-features, and sample-adaptive sub-features to obtain the aforementioned multiple stage-enhancing sub-features.

[0097] In some optional implementations of certain embodiments, the execution entity fuses the multiple stage-preserving sub-features, field-view transformation sub-features, and sample adaptation sub-features to obtain the multiple stage-enhancing sub-features, which may include the following steps:

[0098] First, for each of the above stages, the aforementioned field-view preservation sub-features and field-view transformation sub-features are adaptively stacked along the channel dimension based on the aforementioned adaptive sample sub-features. The specific implementation of this adaptive stacking is shown in the following formula:

[0099] V f =Concat(ω n ×V n ), n = 0, 1, 2, 3.

[0100] Where V represents a feature. f This represents the stacking result features. V0 represents the above-mentioned field-of-view preservation sub-feature. V1 represents the horizontal field-of-view transformation sub-feature among the above-mentioned field-of-view transformation sub-features. V2 represents the vertical field-of-view transformation sub-feature among the above-mentioned field-of-view transformation sub-features. V3 represents the scaling field-of-view transformation sub-feature among the above-mentioned field-of-view transformation features. Concat represents the stacking operation function of the above features on the channels. n represents a constant. ω nThe eigenvalues represent the adaptive sub-features of the samples. The eigenvalues of the aforementioned adaptive sub-features can range from 0 to 1.

[0101] Optionally, the feature weights of the horizontal field of view transformation sub-feature, vertical field of view transformation sub-feature, scaling field of view transformation sub-feature, and field of view preservation sub-feature in the above-mentioned field of view transformation sub-feature can be adjusted by using the feature values of the above-mentioned adaptive sub-feature, so that the above-mentioned stacked result features are applicable to different samples.

[0102] Optionally, the above-mentioned horizontal field of view transformation submodule, vertical field of view transformation submodule, and zoom field of view transformation submodule can be deleted or reduced according to the actual needs of the task.

[0103] Then, perform two convolution operations on the stacked features to obtain the corresponding enhanced sub-features for this stage.

[0104] The third step is to generate a salient object segmentation result image based on the enhanced sub-features from the above multiple stages.

[0105] Optionally, an output prediction module can be composed of a preset number of convolutional layers. The obtained enhanced sub-features can be segmented based on the output prediction module to obtain a salient object segmentation result image.

[0106] Optionally, during the model training phase of the aforementioned pre-defined vision-aware convolutional neural network segmentation model, the enhancement features of each stage in the multiple stages can be concatenated to the output prediction module, and the salient object segmentation results from each stage can be supervised to generate better feature learning results for each stage. This improves the accuracy of the salient object segmentation results.

[0107] In practice, using a well-trained convolutional neural network to process equidistant cylindrical projection images can yield more accurate salient object segmentation results, thus improving the reliability of salient object segmentation for panoramic images.

[0108] In practice, the above-mentioned panoramic image salient object segmentation method and related content have achieved salient object segmentation of panoramic images. In the process of salient object segmentation, a series of processing such as feature extraction, feature modulation, and feature enhancement are performed on the panoramic image, which improves the accuracy and reliability of panoramic image salient object segmentation.

[0109] Further reference Figure 5 As an implementation of the methods shown in the above figures, this disclosure provides some embodiments of a field-aware panoramic image salient object segmentation apparatus, which are similar to... Figure 1 Corresponding to the method embodiments shown, the device can be specifically applied to various electronic devices.

[0110] like Figure 5 As shown, a panoramic image salient object segmentation apparatus 500 with field-of-view awareness in some embodiments includes: an acquisition unit 501, a projection unit 502, and a segmentation unit 503. The acquisition unit 501 is configured to acquire a panoramic image; the projection unit 502 is configured to project the panoramic image to obtain an equidistant cylindrical projection image and analyze the equidistant cylindrical projection image; the segmentation unit 503 is configured to acquire a preset field-of-view awareness convolutional neural network segmentation model and segment the equidistant cylindrical projection image according to the preset field-of-view awareness convolutional neural network segmentation model to obtain a salient object segmentation result image.

[0111] It is understandable that the units described in the device 500 are related to the reference. Figure 2 The steps in the described method correspond accordingly. Therefore, the operations, features, and beneficial effects described above for the method also apply to the device 500 and the units contained therein, and will not be repeated here.

[0112] In some implementations, clients and servers can communicate using any currently known or future-developed network protocol such as HTTP (Hypertext Transfer Protocol) and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future-developed networks.

[0113] Computer program code for performing operations of some embodiments of this disclosure can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0114] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0115] The above description is merely a selection of preferred embodiments of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in the embodiments of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above-described features with (but not limited to) technical features with similar functions disclosed in the embodiments of this disclosure.

Claims

1. A view-aware panoramic image salient object segmentation method, comprising: Acquire panoramic images; The panoramic image is projected to obtain an equidistant cylindrical projection image, and the equidistant cylindrical projection image is analyzed. A preset field-view-aware convolutional neural network segmentation model is obtained, and the isometric cylindrical projection image is segmented according to the preset field-view-aware convolutional neural network segmentation model to obtain a salient object segmentation result image. The field-view-aware convolutional neural network segmentation model includes a basic feature extraction module, a channel adaptation module, a feature fusion module, an output prediction module, and a sample adaptive field-view transformation module. The sample adaptive field-view transformation module includes a sample adaptive submodule, a scaling field-view transformation submodule, a vertical field-view transformation submodule, a horizontal field-view transformation submodule, a field-view preservation submodule, and a preset number of convolution operations. The step of projecting the panoramic image to obtain an equidistant cylindrical projection image and analyzing it includes: Based on the corresponding annotations of the equidistant cylindrical projection image, the degree of distortion in the salient regions of the equidistant cylindrical projection image is analyzed to obtain the salient region distortion degree of the equidistant cylindrical projection image. The specific formula for determining the salient region distortion degree is as follows: ， in, This indicates the distortion degree of the significant region. This indicates the pixel row index number in the pixel image included in the equidistant cylindrical projection image. This represents the ordinate of a pixel in the pixel image in a planar coordinate system. Represents the first pixel in the pixel image The ordinate of a row pixel in the plane coordinate system. This represents the pixel spherical coordinate in the projected spherical coordinate system corresponding to the ordinate of a pixel in the pixel image in the planar coordinate system. This represents the first pixel in the projected spherical coordinate system and its corresponding position in the pixel image. The pixel spherical coordinates corresponding to the pixel ordinate in the planar coordinate system of a row pixel. This represents the set of corresponding coordinate pairs consisting of the pixel coordinates of a pixel in the pixel image in the planar coordinate system and the corresponding pixel spherical coordinates. This represents the height of the equidistant cylindrical projection image. This represents the projection operator from the planar coordinate system to the projected spherical coordinate system. This indicates the number of corresponding coordinate pairs in the set of corresponding coordinate pairs. Indicates the pixel image that is related to the first The salient region corresponding to the pixel's ordinate in the plane coordinate system. The significant region represents the coordinates of the salient region in the isometric cylindrical projection image and the coordinates of the pixel in the spherical coordinate system. The row pixel corresponds to the pixel spherical coordinates of the pixel ordinate in the planar coordinate system on the spherical image.

2. The method according to claim 1, wherein, The step of segmenting the equidistant cylindrical projection image according to the preset field-aware convolutional neural network segmentation model to obtain a salient object segmentation result image includes: The equidistant cylindrical projection image is subjected to multiple stages of feature extraction, adaptation and modulation to obtain multiple stages of modulation sub-features; The enhancement processing of the multiple stage modulation sub-features yields multiple stage enhanced sub-features. This process, which applies human visual perception of panoramic images to feature learning of isometric cylindrical projection images via neural networks, may include the following steps: Based on the sample adaptive field transformation module in the preset field-aware convolutional neural network segmentation model, the multiple stage modulation sub-features are enhanced to obtain multiple stage enhanced sub-features. Based on the multiple stage-enhanced sub-features, a salient object segmentation result image is generated.

3. The method according to claim 2, wherein, The process involves extracting, adapting, and modulating features at multiple stages from the equidistant cylindrical projection image to obtain multiple stage modulation sub-features, including: Based on the basic feature extraction module in the preset field-view perception convolutional neural network segmentation model, multiple stages of feature extraction are performed on the equidistant cylindrical projection image to obtain multiple stages of basic sub-features. Based on the channel adaptation module in the preset vision-aware convolutional neural network segmentation model, the multiple stage basic sub-features are adapted to obtain multiple stage adapted sub-features. Based on the feature fusion module in the preset vision-aware convolutional neural network segmentation model, the multiple stage adaptor sub-features are modulated to obtain multiple stage modulated sub-features.

4. The method according to claim 3, wherein, The basic feature extraction module in the preset field-of-view perception convolutional neural network segmentation model performs multi-stage feature extraction on the equidistant cylindrical projection image to obtain multiple stage basic sub-features, including: Based on the basic feature extraction module in the preset field-of-view perception convolutional neural network segmentation model, the equidistant cylindrical projection image is subjected to five stages of feature extraction in sequence to obtain multiple basic sub-features corresponding to the second, third, fourth and fifth stages respectively.

5. The method according to claim 4, wherein, The process involves adapting the multiple stage-specific basic sub-features according to the channel adaptation module in the preset field-aware convolutional neural network segmentation model, resulting in multiple stage-adapted sub-features, including: Two convolution operations are performed on the basic sub-features of the multiple stages to obtain multiple stage-adaptive sub-features. The spatial size of the convolution kernels in the two convolution operations is 3×3 and 1×1, respectively, and the number of convolution kernels in the two convolution operations decreases sequentially.

6. The method according to claim 5, wherein, The feature fusion module in the preset field-aware convolutional neural network segmentation model modulates the multiple stage adaptation sub-features to obtain multiple stage modulated sub-features, including: Based on the feature fusion module of the current stage, each stage adaptor feature of the multiple stage adaptor features and the enhancement feature of the higher-level stage adjacent to the current stage are modulated to obtain the modulation feature of the current stage. In response to determining that the enhancer features of the higher-level stage adjacent to the current stage do not exist, only the adaptor features of the current stage are used as the input of the feature fusion module, wherein the feature fusion module includes a preset number of convolution operations.

7. The method according to claim 6, wherein, The sample adaptive field-view transformation module in the preset field-view-aware convolutional neural network segmentation model enhances the multiple stage modulation sub-features to obtain multiple stage enhanced sub-features, including: Based on the field-view preservation sub-module in the sample adaptive field-view transformation module, the multiple stage modulation sub-features are subjected to field-view preservation processing to obtain multiple stage field-view preservation sub-features. Based on the field of view transformation submodule in the sample adaptive field of view transformation module, the multiple stage modulation sub-features are subjected to field of view transformation processing to obtain multiple stage field of view transformation sub-features. Based on the sample adaptive submodule in the sample adaptive field-view transformation module, the multiple stage modulation sub-features are subjected to field-view adaptive processing to obtain multiple stage sample adaptive sub-features. The field-view adaptive processing of the multiple stage modulation features includes the following steps: The sample adaptive submodule in the sample adaptive field transformation module performs a preset number of convolutions on the modulation sub-features of each of the multiple stages to obtain the sample adaptive sub-features of the multiple stages, wherein the size of each feature value of the sample adaptive sub-feature of each of the multiple stages is between 0 and 1. The multiple stages of field-view preservation sub-features, field-view transformation sub-features, and sample adaptation sub-features are fused to obtain the multiple stages of enhancement sub-features. The fusion of these features includes the following steps: For each of the multiple stages, the view preservation sub-features and view transformation sub-features are adaptively stacked along the channel dimension based on the sample adaptive sub-features, wherein the specific implementation of the adaptive stacking is shown in the following formula: ， in, Indicates features, Indicates the characteristics of the stacking result. This indicates the field-view preservation sub-feature. This refers to the horizontal field of view transformation sub-feature within the aforementioned field of view transformation sub-features. The vertical field of view transformation sub-feature in the field of view transformation sub-feature. This refers to the scaling view transformation sub-feature in the view transformation sub-feature. This represents the superposition operation function of the aforementioned features on the channel. Represent a constant. The feature value represents the adaptive sub-feature of the sample, and the feature value of the adaptive sub-feature of the sample is between 0 and 1; The stacked feature is subjected to two convolution operations to obtain the corresponding enhanced sub-features for this stage.

8. The method according to claim 7, wherein, The view preservation submodule in the sample adaptive view transformation module performs view preservation processing on the multiple stage modulation sub-features to obtain multiple stage view preservation sub-features, including: Based on the field-view preservation submodule in the sample adaptive field-view transformation module, each stage modulation sub-feature of the multiple stages is subjected to a preset number of convolution processes to obtain multiple stage field-view preservation sub-features.

9. The method according to claim 8, wherein, The view transformation submodule in the sample adaptive view transformation module performs view transformation processing on the multiple stage modulation sub-features to obtain multiple stage view transformation sub-features, including: Based on the horizontal field of view transformation submodule in the field of view transformation submodule, horizontal field of view transformation processing is performed on each stage modulation sub-feature of the multiple stage modulation sub-features to obtain multiple stage horizontal field of view transformation sub-features. Based on the vertical field of view transformation submodule in the field of view transformation submodule, vertical field of view transformation processing is performed on each stage modulation sub-feature of the multiple stage modulation sub-features to obtain multiple stage vertical field of view transformation sub-features. According to the scaling field of view transformation submodule in the field of view transformation submodule, scaling field of view transformation is performed on each stage modulation sub-feature of the multiple stage modulation sub-features to obtain multiple stage scaling field of view transformation sub-features. Based on each stage of the horizontal field of view transformation sub-features, each stage of the vertical field of view transformation sub-features, and each stage of the zoom field of view transformation sub-features, a field of view transformation sub-feature for each stage is obtained. The human eye's observation behavior of a panoramic image is divided into horizontal left-right observation behavior, vertical up-down observation behavior, and front-back near-far observation behavior. These behaviors are then applied to the neural network's feature learning of the equidistant cylindrical projection image. The transformation parts in the horizontal field of view transformation sub-module, the vertical field of view transformation sub-module, and the zoom field of view transformation sub-module all contain certain parallel sub-branches, where each sub-branch has the same form but different parameters. Furthermore, while the specific transformation functions for the horizontal, vertical, and zoom-based field-of-view transformation processes differ, the main operational flow remains the same. This main operational flow includes: forward field-of-view transformation processing, forward field-of-view transformation feature learning, inverse field-of-view transformation processing, inverse field-of-view transformation feature learning, and field-of-view transformation sub-branch feature fusion. The main operational flow further includes: The field-view positive transformation processing function is shown in the following formula: ， in, This represents the positive field-of-view transformation processing function. This represents the coordinates of the projection point of the modulator feature in the planar coordinate system. This represents the transformed projection point coordinates in the plane coordinate system corresponding to the equidistant cylindrical projection image after the field of view positive transformation processing. The spherical coordinate transformation function represents the transformation from the projected spherical coordinates in the projection spherical coordinate system corresponding to the equidistant cylindrical projection image to the planar coordinates in the planar coordinate system corresponding to the equidistant cylindrical projection image. express inverse function, The spherical polar projection function represents the projection of spherical coordinates in the projection spherical coordinate system onto the projection complex plane coordinates in the projection complex plane coordinate system corresponding to the equidistant cylindrical projection image. express inverse function, The Möbius transform function that satisfies the preset conditions implements the horizontal field of view transformation, the vertical field of view transformation, and the zoom field of view transformation under different preset conditions. The specific form of the Möbius transform function is shown in the following formula: ， in, This represents the complex projection plane corresponding to the equidistant cylindrical projection image. , , ,and Each represents a constant complex number in the projected complex plane. Represents a projected complex variable; The inverse field-of-view transformation processing function is expressed as the inverse function of the forward field-of-view transformation processing function. The forward view transformation feature learning, the inverse view transformation feature learning, and the view transformation sub-branch feature fusion include convolution operations; Among them, the The Möbius transform function, which satisfies preset conditions, implements the horizontal field-of-view transformation, the vertical field-of-view transformation, and the zoom field-of-view transformation under different constraints, including the following steps: The Möbius transform function used for the horizontal and vertical field-of-view transformation processes is defined as the first Möbius transform function, wherein the specific form of the first Möbius transform function is shown in the following formula: ， in, express The conjugate of complex numbers, express The conjugate of complex numbers; In response to determining that the direction rotation angle of the vector passing through the origin in the projected spherical coordinate system is within a preset range, the complex number in the first Möbius transform function... and The specific form is shown in the following formula: ， in, This represents the projected spherical vector passing through the origin in the projected spherical coordinate system. This represents the first coordinate of the projected spherical vector. This represents the second coordinate of the projected spherical surface. This represents the third coordinate of the projected spherical surface. Represents the imaginary unit. Indicates the rotation angle in the stated direction; Response to determination The first Möbius transform function is determined as the horizontal field of view transformation processing function; Response to determination The first Möbius transform function is determined as the vertical field of view transformation processing function; The Möbius transform function used for the scaling view transformation process is determined as the second Möbius transform function, wherein the specific form of the second Möbius transform function is shown in the following formula: ， in, Represents a complex exponent in the projected complex plane. This represents the modulus of the complex exponent. Represents the natural index. Indicates the argument of the complex exponent; Response to determination The second Möbius transform function is determined as a field-of-view contraction function centered at the origin; Response to determination The second Möbius transform function is determined as a field of view expansion function centered at the origin.