Target positioning method based on shared multi-layer perceptron and sliding window transformer

By combining a network structure of a shared multilayer perceptron and a sliding window transformer, the problem of geometric information loss in image target localization is solved, achieving high-precision and fast target localization results.

CN117593374BActive Publication Date: 2026-06-30SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2023-12-22
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing image target localization methods are not very accurate in complex backgrounds, especially in open vocabulary detection and weakly supervised target localization techniques, where the geometric information of the image is severely lost, affecting the localization accuracy.

Method used

A network structure combining a shared multilayer perceptron (Shared MLP) and a sliding window transformer (Swin Transformer) is adopted. By extracting features in a hierarchical and localized manner, coordinate and confidence predictions are performed separately. By using residual convolutional layers and the Swing Transformer to abstract features, the total number of weights is reduced, thereby improving training speed and accuracy.

Benefits of technology

It achieves high-precision target positioning in complex backgrounds, improving positioning accuracy and computation speed, and outperforming traditional methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117593374B_ABST
    Figure CN117593374B_ABST
Patent Text Reader

Abstract

This invention discloses a target localization method based on a shared multilayer perceptron and a sliding window transform to locate targets in an image and determine their pixel coordinates. The method comprises a coordinate prediction part and a confidence prediction part. Each frame of the image is divided into several small square regions. For each small square region, a set of coordinates and its confidence score are predicted, and the set of coordinates with the highest confidence score is used as the final prediction result. In the coordinate prediction part, residual convolutional layers are used to abstract the features of each square region, and a shared multilayer perceptron (SLP) is used to summarize the coordinate prediction results. In the confidence prediction part, a shared multilayer perceptron (SLP) is used to abstract the features of each square region, and a shared multilayer perceptron (SLP) is used to summarize the confidence prediction results. This invention innovatively combines the SLP, shared multilayer perceptron (SLP), and residual convolutional layers from deep neural networks, proposing a novel network structure for image target localization that outperforms traditional methods.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision, specifically relating to a target localization method based on a shared multilayer perceptron and a sliding window transformer. Background Technology

[0002] In the field of image object localization, recent research has mainly focused on open-vocabulary detection and weakly supervised object localization. The former enables an agent to locate unknown or nearly unseen objects, while the latter trains the agent using only image-level category labels to predict the location of objects in an image. Although these studies have achieved preliminary results, they are still some distance from accurate localization in complex backgrounds.

[0003] Traditional deep learning object localization algorithms, while employing different methods to abstract image features, often rely on fully connected layers to arrive at the final result, resulting in some loss of geometric information. Shared MLP (Shared Multilayer Perceptron), a widely used structure in point cloud processing, is characterized by different parts of the network sharing the same weights and parameters to abstract the chaotic point cloud features. The network structure proposed in this invention does not rely on fully connected layers in the feature summarization stage; instead, it applies Shared MLP to image feature processing, maximizing the preservation of geometric features and reducing the total number of weights. Experiments show that using Shared MLP accelerates training and convergence.

[0004] The Swin Transformer (sliding window transformer) is an improved structure based on the Transformer, which hierarchically and locally extracts features. Specifically, it extracts features hierarchically using modules of different sizes and performs computation locally within a sequentially sliding window. It has demonstrated excellent performance on numerous image datasets. In the confidence prediction section of this invention, the Swin Transformer is used to abstract features. Experiments show that using the Swin Transformer in confidence prediction can effectively improve prediction accuracy. Summary of the Invention

[0005] To address the aforementioned problems, this invention discloses a target localization method based on a shared multilayer perceptron and a sliding window transform. The method includes a coordinate prediction part and a confidence prediction part. Each frame of image is divided into several small square regions. For each small square region, a set of coordinates and its confidence score are predicted, and the set of coordinates with the highest confidence score is used as the final prediction result. In the coordinate prediction part, residual convolutional layers are used to abstract the features of each square region, and the coordinate prediction results are summarized using a shared multilayer perceptron. In the confidence prediction part, a Swin Transformer is used to abstract the features of each square region, and the confidence prediction results are summarized using a shared multilayer perceptron.

[0006] To achieve the above objectives, the technical solution of the present invention is as follows:

[0007] The target localization method based on shared multilayer perceptron and sliding window transformer includes the following steps:

[0008] (1) Perform a convolution without bias on the input image to extract the first layer of features, and divide the image into several square regions on average;

[0009] The input image is a tensor representing the pixel values ​​of each point in a three-channel image (red, green, and blue). This image may contain the target to be located. The tensor size is (3, m, n), where 3 represents the three channels, and m and n represent the length and width of the input image, respectively.

[0010] The first layer of features is a tensor obtained by convolving the input image without bias.

[0011] The image is divided into several square regions, each with a length of i and a width of k. These square regions are identical in shape and size. Therefore, m and n should be divisible by i and k, respectively.

[0012] (2) The first layer features are used as inputs to the coordinate prediction part and the confidence prediction part, respectively, and the corresponding output tensors are obtained.

[0013] The coordinate prediction part includes residual convolutional layers, fully connected layers, and a Shared MLP. The input to this part is the first layer of features, and the output tensor size is (2n, i, k); where n is the number of targets to be located; and i and k are the number of square regions segmented by the length and width of the input image, respectively. This tensor represents the coordinates of each target for each square region. This part includes the following sub-steps:

[0014] (2.1a) A residual convolutional network is used to abstract the first layer features of the input to increase the feature dimension. Taking advantage of the property of changing the image size after convolution, the size of the abstracted features is adjusted to match each square region. Assuming the dimension of the abstracted features is dim, then the size of the abstracted features is (dim, i, k), which is called the region coordinate feature. (2.1b) A fully connected layer is used to further abstract the region coordinate features of size (dim, i, k) in (2.1a), resulting in a feature of size (dim, 1, 1), which is called the global feature.

[0015] (2.1c) The global feature described in (2.1b) is copied i and k times in the second and third dimensions respectively to obtain a global feature of size (dim,i,k). This global feature is then concatenated with the regional coordinate feature in (2.1a) to obtain the total feature (2dim,i,k).

[0016] (2.1d) The total features from (2.1c) are summarized using a Shared MLP to obtain an output tensor of size (2n,i,k). This tensor represents the coordinates of each target for each square region. The Shared MLP shares all values ​​in the dimensions i and k.

[0017] The confidence prediction part includes a Swin Transformer and a Shared MLP layer. The input to this part is the first layer features, and the output tensor size is (2n, i, k). This tensor represents the probability of each target existing in each square region; the probability of its existence is called the confidence level. This part includes the following steps:

[0018] (2.2a) The first-layer features are abstracted using the Swin Transformer. Leveraging the feature-sizing property of the Stage module (the main module of the Swin Transformer), the size of the abstracted features is adjusted to match each square region. Assuming the dimension of the abstracted features is dimS, then the size of the abstracted features is (dimS, i, k), which is called the region confidence feature. It should be noted that this step only uses the Stage module (the main module of the Swin Transformer), and the final fully connected layer of the standard Swin Transformer is discarded in this step.

[0019] (2.2b) The Shared MLP is used to summarize the total features in (2.2a) to obtain an output tensor of size (2n,i,k). This tensor represents the probability of each target existing in each square region. The Shared MLP shares all values ​​in the dimensions i and k.

[0020] (3) Based on the output tensor of the confidence prediction part in step (2), find the square region where the target is located by the maximum value index, and then obtain the target coordinate tensor based on the output tensor of the coordinate prediction part in step (2).

[0021] The target coordinate tensor has a size of (n, 4). This tensor represents the coordinates of each target in the input image and the probability of its existence.

[0022] (4) Set a threshold t. For targets in the target coordinate tensor with a confidence level less than t, they are considered non-existent. For objects that are considered to exist, their coordinate tensors can be obtained. At this point, the target localization task is completed.

[0023] The beneficial effects of this invention are:

[0024] This invention discloses a target localization method based on a shared multilayer perceptron and a sliding window transformer. It organically combines structures such as the Swing Transformer, Shared MLP, and residual convolution in deep neural networks, and proposes a novel network structure for image target localization. This method has high accuracy, fast computation speed, and superior performance compared to traditional methods. Attached Figure Description

[0025] Figure 1 This is a schematic diagram of the target localization method based on a shared multilayer perceptron and a sliding window transformer.

[0026] Figure 2 This is a schematic diagram of a shared multilayer perceptron (MLP).

[0027] Figure 3 To demonstrate the effectiveness of this invention, taking an image of a hand holding a small ball as an example, the geometric center of its outline is located. The position of the small circle represents the positioning result, while the box is added manually. Detailed Implementation

[0028] The present invention will be further illustrated below with reference to the accompanying drawings and specific embodiments. It should be understood that the following specific embodiments are for illustrative purposes only and are not intended to limit the scope of the invention.

[0029] The following detailed description is provided with reference to steps (1) to (4) in the invention description and the following conditions:

[0030] (a) The tensor size of the input image in step (1) of the invention is (3, 448, 448), that is, m = n = 448.

[0031] (b) The image described in step (1) of the invention is divided into several square regions, with each region having an average length and width of seven, i.e., i = k = 7.

[0032] (c) The coordinate prediction part in step (2) of the invention, wherein the number of targets to be located is 1, i.e., n = 1.

[0033] (d) The coordinate prediction part in step (2) of the invention, wherein the dimension of the abstracted feature is 512, i.e., dim = 512.

[0034] (e) The confidence prediction part in step (2) of the invention content, wherein the abstracted feature dimension is 768, i.e., dimS = 768.

[0035] (f) The threshold mentioned in step (4) of the invention's content is used to determine that a target with a confidence level less than 0.5 does not exist, i.e., t = 0.5.

[0036] Referring to the conditions in (a)-(f), the target localization method based on a shared multilayer perceptron and a sliding window transformer is specifically implemented as follows:

[0037] (1) Perform a convolution without bias on the input image to extract the first layer of features, and divide the image into several square regions on average;

[0038] The input image is a tensor representing the pixel values ​​of each point in a three-channel image (red, green, and blue). This image may contain the target to be located. The tensor size is (3, 448, 448), where 3 represents the three channels and 448 represents the length and width of the input image.

[0039] The first layer of features is a tensor obtained by convolving the input image without bias.

[0040] The image is divided into several square regions, with a length of 7 square regions and a width of 7 square regions. These square regions are all the same size and shape.

[0041] (2) The first layer features are used as inputs to the coordinate prediction part and the confidence prediction part, respectively, and the corresponding output tensors are obtained.

[0042] The coordinate prediction part includes residual convolutional layers, fully connected layers, and a Shared MLP. The input to this part is the first layer's features, and the output tensor size is (2,7,7); where 2 represents the number of targets to be located, with each target having two coordinates; and 7 represents the number of square regions segmented in the input image in terms of length and width, respectively. This tensor represents the coordinates of each target within each square region. This part includes the following steps, the flow of which is located in... Figure 1 The coordinate prediction part in the code.

[0043] (2.1a) A residual convolutional network is used to abstract the first layer of input features to increase the feature dimension. Taking advantage of the image size change after convolution, the size of the abstracted features is adjusted to match each square region. Assuming the feature dimension after abstraction is 512, the size of the abstracted features is (512, 7, 7), which is called the region coordinate feature.

[0044] (2.1b) Using a fully connected layer, the region coordinate features of size (512,7,7) in (2.1a) are further abstracted to obtain features of size (512,1,1), which are called global features.

[0045] (2.1c) The global feature described in (2.1b) is copied 7 times in the second and third dimensions respectively to obtain a global feature of size (512,7,7). This global feature is then concatenated with the regional coordinate feature in (2.1a) to obtain the total feature (1024,7,7).

[0046] (2.1d) The total features from (2.1c) are summarized using SharedMLP to obtain an output tensor of size (2,7,7). This tensor represents the coordinates of each target for each square region. All values ​​in the 7th dimension are shared by SharedMLP.

[0047] The confidence prediction part includes a Swin Transformer and a Shared MLP layer. The input to this part is the first layer features, and the output tensor size is (2,7,7). This tensor represents the probability of each target existing in each square region; the probability of its existence is called the confidence level. This part includes the following steps:

[0048] (2.2a) The first-layer features are abstracted using the Swin Transformer. Leveraging the feature-sizing property of the Stage module (the main module of the Swin Transformer), the size of the abstracted features is adjusted to match each square region. Assuming the feature dimension after abstraction is 768, the size of the abstracted features is (768, 7, 7), which is called the region confidence feature. It is important to note that this step only uses the Stage module (the main module of the Swin Transformer), and the final fully connected layer of the standard Swin Transformer is discarded in this step.

[0049] (2.2b) The SharedMLP is used to summarize the total features in (2.2b) to obtain an output tensor of size (2,7,7). This tensor represents the probability of each target existing in each square region. All values ​​in the 7-dimensional dimension of the SharedMLP are shared.

[0050] (3) Based on the output tensor of the confidence prediction part in step (2), find the square region where the target is located by the maximum value index, and then obtain the target coordinate tensor based on the output tensor of the coordinate prediction part in step (2).

[0051] The target coordinate tensor has a size of (1,4). This tensor represents the coordinates of each target in the input image and the probability of its existence.

[0052] (4) Set the threshold to 0.5. For targets in the target coordinate tensor with a confidence level less than 0.5, they are considered non-existent. For objects that are considered to exist, their coordinate tensors can be obtained. At this point, the target localization task is completed.

[0053] In one specific embodiment, the effectiveness of the method of the present invention is verified through experiments. The experiments include performance testing and system ablation experiments. The experimental system setup is as follows:

[0054] 1. In terms of hardware, it is based on the Kinect2 color camera and RTX 2080Ti GPU.

[0055] 2. In terms of software, the above-mentioned functional code was written based on Python 3.9 and PyTorch 1.12.1.

[0056] 3. During the training phase of the neural network, the neural network described in this invention is trained using hundreds of images containing different positions of a human hand. The training process is based on the adaptive gradient Adam optimizer.

[0057] Its specific implementation effects are as follows Figure 3As shown, taking an image of a hand holding a small ball as an example, the geometric center of its outline is located. The positions of the small circles are the positioning results, while the boxes are added manually.

[0058] In performance testing, this invention was compared with other similar image feature extraction algorithms on the same task. The results show that the present invention has the lowest mean square error in positioning coordinates and the fastest convergence speed. The experimental results are shown in Table 1.

[0059] Table 1 shows the performance of different methods implemented for this task.

[0060]

[0061] In Table 1, Reference 1 is:

[0062] He K, Zhang X, Ren S, et al.Deep residual learning for image recognition[C] / / Proceedings of the IEEE conference on computer vision and patternrecognition.2016: 770-778.

[0063] Reference 2 is:

[0064] Dosovitskiy A,Beyer L,Kolesnikov A,et al.An image is worth 16x16words:Transformers for image recognition at scale[J].arXiv preprint arXiv:2010.11929,2020.

[0065] Reference 3 is:

[0066] Liu Z, Lin Y, Cao Y, et al.Swin transformer: Hierarchical visiontransformer using shifted windows[C] / / Proceedings of the IEEE / CVFinternational conference on computer vision. 2021: 10012-10022.

[0067] Reference 4 is:

[0068] Zhao G,Ge W,Yu Y.GraphFPN:Graph feature pyramid network for objectdetection[C] / / Proceedings of the IEEE / CVF international conference oncomputer vision.2021:2763-2772.

[0069] System ablation experiments were conducted, and the structures of the abstract features in the coordinate prediction and confidence prediction parts of this invention were successively changed to Shared MLP, Multilayer Perceptron, and Swin Transformer. Their performance was then verified. Experimental results show that, for the structure of abstract features, Shared MLP achieves the best performance for coordinate prediction, while Swin Transformer achieves the best performance for confidence prediction. Table 2 shows the test results for the confidence prediction part, and Table 3 shows the test results for the coordinate prediction part.

[0070] Table 2

[0071] Change the abstract feature structure of the confidence level test section.

[0072]

[0073] Table 3

[0074] Modify the abstract feature structure of the coordinate prediction part

[0075]

[0076] The embodiments described above are merely illustrative of certain implementations of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A target localization method based on shared multi-layer perceptron and sliding window transformer, characterized in that, The method includes the following steps: (1) Perform a convolution without bias on the input image to extract the first layer of features, and divide the image into several square regions on average; (2) The first layer features are used as inputs to the coordinate prediction part and the confidence prediction part, respectively, and the corresponding output tensors are obtained. The coordinate prediction part includes residual convolutional layers, fully connected layers, and a Shared MLP. The input to this part is the first layer of features, and the output tensor size is (2n, i, k), where n is the number of targets to be located; i and k are the number of square regions segmented by the length and width of the input image, respectively; this tensor represents the coordinates of each target for each square region. This part includes the following sub-steps: (2.1a) Use a residual convolutional network to abstract the first layer of input features to increase the feature dimension, and take advantage of the property of changing the image size after convolution to adjust the size of the abstracted features to match each square region; assuming the dimension of the abstracted features is dim, then the size of the abstracted features is (dim, i, k), which is called the region coordinate feature. (2.1b) A fully connected layer is used to further abstract the region coordinate features of size (dim, i, k) in (2.1a) to obtain features of size (dim, 1, 1), which are called global features; (2.1c) The global feature described in (2.1b) is copied i and k times in the second and third dimensions respectively to obtain a global feature of size (dim, i, k). This global feature is then concatenated with the regional coordinate feature in (2.1a) to obtain the total feature (2dim, i, k). (2.1d) The total features in (2.1c) are summarized using Shared MLP to obtain an output tensor of size (2n, i, k); this tensor represents the coordinates of each target for each square region; the Shared MLP shares all values ​​in the dimensions of i and k. The confidence prediction part includes a Swing Transformer and a Shared MLP layer. The input to this part is the first layer features, and the output tensor size is (2n, i, k). This tensor represents the probability of each target existing in each square region; the probability of its existence is called the confidence level. This part includes the following sub-steps: (2.2a) Use the Swin Transformer to abstract the first layer of features. Utilize the feature size change ... (2.2b) The Shared MLP is used to summarize the region confidence features in (2.2a) to obtain an output tensor of size (2n, i, k); this tensor represents the probability of each target existing in each square region; the Shared MLP shares all values ​​in the dimensions of i and k. (3) Based on the output tensor of the confidence prediction part in step (2), find the square region where the target is located by the maximum value index, and then obtain the target coordinate tensor based on the output tensor of the coordinate prediction part in step (2); (4) Set a threshold t. For targets in the target coordinate tensor with a confidence level less than t, it is determined that they do not exist. For objects that are determined to exist, their coordinate tensor can be obtained. At this point, the target localization task is completed.

2. The target localization method based on a shared multilayer perceptron and a sliding window transformer according to claim 1, characterized in that, The input image in step (1) is a tensor representing the pixel values ​​of each point in a three-channel image of red, green, and blue; the image contains the target to be located; the tensor size is (3, m, n1), where 3 represents the three channels, and m and n1 represent the length and width of the input image, respectively; The first layer of features is a tensor obtained by convolving the input image without bias; The image is divided into several square regions, each with a length of i square regions and a width of k square regions; these square regions have the same shape and size; therefore, m and n1 should be divisible by i and k, respectively.

3. The target localization method based on a shared multilayer perceptron and a sliding window transformer according to claim 1, characterized in that, The target coordinate tensor mentioned in step (3) has a size of (n, 4); this tensor represents the coordinates of each target on the input image and the probability of its existence.