Maritime target recognition method and device based on multi-view joint
By combining 3D sampling and mask autoencoder with dynamic storage queue, the problem of difficulty in mining the associated information of maritime target views in existing technologies has been solved, and higher recognition accuracy has been achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TIANJIN UNIV
- Filing Date
- 2023-04-21
- Publication Date
- 2026-06-30
AI Technical Summary
Existing supervised learning-based multi-view maritime target recognition methods rely on a large amount of labeled information, and the quality of the labeled information affects the model performance, making it difficult to effectively mine the correlation information between maritime target views, resulting in insufficient recognition accuracy.
A 3D sampling strategy is adopted to sample the view sequence of maritime targets as a whole. The mask autoencoder is used to learn the correlation information between multiple views, and the instance-level correlation information is mined through a dynamically updated long storage queue. The feature of maritime targets is extracted by combining view reconstruction loss and similarity model loss for iterative optimization.
It improves the accuracy of self-supervised maritime target identification by fully exploring the visual information of maritime targets and the correlation between similar targets, thereby enhancing the accuracy of identification.
Smart Images

Figure CN116363379B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multi-view maritime target recognition, and in particular to a method and apparatus for maritime target recognition based on multi-view joint recognition. Background Technology
[0002] Despite the highly developed global maritime transport system, collisions and groundings between ships remain frequent incidents in recent years. To ensure navigational safety, ships require advanced identification technologies to detect other vessels and obstacles within a certain range and classify them. With advancements in image processing hardware and software, images captured by cameras now contain sufficiently rich information, leading to widespread attention being paid to multi-view maritime target recognition technology based on image processing in the field of ship identification and classification. [1-2] .
[0003] In recent years, deep learning has emerged and developed rapidly, demonstrating increasingly superior performance in image recognition tasks. Researchers have also proposed many view-based 3D model retrieval network models, such as MVCNN. [3] GVCNN [4] These methods, such as those using virtual cameras to convert 3D models into sequences of 2D views for 3D model retrieval, offer a novel approach to maritime target recognition: using deep learning to extract information about maritime targets from views captured by cameras. However, most of these methods are based on supervised learning, requiring datasets with a large amount of difficult-to-collect labeled information, and the quality of the labeled information also affects model performance.
[0004] In recent years, self-supervised learning has developed rapidly and has shown superior performance compared to traditional supervised learning in some areas of computer vision. These methods can learn general representations of image data without labels and also exhibit good performance in linear classification and other downstream tasks. Among them, MAE... [5] It was demonstrated that masked autoencoders can efficiently learn visual representations from images, and this was subsequently extended to other domains such as video and audio. [6] Maritime targets can be rendered from different perspectives by multiple cameras to obtain a sequence of views. Therefore, learning the representation of maritime targets through a mask autoencoder has become a novel approach. Unlike ordinary two-dimensional images, there is a strong correlation between a set of views of maritime targets, and there is also a certain degree of association between similar maritime targets of the same type. How to mine and utilize this information has become a challenge. Summary of the Invention
[0005] This invention provides a method and apparatus for maritime target recognition based on multi-view joint analysis. The invention utilizes a 3D sampling strategy to perform overall sampling of the view sequence of maritime targets and employs an autoencoder for overall prediction to learn the correlation information between multiple views of maritime targets. Furthermore, it mines instance-level correlation information based on visible markers to learn the correlation information between similar maritime targets, thereby improving the accuracy of self-supervised maritime target recognition. Details are described below:
[0006] A method for identifying maritime targets based on multi-view joint approach, the method comprising:
[0007] A 3D sampling strategy is used to sample the entire view sequence and add position embedding to generate visible blocks and mask blocks;
[0008] The visible blocks are encoded using an encoder to obtain visible markers, and a shared learnable vector is generated for all mask blocks as mask markers.
[0009] The decoder reconstructs the input view sequence based on the visible markers and mask markers, and calculates the mean square error of the reconstructed mask block and the original mask block in the pixel space to obtain the view reconstruction loss;
[0010] A set of visible markers for maritime targets, including the current batch of data, is constructed using a dynamically updated long storage queue. Instance-level association information is mined using a graph-based, parameterless method to learn the association information between similar maritime targets and obtain the similarity model loss.
[0011] The sum of the view reconstruction loss and the similarity model loss is used as the objective function of the network for iterative optimization. Finally, the encoder is used as a feature extractor for maritime targets to perform multi-view maritime target recognition.
[0012] A maritime target identification device based on multi-view joint method, the device comprising: a processor and a memory, the memory storing program instructions, the processor calling the program instructions stored in the memory to cause the device to perform any of the method steps described in the first part.
[0013] A computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the steps of the method described in any of the first parts.
[0014] The beneficial effects of the technical solution provided by this invention are:
[0015] 1. This invention treats the view sequence of maritime targets as a three-dimensional structure, adopts a three-dimensional sampling strategy for overall sampling, and uses a mask autoencoder for overall prediction of the view sequence; it not only learns the features of each two-dimensional view, but also learns the relevant information between multiple views.
[0016] 2. This invention stores a large number of visible markers of maritime targets through a storage queue, and uses a graph-based, parameter-free method to mine instance-level association information using the visible markers. Based on the learning and reconstruction of features of individual maritime targets, it considers the potential association information between similar maritime targets of the same category to improve the quality of the learned maritime target representation.
[0017] Therefore, this invention can fully learn and mine the visual information of maritime targets and the correlation between similar maritime targets, thereby improving the accuracy of self-supervised maritime target identification. Attached Figure Description
[0018] Figure 1 This is a flowchart of a maritime target identification method based on multi-view joint approach;
[0019] Figure 2 This is a network structure diagram of a multi-view joint maritime target recognition method. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below.
[0021] Example 1
[0022] A method for maritime target identification based on multi-view joint approach, see [link to relevant documentation]. Figure 1 The method includes the following steps:
[0023] 101: By setting up multiple cameras around the target at sea for rendering, a sequence of views of the target at sea is generated;
[0024] 102: Use a 3D sampling strategy to sample the entire view sequence and add position embedding to generate visible blocks and mask blocks;
[0025] 103: Encode the visible blocks using an encoder to obtain visible markers, and generate a shared learnable vector as a mask marker for all mask blocks;
[0026] 104: The decoder reconstructs the input view sequence based on the visible markers and mask markers, and calculates the mean square error of the reconstructed mask block and the original mask block in the pixel space to obtain the view reconstruction loss;
[0027] 105: Construct a set of visible markers for maritime targets that includes the current batch of data using a dynamically updated long storage queue, and use a graph-based, parameterless method to mine instance-level association information, learn the association information between similar maritime targets, and obtain the similarity model loss.
[0028] 106: The sum of the view reconstruction loss and the similarity model loss is used as the objective function of the network to iteratively optimize the network. Finally, the encoder is used as the feature extractor for maritime targets to perform multi-view maritime target recognition.
[0029] In summary, this invention proposes a novel multi-view maritime target identification method. This method is based on a mask autoencoder and incorporates instance-level association information mining, thereby improving the accuracy of multi-view maritime target identification.
[0030] Example 2
[0031] The scheme in Example 1 will be further described below with specific examples and calculation formulas:
[0032] 201: By setting up multiple cameras around the target at sea for rendering, a sequence of views of the target at sea is generated;
[0033] A set of viewpoints is selected around the maritime target, i.e., the positions where the virtual cameras are placed. In this invention, 12 viewpoints are selected. In this embodiment of the invention, a camera is placed every 30 degrees around the grid, with the camera raised 30 degrees from the ground plane and pointing towards the centroid of the grid. The rendered view of each camera is acquired clockwise to obtain a view sequence of the maritime target.
[0034] 202: Use a 3D sampling strategy to sample the entire view sequence and add position embedding to generate visible blocks and mask blocks;
[0035] In this embodiment of the invention, each view of a maritime target is scaled to 224*224 pixels, and 12 views are sequentially connected to obtain a 12*224*224 three-dimensional structure. This embodiment then divides this three-dimensional structure along its three dimensions into non-overlapping regular grid blocks, each grid block being 2*16*16 pixels in size, resulting in a total of 6*14*14 grid blocks. These grid blocks are flattened and processed through a linear projection layer to obtain block embedding information.
[0036] Furthermore, embodiments of the present invention add learnable position embeddings to each grid block to preserve position information. Embodiments of the present invention have two position embeddings: one is a view position embedding, which describes the position information of the grid block in a two-dimensional view, using 2D sine and cosine position embeddings; the other is a sequence position embedding, which describes the position information of the grid block in a view sequence, using standard 1D position embeddings.
[0037] Define the set of grid tiles for the view sequence as follows: Where n is the number of grid blocks, i.e., 6*14*14, and D is the dimension of block embedding.
[0038] Finally, in this embodiment of the invention, all grid blocks are randomly sampled, and the sampled grid blocks are the visible blocks, defined as follows: The unsampled grid blocks are the mask blocks that the network needs to predict, defined as follows: Where α = k*(1-λ), β = k*λ, and λ is the masking rate. Considering the significant information redundancy in the view sequences of maritime targets, this embodiment of the invention employs a masking rate as high as 90%, which greatly reduces computational load and accelerates network training. Furthermore, compared to sampling each view separately, performing 3D sampling and overall prediction on the view sequences of the same maritime target helps the network learn the relevant information between views of the same maritime target.
[0039] 203: Encode the visible blocks using an encoder to obtain visible markers, and generate a shared learnable vector as a mask marker for all mask blocks;
[0040] The encoder in this embodiment of the invention adopts the ViT (Vision Transformers) architecture. [7] Encoding only visible blocks, with a mask rate of up to 90%, reduces the encoder complexity to one-tenth of its original level, thus reducing time and memory complexity. If F e (·) represents the encoder, then the visible marker T is defined. v :
[0041]
[0042] For a mask block, this embodiment of the invention generates a shared learnable vector as a mask marker, defined as follows: The visible markers output by the encoder and the mask markers generated in this embodiment of the invention constitute the complete marker set T for all grid blocks in the view sequence:
[0043]
[0044] 204: The decoder reconstructs the input view sequence based on the visible markers and mask markers, and calculates the mean square error of the reconstructed mask block and the original mask block in the pixel space to obtain the view reconstruction loss;
[0045] The network structure in this embodiment of the invention is asymmetric. The decoder is a smaller ViT than the encoder, and its input is a visible marker and a mask marker. The decoder predicts the mask marker based on the visible marker and outputs the reconstructed result of the input view sequence:
[0046]
[0047] Among them, F d (·) represents the decoder, c = 2 * 16 * 16 represents the number of pixels in each grid block, and the output Y is the prediction of the pixel values of all grid blocks. The original pixel values of the flattened grid blocks in step 202 are defined as follows: The view reconstruction loss L is obtained by calculating the mean square error of the reconstructed mask block and the original mask block in the pixel space based on the decoder's prediction results. rec :
[0048]
[0049]
[0050] Where N represents the batch size, that is, the number of maritime targets in the current batch.
[0051] 205: Construct a set of visible markers for maritime targets, including the current batch of data, using a dynamically updated long storage queue, and mine instance-level association information using a graph-based, parameter-free method to learn the association information between similar maritime targets and obtain the similarity model loss.
[0052] During network training, this embodiment of the invention maintains a dynamically updated long storage queue to store the visibility markers of maritime targets obtained during the most recent training. The length of the storage queue is related to the size of the training set, and in principle, it should contain a certain number of maritime targets of various categories in the training set. After each batch of data is trained, this embodiment of the invention stores the visibility markers of the obtained maritime targets into the storage queue and dequeues the oldest batch of data. If the length of the storage queue is N′, the visibility markers in the storage queue are defined as:
[0053]
[0054] Combine the visible markers of the current batch of N maritime targets with the visible markers in the storage queue to form a visible marker set E:
[0055] E = (∪ w T v )∪Q
[0056] Based on this set of visible markers, a graph-based, parameter-free method is used to adaptively find similar maritime targets. This method can be summarized in two points: for each sample, the closest sample is considered its similar sample; if samples A and B, and samples B and C are two pairs of similar samples, then samples A and C are also considered similar samples.
[0057] For each sample in the visible label set, this embodiment of the invention finds the closest sample by calculating the cosine similarity between the sample and other samples. Based on this, this embodiment of the invention can establish a sparse symmetric adjacency matrix according to the following formula:
[0058]
[0059] Based on this adjacency matrix, an undirected graph G = (V, E) containing several connected components can be generated, where V is a node in the graph, representing each visible marker, and E is an edge in the graph, representing the similarity relationship between visible markers. Nodes belonging to the same connected component are either directly connected or connected through intermediate nodes. In this embodiment of the invention, the visible markers represented by these nodes are considered as similar samples to each other. The similarity relationship between different visible markers is defined as follows:
[0060]
[0061] Based on the similarity relationship between visible markers, this embodiment of the invention constructs a similarity model loss function. This loss function attempts to make the visible markers of similar maritime targets in the feature space as close as possible, and the visible markers of dissimilar maritime targets as far apart as possible, specifically expressed as follows:
[0062]
[0063]
[0064] Where N+N′ is the number of visible markers, and τ is the temperature hyperparameter.
[0065] 206: The sum of the view reconstruction loss and the similarity model loss is used as the objective function of the network to iteratively optimize the network. Finally, the encoder is used as the feature extractor for maritime targets to perform multi-view maritime target recognition.
[0066] Define the network's objective function, Loss:
[0067] Loss = L rec +L sim
[0068] After the network iteration optimization is completed, the embodiments of the present invention use the encoder to extract the features of the visible blocks in the view sequence, namely the visible markers, as feature descriptors of maritime targets. Compared with the traditional view-based 3D model retrieval method, which extracts the features of each view and then performs view feature fusion, this avoids the shortcomings of various feature fusion schemes.
[0069] A multi-view joint maritime target identification device includes a processor and a memory. The memory stores program instructions, and the processor invokes the program instructions stored in the memory to execute the method steps in embodiments 1 and 2.
[0070] A 3D sampling strategy is used to sample the entire view sequence and add position embedding to generate visible blocks and mask blocks;
[0071] The visible blocks are encoded using an encoder to obtain visible markers, and a shared learnable vector is generated for all mask blocks as mask markers.
[0072] The decoder reconstructs the input view sequence based on the visible markers and mask markers, and calculates the mean square error of the reconstructed mask block and the original mask block in the pixel space to obtain the view reconstruction loss;
[0073] A set of visible markers for maritime targets, including the current batch of data, is constructed using a dynamically updated long storage queue. Instance-level association information is mined using a graph-based, parameterless method to learn the association information between similar maritime targets and obtain the similarity model loss.
[0074] The sum of the view reconstruction loss and the similarity model loss is used as the objective function of the network for iterative optimization. Finally, the encoder is used as a feature extractor for maritime targets to perform multi-view maritime target recognition.
[0075] The specific 3D sampling strategy is as follows:
[0076] A three-dimensional structure is obtained by sequentially connecting the view sequence of maritime targets. This structure is then divided into non-overlapping regular grid blocks along its three dimensions. After being flattened, the grid blocks are processed by a linear projection layer to obtain block embedding information. The position information of each grid block is preserved based on the view position embedding and the sequence position embedding.
[0077] Define the set of grid tiles for the view sequence as follows: Where n is the number of grid blocks and D is the dimension of block embedding; the randomly sampled grid blocks are defined as... The unsampled grid blocks are the mask blocks that the network needs to predict, defined as follows: Where α = k*(1-λ), β = k*λ, and λ is the mask rate.
[0078] Furthermore, the view position embedding is used to describe the position information of the grid block in the two-dimensional view; the sequence position embedding is used to describe the position information of the grid block in the view sequence.
[0079] Among them, the visible mark T v :
[0080]
[0081] Mask marker:
[0082] Among them, F e For encoder; P v Visible block; Let be the set of real numbers; α is the number of visible blocks obtained from sampling each maritime target; D is the dimension of block embedding; β is the number of mask blocks for each maritime target.
[0083] Among them, the visible tag set E:
[0084] E = (∪ N T v )∪Q
[0085] Where N represents the number of maritime targets in the current batch, and Q is the visibility marker in the storage queue: N′ is the length of the storage queue, and q is the visibility marker for a single target.
[0086] Among them, the view reconstruction loss L rec :
[0087]
[0088]
[0089] Where N represents the number of maritime targets in the current batch. The view reconstruction loss for the i-th maritime target in the current batch, where n is the number of grid tiles, c is the number of pixels contained in each grid tile, and y... j,k and x j,k These represent the reconstructed value and the original value of pixel k in grid block j, respectively.
[0090] Furthermore, the loss of the similarity model is:
[0091]
[0092]
[0093] Where N represents the number of maritime targets in the current batch, N′ is the length of the storage queue, and N+N′ is the number of visible markers. Let S be the similarity model loss for the i-th maritime target in the current batch, τ be the temperature hyperparameter, and S be the temperature hyperparameter. i,j For the similarity relationship between different visible markers, x i x is the visible marker for the i-th maritime target in the current batch. j x k These are samples in the visible label set E.
[0094] It should be noted that the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention will not be repeated here.
[0095] The execution entities of the aforementioned processor and memory can be devices with computing functions such as computers, microcontrollers, and single-chip microcomputers. In specific implementations, the embodiments of the present invention do not limit the execution entities and can select them according to the needs of actual applications.
[0096] Data signals are transmitted between the memory and the processor via a bus, which will not be elaborated upon in this embodiment of the invention.
[0097] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium, the storage medium including a stored program, which, when the program is running, controls the device where the storage medium is located to execute the method steps in the above embodiments.
[0098] The computer-readable storage medium includes, but is not limited to, flash memory, hard disk, solid-state drive, etc.
[0099] It should be noted that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the embodiments of the present invention will not be repeated here.
[0100] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of the present invention is generated.
[0101] A computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. Computer instructions can be stored in or transmitted through a computer-readable storage medium. A computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be magnetic or semiconductor, etc.
[0102] References:
[0103] [1] Zhang Ning, Jiang Chunzi, Lin Jiahao. Research on target identification of ships at sea [J]. China Water Transport (Second Half Month), 2021, 21(01): 1-4.
[0104] [2] Zhang Kun, Luo Yasong, Liu Zhong. Research on maritime target recognition technology based on YOLOv4 [J]. Journal of Ordnance Equipment Engineering, 2022, 43(04): 211-217.
[0105] [3]Su H, Maji S, Kalogerakis E, et al.Multi-view convolutional neural networks for 3d shape recognition[C] / / Proceedings of the IEEE internationalconference on computer vision.2015:945-953.
[0106] [4]Feng Y, Zhang Z, Zhao
[0107] [5]He K, Chen learners[J].arXiv preprint arXiv:2205.09113,2022.
[0108] [7]Dosovitskiy A,Beyer L,Kolesnikov A,et al.An image is worth 16x16words:Transformers for image recognition at scale[J].arXiv preprint arXiv:2010.11929,2020.
[0109] Unless otherwise specified, the model numbers of the various devices in this embodiment of the invention are not limited, and any device that can perform the above functions is acceptable.
[0110] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of a preferred embodiment, and the sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0111] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A maritime target identification method based on multi-view joint approach, characterized in that, The method includes: A 3D sampling strategy is used to sample the entire view sequence and add position embedding to generate visible blocks and mask blocks; The visible blocks are encoded using an encoder to obtain visible markers, and a shared learnable vector is generated for all mask blocks as mask markers. The decoder reconstructs the input view sequence based on the visible markers and mask markers, and calculates the mean square error of the reconstructed mask block and the original mask block in the pixel space to obtain the view reconstruction loss; A set of visible markers for maritime targets, including the current batch of data, is constructed using a dynamically updated long storage queue. Instance-level association information is mined using a graph-based, parameterless method to learn the association information between similar maritime targets and obtain the similarity model loss. The sum of the view reconstruction loss and the similarity model loss is used as the objective function of the network for iterative optimization. Finally, the encoder is used as the feature extractor for maritime targets to perform multi-view maritime target recognition. The specific three-dimensional sampling strategy is as follows: A three-dimensional structure is obtained by sequentially connecting the view sequence of maritime targets. This structure is then divided into non-overlapping regular grid blocks along its three dimensions. After being flattened, the grid blocks are processed by a linear projection layer to obtain block embedding information. The position information of each grid block is preserved based on the view position embedding and the sequence position embedding. Define the set of grid tiles for the view sequence as follows: ,in It is the number of grid blocks. It is the dimension of block embedding; the grid blocks obtained by random sampling are defined as... The unsampled grid blocks are the mask blocks that the network needs to predict, defined as... ,in , , For the mask rate, It represents the number of visible blocks obtained from sampling each maritime target.
2. The maritime target identification method based on multi-view joint analysis according to claim 1, characterized in that, The view position embedding is used to describe the position information of the grid block in the two-dimensional view; the sequence position embedding is used to describe the position information of the grid block in the view sequence.
3. The maritime target identification method based on multi-view joint method according to claim 1, characterized in that, The visible mark : ; The mask marker: in, For encoder; For visible blocks; D is the dimension of block embedding; The number of mask blocks for each maritime target.
4. The maritime target identification method based on multi-view joint method according to claim 1, characterized in that, The visible marker set : ; in, This indicates the number of maritime targets in the current batch. For visible markers in the storage queue: , A visible marker for a single target.
5. The maritime target identification method based on multi-view joint method according to claim 1, characterized in that, The view reconstruction loss : ; ; in, This indicates the number of maritime targets in the current batch. For the view reconstruction loss of the i-th maritime target in the current batch, is the number of grid blocks, and c is the number of pixels in each grid block. and These represent the reconstructed value and the original value of pixel k in grid block j, respectively.
6. The maritime target identification method based on multi-view joint method according to claim 1, characterized in that, The loss of the similarity model is: ; ; in, This indicates the number of maritime targets in the current batch. The length of the storage queue, The number of visible markers in the visible marker set. The similarity model loss for the i-th maritime target in the current batch, For temperature hyperparameters, To establish similarity relationships between different visible markers, This is the visible marker for the i-th maritime target in the current batch. , For visible marker set The samples in.
7. A maritime target identification device based on multi-view joint identification, characterized in that, The device includes a processor and a memory, the memory storing program instructions, the processor calling the program instructions stored in the memory to cause the device to perform the steps of the method according to any one of claims 1-6.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the steps of the method described in any one of claims 1-6.