A multimodal visual detection and recognition method for underwater optical fusion devices

By adaptively adjusting the contrast threshold of the SIFT algorithm and using gradient magnitude and confidence level to dynamically adjust the feature point extraction of underwater multimodal images, the problem of inaccurate feature point recognition in underwater environments is solved, and the accuracy and stability of multimodal image fusion are improved.

CN121746897BActive Publication Date: 2026-06-30DEEPIN ARTIFICIAL INTELLIGENCE TECHNOLOGY (SHENYANG) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DEEPIN ARTIFICIAL INTELLIGENCE TECHNOLOGY (SHENYANG) CO LTD
Filing Date
2025-12-24
Publication Date
2026-06-30

Smart Images

  • Figure CN121746897B_ABST
    Figure CN121746897B_ABST
Patent Text Reader

Abstract

This invention relates to a multimodal visual detection and recognition method for underwater optical fusion equipment, belonging to the field of image processing technology. The method starts with any point in the current modal image to determine its corresponding assumed edge segment. The confidence level of the starting point as a true edge is determined based on the size and consistency of the edge features of each edge point in the assumed edge segment. The edge texture ratio of any region in the image is determined based on the confidence level of all points within that region. The corner point ratio within that region is determined based on the irregularity of the assumed edge segment. An adaptive contrast threshold is then determined using the SIFT algorithm to extract corner points in that region, based on these two ratios. This yields all corner points of the current modal image and completes multimodal image fusion and subsequent visual detection and recognition. By obtaining the adaptive contrast threshold for each region in the modal image, this invention significantly improves the accuracy and efficiency of corner point determination in modal images, thereby enhancing the multimodal visual fusion effect.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and in particular to a multimodal visual detection and recognition method for underwater optical fusion equipment. Background Technology

[0002] In marine scientific research and underwater environmental monitoring, the extremely weak lighting, complex suspended particles, and severe multipath scattering in underwater environments significantly limit the clarity, contrast, and stability of single-type optical images. Therefore, fusing multi-source data can effectively compensate for the limitations of a single modality, improving feature representation and recognition robustness. However, under the unique optical characteristics and complex interference factors of the underwater environment, how to efficiently, accurately, and stably fuse large amounts of multi-source image data remains a significant research challenge.

[0003] In underwater multimodal visual fusion, the SIFT (Scale Invariant Feature Transform) algorithm is widely used for registration and alignment of multimodal images due to its scale invariance and strong robustness. However, its effectiveness is significantly limited in underwater environments. This is because underwater images generally suffer from issues such as light attenuation, scattering from suspended particles, and weakened texture, resulting in generally small local gradient changes. The SIFT algorithm relies on a fixed contrast threshold (approximately 0.04 by default) to select key feature points during feature extraction. However, this fixed threshold cannot adapt to the characteristics of weakly textured underwater regions, leading to the accidental deletion of many valid but low-gradient corner points. This results in insufficient and unevenly distributed feature points, further causing problems such as unstable cross-modal matching and increased image alignment errors.

[0004] In other words, in the current underwater multimodal visual fusion process, there is a technical problem that the multimodal image fusion effect is poor due to inaccurate identification of feature points on the images to be fused. Summary of the Invention

[0005] In view of this, the present invention provides a multimodal visual detection and recognition method for underwater optical fusion devices to solve the technical problem of poor fusion effect caused by inaccurate recognition of feature points on the images to be fused in the current underwater multimodal visual fusion process.

[0006] The present invention provides a multimodal visual detection and recognition method for an underwater optical fusion device, comprising:

[0007] Multimodal images are acquired using an underwater optical fusion device. The grayscale image of any modal image is used as the image to be analyzed, and the pixels with non-zero gradient magnitudes on the image to be analyzed are used as the points to be analyzed.

[0008] Using any point to be analyzed as a base point, within a preset neighborhood of the base point, find the point to be analyzed with the highest probability of forming an edge with the base point based on the gradient direction difference and position difference, and use the obtained matching point as a new base point. Repeat the matching point determination process until the number of matching points reaches a preset number, and obtain the matching point sequence of any point to be analyzed.

[0009] Based on the magnitude and stability of the probability of each matching point in the matching point sequence of any point to be analyzed, the confidence of any point to be analyzed as an edge pixel is determined. All points to be analyzed are clustered according to their positions and gradient magnitudes. Using the confidence as the weight, the weighted average of the gradient magnitudes of all points to be analyzed in any cluster is calculated. Then, the edge saliency of any cluster is determined by combining the mean of the gradient magnitudes of all points to be analyzed in any cluster.

[0010] Determine the maximum curvature of the trajectory formed by the matching points corresponding to any point to be analyzed. Use the maximum value of the maximum curvature corresponding to each point to be analyzed in any cluster as the corner saliency of any cluster. Use the edge saliency and the corner saliency to determine the threshold adjustment coefficient for the region corresponding to any cluster and obtain the adaptive contrast threshold of the region.

[0011] The corner points of the image to be analyzed are extracted by the scale-invariant feature transformation algorithm based on the adaptive contrast threshold of the region corresponding to each type of cluster. Multimodal fusion is performed based on the corner points of each modality image to complete visual detection and recognition.

[0012] Further, determining the probability of the constituent edges includes:

[0013] Each point to be analyzed within the preset neighborhood of the base point is denoted as a neighborhood point. The absolute value of the difference between the gradient direction of the base point and the gradient direction of any neighborhood point is calculated and denoted as the first difference degree. The distance between the base point and any neighborhood point is calculated along the gradient direction of the base point and denoted as the second difference degree. The distance between the base point and any neighborhood point is denoted as the third difference degree.

[0014] The probability of the base point forming an edge with any of the neighboring points is determined based on the first difference degree, the second difference degree, and the third difference degree, wherein the probability degree is inversely proportional to the first difference degree, the second difference degree, and the third difference degree.

[0015] Furthermore, determining the confidence level of any of the points to be analyzed as edge pixels includes:

[0016] The probability degree of each matching point in the matching point sequence of any point to be analyzed is calculated sequentially when it is determined, and the calculated probability degrees are used to form the probability degree sequence of any point to be analyzed.

[0017] The confidence level of any point to be analyzed as an edge point is determined based on the mean and coefficient of variation of all possible degrees in the probability sequence of any point to be analyzed. The confidence level is directly proportional to the mean of all possible degrees in the probability sequence of any point to be analyzed and inversely proportional to the coefficient of variation of all possible degrees in the probability sequence of any point to be analyzed.

[0018] Furthermore, the clustering of all the points to be analyzed based on their locations and the confidence level includes:

[0019] The x-coordinates, y-coordinates, and gradient magnitudes of all the points to be analyzed are normalized. Based on the normalized x-coordinates, normalized y-coordinates, and normalized gradient magnitudes of each point to be analyzed, the HDBCAN clustering method is used to cluster all the points to be analyzed.

[0020] Further, determining the edge salience of any of the said clusters includes:

[0021] The normalized value of the ratio of the weighted average of the gradient magnitudes of all points to be analyzed in any given cluster to the mean of the gradient magnitudes of all points to be analyzed in any given cluster is used as the marginal significance of any given cluster.

[0022] Further, determining the threshold adjustment coefficient for the region corresponding to any of the clusters includes:

[0023] Calculate the mean corner saliency of each cluster, and use the ratio of the corner saliency of any cluster to the mean corner saliency as the relative corner saliency of any cluster;

[0024] A threshold adjustment coefficient for the region corresponding to any of the clusters is determined based on the relative salience of the corner points and the salience of the edges. The threshold adjustment coefficient is inversely proportional to the relative salience of the corner points and directly proportional to the salience of the edges.

[0025] Furthermore, obtaining the contrast threshold for this region includes:

[0026] The difference between the upper and lower limits of the scale-invariant feature transformation algorithm is used as the total adjustable threshold. The threshold adjustment term is determined based on the threshold adjustment coefficient and the total adjustable threshold. The sum of the threshold adjustment term and the lower limit of the scale-invariant feature transformation algorithm is used as the adaptive contrast threshold for the region corresponding to any cluster.

[0027] Furthermore, the extraction of key points from the image to be analyzed includes:

[0028] The corner points in the region corresponding to any cluster are extracted by the scale-invariant feature transform algorithm based on the adaptive contrast threshold of the region corresponding to any cluster, and the corner points in the regions corresponding to all clusters are taken as the corner points of the image to be analyzed.

[0029] The advantages of this invention compared to the prior art are:

[0030] This invention uses pixels with non-zero gradient magnitudes in any modal image as the points to be analyzed. Starting from any given point, it iteratively searches for matching points within a certain region to determine the most likely edge segment formed by that starting point. Based on the probability and stability of the edge formation between two matching points during each matching iteration, the confidence level of the starting point as an edge pixel is determined. Then, all points are clustered based on position and gradient magnitude. The weighted average of the gradient magnitudes of all points in any cluster is calculated using the obtained confidence level as weight, and compared with the original average gradient magnitude of each point in that cluster to determine the edge saliency of the corresponding region. Simultaneously, the corner saliency of the cluster is determined based on the irregularity of the edge segments formed by the points in each cluster. Finally, the adaptive contrast threshold of the corresponding region is determined by combining the edge saliency and corner saliency to complete the adaptive corner acquisition of each region in the modal image. Multimodal fusion and visual detection and recognition are then performed based on the corners of each modal image. This invention constructs hypothetical edge segments in modal images, starting from each point to be analyzed. Based on these hypothetical edge segments, it quantifies the texture proportion and texture irregularity in each cluster region of the modal image. This enables adaptive determination of the most suitable contrast threshold for corner extraction in different regions using the scale-invariant feature transformation algorithm, which can significantly improve the accuracy of corner determination in modal images and enhance the multimodal fusion effect. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a flowchart illustrating a multimodal visual detection and recognition method for an underwater optical fusion device provided in Embodiment 1 of the present invention. Detailed Implementation

[0033] To further illustrate the technical solution of the present invention, specific embodiments are described below.

[0034] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of the invention include a particular feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. Furthermore, a particular feature, structure, or characteristic in one or more embodiments may be combined in any suitable form, and the terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless otherwise specifically emphasized.

[0035] It should be understood that the sequence number of each step in the following embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0036] Method Implementation Examples:

[0037] See Figure 1 This is a flowchart illustrating a multimodal visual detection and recognition method for an underwater optical fusion device provided in Embodiment 1 of the present invention. Figure 1 As shown, the identification method may include the following steps:

[0038] S101, acquire multimodal images through underwater optical fusion equipment, use the grayscale image of any modal image as the image to be analyzed, and use the pixels with non-zero gradient magnitudes on the image to be analyzed as the points to be analyzed.

[0039] In underwater multi-source image fusion tasks, underwater images are affected by factors such as light attenuation and scattering from suspended particles, resulting in problems such as low overall or local contrast, weakened texture, and flat local gradient changes. Traditional SIFT (Scale-Invariant Feature Transform) algorithms use a fixed contrast threshold (typically 0.04) for feature point or keypoint extraction, which is ill-suited to the characteristics of such images. If the contrast threshold is set too high, weakly textured local areas may be ignored, leading to the omission of important feature points and affecting the accuracy and stability of multi-source image fusion. Conversely, using a uniformly small contrast threshold may result in the extraction of too many irrelevant feature points, such as noise, increasing the possibility of mismatches and affecting the selection of effective corner points and the accuracy of subsequent fusion.

[0040] Therefore, the fixed threshold strategy has significant limitations in underwater image processing, especially in multi-source image fusion. This contradiction highlights the lack of adaptive perception of local image content in fixed threshold methods. Therefore, underwater multi-source image data fusion requires an adaptive mechanism that can dynamically perceive local texture intensity, gradient distribution, and their global importance, and dynamically adjust the contrast threshold during feature extraction. This would improve the accuracy and matching precision of corner point extraction, further enhancing the overall effect and stability of underwater multi-source image fusion.

[0041] This embodiment aims to optimize the application of the SIFT algorithm in the multi-source visual image fusion process using an underwater optical fusion device. First, the device's built-in visual image sensor simultaneously acquires different types of image data, i.e., multimodal images. This data is then processed by the device's internal image fusion processing module. The image fusion module comprises several sub-modules, among which the preprocessing module adaptively adjusts the contrast threshold in the SIFT algorithm to improve the accuracy of corner extraction. By dynamically adjusting the contrast threshold, key features in the images can be effectively extracted under different underwater environmental conditions. The adjusted SIFT algorithm selects valid corner points and combines these corner points with the image fusion algorithm to achieve accurate fusion of multi-source images.

[0042] Specifically, after acquiring multimodal images, each modal image needs to be processed separately to determine its adaptive contrast threshold when applied to the SIFT algorithm. Taking any modal image as an example, it is first grayscaled to serve as the image to be analyzed. Then, the Sobel operator is used to calculate the gradient magnitude and direction of each pixel in the entire image, that is, to determine the gradient vector of each pixel.

[0043] Considering that key feature points in an image are not distributed in areas with stable gray levels, in order to reduce computational load and improve the accuracy of subsequent determination of the adaptive contrast threshold, pixels with zero gradient magnitude are first removed from the image to be analyzed, and pixels with non-zero gradient magnitude are selected as the points to be analyzed.

[0044] S102, taking any point to be analyzed as the base point, within the preset neighborhood of the base point, find the point to be analyzed with the highest probability of forming an edge with the base point based on the gradient direction difference and position difference, and use the obtained matching point as the new base point. Repeat the matching point determination process until the number of matching points reaches a preset number, and obtain the matching point sequence of any point to be analyzed.

[0045] The point to be analyzed could be a real edge texture or a pseudo-feature caused by underwater turbidity and noise. Therefore, it is necessary to further determine whether it belongs to a valid texture region. Considering that the length of a real edge is significantly longer than that of a pseudo-feature region caused by underwater turbidity and noise, and that each point on the edge is continuous and has a high consistency in gradient direction, we can first find the neighboring pixels that are most likely to form an edge with any point to be analyzed in a preset neighborhood of the image to be analyzed, based on the differences in gradient direction and position. Then, based on the found neighboring pixels, we repeat the process of finding the pixels that are most likely to form an edge with the current base pixel in the neighborhood of the current base pixel, until the number of matching points determined by this repeated search process reaches a preset number, satisfying the characteristic that an edge generally has a certain length. This constitutes the matching point sequence corresponding to the initial point to be analyzed. This matching point sequence can characterize the edge feature degree of a hypothetical edge segment of a certain length obtained from the point to be analyzed as an edge point, thereby reflecting the credibility of the point to be analyzed as a real edge pixel.

[0046] Specifically, firstly, based on any point to be analyzed as a base point, the probability of forming an edge segment between it and each neighboring point within a preset neighborhood needs to be determined. This probability is constructed based on the characteristics that edge or texture pixels generally have continuity and directional similarity, including:

[0047] Each point to be analyzed within the preset neighborhood of the base point is denoted as a neighborhood point. The absolute value of the difference between the gradient direction of the base point and the gradient direction of any neighborhood point is calculated and denoted as the first difference degree. The distance between the base point and any neighborhood point is calculated along the gradient direction of the base point and denoted as the second difference degree. The distance between the base point and any neighborhood point is denoted as the third difference degree.

[0048] The probability of the base point forming an edge with any of the neighboring points is determined based on the first difference degree, the second difference degree, and the third difference degree, wherein the probability degree is inversely proportional to the first difference degree, the second difference degree, and the third difference degree.

[0049] Furthermore, as a preferred option, the degree of probability is:

[0050]

[0051] in, This represents the probability that the i-th point to be analyzed and the s-th point to be analyzed within its preset neighborhood form an edge. It's easy to understand that the i-th point to be analyzed is the aforementioned base point, and the s-th point to be analyzed within the preset neighborhood of the i-th point is one of the neighboring points within the preset neighborhood of the base point. In this embodiment, the preset neighborhood is preferably a 24-neighborhood, meaning that all other points to be analyzed within a 5x5 area, excluding the i-th point, are considered neighboring points. Choosing this neighborhood size can appropriately avoid the influence of water quality scattering factors such as suspended particles in the water on weak edge analysis, thereby improving the noise resistance during analysis. This refers to normalization, such as linear normalization, norm normalization, etc. This represents the gradient direction at the i-th point to be analyzed. This represents the gradient direction of the s-th point within the preset neighborhood of the i-th point to be analyzed. This represents the distance between the i-th pixel and the s-th pixel in its preset neighborhood along the gradient direction. The smaller this value, the greater the similarity between these two pixels to the features of an edge pixel. This represents the distance between the i-th point to be analyzed and the s-th point to be analyzed in its preset neighborhood. Since edge pixels usually have a certain degree of continuity, the smaller the distance between these two pixels, the greater the degree of feature agreement between them that constitutes the edge pixels. This value represents the similarity between the gradient direction of the i-th pixel and the gradient direction of the s-th pixel in its preset neighborhood. The smaller the value, the more similar the gradient directions of the two pixels are, and the greater the degree to which they conform to edge features such as texture.

[0052] Through the above probability calculation process, there will be a probability between the i-th point to be analyzed, which is currently the base point, and each of its neighboring points in the preset neighborhood. The neighboring point with the maximum probability value is found, and this neighboring point is the matching point of the i-th point to be analyzed. Then, the obtained matching point is used as a new base point to find its matching point again. This process is repeated until the number of matching points found reaches the preset number. Then, the found matching points can be used to form a sequence, which is the matching point sequence of the i-th point to be analyzed at the beginning.

[0053] An example of the matching point sequence acquisition process is as follows: First, take any point to be analyzed as the base point, assuming it is pixel a. Then, obtain the neighborhood point with the highest probability of forming an edge with pixel a within the preset neighborhood of pixel a (i.e., the 24 neighborhood mentioned above), and assume it is b. Record pixel b as the matching point of pixel a, and mark pixel a, such as marking it as -1, to indicate that it has been used as a base point and cannot be used as a matching point in the subsequent matching point search process. Then, take pixel b as the center, which is the new base point, and obtain the unmarked pixel with the highest probability of forming an edge with pixel b within the preset neighborhood of pixel b, and assume it is pixel c. Record pixel c as the matching point of pixel b, and then mark pixel b in the same way as pixel a. Then, take pixel c as the center, which is the new base point, and obtain the unmarked pixel with the highest probability of forming an edge with pixel c, and then repeat the above operation.

[0054] In the examples above, each pixel is easily understood to be a point to be analyzed with a gradient magnitude that is not zero. It can also be seen that in the process of repeatedly searching for matching points, the matching point determined in the current matching point search stage will not be a point to be analyzed that has already been used as a base point, so as to avoid the repeated search process from getting stuck in an infinite loop and failing to find an effective sequence of matching points.

[0055] In this embodiment, the preset number is preferably 8, that is, the number of repeated search iterations for matching points is set to 8. In this way, the length of the determined matching point sequence can meet the continuity characteristics that edge pixels usually have, thus effectively distinguishing it from the false features generated by underwater turbidity and noise.

[0056] S103, based on the magnitude and stability of the probability of each matching point in the matching point sequence of any point to be analyzed, determine the confidence of any point to be analyzed as an edge pixel. Cluster all points to be analyzed according to their positions and gradient magnitudes. Using the confidence as the weight, calculate the weighted average of the gradient magnitudes of all points to be analyzed in any cluster. Then, combine the average of the gradient magnitudes of all points to be analyzed in any cluster to determine the edge saliency of any cluster.

[0057] After obtaining the matching point sequence for any point to be analyzed, the degree of agreement between the sequence and the feature with high similarity and consistency with pixels on the edge can be measured by evaluating the probability of each matching point in the sequence at the time of acquisition and the similarity or stability of the probability of different matching points in the sequence. This characterizes the confidence that the point to be analyzed corresponding to the matching point sequence is an edge pixel, including:

[0058] The probability degree of each matching point in the matching point sequence of any point to be analyzed is calculated sequentially when it is determined, and the calculated probability degrees are used to form the probability degree sequence of any point to be analyzed.

[0059] The confidence level of any point to be analyzed as an edge point is determined based on the mean and coefficient of variation of all possible degrees in the probability sequence of any point to be analyzed. The confidence level is directly proportional to the mean of all possible degrees in the probability sequence of any point to be analyzed and inversely proportional to the coefficient of variation of all possible degrees in the probability sequence of any point to be analyzed.

[0060] Furthermore, as a preferred option, the confidence level is:

[0061]

[0062] in, Let represent the confidence level that the i-th point to be analyzed is an edge pixel, and let represent the number of matching points in the matching point sequence of the i-th point to be analyzed. Let represent the probability of the r-th matching point within the matching point sequence of the i-th point to be analyzed, when determined. Represents the natural constant. Let represent the coefficient of variation of the probability of each matching point in the matching point sequence of the i-th point to be analyzed when it is determined. In other words, the coefficient of variation of the probability sequence corresponding to the i-th point to be analyzed is smaller. The larger the value, the higher the stability of the probability sequence. Correspondingly, the higher the similarity of the probability values ​​of each matching point in the matching point sequence of the i-th matching point to be analyzed when it is determined, the greater the confidence that the i-th matching point to be analyzed is an edge pixel. This represents the average probability of all matching points in the matching point sequence of the i-th point to be analyzed when they are determined. The larger this average is, the greater the probability that any matching base point and the matching point form an edge. Therefore, the greater the confidence that the trajectory formed between each matching point in the sequence and the i-th point to be analyzed belongs to the edge, which also means that the greater the confidence that the i-th point to be analyzed is an edge pixel.

[0063] Since different regions within the same image, such as the image to be analyzed, contain varying degrees of texture intensity, the optimal contrast threshold also differs. Therefore, it is necessary to process different regions separately. To address this, this embodiment further establishes a three-dimensional clustering space, where X, Y, and Z represent the x-coordinate, y-coordinate, and gradient magnitude of the points to be analyzed, respectively. Simultaneously, these three coordinate dimensions are linearly normalized to the interval between 0 and 1. In other words, the x-coordinate, y-coordinate, and gradient magnitude of all points to be analyzed are normalized to obtain normalized x-coordinates, normalized y-coordinates, and normalized gradient magnitudes.

[0064] Then, each candidate pixel is mapped to three-dimensional coordinates, and the HDBSCAN algorithm is used to divide the points to be analyzed in the three-dimensional clustering space into different clusters. It should be noted that the HDBSCAN algorithm is an upgraded version of the DBSCAN algorithm. The algorithm has fewer parameters and is more robust. It only requires setting the min_cluster_size parameter, which is set to 2 here, meaning that the minimum feasible cluster is 2. In other words, at least two similar points to be analyzed are required to form a cluster.

[0065] In this way, each point on the image to be analyzed can be classified into a different cluster, and each cluster corresponds to a region on the image to be analyzed.

[0066] Next, for any region, i.e., any cluster, the gradient magnitude of each point under the cluster is weighted using the confidence level of the points to be analyzed as the weight, and the weighted average of the gradient magnitudes of all points under the cluster is calculated. This weighted average is compared with the mean of the gradient magnitudes of all points under the same cluster to characterize the proportion of edge texture in the region corresponding to that cluster, i.e., edge saliency. This edge saliency effectively characterizes the texture intensity within the region. It is easy to understand that the stronger the texture intensity within the region, the greater the original contrast within the region may be. Therefore, when using the SIFT algorithm to detect feature points or keypoints in the subsequent corner detection, a larger adaptive contrast threshold should actually be used.

[0067] The method for obtaining the weighted average of the gradient magnitudes of all points to be analyzed under a cluster is as follows:

[0068]

[0069] in, This represents the weighted average of the gradient magnitudes of all points to be analyzed within the a-th cluster. This represents the total number of points to be analyzed under the a-th cluster. This represents the gradient magnitude of the i-th point to be analyzed within the a-th cluster. This represents the confidence level that the i-th point to be analyzed under the a-th cluster is an edge pixel. This represents the weighted average of the gradient magnitudes of each point to be analyzed in the region corresponding to the a-th cluster, combined with the confidence level of each point as an edge pixel. This value combines two dimensions: the number of edge pixels in the region and the magnitude of the gradient magnitude of the edge pixels. The larger the combined result of the two dimensions, the greater the content or degree of suspected edge texture in the region.

[0070] By comparing the weighted average of the obtained gradient magnitudes with the mean of the gradient magnitudes of all points to be analyzed within the same region, the marginal salience within that region can be determined, including:

[0071] The normalized value of the ratio of the weighted average of the gradient magnitudes of all points to be analyzed in any given cluster to the mean of the gradient magnitudes of all points to be analyzed in any given cluster is used as the marginal significance of any given cluster.

[0072] The formulaic representation of significant edges is as follows:

[0073]

[0074] in, This represents the edge saliency of the region corresponding to the a-th cluster. Represents the normalization function. This represents the average gradient magnitude of each point to be analyzed in the a-th cluster.

[0075] S104, determine the maximum curvature of the trajectory formed by the matching points corresponding to any point to be analyzed, take the maximum value of the maximum curvature corresponding to each point to be analyzed in any cluster as the corner saliency of any cluster, and determine the threshold adjustment coefficient of the region corresponding to any cluster using the edge saliency and the corner saliency to obtain the adaptive contrast threshold of the region.

[0076] The aforementioned steps allow us to obtain the edge saliency of any region, representing the relative proportion of edge texture in each region. However, the relative proportion of edge texture only represents the intensity of edge features in a region from one perspective. Even with the same edge feature intensity, there may be differences in texture smoothness. That is, some regions may have more regular edges with a higher proportion of straight lines, while others may have more complex edges with a lower proportion of straight lines. The edge saliency of these two types of regions may be the same, but if the same adaptive contrast threshold is used, it will obviously result in more key points or corner points being identified in regions with regular edges, and fewer key points or corner points being identified in regions with complex edges, which is unreasonable.

[0077] Therefore, for any cluster, after obtaining the matching point sequence for each point to be analyzed through the aforementioned steps, the trajectory formed by each matching point in the matching point sequence of any point to be analyzed is determined, which is also the position distribution curve on the image to be analyzed. The maximum curvature on the trajectory is calculated, that is, the curvature at the position with the greatest change in direction on the trajectory is determined, and this is taken as the maximum curvature of any point to be analyzed. Then, for the current cluster, the maximum value of the maximum curvature of all points to be analyzed is selected as the corner saliency of the cluster. Thus, each cluster (corresponding region) can obtain a corner saliency. If the corner saliency of a certain cluster (corresponding region) is larger than that of all regions on the image to be analyzed, that is, compared with the average corner saliency of all clusters, it indicates that the cluster may have a higher probability of having corner points or other feature points. Therefore, its threshold coefficient can be appropriately lowered to filter more feature points in the region; otherwise, the threshold of the region should be adjusted higher to avoid filtering out useless or too many corner points in the region, thereby reducing the accuracy and efficiency of subsequent fusion.

[0078] Based on the above analysis, the threshold adjustment coefficient for the corresponding region of any cluster can be determined according to the edge salience and corner salience, including:

[0079] Calculate the mean corner saliency of each cluster, and use the ratio of the corner saliency of any cluster to the mean corner saliency as the relative corner saliency of any cluster;

[0080] A threshold adjustment coefficient for the region corresponding to any of the clusters is determined based on the relative salience of the corner points and the salience of the edges. The threshold adjustment coefficient is inversely proportional to the relative salience of the corner points and directly proportional to the salience of the edges.

[0081] Further, as a preferred embodiment, the threshold adjustment coefficient is:

[0082]

[0083] in, This represents the threshold adjustment coefficient for the region corresponding to the a-th cluster. Represents the normalization function. This represents the mean salience of corner points across all clusters. This represents the saliency of the corner points of the a-th cluster. This represents the relative saliency of the corner points of the a-th cluster. Since a larger relative saliency of the corner points should result in a smaller final adaptive contrast threshold for the region, as analyzed above, an inverse proportional form is used here. The smaller the value, the more important the feature points in the area are, and the smaller the threshold coefficient is. Otherwise, the threshold coefficient is larger, thus avoiding the selection of too many useless corner points.

[0084] Based on the threshold adjustment coefficient of the region corresponding to each cluster, the adaptive contrast threshold of each region can be obtained, including:

[0085] The difference between the upper and lower limits of the scale-invariant feature transformation algorithm is used as the total adjustable threshold. The threshold adjustment term is determined based on the threshold adjustment coefficient and the total adjustable threshold. The sum of the threshold adjustment term and the lower limit of the scale-invariant feature transformation algorithm is used as the adaptive contrast threshold for the region corresponding to any cluster.

[0086] As a further preferred option, the adaptive contrast threshold is:

[0087]

[0088] in, This represents the adaptive contrast threshold for the region corresponding to the a-th cluster. This represents the lower limit of the contrast threshold in the SIFT algorithm. This represents the upper limit of the contrast threshold in the SIFT algorithm. This indicates the total adjustable threshold. This indicates the threshold adjustment term.

[0089] At this point, the adaptive contrast threshold used by the SIFT algorithm to extract key points, i.e., corner points, in the region corresponding to each cluster in the image to be analyzed can be determined.

[0090] S105, based on the adaptive contrast threshold of the region corresponding to each type of cluster, the corner points of the image to be analyzed are extracted by the scale-invariant feature transformation algorithm, and multimodal fusion is performed based on the corner points of each modality image to complete visual detection and recognition.

[0091] The above steps yield an adaptive contrast threshold for the region corresponding to any cluster. Then, the SIFT algorithm is used to determine the corner points within each cluster's corresponding region based on the adaptive contrast threshold. The corner points from all clusters' regions are then integrated as the corner points of the image to be analyzed. Similarly, the corner points on each modal image can be determined; these corner points are also the stable feature points on the modal image.

[0092] Subsequently, after obtaining stable feature points for each modality of image, various image fusion algorithms can be employed to achieve spatial alignment and fusion of cross-modal images. For example, RANSAC can be used to eliminate errors in feature point matching, affine or perspective transformations can be used for multi-source image registration, and strategies such as pyramid-based wavelet fusion, weighted average fusion, or feature layer fusion can be used to combine texture details, structural information, and brightness features of different modal images. Through this fusion process, the accuracy of multi-source underwater images in terms of texture preservation, structural consistency, and overall scene reconstruction can be significantly improved, providing more reliable input data for subsequent recognition and detection tasks.

[0093] In this embodiment of the invention, when performing adaptive contrast threshold acquisition on a local region of any modal image in a multimodal image acquired by an underwater optical fusion device, the following steps are taken: First, all pixels are initially screened using the gradient magnitude of the pixels in the analyzed image to obtain the points to be analyzed. Then, any point to be analyzed and its neighboring points are analyzed to obtain the probability of forming an edge between them. The neighboring point to be analyzed corresponding to the highest probability is used as the matching point. Through the repeated determination process of the matching points, the matching point sequence of the initially analyzed points is obtained. Furthermore, the probability of each matching point in the matching point sequence of each point to be analyzed is analyzed to determine the confidence level of the point to be analyzed corresponding to the matching point sequence as an edge pixel. Subsequently, all points to be analyzed are clustered based on their positions and gradient magnitudes. Then, confidence is used as a weight to calculate the weighted average of the gradient magnitudes of the points to be analyzed in each cluster. The edge saliency of the region corresponding to the current cluster is determined by comparing the weighted average with the mean gradient magnitude of all points to be analyzed in the original cluster. At the same time, the corner saliency of the cluster is determined based on the curvature of the trajectory formed by the matching point sequence corresponding to each point to be analyzed in the cluster. Finally, the corner saliency and edge saliency are combined to determine the contrast threshold adjustment coefficient of the region corresponding to the current cluster and obtain the adaptive contrast threshold of the region corresponding to the cluster. Thus, the corners of the overall image to be analyzed are determined by the scale-invariant feature transformation algorithm based on the adaptive contrast threshold corresponding to each region on the image to be analyzed. Finally, the corners of each modal image are accurately fused to ensure more accurate visual detection and recognition of underwater targets.

[0094] This invention improves the accuracy and efficiency of corner point selection in modal images by adaptively determining the contrast threshold of the fused image, thereby improving the accuracy of multi-source image fusion and enabling efficient acquisition of underwater equipment images, providing more reliable input data for subsequent target recognition, detection and other tasks.

[0095] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A multimodal visual detection and recognition method for underwater optical fusion equipment, characterized in that, The method includes: Multimodal images are acquired using an underwater optical fusion device. The grayscale image of any modal image is used as the image to be analyzed, and the pixels with non-zero gradient magnitudes on the image to be analyzed are used as the points to be analyzed. Using any point to be analyzed as a base point, within a preset neighborhood of the base point, find the point to be analyzed with the highest probability of forming an edge with the base point based on the gradient direction difference and position difference, and use the obtained matching point as a new base point. Repeat the matching point determination process until the number of matching points reaches a preset number, and obtain the matching point sequence of any point to be analyzed. Based on the magnitude and stability of the probability of each matching point in the matching point sequence of any point to be analyzed, the confidence of any point to be analyzed as an edge pixel is determined. All points to be analyzed are clustered according to their positions and gradient magnitudes. Using the confidence as the weight, the weighted average of the gradient magnitudes of all points to be analyzed in any cluster is calculated. Then, the edge saliency of any cluster is determined by combining the mean of the gradient magnitudes of all points to be analyzed in any cluster. Determine the maximum curvature of the trajectory formed by the matching points corresponding to any point to be analyzed. Use the maximum value of the maximum curvature corresponding to each point to be analyzed in any cluster as the corner saliency of any cluster. Use the edge saliency and the corner saliency to determine the threshold adjustment coefficient for the region corresponding to any cluster and obtain the adaptive contrast threshold of the region. The corner points of the image to be analyzed are extracted by the scale-invariant feature transformation algorithm based on the adaptive contrast threshold of the region corresponding to each type of cluster. Multimodal fusion is performed based on the corner points of each modal image to complete visual detection and recognition. Determining the probability of the constituent edges includes: Each point to be analyzed within the preset neighborhood of the base point is denoted as a neighborhood point. The absolute value of the difference between the gradient direction of the base point and the gradient direction of any neighborhood point is calculated and denoted as the first difference degree. The distance between the base point and any neighborhood point is calculated along the gradient direction of the base point and denoted as the second difference degree. The distance between the base point and any neighborhood point is denoted as the third difference degree. The probability of the base point forming an edge with any of the neighboring points is determined based on the first difference degree, the second difference degree, and the third difference degree, and the probability degree is inversely proportional to the first difference degree, the second difference degree, and the third difference degree. Determining the confidence level of any of the points to be analyzed as edge pixels includes: The probability of each matching point in the matching point sequence of any point to be analyzed is calculated sequentially when it is determined, and the calculated probability of each probability is used to form the probability sequence of any point to be analyzed. The confidence level of any point to be analyzed as an edge point is determined based on the mean and coefficient of variation of all possible degrees in the probability sequence of any point to be analyzed. The confidence level is directly proportional to the mean of all possible degrees in the probability sequence of any point to be analyzed and inversely proportional to the coefficient of variation of all possible degrees in the probability sequence of any point to be analyzed.

2. The multimodal visual detection and recognition method for underwater optical fusion equipment according to claim 1, characterized in that, The clustering of all the points to be analyzed based on their locations and confidence levels includes: The x-coordinates, y-coordinates, and gradient magnitudes of all the points to be analyzed are normalized. Based on the normalized x-coordinates, normalized y-coordinates, and normalized gradient magnitudes of each point to be analyzed, the HDBCAN clustering method is used to cluster all the points to be analyzed.

3. The multimodal visual detection and recognition method for underwater optical fusion equipment according to claim 1, characterized in that, Determining the edge saliency of any of the said clusters includes: The normalized value of the ratio of the weighted average of the gradient magnitudes of all points to be analyzed in any given cluster to the mean of the gradient magnitudes of all points to be analyzed in any given cluster is used as the marginal significance of any given cluster.

4. The multimodal visual detection and recognition method for underwater optical fusion equipment according to claim 1, characterized in that, Determining the threshold adjustment coefficient for the region corresponding to any of the said clusters includes: Calculate the mean corner saliency of each cluster, and use the ratio of the corner saliency of any cluster to the mean corner saliency as the relative corner saliency of any cluster; A threshold adjustment coefficient for the region corresponding to any of the clusters is determined based on the relative salience of the corner points and the salience of the edges. The threshold adjustment coefficient is inversely proportional to the relative salience of the corner points and directly proportional to the salience of the edges.

5. The multimodal visual detection and recognition method for underwater optical fusion equipment according to claim 1 or 4, characterized in that, Obtaining the contrast threshold for this region includes: The difference between the upper and lower limits of the scale-invariant feature transformation algorithm is used as the total adjustable threshold. The threshold adjustment term is determined based on the threshold adjustment coefficient and the total adjustable threshold. The sum of the threshold adjustment term and the lower limit of the scale-invariant feature transformation algorithm is used as the adaptive contrast threshold for the region corresponding to any cluster.

6. The multimodal visual detection and recognition method for underwater optical fusion equipment according to claim 1, characterized in that, The extraction of corner points from the image to be analyzed includes: The corner points in the region corresponding to any cluster are extracted by the scale-invariant feature transform algorithm based on the adaptive contrast threshold of the region corresponding to any cluster, and the corner points in the regions corresponding to all clusters are taken as the corner points of the image to be analyzed.