Target recognition and tracking method and system based on multi-source image fusion
By prioritizing band selection, image decomposition and fusion, spatiotemporal correlation graph construction, and graph neural network inference, the robustness and continuity of target tracking under severe weather conditions were solved, and stable and continuous target tracking in complex environments was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING YOUSHENG ZHIGUANG TECH CO LTD
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies lack robustness and continuity in target tracking under adverse weather conditions. In particular, under complex weather conditions such as rain, snow, fog, and haze, the image fusion process has limited adaptability to changes in the quality of sensor input data, resulting in insufficient robustness of target features and making it easy for trajectory interruptions or misidentification to occur.
By prioritizing the bands of a multispectral sensor, decomposing the image hierarchy using an atmospheric scattering model, and constructing a spatiotemporal correlation graph using a graph neural network, target recognition and tracking are achieved. Specific steps include band selection, transmittance decomposition, image fusion, target recognition, spatiotemporal correlation graph construction, and graph neural network inference to optimize the target's trajectory.
Under complex weather conditions, stable and continuous target tracking was achieved through image quality optimization and intelligent association reasoning, which enhanced the accuracy of cross-frame node association and the reliability of target correspondence, and solved the problem of insufficient robustness in target tracking.
Smart Images

Figure CN122244102A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image target tracking technology, and in particular to a target recognition and tracking method and system based on multi-source image fusion. Background Technology
[0002] The target recognition and tracking method based on multi-source image fusion aims to improve the detection and tracking capabilities of targets of interest by integrating image data from different sources. This method has broad application prospects in fields such as security monitoring, intelligent transportation, and environmental perception.
[0003] In existing technologies, a strategy of feature extraction and weighted fusion of images acquired by multiple sensors is often adopted to comprehensively utilize the complementarity of information from different bands or modes. For example, visible light and infrared images are fused at the pixel level or feature level to enhance the overall performance of the target.
[0004] Further approaches apply target detection algorithms to the fused image sequence to obtain target locations and use filtering or data association methods to link target locations between different image frames to form a trajectory. However, current methods have limited adaptability to changes in the quality of sensor input data during image fusion in complex weather conditions such as rain, snow, fog, and haze. This results in insufficient robustness of target features, making it prone to trajectory interruptions or misidentification when the target is briefly occluded or its appearance changes. Therefore, existing technologies suffer from insufficient continuity in target tracking under harsh environments. Summary of the Invention
[0005] This application provides a target recognition and tracking method and system based on multi-source image fusion to solve the problems of low robustness and poor continuity in the tracking of moving targets under severe weather conditions in the prior art.
[0006] To address the aforementioned technical problems, in a first aspect, this application provides a target recognition and tracking method based on multi-source image fusion, comprising: Based on priority information, the raw light received by the multispectral sensor is filtered by band to generate a preprocessed image set. The priority information is determined based on meteorological visibility data and spectral attenuation data. The transmittance map of the preprocessed image set is calculated using an atmospheric scattering model, and based on the transmittance map, each image in the preprocessed image set is decomposed into a first image layer and a second image layer. In the image fusion framework, the second image layers from different bands are fused to obtain a fused detail layer, and the fused detail layer is reconstructed and merged with each of the first image layers to generate the corresponding fused image; Target recognition is performed on the image sequence composed of the fused images to obtain the target information set in each image frame, and a spatiotemporal correlation graph is constructed based on the target information set; The spatiotemporal correlation graph is input into a graph neural network, and the graph neural network is used to infer the correspondence between nodes in the spatiotemporal correlation graph. Based on the correspondence, the final motion trajectory of the target is generated, and the target is continuously tracked based on the final motion trajectory.
[0007] Optionally, the step of inputting the spatiotemporal correlation graph into a graph neural network and inferring the correspondence between nodes in the spatiotemporal correlation graph through the graph neural network includes: The spatiotemporal correlation graph is input into the attention graph convolutional layer of the graph neural network. In the attention graph convolutional layer, a multi-head attention mechanism is used to assign weights to the connection edges between each node and its neighboring nodes. Based on the weights, the features of the neighboring nodes are aggregated, and the feature representation of each node is updated to obtain the first updated feature vector. The first updated feature vector is input into the memory enhancement module of the graph neural network. In the memory enhancement module, the temporal features of the nodes are modeled through a long short-term memory network, and the second updated feature vector is output. The second updated feature vector is input into the matching calculation layer of the graph neural network. In the matching calculation layer, the initial cosine similarity between any two second updated feature vectors is calculated, and motion constraints are introduced to correct the initial cosine similarity in order to generate the matching probability between nodes. The Hungarian algorithm is used to globally optimize and allocate the matching probability through the output layer of the graph neural network to obtain the optimal matching pairs of nodes between different image frames. The node pairs in the optimal matching pairs are determined as the correspondence of the same target in different frames.
[0008] Optionally, in the matching calculation layer, the initial cosine similarity between any two second updated feature vectors is calculated, and motion constraints are introduced to correct the initial cosine similarity to generate the matching probability between nodes, including: In the matching calculation layer, the ratio of the dot product to the product of the moduli between the two second updated feature vectors is calculated to obtain the initial cosine similarity. From the spatiotemporal correlation graph, obtain the position and time information of the two nodes corresponding to the two second updated feature vectors; Based on the location information and the time information, the ratio of the displacement difference to the time difference between the targets represented by the two nodes is calculated to obtain a metric value; The measured value is compared with a preset measured value threshold to generate a motion constraint factor; Multiply the initial cosine similarity by the motion constraint factor to obtain the target cosine similarity; The target cosine similarity is input into the Sigmoid function for normalization, and the matching probability between nodes is output.
[0009] Optionally, generating the target's final motion trajectory based on the correspondence, and continuously tracking the target based on the final motion trajectory, includes: Based on the correspondence, multiple nodes belonging to the same target are identified, and the position information of the nodes is extracted according to the image frame order corresponding to the nodes. The position information is smoothed using the Kalman filter algorithm, and the target's state in the next frame of the current image frame is predicted based on the processed position information to generate the target's initial motion trajectory. Based on the initial motion trajectory, the motion pattern of the target is learned using a gated recurrent unit network to predict the target's predicted position in subsequent image frames. The target information set of subsequent image frames is associated and matched with the predicted position, and the matching result is verified. If the verification is successful, the node corresponding to the target information set is added to the correspondence. Based on the location information of the newly added node, the initial motion trajectory is updated and optimized using a trajectory optimization algorithm to generate the final motion trajectory, and continuous tracking is achieved based on the final motion trajectory.
[0010] Optionally, the step of performing target recognition on the image sequence composed of the fused images to obtain a target information set in each image frame, and constructing a spatiotemporal correlation graph based on the target information set, includes: For each frame of the image sequence composed of the fused images, a convolutional neural network is used to perform target detection, detecting multiple potential target regions in the image; Extract the appearance features and location information of each potential target region, and form a corresponding target information set based on the appearance features and location information; Each target information in the target information set of all consecutive frames of images is used as a node to form a node set; For any two nodes belonging to different image frames, the appearance similarity and motion consistency between the corresponding target information sets are calculated using a Siamese network; Calculate the correlation degree between nodes based on the appearance similarity and the motion consistency; If the correlation degree is greater than the preset correlation degree threshold, a connection edge is established between the corresponding two nodes, and a spatiotemporal correlation graph is formed based on the set of nodes and all connection edges.
[0011] Optionally, the step of performing band filtering on the raw light received by the multispectral sensor according to priority information to generate a preprocessed image set includes: Acquire meteorological visibility data and spectral attenuation data; Based on the meteorological visibility data, query the portion of the spectral attenuation data that corresponds to the current meteorological conditions, and obtain the attenuation coefficients corresponding to each of the multiple candidate bands. By comparing the attenuation coefficients of the multiple candidate bands, the infrared band with the smallest attenuation coefficient and the visible light band with the smallest attenuation coefficient are determined as the infrared band and visible light band with the smallest attenuation. Priority information is generated based on the attenuation coefficients of the infrared band and the visible light band with the least attenuation. Based on the priority information, a filter control command is generated, which is used to instruct the passband center and bandwidth of the optical filter unit to be adjusted to match the infrared band and visible light band with the least attenuation, respectively. The optical filtering unit is controlled according to the filtering control command to filter the raw light entering the multispectral sensor so that the light with the least attenuation in the infrared and visible light bands can pass through preferentially and be received by the multispectral sensor to form a preprocessed image set.
[0012] Secondly, this application provides a target recognition and tracking system based on multi-source image fusion, comprising: The filtering module is used to filter the raw light received by the multispectral sensor according to priority information to generate a preprocessed image set. The priority information is determined based on meteorological visibility data and spectral attenuation data. The calculation module is used to calculate the transmittance map of the preprocessed image set using an atmospheric scattering model, and based on the transmittance map, decompose each image in the preprocessed image set into a first image layer and a second image layer. The fusion module is used to fuse second image layers from different bands in the image fusion framework to obtain a fused detail layer, and to reconstruct and merge the fused detail layer with each of the first image layers to generate a fused image; The recognition module is used to perform target recognition on the image sequence composed of the fused images, obtain the target information set in each image frame, and construct a spatiotemporal correlation graph based on the target information set; The input module is used to input the spatiotemporal correlation graph into the graph neural network, and infer the correspondence between nodes in the spatiotemporal correlation graph through the graph neural network; The generation module is used to generate the final motion trajectory of the target based on the correspondence, and to continuously track the target based on the final motion trajectory.
[0013] Thirdly, this application provides an electronic device, comprising: Memory, used to store computer programs; A processor, configured to execute the computer program to implement the steps of the target recognition and tracking method based on multi-source image fusion as described in the first aspect above.
[0014] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the steps of the target recognition and tracking method based on multi-source image fusion as described in the first aspect above.
[0015] The technical solution provided in this application has the following beneficial effects: First, the image quality can be improved and meteorological interference can be suppressed by band screening. Then, the scene structure and detail information can be effectively separated by transmittance decomposition. In this way, key details are enhanced and the overall image performance is improved in the fusion framework. Next, cross-frame target connections can be established by target recognition and association graph construction. Finally, stable and continuous tracking of the target under complex conditions is achieved by using graph neural network inference and trajectory update.
[0016] Furthermore, this paper first aggregates node features in the attention map convolutional layer, then fuses temporal information through a memory enhancement module, and then calculates and corrects the matching probability in the matching calculation layer by incorporating motion constraints. Finally, the Hungarian algorithm is used to optimize and obtain the node correspondence. Therefore, this application enhances the representational ability of node features, improves the accuracy of cross-frame node association, and obtains a more reliable target correspondence through global optimization of the matching process.
[0017] These or other aspects of this application will become more apparent in the following description of the embodiments. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 A flowchart illustrating a target recognition and tracking method based on multi-source image fusion, provided in this application embodiment; Figure 2 This application provides a schematic diagram illustrating a specific implementation of a target recognition and tracking method based on multi-source image fusion. Figure 3 This is a schematic diagram of the structure of a target recognition and tracking system based on multi-source image fusion, provided in an embodiment of this application. Detailed Implementation
[0020] The problem of insufficient target tracking continuity in the aforementioned existing technologies under complex weather conditions stems from the limited ability of the image fusion process to cope with changes in the quality of input data, and the lack of sufficient robustness of the target association strategy in dealing with sudden changes in target appearance or temporary occlusion. This limits the actual performance of the overall system in harsh environments.
[0021] To address this issue, this application proposes a target recognition and tracking method based on multi-source image fusion. This method first dynamically selects the imaging band based on real-time meteorological data to optimize the quality of the input image from a physical perspective. Then, it decomposes and fuses the image using an atmospheric scattering model to highlight the detailed features of the target and suppress environmental interference. Finally, it constructs a spatiotemporal correlation graph of the target and uses a graph neural network for intelligent reasoning to achieve robust correlation and trajectory update of the target across frames.
[0022] Therefore, this solution effectively overcomes the effects of image quality degradation and target feature instability under severe weather conditions by coordinating and integrating front-end data acquisition optimization, mid-end image enhancement processing and back-end intelligent association reasoning. This allows the target to maintain the continuity of its trajectory and the consistency of its identity even when it deforms, disappears temporarily or occludes itself, thus solving the core defect of insufficient robustness in target tracking in existing technologies.
[0023] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are merely some embodiments of the present application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0024] The core of this application is to provide a target recognition and tracking method based on multi-source image fusion, and a flowchart of one specific implementation is shown below. Figure 1 As shown, the method includes: Step 101: Based on priority information, the raw light received by the multispectral sensor is filtered by band to generate a preprocessed image set. The priority information is determined based on meteorological visibility data and spectral attenuation data.
[0025] The priority information is a set of data sorted according to the attenuation of each band under the current weather conditions, used to indicate the priority selection order of light from different bands; the preprocessed image set is a set of image data generated by a multispectral sensor after physical filtering of the original light.
[0026] In this embodiment, step 101 includes the following process: Step 1011: Generate a filter control command based on priority information. The filter control command is used to instruct the passband center and bandwidth of the optical filter unit to be adjusted to match the infrared band and visible light band with the least attenuation, respectively.
[0027] In step 1011, the filter control command is a control signal used to control the adjustment of the hardware parameters of the optical filter unit. The command includes the passband center wavelength value and passband width value that need to be set. The optical filter unit is a tunable optical device that can change the range of light wavelengths that it allows to pass through according to the received command.
[0028] For example, the priority information acquisition process is as follows: acquire meteorological visibility data and spectral attenuation data; based on the meteorological visibility data, query the portion of the spectral attenuation data corresponding to the current meteorological conditions, and acquire the attenuation coefficients corresponding to multiple candidate bands; compare the attenuation coefficients of the multiple candidate bands, and determine the infrared band and the visible light band with the smallest attenuation coefficient as the infrared band and visible light band with the smallest attenuation coefficient; generate priority information based on the attenuation coefficients of the infrared band and the visible light band with the smallest attenuation coefficients. This embodiment does not limit the mapping relationship between the above attenuation coefficients and priorities.
[0029] In this embodiment, the center wavelength value of the infrared band with the least attenuation is first read according to the band order specified in the priority information. and the center wavelength value of the visible light band with the least attenuation Then, based on the sensor's design parameters, determine the required passband width B and generate a passband containing... , and the filter control command for B.
[0030] In practical applications, for example, if the current priority information indicates that the long-wave infrared band is 10 micrometers and the specific visible light band is 0.55 micrometers, then the filter control command is set to adjust the passband center of the optical filter unit to 10 micrometers and 0.55 micrometers respectively, and set the passband width to 0.2 micrometers.
[0031] Step 1012: Control the optical filtering unit according to the filtering control command to filter the original light entering the multispectral sensor so that the light with the least attenuation in the infrared band and visible light band can pass through preferentially and be received by the multispectral sensor to form a preprocessed image set.
[0032] In step 1012, the raw light is the ambient light that enters the imaging system directly without filtering.
[0033] In this embodiment, the optical filtering unit receives and executes a filtering control command, adjusting its internal optical components to align with the command. and The system is adjusted to match the target wavelength. Once adjusted, when external light enters the imaging system, the optical filter unit allows infrared light (with a center wavelength around 10 micrometers and a passband range of 9.9 to 10.1 micrometers) and visible light (with a center wavelength around 0.55 micrometers and a passband range of 0.45 to 0.65 micrometers) to pass preferentially, while blocking interfering light from other wavelengths. Subsequently, the multispectral sensor performs photoelectric conversion and imaging on the light in these two wavelengths, generating a set of infrared images and a set of visible light images. These two sets of images together constitute the preprocessed image set.
[0034] In practical applications, in a foggy scenario, the system calculates based on visibility data and a spectral attenuation model that the long-wave infrared and blue-green light bands are least affected by fog. The system generates corresponding filter control commands to drive the tunable filter to adjust the passband center to 10 micrometers and 0.5 micrometers, respectively. The adjusted filter allows the reflected light from objects in these two bands to effectively enter the camera sensor, thereby acquiring infrared and visible light images that are clearer and less affected by scattering than other bands, which serve as a preprocessed image set.
[0035] This application achieves active screening of imaging light at the physical level through step 101 and its sub-steps, effectively suppressing interference caused by atmospheric scattering and suspended particles, and providing a high-quality initial data foundation for subsequent image processing.
[0036] Step 102: Calculate the transmittance map of the preprocessed image set using an atmospheric scattering model, and based on the transmittance map, decompose each image in the preprocessed image set into a first image layer and a second image layer.
[0037] The atmospheric scattering model used in this application can be constructed based on the classic McCartney model structure. This model mainly includes two core components: a direct attenuation term and an ambient light term. The direct attenuation term describes the portion of the target reflected light that reaches the sensor after atmospheric transmission, and its mathematical expression is J(x)×t(x), where J(x) represents the true radiance of the scene at point x, and t(x) represents the atmospheric transmittance at that point. The ambient light term describes the contribution of atmospheric scattered light itself to imaging, expressed as A×(1-t(x)), where A represents the global atmospheric light value. The complete imaging equation of the model is I(x)=J(x)×t(x)+A×(1-t(x)), where I(x) is the pixel intensity value finally observed by the sensor. In a specific example of this invention, for a visible light image acquired in a foggy scene, the global atmospheric light value A is first estimated from the image I(x) based on the dark channel prior method. Then, the transmittance t(x) is estimated pixel by pixel by solving the imaging equation, and finally, the clear scene image J'(x) is obtained by inversion.
[0038] A transmittance map is a two-dimensional data map with the same size as the input image. The value at each location in the map represents the proportion of reflected light from the scene at that location that penetrates the atmosphere and reaches the sensor. The first image layer reflects the image components of the scene's broad illumination and basic structure, while the second image layer reflects the image components of the scene's local details and textures.
[0039] In this embodiment, step 102 includes the following process: Step 1021: Obtain multiple single-band images from the preprocessed image set, wherein each single-band image contains pixels and the pixel intensity value corresponding to each pixel.
[0040] In step 1021, a single-band image is an image in the preprocessed image set that corresponds to a specific wavelength range; the pixel intensity value is a quantized value of the brightness or radiation intensity recorded at each pixel location in the image.
[0041] In this embodiment of the application, from the preprocessed image set generated in step 101, a single-band image belonging to the infrared band and a single-band image belonging to the visible light band are read respectively. Each pixel in these images contains a pixel intensity value ranging from 0 to 255.
[0042] In practical applications, from the preprocessed image set collected in foggy scenes, single-band image A collected in the long-wave infrared band and single-band image B collected in the blue-green light band are extracted respectively. The size of images A and B is 640 pixels by 480 pixels, and each pixel stores an intensity value representing the brightness of that point.
[0043] Step 1022: For each single-band image, the pixel intensity value is input into the atmospheric scattering model, and the atmospheric scattering model calculates the transmittance value of each pixel in the single-band image based on the pixel intensity value.
[0044] In step 1022, the transmittance value is a value between 0 and 1, representing the transmittance of light on the atmospheric path corresponding to the pixel, where 0 means completely impenetrable and 1 means completely transparent.
[0045] In this embodiment, for each single-band image, the transmittance is estimated using a dark channel prior method based on the relationship between light intensity attenuation and propagation distance in the atmospheric scattering model. Specifically, the single-band image is first divided into multiple small regions. Within each small region, the point with the minimum pixel intensity value is found, and this minimum value constitutes the dark channel value for that region. Then, assuming that the atmospheric light composition is a known or estimable constant, the initial transmittance of each small region is calculated using the proportional relationship between the dark channel value and the atmospheric light composition. Finally, through smoothing techniques such as guided filtering, the initial transmittance based on the small region is refined to each pixel, thereby obtaining the transmittance value corresponding to each pixel in the single-band image.
[0046] The relationship between light intensity attenuation and propagation distance is the theoretical basis of the atmospheric scattering model. This relationship shows that when light propagates in the atmosphere, its intensity decreases exponentially with the increase of propagation distance. This physical law is expressed by McCartney et al. as the specific imaging equation I(x)=J(x)×t(x)+A×(1-t(x)). The transmittance t(x) is the key variable describing the degree of light intensity attenuation, which is negatively correlated with the propagation distance. In this application, the propagation distance is not directly measured. Instead, the dark channel prior is used as a bridge because the dark channel value D is related to the transmittance t(x) in the dense fog region. Thus, t(x) can be indirectly solved by estimating D and A from the image itself, realizing the transition from the physical model to the specific algorithm.
[0047] This application does not specifically limit the relationship; it can be set according to the actual situation.
[0048] Step 1023: Summarize the transmittance values of all pixels in the single-band image to generate a transmittance map corresponding to the single-band image.
[0049] In this embodiment of the application, the transmittance values calculated in step 1022, which correspond one-to-one with each pixel in the single-band image, are combined in the same two-dimensional spatial arrangement as the original image to form a two-dimensional matrix with the same size as the single-band image. This matrix is the transmittance map.
[0050] In practical applications, for a visible light single-band image B with a size of 640×480, after the calculation in step 1022, 640×480 transmittance values will be obtained. These values are arranged according to their row and column positions in the image to form a two-dimensional matrix with 640 rows and 480 columns. This matrix is the transmittance map corresponding to image B.
[0051] Step 1024: Using the transmittance map, extract the first image layer from the single-band image, and remove the first image layer from the single-band image to obtain the second image layer.
[0052] In this embodiment, the first image layer is extracted by inverting the single-band image using the transmittance map based on the imaging equation of the atmospheric scattering model. Specifically, for each pixel, the formula J=(IA) / t+A is used for calculation, where I is the pixel intensity value of the single-band image, A is the estimated atmospheric light value, and t is the transmittance value of the pixel. The calculated J value is the intensity value of the first image layer at that pixel. After obtaining the complete first image layer J, the difference between the original single-band image I and the first image layer J is calculated, i.e., R=IJ, to obtain the second image layer R.
[0053] In practical applications, taking pixel (100, 150) as an example, given that I is 120, A is 220, and t is 0.87, the intensity value of the first image layer at this point is calculated as J = (120-220) / 0.87 + 220 ≈ 105. Then, the intensity value of the second image layer at this point is calculated as R = 120-105 = 15. This calculation process is repeated for all pixels in the image to generate the complete first image layer J and the second image layer R respectively.
[0054] This application effectively decomposes an image affected by atmospheric scattering into a base layer that reflects a clear scene structure and a detail layer that contains key details of the target through step 102 and its sub-steps, laying an important foundation for subsequent targeted feature fusion.
[0055] Step 103: In the image fusion framework, the second image layers from different bands are fused to obtain a fused detail layer, and the fused detail layer is reconstructed and merged with each of the first image layers to generate the corresponding fused image.
[0056] In step 103, the image fusion framework is a software processing flow for organically combining components from images of different bands to generate a new image. In this application, the image fusion framework is specifically responsible for weighted fusion of the second image layer and merging the fusion result with the first image layer.
[0057] The fused detail layer is a new image layer obtained by combining the features of multiple second image layers. This layer concentrates the most detailed and texture information from images of different bands. Reconstruction merging refers to the operation of superimposing the fused detail layer with the first image layer according to the corresponding pixel positions, thereby generating a final image with clear structure and enhanced details.
[0058] In this embodiment of the application, firstly, within the image fusion framework, corresponding fusion weights are assigned to the second image layers from different bands according to the priority information determined in step 101; then, these weights are used to weight and superimpose each second image layer to generate a fused detail layer; finally, the generated fused detail layer is added to the first image layer extracted from any channel image at the pixel level to complete the reconstruction and merging and output the fused image.
[0059] Step 104: Perform target recognition on the image sequence composed of the fused images to obtain the target information set in each image frame, and construct a spatiotemporal correlation graph based on the target information set.
[0060] The image sequence is a set of images composed of the fused images generated in step 103 arranged in chronological order; the target information set is a data set that summarizes the descriptive information of each target detected in a frame of an image, including the appearance feature vector and position coordinates of the target.
[0061] Spatiotemporal correlation graph is a graph structure data used to characterize the correlation between targets in multiple frames of images. Each node in the spatiotemporal correlation graph represents a set of target information, and each connecting edge represents the possibility of correlation between the targets corresponding to two nodes.
[0062] In this embodiment, step 104 includes the following process: Step 1041: For each frame of the image sequence composed of the fused images, a convolutional neural network is used to perform target detection to detect multiple potential target regions in the image.
[0063] In step 1041, the convolutional neural network is a deep learning model that automatically extracts image features through multi-layer convolution operations. In this application, the model is pre-trained using a large amount of image data labeled with target bounding boxes, thereby learning to recognize specific categories of targets in the image. Furthermore, this application does not impose specific limitations on the structural design of the layers and other components used in the internal structure of the Siamese network, and can set them accordingly based on the actual situation. A potential target region is a local image region in an image that is identified by a convolutional neural network as containing a target, and it is usually identified by a rectangular bounding box.
[0064] In this embodiment, each frame of the fused image sequence is first input into a pre-trained convolutional neural network model. The convolutional neural network performs multi-layer feature extraction and classification on the input image, and finally outputs the coordinates of multiple rectangular bounding boxes and their corresponding target category confidence scores. The image region enclosed by all bounding boxes with confidence scores exceeding a preset detection threshold is determined as the potential target region of the frame image.
[0065] In practical applications, for a fused image generated in a foggy traffic scene, it is input into a vehicle detection convolutional neural network trained based on the YOLO architecture. The network outputs three bounding boxes with confidence scores of 0.92, 0.85 and 0.45, respectively. With a detection threshold of 0.5, the regions corresponding to the first two bounding boxes are identified as potential target regions, while the third bounding box is filtered out because its confidence score is lower than the threshold.
[0066] Step 1042: Extract the appearance features and location information of each potential target region, and form a corresponding target information set based on the appearance features and location information.
[0067] In step 1042, the appearance feature is a high-dimensional numerical vector obtained by feature encoding of image pixels within the potential target region, used to characterize the unique visual attributes of the target; the location information is data describing the spatial location of the potential target region in the image, which typically includes the horizontal and vertical coordinates of the center point of the bounding box, as well as the width and height of the bounding box.
[0068] In this embodiment of the application, for each potential target region detected in step 1041, the image within the region is first cropped and scaled to a fixed size; then the cropped image patch is input into a pre-trained feature extraction network, which outputs a fixed-length numerical vector as the appearance feature of the target; at the same time, the center point coordinates (x, y) of the bounding box corresponding to the potential target region, as well as the width w and height h, are recorded to form the position information; finally, the appearance feature vector and position information of the target are packaged to form a complete target information set.
[0069] In practical applications, for the potential target region with a confidence level of 0.92, its image content is first cropped and scaled to 64 pixels by 64 pixels; then, it is input into a ResNet network to extract a 256-dimensional appearance feature vector, for example, [0.1, 0.85, -0.3, ...]; at the same time, the coordinates of its bounding box center point are recorded as (320, 240), the width is 60, and the height is 40; these data together constitute a target information set.
[0070] Step 1043: Take each target information in the target information set of all consecutive frames of images as a node to form a node set.
[0071] In this embodiment of the application, the process of constructing the node set is as follows: first, a time window is set, for example, 5 consecutive frames of images; then, all target information sets formed in step 1042 are collected from these 5 frames of images; finally, each target information set is regarded as an independent entity and is used as a node in the spatiotemporal association graph. The collection of all these nodes constitutes the node set.
[0072] In practical applications, when processing a video stream, five consecutive frames from frame 10 to frame 14 are selected as the fused images. After detection and feature extraction, these five frames yield a total of 12 target information sets. These 12 target information sets are then used as 12 nodes, which together constitute the node set currently being processed.
[0073] Step 1044: For any two nodes belonging to different image frames, use a Siamese network to calculate the appearance similarity and motion consistency between the corresponding target information sets.
[0074] In step 1044, a Siamese network is a deep learning model that includes two sub-network structures with shared parameters, specifically designed to compare the similarity between two inputs. Furthermore, this application does not impose specific limitations on the structural design of the internal structure of the Siamese network, such as the layers, and can set them according to the actual situation.
[0075] Appearance similarity is a value between 0 and 1 used to quantify the degree of similarity between two targets in appearance. The closer the value is to 1, the more similar they are. Motion consistency is a measure of how reasonable the motion trajectories of two targets are in terms of physical laws.
[0076] In this embodiment, firstly, for any two nodes belonging to different image frames in the node set, their corresponding appearance feature vectors are input into the two sub-networks of the Siamese network respectively; the Siamese network calculates and outputs a scalar value as the appearance similarity between the two nodes; simultaneously, for these two nodes, motion consistency is calculated based on their coordinates and timestamps in their location information; specifically, assuming node i is located in frame t, with coordinates ( ), node j is located in frame t+1, and its coordinates are ( First, the displacement vector is calculated, and then the predicted displacement is calculated based on the time interval between two frames. The degree of matching between the actual displacement and the predicted displacement is used as a motion consistency metric.
[0077] In practical applications, comparing node A in frame 10 with node B in frame 11, the appearance feature vector of node A is: Node B is ;Will and Input a Siamese network and output appearance similarity. =0.78. Node A is located at (300, 200), and node B is located at (310, 205). The time interval between the two frames is 0.1 seconds. Assuming the displacement predicted based on the historical trajectory is (12, 5), the matching degree between the actual displacement (10, 5) and the predicted displacement is obtained by calculating the cosine similarity to determine motion consistency. =0.95.
[0078] Step 1045: Calculate the correlation degree between nodes based on the appearance similarity and the motion consistency.
[0079] In this embodiment, the correlation between nodes is calculated by weighted fusion of appearance similarity and motion consistency. The weighted fusion formula can be specifically R = α × +β× Where R represents the degree of correlation, Represents the degree of similarity in appearance. Representing motion consistency, α and β are preset weighting coefficients that satisfy α + β = 1. This formula integrates two different dimensions of measurement into a comprehensive correlation score.
[0080] In practical applications, following the previous example, let α be 0.6 and β be 0.4. Given... It is 0.78. If the correlation coefficient is 0.95, then the correlation coefficient between node A and node B is R = 0.6 × 0.78 + 0.4 × 0.95 = 0.468 + 0.38 = 0.848.
[0081] Step 1046: If the correlation degree is greater than the preset correlation degree threshold, then establish a connection edge between the corresponding two nodes, and construct a spatiotemporal correlation graph based on the node set and all connection edges.
[0082] In this embodiment, a correlation threshold T is first set, for example, 0.7; then, all node pairs belonging to different image frames in the node set are traversed, and the correlation R of each pair of nodes is calculated; for any pair of nodes, if its correlation R is greater than the threshold T, an undirected connection edge is established between the two nodes; finally, all the nodes and all the connection edges that meet the conditions are combined to form a complete graph structure, namely, a spatiotemporal correlation graph.
[0083] In practical applications, for the aforementioned 12 node sets, the correlation degree of all possible node pairs is calculated; assuming that the correlation degree of 20 pairs of nodes exceeds the preset threshold of 0.7, a connection edge is established between each of these 20 pairs of nodes; finally, these 12 nodes and 20 connection edges constitute a spatiotemporal correlation graph describing the cross-frame correlation relationship of the target in these 5 frames of images.
[0084] This application effectively organizes the discrete target detection results in consecutive image frames into a graph structure with spatiotemporal correlation through step 104 and its sub-steps, providing a direct and rich data foundation for subsequent intelligent reasoning using graph models to solve the target identity association problem.
[0085] Step 105: Input the spatiotemporal correlation graph into a graph neural network, and infer the correspondence between nodes in the spatiotemporal correlation graph through the graph neural network.
[0086] The structure design and training process of the graph neural network can be set according to the actual situation, and this embodiment does not limit this. The correspondence refers to the matching relationship between nodes in different image frames pointing to the same real target.
[0087] In this embodiment, step 105 includes the following process, such as... Figure 2 As shown: Step 1051: Input the spatiotemporal correlation graph into the attention graph convolutional layer of the graph neural network. In the attention graph convolutional layer, a weight is assigned to the connection edge between each node and its neighboring nodes through a multi-head attention mechanism. Based on the weight, the features of the neighboring nodes are aggregated, and the feature representation of each node is updated to obtain the first updated feature vector.
[0088] In step 1051, the attention graph convolutional layer is a specific network layer in a graph neural network. This layer uses an attention mechanism to dynamically determine the importance of neighboring nodes when information is aggregated. The first updated feature vector is a new feature representation obtained after the original features of the nodes are processed by the attention graph convolutional layer.
[0089] In this embodiment, the node features and edge features of the spatiotemporal correlation graph are first input into the attention graph convolutional layer. For each central node in the graph, the layer calculates the attention coefficient between it and each of its neighboring nodes. This coefficient is determined by the features of the central node, the features of the neighboring nodes, and the features of the connecting edges. A multi-head attention mechanism is adopted, that is, multiple independent attention calculations are performed in parallel and the results are concatenated. Finally, the features of the neighboring nodes are weighted and summed according to the calculated attention coefficients, and then combined and nonlinearly transformed with the features of the central node itself to output the first updated feature vector of the node.
[0090] In practical applications, suppose there is a node N1 in the spatiotemporal relation graph, which has three neighboring nodes N2, N3, and N4. In the attention graph convolutional layer, the attention coefficients of N1 with N2, N3, and N4 are calculated respectively, and the weight values are obtained, for example, 0.6, 0.3, and 0.1. Then, the feature vectors of N2, N3, and N4 are weighted and averaged using these weights to obtain an aggregated neighbor feature. Finally, this aggregated feature is combined with the feature of N1 itself, and after passing through a fully connected layer and an activation function, the first updated feature vector of node N1 is generated, for example, a numerical vector with a length of 128 dimensions.
[0091] Step 1052: Input the first updated feature vector into the memory enhancement module of the graph neural network. In the memory enhancement module, the temporal features of the nodes are modeled through the long short-term memory network, and the second updated feature vector is output.
[0092] In step 1052, the memory enhancement module is a component in the graph neural network used to capture the changing patterns of node features over time, and the second updated feature vector is a feature representation that incorporates the historical temporal information of the nodes.
[0093] In this embodiment of the application, the first updated feature vectors corresponding to the same target in each frame are arranged into a temporal sequence according to the time order of the image frames; then this temporal sequence is input into a long short-term memory network; the gating structure of the long short-term memory network will filter and memorize the information in the sequence, and the hidden state output at the last time step is the second updated feature vector that incorporates the historical information of the target.
[0094] In practical applications, for the same vehicle target, the five nodes corresponding to five consecutive frames of images are extracted, and their first updated feature vectors obtained in step 1051 are arranged in frame order as [V1, V2, V3, V4, V5]. These five feature vectors are then sequentially input into a long short-term memory network with 128 hidden units. After the network processes the fifth vector V5, the state of its hidden units is extracted and used as the second updated feature vector representing the complete temporal information of the target.
[0095] Step 1053: Input the second updated feature vector into the matching calculation layer of the graph neural network. In the matching calculation layer, calculate the initial cosine similarity between any two second updated feature vectors, and introduce motion constraints to correct the initial cosine similarity in order to generate the matching probability between nodes.
[0096] The matching computation layer is a specific functional layer in the graph neural network. It is specifically responsible for calculating the probability that any two nodes in the graph belong to the same target. The input of this layer is the node feature vector after being processed by the aforementioned network layers, and the output is the probability value representing the strength of the association between each pair of nodes.
[0097] Motion constraints are a reasonable restriction imposed on the node matching process based on the physical laws of target motion. Specifically, the motion constraints utilize the position and time information of the target represented by the node to calculate the consistency of its motion state and generate a correction coefficient to adjust the similarity calculated solely based on appearance features.
[0098] The initial cosine similarity is a raw value that is calculated using the cosine similarity formula and only reflects the degree of similarity in the directions of two feature vectors. The matching probability between nodes is a value between 0 and 1. This matching probability quantifies the possibility that nodes in two different image frames point to the same real target. This value is determined by the similarity of the nodes' appearance features and the consistency of their motion. The closer the value is to 1, the higher the probability of a successful match.
[0099] Step 1053 may specifically include the following steps: A1: In the matching calculation layer, the ratio of the dot product to the modulus product between the two second updated feature vectors is calculated to obtain the initial cosine similarity.
[0100] In this embodiment of the application, for any two second updated feature vectors U and V, their dot product is first calculated, that is, the corresponding dimension values are multiplied and then summed; then the magnitude of vector U and the magnitude of vector V are calculated respectively, the magnitude being the square root of the sum of the squares of the values of each dimension of the vector; finally, the dot product is divided by the product of the two magnitudes to obtain the initial cosine similarity, which is in the range of negative 1 to positive 1.
[0101] In practical applications, let vector U be [0.5, 0.1, -0.2] and vector V be [0.4, 0.2, 0.1]; calculate the dot product as 0.5 × 0.4 + 0.1 × 0.2 + (-0.2) × 0.1 = 0.2; calculate the magnitude of U as... ; Calculate the modulus of V as The initial cosine similarity is then... .
[0102] A2: Obtain the position and time information of the two nodes corresponding to the two second update feature vectors from the spatiotemporal correlation graph.
[0103] In step A2, the location information includes the coordinates of the center point of the target represented by the node in the image, and the time information is the timestamp or frame number of the image frame to which the node belongs.
[0104] A3: Based on the location information and the time information, calculate the ratio of the displacement difference to the time difference between the targets represented by the two nodes to obtain the measurement value.
[0105] In step A3, the metric is used to quantify the consistency of the motion speed between the two targets.
[0106] In this embodiment of the application, it is assumed that node i is in time Located in position ( , ), node j in time Located in position ( , First through Calculate the displacement difference; then calculate the absolute value of the time difference. Finally, the ratio of displacement difference to time difference is used as the metric S, which represents the average velocity.
[0107] In practical applications, node i is located in frame 10 with coordinates (100, 200); node j is located in frame 12 with coordinates (130, 210); then the displacement difference is... Pixels; Time difference of 2 frames; Measurement value Pixels per frame.
[0108] A4: Compare the measured value with a preset measured value threshold to generate a motion constraint factor.
[0109] In step A4, the motion constraint factor is a coefficient used to correct appearance similarity based on motion rationality, and its value is between 0 and 1.
[0110] In this embodiment, a reasonable maximum speed threshold is preset. ; Calculate the metric S and ratio If r is less than or equal to 1, the motion is considered reasonable and the motion constraint factor C is set to 1; if r is greater than 1, the motion is considered less likely and the motion constraint factor C is set to 1 / r.
[0111] In practical applications, a maximum speed threshold is set. With 20 pixels per frame, when the metric S is 15.81, the ratio r is calculated as 15.81 / 20 = 0.79; since r is less than 1, the motion constraint factor C is 1.
[0112] A5: Multiply the initial cosine similarity by the motion constraint factor to obtain the target cosine similarity.
[0113] In practical applications, with an initial cosine similarity of 0.797 and a motion constraint factor of 1, the target cosine similarity D = 0.797 × 1 = 0.797.
[0114] A6: Input the target cosine similarity into the Sigmoid function for normalization and output the matching probability between nodes.
[0115] In this embodiment of the application, the matching probability P is expressed by the formula The calculation yielded, where Let represent the natural exponential function, and k be a scaling factor that controls the steepness of the function. It is the offset. The sigmoid function maps the target cosine similarity D to between 0 and 1, as the matching probability.
[0116] In practical applications, let k=10, d0=0.7, and target cosine similarity D=0.797; then the matching probability P=1 / (1+exp(-10×(0.797-0.7)))=1 / (1+exp(-0.97))≈1 / (1+0.379)≈0.725.
[0117] Step 1054: Through the output layer of the graph neural network, the matching probability is globally optimized and allocated using the Hungarian algorithm to obtain the optimal matching pairs of nodes between different image frames, and the node pairs in the optimal matching pairs are determined as the correspondence of the same target in different frames.
[0118] In step 1054, the correspondence is the inferred link relationship that indicates different nodes represent the same target.
[0119] In this embodiment of the application, the nodes of two consecutive frames of images are first used as the two parts of the bipartite graph, and the matching probability calculated in step 1053 is used as the weight of the connecting edge to construct a weighted bipartite graph. Then, the Hungarian algorithm is applied to find a matching scheme on the bipartite graph such that the sum of the weights of the matching edges is maximized. The two nodes connected by each set of matching edges found by the algorithm are determined to be the correspondence of the same target in different frames.
[0120] In practical applications, assume that the 10th frame has 3 nodes ( ), Frame 11 has 2 nodes ( The matching probability matrix between them is The Hungarian algorithm finds the combination that maximizes the total matching probability, such as the matching pair ( - , - ),and No matching object found; then confirm. and For the same goal, and They correspond to the same goal.
[0121] This application, through step 105 and its sub-steps, comprehensively utilizes attention mechanisms, temporal modeling, motion constraints, and global optimization algorithms to achieve accurate and robust inference of the correspondence between nodes in complex spatiotemporal graphs, providing a key basis for generating continuous and accurate target motion trajectories.
[0122] Step 106: Based on the correspondence, generate the final motion trajectory of the target, and continuously track the target based on the final motion trajectory.
[0123] The final trajectory is a complete path describing the continuous change of the spatial position of the same target over a continuous period of time; continuous tracking refers to the process of continuously acquiring new data in subsequent time after generating the initial trajectory of the target, and associating the newly detected targets with the existing trajectories in order to update and extend the target trajectory.
[0124] In this embodiment, step 106 includes the following process: Step 1061: Based on the correspondence, determine multiple nodes belonging to the same target, and extract the position information of the nodes according to the image frame order corresponding to the nodes.
[0125] In step 1061, the location information is the coordinate data of the target represented by the node in its corresponding image frame.
[0126] In this embodiment of the application, firstly, all nodes that are linked together are extracted from the optimal matching pairs obtained in step 105, and these nodes are determined to belong to the same target; then, these nodes are sorted according to the time order of the image frames to which they belong; finally, the position coordinates of each node are read from the target information set of each node, so as to obtain a series of position sequences of the target in multiple consecutive frames.
[0127] Step 1062: Use the Kalman filter algorithm to smooth the position information, and predict the target's state in the next frame of the current image frame based on the processed position information to generate the target's initial motion trajectory.
[0128] In step 1062, the initial trajectory is a preliminary description of the target's short-term future movement path, generated based on historical observations and predictions.
[0129] In this embodiment, the position sequence obtained in step 1061 is first used as the observation input of the Kalman filter algorithm. The Kalman filter performs optimal estimation of the target's position and velocity state through its prediction and update steps, and outputs a smoothed position sequence. Then, the updated state vector observed in the last frame of the Kalman filter, which contains position and velocity, is used to predict the target's position in the next frame after the last frame of the current processing. Finally, the smoothed historical position sequence is connected with the predicted position of the next frame to form the target's initial motion trajectory.
[0130] Step 1063: Based on the initial motion trajectory, the motion pattern of the target is learned using a gated recurrent unit network to predict the target's predicted position in subsequent image frames.
[0131] In this embodiment of the application, a large amount of historical trajectory data is first used to train the gated recurrent unit network so that it learns the general rules of the target motion pattern. In the application stage, the position sequence of the initial motion trajectory generated in step 1062 is input into the trained gated recurrent unit network. The network outputs predictions of a series of positions of the target in subsequent frames, such as 5 frames, based on the learned motion pattern. These positions constitute the predicted position sequence.
[0132] Step 1064: Associate and match the target information set of subsequent image frames with the predicted position, and verify the matching result. If the verification is successful, add the node corresponding to the target information set to the correspondence.
[0133] In this embodiment, the actual image of the next frame is first processed, and the target information set of the frame is obtained through step 104. Then, the distance between the position of each target information set in the frame and the predicted position of the 13th frame predicted in step 1063 is calculated. The actual target that is closest to the predicted position and is less than a preset distance threshold is associated with the predicted position. Finally, the association result is verified, for example, by checking whether the appearance features of the actual target are sufficiently similar to the appearance features of the target on the historical trajectory. If the verification passes, the node corresponding to the actual target is confirmed to belong to the same target and is added to the correspondence established in step 105.
[0134] It should be noted that the embodiments of this application do not limit the value of the preset distance threshold, and can be set according to specific circumstances.
[0135] Step 1065: Based on the location information corresponding to the newly added node, update and optimize the initial motion trajectory using a trajectory optimization algorithm to generate the final motion trajectory, and achieve continuous tracking based on the final motion trajectory.
[0136] In this embodiment, the actual observed position corresponding to the newly added node is first added to the historical position sequence of the target; then, the trajectory optimization algorithm is reapplied to the entire updated position sequence, for example, cubic spline interpolation is used to smoothly fit the discrete points to generate a continuous and smooth motion curve, which is the updated final motion trajectory; in each subsequent frame processing, the process of steps 1063 to 1065 is repeated, namely prediction, association matching, verification and trajectory update, so as to achieve continuous tracking of the target.
[0137] This application realizes the generation of continuous motion trajectory from discrete node correspondence through step 106 and its sub-steps. By combining filtering, neural network prediction and online correlation verification, it can dynamically adapt to changes in target motion, thereby completing stable and continuous long-term tracking of the target.
[0138] Figure 3 A schematic diagram of a target recognition and tracking system based on multi-source image fusion provided in this application embodiment is shown below. Figure 3 As shown, the system includes: The filtering module 31 is used to perform band filtering on the raw light received by the multispectral sensor according to priority information to generate a preprocessed image set. The priority information is determined based on meteorological visibility data and spectral attenuation data.
[0139] The calculation module 32 is used to calculate the transmittance map of the preprocessed image set using an atmospheric scattering model, and based on the transmittance map, decompose each image in the preprocessed image set into a first image layer and a second image layer.
[0140] The fusion module 33 is used to fuse the second image layers from different bands in the image fusion framework to obtain a fused detail layer, and to reconstruct and merge the fused detail layer with each of the first image layers to generate the corresponding fused image.
[0141] The recognition module 34 is used to perform target recognition on the image sequence composed of the fused images, obtain the target information set in each image frame, and construct a spatiotemporal correlation graph based on the target information set.
[0142] The input module 35 is used to input the spatiotemporal correlation graph into the graph neural network, and infer the correspondence between nodes in the spatiotemporal correlation graph through the graph neural network.
[0143] The generation module 36 is used to generate the final motion trajectory of the target based on the correspondence, and to continuously track the target based on the final motion trajectory.
[0144] The target recognition and tracking system based on multi-source image fusion in this application is used to implement the aforementioned target recognition and tracking method based on multi-source image fusion. Therefore, the specific implementation of the target recognition and tracking system based on multi-source image fusion can be found in the embodiment section of the target recognition and tracking method based on multi-source image fusion above. The specific implementation can be referred to the description of the corresponding embodiment, and will not be repeated here.
[0145] This application also provides an electronic device, comprising: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of any of the above-described target recognition and tracking methods based on multi-source image fusion.
[0146] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described target recognition and tracking methods based on multi-source image fusion.
[0147] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as USB flash drives, read-only memory, random access memory, portable hard drives, magnetic disks, or optical disks.
[0148] The embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the target recognition and tracking method based on multi-source image fusion.
[0149] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0150] The above provides a detailed description of a target recognition and tracking method and system based on multi-source image fusion provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of this application.
Claims
1. A target recognition tracking method based on multi-source image fusion, characterized in that, include: Based on priority information, the raw light received by the multispectral sensor is filtered by band to generate a preprocessed image set. The priority information is determined based on meteorological visibility data and spectral attenuation data. The transmittance map of the preprocessed image set is calculated using an atmospheric scattering model, and based on the transmittance map, each image in the preprocessed image set is decomposed into a first image layer and a second image layer. In the image fusion framework, the second image layers from different bands are fused to obtain a fused detail layer, and the fused detail layer is reconstructed and merged with each of the first image layers to generate the corresponding fused image; Target recognition is performed on the image sequence composed of the fused images to obtain the target information set in each image frame, and a spatiotemporal correlation graph is constructed based on the target information set; The spatiotemporal correlation graph is input into a graph neural network, and the graph neural network is used to infer the correspondence between nodes in the spatiotemporal correlation graph. Based on the correspondence, the final motion trajectory of the target is generated, and the target is continuously tracked based on the final motion trajectory.
2. The method of claim 1, wherein, The step of inputting the spatiotemporal correlation graph into a graph neural network and inferring the correspondence between nodes in the spatiotemporal correlation graph through the graph neural network includes: The spatiotemporal correlation graph is input into the attention graph convolutional layer of the graph neural network. In the attention graph convolutional layer, a multi-head attention mechanism is used to assign weights to the connection edges between each node and its neighboring nodes. Based on the weights, the features of the neighboring nodes are aggregated, and the feature representation of each node is updated to obtain the first updated feature vector. The first updated feature vector is input into the memory enhancement module of the graph neural network. In the memory enhancement module, the temporal features of the nodes are modeled through a long short-term memory network, and the second updated feature vector is output. The second updated feature vector is input into the matching calculation layer of the graph neural network. In the matching calculation layer, the initial cosine similarity between any two second updated feature vectors is calculated, and motion constraints are introduced to correct the initial cosine similarity in order to generate the matching probability between nodes. The Hungarian algorithm is used to globally optimize and allocate the matching probability through the output layer of the graph neural network to obtain the optimal matching pairs of nodes between different image frames. The node pairs in the optimal matching pairs are determined as the correspondence of the same target in different frames.
3. The method of claim 2, wherein, In the matching calculation layer, the initial cosine similarity between any two second updated feature vectors is calculated, and motion constraints are introduced to correct the initial cosine similarity to generate the matching probability between nodes, including: In the matching calculation layer, the ratio of the dot product to the product of the moduli between the two second updated feature vectors is calculated to obtain the initial cosine similarity. From the spatiotemporal correlation graph, obtain the position and time information of the two nodes corresponding to the two second updated feature vectors; Based on the location information and the time information, the ratio of the displacement difference to the time difference between the targets represented by the two nodes is calculated to obtain a metric value; The measured value is compared with a preset measured value threshold to generate a motion constraint factor; Multiply the initial cosine similarity by the motion constraint factor to obtain the target cosine similarity; The target cosine similarity is input into the Sigmoid function for normalization, and the matching probability between nodes is output.
4. The method of claim 1, wherein, The step of generating the target's final motion trajectory based on the correspondence and continuously tracking the target based on the final motion trajectory includes: Based on the correspondence, multiple nodes belonging to the same target are identified, and the position information of the nodes is extracted according to the image frame order corresponding to the nodes. The position information is smoothed using the Kalman filter algorithm, and the target's state in the next frame of the current image frame is predicted based on the processed position information to generate the target's initial motion trajectory. Based on the initial motion trajectory, the motion pattern of the target is learned using a gated recurrent unit network to predict the target's predicted position in subsequent image frames. The target information set of subsequent image frames is associated and matched with the predicted position, and the matching result is verified. If the verification is successful, the node corresponding to the target information set is added to the correspondence. Based on the location information of the newly added node, the initial motion trajectory is updated and optimized using a trajectory optimization algorithm to generate the final motion trajectory, and continuous tracking is achieved based on the final motion trajectory.
5. The method of claim 1, wherein, The step of performing target recognition on the image sequence composed of the fused images to obtain a target information set in each image frame, and constructing a spatiotemporal correlation graph based on the target information set, includes: For each frame of the image sequence composed of the fused images, a convolutional neural network is used to perform target detection, detecting multiple potential target regions in the image; Extract the appearance features and location information of each potential target region, and form a corresponding target information set based on the appearance features and location information; Each target information in the target information set of all consecutive frames of images is used as a node to form a node set; For any two nodes belonging to different image frames, the appearance similarity and motion consistency between the corresponding target information sets are calculated using a Siamese network; Calculate the correlation degree between nodes based on the appearance similarity and the motion consistency; If the correlation degree is greater than the preset correlation degree threshold, a connection edge is established between the corresponding two nodes, and a spatiotemporal correlation graph is formed based on the set of nodes and all connection edges.
6. The method of claim 1, wherein, The step of filtering the raw light received by the multispectral sensor according to priority information to generate a preprocessed image set includes: Based on priority information, a filter control command is generated. The filter control command is used to instruct the passband center and bandwidth of the optical filter unit to be adjusted to match the infrared band and visible light band with the least attenuation, respectively. The optical filtering unit is controlled according to the filtering control command to filter the raw light entering the multispectral sensor so that the light with the least attenuation in the infrared and visible light bands can pass through preferentially and be received by the multispectral sensor to form a preprocessed image set.
7. The method of claim 1, wherein, The process of calculating the transmittance map of the preprocessed image set using an atmospheric scattering model, and decomposing each image in the preprocessed image set into a first image layer and a second image layer based on the transmittance map, includes: Acquire multiple single-band images from the preprocessed image set, wherein each single-band image contains pixels and the pixel intensity value corresponding to each pixel; For each single-band image, the pixel intensity value is input into the atmospheric scattering model, and the atmospheric scattering model calculates the transmittance value of each pixel in the single-band image based on the pixel intensity value. Summarize the transmittance values of all pixels in the single-band image to generate a transmittance map corresponding to the single-band image. Using the transmittance map, a first image layer is extracted from the single-band image, and the first image layer is removed from the single-band image to obtain a second image layer.
8. A target recognition and tracking system based on multi-source image fusion, characterized in that, include: The filtering module is used to filter the raw light received by the multispectral sensor according to priority information to generate a preprocessed image set. The priority information is determined based on meteorological visibility data and spectral attenuation data. The calculation module is used to calculate the transmittance map of the preprocessed image set using an atmospheric scattering model, and based on the transmittance map, decompose each image in the preprocessed image set into a first image layer and a second image layer. The fusion module is used to fuse the second image layers from different bands in the image fusion framework to obtain a fused detail layer, and to reconstruct and merge the fused detail layer with each of the first image layers to generate the corresponding fused image; The recognition module is used to perform target recognition on the image sequence composed of the fused images, obtain the target information set in each image frame, and construct a spatiotemporal correlation graph based on the target information set; The input module is used to input the spatiotemporal correlation graph into the graph neural network, and infer the correspondence between nodes in the spatiotemporal correlation graph through the graph neural network; The generation module is used to generate the final motion trajectory of the target based on the correspondence, and to continuously track the target based on the final motion trajectory.
9. An electronic device, comprising: include: Memory, used to store computer programs; A processor, configured to execute the computer program to implement the steps of the target recognition and tracking method based on multi-source image fusion as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, enables the target recognition and tracking method based on multi-source image fusion as described in any one of claims 1 to 7.