Twin network target tracking method, device, equipment, medium and program product
By extracting environmental features and generating modulation parameters in a Siamese network, and dynamically adjusting the feature map response, the problem of decreased target tracking accuracy in dynamic scenes is solved, and accurate tracking in complex environments is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 广州新华学院
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing target tracking methods suffer from decreased tracking accuracy or even target loss in dynamic scenes due to a lack of environmental context modeling. This is caused by changes in lighting, target deformation, or background interference.
By introducing environmental feature extraction through Siamese networks, modulation parameters are generated to perform affine transformation on the initial target feature map, dynamically adjusting the response intensity of the feature layer, and introducing an environment-adapted template for similarity calculation during the feature matching stage.
It improves the positioning accuracy and tracking performance of target tracking in complex dynamic scenarios, and enhances the adaptability of twin networks in deformation or interference scenarios.
Smart Images

Figure CN122244471A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the technical field of twin networks, and in particular to a twin network target tracking method, apparatus, device, medium, and program product. Background Technology
[0002] Target tracking, as a core technology in computer vision, has wide applications in intelligent monitoring, human-computer interaction, and autonomous driving. However, the inventors have found that related methods typically use fixed template feature maps for target matching in dynamic scenes, lacking explicit modeling and fusion of environmental context information. When changes in lighting, target deformation, or background interference occur, static templates struggle to adapt to dynamic changes in the target's appearance, leading to decreased tracking accuracy or even target loss. Therefore, how to dynamically adjust template features according to the current environment to improve tracking performance in dynamic scenes has become a problem to be solved. Summary of the Invention
[0003] The main objective of this invention is to provide a twin network target tracking method, apparatus, device, medium, and program product, aiming to solve the problem that existing methods cannot adapt to dynamic environmental changes such as illumination, deformation, and background interference, resulting in a decrease in tracking accuracy.
[0004] To achieve the above objectives, the present invention provides a target tracking method based on Siamese networks, comprising: After acquiring the initial image of the target object, the positional changes of the target object are tracked to obtain the target area video image of the current frame; The target region video image is feature extracted through the backbone network of the twin network to obtain a region feature map, and the initial target feature map is obtained by the backbone network from the initial image. Environmental features are extracted from the video images of the target area to obtain environmental feature data; Modulation parameters are generated based on the environmental feature data, and an affine transformation is performed on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map; Target matching analysis is performed on the environmental modulation feature map and the regional feature map to determine the position of the target object in the video image of the target region.
[0005] Further, the step of extracting features from the target region video image through the backbone network of the twin network to obtain a region feature map includes: The target region video image is input into the convolutional layer of the backbone network, and the target region video image is weighted and summed region by region to output a region convolutional map. The region convolutional map is input into the activation layer of the backbone network, and a non-linear mapping is performed on each data point in the region convolutional map to output an activation feature map. The activation feature map is input into the pooling layer of the backbone network, and the maximum value in the local region is selected from the activation feature map to generate a local region feature map. The local region feature maps are input into the feature integration layer of the backbone network, and all the local region feature maps are combined to obtain the region feature map.
[0006] Further, the step of extracting environmental features from the video image of the target area to obtain environmental feature data includes: The video image of the target area is divided into grids to obtain multiple image sub-blocks; Pixel statistics are performed on each of the image sub-blocks, and the average brightness and texture intensity values of each image sub-block are extracted; The average brightness and the texture intensity value are used as the environmental feature data.
[0007] Further, generating modulation parameters based on the environmental feature data includes: The environmental feature data is input into a pre-trained modulation parameter generation network; The environmental feature data is processed by the modulation parameter generation network to output a scaling factor vector and an offset vector corresponding to the feature channels of the initial target feature map; The scaling factor vector and the offset vector are used as the modulation parameters.
[0008] Further, the step of performing an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map includes: Obtain multiple feature channels of the initial target feature map; For each feature channel, all feature values in the feature channel are multiplied by the scaling factor of the corresponding channel in the scaling factor vector, and then added to the offset of the corresponding channel in the offset vector to obtain the adjusted feature value of the feature channel. The adjusted feature values of all the feature channels are combined to obtain the environmental modulation feature map.
[0009] Further, the step of performing target matching analysis on the environmental modulation feature map and the region feature map to determine the position of the target object in the target region video image includes: The environmental modulation feature map is slid-matched on the regional feature map according to a preset unit position. Each time the map slides to a sliding position, the environmental modulation feature map is multiplied point by point with the currently covered local area in the regional feature map and the results are accumulated to obtain the response value at the sliding position. The response values of all the sliding positions are filled into a preset response map matrix in the sliding order, and the response value with the largest value is found in the response map matrix. The position of the response value is taken as the peak position. The coordinates of the region feature map are calculated based on the peak position and the preset unit position to obtain the current target position coordinates, and the current target position coordinates are determined as the position of the target object in the target region video image.
[0010] The present invention also provides a twin network target tracking device, applied to any one of the twin network target tracking methods described above, comprising: The acquisition module is used to track the positional changes of the target object after acquiring an initial image of the target object, and obtain a video image of the target area in the current frame; The analysis module is used to extract features from the target region video image through the backbone network in the twin network to obtain a region feature map, and to obtain an initial target feature map extracted by the backbone network from the initial image; The association module is used to extract environmental features from the video image of the target area to obtain environmental feature data. The processing module is used to generate modulation parameters based on the environmental feature data, and to perform an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map. The control module is used to perform target matching analysis on the environmental modulation feature map and the regional feature map to determine the position of the target object in the target region video image.
[0011] The present invention also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the twin network target tracking method as described in any of the preceding claims.
[0012] The present invention also provides a readable storage medium storing a computer program that, when executed by a processor, implements the steps of the twin network target tracking method as described in any of the preceding claims.
[0013] The present invention also provides a computer program product comprising a computer program that, when executed by a processor, implements the steps of the twin network target tracking method as described in any one of claims 1 to 6.
[0014] The present invention provides a twin network target tracking method, apparatus, device, medium, and program product, which have the following beneficial effects: By introducing environmental feature extraction into the Siamese network tracking method, the dynamic environmental information of the current frame's search region can be explicitly quantified into environmental feature data, solving the problem of template-scene disconnect caused by ignoring environmental context in traditional methods. By generating modulation parameters based on environmental feature data and performing affine transformation on the initial target feature map, the response intensity of each feature layer can be dynamically adjusted according to environmental changes such as illumination and background. This solves the problem that static templates cannot adapt to dynamic changes in target appearance, realizing the transformation of template features from fixed to environmentally adaptive, thus maintaining accurate target representation even in deformed or disturbed scenarios. By performing target tracking analysis using environmental modulation feature maps and region feature maps, the environment-adapted template can be introduced into similarity calculation during the feature matching stage, solving the tracking drift problem caused by decreased matching degree in dynamic environments. This improves positioning accuracy and tracking performance, effectively enhancing the adaptability of Siamese networks in complex dynamic scenes. Attached Figure Description
[0015] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0016] Figure 1 This is a schematic diagram of a twin network target tracking system according to an embodiment of the present invention; Figure 2 This is a flowchart illustrating a twin network target tracking method according to an embodiment of the present invention; Figure 3 This is a schematic diagram of a twin network target tracking device according to an embodiment of the present invention; Figure 4 This is a schematic diagram of the structure of a computer device according to an embodiment of the present invention.
[0017] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] It should be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or collections thereof. It should also be understood that, as used in this specification and the appended claims, the term "and / or" refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0020] Furthermore, in the description of this invention and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0021] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of the invention include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0022] It should be understood that the sequence number of each step in the following embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0023] To illustrate the technical solution of the present invention, specific embodiments are described below.
[0024] Target tracking, as a core technology in computer vision, has wide applications in intelligent monitoring, human-computer interaction, and autonomous driving. However, the inventors have found that related methods typically use fixed template feature maps for target matching in dynamic scenes, lacking explicit modeling and fusion of environmental context information. When changes in lighting, target deformation, or background interference occur, static templates struggle to adapt to dynamic changes in the target's appearance, leading to decreased tracking accuracy or even target loss. Therefore, how to dynamically adjust template features according to the current environment to improve tracking performance in dynamic scenes has become a problem to be solved.
[0025] To address the aforementioned issues, this invention proposes a twin network target tracking method, apparatus, device, medium, and program product. By introducing environmental feature extraction into the twin network tracking method, the dynamic environmental information of the current frame search area can be explicitly quantified into environmental feature data, solving the problem of template and scene separation caused by neglecting environmental context in traditional methods. By generating modulation parameters based on environmental feature data and performing affine transformation on the initial target feature map, the response intensity of each feature layer can be dynamically adjusted according to environmental changes such as illumination and background, solving the problem that static templates cannot adapt to dynamic changes in target appearance. This achieves the transformation of template features from fixed to environmentally adaptive, thus maintaining accurate target representation even in deformed or disturbed scenarios. By performing target tracking analysis on the environmental modulation feature map and the region feature map, the environmentally adapted template can be introduced into the similarity calculation during the feature matching stage, solving the tracking drift problem caused by decreased matching degree in dynamic environments, improving positioning accuracy and tracking performance, and effectively enhancing the adaptability of twin networks in complex dynamic scenes.
[0026] The twin network target tracking method provided in this invention can be applied to, for example... Figure 1 The twin network target tracking system shown includes a terminal device and a twin network target tracking apparatus, wherein the terminal device communicates with the server via a network or bus.
[0027] The twin network target tracking device can be a terminal device, which refers to a device that corresponds to the server and provides local services to customers. This terminal device includes, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.
[0028] The twin network target tracking device can also be a server, which can be implemented using a standalone server or a server cluster composed of multiple servers.
[0029] like Figure 2 As shown, the present invention provides a target tracking method based on Siamese networks, comprising: Step S1: After acquiring the initial image of the target object, track the positional changes of the target object to obtain the target area video image of the current frame; Specifically, after acquiring the initial image of the target object, the positional changes of the target object are tracked to obtain the target region video image of the current frame. The center coordinates of the target object in the previous frame image, as well as the width and height of the bounding box, are read from the tracking results of the previous frame. Based on a preset search area expansion factor, the width and height of the bounding box of the previous frame are multiplied by this expansion factor to calculate the rectangular area to be searched in the current frame. This area is centered at the center coordinates of the previous frame, and its width and height are the expanded dimensions. It is determined whether the current frame is the first frame of the video sequence. If it is the first frame, the initial image is directly used as the target region video image because there is no tracking information from the previous frame. If it is not the first frame, the image region corresponding to the aforementioned rectangular area is cropped from the original image of the current frame, and the cropped image is used as the target region video image of the current frame.
[0030] Step S2: Through the backbone network in the twin network, feature extraction is performed on the video image of the target region to obtain a region feature map, and the initial target feature map extracted by the backbone network from the initial image is obtained; Specifically, the backbone network of the Siamese network extracts features from the video image of the target region to obtain a regional feature map, and obtains the initial target feature map extracted by the backbone network from the initial image. The backbone network adopts a multi-layer convolutional neural network structure, and its parameters are shared between the template branch and the search branch. The video image of the target region of the current frame is input into the backbone network. First, it passes through multiple convolutional layers for weighted summation of each region to extract the shallow edge and texture features of the image and output the corresponding convolutional feature map. Then, the activation layer performs non-linear mapping on each data point in the convolutional feature map to enhance the feature representation ability. Next, the pooling layer selects the maximum value in the local region of the activated feature map to retain the main features and reduce the spatial resolution. Finally, the feature integration layer combines all the pooled local features to form the final regional feature map.
[0031] Step S3: Extract environmental features from the video image of the target area to obtain environmental feature data; Specifically, the target area video image is divided into grids, uniformly segmenting it into multiple image sub-blocks, each covering a local region of the image. For each sub-block, all pixels within it are traversed, the brightness value of each pixel is calculated, and the average of all brightness values is obtained to obtain the average brightness value of the sub-block. The texture intensity value of the sub-block is extracted by calculating the grayscale difference between adjacent pixels or using local binary mode, to characterize the richness of texture within the sub-block. The average brightness values and texture intensity values of all sub-blocks are summarized to form a set of multi-dimensional data. This data reflects the overall lighting conditions and background texture distribution of the target area in the current frame, which is the environmental feature data.
[0032] Step S4: Generate modulation parameters based on the environmental feature data, and perform an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map; Specifically, environmental feature data is input into a pre-trained modulation parameter generation network. This network processes the environmental feature data through fully connected layers, mapping the input environmental feature data into a set of scaling factor vectors and offset vectors corresponding to the feature channels of the initial target feature map. Each element in the scaling factor vector controls the amplification or suppression degree of the corresponding feature channel, and each element in the offset vector is used to shift the value of the corresponding feature channel as a whole. The pre-stored initial target feature map is read, which contains multiple feature channels, each corresponding to a visual mode response. For each feature channel, all feature values within that channel are sequentially extracted, multiplied by the scaling factor of the corresponding channel in the scaling factor vector, and added to the offset of the corresponding channel in the offset vector to obtain the adjusted feature value for that feature channel. After completing the above calculations for all feature channels, all adjusted feature channels are recombine in their original order to form the environmental modulation feature map.
[0033] Step S5: Perform target matching analysis on the environmental modulation feature map and the regional feature map to determine the position of the target object in the target region video image.
[0034] Specifically, the environmental modulation feature map is used as a matching template. Starting from the top left corner of the region feature map, the map slides sequentially from left to right and top to bottom, following a preset unit sliding step. At each sliding position, the environmental modulation feature map is multiplied point-by-point with the currently covered local region in the region feature map. All product results are summed to obtain a response value, which characterizes the similarity between the template at the current position and the local region. The response values calculated at each sliding position are filled into a preset blank response map matrix according to the sliding order, forming a complete response map matrix. The element with the largest response value is found in the response map matrix, and its row and column position is taken as the peak position. Based on the row and column index of the peak position in the response map matrix, combined with the preset sliding step and the offset of the target region video image relative to the original image, the center coordinates of the target in the original image of the current frame are calculated through coordinate transformation. This center coordinate is output as the position of the target in the current frame, completing the tracking task for this frame.
[0035] This invention provides a twin network target tracking method. By introducing environmental feature extraction into the twin network tracking method, it can explicitly quantify the dynamic environmental information of the current frame search area into environmental feature data, solving the problem of template and scene separation caused by ignoring environmental context in traditional methods. By generating modulation parameters based on environmental feature data and performing affine transformation on the initial target feature map, the response intensity of each feature layer can be dynamically adjusted according to environmental changes such as illumination and background. This solves the problem that static templates cannot adapt to dynamic changes in target appearance, realizing the transformation of template features from fixed to environmentally adaptive, thus maintaining accurate target representation even in deformed or disturbed scenarios. By performing target tracking analysis on the environmental modulation feature map and the region feature map, the environmentally adapted template can be introduced into the similarity calculation during the feature matching stage, solving the tracking drift problem caused by decreased matching degree in dynamic environments, improving positioning accuracy and tracking performance, and effectively enhancing the adaptability of twin networks in complex dynamic scenes.
[0036] In one embodiment, after acquiring an initial image of the target object, tracking the positional changes of the target object to obtain a target region video image in the current frame includes: Read the center coordinates and bounding box dimensions of the target object from the tracking results of the previous frame; Based on the preset search area expansion factor, the width and height of the bounding box are multiplied by the expansion factor to obtain the search area width and height of the current frame, and the rectangular range of the search area is determined with the center coordinates as the center. Specifically, after obtaining the width and height of the bounding box read from the previous frame, they are multiplied by a preset search area expansion factor. The expansion factor is preset based on the target object's movement speed, for example, it can be set to two or three times, to ensure that all possible locations of the target object in the current frame are included in the search range. After multiplying the width and height by the expansion factor, the search area width and search area height values for the current frame are obtained. Using the center coordinates read from the previous frame as the geometric center of the rectangular region, and combining the calculated search area width and height, a rectangular region is determined in the image coordinate system. The range of this rectangular region is defined by the coordinates of the upper left and lower right corners. The specific calculation method is as follows: the left boundary is obtained by subtracting half the search area width from the center x-coordinate, and the right boundary is obtained by adding half the search area width to the center x-coordinate; the upper boundary is obtained by subtracting half the search area height from the center y-coordinate, and the lower boundary is obtained by adding half the search area height to the center y-coordinate.
[0037] Determine whether the current frame is the first frame. If it is the first frame, then use the initial image directly as the video image of the target area. If it is not the first frame, then the image region corresponding to the rectangular range is cropped from the original image of the current frame, and the cropped image is used as the video image of the target region.
[0038] Specifically, when it is determined that the current frame is not the first frame, the original image data of the current frame is acquired. Then, based on the rectangular range, the boundary coordinates of the rectangular region are located on the original image. After localization, an image cropping operation is performed, which extracts all pixels within the rectangular range from the original image to form a sub-image of the same size as the rectangular range. The cropping operation is implemented through pointer offset or pixel copying to ensure that the original image data is not modified. The cropped sub-image is the target area video image, which retains both the target itself and information about its surrounding environment.
[0039] The method provided in this embodiment solves the problem of disconnected target position information between consecutive frames by reading the center coordinates and bounding box size of the target object after acquiring the tracking result of the previous frame, and transferring the positioning information of historical frames to the current frame. By enlarging the bounding box size according to a preset expansion factor and determining the rectangular range using the center coordinates, the possible movement area of the target object in the current frame is restricted to a local area, solving the problem of difficulty in quickly locating the target after movement and reducing computational overhead. By determining whether the current frame is the first frame and selecting to directly use the initial image or crop the rectangular area based on the determination result, the problem of not being able to determine the search area when the first frame lacks information from the previous frame is solved, providing a guarantee for the smooth execution of the tracking process.
[0040] In one embodiment, the step of extracting features from the target region video image through the backbone network of the twin network to obtain a region feature map includes: The target region video image is input into the convolutional layer of the backbone network, and the target region video image is weighted and summed region by region to output a region convolutional map. Specifically, the target region video image is fed as input data into the first convolutional layer of the backbone network. This convolutional layer contains multiple convolutional kernels, each smaller than the input image. Starting from the top left corner of the input image, the convolutional layer sequentially multiplies each kernel with the currently covered local region of the image, following a preset stride. All product results are then summed to obtain an output value. After the convolutional kernel completes one full slide across the entire image, all output values are arranged according to their sliding positions, forming a region convolutional map.
[0041] The region convolutional map is input into the activation layer of the backbone network, and a non-linear mapping is performed on each data point in the region convolutional map to output an activation feature map. The activation feature map is input into the pooling layer of the backbone network, and the maximum value in the local region is selected from the activation feature map to generate a local region feature map. Specifically, the activation feature map is fed as input data into the pooling layer. The pooling layer has a pre-defined pooling window of a fixed size, such as a square window with a width and height of two pixels, and a stride. The pooling window starts from the top left corner of the activation feature map and slides sequentially from left to right and top to bottom according to the set stride. At each sliding position, the pooling window covers a local region of the activation feature map, which contains multiple data points. The pooling operation iterates through all data points within this local region, compares the values of each data point, and selects the maximum value as the output value for that local region. The pooling window outputs a maximum value after one slide, and all maximum values are arranged in the order the window slides, forming a local region feature map. Since the stride of the pooling window is usually consistent with the window size, the spatial width and height of the local region feature map are reduced to approximately half of the input activation feature map.
[0042] The local region feature maps are input into the feature integration layer of the backbone network, and all the local region feature maps are combined to obtain the region feature map.
[0043] Specifically, all local region feature maps are fed into the feature integration layer as input data. The feature integration layer first aligns the multiple input local region feature maps along the channel dimension, ensuring that each feature map has the same spatial height and width. After alignment, the feature integration layer concatenates all local region feature maps along the channel dimension, arranging the feature maps in channel order to form a single feature map with the number of channels equal to the sum of the number of channels in each input feature map. During the concatenation process, the spatial position of each feature map remains unchanged, thus the output feature map retains the spatial structure information of each local region. After concatenation, the feature integration layer performs further convolution processing on the concatenated feature map to enhance the feature interaction between channels. The final output region feature map contains multi-level, multi-dimensional feature information extracted from the target region video image. While the spatial resolution of this feature map is lower than that of the input image, each data point has richer semantic meaning, effectively representing the appearance characteristics of the target object.
[0044] The method provided in this embodiment maps the original pixel space to the feature space by inputting the video image of the target region into a convolutional layer and performing weighted summation region by region, thus solving the problem that low-level visual information cannot be directly used for target matching. By using activation layers to perform non-linear mapping on each data point in the region convolutional map, the expressive power of the features is enhanced, solving the problem that linear transformations are unable to distinguish complex patterns and highlighting salient features. Furthermore, by using pooling layers to select the maximum value within a local region of the activation feature map, the spatial resolution is reduced while preserving the main features, solving the problem of heavy computational burden caused by high feature dimensionality and improving the local invariance of features.
[0045] In one embodiment, the step of extracting environmental features from the video image of the target area to obtain environmental feature data includes: The video image of the target area is divided into grids to obtain multiple image sub-blocks; Pixel statistics are performed on each of the image sub-blocks, and the average brightness and texture intensity values of each image sub-block are extracted; Specifically, pixel statistical analysis is performed on each image sub-block sequentially. For the currently processed sub-block, all pixels within it are traversed, and the luminance component value of each pixel is read. The luminance component values of all pixels are summed to obtain the total luminance, which is then divided by the total number of pixels in the sub-block to calculate the average luminance of the sub-block. Texture intensity is then calculated for the sub-block by comparing the luminance differences between adjacent pixels. Each row of pixels within the sub-block is traversed, and the absolute value of the luminance difference between adjacent pixels is calculated and summed; the same applies to each column of pixels within the sub-block. The sums of the luminance differences in the row and column directions are then added to obtain the texture intensity value of the sub-block. For each image sub-block, the above pixel statistical operation is performed independently to extract the average luminance and texture intensity value of the sub-block. After all sub-blocks have been extracted, two sets of numerical sequences are obtained: one set represents the average luminance of each sub-block, and the other set represents the texture intensity value of each sub-block.
[0046] The average brightness and the texture intensity value are used as the environmental feature data.
[0047] Specifically, starting with the first image sub-block, the average brightness value of each sub-block is read sequentially. The read average brightness values are stored in a pre-defined array, the length of which is equal to the total number of image sub-blocks, following the original arrangement of the sub-blocks. After reading and storing the average brightness values, all image sub-blocks are traversed again, and the texture intensity value of each sub-block is read sequentially. Similarly, the read texture intensity values are stored in another pre-defined array, following the original arrangement of the sub-blocks. The index order of the two arrays corresponds exactly; that is, the first element of the first array corresponds to the average brightness value of the first image sub-block, and the first element of the second array corresponds to the texture intensity value of the first image sub-block. Finally, these two arrays are merged into a single environmental feature dataset.
[0048] The method provided in this embodiment divides the target area video image into multiple local sub-blocks by dividing the entire image into grids, solving the technical problem of the difficulty in finely quantifying global environmental information and enabling environmental feature extraction to focus on different regions of the image. By performing pixel statistics on each image sub-block to extract the average brightness and texture intensity values, the original pixel information in the image sub-block can be transformed into quantifiable numerical indicators, solving the abstract problem that environmental changes cannot be directly used for subsequent processing, and presenting environmental information in a concrete numerical form. By using the average brightness and texture intensity values as environmental feature data, the local statistical results of the image sub-blocks can be integrated into a complete data sequence, solving the technical problem that scattered environmental information cannot be represented holistically. This allows the Siamese network to dynamically adjust template features according to the current frame's illumination and background state, improving the adaptability of the tracking method in dynamic environments.
[0049] In one embodiment, generating modulation parameters based on the environmental feature data includes: The environmental feature data is input into a pre-trained modulation parameter generation network; The modulation parameter generation network is a multi-layer neural network structure independent of the backbone network. This network has been trained offline beforehand, learning the mapping relationship from environmental feature data to modulation parameters during training. Its internal parameters remain fixed during the online tracking phase. The modulation parameter generation network consists of fully connected layers, with the number of input layer nodes matching the dimensionality of the environmental feature data, and the number of output layer nodes related to the total number of feature channels in the initial target feature map.
[0050] The environmental feature data is processed by the modulation parameter generation network to output a scaling factor vector and an offset vector corresponding to the feature channels of the initial target feature map; Specifically, after the modulation parameter generation network completes its forward computation, its output layer generates two sets of numerical sequences in parallel. The output layer contains two independent output ports, each outputting a set of values. The first output port is accessed, and all values output by that port are read. During the reading process, the values are acquired sequentially according to their order of appearance in the output port; the first acquired value is recorded as the first element of the sequence, the second acquired value as the second element, and so on, until all values output by that port have been read. After reading, this set of values is stored in a pre-allocated memory area in the order of reading, forming a scaling factor vector. The length of the scaling factor vector is determined by the number of nodes in the output port, which is equal to the total number of feature channels in the initial target feature map. After reading and storing the scaling factor vector, the process moves to accessing the second output port, reading all values output by that port sequentially in the same manner, and storing them in another pre-allocated memory area in the order of reading, forming an offset vector. The length of the offset vector is also equal to the total number of feature channels in the initial target feature map. During storage, the index order of elements in the scaling factor vector and the offset vector is kept completely consistent; that is, elements with the same index position in both vectors act on the same feature channel. Both the scaling factor vector and the offset vector are stored in floating-point form, and the precision of each value meets the requirements of affine transformation calculation.
[0051] The scaling factor vector and the offset vector are used as the modulation parameters.
[0052] The method provided in this embodiment, by inputting environmental feature data into a pre-trained modulation parameter generation network, can automatically map the numerical sequences representing illumination and texture into scaling factors and offsets corresponding to feature channels. This solves the problem of the lack of quantitative correlation between environmental information and template adjustment, giving the modulation parameter generation process a deterministic computational basis. By outputting scaling factor vectors and offset vectors through the modulation parameter generation network, corresponding control parameters can be generated for each feature channel of the initial target feature map. This solves the problem that traditional methods use a uniform adjustment method for all feature channels, which cannot adapt to different visual modes, and enables differentiated control of various features such as edges, textures, and colors. By using both the scaling factor vector and the offset vector as modulation parameters, subsequent affine transformations can simultaneously possess the ability to adjust channel scale and perform overall translation, solving the problem that single-parameter adjustment methods cannot simultaneously address feature enhancement and feature offset compensation.
[0053] In one embodiment, the step of performing an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map includes: Obtain multiple feature channels of the initial target feature map; Among them, the feature channel refers to each independent layer in the depth dimension of the initial target feature map. Each feature channel corresponds to a specific visual mode response, such as edge response channel, texture response channel or color response channel. Each channel consists of multiple feature values arranged in two-dimensional space.
[0054] Specifically, the initial target feature map is read, which is stored as a three-dimensional data block. The three dimensions correspond to the height, width, and number of channels of the feature map, respectively. The dimensional information of this three-dimensional data block is read to obtain the number of channels. A channel traversal index is built based on the number of channels, starting from the first index value and incrementing sequentially to the total number of channels. Simultaneously with index building, a unique channel identifier is assigned to each feature channel, with each identifier corresponding one-to-one with the index value. After the channel identifiers are assigned, all feature channels are arranged in order according to their channel identifiers.
[0055] For each feature channel, all feature values in the feature channel are multiplied by the scaling factor of the corresponding channel in the scaling factor vector, and then added to the offset of the corresponding channel in the offset vector to obtain the adjusted feature value of the feature channel. Specifically, starting from the channel index value, the feature channel corresponding to the current index is obtained. Simultaneously, based on the same channel index, the scaling factor value at the corresponding position is extracted from the scaling factor vector, and the offset value at the corresponding position is extracted from the offset vector. After determining the scaling factor and offset of the current channel, each spatial position within that feature channel is traversed, and the feature value stored at each position is read sequentially. For the currently read feature value, a multiplication operation is performed, multiplying the feature value by the scaling factor to obtain the scaled intermediate value; then, an addition operation is performed, adding the result of the multiplication operation to the offset to obtain the adjusted new feature value. After calculating a feature value, the new feature value is written back to the corresponding position of the current feature channel, overwriting the original feature value. The process continues to traverse the next feature value within the current channel, repeating the multiplication and addition operations until all feature values within the current channel have been processed. After the current channel is processed, the channel index is incremented, moving to the next feature channel, and the above operations are repeated until all feature channels have completed feature value adjustment.
[0056] The adjusted feature values of all the feature channels are combined to obtain the environmental modulation feature map.
[0057] Specifically, after traversing all feature channels, multiple adjusted feature channels are stored. Each feature channel exists as a two-dimensional array, and each two-dimensional array is accompanied by its original channel identifier. The channel identifier of the first feature channel is read, and the arrangement order of the channel in the final feature map is determined based on the identifier. A new three-dimensional data storage area is created. The height and width of this storage area are the same as the height and width of a single feature channel, and the depth (i.e., the number of channels) is equal to the total number of channels in the initial target feature map. The program copies the two-dimensional data array of each adjusted feature channel sequentially to the corresponding depth layer of the three-dimensional storage area according to the ascending order of the channel identifiers. That is, the two-dimensional array corresponding to the first channel identifier is stored in the first layer, the two-dimensional array corresponding to the second channel identifier is stored in the second layer, and so on. During the copying process, the spatial position of the data points within each two-dimensional array remains unchanged, ensuring that the spatial structure of the feature map is completely preserved. After all feature channels are copied, the three-dimensional storage area is filled, forming a new three-dimensional feature map, which is the environment modulation feature map. The environment modulation feature map inherits the spatial structure and number of channels of the initial target feature map, but the values in each channel have been dynamically adjusted according to the current environmental features, which has the ability to adapt to the current lighting conditions and background texture, and provides input for subsequent target matching analysis with the region feature map.
[0058] The method provided in this embodiment, by acquiring multiple feature channels of the initial target feature map and processing each channel independently, enables differentiated control of feature layers for different visual modes, solving the problem that traditional methods, which uniformly adjust all features, cannot adapt to various feature response characteristics. By multiplying all feature values in each feature channel by the scaling factor of the corresponding channel in the scaling factor vector, the method can enhance or suppress the feature response at all spatial locations within the channel based on environmental characteristics, solving the problem that static templates cannot dynamically adjust response intensity with changes in lighting and background. By adding the offset to the offset of the corresponding channel in the offset vector, the method can perform overall translation compensation of feature values based on scaling adjustments, solving the problem that a single scaling operation cannot correct the overall offset of feature values. By combining the adjusted feature values of all feature channels, the adjusted channels are restored to a complete feature map structure, solving the problem of feature data fragmentation after channel-by-channel scattered adjustments.
[0059] In one embodiment, the step of performing target matching analysis on the environmental modulation feature map and the region feature map to determine the position of the target object in the target region video image includes: The environmental modulation feature map is slid-matched on the regional feature map according to a preset unit position. Each time the map slides to a sliding position, the environmental modulation feature map is multiplied point by point with the currently covered local area in the regional feature map and the results are accumulated to obtain the response value at the sliding position. Specifically, the ambient modulation feature map and the region feature map are read to obtain the width and height values of the ambient modulation feature map. Based on a preset sliding step size, starting from the top-left corner of the region feature map, the ambient modulation feature map is placed at this starting position, at which point it covers a local area of the region feature map. Each spatial position of the ambient modulation feature map and its corresponding feature channel is traversed. For each data point, the feature value of that point is read from the ambient modulation feature map, and simultaneously, the feature value of the corresponding position in the covered local area of the region feature map is read. The two feature values are multiplied to obtain a product value. After traversing all positions and all channels, all product values are accumulated to obtain a sum value, which is the response value of the current sliding position. After recording this response value, the ambient modulation feature map is moved one step to the right according to the preset sliding step size. The above point-by-point multiplication and accumulation operation is repeated to obtain the response value of the next sliding position. When the environment modulation feature map moves to the rightmost boundary of the region feature map, it moves down one step unit and returns to the leftmost side to continue sliding to the right until it covers all possible positions of the entire region feature map.
[0060] The response values of all the sliding positions are filled into a preset response map matrix in the sliding order, and the response value with the largest value is found in the response map matrix. The position of the response value is taken as the peak position. Specifically, the number of rows and columns of the response map matrix is calculated based on the size of the region feature map, the size of the environment modulation feature map, and the preset sliding step size. The number of rows is obtained by subtracting the height of the environment modulation feature map from the height of the region feature map, dividing by the sliding step size, rounding to the nearest integer, and then adding one. Similarly, the number of columns is obtained by subtracting the width of the environment modulation feature map from the width of the region feature map, dividing by the sliding step size, rounding to the nearest integer, and then adding one. The program allocates a two-dimensional matrix space in memory based on the calculated row and column numbers and initializes all elements to zero. After each sliding position is calculated, a response value is obtained. The program calculates the row and column indices of the corresponding sliding position in the response map matrix based on the starting coordinates of the current sliding position on the region feature map and fills the corresponding element position in the matrix with the response value. Once all elements of the response map matrix are filled, the program iterates through all elements in the response map matrix, starting from the first element, comparing the value of each element with the currently recorded maximum value. If the value of the current element is greater than the recorded maximum value, the maximum value record is updated, and the row and column indices of that element are recorded. After the traversal is complete, the maximum value recorded is the peak value in the response graph matrix, and the row and column index corresponding to the maximum value is the peak value position.
[0061] The coordinates of the region feature map are calculated based on the peak position and the preset unit position to obtain the current target position coordinates, and the current target position coordinates are determined as the position of the target object in the target region video image.
[0062] Specifically, the row and column indices of the peak position are read, along with the preset sliding step size. The row and column indices of the peak position are multiplied by the sliding step size to obtain the starting coordinates of the environmental modulation feature map sliding on the regional feature map, i.e., the coordinates of the top-left corner of the local region in the regional feature map with the highest matching degree to the target. Since the environmental modulation feature map itself has a certain width and height, the center position of the target object is usually located at the center point of this local region. The starting coordinates are added to half the width and half the height of the environmental modulation feature map to calculate the center point coordinates of the target object on the regional feature map. The regional feature map is obtained by downsampling the target region video image through the backbone network. Its spatial size is smaller than the original image. The scaling ratio used by the backbone network during downsampling is read, and the center point coordinates on the regional feature map are multiplied by this scaling ratio to convert back to pixel coordinates on the target region video image. These converted coordinates are the precise position of the target object in the target region video image. These coordinates are then stored as the tracking result for the current frame.
[0063] The method provided in this embodiment solves the problem of lacking a quantitative comparison means between the template and the search area by sliding the environmental modulation feature map on the regional feature map according to a preset step size, and multiplying and accumulating each covered local area point by point, so that all possible positions can obtain a uniform similarity measure. By filling the response values of all sliding positions into the response map matrix and finding the peak position, the position with the highest matching degree is selected from the global range, solving the problem of scattered local matching results and inability to determine the optimal position. By performing coordinate conversion based on the peak position and step size, the response map index is mapped back to the actual coordinates of the original image, solving the problem of the correspondence between the feature space and the image space, so that the output target position can be directly used for tracking in subsequent frames.
[0064] like Figure 3 As shown, the present invention also provides a twin network target tracking device, applied to any one of the twin network target tracking methods described above, comprising: The acquisition module is used to track the positional changes of the target object after acquiring an initial image of the target object, and obtain a video image of the target area in the current frame; The analysis module is used to extract features from the target region video image through the backbone network in the twin network to obtain a region feature map, and to obtain an initial target feature map extracted by the backbone network from the initial image; The association module is used to extract environmental features from the video image of the target area to obtain environmental feature data. The processing module is used to generate modulation parameters based on the environmental feature data, and to perform an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map. The control module is used to perform target matching analysis on the environmental modulation feature map and the regional feature map to determine the position of the target object in the target region video image.
[0065] This invention also provides a computer device, such as... Figure 4 As shown, the computer device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor. When the processor executes the computer program, it implements the steps in any of the above method embodiments, or when the processor executes the computer program, it implements the functions of each module / unit in the above device embodiments.
[0066] For example, the computer program may be divided into one or more modules / units, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules / units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program in the computer device.
[0067] Those skilled in the art will understand that Figure 4 The computer device described is merely an example and does not constitute a limitation on the computer device. It may include more or fewer components than shown, or combine certain components, or different components. For example, the computer device may also include input / output devices, network access devices, buses, etc.
[0068] The aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0069] The memory can be an internal storage unit of the computer device, such as a hard drive or RAM. The memory can also be an external storage device of the computer device, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card. Furthermore, the memory can include both internal and external storage units of the computer device.
[0070] This invention also provides a readable storage medium storing a computer program that, when executed by a processor, implements the steps described in the various method embodiments above.
[0071] This invention provides a computer program product that, when run on an electronic device, enables the electronic device to perform the steps described in the various method embodiments above.
[0072] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying the computer program code to a photographing device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.
[0073] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0074] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0075] In the embodiments provided by this invention, it should be understood that the disclosed apparatus / device and method can be implemented in other ways. For example, the apparatus / device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0076] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0077] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A target tracking method based on Siamese networks, characterized in that, include: After acquiring the initial image of the target object, the positional changes of the target object are tracked to obtain the target area video image of the current frame; The target region video image is feature extracted through the backbone network of the twin network to obtain a region feature map, and the initial target feature map is obtained by the backbone network from the initial image. Environmental features are extracted from the video images of the target area to obtain environmental feature data; Modulation parameters are generated based on the environmental feature data, and an affine transformation is performed on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map; Target matching analysis is performed on the environmental modulation feature map and the regional feature map to determine the position of the target object in the video image of the target region.
2. The twin network target tracking method according to claim 1, characterized in that, The step of extracting features from the target region video image through the backbone network of the twin network to obtain a region feature map includes: The target region video image is input into the convolutional layer of the backbone network, and the target region video image is weighted and summed region by region to output a region convolutional map. The region convolutional map is input into the activation layer of the backbone network, and a non-linear mapping is performed on each data point in the region convolutional map to output an activation feature map. The activation feature map is input into the pooling layer of the backbone network, and the maximum value in the local region is selected from the activation feature map to generate a local region feature map. The local region feature maps are input into the feature integration layer of the backbone network, and all the local region feature maps are combined to obtain the region feature map.
3. The twin network target tracking method according to claim 1, characterized in that, The step of extracting environmental features from the video image of the target area to obtain environmental feature data includes: The video image of the target area is divided into grids to obtain multiple image sub-blocks; Pixel statistics are performed on each of the image sub-blocks, and the average brightness and texture intensity values of each image sub-block are extracted; The average brightness and the texture intensity value are used as the environmental feature data.
4. The twin network target tracking method according to claim 1, characterized in that, The step of generating modulation parameters based on the environmental feature data includes: The environmental feature data is input into a pre-trained modulation parameter generation network; The environmental feature data is processed by the modulation parameter generation network to output a scaling factor vector and an offset vector corresponding to the feature channels of the initial target feature map; The scaling factor vector and the offset vector are used as the modulation parameters.
5. The twin network target tracking method according to claim 4, characterized in that, The step of performing an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map includes: Obtain multiple feature channels of the initial target feature map; For each feature channel, all feature values in the feature channel are multiplied by the scaling factor of the corresponding channel in the scaling factor vector, and then added to the offset of the corresponding channel in the offset vector to obtain the adjusted feature value of the feature channel. The adjusted feature values of all the feature channels are combined to obtain the environmental modulation feature map.
6. The twin network target tracking method according to claim 1, characterized in that, The step of performing target matching analysis on the environmental modulation feature map and the region feature map to determine the position of the target object in the target region video image includes: The environmental modulation feature map is slid-matched on the regional feature map according to a preset unit position. Each time the map slides to a sliding position, the environmental modulation feature map is multiplied point by point with the currently covered local area in the regional feature map and the results are accumulated to obtain the response value at the sliding position. The response values of all the sliding positions are filled into a preset response map matrix in the sliding order, and the response value with the largest value is found in the response map matrix. The position of the response value is taken as the peak position. The coordinates of the region feature map are calculated based on the peak position and the preset unit position to obtain the current target position coordinates, and the current target position coordinates are determined as the position of the target object in the target region video image.
7. A twin network target tracking device, characterized in that, The twin network target tracking method applied to any one of claims 1-6 includes: The acquisition module is used to track the positional changes of the target object after acquiring an initial image of the target object, and obtain a video image of the target area in the current frame; The analysis module is used to extract features from the target region video image through the backbone network in the twin network to obtain a region feature map, and to obtain an initial target feature map extracted by the backbone network from the initial image; The association module is used to extract environmental features from the video image of the target area to obtain environmental feature data. The processing module is used to generate modulation parameters based on the environmental feature data, and to perform an affine transformation on the initial target feature map based on the modulation parameters to obtain an environmental modulation feature map. The control module is used to perform target matching analysis on the environmental modulation feature map and the regional feature map to determine the position of the target object in the target region video image.
8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the twin network target tracking method as described in any one of claims 1 to 6.
9. A readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the twin network target tracking method as described in any one of claims 1 to 6.
10. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, enables the implementation of the steps of the twin network target tracking method as described in any one of claims 1 to 6.