Target detection method and device in special railway operation scene and storage medium

By constructing a training dataset and optimizing the target detection model, and utilizing the space-frequency coordination module and the cascaded feature alignment module, the problems of missed detection and positioning drift of small targets in the operation of dedicated railways were solved, and high-precision target detection in complex environments was achieved.

CN122289660APending Publication Date: 2026-06-26RIZHAO PORT GRP CO LTD +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
RIZHAO PORT GRP CO LTD
Filing Date
2026-04-03
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In the context of dedicated railway operation, existing target detection models struggle to accurately extract effective features under complex conditions such as high dust, low illumination, and motion blur, leading to missed detections of small targets and location drift.

Method used

A training dataset is constructed, including image degradation features and small targets. A basic target detection model is built and optimized. Through a location-aware spatial-frequency coordination module and a frequency-guided cascaded feature alignment module, a spatial high-frequency saliency map and geometric offset are generated. Multiple rounds of iterative training are conducted to improve feature robustness and positioning accuracy.

Benefits of technology

It effectively solves the problems of feature loss and noise interference under complex working conditions, improves the detection accuracy and positioning precision of small targets, and enhances the model's perception capability in complex environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289660A_ABST
    Figure CN122289660A_ABST
Patent Text Reader

Abstract

This invention relates to the field of target detection technology, specifically disclosing a target detection method, device, and storage medium for dedicated railway operation scenarios. The method includes: constructing a training dataset; building a basic target detection model based on the detection accuracy requirements of dedicated railway operation scenarios; optimizing the basic target detection model, including constructing a location-aware space-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a dual-stream architecture and local frequency domain changes; constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and deforming and aligning the semantic attention map; and performing multi-round iterative training based on a combined loss function including detection loss and auxiliary constraint loss. The target detection method for dedicated railway operation scenarios provided by this invention can improve the accuracy of detecting small targets in dedicated railway scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target detection technology, and in particular to a target detection method, a target detection device, and a storage medium for a dedicated railway operation scenario. Background Technology

[0002] With the rapid development of deep learning technology, object detection algorithms based on convolutional neural networks have achieved extremely high detection accuracy on publicly available datasets in clear and ideal environments, and have been widely deployed in fields such as intelligent transportation. However, in practical engineering applications, the operating environment of dedicated railways (such as ports and mining areas) is harsh. Locomotive-mounted cameras often face complex operating environments such as high dust, low light, and water vapor interference. The severe mechanical vibrations generated by the locomotive during high-speed operation or when passing through switch areas can also cause serious motion blur, ultimately leading to non-ideal degradation of the acquired images. In addition, the background of dedicated railways is complex, and heavy-load trains have long braking distances, requiring extremely high precision in detecting small, distant intruding foreign objects. Even under good weather conditions, the complex background along dedicated railway lines and the inherent depth-of-field limitations of telephoto lenses often result in blurred features of small, distant targets, making them difficult for general models to accurately capture, leading to missed detection of critical obstacles and seriously threatening train safety.

[0003] Image degradation in such complex operating scenarios leads to two serious problems: first, high-frequency information such as edges and textures is severely lost, resulting in blurred target outlines; second, a large amount of background noise is introduced into the image, making it difficult for feature extractors to distinguish foreground targets from environmental interference. Traditional deep learning detectors are usually trained based on the "independent and identically distributed" assumption. When faced with data with severe domain shift and quality degradation, their feature extraction capabilities will significantly decrease, leading to missed detections (especially of small targets) and localization drift.

[0004] Therefore, how to solve the problem of inaccurate feature extraction due to complex operating conditions such as ambiguity and low-light noise in the current operation of dedicated railways has become a technical problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] This invention provides a target detection method, a target detection device, and a storage medium for dedicated railway operation scenarios, solving the problem in related technologies where complex operating conditions such as ambiguity and low-light noise in current dedicated railway operation make it impossible to accurately extract effective features.

[0006] As a first aspect of the present invention, a target detection method for a dedicated railway operation scenario is provided, comprising:

[0007] A training dataset is constructed, which includes image data under multiple dedicated railway operating conditions. Each image data under a dedicated railway operating condition includes image degradation features and small targets. The image degradation features include at least motion blur, low-light noise, and defocus blur. The small targets include targets in the image data under the dedicated railway operating conditions whose coverage area is smaller than a preset small target pixel.

[0008] A basic target detection model is built based on the detection accuracy requirements of dedicated railway operation scenarios. The basic target detection model includes at least a backbone network for feature extraction, a feature pyramid network for multi-scale feature fusion, and a detection head network for target classification and localization.

[0009] The basic target detection model is optimized, wherein the optimization process includes at least constructing a location-aware spatial-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain changes; and constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset.

[0010] The training dataset is input into the optimized basic target detection model, and multiple rounds of iterative training are performed based on a combined loss function including detection loss and auxiliary constraint loss to obtain a target detection model for the dedicated railway operation scenario. The target detection model for the dedicated railway operation scenario can perform target detection on real-time acquired images of the input dedicated railway operation scenario and obtain target detection results.

[0011] Furthermore, a location-aware space-frequency coordination module is constructed in the backbone network to generate a spatial high-frequency saliency map using a dual-stream architecture and local frequency domain variations, including:

[0012] A dual-stream feature splitting architecture is constructed, and the interfaces of the dual-stream feature splitting architecture are aligned. The dual-stream feature splitting architecture includes a spatial anchor branch and a frequency-aware branch. The spatial anchor branch is used to maintain the physical spatial structure of the feature map and perform downsampling, and the frequency-aware branch is used to capture the local texture details of the feature map.

[0013] The feature map of the frequency sensing branch is divided into blocks to obtain multiple local blocks, and each local block is independently processed by frequency domain transformation.

[0014] Based on the semantic information and absolute position encoding of the spatial anchor branch, the channel and spectrum gating mask of each local block is obtained;

[0015] The spectrum of each local block is weighted and filtered according to the gate mask, and the spectrum of the weighted and filtered local blocks is spatially restored.

[0016] The high-frequency energy information of each local block is aggregated to obtain a spatial high-frequency saliency map that reflects the distribution of high-frequency texture in the feature map.

[0017] Furthermore, based on the semantic information and absolute position encoding of the spatial anchor branch, the channel and spectrum gating mask for each local block is obtained, including:

[0018] Adaptive average pooling is performed on the spatial features extracted from the spatial anchor point branches to compress the spatial resolution of the spatial features to a dimension consistent with the number of local blocks, and the spatial features are flattened into spatial semantic descriptors.

[0019] The spatial semantic descriptor is fused with a pre-initialized absolute position embedding matrix to obtain position-aware contextual features;

[0020] The location-aware contextual features are input into a multilayer perceptron and mapped to channel weights to obtain channel-gated branches that include texture information.

[0021] The location-aware contextual features are input into a multilayer perceptron and mapped to spectral weights to obtain a spectral-gated branch for filtering general effective frequency components.

[0022] By performing a joint mask based on the channel weights and the spectrum weights, the channel and spectrum gating masks for each local block are obtained.

[0023] Further, the spectrum of each local block is weighted and filtered according to the gate mask, and the spectrum of the weighted and filtered local blocks is spatially restored, including:

[0024] The spectrum of each local block is weighted and filtered according to the gated mask to obtain a modulated spectrum sequence, wherein the frequency components determined to be noise or background have their weights suppressed, and the high-frequency components determined to be target edges or textures have their weights enhanced.

[0025] The modulated spectral sequence is subjected to a two-dimensional discrete cosine inverse transform, and then restored to a frequency domain restored feature with the same spatial feature size as the spatial feature extracted by the spatial anchor point branch through a folding operation.

[0026] The spatial features extracted from the spatial anchor point branch are fused with the frequency domain restored features to obtain a fused feature map;

[0027] The fused feature map is then subjected to dimensionality upscaling to complete channel recovery.

[0028] Furthermore, a frequency-guided cascaded feature alignment module is constructed in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset, including:

[0029] The spatial reference anchor points, features to be calibrated, and cross-domain frequency priors required for the current fusion level are obtained to construct a multi-scale combined feature including multi-dimensional spatial indication information. The spatial reference anchor points include the lateral connection features after the output features of the backbone network are reduced by lateral convolution. The features to be calibrated include the upsampled features obtained after the features of the previous level of the feature pyramid network are upsampled. The cross-domain frequency priors include the spatial high-frequency saliency map.

[0030] The pixel-level dense offset field is predicted based on the multi-scale combined features, and the dense offset field is adaptively adjusted based on the cross-domain frequency prior.

[0031] The feature to be calibrated is resampled and deformed according to the dense offset field to obtain a calibrated feature aligned with the spatial reference anchor point;

[0032] An initial spatial semantic attention map is obtained based on the multi-scale combined features, and the initial spatial semantic attention map is resampled and deformed based on the dense offset field to obtain a final spatial semantic attention map corresponding to the geometric alignment features.

[0033] The calibrated features are denoised and filtered based on the final spatial semantic attention map, and then fused with the lateral connection features to obtain the fused features output by the feature pyramid network.

[0034] Further, the pixel-level dense offset field is predicted based on the multi-scale combined features, and the dense offset field is adaptively adjusted based on the cross-domain frequency prior, including:

[0035] An offset prediction network is constructed, and the multi-scale combined features are used as input to the offset prediction network to learn the displacement vector of the feature to be calibrated relative to the spatial reference anchor point, thereby obtaining a pixel-level dense offset field.

[0036] The offset prediction network is adaptively adjusted based on the cross-domain frequency prior, wherein the offset predicted by the offset prediction network is positively correlated with the spatial response intensity of the cross-domain frequency prior.

[0037] Further, the training dataset is input into the optimized base object detection model, and multiple rounds of iterative training are performed based on a combined loss function including detection loss and auxiliary constraint loss, including:

[0038] Construct a saliency-guided and sparsity-constrained loss for the aforementioned space-frequency coordination module;

[0039] Construct an offset smoothing loss for the cascaded feature alignment module;

[0040] The auxiliary loss is formed by combining the saliency-guided and sparsity-constrained loss with the offset smoothing loss.

[0041] The auxiliary loss and the detection loss of the basic target detection model are weighted and fused to obtain the total loss function;

[0042] The optimized basic target detection model is trained through multiple rounds of iterations based on the total loss function to obtain a target detection model for the dedicated railway operation scenario.

[0043] Furthermore, a saliency-guided and sparsity-constrained loss is constructed for the aforementioned space-frequency coordination module, including:

[0044] Construct a multi-scale saliency supervision flow based on the aforementioned spatial high-frequency saliency map;

[0045] Generate a resolution-adaptive ground-value Gaussian heatmap;

[0046] The focal loss function of the dynamic weights is obtained based on the absolute error between the spatial high-frequency saliency map and the true Gaussian heatmap. The expression is:

[0047] ,

[0048] in, Map representing the high-frequency saliency of space gt Represents the Gaussian heatmap of the truth value. Indicates pixel index, Indicates the focus factor. The total number of pixels represented;

[0049] Applying regularization constraints to the gate mask yields a frequency domain sparsity loss function. The expression is:

[0050] ,

[0051] in, Indicates the gate mask. This represents the total number of elements in the gate mask.

[0052] As another aspect of the present invention, a target detection device for a dedicated railway operation scenario is provided, for implementing the target detection method for a dedicated railway operation scenario described above, wherein the device includes:

[0053] The dataset construction module is used to construct a training dataset, which includes image data under multiple dedicated railway operating conditions. Each image data under a dedicated railway operating condition includes image degradation features and small targets. The image degradation features include at least motion blur, low-light noise, and defocus blur. The small targets include targets in the image data under the dedicated railway operating conditions whose coverage area is smaller than a preset small target pixel.

[0054] The model building module is used to build a basic target detection model based on the detection accuracy requirements of the dedicated railway operation scenario. The basic target detection model includes at least a backbone network for feature extraction, a feature pyramid network for multi-scale feature fusion, and a detection head network for target classification and localization.

[0055] The model optimization module is used to optimize the basic target detection model. The optimization process includes at least constructing a location-aware spatial-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain changes; and constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset.

[0056] The model training module is used to input the training dataset into the optimized basic target detection model, and perform multiple rounds of iterative training based on a combined loss function including detection loss and auxiliary constraint loss to obtain a target detection model under the dedicated railway operation scenario. The target detection model under the dedicated railway operation scenario can perform target detection on the real-time acquired images under the input dedicated railway operation scenario and obtain target detection results.

[0057] As another aspect of the present invention, a storage medium is provided for storing computer instructions that are loaded and executed by a processor to implement the target detection method in the dedicated railway operation scenario described above.

[0058] The target detection method for dedicated railway operation scenarios provided by this invention constructs a training dataset and builds a basic target detection model based on the detection accuracy requirements of dedicated railway operation scenarios. This basic target detection model is then optimized. The constructed position-aware space-frequency collaborative module effectively solves the problems of high-frequency feature loss and non-uniform degradation in complex operating scenarios, improving feature robustness and enhancing the locomotive's perception capabilities under complex operating conditions such as high dust and low visibility. It adaptively recovers the edge and texture details of small targets from the source, significantly enhancing the model's perception capabilities against visual degradation phenomena such as motion blur and low-light noise. Furthermore, the constructed frequency-guided geometric and semantic cascaded feature alignment module effectively solves the problems of pixel-level misalignment and background noise interference in multi-scale feature fusion, improving positioning accuracy and enabling precise positioning and intrusion judgment of small targets at long distances. Finally, to maintain the stability of the basic target detection framework while effectively supervising the feature selection capability of the space-frequency collaborative module and the geometric transformation stability of the cascaded alignment module, a multi-task combined loss function is constructed to achieve multi-round iterative training. Therefore, the target detection method for dedicated railway operation scenarios provided by this invention can effectively solve the problems of existing general target detection models (such as Faster R-CNN) failing to accurately extract effective features when dealing with complex working conditions such as blur, low-light noise, and defocus blur commonly encountered in dedicated railway operation, resulting in missed detection of small targets and positioning drift, thus improving the accuracy of small target detection. Attached Figure Description

[0059] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used together with the following detailed description to explain the invention, but do not constitute a limitation thereof.

[0060] Figure 1 A flowchart of the target detection method for a dedicated railway operation scenario provided by the present invention.

[0061] Figure 2 The flowchart for constructing the space-frequency coordination module provided by the present invention.

[0062] Figure 3 The flowchart provided by the present invention describes the gating mask for obtaining the channels and spectrum of each local block.

[0063] Figure 4 This invention provides a flowchart for performing weighted filtering on local blocks and spatially restoring the spectrum of the weighted filtered local blocks.

[0064] Figure 5 The flowchart for constructing a frequency-guided cascaded feature alignment module provided by the present invention is shown.

[0065] Figure 6This is a flowchart for adaptive adjustment of dense offset fields guided by cross-domain frequency priors, provided by the present invention.

[0066] Figure 7 This is a flowchart of multi-round iterative training based on the combined loss function provided by the present invention.

[0067] Figure 8 The flowchart provided by this invention describes the construction of saliency-guided and sparsity-constrained loss.

[0068] Figure 9 The structural block diagram of the target detection device for a dedicated railway operation scenario provided by the present invention.

[0069] Figure 10 This is a structural block diagram of the electronic device provided by the present invention.

[0070] Figure 11 A schematic diagram of a dedicated railway scenario provided by the present invention. Detailed Implementation

[0071] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0072] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0073] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of the invention described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0074] To address image degradation in complex scenarios, most mainstream object detection methods (such as Faster R-CNN and YOLO series) employ convolutional neural network-based backbone networks like ResNet for feature extraction, combined with Feature Pyramid Networks (FPNs) for multi-scale feature fusion. Specifically, the backbone network extracts image features progressively through stacked convolutional and pooling layers. Convolutional operations primarily extract local textures in the spatial domain using sliding windows. As the network deepens, resolution decreases while semantic information is enhanced. In the feature fusion stage, standard FPN structures typically employ a top-down approach, upsampling deep, high-semantic features using nearest-neighbor interpolation or bilinear interpolation, and then horizontally adding or concatenating them with shallow, high-resolution features to fuse multi-scale information, thus accommodating both large and small targets. However, while this general detection framework performs remarkably well on clear images, it suffers from significant technical limitations when handling non-ideal imaging scenarios with blurriness, noise, and other interference. The main reasons are as follows: First, the backbone network suffers from high-frequency information loss. Existing convolutional neural networks primarily exhibit low-pass filter characteristics. As the network depth increases and downsampling operations are performed, high-frequency information in the image, such as the edges of small targets and texture details, is gradually smoothed out or even lost. When faced with blurry or low-contrast images, relying solely on spatial domain convolution is insufficient to capture effective boundary information, leading to missed detections or misclassifications. Existing frequency domain enhancement methods typically employ global transformations, which easily disrupt local spatial structures and fail to achieve accurate perception of small targets. Second, feature fusion suffers from spatial misalignment and noise introduction problems. During FPN feature fusion, because deep features undergo multiple downsampling processes, when upsampled and fused with shallow features, the two often cannot be strictly aligned at pixel positions, resulting in inaccurate detection box localization. Furthermore, under complex interference conditions, shallow features often contain a large amount of background noise and other noise. Direct fusion will introduce these noises into deep semantic features, resulting in unclear and inaccurate features, which seriously restricts the detection performance of the model in complex scenarios.

[0075] Based on this, this embodiment provides a target detection method for a dedicated railway operation scenario. Figure 1 This is a flowchart of a target detection method for a dedicated railway operation scenario provided by an embodiment of the present invention, such as... Figure 1 As shown, it includes:

[0076] S100. Construct a training dataset, which includes image data under multiple dedicated railway operating conditions, and each image data under a dedicated railway operating condition includes image degradation features and small targets. The image degradation features include at least motion blur, low-light noise, and defocus blur. The small targets include targets in the image data under the dedicated railway operating conditions whose coverage area is smaller than a preset small target pixel.

[0077] In this embodiment of the invention, a training dataset is constructed by specifically selecting multi-condition visual scene image data containing degradation features such as blurriness, noise interference, and small targets.

[0078] It should be noted that, in this embodiment of the invention, the term "micro-target" can be understood as the size of the target relative to the number of pixels in the initial image. For example, if the image is 1024*1024 pixels, a target with a pixel size smaller than 32*32 can be considered a micro-target. Furthermore, the current mainstream definition is the MS COCO standard: Small specifically refers to an area less than 32*32 pixels, Medium specifically refers to an area between 32*32 and 96*96 pixels, and Large specifically refers to an area greater than 96*96 pixels. Therefore, in this embodiment of the invention, a micro-target can be understood as a target that occupies a small pixel area in the input image and, after multiple downsampling operations by the backbone network, lacks sufficient spatial resolution in the deep feature map to maintain its geometric structure features. More specifically, it can be understood as a target whose coverage area is less than 32*32 pixels in the original acquired image. Such targets, due to their extremely low pixel proportion, are prone to feature dissipation (feature loss) during the downsampling process of conventional convolutional neural networks. Therefore, the preset small target pixel in this embodiment of the invention can specifically be 32*32 pixels.

[0079] In embodiments of the present invention, such as Figure 11 As shown, the red, green, and blue boxes represent different target categories, with targets with pixels less than or equal to 32*32 pixels being classified as micro-targets.

[0080] S200. Based on the detection accuracy requirements of the dedicated railway operation scenario, a basic target detection model is built. The basic target detection model includes at least a backbone network for feature extraction, a feature pyramid network for multi-scale feature fusion, and a detection head network for target classification and localization.

[0081] In this embodiment of the invention, a general object detection model based on deep learning (such as Faster R-CNN) is built as the basic framework. Given the high accuracy requirements for detecting small targets in complex operating scenarios (such as blurred or noisy environments), this embodiment selects the classic Faster R-CNN as the basic object detection framework for verification. However, those skilled in the art should understand that the core modules proposed in this invention are universal, and the basic object detection model can also adopt single-stage or multi-stage detection frameworks such as the YOLO series, RetinaNet, and FCOS.

[0082] S300. Optimize the basic target detection model, wherein the optimization process includes at least constructing a location-aware spatial-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain changes, and enhancing the texture details of the feature map based on the spatial high-frequency saliency map; and constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the prediction of the geometric offset of the feature map according to the spatial high-frequency saliency map, and deforming and aligning the semantic attention map based on the geometric offset.

[0083] In this embodiment of the invention, a space-frequency collaborative sensing module is constructed at each stage of the backbone network to extract enhanced features and generate high-frequency prior information. A frequency-guided geometric and semantic cascaded feature alignment module is constructed in the feature pyramid to achieve accurate feature alignment and denoising using prior information.

[0084] Specifically, to address the prevalent issues of texture detail attenuation and non-uniform degradation in long-distance imaging during dedicated railway locomotive operation, this invention constructs a position-aware spatial-frequency collaborative module at the end of each stage of the backbone network. This module employs a parallel dual-stream architecture, preserving the physical spatial structure while establishing a local frequency feature library through block-based DCT (Discrete Cosine Transform). It then adaptively recovers high-frequency details using a decoupling gating mechanism incorporating absolute position coding, ultimately generating a spatial high-frequency saliency map reflecting texture distribution. () as a cross-level prior.

[0085] In this embodiment of the invention, during actual operation, distant targets often occupy a very small proportion in the image (small distant targets). During the FPN upsampling process, even a tiny pixel deviation can cause the detection box to deviate from the target center, leading to ranging errors. To address the spatial misalignment and noise interference issues in multi-scale fusion, this invention constructs a frequency-guided cascaded feature alignment module at each level of the FPN. The core logic of this frequency-guided cascaded feature alignment module is to use precisely positioned lateral features as anchors, guided by the spatial high-frequency saliency map transmitted by the backbone network, to perform geometric calibration on the position-drifting upsampled features, and simultaneously perform cascaded deformation of semantic attention.

[0086] S400. Input the training dataset into the optimized basic target detection model, and perform multiple rounds of iterative training based on the combined loss function including detection loss and auxiliary constraint loss to obtain a target detection model for the dedicated railway operation scenario. The target detection model for the dedicated railway operation scenario can perform target detection on the real-time acquired images of the input dedicated railway operation scenario and obtain target detection results.

[0087] In this embodiment of the invention, in order to effectively supervise the feature selection capability of the spatial-frequency collaboration module and the geometric transformation stability of the cascaded alignment module while maintaining the stability of the basic target detection framework, this embodiment of the invention constructs a multi-task combined loss function. This combined loss function consists of the basic Faster R-CNN detection loss, the saliency guidance and sparsity constraint loss for the spatial-frequency module, and the smoothing regularization loss for the alignment module.

[0088] Therefore, the target detection method for dedicated railway operation scenarios provided by this invention constructs a training dataset and builds a basic target detection model based on the detection accuracy requirements of dedicated railway operation scenarios. This basic target detection model is then optimized. The constructed position-aware space-frequency collaboration module effectively solves the problems of high-frequency feature loss and non-uniform degradation in complex operating scenarios, improving feature robustness and enhancing the locomotive's perception capabilities under complex operating conditions such as high dust and low visibility. It adaptively recovers the edge and texture details of small targets from the source, significantly enhancing the model's perception capabilities against visual degradation phenomena such as motion blur and low-light noise. Furthermore, the constructed frequency-guided geometric and semantic cascaded feature alignment module effectively solves the problems of pixel-level misalignment and background noise interference in multi-scale feature fusion, improving positioning accuracy and enabling precise positioning and intrusion judgment of small targets at long distances. Finally, to maintain the stability of the basic target detection framework while effectively supervising the feature selection capability of the space-frequency collaboration module and the geometric transformation stability of the cascaded alignment module, a multi-task combined loss function is constructed to achieve multi-round iterative training. Therefore, the target detection method for dedicated railway operation scenarios provided by this invention can effectively solve the problems of existing general target detection models (such as Faster R-CNN) failing to accurately extract effective features when dealing with complex working conditions such as blur, low-light noise, and defocus blur commonly encountered in dedicated railway operation, resulting in missed detection of small targets and positioning drift, thus improving the accuracy of small target detection.

[0089] In this embodiment of the invention, as a specific implementation method for constructing the training dataset, image data that matches the actual application scenario and contains typical image degradation features (such as motion blur, low-light noise, defocus blur, etc.) and small targets is selected. After manual annotation, a training dataset is formed, that is, a multi-condition visual scene training dataset including the perspective of a dedicated railway locomotive operation is constructed. This training dataset not only includes non-ideal samples such as low light, motion blur, and dust interference, but also includes conventional samples with good lighting but complex backgrounds and small targets, to ensure that the model has robustness under various operating conditions. Let the training dataset be denoted as... ,in Represents the total number of samples. Indicates the first Input image data for each sample; Indicates the first The true annotation information of each sample, among which This indicates the number of targets in the image; Indicates the first The true bounding box coordinates of the target; Indicates the first The category to which each target belongs. This represents the total number of categories. It should be noted that, to ensure the model's robustness to complex environments, the training dataset should cover as many degrees of degradation and scene types as possible.

[0090] Given the high accuracy requirements for detecting small targets in complex operating scenarios (such as blurred or noisy environments), this embodiment of the invention selects the classic Faster R-CNN as the basic target detection framework for verification. However, those skilled in the art should understand that the basic target detection model of this embodiment is universal, and can also employ single-stage or multi-stage detection frameworks such as YOLO, RetinaNet, and FCOS. This invention selects the classic Faster R-CNN as the basic target detection framework, which will serve as the benchmark for subsequent embedding of position-aware spatial-frequency collaborative modules and cascaded alignment modules. Specifically, the basic target detection model mainly includes: a backbone network, using ResNet-50 as the basic feature extraction network. Before improvement, it mainly extracts image feature maps from the bottom up through the stacking of residual blocks in four stages (Stages 1-4), denoted as the features output by each stage. As the layers deepen, the resolution gradually decreases (to 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, respectively), while the semantic information gradually increases. To address the multi-scale detection problem, the Feature Pyramid Network (FPN) utilizes lateral connections and a top-down path to upsample deep features and fuse them with shallow features, generating a feature pyramid. The first part is used as input for subsequent detection heads; the second part is the Region Generation Network (RPN), which generates candidate regions on the feature map output by the FPN through a sliding window, distinguishes between foreground and background and performs coarse bounding box regression; the third part is the Region of Interest (ROI) Head, which uses the ROI Align operation to extract the features of candidate regions of different sizes into fixed sizes, and performs the final category classification and bounding box refinement.

[0091] The original detection loss function of the basic Faster R-CNN model The definition is shown in formula (1):

[0092] (1)

[0093] in, This represents the loss of the region generation network, including foreground / background classification loss. and candidate box regression loss ; This represents the loss of the final detection head, including the target category classification loss. and bounding box regression loss The classification loss uses cross-entropy loss, and the regression loss uses Smooth L1 Loss.

[0094] Those skilled in the art will understand that in subsequent steps, the present invention will structurally reconstruct the backbone network and feature pyramid network in the above-mentioned basic framework and make targeted improvements to the loss function, while the RPN and ROIHead parts can maintain their existing technical structures. The choice of ResNet-50 is merely an example for illustrative purposes; the basic backbone network can also be replaced with ResNet-101 or other convolutional neural networks with similar hierarchical structures.

[0095] In this embodiment of the invention, the basic target detection model is optimized. Specifically, to address the common problems of texture detail attenuation and non-uniform degradation in long-distance imaging during dedicated railway locomotive operation, this invention constructs a position-aware spatial-frequency collaborative module at the end of each stage of the backbone network. This spatial-frequency collaborative module adopts a parallel dual-stream architecture. While preserving the physical spatial structure, it establishes a local frequency feature library through block-based DCT transformation and adaptively recovers high-frequency details using a decoupling gating mechanism that introduces absolute position coding, ultimately generating a spatial high-frequency saliency map reflecting the texture distribution. () as a cross-level prior.

[0096] Specifically, a location-aware space-frequency coordination module is constructed in the backbone network to generate a spatial high-frequency saliency map using a dual-stream architecture and local frequency domain variations, such as... Figure 2 As shown, it includes:

[0097] S310a. Construct a dual-stream feature splitting architecture and align the interfaces of the dual-stream feature splitting architecture, wherein the dual-stream feature splitting architecture includes a spatial anchor branch and a frequency-aware branch. The spatial anchor branch is used to maintain the physical spatial structure of the feature map and perform downsampling, and the frequency-aware branch is used to capture local texture details of the feature map.

[0098] Specifically, the receiver is from the front end of the residual block. The feature map output from the dimensionality reduction convolutional layer is used as input. (in The number of channels after dimensionality reduction is represented by the vector number. This is divided into two parallel processing branches: one branch is responsible for preserving the physical spatial structure and performing downsampling (spatial anchor point branch), and the other is responsible for capturing local texture details (frequency-aware branch). After processing, the fused features from the two branches are sent to the backend. Upgrade the convolutional layer to restore the number of channels and complete the residual connections.

[0099] As a specific implementation of the spatial anchor point branch, the spatial anchor point branch utilizes Convolutional kernels process the input feature map to extract spatial features. Given that downsampling operations in the ResNet architecture typically occur in intermediate convolutional layers, if downsampling is required at the current backbone network level (such as the first block of Stage 2, 3, or 4), it can be performed directly in that spatial branch. Using a stride of 2 in convolution reduces the resolution of the feature map to half that of the input, establishing a spatial resolution benchmark for the entire module.

[0100] As a specific implementation of the frequency-aware branch, in order to maintain spatial scale consistency with the spatial anchor branch, if the spatial anchor branch performs a downsampling operation with a step size of 2, the spatial anchor branch first... Average pooling layer for input feature map Anti-aliasing downsampling is performed to obtain the preprocessed feature map. Average pooling is used here instead of max pooling, aiming to reduce the resolution while preserving the image's texture information as smoothly as possible, avoiding spectral distortion caused by high-frequency signal aliasing, and providing faithful input for subsequent frequency domain analysis.

[0101] S320a. Perform block operation on the feature map of the frequency sensing branch to obtain multiple local blocks, and perform frequency domain transformation processing on each local block independently.

[0102] In this embodiment of the invention, in order to overcome the problem that traditional global frequency domain transformation destroys the local spatial structure and easily loses the features of small targets, the feature map of the frequency sensing branch is divided into blocks, and frequency domain transformation is performed independently on each local block.

[0103] Specifically, using tensor reshaping or sliding window expansion operations, the feature maps are arranged in row-major index order. Divided into Each of the following non-overlapping local blocks has a spatial size set to [value]. pixels, thus obtaining a block sequence .in Indicates the total number of blocks. This represents the height and width after downsampling.

[0104] Furthermore, for each of the block sequences Local blocks perform frequency transformation independently (in this embodiment, a two-dimensional discrete cosine transform (DCT) is preferred). According to the principle of energy conservation, each spatial block containing 64 pixels is mapped to a spectral block containing 64 frequency coefficients. To facilitate subsequent neural network processing, each... The two-dimensional spectrum matrix is ​​flattened into a one-dimensional spectrum vector of dimension 64, ultimately yielding the frequency domain feature sequence. Each element in this vector corresponds to a specific combination of horizontal and vertical frequencies, ranging from low frequencies (DC component) to high frequencies (edge ​​texture).

[0105] S330a. Based on the semantic information and absolute position encoding of the spatial anchor branch, obtain the channel and spectrum gating mask for each local block;

[0106] In this embodiment of the invention, the semantic information and absolute position encoding of the spatial branches are used to dynamically predict a dual filtering strategy of channels and spectrum for each local block, so as to adapt to the non-uniformly distributed degradation phenomenon in the image (such as the blurring at the edge of the lens is usually stronger than that at the center).

[0107] Specifically, based on the semantic information and absolute position encoding of the spatial anchor branch, the channel and spectrum gating mask for each local block is obtained, such as... Figure 3 As shown, it includes:

[0108] S331a. Adaptive average pooling is performed on the spatial features extracted from the spatial anchor branch to compress the spatial resolution of the spatial features to a dimension consistent with the number of local blocks, and the spatial features are flattened into spatial semantic descriptors.

[0109] In this embodiment of the invention, the spatial features extracted from the spatial anchor point branches are first... Adaptive average pooling is performed to compress its spatial resolution to a dimension consistent with the number of blocks (i.e., ), and flattened into a spatial semantic descriptor This establishes a one-to-one mapping relationship between spatial location and frequency block.

[0110] S332a. The spatial semantic descriptor is fused with the pre-initialized absolute position embedding matrix to obtain position-aware contextual features;

[0111] In this embodiment of the invention, a learnable absolute position embedding matrix is ​​initialized. The location-aware context features are then fused element-by-element with the spatial semantic descriptor to obtain the location-aware context features. It should be noted that, in order to adapt to input images of different resolutions, when the spatial size of the input feature map changes, bilinear interpolation or a parameter sharing mechanism is used to adjust the absolute position embedding matrix. Adaptive scaling is performed to maintain dimensional alignment. The absolute position embedding matrix is ​​introduced to supplement the absolute coordinate information missing in the convolution operation, enabling the network to distinguish between image center and edge regions, thereby generating differentiated frequency enhancement strategies for different degrees of blur at different locations.

[0112] S333a. The location-aware contextual features are input into a multilayer perceptron and mapped to channel weights to obtain channel-gated branches that include texture information.

[0113] In this embodiment of the invention, a decoupled design is used to generate a gated mask to reduce the number of parameters and improve the independence of feature selection. This gate mechanism implements soft thresholding rather than hard pruning. Specifically, a channel-gated branch is constructed to pass the location-aware context features. Input to a multilayer perceptron (MLP) and map it to channel weights. After being activated by the Sigmoid, it is used to filter feature channels rich in texture information.

[0114] S334a. The location-aware contextual features are input into a multilayer perceptron and mapped to spectral weights to obtain a spectral gating branch for filtering general effective frequency components.

[0115] In this embodiment of the invention, a spectrum-gated branch is constructed to incorporate the location-aware context features. Input to a multilayer perceptron (MLP) and map it to spectral weights. After being activated by Sigmoid, it is used to filter common effective frequency components (such as high-frequency edges), and this spectral weight is shared across all channels.

[0116] S335a. Perform a joint mask based on the channel weights and the spectrum weights to obtain the channel and spectrum gating mask for each local block.

[0117] Specifically, the final joint gating mask is generated through broadcast multiplication. Its dimensions are It enables fine-grained adjustment of different channels, different spatial positions, and different frequency components.

[0118] S340a. The spectrum of each local block is weighted and filtered according to the gate mask, and the spectrum of the weighted and filtered local blocks is spatially restored.

[0119] In this embodiment of the invention, the generated gated mask is used to perform weighted filtering of the local spectrum, remove background noise and enhance effective texture, and then the features are restored back to the spatial domain and fused with the spatial branches.

[0120] Specifically, the spectrum of each local block is weighted and filtered according to the gate mask, and the spectrum of the weighted and filtered local blocks is spatially restored, such as... Figure 4 As shown, it includes:

[0121] S341a. The spectrum of each local block is weighted and filtered according to the gated mask to obtain the modulated spectrum sequence, wherein the frequency components determined to be noise or background have their weights suppressed, and the high-frequency components determined to be target edges or textures have their weights enhanced.

[0122] In this embodiment of the invention, element-wise multiplication is performed. For frequency components that are identified as noise or background, their weights approach 0 and are thus suppressed; for high-frequency components that are identified as target edges or textures, their weights approach 1 and are thus enhanced.

[0123] S342a. Perform a two-dimensional discrete cosine inverse transform on the modulated spectrum sequence, and restore it to a frequency domain restored feature with the same spatial feature size as the spatial feature extracted by the spatial anchor point branch through a folding operation.

[0124] In this embodiment of the invention, the modulated spectral sequence Perform a two-dimensional discrete cosine inverse transform and then restore it to the spatial features extracted by the spatial anchor point branch through a folding operation. Spatial feature maps of consistent size .

[0125] S343a. The spatial features extracted from the spatial anchor point branch are fused with the frequency domain restoration features to obtain the fused feature map.

[0126] In this embodiment of the invention, the spatial features extracted from the spatial anchor point branches are... With frequency domain reconstruction features Perform element-wise addition and output the fused feature map. Through this residual connection structure, the frequency domain branch acts as a high-frequency detail injector, compensating for the detail loss caused by spatial convolution.

[0127] S344a. The fused feature map is subjected to dimensionality upscaling to complete channel recovery.

[0128] Finally, this fusion feature is fed to the back end of the residual block. Increase the dimensionality of the convolutional layer to complete channel recovery.

[0129] S350a. Aggregate the high-frequency energy information of each local block to obtain a spatial high-frequency saliency map that reflects the high-frequency texture distribution in the feature map.

[0130] In this embodiment of the invention, high-frequency energy information in each local block is aggregated to construct a spatial saliency map that reflects the high-frequency texture distribution of the image, which is then transmitted as prior information across layers to the subsequent feature fusion network.

[0131] Specifically, based on the frequency domain topology of the DCT transform (i.e., the upper left corner represents low frequency and the lower right corner represents high frequency), a high-frequency index mask is preset. The mask will Coordinates in the matrix ( The region representing the preset threshold is set to 0, and the remaining regions are set to 1, thereby eliminating DC components and low-frequency profile information.

[0132] Furthermore, regarding the first Each block, calculates its joint gating mask. With high-frequency index mask Weighted sum: This value It characterizes the significance of the effective high-frequency texture in the local region as perceived by the backbone network.

[0133] Furthermore, all High-frequency energy value of each block The features are reorganized according to their spatial arrangement in the original image to form a single-channel two-dimensional feature map, namely, a spatial high-frequency saliency map. The highlighted areas in the spatial high-frequency saliency map indicate the potential object edge locations in the blurred image. These areas are then transmitted via skip connections to the corresponding levels of the Feature Pyramid (FPN) to guide subsequent feature alignment.

[0134] In actual operation, distant targets often occupy a very small proportion in the image (small targets at a distance). During the FPN upsampling process, even a tiny pixel deviation can cause the detection box to deviate from the target center, leading to ranging errors. To address the spatial misalignment and noise interference issues in multi-scale fusion, this embodiment of the invention constructs a frequency-guided geometric and semantic cascade alignment module at each level of the FPN. The core logic of this cascaded feature alignment module is to use precisely positioned lateral features as anchors, guided by the spatial high-frequency saliency map transmitted by the backbone network, to perform geometric calibration on the position-drifting upsampled features, and simultaneously perform cascaded deformation of semantic attention.

[0135] In this embodiment of the invention, a frequency-guided cascaded feature alignment module is constructed in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset, such as... Figure 5 As shown, it includes:

[0136] S310b: Obtain the spatial reference anchor point, the feature to be calibrated, and the cross-domain frequency prior required for the current fusion level, so as to construct a multi-scale combined feature including multi-dimensional spatial indication information. The spatial reference anchor point includes the lateral connection feature after the output feature of the backbone network is reduced by lateral convolution. The feature to be calibrated includes the upsampled feature obtained after the feature of the previous level of the feature pyramid network is upsampled. The cross-domain frequency prior includes the spatial high-frequency saliency map.

[0137] In this embodiment of the invention, multi-scale feature preparation and frequency prior injection are performed. The spatial reference anchor point, the object to be calibrated, and the cross-domain frequency prior required for the current fusion level are obtained. A multi-scale feature combination containing multi-dimensional spatial indication information is constructed to clarify the alignment benchmark and target.

[0138] Specifically, obtain the current level (Stage) from the backbone network. The output characteristics of ) are processed After dimensionality reduction via lateral convolution, lateral connectivity features are obtained. Since this lateral connection feature retains the most complete spatial geometry and has undergone the spatial frequency synergy enhancement mentioned above, its edge texture is accurate. Therefore, it is used as the spatial reference anchor point of the cascaded feature alignment module, and its coordinates remain fixed in the subsequent process without geometric deformation.

[0139] Furthermore, obtain information from the layer above the FPN. The characteristics of ) after After upsampling, the upsampled features are obtained. Although the upsampled feature is rich in semantic information, its activation response position often deviates from the center of the real object due to the smoothing effect of the interpolation algorithm. Therefore, it is used as the calibration object of this cascaded feature alignment module.

[0140] Furthermore, the spatial high-frequency saliency map generated by the space-frequency coordination module is received. (Original size is) The resolution is adjusted to match the current level using adaptive bilinear interpolation. Reference features Features to be calibrated and frequency prior diagram The features are combined by splicing along the channel dimension. At this point, the network simultaneously perceives where it should be at the pixel level. Where are you now? And where is the high-frequency edge ( ) ).

[0141] S320b: Predict a pixel-level dense offset field based on the multi-scale combined features, and guide the dense offset field to adaptively adjust based on the cross-domain frequency prior.

[0142] In this embodiment of the invention, the pixel-level dense offset field is predicted based on combined features, and the network is guided to focus on learning the alignment error of the blurred boundary by using the introduced frequency prior map, so as to avoid generating invalid drift in flat areas.

[0143] Specifically, a pixel-level dense offset field is predicted based on the multi-scale combined features, and the dense offset field is adaptively adjusted based on the cross-domain frequency prior, such as... Figure 6 As shown, it includes:

[0144] S321b. Construct an offset prediction network and use the multi-scale combined features as input to the offset prediction network to learn the displacement vector of the feature to be calibrated relative to the spatial reference anchor point, thereby obtaining a pixel-level dense offset field.

[0145] In this embodiment of the invention, a lightweight offset prediction network consisting of multiple convolutional layers is constructed to combine features. The network is designed to learn from the input. Compared to The displacement vector. Output pixel-level dense offset field. ,in This indicates the kernel size for subsequent deformable convolutions (in this embodiment of the invention). The offset field contains coordinate points Offset vector and corresponding sampling point weights.

[0146] S322b. Adaptively adjust the offset prediction network according to the cross-domain frequency prior, wherein the offset predicted by the offset prediction network is positively correlated with the spatial response intensity of the cross-domain frequency prior.

[0147] In this embodiment of the invention, a frequency guidance mechanism is introduced: because the input features are explicitly concatenated... During backpropagation, gradients drive the network to adaptively adjust its prediction strategy. Specifically, in In regions with high response values ​​(i.e., regions identified by the backbone as having rich texture or edge areas), the network predicts larger offsets to correct significant spatial misalignments; while... In regions with low response values ​​(i.e., flat backgrounds), the network predicts offsets close to 0. This mechanism ensures that alignment operations only occur in the correct locations, avoiding overfitting to background noise.

[0148] S330b: Resample and deform the feature to be calibrated according to the dense offset field to obtain a calibrated feature aligned with the spatial reference anchor point;

[0149] In this embodiment of the invention, the predicted offset field is used to resample and deform the object to be calibrated so that its spatial position is strictly aligned with the spatial reference anchor point.

[0150] Specifically, a deformable convolution operator is used to... Represents geometric deformation parameters, for upsampled features Processing is performed to obtain geometric alignment features. .

[0151] In this embodiment of the invention, the mathematical expression of the geometric alignment feature is as follows: ,in Indicates the current pixel position. Indicates the standard convolution kernel sampling offset. and These represent the predicted offset and modulation scalar, respectively. Through this operation, the semantic activation centers of deep features are geometrically transformed to... The texture edge locations shown resolve the ghosting problem commonly found in feature pyramids.

[0152] S340b: Obtain an initial spatial semantic attention map based on the multi-scale combined features, and resample and deform the initial spatial semantic attention map based on the dense offset field to obtain a final spatial semantic attention map corresponding to the geometric alignment features.

[0153] In this embodiment of the invention, in order to solve the problem of secondary misalignment caused by geometric deformation of features in traditional methods, but attention weights still being based on the original erroneous positions, this embodiment of the invention provides a cascade alignment strategy to achieve synchronous correction of features and attention.

[0154] Specifically, using Convolutional layers combine features Dimensionality reduction and transformation are performed to generate an initial spatial semantic attention map. This attention map is used to indicate which regions in the feature map are foreground objects and which are background noise, but it is still based on an unaligned coordinate system.

[0155] Furthermore, an offset sharing strategy is implemented, specifically, the same offset field generated in step S320b can be directly reused. The initial attention map is processed using a bilinear grid sampler or the same deformable convolution operation. Resampling and deformation are performed to obtain Strictly Corresponding Final Attention Graph ,Right now This cascaded design ensures that the highlighted attention-activated area always precisely covers the deformed object features, achieving synchronous movement of features and attention, and completely eliminating spatial inconsistencies.

[0156] S350b: Denoise and filter the calibrated features according to the final spatial semantic attention map, and fuse them with the lateral connection features to obtain the fused features output by the feature pyramid network.

[0157] In this embodiment of the invention, the aligned attention map is used to denoise and filter the calibrated features, and residual fusion is performed with the original reference features to output the final high-quality FPN features.

[0158] Specifically, regarding the final attention map conduct Activate, map it to Interval weighting graph.

[0159] Furthermore, this weighted graph is used to assess geometric alignment features. Perform pixel-by-pixel weighting, the formula is as follows This step uses semantic information to suppress background noise (such as raindrops or salt and pepper noise) that may be carried in the upsampled features, retaining only the semantic information related to the object.

[0160] Furthermore, the filtered features Compared with the original lateral connection features Element-wise addition is performed to obtain the output features of the current FPN layer. The specific formula is as follows: .

[0161] This output feature combines... Accurate texture localization and frequency priors are guaranteed by the backbone network. The rich semantic information guaranteed by upsampling and cascade alignment will be used as the input to the detection head in the next stage.

[0162] To maintain the stability of the basic target detection framework while effectively monitoring the feature selection capability of the spatial-frequency collaboration module and the geometric transformation stability of the cascaded alignment module, this embodiment of the invention constructs a multi-task combined loss function. This multi-task combined loss function consists of the basic Faster R-CNN detection loss, the saliency guidance and sparsity constraint loss for the spatial-frequency module, and the smoothing regularization loss for the alignment module.

[0163] In this embodiment of the invention, the training dataset is input into the optimized basic object detection model, and multiple rounds of iterative training are performed based on a combined loss function including detection loss and auxiliary constraint loss, such as... Figure 7 As shown, it includes:

[0164] S410. Construct a saliency-guided and sparsity-constrained loss for the aforementioned space-frequency cooperative module;

[0165] In this embodiment of the invention, a saliency-guided and sparsity-constrained loss is constructed for the space-frequency coordination module. This is to supervise the spatial high-frequency saliency map generated above (…). It accurately focuses on the object region while preventing redundant noise introduced by frequency domain branches, and provides soft focus loss and frequency domain sparsity loss based on Gaussian heatmap.

[0166] Specifically, a saliency-guided and sparsity-constrained loss is constructed for the aforementioned space-frequency coordination module, such as... Figure 8 As shown, it includes:

[0167] S411. Construct a multi-scale saliency supervision flow based on the aforementioned spatial high-frequency saliency map;

[0168] In this embodiment of the invention, a multi-scale saliency supervision flow is constructed. The set of spatial high-frequency saliency maps output at each stage of the backbone network is denoted as [set name missing]. ,in This represents the total number of layers (usually corresponding to Stages 2-5 of ResNet, with strides of 4, 8, 16, and 32 respectively). The resolution is .

[0169] S412. Generate a resolution-adaptive ground truth Gaussian heatmap;

[0170] In this embodiment of the invention, for the first Each level utilizes the spatial downsampling step size of that level. For the original annotation box Perform coordinate mapping to obtain the projection box at the feature level. Based on projection frame At a resolution of Generate the corresponding ground truth Gaussian heatmap on the grid. Its generation formula is: for levels pixels on Weight This method ensures that the ground truth heatmap is strictly aligned spatially with the saliency map output by the network. This represents a hyperparameter that controls the range of Gaussian distribution decay. The heatmap approaches 1 at the center of the object and gradually decays towards the edge, while the area outside the frame is set to 0.

[0171] S413. Obtain the focal loss function of the dynamic weights based on the absolute error between the spatial high-frequency saliency map and the true Gaussian heatmap. The expression is:

[0172] ,

[0173] in, Map representing the high-frequency saliency of space gt Represents the Gaussian heatmap of the truth value. Indicates pixel index, Indicates the focus factor. The total number of pixels represented;

[0174] It should be understood that the focus loss function with dynamic weights can solve the problem of imbalanced positive and negative samples and adapt to the continuous value characteristics of Gaussian heatmaps. Furthermore, this focus loss function with dynamic weights utilizes the semantic prior of the bounding boxes to force... The network highlights textured areas of objects and suppresses background noise, while a dynamic weighting mechanism allows the network to focus on indistinguishable edge regions.

[0175] S414. Apply regularization constraints to the gate mask to obtain the frequency domain sparsity loss function. The expression is:

[0176] ,

[0177] in, Indicates the gate mask. This represents the total number of elements in the gate mask.

[0178] In this embodiment of the invention, to prevent the gating network described above from opening all frequency channels to meet saliency supervision and thus introducing invalid high-frequency background noise, the generated gating mask is modified accordingly. Apply Regularization constraints This frequency-domain sparsity loss function is based on the prior assumption that effective texture is sparse in the frequency domain, prompting the network to activate only the few frequency components that are most critical to feature representation.

[0179] S420. Construct the offset smoothing loss for the cascaded feature alignment module;

[0180] In this embodiment of the invention, for the offset prediction of the feature alignment module, in order to prevent messy geometric deformation in blurry or flat areas lacking strong texture, a smoothness regularization based on total variation is introduced.

[0181] Specifically, obtain the dense offset field output above. Calculate offset smoothing loss. That is, to constrain the range of change in offset between adjacent pixels: .in, Indicates the current pixel offset vector, and These represent the offset vectors of the neighboring pixels to its right and below, respectively. This loss forces the offset field to remain continuous locally, ensuring that the geometric alignment operation conforms to the rigid or semi-rigid motion laws of physical entities and preventing unnatural tearing of the feature map.

[0182] S430. An auxiliary loss is formed based on the saliency-guided and sparsity-constrained loss and the offset smoothing loss;

[0183] S440. The auxiliary loss and the detection loss of the basic target detection model are weighted and fused to obtain the total loss function;

[0184] In this embodiment of the invention, the above-mentioned auxiliary loss term is weighted and fused with the basic Faster R-CNN detection loss defined above to form the final optimization target.

[0185] Specifically, define the total loss function. :

[0186] ,

[0187] in, This represents the original loss of the basic object detection model, consistent with the previous definition, which includes the classification and regression losses of the RPN network, as well as the cross-entropy classification loss and Smooth L1 regression loss of the ROI Head. , , This represents the balance coefficient for each loss, typically taken as 0.01. 1.

[0188] S450. The optimized basic target detection model is trained through multiple rounds of iterations based on the total loss function to obtain a target detection model for the dedicated railway operation scenario.

[0189] Specifically, the model is trained and detection is performed based on a constructed training dataset of complex job scenarios. During backpropagation, It primarily drives the backbone network, FPN, and detection head to learn object detection tasks, while , , As an auxiliary supervision signal, it optimizes the frequency filtering capability of the spatial frequency sensing module and the alignment stability of the feature alignment module, thereby significantly improving the model's feature representation and localization capabilities in blurred and noisy scenes without changing the basic detection head structure.

[0190] Those skilled in the art should understand that the Faster R-CNN model selected is only used as a basic carrier to verify the effectiveness of the spatial-frequency coordination and feature alignment module of this invention. The core module proposed in this invention has general plug-in characteristics and is not limited to a two-stage detector. As long as the model has an architecture with a backbone network for feature extraction and a feature pyramid or similar structure for multi-scale fusion, the module can be applied. At the same time, the frequency domain transformation mentioned in this invention refers to the mathematical transformation method of converting image signals from the spatial domain to the frequency domain. The discrete cosine transform (DCT) is preferred in the embodiments of this invention, but it can also cover orthogonal transformation methods such as Fourier transform (DFT) and wavelet transform (DWT).

[0191] In summary, the target detection method for dedicated railway operation scenarios provided by this invention effectively solves the problems of high-frequency feature loss and non-uniform degradation in complex operating scenarios by constructing a location-aware space-frequency collaborative module, thereby improving feature robustness and enhancing the locomotive's perception capabilities under complex operating conditions such as high dust and low visibility. This space-frequency collaborative module overcomes the limitations of traditional convolutional network low-pass filtering by constructing a local frequency feature library and a decoupling gating mechanism. Compared with conventional methods in the prior art, this invention can adaptively recover the edge and texture details of small targets from the source, significantly enhancing the model's perception capabilities for visual degradation phenomena such as motion blur and low-light noise. At the same time, under normal lighting and visibility conditions, this module can mine ignored high-frequency texture details in the image, further improving the recognition accuracy of small foreign objects and achieving performance gains for the model in all-weather operating scenarios. The constructed frequency-guided geometric and semantic cascaded feature alignment module effectively solves the problems of pixel-level misalignment and background noise interference in multi-scale feature fusion, improving positioning accuracy and achieving accurate positioning and intrusion judgment of small targets at long distances. This cascaded feature alignment module innovatively utilizes the spatial high-frequency saliency map transmitted by the backbone network as a priori guide, overcoming the shortcomings of traditional deformable convolutions that blindly drift in flat or blurry regions. Through geometric calibration with lateral features as anchor points and attention-based cascaded deformation, it achieves pixel-level precise feature alignment while effectively suppressing background noise introduced by upsampling and eliminating detection ghosting. The constructed combined loss function effectively solves the model training convergence problem under supervised training on complex operational degradation scenario datasets, improving the model's detection performance. Addressing the limitations of degraded image data in complex operational scenarios, this combined loss function fully leverages the semantic priors of bounding boxes for supervised learning, forcing the network to focus on object textures and maintain the stability of geometric transformations. This strategy significantly improves the model's detection performance in the complex operating environment of dedicated railways without increasing inference computation costs. Therefore, the target detection method for dedicated railway operation scenarios proposed in this invention has good scene adaptability, significantly resists interference under non-ideal conditions, and further improves the recall rate of small targets under ideal conditions without introducing additional false detections.

[0192] As another embodiment of the present invention, a target detection device 100 for a dedicated railway operation scenario is provided, for implementing the target detection method for a dedicated railway operation scenario described above, wherein, as Figure 9 As shown, it includes:

[0193] The dataset construction module 110 is used to construct a training dataset, which includes image data under multiple dedicated railway operating conditions. Each image data under a dedicated railway operating condition includes image degradation features and small targets. The image degradation features include at least motion blur, low-light noise, and defocus blur. The small targets include targets in the image data under the dedicated railway operating conditions whose coverage area is smaller than a preset small target pixel.

[0194] The model building module 120 is used to build a basic target detection model according to the detection accuracy requirements of the dedicated railway operation scenario. The basic target detection model includes at least a backbone network for feature extraction, a feature pyramid network for multi-scale feature fusion, and a detection head network for target classification and localization.

[0195] The model optimization module 130 is used to optimize the basic target detection model. The optimization process includes at least constructing a location-aware spatial-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain changes; and constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the geometric offset of the predicted feature map according to the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset.

[0196] The model training module 140 is used to input the training dataset into the optimized basic target detection model, and perform multiple rounds of iterative training based on a combined loss function including detection loss and auxiliary constraint loss to obtain a target detection model under the dedicated railway operation scenario. The target detection model under the dedicated railway operation scenario can perform target detection on the real-time acquired images under the input dedicated railway operation scenario and obtain target detection results.

[0197] Therefore, the target detection device for dedicated railway operation scenarios provided by this invention constructs a training dataset and builds a basic target detection model based on the detection accuracy requirements of dedicated railway operation scenarios. This basic target detection model is then optimized. The constructed position-aware space-frequency collaborative module effectively solves the problems of high-frequency feature loss and non-uniform degradation in complex operating scenarios, improves feature robustness, and enhances the locomotive's perception capability under complex operating conditions such as high dust and low visibility. It adaptively recovers the edge and texture details of small targets from the source, significantly enhancing the model's perception capability against visual degradation phenomena such as motion blur and low-light noise. Furthermore, the constructed frequency-guided geometric and semantic cascaded feature alignment module effectively solves the problems of pixel-level misalignment and background noise interference in multi-scale feature fusion, improving positioning accuracy and enabling precise positioning and intrusion judgment of small targets at long distances. Finally, to maintain the stability of the basic target detection framework while effectively supervising the feature selection capability of the space-frequency collaborative module and the geometric transformation stability of the cascaded alignment module, a multi-task combined loss function is constructed to achieve multi-round iterative training. Therefore, the target detection device for dedicated railway operation scenarios provided by this invention can effectively solve the problems of existing general target detection models (such as Faster R-CNN) failing to accurately extract effective features when dealing with complex working conditions such as blur, low-light noise, and defocus blur commonly encountered in dedicated railway operation, resulting in missed detection of small targets and positioning drift, thereby improving the accuracy of small target detection.

[0198] The specific working principle of the target detection device in the dedicated railway operation scenario of the present invention can be referred to the description of the target detection method in the dedicated railway operation scenario above, and will not be repeated here.

[0199] As another embodiment of the present invention, a storage medium is provided, wherein computer instructions are stored, which are loaded and executed by a processor to implement the target detection method in the dedicated railway operation scenario described above.

[0200] In this embodiment of the invention, a non-transitory computer-readable storage medium is provided. The computer-readable storage medium stores computer-executable instructions that can execute the target detection method for a dedicated railway operation scenario in any of the above method embodiments. The storage medium may be a magnetic disk, optical disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk drive (HDD), or solid-state drive (SSD), etc.; the storage medium may also include combinations of the above types of memory.

[0201] As another embodiment of the present invention, an electronic device is provided, comprising a memory and a processor, wherein the processor is communicatively connected to the memory, the memory is used to store a computer program, and the processor is used to load and execute the computer program to implement the target detection method in the dedicated railway operation scenario described above.

[0202] like Figure 10 As shown, the electronic device 10 may include: at least one processor 11, such as a CPU (Central Processing Unit), at least one communication interface 13, a memory 14, and at least one communication bus 12. The communication bus 12 is used to enable communication between these components. The communication interface 13 may include a display screen or a keyboard; optionally, the communication interface 13 may also include a standard wired interface or a wireless interface. The memory 14 may be high-speed RAM (Random Access Memory) or non-volatile memory, such as at least one disk drive. Optionally, the memory 14 may also be at least one storage device located remotely from the aforementioned processor 11. The memory 14 stores application programs, and the processor 11 calls the program code stored in the memory 14 to execute any of the aforementioned method steps.

[0203] The communication bus 12 can be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The communication bus 12 can be divided into an address bus, a data bus, a control bus, etc. For ease of representation, Figure 10 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.

[0204] The memory 14 may include volatile memory, such as random-access memory (RAM); the memory may also include non-volatile memory, such as flash memory, hard disk drive (HDD) or solid-state drive (SSD); the memory 14 may also include a combination of the above types of memory.

[0205] The processor 11 can be a central processing unit (CPU), a network processor (NP), or a combination of CPU and NP.

[0206] The processor 11 may further include a hardware chip. This hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.

[0207] Optionally, memory 14 is also used to store program instructions. Processor 11 can invoke program instructions to implement the present invention. Figure 1 The target detection method in the dedicated railway operation scenario shown in the embodiment.

[0208] It is understood that the above embodiments are merely exemplary implementations used to illustrate the principles of the present invention, and the present invention is not limited thereto. For those skilled in the art, various modifications and improvements can be made without departing from the spirit and essence of the present invention, and these modifications and improvements are also considered to be within the scope of protection of the present invention.

Claims

1. A target detection method for a dedicated railway operation scenario, characterized in that, include: A training dataset is constructed, which includes image data under multiple dedicated railway operating conditions. Each image data under a dedicated railway operating condition includes image degradation features and small targets. The image degradation features include at least motion blur, low-light noise, and defocus blur. The small targets include targets in the image data under the dedicated railway operating conditions whose coverage area is smaller than a preset small target pixel. A basic target detection model is built based on the detection accuracy requirements of dedicated railway operation scenarios. The basic target detection model includes at least a backbone network for feature extraction, a feature pyramid network for multi-scale feature fusion, and a detection head network for target classification and localization. The basic target detection model is optimized, wherein the optimization process includes at least constructing a location-aware spatial-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain changes; and constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset. The training dataset is input into the optimized basic target detection model, and multiple rounds of iterative training are performed based on a combined loss function including detection loss and auxiliary constraint loss to obtain a target detection model for the dedicated railway operation scenario. The target detection model for the dedicated railway operation scenario can perform target detection on real-time acquired images of the input dedicated railway operation scenario and obtain target detection results.

2. The target detection method in a dedicated railway operation scenario according to claim 1, characterized in that, A location-aware space-frequency coordination module is constructed in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain variations, including: A dual-stream feature splitting architecture is constructed, and the interfaces of the dual-stream feature splitting architecture are aligned. The dual-stream feature splitting architecture includes a spatial anchor branch and a frequency-aware branch. The spatial anchor branch is used to maintain the physical spatial structure of the feature map and perform downsampling, and the frequency-aware branch is used to capture the local texture details of the feature map. The feature map of the frequency sensing branch is divided into blocks to obtain multiple local blocks, and each local block is independently processed by frequency domain transformation. Based on the semantic information and absolute position encoding of the spatial anchor branch, the channel and spectrum gating mask of each local block is obtained; The spectrum of each local block is weighted and filtered according to the gate mask, and the spectrum of the weighted and filtered local blocks is spatially restored. The high-frequency energy information of each local block is aggregated to obtain a spatial high-frequency saliency map that reflects the distribution of high-frequency texture in the feature map.

3. The target detection method in the operation scenario of a dedicated railway as described in claim 2, characterized in that, Based on the semantic information and absolute position encoding of the spatial anchor branch, the channel and spectrum gating mask for each local block is obtained, including: Adaptive average pooling is performed on the spatial features extracted from the spatial anchor point branches to compress the spatial resolution of the spatial features to a dimension consistent with the number of local blocks, and the spatial features are flattened into spatial semantic descriptors. The spatial semantic descriptor is fused with a pre-initialized absolute position embedding matrix to obtain position-aware contextual features; The location-aware contextual features are input into a multilayer perceptron and mapped to channel weights to obtain channel-gated branches that include texture information. The location-aware contextual features are input into a multilayer perceptron and mapped to spectral weights to obtain a spectral-gated branch for filtering general effective frequency components. By performing a joint mask based on the channel weights and the spectrum weights, the channel and spectrum gating masks for each local block are obtained.

4. The target detection method in the operation scenario of a dedicated railway as described in claim 2, characterized in that, The spectrum of each local block is weighted and filtered according to the gate mask, and the spectrum of the weighted and filtered local blocks is spatially restored, including: The spectrum of each local block is weighted and filtered according to the gated mask to obtain a modulated spectrum sequence, wherein the frequency components determined to be noise or background have their weights suppressed, and the high-frequency components determined to be target edges or textures have their weights enhanced. The modulated spectral sequence is subjected to a two-dimensional discrete cosine inverse transform, and then restored to a frequency domain restored feature with the same spatial feature size as the spatial feature extracted by the spatial anchor point branch through a folding operation. The spatial features extracted from the spatial anchor point branch are fused with the frequency domain restored features to obtain a fused feature map; The fused feature map is then subjected to dimensionality upscaling to complete channel recovery.

5. The target detection method in a dedicated railway operation scenario according to claim 1, characterized in that, A frequency-guided cascaded feature alignment module is constructed in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset, including: The spatial reference anchor points, features to be calibrated, and cross-domain frequency priors required for the current fusion level are obtained to construct a multi-scale combined feature including multi-dimensional spatial indication information. The spatial reference anchor points include the lateral connection features after the output features of the backbone network are reduced by lateral convolution. The features to be calibrated include the upsampled features obtained after the features of the previous level of the feature pyramid network are upsampled. The cross-domain frequency priors include the spatial high-frequency saliency map. The pixel-level dense offset field is predicted based on the multi-scale combined features, and the dense offset field is adaptively adjusted based on the cross-domain frequency prior. The feature to be calibrated is resampled and deformed according to the dense offset field to obtain a calibrated feature aligned with the spatial reference anchor point; An initial spatial semantic attention map is obtained based on the multi-scale combined features, and the initial spatial semantic attention map is resampled and deformed based on the dense offset field to obtain a final spatial semantic attention map corresponding to the geometric alignment features. The calibrated features are denoised and filtered based on the final spatial semantic attention map, and then fused with the lateral connection features to obtain the fused features output by the feature pyramid network.

6. The target detection method in a dedicated railway operation scenario according to claim 5, characterized in that, The method includes predicting a pixel-level dense offset field based on the multi-scale combined features, and adaptively adjusting the dense offset field based on the cross-domain frequency prior, including: An offset prediction network is constructed, and the multi-scale combined features are used as input to the offset prediction network to learn the displacement vector of the feature to be calibrated relative to the spatial reference anchor point, thereby obtaining a pixel-level dense offset field. The offset prediction network is adaptively adjusted based on the cross-domain frequency prior, wherein the offset predicted by the offset prediction network is positively correlated with the spatial response intensity of the cross-domain frequency prior.

7. The target detection method in a dedicated railway operation scenario according to claim 1, characterized in that, The training dataset is input into the optimized base object detection model, and multiple rounds of iterative training are performed based on a combined loss function including detection loss and auxiliary constraint loss, including: Construct a saliency-guided and sparsity-constrained loss for the aforementioned space-frequency coordination module; Construct an offset smoothing loss for the cascaded feature alignment module; The auxiliary loss is formed by combining the saliency-guided and sparsity-constrained loss with the offset smoothing loss. The auxiliary loss and the detection loss of the basic target detection model are weighted and fused to obtain the total loss function; The optimized basic target detection model is trained through multiple rounds of iterations based on the total loss function to obtain a target detection model for the dedicated railway operation scenario.

8. The target detection method in a dedicated railway operation scenario according to claim 7, characterized in that, Constructing a saliency-guided and sparsity-constrained loss for the aforementioned space-frequency coordination module, including: Construct a multi-scale saliency supervision flow based on the aforementioned spatial high-frequency saliency map; Generate a resolution-adaptive ground-value Gaussian heatmap; The focal loss function of the dynamic weights is obtained based on the absolute error between the spatial high-frequency saliency map and the true Gaussian heatmap. The expression is: , in, Map representing the high-frequency saliency of a space. gt Represents the Gaussian heatmap of the truth value. Indicates pixel index, Indicates the focus factor. The total number of pixels represented; Applying regularization constraints to the gate mask yields a frequency domain sparsity loss function. The expression is: , in, Indicates the gate mask. This represents the total number of elements in the gate mask.

9. A target detection device for a dedicated railway operation scenario, used to implement the target detection method for a dedicated railway operation scenario as described in any one of claims 1 to 8, characterized in that, include: The dataset construction module is used to construct a training dataset, which includes image data under multiple dedicated railway operating conditions. Each image data under a dedicated railway operating condition includes image degradation features and small targets. The image degradation features include at least motion blur, low-light noise, and defocus blur. The small targets include targets in the image data under the dedicated railway operating conditions whose coverage area is smaller than a preset small target pixel. The model building module is used to build a basic target detection model based on the detection accuracy requirements of the dedicated railway operation scenario. The basic target detection model includes at least a backbone network for feature extraction, a feature pyramid network for multi-scale feature fusion, and a detection head network for target classification and localization. The model optimization module is used to optimize the basic target detection model. The optimization process includes at least constructing a location-aware spatial-frequency coordination module in the backbone network to generate a spatial high-frequency saliency map using a two-stream architecture and local frequency domain changes; and constructing a frequency-guided cascaded feature alignment module in the feature pyramid network to guide the geometric offset of the predicted feature map based on the spatial high-frequency saliency map, and to deform and align the semantic attention map based on the geometric offset. The model training module is used to input the training dataset into the optimized basic target detection model, and perform multiple rounds of iterative training based on a combined loss function including detection loss and auxiliary constraint loss to obtain a target detection model under the dedicated railway operation scenario. The target detection model under the dedicated railway operation scenario can perform target detection on the real-time acquired images under the input dedicated railway operation scenario and obtain target detection results.

10. A storage medium, characterized in that, Used to store computer instructions, which are loaded and executed by a processor to implement the target detection method in a dedicated railway operation scenario as described in any one of claims 1 to 8.