A wide-band signal detection method based on dual-channel feature and frequency injection
By employing a dual-channel feature and frequency injection signal detection method, the problems of limited cross-frequency band feature extraction and domain drift are solved, enabling efficient and robust signal detection in complex electromagnetic environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2026-04-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing signal detection methods are limited in cross-frequency band feature extraction and have poor cross-domain generalization ability when dealing with complex electromagnetic spectrum environments. Furthermore, changes in hardware equipment lead to a degradation in model performance, making it difficult to adapt to rapidly changing electromagnetic environments.
A signal detection method based on dual-channel feature and frequency injection is adopted. By introducing asymmetric downsampling and elongated pooling kernels, combined with dual-channel input and Gaussian Fourier mapping, the weights of the convolution kernels are dynamically adjusted to adapt to heterogeneous signal characteristics and suppress the influence of environmental changes.
It significantly improves the robustness and cross-domain generalization ability of broadband signal detection, increases the accuracy of boundary regression, reduces the model training frequency, and maintains detection accuracy and efficiency.
Smart Images

Figure CN122247528A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of radio monitoring and communication signal processing technology, specifically relating to a broadband signal detection method based on dual-channel features and frequency injection. Background Technology
[0002] With the rapid development of radio communication technology, the electromagnetic spectrum environment is becoming increasingly crowded and complex. Efficient and robust detection of wideband non-cooperative signals has become a core task in radio monitoring and spectrum management. In actual field monitoring and wideband reception scenarios, signals in different frequency bands exhibit significant differences in bandwidth, duration, and modulation texture. For example, FM broadcasting bands are mostly narrowband, frequently transmitted signals, while certain specific service bands are filled with burst or frequency-hopping signals. Simultaneously, real-world environments are often accompanied by non-stationary fluctuations in noise floor, and the distribution of data collected under different temporal, spatial, and hardware conditions varies significantly. This "domain drift" phenomenon leads to performance degradation of detection models. Traditional signal detection methods typically rely on manual experience to set fixed decision thresholds, resulting in poor generalization ability and high false alarm rates, making them unsuitable for rapidly changing and complex electromagnetic environments. Therefore, in-depth research into highly generalizable and high-precision intelligent signal detection technologies is of great significance for improving electromagnetic situational awareness capabilities in complex environments.
[0003] Existing deep learning-based signal detection methods primarily convert the signal into a two-dimensional time-frequency map, then directly transfer and apply mature object detection architectures from the computer vision field (such as the YOLOv8 series) for feature extraction and object bounding, as seen in CN118823366A and CN120468794A. The underlying logic of these conventional approaches is as follows: In the data preprocessing stage, the network receives a single-channel time-frequency matrix input and performs a global min-max linear normalization operation, mapping the power to the range of 0 to 1. In the feature extraction stage, the backbone network uses globally shared convolutional kernels and a fixed spatial receptive field, defaults to a spatially symmetric downsampling strategy, and relies on standard square spatial pyramid pooling kernels to fuse multi-scale local features. In the bounding box regression stage, the algorithm typically employs a distributed focus loss mechanism, using a preset conventional regression upper bound parameter to predict the probability distribution of signal boundary offsets. Although such schemes have some effect on simulation datasets, they suffer from a series of problems when faced with real-world wideband signals, such as severe degradation of cross-domain generalization performance and forced truncation of wideband target bounding boxes.
[0004] Field tests and engineering deployments revealed several significant drawbacks of existing technologies when processing complex electromagnetic signals. First, the wide bandwidth and duration of signals across different frequency bands vary considerably. Existing general-purpose detection networks typically employ globally shared convolutional kernels and fixed receptive fields, making it difficult for a single network model to dynamically adapt and effectively extract features from cross-frequency, multi-scale heterogeneous signals. While using multi-model stacking or integration to process different frequency bands separately can alleviate this problem to some extent, it significantly increases hardware storage costs and forward inference time.
[0005] Secondly, the existing models have poor cross-domain generalization ability, and their performance deteriorates significantly when processing data obtained from different spatiotemporal environments and hardware acquisition devices. Summary of the Invention
[0006] To overcome the problems existing in the prior art, the present invention aims to propose a wideband signal detection method based on dual-channel feature extraction and frequency injection. Addressing the limitations of cross-band feature extraction and domain drift caused by changes in hardware and software environments in target detection networks within complex electromagnetic spectrum environments, a signal detection method based on dual-channel feature extraction and frequency injection (YOLOv8s-FDC) is designed. First, the baseline network structure is optimized by introducing asymmetric downsampling and strip pooling kernels to match wideband signal features. Second, at the network input, a signal probability matrix is generated using the fitted environmental noise floor and concatenated with the original time-frequency plot to form a dual-channel input, thus suppressing the problem of poor model generalization ability caused by environmental changes. Finally, in the feature extraction layer, the frequency coordinates are converted into high-dimensional features using Gaussian Fourier mapping and injected into the backbone network, giving the model frequency band perception capability. The convolution channel weights are dynamically adjusted to adapt to heterogeneous signal features. This invention, through dual-channel input and frequency injection mechanism, gives the model frequency band perception capability and suppresses domain drift. Combined with asymmetric network reconstruction, it significantly improves the robustness of wideband signal detection, cross-domain generalization capability, and boundary regression accuracy of large-span targets.
[0007] To achieve the above objectives, the present invention adopts the following technical solution: A broadband signal detection method based on dual-channel features and frequency injection, the method comprising the following steps: S1 acquires the time-frequency slice of the broadband radio frequency signal to be detected and its corresponding center frequency information; S2 uses the time-frequency slice as the first channel and the signal probability matrix generated based on the noise floor fitting result as the second channel, and performs channel dimension splicing. S3 maps the center frequency information into a high-dimensional feature vector and injects the high-dimensional feature vector into the feature extraction layer of the convolutional neural network, dynamically adjusts the convolutional kernel weights, and outputs a multi-scale feature map; wherein, the feature extraction layer includes a cascaded frequency-aware feature extraction module and a spatial pyramid pooling module; S4 Based on the multi-scale feature map output by the spatial pyramid pooling module, the target confidence and bounding box regression parameters are output by the decoupled detection head to complete the wideband signal detection; The feature extraction layer of the convolutional neural network employs an asymmetric downsampling strategy, and the upper limit of the bounding box regression parameters is extended to accommodate the wideband signal span.
[0008] The use of the signal probability matrix generated based on the noise floor fitting result as the second channel, as described in S2, specifically includes: The original time-frequency slices are transformed from the logarithmic domain to the linear domain, and average pooling or mean calculation is performed along the time axis to generate a one-dimensional power spectrum vector characterizing the background energy distribution. The one-dimensional power spectral vector is transformed back to the logarithmic domain and standardized, then input into a pre-trained noise floor fitting model to output a one-dimensional noise floor estimate. The one-dimensional estimate is extended into a two-dimensional matrix along the time axis. The difference tensor is obtained by subtracting the noise floor estimate from the original two-dimensional time-frequency slice. The difference tensor is then mapped into a signal probability matrix with values in the range [0,1] through a Sigmoid activation function layer.
[0009] The pre-trained background noise fitting model is a neural network model built on the stacked denoising autoencoder (SAD-DAE) architecture. The model is configured to learn the spectral distribution characteristics of background noise by minimizing the reconstruction error.
[0010] In S3, the center frequency information is mapped into a high-dimensional feature vector, specifically including: Obtain the normalized center frequency scalar of the current time-frequency slice; Construct a Gaussian Fourier mapping matrix, and use the mapping matrix to perform matrix multiplication and sine / cosine transformation on the normalized center frequency scalar in sequence, so as to map the one-dimensional frequency scalar into a fixed-dimensional periodic embedding vector, that is, the high-dimensional feature vector.
[0011] In S3, the high-dimensional feature vector is injected into the feature extraction layer of the convolutional neural network, and the convolutional kernel weights are dynamically adjusted, specifically including: The high-dimensional feature vector is input into a multilayer perceptron and decoded to generate scaling and translation coefficients that match the number of channels in the current feature map. The scaling and translation coefficients are used to perform a channel-by-channel affine transformation on the original convolutional feature map in the feature extraction layer, so that the network can adaptively adjust the feature response weights according to the frequency band coordinates. The original convolutional feature map refers to the intermediate feature map directly output by the convolutional layer in the feature extraction layer after performing convolution operations on the input data; The step of performing a channel-wise affine transformation on the original convolutional feature map in the feature extraction layer using the scaling and translation coefficients includes: applying the scaling and translation coefficients to the intermediate feature map to generate a frequency-modulated feature map.
[0012] The convolutional neural network adopts the YOLO network architecture, preferably YOLOv8s. Its feature extraction layer includes a frequency-aware feature extraction module, which is composed of multiple convolutional layers and frequency modulation layers. The frequency modulation layer is used to perform the affine transformation. The features obtained by the frequency-aware feature extraction module are fed into the pyramid pooling module to obtain a multi-scale feature map.
[0013] The asymmetric downsampling strategy is as follows: In the deep feature extraction of the convolutional neural network, the stride of the convolutional layer is set to an asymmetric form, that is, the stride in the time axis direction is set to t, and the stride in the frequency axis direction is set to f, t < f, so as to preserve the temporal resolution of burst signals; then, the maximum pooling kernel size in the spatial pyramid pooling module is adjusted from k×k to 1×k or 1×n (n>k), so that the pooling operation only performs feature aggregation in the frequency axis direction, thereby expanding the receptive field in the frequency axis direction while preserving the temporal resolution.
[0014] The upper limit of the bounding box regression parameter is extended, and its value is determined according to the maximum physical width of the input time-frequency slice, preferably 256, so as to match and cover the maximum physical boundary of the broadband signal spanning the entire frequency band and eliminate the risk of the bounding box being forcibly truncated.
[0015] The processing method of the first channel is as follows: only local Min-Max linear normalization is performed on the original time-frequency slice to map the power into the non-negative interval in order to preserve the modulation texture features inside the signal; The local Min-Max linear normalization refers to: based on the sliding window mechanism, for each pixel in the time-frequency slice, selecting a neighborhood window of a preset size centered on the pixel, calculating the maximum and minimum power values within the neighborhood window, and using the maximum and minimum power values to perform normalization mapping on the pixel.
[0016] The noise floor fitting results are generated by a pre-trained deep learning noise floor fitting model or a mean filtering algorithm and a nonlinear recursive smoothing (NLR) filtering algorithm.
[0017] Compared with the prior art, the present invention has the following advantages: 1. Introduce active physics priors to replace passive global normalization: This invention constructs a dual-channel tensor of the original time-frequency plot and the signal probability matrix through a dual-channel input mechanism. A local normalization strategy is employed in the first channel, using a local sliding window normalization to preserve relative power fluctuations. Through physical prior injection, the second channel uses a SAD-DAE model to fit the noise floor, and generates the probability matrix through difference and Sigmoid mapping.
[0018] Unlike traditional methods that rely solely on internal network parameters to learn to distinguish between signals and noise, this invention introduces a physical prior (noise floor model) at the input stage. The probability matrix acts as a guiding signal, forcing the network to focus on high-probability regions and directly eliminating background clutter. Furthermore, this invention is no longer heavily reliant on the distribution of training data, maintaining robustness even when hardware changes or sudden shifts in the spatiotemporal environment alter the noise floor distribution, significantly reducing the frequency of retraining.
[0019] 2. Break the translation invariance limitation of convolutional neural networks.
[0020] This invention employs a frequency injection mechanism (FreqIn) to explicitly encode a one-dimensional center frequency scalar into a 64-dimensional periodic feature vector using a Gaussian Fourier mapping (GFF). A dynamic modulation frequency-aware feature extraction module (C2f_Freq) generates scaling and shifting parameters via an MLP to perform an affine transformation on the feature map.
[0021] On the one hand, this invention creatively introduces absolute frequency coordinates as conditional variables into the network, breaking the inherent limitations of CNNs. This enables the network to possess explicit frequency band-aware priors, allowing it to accurately locate the absolute frequency position of the object being processed. On the other hand, compared to existing technologies that train multiple dedicated models, this invention does not require increasing the number of model parameters; a single network can dynamically adjust the convolutional kernel weights based on the frequency prior. Without increasing computational redundancy, it achieves an effect similar to multi-expert models, solving the problem of limited cross-frequency band feature extraction.
[0022] 3. This invention addresses the underlying architecture reconstruction of broadband signals, resolving the contradiction between large target truncation and small target loss.
[0023] First, this invention employs asymmetric downsampling and strip pooling, allowing for a strategy of setting a time axis step size of 1 and a frequency axis step size of 2 in deep networks; it reconstructs the square pooling kernel of the SPPF module into an asymmetric strip shape. Second, this invention recalculates regmax based on the input width (expanding it to 256), overcoming the prediction limitations of Anchor-Free heads. This invention features asymmetric receptive field matching: this is a precise adaptation to the physical form of the time-frequency graph. The strip pooling kernel expands the receptive field in the frequency axis direction, specifically capturing the long-span characteristics of broadband signals, while preserving the resolution of the time axis to detect transient signals. Through mathematical parameter reconstruction, this invention completely eliminates the engineering risks of large-scale signal detection, solving the technical problems of broadband signals being segmented into multiple segments or the inability to close bounding boxes in existing technologies, and possessing full-band coverage capability.
[0024] 4. This invention is based on YOLOv8s (lightweight) and makes targeted improvements without introducing significant parameter inflation.
[0025] Experimental data show that this invention (YOLOv8s-FDC) improves detection accuracy (mAP@0.5) by more than 7% with only an increase of about 2ms inference time. Compared to simply increasing the network size (e.g., from v8n to v8m), this invention achieves higher feature extraction efficiency by introducing prior knowledge and structural fine-tuning. This demonstrates that in specific domains (electromagnetic spectrum sensing), network structure optimization based on physical distribution characteristics has greater engineering application value than simply increasing model depth and the number of channels.
[0026] In summary, this invention solves the data distribution problem through dual-channel input, the prior problem of feature extraction through frequency injection, and the physical shape matching problem through asymmetric reconstruction. The combination of these three elements constitutes a highly innovative and complete technical solution that significantly surpasses the application effect of existing general object detection technologies in this field. Attached Figure Description
[0027] Figure 1 Network architecture diagram.
[0028] Figure 2 C2f_Freq architecture diagram.
[0029] Figure 3 Time-frequency diagram for complex frequency bands.
[0030] Figure 4 Time-frequency diagram of sparse frequency bands of signal.
[0031] Figure 5 ROC curve diagram.
[0032] Figure 6 Model detection results.
[0033] Figure 7 Comparison of detection results in the 88MHz~102MHz frequency band, where (a) is the signal truth label, (b) is the detection result of the YOLOv8n model, and (c) is the detection result of the YOLOv8s-FDC model.
[0034] Figure 8 Comparison of detection results in the 305MHz~320MHz frequency band, where (a) is the signal truth label, (b) is the detection result of the YOLOv8s model, and (c) is the detection result of the YOLOv8s-FDC model.
[0035] Figure 9 Comparison of detection results in the 443MHz~455MHz frequency band. (a) Signal truth label, (b) YOLOv8s model detection result, (c) YOLOv8s-FDC model detection result.
[0036] Figure 10 Comparison of detection results in the 576MHz~590MHz frequency band. (a) Signal truth label, (b) YOLOv8s model detection results, (c) YOLOv8s-FDC model detection results.
[0037] Figure 11 Comparison of detection results in the 872MHz~886MHz frequency band. (a) Signal truth label, (b) YOLOv8s model detection result, (c) YOLOv8s-FDC model detection result.
[0038] Figure 12 Comparison of detection results in the 958MHz~974MHz frequency band. (a) Signal truth label, (b) YOLOv8s model detection result, (c) YOLOv8s-FDC model detection result.
[0039] Figure 13 Comparison of test results.
[0040] Figure 14 Comparison of test results.
[0041] Figure 15 Comparison of test results.
[0042] Figure 16 Comparison of model prediction results (Example 1).
[0043] Figure 17 Comparison of model prediction results (Example 2). Detailed Implementation
[0044] The present invention will now be described in further detail with reference to specific embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of protection of the present invention.
[0045] Example 1: Signal Detection Method Based on Dual-Channel and Frequency Injection A broadband signal detection method based on dual-channel feature extraction and frequency injection. (See also...) Figure 1 The method includes the following steps: Step S101: Obtain the time-frequency slice of the broadband radio frequency signal to be detected and its corresponding center frequency information.
[0046] The raw radio frequency signal is acquired by a broadband receiver, and time-frequency analysis is performed on the signal using short-time Fourier transform (STFT) to generate a two-dimensional time-frequency slice. The horizontal axis represents time, the vertical axis represents frequency, and the grayscale value of a pixel represents the signal power intensity.
[0047] Simultaneously, obtain the center frequency information corresponding to this time-frequency slice. The center frequency information serves as an auxiliary feature to guide subsequent feature extraction processes.
[0048] Step S102: Construct a dual-channel input tensor.
[0049] This embodiment constructs a dual-channel input structure to enhance the network's ability to perceive signal characteristics.
[0050] First channel (original time-frequency plot channel): This channel processes the original time-frequency slices. Perform local Min-Max linear normalization.
[0051] Specifically, a sliding window mechanism is used for each pixel in the time-frequency slice. Select a neighborhood window of a preset size centered on the pixel. .
[0052] Calculate the maximum power value within this neighborhood window. and minimum power value The following formula is used to perform a linear mapping, mapping the power value to the non-negative interval. Inside: in, , ,in .
[0053] This operation can adaptively enhance local contrast, preserve subtle modulation texture features within the signal, and avoid the loss of weak signal features due to excessive global energy differences.
[0054] Second channel (signal probability matrix channel): Generates the signal probability matrix based on the noise floor fitting result. .
[0055] Specifically, the detailed process of generating the noise floor fitting result and constructing the probability matrix is as follows: First, data dimensionality reduction and feature extraction: This involves processing the original two-dimensional time-frequency slices... Perform mean calculation (or average pooling operation) along the time axis to compress the time dimension and generate a one-dimensional power spectrum vector that represents the global background energy distribution within the current observation slice.
[0056] Secondly, standardization preprocessing: The one-dimensional power spectrum vector is standardized (Z-score standardization) to convert it into a distribution with zero mean and unit variance, so as to eliminate the interference caused by the drastic fluctuation of absolute power amplitude of different hardware devices or different frequency bands to the model input layer.
[0057] Subsequently, noise floor fitting: The standardized one-dimensional power spectrum vector is input into a pre-trained deep learning noise floor fitting model (SAD-DAE network model). The model compresses and extracts the deep spectral distribution features of the background noise through its internal encoder network, and then reconstructs them through the decoder network to output a one-dimensional noise floor estimate vector.
[0058] Finally, dimension alignment and probability mapping: the output one-dimensional noise floor estimate vector is copied and expanded along the time axis to construct a vector that is identical to the original two-dimensional time-frequency slice. A two-dimensional noise floor matrix with perfectly uniform spatial dimensions is used. The two-dimensional noise floor matrix is subtracted from the original two-dimensional time-frequency slice to obtain a differential tensor representing the local signal-to-noise ratio (SNR). This differential energy is then mapped to a signal probability matrix with values ranging from [0, 1] using a Sigmoid activation function layer. In this matrix, noise regions with SNR below a threshold are nonlinearly compressed to approach 0, while true signal regions above the threshold are activated to approach 1, thus achieving efficient background clutter removal from the physical input before the first convolutional layer.
[0059] Step S103: Perform frequency injection feature extraction.
[0060] Dual-channel input tensor It is fed into the feature extraction layer of the convolutional neural network.
[0061] The feature extraction layer includes a backbone network and a spatial pyramid pooling module. The backbone network consists of multiple cascaded feature extraction units, at least some of which are C2f_Freq modules with a frequency modulation mechanism.
[0062] In this embodiment, a frequency injection mechanism is introduced. Specifically, the center frequency information obtained in step S101 is injected... Mapped to a high-dimensional feature vector using a multilayer perceptron (MLP). This feature vector It was then broadcast and compared with the feature map in the backbone network. Perform element-wise multiplication and dynamically adjust the convolution kernel weights: The center frequency information obtained in step S101 First, the feature map is transformed into a fixed-dimensional periodic embedding vector using a Gaussian Fourier transform (GFF). Then, this embedding vector is decoded using a multilayer perceptron (MLP) to generate scaling coefficients *s* and translation coefficients *b* that match the number of channels in the current feature map. These two coefficients are then used to perform a channel-wise affine transformation on the original convolutional feature map *F* in the backbone network. in, This indicates element-wise multiplication. This operation allows the network to adaptively adjust its feature extraction strategy based on changes in center frequency.
[0063] Step S104: Signal detection based on asymmetric downsampling and improved SPP.
[0064] An asymmetric downsampling strategy is employed during feature extraction.
[0065] Unlike traditional Symmetric pooling is employed in this embodiment using a strategy of setting an asymmetric step size in the deep network. Specifically, the step size in the time axis direction is set to 1 to maintain extremely high resolution (stop downsampling) and prevent the temporal characteristics of short burst signals from collapsing; while the step size in the frequency axis direction is set to 2 for regular downsampling.
[0066] Meanwhile, for the Spatial Pyramid Pooling Module (SPP), this embodiment reconstructs its internal pooling core into an asymmetric elongated shape (such as...). This allows the pooling operation to aggregate only in the frequency axis direction, thereby expanding the receptive field in the frequency axis direction to cover the long-span characteristics of broadband signals.
[0067] Finally, the target confidence score and bounding box regression parameters representing the presence of the signal are output based on the decoupled detection head. To accommodate the large span characteristics of broadband signals and eliminate the risk of boundary truncation, this embodiment sets the upper limit value of the bounding box regression parameter regmax in the distributed focus loss function to 256, enabling the model to fully predict the frequency span across the entire map.
[0068] Example 2: Signal Detection System This embodiment provides a signal detection system corresponding to the method described above. The system includes: Data acquisition module: Used to execute step S101 to acquire broadband radio frequency signals and time-frequency slices.
[0069] Data preprocessing module: Used to perform step S102 and construct a dual-channel input tensor.
[0070] Feature extraction and detection module: Used to execute steps S103 and S104, containing a processor that stores the parameters of the C2f_Freq module, and outputting the detection results.
[0071] Experimental Results and Analysis This section first introduces the hyperparameter configuration of the training model, and then analyzes the performance of the model.
[0072] Experimental Environment Setup and Evaluation Indicators Experimental environment and parameter configuration are fundamental to model training and performance evaluation. This section provides a detailed explanation of the underlying hardware and software environment of the experimental platform, the specific hyperparameters for model training, and the evaluation metrics used to quantify model detection performance.
[0073] Network Architecture Description This paper proposes a signal detection network based on dual-channel feature and frequency injection. The overall network architecture is as follows. Figure 1 As shown, this network is built upon and optimized from the YOLOv8s object detection framework. Based on the characteristics of broadband spectrum data, this network employs a dual-channel input mechanism and a frequency injection mechanism. Data processing is mainly divided into five stages: network input, frequency prior injection, feature extraction, multi-scale feature fusion, and decoupled detection output.
[0074] 1) Network Input. In the data preprocessing stage, this paper only sends data slices of the specified test frequency band into the network. First, a two-channel tensor with dimensions B×2×128×2048 is input. Channel 1 is the original broadband time-frequency plot slice. Channel 2 is the signal probability matrix, and the calculation method of the probability matrix is detailed in Section 4.3.3. Next, the center frequency scalar corresponding to the slice is input to represent the frequency band of the current slice.
[0075] 2) Frequency Injection. To address the issue of gradient vanishing or representation degradation during the forward propagation of single frequency values in deep networks, this paper constructs a global condition generator. This module introduces a Gaussian Fourier mapping (GFF) mechanism to explicitly encode scalar frequencies as 64-dimensional periodic feature vectors. This vector effectively amplifies the numerical differences between adjacent frequency bands and serves as a global frequency band prior condition, which is distributed in parallel to the feature extraction nodes at each level of the backbone and neck network.
[0076] 3) Feature Extraction. Signals in different frequency bands exhibit significant differences in bandwidth, duration, and modulation texture. To address these frequency domain feature differences, this paper modifies the native YOLOv8 architecture, replacing all C2f modules in its feature extraction and fusion paths with a custom frequency-aware module, C2f_Freq. In C2f_Freq, the 64-dimensional frequency embedding vector input from the front end is first decoded by a multilayer perceptron, mapped to a scaling factor s and a bias term b that match the number of feature channels in the current image. Subsequently, the module performs an affine transformation on the input feature x. This mechanism endows the network with cross-frequency band adaptive feature extraction capabilities, enabling the activation response range to dynamically adjust with frequency.
[0077] 4) Feature Fusion. In the multi-scale feature fusion stage, the network adopts the PAN-FPN structure to achieve top-down and bottom-up information interaction. Broadband signals typically have a large span in the time-frequency map, and shallow feature maps (with downsampling rates of 2 or 4) have small receptive fields, making it difficult to capture the complete signal contour. Therefore, the feature fusion stage directly discards low-level shallow features and only performs cross-scale splicing on deep semantic features with downsampling rates of 8, 16, and 32 times. This reduces computational overhead while maintaining target detection accuracy.
[0078] 5) Adaptability Optimization. The fused multi-scale features are directly input into the decoupled detection head for coordinate decoding, completing the regression of the signal bounding box. Broadband signals have a large span on the frequency axis, and the native YOLOv8 regression receptive field designed for visual targets easily leads to truncation of the bounding box. Therefore, this paper modifies the regression level RegMax of the detection head, expanding it from the default 16 to 256, solving the bounding box truncation problem in the detection of non-cooperative signals in the broadband band and ensuring the integrity of the detection results.
[0079] Figure 2 The C2f_Freq module is used as the core feature extraction unit in the baseline YOLOv8s network. This study reconstructs it into a frequency-aware C2f_Freq module to achieve dynamic feature injection. When the image feature tensor propagates forward through the backbone network and enters the C2f_Freq module, the network uses the scaling and translation coefficients generated in the previous step to perform a channel-wise affine transformation on the feature map.
[0080] Hardware and software platform configuration All experiments in this chapter are conducted on a unified hardware and software platform. The specific experimental environment configuration is as follows: (1) Operating system: Windows 10; (2) Hardware configuration: NVIDIA GeForce RTX 3090 (24GB); (3) Basic software environment: Conda 23.7.4 is used for environment management, and Python 3.10.19 is used as the programming language; (4) Deep learning framework: PyTorch 2.9.1 is used, and CUDA 12.6 is configured to utilize GPU computing power; (5) Core dependency libraries: Ultralytics 8.4.7, NumPy 2.2.5, Pandas 2.3.3, SciPy 1.15.3 and hdf5storage 0.2.2.
[0081] Model training hyperparameter configuration The specific hyperparameters and network configurations for model training in this chapter are as follows: (1) Input tensor size: 128×2048; (2) Initial learning rate: 1× ; (3) Learning rate scheduling strategy: The learning rate is dynamically decayed using the cosine annealing strategy; (4) Batch Size: 8; (5) Maximum number of iterations (Epochs): 256; (6) Early Stopping: If the validation set loss does not decrease within 20 consecutive epochs, training is terminated early to prevent network overfitting; (7) Loss function weights: The weight of the bounding box regression loss (box) is set to 15.0, and the weight of the distribution focus loss (dfl) is set to 8.0.
[0082] Detection frequency band delineation This study used broadband radio frequency (RF) data from real-world sampling, covering a spectrum range of 88MHz to 1000MHz, encompassing core frequency bands for civilian communications and broadcasting services. Considering the non-stationarity and time-varying nature of the real electromagnetic environment, continuous signal detection was not conducted across the entire frequency band. The core reasons can be summarized in two aspects: First, the electromagnetic environment in some dedicated frequency bands (such as police trunking communication bands and conventional VHF / UHF communication bands) is highly complex, with a large number of densely overlapping signals. Manual labeling can easily lead to mislabeling and omissions, affecting the model's feature learning performance and resulting in poor generalization performance. Second, some frequency bands have low spectrum utilization, making continuous detection in these bands impractical.
[0083] To ensure the reliability of model training and performance evaluation, this study ultimately selected six frequency bands for signal detection: 88MHz~108MHz, 280MHz~360MHz, 366MHz~500MHz, 500MHz~860MHz, 860MHz~894MHz, and 934MHz~1000MHz. Samples of frequency bands not included in the monitoring are provided below. Figure 3 , Figure 4 As shown. This type of frequency band has problems such as difficulty in labeling or sparse signal, so it is not included in the frequency bands of the experimental processing in this paper.
[0084] Figure 3 , Figure 4 Examples of time-frequency diagrams for complex frequency bands and sparse signal frequency bands, respectively.
[0085] Evaluation indicators After model training, its detection performance on broadband time-frequency maps needs to be objectively evaluated using quantitative metrics. This experiment uses the mean average accuracy (mAP@0.5 vs. mAP@0.5:0.95) to measure the model's detection performance. All the above metrics are calculated based on the Intersection over Union (IoU).
[0086] Intersection over Union (IoU) measures the degree of overlap between the model's predicted bounding boxes and the ground truth labeled bounding boxes. A predicted bounding box is considered a true positive (TP) only if the IoU is greater than a set threshold. Otherwise, it is considered a false positive (FP). Ground truth targets that are not detected are considered false negatives (FN).
[0087] Precision (P) represents the proportion of true signals among the targets predicted as "signals" by the model. Recall (R) represents the proportion of true signal targets correctly detected by the model. The formulas for both are as follows: (4-13) (4-14) In model evaluation, precision and recall are mutually restrictive. Average precision (AP) is calculated by taking the area under the precision-recall curve (PR curve) to comprehensively evaluate the model's detection performance for a single class. Its theoretical formula is: (4-15) Mean Average Precision (mAP) is the arithmetic mean of the AP values for N detection classes, used to measure the overall performance of the model in multi-class tasks. Its calculation formula is: (4-16) In the formula, N represents the total number of target categories. This represents the average accuracy for the i-th category. This study only detects valid signals and does not classify the specific modulation category of the signal, i.e., N=1. In this case, the model's output mAP is numerically equivalent to the AP for a single category.
[0088] To comprehensively evaluate the model's ability to capture signal locations and the accuracy of boundary regression in time-frequency plots, this experiment focuses on examining the following two comprehensive indicators: mAP@0.5: This refers to the mAP value calculated when the IoU threshold is set to 0.5. This metric mainly reflects the model's overall ability to capture and detect the macroscopic location of signals.
[0089] mAP@0.5:0.95: The average mAP value for each IoU threshold from 0.5 to 0.95 (step size 0.05). This metric requires the model to accurately define the time-frequency boundaries of the signal and is used to evaluate the regression accuracy of the bounding box.
[0090] Performance evaluation To verify the effectiveness and universality of the proposed dual-channel input mechanism and frequency injection mechanism, this section introduces the FDC (Frequency-aware Dual-Channel) mechanism for pairwise cross-comparison on three benchmark architectures with different parameter scales: YOLOv8n, YOLOv8s, and YOLOv8m, based on the official standards. To ensure the objectivity of the evaluation, all models were tested inference on the same hardware environment and dataset. The comparison results are shown in Table 1.1.
[0091] Before conducting the tests, we will first introduce the size and source of the test set data used in this evaluation. The test set consists of two parts: one part is 5000 simulated samples generated using MATLAB; the other part is 5000 extended samples, which are derived from signals collected in the field at Xi'an University of Electronic Science and Technology, and then reassembled after manual annotation and target extraction.
[0092] Table 1.1 Model Performance Comparison Model Name mAP@0.5 mAP@0.5:0.95 Algorithm running time / ms YOLOv8n 0.8487 0.6904 9.25 YOLOv8n-FDC 0.9234 0.7512 10.50 YOLOv8s 0.8897 0.7682 16.79 YOLOv8s-FDC 0.9639 0.8146 18.78 YOLOv8m 0.8952 0.7715 34.50 YOLOv8m-FDC 0.9681 0.8211 37.20 Table 1.1 reflects the impact of network size changes and the introduction of the FDC mechanism on detection performance. In the baseline model, when the network is expanded from YOLOv8s to YOLOv8m, its algorithm runtime increases from 16.79ms to 34.50ms, but mAP@0.5 and mAP@0.5:0.95 only improve by 0.55% and 0.33%, respectively. This phenomenon indicates that when processing time-frequency map signals, simply increasing the number of network parameters has limited improvement on feature extraction capabilities, exhibiting a trend of diminishing marginal returns. In contrast, after introducing the FDC mechanism, the detection accuracy and localization integrity (mAP@0.5:0.95) of models of all scales are significantly improved. Among them, YOLOv8n-FDC, with a runtime of only 10.50ms, successfully surpasses YOLOv8m (89.52%) in mAP@0.5 (92.34%), objectively confirming that the multi-channel and frequency injection mechanism is superior to network size expansion in feature representation performance. This result shows that introducing the FDC mechanism is more effective than simply increasing the network size in improving feature extraction efficiency.
[0093] In practical applications, model selection must balance detection accuracy and inference speed. Data shows that although YOLOv8m-FDC achieves an mAP@0.5 of 96.81% (the highest among all groups), its accuracy is only 0.42% higher than YOLOv8s-FDC (96.39%), while its algorithm runtime increases by 18.42ms. In real-time signal detection scenarios, this large latency can easily lead to data backlog. Considering both computational efficiency and detection performance, this paper selects YOLOv8s-FDC as the final model.
[0094] Figure 5 ROC curves for each model were plotted. The dashed line represents the original baseline model, and the solid line represents the improved model incorporating the FDC mechanism. Observing the curve distribution, it can be seen that, for the same network size, the solid line lies above the dashed line of the same color. This indicates that the FDC mechanism effectively improves the detection probability of real signals in all YOLOv8n, YOLOv8s, and YOLOv8m architectures.
[0095] Cross-domain generalization performance analysis and evaluation In this paper, cross-domain generalization performance refers to the model's ability to maintain stable detection accuracy when faced with data acquired at different times, in different geographical locations, and from different receiving devices.
[0096] Before conducting the tests, we will first introduce the scale and source of the test set data used in this evaluation. The first type is a mixed test set, containing 10,000 samples. This test set consists of two parts: one part is 5,000 simulated samples generated using MATLAB; the other part is 5,000 extended samples, which are signals collected in the field at Xi'an University of Electronic Science and Technology, manually labeled and extracted, and then reassembled. The second type is a real-world dataset, containing 100 samples. This dataset is raw data collected in the field at a test site in Southwest China, without any preprocessing.
[0097] All dataset samples were stored in .mat format, with each .mat file covering a wide bandwidth from 20MHz to 1000MHz, and a time frame of 120 frames. During performance evaluation, the model was tested only on the six frequency bands defined above. Furthermore, to calculate evaluation metrics, 100 samples from the actual dataset were manually labeled in this section. Due to the complex background environment and blurred signal edges of the actual data, the manual labeling results may contain biases, which may affect the final mAP value calculation to some extent.
[0098] The performance comparison results are shown in Table 1.2.
[0099] Table 1.2 Model Performance Comparison Dataset types Model Name mAP@0.5 mAP@0.5:0.95 Algorithm running time / ms Mixed test set YOLOv8s 0.8897 0.7682 16.79 Mixed test set YOLOv8s-FDC 0.9639 0.8146 18.78 Real-time dataset YOLOv8s 0.8125 0.6433 16.82 Real-time dataset YOLOv8s-FDC 0.9012 0.7358 18.85 As shown in Table 1.2, the detection accuracy of both models decreased when the test data was switched from the mixed test set to the real-world dataset. The performance of the baseline YOLOv8s model showed a more significant decline, with its mAP@0.5 index dropping to 0.8125, and the mAP@0.5:0.95 index, reflecting the accuracy of bounding box localization, dropping to 0.6433. This indicates that the fluctuating background noise in the real-world data interfered with the feature extraction of the base network, making it difficult for the model to accurately define the true physical boundaries of the signal.
[0100] In contrast, the YOLOv8s-FDC model proposed in this paper still maintains mAP@0.5 of 0.9012 and mAP@0.5:0.95 of 0.7358 on the real-world dataset. The data shows that while the model performance degrades on the real-world dataset, it still exhibits good adaptability and can effectively distinguish weak signals from background clutter.
[0101] Figure 6This is an example of the detection results of the YOLOv8s-FDC model on a real-world dataset. As shown in the figure, when a weak signal exists at the edge of a strong signal, the model is prone to missing the weak signal. In the figure, a weak signal exists at the edge of a broadband strong signal near 380MHz, but the model failed to output a corresponding predicted bounding box. The main reason for this phenomenon is that strong signals in real-world environments often have local energy leakage, which raises the noise level in the edge region. The weak features are misjudged by the network as background clutter and thus filtered out. This indicates that the current model's ability to extract features from weak targets in scenarios with dense overlap of strong and weak signals still needs further improvement.
[0102] Cross-band signal detection performance analysis and evaluation Because signals in different frequency bands exhibit significant differences in bandwidth, duration, and modulation texture, standard object detection networks, limited by a fixed receptive field, often struggle to handle cross-frequency, multi-scale feature extraction. To verify the detection capability and generalization performance of the YOLOv8s-FDC model in addressing these challenges, this section selected test samples across six frequency bands: 88MHz~108MHz, 280MHz~360MHz, 366MHz~500MHz, 500MHz~860MHz, 860MHz~894MHz, and 934MHz~1000MHz. These samples encompassed various signal types, ranging from narrowband to broadband. The YOLOv8s model was used as a baseline control group to compare and verify whether the introduction of the FDC mechanism effectively improved the model's actual detection capability when facing signals of different frequency bands and scales.
[0103] Before conducting the tests, we will first introduce the size and source of the test set data used in this evaluation. The test set consists of two parts: one part is 5000 simulated samples generated using MATLAB; the other part is 5000 extended samples, which are derived from signals collected in the field at Xi'an University of Electronic Science and Technology, and then reassembled after manual annotation and target extraction.
[0104] Table 1.3 Comparison of Cross-Band Signal Detection Performance frequency band Model Name mAP@0.5 mAP@0.5:0.95 88MHz~108MHz YOLOv8s 0.976 0.825 88MHz~108MHz YOLOv8s-FDC 0.985 0.835 280MHz~360MHz YOLOv8s 0.925 0.796 280MHz~360MHz YOLOv8s-FDC 0.975 0.825 366MHz~500MHz YOLOv8s 0.868 0.778 366MHz~500MHz YOLOv8s-FDC 0.968 0.818 500MHz~860MHz YOLOv8s 0.856 0.765 500MHz~860MHz YOLOv8s-FDC 0.959 0.809 860MHz~894MHz YOLOv8s 0.912 0.728 860MHz~894MHz YOLOv8s-FDC 0.954 0.803 934MHz~1000MHz YOLOv8s 0.795 0.707 934MHz~1000MHz YOLOv8s-FDC 0.941 0.792 As shown in Table 1.3, the detection accuracy of the two models is similar in the 88MHz to 108MHz frequency band. However, the performance of the baseline YOLOv8s model degrades significantly with the change in frequency band. For example, in the 860MHz to 894MHz frequency band, the mAP@0.5:0.95 ratio drops to 0.7280. In the 934MHz to 1000MHz frequency band, the mAP@0.5 of the baseline model drops to 0.7950.
[0105] In contrast, the YOLOv8s-FDC model maintained stable detection capabilities in cross-band testing. The model exhibited minimal fluctuation in its mAP@0.5 metric across the six test frequency bands. These data demonstrate that the FDC mechanism effectively addresses the baseline model's tendency to miss detections and inaccurate boundary localization in complex frequency bands. It improves the model's adaptability and stability to cross-band signals without significantly increasing computational overhead.
[0106] The following section selects one test sample from each of the six frequency bands to visually compare the detection performance of the models.
[0107] 1. 88MHz~108MHz frequency band The 88MHz~102MHz frequency band mainly consists of narrowband signals with high signal-to-noise ratios. These signals have concentrated energy distribution and clear boundary contours on the time-frequency plot, placing relatively low demands on the network's receptive field and feature extraction capabilities. Therefore, the YOLOv8s model and the YOLOv8s-FDC model exhibit the same detection performance, with neither showing missed or false detections, and both accurately completing target bounding. Figure 7 (a) The true label of the signal. Figure 7 (b) shows the detection results of the YOLOv8n model. No false positives or false negatives were found in this image. Figure 7 (c) shows the detection results of the YOLOv8s-FDC model. No false positives or false negatives were found in this image.
[0108] 2. 280MHz~360MHz Figure 8 (a) shows the true label of the signal. In tests within the 305MHz~320MHz frequency band, the YOLOv8s model misclassified local high-energy regions within a single signal as independent signal entities, resulting in two prediction boxes being output for the same target. The prediction results are as follows: Figure 8 As shown in (b) (around the 314MHz band). The YOLOv8s-FDC model does not exhibit this situation, and its prediction results are as follows: Figure 8 (c).
[0109] 3. 366MHz~500MHz Figure 9 (a) shows the true label of the signal. In tests within the 443MHz~455MHz frequency band, the YOLOv8s model exhibited missed detections near 445MHz and false positives near 454MHz, such as... Figure 9As shown in (b), the YOLOv8s model not only missed the actual signal but also incorrectly predicted a complete signal in this frequency band as two independent signals. In contrast, the YOLOv8s-FDC model accurately detected the signal near 445MHz and output a complete prediction box for a single signal near 454MHz, effectively avoiding missed detections and misjudgments. Figure 9 As shown in (c).
[0110] 4. 500MHz~860MHz Figure 10 (a) shows the true label of the signal. In tests conducted in the 576MHz–590MHz frequency band, a broadband signal existed that filled the entire time-frequency plot. The baseline YOLOv8s model failed to identify the overall profile of this signal, incorrectly detecting it as two separate small signals. Figure 10 As shown in (b). In contrast, the YOLOv8s-FDC model accurately identifies the entire image as a single, complete signal target, as shown in the prediction results. Figure 10 (c) The reason is that the YOLOv8s-FDC model introduces a dual-channel input mechanism, with the second channel input being the signal probability matrix calculated using the global noise floor fitting results. Guided by this probability matrix, even when a single signal fills the entire image, the YOLOv8s-FDC model can still output a complete prediction box, effectively avoiding the problem of misclassifying large signals as noise or segmenting them into multiple small signals.
[0111] 5. 860MHz~894MHz Figure 11 (a) shows the true label of the signal. In tests conducted in the 872MHz~886MHz frequency band, the YOLOv8s model detected all target signals, but its output predicted bounding box was significantly smaller, such as... Figure 11 As shown in (b), this phenomenon is caused by the fact that real broadband signals typically experience energy attenuation at their edges. The YOLOv8s model can only capture the main body of the signal with high energy and obvious visual features, mistaking the attenuated edge regions of the signal as background noise. In contrast, the YOLOv8s-FDC model can accurately define the complete boundary of the signal, and the prediction results are as follows: Figure 11 As shown in (c). The reason is that the signal probability matrix input from the second channel completely preserves the information of edge energy attenuation. At the same time, the frequency injection mechanism provides the network with frequency coordinate features, helping the network to extract the signal morphology distribution patterns under different frequency bands. Under the combined guidance of these two mechanisms, the model can effectively perceive and extract the edge attenuation features of the signal, thereby solving the problem of excessively small bounding boxes.
[0112] 5. 934MHz~1000MHz Figure 12 (a) shows the true label of the signal. In tests within the 958MHz~974MHz frequency band, the YOLOv8s model only detected a high-energy central region with a bandwidth of approximately 1MHz, misclassifying the remaining low-energy components as background noise, such as... Figure 12 As shown in (b). In contrast, the YOLOv8s-FDC model can accurately and completely frame the entire signal. The main reason for this is that the dual-channel input mechanism introduces a signal probability matrix, which effectively preserves the overall structural features of the signal, thus avoiding the problem of incomplete prediction. The prediction result is shown in (b). Figure 12 (c).
[0113] In summary, the YOLOv8s model is only suitable for simple, narrowband, high signal-to-noise ratio scenarios. When faced with complex signals that occupy the entire time-frequency map or have edge energy attenuation, it is prone to problems such as incorrect segmentation of single targets, filtering out local low-energy regions as noise, and incomplete bounding box regression. In contrast, the proposed YOLOv8s-FDC model exhibits stronger cross-frequency band adaptability. By introducing a dual-channel input mechanism and a frequency injection mechanism, the network can directly acquire the complete spatial structure and edge transition information of the signal. Experiments show that these two mechanisms effectively compensate for the shortcomings of the YOLOv8s model in feature extraction, not only avoiding repeated misjudgments caused by internal signal energy fluctuations, but also accurately defining the edges of the signal, ensuring that the model can accurately and completely define the true boundaries of target signals in various complex frequency bands.
[0114] ablation experiment This section mainly implements the ablation experiments of the core parameters and the core modules.
[0115] Core parameter ablation experiment In the dual-channel input mechanism, the quality of the second-channel input matrix directly determines the network's ability to suppress background clutter and its ability to perceive target boundaries. The probability matrix generation process is as follows: (1) Subtract the original broadband signal time-frequency diagram from the noise floor fitting result output by the SAD-DAE model to obtain the residual matrix; (2) Energy screening of the residual matrix is performed by setting a decision threshold; (3) Using a specific probability mapping function, the filtered residuals are transformed into a numerical matrix with values in the range of [0,1].
[0116] As demonstrated by the above process, the setting of the decision threshold and the selection of the probability mapping function are key parameters affecting the quality of the second channel matrix. To investigate the impact of the decision threshold and the probability mapping function on the model's detection performance, this section conducts ablation experiments on the core parameters of YOLOv8s-FDC (with a sigmoid mapping function and a decision threshold of 2). The experiments mainly compare two probability mapping functions: a 0-1 mapping (binary) and a non-linear smooth mapping (Sigmoid function). Four decision thresholds of 0, 2, 4, and 6 were set.
[0117] Table 1.4 records the evaluation results of model performance under different parameter combinations. Among them, mAP@0.5 is used to evaluate the model's ability to detect signals, mAP@0.5:0.95 reflects the regression accuracy of the prediction boundary, and the algorithm running time reflects the time cost of a single forward inference of the model.
[0118] Before conducting the tests, we will first introduce the size and source of the test set data used in this evaluation. The test set consists of two parts: one part is 5000 simulated samples generated using MATLAB; the other part is 5000 extended samples, which are derived from signals collected in the field at Xi'an University of Electronic Science and Technology, and then reassembled after manual annotation and target extraction.
[0119] Table 1.4 Ablation Experiment Results of Core Parameters Probability mapping method Decision threshold mAP@0.5 mAP@0.5:0.95 Algorithm running time / ms binary 0 0.8437 0.6923 18.68 binary 2 0.9251 0.7653 18.67 binary 4 0.8973 0.7437 18.69 binary 6 0.8719 0.7231 18.68 sigmoid 0 0.8315 0.6819 18.77 sigmoid 2 0.9639 0.8146 18.78 sigmoid 4 0.9413 0.7931 18.79 sigmoid 6 0.9027 0.7715 18.77 Analysis of the data in Table 1.4 shows that different parameter configurations have a very significant impact on the detection performance of the model.
[0120] First, observing the changes in the decision threshold reveals that when the decision threshold is set to 0, the mAP@0.5 and mAP@0.5:0.95 indices for both the 0-1 mapping and the nonlinear smoothing mapping are at low levels. This phenomenon indicates that an excessively low threshold assigns a high probability weight to background noise, which is equivalent to artificially amplifying the interference of background clutter, leading to more false alarms in the model's prediction. When the decision threshold is set to 6, the mAP@0.5 and mAP@0.5:0.95 indices for both mapping methods show a significant decline. The reason for this phenomenon is that an excessively high decision threshold can erase the characteristics of weak signals, thus causing missed detections.
[0121] In a horizontal comparison of mapping methods, although the 0-1 mapping performs reasonably well in terms of mAP@0.5, its mAP@0.5:0.95 ratio is significantly limited. This is because the 0-1 mapping cuts off the energy attenuation at the edges of the broadband signal, losing signal edge information, making it difficult for the predicted bounding boxes regressed by the network to accurately fit the true signal boundaries. In contrast, when using a nonlinear smoothing mapping and setting the decision threshold to 2, the model achieves an optimal mAP@0.5 of 0.9639. This indicates that when using a nonlinear smoothing mapping and setting the decision threshold to 2, the model achieves the best balance between suppressing background clutter and preserving signal edge features.
[0122] (1) Comparison of mapping functions under the same decision threshold This section conducts a comparative analysis of 0-1 mapping and nonlinear smoothing mapping under a fixed decision threshold. This section focuses on analyzing the differences between these two mapping methods in processing signal edge features, and, based on prediction results, analyzes the impact of different mapping strategies on the model's clutter suppression capability, boundary regression accuracy (mAP@0.5:0.95), and overall detection performance (mAP@0.5).
[0123] Figure 13 (a) is the truth label for the signal. (From...) Figure 13 (b) It can be seen that when using nonlinear smoothing mapping and setting the decision threshold to 0, the model exhibits significant redundant detection problems. First, near the 443MHz band, the model misclassifies local high-energy regions within a single signal as independent signal entities, resulting in the same target being repeatedly output as three prediction boxes. Second, near the 451MHz band, the model's prediction boundary is much larger than the actual signal boundary. The reason for this phenomenon is that when using nonlinear smoothing mapping and setting the decision threshold to 0, the model amplifies the probability weight of the background noise, causing the boundary between the actual signal and background clutter to become blurred. The predicted boxes regressed by the network not only include the target signal but also incorrectly include large areas of surrounding noise within the signal range. When using 0-1 mapping and setting the decision threshold to 0, the phenomenon of prediction boundaries being much larger than the actual signal boundary does not occur near the 451MHz band. The predicted boxes effectively avoid large areas of surrounding noise, such as... Figure 13 As shown in (c). However, the model still exhibits a redundancy detection problem near the 443MHz band, similarly segmenting and misidentifying local high-energy regions within a single broadband signal as independent signal entities.
[0124] Figure 14 (a) is the truth label for the signal. (From...) Figure 14 (b) Figure 14The comparison results in (c) show that the two mapping methods exhibit significant differences in handling signal boundaries. In real electromagnetic environments, the energy distribution of broadband signals in the time-frequency domain is not an ideal step change; its edges are often accompanied by natural energy attenuation. The 0-1 mapping directly cuts off this continuous energy decay, causing signal edges below the judgment threshold to be forcibly zeroed out, thus being misjudged as background noise by the model. This loss of boundary information makes it impossible for the network to perceive the complete span of the signal when regressing the prediction box, which is intuitively manifested as the shrinkage of the prediction boundary. In contrast, the nonlinear smooth mapping can completely preserve the signal edge feature information, enabling the model to accurately capture the signal boundary while effectively isolating background noise.
[0125] (2) Comparison of different thresholds under fixed mapping This section conducts a comparative analysis of decision thresholds with different values under a fixed probability mapping method. This section focuses on analyzing the impact of the threshold level on the prediction results.
[0126] Figure 15 (a) shows a comparison of the prediction results of the model under different decision thresholds (set to 6, 4, and 2, respectively) when using nonlinear smoothing mapping. Observation Figure 15 (b) It can be seen that when the decision threshold is set to 6, the model exhibits missed detections in both the 412MHz and 417MHz frequency bands. As the threshold is reduced to 4, signals in the 412MHz band are successfully detected, but signals in the 417MHz band remain undetected, such as... Figure 15 As shown in (c). When the threshold is lowered to 2, the signals in both frequency bands are captured completely and accurately, as shown in (c). Figure 15 As shown in (d).
[0127] The above results demonstrate that the decision threshold setting has a decisive impact on the overall detection performance of the model. In complex broadband spectrum environments, some signals have low energy due to long-distance transmission loss or channel fading. If the threshold is set too high, the nonlinear smoothing mapping will misclassify these low-energy real targets as background noise and forcibly suppress them, leading to missed detections and a decrease in the mAP@0.5 index. Therefore, in practical broadband spectrum monitoring tasks, the decision threshold is not an absolute constant. Researchers need to dynamically adjust the threshold based on specific task requirements and the actual noise distribution characteristics of the target frequency band to achieve the optimal balance between precision and recall.
[0128] Core module ablation experiment To verify the independent contributions and synergistic effects of the frequency injection mechanism and the dual-channel input mechanism—two key sub-modules—on detection performance, this section conducts ablation experiments using YOLOv8s as the baseline network. Experimental evaluation includes quantitative index comparisons and analysis of prediction results under complex electromagnetic environments.
[0129] In this ablation experiment, four sets of comparative models were set up. The first set selected the official standard benchmark model YOLOv8s, which uses a single-channel time-frequency plot as input and serves as the control group for performance evaluation in this experiment. YOLOv8s+FreqIn adds only a frequency injection module to the benchmark model. This model introduces the frequency coordinate vector of the signal in the deep feature extraction stage of the network. Its core purpose is to dynamically adjust the feature channel weights using frequency prior information, helping the model distinguish semantic features of different frequency bands in the normalized time-frequency plot and alleviate the problem of cross-frequency band signal feature confusion. YOLOv8s+Dual-Channel adds only a dual-channel input mechanism to the benchmark model. This model uses the signal probability matrix as the second channel. This probability matrix aims to guide the network to focus on high-probability signal regions, effectively suppressing background clutter interference while strengthening the target boundary capture capability. YOLOv8s-FDC is the final model proposed in this paper, which integrates both the dual-channel input mechanism and the frequency injection mechanism.
[0130] Before conducting the tests, we will first introduce the size and source of the test set data used in this evaluation. The test set consists of two parts: one part is 5000 simulated samples generated using MATLAB; the other part is 5000 extended samples, which are derived from signals collected in the field at a university, manually labeled and extracted, and then reassembled.
[0131] Table 1.5 Model Performance Comparison Model Name mAP@0.5 mAP@0.5:0.95 Algorithm running time / ms YOLOv8s 0.8897 0.7682 16.79 YOLOv8s+FreqIn 0.9250 0.7854 17.50 YOLOv8s+Dual-Channel 0.9412 0.7920 18.10 YOLOv8s-FDC 0.9639 0.8146 18.78 Table 1.5 records the comparative results of the ablation experiments. Analysis of the table data shows that both introducing the frequency injection mechanism and the dual-channel input mechanism improve the model's detection performance. After introducing the frequency injection mechanism (YOLOv8s+FreqIn), the model's mAP@0.5 increased to 0.9250. This indicates that providing frequency coordinate references helps the network distinguish the features of different frequency bands in the normalized time-frequency plot. After introducing the dual-channel input mechanism (YOLOv8s+Dual-Channel), mAP@0.5 reached 0.9412. Compared with frequency injection, this mechanism has a more significant improvement in accuracy. This proves that using the probability matrix to highlight the entire signal region and suppress background noise can directly improve the feature extraction effect in low signal-to-noise ratio environments. The YOLOv8s-FDC model's mAP@0.5 reached 0.9639. Furthermore, the mAP@0.5:0.95 ratio also improved to 0.8146. In terms of computational overhead, the YOLOv8s-FDC algorithm runs for 18.78 ms per run, only 1.99 ms more than the native baseline model. These comparative data demonstrate that this model effectively improves the detection rate and boundary localization accuracy of complex broadband signals with a small increase in inference latency.
[0132] To observe the model's detection performance, two typical examples are selected below to analyze and evaluate the model's performance. Example 1 comparing the model's detection results is as follows: Figure 16 (a) is the true value label for the signal. In the prediction results of the baseline model YOLOv8s, the model faces narrow-band low signal-to-noise ratio signals ( Figure 16 At the 479MHz position in (b), only a portion of the signal was detected, indicating a truncation phenomenon. From a fundamental perspective, the baseline model receives a single-channel time-frequency map and performs global Min-Max linear normalization during data preprocessing. Because other high-power signals exist in this frequency band, this linear normalization significantly compresses the dynamic range of weak signals, resulting in insignificant pixel gradients in the tensor matrix for this narrowband signal. The baseline YOLOv8s lacks prior knowledge of specific frequency bands; its fixed convolutional receptive field cannot extract a coherent signal profile from the severely compressed feature space, ultimately only capturing signal fragments with slightly higher local signal-to-noise ratios.
[0133] In the prediction results of the model (YOLOv8s+Dual-Channel) with the introduction of a dual-channel input mechanism, Figure 16In (c), a false alarm occurred at the 474MHz position. This is because the model uses the signal probability matrix as the second channel. During the nonlinear mapping process, regions with differential energy below a set threshold are nonlinearly compressed and approach zero. Due to the low signal-to-noise ratio (SNR) of this narrowband signal, its energy fluctuations are not significantly higher than the background noise, leading to it being misjudged as noise and filtered out. While the signal probability matrix of the second channel greatly suppresses false alarms, it is prone to false alarms when dealing with low SNR signals.
[0134] In the prediction results of the model (YOLOv8s+FreqIn) with the introduction of the frequency injection mechanism, Figure 16 In (d), the signal at 474MHz was successfully detected, but a narrowband signal near 482MHz was incorrectly split into two independent prediction boxes. This phenomenon illustrates the duality of the frequency modulation module in practical applications. On the one hand, the Gaussian-Fourier mapping at the network front end can transform frequency coordinates into high-dimensional features and dynamically scale the convolution weights using the C2f_Freq module. This gives the network the ability to sense frequency bands. Therefore, even if single-channel normalization severely weakens the signal features, the network can still capture weak signals based on the unique morphological patterns of that frequency band. On the other hand, due to the lack of a dual-channel input mechanism to provide a signal probability matrix to suppress background clutter, the network is susceptible to background noise. When the narrowband signal at 482MHz exhibits energy fluctuations on the time axis, the bounding box regression branch is interfered with by local noise, mistakenly identifying the energy dips of the signal as signal boundaries, thus causing the breakage of a single continuous signal.
[0135] The YOLOv8s-FDC model overcomes the shortcomings of the aforementioned models (YOLOv8s, YOLOv8s+Dual-Channel, YOLOv8s+FreqIn), enabling the complete detection of low signal-to-noise ratio signals while ensuring that narrowband signal boundaries are not fragmented. The detection results are as follows: Figure 16 As shown in (e).
[0136] The following is a comparison of the model detection results in Example 2: Figure 17(a) shows the signal truth label (GT), which fully marks the true location and boundary of all valid signals in the 469~481MHz frequency band, serving as a benchmark for detection performance.
[0137] The detection results of the benchmark YOLOv8s model (Figure 17(b)) show several detection defects. Signal truncation occurs at 469MHz, and missed detection occurs at 481MHz. The core reason is that the global Min-Max normalization of the single-channel input compresses the dynamic range of weak signals. The model's fixed convolutional receptive field cannot extract the compressed weak signal features, thus leading to problems such as missed detection and signal truncation.
[0138] The strong signal detection accuracy of the YOLOv8s+Dual-Channel model (Figure 17(c)) is improved compared to the benchmark. However, the signal frame size at 476MHz is too small, and signal misses occur at 481MHz. This is because when the signal probability matrix of the second channel filters noise through the energy threshold, it misclassifies low signal-to-noise ratio signals with energy fluctuations close to the noise floor as noise. Although this suppresses false alarms, it leads to the missed detection of weak signals.
[0139] The YOLOv8s+FreqIn model (Figure 17(d)) incorrectly splits the 476 MHz ~ 477 MHz broadband continuous signal into two prediction boxes, resulting in signal fragmentation. This is because the frequency injection mechanism gives the model frequency band sensing capabilities, enabling it to capture weak signal features; however, it lacks the clutter suppression capability of the signal probability matrix, making it susceptible to interference from signal energy fluctuations, leading to boundary misjudgment and signal fragmentation.
[0140] The YOLOv8s-FDC model (Figure 17(e)) achieves accurate detection of signals across all scenarios, fully detecting all types of signals. This model integrates dual channels and a frequency injection mechanism to achieve complementary advantages: the signal probability matrix suppresses clutter and boundary misjudgments, while the frequency injection mechanism ensures weak signal detection capabilities, thus synergistically improving detection performance in complex scenarios.
[0141] The basic YOLOv8 model possesses spatial translation invariance; however, in broadband frequencies, the frequency axis (X-axis) lacks translation invariance. A signal located at 100MHz (FM band) and a signal located at 1000MHz exhibit significant differences in bandwidth, duration, and modulation texture. Frequency injection mechanisms break translation invariance by injecting frequency coordinate constraints. This is equivalent to endowing the convolutional kernel with frequency band awareness, enabling the network to dynamically adjust feature weights based on the current signal's frequency band during feature extraction. This allows the model to move beyond using fixed visual templates to detect signals across all frequency bands.
[0142] In real-world research scenarios, the background noise of the actual electromagnetic environment fluctuates and contains colored noise. If only a single channel is input, the basic YOLOv8 model needs to distinguish the background noise from the signal (extracting the signal from the background noise) within a limited network depth, which is quite challenging. The dual-channel input mechanism essentially introduces physical priors. The signal probability matrix removes most of the background noise, allowing the receptive field of the first convolutional kernel to focus on high-energy regions. This significantly reduces the difficulty for the network to learn the decoupling of the "noise-signal" boundary, thereby significantly reducing false alarms in noisy environments.
[0143] Compared to the three models mentioned above (YOLOv8s, YOLOv8s+FreqIn, YOLOv8s+Dual-Channel), the YOLOv8s-FDC model achieved the best global performance in the module ablation experiment.
Claims
1. A broadband signal detection method based on dual-channel features and frequency injection, characterized in that, The method includes the following steps: S1 acquires the time-frequency slice of the broadband radio frequency signal to be detected and its corresponding center frequency information; S2 uses the time-frequency slice as the first channel and the signal probability matrix generated based on the noise floor fitting result as the second channel, and performs channel dimension splicing. S3 maps the center frequency information into a high-dimensional feature vector and injects the high-dimensional feature vector into the feature extraction layer of the convolutional neural network, dynamically adjusts the convolutional kernel weights, and outputs a multi-scale feature map; wherein, the feature extraction layer includes a cascaded frequency-aware feature extraction module and a spatial pyramid pooling module; S4 Based on the multi-scale feature map output by the spatial pyramid pooling module, the target confidence and bounding box regression parameters are output by the decoupled detection head to complete the wideband signal detection; The feature extraction layer of the convolutional neural network employs an asymmetric downsampling strategy, and the upper limit of the bounding box regression parameters is extended to accommodate the wideband signal span.
2. The method according to claim 1, characterized in that, The use of the signal probability matrix generated based on the noise floor fitting result as the second channel, as described in S2, specifically includes: The original time-frequency slices are transformed from the logarithmic domain to the linear domain, and average pooling or mean calculation is performed along the time axis to generate a one-dimensional power spectrum vector characterizing the background energy distribution. The one-dimensional power spectral vector is transformed back to the logarithmic domain and standardized, then input into a pre-trained noise floor fitting model to output a one-dimensional noise floor estimate. The one-dimensional estimate is extended into a two-dimensional matrix along the time axis. The difference tensor is obtained by subtracting the noise floor estimate from the original two-dimensional time-frequency slice. The difference tensor is then mapped into a signal probability matrix with values in the range [0,1] through a Sigmoid activation function layer.
3. The method according to claim 2, characterized in that, The pre-trained background noise fitting model is a neural network model built on the stacked denoising autoencoder (SAD-DAE) architecture. The model is configured to learn the spectral distribution characteristics of background noise by minimizing the reconstruction error.
4. The method according to claim 1, characterized in that, In S3, the center frequency information is mapped into a high-dimensional feature vector, specifically including: Obtain the normalized center frequency scalar of the current time-frequency slice; Construct a Gaussian Fourier mapping matrix, and use the mapping matrix to perform matrix multiplication and sine / cosine transformation on the normalized center frequency scalar in sequence, so as to map the one-dimensional frequency scalar into a fixed-dimensional periodic embedding vector, that is, the high-dimensional feature vector.
5. The method according to claim 1, characterized in that, In S3, the high-dimensional feature vector is injected into the feature extraction layer of the convolutional neural network, and the convolutional kernel weights are dynamically adjusted, specifically including: The high-dimensional feature vector is input into a multilayer perceptron and decoded to generate scaling and translation coefficients that match the number of channels in the current feature map. The scaling and translation coefficients are used to perform a channel-by-channel affine transformation on the original convolutional feature map in the feature extraction layer, so that the network can adaptively adjust the feature response weights according to the frequency band coordinates. The original convolutional feature map refers to the intermediate feature map directly output by the convolutional layer in the feature extraction layer after performing convolution operations on the input data; The step of performing a channel-wise affine transformation on the original convolutional feature map in the feature extraction layer using the scaling and translation coefficients includes: applying the scaling and translation coefficients to the intermediate feature map to generate a frequency-modulated feature map.
6. The method according to claim 5, characterized in that, The convolutional neural network adopts the YOLO network architecture, and its feature extraction layer includes a frequency-aware feature extraction module. The frequency-aware feature extraction module consists of multiple convolutional layers and frequency modulation layers. The frequency modulation layer is used to perform the affine transformation. The features obtained by the frequency-aware feature extraction module are fed into the pyramid pooling module to obtain a multi-scale feature map.
7. The method according to claim 1, characterized in that, The asymmetric downsampling strategy is as follows: In the deep feature extraction of the convolutional neural network, the stride of the convolutional layer is set to an asymmetric form, that is, the stride in the time axis direction is set to t, and the stride in the frequency axis direction is set to f, t < f, so as to preserve the temporal resolution of burst signals; then, the maximum pooling kernel size in the spatial pyramid pooling module is adjusted from k×k to 1×k or 1×n, n>k, so that the pooling operation only performs feature aggregation in the frequency axis direction, thereby expanding the receptive field in the frequency axis direction while preserving the temporal resolution.
8. The method according to claim 1, characterized in that, The upper limit of the bounding box regression parameter described in S4 is expanded, and its value is determined based on the maximum physical width of the input time-frequency slice, so as to match and cover the maximum physical boundary of the broadband signal spanning the entire frequency band, eliminating the risk of the bounding box being forcibly truncated.
9. The method according to claim 1, characterized in that, The processing method of the first channel is as follows: only local Min-Max linear normalization is performed on the original time-frequency slice to map the power into the non-negative interval in order to preserve the modulation texture features inside the signal.
10. The method according to claim 9, characterized in that, The local Min-Max linear normalization refers to: based on the sliding window mechanism, for each pixel in the time-frequency slice, selecting a neighborhood window of a preset size centered on the pixel, calculating the maximum and minimum power values within the neighborhood window, and using the maximum and minimum power values to perform normalization mapping on the pixel.
Citation Information
Patent Citations
Feature extraction method and system based on two-dimensional time-frequency analysis, and storage medium
CN118823366A
Radar signal detection and identification method and device based on deep learning
CN120468794A