A machine vision-based sound barrier sound-absorbing material filling quality detection method
By using pixel-level registration of infrared and visible light synchronized video frames and an improved octave convolution module, a mid-frequency branch is introduced for deformable convolution. Combined with cross-attention mechanism and inter-frame motion compensation, the problems of lack of mid-frequency transition band in mid-frequency domain features and lack of prior information in inter-frame detection of sound-absorbing material filling quality detection of sound barriers are solved, achieving more accurate and continuous detection results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUBEI LUAN NEW MATERIALS CO LTD
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-26
AI Technical Summary
Existing methods for detecting the filling quality of sound-absorbing materials in sound barriers lack mid-frequency transition band information in the mid-frequency domain, multi-scale feature aggregation lacks prior guidance, and inter-frame detection lacks continuity, resulting in poor discontinuity in detection results, boundary jitter, and missed detections.
By pixel-level registration of infrared and visible light synchronized video frames, an improved octave convolution module detection network is constructed. An intermediate frequency branch is introduced and an offset field is generated for deformable convolution. Cross-frequency fusion is completed by combining a cross-attention mechanism. Multi-scale feature aggregation is performed using dilated convolution layers, and feature extraction is performed in the next frame after inter-frame motion compensation.
It compensates for information gaps in frequency decomposition, improves the accuracy of multi-scale feature aggregation, reduces boundary jitter and missed detections in video stream detection, and enhances the continuity and accuracy of detection.
Smart Images

Figure CN122289263A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a machine vision-based method for detecting the filling quality of sound-absorbing materials in sound barriers. Background Technology
[0002] As a core noise reduction component in traffic noise control projects, the quality of the sound-absorbing material filling within sound barriers directly affects their overall noise reduction performance. During actual production, construction, installation, and long-term service, sound-absorbing materials often exhibit filling quality defects such as internal voids, sparse distribution, or misalignment due to improper filling processes, gravity settlement, or environmental vibrations. Traditional quality inspection methods rely heavily on manual tapping and listening, or partial disassembly and sampling, which are inefficient, subjective, and compromise the structural integrity of the components, making them unsuitable for large-scale engineering inspections.
[0003] In the prior art, Chinese patent document CN117557775B discloses a method for detecting substation power equipment based on infrared and visible light fusion. This method uses a cross-attention structure with dynamically weighted adaptive allocation to fuse multimodal features from infrared and visible light images, and then outputs the detection results via a feature prediction network. While this method employs a strategy combining pixel-wise weighting and an attention mechanism at the multimodal fusion level, it does not perform multi-band decomposition of frequency domain features during feature extraction, does not involve the construction and utilization of mid-frequency transition band information, and does not adjust the convolution sampling position based on the differences in information richness in local regions of different modal images.
[0004] Visible light imaging can capture the texture and boundary details of a material's surface, while infrared imaging can reflect the density distribution within the material through differences in thermal conductivity. In the frequency domain feature extraction stage of detection using dual-modal information, the conventional octave convolution method only divides the features into high-frequency and low-frequency paths, ignoring the mid-frequency transition band information that contains both surface texture and internal thermal gradient features. The above frequency decomposition method also faces another limitation when processing dual-modal input: it fails to adjust the convolution sampling position according to the differences in information distribution between the infrared and visible light channels in local space, leading to alignment deviations and information loss during cross-band feature interaction.
[0005] The filling defects in sound-absorbing materials for sound barriers vary greatly in scale, ranging from tiny gaps to large-area settlement morphologies. In practical deployments, it has been found that existing receptive field aggregation methods use the same splicing weights for features at different scales, lacking bias adjustment capabilities based on statistical priors of the actual defect scale. This limits the ability of multi-scale feature aggregation to represent complex morphological defects. Furthermore, when detecting continuous video sequences, existing networks often treat each frame as an independent image, failing to extract and transfer prior features of the defect regions identified in the current frame to subsequent frames. The lack of inter-frame motion compensation and cross-frame defect prior feedback mechanisms makes them susceptible to background interference or slight jitter during actual inspections, resulting in poor continuity of detection results, boundary jitter, and missed detections. Summary of the Invention
[0006] To address the technical challenges of existing sound barrier sound-absorbing material filling quality detection methods, such as the lack of a mid-frequency transition band in the mid-frequency domain and the lack of prior guidance in multi-scale aggregation and inter-frame detection, this application provides a machine vision-based sound barrier sound-absorbing material filling quality detection method, including: The infrared and visible light synchronized video frame sequence is acquired, and after pixel-level registration, it is stitched together along the channel dimension to form multi-channel input features; A detection network with an improved octave convolution module is constructed. The multi-channel input features are decomposed into three branches: high frequency, mid frequency and low frequency. The mid frequency branch is generated by concatenating the high frequency features after low-pass filtering downsampling and the low frequency features after high-pass filtering upsampling. An offset field is generated based on the ratio of the local spatial entropy of the infrared and visible light channels and deformable convolution is performed on the mid frequency branch. The temperature coefficient is the average variance of each channel in the feature map of each branch along the channel dimension, plus a minimum constant and the square root of the result. Cross-frequency fusion is completed through a cross-attention mechanism. The three-frequency fusion features are input into multiple dilated convolutional layers with different dilation rates. The outputs of each layer are weighted by element-wise multiplication using learnable scale bias coefficients and then concatenated by channel. The initial values of the scale bias coefficients are obtained by mapping the expected value of the logarithmic distribution of the area of the defect connected domains in the training set. The defect bounding boxes and filling density levels are output by a dual-head decoder. When there are sparse or hollow defect regions in the current frame, the low-frequency feature map of the defect region is extracted as the spatial prior weight matrix, which is fed back to the feature extraction stage of the next frame after inter-frame motion compensation.
[0007] This application stitches together infrared and visible light synchronized video frames into multi-channel input features through pixel-level registration and introduces a mid-frequency branch into the detection network. This compensates for the information gap between high and low frequencies in conventional frequency decomposition. An offset field is generated based on the ratio of the local spatial entropy of the infrared and visible light channels, and deformable convolution is performed on the mid-frequency branch. This causes the convolution sampling points to shift along regions of thermal radiation anomalies or texture abrupt changes. The average variance of each channel in the feature map of each branch is taken along the channel dimension, and the square root of the result is a temperature coefficient. Cross-frequency fusion is completed through a cross-attention mechanism. The three-frequency fused features are processed by dilated convolutional layers with different dilation rates and then weighted by a scale bias coefficient to match the multi-scale receptive field aggregation ratio with the defect scale distribution. When a defect region exists in the current frame, the low-frequency feature map is extracted as a spatial prior weight matrix and fed back to the feature extraction stage of the next frame through inter-frame motion compensation. This reduces boundary jitter and missed detection caused by background interference or jitter in video stream detection.
[0008] Preferably, the step of stitching together the pixel-level registered features along the channel dimension to form multi-channel input features includes: The edge contour feature points of infrared video frames and visible light video frames are extracted respectively, and the edge contour feature points of infrared video frames and visible light video frames are initially matched to generate matching point pairs. Using visible light video frames as a reference, the random sampling consistency algorithm is used to eliminate mismatched point pairs, and the homography transformation matrix between infrared and visible light images is calculated. Based on the homography transformation matrix, perspective transformation and bilinear interpolation resampling are performed on the infrared video frames to achieve pixel-level spatial alignment between the infrared video frames and the visible light video frames. The aligned single-channel infrared image and the three-channel visible light image are concatenated along the channel dimension to generate a four-channel tensor as a multi-channel input feature.
[0009] Preferably, the generation method of the mid-frequency branch includes: performing local spatial smoothing on the features of the high-frequency branch using a Gaussian kernel function, and downsampling through a max pooling layer with a stride of 2; performing double upsampling on the features of the low-frequency branch using bilinear interpolation, and extracting the high-frequency edge response after upsampling using a Laplacian operator; unifying the smoothed high-frequency features after downsampling and the low-frequency features after extracting the edge response to the same spatial resolution and then concatenating them in the channel dimension, performing channel dimensionality reduction and feature fusion through a 1×1 convolutional layer to generate the mid-frequency branch, which contains smoothed high-frequency texture details and low-frequency structural edge information at the channel level.
[0010] Preferably, the step of generating the offset field based on the ratio of the local spatial entropy of the infrared and visible light channels and performing deformable convolution on the intermediate frequency branch includes: Set up a sliding window and calculate the information entropy of pixel grayscale distribution in the infrared channel and visible light channel within the local window respectively; Calculate the ratio of infrared spatial entropy to visible light spatial entropy, take the logarithm of the ratio matrix and normalize it using the hyperbolic tangent function to obtain a weighted mapping map with values in the range of [-1, 1]. Input the weight map into a micro-convolutional network containing two consecutive convolutional layers, and output an offset field with twice the number of channels as the number of spatial sampling points of the deformable convolutional kernel; Using the offset field as the deformation reference of the sampling grid, deformable convolution is performed on the mid-frequency branch, so that the sampling grid produces spatial deformation according to the difference in the richness of bimodal information.
[0011] Preferably, the cross-frequency fusion via the cross-attention mechanism includes: Calculate the variance of the feature maps of the high-frequency branch, the mid-frequency branch and the low-frequency branch in each channel in the spatial dimension. Take the mean of the variance of each channel along the channel dimension, add a minimum constant and take the square root, and use it as the temperature coefficient of the corresponding branch. The query vector is generated by mapping mid-frequency features, and the key vector and value vector are generated by mapping high-frequency and low-frequency features together. Calculate the inner product matrix between the query vector and the key vector, divide the inner product matrix by the temperature coefficient of the mid-frequency branch, scale it, and then obtain the attention weight matrix through the Softmax function; The attention weight matrix is multiplied by the value vector, and the output weighted features are residually connected to the original mid-frequency features.
[0012] By introducing a temperature coefficient, when the feature space is highly discrete, the scaling operation makes the attention weights tend to be evenly distributed, reducing the interference of isolated noise response on the fusion result.
[0013] Preferably, the step of inputting the three-frequency fusion features into multiple groups of dilated convolutional layers with different dilation rates, and then concatenating the outputs of each layer by element-wise weighting through learnable scale bias coefficients, includes: Set up four parallel dilated convolution branches with dilation rates of 1, 6, 12, and 18 respectively; The area of all labeled defective connected components in the statistical training set is calculated. The expected value and standard deviation are obtained by fitting the Gaussian distribution after taking the natural logarithm of the area. The expected value and standard deviation are mapped through a multilayer perceptron and normalized by the Sigmoid function to generate a four-dimensional vector, which serves as the initial value of the scale bias coefficients of the four dilated convolution branches. During training, the output features of each dilated convolution branch are weighted by element-wise multiplication using the scale bias coefficient and then concatenated by channel to match the weighting of each branch with the actual defect area distribution.
[0014] Preferably, the feature extraction stage that feeds back to the next frame after inter-frame motion compensation includes: when the classification branch of the dual-head decoder determines that the fill density level of the detection box region in the current frame is sparse or hollow, a binary mask with the same size as the detection box is generated, and the binary mask is scaled to the spatial size of the low-frequency branch feature map by nearest neighbor interpolation; the activated feature map of the receptive field region corresponding to the low-frequency branch in the current frame is extracted using the scaled binary mask, and a spatial prior weight matrix is generated by processing it with max pooling and the Sigmoid function along the channel dimension; after performing inter-frame motion compensation on the spatial prior weight matrix, it is scaled to the spatial size of the input features of the next frame by bilinear interpolation, and element-wise multiplied with the features of the corresponding coordinate region in the next frame; when no defects are detected in the first frame or the previous frame of the video sequence, the spatial prior weight matrix is initialized to an all-zero matrix.
[0015] Preferably, the multiple sets of dilated convolutional layers with different dilation rates further include a global average pooling branch. The output of the global average pooling branch is upsampled and aligned back to the spatial size of the dilated convolutional branch, and then concatenated with the weighted output of each dilated convolutional branch along the channel. After concatenation, the concatenation is fused by channel dimensionality reduction through a 1×1 convolutional layer.
[0016] Preferably, the detection network is trained by using the full intersection-union loss function to regress the defect bounding box, and by combining the focus loss function to classify and supervise the filling density level, so as to constrain the update of the detection network weights.
[0017] Preferably, the spatial prior weight matrix is added to a constant value after motion compensation and then subjected to upper limit clamping to generate an enhancement coefficient matrix. The enhancement coefficient matrix is then multiplied element-wise with the input feature map of the next frame within the corresponding coordinate range of the defect region.
[0018] The technical solution of this application has the following beneficial technical effects: This application stitches infrared and visible light synchronized video frames into multi-channel input features through pixel-level registration and introduces an intermediate frequency branch into the detection network. This compensates for the information gap between high and low frequencies in conventional frequency decomposition. An offset field is generated based on the ratio of the local spatial entropy of the infrared and visible light channels, and deformable convolution is performed on the intermediate frequency branch. The variance within each channel of each branch feature map is averaged along the channel dimension, a minimum constant is added, and the square root is taken as the temperature coefficient. Cross-frequency fusion is completed through a cross-attention mechanism. After processing by dilated convolutional layers with different dilation rates, the feature maps are weighted and stitched with scale bias coefficients and output by a dual-head decoder. When there is a defect in the current frame, the low-frequency feature map is extracted as a spatial prior weight matrix and fed back to the next frame through inter-frame motion compensation. This reduces boundary jitter and missed detections in video stream detection.
[0019] Furthermore, during the mid-frequency branch generation process, high-frequency features are smoothed with a Gaussian kernel and then downsampled, while low-frequency features are upsampled and edge responses are extracted using the Laplacian operator before being fused by convolutional dimensionality reduction. In the offset field generation stage, the local information entropy of the infrared and visible light channels is calculated separately through a sliding window, and the ratio is input into the micro-convolutional network after nonlinear mapping. During the training stage, the full intersection-union loss function and the focal loss function are used for joint supervision to concentrate the gradient on the difficult-to-classify samples. The spatial prior weight matrix is added to a constant value after motion compensation and is clamped by an upper limit to form an enhancement coefficient matrix, which is then multiplied element-wise in the defect region to enhance the response at the historical defect location. Attached Figure Description
[0020] Figure 1 This is a flowchart of a machine vision-based method for detecting the filling quality of sound-absorbing materials in a sound barrier, as described in this application. Figure 2 A scaled diagram illustrating the distribution of attention weights in cross-frequency interactions; Figure 3 This is a schematic diagram of the Gaussian fitting distribution of the logarithmic area of the defect; Figure 4 This is a schematic diagram comparing the performance of the models during the ablation experiment. Detailed Implementation
[0021] The technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments.
[0022] Reference Figure 1 A method for detecting the filling quality of sound-absorbing materials in a sound barrier based on machine vision, comprising steps S1 to S4.
[0023] S1, multimodal video acquisition and pixel registration.
[0024] The system uses a hardware-triggered synchronization control protocol to acquire the video stream output by a binocular acquisition system consisting of an infrared thermal imager and a visible light industrial camera. It reads the dual-source video stream and extracts the synchronized infrared and visible light video frame sequences according to their timestamps. The infrared video frames acquired by the infrared thermal imager have a resolution of 640×512 and a bit depth of 16 bits per channel; the visible light video frames acquired by the visible light industrial camera have a resolution of 1920×1080 and a bit depth of 8 bits per channel. Due to the difference in viewing angle between the infrared thermal imager and the visible light industrial camera, pixel-level registration of the infrared and visible light video frames is required before they are stitched together along the channel dimension to form a multi-channel input feature.
[0025] Using visible light video frames as the reference coordinate system, edge contour feature points are extracted from both infrared and visible light video frames. The Canny edge detection algorithm is used for edge extraction in the infrared video frames, with high and low thresholds set to 50 and 150 respectively; the visible light images have thresholds set to 100 and 200 respectively. The ORB algorithm is used to extract corner points and generate feature representation vectors from the extracted edge maps. Initial matching point pairs are calculated using brute-force matching combined with Hamming distance, and erroneous matching pairs are eliminated using a distance ratio threshold of 0.75.
[0026] It should be noted that the initial matching point pairs contain a large number of erroneous matches. Directly using these pairs to solve the transformation matrix would lead to a decrease in registration accuracy. Therefore, the filtered matching point pairs are input into the random sample consensus algorithm to iteratively calculate the homography transformation matrix between the infrared and visible light images. During the random sample consensus iteration process, the maximum number of iterations is set to 2000, and the reprojection error threshold is set to 3 pixels. Considering that too few iterations would result in unstable matrix calculations and difficulty in covering all abnormal matching points, while too many iterations would increase computation time but reduce accuracy gains, the maximum number of iterations in this embodiment is set to 2000 as an engineering compromise. The selection of the reprojection error threshold must match the downstream registration accuracy. If the threshold is too large, the retained inliers will still contain matching pairs with large errors, while if the threshold is too small, reasonable matching points may be eliminated, resulting in insufficient effective data. In this embodiment, 3 pixels is selected.
[0027] After obtaining the 3×3 homography transformation matrix, perspective transformation and bilinear interpolation resampling are performed on the infrared video frame according to the homography transformation matrix. The spatial resolution of the infrared frame is resampled to the same size as the visible light frame, 1920×1080, so that the infrared video frame and the visible light video frame achieve pixel-level spatial alignment.
[0028] The aligned infrared image was mapped to the range of 0 to 255 using max-min normalization and converted to an 8-bit single-channel format. Along the channel dimension of the tensor, an infrared tensor with dimensions of 1 channel, height 1080, and width 1920 was concatenated with a visible light tensor with dimensions of 3 channels, height 1080, and width 1920, generating a four-channel floating-point tensor with dimensions of 4 channels, height 1080, and width 1920. This tensor data was standardized with means of 0.485, 0.456, 0.406, and 0.5, and standard deviations of 0.229, 0.224, 0.225, and 0.2, respectively, and used as the standardized multi-channel input features for the subsequent detection network. The means and standard deviations of the first three channels were determined based on statistical experience with the ImageNet dataset, while the mean of 0.5 and standard deviation of 0.2 for the fourth channel corresponding to the infrared image were determined based on the statistical distribution of the infrared image's grayscale. In other implementations, the standardized parameters can be adjusted based on the statistical distribution of the actual collected multimodal dataset of the sound barrier.
[0029] This detection network is a deep neural network for target localization and classification. The network takes a normalized multi-channel input feature tensor aligned with multimodal characteristics as input and outputs a detection result containing the defect target category, confidence level, and spatial bounding box coordinates. The network structure includes a backbone module for basic multi-level feature extraction and high- and low-frequency feature decoupling, a neck network module for cross-scale and cross-modal feature fusion and aggregation, and a decoder detection head module for predicting bounding boxes and category attributes.
[0030] S2, improved octave convolution and cross-frequency fusion.
[0031] A detection network incorporating an improved octave convolution module is constructed to decompose multi-channel input features into high-frequency, mid-frequency, and low-frequency branches in the spatial domain. The backbone network receives standardized multi-channel input feature tensors, extracts basic features through conventional convolutional layers, and then splits the features into high-frequency and low-frequency branches according to channel proportions. The high-frequency branch preserves spatial details and texture edge information, while the low-frequency branch retains global structure and low-frequency background information after 2D convolutional downsampling with a stride of 2.
[0032] The construction of the mid-frequency branch requires the simultaneous utilization of texture details in high-frequency features and structural boundary information in low-frequency features. For high-frequency features with an input dimension of 128 channels, a spatial height of 128, and a spatial width of 128, a two-dimensional Gaussian kernel function with a kernel size of 5×5 and a standard deviation of 1.2 is used for local spatial smoothing. The Gaussian cutoff frequency corresponding to the standard deviation of 1.2 suppresses surface particulate noise while preserving the edge contour of the sound-absorbing holes. A stride of 1 and zero-padding are used to maintain the spatial dimension. After passing through a max-pooling layer with a kernel size of 2×2 and a stride of 2, the feature space is downsampled to a size of 128 channels, a height of 64, and a width of 64.
[0033] The input dimensions of the low-frequency feature map are 64 channels, 32 heights, and 32 widths. After applying bilinear interpolation to double the spatial scale, the feature map resolution reaches 64 channels, 64 heights, and 64 widths. A 3×3 discrete Laplacian operator with a center weight of 8 and eight neighboring weights of -1 is used to perform channel-wise high-pass filtering convolution on the upsampled features to extract the high-frequency response at the edges of the sound-absorbing holes on the material surface. The convolution weights are fixed and not updated, the stride is 1, and reflection filling is used at the edges to avoid boundary artifacts.
[0034] Both feature paths have a spatial resolution of 64 in height and 64 in width. After concatenation along the channel dimension, an intermediate mixing tensor with 192 channels, a height of 64, and a width of 64 is formed. A 1×1 point convolutional layer with batch normalization and SiLU activation is used to compress the number of channels to 96, generating the intermediate frequency branch. In other implementations, the number of channels in the intermediate frequency branch can be adjusted from 64 to 128 to balance the overall computational cost of the network with detection accuracy.
[0035] It is worth noting that infrared and visible light images differ in information richness at the same spatial location. If local spatial distribution differences are not introduced to adjust the convolution sampling position, the deformable convolution of the mid-frequency branch will sample with a uniform grid, failing to focus on key areas with thermal radiation anomalies or abrupt texture changes. Therefore, an offset field is generated based on the ratio of the local spatial entropy of the infrared and visible light channels, and deformable convolution is performed on the mid-frequency branch using this offset field.
[0036] For single-channel infrared images (1080 pixels high, 1920 pixels wide) and visible light images converted to grayscale, a 9×9 sliding window is used to cover the typical diameter range of the sound-absorbing holes in the sound barrier, with a sliding step size of 2 to reduce computational redundancy. Within each sliding window, the 256 grayscale levels are discretized into 16 grayscale level intervals. The number of levels ensures the accuracy of entropy estimation while avoiding instability in probability estimation due to insufficient samples per level within the window. The pixel probabilities of each interval are statistically analyzed, and the local spatial entropy of the infrared channel and the visible light channel are calculated using the Shannon entropy relation. The Shannon entropy relation is a standard measure of the uncertainty of random variables in information theory. To prevent the denominator from being zero, a minimum constant is added. The ratio of the two values is calculated to obtain the local entropy ratio matrix.
[0037] To eliminate gradient instability caused by extreme ratios, a natural logarithmic transformation is applied to each element of the generated ratio matrix. The transformed elements are then smoothly mapped to the interval [-1, 1] using the hyperbolic tangent function, forming a weight map. The spatial dimensions of this weight map are downsampled and aligned to the height and width of the mid-frequency feature dimensions of 64. Positive values in the weight map indicate more prominent infrared thermal anomalies, while negative values indicate that visible light texture variations are dominant.
[0038] The weight map is input into a miniature convolutional network. This miniature convolutional network is a lightweight fully convolutional feature mapping network. The input is a single-channel, smoothed, downsampled local spatial entropy ratio weight map tensor. The output is a two-dimensional tensor representing the spatial coordinate offset component, used as the offset field for subsequent deformable convolution calculations. The structure of this miniature convolutional network includes a first-level 3×3 standard convolutional layer, a modified linear unit activation function layer connected after the first level, and a second-level 3×3 standard convolutional layer, with the preceding and following layers connected in series. No activation function is set after the second-level convolutional layer; the offset field tensor is directly output to ensure that the offset can take positive or negative values, allowing the convolution kernel sampling points to shift in any direction. The two-level convolution extracts nonlinear spatial patterns and outputs a tensor with 18 channels, a height of 64, and a width of 64. The number of channels is equal to the sum of the output components of the 9 spatial sampling points of the 3×3 deformable convolution kernel in both the horizontal and vertical directions.
[0039] Using the offset field as the deformation reference for the sampling grid, deformable convolution is performed on the intermediate frequency branch. During the deformable convolution execution stage, a feature map with 96 channels, a height of 64, and a width of 64 in the intermediate frequency branch is used as input. The standard 3×3 grid sampling points are deformed and perturbed using the offset field containing coordinate offsets, causing the sampling points to cluster towards regions of thermal radiation anomalies or abrupt texture changes. Since the offset is a non-integer decimal, a bilinear interpolation algorithm is used to calculate the eigenvalues at the offset coordinates.
[0040] The variance within each channel of each branch feature map is averaged along the channel dimension, then a minimum constant is added, and the square root of this value is used as a temperature coefficient. A cross-attention mechanism is then used to perform cross-frequency interaction and fusion of mid-frequency, high-frequency, and low-frequency features. High-frequency, mid-frequency, and low-frequency feature tensors with spatial dimensions normalized to 64 units in height and width are obtained. The variance of each channel feature map is calculated along the height and width directions of each branch feature map. The variance of each channel is averaged along the channel dimension to obtain a scalar global variance, which is then added with a minimum constant of 10. -4 To prevent division by zero, the square root is taken as the temperature coefficient specific to each branch. In complex defect scenarios, the extracted temperature coefficients range from 0.5 to 2.5, and this parameter represents the degree of dispersion of the characteristic responses at each frequency.
[0041] When constructing the cross-attention computation graph, the query vector is generated by mapping mid-frequency features, and the key and value vectors are generated by jointly mapping high-frequency and low-frequency features. The mid-frequency features are reshaped and transposed using a 1×1 linear mapping convolutional layer to serve as the query vector matrix. High-frequency and low-frequency features are concatenated along the channel dimension to form a concatenated feature with twice the original number of channels. This concatenated feature is then dimensionality-reduced by two parallel 1×1 linear convolutions and mapped into a key vector matrix and a value vector matrix. Batch matrix multiplication is used to calculate the inner product of the query vector matrix and the transposed key vector matrix to obtain the original association matrix.
[0042] After calculating the inner product matrix of the query vector and the key vector, the inner product matrix is scaled by dividing it by the temperature coefficient of the mid-frequency branch, and then input into the Softmax function to obtain the attention weight matrix. Because a temperature coefficient representing feature divergence is introduced, when the input feature space has a drastic distribution, the scaling operation makes the attention weight distribution more uniform and smooth, avoiding the network from over-focusing on high-frequency responses to single noise sources. Figure 2 The attention weight matrix is multiplied by the value vector to output a weighted aggregated feature containing cross-frequency correlation context. The tensor shape of the weighted feature is restored to its original two-dimensional spatial structure, and a pixel-wise residual connection is made with the original mid-frequency feature through a learnable scaling parameter initially set to 0 to obtain the enhanced mid-frequency feature. The high-frequency feature, the enhanced mid-frequency feature, and the low-frequency feature are concatenated along the channel dimension to form a multi-frequency joint tensor with 288 channels. The number of channels is compressed to 256 through a 1×1 convolutional layer to generate a three-frequency fused feature map.
[0043] S3, multi-scale feature aggregation and decoding detection.
[0044] For a three-frequency fusion feature map with 256 input channels, the three-frequency fusion features are input into multiple sets of dilated convolutional layers with different dilation rates. A multi-scale feature extraction structure with five parallel branches is set up: four parallel dilated convolutional branches with dilation rates of 1, 6, 12, and 18, and one global average pooling branch. The branch with a dilation rate of 1 is a standard 3×3 convolutional layer with 64 output channels, used to preserve the original resolution features; the three branches with dilation rates of 6, 12, and 18 all have a 3×3 kernel size and 64 output channels, used to progressively expand the receptive field coverage. The global average pooling layer is used to obtain global context information. After pooling, the number of channels is mapped to 64 by a 1×1 convolution and then aligned back to the original spatial size by bilinear interpolation upsampling.
[0045] The training dataset for detecting defects in sound-absorbing materials of sound barriers was used for verification: During the network model initialization phase, the defect bounding box label files of the images in the training dataset were parsed offline, and the product of the length and width of each ground truth box was calculated to obtain the pixel-level defect area. The defect area exhibits a long-tailed distribution, with most defects concentrated in small areas, while the area of a few defects can be tens of times larger than that of small defects. Directly using the original area as statistical input can easily bias the fitting results towards high-frequency small defects. Therefore, a histogram was plotted after taking the natural logarithm of the area and counting the frequencies. A Gaussian distribution was then used to fit the logarithmic area histogram, and the expected value and standard deviation of the area distribution were obtained. Here, the expected value is 8.5, corresponding to an area of approximately 4900 square pixels; the standard deviation is 1.2. (Refer to...) Figure 3 The original area of the defect exhibits a typical long-tailed distribution, which satisfies the normal distribution characteristics after logarithmic transformation. The expected value and standard deviation of the distribution can be stably obtained through fitting.
[0046] After the expected value and standard deviation form a two-dimensional prior feature vector, it is mapped to the initial values of the scale bias coefficients of four dilated convolutional branches by a lightweight multilayer perceptron. The lightweight multilayer perceptron receives this two-dimensional vector, enlarges the input from 2D to a 16D hidden feature space through a first fully connected layer, reduces the dimensionality from 16D to 4D output through a second fully connected layer after ReLU activation, and normalizes the output through a Sigmoid activation function layer. During training, the internal weights of this multilayer perceptron are updated synchronously with the gradient of the total network loss. The expected value and standard deviation serve as fixed statistical prior inputs, and the trained multilayer perceptron outputs stable initial values of the scale bias coefficients. In similar object detection systems, this type of feedforward network typically uses two to three fully connected layers to complete the mapping from low-dimensional priors to multi-branch weights. Since the scale bias coefficients must be limited to a bounded range to ensure training stability, a Sigmoid function is used to restrict each element of the output feature to the range of 0 to 1. The mathematical expression of this function is: ; In the formula, The output feature values passed from the second fully connected layer to the Sigmoid function are then mapped to generate a four-dimensional vector with values ranging from 0 to 1, which serves as the initial values for the scale bias coefficients of the four dilated convolution branches.
[0047] During the forward and backward propagation training cycles of the network, the four generated scalar coefficients are set as trainable floating-point parameters, updated along with the gradient of the network's total loss. After activating the output, each dilated convolution branch is weighted element-wise by its corresponding scale bias coefficient. The weighted features of the four dilated branches and the global pooling branch are merged along the channel dimension to obtain a concatenated tensor with a total of 320 channels. A linear convolutional layer with a 1×1 kernel is applied for channel compression and fusion, reshaping the output dimension back to 256 channels, generating a three-frequency fusion feature tensor with a dense, large receptive field and corrected for data scale prior.
[0048] The stitched features are input into a dual-head decoder containing regression and classification branches, i.e., a decoupled detection head module. The regression prediction branch uses the full intersection-union loss function to calculate the coordinate regression offset of the target bounding box and outputs the defect bounding box. The classification prediction branch uses a focus loss function based on cross-entropy to output the discrete classification probability level of the sound-absorbing material filling density and outputs the filling density level. Conditional judgment logic is set in the video stream temporal processing stage. When the classification result of a preset anchor box region reaches a preset sparsity or hole threshold, a cross-frame defect prior feedback process is triggered.
[0049] S4, enhanced cross-frame defect prior feedback.
[0050] In the processing flow of time-series video frames generated by the mobile shooting of the sound barrier inspection vehicle, in order to maintain detection stability and eliminate inter-frame missed detections, when there are sparse or hollow defect areas in the current frame, the low-frequency feature map of the defect area is extracted as a spatial prior weight matrix, which is fed back to the feature extraction stage of the next frame through inter-frame motion compensation.
[0051] When the classification branch of the dual-head decoder determines that the fill density level of the detection box region in the current frame is sparse or void, a corresponding binary mask is generated on the blank matrix of the entire image based on the coordinates of the bounding box. The mask has a value of 1 inside and a background value of 0 outside, and the mask size is the same as the detection box size. For sparse defects, the trigger condition is that the category confidence score output by the classification branch is greater than 0.6; for void defects, the trigger condition is that the category confidence score is greater than 0.85. The confidence threshold for void defects is higher than that for sparse defects because void defects have more obvious thermal features in infrared images and their classification confidence score is naturally higher. Using a higher threshold can avoid thermal fluctuations in normal areas being misjudged as void defects. In other embodiments, the confidence threshold for sparse defects can be adjusted in the range of 0.5 to 0.7, and the confidence threshold for void defects can be adjusted in the range of 0.75 to 0.9, which can be determined by those skilled in the art through conventional experiments based on actual working conditions.
[0052] A binary mask is used to extract the activation feature map of the receptive field region corresponding to the low-frequency branch of the detection network in the current frame. The low-frequency branch feature map has a height of 270 and a width of 480. The binary mask is scaled to the spatial size of the low-frequency branch feature map using nearest neighbor interpolation, and the activation tensor under the corresponding receptive field is extracted using the scaled mask. The multi-channel features are compressed into a single channel by calculating the maximum value along the channel dimension, and then normalized to the 0 to 1 interval by the Sigmoid activation function to generate a spatial prior weight matrix whose size is limited by the bounding box size.
[0053] When no defect region is detected in the first frame or the previous frame of a video sequence, the spatial prior weight matrix is initialized to an all-zero matrix with the same spatial size as the low-frequency branch feature map. At this time, the element-wise dot product enhancement operation is not performed in the feature extraction stage of the next frame, and the detection network processes the frame in the conventional forward propagation.
[0054] When processing the next frame, motion compensation is performed on the spatial prior weight matrix based on inter-frame motion estimation. The dense optical flow algorithm is used to calculate the horizontal and vertical two-dimensional displacement vector fields of the visible light images between adjacent frames. Based on the obtained motion estimation vector fields, the stored spatial prior weight matrix of the previous frame is subjected to affine deformation and position update, and motion compensation is performed to ensure that the historical prior is correctly projected onto the actual spatial position of the object in the next frame.
[0055] The compensated spatial prior weight matrix is scaled using bilinear interpolation to the spatial dimensions of the input features in the next frame, i.e., the same dimensions as the original input image (height 1080, width 1920). During scaling, boundary regions are smoothly transitioned through interpolation. The scaled spatial prior weight matrix is added to a constant 1 to generate the enhancement coefficient matrix. To prevent the continuous accumulation of local feature responses due to repeated enhancements across multiple frames, an upper limit clamp is applied to the enhancement coefficient matrix, truncating elements greater than 1.5 to 1.5, thus limiting the value range of the enhancement coefficient matrix to between 1 and 1.5. The clamped enhancement coefficient matrix is then multiplied element-wise with the input feature map of the next frame within the corresponding coordinate range of the defect region. When defects are continuously detected in the same coordinate region for more than 5 consecutive frames, the spatial prior weight matrix for that region is reset to an all-zero matrix, and subsequent frames generate a new spatial prior weight matrix based on the current detection results. The selection of the frame rate threshold must match the video capture frame rate and the speed of the inspection vehicle. If the number of frames is too small, the prior information will be discarded without being fully utilized. If the number of frames is too large, the reset delay will increase. In this embodiment, 5 frames are used. In other embodiments, the number of frames can be adjusted within the range of 3 to 8 frames according to the actual inspection speed.
[0056] Furthermore, the training of the detection network employs end-to-end supervision using a composite loss function. For spatial regression localization of defect bounding boxes, a complete intersection-union loss function is used. This loss function additionally considers aspect ratio consistency and minimizing the distance to the center point, accelerating convergence for narrow sound-absorbing material defect boundaries. For classification evaluation at the level of filling density, a focus loss function is used in conjunction. Referring to common handling methods for imbalanced positive and negative samples in deep learning object detection, this embodiment sets the focus parameter of the focus loss to 2 and the positive-negative sample balance factor to 0.25. A focus parameter of 2 sufficiently attenuates the loss weights of correctly classified samples, concentrating the training gradient on severely defective samples that are difficult to distinguish; a balance factor of 0.25 weakens the gradient proportion of easily classified negative samples with dense, normal states. In actual engineering deployments, the focus parameter can be adjusted from 1 to 3 according to the ratio of positive to negative samples, and the balance factor can be adjusted from 0.1 to 0.5.
[0057] To verify the contribution of each module to the detection performance, a comparative verification experiment was conducted on a deep learning workstation equipped with two RTX 3090 graphics cards and running Ubuntu 20.04. The experimental data consisted of a multimodal video sequence of defects in the sound-absorbing material of the sound barrier, collected in the field. This sequence contained 2500 pairs of infrared and visible light synchronized video frames, of which 2000 pairs were used for network training and 500 pairs for verification testing. Those skilled in the art marked the defect locations with rectangular bounding boxes and graded the filling density based on the average grayscale deviation within the marked area in the infrared image. The difference between the average grayscale value of the infrared channel within the marked area and the global average grayscale value of the infrared image in that frame was recorded as the grayscale deviation: a grayscale deviation less than 15 was marked as dense, a grayscale deviation between 15 and 40 was marked as sparse, and a grayscale deviation greater than 40 was marked as void. These thresholds were determined based on the grayscale statistical distribution of the infrared images of the sound barrier collected in the field; in other embodiments, they can be adjusted by those skilled in the art based on the actual thermal conductivity characteristics of the material. The AdamW optimizer was used during training, with an initial learning rate of 0.001 and a cosine annealing decay strategy. The batch size was set to 16, and the model was trained for a total of 20 epochs. The mean accuracy was used to measure the model's accuracy in classifying and locating sparse or void defects, and the frames per second was used to measure the real-time performance of the detection.
[0058] Using a baseline model with only visible light single-modal images as input and a conventional backbone network as a control group, the average accuracy of the control group was 72.4%, and the detection speed was 45 frames per second. After adding an infrared and visible light registration and stitching module and a cross-frequency interactive fusion module utilizing the temperature coefficient to the baseline, the average accuracy increased to 78.6%, and the detection speed became 38 frames per second. After cascading a deformable convolution module that generates an offset field using local spatial entropy, the average accuracy reached 83.2%, and the speed was 35 frames per second. Adding a multi-scale dilated convolution branch with initial values for multilayer perceptron scale bias further increased the model's average accuracy to 86.5%, and the speed was 32 frames per second. Applying a complete scheme based on low-frequency spatial priors and cross-frame defect prior feedback enhancement, an average accuracy of 91.3% was achieved on the test set, and the video detection speed stabilized at 28 frames per second. (Reference) Figure 4 As the modules of multimodal registration stitching, cross-frequency interactive fusion, spatial entropy deformable convolution, scale-biased multi-scale dilated convolution and cross-frame defect prior feedback enhancement are stacked in sequence, the average accuracy increases from 72.4% to 91.3%, and the detection speed decreases from 45 frames per second to 28 frames per second. The real-time detection capability is still maintained after the modules are stacked.
[0059] It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the scope of protection of this application. Therefore, the scope of protection of this patent application shall be determined by the appended claims.
Claims
1. A machine vision-based method for detecting the filling quality of sound-absorbing materials in sound barriers, characterized in that, include: S1. Acquire a sequence of synchronized infrared and visible light video frames, and after pixel-level registration, stitch them together along the channel dimension to form multi-channel input features; S2. Construct a detection network with an improved octave convolution module. Decompose the multi-channel input features into three branches: high frequency, mid frequency, and low frequency. The mid frequency branch is generated by concatenating the high frequency features after low-pass filtering downsampling and the low frequency features after high-pass filtering upsampling. Generate an offset field based on the ratio of the local spatial entropy of the infrared and visible light channels and perform deformable convolution on the mid frequency branch. The temperature coefficient is obtained by taking the average of the variance in each channel of the feature map along the channel dimension, adding a minimum constant, and taking the square root. Cross-frequency fusion is completed through a cross-attention mechanism. S3. Input the three-frequency fusion features into multiple groups of dilated convolutional layers with different dilation rates. The outputs of each layer are weighted by element-wise multiplication using learnable scale bias coefficients and then concatenated by channel. The initial values of the scale bias coefficients are obtained by mapping the expected value of the logarithmic distribution of the area of the defect connected domains in the training set. The defect bounding boxes and filling density levels are output by a dual-head decoder. S4. When there are sparse or hollow defect regions in the current frame, the low-frequency feature map of the defect region is extracted as the spatial prior weight matrix, which is fed back to the feature extraction stage of the next frame through inter-frame motion compensation.
2. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 1, characterized in that, The process of stitching multi-channel input features along the channel dimension after pixel-level registration includes: extracting edge contour feature points of infrared video frames and visible light video frames respectively, and performing initial matching of edge contour feature points of infrared video frames and edge contour feature points of visible light video frames to generate matching point pairs; Using visible light video frames as a reference, a random sampling consensus algorithm is used to eliminate mismatched point pairs, and the homography transformation matrix between infrared and visible light images is calculated. Based on the homography transformation matrix, perspective transformation and bilinear interpolation resampling are performed on the infrared video frames to achieve pixel-level spatial alignment between the infrared video frames and the visible light video frames. The aligned single-channel infrared image and the three-channel visible light image are concatenated along the channel dimension to generate a four-channel tensor as a multi-channel input feature.
3. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 1, characterized in that, The generation method of the mid-frequency branch includes: applying a Gaussian kernel function to the features of the high-frequency branch for local spatial smoothing, and downsampling through a max pooling layer with a stride of 2; applying bilinear interpolation to the features of the low-frequency branch for upsampling by a factor of two, and using the Laplacian operator to extract the high-frequency edge responses after upsampling; unifying the smoothed high-frequency features after downsampling and the low-frequency features after extracting edge responses to the same spatial resolution and then concatenating them in the channel dimension, performing channel dimensionality reduction and feature fusion through a 1×1 convolutional layer to generate the mid-frequency branch.
4. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 1, characterized in that, The step of generating an offset field based on the ratio of the local spatial entropy of the infrared and visible light channels and performing deformable convolution on the mid-frequency branch includes: Set up a sliding window and calculate the information entropy of pixel grayscale distribution in the infrared channel and visible light channel within the local window respectively; Calculate the ratio of infrared spatial entropy to visible light spatial entropy, take the logarithm of the ratio matrix and normalize it using the hyperbolic tangent function to obtain a weighted mapping map with values in the range of [-1, 1]. Input the weight map into a micro-convolutional network containing two consecutive convolutional layers, and output an offset field with twice the number of channels as the number of spatial sampling points of the deformable convolutional kernel; Using the offset field as the deformation reference of the sampling grid, deformable convolution is performed on the mid-frequency branch.
5. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 1, characterized in that, The cross-frequency fusion achieved via the cross-attention mechanism includes: Calculate the variance of the feature maps of the high-frequency branch, the mid-frequency branch and the low-frequency branch in each channel in the spatial dimension. Take the mean of the variance of each channel along the channel dimension, add a minimum constant and take the square root, and use it as the temperature coefficient of the corresponding branch. The query vector is generated by mapping the mid-frequency features, and the key vector and value vector are generated by mapping the high-frequency features and low-frequency features together. The inner product matrix of the query vector and the key vector is calculated, and the inner product matrix is divided by the temperature coefficient of the mid-frequency branch and scaled before being obtained by the Softmax function to obtain the attention weight matrix. The attention weight matrix is multiplied by the value vector, and the output weighted features are residually connected to the original mid-frequency features.
6. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 1, characterized in that, The process of inputting the three-frequency fusion features into multiple groups of dilated convolutional layers with different dilation rates, and then concatenating the outputs of each layer by element-wise weighting through learnable scale bias coefficients, includes: Set up four parallel dilated convolution branches with dilation rates of 1, 6, 12, and 18 respectively; The area of all labeled defective connected components in the training set is statistically analyzed. The natural logarithm of the area is taken, and the expected value and standard deviation are obtained by fitting a Gaussian distribution. The expected value and standard deviation are mapped through a multilayer perceptron and normalized by the Sigmoid function to generate a four-dimensional vector as the initial value of the scale bias coefficient of the four dilated convolution branches. During training, the output features of each dilated convolution branch are weighted by element-wise multiplication using the scale bias coefficient and then concatenated by channel.
7. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 1, characterized in that, The feature extraction stage, which feeds back to the next frame after inter-frame motion compensation, includes: when the classification branch of the dual-head decoder determines that the filling density level of the detection box region in the current frame is sparse or has holes, a binary mask with the same size as the detection box is generated, and the binary mask is scaled to the spatial size of the low-frequency branch feature map by nearest neighbor interpolation; the activated feature map of the receptive field region corresponding to the low-frequency branch in the current frame is extracted using the scaled binary mask, and a spatial prior weight matrix is generated by max pooling and the Sigmoid function along the channel dimension; after performing inter-frame motion compensation on the spatial prior weight matrix, it is scaled to the spatial size of the input features of the next frame by bilinear interpolation, and element-wise multiplied with the features of the corresponding coordinate region in the next frame; when no defects are detected in the first frame or the previous frame of the video sequence, the spatial prior weight matrix is initialized to an all-zero matrix.
8. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 6, characterized in that, The multiple sets of dilated convolutional layers with different dilation rates also include a global average pooling branch. The output of the global average pooling branch is upsampled and aligned back to the spatial size of the dilated convolutional branch, and then concatenated with the weighted output of each dilated convolutional branch along the channel. After concatenation, the concatenation is fused by channel dimensionality reduction through a 1×1 convolutional layer.
9. The method for detecting the filling quality of sound-absorbing materials in sound barriers based on machine vision according to claim 7, characterized in that, The detection network is trained by regressing the defect bounding box using the full intersection-union loss function and by using the focal loss function to classify and supervise the filling density level, thereby constraining the update of the detection network weights.
10. The machine vision-based method for detecting the filling quality of sound-absorbing materials in sound barriers according to claim 7 or 9, characterized in that, The spatial prior weight matrix, after motion compensation, is added to a constant value and then subjected to an upper limit constraint to generate an enhancement coefficient matrix. The enhancement coefficient matrix is then multiplied element-wise with the input feature map of the next frame within the corresponding coordinate range of the defect region.