Method and system for detecting defects in a concrete filled steel tube based on a time-frequency domain audio signal
By combining an acoustic sensor array and a time-frequency domain feature processing model with an attention module, the problems of high cost and accuracy dependence on probe position in ultrasonic testing technology for defect detection in steel-concrete composite pipes are solved, achieving more efficient and accurate defect detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA RAILWAY 25TH BUREAU GRP
- Filing Date
- 2025-06-23
- Publication Date
- 2026-06-30
AI Technical Summary
Existing ultrasonic testing technology is costly in detecting defects in steel-concrete composite pipes, its accuracy depends on the probe position, and it is difficult to capture time-domain and frequency-domain feature information, thus failing to fully utilize the multidimensional features of audio signals.
Audio sequences are acquired using an acoustic sensor array. The time-frequency domain features are extracted and processed by a convolutional long short-term memory network model. An attention module is then used to fuse time-domain and frequency-domain features, thereby improving the accuracy of defect detection.
It improves the accuracy and robustness of defect detection in steel-concrete composite pipes, reduces dependence on sensor location, and enhances the ability to capture subtle defect features.
Smart Images

Figure CN120847367B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of steel-concrete composite defect detection and processing technology, and in particular to a method and system for detecting steel-concrete composite defects based on time-frequency domain audio signals. Background Technology
[0002] Concrete-tube steel structures (CVS) are composite structures formed by filling concrete inside a steel tube. The steel tube confines the concrete, enhancing its compressive strength, while the concrete delays the buckling of the steel tube. With its core advantages of high load-bearing capacity, good seismic resistance, and rapid construction, CSS has become the preferred structural form for super high-rise buildings, long-span bridges, and major infrastructure projects. In the future, the integration of intelligent monitoring, low-carbon materials, and seismic-resistant technologies will further expand its engineering boundaries.
[0003] Defects in concrete-filled steel tubing mainly include: interface voids (such as gaps between the steel tubing and the concrete), uneven internal density, and secondary defects after repair (such as repair materials hindering the penetration of elastic waves). These defects significantly affect the load-bearing capacity and durability of the structure, and in severe cases, may lead to structural failure. Currently, ultrasonic testing technology is the main method for detecting defects in concrete-filled steel tubing, but this method has obvious limitations: First, ultrasonic testing equipment is expensive and requires professional operators; second, the detection accuracy is heavily dependent on the accuracy of the probe placement, and it is difficult to ensure detection stability in complex on-site construction environments; third, traditional methods cannot simultaneously capture time-domain and frequency-domain feature information, and cannot fully utilize the multi-dimensional feature information in audio signals. Summary of the Invention
[0004] The following is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims.
[0005] The main objective of this disclosure is to propose a method and system for detecting defects in steel-concrete composite tubes based on time-frequency domain audio signals. This method can fully utilize the time-domain and frequency-domain audio features in the audio sequence on the surface of the steel-concrete composite tube to improve the accuracy of detecting defects.
[0006] The first aspect of this application proposes a method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals, the method comprising:
[0007] Audio sequences of steel-concrete composite pipes are acquired based on an acoustic sensor array.
[0008] Multiple time-domain audio features are extracted from the audio sequence and fused to obtain time-domain fused audio features; the multiple time-domain audio features are Fourier transformed to generate multiple frequency-domain audio features and fused to obtain frequency-domain fused audio features.
[0009] The temporal fusion audio features are input into a temporal feature processing model to obtain the first audio features output by the temporal feature processing model; wherein, the temporal feature processing model includes multiple cascaded convolutional long short-term memory networks, and a convolutional layer connected to the last convolutional long short-term memory network;
[0010] The frequency domain fused audio features are input into the frequency domain feature processing model to obtain the second audio features output by the frequency domain feature processing model; wherein, the frequency domain feature processing model has the same network structure as the time domain feature processing model;
[0011] The first audio feature and the second audio feature are input into the attention module to enhance and fuse the first audio feature and the second audio feature based on the attention module to obtain the fourth audio feature. The defect detection result of the steel pipe concrete is output based on the fourth audio feature according to the fully connected layer.
[0012] In some implementations, the temporal feature processing model includes four convolutional long short-term memory networks, and there is a residual connection between the output features of the first convolutional long short-term memory network and the output features of the third convolutional long short-term memory network.
[0013] In some implementations, an attention module is provided between each of the convolutional long short-term memory networks in the temporal feature processing model and a corresponding convolutional long short-term memory network in the frequency domain feature processing model;
[0014] The input feature of the next convolutional long short-term memory network in the temporal feature processing model is: the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding attention module;
[0015] The input feature of the next convolutional long short-term memory network in the frequency domain feature processing model is: the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding attention module.
[0016] The process of the attention module outputting features between the convolutional long short-term memory network in the temporal feature processing model and the convolutional long short-term memory network in the frequency domain feature processing model includes:
[0017] The output features of the previous convolutional long short-term memory network in the time-domain feature processing model and the output features of the previous convolutional long short-term memory network in the frequency-domain feature processing model are enhanced with attention and fused to obtain the output features.
[0018] In some implementations, the attention module includes a channel attention block and a spatial attention block;
[0019] The step of inputting the first audio feature and the second audio feature into the attention module, and performing attention enhancement and fusion on the first audio feature and the second audio feature based on the attention module to obtain the fourth audio feature includes:
[0020] The first audio feature is input into the channel attention block to obtain the output first channel attention feature;
[0021] The first audio feature and the first channel attention feature are multiplied pixel by pixel to obtain the second channel attention feature;
[0022] The second channel attention feature is input into the spatial attention block to obtain the output first spatial attention feature;
[0023] The first spatial attention feature and the second channel attention feature are multiplied pixel by pixel to obtain the second spatial attention feature;
[0024] The second audio feature is input into the channel attention block to obtain the output third channel attention feature;
[0025] The second audio feature and the third channel attention feature are multiplied pixel by pixel to obtain the fourth channel attention feature;
[0026] The fourth channel attention feature is input into the spatial attention block to obtain the output third spatial attention feature;
[0027] The third spatial attention feature and the fourth channel attention feature are multiplied pixel by pixel to obtain the fourth spatial attention feature;
[0028] The second spatial attention feature and the fourth spatial attention feature are concatenated to obtain the fourth audio feature.
[0029] In some implementations, inputting the first audio feature into the channel attention block to obtain the output first channel attention feature includes:
[0030] The first audio feature is subjected to global average pooling in the spatial dimension to obtain the first intermediate audio feature.
[0031] The first audio feature is subjected to global max pooling in the spatial dimension to obtain the second intermediate audio feature;
[0032] The first intermediate audio feature is input into a multilayer perceptron to obtain the third intermediate audio feature;
[0033] The second intermediate audio feature is input into a multilayer perceptron to obtain the fourth intermediate audio feature;
[0034] By concatenating the third intermediate audio feature and the fourth intermediate audio feature, a fifth intermediate audio feature is obtained;
[0035] The first channel attention feature is obtained by activating the fifth intermediate audio feature using the Sigmoid function.
[0036] In some implementations, inputting the second channel attention feature into the spatial attention block to obtain the output first spatial attention feature includes:
[0037] The second channel attention feature is subjected to global average pooling along the channel dimension to obtain the sixth intermediate audio feature;
[0038] The second channel attention feature is subjected to global max pooling along the channel dimension to obtain the seventh intermediate audio feature;
[0039] The sixth intermediate audio feature and the seventh intermediate audio feature are concatenated to obtain the eighth intermediate audio feature;
[0040] The eighth intermediate audio feature is input into the convolutional layer to obtain the output ninth intermediate audio feature;
[0041] The first spatial attention feature is obtained by activating the ninth intermediate audio feature using the Sigmoid function.
[0042] In some implementations, extracting multiple temporal audio features from the audio sequence and fusing the multiple temporal audio features includes:
[0043] The energy spectrum feature extraction algorithm based on wavelet packet decomposition extracts multiple time-domain audio features from the audio sequence;
[0044] The multiple time-domain audio features are fused based on the principal component analysis algorithm.
[0045] A second aspect of this application proposes a defect detection system for steel-concrete composite pipes based on time-frequency domain audio signals, the system comprising:
[0046] An audio signal acquisition module is used to acquire audio sequences of steel-concrete composite structures based on an acoustic sensor array.
[0047] The audio feature extraction module is used to extract multiple time-domain audio features from the audio sequence, and fuse the multiple time-domain audio features to obtain time-domain fused audio features; and to generate multiple frequency-domain audio features by Fourier transform of the multiple time-domain audio features, and fuse the multiple frequency-domain audio features to obtain frequency-domain fused audio features.
[0048] The first processing module is used to input the temporal fusion audio features into the temporal feature processing model to obtain the first audio features output by the temporal feature processing model; wherein, the temporal feature processing model includes multiple cascaded convolutional long short-term memory networks, and a convolutional layer connected to the last convolutional long short-term memory network;
[0049] The second processing module is used to input the frequency domain fused audio features into the frequency domain feature processing model to obtain the second audio features output by the frequency domain feature processing model; wherein, the frequency domain feature processing model has the same network structure as the time domain feature processing model;
[0050] The defect detection module is used to input the first audio feature and the second audio feature into the attention module, so as to enhance and fuse the first audio feature and the second audio feature based on the attention module to obtain a fourth audio feature, and output the defect detection result of the steel-concrete composite according to the fourth audio feature based on the fully connected layer.
[0051] A third aspect of this application provides an electronic device including at least one controller and a memory for communicative connection with the controller; the memory stores instructions executable by the at least one controller to cause the at least one controller to perform a time-frequency domain-based method for detecting defects in steel-concrete composite structures as described above.
[0052] A fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when executed, implements the above-described method for detecting defects in steel-concrete composite based on time-frequency domain audio signals.
[0053] The method provided in this embodiment has the following advantages:
[0054] This method acquires audio time-series data of the surface of steel-concrete composite tubes using an acoustic sensor array. Then, it extracts temporal-domain fused audio features and frequency-domain fused audio features from the audio time-series data. Based on a temporal-domain feature processing model, it performs dual feature analysis on the temporal-domain fused audio features, and on the frequency-domain feature processing model, it performs dual feature analysis in different domains, capturing more defect features in the audio feature maps of different domains. Furthermore, the feature processing model consists of multiple cascaded convolutional long short-term memory networks, enabling it to extract audio signal features layer by layer, improving robustness to long-sequence signals, and thus enhancing the model's ability to capture weak defect features in the audio signals of steel-concrete composite tubes. Finally, it fuses the time-frequency domain features extracted by the feature processing model based on an attention module, and enhances feature attention to improve the expressive power of the features, thereby improving the defect feature capture effect and ultimately increasing the accuracy of defect detection in steel-concrete composite tubes.
[0055] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0056] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0057] Figure 1 This is a schematic flowchart of a method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals, provided in an embodiment of this application.
[0058] Figure 2 This is a schematic diagram of the structure of the time-domain feature processing model and the frequency-domain feature processing model provided in the embodiments of this application;
[0059] Figure 3 This is a schematic diagram of a steel-concrete composite defect detection system based on time-frequency domain audio signals provided in an embodiment of this application;
[0060] Figure 4 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0061] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0062] In the description of this application, the use of terms such as "first," "second," etc., is for the purpose of distinguishing technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or the order of the technical features indicated.
[0063] In the description of this application, it should be understood that the orientation descriptions, such as up, down, etc., are based on the orientation or positional relationship shown in the accompanying drawings, and are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application.
[0064] Concrete-tube steel structures (CVS) are composite structures formed by filling concrete inside a steel tube. The steel tube confines the concrete, enhancing its compressive strength, while the concrete delays the buckling of the steel tube. With its core advantages of high load-bearing capacity, good seismic resistance, and rapid construction, CSS has become the preferred structural form for super high-rise buildings, long-span bridges, and major infrastructure projects. In the future, the integration of intelligent monitoring, low-carbon materials, and seismic-resistant technologies will further expand its engineering boundaries.
[0065] Defects in concrete-filled steel tubing mainly include: interface voids (such as gaps between the steel tubing and the concrete), uneven internal density, and secondary defects after repair (such as repair materials hindering the penetration of elastic waves). These defects significantly affect the load-bearing capacity and durability of the structure, and in severe cases, may lead to structural failure. Currently, ultrasonic testing technology is the main method for detecting defects in concrete-filled steel tubing, but this method has obvious limitations: First, ultrasonic testing equipment is expensive and requires professional operators; second, the detection accuracy is heavily dependent on the accuracy of the probe placement, and it is difficult to ensure detection stability in complex on-site construction environments; third, traditional methods cannot simultaneously capture time-domain and frequency-domain feature information, and cannot fully utilize the multi-dimensional feature information in audio signals.
[0066] like Figure 1 and Figure 2 One embodiment of this application provides a method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals, the method comprising the following steps S110 to S150:
[0067] Step S100: Acquire the audio sequence of the steel-concrete composite material based on the acoustic sensor array.
[0068] Step S200: Extract multiple time-domain audio features from the audio sequence and fuse the multiple time-domain audio features to obtain time-domain fused audio features; Fourier transform the multiple time-domain audio features to generate multiple frequency-domain audio features and fuse the multiple frequency-domain audio features to obtain frequency-domain fused audio features.
[0069] Step S300: Input the temporal fusion audio features into the temporal feature processing model to obtain the first audio features output by the temporal feature processing model; wherein, the temporal feature processing model includes multiple cascaded convolutional long short-term memory networks, and a convolutional layer connected to the last convolutional long short-term memory network.
[0070] Step S400: Input the frequency domain fused audio features into the frequency domain feature processing model to obtain the second audio features output by the frequency domain feature processing model; wherein, the network structure of the frequency domain feature processing model is the same as that of the time domain feature processing model.
[0071] In step S500, the first audio feature and the second audio feature are input into the attention module to enhance and fuse the first audio feature and the second audio feature based on the attention module to obtain the fourth audio feature. Based on the fourth audio feature, the defect detection result of the steel pipe concrete is output according to the fully connected layer.
[0072] In step S100, multiple sound wave receiving devices can be arranged in a specific spatial distribution to form an acoustic sensor array, such as using a ring array or matrix layout to achieve multi-angle acquisition of audio sequences from the surface of the steel pipe.
[0073] In step S200, after the audio signal is acquired, time-domain audio feature extraction is first performed. For example, wavelet packet decomposition algorithm is used to extract the energy spectrum features of multiple sub-bands, such as extracting energy values of frequency bands like 0-1 kHz and 1-2 kHz. After dimensionality reduction by principal component analysis, a 128-dimensional time-domain fused audio feature is formed. Here, the time-domain fused audio feature refers to the comprehensive time-domain parameters obtained by fusion of the energy spectrum features extracted from the time-domain audio features through wavelet packet decomposition and then dimensionality reduction through principal component analysis. This parameter is used to characterize the energy distribution characteristics of the audio signal acquired from the steel-concrete composite surface in the time dimension.
[0074] Here, the time-domain audio features are further subjected to a Fast Fourier Transform to generate frequency-domain audio features. Principal component analysis is then used to fuse these features into a 128-dimensional frequency-domain fused audio feature. The frequency-domain fused audio feature refers to the comprehensive frequency-domain parameters formed by extracting and fusing the spectral parameters after performing a Fourier transform on the time-domain features. This can reflect the resonance characteristics of the internal structure of the steel-concrete composite tube.
[0075] In step S300, the Convolutional Long Short-Term Memory (ConvLSTM) network is a hybrid architecture that combines convolutional operations with recurrent neural networks. By setting up multiple cascaded ConvLSTM networks and establishing residual connections, it can simultaneously handle local feature learning and temporal correlation analysis for both temporal and frequency domain audio features. The ConvLSTM model is an extension of LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network), using convolutional operations to transform input to state and state to state, thereby extracting temporal and spatial correlation features.
[0076] In step S500, the attention module may refer to a fusion mechanism that includes both channel attention and spatial attention paths, and feature enhancement is achieved through pixel-by-pixel multiplication to improve the complementary expression of temporal and frequency domain features.
[0077] Compared with existing technologies, this method acquires audio time-series data of the steel-concrete composite surface using an acoustic sensor array. It then extracts temporal-domain fused audio features and frequency-domain fused audio features from this data. Based on a temporal-domain feature processing model, it performs dual feature analysis on the temporal-domain fused audio features, and on the frequency-domain feature processing model, it performs dual feature analysis across different domains, capturing more defect features in the audio feature maps of different domains. Furthermore, the feature processing model is composed of multiple cascaded convolutional long short-term memory networks, enabling layer-by-layer feature extraction of audio signals, improving robustness to long-sequence signals, and thus enhancing the model's ability to capture weak defect features in the audio signals of the steel-concrete composite. Finally, it fuses the temporal and frequency-domain features extracted by the feature processing model using an attention module, and enhances feature attention to improve feature expressiveness and defect feature capture effectiveness, ultimately increasing the accuracy of defect detection in the steel-concrete composite.
[0078] Furthermore, such as Figure 2 An attention module is set between each convolutional long short-term memory network in the time-domain feature processing model and a corresponding convolutional long short-term memory network in the frequency-domain feature processing model.
[0079] The input feature of the next convolutional long short-term memory network in the temporal feature processing model is the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding attention module.
[0080] The input feature of the next convolutional long short-term memory network in the frequency domain feature processing model is the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding attention module.
[0081] The process of the attention module outputting features between the convolutional long short-term memory network in the time-domain feature processing model and the convolutional long short-term memory network in the frequency-domain feature processing model includes:
[0082] The output features are obtained by enhancing the attention of the output features of the previous convolutional long short-term memory network in the time-domain feature processing model and the output features of the previous convolutional long short-term memory network in the frequency-domain feature processing model.
[0083] In this embodiment, an attention module is set between each convolutional long short-term memory network in the temporal feature processing model and a corresponding convolutional long short-term memory network in the frequency domain feature processing model. This allows for information sharing between the temporal and frequency domain feature processing models through the attention module, fully mining detailed features in the time and frequency domains and improving the expressive power of the features.
[0084] Further, in step S500, the attention module fuses the first audio feature and the second audio feature to obtain the fourth audio feature, including the following steps S510 to S590:
[0085] Step S510: Input the first audio feature into the channel attention block to obtain the output first channel attention feature.
[0086] Step S520: Multiply the first audio feature and the first channel attention feature pixel by pixel to obtain the second channel attention feature.
[0087] Step S530: Input the second channel attention feature into the spatial attention block to obtain the output first spatial attention feature.
[0088] Step S540: Multiply the first spatial attention feature and the second channel attention feature pixel by pixel to obtain the second spatial attention feature.
[0089] Step S550: Input the second audio feature into the channel attention block to obtain the output third channel attention feature.
[0090] Step S560: Multiply the second audio feature and the third channel attention feature pixel by pixel to obtain the fourth channel attention feature.
[0091] Step S570: Input the fourth channel attention feature into the spatial attention block to obtain the output third spatial attention feature.
[0092] Step S580: Multiply the third spatial attention feature and the fourth channel attention feature pixel by pixel to obtain the fourth spatial attention feature.
[0093] Step S590: The second spatial attention feature and the fourth spatial attention feature are concatenated to obtain the fourth audio feature.
[0094] Among them, the channel attention block refers to the module that generates channel weight coefficients by aggregating spatial dimension information. Specifically, it can be implemented by combining global average pooling and global max pooling with a multilayer perceptron to highlight important frequency band information. See the following embodiments for details.
[0095] Spatial attention blocks are modules that generate spatial weight coefficients by aggregating channel dimension information. This can be implemented using channel pooling combined with convolutional layers to enhance features in key regions. Pixel-wise multiplication refers to element-wise multiplication of the attention weights with the original features, which can be achieved through matrix dot multiplication. This is used for feature enhancement, as detailed in subsequent examples.
[0096] The channel attention block first performs global average pooling and global max pooling on the input features simultaneously, capturing different statistical properties respectively. The two pooling results are then processed independently by a multilayer perceptron and concatenated, generating channel attention weights through a sigmoid function. These weights, multiplied by the original features, adaptively adjust the contribution of each channel.
[0097] The spatial attention block performs max pooling and average pooling on the channel attention features along the channel dimension, concatenates them, and generates a spatial attention map through a convolutional layer. This map is then multiplied with the channel attention features to achieve feature filtering based on spatial location.
[0098] The time-domain and frequency-domain features are concatenated after undergoing dual attention processing to form an enhanced feature that integrates time and frequency information.
[0099] This method improves the accuracy of channel weights by combining dual pooling with a multilayer perceptron, introduces spatial attention to achieve fine-grained feature map screening, and finally preserves the complementarity of time-frequency features through splicing and fusion. It can effectively solve the problem of insufficient interaction between time-domain and frequency-domain features. In the process of steel pipe concrete defect detection, the dual attention mechanism suppresses noise interference and enhances the expressive ability of defect-related features. This method can reduce the impact of acoustic sensor position deviation on detection results and improve the accuracy of defect identification.
[0100] Further, step S510, which involves inputting the first audio feature into the channel attention block to obtain the output first channel attention feature, includes the following steps S5110 to S5160:
[0101] Step S5110: Perform global average pooling on the spatial dimension of the first audio feature to obtain the first intermediate audio feature. Global average pooling on the spatial dimension refers to calculating the average value of the feature values at each position along the spatial dimension of the audio feature. Specifically, it can be achieved by averaging along the height and width dimensions of the feature map, which is used to capture the overall distribution information in the spatial dimension.
[0102] Step S5120: Perform global max pooling on the first audio feature in the spatial dimension to obtain the second intermediate audio feature. Global max pooling in the spatial dimension refers to extracting the maximum value of the feature value at each position along the spatial dimension. Specifically, it can be achieved by taking the maximum value along the height and width dimensions of the feature map, which is used to capture significant local features in the spatial dimension.
[0103] Step S5130: Input the first intermediate audio feature into the multilayer perceptron to obtain the third intermediate audio feature; the multilayer perceptron is a neural network structure composed of fully connected layers and activation functions, such as using two fully connected layers with the ReLU activation function, to perform nonlinear transformation on the pooled features.
[0104] Step S5140: Input the second intermediate audio feature into the multilayer perceptron to obtain the fourth intermediate audio feature.
[0105] Step S5150: Concatenate the third intermediate audio feature and the fourth intermediate audio feature to obtain the fifth intermediate audio feature.
[0106] Step S5160: The fifth intermediate audio feature is activated based on the Sigmoid function to obtain the first channel attention feature. Sigmoid function activation refers to mapping the values to the 0-1 range to generate channel attention weights.
[0107] This method can improve the sensitivity of channel attention features to defects in steel-concrete composite tubes, enhance the model's ability to identify defects, reduce false detections or missed detections caused by differences in local feature responses, and adapt to the feature distribution patterns of different defect types through a dual-pooling mechanism.
[0108] Further, step S530, which inputs the second channel attention feature into the spatial attention block to obtain the output first spatial attention feature, includes the following steps S5310 to S5350:
[0109] Step S5310: Perform global average pooling on the channel dimension on the attention features of the second channel to obtain the sixth intermediate audio features. Global average pooling on the channel dimension refers to a statistical operation that compresses the feature map along the channel direction. Specifically, it can be implemented by calculating the mean of each spatial location feature across channels to extract the overall distribution information of the channel dimension.
[0110] Step S5320: Perform global max pooling on the channel dimension on the attention features of the second channel to obtain the seventh intermediate audio features. Global max pooling on the channel dimension refers to the operation of selecting the maximum value of each feature at each spatial location along the channel direction. Specifically, it can be implemented by retaining the extreme values of feature responses across channels to capture salient features in the channel dimension.
[0111] Step S5330: The sixth intermediate audio feature and the seventh intermediate audio feature are concatenated to obtain the eighth intermediate audio feature. The concatenation operation refers to connecting two feature tensors along a specified dimension. Specifically, it can be achieved by concatenating different pooling results along the channel dimension to fuse multi-dimensional feature information.
[0112] Step S5340: Input the eighth intermediate audio feature into the convolutional layer to obtain the output ninth intermediate audio feature; convolutional layer processing refers to feature transformation through learnable convolutional kernels. Specifically, a single-layer 3×3 convolutional kernel can be used to model the spatial relationship of the concatenated features to generate spatial attention weights.
[0113] Step S5350: Activate the ninth intermediate audio feature based on the Sigmoid function to obtain the first spatial attention feature. Sigmoid function activation refers to a non-linear operation that maps features to the 0-1 interval. Specifically, it can be implemented by calculating the S-shaped function value element by element, which is used to highlight key regions in the spatial dimension.
[0114] This method guides spatial attention generation by fusing channel dimension statistical information, which effectively enhances the model's ability to capture defect features such as uneven density inside concrete and obstruction by repair materials, reduces the dependence on the accuracy of sensor placement, and achieves more stable detection performance in complex engineering scenarios.
[0115] Further, step S200 involves extracting multiple temporal audio features from the audio sequence and fusing these features, including the following steps S210 and S220:
[0116] Step S210 involves extracting multiple temporal audio features from the audio sequence using an energy spectrum feature extraction algorithm based on wavelet packet decomposition. This algorithm involves decomposing the audio signal into sub-signals of different frequency bands using multi-level wavelet packet decomposition, calculating the energy values of each sub-signal to form a feature vector, and specifically employing a three-level wavelet packet decomposition to obtain the energy distribution of multiple frequency bands. This method effectively captures the differences in energy distribution caused by defects of different scales within concrete.
[0117] Step S230: Fusing multiple temporal audio features based on Principal Component Analysis (PCA) algorithm. PCA algorithm projects high-dimensional features into a low-dimensional space through orthogonal transformation, retaining the principal components along the direction of maximum variance. Specifically, it can be implemented using eigenvalue decomposition of the covariance matrix. Its purpose is to eliminate redundant information between temporal features and improve the efficiency of subsequent model processing.
[0118] The wavelet packet decomposition energy spectrum features of this method can finely divide frequency bands, accurately reflecting the energy attenuation characteristics of defects such as voids and cracks inside concrete at different frequency bands. The application of principal component analysis further solves the problem of dimensionality explosion in traditional feature fusion, making the features more discriminative. This method achieves multi-scale fine extraction of time-domain audio features of steel-concrete composites, enhancing the expressive power of defect features. Energy spectrum features can effectively distinguish the frequency band energy changes caused by different defect types, while principal component analysis improves feature robustness, enabling the subsequent model to maintain stable detection accuracy even when there are deviations in sensor positions, reducing the dependence on sensor deployment accuracy.
[0119] Furthermore, the fusion of multiple frequency domain audio features in step S200 includes the following steps:
[0120] Multiple frequency domain audio features are fused based on the principal component analysis algorithm.
[0121] This method effectively solves the problem of low model training efficiency caused by the explosion of frequency domain feature dimensions, while avoiding the interference of high-frequency noise on defect detection accuracy. The frequency domain fusion audio features fused by the principal component analysis algorithm can significantly reduce the data dimensionality while retaining the core defect information, thereby improving the training speed of the subsequent frequency domain feature processing model and making it easier to converge, ultimately enhancing the accuracy and robustness of steel-concrete composite defect detection.
[0122] like Figure 3 One embodiment of this application provides a defect detection system for steel-concrete composite pipes based on time-frequency domain audio signals. The system includes:
[0123] The audio signal acquisition module 1100 is used to acquire the audio sequence of steel-concrete composite based on an acoustic sensor array;
[0124] The audio feature extraction module 1200 is used to extract multiple time-domain audio features from an audio sequence, and fuse the multiple time-domain audio features to obtain time-domain fused audio features; it also uses Fourier transform of the multiple time-domain audio features to generate multiple frequency-domain audio features, and fuses the multiple frequency-domain audio features to obtain frequency-domain fused audio features.
[0125] The first processing module 1300 is used to input the temporal fusion audio features into the temporal feature processing model to obtain the first audio features output by the temporal feature processing model; wherein, the temporal feature processing model includes multiple cascaded convolutional long short-term memory networks, and a convolutional layer connected to the last convolutional long short-term memory network.
[0126] The second processing module 1400 is used to input the frequency domain fused audio features into the frequency domain feature processing model to obtain the second audio features output by the frequency domain feature processing model; wherein, the network structure of the frequency domain feature processing model is the same as that of the time domain feature processing model;
[0127] The defect detection module 1500 is used to input the first audio feature and the second audio feature into the attention module, so as to enhance and fuse the first audio feature and the second audio feature based on the attention module to obtain the fourth audio feature, and output the defect detection result of steel pipe concrete based on the fourth audio feature according to the fully connected layer.
[0128] It should be noted that this embodiment of the steel-concrete composite defect detection system based on time-frequency domain audio signals is based on the same inventive concept as the above-described embodiment of the steel-concrete composite defect detection method based on time-frequency domain audio signals. Therefore, the relevant content of the above-described embodiment of the steel-concrete composite defect detection method based on time-frequency domain audio signals is also applicable to this embodiment of the steel-concrete composite defect detection system based on time-frequency domain audio signals, and will not be repeated here.
[0129] Reference Figure 4 This application also provides an electronic device, which includes:
[0130] At least one memory;
[0131] At least one processor;
[0132] At least one program;
[0133] The program is stored in memory, and the processor executes at least one program to implement the above-described method for detecting defects in steel-concrete composite based on time-frequency domain audio signals.
[0134] This electronic device can be any smart terminal, including mobile phones, tablets, personal digital assistants (PDAs), and in-vehicle computers.
[0135] The electronic devices according to embodiments of this application will now be described in detail.
[0136] The processor 1600 can be implemented using a general-purpose central processing unit (CPU), microprocessor, application specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.
[0137] The memory 1700 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 1700 can store the operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the program code is stored in the memory 1700 and is called and executed by the processor 1600 to execute the steel-concrete composite defect detection method based on time-frequency domain audio signals according to the embodiments of this application.
[0138] The input / output interface 1800 is used to implement information input and output.
[0139] The communication interface 1900 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0140] Bus 2000 transmits information between various components of the device (e.g., processor 1600, memory 1700, input / output interface 1800, and communication interface 1900);
[0141] The processor 1600, memory 1700, input / output interface 1800 and communication interface 1900 are connected to each other within the device via bus 2000.
[0142] This application embodiment also provides a storage medium, which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute the above-described method for detecting defects in steel-concrete composite based on time-frequency domain audio signals.
[0143] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0144] The embodiments described in this application are intended to more clearly illustrate the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. Those skilled in the art will know that with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0145] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0146] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0147] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0148] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0149] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0150] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0151] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0152] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0153] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0154] The above is a detailed description of the preferred embodiments of this application. However, the embodiments of this application are not limited to the above-described implementation methods. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the embodiments of this application. All such equivalent modifications or substitutions are included within the scope defined by the claims of the embodiments of this application.
Claims
1. A method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals, characterized in that, The method includes: Audio sequences of steel-concrete composite pipes are acquired based on an acoustic sensor array. Multiple time-domain audio features are extracted from the audio sequence and fused to obtain time-domain fused audio features; the multiple time-domain audio features are Fourier transformed to generate multiple frequency-domain audio features and fused to obtain frequency-domain fused audio features. The temporal fusion audio features are input into a temporal feature processing model to obtain the first audio feature output by the temporal feature processing model; wherein, the temporal feature processing model includes multiple cascaded convolutional long short-term memory networks, and a convolutional layer connected to the last convolutional long short-term memory network; the temporal feature processing model includes four convolutional long short-term memory networks, and there is a residual connection between the output features of the first convolutional long short-term memory network and the output features of the third convolutional long short-term memory network; The frequency domain fused audio features are input into the frequency domain feature processing model to obtain the second audio features output by the frequency domain feature processing model; wherein, the frequency domain feature processing model has the same network structure as the time domain feature processing model; The first audio feature and the second audio feature are input into the first attention module to enhance and fuse the first audio feature and the second audio feature based on the first attention module to obtain the fourth audio feature. The defect detection result of the steel pipe concrete is output based on the fourth audio feature according to the fully connected layer. A second attention module is provided between each of the convolutional long short-term memory networks in the temporal feature processing model and a corresponding convolutional long short-term memory network in the frequency domain feature processing model. The input feature of the next convolutional long short-term memory network in the temporal feature processing model is: the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding second attention module; The input feature of the next convolutional long short-term memory network in the frequency domain feature processing model is: the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding second attention module; The process of the second attention module outputting features between the convolutional long short-term memory network in the temporal feature processing model and the convolutional long short-term memory network in the frequency domain feature processing model includes: The output features of the previous convolutional long short-term memory network in the time-domain feature processing model and the output features of the previous convolutional long short-term memory network in the frequency-domain feature processing model are enhanced with attention and fused to obtain the output features.
2. The method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals according to claim 1, characterized in that, The first attention module includes a channel attention block and a spatial attention block; The step of inputting the first audio feature and the second audio feature into a first attention module, and then performing attention enhancement and fusion on the first audio feature and the second audio feature based on the first attention module to obtain a fourth audio feature, includes: The first audio feature is input into the channel attention block to obtain the output first channel attention feature; The first audio feature and the first channel attention feature are multiplied pixel by pixel to obtain the second channel attention feature; The second channel attention feature is input into the spatial attention block to obtain the output first spatial attention feature; The first spatial attention feature and the second channel attention feature are multiplied pixel by pixel to obtain the second spatial attention feature; The second audio feature is input into the channel attention block to obtain the output third channel attention feature; The second audio feature and the third channel attention feature are multiplied pixel by pixel to obtain the fourth channel attention feature; The fourth channel attention feature is input into the spatial attention block to obtain the output third spatial attention feature; The third spatial attention feature and the fourth channel attention feature are multiplied pixel by pixel to obtain the fourth spatial attention feature; The second spatial attention feature and the fourth spatial attention feature are concatenated to obtain the fourth audio feature.
3. The method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals according to claim 2, characterized in that, The step of inputting the first audio feature into the channel attention block to obtain the output first channel attention feature includes: The first audio feature is subjected to global average pooling in the spatial dimension to obtain the first intermediate audio feature. The first audio feature is subjected to global max pooling in the spatial dimension to obtain the second intermediate audio feature; The first intermediate audio feature is input into a multilayer perceptron to obtain the third intermediate audio feature; The second intermediate audio feature is input into a multilayer perceptron to obtain the fourth intermediate audio feature; By concatenating the third intermediate audio feature and the fourth intermediate audio feature, a fifth intermediate audio feature is obtained; The first channel attention feature is obtained by activating the fifth intermediate audio feature using the Sigmoid function.
4. The method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals according to claim 3, characterized in that, The step of inputting the second channel attention feature into the spatial attention block to obtain the output first spatial attention feature includes: The second channel attention feature is subjected to global average pooling along the channel dimension to obtain the sixth intermediate audio feature; The second channel attention feature is subjected to global max pooling along the channel dimension to obtain the seventh intermediate audio feature; The sixth intermediate audio feature and the seventh intermediate audio feature are concatenated to obtain the eighth intermediate audio feature; The eighth intermediate audio feature is input into the convolutional layer to obtain the output ninth intermediate audio feature; The first spatial attention feature is obtained by activating the ninth intermediate audio feature using the Sigmoid function.
5. The method for detecting defects in steel-concrete composite structures based on time-frequency domain audio signals according to claim 1, characterized in that, The step of extracting multiple temporal audio features from the audio sequence and fusing the multiple temporal audio features includes: The energy spectrum feature extraction algorithm based on wavelet packet decomposition extracts multiple time-domain audio features from the audio sequence; The multiple time-domain audio features are fused based on the principal component analysis algorithm.
6. A defect detection system for steel-concrete composite pipes based on time-frequency domain audio signals, characterized in that, The system includes: An audio signal acquisition module is used to acquire audio sequences of steel-concrete composite structures based on an acoustic sensor array. The audio feature extraction module is used to extract multiple time-domain audio features from the audio sequence, and fuse the multiple time-domain audio features to obtain time-domain fused audio features; and to generate multiple frequency-domain audio features by Fourier transform of the multiple time-domain audio features, and fuse the multiple frequency-domain audio features to obtain frequency-domain fused audio features. The first processing module is used to input the temporal fused audio features into a temporal feature processing model to obtain the first audio features output by the temporal feature processing model; wherein, the temporal feature processing model includes multiple cascaded convolutional long short-term memory networks, and a convolutional layer connected to the last convolutional long short-term memory network; the temporal feature processing model includes four convolutional long short-term memory networks, and there is a residual connection between the output features of the first convolutional long short-term memory network and the output features of the third convolutional long short-term memory network; The second processing module is used to input the frequency domain fused audio features into the frequency domain feature processing model to obtain the second audio features output by the frequency domain feature processing model; wherein, the frequency domain feature processing model has the same network structure as the time domain feature processing model; A defect detection module is used to input the first audio feature and the second audio feature into a first attention module, so as to enhance and fuse the first audio feature and the second audio feature based on the first attention module to obtain a fourth audio feature, and output the defect detection result of the steel-concrete composite according to the fourth audio feature based on the fully connected layer; a second attention module is set between each convolutional long short-term memory network in the temporal feature processing model and a corresponding convolutional long short-term memory network in the frequency domain feature processing model; The input feature of the next convolutional long short-term memory network in the temporal feature processing model is: the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding second attention module; The input feature of the next convolutional long short-term memory network in the frequency domain feature processing model is: the concatenation feature of the output feature of the corresponding previous convolutional long short-term memory network and the output feature of the corresponding second attention module; The process of the second attention module outputting features between the convolutional long short-term memory network in the temporal feature processing model and the convolutional long short-term memory network in the frequency domain feature processing model includes: The output features of the previous convolutional long short-term memory network in the time-domain feature processing model and the output features of the previous convolutional long short-term memory network in the frequency-domain feature processing model are enhanced with attention and fused to obtain the output features.
7. An electronic device, characterized in that, It includes at least one controller and a memory for communicatively connecting with the controller; the memory stores instructions executable by the at least one controller, the instructions being executed by the at least one controller to cause the at least one controller to perform the steel-concrete composite defect detection method based on time-frequency domain audio signals as described in any one of claims 1 to 5.
8. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores computer-executable instructions for causing a computer to perform the steel-concrete composite defect detection method based on time-frequency domain audio signals as described in any one of claims 1 to 5.