A lightweight sound detection system and method for dry-wood pests based on improved MobileNetV3
By improving the lightweight sound detection system of MobileNetV3, the problems of large model parameters and insufficient detection capabilities in the existing technology have been solved, realizing efficient and accurate detection of borers on embedded devices and improving detection sensitivity and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHWEST A & F UNIV
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing acoustic detection technologies for forest pest monitoring suffer from problems such as large model parameter and computational load, difficulty in deployment on embedded devices, inability to effectively identify short-duration high-frequency signals of borer vibrations, poor versatility, and difficulty in covering various borer detection scenarios.
A lightweight sound detection system based on an improved MobileNetV3 is adopted. The audio signal is noise-suppressed and the sampling rate is optimized through a data preprocessing module. The spectrogram features are extracted by combining a bidirectional Mel filter bank. The feature capture capability is enhanced by a lightweight feature extraction network and a multispectral channel attention module. The weak feature learning capability is improved by a multi-scale feature fusion module. Finally, early detection is achieved through a pest species prediction module.
A lightweight trunk borer detection system has been developed, significantly reducing the number of parameters and computational load. It can be deployed on resource-constrained embedded devices, achieving a 97.18% accuracy rate with 10-fold cross-validation. It effectively identifies weak vibration signals in noisy environments, improving the sensitivity and accuracy of detection.
Smart Images

Figure CN122245348A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of forestry pest monitoring technology, and relates to, but is not limited to, a lightweight sound detection system and method for borers based on an improved MobileNetV3. Background Technology
[0002] Tree borers are one of the major threats to forest ecosystems. Their larvae bore into the trunks to feed, and are characterized by their strong concealment, long period of damage, and difficulty in control. They can easily cause huge losses in many fields such as forestry production, landscaping, construction, and home furnishing.
[0003] Existing acoustic detection technologies have several limitations. First, the large number of model parameters and computational demands involved in current acoustic detection technologies make them difficult to deploy on embedded devices or mobile terminals, and they are not suitable for on-site detection needs in forests. Second, traditional channel attention mechanisms can only capture low-frequency features and cannot effectively identify short-duration, high-frequency signals from borer vibrations. Third, weak amplitude features are easily lost during feature learning, resulting in insufficient sensitivity for detecting early-stage pests. Finally, existing models are mostly designed for specific pests such as the red-brown weevil, and their versatility is poor, making it difficult to cover various borer detection scenarios.
[0004] Therefore, there is an urgent need for a lightweight, highly sensitive, and universal sound detection system to solve the problems existing in current acoustic detection technologies and to achieve early real-time detection of various borers in forests. Summary of the Invention
[0005] This application provides a lightweight sound detection system and method for borers based on an improved MobileNetV3.
[0006] The technical solution of this application embodiment is implemented as follows: In a first aspect, embodiments of this application provide a lightweight sound detection system for borers based on an improved MobileNetV3. The system includes a data preprocessing module, a spectrogram extraction and fusion module, a lightweight feature extraction network, a multispectral channel attention module, a multi-scale feature fusion module, and a pest species prediction module, wherein: The data preprocessing module sequentially performs noise adaptive suppression, dynamic duration adjustment, and sampling rate optimization on the original audio signal to obtain a standardized audio signal. The spectrogram extraction and fusion module performs a short-time Fourier transform on the standardized audio signal based on a bidirectional Mel filter bank to extract logarithmic and chromatic features. After normalizing the logarithmic and chromatic features, the modules are concatenated and fused to obtain a fused spectrogram. A lightweight feature extraction network performs multi-scale basic feature extraction on the fused spectrogram, outputting multiple intermediate feature maps at different scales. The network training employs a dynamic channel pruning mechanism. A multispectral channel attention module uses a two-dimensional discrete cosine transform... Multi-band frequency features of the spectrograms of each intermediate feature map are extracted. Frequency weights are learned through a multilayer perceptron and the channels of each intermediate feature map are weighted to obtain multiple weighted feature maps at different scales. The multispectral channel attention module is embedded in the lightweight feature extraction network. The multi-scale feature fusion module is used to transform the multiple weighted feature maps at different scales to a unified channel dimension through a feature pyramid network. The low-resolution feature map is upsampled by bilinear interpolation and then concatenated with the high-resolution feature map to obtain a fused feature map. The pest species prediction module is used to output the probability of pest presence and the prediction results of health and damage status based on the fused feature map.
[0007] The technical solution provided in this application performs noise adaptive suppression, dynamic duration adjustment, and sampling rate optimization sequentially on the original audio signal through a data preprocessing module to obtain a standardized audio signal, thereby effectively separating and suppressing environmental noise while retaining high-fidelity borer vibration signals. In the spectrogram extraction and fusion module, a short-time Fourier transform is performed on the standardized audio signal based on a bidirectional Mel filter bank to extract logarithmic and chromatic features. The logarithmic and chromatic features are normalized and then spliced and fused to obtain a fused spectrogram, thereby achieving enhanced extraction of effective features and reduction of mid-frequency noise. Through a lightweight feature extraction network, multi-scale basic feature extraction is performed on the fused spectrogram, outputting multiple intermediate feature maps at different scales. The network has only 0.02M parameters and a computational cost of only 9.51M. FLOPs were reduced by 99.52% and 86.83% respectively compared to the original MobileNetV3, indicating a significant lightweight advantage in both parameter and computational cost. In the multispectral channel attention module, multi-band frequency features of the spectrograms of each intermediate feature map were extracted through two-dimensional discrete cosine transform. Frequency weights were learned through a multilayer perceptron and the channels of each intermediate feature map were weighted to obtain multiple weighted feature maps at different scales, thereby enhancing the ability to capture features over short time intervals and enabling the model to focus on effective high-frequency and low-frequency features with temporal regularity over short time intervals. In the multi-scale feature fusion module, through feature... The pyramid network transforms multiple weighted feature maps at different scales into a unified channel dimension. After bilinear interpolation upsampling, the low-resolution feature map is concatenated with the high-resolution feature map to obtain a fused feature map, thereby improving the learning ability of weak amplitude features. It also includes shallow detail features and deep semantic features. Through a pest species prediction module, the presence probability and health / damage status prediction results of borers are determined based on the fused feature map. The 10-fold cross-validation accuracy reaches 97.18%, an improvement of 0.94% compared to the original MobileNetV3, effectively identifying weak vibration signals in noisy environments. The technical solution provided in this application, through a lightweight network structure design and a multi-dimensional feature enhancement mechanism, achieves early, efficient, and accurate detection of borer vibration signals from various borers. Furthermore, the model has a small number of parameters and low computational cost, allowing for convenient deployment on resource-constrained embedded terminal devices, providing reliable technical support for forestry pest and disease control.
[0008] Optionally, the data preprocessing module includes a noise adaptive suppression unit, a dynamic duration adjustment unit, and a sampling rate optimization unit, wherein: the noise adaptive suppression unit is used for dynamic filtering based on a frequency band "V"-shaped distribution, analyzing the time-frequency distribution of the original audio signal through short-time Fourier transform, identifying and adaptively suppressing noise frequency bands within a preset mid-frequency range, and identifying and retaining effective features of borers, wherein the effective features of borers include low-frequency component features below a first preset frequency and high-frequency component features above a second preset frequency; the dynamic duration adjustment unit is used for duration statistical analysis of the noise-suppressed audio signal, and if the audio duration is less than a preset fixed duration, random bits in the beginning and end of the audio signal are adjusted. A zero vector with a padding length not exceeding a preset padding threshold is set; if the audio duration is longer than a preset fixed duration, a window of the preset fixed duration is slid across the audio with a preset step size to cut out at least one audio segment of fixed duration, thus obtaining an audio signal of fixed duration; the sampling rate optimization unit is used to optimize the sampling rate of the audio signal of fixed duration using a segmented mean downsampling method, reducing the original sampling rate to the target sampling rate to obtain a standardized audio signal, wherein the segmented mean downsampling method is as follows: according to the ratio between the target sampling rate and the original sampling rate, the audio sampling points are divided into two parts, and the first and second points are respectively used as a group for mean processing, and the processing result is used as the target sampling point after downsampling.
[0009] Optionally, the bidirectional Mel filter bank is constructed based on the "V"-shaped distribution pattern of the effective characteristics of borers, including a low-frequency conventional Mel filter bank covering a preset low-frequency band and a high-frequency inverted Mel filter bank covering a preset high-frequency band. The preset low-frequency band is 0-2kHz, and the low-frequency conventional Mel filter bank contains 48 filters. The preset high-frequency band is 6-8kHz, and the high-frequency inverted Mel filter bank contains 24 filters. The low-frequency conventional Mel filter bank and the high-frequency inverted Mel filter bank are spliced in the frequency domain to form a 72-dimensional bidirectional Mel frequency feature.
[0010] Optionally, the spectrogram extraction and fusion module includes a short-time Fourier transform unit, a filtering unit, a feature extraction unit, and a feature fusion unit, wherein: the short-time Fourier transform unit is used to perform a short-time Fourier transform on the standardized audio signal to obtain a linear spectrum; the filtering unit is used to obtain low-frequency bidirectional Mel features by weighted summation of a low-frequency Mel filter bank and a low-frequency linear spectrum, and to obtain high-frequency bidirectional Mel features by weighted summation of a high-frequency inverted Mel filter bank and a high-frequency linear spectrum; the low-frequency bidirectional Mel features and the high-frequency bidirectional Mel features are then concatenated in the frequency dimension to form a 72-dimensional bidirectional Mel spectrogram, and the filtering process is expressed by the following formula: ; In the formula, Indicates the index of the low-frequency Mel filter; Indicates the time frame index; Indicates the frequency slot index in the low-frequency region; Indicates the index of a high-frequency Mel filter; Indicates the frequency slot index in the high-frequency region; This represents the low-frequency Mel-spectrum output matrix; This represents the weighting function of a low-frequency Mel filter bank. Represents the linear spectrum in the low-frequency region; This represents the output matrix of the high-frequency Mel-spectrum. The weighting function represents the weighting function of a high-frequency Mel filter bank; The high-frequency linear spectrum is represented by the feature extraction unit, which extracts logarithmic and chromatic features from the bidirectional Mel spectrogram. The logarithmic feature characterizes the bidirectional Mel frequency energy distribution with a frequency dimension of 72, and the chromatic feature characterizes the relative frequency variation with a frequency dimension of 24. The feature fusion unit normalizes the logarithmic and chromatic features and merges them along the frequency dimension to obtain a fused spectrogram with dimensions of 1×96×157. Here, 1 represents a single channel, 96 represents the fused frequency feature dimension, and 157 represents the time dimension. The 96-dimensional fused frequency feature dimension is determined by concatenating the 72-dimensional logarithmic feature and the 24-dimensional chromatic feature.
[0011] Optionally, the lightweight feature extraction network is obtained by adjusting the hyperparameters of the MobileNetV3 base network. The lightweight feature extraction network includes one cascaded standard convolutional layer and four depthwise separable convolutional layers. The standard convolutional layer uses a 7×7 kernel with a stride of 2 and four output channels. The expansion rates of the four depthwise separable convolutional layers are set to 1, 3, 3, and 1 respectively, with 3×3 kernels, strides of 1, 2, 2, and 1 respectively, and output channels of 8, 16, 32, and 64 respectively. The lightweight feature extraction network performs multi-scale basic feature extraction on the fused spectrogram, outputting multiple intermediate feature maps at different scales, including: in the standard convolutional layer, extracting basic features from the fused spectrogram at multiple scales. The spectrogram undergoes a first-level standard convolutional processing to capture global features of the shallow high-resolution spectrogram and control the initial computational load, resulting in the first-level output feature map. In the four depthwise separable convolutional layers, the first-level output feature map is sequentially subjected to four levels of depthwise separable convolutional processing to compress the feature map size and increase the number of channels, generating four intermediate feature maps at different scales. The shapes of these four intermediate feature maps are 8×48×79, 16×24×40, 32×12×20, and 64×12×20, respectively. During training, the channels of the four intermediate feature maps at different scales are dynamically pruned. The importance score of each channel is calculated based on the L1 norm of the feature map channel, and channels with importance scores below a preset threshold are pruned.
[0012] Optionally, the multispectral channel attention module is specifically used to: perform a two-dimensional discrete cosine transform on the intermediate feature maps at four different scales, extract multiple frequency components covering multiple frequencies from low to high frequencies by adjusting the frequency index, and generate multiple frequency weight elements, wherein the low-frequency components correspond to the slow boring signal of the pest, and the high-frequency components correspond to the rapid gnawing signal of the pest; the calculation formula of the two-dimensional discrete cosine transform is expressed by the following formula: In the formula, Indicates the first Frequency weight elements corresponding to each frequency component; Indicates the spatial location index in the height direction of the feature map; Indicates the height of the feature map; Indicates the spatial position index along the width direction of the feature map; Indicates the width of the feature map; This is an intermediate feature map. Indicates spatial location The eigenvectors of all channels; This represents the frequency index of the two-dimensional discrete cosine transform at the vertical frequency. Represents the frequency index of the two-dimensional discrete cosine transform at the horizontal frequency; Indicates the index of the frequency component; The total number of frequency components to be extracted is predetermined. Multiple frequency weight elements are concatenated to obtain a complete frequency weight vector. This vector is then input into a pre-defined two-layer fully connected multilayer perceptron network for frequency weight learning, generating a frequency attention weight vector with the same dimension as the intermediate feature map. The number of hidden layer nodes in the multilayer perceptron network is 1 / 4 of the number of channels in the intermediate feature map. Temporal correlation information of the intermediate feature map in the time dimension is extracted, capturing specific vibrational sequences within a preset time window to generate a temporal attention weight vector. The frequency attention weight vector and the temporal attention weight vector are fused to generate a frequency-temporal fusion weight vector. The frequency-temporal fusion weight vector is then multiplied point-by-point by the intermediate feature map to generate a weighted feature map.
[0013] Optionally, the multi-scale feature fusion module includes a channel unification unit, an upsampling unit, a feature map stitching unit, a channel compression unit, and a feature refinement and filtering unit, wherein: the channel unification unit is used to perform 1×1 pointwise convolution processing on the weighted feature maps of four different scales through a feature pyramid network, thereby unifying the channel dimension of each weighted feature map to 64; the upsampling unit is used to perform 2x bilinear interpolation upsampling on the weighted feature map of the second layer with a spatial size of 24×40 after channel unification, based on the weighted feature map of the first layer with a spatial size of 48×79, and 4x bilinear interpolation upsampling on the weighted feature map of the third layer with a spatial size of 12×20. The sampling process involves performing a 4x bilinear interpolation upsampling on the weighted feature map of the fourth layer, which has a spatial size of 12×20, to unify the spatial size of the four weighted feature maps to 48×79. The feature map stitching unit stitches the four weighted feature maps along the channel dimension after unifying their spatial size to obtain an initial fused feature map. The channel compression unit performs 1×1 pointwise convolution compression on the initial fused feature map to obtain a fused feature map with 64 channels and a spatial size of 48×79. The feature refinement and filtering unit inputs the fused feature map into a preset small discriminant network to score the effectiveness of each region of the fused feature map and filter out regions with scores higher than a preset threshold.
[0014] Optionally, the pest species prediction module is specifically used to: perform global average pooling on each channel of the fused feature map, compressing each fused feature map with a spatial size of 48×79 into a 64-dimensional feature vector with the same dimension as the number of input channels; input the 64-dimensional feature vector into a fully connected layer, and map the feature vector to the category space through linear transformation to obtain a category score vector with a dimension equal to the total number of trunk-boring pest categories; and normalize the category score vector through the Softmax function to obtain the probability of existence and the prediction results of the health and damage status of trunk-boring pests.
[0015] Secondly, embodiments of this application provide a lightweight sound detection method for borers based on an improved MobileNetV3. This method is applied to a lightweight sound detection system for borers based on an improved MobileNetV3. The system includes a data preprocessing module, a spectrogram extraction and fusion module, a lightweight feature extraction network, a multispectral channel attention module, a multi-scale feature fusion module, and a pest species prediction module. The method includes: sequentially performing noise adaptive suppression, dynamic duration adjustment, and sampling rate optimization on the original audio signal to obtain a standardized audio signal; performing a short-time Fourier transform on the standardized audio signal based on a bidirectional Mel filter bank to extract logarithmic and chromaticity features; and normalizing the logarithmic and chromaticity features before concatenation. The fusion process yields a fused spectrogram. Multi-scale basic feature extraction is performed on the fused spectrogram, outputting multiple intermediate feature maps at different scales. Multi-band frequency features of each intermediate feature map are extracted using a two-dimensional discrete cosine transform. Frequency weights are learned using a multi-layer perceptron, and the channels of each intermediate feature map are weighted to obtain multiple weighted feature maps at different scales. The multispectral channel attention module is embedded within the lightweight feature extraction network. A feature pyramid network is used to transform the multiple weighted feature maps at different scales to a unified channel dimension. The low-resolution feature map is then subjected to bilinear interpolation upsampling and concatenated with the high-resolution feature map to obtain a fused feature map. Based on the fused feature map, the probability of pest presence and the prediction results of the pest's health and damage status are output.
[0016] Thirdly, embodiments of this application provide an electronic device, including a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the program, it implements the steps in the above-mentioned lightweight sound detection method for borers based on improved MobileNetV3.
[0017] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the aforementioned lightweight sound detection method for borers based on improved MobileNetV3.
[0018] The beneficial effects of the technical solutions provided in this application include at least the following: This application provides a lightweight sound detection system and method for borer pests based on an improved MobileNetV3. The system employs a data preprocessing module to sequentially perform noise adaptive suppression, dynamic duration adjustment, and sampling rate optimization on the original audio signal, resulting in a standardized audio signal. This effectively separates and suppresses environmental noise while preserving high-fidelity borer vibration signals. In the spectrogram extraction and fusion module, a short-time Fourier transform is performed on the standardized audio signal using a bidirectional Mel filter bank to extract logarithmic and chromatic features. These features are then normalized and fused to obtain a fused spectrogram, thereby enhancing the extraction of effective features and reducing mid-frequency noise. A lightweight feature extraction network is used to extract multi-scale basic features from the fused spectrogram, outputting multiple intermediate feature maps at different scales. The network has only 0.02M parameters and a computational cost of only 9.51M. FLOPs were reduced by 99.52% and 86.83% respectively compared to the original MobileNetV3, indicating a significant lightweight advantage in both parameter and computational cost. In the multispectral channel attention module, multi-band frequency features of the spectrograms of each intermediate feature map were extracted through two-dimensional discrete cosine transform. Frequency weights were learned through a multilayer perceptron and the channels of each intermediate feature map were weighted to obtain multiple weighted feature maps at different scales, thereby enhancing the ability to capture features over short time intervals and enabling the model to focus on effective high-frequency and low-frequency features with temporal regularity over short time intervals. In the multi-scale feature fusion module, through feature... The pyramid network transforms multiple weighted feature maps at different scales into a unified channel dimension. After bilinear interpolation upsampling, the low-resolution feature map is concatenated with the high-resolution feature map to obtain a fused feature map, thereby improving the learning ability of weak amplitude features. It also includes shallow detail features and deep semantic features. Through a pest species prediction module, the presence probability and health / damage status prediction results of borers are determined based on the fused feature map. The 10-fold cross-validation accuracy reaches 97.18%, an improvement of 0.94% compared to the original MobileNetV3, effectively identifying weak vibration signals in noisy environments. The technical solution provided in this application, through a lightweight network structure design and a multi-dimensional feature enhancement mechanism, achieves early, efficient, and accurate detection of borer vibration signals from various borers. Furthermore, the model has a small number of parameters and low computational cost, allowing for convenient deployment on resource-constrained embedded terminal devices, providing reliable technical support for forestry pest and disease control. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort, wherein: Figure 1 A schematic diagram of a lightweight sound detection system for borers based on an improved MobileNetV3, provided in an embodiment of this application; Figure 2 A flowchart of a lightweight sound detection method for borers based on an improved MobileNetV3, provided in this application embodiment; Figure 3 This is a schematic diagram of the hardware entity of an electronic device provided in an embodiment of this application. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. The following embodiments are used to illustrate this application, but are not intended to limit the scope of this application. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0021] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.
[0022] It should be noted that the terms "first, second, and third" used in the embodiments of this application are merely to distinguish similar objects and do not represent a specific order of objects. It is understood that "first, second, and third" can be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.
[0023] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of this application pertain. It should also be understood that terms such as those defined in general dictionaries should be understood to have a meaning consistent with their meaning in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0024] The embodiments of this application will be further described below with reference to the accompanying drawings.
[0025] In view of the current problems in the sound detection of borers in the field of forest pest monitoring technology, this application provides a lightweight sound detection system and method for borers based on an improved MobileNetV3.
[0026] The technical solution of this application is described below, starting with the system implementation of this application.
[0027] Please refer to Figure 1 It shows a schematic diagram of a lightweight sound detection system for borers based on an improved MobileNetV3 according to an embodiment of this application, such as... Figure 1 As shown, the system includes a data preprocessing module 01, a spectrogram extraction and fusion module 02, a lightweight feature extraction network 03, a multispectral channel attention module 04, a multi-scale feature fusion module 05, and a pest species prediction module 06. The data preprocessing module 01 sequentially performs noise adaptive suppression, dynamic duration adjustment, and sampling rate optimization on the original audio signal to obtain a standardized audio signal. The spectrogram extraction and fusion module 02 performs a short-time Fourier transform on the standardized audio signal based on a bidirectional Mel filter bank to extract logarithmic and chromatic features. After normalizing the logarithmic and chromatic features, the modules are concatenated and fused to obtain a fused spectrogram. The lightweight feature extraction network 03 performs multi-scale basic feature extraction on the fused spectrogram, outputting multiple intermediate feature maps at different scales. The multispectral channel attention module 04 extracts each intermediate feature map through a two-dimensional discrete cosine transform. The multi-band frequency features of the spectrogram of the feature map are learned by a multilayer perceptron, and the channels of each intermediate feature map are weighted to obtain multiple weighted feature maps at different scales. The multispectral channel attention module is embedded in the lightweight feature extraction network. The multi-scale feature fusion module 05 is used to transform the multiple weighted feature maps at different scales to a unified channel dimension through a feature pyramid network. After performing bilinear interpolation upsampling on the low-resolution feature map, it is concatenated with the high-resolution feature map to obtain a fused feature map. The pest species prediction module 06 is used to output the probability of pest presence and the prediction results of health and damage status based on the fused feature map.
[0028] In this embodiment, the data preprocessing module 01 includes a noise adaptive suppression unit, a dynamic duration adjustment unit, and a sampling rate optimization unit. Specifically, in the noise adaptive suppression unit, based on dynamic filtering with a frequency band "V"-shaped distribution, the time-frequency distribution of the original audio signal is analyzed through short-time Fourier transform to identify noise frequency bands within a preset mid-frequency range and adaptively suppress them, while retaining effective low-frequency and high-frequency features. The preset mid-frequency range is 2kHz-6kHz. Effective features of borers are identified and retained. These effective features include low-frequency components below a first preset frequency and high-frequency components above a second preset frequency. In this embodiment, the first preset frequency is 2kHz and the second preset frequency is 6kHz. In the dynamic duration adjustment unit, duration statistical analysis is performed on the noise-suppressed audio signal. The audio duration distribution is mainly concentrated between 6.143 and 22.528 seconds. Taking 20 seconds as the baseline, a random zero-filling and sliding clipping strategy is adopted. If the audio duration is less than the preset fixed duration of 20 seconds, zero vectors with a length not exceeding the preset filling threshold are filled at random positions at the beginning and end of the audio. The preset filling threshold does not exceed 2 seconds to avoid feature dilution. If the audio duration is greater than the preset fixed duration of 20 seconds, a window of the preset fixed duration is slid across the audio with a preset step size of 0.5 seconds to clip at least one audio segment of fixed duration, retaining segments containing potential borer signals, improving data diversity, and finally obtaining an audio signal with a uniform duration of 20 seconds.
[0029] Furthermore, in the sampling rate optimization unit, the original sampling rate of a 20-second audio signal is 44.1kHz, which can lead to an imbalance in the time-frequency axis of the spectrogram. Therefore, a segmented averaging downsampling method is used to optimize the sampling rate. Based on the ratio between the target sampling rate and the original sampling rate, the audio sampling points are divided into two parts, and the first and second points are averaged separately. The result is used as the target sampling point after downsampling, thereby reducing the original sampling rate to the target sampling rate of 8kHz, resulting in a standardized audio signal. The first and second parts of the divided audio sampling points include 19,500 and 24,600 sampling points, respectively. The first and second points are 5 and 6 points, respectively. That is, the first 19,500 sampling points are averaged at 5 points, and the remaining 24,600 sampling points are averaged at 6 points, thereby unifying the sampling rate to 8kHz and making the spectrogram shape close to a 1×96×157 square, which meets the model input requirements.
[0030] In this embodiment, the spectrogram extraction and fusion module 02 includes a short-time Fourier transform unit, a filtering unit, a feature extraction unit, and a feature fusion unit. Specifically, in the short-time Fourier transform unit, a short-time Fourier transform is performed on the standardized audio signal to obtain a linear spectrum. In the filtering unit, a low-frequency bidirectional Mel feature is obtained by weighted summation of a low-frequency Mel filter bank and the low-frequency linear spectrum, and a high-frequency bidirectional Mel feature is obtained by weighted summation of a high-frequency inverted Mel filter bank and the high-frequency linear spectrum. The low-frequency bidirectional Mel feature and the high-frequency bidirectional Mel feature are then concatenated in the frequency dimension to form a 72-dimensional bidirectional Mel spectrogram, thereby achieving enhanced extraction of effective features and reduction of mid-frequency noise. The bidirectional Mel filter bank is constructed based on the "V"-shaped distribution pattern of effective characteristics of borers. It includes a low-frequency conventional Mel filter bank covering a preset low-frequency band and a high-frequency inverted Mel filter bank covering a preset high-frequency band. The preset low-frequency band is 0-2kHz, and the low-frequency conventional Mel filter bank contains 48 filters. The preset high-frequency band is 6-8kHz, and the high-frequency inverted Mel filter bank contains 24 filters. After being spliced in the frequency domain, the low-frequency conventional Mel filter bank and the high-frequency inverted Mel filter bank form a 72-dimensional bidirectional Mel frequency characteristic. The filtering process is expressed by the following formula: ; In the formula, Indicates the index of the low-frequency Mel filter; Indicates the time frame index; Indicates the frequency slot index in the low-frequency region; Indicates the index of a high-frequency Mel filter; Indicates the frequency slot index in the high-frequency region; This represents the low-frequency Mel-spectrum output matrix; This represents the weighting function of a low-frequency Mel filter bank. Represents the linear spectrum in the low-frequency region; This represents the output matrix of the high-frequency Mel-spectrum. The weighting function represents the weighting function of a high-frequency Mel filter bank; It represents the linear spectrum in the high-frequency region.
[0031] Furthermore, in the feature extraction unit, logarithmic features and chromaticity features are extracted from the bidirectional Mel spectrogram, respectively. The logarithmic features are used to characterize the bidirectional Mel frequency energy distribution, with a frequency dimension of 72; the chromaticity features are used to characterize the relative frequency variation characteristics, with a frequency dimension of 24. In the feature fusion unit, the logarithmic and chromaticity features are respectively subjected to Min-Max normalization to the [0,1] interval, and then spliced and fused along the frequency dimension to obtain a fused spectrogram with a size of 1×96×157 to adapt to the model input requirements. Here, 1 represents a single channel, 96 represents the fused frequency feature dimension, which is determined by splicing the 72-dimensional logarithmic feature and the 24-dimensional chromaticity feature, and 157 represents the time dimension.
[0032] In this embodiment, the lightweight feature extraction network 03 is based on the MobileNetV3 network. It is obtained by adjusting the network's hyperparameters, including reducing the number of network layers, optimizing the inverse residual block expansion rate, and employing depthwise separable convolutions, to reduce the number of parameters and computational cost. The lightweight feature extraction network 03 includes one cascaded standard convolutional layer and four depthwise separable convolutional layers. By removing redundant inverse residual blocks, the network depth and computational steps are reduced. Furthermore, by using depthwise separable convolutions instead of traditional convolutions, computational cost is further reduced. The standard convolutional layer uses a 7×7 kernel with a stride of 2 to capture global features from shallow, high-resolution spectrograms. It outputs four channels to control the initial computational cost. The original MobileNetV3 expansion rate is 6. After adjusting the expansion rate of the inverse residual block, the expansion rates of the four depthwise separable convolutional layers are set to 1, 3, 3, and 1 respectively. The convolutional kernels are all 3×3, and the strides are 1, 2, 2, and 1 respectively. This achieves gradual compression of the feature map size and improves computational efficiency. The number of output channels is 8, 16, 32, and 64 respectively. By dynamically adjusting the growth rate of the number of channels, effective features are retained while reducing the amount of computation.
[0033] In this embodiment, a lightweight feature extraction network 03 is used to extract basic features from the fused spectrogram at multiple scales, outputting multiple intermediate feature maps at different scales. Specifically, in the standard convolutional layer, the fused spectrogram undergoes a first-level standard convolutional process to capture the global features of the shallow high-resolution spectrogram and control the initial computational cost, resulting in the first-level output feature map. In the four depthwise separable convolutional layers, the first-level output feature map undergoes four levels of depthwise separable convolution processing sequentially to compress the feature map size and increase the number of channels, generating four intermediate feature maps at different scales with shapes of 8×48×79, 16×24×40, 32×12×20, and 64×12×20, respectively. The depthwise separable convolutional layers adopt a depthwise convolution + pointwise convolution structure, where depthwise convolution is responsible for extracting spatial features and pointwise convolution is responsible for adjusting the number of channels, reducing the computational cost by 89% compared to traditional convolution.
[0034] It is worth noting that the network training adopts a dynamic channel pruning mechanism. During the training process, the channels of the intermediate feature maps at four different scales are dynamically pruned. The importance score of each channel is calculated based on the L1 norm of the feature map channels, and channels with importance scores below a preset threshold are dynamically pruned. The number of effective channels in each layer can be adaptively adjusted according to the differences in vibration signal characteristics of different borers. For example, low-discrimination channels are pruned for longhorn beetle vibration signals, while channels with matching features are retained for bark beetle signals. This further reduces redundant calculations, reducing the number of parameters by another 10%-15%, and improving the average detection speed by about 8% in multi-pest mixed detection scenarios.
[0035] In this embodiment, targeting the short-duration characteristics of the vibration signal from borers, the original channel attention of the MobileNetV3 base network is replaced by embedding a multispectral channel attention module 04 within a lightweight feature extraction network 03 to achieve full-band feature capture. In the multispectral channel attention module 04, multi-band frequency features of the spectrogram of each intermediate feature map are extracted using a two-dimensional discrete cosine transform. Frequency weights are learned using a multilayer perceptron, and the channels of each intermediate feature map are weighted to obtain multiple weighted feature maps at different scales. Specifically, a two-dimensional discrete cosine transform is performed on the four intermediate feature maps at different scales. By adjusting the frequency index, multiple frequency components covering low to high frequencies are extracted, generating multiple frequency weight elements. Wherein, when... , Low-frequency components are extracted at specific times; these low-frequency components correspond to the signal of pests slowly boring into the earth. , When the frequency is increased, high-frequency components are extracted. These high-frequency components correspond to the rapid biting signals of pests, covering the entire frequency band from 0 to 8 kHz. The calculation formula for the two-dimensional discrete cosine transform is expressed by the following equation: ; In the formula, Indicates the first Frequency weight elements corresponding to each frequency component; Indicates the spatial location index in the height direction of the feature map; Indicates the height of the feature map; Indicates the spatial position index along the width direction of the feature map; Indicates the width of the feature map; This is an intermediate feature map. Indicates spatial location The eigenvectors of all channels; This represents the frequency index of the two-dimensional discrete cosine transform at the vertical frequency. Represents the frequency index of the two-dimensional discrete cosine transform at the horizontal frequency; Indicates the index of the frequency component; This represents the pre-defined total number of frequency components to be extracted. Multiple frequency weight elements are concatenated to obtain the complete frequency weight vector. .
[0036] Furthermore, the frequency weight vector is input into a pre-defined two-layer fully connected multilayer perceptron network for frequency weight learning, strengthening the weights of effective frequency band features and weakening the weights of noisy frequency bands, generating a frequency attention weight vector with the same dimension as the intermediate feature map. The number of hidden layer nodes in the multilayer perceptron network is one-quarter of the number of channels in the intermediate feature map. Temporal correlation information in the intermediate feature map is extracted along the time dimension, capturing specific vibration sequences within a pre-defined time window, generating a temporal attention weight vector. The frequency attention weight vector and the temporal attention weight vector are then fused to generate a frequency-temporal fusion weight vector, capturing the temporal patterns of pest vibrations, such as specific vibration sequences within 5 consecutive seconds of a longhorn beetle, thus strengthening the features of effective temporal segments. Finally, the frequency-time fusion weight vector is multiplied point by point with the intermediate feature map to enhance the effective features and achieve accurate capture of short-time discontinuous vibration signals. The resulting weighted feature map allows the model to focus on short-time discontinuous and time-series effective features, improving feature discrimination. When detecting pests with strong vibration time-series characteristics, such as the red brown weevil, the accuracy is improved by about 5%.
[0037] In this embodiment, the multi-scale feature fusion module 05 includes a channel unification unit, an upsampling processing unit, a feature map stitching unit, a channel compression unit, and a feature refinement and filtering unit. Specifically, in the channel unification unit, a feature pyramid network is used to perform 1×1 pointwise convolution processing on four weighted feature maps of different scales with shapes of 8×48×79, 16×24×40, 32×12×20, and 64×12×20, respectively, to uniformly adjust the channel dimension of each weighted feature map to 64, ensuring consistent channel dimensions during fusion. In the upsampling processing unit, based on the first layer weighted feature map with a spatial size of 48×79, bilinear interpolation upsampling is performed on the low-resolution feature map. Specifically, the second layer weighted feature map with a spatial size of 24×40 after channel unification is upsampled by 2 times bilinear interpolation, the third layer weighted feature map with a spatial size of 12×20 is upsampled by 4 times bilinear interpolation, and the fourth layer weighted feature map with a spatial size of 12×20 is upsampled by 4 times bilinear interpolation, so that its size matches the high-resolution feature map of the first layer, and the spatial size of the four weighted feature maps is unified to 48×79.
[0038] Furthermore, in the feature map stitching unit, the four weighted feature maps with unified channel dimensions and spatial sizes are stitched together along the channel dimensions to generate an initial fused feature map with a size of 64×48×79, thereby improving feature utilization. This fused feature map simultaneously contains shallow detail features (such as weak vibration amplitude) and deep semantic features (such as pest species features). In the channel compression unit, the initial fused feature map is compressed using a 1×1 pointwise convolution to obtain a fused feature map with 64 channels and a spatial size of 48×79. It is noteworthy that in the feature refinement and filtering unit, a small discriminative network is set up and the fused feature map is input into it. Each region of the fused feature map is scored for effectiveness, and regions with scores higher than a preset threshold are selected for subsequent classification. Through the feature refinement and filtering unit, key regions containing pest features can be accurately selected, such as the small activity areas of early pests on tree trunks, filtering out irrelevant background features, solving the problem of lost weak amplitude features, and reducing interference from invalid features, thus improving the accuracy of early pest detection by 6%-9%.
[0039] In this embodiment, the pest species prediction module 06 is used to output the pest presence probability and health / damage status prediction results based on the fused feature map. Specifically, global average pooling is performed on each channel of the fused feature map, compressing each fused feature map with a spatial size of 48×79 into a 64-dimensional feature vector with the same dimension as the number of input channels; the 64-dimensional feature vector is input into a fully connected layer, and a linear transformation is used to map the feature vector to the category space to obtain a category score vector with a dimension equal to the total number of stem borer pest categories; the category score vector is normalized using the Softmax function to obtain the presence probability distribution of stem borers, and the health / damage status prediction results are determined based on the presence probability distribution results.
[0040] In one specific embodiment, 2548 audio samples of borer vibrations from five types of wood-boring pests, including longhorn beetles, bark beetles, and red palm weevils, were collected. Of these, 1754 were healthy samples and 794 were damaged samples. The audio sampling environment included 10 types of natural environmental noise, such as wind and rain, traffic noise, human voices, and animal calls, to simulate a real-world multi-noise environment in a forest. All audio samples were originally sampled at 44.1 kHz, recorded in a single channel, with individual audio lengths ranging from 6.143 seconds to 22.528 seconds. The collected audio samples were manually labeled to determine the pest species and damage status of each sample, serving as the true labels for model training and evaluation. Furthermore, based on the above dataset, a 10-fold cross-validation method was used to train and evaluate the improved MobileNetV3 lightweight sound detection model provided in this application. Specifically, all 2548 samples were randomly and evenly divided into 10 disjoint subsets. Each time, 8 subsets were selected as the training set (80%), 1 subset as the validation set (10%), and the remaining subset as the test set (10%). This process was repeated 10 times, selecting a different subset for the test set each time. The average of the 10 evaluations was used to evaluate the model performance. During training, the hyperparameters were set as follows: initial learning rate of 0.001, batch size of 128, number of training epochs of 100, stochastic gradient descent as the optimizer, and cross-entropy loss as the loss function.
[0041] After 10-fold cross-validation, the improved MobileNetV3 lightweight sound detection model provided in this application achieved an average accuracy of 97.18% on the test set, an improvement of 0.94% compared to the original MobileNetV3 model, effectively identifying weak vibration signals in noisy environments. The model has only 0.02M parameters, a reduction of approximately 99.52% compared to the original MobileNetV3 model; and a computational cost of only 9.51M FLOPs, a reduction of approximately 86.83% compared to the original MobileNetV3 model. It can be seen that the model provided in this application maintains the highest accuracy while exhibiting significant advantages in both parameter and computational cost, fully verifying the effectiveness of the lightweight design of this invention, and enabling deployment on resource-constrained terminal devices. In addition, to verify the contribution of the multispectral channel attention module and the multiscale feature fusion module to the detection performance, an ablation experiment was conducted. The experimental results showed that after introducing the multispectral channel attention module, the model's ability to identify short-term discontinuous high-frequency signals was significantly enhanced, and the accuracy in the red brown weevil detection task was improved by about 5%. After introducing the multiscale feature fusion module and the feature refinement and screening unit, the model's detection accuracy for weak vibration signals of early pests was improved by 6% to 9%. This indicates that the model after multispectral attention and multiscale fusion can be adapted to the vibration characteristics of various stem-boring pests, and has a wide range of applications and strong versatility.
[0042] Furthermore, to verify the detection performance of the method of the present invention in actual forest environments, embedded detection devices equipped with the model of this application were deployed in three forest farms at different geographical locations. The devices collected vibration audio signals inside the trees in real time. After noise suppression, duration adjustment and sampling rate optimization by the data preprocessing module, the data was input into the model for inference and output the probability of pest presence and the prediction results of health and damage status. The model detection results were compared with the verification results of manual wood dissection. The test results showed that the on-site detection accuracy of the method of this application reached 95.2%, the false alarm rate was less than 3%, and the average detection time per sample was no more than 0.1 seconds, which meets the application requirements of real-time rapid detection in forests.
[0043] In summary, the lightweight sound detection system for borers based on an improved MobileNetV3 provided in this application preprocesses the original audio signal through a data preprocessing module, sequentially performing noise adaptive suppression, dynamic duration adjustment, and sampling rate optimization to obtain a standardized audio signal. This effectively separates and suppresses environmental noise while preserving high-fidelity borer vibration signals. In the spectrogram extraction and fusion module, a short-time Fourier transform is performed on the standardized audio signal based on a bidirectional Mel filter bank to extract logarithmic and chromatic features. The logarithmic and chromatic features are normalized and then spliced and fused to obtain a fused spectrogram, thereby enhancing the extraction of effective features and reducing mid-frequency noise. A lightweight feature extraction network is used to extract multi-scale basic features from the fused spectrogram, outputting multiple intermediate feature maps at different scales. The network has only 0.02M parameters and a computational cost of only 9.51M. FLOPs were reduced by 99.52% and 86.83% respectively compared to the original MobileNetV3, indicating a significant lightweight advantage in both parameter and computational cost. In the multispectral channel attention module, multi-band frequency features of the spectrograms of each intermediate feature map were extracted through two-dimensional discrete cosine transform. Frequency weights were learned through a multilayer perceptron and the channels of each intermediate feature map were weighted to obtain multiple weighted feature maps at different scales, thereby enhancing the ability to capture features over short time intervals and enabling the model to focus on effective high-frequency and low-frequency features with temporal regularity over short time intervals. In the multi-scale feature fusion module, through feature... The pyramid network transforms multiple weighted feature maps at different scales into a unified channel dimension. After bilinear interpolation upsampling, the low-resolution feature map is concatenated with the high-resolution feature map to obtain a fused feature map, thereby improving the learning ability of weak amplitude features. It also includes shallow detail features and deep semantic features. Through a pest species prediction module, the presence probability and health / damage status prediction results of borers are determined based on the fused feature map. The 10-fold cross-validation accuracy reaches 97.18%, an improvement of 0.94% compared to the original MobileNetV3, effectively identifying weak vibration signals in noisy environments. The technical solution provided in this application, through a lightweight network structure design and a multi-dimensional feature enhancement mechanism, achieves early, efficient, and accurate detection of borer vibration signals from various borers. Furthermore, the model has a small number of parameters and low computational cost, allowing for convenient deployment on resource-constrained embedded terminal devices, providing reliable technical support for forestry pest and disease control.
[0044] The above is a description of the system embodiments of this application. Based on the foregoing embodiments, the method embodiments of this application are described below.
[0045] Please refer to Figure 2It shows a flowchart of a lightweight sound detection method for borers based on an improved MobileNetV3 according to an embodiment of this application. This method is applied to, for example... Figure 1 This paper presents a lightweight sound detection system for borers based on an improved MobileNetV3. For details not disclosed in the method embodiments, please refer to the system embodiments. The system includes a data preprocessing module, a spectrogram extraction and fusion module, a lightweight feature extraction network, a multispectral channel attention module, a multi-scale feature fusion module, and a pest species prediction module. Figure 2 As shown, the method includes the following steps S210 to S260.
[0046] Step S210: The original audio signal is subjected to noise adaptive suppression, dynamic duration adjustment and sampling rate optimization in sequence to obtain a standardized audio signal; Step S220: Perform short-time Fourier transform on the standardized audio signal based on the bidirectional Mel filter bank to extract logarithmic and chromaticity features. After normalizing the logarithmic and chromaticity features, splice and fuse them to obtain a fused spectrogram. Step S230: Extract basic features from the fused spectrogram at multiple scales and output multiple intermediate feature maps at different scales. Step S240: Extract the multi-band frequency features of the spectrogram of each intermediate feature map through two-dimensional discrete cosine transform, learn the frequency weights through multilayer perceptron and weight the channels of each intermediate feature map to obtain multiple weighted feature maps of different scales. The multispectral channel attention module is embedded in the lightweight feature extraction network. Step S250: Through the feature pyramid network, the weighted feature maps of multiple different scales are transformed to a unified channel dimension. After performing bilinear interpolation upsampling on the low-resolution feature map, it is concatenated with the high-resolution feature map to obtain a fused feature map. Step S260: Output the probability of pest presence and the prediction results of health and damage status based on the fused feature map.
[0047] It should be noted that, in the embodiments of this application, if the aforementioned lightweight sound detection method for borers based on the improved MobileNetV3 is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiments of this application, or the part that contributes to related technologies, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a magnetic disk, or an optical disk. Thus, the embodiments of this application are not limited to any specific hardware and software combination.
[0048] Correspondingly, embodiments of this application provide a computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the steps in any of the above embodiments of a lightweight sound detection method for borers based on improved MobileNetV3. Correspondingly, embodiments of this application also provide a computer program product, which, when executed by a processor of an electronic device, is used to implement the steps in any of the above embodiments of a lightweight sound detection method for borers based on improved MobileNetV3.
[0049] Based on the same technical concept, this application provides an electronic device for implementing a lightweight sound detection method for borers based on an improved MobileNetV3, as described in the above method embodiments. Figure 3 This is a hardware entity diagram of an electronic device provided in an embodiment of this application, such as... Figure 3 As shown, the electronic device 300 includes a memory 310 and a processor 320. The memory 310 stores a computer program that can run on the processor 320. When the processor 320 executes the program, it implements the steps in any of the embodiments of this application of a lightweight sound detection method for borers based on an improved MobileNetV3.
[0050] The memory 310 is configured to store instructions and applications executable by the processor 320, and can also cache data to be processed or already processed by the processor 320 and various modules in the electronic device (e.g., image data, audio data, voice communication data and video communication data), which can be implemented by flash memory or random access memory (RAM).
[0051] When the processor 320 executes a program, it implements the steps of any of the above-mentioned steps in a lightweight sound detection method for borers based on an improved MobileNetV3. The processor 320 typically controls the overall operation of the electronic device 300.
[0052] The aforementioned processor can be at least one of the following: Application Specific Integrated Circuit (ASIC), Digital Signal Processor (DSP), Digital Signal Processing Device (DSPD), Programmable Logic Device (PLD), Field Programmable Gate Array (FPGA), Central Processing Unit (CPU), Controller, Microcontroller, and Microprocessor. It is understood that other electronic devices can also implement the functions of the aforementioned processor, and this application does not specifically limit the specific implementation.
[0053] The aforementioned computer storage media / memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM), etc.; or it can be various electronic devices that include one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants, etc.
[0054] It should be noted that the descriptions of the storage medium and device embodiments above are similar to the descriptions of the method embodiments above, and have similar beneficial effects. For technical details not disclosed in the storage medium and device embodiments of this application, please refer to the descriptions of the method embodiments of this application for understanding.
[0055] It should be understood that the phrase "one embodiment" or "an embodiment" throughout the specification means that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment" or "in an embodiment" appearing throughout the specification does not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. It should be understood that in the various embodiments of this application, the sequence numbers of the above-described processes do not imply a sequential order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. The sequence numbers of the above-described embodiments are merely descriptive and do not represent the superiority or inferiority of the embodiments.
[0056] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0057] In the several embodiments provided in this application, it should be understood that the disclosed devices and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components can be combined, or integrated into another system, or some features can be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed can be through some interfaces, and the indirect coupling or communication connection between devices or units can be electrical, mechanical, or other forms.
[0058] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units. They may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of the embodiments of this application, depending on actual needs.
[0059] In addition, each functional unit in the various embodiments of this application can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.
[0060] Alternatively, if the integrated units described above are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, or the parts that contribute to related technologies, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause the device automatic test line to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROMs, magnetic disks, or optical disks.
[0061] The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined without conflict to obtain new method embodiments.
[0062] The features disclosed in the several method or device embodiments provided in this application can be arbitrarily combined without conflict to obtain new method or device embodiments.
[0063] The above description is merely an embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A lightweight sound detection system for borers based on an improved MobileNetV3, characterized in that, The system includes: The data preprocessing module is used to sequentially perform noise adaptive suppression, dynamic duration adjustment and sampling rate optimization on the raw audio signal to obtain a standardized audio signal; The spectrogram extraction and fusion module is used to perform a short-time Fourier transform on the standardized audio signal based on a bidirectional Mel filter bank, extract logarithmic features and chromaticity features, and then splice and fuse the logarithmic features and chromaticity features after normalization to obtain a fused spectrogram. A lightweight feature extraction network is used to extract basic features from the fused spectrogram at multiple scales and output multiple intermediate feature maps at different scales. The network training adopts a dynamic channel pruning mechanism. The multispectral channel attention module is used to extract multi-band frequency features of the spectrogram of each intermediate feature map through two-dimensional discrete cosine transform, learn frequency weights through multilayer perceptron and weight the channels of each intermediate feature map to obtain multiple weighted feature maps of different scales. The multispectral channel attention module is embedded in the lightweight feature extraction network. The multi-scale feature fusion module is used to transform the weighted feature maps of multiple different scales to a unified channel dimension through a feature pyramid network, and then perform bilinear interpolation upsampling on the low-resolution feature map and concatenate it with the high-resolution feature map to obtain a fused feature map. The pest species prediction module is used to output the probability of pest presence and the prediction results of health and damage status based on the fused feature map.
2. The system according to claim 1, characterized in that, The data preprocessing module includes a noise adaptive suppression unit, a dynamic duration adjustment unit, and a sampling rate optimization unit, wherein: The noise adaptive suppression unit is used for dynamic filtering based on the frequency band "V"-shaped distribution. It analyzes the time-frequency distribution of the original audio signal through short-time Fourier transform, identifies and adaptively suppresses the noise frequency band within the preset mid-frequency range, and identifies and retains the effective features of the borer. The effective features of the borer include low-frequency component features below the first preset frequency and high-frequency component features above the second preset frequency. The dynamic duration adjustment unit is used to perform duration statistical analysis on the noise-suppressed audio signal. If the audio duration is less than the preset fixed duration, zero vectors with a length not exceeding the preset filling threshold are filled at random positions at the beginning and end of the audio. If the audio duration is greater than the preset fixed duration, a window of the preset fixed duration is slid across the audio with a preset step size to cut out at least one audio segment of fixed duration to obtain an audio signal of fixed duration. The sampling rate optimization unit is used to optimize the sampling rate of the fixed-duration audio signal using a segmented mean downsampling method, reducing the original sampling rate to the target sampling rate to obtain a standardized audio signal. The segmented mean downsampling method is as follows: based on the ratio between the target sampling rate and the original sampling rate, the audio sampling points are divided into two parts, and the first and second points are used as a group for mean processing, and the processing result is used as the target sampling point after downsampling.
3. The system according to claim 1, characterized in that, The bidirectional Mel filter bank is constructed based on the "V"-shaped distribution pattern of the effective characteristics of borers. It includes a low-frequency conventional Mel filter bank covering a preset low-frequency band and a high-frequency inverted Mel filter bank covering a preset high-frequency band. The preset low-frequency band is 0-2kHz, and the low-frequency conventional Mel filter bank contains 48 filters. The preset high-frequency band is 6-8kHz, and the high-frequency inverted Mel filter bank contains 24 filters. The low-frequency conventional Mel filter bank and the high-frequency inverted Mel filter bank are spliced in the frequency domain to form a 72-dimensional bidirectional Mel frequency feature.
4. The system according to claim 3, characterized in that, The spectrogram extraction and fusion module includes a short-time Fourier transform unit, a filtering unit, a feature extraction unit, and a feature fusion unit, wherein: The short-time Fourier transform unit is used to perform a short-time Fourier transform on the standardized audio signal to obtain a linear spectrum; The filtering unit is used to obtain low-frequency bidirectional Mel features by weighted summation of a low-frequency Mel filter bank and a low-frequency linear spectrum, and to obtain high-frequency bidirectional Mel features by weighted summation of a high-frequency inverted Mel filter bank and a high-frequency linear spectrum. The low-frequency and high-frequency bidirectional Mel features are then concatenated along the frequency dimension to form a 72-dimensional bidirectional Mel spectrogram. The filtering process is expressed by the following formula: ; In the formula, Indicates the index of the low-frequency Mel filter; Indicates the time frame index; Indicates the frequency slot index in the low-frequency region; Indicates the index of a high-frequency Mel filter; Indicates the frequency slot index in the high-frequency region; This represents the low-frequency Mel-spectrum output matrix; This represents the weighting function of a low-frequency Mel filter bank. Represents the linear spectrum in the low-frequency region; This represents the output matrix of the high-frequency Mel-spectrum. The weighting function represents the weighting function of a high-frequency Mel filter bank; Represents the linear spectrum in the high-frequency region; The feature extraction unit is used to extract logarithmic features and chromaticity features from the two-way Mel spectrogram, respectively. The logarithmic features are used to characterize the two-way Mel frequency energy distribution, with a frequency dimension of 72; the chromaticity features are used to characterize the relative frequency variation characteristics, with a frequency dimension of 24. The feature fusion unit is used to normalize the logarithmic feature and the chromaticity feature respectively, and then splice and fuse them along the frequency dimension to obtain a fused spectrogram with a size of 1×96×157, where 1 represents a single channel, 96 represents the fused frequency feature dimension, and 157 represents the time dimension. The 96-dimensional fused frequency feature dimension is determined by splicing the 72-dimensional logarithmic feature and the 24-dimensional chromaticity feature.
5. The system according to claim 1, characterized in that, The lightweight feature extraction network is obtained by adjusting the hyperparameters of the MobileNetV3 base network. The lightweight feature extraction network includes one cascaded standard convolutional layer and four depthwise separable convolutional layers. The standard convolutional layer uses a 7×7 kernel with a stride of 2 and four output channels. The expansion rates of the four depthwise separable convolutional layers are set to 1, 3, 3, and 1 respectively, with 3×3 kernels, strides of 1, 2, 2, and 1 respectively, and output channels of 8, 16, 32, and 64 respectively. The lightweight feature extraction network performs multi-scale basic feature extraction on the fused spectrogram, outputting multiple intermediate feature maps at different scales, including: In the standard convolutional layer, the fused spectrogram is subjected to the first layer of standard convolution processing to capture the global features of the shallow high-resolution spectrogram and control the initial computational load, thereby obtaining the first layer output feature map. In the four depthwise separable convolutional layers, the output feature map of the first layer is subjected to four levels of depthwise separable convolution processing in sequence to compress the feature map size and increase the number of channels, generating four intermediate feature maps of different scales. The shapes of the four intermediate feature maps of different scales are 8×48×79, 16×24×40, 32×12×20, and 64×12×20, respectively. During training, the intermediate feature map channels at four different scales are dynamically pruned. The importance score of each channel is calculated based on the L1 norm of the feature map channel, and channels with importance scores below a preset threshold are pruned.
6. The system according to claim 1, characterized in that, The multispectral channel attention module is specifically used for: A two-dimensional discrete cosine transform is performed on the intermediate feature maps at four different scales. By adjusting the frequency index, multiple frequency components covering the range from low to high frequencies are extracted, generating multiple frequency weight elements. The low-frequency components correspond to the slow boring signal of the pest, while the high-frequency components correspond to the rapid gnawing signal. The calculation formula for the two-dimensional discrete cosine transform is expressed as follows: In the formula, Indicates the first Frequency weight elements corresponding to each frequency component; Indicates the spatial location index in the height direction of the feature map; Indicates the height of the feature map; Indicates the spatial position index along the width direction of the feature map; Indicates the width of the feature map; This is an intermediate feature map. Indicates spatial location The eigenvectors of all channels; This represents the frequency index of the two-dimensional discrete cosine transform at the vertical frequency. Represents the frequency index of the two-dimensional discrete cosine transform at the horizontal frequency; Indicates the index of the frequency component; This indicates the total number of frequency components that need to be extracted, as preset. By concatenating the multiple frequency weight elements, a complete frequency weight vector is obtained; The frequency weight vector is input into a preset two-layer fully connected multilayer perceptron network for frequency weight learning, generating a frequency attention weight vector with the same dimension as the intermediate feature map. The number of hidden layer nodes in the multilayer perceptron network is 1 / 4 of the number of channels in the intermediate feature map. Extract the temporal correlation information of the intermediate feature map in the time dimension, capture the specific vibration timing within a preset time window, and generate a temporal attention weight vector; The frequency attention weight vector and the temporal attention weight vector are fused to generate a frequency-temporal fusion weight vector; the frequency-temporal fusion weight vector is then multiplied point-by-point with the intermediate feature map to generate a weighted feature map.
7. The system according to claim 1, characterized in that, The multi-scale feature fusion module includes a channel unification unit, an upsampling processing unit, a feature map stitching unit, a channel compression unit, and a feature refinement and filtering unit, wherein: The channel unification unit is used to perform 1×1 pointwise convolution processing on four weighted feature maps of different scales through the feature pyramid network, and uniformly adjust the channel dimension of each weighted feature map to 64. The upsampling processing unit is used to perform 2x bilinear interpolation upsampling on the second-layer weighted feature map with a spatial size of 24×40 after channel unification, based on the first-layer weighted feature map with a spatial size of 48×79, 4x bilinear interpolation upsampling on the third-layer weighted feature map with a spatial size of 12×20, and 4x bilinear interpolation upsampling on the fourth-layer weighted feature map with a spatial size of 12×20, so as to unify the spatial size of the four weighted feature maps to 48×79; The feature map stitching unit is used to stitch together four weighted feature maps with unified spatial dimensions along the channel dimension to obtain an initial fused feature map; The channel compression unit is used to perform 1×1 pointwise convolution compression on the initial fused feature map to obtain a fused feature map with 64 channels and a spatial size of 48×79. The feature refinement and filtering unit is used to input the fused feature map into a preset small discriminant network, score the effectiveness of each region of the fused feature map, and filter out regions with scores higher than a preset threshold.
8. The system according to claim 1, characterized in that, The pest species prediction module is specifically used for: Global average pooling is performed on each channel of the fused feature map to compress each fused feature map with a spatial size of 48×79 into a 64-dimensional feature vector with the same dimension as the number of input channels; The 64-dimensional feature vector is input into a fully connected layer, and a linear transformation is used to map the feature vector to the category space to obtain a category score vector with a dimension equal to the total number of categories of borers. The category score vectors are normalized using the Softmax function to obtain the probability of existence of borers and the prediction results of their health and damage status.
9. A lightweight sound detection method for borers based on an improved MobileNetV3, characterized in that, A lightweight sound detection system for borers based on an improved MobileNetV3 is applied. The system includes a data preprocessing module, a spectrogram extraction and fusion module, a lightweight feature extraction network, a multispectral channel attention module, a multi-scale feature fusion module, and a pest species prediction module. The method includes: The original audio signal is subjected to noise adaptive suppression, dynamic duration adjustment and sampling rate optimization in sequence to obtain a standardized audio signal; The standardized audio signal is subjected to short-time Fourier transform based on a bidirectional Mel filter bank to extract logarithmic and chromatic features. The logarithmic and chromatic features are then normalized and spliced together to obtain a fused spectrogram. Multi-scale basic feature extraction is performed on the fused spectrogram to output multiple intermediate feature maps at different scales; Multi-band frequency features of the spectrogram of each intermediate feature map are extracted by two-dimensional discrete cosine transform. Frequency weights are learned by multilayer perceptron and the channels of each intermediate feature map are weighted to obtain multiple weighted feature maps of different scales. The multispectral channel attention module is embedded in the lightweight feature extraction network. The multiple weighted feature maps at different scales are transformed to a unified channel dimension through a feature pyramid network. The low-resolution feature map is then subjected to bilinear interpolation upsampling and concatenated with the high-resolution feature map to obtain a fused feature map. The probability of pest presence and the prediction results of health and damage status are output based on the fused feature map.
10. An electronic device comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method of claim 9.