Campus abnormal sound detection method and system fusing background noise
By combining a dual-branch feature extraction network and a deep single-classification network, the problem of low accuracy and high false alarm rate in campus abnormal sound detection under complex noise environments is solved, and accurate abnormal sound detection in multiple campus scenarios is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WUHAN INST OF TECH
- Filing Date
- 2026-04-22
- Publication Date
- 2026-06-16
AI Technical Summary
Existing methods for detecting abnormal sounds on campus have low accuracy, high false alarm rate, and poor scene adaptability in complex background noise environments, and cannot effectively identify abnormal sounds in multiple campus scenarios.
By employing a dual-branch feature extraction network combined with a deep single-classification network and a noise discriminator, and through the fusion of time-domain and frequency-domain features, the system learns the manifold of normal sound features, identifies and cancels background noise in real time, and dynamically adjusts the detection threshold to achieve accurate detection of abnormal sounds.
It improves the accuracy and robustness of abnormal sound detection on campus, reduces the false alarm rate, and is adaptable to different campus scenarios such as classrooms and playgrounds, accurately identifying abnormal sounds.
Smart Images

Figure CN122224221A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of acoustic signal processing and artificial intelligence detection technology, and more specifically, to a method and system for detecting abnormal sounds on campus that incorporates background noise. Background Technology
[0002] The campus environment has complex background noise levels, ranging from quiet periods during class to noisy times during breaks, physical education classes, and group activities. Simple decibel threshold alarms are prone to generating false alarms during normal activities.
[0003] Existing methods for detecting abnormal sounds mainly include: performing speech enhancement on a positive sample sound dataset to obtain an enhanced sample sound dataset; training a pre-defined feature extraction model using the positive sample sound dataset and the enhanced sample sound dataset to obtain a trained feature extraction model; extracting features from the positive sample sound dataset and the sound data to be tested using the trained feature extraction model to obtain a standard acoustic feature dataset and the acoustic feature data to be tested; calculating the mean and covariance of the acoustic feature dataset to obtain the standard acoustic mean and standard acoustic covariance; calculating the distance between the acoustic feature data to be tested and the acoustic feature dataset based on the standard acoustic mean and standard acoustic covariance; and determining the acoustic data to be tested as abnormal data when the distance is greater than a preset threshold. This invention can improve the accuracy of abnormal sound detection.
[0004] The above solutions only achieve speech enhancement through tone conversion, audio adjustment, and injection of white noise. White noise is a single type of noise, which is out of touch with the real and complex background noise in various scenarios such as school classrooms, playgrounds, and corridors. The enhanced samples cannot fit the actual application scenarios, resulting in poor generalization ability of the model training.
[0005] Feature extraction relies solely on the basic structure of coding and convolutional layers, which cannot accurately capture key features such as short-term transients and frequency-sensitive subbands of abnormal sounds. The comprehensiveness and discriminativeness of feature representation are insufficient. At the same time, although distance calculation combines mean and covariance, it uses a fixed preset threshold for anomaly detection, but does not consider the differences in acoustic features in different scenarios. This can easily lead to false alarms and missed alarms in multiple scenarios on campus. Summary of the Invention
[0006] This invention addresses the technical problems existing in the prior art by providing a method and system for detecting abnormal sounds on campus that integrates background noise. This solves the technical pain points of traditional detection methods, such as low accuracy, high false alarm rate, and poor scene adaptability in complex background noise environments on campus.
[0007] According to a first aspect of the present invention, a method for detecting abnormal sounds on campus by incorporating background noise is provided, comprising:
[0008] Step 1: Collect normal sound samples and noise samples from different scenes in the campus environment, and generate noisy abnormal sound samples. All the collected normal sound samples, noise samples and generated noisy abnormal sound samples constitute the training set.
[0009] Step 2: Extract the time-domain features and frequency-domain features of each sample in the training set based on the dual-branch feature extraction network, and aggregate the time-domain features and frequency-domain features of each sample to obtain the fused features of each sample;
[0010] Step 3: Based on the fusion features of normal sound samples and noise samples in the training set, a deep single-classification network and a noise discriminator are jointly trained, and the training effect of the deep single-classification network is verified by using the fusion features of noisy abnormal sound samples. The deep single-classification network is used to learn the center position of the normal sound feature manifold, and the noise discriminator is used to identify noise from the sound.
[0011] Step 4: Receive the campus audio stream to be tested in real time, identify the background noise in the campus audio stream to be tested based on the trained noise discriminator, and perform noise cancellation on the campus audio stream to be tested in the time-frequency domain based on the inverse filter to obtain the noise-cancelled campus audio stream to be tested.
[0012] Step 5: Extract the temporal and frequency domain features of the purified campus audio stream based on the dual-branch feature extraction network and aggregate them to obtain the fused features of the purified campus audio stream.
[0013] Step 6: Calculate the distance between the fusion features of the purified campus audio stream and the center of the sphere of the normal sound feature manifold;
[0014] Step 7: Based on the distance, determine whether the campus audio stream to be tested contains abnormal sounds.
[0015] Furthermore, step 1, which involves collecting normal sound samples and noise samples from different scenarios within the campus environment, and generating noisy abnormal sound samples, includes:
[0016] A dynamic noise database was established by collecting normal sound samples and real noise samples of different scene types in the campus environment through the deployment of a distributed microphone array. This database includes real noise samples of different scene types and intensities, denoted as [missing information]. ;
[0017] Based on each real noise sample Its scene type tags and preset random latent variables The generator based on the conditional generative adversarial network generates corresponding synthetic noise samples. The generation formula of the generator is:
[0018]
[0019] in, Indicates a generator;
[0020] The generated synthetic noise samples Dynamically mix the sample with a clean, noise-free anomalous sound sample to obtain a noisy anomalous sound sample.
[0021] Furthermore, the generated synthetic noise samples Dynamically mix the noisy anomalous sound samples with clean, noise-free anomalous sound samples to obtain noisy anomalous sound samples, including:
[0022] Let the pure, noise-free anomalous sound sample be... The generated synthetic noise samples are The mixed noisy abnormal sound sample is Based on the adaptive adjustment mechanism of the noise signal-to-noise ratio (SNR), the mixed noisy abnormal sound samples are: Represented as:
[0023]
[0024] in, It is a hybrid weight that is adaptively adjusted based on the noise signal-to-noise ratio. .
[0025] Furthermore, the dual-branch network includes a time-domain branch network and a frequency-domain branch network. In step 2, the time-domain features and frequency-domain features of each sample in the training set are extracted based on the dual-branch feature extraction network, and the time-domain features and frequency-domain features of each sample are aggregated to obtain the fused features of each sample, including:
[0026] The temporal branching network uses a 1D residual convolutional network to extract short-term transient features for each sample, which are then used as temporal features. ;
[0027] The frequency domain branch network converts the original sound signal of each sample into a time-frequency map through wavelet packet transform. The time-frequency map is then input into a focused Transformer module, which extracts frequency domain features from the time-frequency map using a multi-head attention mechanism. ;
[0028] The temporal features of each sample are processed by a gated fusion unit. and frequency domain features Perform dynamic aggregation to generate fusion features for each sample. The expression for dynamic aggregation is:
[0029]
[0030] in, This represents the sigmoid activation function. It is the weight matrix of the gated fusion unit. Representing time-domain features and frequency domain features splicing, This represents element-level multiplication operations.
[0031] Furthermore, step 3, based on the fusion features of normal sound samples and noise samples in the training set, jointly trains a deep single-classification network and a noise discriminator, and verifies the training effect of the deep single-classification network using the fusion features of noisy abnormal sound samples, including:
[0032] A deep single-classification network is trained based on the fusion features of normal sound samples in the training set to learn the normal sound feature manifold, so that the fusion features of normal sound samples are concentrated in a closed sphere in the feature space.
[0033] Specifically, the optimization objective minimizes the spherical center position of the fusion features of normal sound samples and the normal sound feature manifold. The distance, the optimization objective is expressed as:
[0034]
[0035] in, It is a normal sound sample extracted through a dual-branch feature extraction network. The fusion characteristics It is the center of the sphere of the normal sound characteristic manifold. These are preset regularization parameters. These are the weights of the dual-branch feature extraction network;
[0036] A noise discriminator is trained based on the fusion features of noise samples, and the regularization constraint of the noise discriminator is:
[0037]
[0038] in, These are generated synthetic noise samples. These are the fusion features of synthetic noise samples. It is the threshold for synthetic noise samples;
[0039] In each iteration of the joint training of the deep single-classification network and the noise discriminator, the center position of the sphere... Iterative updates are performed using the moving average, based on the updated center position of the ball. Calculate the regularization constraints of the noise discriminator. The calculated regularization constraints are fed back to the deep single-classification network;
[0040] Based on the aforementioned regularization constraint, the position of the sphere's center is updated using the sliding mean. Repeat the iteration until the termination condition is met to obtain the trained deep single-classification network and noise discriminator.
[0041] The termination condition is that the average distance between the fusion features of the normal sound sample and the center position c of the sphere satisfies the distance threshold, and the regularization constraint of the noise discriminator satisfies the constraint condition.
[0042] In each iteration of training, the fusion features of noisy abnormal sound samples are input into the deep single-classification network. If the fusion features of noisy abnormal sound samples are not enclosed in the smallest closed sphere of normal sound, it indicates that the deep single-classification network has a preliminary abnormality recognition ability; otherwise, the parameters of the deep single-classification network are adjusted and training continues.
[0043] Further, step 4 involves receiving the campus audio stream to be tested in real time, identifying background noise in the audio stream based on a trained noise discriminator, and performing noise cancellation on the audio stream in the time-frequency domain based on an inverse filter to obtain a noise-cancelled purified audio stream. This includes:
[0044] The system receives the campus audio stream to be tested in real time, identifies the background noise in the campus audio stream based on the trained noise discriminator, and constructs an inverse filter by calling the generator in the conditional adversarial network corresponding to the scene type of the identified background noise according to the scene type.
[0045] Based on the inverse filter, noise cancellation is performed on the campus audio stream under test in the time-frequency domain to obtain the purified campus audio stream under test after noise cancellation.
[0046] The noise cancellation in the time-frequency domain is expressed as follows:
[0047]
[0048] It is the time-frequency representation of the original campus audio stream to be tested;
[0049] It is an estimate of the noise power spectrum, which characterizes the intensity of noise in the frequency domain;
[0050] It is an adaptive suppression coefficient that dynamically adjusts the suppression amount according to the intensity of the noise;
[0051] It is the time-frequency representation of the purified audio stream of the campus under test after noise cancellation.
[0052] Furthermore, step 6, calculating the distance between the fusion features of the purified campus audio stream under test and the center position of the normal sound feature manifold, includes:
[0053] The Mahalanobis distance between the fusion features of the purified campus audio stream and the center of the sphere of the normal sound feature manifold is calculated using the following formula:
[0054]
[0055] in, It is a fusion feature for purifying the audio stream of the campus to be tested. It is the center of the sphere of the normal sound characteristic manifold. It is the covariance matrix of the fusion features of normal sound samples. It is its inverse matrix.
[0056] Furthermore, step 7, based on the distance, determines whether the campus audio stream to be tested contains abnormal sounds, including:
[0057] The noise discriminator identifies the scene type of background noise in the campus audio stream under test, and determines the anomaly detection threshold corresponding to the scene type.
[0058] When the Mahalanobis distance is greater than or equal to the anomaly detection threshold, it is determined that the campus audio stream under test contains abnormal sounds and an anomaly alarm is issued; otherwise, it is determined that the campus audio stream under test does not contain abnormal sounds.
[0059] Furthermore, the step of identifying the scene type of background noise in the campus audio stream under test based on the noise discriminator and determining the anomaly detection threshold corresponding to the scene type includes:
[0060] Based on the current scenario type, obtain the set of Mahalanobis distances for a predetermined historical time period, and calculate the mean of all Mahalanobis distances in the set. and standard deviation k represents the current scene type;
[0061] Based on mean Standard deviation and scene factors Dynamically calculate the anomaly detection threshold Its formula is:
[0062]
[0063] in, and These are the mean and standard deviation of the historical Mahalanobis distance for the current scene, respectively.
[0064] According to a second aspect of the present invention, a campus abnormal sound detection system that integrates background noise is provided, comprising:
[0065] The acquisition module is used to collect normal sound samples and noise samples from different scenarios in the campus environment, as well as generate noisy abnormal sound samples. All the collected normal sound samples, noise samples and generated noisy abnormal sound samples constitute the training set.
[0066] The first feature extraction module is used to extract the time-domain features and frequency-domain features of each sample in the training set based on the dual-branch feature extraction network, and to aggregate the time-domain features and frequency-domain features of each sample to obtain the fused features of each sample.
[0067] The training module is used to jointly train a deep single-classification network and a noise discriminator based on the fusion features of normal sound samples and noise samples in the training set, and to verify the training effect of the deep single-classification network using the fusion features of noisy abnormal sound samples. The deep single-classification network is used to learn the center position of the normal sound feature manifold, and the noise discriminator is used to identify noise from the sound.
[0068] The noise processing module is used to receive the campus audio stream to be tested in real time, identify the background noise in the campus audio stream to be tested based on the trained noise discriminator, and perform noise cancellation processing on the campus audio stream to be tested in the time and frequency domain based on the inverse filter to obtain the noise-cancelled purified campus audio stream to be tested.
[0069] The second feature extraction module is used to extract and aggregate the temporal and frequency domain features of the purified campus audio stream based on the dual-branch feature extraction network to obtain the fused features of the purified campus audio stream.
[0070] The calculation module is used to calculate the distance between the fusion features of the purified campus audio stream under test and the center of the sphere of the normal sound feature manifold;
[0071] The determination module is used to determine whether the campus audio stream under test contains abnormal sounds based on the distance.
[0072] This invention provides a method and system for detecting abnormal sounds on campus by integrating background noise. It addresses the challenge of scarce abnormal sound samples on campus by employing a deep single-classification network to learn the manifold of normal sound features, thus optimizing the anomaly detection logic centered on normal sound features. The network optimizes the target to concentrate normal sound features within a closed sphere in the feature space, eliminating the need for extensive training with a large number of abnormal samples. Anomaly detection is achieved simply by determining whether the target feature deviates from the normal feature manifold, solving the problem of difficult sample collection caused by sporadic abnormal sounds on campus. Furthermore, it simultaneously trains a noise discriminator, effectively enhancing the feature differentiation between campus background noise and normal / abnormal sounds, significantly reducing the false alarm rate caused by noise interference. Attached Figure Description
[0073] Figure 1 A flowchart of a campus abnormal sound detection method that integrates background noise is provided in one embodiment of the present invention;
[0074] Figure 2 This is a schematic diagram of a dynamic noise library construction and adversarial generative network framework according to an embodiment of the present invention;
[0075] Figure 3 A schematic diagram illustrating the generation of noisy anomalous sound samples for adaptive blending enhancement;
[0076] Figure 4 This is a schematic diagram of a dual-branch feature extraction network structure;
[0077] Figure 5 A schematic diagram of a deep single-class classification and noise regularization training framework;
[0078] Figure 6 This is a schematic diagram of a campus abnormal sound detection system that integrates background noise according to an embodiment of the present invention; Figure 7 A schematic diagram of the hardware structure of a possible electronic device provided by the present invention; Figure 8 This is a schematic diagram of the hardware structure of a possible computer-readable storage medium provided by the present invention. Detailed Implementation
[0079] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. In addition, the technical features of the various embodiments or individual embodiments provided by the present invention can be arbitrarily combined with each other to form feasible technical solutions. Such combinations are not constrained by the order of steps and / or structural composition patterns, but must be based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.
[0080] Addressing the challenges of complex and variable background noise in campus environments, significant differences in acoustic characteristics across different scenarios, and the scarcity of samples due to the sporadic nature of abnormal sounds, this invention overcomes the technical pain points of traditional detection methods, which suffer from low accuracy, high false alarm rates, and poor scenario adaptability in such environments. It constructs a campus abnormal sound detection method that integrates background noise adversarial game theory. This method optimizes the entire process, from constructing diverse noise samples, accurately extracting time-frequency features, adapting to feature manifold learning with limited samples, targeted noise cancellation, to scenario-based dynamic threshold detection. This enables automatic and accurate detection of abnormal sounds in complex campus noise environments, improving detection accuracy and robustness, reducing false alarm and false negative rates. The system can effectively adapt to different campus scenarios such as classrooms, playgrounds, and corridors, accurately identifying various abnormal campus sounds such as alarms and shouts, providing reliable acoustic detection technology support for campus safety monitoring.
[0081] Figure 1 The following is a flowchart illustrating a campus abnormal sound detection method that integrates background noise according to an embodiment of the present invention. Figure 1 As shown, the method includes the following steps:
[0082] Step 1: Collect normal sound samples and noise samples from different scenes in the campus environment, and generate noisy abnormal sound samples. All the collected normal sound samples, noise samples and generated noisy abnormal sound samples constitute the training set.
[0083] Understandably, deploying distributed microphone arrays to collect background noise in the campus environment, such as... Figure 2 The microphone array can collect noise data in real time from different scenes (such as classrooms, playgrounds, corridors, etc.). In this way, a dynamic noise database is built, which contains background noise samples of different types and intensities. These real noise samples are denoted as... The system also labels the scene type for each real noise sample. The microphone array can also collect normal sound samples without noise within the campus environment.
[0084] Conditional Generative Adversarial Networks (CGANs) are used to generate adversarial noise samples (also known as synthetic noise samples). The generator in a CGAN... Receive real noise samples and random latent variables And based on real noise samples Scene tags Generate corresponding synthetic noise samples Its generation formula is:
[0085]
[0086] Among them, conditional generative adversarial networks are used to process the collected real noise samples. Noise enhancement is performed to avoid interference from the complex campus environment and obtain a purer, clearer noise sample.
[0087] like Figure 2 Discriminator in Conditional Generative Adversarial Networks Then based on scene tags To distinguish the generated synthetic noise samples Compared with real noise samples .
[0088] See Figure 3 The generated synthetic noise samples The noise sample is dynamically mixed with the collected clean, noise-free anomalous sound sample to obtain a noisy anomalous sound sample. During the mixing process, the synthesized noise sample is adaptively adjusted according to the noise signal-to-noise ratio (SNR). The mixed weights of the collected clean anomalous sound samples are used to obtain anomalous sound samples containing diverse background noise, which are referred to as noisy anomalous sound samples.
[0089] Let the pure abnormal sound sample be... The generated synthetic noise samples are The mixed noisy abnormal sound data is Based on the adaptive adjustment mechanism of the noise signal-to-noise ratio (SNR), the mixed data is represented as follows:
[0090]
[0091] in, It is a hybrid weight that is adaptively adjusted based on the noise signal-to-noise ratio. .when When the value is close to 1, noise interference is relatively small, and the proportion of clean abnormal sound samples is relatively high; when... When the value is close to 0, the noise interference is strong and the proportion of noisy samples is high.
[0092] Finally, normal sound samples and synthetic noise samples. and noisy abnormal sound samples To form a training set.
[0093] Step 2: Extract the time-domain features and frequency-domain features of each sample in the training set based on the dual-branch feature extraction network, and aggregate the time-domain features and frequency-domain features of each sample to obtain the fused features of each sample.
[0094] Understandably, see Figure 4 For each sample in the training set, a dual-branch feature extraction network is used to extract the time-domain and frequency-domain features of each sample, and then the features are fused to obtain the fused features of each sample.
[0095] The dual-branch feature extraction network comprises a time-domain branch network and a frequency-domain branch network. The time-domain branch network uses a 1D residual convolutional network to extract short-term transient features (time-domain features) for each sample. This time-domain branch network captures local time-domain changes in the sound signal through convolutional layers, effectively extracting transient information, such as the fluctuations of sudden sounds. The frequency-domain branch network first converts the original sound signal into a time-frequency map through wavelet packet transform to capture the dynamic features of frequency changes over time. Then, the time-frequency map is input into a focusing Transformer module. This Transformer module uses a multi-head attention mechanism to focus on sub-bands sensitive to abnormal sounds in the frequency dimension. These sub-bands have higher discriminative power in the expression of abnormal sounds, thus extracting frequency-domain features.
[0096] Finally, the temporal features of each extracted sample will be... and frequency domain features Dynamic aggregation is performed using gated fusion units to obtain the fusion features of each sample. The specific fusion formula is as follows:
[0097]
[0098] in, This represents the sigmoid activation function. It is the weight matrix of the gated fusion unit. Representing time-domain features and frequency domain features splicing, This represents element-level multiplication operations. and These are features extracted in the time domain and frequency domain, respectively. This is the final fused feature. The purpose of this fusion formula is to enhance the contribution of each feature by dynamically adjusting the fusion ratio, thereby improving the model's accuracy and robustness in detecting abnormal sounds. (Time-domain features) When it performs better in specific situations It will approach 1, and conversely, it will depend more on frequency domain characteristics. Features after fusion It can integrate time-domain and frequency-domain information to provide a more comprehensive feature representation for subsequent abnormal sound detection.
[0099] Based on the above dual-branch network, the fusion features of each sample (including normal sound samples, noise samples, and noisy abnormal sound samples) in the training set are extracted.
[0100] Step 3: Based on the fusion features of normal sound samples and noise samples in the training set, a deep single-classification network and a noise discriminator are jointly trained, and the training effect of the deep single-classification network is verified by using the fusion features of noisy abnormal sound samples. The deep single-classification network is used to learn the center position of the normal sound feature manifold, and the noise discriminator is used to identify noise from the sound.
[0101] Understandably, see Figure 5 This step involves jointly training a deep single-classification network and a noise discriminator based on the fusion features of normal sound samples and noise samples in the training set. During training, the fusion features of noisy abnormal sound samples are used to verify the training effect of the deep single-classification network after each round of training. Specifically, the deep single-classification network learns the center position of the normal sound feature manifold, while the noise discriminator identifies noise from the sound, separating the valid sound signal from the noise.
[0102] Specifically, a deep single-classification network is used to learn the normal sound feature manifold: the fused features of normal sound samples in the training set are input into the deep single-classification network for training, and a suitable normal sound feature manifold is learned, so that the features of normal sounds are concentrated in a closed sphere in the feature space. The core of this process is to minimize the normal sound features relative to the center of the sphere by optimizing the objective. The distance is determined while controlling the complexity of the model.
[0103] The optimization objective of the training is:
[0104]
[0105] in, Normal sound samples obtained through a dual-branch feature extraction network The fusion characteristics It is the center of the sphere of the normal sound characteristic manifold. These are preset regularization parameters. These are the weights of the dual-branch feature extraction network. The goal is to minimize the features of each normal sound sample. relative to the center of the ball The distance between them, and also through regularization terms This is used to control the complexity of the network and avoid overfitting. (Sphere center position) By iteratively updating the moving average, the model can gradually adapt to the distribution of the data.
[0106] While training the deep single-classification network, a noise discriminator was also trained simultaneously based on noise samples to clearly distinguish between background noise and valid sound signals, and regularization constraints were applied to the noise samples:
[0107]
[0108] in, It is a synthetic noise sample. These are the fusion features of the noise samples. This is the threshold for noise samples. Through this regularization constraint, noise samples are forced away from the normal sound decision boundary, thereby improving the accuracy of abnormal sound detection.
[0109] During training, the deep single-classification network and the noise discriminator are trained jointly. When training the deep single-classification network, the center of the sphere is located... Iterative updates are achieved through the moving average. The method for iterative moving average analysis is as follows:
[0110]
[0111] in, The center position of the sphere at the current iteration number. The position of the center of the ball in the previous iteration. This is the fusion feature center of the current batch of normal sound samples. This is the moving average coefficient, typically taken as 0.9 or 0.99.
[0112] Based on the sphere center position *c* generated in each iteration, the regularization constraint of the noise discriminator is calculated, and this calculated regularization constraint is fed back to the deep single-classification network. Based on the regularization constraint, the sphere center position is updated using the moving average. Repeat the iteration until the termination condition is met to obtain the trained deep single-classification network and noise discriminator.
[0113] The termination condition is that the average distance between the fusion feature of the normal sound sample and the center position c of the sphere reaches the minimum (or meets the threshold condition), and the regularization constraint of the noise discriminator is maximized (or meets the preset constraint condition), or the number of iterations reaches the set number of iterations.
[0114] After each iteration of training, the training effect of the deep single-classification network is verified using noisy abnormal sound samples. Specifically, the fusion features of the noisy abnormal sound samples are input into the deep single-classification network. If the fusion features of the noisy abnormal sound samples are not enclosed in the smallest closed sphere of normal sound, it indicates that the deep single-classification network has a preliminary abnormality recognition ability. Otherwise, the parameters of the deep single-classification network are adjusted and training continues.
[0115] The input to the deep single-classification network includes: normal sound samples from the training set. Each sample obtains a fused feature representation through the dual-branch feature extraction network in step 2. .
[0116] The output of a deep single-classification network includes: through training, the deep single-classification network learns a feature manifold of normal sound and describes the distribution of normal sound using a minimal closed sphere. The center of this sphere is located at... The radius will be continuously updated based on training data; it will incorporate all features of normal sounds. This is mapped onto a sphere in the feature space. To achieve this, the optimization objective is to minimize each normal sound feature. relative to the center of the ball The distance.
[0117] During the training of a deep single-classification network, a noise discriminator is jointly trained to synthesize noise samples. Background noise is also input into the dual-branch feature extraction network, where its fused features are calculated. To ensure sufficient differentiation between noise samples and normal sounds, a regularization constraint is applied to the noise samples, causing their features to deviate from the decision boundary of the normal sound. When the features of the noise samples... Distance from the center of the ball When too close, the regularization term This will increase the noise level, forcing the noise sample away from the area of normal sound, thereby improving the accuracy of abnormal sound detection.
[0118] Three normal sound samples are provided. Their fusion feature representations are respectively And their distance from the center of the ball They are respectively:
[0119]
[0120] In this embodiment, the training objective of the deep single-classification network is to optimize the target so that these features are concentrated as close as possible to the center of the sphere, minimizing the distance to the center of the sphere;
[0121] Suppose a noise sample Fusion characteristics With the center of the ball The distance is And the set threshold At this point, the noise sample is too far from the center of the sphere, therefore the regularization constraint... It will become:
[0122]
[0123] The results show that noise samples are far from the characteristic manifold of normal sound, and need to be pushed further away by regularization constraints to enhance the separation between abnormal sound and noise samples.
[0124] Step 3 not only optimizes the feature manifold of normal sound to ensure its tight concentration in the feature space, but also ensures that noise samples are far from the decision boundary of normal sound through regularization constraints on noise samples, thereby improving the accuracy and robustness of abnormal sound detection.
[0125] Step 4: Receive the campus audio stream to be tested in real time, identify the background noise in the campus audio stream to be tested based on the trained noise discriminator, and perform noise cancellation in the time-frequency domain based on the inverse filter to obtain the noise-cancelled campus audio stream to be tested.
[0126] Understandably, this step receives the campus audio stream to be tested in real time and determines whether it contains abnormal sounds. First, based on the trained noise discriminator, the scene type of background noise in the campus audio stream is identified. Based on the scene type of the background noise, the corresponding noise generator G in the conditional adversarial network is invoked to construct an inverse filter. The noise in the campus audio stream is then canceled in the time-frequency domain using this inverse filter. Specifically, noise cancellation in the time-frequency domain involves:
[0127]
[0128] It is the time-frequency representation of the original audio signal to be tested.
[0129] It is an estimate of the noise power spectrum, which represents the intensity of the noise in the frequency domain.
[0130] It is an adaptive suppression coefficient, which dynamically adjusts the suppression amount according to the noise intensity. When the noise is strong, The noise level will be higher, thus enhancing noise suppression; when the noise is weaker, It will be smaller, reducing excessive inhibition.
[0131] It is the time-frequency representation of the purified audio stream of the campus under test after noise cancellation.
[0132] By adjusting the noise power spectrum estimation and adaptive suppression coefficient, the noise component in the audio signal is effectively removed, thus providing a cleaner signal. After noise cancellation, a purified audio signal is obtained. The background noise has been effectively reduced.
[0133] Step 5: Extract the temporal and frequency domain features of the purified campus audio stream based on the dual-branch feature extraction network and aggregate them to obtain the fused features of the purified campus audio stream.
[0134] Understandably, the purification signal The input will be fed into the dual-branch feature extraction network in step 2 to extract its time-domain and frequency-domain features, which will then be aggregated to obtain the fused features. These fused features will be used for subsequent abnormal sound detection, helping the system to more accurately identify abnormal sounds, such as sudden alarms, shouts, etc., thereby improving detection accuracy and reducing false alarm rates. The system combines noise cancellation and feature extraction in this way to ensure the high efficiency of abnormal sound detection.
[0135] Step 6: Calculate the distance between the fusion features of the purified campus audio stream and the center of the sphere of the normal sound feature manifold.
[0136] Understandably, the fusion characteristics of the obtained campus audio stream to be tested... The sphere center position of the manifold with normal sound characteristics Compare the positions of the ball's center. It is obtained from normal sound samples through training in step 3, and it represents the center of the normal sound in the feature space. Mahalanobis distance is used to measure the distance between the feature point to be measured and the feature manifold of the normal sound; its formula is:
[0137]
[0138] in, It is the fused feature vector of the audio to be tested. It is the center of the sphere of the normal sound characteristic manifold. It is the covariance matrix of the fusion features of normal sound samples. It is its inverse matrix.
[0139] Mahalanobis distance takes into account the correlation between features, making it more suitable than Euclidean distance for measuring distances in a multidimensional feature space.
[0140] Step 7: Based on the distance, determine whether the campus audio stream to be tested contains abnormal sounds.
[0141] In one embodiment, step 7, determining whether the campus audio stream to be tested contains abnormal sounds based on distance, includes:
[0142] Step 71: Based on the noise discriminator, identify the scene type of background noise in the campus audio stream to be tested, and determine the anomaly detection threshold corresponding to the scene type.
[0143] Each campus area may have different background noise and normal sound characteristics. Therefore, the anomaly detection threshold needs to be dynamically adjusted according to the current scenario. By dynamically adjusting the threshold, it can be ensured that the system's sensitivity to abnormal sounds is appropriate under different scenarios. Based on the scenario type of background noise in the campus audio stream to be tested, the Mahalanobis distance set of the manifold features of campus audio streams of the same scenario type and normal sounds in historical predetermined time periods is obtained, and the mean of all Mahalanobis distances in the Mahalanobis distance set is calculated. and standard deviation , k represents the current scene type.
[0144] Based on mean Standard deviation and scene factors Dynamically calculate the anomaly detection threshold Its formula is:
[0145]
[0146] in, and These are the mean and standard deviation of the historical Mahalanobis distance for the current scene, respectively.
[0147] Step 72: When the Mahalanobis distance is greater than or equal to the anomaly detection threshold, it is determined that the campus audio stream under test contains abnormal sounds and an anomaly alarm is issued; otherwise, it is determined that the campus audio stream under test does not contain abnormal sounds.
[0148] Specifically, when the calculated Mahalanobis distance Greater than the dynamically adjusted threshold When an abnormal sound is detected, the system will trigger an alarm, indicating that the current audio stream may contain abnormal sounds. The alarm mechanism relies on the normal sound feature manifold in step three and the noise cancellation processing in step four to ensure more accurate detection results.
[0149] Input data includes:
[0150] Features to be tested: Features of the purified audio signal obtained from the noise cancellation and feature extraction network in step 4. .
[0151] Normal sound feature manifold: The normal sound feature manifold learned in step 3 includes: the location of the sphere center. , representing the center of normal sound in feature space, covariance matrix This represents the distribution of normal sound characteristics;
[0152] Let the fusion characteristics of the audio stream to be tested be... for:
[0153]
[0154] The spherical center position of the characteristic manifold of normal sound for:
[0155]
[0156] Covariance matrix of fusion features of normal sound samples for:
[0157]
[0158] 1): Calculate the center of the sphere of the fusion features and normal sound feature manifold of the audio stream under test. Mahalanobis distance between them:
[0159] calculate :
[0160]
[0161] Calculate the covariance matrix inverse matrix The calculation yielded the following:
[0162]
[0163] Mahalanobis distance :
[0164]
[0165] 2): Set scene factors according to the scene. Assuming the classroom setting, the mean of the historical Mahalanobis distance is... Standard deviation Scene factors The dynamic anomaly threshold. for:
[0166]
[0167] 3): Based on the calculated Mahalanobis distance , with threshold Comparison
[0168] if If this occurs, an abnormal alarm will be triggered.
[0169] if If not, no alarm will be triggered.
[0170] In this embodiment, Less than Therefore, no alarm will be triggered.
[0171] Step 5 calculates the Mahalanobis distance between the feature to be tested and the normal sound feature manifold, combines it with the dynamic threshold to determine the anomaly, and finally outputs the alarm result or updates the noise database.
[0172] See Figure 6 This invention illustrates a campus abnormal sound detection system that integrates background noise according to an embodiment of the present invention, comprising:
[0173] The acquisition module 601 is used to acquire normal sound samples and noise samples from different scenes in the campus environment, as well as generate noisy abnormal sound samples. All the acquired normal sound samples, noise samples and generated noisy abnormal sound samples constitute the training set.
[0174] The first feature extraction module 602 is used to extract the time-domain features and frequency-domain features of each sample in the training set based on the dual-branch feature extraction network, and to aggregate the time-domain features and frequency-domain features of each sample to obtain the fused features of each sample.
[0175] Training module 603 is used to jointly train a deep single-classification network and a noise discriminator based on the fusion features of normal sound samples and noise samples in the training set, and to verify the training effect of the deep single-classification network using the fusion features of noisy abnormal sound samples. The deep single-classification network is used to learn the center position of the normal sound feature manifold, and the noise discriminator is used to identify noise from the sound.
[0176] The noise processing module 604 is used to receive the campus audio stream to be tested in real time, identify the background noise in the campus audio stream to be tested based on the trained noise discriminator, and perform noise cancellation processing on the campus audio stream to be tested in the time and frequency domain based on the inverse filter to obtain the noise-cancelled purified campus audio stream to be tested.
[0177] The second feature extraction module 605 is used to extract and aggregate the temporal and frequency domain features of the purified campus audio stream based on the dual-branch feature extraction network to obtain the fused features of the purified campus audio stream.
[0178] The calculation module 606 is used to calculate the distance between the fusion features of the purified campus audio stream under test and the center position of the normal sound feature manifold;
[0179] The determination module 607 is used to determine whether the campus audio stream to be tested contains abnormal sounds based on the distance.
[0180] It is understood that the campus abnormal sound detection system that integrates background noise provided by the present invention corresponds to the campus abnormal sound detection method that integrates background noise provided in the foregoing embodiments. The relevant technical features of the campus abnormal sound detection system that integrates background noise can be referred to the relevant technical features of the campus abnormal sound detection method that integrates background noise, and will not be repeated here.
[0181] Please see Figure 7 , Figure 7 This is a schematic diagram illustrating an embodiment of the electronic device provided in this invention. For example... Figure 7 As shown, an embodiment of the present invention provides an electronic device 700, including a memory 710, a processor 170, and a computer program 711 stored in the memory 710 and executable on the processor 720. When the processor 720 executes the computer program 711, it implements the steps of a campus abnormal sound detection method that integrates background noise.
[0182] Please see Figure 8 , Figure 8 This is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. (See diagram below.) Figure 8 As shown, this embodiment provides a computer-readable storage medium 800, on which a computer program 811 is stored. When the computer program 811 is executed by a processor, it implements the steps of a campus abnormal sound detection method that integrates background noise.
[0183] The present invention provides a method and system for detecting abnormal sounds on campus by incorporating background noise, which has the following beneficial effects:
[0184] 1. This invention utilizes a deep single-classification network to learn the manifold of normal sound features, addressing the challenge of scarce abnormal sound samples on campus and optimizing the anomaly detection logic centered on normal sound features. The network optimizes the target to concentrate normal sound features within a closed sphere in the feature space, eliminating the need for extensive training with numerous abnormal samples. Anomaly detection is achieved simply by determining whether the target feature deviates from the normal feature manifold, solving the problem of sample collection difficulties caused by sporadic abnormal sounds on campus. Furthermore, regularization terms control network complexity, and iterative updates of the sphere center using the moving average ensure the model continuously adapts to subtle changes in the campus sound environment, guaranteeing the accuracy and stability of feature manifold learning.
[0185] 2. This invention effectively enhances the distinguishability of campus background noise from normal and abnormal sounds by applying regularization constraints to noise samples and simultaneously training a noise discriminator, significantly reducing the false alarm rate caused by noise interference. The noise regularization constraint formula forces the features of noise samples away from the decision boundary of the normal sound feature manifold, allowing the model to clearly distinguish between background noise and valid sound signals, avoiding misclassifying strong noise as abnormal sound. The noise discriminator accurately identifies different types of campus background noise, providing a basis for subsequent targeted noise cancellation. It cuts off the interference path of noise on anomaly detection at the feature level, allowing the model to maintain accurate recognition of valid sound signals even in complex noisy environments.
[0186] 3. The system utilizes an inverse filter bank based on noise type identification and adaptive suppression coefficient adjustment to achieve accurate noise cancellation in the time-frequency domain on campus, providing a high-purity audio signal for abnormal sound detection. The system first identifies the type of background noise in the real-time audio stream using a noise discriminator, then calls the corresponding generator to construct a matching inverse filter bank to specifically cancel the noise in the time-frequency domain. Simultaneously, the suppression coefficient α is adaptively adjusted according to the noise intensity; the coefficient is increased to strengthen suppression when the noise is strong, and decreased when the noise is weak to avoid over-suppressing the effective signal. This operation effectively removes the masking effect of complex background noise on abnormal sounds on campus, significantly improving the purity of the audio signal. This allows subsequent feature extraction and anomaly detection to be performed based on a clearer signal, significantly improving detection accuracy.
[0187] 4. This invention achieves high adaptability and accuracy in detecting abnormal sounds in different campus scenarios by combining Mahalanobis distance calculation with a scenario-based dynamic threshold adjustment anomaly judgment mechanism, while balancing detection sensitivity and low false alarm rate. Mahalanobis distance considers the correlation between features and, compared to Euclidean distance, can more accurately measure the deviation between the tested audio features and the normal sound feature manifold, making accurate judgments even in cases of complex feature dimensions. Simultaneously, the system dynamically calculates the detection threshold based on historical Mahalanobis distance statistics for different campus scenarios such as classrooms and playgrounds, combined with the scenario factor β. This allows the detection standard to adapt to the acoustic feature differences in different scenarios, avoiding the problem of a uniform threshold being too sensitive in some scenarios leading to false alarms, or too sensitive in others leading to missed alarms. Ultimately, this achieves high accuracy detection and low false alarm rate control for abnormal sounds across the entire campus.
[0188] It should be noted that the descriptions of each embodiment in the above embodiments have different focuses. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0189] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0190] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0191] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0192] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0193] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0194] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A method for detecting abnormal sounds on campus by incorporating background noise, characterized in that, include: Step 1: Collect normal sound samples and noise samples from different scenes in the campus environment, and generate noisy abnormal sound samples. All the collected normal sound samples, noise samples and generated noisy abnormal sound samples constitute the training set. Step 2: Extract the time-domain features and frequency-domain features of each sample in the training set based on the dual-branch feature extraction network, and aggregate the time-domain features and frequency-domain features of each sample to obtain the fused features of each sample; Step 3: Based on the fusion features of normal sound samples and noise samples in the training set, a deep single-classification network and a noise discriminator are jointly trained, and the training effect of the deep single-classification network is verified by using the fusion features of noisy abnormal sound samples. The deep single-classification network is used to learn the center position of the normal sound feature manifold, and the noise discriminator is used to identify noise from the sound. Step 4: Receive the campus audio stream to be tested in real time, identify the background noise in the campus audio stream to be tested based on the trained noise discriminator, and perform noise cancellation on the campus audio stream to be tested in the time-frequency domain based on the inverse filter to obtain the noise-cancelled campus audio stream to be tested. Step 5: Extract the temporal and frequency domain features of the purified campus audio stream based on the dual-branch feature extraction network and aggregate them to obtain the fused features of the purified campus audio stream. Step 6: Calculate the distance between the fusion features of the purified campus audio stream and the center of the sphere of the normal sound feature manifold; Step 7: Based on the distance, determine whether the campus audio stream to be tested contains abnormal sounds.
2. The campus abnormal sound detection method according to claim 1, characterized in that, Step 1 involves collecting normal sound samples and noise samples from different scenarios within the campus environment, as well as generating noisy abnormal sound samples, including: A dynamic noise database was established by collecting normal sound samples and real noise samples of different scene types in the campus environment through the deployment of a distributed microphone array. This database includes real noise samples of different scene types and intensities, denoted as [missing information]. ; Based on each real noise sample Its scene type tags and preset random latent variables The generator based on the conditional generative adversarial network generates corresponding synthetic noise samples. The generation formula of the generator is: ; in, Indicates a generator; The generated synthetic noise samples Dynamically mix the sample with a clean, noise-free anomalous sound sample to obtain a noisy anomalous sound sample.
3. The campus abnormal sound detection method according to claim 2, characterized in that, The generated synthetic noise samples Dynamically mix the noisy anomalous sound samples with clean, noise-free anomalous sound samples to obtain noisy anomalous sound samples, including: Let the pure, noise-free anomalous sound sample be... The generated synthetic noise samples are The mixed noisy abnormal sound sample is Based on the adaptive adjustment mechanism of the noise signal-to-noise ratio (SNR), the mixed noisy abnormal sound samples are: Represented as: ; in, It is a hybrid weight that is adaptively adjusted based on the noise signal-to-noise ratio. .
4. The campus abnormal sound detection method according to claim 1, characterized in that, The dual-branch network includes a time-domain branch network and a frequency-domain branch network. In step 2, the time-domain features and frequency-domain features of each sample in the training set are extracted based on the dual-branch feature extraction network. The time-domain features and frequency-domain features of each sample are aggregated to obtain the fused features of each sample, including: The temporal branching network uses a 1D residual convolutional network to extract short-term transient features for each sample, which are then used as temporal features. ; The frequency domain branch network converts the original sound signal of each sample into a time-frequency map through wavelet packet transform. The time-frequency map is then input into a focused Transformer module, which extracts frequency domain features from the time-frequency map using a multi-head attention mechanism. ; The temporal features of each sample are processed by a gated fusion unit. and frequency domain features Perform dynamic aggregation to generate fusion features for each sample. The expression for dynamic aggregation is: ; in, This represents the sigmoid activation function. It is the weight matrix of the gated fusion unit. Representing time-domain features and frequency domain features splicing, This represents element-level multiplication operations.
5. The campus abnormal sound detection method according to claim 1, characterized in that, Step 3 involves jointly training a deep single-classification network and a noise discriminator based on the fusion features of normal sound samples and noise samples in the training set. The training performance of the deep single-classification network is then validated using the fusion features of noisy abnormal sound samples, including: A deep single-classification network is trained based on the fusion features of normal sound samples in the training set to learn the normal sound feature manifold, so that the fusion features of normal sound samples are concentrated in a closed sphere in the feature space. Specifically, the optimization objective minimizes the spherical center position of the fusion features of normal sound samples and the normal sound feature manifold. The distance, the optimization objective is expressed as: ; in, It is a normal sound sample extracted through a dual-branch feature extraction network. The fusion characteristics It is the center of the sphere of the normal sound characteristic manifold. These are preset regularization parameters. These are the weights of the dual-branch feature extraction network; A noise discriminator is trained based on the fusion features of noise samples, and the regularization constraint of the noise discriminator is: ; in, These are generated synthetic noise samples. These are the fusion features of synthetic noise samples. It is the threshold for synthetic noise samples; In each iteration of the joint training of the deep single-classification network and the noise discriminator, the center position of the sphere... Iterative updates are performed using the moving average, based on the updated center position of the ball. Calculate the regularization constraints of the noise discriminator. The calculated regularization constraints are fed back to the deep single-classification network; Based on the aforementioned regularization constraint, the position of the sphere's center is updated using the sliding mean. Repeat the iteration until the termination condition is met to obtain the trained deep single-classification network and noise discriminator. The termination condition is that the average distance between the fusion features of the normal sound sample and the center position c of the sphere satisfies the distance threshold, and the regularization constraint of the noise discriminator satisfies the constraint condition. In each iteration of training, the fusion features of noisy abnormal sound samples are input into the deep single-classification network. If the fusion features of noisy abnormal sound samples are not enclosed in the smallest closed sphere of normal sound, it indicates that the deep single-classification network has a preliminary abnormality recognition ability; otherwise, the parameters of the deep single-classification network are adjusted and training continues.
6. The campus abnormal sound detection method according to claim 2, characterized in that, Step 4 involves receiving the campus audio stream to be tested in real time, identifying background noise in the audio stream based on a trained noise discriminator, and performing noise cancellation on the audio stream in the time-frequency domain using an inverse filter to obtain a noise-cancelled audio stream. This includes: The system receives the campus audio stream to be tested in real time, identifies the background noise in the campus audio stream based on the trained noise discriminator, and constructs an inverse filter by calling the generator in the conditional adversarial network corresponding to the scene type of the identified background noise according to the scene type. Based on the inverse filter, noise cancellation is performed on the campus audio stream under test in the time-frequency domain to obtain the purified campus audio stream under test after noise cancellation. The noise cancellation in the time-frequency domain is expressed as follows: ; It is the time-frequency representation of the original campus audio stream to be tested; It is an estimate of the noise power spectrum, which characterizes the intensity of noise in the frequency domain; It is an adaptive suppression coefficient that dynamically adjusts the suppression amount according to the intensity of the noise; It is the time-frequency representation of the purified audio stream of the campus under test after noise cancellation.
7. The campus abnormal sound detection method according to claim 1, characterized in that, Step 6, calculating the distance between the fusion features of the purified campus audio stream and the center of the sphere of the normal sound feature manifold, includes: The Mahalanobis distance between the fusion features of the purified campus audio stream and the center of the sphere of the normal sound feature manifold is calculated using the following formula: ; in, It is a fusion feature for purifying the audio stream of the campus to be tested. It is the center of the sphere of the normal sound characteristic manifold. It is the covariance matrix of the fusion features of normal sound samples. It is its inverse matrix.
8. The campus abnormal sound detection method according to claim 7, characterized in that, Step 7, based on the distance, determines whether the campus audio stream to be tested contains abnormal sounds, including: The noise discriminator identifies the scene type of background noise in the campus audio stream under test, and determines the anomaly detection threshold corresponding to the scene type. When the Mahalanobis distance is greater than or equal to the anomaly detection threshold, it is determined that the campus audio stream under test contains abnormal sounds and an anomaly alarm is issued; otherwise, it is determined that the campus audio stream under test does not contain abnormal sounds.
9. The campus abnormal sound detection method according to claim 8, characterized in that, The step of identifying the scene type of background noise in the campus audio stream under test based on the noise discriminator and determining the anomaly detection threshold corresponding to the scene type includes: Based on the current scenario type, obtain the set of Mahalanobis distances for a predetermined historical time period, and calculate the mean of all Mahalanobis distances in the set. and standard deviation k represents the current scene type; Based on mean Standard deviation and scene factors Dynamically calculate the anomaly detection threshold Its formula is: ; in, and These are the mean and standard deviation of the historical Mahalanobis distance for the current scene, respectively.
10. A campus abnormal sound detection system that integrates background noise, characterized in that, include: The acquisition module is used to collect normal sound samples and noise samples from different scenarios in the campus environment, as well as generate noisy abnormal sound samples. All the collected normal sound samples, noise samples and generated noisy abnormal sound samples constitute the training set. The first feature extraction module is used to extract the time-domain features and frequency-domain features of each sample in the training set based on the dual-branch feature extraction network, and to aggregate the time-domain features and frequency-domain features of each sample to obtain the fused features of each sample. The training module is used to jointly train a deep single-classification network and a noise discriminator based on the fusion features of normal sound samples and noise samples in the training set, and to verify the training effect of the deep single-classification network using the fusion features of noisy abnormal sound samples. The deep single-classification network is used to learn the center position of the normal sound feature manifold, and the noise discriminator is used to identify noise from the sound. The noise processing module is used to receive the campus audio stream to be tested in real time, identify the background noise in the campus audio stream to be tested based on the trained noise discriminator, and perform noise cancellation processing on the campus audio stream to be tested in the time and frequency domain based on the inverse filter to obtain the noise-cancelled purified campus audio stream to be tested. The second feature extraction module is used to extract and aggregate the temporal and frequency domain features of the purified campus audio stream based on the dual-branch feature extraction network to obtain the fused features of the purified campus audio stream. The calculation module is used to calculate the distance between the fusion features of the purified campus audio stream under test and the center of the sphere of the normal sound feature manifold; The determination module is used to determine whether the campus audio stream under test contains abnormal sounds based on the distance.