Abnormal sound detection method and apparatus, storage medium, and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By extracting dual-channel features from the frequency domain vector sequence of audio information, the problem of information loss in non-human voice detection by traditional acoustic feature extraction methods is solved, and higher accuracy in abnormal sound detection is achieved.

CN115565548BActive Publication Date: 2026-06-23CHINA TELECOM CORP LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: CHINA TELECOM CORP LTD
Filing Date: 2022-09-20
Publication Date: 2026-06-23

Application Information

Patent Timeline

20 Sep 2022

Application

23 Jun 2026

Publication

CN115565548B

IPC: G10L25/30; G10L25/51

AI Tagging

Application Domain

Speech analysis

Technology Topics

Sound detectionFeature extraction

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Adaptive noise cancellation method and system for a communications headset based on ambient sound detection
CN122269186ASpeech analysis Earpiece/earphone attachmentsSound detectionNoise
A lightweight sound detection system and method for dry-wood pests based on improved MobileNetV3
CN122245348ASpeech analysis Biological modelsSound detectionNoise
Vehicle-mounted photoacoustic sensor for measuring methane
WO2025261782A9Analysing fluids using sonic/ultrasonic/infrasonic wavesMaterial analysis by optical meansSound detectionIn vehicle
Elevator car space allocation based on detecting voices of potential passengers
US12662353B2ElevatorsSound detectionSpeech sound
Video processing method for performing partial highlighting with aid of auxiliary information detection, and associated system on chip
US12665979B2Television system details Speech analysisSound detectionVideo processing

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Traditional acoustic feature extraction methods are prone to losing high-frequency information when detecting abnormal non-human voices, leading to misjudgments, especially in machine sound detection where the accuracy is not high.

Method used

A dual-channel feature extraction method based on frequency domain vector sequences is adopted. The audio information is preprocessed and dual-channel features are extracted through a sound model to obtain the target vector sequence. The audio information is then judged as abnormal based on the vector distance.

Benefits of technology

It improves the accuracy of abnormal sound detection, and can better preserve and extract audio information outside the frequency range of non-human voice speech, making it applicable to a wider range of abnormal sound detection fields.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115565548B_ABST

Patent Text Reader

Abstract

The application belongs to the technical field of artificial intelligence, and relates to an abnormal sound detection method and device, a storage medium and an electronic device. The method comprises the following steps: preprocessing to-be-processed audio information to obtain a frequency domain vector sequence corresponding to the to-be-processed audio information; inputting the frequency domain vector sequence into a sound model, performing double-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence; determining a vector distance according to the target vector sequence and the frequency domain vector sequence, and judging whether the to-be-processed audio information is abnormal audio information according to the vector distance. The application can improve the accuracy of the abnormal sound detection result.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to an abnormal sound detection method, an abnormal sound detection system, a computer storage medium, and an electronic device. Background Technology

[0002] With the development of artificial intelligence technology, it has become possible to detect abnormal sounds using artificial intelligence technology, which is more accurate than manual detection of abnormal sounds.

[0003] Currently, when detecting abnormal sounds, traditional acoustic feature extraction methods are used to convert audio from the time domain to the frequency domain. The abnormal sounds are then judged based on the frequency domain information obtained from the conversion. However, traditional acoustic feature extraction methods generally use frequency domain features such as Fbank or Mel-frequency cepstral coefficients. These feature extraction methods are mostly used for speech feature extraction. Since the frequency range of machine sound and speech may not be consistent, this feature extraction method will lose some high-frequency information and may misjudge some high-frequency abnormal audio.

[0004] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this application. Summary of the Invention

[0005] The purpose of this application is to provide an abnormal sound detection method, an abnormal sound detection system, a computer storage medium, and an electronic device, thereby improving the detection accuracy of various abnormal sounds to at least a certain extent.

[0006] Other features and advantages of this application will become apparent from the following detailed description, or may be learned in part from practice of this application.

[0007] According to a first aspect of this application, an abnormal sound detection method is provided, comprising:

[0008] Preprocess the audio information to be processed to obtain a frequency domain vector sequence corresponding to the audio information to be processed;

[0009] The frequency domain vector sequence is input into the sound model, and the sound model performs dual-channel feature extraction on the frequency domain vector sequence to obtain a target vector sequence with the same dimension as the frequency domain vector sequence.

[0010] The vector distance is determined based on the target vector sequence and the frequency domain vector sequence, and the audio information to be processed is determined based on the vector distance to determine whether it is abnormal audio information.

[0011] According to a second aspect of this application, an abnormal sound detection device is provided, comprising:

[0012] The preprocessing module is used to preprocess the audio information to be processed in order to obtain a frequency domain vector sequence corresponding to the audio information to be processed.

[0013] The model processing module is used to input the frequency domain vector sequence into the sound model, and perform dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence.

[0014] The anomaly detection module is used to determine the vector distance based on the target vector sequence and the frequency domain vector sequence, and to determine whether the audio information to be processed is abnormal audio information based on the vector distance.

[0015] According to a third aspect of this application, a computer storage medium is provided, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the above-described abnormal sound detection method.

[0016] According to a fourth aspect of this application, an electronic device is provided, characterized in that it comprises:

[0017] Processor; and

[0018] Memory for storing the executable instructions of the processor;

[0019] The processor is configured to execute the above-described abnormal sound detection method by executing the executable instructions.

[0020] As can be seen from the above technical solutions, the abnormal sound detection method, abnormal sound detection device, computer storage medium, and electronic device in the exemplary embodiments of this application have at least the following advantages and positive effects:

[0021] The abnormal sound detection method in this application preprocesses the audio information to be processed to obtain a frequency domain vector sequence corresponding to the audio information. Then, the obtained frequency domain vector sequence is input into a sound model, which performs dual-channel feature extraction on the frequency domain vector sequence to obtain a target vector sequence with the same dimension as the frequency domain vector sequence. Finally, the vector distance between the target vector sequence and the frequency domain vector sequence is determined, and the vector distance is used to determine whether the audio information to be processed is abnormal. The abnormal sound detection method in this application improves the accuracy of abnormal sound detection because the sound model can perform dual-channel feature extraction on the frequency domain vector sequence, which yields more audio information and improves the accuracy of the target vector sequence.

[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this application. Attached Figure Description

[0023] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are merely some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.

[0024] Figure 1 A schematic diagram of the system architecture for applying the abnormal sound detection method in the embodiments of this application is shown.

[0025] Figure 2 A schematic flowchart of the abnormal sound detection method in the embodiments of this application is shown.

[0026] Figure 3 A schematic diagram of the sound model in an embodiment of this application is shown.

[0027] Figure 4 The schematic diagram illustrates the structure of the self-encoding sub-model 303 in the embodiments of this application.

[0028] Figure 5 A schematic diagram of the structure of the sound model to be trained in an embodiment of this application is shown.

[0029] Figure 6 The schematic diagram illustrates the structure of the autoencoder sub-model 504 to be trained in the embodiments of this application.

[0030] Figure 7 A schematic block diagram of the abnormal sound detection device in this application is shown.

[0031] Figure 8 A schematic diagram of a computer system architecture suitable for implementing the embodiments of this application is shown. Detailed Implementation

[0032] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this application more comprehensive and complete, and to fully convey the concept of the exemplary embodiments to those skilled in the art.

[0033] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.

[0034] The terms “a,” “an,” “the,” and “the” are used in this specification to indicate the presence of one or more elements / components / etc.; the terms “including” and “having” are used to indicate an open-ended inclusion and to mean that there may be other elements / components / etc. in addition to the listed elements / components / etc.; the terms “first” and “second” are used only as markings and are not a limitation on the number of objects.

[0035] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0036] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.

[0037] In the related technologies of this application, in the task of abnormal sound detection, due to limitations, it is often possible to obtain only a large number of normal audio samples for model training, but there is a lack of sufficiently large-scale abnormal sound samples, or even, in some cases, it is impossible to obtain abnormal sound samples at all. In this situation, it is necessary to use normal audio samples for unsupervised model training to achieve the purpose of detecting abnormal sounds.

[0038] Currently, many abnormal sound detection technologies still employ traditional acoustic feature extraction methods (Mel spectrograms, FBanks, Mel-frequency cepstral coefficients (MFCC), etc.). While these methods are effective at extracting features from human speech, the frequency range of speech is concentrated around 1 kHz, and the human ear can hear frequencies from 20 Hz to 20 kHz. However, the frequency of the sound being detected may not fall within this range. Therefore, traditional acoustic feature extraction methods are not necessarily suitable for sounds emitted by objects such as lathes, bearings, and gears; simply applying these methods may result in the loss of important audio information.

[0039] In view of the problems existing in related technologies, this application proposes an abnormal sound detection method.

[0040] Before providing a detailed description of the technical solutions in the embodiments of this application, the technical terms that may be involved in the embodiments of this application will be explained and described first.

[0041] (1) Short-time Fourier transform (STFT) is a mathematical transform related to Fourier transform, used to determine the frequency and phase of the sinusoidal wave in the local region of a time-varying signal.

[0042] (2) Autoencoder: Autoencoder, abbreviated as AE, is a type of artificial neural network used in semi-supervised and unsupervised learning. Its function is to learn the representation of the input information by taking the input information as the learning target.

[0043] After introducing the technical terms that may be involved in the embodiments of this application, the abnormal sound detection method in this application will be described in detail.

[0044] Figure 1 An exemplary system architecture block diagram illustrating the application of the technical solutions of this application is shown.

[0045] like Figure 1As shown, the system architecture 100 may include terminal device 101, server 102, and network 103. Terminal device 101 may include various electronic devices with display screens and sound acquisition devices, such as smartphones, tablets, laptops, desktop computers, smart TVs, and smart vehicle terminals. The sound acquisition device may be, for example, an embedded or external microphone, a microphone, or other devices capable of sound acquisition. Server 102 may be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services. Network 103 may be a communication medium of various connection types capable of providing a communication link between terminal device 101 and server 102, such as a wired communication link or a wireless communication link.

[0046] In an exemplary embodiment of this application, terminal device 101 can collect audio information through its own or externally connected sound acquisition device, and then send the audio information as audio information to be processed to server 102 through network 103. After receiving the audio information to be processed, server 102 can preprocess it to obtain a frequency domain vector sequence corresponding to the audio information to be processed; then it can call a sound model, input the frequency domain vector sequence into the sound model, and perform dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence; then it can determine the vector distance based on the target vector sequence and the frequency domain vector sequence, and determine whether the audio information to be processed is abnormal audio information based on the vector distance.

[0047] In an exemplary embodiment of this application, terminal device 101 may also receive audio information to be processed sent by other terminal devices and send the audio information to be processed to server 102 through network 103, so that server 102 calls the sound model to process the audio information to be processed and determines whether the audio information to be processed is an abnormal sound.

[0048] Of course, the abnormal sound detection method in this application embodiment can also be executed by the terminal device 101. After the terminal device 101 collects the audio information to be processed or receives the audio information to be processed sent by other terminal devices, it can call the sound model to process the stored audio information and determine whether the audio information to be processed is an abnormal sound based on the processing result.

[0049] Depending on the implementation requirements, the system architecture in this application embodiment can have any number of terminal devices, networks, and servers. For example, the server can be a server group composed of multiple server devices.

[0050] The technical solutions provided in this application can be applied to terminal device 101 or server 102. The abnormal sound detection method in this application is based on a sound model, which is a machine learning model involving artificial intelligence.

[0051] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0052] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0053] Computer vision (CV) is the science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in recognizing and measuring targets, and then performs image processing to create images more suitable for human observation or transmission to instruments for detection. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image annotation, OCR, video processing, video semantic understanding, video content / behavior recognition, abnormal sound detection, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), and other technologies.

[0054] Machine Learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instructional learning.

[0055] The following detailed description of the technical solutions provided in this application, including the abnormal sound detection method, abnormal sound detection device, computer-readable medium, and electronic device, is based on specific embodiments.

[0056] Figure 2 A flowchart of an abnormal sound detection method is shown, such as... Figure 2 As shown, the abnormal sound detection methods include:

[0057] Step S210: Preprocess the audio information to be processed to obtain a frequency domain vector sequence corresponding to the audio information to be processed;

[0058] Step S220: Input the frequency domain vector sequence into the sound model, and perform dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence;

[0059] Step S230: Determine the vector distance based on the target vector sequence and the frequency domain vector sequence, and determine whether the audio information to be processed is abnormal audio information based on the vector distance.

[0060] The abnormal sound detection method of this application preprocesses the audio information to be processed to obtain a frequency domain vector sequence corresponding to the audio information; then, the obtained frequency domain vector sequence is input into a sound model, and the sound model performs dual-channel feature extraction on the frequency domain vector sequence to obtain a target vector sequence with the same dimension as the frequency domain vector sequence; finally, the vector distance is determined based on the target vector sequence and the frequency domain vector sequence, and the vector distance is used to determine whether the audio information to be processed is abnormal audio information. The abnormal sound detection method of this application improves the accuracy of abnormal sound detection because the sound model can perform dual-channel feature extraction on the frequency domain vector sequence, which yields more audio information and improves the accuracy of the input vector sequence.

[0061] The following is about Figure 2 The steps of the abnormal sound detection method shown are explained in detail.

[0062] In step S210, the audio information to be processed is preprocessed to obtain a frequency domain vector sequence corresponding to the audio information to be processed.

[0063] In an exemplary embodiment of this application, after receiving the audio information to be processed, preprocessing can be performed on the audio information to be processed to convert it from the time domain to the frequency domain and obtain a frequency domain vector sequence. The preprocessing in this embodiment includes a first preprocessing stage and a second preprocessing stage. The first preprocessing stage involves cleaning the audio information to be processed by noise reduction, echo cancellation, and dereverberation to remove interference information. The second preprocessing stage involves performing time-frequency domain conversion on the audio information to be processed to convert the audio from the time domain to the frequency domain and obtain a frequency domain vector sequence corresponding to the audio information to be processed.

[0064] In an exemplary embodiment of this application, a frequency domain vector sequence corresponding to the audio information to be processed can be obtained by short-time Fourier transform. Specifically, the audio information to be processed is first divided into frames, then windowed for each frame of the audio information to be processed, and the frequency domain information, i.e., the frequency domain vector, is extracted from each frame of the audio information to be processed. Finally, the frequency domain information corresponding to each frame of the audio information to be processed is concatenated to obtain the frequency domain vector sequence corresponding to the audio information to be processed. Here, windowing is to multiply each frame of the audio information to be processed by a time-finite window function h(t), and assume that the non-stationary signal is stationary within a short time interval of the analysis window. By moving the window function h(t) on the time axis, the signal is analyzed segment by segment to obtain a set of local "spectrums" of the signal.

[0065] In step S220, the frequency domain vector sequence is input into the sound model, and the sound model performs dual-channel feature extraction on the frequency domain vector sequence to obtain a target vector sequence with the same dimension as the frequency domain vector sequence.

[0066] In an exemplary embodiment of this application, after obtaining the frequency domain vector sequence, the frequency domain vector sequence can be processed to obtain the target vector sequence corresponding to the audio information to be processed. In this embodiment, a sound model is used to extract features from the frequency domain vector sequence. This sound model is a fusion model that includes multiple sub-models with different functions.

[0067] Figure 3 A schematic diagram of the sound model's structure is shown, such as... Figure 3As shown, the sound model 300 includes an input layer 301, a first pointwise convolutional sub-model 302, an autoencoder sub-model 303, a second pointwise convolutional model 304, and an output layer 305 connected in sequence. The input layer 301 inputs a frequency domain vector sequence to the first pointwise convolutional model 302, which extracts features from the frequency domain vector sequence to obtain a first feature vector. Then, the first feature vector is input to the autoencoder sub-model 303, which performs dual-channel encoding and dual-channel decoding on the first feature vector to obtain a second feature vector. The second feature vector is then input to the second pointwise convolutional model 304, which extracts features from the second feature vector to obtain a target vector sequence. Finally, the output layer 305 outputs the target vector sequence.

[0068] Next, the sound model and information processing flow in the embodiments of this application will be described in detail.

[0069] In an exemplary embodiment of this application, both the first pointwise convolutional sub-model 302 and the second pointwise convolutional sub-model 304 are pointwise convolutional neural network models. The first pointwise convolutional sub-model contains M convolutional kernels, each of size 1×1, where M is a positive integer less than the dimension of the frequency domain vector sequence. The second pointwise convolutional sub-model contains N convolutional kernels, each of size 1×1, where N is equal to the dimension of the frequency domain vector sequence. That is, the size of the final output target vector sequence is the same as the dimension of the frequency domain vector sequence. The first pointwise convolutional model 302 mainly extracts important features from the frequency domain vector sequence and reduces the amount of data processing. For example, the first pointwise convolutional model 302 can convert the frequency domain vector sequence into a 256-dimensional first feature vector, etc. The second pointwise convolutional model 304 converts the dimension of the second feature vector into the dimension of the frequency domain vector sequence, which is beneficial for calculating the vector distance and detecting abnormal sounds based on the vector distance.

[0070] In an exemplary embodiment of this application, the autoencoder sub-model 303 includes symmetrically arranged encoding units 303-1 and decoding units 303-2. The number of encoding units 303-1 and decoding units 303-2 can be one or more, and the number of encoding units is the same as the number of decoding units. Since the autoencoder model uses input information as a learning target and performs representation learning on the input information, the internal structure of the encoding units 303-1 and decoding units 303-2 is symmetrical. That is, if a convolutional layer is provided in the encoding unit 303-1, then a corresponding deconvolutional layer must be provided in the decoding unit 303-2, and so on.

[0071] Next, taking the self-encoding sub-model 303, which includes an encoding unit and a decoding unit, as an example, the structure of the self-encoding sub-model will be explained.

[0072] Figure 4 The schematic diagram illustrates the structure of the self-encoding sub-model 303, as follows: Figure 4 As shown, the encoding unit 303-1 includes a first normalization layer 401, a first convolutional layer 402 and a second convolutional layer 403 connected to the first normalization layer, a first sigmoid activation layer 404 connected to the first convolutional layer 402, and a first weighted layer 405 connected to the second convolutional layer 403 and the first sigmoid activation layer 404; the decoding unit 303-2 includes a second normalization layer 406, a first deconvolutional layer 407 connected to the second normalization layer 406, and... The second deconvolutional layer 408, the second sigmoid activation layer 409 connected to the first deconvolutional layer 407, and the second weighted layer 410 connected to the second deconvolutional layer 408 and the second sigmoid activation layer 409, wherein the parameters of the first normalization layer 401 and the second normalization layer 406 are the same, the parameters of the first convolutional layer 402, the second convolutional layer 403, the first deconvolutional layer 407 and the second deconvolutional layer 408 are the same, and all are one-dimensional convolutional layers or one-dimensional deconvolutional layers.

[0073] After receiving the first feature vector, the encoding unit 303-1 can perform dual-channel feature extraction to obtain a first weighted feature vector. Specifically, the first feature vector is normalized by a first normalization layer 401 to obtain a third feature vector; then, the third feature vector is convolved by a first convolutional layer 402, and the extracted features are processed by a first sigmoid activation layer 404 to obtain a weight vector. Simultaneously, the third feature vector is convolved by a second convolutional layer 403 to obtain a fourth feature vector; finally, the fourth feature vector is weighted by a first weighting layer 405 according to the weight vector to obtain the first weighted feature vector.

[0074] In this system, the first convolutional layer 402 and the first sigmoid activation layer 403 form the first channel, and the second convolutional layer 403 forms the second channel. The output of the first channel is a value between [0,1] corresponding to each of the convolutionally processed third feature vectors. This value can be considered as a weight value. Since the first convolutional layer 402 and the second convolutional layer 403 have the same structure, the third feature vector after convolution is the same as the fourth feature vector. By weighting the fourth feature vector according to the weight vector through the first weighting layer, information can be extracted from each position of the fourth feature vector with corresponding weights. Since the value obtained by the sigmoid function is in the range of (0,1), compared with the ReLU activation function with coefficients of only 0 and 1, dual-channel feature extraction can retain more audio information. At the same time, since the weight values corresponding to each sub-feature vector are not completely the same, it is possible to distinguish and retain important and unimportant information in the audio information to be processed, laying the foundation for accurate detection of abnormal sounds.

[0075] In an exemplary embodiment of this application, the first weighted layer 405 is a Hadamard layer, which obtains the first weighted feature vector by performing a Hadamard product operation on the weight vector and the fourth feature vector. The Hadamard product operation is to multiply the weight value and feature value corresponding to the same coordinate in the weight vector and the fourth feature vector to form a weight value corresponding to that coordinate, and then construct the first weighted feature vector based on the weight values corresponding to each coordinate.

[0076] In an exemplary embodiment of this application, the first weighted feature vector is input to the decoding unit 303-2 by the first weighting layer 405. The decoding unit 303-2 performs dual-channel feature extraction on the first weighted feature vector to obtain the second weighted feature vector, which is the second feature vector output by the autoencoder sub-model 303.

[0077] The processing flow of the first weighted feature vector by the decoding unit 303-2 is similar to that of the first feature vector by the encoding unit 303-1. Specifically, after receiving the first weighted feature vector, the second normalization layer 406 normalizes the first weighted feature vector to obtain the fifth feature vector. Then, the fifth feature vector is deconvolved by the first deconvolution layer 407, and the features extracted by the deconvolution are processed by the second sigmoid activation layer 409 to obtain the weight vector. At the same time, the fifth feature vector is deconvolved by the second deconvolution layer 408 to obtain the sixth feature vector. Finally, the sixth feature vector is weighted by the second weighting layer 410 according to the weight vector to obtain the second weighted feature vector.

[0078] The logic of obtaining the second weighted feature vector through dual-channel feature extraction in the decoding unit 303-2 is the same as the logic of obtaining the first weighted feature vector through dual-channel feature extraction in the encoding unit 303-1. The only difference is that the first deconvolution layer 407 and the second deconvolution layer 408 perform deconvolution processing on the fifth feature vector, which is the inverse operation of the first convolution layer 402 and the third convolution layer 403 performing convolution processing on the third feature vector. Therefore, the process of obtaining the second weighted feature vector will not be described again here.

[0079] In an exemplary embodiment of this application, the second weighting layer 410 is the same as the first weighting layer 405, both being Hadamard layers. These layers multiply the weight values and eigenvalues corresponding to the same coordinate in the weight vector and the fifth feature vector to obtain the weight values that constitute the second weighted feature vector. It is worth noting that since the first feature vector output by the first pointwise convolutional sub-model 301 is a one-dimensional vector, the first convolutional layer 402 and the second convolutional layer 403 in this embodiment are both one-dimensional convolutional layers, and the first deconvolutional layer 407 and the second deconvolutional layer 408 are both one-dimensional deconvolutional layers.

[0080] In exemplary embodiments of this application, there can be multiple encoding units and decoding units. For example, six encoding units and six decoding units can be set. Of course, other numbers of encoding units and decoding units can also be set, and this application embodiment does not specifically limit this. In this application embodiment, by setting multiple encoding units and multiple decoding units to encode and decode the first feature vector, the accuracy of the second feature vector can be improved, the information in the audio information to be processed can be fully obtained, and thus the accuracy of abnormal sound detection can be improved.

[0081] After generating the second feature vector, the decoding unit 303-2 inputs the second feature vector into the second pointwise convolutional sub-model 304, and performs feature extraction on the second feature vector through the second pointwise convolutional sub-model 304 to obtain the target vector sequence. The target vector sequence is a vector sequence with the same dimension as the frequency domain vector sequence generated after feature extraction of the audio information to be processed.

[0082] In step S230, the vector distance is determined based on the target vector sequence and the frequency domain vector sequence, and the audio information to be processed is determined based on the vector distance to determine whether it is abnormal audio information.

[0083] In an exemplary embodiment of this application, after obtaining the target vector sequence, in order to determine whether the audio information to be processed is abnormal audio information, the vector distance between the target vector sequence and the frequency domain vector sequence can be obtained, and the determination can be made based on the vector distance. In the embodiments of this application, the vector distance can specifically be the L2 distance, which is also known as the Euclidean distance, and its calculation formula is shown in equation (1):

[0084]

[0085] Where, x i Let y be the i-th element in the target vector sequence. i Let be the i-th element in the frequency domain vector sequence, and N be the maximum length of the target vector sequence and the frequency domain vector sequence.

[0086] It is worth noting that the vector distance in the embodiments of this application can also be other types of distance, and the embodiments of this application do not specifically limit this.

[0087] In an exemplary embodiment of this application, after obtaining the vector distance, the vector distance can be compared with a distance threshold. When the vector distance is less than or equal to the distance threshold, the audio information to be processed is determined to be normal audio information; when the vector distance is greater than the distance threshold, the audio information to be processed is determined to be abnormal audio information. The distance threshold is determined based on the distance of normal audio information. Specifically, the frequency domain vector sequence corresponding to the normal audio information can be input into a sound model, and the sound model outputs a target vector sequence corresponding to the normal audio information. The distance threshold can then be determined based on the target vector sequence and the frequency domain vector sequence corresponding to the normal audio information. Furthermore, a distance threshold range can be determined based on the obtained distance threshold. When the vector distance is within the distance threshold range, the audio information to be processed is determined to be normal audio information; when the vector distance is outside the distance threshold range, the audio information to be processed is determined to be abnormal audio information.

[0088] In an exemplary embodiment of this application, before inputting the frequency domain vector sequence corresponding to the audio information to be processed into the sound model, the sound model to be trained needs to be trained to obtain a sound model with stable performance. During training, audio information samples can first be obtained, and their corresponding frequency domain vector sequences can be obtained by preprocessing the audio information samples. Then, the frequency domain vector sequences are input into the sound model to be trained, and the sound model to be trained performs feature extraction on the frequency domain vector sequences to obtain a prediction vector sequence. Then, a loss is constructed based on the frequency domain vector sequence corresponding to the audio information samples and the prediction vector sequence, and the parameters of the sound model to be trained are optimized according to the loss function to obtain the sound model.

[0089] In an exemplary embodiment of this application, the structure of the sound model to be trained is substantially the same as the architecture of the sound model, including a first pointwise convolutional sub-model to be trained, a training autoencoder sub-model, and a second pointwise convolutional sub-model to be trained. The training autoencoder sub-model includes one or more training encoding units and one training decoding unit. During training, to prevent overfitting, a Dropout layer can be added to the sound model to be trained. By randomly removing some neurons, overfitting is prevented during model training, thereby improving the robustness of the model. One or more Dropout layers can be set in the sound model to be trained. For example, Dropout layers can be set at at least one of the following three locations: between the first pointwise convolutional sub-model to be trained and the training autoencoder sub-model; after the training weighted layers contained in the training encoding and decoding units; and after the second pointwise convolutional sub-model to be trained.

[0090] Figure 5 The schematic diagram illustrates the structure of the sound model to be trained, such as... Figure 5 As shown, the sound model to be trained includes an input layer 501, a first pointwise convolutional sub-model to be trained 502, a first dropout layer 503, an autoencoder sub-model to be trained 504, a second pointwise convolutional sub-model to be trained 505, a second dropout layer 506, and an output layer 507.

[0091] Figure 6 The schematic diagram illustrates the structure of the autoencoder sub-model 504 to be trained, as follows: Figure 6 As shown, the autoencoder sub-model 504 to be trained includes an encoding unit 601 to be trained and a decoding unit 602 to be trained. The encoding unit 601 to be trained includes a first training layer normalization layer 603, a first training convolutional layer 604 and a second training convolutional layer 605 connected to the first training layer normalization layer 603, a first training sigmoid activation layer 606 connected to the first training convolutional layer 604, a first training weighted layer 607 connected to the second training convolutional layer 605 and the first training sigmoid activation layer 606, and a first training weighted layer 607 connected to the first training weighted layer 607. The third Dropout layer 608 is connected to the weighted layer 607; the training decoding unit 602 includes a second training layer normalization layer 609, a first training deconvolution layer 610 and a second training deconvolution layer 611 connected to the second training layer normalization layer, a second training sigmoid activation layer 612 connected to the first training deconvolution layer, a second training weighted layer 613 connected to the second training deconvolution layer 611 and the second training sigmoid activation layer 612, and a fourth Dropout layer 614 connected to the second training weighted layer 613. It is worth noting that the training autoencoder sub-model, in addition to being composed of... Figure 6The diagram shows a single encoding unit and a single decoding unit to be trained. It can also consist of multiple encoding units and multiple decoding units to be trained, as long as the number of encoding units and decoding units to be trained is the same.

[0092] The data processing flow of each sub-model in the sound model to be trained and Figure 3 The data processing flow for each sub-model in the illustrated sound model is the same. The only difference is that the feature vector output by the first training pointwise convolutional sub-model 502 is randomly removed through the first Dropout layer 503; the weighted feature vector output by the first training weighted layer 607 is randomly removed through the third Dropout layer; the weighted feature vector output by the second training weighted layer 613 is randomly removed through the fourth Dropout layer 614; and the feature vector output by the second training pointwise convolutional sub-model 505 is randomly removed through the second Dropout layer 506. By randomly removing features through Dropout layers at different model training stages, overfitting in the trained model is prevented.

[0093] In an exemplary embodiment of this application, the L2 distance determined based on the frequency domain vector sequence and the predicted vector sequence corresponding to the audio information sample can be used as the loss function. The optimal model parameter values are determined by minimizing the loss function, and then a stable sound model is generated based on the determined optimal model parameter values. In embodiments of this application, other types of distances can also be used to construct the loss function; this application does not specifically limit the specific use of such distances.

[0094] The abnormal sound detection method in this application can be applied to various fields involving abnormal sound detection, such as public security, industry, national defense, military, security, medical care, etc. For example, in the industrial field, abnormal sound detection can be used to determine whether a machine is malfunctioning; in the security field, abnormal sound detection can be used to determine whether there is illegal intrusion; in the medical field, abnormal sound detection can be used to determine whether an organ is diseased, etc.

[0095] Taking the determination of whether a machine is faulty as an example, the sound of the machine running can be collected, and the audio information corresponding to the collected sound can be preprocessed to obtain the corresponding frequency domain vector sequence. Then, the frequency domain vector sequence can be input into the sound model for dual-channel feature extraction to obtain a target vector sequence with the same dimension as the frequency domain vector sequence. Then, the vector distance is determined based on the target vector sequence and the frequency domain vector sequence. Finally, the vector distance is compared with the distance threshold determined based on the sound collected when the machine is running normally, so as to determine whether the machine is faulty based on the comparison result.

[0096] In an exemplary embodiment of this application, a 1×1 pointwise convolutional neural network (pointwise convolutional sub-model) is used as a feature extractor after using a short-time Fourier transform. The network learns how to extract features through continuous iteration, enabling it to adapt to audio samples of different frequencies and thus retain more audio information. Therefore, after training the sound model to be trained is completed, the convolutional neural network feature extractor can be used for other tasks. The convolutional neural network feature extractor can accurately extract features of different types of sounds, thereby improving the execution accuracy of other tasks.

[0097] In an exemplary embodiment of this application, when preprocessing the audio information to be processed, only the audio information to be processed can be cleaned to remove interference information. Instead, a time-frequency domain transformation layer is added to the sound model. The received preprocessed audio information to be processed is subjected to a short-time Fourier transform through the time-frequency domain transformation layer to transform the audio information from the time domain to the frequency domain and obtain the corresponding frequency domain vector sequence. Then, the frequency domain vector sequence is processed by other sub-models in the sound model. The processing flow of the frequency domain vector sequence is the same as that in the above embodiment, and will not be repeated here.

[0098] In the abnormal sound detection method of this application embodiment, the frequency domain vector sequence obtained by preprocessing the audio information to be processed is input into the sound model. The sound model performs dual-channel feature extraction on the frequency domain vector sequence to obtain a target vector sequence with the same dimension as the frequency domain vector sequence. Then, the vector distance is determined based on the target vector sequence and the frequency domain vector sequence, and the abnormal audio information to be processed is determined based on the vector distance. The abnormal sound detection in this application embodiment is based on the sound model, which includes a pointwise convolutional sub-model and an autoencoder sub-model. Since the convolutional feature extraction method can take into account both audio frequency domain features and self-learned acoustic features, the sound model in this application can solve the problem of information loss that may occur in traditional audio extraction methods at non-human speech frequencies. This ensures that the abnormal sound detection method can be applied to a wider range of fields, such as identifying audio frequencies outside the range of human hearing and extracting important information from them. In addition, since the autoencoder sub-model is an unsupervised training model, it does not require a large number of normal audio information samples and abnormal audio information samples to be prepared in advance during training. This avoids the problem of insufficient abnormal audio information samples. The problem of model training was initially addressed by employing dual-channel feature extraction in the sound model. One channel used the sigmoid activation function, while the other did not. The channel using the sigmoid activation function effectively calculated a weight for each position, discounting the vector value at each position. Compared to the ReLU activation function, where coefficients can only be 0 or 1, this method of calculating positional weights is more flexible and allows for better retention of more information in the sound model's computation. It addresses whether and how much information should be retained at each position, avoiding the information loss caused by the indiscriminate reduction of ReLU. This improves the sound model's ability to capture audio details and ultimately enhances the accuracy of abnormal sound detection.

[0099] This application also provides an abnormal sound detection device. Figure 7 A schematic diagram of the abnormal sound detection device is shown, such as... Figure 7 As shown, the abnormal sound detection device 700 may include a preprocessing module 701, a model processing module 702, and an anomaly judgment module 703. Wherein:

[0100] Preprocessing module 701 is used to preprocess the audio information to be processed in order to obtain a frequency domain vector sequence corresponding to the audio information to be processed;

[0101] Model processing module 702 is used to input the frequency domain vector sequence into the sound model, and perform dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same size as the frequency domain vector sequence;

[0102] The anomaly detection module 703 is used to determine the vector distance based on the target vector sequence and the frequency domain vector sequence, and to determine whether the audio information to be processed is abnormal audio information based on the vector distance.

[0103] In one embodiment of this application, the preprocessing module 701 is configured as follows:

[0104] The audio information to be processed is cleaned, and a short-time Fourier transform is performed on the cleaned audio information to obtain the frequency domain vector sequence.

[0105] In one embodiment of this application, the sound model includes a first pointwise convolutional sub-model, an autoencoder sub-model, and a second pointwise convolutional model; the model processing module 702 includes:

[0106] The first processing unit is configured to extract features from the frequency domain vector sequence using the first pointwise convolutional sub-model to obtain a first feature vector.

[0107] The second processing unit is used to perform dual-channel encoding and dual-channel decoding on the first feature vector through the autoencoder sub-model to obtain the second feature vector.

[0108] The third processing unit is used to extract features from the second feature vector using the second pointwise convolutional sub-model to obtain the target vector sequence.

[0109] In one embodiment of this application, the first pointwise convolutional sub-model contains M convolutional kernels, and the second pointwise convolutional sub-model contains N convolutional kernels, wherein M is a positive integer less than the dimension of the frequency domain vector sequence, and N is equal to the dimension of the frequency domain vector sequence.

[0110] In one embodiment of this application, the self-encoding sub-model includes symmetrically arranged encoding and decoding units; the second processing unit includes:

[0111] The first encoding unit is used to perform dual-channel feature extraction on the first feature vector to obtain a first weighted feature vector.

[0112] The first decoding unit is used to perform dual-channel feature extraction on the first weighted feature vector to obtain a second weighted feature vector, and to use the second weighted feature vector as the second feature vector.

[0113] In an exemplary embodiment of this application, the encoding unit includes a first normalization layer, a first convolutional layer and a second convolutional layer connected to the first normalization layer, a first sigmoid activation layer connected to the first convolutional layer, and a first weighted layer connected to the second convolutional layer and the first sigmoid activation layer; the first encoding unit is configured as follows:

[0114] The first feature vector is normalized by the first normalization layer to obtain the third feature vector;

[0115] The third feature vector is processed by convolution through the first convolutional layer, and the features extracted by convolution are processed by the first sigmoid activation layer to obtain a weight vector.

[0116] The third feature vector is convolved by the second convolutional layer to obtain the fourth feature vector.

[0117] The first weighted feature vector is obtained by weighting the fourth feature vector according to the weight vector through the first weighted layer.

[0118] In an exemplary embodiment of this application, the first weighting layer is a Hadamard layer; the step of weighting the fourth feature vector according to the weight vector through the first weighting layer to obtain the first weighted feature vector is configured as follows:

[0119] The weight value and feature value corresponding to the same coordinate in the weight vector and the fourth feature vector are multiplied together to obtain the first weighted feature vector.

[0120] In an exemplary embodiment of this application, the decoding unit includes a second normalization layer, a first deconvolution layer and a second deconvolution layer connected to the second normalization layer, a second sigmoid activation layer connected to the first deconvolution layer, and a second weighting layer connected to the second deconvolution layer and the second sigmoid activation layer; the first decoding unit is configured as follows:

[0121] The first weighted feature vector is normalized by the second normalization layer to obtain the fifth feature vector;

[0122] The fifth feature vector is deconvolved by the first deconvolution layer, and the features extracted by the deconvolution are processed by the second sigmoid activation layer to obtain a weight vector.

[0123] The fifth feature vector is deconvolved by the second deconvolution layer to obtain the sixth feature vector.

[0124] The second weighting layer performs weighting processing on the sixth feature vector according to the weight vector to obtain the second weighted feature vector.

[0125] In an exemplary embodiment of this application, the second weighting layer is a Hadamard layer; the step of weighting the sixth feature vector according to the weight vector through the second weighting layer to obtain the second weighted feature vector is configured as follows:

[0126] The weight value and feature value corresponding to the same coordinate in the weight vector and the sixth feature vector are multiplied together to obtain the second weighted feature vector.

[0127] In an exemplary embodiment of this application, the anomaly detection module 703 is configured as follows:

[0128] Compare the vector distance with a distance threshold;

[0129] When the vector distance is less than or equal to the distance threshold, the audio information to be processed is determined to be normal audio information;

[0130] When the vector distance is greater than the distance threshold, the audio information to be processed is determined to be abnormal audio information.

[0131] In an exemplary embodiment of this application, the vector distance is an L2 distance, and the distance threshold is an L2 distance determined based on normal audio information.

[0132] In an exemplary embodiment of this application, the abnormal sound detection device 700 further includes:

[0133] The prediction module is used to obtain audio information samples before inputting the frequency domain vector sequence into the sound model, input the frequency domain vector sequence corresponding to the audio information samples into the sound model to be trained, and extract features from the frequency domain vector sequence through the sound model to be trained to obtain a prediction vector sequence.

[0134] The optimization module is used to construct a loss function based on the frequency domain vector sequence corresponding to the audio information sample and the prediction vector sequence, and to optimize the parameters of the sound model to be trained based on the loss function to obtain the sound model.

[0135] In an exemplary embodiment of this application, the optimization module is configured as follows:

[0136] The L2 distance is determined as the loss function based on the frequency domain vector sequence corresponding to the audio information sample and the prediction vector sequence.

[0137] In an exemplary embodiment of this application, the sound model to be trained includes a first pointwise convolutional sub-model to be trained, a self-encoder sub-model to be trained, and a second pointwise convolutional sub-model to be trained; the self-encoder sub-model to be trained includes a coding unit to be trained and a decoding unit to be trained.

[0138] In an exemplary embodiment of this application, the abnormal sound detection device 700 is configured as follows:

[0139] A Dropout layer is placed between the first pointwise convolutional sub-model to be trained and the autoencoder sub-model to be trained; and / or

[0140] A Dropout layer is placed after the weighted layers to be trained contained in the encoding unit and the decoding unit; and / or

[0141] A Dropout layer is set after the second pointwise convolutional sub-model to be trained.

[0142] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0143] Furthermore, although the steps of the method in this application are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or a step may be broken down into multiple steps.

[0144] Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, mobile terminal, or network device, etc.) to execute the method according to the embodiments of this application.

[0145] Figure 8 A schematic diagram of a computer system architecture for implementing an electronic device according to embodiments of the present application is shown. The electronic device may be located in a terminal device or a server.

[0146] It should be noted that, Figure 8 The computer system 800 of the electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0147] like Figure 8 As shown, the computer system 800 includes a central processing unit (CPU) 801, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 802 or programs loaded from storage section 808 into random access memory (RAM). The random access memory 803 also stores various programs and data required for system operation. The CPU 801, ROM 802, and RAM 803 are interconnected via a bus 804. An input / output interface 805 (I / O interface) is also connected to the bus 804.

[0148] In some embodiments, the following components are connected to the input / output interface 805: an input section 806 including a keyboard, mouse, etc.; an output section 807 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 808 including a hard disk, etc.; and a communication section 809 including a network interface card such as a local area network card, modem, etc. The communication section 809 performs communication processing via a network such as the Internet. A drive 810 is also connected to the input / output interface 805 as needed. A removable medium 811, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 810 as needed so that computer programs read from it can be installed into the storage section 808 as needed.

[0149] Specifically, according to embodiments of this application, the processes described in the various method flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 809, and / or installed from removable medium 811. When the computer program is executed by central processing unit 801, it performs various functions defined in the system of this application.

[0150] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium, a computer-readable medium, or any combination of the above. A computer-readable medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.

[0151] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0152] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0153] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, and includes several instructions to cause an electronic device to execute the method according to the embodiments of this application.

[0154] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A method for detecting abnormal sounds, characterized in that, include: Preprocess the audio information to be processed to obtain a frequency domain vector sequence corresponding to the audio information to be processed; The frequency domain vector sequence is input into the sound model, and the sound model performs dual-channel feature extraction on the frequency domain vector sequence to obtain a target vector sequence with the same dimension as the frequency domain vector sequence. The vector distance is determined based on the target vector sequence and the frequency domain vector sequence, and the audio information to be processed is determined based on the vector distance to determine whether it is abnormal audio information; The sound model includes a first pointwise convolutional sub-model, an autoencoder sub-model, and a second pointwise convolutional model; The step of inputting the frequency domain vector sequence into a sound model and performing dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence includes: The first feature vector is obtained by extracting features from the frequency domain vector sequence using the first pointwise convolutional sub-model. The first feature vector is processed by dual-channel encoding and dual-channel decoding using the autoencoder sub-model to obtain the second feature vector. The second feature vector is used to extract features from the second pointwise convolutional sub-model to obtain the target vector sequence.

2. The method according to claim 1, characterized in that, The preprocessing of the audio information to be processed to obtain a frequency domain vector sequence corresponding to the audio information to be processed includes: The audio information to be processed is cleaned, and a short-time Fourier transform is performed on the cleaned audio information to obtain the frequency domain vector sequence.

3. The method according to claim 1, characterized in that, The first pointwise convolutional sub-model contains M convolutional kernels, and the second pointwise convolutional sub-model contains N convolutional kernels, where M is a positive integer less than the total number of frequency domain vectors contained in the frequency domain vector sequence, and N is equal to the total number of frequency domain vectors contained in the frequency domain vector sequence.

4. The method according to claim 1, characterized in that, The self-encoding sub-model includes symmetrically arranged encoding and decoding units; The step of performing dual-channel encoding and dual-channel decoding on the first feature vector using the autoencoder sub-model to obtain the second feature vector includes: The encoding unit performs dual-channel feature extraction on the first feature vector to obtain a first weighted feature vector. The decoding unit performs dual-channel feature extraction on the first weighted feature vector to obtain a second weighted feature vector, and uses the second weighted feature vector as the second feature vector.

5. The method according to claim 4, characterized in that, The encoding unit includes a first normalization layer, a first convolutional layer and a second convolutional layer connected to the first normalization layer, a first sigmoid activation layer connected to the first convolutional layer, and a first weighted layer connected to the second convolutional layer and the first sigmoid activation layer. The step of performing dual-channel feature extraction on the first feature vector through the encoding unit to obtain the first weighted feature vector includes: The first feature vector is normalized by the first normalization layer to obtain the third feature vector; The third feature vector is processed by convolution through the first convolutional layer, and the features extracted by convolution are processed by the first sigmoid activation layer to obtain a weight vector. The third feature vector is convolved by the second convolutional layer to obtain the fourth feature vector. The first weighted feature vector is obtained by weighting the fourth feature vector according to the weight vector through the first weighted layer.

6. The method according to claim 5, characterized in that, The first weighted layer is a Hadamard layer; The step of weighting the fourth feature vector according to the weight vector through the first weighting layer to obtain the first weighted feature vector includes: The weight value and feature value corresponding to the same coordinate in the weight vector and the fourth feature vector are multiplied together to obtain the first weighted feature vector.

7. The method according to claim 4, characterized in that, The decoding unit includes a second normalization layer, a first deconvolution layer and a second deconvolution layer connected to the second normalization layer, a second sigmoid activation layer connected to the first deconvolution layer, and a second weighting layer connected to the second deconvolution layer and the second sigmoid activation layer. The step of performing dual-channel feature extraction on the first weighted feature vector through the decoding unit to obtain the second weighted feature vector includes: The first weighted feature vector is normalized by the second normalization layer to obtain the fifth feature vector; The fifth feature vector is deconvolved by the first deconvolution layer, and the features extracted by the deconvolution are processed by the second sigmoid activation layer to obtain a weight vector. The fifth feature vector is deconvolved by the second deconvolution layer to obtain the sixth feature vector. The second weighting layer performs weighting processing on the sixth feature vector according to the weight vector to obtain the second weighted feature vector.

8. The method according to claim 7, characterized in that, The second weighted layer is a Hadamard layer; The step of weighting the sixth feature vector according to the weight vector through the second weighting layer to obtain the second weighted feature vector includes: The weight value and feature value corresponding to the same coordinate in the weight vector and the sixth feature vector are multiplied together to obtain the second weighted feature vector.

9. The method according to claim 1, characterized in that, The step of determining whether the audio information to be processed is abnormal audio information based on the vector distance includes: Compare the vector distance with a distance threshold; When the vector distance is less than or equal to the distance threshold, the audio information to be processed is determined to be normal audio information; When the vector distance is greater than the distance threshold, the audio information to be processed is determined to be abnormal audio information.

10. The method according to claim 9, characterized in that, The vector distance is the L2 distance, and the distance threshold is the L2 distance determined based on normal audio information.

11. The method according to claim 1, characterized in that, Before inputting the frequency domain vector sequence into the sound model, the method further includes: Acquire audio information samples, input the frequency domain vector sequence corresponding to the audio information samples into the sound model to be trained, and extract features from the frequency domain vector sequence through the sound model to be trained to obtain a prediction vector sequence. A loss function is constructed based on the frequency domain vector sequence corresponding to the audio information sample and the prediction vector sequence, and the parameters of the sound model to be trained are optimized based on the loss function to obtain the sound model.

12. The method according to claim 11, characterized in that, The step of constructing a loss function based on the frequency domain vector sequence corresponding to the audio information sample and the prediction vector sequence includes: The L2 distance, determined based on the frequency domain vector sequence corresponding to the audio information sample and the prediction vector sequence, is used as the loss function.

13. The method according to claim 11, characterized in that, The sound model to be trained includes a first pointwise convolutional sub-model to be trained, a second pointwise convolutional sub-model to be trained, and the second pointwise convolutional sub-model to be trained. The second pointwise convolutional sub-model to be trained includes an encoding unit to be trained and a decoding unit to be trained.

14. The method according to claim 13, characterized in that, The method further includes: A Dropout layer is placed between the first pointwise convolutional sub-model to be trained and the autoencoder sub-model to be trained; and / or A Dropout layer is placed after the weighted layers to be trained contained in the encoding unit and the decoding unit; and / or A Dropout layer is set after the second pointwise convolutional sub-model to be trained.

15. An abnormal sound detection device, characterized in that, include: The preprocessing module is used to preprocess the audio information to be processed in order to obtain a frequency domain vector sequence corresponding to the audio information to be processed. The model processing module is used to input the frequency domain vector sequence into the sound model, and perform dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence. An anomaly detection module is used to determine the vector distance based on the target vector sequence and the frequency domain vector sequence, and to determine whether the audio information to be processed is abnormal audio information based on the vector distance. The sound model includes a first pointwise convolutional sub-model, an autoencoder sub-model, and a second pointwise convolutional model; The step of inputting the frequency domain vector sequence into a sound model and performing dual-channel feature extraction on the frequency domain vector sequence through the sound model to obtain a target vector sequence with the same dimension as the frequency domain vector sequence includes: The first feature vector is obtained by extracting features from the frequency domain vector sequence using the first pointwise convolutional sub-model. The first feature vector is processed by dual-channel encoding and dual-channel decoding using the autoencoder sub-model to obtain the second feature vector. The second feature vector is used to extract features from the second pointwise convolutional sub-model to obtain the target vector sequence.

16. A computer storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the abnormal sound detection method according to any one of claims 1 to 14.

17. An electronic device, characterized in that, include: processor; as well as Memory for storing the executable instructions of the processor; The processor is configured to execute the abnormal sound detection method according to any one of claims 1 to 14 by executing the executable instructions.