abnormal noise detection methods

By extracting the frequency and time domain energy features of the vehicle screen audio data, constructing a temporal feature tensor, and using a model for detection, the problem of inconsistent detection results and low efficiency caused by human auditory experience is solved, and efficient and accurate abnormal noise detection is achieved.

CN122306367APending Publication Date: 2026-06-30CHONGQING JINKANG NEW ENERGY VEHICLE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING JINKANG NEW ENERGY VEHICLE CO LTD
Filing Date
2026-04-07
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, the detection of abnormal noises in automotive screens relies on human auditory experience and subjective judgment, resulting in poor consistency of detection results and low efficiency, making it difficult to meet the needs of large-scale production lines.

Method used

By acquiring audio data from the vehicle screen, extracting multiple target audio frames, determining frequency domain features and time domain energy features, constructing a time-series feature tensor, and using a pre-trained abnormal noise recognition model for detection, the system can replace manual listening and identification.

Benefits of technology

It improves the accuracy and consistency of abnormal noise detection in automotive screens, and realizes automated and high-speed abnormal noise judgment, meeting the high-efficiency testing needs of production lines.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122306367A_ABST
    Figure CN122306367A_ABST
Patent Text Reader

Abstract

This application provides a method for detecting abnormal noises, belonging to the field of automotive parts quality inspection technology. The method includes: acquiring audio data corresponding to a vehicle screen and extracting multiple target audio frames from the audio data; determining the frequency domain features and time domain energy features corresponding to each target audio frame; determining a time-series feature tensor based on the frequency domain features and time-series energy features corresponding to each target audio frame; and detecting abnormal noises on the vehicle screen based on the time-series feature tensor. By determining the time-series feature tensor based on the frequency domain features and time-series energy features corresponding to each target audio frame, and thus detecting abnormal noises on the vehicle screen, the accuracy of abnormal noise detection on the vehicle screen can be improved. This solves the technical problem of existing technologies relying on the auditory experience and subjective judgment of inspectors, leading to poor consistency of detection results and low detection efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of automotive parts quality inspection technology, and in particular to a method for detecting abnormal noises. Background Technology

[0002] With the development of automotive intelligence and connectivity, the functions of in-vehicle display systems are becoming increasingly rich. As an important component of the rear-seat entertainment system, the assembly quality of the screen directly affects the user experience. In the production and testing phase of automotive screens, abnormal noise detection is one of the key indicators for determining whether there are assembly defects or abnormal components.

[0003] Currently, the detection of abnormal noises in automotive screens primarily relies on manual listening. Inspectors place the component under test in a quiet environment and subjectively judge the presence of abnormal noise based on its sound performance during startup, operation, and shutdown. However, this method depends entirely on the inspector's auditory experience and subjective judgment, making it susceptible to environmental noise interference and individual differences, resulting in inconsistent test results. Furthermore, in large-scale production lines, manual listening is inefficient, failing to meet the demands of fast-paced production, and incurs high labor costs, thus hindering improvements in production efficiency. Summary of the Invention

[0004] The purpose of this application is to provide a method for detecting abnormal noises, thereby solving the technical problem that existing technologies rely on the auditory experience and subjective judgment of the testing personnel, resulting in poor consistency of testing results and low testing efficiency. The specific technical solution is as follows: In a first aspect of this application, an abnormal noise detection method is provided, the method comprising: Acquire audio data corresponding to the vehicle screen, and extract multiple target audio frames from the audio data; Determine the frequency domain features and time domain energy features corresponding to each of the target audio frames; Based on the frequency domain features and time domain energy features corresponding to each target audio frame, determine the time-series feature tensor; Based on the aforementioned temporal feature tensor, abnormal noise detection is performed on the vehicle screen.

[0005] In an optional implementation, determining the frequency domain features corresponding to each of the target audio frames includes: For any of the target audio frames, a short-time Fourier transform is performed on the target audio frame to obtain the power spectrum; The power spectrum is transformed to obtain the frequency domain features corresponding to the target audio frame.

[0006] In an optional implementation, determining the temporal energy characteristics corresponding to each of the target audio frames includes: For any one of the target audio frames, the target audio frame is sampled to obtain multiple amplitude values; Based on multiple amplitude values, the temporal energy characteristics corresponding to the target audio frame are determined.

[0007] In an optional implementation, determining the temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each of the target audio frames includes: For any of the target audio frames, the frequency domain features and time domain energy features corresponding to the target audio frame are spliced ​​together according to a preset splicing strategy to obtain the fusion features corresponding to the target audio frame; The fusion features corresponding to each of the target audio frames are combined to obtain the temporal feature tensor.

[0008] In an optional implementation, the abnormal noise detection of the vehicle screen based on the temporal feature tensor includes: The temporal feature tensor is input into a pre-trained abnormal noise recognition model to obtain an abnormal noise evaluation value; If the abnormal noise assessment value is greater than the preset abnormal noise threshold, then the vehicle screen is determined to be abnormal; If the abnormal noise assessment value is less than or equal to the preset abnormal noise threshold, then the vehicle screen is determined to be normal.

[0009] In an optional implementation, the pre-trained abnormal noise recognition model includes: a feature extraction subnetwork, a bidirectional gated recurrent unit subnetwork, an attention aggregation layer, and a classification layer; The step of inputting the temporal feature tensor into a pre-trained abnormal noise recognition model to obtain an abnormal noise evaluation value includes: The temporal feature tensor is input into the feature extraction subnet to obtain a local time-frequency pattern feature sequence; The local time-frequency pattern feature sequence is input into the bidirectional gated cyclic unit subnet to obtain a time-dependent sequence. The temporal dependency sequence is input into the attention aggregation layer to obtain the global temporal representation vector; The global temporal representation vector is input into the classification layer to obtain the abnormal sound evaluation value.

[0010] In an optional implementation, the feature extraction subnetwork comprises N levels of convolutional blocks, each level of which consists of a convolutional layer, a normalization layer, a ReLU activation function layer, and a max pooling layer, where N is a positive integer. The step of inputting the temporal feature tensor into the feature extraction subnet to obtain a local time-frequency pattern feature sequence includes: The following steps are performed iteratively until the Nth first feature map is obtained, at which point the following steps are stopped: The i-th first feature map to be processed is input into the i-th level convolutional block to obtain the i-th first feature map, wherein the i-th first feature map is the (i+1)-th first feature map to be processed; the first first feature map is the temporal feature tensor; the N-th first feature map is the local time-frequency pattern feature sequence, and i is 1, 2, 3, ..., N in sequence.

[0011] In an optional implementation, the bidirectional gated loop unit subnet includes a forward gated loop unit and a backward gated loop unit; The step of inputting the local time-frequency pattern feature sequence into the bidirectional gated cyclic unit subnet to obtain a time-dependent sequence includes: The local time-frequency pattern feature sequence is expanded along the time dimension to obtain a time-series feature vector sequence; The time-series feature vector sequence is input into the forward-gated loop unit and the backward-gated loop unit respectively to obtain the forward time-series feature sequence and the backward time-series feature sequence; The forward temporal feature sequence and the backward temporal feature sequence are concatenated to obtain the temporal dependency sequence.

[0012] In an optional implementation, the step of inputting the temporal dependency sequence into the attention aggregation layer to obtain a global temporal representation vector includes: For any time step feature in the time-dependent sequence, the time step feature is input into the attention weight matrix of the attention aggregation layer to obtain the original score corresponding to the time step feature; The original scores corresponding to each time step feature are normalized to obtain the attention weights corresponding to each time step feature. The time-dependent sequences are weighted and summed based on the attention weights corresponding to the features of each time step to obtain the global time-series representation vector.

[0013] In an optional implementation, the step of inputting the global temporal representation vector into the classification layer to obtain the anomaly evaluation value includes: The global temporal representation vector is input into the classification layer to obtain the first original output value and the second original output value. The first and second original output values ​​are normalized to obtain the abnormal noise evaluation value.

[0014] In a second aspect of this application, an abnormal noise detection device is also provided, the device comprising: The audio frame extraction module is used to acquire audio data corresponding to the vehicle screen and extract multiple target audio frames from the audio data. The feature determination module is used to determine the frequency domain features and time domain energy features corresponding to each of the target audio frames; The temporal feature tensor determination module is used to determine the temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each of the target audio frames; An abnormal noise detection module is used to detect abnormal noises on the vehicle screen based on the time-series feature tensor.

[0015] In a third aspect of the embodiments of this application, an electronic device is also provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; When a processor executes a program stored in memory, it implements the abnormal noise detection method described in any of the first aspects above.

[0016] In a fourth aspect of the embodiments of this application, a storage medium is also provided, wherein the storage medium stores instructions that, when executed on a computer, cause the computer to perform any of the abnormal noise detection methods described in the first aspect above.

[0017] In a fifth aspect of the embodiments of this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to perform any of the abnormal noise detection methods described in the first aspect above.

[0018] The technical solution provided in this application acquires audio data corresponding to a vehicle screen and extracts multiple target audio frames from the audio data; determines the frequency domain features and temporal energy features corresponding to each target audio frame; determines a temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each target audio frame; and performs abnormal noise detection on the vehicle screen based on the temporal feature tensor. By determining the temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each target audio frame, and thus performing abnormal noise detection on the vehicle screen, the accuracy of abnormal noise detection on the vehicle screen can be improved. This solves the technical problem of existing technologies relying on the auditory experience and subjective judgment of testing personnel, resulting in poor consistency of detection results and low detection efficiency. Attached Figure Description

[0019] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0020] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0021] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.

[0022] Figure 1 A schematic diagram illustrating the implementation process of an abnormal noise detection method provided in this application embodiment; Figure 2 A schematic diagram illustrating the implementation process of another abnormal noise detection method provided in this application embodiment; Figure 3 A schematic diagram illustrating the implementation process of a method for determining abnormal noise evaluation values ​​provided in this application embodiment; Figure 4 This is a schematic diagram of the structure of an abnormal noise detection device provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0024] The following disclosure provides numerous different embodiments or examples for implementing various structures of this application. To simplify the disclosure, specific examples of components and arrangements are described below. These are merely examples and are not intended to limit the scope of this application. Furthermore, reference numerals and / or letters may be repeated in different examples. Such repetition is for simplification and clarity and does not in itself indicate a relationship between the various embodiments and / or arrangements discussed.

[0025] To address the technical problem of existing technologies relying on the auditory experience and subjective judgment of inspectors, leading to poor consistency and low detection efficiency, this application provides a method for detecting abnormal noises from vehicle screens. This method involves acquiring audio data corresponding to the vehicle screen and extracting multiple target audio frames from the audio data; determining the frequency domain features and temporal energy features corresponding to each target audio frame; determining a temporal feature tensor based on the frequency domain features and temporal energy features of each target audio frame; and performing abnormal noise detection on the vehicle screen based on the temporal feature tensor. By determining the temporal feature tensor based on the frequency domain features and temporal energy features of each target audio frame, the accuracy of abnormal noise detection on vehicle screens can be improved.

[0026] like Figure 1 The diagram shown is a schematic representation of the implementation process of an abnormal noise detection method provided in this application embodiment, which may specifically include the following steps: S101, acquire the audio data corresponding to the vehicle screen, and extract multiple target audio frames from the audio data.

[0027] The aforementioned vehicle screens refer to in-vehicle devices with display functions, such as car ceiling-mounted displays installed on the roof of a car; central control screens located in the center of the cockpit that integrate navigation, entertainment, and vehicle settings; instrument panel displays used to display key data such as vehicle speed, battery level, and fault information; and passenger entertainment screens located in front of the passenger seat that provide independent audio-visual entertainment functions. If these vehicle screens experience structural loosening or component failure during startup, operation, or shutdown, they will produce abnormal sounds, requiring audio detection to identify the defects.

[0028] The audio data corresponding to the vehicle screen mentioned above refers to the sound signal emitted by the vehicle screen during the entire process of startup, operation, and shutdown, which is collected by a microphone in the quiet environment of the production line. The sampling rate can be 48kHz and the quantization accuracy is 16bit.

[0029] The target audio frame refers to the short-time analysis unit obtained after the audio data is processed into frames. The frame length can be 25ms and the frame shift can be 10ms. A Hamming window can be applied to maintain the stationarity of the signal in the time domain, which is convenient for subsequent feature extraction.

[0030] In this embodiment, audio data corresponding to the vehicle screen is acquired, and multiple target audio frames are extracted from the audio data. This step provides well-aligned, low-redundancy time-series units (i.e., multiple target audio frames) for subsequent feature extraction, ensuring a balance between temporal resolution and computational efficiency.

[0031] For example, if the vehicle screen is a ceiling-mounted display, the screen under test can be placed in a quiet box with a noise floor of less than 30dB, and the microphone should be 50cm horizontally away from the screen's sound radiating surface. The control screen should run automatically in the sequence of "start-run-shutdown," while simultaneously starting audio acquisition. The acquired raw audio is saved in WAV format, and then the continuous signal is divided into several overlapping audio frames in time through frame segmentation and windowing operations to obtain multiple target audio frames, which serve as the basic unit for subsequent feature extraction.

[0032] S102, determine the frequency domain features and time domain energy features corresponding to each target audio frame.

[0033] The aforementioned frequency domain features can be 128-dimensional Mel spectrum extracted from each target audio frame to reflect the frequency distribution under the characteristics of human auditory perception, and are used to capture abnormal patterns of abnormal noises in the frequency domain.

[0034] The aforementioned temporal energy characteristics refer to the short-time energy of each target audio frame, used to characterize the sound intensity of a single frame audio signal, reflect the change in sound intensity on the time axis, and assist in identifying abnormal noises caused by sudden energy changes.

[0035] In this embodiment, the frequency domain features and time domain energy features corresponding to each target audio frame are determined. This step is used to convert the target audio frame into more discriminative acoustic features while preserving complementary information in the time and frequency domains.

[0036] S103, based on the frequency domain features and time domain energy features corresponding to each target audio frame, determine the temporal feature tensor.

[0037] The aforementioned temporal feature tensor refers to the structured feature data formed by concatenating the frequency domain features (such as 128-dimensional Mel spectrum) and temporal energy features (such as 1-dimensional short-time energy) corresponding to each target audio frame in the frequency dimension and arranging them in chronological order. Its dimension is 129×T (where 129 is the feature dimension of a single frame and T is the total number of target audio frames, such as T=50). This tensor integrates both temporal and frequency domain information and preserves the temporal sequence characteristics of the features.

[0038] In this embodiment, the temporal feature tensor can be determined based on the frequency domain features and temporal energy features corresponding to each target audio frame, thereby organizing the multidimensional features of each target audio frame into a structured tensor form in chronological order, satisfying the requirements of the temporal model for the input format, and ensuring the alignment of samples in the time dimension within a batch.

[0039] Specifically, for each target audio frame, its 128-dimensional Mel spectrum and 1-dimensional short-time energy can be concatenated along the frequency axis to obtain a 129-dimensional feature vector sequence. The sequence is then truncated or zero-padded based on the frame number T of the shortest sample to form a unified 129×T temporal feature tensor.

[0040] S104, based on temporal feature tensors, detects abnormal noises from vehicle screens.

[0041] In this embodiment, abnormal noise detection of vehicle screens can be performed based on time-series feature tensors to determine whether there are abnormal noises on the vehicle screens. This enables automated abnormal noise determination in production line scenarios, replacing the traditional method of manual listening and solving the problems of strong subjectivity, high missed detection rate, and low efficiency of manual detection. It also meets the automated detection needs of large-scale, high-cycle production lines.

[0042] Based on the above description of the technical solution provided in the embodiments of this application, audio data corresponding to the vehicle screen is obtained, and multiple target audio frames are extracted from the audio data; the frequency domain features and time domain energy features corresponding to each target audio frame are determined; a time-series feature tensor is determined based on the frequency domain features and time-series energy features corresponding to each target audio frame; and abnormal noise detection is performed on the vehicle screen based on the time-series feature tensor.

[0043] By determining the temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each target audio frame, the abnormal noise detection of vehicle screens can be performed. This can improve the accuracy of abnormal noise detection of vehicle screens and solve the technical problem that the existing technology relies on the auditory experience and subjective judgment of the detection personnel, resulting in poor consistency of detection results and low detection efficiency.

[0044] like Figure 2 The diagram shown illustrates the implementation flow of another abnormal noise detection method provided in this application embodiment, which may specifically include the following: S201, acquire the audio data corresponding to the vehicle screen, and extract multiple target audio frames from the audio data.

[0045] In this embodiment of the application, audio data corresponding to the vehicle screen is obtained, and multiple target audio frames are extracted from the audio data.

[0046] Specifically, acquiring audio data corresponding to the vehicle screen can be achieved by placing the vehicle screen in a preset operating environment; sending control commands to the vehicle screen in the preset operating environment to control the vehicle screen to operate according to the preset operating conditions, thereby obtaining operating data; and preprocessing the operating data to obtain the audio data corresponding to the vehicle screen. The preset operating environment can be a low-noise acquisition environment constructed using a soundproof enclosure, with the background noise of the soundproof enclosure strictly controlled below 30dB. Simultaneously, the audio acquisition microphone is fixed at a preset position 50cm horizontally from the sound radiation surface of the vehicle screen, thus isolating irrelevant signals such as production line environmental noise and electromagnetic interference, ensuring that the acquired sound signal is the direct sound emitted by the ceiling-mounted screen, and reducing signal attenuation and distortion. Control commands refer to electrical signal trigger commands issued by the production line automation control system that automatically control the working state of the vehicle screen in a fixed sequence of start, run, and stop. This enables standardized and automated control of the vehicle screen's operating conditions, ensuring consistency in the acquisition conditions for different samples. The operational data refers to the continuous audio signal synchronously acquired by the microphone from the vehicle screen under the aforementioned operating conditions. During acquisition, the sampling rate was set to 48kHz and the quantization precision to 16bit, ensuring coverage of the characteristic frequency bands of abnormal noises from the vehicle screen while also meeting the requirement for capturing faint abnormal noises. The raw operational data is saved in lossless WAV format. Preprocessing may include operations such as DC component removal and signal pre-emphasis on the operational data. DC removal eliminates DC offset in the audio signal to avoid affecting the accuracy of subsequent feature extraction. Pre-emphasis increases the proportion of high-frequency signals through filtering, enhancing the characteristic performance of abnormal noises in the high-frequency range and further improving the quality of the raw audio data.

[0047] Specifically, extracting multiple target audio frames from audio data can include framing the audio data to obtain multiple temporally continuous audio frames; and windowing each audio frame to obtain multiple target audio frames. Framing refers to dividing the continuous analog audio signal of the audio data into discrete short-time analysis units based on fixed time parameters. In this embodiment, the frame length can be set to 25ms and the frame shift to 10ms. This parameter setting ensures that a single audio frame can completely capture the temporal characteristics of a single short-term abnormal sound, while also ensuring signal continuity through partial overlap between frames, avoiding feature loss, and maintaining temporal stationarity of the signal within the frame, laying the foundation for subsequent frequency and temporal feature extraction. Windowing refers to applying Hamming window processing to each segmented audio frame. Through the weight allocation of the Hamming window, the edge signals of the audio frame are smoothly attenuated, effectively suppressing the spectral leakage problem generated during subsequent Fourier transform, reducing feature extraction errors, and making the extracted features more closely match the true characteristics of the audio signal. The target audio frames obtained after framing and windowing processing serve as the smallest basic processing unit for subsequent acoustic feature extraction.

[0048] S202, for any target audio frame, perform a short-time Fourier transform on the target audio frame to obtain the power spectrum.

[0049] In this embodiment, for any target audio frame, a short-time Fourier transform is performed on the target audio frame to obtain the power spectrum. The power spectrum reflects the energy distribution of each frame's audio signal across its frequency components.

[0050] Specifically, a Fourier transform can be performed on each target audio frame to convert the audio amplitude signal in the time domain into a complex form signal in the frequency domain. This preserves both the frequency distribution information of the single-frame signal and the temporal characteristics of the signal through the time-series properties of the frame. The number of Fourier transform points can be set according to the sampling rate and frame length. In this embodiment, a 1024-point Fast Fourier Transform is used, with a frequency resolution of approximately 46.875Hz.

[0051] S203 converts the power spectrum to obtain the frequency domain features corresponding to the target audio frame.

[0052] In the embodiments of this application, the power spectrum can be converted to obtain the frequency domain features corresponding to the target audio frame.

[0053] The process of converting the power spectrum to obtain the frequency domain features corresponding to the target audio frame can include: obtaining the type information of the vehicle screen; determining the conversion parameters based on the type information of the vehicle screen; and converting the power spectrum based on the conversion parameters to obtain the frequency domain features corresponding to the target audio frame.

[0054] The vehicle screen type information can include the inherent attributes of the device, such as the model, size, hardware configuration, and specifications of the sound-generating components. Different types of vehicle screens have different sound pattern frequency distributions and abnormal noise characteristic frequency bands due to hardware differences. The conversion parameters refer to the construction parameters of the Mel filter bank adapted to different types of vehicle screens, including the frequency coverage range of the filter bank, the Mel scale division interval, and the number of filters. For example, for small-sized ceiling screens, the abnormal noises are mostly concentrated in the high-frequency range. The conversion parameters can be adjusted to improve the resolution of the Mel filter bank in the high-frequency range.

[0055] For example, retrieve the model, hardware configuration, and other type information of the ceiling screen under test from the production line equipment information database; based on the pre-stored type information-conversion parameter mapping table, match the Mel filter bank conversion parameters (such as frequency coverage range, number of filters, etc.) corresponding to the type of ceiling screen; construct a dedicated Mel filter bank based on the matched conversion parameters, and perform filtering, logarithmic transformation, and other operations on the power spectrum obtained in step S202 to obtain the 128-dimensional frequency domain characteristics (Mel spectrum) adapted to this type of equipment.

[0056] In another embodiment of this application, to further improve the effectiveness of the frequency domain features, after obtaining the power spectrum, the type information of the vehicle screen can also be acquired. Based on the type information of the vehicle screen, correction parameters are determined, and the power spectrum is corrected using the correction parameters to obtain the corrected power spectrum. The correction parameters are frequency weighting coefficients obtained based on the power spectrum characteristics of different types of vehicle screens during normal operation. These coefficients are used to perform frequency domain weighted correction on the collected power spectrum, suppressing the inherent normal acoustic energy of different types of ceiling-mounted screens, amplifying the abnormal noise characteristic frequency band energy of that type of device, and reducing the interference of device type differences on abnormal noise detection.

[0057] The above-mentioned transformation of the power spectrum to obtain the frequency domain features corresponding to the target audio frame can be achieved by transforming the corrected power spectrum to obtain the frequency domain features corresponding to the target audio frame.

[0058] For example, extract the type information of the ceiling screen to be tested, retrieve the power spectrum correction parameters (frequency weighting coefficients) corresponding to this type of ceiling screen from the feature library; perform a frequency-point weighted operation on the correction parameters and the original power spectrum obtained in step S202 to suppress the energy of the normal sound signature frequency band and amplify the energy of the abnormal sound feature frequency band to obtain the corrected power spectrum; construct a Mel filter bank based on the exclusive transformation parameters of this type of ceiling screen, filter and logarithmically transform the corrected power spectrum to obtain the frequency domain features corresponding to the target audio frame.

[0059] After obtaining the frequency domain features corresponding to the target audio frame, they can be standardized. The eigenvalues, mean, and standard deviation of the frequency domain features are sequentially input into the normalization formula to obtain the normalized eigenvalues. The normalization formula is as follows:

[0060] in, These are the eigenvalues ​​in the frequency domain features. The normalized eigenvalue is the eigenvalue. The standard deviation is the frequency domain characteristic. The mean value corresponding to the frequency domain features. This is a smoothing term used to avoid the denominator being zero (it can be taken as...). ).

[0061] S204: For any target audio frame, sample the target audio frame to obtain multiple amplitude values.

[0062] In this embodiment, for any target audio frame, the target audio frame is sampled to obtain multiple amplitude values. Sampling refers to discretizing the amplitude values ​​of the target audio frame after frame segmentation and windowing based on a sampling rate consistent with the target audio frame (e.g., 48kHz). This involves extracting the sound pressure amplitude corresponding to each sampling point in the target audio frame. This amplitude is a discretized numerical representation of the continuous audio signal, directly reflecting the sound intensity at the corresponding moment. Multiple amplitude values ​​are the set of amplitude values ​​from all sampling points within a single target audio frame. For example, for a target audio frame with a frame length of 25ms and a sampling rate of 48kHz, the amplitude values ​​of 200 sampling points (48000 × 0.025 = 1200) can be extracted, forming multiple amplitude values ​​corresponding to that frame.

[0063] Specifically, for any target audio frame obtained in step S201, based on a sampling rate of 48kHz, the sound pressure amplitude corresponding to all discrete sampling points in the frame can be extracted point by point, and all amplitudes can be arranged in chronological order to form the amplitude sequence of the target audio frame, thus obtaining a set of multiple amplitudes.

[0064] S205 determines the temporal energy characteristics corresponding to the target audio frame based on multiple amplitude values.

[0065] In this embodiment, the temporal energy characteristics corresponding to the target audio frame are determined based on multiple amplitude values. The temporal energy characteristics are the short-time energy of a single audio frame calculated based on the amplitude sequence. These characteristics characterize the overall sound intensity of a single target audio frame, intuitively reflecting the magnitude of the sound energy within the frame, and effectively identifying abnormal noises caused by sudden energy increases due to component impacts, jamming, etc.

[0066] Specifically, the short-time energy of the target audio frame can be calculated using the root mean square algorithm based on the multiple amplitude values ​​obtained in step S204.

[0067] In another embodiment of this application, to improve the accuracy of identifying abnormal noises from different types of ceiling screens using time-domain energy features, after obtaining multiple amplitude values, the type information of the vehicle screen can also be acquired. Based on the type information of the vehicle screen, amplitude weights are determined, and the multiple amplitude values ​​are corrected using the amplitude weights to obtain multiple corrected amplitude values. The amplitude weights are time-domain weighting coefficients obtained based on the normal operating amplitude characteristics of different types of vehicle screens. Different types of vehicle screens have different fluctuation ranges and energy baselines in their normal amplitudes due to differences in operating conditions. By correcting the original amplitude values ​​using amplitude weights, the energy baselines of different types of devices can be unified, amplifying the abnormal characteristics of energy mutations and improving the sensitivity of identifying abnormal noises caused by energy mutations.

[0068] The above method of determining the temporal energy characteristics of the target audio frame based on multiple amplitudes can be achieved by transforming multiple corrected amplitudes to determine the temporal energy characteristics of the target audio frame.

[0069] Specifically, the type information of the vehicle screen (such as the ceiling screen to be tested) can be retrieved from the production line equipment information database. Based on the pre-stored type information-amplitude weight mapping table, the amplitude weight coefficient corresponding to the type of vehicle screen is matched. The amplitude weight is then weighted point by point with the original amplitude obtained in step S204 to obtain multiple corrected amplitudes. The correction process can suppress the normal amplitude fluctuation of the vehicle screen of this type and amplify abnormal amplitude mutations. Based on the multiple corrected amplitudes, the root mean square algorithm formula is substituted to calculate the temporal energy characteristics corresponding to the target audio frame. The corrected energy characteristics can better highlight the energy changes related to abnormal noise.

[0070] S206. For any target audio frame, according to a preset splicing strategy, the frequency domain features and time domain energy features corresponding to the target audio frame are spliced ​​together to obtain the fused features corresponding to the target audio frame.

[0071] In this embodiment, for any target audio frame, the frequency domain features and time domain energy features corresponding to the target audio frame are spliced ​​together according to a preset splicing strategy to obtain the fused features corresponding to the target audio frame. The preset splicing strategy is a pre-defined frequency dimension splicing strategy. It can be that the frequency domain features corresponding to the target audio frame (e.g., 128-dimensional frequency domain features, Mel spectrum) are used as the basic feature dimension, and time domain energy features (e.g., 1-dimensional time domain energy features) are added along the frequency dimension to the dimensional sequence of the frequency domain features, forming a single-frame fused feature with unified dimensions. This ensures the continuity of the feature dimensions and does not destroy the frequency distribution pattern of the frequency domain features and the independent representation of the time domain energy features. The fused feature is a feature vector (e.g., a 129-dimensional feature vector) obtained after feature splicing of each target audio frame. This vector simultaneously contains the frequency distribution information and time domain energy intensity information of the sound emitted from the vehicle screen, providing a complete representation of the single-frame audio soundprint features.

[0072] Specifically, the frequency domain features and temporal energy features of any target audio frame can be retrieved. According to the preset frequency dimension splicing strategy, the temporal energy features are added as a new dimension and spliced ​​to the end of the frequency domain features (or at a specified frequency dimension position) to form the fused features corresponding to the target audio frame.

[0073] It should be noted that the spliced ​​fused feature vector can be standardized and verified to ensure that the feature dimensions and numerical scale are consistent, thus completing the construction of single-frame fused features.

[0074] S207, combine the fusion features corresponding to each target audio frame to obtain the temporal feature tensor.

[0075] In this embodiment, the fusion features corresponding to each target audio frame are combined to obtain a temporal feature tensor. The temporal feature tensor is a three-dimensional tensor data formed by integrating the fusion features of all target audio frames in time order. Its basic dimension can be 129×T (129 is the dimension of a single-frame fusion feature, and T is the total number of target audio frames, such as 100). After dimension adjustment and temporal alignment, it is finally converted into the standard dimension for the adaptation model input.

[0076] S208: Input the temporal feature tensor into the pre-trained abnormal noise recognition model to obtain the abnormal noise evaluation value.

[0077] In this embodiment of the application, the temporal feature tensor can be input into a pre-trained abnormal noise recognition model to obtain an abnormal noise evaluation value.

[0078] The pre-trained abnormal noise recognition model includes: a feature extraction subnetwork, a bidirectional gated recurrent unit subnetwork, an attention aggregation layer, and a classification layer.

[0079] For details on how to input the temporal feature tensor into the pre-trained abnormal noise recognition model to obtain the abnormal noise evaluation value, please refer to [reference needed]. Figure 3 The method shown. (As illustrated) Figure 3 The diagram shown illustrates the implementation flow of a method for determining abnormal noise assessment values ​​according to an embodiment of this application, which may specifically include the following steps: S301, input the temporal feature tensor into the feature extraction subnet to obtain the local time-frequency pattern feature sequence.

[0080] In this embodiment of the application, the temporal feature tensor can be input into the feature extraction subnet to obtain the local time-frequency pattern feature sequence.

[0081] The feature extraction subnetwork contains N levels of convolutional blocks. Each level consists of a convolutional layer, a normalization layer, a ReLU activation function layer, and a max pooling layer, where N is a positive integer (e.g., N = 2). The feature extraction subnetwork is used to perform progressive convolution, dimensionality reduction, and nonlinear transformations on the temporal feature tensor to capture local time-frequency pattern features within and between audio frames.

[0082] The temporal feature tensor is input into the feature extraction subnet to obtain the local time-frequency pattern feature sequence. This may include iteratively executing the following steps until the Nth first feature map is obtained: input the i-th first feature map to be processed into the i-th level convolutional block to obtain the i-th first feature map, where the i-th first feature map is the (i+1)-th first feature map to be processed; the first first feature map is the temporal feature tensor; the Nth first feature map is the local time-frequency pattern feature sequence, where i is 1, 2, 3, ..., N in sequence.

[0083] For example, the temporal feature tensor (the first feature map to be processed) is input into the first-level convolutional block, and after convolution, normalization, activation and pooling, the first feature map is obtained; the first feature map is used as the second feature map to be processed and input into the second-level convolutional block, and after the same processing, the second feature map is obtained. This feature map is the local time-frequency pattern feature sequence.

[0084] For example, this feature extraction subnetwork contains two levels of convolutional blocks, with 1 input channel and 32 and 64 output channels respectively. The kernel size is 3×3. Normalization and ReLU activation functions are applied after each convolutional stage. Finally, a 2×2 max pooling layer is used to downsample in both the time and frequency dimensions to extract local time-frequency pattern feature sequences. The convolution operation formula is as follows: ; in, For time-series feature tensors, For convolution kernel weights, This is a preset bias term (which can take the value 0.0001). This is the first feature map. The height of the convolution kernel ( ), The width of the convolution kernel ( ), Sliding index in the vertical direction (from 0 to ...) (between), used to traverse all positions of the convolution kernel in the row direction. Sliding index in the horizontal direction (from 0 to ...) (between), used to traverse all positions of the convolution kernel in the column direction. These are the position coordinates of the first feature map.

[0085] S302, input the local time-frequency mode feature sequence into the bidirectional gated cyclic unit subnet to obtain the time-dependent sequence.

[0086] In this embodiment, the local time-frequency pattern feature sequence is input into a bidirectional gated recurrent unit (BRN) subnet to obtain a time-dependent sequence. The BRN subnet includes a forward-gated recurrent unit (FGRU) and a backward-gated recurrent unit (BGRU). The forward-gated recurrent unit processes the feature sequence in ascending time order (from frame 1 to frame 1, capturing the dependence of subsequent features on earlier features). The backward-gated recurrent unit processes the feature sequence in reverse time order (from frame 1 to frame 1, capturing the dependence of earlier features on subsequent features).

[0087] Inputting the local time-frequency pattern feature sequence into a bidirectional gated recurrent unit subnet to obtain a time-dependent sequence may include the following steps: The local time-frequency pattern feature sequence is expanded along the time dimension to obtain a temporal feature vector sequence. This temporal feature vector sequence is then input into a forward-gated loop unit and a backward-gated loop unit, respectively, to obtain a forward temporal feature sequence and a backward temporal feature sequence. The forward and backward temporal feature sequences are concatenated to obtain a temporal dependency sequence. The forward temporal feature sequence, the output of the forward-gated loop unit after processing the temporal feature vector sequence, is used to separately characterize the historical temporal dependency features of the vehicle screen's sound emission, reflecting the feature change pattern in forward time. The backward temporal feature sequence, the output of the backward-gated loop unit after processing the temporal feature vector sequence, is used to separately characterize the future temporal dependency features of the vehicle screen's sound emission, reflecting the feature change pattern in reverse time.

[0088] For example, expanding a time-dependent sequence into a sequence along the time dimension. The input is fed into a bidirectional gated loop unit, with 128 hidden units in each direction and 2 layers. The forward and backward gated loop units can simultaneously capture the temporal dependencies of both directions. Their core update formula is: ; Where t is the time step index, representing the t-th target audio frame. The output vector of the current target audio frame. This is the output vector of the previous target audio frame. To update the door, In the candidate hidden state, This indicates element-wise multiplication.

[0089] By simultaneously modeling with forward-gated and backward-gated recurrent units, the timing output sequence is obtained. .

[0090] S303 inputs the temporal dependent sequence into the attention aggregation layer to obtain the global temporal representation vector.

[0091] In this embodiment, the temporal dependency sequence is input to the attention aggregation layer to obtain a global temporal representation vector. This global temporal representation vector is used to fuse all feature information of the temporal dependency sequence, with a focus on retaining key anomaly features.

[0092] Specifically, inputting the temporal-dependent sequence into the attention aggregation layer to obtain the global temporal representation vector may include the following steps: Step 31: For any time step feature in the time-dependent sequence, input the time step feature into the attention weight matrix of the attention aggregation layer to obtain the original score corresponding to the time step feature.

[0093] In this embodiment, for any time step feature in the time-dependent sequence, the time step feature is input into the attention weight matrix of the attention aggregation layer to obtain the original score corresponding to the time step feature. The attention weight matrix is ​​a trainable parameter matrix obtained through model pre-training in the attention aggregation layer, and its dimension matches the dimension of the time step feature vector (e.g., when the time step feature is a 512-dimensional vector, the attention weight matrix has a dimension of 512×1), thus mapping the high-dimensional time step feature vector to a one-dimensional value and quantifying the correlation between each time step feature and the abnormal sound determination. The original score is a one-dimensional scalar value obtained by matrix multiplication of the time step feature vector and the attention weight matrix; its magnitude directly represents the importance of the corresponding time step feature to the abnormal sound determination.

[0094] Step 32: Normalize the original scores corresponding to the features at each time step to obtain the attention weights corresponding to the features at each time step.

[0095] In this embodiment, the original scores corresponding to each time step feature are normalized to obtain the attention weights corresponding to each time step feature. Normalization refers to performing a Softmax normalization operation on the original scores of all time step features. The Softmax function maps the original scores to a numerical range of 0 to 1, ensuring that the sum of the normalized scores corresponding to all time step features is 1. This achieves weight amplification for key time step features related to abnormal sounds and weight suppression for irrelevant features.

[0096] Step 33: Based on the attention weights corresponding to the features at each time step, the temporal dependent sequence is weighted and summed to obtain the global temporal representation vector.

[0097] In this embodiment, the time-dependent sequence is weighted and summed based on the attention weights corresponding to the features at each time step to obtain a global time-series representation vector.

[0098] For example, a temporally dependent sequence contains three time-step features, namely feature vectors h1=[1,2,3], h2=[4,5,6], and h3=[7,8,9]. After steps 31 and 32, the corresponding attention weights are A1=0.1, A2=0.2, and A3=0.7, respectively. First, each feature vector is weighted element-wise to obtain A1×h1=[0.1,0.2,0.3], A2×h2=[0.8,1.0,1.2], and A3×h3=[4.9,5.6,6.3]. Then, the weighted feature vectors are summed according to their dimensions, i.e., [0.1+0.8+4.9,0.2+1.0+5.6,0.3+1.2+6.3]=[5.8,6.8,7.8]. This result is the global temporal representation vector.

[0099] For example, for time-dependent sequences Calculate attention weights along the time dimension : ; in, For attention weights, Let t be the eigenvector of the time-dependent sequence. Here is the trainable weight matrix for the attention layer. Indicates the current time The feature importance scores are used for exponential calculation to ensure that the weights are positive. This means normalizing the scores over all time steps to ensure that all attention weights are balanced. The sum of these sums is 1, and a weighted sum is performed to obtain the global time series representation vector. : ; in, For attention weights, Let t be the eigenvector of the time-dependent sequence. This is the global time series representation vector.

[0100] S304, input the global temporal representation vector into the classification layer to obtain the abnormality evaluation value.

[0101] In this embodiment, the global temporal representation vector is input into the classification layer to obtain the abnormal sound evaluation value.

[0102] Specifically, inputting the global temporal representation vector into the classification layer to obtain the anomaly evaluation value may include the following steps: Step 41: Input the global temporal representation vector into the classification layer to obtain the first original output value and the second original output value.

[0103] In this embodiment, the global temporal representation vector is input to a classification layer to obtain a first original output value and a second original output value. The classification layer includes a trainable weight matrix and bias terms that match the dimension of the global temporal representation vector. After inputting the global temporal representation vector into the classification layer, a linear transformation involving matrix multiplication and bias addition maps the high-dimensional global temporal representation vector to a two-dimensional original output value. This two-dimensional original output value includes both the first and second original output values. The first original output value is the linear output score corresponding to the normal vehicle screen category, and the second original output value is the linear output score corresponding to the abnormal noise category of the vehicle screen.

[0104] Specifically, the calculation process for the classification layer is represented as follows: ; Where H is the global time series representation vector. For trainable weight matrix, For bias terms, z1 is the first raw output value, z2 is the second raw output value, and T is the total number of target audio frames.

[0105] Step 42: Normalize the first and second original output values ​​to obtain the abnormal noise evaluation value.

[0106] In this embodiment, the first and second original output values ​​are normalized to obtain the abnormal noise evaluation value. Specifically, the first and second original output values ​​are substituted into the Softmax function, and the original scores are mapped to positive values ​​through exponential operation. Then, normalization is performed so that the sum of the probability values ​​corresponding to the two categories is 1. The maximum value is selected from the normalized first and second original output values ​​as the abnormal noise evaluation value.

[0107] Specifically, the first and second original output values ​​can be substituted into the Softmax function to obtain the normalized first and second original output values, where the Softmax function is: ; ; in, This is the first original output value after normalization. This is the first original output value. This is the second original output value. This is the normalized second original output value.

[0108] For example, if the first original output value is 1.2 (corresponding to normal) and the second original output value is 3.5 (corresponding to abnormal noise), after substituting the two values ​​into the Softmax function, the probability value of the normal category is 0.09 and the probability value of the abnormal noise category is 0.91. The maximum value of 0.91 is selected as the abnormal noise evaluation value.

[0109] The pre-trained abnormal noise recognition model is trained through the following steps: A training sample set is obtained, comprising multiple audio samples labeled with normal and / or abnormal noise tags. For any given audio sample, multiple sample audio frames are extracted; the frequency domain features and temporal energy features corresponding to each sample audio frame are determined; based on the frequency domain features and temporal energy features corresponding to each sample audio frame, the sample temporal feature tensor corresponding to the audio sample is determined; the sample temporal feature tensors corresponding to each of the multiple audio samples are input into the initial abnormal noise recognition model to obtain a prediction result sequence; based on the prediction result sequence and the labels corresponding to each of the multiple audio samples, a loss function is constructed; the initial abnormal noise recognition model is iteratively optimized according to the loss function to obtain the pre-trained abnormal noise recognition model.

[0110] The loss function can be the cross-entropy loss function, which is: ; in, The label for the i-th category (when the audio sample belongs to the i-th category, =1, otherwise =0), Let be the probability that multiple audio samples belong to the i-th category. The total number of categories, =2 (corresponding to two categories: normal and abnormal noise).

[0111] S209. If the abnormal noise assessment value is greater than the preset abnormal noise threshold, the vehicle screen is determined to be abnormal.

[0112] In this embodiment, if the abnormal noise assessment value is greater than a preset abnormal noise threshold, the vehicle screen is determined to be abnormal. The preset abnormal noise threshold is a judgment threshold determined based on production line testing accuracy requirements and verification with a large number of samples; for example, 0.5 is used as a quantitative standard to distinguish between normal and abnormal vehicle screens, ensuring the consistency and accuracy of the test results.

[0113] S210, if the abnormal noise assessment value is less than or equal to the preset abnormal noise threshold, then the vehicle screen is determined to be normal.

[0114] In this embodiment of the application, if the abnormal noise evaluation value is less than or equal to the preset abnormal noise threshold, the vehicle screen is determined to be normal.

[0115] Corresponding to the above method embodiments, this application also provides an abnormal noise detection device, such as... Figure 4 As shown, the device may include an audio frame extraction module 401, a feature determination module 402, a temporal feature tensor determination module 403, and an abnormal noise detection module 404.

[0116] The audio frame extraction module 401 is used to acquire audio data corresponding to the vehicle screen and extract multiple target audio frames from the audio data. The feature determination module 402 is used to determine the frequency domain features and time domain energy features corresponding to each target audio frame; The temporal feature tensor determination module 403 is used to determine the temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each target audio frame; The abnormal noise detection module 404 is used to detect abnormal noises on the vehicle screen based on the temporal feature tensor.

[0117] This application also provides an electronic device, such as... Figure 5As shown, it includes a processor 501, a communication interface 502, a memory 503, and a communication bus 504, wherein the processor 501, the communication interface 502, and the memory 503 communicate with each other through the communication bus 504. Memory 503 is used to store computer programs; In one embodiment of this application, when the processor 501 executes a program stored in the memory 503, it performs the following steps: The system acquires audio data corresponding to the vehicle screen and extracts multiple target audio frames from the audio data; determines the frequency domain features and time domain energy features corresponding to each target audio frame; determines the time-series feature tensor based on the frequency domain features and time domain energy features corresponding to each target audio frame; and performs abnormal noise detection on the vehicle screen based on the time-series feature tensor.

[0118] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not indicate that there is only one bus or one type of bus.

[0119] The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0120] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0121] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0122] In another embodiment provided in this application, a storage medium is also provided, which stores instructions that, when run on a computer, cause the computer to execute any of the abnormal noise detection methods described in the above embodiments.

[0123] In another embodiment provided in this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the abnormal noise detection methods described in the above embodiments.

[0124] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a storage medium or transmitted from one storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (SSD)).

[0125] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0126] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0127] The above description is merely a specific embodiment of this application, enabling those skilled in the art to understand or implement this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application are included within the protection scope of this application.

Claims

1. A method for detecting abnormal noises, characterized in that, The method includes: Acquire audio data corresponding to the vehicle screen, and extract multiple target audio frames from the audio data; Determine the frequency domain features and time domain energy features corresponding to each of the target audio frames; Based on the frequency domain features and time domain energy features corresponding to each target audio frame, determine the time-series feature tensor; Based on the aforementioned temporal feature tensor, abnormal noise detection is performed on the vehicle screen.

2. The method according to claim 1, characterized in that, Determining the frequency domain features corresponding to each of the target audio frames includes: For any of the target audio frames, a short-time Fourier transform is performed on the target audio frame to obtain the power spectrum; The power spectrum is transformed to obtain the frequency domain features corresponding to the target audio frame.

3. The method according to claim 1, characterized in that, Determining the temporal energy features corresponding to each of the target audio frames includes: For any one of the target audio frames, the target audio frame is sampled to obtain multiple amplitude values; Based on multiple amplitude values, the temporal energy characteristics corresponding to the target audio frame are determined.

4. The method according to claim 1, characterized in that, The step of determining the temporal feature tensor based on the frequency domain features and temporal energy features corresponding to each of the target audio frames includes: For any of the target audio frames, the frequency domain features and time domain energy features corresponding to the target audio frame are spliced ​​together according to a preset splicing strategy to obtain the fusion features corresponding to the target audio frame; The fusion features corresponding to each of the target audio frames are combined to obtain the temporal feature tensor.

5. The method according to claim 1, characterized in that, The step of detecting abnormal noises on the vehicle screen based on the temporal feature tensor includes: The temporal feature tensor is input into a pre-trained abnormal noise recognition model to obtain an abnormal noise evaluation value; If the abnormal noise assessment value is greater than the preset abnormal noise threshold, then the vehicle screen is determined to be abnormal; If the abnormal noise assessment value is less than or equal to the preset abnormal noise threshold, then the vehicle screen is determined to be normal.

6. The method according to claim 5, characterized in that, The pre-trained abnormal noise recognition model includes: a feature extraction subnetwork, a bidirectional gated recurrent unit subnetwork, an attention aggregation layer, and a classification layer; The step of inputting the temporal feature tensor into a pre-trained abnormal noise recognition model to obtain an abnormal noise evaluation value includes: The temporal feature tensor is input into the feature extraction subnet to obtain a local time-frequency pattern feature sequence; The local time-frequency pattern feature sequence is input into the bidirectional gated cyclic unit subnet to obtain a time-dependent sequence. The temporal dependency sequence is input into the attention aggregation layer to obtain the global temporal representation vector; The global temporal representation vector is input into the classification layer to obtain the abnormal sound evaluation value.

7. The method according to claim 6, characterized in that, The feature extraction subnetwork contains N levels of convolutional blocks, each level of which consists of a convolutional layer, a normalization layer, a ReLU activation function layer, and a max pooling layer, where N is a positive integer. The step of inputting the temporal feature tensor into the feature extraction subnet to obtain a local time-frequency pattern feature sequence includes: The following steps are performed iteratively until the Nth first feature map is obtained, at which point the following steps are stopped: The i-th first feature map to be processed is input into the i-th level convolutional block to obtain the i-th first feature map, wherein the i-th first feature map is the (i+1)-th first feature map to be processed; the first first feature map is the temporal feature tensor; the N-th first feature map is the local time-frequency pattern feature sequence, and i is 1, 2, 3, ..., N in sequence.

8. The method according to claim 6, characterized in that, The bidirectional gated loop unit subnet includes a forward gated loop unit and a backward gated loop unit; The step of inputting the local time-frequency pattern feature sequence into the bidirectional gated cyclic unit subnet to obtain a time-dependent sequence includes: The local time-frequency pattern feature sequence is expanded along the time dimension to obtain a time-series feature vector sequence; The time-series feature vector sequence is input into the forward-gated loop unit and the backward-gated loop unit respectively to obtain the forward time-series feature sequence and the backward time-series feature sequence; The forward temporal feature sequence and the backward temporal feature sequence are concatenated to obtain the temporal dependency sequence.

9. The method according to claim 6, characterized in that, The step of inputting the temporal dependency sequence into the attention aggregation layer to obtain a global temporal representation vector includes: For any time step feature in the time-dependent sequence, the time step feature is input into the attention weight matrix of the attention aggregation layer to obtain the original score corresponding to the time step feature; The original scores corresponding to each time step feature are normalized to obtain the attention weights corresponding to each time step feature. The time-dependent sequences are weighted and summed based on the attention weights corresponding to the features of each time step to obtain the global time-series representation vector.

10. The method according to claim 6, characterized in that, The step of inputting the global temporal representation vector into the classification layer to obtain the anomaly evaluation value includes: The global temporal representation vector is input into the classification layer to obtain the first original output value and the second original output value. The first and second original output values ​​are normalized to obtain the abnormal noise evaluation value.