A VDR Voice Endpoint Detection Method

CN116246664BActive Publication Date: 2026-06-30DALIAN MARITIME UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DALIAN MARITIME UNIVERSITY
Filing Date
2022-12-19
Publication Date
2026-06-30

Smart Images

  • Figure CN116246664B_ABST
    Figure CN116246664B_ABST
Patent Text Reader

Abstract

This invention discloses a VDR speech endpoint detection method, comprising: extracting feature information from an audio signal, obtaining the first-order and second-order differences of the four feature information respectively; inputting the zero-padded feature map into a residual network with an attention mechanism to extract complex abstract features of the feature map; calculating the feature centroids corresponding to the initial output values ​​of 0 and 1; searching for abrupt changes in the initial output of speech endpoint detection with a duration of less than 100ms, defining them as short-term abrupt changes, and calculating the similarity of the feature centroids of the abrupt changes with the feature centroids of the 0 and 1 judgment results of the entire audio file respectively; updating the VDR speech endpoint detection output value based on the feature centroid similarity estimation results of the short-term abrupt changes, and obtaining the final VDR speech endpoint detection output value. This method avoids short-term endpoint detection abrupt changes, thereby accurately locating the speech position in the VDR audio signal.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech recognition technology, and in particular to a method for VDR speech endpoint detection. Background Technology

[0002] Speech endpoint detection is a crucial step in applications such as speech enhancement, speech recognition, and keyword detection. In ship navigation data recorders (VDRs), the ultra-high frequency communication audio signal is contaminated by equipment, communication, and environmental noise, resulting in a low signal-to-noise ratio; sometimes, even the human ear struggles to pinpoint the location of the speech.

[0003] UHF communication audio from ship voyage data recorders (VDRs) plays a crucial role in presenting the ship's operational status and communication content, providing valuable information for navigation safety and accident investigations. However, due to factors such as equipment, communication, and environmental noise, VDR audio signals have a low signal-to-noise ratio, resulting in poor clarity and intelligibility. This hinders the application of VDR audio in accident investigations and ship safety monitoring. Furthermore, relying solely on data-driven machine learning methods for category judgment often leads to misclassifications, causing short-term error spikes in endpoint detection, which in turn limits the performance of voice endpoint detection systems. Summary of the Invention

[0004] To achieve robust speech endpoint detection in complex background noise, this invention proposes a speech activation detection method based on residual networks and attention mechanisms, specifically including the following steps:

[0005] Extract feature information from audio signals, including frequency cepstral coefficients, Mel frequency cepstral coefficients, Gammatone frequency cepstral coefficients, Victor Bark frequency cepstral coefficients, and amplitude-based root cepstral coefficients.

[0006] Obtain the first-order and second-order differences of the above four feature information respectively;

[0007] The acquired feature information is reconstructed to obtain a feature map with a certain number of pixels;

[0008] Zero-padding is applied to the feature map;

[0009] The zero-padded feature map is input into a residual network with an attention mechanism to extract complex abstract features from the feature map;

[0010] The VDR speech endpoint detection output value is obtained by sequentially passing through a dense layer, a flattened layer, another dense layer, and a Sigmoid activation function. The output value of the Sigmoid activation function is rounded up to 1 if it is greater than or equal to 0.5, indicating that there is a speech segment; the output value of the Sigmoid function is rounded down to 0 if it is less than 0.5, indicating that there is no speech segment.

[0011] Calculate the feature centroids corresponding to the initial output values ​​of 0 and 1, and denote them as g0 and g1, respectively;

[0012] Search for mutations in the initial output of the speech endpoint detection that last less than 100ms and define them as short-time mutations. Calculate the similarity of the feature centroids of the mutation part with the feature centroids of the 0 and 1 class judgment results of the entire audio file.

[0013] The VDR speech endpoint detection output value is updated by estimating the similarity of the feature centroids of the short-term mutation part, thus obtaining the final VDR speech endpoint detection output value.

[0014] The similarity is defined as:

[0015]

[0016] Among them, X h It is the average characteristic centroid of the mutation part. Let be the feature centroid corresponding to the i-th preliminary judgment, i = 0 or 1, <·, ·> represent inner product operation, and ||·|| represent L2 norm.

[0017] By adopting the above technical solution, this invention proposes a speech endpoint detection method for VDR audio signals that combines residual networks and attention mechanisms. This method first uses residual networks and attention mechanisms to obtain preliminary speech endpoint detection results, then performs post-processing on short-term endpoint detection mutations and avoids short-term endpoint detection mutations, thereby accurately locating the speech position in the VDR audio signal. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a flowchart of the method of the present invention;

[0020] Figure 2 This is a time-domain waveform diagram of a VDR audio signal in an embodiment of the present invention;

[0021] Figure 3 This is the spectrogram of the VDR audio signal in this embodiment of the invention;

[0022] Figure 4 This is a feature map after feature extraction in an embodiment of the present invention;

[0023] Figure 5This is a preliminary judgment effect diagram of voice endpoint detection in an embodiment of the present invention;

[0024] Figure 6 This is a schematic diagram of the final output of voice endpoint detection in an embodiment of the present invention. Detailed Implementation

[0025] To make the technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention:

[0026] like Figure 1 The VDR voice endpoint detection method shown includes the following steps:

[0027] SI: Feature extraction from audio signals

[0028] ① Divide the audio signal into frames approximately 23.3ms long;

[0029] ② Extract the 20-dimensional Mel-frequency cepstral coefficients (MFCC) features of the audio signal. First, pre-emphasize the audio signal, then add a Hamming window frame by frame. Next, after a Fast Fourier Transform, filter the energy spectrum using a 20-Melb-scale triangular filter bank. Finally, after taking the logarithm, perform a Discrete Cosine Transform to obtain the MFCC coefficients.

[0030] ③ The first part of the process for extracting the 20-dimensional Gammatone frequency cepstral coefficients (GFCC) of the audio signal is the same as that for MFCC. After Fast Fourier Transform, spectral smoothing is performed using a Gammatone filter bank. Then, the cube root operation is used to capture the physiological mechanisms of the human auditory peripheral system. Finally, the GFCC coefficients are obtained using Discrete Cosine Transform.

[0031] ④ The difference between extracting 20-dimensional Bark frequency cepstral coefficients (BFCC) from audio signals and MFCC is that the filter bank used is a Bark-scale filter bank.

[0032] ⑤ Extract 20-dimensional amplitude-based root-and-cephalic coefficients (MSRCC) from the audio signal. First, pre-emphasize, window, and calculate the energy spectrum using Fast Fourier Transform (FFT). Then, filter using a Mel filter bank. Next, boost the spectral energy coefficients using a power function. Finally, obtain the MSRCC coefficients using Discrete Cosine Transform (DCT).

[0033] ⑥ Obtain the first and second differences of the above four features respectively.

[0034] S2: Reconstruct the aforementioned four features to obtain a feature map of 60*4 pixels.

[0035] S3: Zero-padding is applied to the input feature map to ensure that all feature information is preserved in subsequent convolutional processing.

[0036] S4: Input the zero-padded feature map into a residual network with an attention mechanism to extract complex abstract features from the feature map.

[0037] S5: Pass through a dense layer, a flattened layer, another dense layer, and a Sigmoid activation function in sequence to obtain the initial output of VDR speech endpoint detection. An output greater than 0.5 is judged as 1, otherwise it is 0.

[0038] S6: Calculate the feature centroids corresponding to the initial outputs 0 and 1, respectively, and denote them as g0 and g1.

[0039] S7: Detect abrupt changes within 100ms in the initial output of the search speech endpoint detection, and calculate the similarity between the feature centroid of the abrupt change and the feature centroids of the two judgment results. This similarity is defined as:

[0040]

[0041] In the above formula, X h It is the average characteristic centroid of the mutation part. Let be the feature centroid corresponding to the i-th preliminary judgment (i = 0 or 1), <·, ·> represent inner product operations, and ||·|| represent the L2 norm.

[0042] S8: Set the short-term mutation category to the category corresponding to the feature centroid with high similarity.

[0043] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A method for detecting VDR voice endpoints, characterized in that... include: The feature information of the audio signal is extracted, including 20-dimensional Mel frequency cepstral coefficients, 20-dimensional Gammatone frequency cepstral coefficients, 20-dimensional Bark frequency cepstral coefficients, and 20-dimensional amplitude-based root cepstral coefficients. The first and second differences of 20-dimensional Mel frequency cepstral coefficients, 20-dimensional Gammatone frequency cepstral coefficients, 20-dimensional Bark frequency cepstral coefficients, and 20-dimensional amplitude-based spectral root cepstral coefficients were obtained respectively. The acquired feature information is reconstructed to obtain 60 A feature map of 4 pixels; Zero-padding is applied to the feature map; The zero-padded feature map is input into a residual network with an attention mechanism to extract complex abstract features from the feature map; The VDR speech endpoint detection output value is obtained by sequentially passing through a dense layer, a flattened layer, another dense layer, and a Sigmoid activation function. The output value of the Sigmoid activation function is rounded up to 1 if it is greater than or equal to 0.5, indicating that there is a speech segment; the output value of the Sigmoid function is rounded down to 0 if it is less than 0.5, indicating that there is no speech segment. Calculate the eigencentroids corresponding to the initial output values ​​of 0 and 1, and denote them as follows: and ; Search for mutations in the initial output of the speech endpoint detection that last less than 100ms and define them as short-time mutations. Calculate the similarity of the feature centroids of the mutation part with the feature centroids of the 0 and 1 class judgment results of the entire audio file. The VDR speech endpoint detection output value is updated by estimating the similarity of the feature centroids of the short-term mutation part, thus obtaining the final VDR speech endpoint detection output value.

2. The VDR voice endpoint detection method according to claim 1, characterized in that: The similarity is defined as: (1) in, It is the average characteristic centroid of the mutation part. For the first The initial determination of the corresponding feature centroid is as follows. , Represents inner product operation. This represents the L2 norm.